# Modern NLP in Python, 2019
### _- Or -_
## What you can learn about food by analyzing 4 million Yelp reviews

#### Before we get started...
__whois?__
- Patrick Harrison
- Director of AI Engineering @ S&P Global - _**we are hiring**_
- University of Virginia
- patrick@hrsn.me

_Note: I presented an older version of this notebook as a tutorial during the [PyData DC 2016 conference](http://pydata.org/dc2016/schedule/presentation/11/). To view the video of that presentation on YouTube, see [here](https://www.youtube.com/watch?v=6zm9NC9uRkk)._

## Our Trail Map
This tutorial features an end-to-end data science & natural language processing pipeline, starting with **raw data** and running through **preparing**, **modeling**, **visualizing**, and **analyzing** the data. We'll touch on the following points:
1. A tour of the dataset
1. Introduction to text processing with spaCy
1. Automatic phrase modeling
1. Topic modeling with LDA
1. Visualizing topic models with pyLDAvis
1. Word vector models with word2vec
1. Visualizing word2vec with t-SNE
1. Text categorization (classification) with spaCy's `textcat` model
1. Contextual word vectors with spaCy Pytorch Transformers

...and we might even learn a thing or two about Python along the way.

Let's get started!

## The Yelp Dataset
[**The Yelp Dataset**](https://www.yelp.com/dataset_challenge/) is a dataset published by the business review service [Yelp](http://yelp.com) for academic research and educational purposes. I really like the Yelp dataset as a subject for machine learning and natural language processing demos, because it's big (but not so big that you need your own data center to process it), well-connected, and anyone can relate to it &mdash; it's largely about food, after all!

**Note:** If you'd like to execute this notebook interactively on your local machine, you'll need to download your own copy of the Yelp dataset. If you're reviewing a static copy of the notebook online, you can skip this step. Here's how to get the dataset:
1. Please visit the Yelp dataset webpage [here](https://www.yelp.com/dataset_challenge/)
1. Click "Get the Data"
1. Please review, agree to, and respect Yelp's terms of use!
1. The dataset downloads as a `.tar` file; unarchive it
1. Place the uncompressed dataset files (`business.json`, etc.) in a directory named `yelp_dataset`
1. Place the `yelp_dataset` directory within the `data` directory in the *Modern NLP in Python* project folder

That's it! You're ready to go.

The current iteration of the Yelp dataset (as of this demo) consists of the following data:
- __1.6M__ users
- __193K__ businesses
- __6.7M__ user reviews

When focusing on restaurants alone, there are approximately __59K__ restaurants with approximately __4.2M__ user reviews written about them.

The data is provided in a handful of files in _.json_ format. We'll be using the following files for our demo:
- __business.json__ &mdash; _the records for individual businesses_
- __review.json__ &mdash; _the records for reviews users wrote about businesses_

The files are text files (UTF-8) with one _json object_ per line, each one corresponding to an individual data record. Let's take a look at a few examples.

In [1]:
import os

data_directory = os.path.join('..', 'data', 'yelp_dataset')

businesses_filepath = os.path.join(data_directory, 'business.json')

with open(businesses_filepath) as f:
    first_business_record = f.readline() 

print(first_business_record)

{"business_id":"1SWheh84yJXfytovILXOAQ","name":"Arizona Biltmore Golf Club","address":"2818 E Camino Acequia Drive","city":"Phoenix","state":"AZ","postal_code":"85016","latitude":33.5221425,"longitude":-112.0184807,"stars":3.0,"review_count":5,"is_open":0,"attributes":{"GoodForKids":"False"},"categories":"Golf, Active Life","hours":null}



The business records consist of _key, value_ pairs containing information about the particular business. A few attributes we'll be interested in for this demo include:
- __business\_id__ &mdash; _unique identifier for businesses_
- __categories__ &mdash; _a comma-delimited list containing relevant category values of businesses_

The _categories_ attribute is of special interest. This demo will focus on restaurants, which are indicated by the presence of the _Restaurants_ tag in the _categories_ list. In addition, the _categories_ list may contain more detailed information about restaurants, such as the type of food they serve.

The review records are stored in a similar manner &mdash; _key, value_ pairs containing information about the reviews.

In [2]:
review_json_filepath = os.path.join(data_directory, 'review.json')

with open(review_json_filepath) as f:
    first_review_record = f.readline()
    
print(first_review_record)

{"review_id":"Q1sbwvVQXV2734tPgoKj4Q","user_id":"hG7b0MtEbXx5QzbzE6C_VA","business_id":"ujmEBvifdJM6h6RLv4wQIg","stars":1.0,"useful":6,"funny":1,"cool":0,"text":"Total bill for this horrible service? Over $8Gs. These crooks actually had the nerve to charge us $69 for 3 pills. I checked online the pills can be had for 19 cents EACH! Avoid Hospital ERs at all costs.","date":"2013-05-07 04:34:36"}



A few attributes of note on the review records:
- __business\_id__ &mdash; _indicates which business the review is about_
- __text__ &mdash; _the natural language text the user wrote_

The _text_ attribute will be our focus today!

_json_ is a handy file format for data interchange, but it's typically not the most usable for any sort of modeling work. Let's do a bit more data preparation to get our data in a more usable format. Our next code block will do the following:
1. Read in each business record and convert it to a Python `dict`
2. Filter out business records that aren't about restaurants (i.e., not in the "Restaurant" category)
3. Create a `frozenset` of the business IDs for restaurants, which we'll use in the next step

In [3]:
import json

restaurant_ids = set()

# open the businesses file
with open(businesses_filepath) as f:
    
    # iterate through each line (json record) in the file
    for business_json in f:
        
        # convert the json record to a Python dict
        business = json.loads(business_json)
        
        # if this business does not have a 'categories' attribute,
        # or if that attribute is None, skip to the next one
        if not business.get('categories'):
            continue
        
        # if this business is not a restaurant, skip to the next one
        if 'Restaurants' not in business['categories']:
            continue
            
        # add the restaurant business id to our restaurant_ids set
        restaurant_ids.add(business['business_id'])

# turn restaurant_ids into a frozenset, as we don't need to change it anymore
restaurant_ids = frozenset(restaurant_ids)

# print the number of unique restaurant ids in the dataset
print(f'{len(restaurant_ids):,} restaurants in the dataset.')

59,371 restaurants in the dataset.


Next, we will create a new file that contains only the text from reviews about restaurants, with one review per line in the file.

In [4]:
scratch_directory = os.path.join('..', 'scratch')

# create a scratch directory if one doesn't already exist
try:
    os.mkdir(scratch_directory)
except FileExistsError:
    pass

review_txt_filepath = os.path.join(scratch_directory, 'review_text_all.txt')

In [5]:
# this is a bit time consuming - set execute = True
# if you want to execute data prep yourself.

execute = False

if execute:
    
    review_count = 0

    # create & open a new file in write mode
    with open(review_txt_filepath, 'w') as review_txt_file:

        # open the existing review json file
        with open(review_json_filepath) as review_json_file:

            # loop through all reviews in the existing file and convert to dict
            for review_json in review_json_file:
                review = json.loads(review_json)

                # if this review is not about a restaurant, skip to the next one
                if review['business_id'] not in restaurant_ids:
                    continue

                # write the restaurant review as a line in the new file
                # escape newline characters in the original review text
                review_txt_file.write(review['text'].replace('\n', '\\n') + '\n')
                review_count += 1

    print(f'Text from {review_count:,} restaurant reviews written to the new txt file.')
    
else:
    
    # count the reviews in the existing file
    with open(review_txt_filepath) as review_txt_file:
        for review_count, line in enumerate(review_txt_file):
            pass
        
    print(f'Text from {review_count + 1:,} restaurant reviews in the txt file.')

Text from 4,203,848 restaurant reviews in the txt file.


## spaCy &mdash; Industrial-Strength NLP in Python

![spaCy](https://s3.amazonaws.com/skipgram-images/spaCy.png)

[**spaCy**](https://spacy.io) is an industrial-strength natural language processing (_NLP_) library for Python. spaCy's goal is to take recent advancements in natural language processing out of research papers and put them in the hands of users to build production software.

spaCy handles many tasks commonly associated with building an end-to-end natural language processing pipeline:
- Tokenization
- Text normalization, such as lowercasing, lemmatization, and token shape analysis
- Part-of-speech tagging
- Syntactic dependency parsing
- Sentence boundary detection
- Named entity recognition and annotation

In the "batteries included" Python tradition, spaCy contains built-in data and models which you can use out-of-the-box for processing general-purpose English language text:
- Large English vocabulary, including stopword lists
- Token "probabilities"
- Word vectors

spaCy is written in optimized Cython, which means it's _fast_. According to a few independent sources, it's the fastest syntactic parser available in any language. Key pieces of the spaCy parsing pipeline are written in pure C, enabling efficient multithreading (i.e., spaCy can release the _GIL_).

In [6]:
import spacy
from spacy import displacy
import pandas as pd
import itertools as it

nlp = spacy.load('en_core_web_md')

Let's grab a sample review to play with.

In [7]:
review_num = 754600

with open(review_txt_filepath) as f:
    sample_review = list(it.islice(f, review_num, review_num+1))[0]
    sample_review = sample_review.replace('\\n', '\n')
        
print(sample_review)

Kinjo is a breath of fresh sushi, the first lungful in a long time, one I would frequent if my home wasn't 900 km away.  A lunch at Kinjo is like eating solid joy, perfectly transmitted from its owner to each table...and to think this is a chain, albeit a small one.  Only three Kinjos exist, scattered like buckshot across the city of Calgary, validating the appeal of the décor, the quality of its sushi, and the passion of its owner. 
 
I know I'm long past my obligation for an opening joke, something involving the inner workings of my gastro-intestinal system or perhaps a remark a passing reader may mistaken as faintly racist.  Truth is, right beside my computer, I have a Thunderdome-like Aunty's wheel I spin to determine the bizarre pop-culture references I'll have to drop during the course of a review.  My initial attempt landed on Silence of the Lambs, which I've touched on three times already.  A second fell on Pink Floyd's laser concert which...okay, I've honestly no idea why that

Hand the review text to spaCy, and be prepared to wait...

In [8]:
%%time
parsed_review = nlp(sample_review)

CPU times: user 98.2 ms, sys: 9.21 ms, total: 107 ms
Wall time: 107 ms


...a fraction of a second or so. Let's take a look at what we got during that time...

In [9]:
print(parsed_review)

Kinjo is a breath of fresh sushi, the first lungful in a long time, one I would frequent if my home wasn't 900 km away.  A lunch at Kinjo is like eating solid joy, perfectly transmitted from its owner to each table...and to think this is a chain, albeit a small one.  Only three Kinjos exist, scattered like buckshot across the city of Calgary, validating the appeal of the décor, the quality of its sushi, and the passion of its owner. 
 
I know I'm long past my obligation for an opening joke, something involving the inner workings of my gastro-intestinal system or perhaps a remark a passing reader may mistaken as faintly racist.  Truth is, right beside my computer, I have a Thunderdome-like Aunty's wheel I spin to determine the bizarre pop-culture references I'll have to drop during the course of a review.  My initial attempt landed on Silence of the Lambs, which I've touched on three times already.  A second fell on Pink Floyd's laser concert which...okay, I've honestly no idea why that

Looks the same! What happened under the hood?

What about sentence detection and segmentation?

In [10]:
for num, sentence in enumerate(parsed_review.sents):
    print(f'Sentence {num + 1}:')
    print(sentence)
    print('')

Sentence 1:
Kinjo is a breath of fresh sushi, the first lungful in a long time, one I would frequent if my home wasn't 900 km away.  

Sentence 2:
A lunch at Kinjo is like eating solid joy, perfectly transmitted from its owner to each table...and to think this is a chain, albeit a small one.  

Sentence 3:
Only three Kinjos exist, scattered like buckshot across the city of Calgary, validating the appeal of the décor, the quality of its sushi, and the passion of its owner. 
 


Sentence 4:
I know I'm long past my obligation for an opening joke, something involving the inner workings of my gastro-intestinal system or perhaps a remark a passing reader may mistaken as faintly racist.  

Sentence 5:
Truth is, right beside my computer, I have a Thunderdome-like Aunty's wheel I spin to determine the bizarre pop-culture references I'll have to drop during the course of a review.  

Sentence 6:
My initial attempt landed on Silence of the Lambs, which I've touched on three times already.  

Sent

What about text normalization, like lemmatization and token shape analysis?

In [11]:
token_text = [token.orth_ for token in parsed_review]
token_lemma = [token.lemma_ for token in parsed_review]
token_shape = [token.shape_ for token in parsed_review]

pd.DataFrame(
    zip(token_text, token_lemma, token_shape),
    columns=['token_text', 'token_lemma', 'token_shape']
    )

Unnamed: 0,token_text,token_lemma,token_shape
0,Kinjo,Kinjo,Xxxxx
1,is,be,xx
2,a,a,x
3,breath,breath,xxxx
4,of,of,xx
5,fresh,fresh,xxxx
6,sushi,sushi,xxxx
7,",",",",","
8,the,the,xxx
9,first,first,xxxx


What about part of speech tagging?

In [12]:
token_pos = [token.pos_ for token in parsed_review]

pd.DataFrame(
    zip(token_text, token_pos),
    columns=['token_text', 'part_of_speech']
    )

Unnamed: 0,token_text,part_of_speech
0,Kinjo,PROPN
1,is,VERB
2,a,DET
3,breath,NOUN
4,of,ADP
5,fresh,ADJ
6,sushi,NOUN
7,",",PUNCT
8,the,DET
9,first,ADJ


What about named entity detection?

In [13]:
displacy.render(parsed_review, style="ent")

In [14]:
for num, entity in enumerate(parsed_review.ents):
    print(f'Entity {num + 1}:', entity, '-', entity.label_)
    print('')

Entity 1: Kinjo - ORG

Entity 2: first - ORDINAL

Entity 3: 900 km - QUANTITY

Entity 4: Kinjo - ORG

Entity 5: Kinjos - ORG

Entity 6: Calgary - GPE

Entity 7: Thunderdome - GPE

Entity 8: Aunty - NORP

Entity 9: Silence of the Lambs - WORK_OF_ART

Entity 10: three - CARDINAL

Entity 11: second - ORDINAL

Entity 12: Pink Floyd's - ORG

Entity 13: Kinjo - ORG

Entity 14: Peter Kinjo - PERSON

Entity 15: that day - DATE

Entity 16: Peter - PERSON

Entity 17: one - CARDINAL

Entity 18: Edo Japan - ORG

Entity 19: Vancouver - GPE

Entity 20: Saskatoon - GPE

Entity 21: over a hundred - CARDINAL

Entity 22: Peter - PERSON

Entity 23: Calgary - GPE

Entity 24: Canada - GPE

Entity 25: China - GPE

Entity 26: Canada - GPE

Entity 27: Kinjo - ORG

Entity 28: one - CARDINAL

Entity 29: Lucile Ball - FAC

Entity 30: Laser Floyd - ORG

Entity 31: Teletubbie - PERSON

Entity 32: Guangzhou - GPE

Entity 33: Kinjo - GPE

Entity 34: twenty - CARDINAL

Entity 35: Kinjo - ORG

Entity 36: three - CARDI

What about token-level entity analysis?

In [15]:
token_entity_type = [token.ent_type_ for token in parsed_review]
token_entity_iob = [token.ent_iob_ for token in parsed_review]

pd.DataFrame(
    zip(token_text, token_entity_type, token_entity_iob),
    columns=['token_text', 'entity_type', 'inside_outside_begin']
    )

Unnamed: 0,token_text,entity_type,inside_outside_begin
0,Kinjo,ORG,B
1,is,,O
2,a,,O
3,breath,,O
4,of,,O
5,fresh,,O
6,sushi,,O
7,",",,O
8,the,,O
9,first,ORDINAL,B


What about a variety of other token-level attributes, such as the relative frequency of tokens, and whether or not a token matches any of these categories?
- stopword
- punctuation
- whitespace
- represents a number
- whether or not the token is included in spaCy's default vocabulary?

In [16]:
token_attributes = [
        (
        token.orth_,
        token.prob,
        token.is_stop,
        token.is_punct,
        token.is_space,
        token.like_num,
        token.is_oov
        ) for token in parsed_review
    ]

df = pd.DataFrame(
    token_attributes,
    columns=[
        'text',
        'log_probability',
        'stop?',
        'punctuation?',
        'whitespace?',
        'number?',
        'out of vocab.?'
        ]
    )

df.loc[:, 'stop?':'out of vocab.?'] = (
    df
    .loc[:, 'stop?':'out of vocab.?']
    .applymap(lambda x: 'Yes' if x else '')
    )
                                               
df

Unnamed: 0,text,log_probability,stop?,punctuation?,whitespace?,number?,out of vocab.?
0,Kinjo,-18.924250,,,,,
1,is,-4.457749,Yes,,,,
2,a,-3.929788,Yes,,,,
3,breath,-10.806992,,,,,
4,of,-4.275874,Yes,,,,
5,fresh,-10.332080,,,,,
6,sushi,-12.298863,,,,,
7,",",-3.454960,,Yes,,,
8,the,-3.528767,Yes,,,,
9,first,-7.063717,Yes,,,,


If the text you'd like to process is general-purpose English language text (i.e., not domain-specific, like medical literature), spaCy is ready to use out-of-the-box.

I think it will eventually become a core part of the Python data science ecosystem &mdash; it will do for natural language computing what other great libraries have done for numerical computing.

## Phrase Modeling

_Phrase modeling_ is another approach to learning combinations of tokens that together represent meaningful multi-word concepts. We can develop phrase models by looping over the the words in our reviews and looking for words that _co-occur_ (i.e., appear one after another) together much more frequently than you would expect them to by random chance. The formula our phrase models will use to determine whether two tokens $A$ and $B$ constitute a phrase is:

$$\frac{count(A\ B) - count_{min}}{count(A) * count(B)} * N > threshold$$

...where:
* $count(A)$ is the number of times token $A$ appears in the corpus
* $count(B)$ is the number of times token $B$ appears in the corpus
* $count(A\ B)$ is the number of times the tokens $A\ B$ appear in the corpus *in order*
* $N$ is the total size of the corpus vocabulary
* $count_{min}$ is a user-defined parameter to ensure that accepted phrases occur a minimum number of times
* $threshold$ is a user-defined parameter to control how strong of a relationship between two tokens the model requires before accepting them as a phrase

Once our phrase model has been trained on our corpus, we can apply it to new text. When our model encounters two tokens in new text that identifies as a phrase, it will merge the two into a single new token.

Phrase modeling is superficially similar to named entity detection in that you would expect named entities to become phrases in the model (so _new york_ would become *new\_york*). But you would also expect multi-word expressions that represent common concepts, but aren't specifically named entities (such as _happy hour_) to also become phrases in the model.

We turn to the indispensible [**gensim**](https://radimrehurek.com/gensim/index.html) library to help us with phrase modeling &mdash; the [**Phrases**](https://radimrehurek.com/gensim/models/phrases.html) class in particular.

In [17]:
from gensim.models.phrases import Phrases, Phraser
from gensim.models.word2vec import LineSentence

As we're performing phrase modeling, we'll be doing some iterative data transformation at the same time. Our roadmap for data preparation includes:

1. Segment text of complete reviews into sentences & normalize text
1. First-order phrase modeling $\rightarrow$ _apply first-order phrase model to transform sentences_
1. Second-order phrase modeling $\rightarrow$ _apply second-order phrase model to transform sentences_
1. Apply text normalization and second-order phrase model to text of complete reviews

We'll use this transformed data as the input for some higher-level modeling approaches in the following sections.

First, let's define a few helper functions that we'll use for text normalization.

In [18]:
def punct_space(token):
    """
    helper function to eliminate tokens
    that are pure punctuation or whitespace
    """
    
    return token.is_punct or token.is_space

def pronoun_lemmatize(token):
    """
    helper function to preserve pronouns and force lowercasing while lemmatizing
    """
    
    if token.lemma_ == '-PRON-':
        return token.lower_
    
    else:
        return token.lemma_.lower()

def line_review(filename):
    """
    generator function to read in reviews from the file
    and un-escape the original line breaks in the text
    """
    
    with open(filename) as f:
        for review in f:
            yield review.replace('\\n', '\n')

Next, we will use spaCy to:

- Iterate over the 4.2M reviews in the `review_txt_all.txt` file we created before
- Segment the reviews into individual sentences
- Remove punctuation and excess whitespace
- Lemmatize the text

... and do so with the benefit of multiprocessing, thanks to spaCy's `nlp.pipe()` function. We'll write this data back out to a new file (`sentence_lemmatized_all`), with one normalized sentence per line. During the process, we'll also preprocess the text from the full, non-sentence-segmented reviews the same way, and save it in a file called `review_lemmatized_all`.

We'll use all of this data later for learning models.

In [19]:
review_lemmatized_filepath = os.path.join(scratch_directory, 'review_lemmatized_all.txt')
sentence_lemmatized_filepath = os.path.join(scratch_directory, 'sentence_lemmatized_all.txt')

>⚠️ **Heads-up:** if you want to re-run the text preprocessing yourself, the next cell took me about **12 hours** to run to sentencize and lemmatize all the restaurant review text in the Yelp dataset.

In [20]:
# this is a bit time consuming - set execute = True
# if you want to execute data prep yourself.

execute = False

if execute:

    with open(review_lemmatized_filepath, 'w') as review_file:
        with open(sentence_lemmatized_filepath, 'w') as sentence_file:
            
            pipe = nlp.pipe(
                line_review(review_txt_filepath),
                batch_size=10000,
                n_threads=8
                )
            
            for parsed_review in pipe:
                
                # lemmatize the text of the review, removing punctuation and whitespace
                lemmatized_review = ' '.join([
                    pronoun_lemmatize(token)
                    for token in parsed_review
                    if not punct_space(token)
                    ])
                
                # save the text from each lemmatized review as a new line in a file
                review_file.write(lemmatized_review + '\n')
        
                # iterate over each sentence in the review
                for sent in parsed_review.sents:
                    
                    # lemmatize the text of each sentence
                    lemmatized_sentence = ' '.join([
                        pronoun_lemmatize(token)
                        for token in sent
                        if not punct_space(token)
                        ])
                    
                    # save the text from each lemmatized sentence as a new line in a file
                    sentence_file.write(lemmatized_sentence + '\n')

If your data is organized like our `sentence_lemmatized_all` file now is &mdash; a large text file with one document/sentence per line &mdash; gensim's [**LineSentence**](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.LineSentence) class provides a convenient iterator for working with other gensim components. It *streams* the documents/sentences from disk, so that you never have to hold the entire corpus in RAM at once. This allows you to scale your modeling pipeline up to potentially very large corpora.

In [21]:
sentences_unigrams = LineSentence(sentence_lemmatized_filepath)

Let's take a look at a few sample sentences in our new, transformed file.

In [22]:
for sentence_unigrams in it.islice(sentences_unigrams, 60, 70):
    print(' '.join(sentence_unigrams))
    print('')

like walk back in time every saturday morning my sister and i be in a bowling league and after we be do we would spend a few quarter play the pin ball machine until our mother come to pick us up

my sister be dare and play the machine hard she be afraid of that tilt show up and freeze the game

i on the other hand be a bit more gentle and want to make sure i get my quarter 's worth

this place have row and row of machine some be really old and some be more of a mid 80 's theme

there be even a ms pac man

it be fun to spend an afternoon play the machine and remember all the fun of my early teen year

walk in around 4 on a friday afternoon we sit at a table just off the bar and walk out after 5 min or so

do not even think they realize we walk in

however everyone at the bar notice we walk in

service be non existent at best



  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


Next, we'll learn a phrase model that will link individual words into two-word phrases. We'd expect words that together represent a specific concept, like "`pin ball`", to be linked together to form a new, single token: "`pin_ball`".

In [23]:
bigram_model_filepath = os.path.join(scratch_directory, 'bigram_phrase_model')

>⚠️ **Heads-up:** if you want to re-run the text preprocessing yourself, the next cell took me about **12 minutes** to run.

In [24]:
# this is a bit time consuming - set execute = True
# if you want to execute modeling yourself.

execute = False

if execute:

    bigram_phrases = Phrases(sentences_unigrams)
    
    # Turn the finished Phrases model into a "Phraser" object,
    # which is optimized for speed and memory use
    bigram_phrases = Phraser(bigram_phrases)
    bigram_phrases.save(bigram_model_filepath)

In [25]:
# load the finished model from disk
bigram_phrases = Phraser.load(bigram_model_filepath)

Now that we have a trained phrase model for word pairs, let's apply it to the review sentences data and explore the results.

In [26]:
sentences_bigrams_filepath = os.path.join(scratch_directory, 'sentence_bigram_phrases_all.txt')

>⚠️ **Heads-up:** if you want to re-run the text preprocessing yourself, the next cell took me about **17 minutes** to run.

In [27]:
# this is a bit time consuming - set execute = True
# if you want to execute data prep yourself.

execute = False

if execute:

    with open(sentences_bigrams_filepath, 'w') as f:
        
        for sentence_unigrams in sentences_unigrams:
            
            sentence_bigrams = ' '.join(bigram_phrases[sentence_unigrams])
            
            f.write(sentence_bigrams + '\n')

In [28]:
sentences_bigrams = LineSentence(sentences_bigrams_filepath)

In [29]:
for sentence_bigrams in it.islice(sentences_bigrams, 60, 70):
    print(' '.join(sentence_bigrams))
    print('')

like walk back in time every saturday_morning my sister and i be in a bowling_league and after we be do we would spend a few quarter play the pin_ball machine until our mother come to pick us up

my sister be dare and play the machine hard she be afraid of that tilt show up and freeze the game

i on the other hand be a bit more gentle and want to make sure i get my quarter 's worth

this place have row and row of machine some be really old and some be more of a mid 80 's theme

there be even a ms_pac man

it be fun to spend an afternoon play the machine and remember all the fun of my early teen year

walk in around 4 on a friday afternoon we sit at a table just off the bar and walk out after 5 min or so

do not even think they realize we walk in

however everyone at the bar notice we walk in

service be non_existent at best



Looks like the phrase modeling worked! We now see two-word phrases, such as "`pin_ball`" and "`saturday_morning`", linked together in the text as a single token. Next, we'll train a _second-order_ phrase model. We'll apply the second-order phrase model on top of the already-transformed data, so that incomplete word combinations like "`ms_pac man`" will become fully joined to "`ms_pac_man`".

In [30]:
trigram_model_filepath = os.path.join(scratch_directory, 'trigram_phrase_model')

In [31]:
# this is a bit time consuming - set execute = True
# if you want to execute modeling yourself.

execute = False

if execute:

    trigram_phrases = Phrases(sentences_bigrams)
    
    # Turn the finished Phrases model into a "Phraser" object,
    # which is optimized for speed and memory use
    trigram_phrases = Phraser(trigram_phrases)
    trigram_phrases.save(trigram_model_filepath)

In [32]:
# load the finished model from disk
trigram_phrases = Phraser.load(trigram_model_filepath)

We'll apply our trained second-order phrase model to our first-order transformed sentences, write the results out to a new file, and explore a few of the second-order transformed sentences.

In [33]:
sentences_trigrams_filepath = os.path.join(scratch_directory, 'sentence_trigram_phrases_all.txt')

In [34]:
# this is a bit time consuming - set execute = True
# if you want to execute data prep yourself.

execute = False

if execute:

    with open(sentences_trigrams_filepath, 'w') as f:
        
        for sentence_bigrams in sentences_bigrams:
            
            sentence_trigrams = ' '.join(trigram_phrases[sentence_bigrams])
            
            f.write(sentence_trigrams + '\n')

In [35]:
sentences_trigrams = LineSentence(sentences_trigrams_filepath)

In [36]:
for sentence_trigrams in it.islice(sentences_trigrams, 60, 70):
    print(' '.join(sentence_trigrams))
    print('')

like walk back in time every saturday_morning my sister and i be in a bowling_league and after we be do we would spend a few quarter play the pin_ball_machine until our mother come to pick us up

my sister be dare and play the machine hard she be afraid of that tilt show up and freeze the game

i on the other hand be a bit more gentle and want to make sure i get my quarter 's worth

this place have row and row of machine some be really old and some be more of a mid 80 's theme

there be even a ms_pac_man

it be fun to spend an afternoon play the machine and remember all the fun of my early teen year

walk in around 4 on a friday_afternoon we sit at a table just off the bar and walk out after 5 min or so

do not even think they realize we walk in

however everyone at the bar notice we walk in

service be non_existent at best



Looks like the second-order phrase model was successful. We're now seeing three-word phrases, such as "`pin_ball_machine`" and "`ms_pac_man`".

The final step of our text preparation process circles back to the complete text of the reviews. We're going to run the complete text of the reviews through a pipeline that applies our text normalization and phrase models.

In addition, we'll remove stopwords at this point. _Stopwords_ are very common words, like _a_, _the_, _and_, and so on, that serve functional roles in natural language, but typically don't contribute to the overall meaning of text. Filtering stopwords is a common procedure that allows higher-level NLP modeling techniques to focus on the words that carry more semantic weight.

Finally, we'll write the transformed text out to a new file, with one review per line.

In [37]:
review_trigrams_filepath = os.path.join(scratch_directory, 'review_trigrams_all.txt')

>⚠️ **Heads-up:** if you want to re-run the text preprocessing yourself, the next cell took me about **30 minutes** to run.

In [38]:
# this is a bit time consuming - set execute = True
# if you want to execute data prep yourself.

execute = False

if execute:
    
    reviews_lemmatized = LineSentence(review_lemmatized_filepath)

    with open(review_trigrams_filepath, 'w') as f:
        
        for review_unigrams in reviews_lemmatized:
                        
            # apply the first-order and second-order phrase models
            review_bigrams = bigram_phrases[review_unigrams]
            review_trigrams = trigram_phrases[review_bigrams]

            # remove any remaining stopwords
            review_trigrams = [
                term
                for term in review_trigrams
                if term not in nlp.Defaults.stop_words
                ]

            # write the transformed review as a line in the new file
            review_trigrams = ' '.join(review_trigrams)
            f.write(review_trigrams + '\n')

Let's preview the results. We'll grab one review from the file with the original, untransformed text, grab the same review from the file with the normalized and transformed text, and compare the two.

In [39]:
review_num = 20

print('Original:' + '\n')

for review in it.islice(line_review(review_txt_filepath), review_num, review_num+1):
    print(review)

print('----' + '\n')
print('Transformed:' + '\n')

with open(review_trigrams_filepath) as f:
    for review in it.islice(f, review_num, review_num+1):
        print(review)

Original:

ended up here because Raku was closed and it received great ratings on Yelp.  I'm so glad I came here.  One of the better meals I've had.  Started off with the mushroom dish and the lettuce wrap.  both were amazing. the lettuce wrap is like having a flavor party in your mouth.  also had the panang duck which was terrific. highly recommend all three dishes. one dish that wasn't so good was the seabass with drunken noodles. overall it was an excellent meal, intimate setting, and great service. definitely will be back.

----

Transformed:

end raku closed receive great rating yelp glad come meal start mushroom dish lettuce_wrap amazing lettuce_wrap like flavor party mouth panang duck terrific highly_recommend dish dish good seabass drunken_noodle overall excellent meal intimate_setting great service definitely



You can see that most of the grammatical structure has been scrubbed from the text &mdash; capitalization, articles/conjunctions, punctuation, spacing, etc. However, much of the general semantic *meaning* is still present. Also, multi-word concepts such as "`lettuce_wrap`" and "`intimate_setting`" have been joined into single tokens, as expected. The review text is now ready for topic modeling. 

## Topic Modeling with Latent Dirichlet Allocation (_LDA_)

*Topic modeling* is family of techniques that can be used to describe and summarize the documents in a corpus according to a set of latent "topics". For this demo, we'll be using [*Latent Dirichlet Allocation*](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf) or LDA, a popular approach to topic modeling.

In many conventional NLP applications, documents are represented a mixture of the individual tokens (words and phrases) they contain. In other words, a document is represented as a *vector* of token counts. There are two layers in this model &mdash; documents and tokens &mdash; and the size or dimensionality of the document vectors is the number of tokens in the corpus vocabulary. This approach has a number of disadvantages:
* Document vectors tend to be large (one dimension for each token $\Rightarrow$ lots of dimensions)
* They also tend to be very sparse. Any given document only contains a small fraction of all tokens in the vocabulary, so most values in the document's token vector are 0.
* The dimensions are fully indepedent from each other &mdash; there's no sense of connection between related tokens, such as _knife_ and _fork_.

LDA injects a third layer into this conceptual model. Documents are represented as a mixture of a pre-defined number of *topics*, and the *topics* are represented as a mixture of the individual tokens in the vocabulary. The number of topics is a model hyperparameter selected by the practitioner. LDA makes a prior assumption that the (document, topic) and (topic, token) mixtures follow [*Dirichlet*](https://en.wikipedia.org/wiki/Dirichlet_distribution) probability distributions. This assumption encourages documents to consist mostly of a handful of topics, and topics to consist mostly of a modest set of the tokens.

![LDA](https://s3.amazonaws.com/skipgram-images/LDA.png)

LDA is fully unsupervised. The topics are "discovered" automatically from the data by trying to maximize the likelihood of observing the documents in your corpus, given the modeling assumptions. They are expected to capture some latent structure and organization within the documents, and often have a meaningful human interpretation for people familiar with the subject material.

We'll again turn to gensim to assist with data preparation and modeling. In particular, gensim offers a high-performance parallelized implementation of LDA with its [**LdaMulticore**](https://radimrehurek.com/gensim/models/ldamulticore.html) class.

In [40]:
from gensim.corpora import Dictionary, MmCorpus
from gensim.models.ldamulticore import LdaMulticore

import pyLDAvis
import pyLDAvis.gensim
import warnings
import pickle

The first step to creating an LDA model is to learn the full vocabulary of the corpus to be modeled. We'll use gensim's [**Dictionary**](https://radimrehurek.com/gensim/corpora/dictionary.html) class for this.

In [41]:
dictionary_filepath = os.path.join(scratch_directory, 'trigram_dict_all.dict')

In [42]:
# this is a bit time consuming - set execute = True
# if you want to learn the dictionary yourself.

execute = False

if execute:

    reviews_trigrams = LineSentence(review_trigrams_filepath)

    # learn the dictionary by iterating over all of the reviews
    dictionary_trigrams = Dictionary(reviews_trigrams)
    
    # filter tokens that are very rare or too common from
    # the dictionary (filter_extremes) and reassign integer ids (compactify)
    dictionary_trigrams.filter_extremes(no_below=20, no_above=0.4)
    dictionary_trigrams.compactify()

    dictionary_trigrams.save(dictionary_filepath)    

In [43]:
# load the finished dictionary from disk
dictionary_trigrams = Dictionary.load(dictionary_filepath)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


Like many NLP techniques, LDA uses a simplifying assumption known as the [*bag-of-words* model](https://en.wikipedia.org/wiki/Bag-of-words_model). In the bag-of-words model, a document is represented by the counts of distinct terms that occur within it. Additional information, such as word order, is discarded. 

Using the gensim Dictionary we learned to generate a bag-of-words representation for each review. The `trigram_bow_generator` function implements this. We'll save the resulting bag-of-words reviews as a matrix.

In the following code, "bag-of-words" is abbreviated as `bow`.

In [44]:
bow_corpus_filepath = os.path.join(scratch_directory, 'bow_trigrams_corpus_all.mm')

In [45]:
def bow_generator(filepath):
    """
    generator function to read reviews from a file
    and yield a bag-of-words representation
    """
    
    for review in LineSentence(filepath):
        yield dictionary_trigrams.doc2bow(review)

>⚠️ **Heads-up:** if you want to re-run the text preprocessing yourself, the next cell took me about **5 minutes** to run.

In [46]:
# this is a bit time consuming - set execute = True
# if you want to build the bag-of-words corpus yourself.

execute = False

if execute:

    # generate bag-of-words representations for
    # all reviews and save them as a matrix
    MmCorpus.serialize(
        bow_corpus_filepath,
        bow_generator(review_trigrams_filepath)
        )

In [47]:
# load the finished bag-of-words corpus from disk
trigram_bow_corpus = MmCorpus(bow_corpus_filepath)

With the bag-of-words corpus, we're finally ready to learn our topic model from the reviews. We simply need to pass the bag-of-words matrix and Dictionary from our previous steps to `LdaMulticore` as inputs, along with the number of topics the model should learn. For this demo, we're asking for 50 topics.

In [48]:
lda_model_filepath = os.path.join(scratch_directory, 'lda_model_all')

>⚠️ **Heads-up:** if you want to re-run LDA modeling yourself, the next cell took me about **10 minutes** to run.

In [49]:
# this is a bit time consuming - set execute = True
# if you want to train the LDA model yourself.

execute = False

if execute:

    with warnings.catch_warnings():
        warnings.simplefilter('ignore')
        
        # workers => sets the parallelism, and should be
        # set to your number of physical cores minus one
        lda = LdaMulticore(
            trigram_bow_corpus,
            num_topics=50,
            id2word=dictionary_trigrams,
            workers=7
            )
    
    lda.save(lda_model_filepath)

In [50]:
# load the finished LDA model from disk
lda = LdaMulticore.load(lda_model_filepath)

Our topic model is now trained and ready to use! Since each topic is represented as a mixture of tokens, you can manually inspect which tokens have been grouped together into which topics to try to understand the patterns the model has discovered in the data.

In [51]:
def explore_topic(topic_number, topn=25):
    """
    accept a user-supplied topic number and
    print out a formatted list of the top terms
    """
        
    print(f'{"term":20} {"frequency"}' + '\n')

    for term, frequency in lda.show_topic(topic_number, topn=25):
        print(f'{term:20} {round(frequency, 3):.3f}')

In [52]:
explore_topic(topic_number=0)

term                 frequency

steak                0.150
cook                 0.029
cut                  0.016
order                0.015
medium               0.014
steakhouse           0.013
medium_rare          0.012
filet                0.012
meat                 0.011
ribeye               0.010
meal                 0.009
like                 0.009
rare                 0.008
house                0.008
bone                 0.007
perfection           0.007
tender               0.007
potato               0.007
butter               0.007
filet_mignon         0.006
dinner               0.006
rib_eye              0.006
prime_rib            0.006
wife                 0.006
philly               0.006


The first topic has strong associations with words like *steak*, *cut*, *medium*, *steakhouse*, and *filet*, as well as a handful of more general words. You might call this the **Steak** topic!

It's possible to go through and inspect each topic in the same way, and try to assign a human-interpretable label that captures the essence of each one. I've given it a shot for all 50 topics below.

In [53]:
topic_names = {
    0: 'steak',
    1: 'menu & ordering',
    2: 'mexican',
    3: 'dessert',
    4: 'vegetarian',
    5: 'buffet',
    6: 'italian',
    7: 'thai',
    8: 'taste',
    9: 'customer service',
    10: 'portions',
    11: 'nightlife',
    12: 'burger & fries',
    13: 'classy ambience', #
    14: 'long wait',
    15: 'chicken',
    16: 'sandwiches',
    17: 'good serivce',
    18: 'vegas hotel',
    19: 'pizza',
    20: 'salad',
    21: 'bar vibe', #
    22: 'meal experience', #
    23: 'slow service',
    24: 'brunch',
    25: 'portion sizes',
    26: 'beer, wings, sports',
    27: 'breakfast',
    28: 'miscellaneous',
    29: 'non-English',
    30: 'deli',
    31: 'barbecue',
    32: 'local business',
    33: 'miscellaneous',
    34: 'hole-in-the-wall',
    35: 'asian',
    36: 'specials',
    37: 'coffeeshop',
    38: 'prices',
    39: 'flavor & texture',
    40: 'noodles',
    41: 'canadian',
    42: 'highly recommended',
    43: 'sushi',
    44: 'ordering',
    45: 'mediterranean',
    46: 'decent value',
    47: 'cleanliness',
    48: 'lobster',
    49: 'seafood'
    }

In [54]:
topic_names_filepath = os.path.join(scratch_directory, 'topic_names.pkl')

with open(topic_names_filepath, 'wb') as f:
    pickle.dump(topic_names, f)

You can see that, along with **steak**, there are a variety of topics related to different styles of food, such as **mexican**, **thai**, **pizza**, **sushi**, and so on. In addition, there are topics that are more related to the overall restaurant *experience*, like **nightlife**, **good service**, **long wait**, and **prices**.

Beyond these two categories, there are still some topics that are difficult to apply a meaningful human interpretation to, such as topic 28 and 48.

Manually reviewing the top terms for each topic is a helpful exercise, but to get a deeper understanding of the topics and how they relate to each other, we need to visualize the data &mdash; preferably in an interactive format. Fortunately, we have the fantastic [**pyLDAvis**](https://pyldavis.readthedocs.io/en/latest/readme.html) library to help with that!

pyLDAvis includes a one-line function to take topic models created with gensim and prepare their data for visualization.

In [55]:
LDAvis_data_filepath = os.path.join(scratch_directory, 'ldavis_prepared')

In [56]:
# this is a bit time consuming - set execute = True
# if you want to execute data prep yourself.

execute = False

if execute:

    LDAvis_prepared = pyLDAvis.gensim.prepare(
        lda,
        trigram_bow_corpus,
        dictionary_trigrams
        )

    with open(LDAvis_data_filepath, 'wb') as f:
        pickle.dump(LDAvis_prepared, f)        

In [57]:
# load the pre-prepared pyLDAvis data from disk
with open(LDAvis_data_filepath, 'rb') as f:
    LDAvis_prepared = pickle.load(f)

`pyLDAvis.display(...)` displays the topic model visualization in-line in the notebook.

In [58]:
pyLDAvis.display(LDAvis_prepared)

### Wait, what am I looking at again?
There are a lot of moving parts in the visualization. Here's a brief summary:

* On the left, there is a plot of the "distance" between all of the topics (labeled as the _Intertopic Distance Map_)
  * The plot is rendered in two dimensions according a [*multidimensional scaling (MDS)*](https://en.wikipedia.org/wiki/Multidimensional_scaling) algorithm. Topics that are generally similar should be appear close together on the plot, while *dis*similar topics should appear far apart.
  * The relative size of a topic's circle in the plot corresponds to the relative frequency of the topic in the corpus.
  * An individual topic may be selected for closer scrutiny by clicking on its circle, or entering its number in the "selected topic" box in the upper-left.
* On the right, there is a bar chart showing top terms.
  * When no topic is selected in the plot on the left, the bar chart shows the top-30 most "salient" terms in the corpus. A term's *saliency* is a measure of both how frequent the term is in the corpus and how "distinctive" it is in distinguishing between different topics.
  * When a particular topic is selected, the bar chart changes to show the top-30 most "relevant" terms for the selected topic. The relevance metric is controlled by the parameter $\lambda$, which can be adjusted with a slider above the bar chart.
    * Setting the $\lambda$ parameter close to 1.0 (the default) will rank the terms solely according to their probability within the topic.
    * Setting $\lambda$ close to 0.0 will rank the terms solely according to their "distinctiveness" or "exclusivity" within the topic &mdash; i.e., terms that occur *only* in this topic, and do not occur in other topics.
    * Setting $\lambda$ to values between 0.0 and 1.0 will result in an intermediate ranking, weighting term probability and exclusivity accordingly.
* Rolling the mouse over a term in the bar chart on the right will cause the topic circles to resize in the plot on the left, to show the strength of the relationship between the topics and the selected term.

A more detailed explanation of the pyLDAvis visualization can be found [here](https://cran.r-project.org/web/packages/LDAvis/vignettes/details.pdf). Unfortunately, though the data used by gensim and pyLDAvis are the same, they don't use the same ID numbers for topics. If you need to match up topics in gensim's `LdaMulticore` object and pyLDAvis' visualization, you have to dig through the terms manually.

### Analyzing our LDA model
The interactive visualization pyLDAvis produces is helpful for both:
1. Better understanding and interpreting individual topics, and
1. Better understanding the relationships between the topics.

For (1), you can manually select each topic to view its top most freqeuent and/or "relevant" terms, using different values of the $\lambda$ parameter. This can help when you're trying to assign a human interpretable name or "meaning" to each topic.

For (2), exploring the _Intertopic Distance Plot_ can help you learn about how topics relate to each other, including potential higher-level structure between groups of topics.

In our plot, there is a stark divide along the y-axis, with most topics in the upper half of the plot and one outlier, topic 48, far at the bottom. Inspecting the outlier topic provides a plausible explanation: the topics contains many non-English words, while most of the rest of the topics are in English. So, one of the main attributes that distinguish the reviews in the dataset from one another is their language.

This finding isn't entirely a surprise. In addition to English-speaking cities, the Yelp dataset includes reviews of businesses from around the world, sometimes written in other languages. Multiple languages isn't a problem for our demo, but for a real NLP application, you might need to ensure that the text you're processing is written in English (or is at least tagged for language) before passing it along to some downstream processing. If that were the case, the divide along the y-axis in the topic plot would immediately alert you to a potential data quality issue.

In the upper half of the plot, there are two large, distinct groups of topics &mdash; let's call them "super-topics" &mdash; one in the upper-left quadrant and the other in the upper-right quadrant. These super-topics correlate reasonably well with the pattern we'd noticed while naming the topics:

* The super-topic in the upper-_right_ tends to be about *food*. It groups together the **burger & fries**, **mexican**, **seafood**, and **barbecue** topics, among others.
* The super-topic in the upper-_left_ tends to be about other elements of the *restaurant experience*. It groups together the **menu & ordering**, **slow service**, **nightlife**, and **prices** topics, among others.

So, in addition to the 50 direct topics the model has learned, our analysis suggests a higher-level pattern in the data. Restaurant reviewers in the Yelp dataset talk about two main things in their reviews, in general: (1) the food, and (2) their overall restaurant experience. For this dataset, this is a very intuitive result, and we probably didn't need a sophisticated modeling technique to tell it to us. When working with datasets from other domains, though, such high-level patterns may be much less obvious from the outset &mdash; and that's where topic modeling can help.

### Describing text with LDA
Beyond data exploration, one of the key uses for an LDA model is providing a compact, quantitative description of natural language text. Once an LDA model has been trained, it can be used to represent free text as a mixture of the topics the model learned from the original corpus. This mixture can be interpreted as a probability distribution across the topics, so the LDA representation of a paragraph of text might look like 50% _Topic A_, 20% _Topic B_, 20% _Topic C_, and 10% _Topic D_.

To use an LDA model to generate a vector representation of new text, you'll need to apply any text preprocessing steps you used on the model's training corpus to the new text, too. For our model, the preprocessing steps we used include:
1. Using spaCy to remove punctuation and lemmatize the text
1. Applying our first-order phrase model to join word pairs
1. Applying our second-order phrase model to join longer phrases
1. Removing stopwords
1. Creating a bag-of-words representation

Once you've applied these preprocessing steps to the new text, it's ready to pass directly to the model to create an LDA representation. The `lda_description(...)` function will perform all these steps for us, including printing the resulting topical description of the input text.

In [59]:
def get_sample_review(review_number):
    """
    retrieve a particular review index
    from the reviews file and return it
    """
    
    review = next(
        it.islice(
            line_review(review_txt_filepath),
            review_number,
            review_number+1
            )
        )
    
    return review

In [60]:
def lda_description(review_text, min_topic_freq=0.05):
    """
    accept the original text of a review and (1) parse it with spaCy,
    (2) apply text pre-proccessing steps, (3) create a bag-of-words
    representation, (4) create an LDA representation, and
    (5) print a sorted list of the top topics in the LDA representation
    """
    
    # parse the review text with spaCy
    parsed_review = nlp(review_text)
    
    # lemmatize the text and remove punctuation and whitespace
    review_unigrams = [
        pronoun_lemmatize(token)
        for token in parsed_review
        if not punct_space(token)
        ]
    
    # apply the first-order and secord-order phrase models
    review_bigrams = bigram_phrases[review_unigrams]
    review_trigrams = trigram_phrases[review_bigrams]
    
    # remove any remaining stopwords
    review_trigrams = [
        term
        for term in review_trigrams
        if not term in nlp.Defaults.stop_words
        ]
    
    # create a bag-of-words representation
    review_bow = dictionary_trigrams.doc2bow(review_trigrams)
    
    # create an LDA representation
    review_lda = lda[review_bow]
    
    # sort with the most highly related topics first
    review_lda = sorted(review_lda, key=lambda topic_number_freq: -topic_number_freq[-1])
    
    for topic_number, freq in review_lda:
        if freq < min_topic_freq:
            break
            
        # print the most highly related topic names and frequencies
        print(f'{topic_names[topic_number]:25} {round(freq, 3):.3f}')

In [61]:
sample_review = get_sample_review(0)

print(sample_review)

Went in for a lunch. Steak sandwich was delicious, and the Caesar salad had an absolutely delicious dressing, with a perfect amount of dressing, and distributed perfectly across each leaf. I know I'm going on about the salad ... But it was perfect.

Drink prices were pretty good.

The Server, Dawn, was friendly and accommodating. Very happy with her.

In summation, a great pub experience. Would go again!



In [62]:
lda_description(sample_review)

good serivce              0.420
salad                     0.229
meal experience           0.131
beer, wings, sports       0.070
sandwiches                0.066
steak                     0.054


In [63]:
sample_review = get_sample_review(3)

print(sample_review)

This place has gone down hill.  Clearly they have cut back on staff and food quality

Many of the reviews were written before the menu changed.  I've been going for years and the food quality has gone down hill.

The service is slow & my salad, which was $15, was as bad as it gets.

It's just not worth spending the money on this place when there are so many other options.



In [64]:
lda_description(sample_review)

slow service              0.359
prices                    0.273
customer service          0.138
menu & ordering           0.085
vegetarian                0.054
salad                     0.053


## Word Vector Embedding with Word2Vec

Pop quiz! Can you complete this text snippet?

<br><br>

![word2vec quiz](https://s3.amazonaws.com/skipgram-images/word2vec-1.png)

<br><br><br>
You just demonstrated the core machine learning concept behind word vector embedding models!
<br><br><br>

![word2vec quiz 2](https://s3.amazonaws.com/skipgram-images/word2vec-2.png)

The goal of *word vector embedding models*, or *word vector models* for short, is to learn dense, numerical vector representations for each term in a corpus vocabulary. If the model is successful, the vectors it learns about each term should encode some information about the *meaning* or *concept* the term represents, and the relationship between it and other terms in the vocabulary. Word vector models are also fully unsupervised &mdash; they learn all of these meanings and relationships solely by analyzing the text of the corpus, without any advance knowledge provided.

Perhaps the best-known word vector model is [word2vec](https://arxiv.org/pdf/1301.3781v3.pdf), originally proposed in 2013. The general idea of word2vec is, for a given *focus word*, to use the *context* of the word &mdash; i.e., the other words immediately before and after it &mdash; to provide hints about what the focus word might mean. To do this, word2vec uses a *sliding window* technique, where it considers snippets of text only a few tokens long at a time.

At the start of the learning process, the model initializes random vectors for all terms in the corpus vocabulary. The model then slides the window across every snippet of text in the corpus, with each word taking turns as the focus word. Each time the model considers a new snippet, it tries to learn some information about the focus word based on the surrouding context, and it "nudges" the words' vector representations accordingly. One complete pass sliding the window across all of the corpus text is known as a training *epoch*. It's common to train a word2vec model for multiple passes/epochs over the corpus. Over time, the model rearranges the terms' vector representations such that terms that frequently appear in similar contexts have vector representations that are *close* to each other in vector space.

For a deeper dive into word2vec's machine learning process, see [here](https://arxiv.org/pdf/1411.2738v4.pdf).

Word2vec has a number of user-defined hyperparameters, including:
- The dimensionality of the vectors. Typical choices include a few dozen to several hundred.
- The width of the sliding window, in tokens. Five is a common default choice, but narrower and wider windows are possible.
- The number of training epochs.

For using word2vec in Python, [gensim](https://rare-technologies.com/deep-learning-with-word2vec-and-gensim/) comes to the rescue again! It offers a [highly-optimized](https://rare-technologies.com/word2vec-in-python-part-two-optimizing/), [parallelized](https://rare-technologies.com/parallelizing-word2vec-in-python/) implementation of the word2vec algorithm with its [Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html) class.

In [65]:
from gensim.models import Word2Vec

sentences_trigrams = LineSentence(sentences_trigrams_filepath)
word2vec_filepath = os.path.join(scratch_directory, 'word2vec_model_all')

We'll train our word2vec model using the normalized sentences with our phrase models applied. We'll use 100-dimensional vectors, and set up our training process to run for twenty epochs.

>⚠️ **Heads-up:** if you want to re-run word2vec modeling yourself, the next cell took me about **3 hours** to run.

In [66]:
# this is a bit time consuming - set execute = True
# if you want to train the word2vec model yourself.

execute = False

if execute:

    # initiate the model and perform the first epoch of training
    food2vec = Word2Vec(
        sentences_trigrams,
        size=100,
        window=5,
        min_count=50,
        sg=1,
        workers=7,
        iter=20
        )
    
    food2vec.save(word2vec_filepath)

In [67]:
food2vec = Word2Vec.load(word2vec_filepath)
food2vec.init_sims()

print(f'{food2vec.epochs} training epochs so far.')

20 training epochs so far.


  setattr(self, attrib, None)


On my eight-core machine, each training epoch over all the text in the ~4 million Yelp reviews takes about 5-10 minutes.

In [68]:
print(f'{len(food2vec.wv.vocab):,} terms in the food2vec vocabulary.')

51,429 terms in the food2vec vocabulary.


Let's take a peek at the word vectors our model has learned. We'll create a pandas DataFrame with the terms as the row labels, and the 100 dimensions of the word vector model as the columns.

In [69]:
# build a list of the terms, integer indices,
# and term counts from the food2vec model vocabulary
ordered_vocab = [
    (term, voc.index, voc.count)
    for term, voc in food2vec.wv.vocab.items()
    ]

# sort by the term counts, so the most common terms appear first
ordered_vocab = sorted(ordered_vocab, key=lambda term_tuple: -term_tuple[2])

# unzip the terms, integer indices, and counts into separate lists
ordered_terms, term_indices, term_counts = zip(*ordered_vocab)

# create a DataFrame with the food2vec vectors as data,
# and the terms as row labels
word_vectors = pd.DataFrame(
    food2vec.wv.vectors_norm[term_indices, :],
    index=ordered_terms
    )

word_vectors

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
be,-0.023416,-0.142913,-0.004794,-0.012889,0.065825,0.075074,-0.035807,-0.031397,-0.050686,-0.064565,...,-0.075246,0.090017,-0.092428,0.059273,0.030500,0.098081,-0.111015,-0.110737,-0.088010,-0.063364
the,0.043502,-0.105723,0.052810,-0.078622,-0.042990,-0.000897,0.058548,-0.067336,-0.078492,-0.196075,...,0.043645,0.031939,-0.035012,0.086955,0.021448,0.037390,-0.175062,-0.029105,-0.019383,-0.067769
and,0.006359,-0.216517,0.097626,-0.026131,0.013498,0.145217,0.063160,-0.058840,-0.007965,-0.029374,...,0.007195,0.010971,-0.082707,0.122966,0.004408,0.042919,-0.113115,-0.184433,-0.019660,0.036782
i,0.034299,-0.163179,-0.078324,-0.020232,0.111377,0.033043,0.019029,-0.038514,-0.051655,-0.088047,...,0.089193,-0.139334,-0.056023,-0.122086,0.128636,0.064481,-0.126704,-0.036542,0.001247,0.074245
a,-0.171251,-0.164302,0.025663,0.020231,-0.101620,0.047604,0.048166,-0.057027,-0.035873,-0.280124,...,0.102896,-0.024255,-0.046795,0.083137,-0.013577,-0.012463,-0.041822,-0.149201,0.127688,-0.011571
to,-0.062844,-0.025136,0.064782,0.070788,-0.041678,-0.018272,0.008294,-0.027996,-0.141299,-0.025855,...,0.017984,-0.103828,-0.001059,-0.049926,-0.034160,-0.072268,-0.108650,-0.024590,0.055596,0.060642
it,0.064423,-0.232705,-0.035534,0.052913,0.108685,0.067501,-0.113117,0.037841,-0.015733,-0.089063,...,-0.046577,-0.101408,0.006010,0.103724,-0.067950,0.062719,-0.043278,-0.100272,0.024008,0.034564
have,-0.018352,-0.074112,-0.052361,0.096576,-0.160598,0.215235,-0.101649,-0.113178,0.005519,-0.132883,...,0.098933,0.029703,-0.124155,0.167376,0.002192,0.090882,-0.139284,-0.070572,-0.041808,0.175469
of,-0.107046,-0.101431,0.102965,0.060700,-0.045057,0.098876,0.283991,-0.011696,-0.015677,-0.077917,...,0.121449,-0.003703,-0.101046,0.117000,0.020804,0.056932,-0.199627,0.011778,0.037286,-0.138425
not,-0.057087,-0.245522,-0.070689,0.137021,0.036406,-0.017087,-0.151932,-0.061277,0.076573,0.076463,...,0.031787,-0.213293,-0.033646,-0.083094,0.003777,0.027993,-0.048434,-0.210237,-0.167275,0.049110


Holy wall of numbers! This DataFrame has 51,429 rows &mdash; one for each term in the vocabulary &mdash; and 100 colums. Our model has learned a quantitative vector representation for each term, as expected.

Put another way, our model has "embedded" the terms into a 100-dimensional vector space.

### So... what can we do with all these numbers?
The first thing we can use them for is to simply look up related words and phrases for a given term of interest.

In [70]:
def get_related_terms(token, topn=10):
    """
    look up the topn most similar terms to token
    and print them as a formatted list
    """

    for word, similarity in food2vec.wv.most_similar(positive=[token], topn=topn):

        print(f'{word:20} {round(similarity, 3)}')

### What things are like McDonald's?

In [71]:
get_related_terms("mcdonald_'s")

mcdonalds            0.986
mcdonald             0.951
mcd_'s               0.948
wendy_'s             0.938
mcd                  0.934
mcds                 0.924
mc_donald_'s         0.913
bk                   0.9
wendys               0.897
mc_donalds           0.896


The model has learned that fast food restaurants are similar to each other! In particular, *wendy's* and *bk*, short for Burger King, are similar to McDonald's, according to this dataset. In addition, the model has found that alternate spellings for McDonald's are probably related, such as *mcd's*.

### When is happy hour?

In [72]:
get_related_terms('happy_hour', topn=15)

hh                   0.929
happy_hr             0.906
reverse_happy_hour   0.888
happy_hour-          0.849
happy_hour_3_6pm     0.807
hooch_hour           0.801
happy_hours          0.795
reverse_hh           0.787
4_6pm                0.782
3_6pm                0.747
7pm                  0.739
4pm-7pm              0.722
3pm-6pm              0.719
m_f                  0.707
3pm-7pm              0.694


The model has noticed several alternate spellings for happy hour, such as *hh* and *happy hr*, and assesses them as highly related. If you were looking for reviews about happy hour, such alternate spellings would be very helpful to know.

Taking a deeper look &mdash; the model has turned up phrases like *3-6pm*, *4-7pm*, and *m-f*, too. This is especially interesting, because the model has no advance knowledge at all about what happy hour is, and what time of day it should be. But simply by scanning through restaurant reviews, the model has discovered that the concept of happy hour has something very important to do with that block of time around 3-7pm on weekdays.

### Let's make pasta tonight. Which style do you want?

In [73]:
get_related_terms('pasta', topn=20)

rigatoni             0.854
penne                0.848
fettuccini           0.843
fettuccine           0.842
bolognese            0.841
spaghetti            0.837
fettucini            0.833
gnocci               0.83
linguini             0.827
linguine             0.826
manicotti            0.822
pastas               0.818
cavatelli            0.817
lasagne              0.816
lasagna              0.815
gnocchi              0.814
tortellini           0.813
angel_hair_pasta     0.811
bucatini             0.807
tagliatelle          0.806


## Word algebra!
No self-respecting word2vec demo would be complete without a healthy dose of *word algebra*, also known as *analogy completion*.

The core idea is that once words are represented as numerical vectors, you can do math with them. The mathematical procedure goes like this:
1. Provide a set of words or phrases that you'd like to add or subtract.
1. Look up the vectors that represent those terms in the word vector model.
1. Add and subtract those vectors to produce a new, combined vector.
1. Look up the most similar vector(s) to this new, combined vector via cosine similarity.
1. Return the word(s) associated with the similar vector(s).

But more generally, you can think of the vectors that represent each word as encoding some information about the *meaning* or *concepts* of the word. What happens when you ask the model to combine the meaning and concepts of words in new ways? Let's see.

In [74]:
def word_algebra(add=[], subtract=[], topn=1):
    """
    combine the vectors associated with the words provided
    in add= and subtract=, look up the topn most similar
    terms to the combined vector, and print the result(s)
    """
    answers = food2vec.wv.most_similar(positive=add, negative=subtract, topn=topn)
    
    for term, similarity in answers:
        print(term)

### breakfast + lunch = ?
Let's start with a softball.

In [75]:
word_algebra(add=['breakfast', 'lunch'], topn=2)

bfast
brunch


OK, so the model knows that *brunch* is a combination of *breakfast* and *lunch*. What else?

### lunch - day + night = ?

In [76]:
word_algebra(add=['lunch', 'night'], subtract=['day'])

dinner


Now we're getting a bit more nuanced. The model has discovered that:
- Both *lunch* and *dinner* are meals
- The main difference between them is time of day
- Day and night are times of day
- Lunch is associated with day, and dinner is associated with night

What else?

### taco - mexican + chinese = ?

In [77]:
word_algebra(add=['taco', 'chinese'], subtract=['mexican'])

dumpling


Here's an entirely new and different type of relationship that the model has learned.
- It knows that tacos are a characteristic example of Mexican food
- It knows that Mexican and Chinese are both styles of food
- If you subtract *Mexican* from *taco*, you're left with something like the concept of a _"characteristic type of food"_, which is represented as a new vector
- If you add that new _"characteristic type of food"_ vector to Chinese, you get *dumpling*.

What else?

### bun - american + mexican = ?

In [78]:
word_algebra(add=['bun', 'mexican'], subtract=['american'], topn=3)

corn_tortilla
tortilla
flour_tortilla


The model knows that both *buns* and *tortillas* are the doughy thing that goes on the outside of your real food, and that the primary difference between them is the style of food they're associated with.

What else?

### filet mignon - beef + seafood = ?

In [79]:
word_algebra(add=['filet', 'seafood'], subtract=['beef'])

lobster_tail


The model has learned a concept of *delicacy*. If you take filet and subtract beef from it, you're left with a vector that roughly corresponds to delicacy. If you add the delicacy vector to *seafood*, you get *lobster tail*.

What else?

### coffee - drink + snack = ?

In [80]:
word_algebra(add=['coffee', 'snack'], subtract=['drink'])

pastry


The model knows that if you're on your coffee break, but instead of drinking something, you're eating something... that thing is most likely a pastry.

What else?

### McDonald's + fine dining = ?

In [81]:
word_algebra(add=["mcdonald_'s", 'fine_dining'])

denny_'s


Touché. It makes sense, though. The model has learned that both McDonald's and Denny's are large chains, and that both serve fast, casual, American-style food. But Denny's has some elements that are slightly more upscale, such as printed menus and table service. Fine dining, indeed.

*What if we keep going?*

### Denny's + fine dining = ?

In [82]:
word_algebra(add=["denny_'s", 'fine_dining'], topn=2)

dennys
tgi_friday


This seems like a good place to land... what if we explore the vector space around *TGI Friday* a bit, in a few different directions? Let's see what we find.

#### TGI Friday + italian = ?

In [83]:
word_algebra(add=['tgi_friday', 'italian'])

olive_garden


#### TGI Friday + pancakes = ?

In [84]:
word_algebra(add=['tgi_friday', 'pancakes'])

ihop


#### TGI Friday + pizza = ?

In [85]:
word_algebra(add=['tgi_friday', 'pizza'])

pizza_hut


You could do this all day. One last analogy before we move on...

## Word Vector Visualization with t-SNE

[t-Distributed Stochastic Neighbor Embedding](https://lvdmaaten.github.io/publications/papers/JMLR_2008.pdf), or *t-SNE* for short, is a dimensionality reduction technique to assist with visualizing high-dimensional datasets. It attempts to map high-dimensional data onto a low two- or three-dimensional representation such that the relative distances between points are preserved as closely as possible in both high-dimensional and low-dimensional space.

scikit-learn provides a convenient implementation of the t-SNE algorithm with its [TSNE](http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) class.

In [86]:
from sklearn.manifold import TSNE

Our input for t-SNE will be the DataFrame of word vectors we created before. Let's first:
1. Drop stopwords &mdash; it's probably not too interesting to visualize *the*, *of*, *or*, and so on
1. Take only the 5,000 most frequent terms in the vocabulary &mdash; no need to visualize all ~50,000 terms right now.

In [87]:
tsne_input = (
    word_vectors
    .drop(nlp.Defaults.stop_words, errors='ignore')
    .head(5000)
    )

tsne_input.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
good,0.054082,-0.074172,0.068609,-0.064248,-0.0586,0.151142,0.08799,0.005272,-0.072053,-0.11467,...,-0.092038,0.017464,0.145565,0.035842,0.023439,0.091448,-0.117423,-0.154449,0.041695,-0.042822
food,0.104257,-0.068298,-0.041186,-0.196919,-0.00842,0.039336,0.042791,-0.098655,-0.14361,0.086271,...,-0.203544,0.074036,-0.017852,0.196475,0.03443,-0.141855,0.017693,-0.065174,0.304121,-0.02814
place,0.11556,-0.131115,-0.083913,-0.016288,0.031346,-0.042556,-0.025425,0.023599,-0.117645,0.141195,...,-0.002841,-0.060658,0.042985,-0.00066,0.062435,0.067196,-0.013473,-0.079611,0.232834,0.045938
order,0.111774,0.027879,0.124779,0.184456,-0.081644,0.142205,0.001316,-0.067226,-0.178887,-0.033263,...,0.146139,0.038975,-0.130769,0.081337,-0.001812,-0.039516,-0.01424,-0.114969,0.016693,0.167479
great,0.088939,0.012518,0.090518,-0.11747,-0.121321,0.141392,0.143355,0.011386,-0.008909,-0.144096,...,-0.023113,-0.04643,0.101145,0.165928,0.01611,0.036126,-0.080476,-0.060198,0.07622,0.074725


In [88]:
tsne_filepath = os.path.join(scratch_directory, 'tsne_model')

tsne_vectors_filepath = os.path.join(scratch_directory, 'tsne_vectors.npy')

In [89]:
# this is a bit time consuming - set execute = True
# if you want to run TSNE modeling yourself.

execute = False

if execute:
    
    tsne = TSNE()
    tsne_vectors = tsne.fit_transform(tsne_input.values)
    
    with open(tsne_filepath, 'wb') as f:
        pickle.dump(tsne, f)

    pd.np.save(tsne_vectors_filepath, tsne_vectors)

In [90]:
with open(tsne_filepath, 'rb') as f:
    tsne = pickle.load(f)
    
tsne_vectors = pd.np.load(tsne_vectors_filepath)

tsne_vectors = pd.DataFrame(
    tsne_vectors,
    index=pd.Index(tsne_input.index),
    columns=['x_coord', 'y_coord']
    )

Now we have a two-dimensional representation of our data! Let's take a look.

In [91]:
tsne_vectors.head()

Unnamed: 0,x_coord,y_coord
good,15.73045,21.12479
food,8.418741,40.431412
place,-5.161271,37.085274
order,-31.196054,40.358353
great,9.165603,14.587344


In [92]:
tsne_vectors['word'] = tsne_vectors.index

### Plotting Word Vectors with Bokeh

In [93]:
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource, value

output_notebook()

In [94]:
# add our DataFrame as a ColumnDataSource for Bokeh
plot_data = ColumnDataSource(tsne_vectors)

# create the plot and configure the
# title, dimensions, and tools
tsne_plot = figure(
    title='t-SNE Word Embeddings',
    plot_width=800,
    plot_height=800,
    tools=(
        'pan, wheel_zoom, box_zoom,'
        'box_select, reset'
        ),
    active_scroll='wheel_zoom'
    )

# add a hover tool to display words on roll-over
tsne_plot.add_tools(
    HoverTool(tooltips = '@word')
    )

# draw the words as circles on the plot
tsne_plot.circle(
    'x_coord',
    'y_coord',
    source=plot_data,
    color='blue',
    line_alpha=0.2,
    fill_alpha=0.1,
    size=10,
    hover_line_color='black'
    )

# configure visual elements of the plotc
tsne_plot.title.text_font_size = value('16pt')
tsne_plot.xaxis.visible = False
tsne_plot.yaxis.visible = False
tsne_plot.grid.grid_line_color = None
tsne_plot.outline_line_color = None

# engage!
show(tsne_plot);

## Text Categorization with spaCy's `textcat`

First, we'll prepare a bit of training data for classification in spaCy's preferred format.

In [95]:
with open(review_json_filepath) as f:
    
    first_review = next(f)
    
    print(json.loads(first_review))

{'review_id': 'Q1sbwvVQXV2734tPgoKj4Q', 'user_id': 'hG7b0MtEbXx5QzbzE6C_VA', 'business_id': 'ujmEBvifdJM6h6RLv4wQIg', 'stars': 1.0, 'useful': 6, 'funny': 1, 'cool': 0, 'text': 'Total bill for this horrible service? Over $8Gs. These crooks actually had the nerve to charge us $69 for 3 pills. I checked online the pills can be had for 19 cents EACH! Avoid Hospital ERs at all costs.', 'date': '2013-05-07 04:34:36'}


One of the attributes of reviews in the Yelp dataset is `funny`, which is the number of Yelp users that flagged the review as funny.

We're going to train a text classification model to try to predict whether or not a review is funny. First, we'll collect examples of both funny and *un*funny reviews. We want really good funny reviews, so we'll require our funny reviews to receive at least 5 "funny" votes from Yelp users. Our unfunny reviews did not receive any "funny" votes.

In [96]:
total_examples = 1000

funny_reviews = []
unfunny_reviews = []

with open(review_json_filepath) as f:
    for idx, review in enumerate(f):
        
        review = json.loads(review)
        
        if review['funny'] > 5 and len(funny_reviews) < (total_examples / 2):
           
            funny_reviews.append((
                review['text'], {
                    'cats': {
                        'FUNNY': 1.0,
                        'UNFUNNY': 0.0
                        }
                    }
            ))
            
            continue
            
        if review['funny'] == 0 and len(unfunny_reviews) < (total_examples / 2):
            
            unfunny_reviews.append((
                review['text'], {
                    'cats': {
                        'FUNNY': 0.0,
                        'UNFUNNY': 1.0
                        }
                    }
            ))
            
            continue
            
        if len(funny_reviews) >= (total_examples / 2) and len(unfunny_reviews) >= (total_examples / 2):
            
            break

Let's preview the reviews and spaCy's preferred format for representing labels.

In [97]:
review_text, review_cats = funny_reviews[1]

print(review_text)
print('')
print(review_cats)

If you are looking for the best pierogies in Pittsburgh, this is your place. There are a few small tables outside but most of the business is carry out. Pierogies Plus wins Best Pierogies every year. Why? Because the owner is from Poland and she is making the real deal pierogies. The best part is that they are hand pinched by a group of older Polish and Hungarian women. 
The biggest seller is potato and cheese but they sell many flavors. They are like plump pillows of softness. You can buy them buy the dozen. You can get them cold to take home and freeze or warm and ready to eat. The warm ones are served with butter and onions.  It's definitely a comfort food. The best part is that they ship internationally. Yes, they are that good.

{'cats': {'FUNNY': 1.0, 'UNFUNNY': 0.0}}


In [98]:
review_text, review_cats = unfunny_reviews[1]

print(review_text)
print('')
print(review_cats)

I have to say that this office really has it together, they are so organized and friendly!  Dr. J. Phillipp is a great dentist, very friendly and professional.  The dental assistants that helped in my procedure were amazing, Jewel and Bailey helped me to feel comfortable!  I don't have dental insurance, but they have this insurance through their office you can purchase for $80 something a year and this gave me 25% off all of my dental work, plus they helped me get signed up for care credit which I knew nothing about before this visit!  I highly recommend this office for the nice synergy the whole office has!

{'cats': {'FUNNY': 0.0, 'UNFUNNY': 1.0}}


We'll split the data 50/50 into train and test sets, then randomly shuffle each set.

In [99]:
import random

In [100]:
train_data = funny_reviews[:int(len(funny_reviews) / 2)] + unfunny_reviews[:int(len(unfunny_reviews) / 2)]
test_data = funny_reviews[int(len(funny_reviews) / 2):] + unfunny_reviews[int(len(unfunny_reviews) / 2):]

random.shuffle(train_data)
random.shuffle(test_data)

Next, we'll create a new `textcat` model and add it to our existing `nlp` spaCy pipeline.

In [101]:
original_pipe_names = nlp.pipe_names
original_pipe_names

['tagger', 'parser', 'ner']

In [102]:
textcat = nlp.create_pipe(
    'textcat',
    config={'exclusive_classes': True}
    )

textcat.add_label('FUNNY')
textcat.add_label('UNFUNNY')

1

In [103]:
nlp.add_pipe(textcat)

nlp.pipe_names

['tagger', 'parser', 'ner', 'textcat']

This is a helper function to help us evaluate the performance of our text classification model.

In [104]:
def evaluate(tokenizer, textcat, texts, cats, label='FUNNY'):
    
    docs = (tokenizer(text) for text in texts)
    
    tp = 0.0  # True positives
    fp = 1e-8  # False positives
    fn = 1e-8  # False negatives
    tn = 0.0  # True negatives
    
    for i, doc in enumerate(textcat.pipe(docs)):

        gold = cats[i]['cats']
        score = doc.cats[label]
        
        if label not in gold:
            continue
        if score >= 0.5 and gold[label] >= 0.5:
            tp += 1.0
        elif score >= 0.5 and gold[label] < 0.5:
            fp += 1.0
        elif score < 0.5 and gold[label] < 0.5:
            tn += 1
        elif score < 0.5 and gold[label] >= 0.5:
            fn += 1
    
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    
    if (precision + recall) == 0:
        f_score = 0.0
    else:
        f_score = 2 * (precision * recall) / (precision + recall)
    
    return {"textcat_p": precision, "textcat_r": recall, "textcat_f": f_score}

We'll start the training loop for our `textcat` model, using `train_data` for training the model and `test_data` for evaluating its performance.

In [105]:
from spacy.util import minibatch

In [106]:
%%time

with nlp.disable_pipes(*original_pipe_names):
    
    optimizer = nlp.begin_training()
    
    print("Training the model...")
    print("{:^5}\t{:^5}\t{:^5}\t{:^5}".format("LOSS", "P", "R", "F"))
    
    for i in range(10):
        losses = {}

        # batch up the examples using spaCy's minibatch
        random.shuffle(train_data)

        batches = minibatch(train_data, size=8)

        for batch in batches:
            texts, cats = zip(*batch)
            nlp.update(texts, cats, sgd=optimizer, drop=0.2, losses=losses)

        with textcat.model.use_params(optimizer.averages):

            # evaluate on the dev data split off in load_data()
            test_texts, test_cats = zip(*test_data)
            scores = evaluate(nlp.tokenizer, textcat, test_texts, test_cats)

        print(
            "{0:.3f}\t{1:.3f}\t{2:.3f}\t{3:.3f}".format(  # print a simple table
                losses["textcat"],
                scores["textcat_p"],
                scores["textcat_r"],
                scores["textcat_f"],
                )
            )

Training the model...
LOSS 	  P  	  R  	  F  
0.498	0.696	0.284	0.403
0.465	0.702	0.736	0.719
0.342	0.708	0.768	0.737
0.277	0.660	0.800	0.723
0.193	0.701	0.704	0.703
0.140	0.685	0.748	0.715
0.077	0.685	0.740	0.712
0.065	0.692	0.700	0.696
0.043	0.678	0.700	0.689
0.040	0.683	0.688	0.685
CPU times: user 3min 19s, sys: 55.9 s, total: 4min 14s
Wall time: 42 s


We now have a trained model! Looks like we were starting to overfit on the training data fairly early in the training process, as the model performance on the test data starts going down even as the training loss continues to drop.

Let's dig in further on the model's performance on our test data. First, we'll use our spaCy pipeline to generate category predictions for every review in our test data.

In [107]:
test_reviews, test_cats = zip(*test_data)

In [108]:
test_docs = [
    nlp(review, disable=original_pipe_names)
    for review in test_reviews
    ]

In [109]:
test_docs[0]

This place serves consistently good Americanized Middle Eastern food. I've had  tuna and lamb plates here, and they've been good. The flavors won't blow you away but just about everything on the menu is tasty and well prepared. Wait staff has been knowledgeable about the food each time I've been there.  Waterworks location doesn't have a liquor license but there's a state wine store nearby and you can bring a bottle into the restaurant.

In [110]:
print('Model predictions:', test_docs[0].cats)

Model predictions: {'FUNNY': 0.00012644033995456994, 'UNFUNNY': 0.9998735189437866}


In [111]:
print('True labels:', test_cats[0])

True labels: {'cats': {'FUNNY': 0.0, 'UNFUNNY': 1.0}}


Next, we'll unpack the `textcat` predictions and labels into a `DataFrame`, which is a little more conductive for further data analysis.

In [112]:
test_funny_labels = [cats['cats']['FUNNY'] for cats in test_cats]
test_unfunny_labels = [cats['cats']['UNFUNNY'] for cats in test_cats]

test_funny_preds = [doc.cats['FUNNY'] for doc in test_docs]
test_unfunny_preds = [doc.cats['UNFUNNY'] for doc in test_docs]

test_df = pd.DataFrame(
    zip(
        test_reviews,
        test_funny_labels,
        test_funny_preds,
        test_unfunny_labels,
        test_unfunny_preds
        ),
    columns = [
        'text',
        'FUNNY',
        'FUNNY_pred',
        'UNFUNNY',
        'UNFUNNY_pred'
        ]
    )

test_df

Unnamed: 0,text,FUNNY,FUNNY_pred,UNFUNNY,UNFUNNY_pred
0,This place serves consistently good Americaniz...,0.0,0.000126,1.0,0.999874
1,Located at the Venetian Hotel in Las Vegas. Gr...,0.0,0.001876,1.0,0.998124
2,I'm so..... Heated... I bought the 3 amigo and...,0.0,0.001392,1.0,0.998608
3,"When going to Vegas, I always use HotWire to f...",1.0,0.986706,0.0,0.013294
4,Where do I start!? Came here for the first tim...,0.0,0.000687,1.0,0.999313
5,My family and I visited and decided to take so...,0.0,0.020842,1.0,0.979158
6,Delicious! My first visit. I have been hearing...,1.0,0.888721,0.0,0.111279
7,"It's been some time since I've had a new, good...",1.0,0.998460,0.0,0.001540
8,Yes! As good as it gets in way of pinball mach...,0.0,0.000487,1.0,0.999513
9,Authentic and Rustic full of ambiance.. Great ...,0.0,0.003008,1.0,0.996992


Let's look at some descriptive statistics about the test data.

In [113]:
test_df.describe()

Unnamed: 0,FUNNY,FUNNY_pred,UNFUNNY,UNFUNNY_pred
count,500.0,500.0,500.0,500.0
mean,0.5,0.561154,0.5,0.438846
std,0.500501,0.443924,0.500501,0.443924
min,0.0,9.2e-05,0.0,0.000258
25%,0.0,0.013053,0.0,0.011491
50%,0.5,0.807424,0.5,0.192576
75%,1.0,0.988509,1.0,0.986947
max,1.0,0.999742,1.0,0.999908


Next, we'll break out the test reviews by whether they were really labeled as funny.

In [114]:
grouped = (
    test_df
    .groupby('FUNNY')
    ['FUNNY_pred']
    .describe()
    .transpose()
    )

grouped

FUNNY,0.0,1.0
count,250.0,250.0
mean,0.384747,0.737561
std,0.434497,0.379015
min,9.2e-05,0.000125
25%,0.001399,0.436964
50%,0.069915,0.978173
75%,0.922619,0.994821
max,0.998146,0.999742


### Visualizing Model Predictions with Bokeh

We'll create a stacked histogram showing the distribution of both the funny reviews and the unfunny reviews, based on the funny prediction score the model assigned to the reviews. 

First, we'll prepare the data for visualization.

In [115]:
funny_df = test_df[test_df['FUNNY'] == 1]
unfunny_df = test_df[test_df['UNFUNNY'] == 1]

funny_hist = (
    funny_df['FUNNY_pred']
    .apply(lambda x: int(round(x * 10, 0)))
    .value_counts()
    .sort_index()
    )

unfunny_hist = (
    unfunny_df['FUNNY_pred']
    .apply(lambda x: int(round(x * 10, 0)))
    .value_counts()
    .sort_index()
    )


funny_score_table = (
    pd.concat(
        [funny_hist, unfunny_hist],
        axis=1
        )
    .fillna(0)
    )

funny_score_table.columns = ['FUNNY_count', 'UNFUNNY_count']

funny_score_table["TOTAL_count"] = (
    funny_score_table
    .apply(pd.np.sum, axis=1)
    )

funny_score_table = pd.concat([
    pd.Series(funny_score_table.index, name='funny_pred') / 10,
    funny_score_table
    ],
    axis = 1
    )

funny_score_table.loc[-1] = [0, 0, 0, 0]
funny_score_table.loc[11] = [1, 0, 0, 0]

funny_score_table = funny_score_table.sort_index()

funny_score_table

Unnamed: 0,funny_pred,FUNNY_count,UNFUNNY_count,TOTAL_count
-1,0.0,0,0,0
0,0.0,32,118,150
1,0.1,10,19,29
2,0.2,8,3,11
3,0.3,6,5,11
4,0.4,8,5,13
5,0.5,1,5,6
6,0.6,3,5,8
7,0.7,10,4,14
8,0.8,5,10,15


Finally, we'll plot our histograms with Bokeh.

In [116]:
from bokeh.plotting import figure, output_notebook, show

output_notebook()

In [117]:
# create the plot and configure the
# dimensions, color, and tools
funny_score_distribution = figure(
    width=850,
    height=600,
    background_fill_color="#F7F7F7",
    tools="pan,box_zoom,wheel_zoom,crosshair,save,reset"
    )

# add a title and axis labels
funny_score_distribution.title.text = "Funny Label by Funny Prediction Score"
funny_score_distribution.xaxis.axis_label = "Funny Prediction Score"
funny_score_distribution.yaxis.axis_label = "Count of Reviews"

# draw the histogram of funny reviews as a golden "patch"
funny_score_distribution.patch(
    funny_score_table["funny_pred"],
    funny_score_table["FUNNY_count"],
    fill_color="GoldenRod",
    fill_alpha=0.8,
    line_color="wheat",
    line_alpha=0.8,
    legend="Funny Reviews",
    );

# draw the histogram of unfunny reviews as a gray "patch"
# stacked on top of the existing funny reviews patch
funny_score_distribution.patch(
    pd.concat([
        funny_score_table['funny_pred'],
        funny_score_table['funny_pred'].sort_index(ascending=False)
        ]),
    pd.concat([
        funny_score_table['FUNNY_count'],
        funny_score_table['TOTAL_count'].sort_index(ascending=False)
        ]),
    fill_color="#555555",
    fill_alpha=0.6,
    line_color=None,
    legend="Unfunny Reviews",
    );

# engage!
show(funny_score_distribution);

## Transformer Models with spaCy Pytorch Transformers

In [118]:
import numpy as np

Sentence 1:

>The crab cakes were delicious, the white sangria hit the spot, and the sea `bass` dinner entree was wonderful!

Sentence 2:

>The music had a lot of `bass` to it and the smoke machine went off periodically, which actually felt refreshing lol.

Sentence 3:

>Fresh Mediterranean sea `bass` grilled lightly and served with a simple coating in EVOO, fresh lemon to squeeze, and herbs.

In [119]:
bass_1 = (
    'The crab cakes were delicious, the white sangria hit the spot, '
    'and the sea bass dinner entree was wonderful!'
    )

bass_2 = (
    'The music had a lot of bass to it and the smoke machine went off periodically, '
    'which actually felt refreshing lol.'
    )

bass_3 = (
    'Fresh Mediterranean sea bass grilled lightly and served '
    'with a simple coating in EVOO, fresh lemon to squeeze, and herbs.'
    )

In [120]:
bass_doc_1 = nlp(bass_1)
bass_doc_2 = nlp(bass_2)
bass_doc_3 = nlp(bass_3)

bass_token_1 = bass_doc_1[16]
bass_token_2 = bass_doc_2[6]
bass_token_3 = bass_doc_3[3]

print(bass_token_1)
print(bass_token_2)
print(bass_token_3)

bass
bass
bass


In [121]:
print(bass_token_1.similarity(bass_token_2))
print(bass_token_1.similarity(bass_token_3))
print(bass_token_2.similarity(bass_token_3))

1.0
1.0
1.0


In [122]:
np.allclose(bass_token_1.vector, bass_token_2.vector)

True

In [123]:
pd.DataFrame(
    zip(bass_token_1.vector, bass_token_2.vector),
    columns=['bass (fish)', 'bass (music)']
    )

Unnamed: 0,bass (fish),bass (music)
0,-0.413120,-0.413120
1,0.875580,0.875580
2,-0.100770,-0.100770
3,0.180940,0.180940
4,0.884870,0.884870
5,0.243010,0.243010
6,0.330550,0.330550
7,-0.321360,-0.321360
8,-0.074179,-0.074179
9,1.008300,1.008300


### Contextual Word Vectors with BERT

In [124]:
bert_nlp = spacy.load('en_pytt_bertbaseuncased_lg')

In [125]:
bert_bass_doc_1 = bert_nlp(bass_1)
bert_bass_doc_2 = bert_nlp(bass_2)
bert_bass_doc_3 = bert_nlp(bass_3)

bert_bass_token_1 = bert_bass_doc_1[16]
bert_bass_token_2 = bert_bass_doc_2[6]
bert_bass_token_3 = bert_bass_doc_3[3]

print(bert_bass_token_1)
print(bert_bass_token_2)
print(bert_bass_token_3)

bass
bass
bass


In [126]:
print(bass_1)
print('')
print(bass_2)
print('')
print('Similarity:', bert_bass_token_1.similarity(bert_bass_token_2))

The crab cakes were delicious, the white sangria hit the spot, and the sea bass dinner entree was wonderful!

The music had a lot of bass to it and the smoke machine went off periodically, which actually felt refreshing lol.

Similarity: 0.4680403


In [127]:
print(bass_1)
print('')
print(bass_3)
print('')
print('Similarity:', bert_bass_token_1.similarity(bert_bass_token_3))

The crab cakes were delicious, the white sangria hit the spot, and the sea bass dinner entree was wonderful!

Fresh Mediterranean sea bass grilled lightly and served with a simple coating in EVOO, fresh lemon to squeeze, and herbs.

Similarity: 0.7809529


In [128]:
print(bass_2)
print('')
print(bass_3)
print('')
print('Similarity:', bert_bass_token_2.similarity(bert_bass_token_3))

The music had a lot of bass to it and the smoke machine went off periodically, which actually felt refreshing lol.

Fresh Mediterranean sea bass grilled lightly and served with a simple coating in EVOO, fresh lemon to squeeze, and herbs.

Similarity: 0.4732631


In [129]:
np.allclose(bert_bass_token_1.vector, bert_bass_token_2.vector)

False

In [130]:
pd.DataFrame(
    zip(bert_bass_token_1.vector, bert_bass_token_2.vector),
    columns=['bass (fish)', 'bass (music)']
    )

Unnamed: 0,bass (fish),bass (music)
0,0.011329,0.591231
1,0.043102,-0.152798
2,0.639056,0.207664
3,0.334233,-0.063802
4,0.336107,-0.600250
5,-0.490059,-0.081239
6,0.087794,-0.041648
7,0.591766,0.506240
8,-0.365087,-0.249177
9,0.256170,0.417884


In [131]:
print(bert_bass_doc_1._.pytt_word_pieces_)

['[CLS]', 'the', 'crab', 'cakes', 'were', 'delicious', ',', 'the', 'white', 'sang', '##ria', 'hit', 'the', 'spot', ',', 'and', 'the', 'sea', 'bass', 'dinner', 'en', '##tree', 'was', 'wonderful', '!', '[SEP]']


In [132]:
print(list(bert_bass_doc_1))

[The, crab, cakes, were, delicious, ,, the, white, sangria, hit, the, spot, ,, and, the, sea, bass, dinner, entree, was, wonderful, !]


In [133]:
print(bert_bass_doc_1._.pytt_alignment)

[[1], [2], [3], [4], [5], [6], [7], [8], [9, 10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20, 21], [22], [23], [24]]


In [134]:
i = 8

print(bert_bass_doc_1[i])

sangria


In [135]:
bert_bass_doc_1._.pytt_alignment[i]

[9, 10]

In [136]:
for j in bert_bass_doc_1._.pytt_alignment[i]:
    print(bert_bass_doc_1._.pytt_word_pieces_[j])

sang
##ria


## Conclusion

Whew! Let's round up the major components that we've seen:

1. Text processing with **spaCy**
1. Automated **phrase modeling**
1. Topic modeling with **LDA** $\ \longrightarrow\ $ visualization with **pyLDAvis**
1. Word vector modeling with **word2vec** $\ \longrightarrow\ $ visualization with **t-SNE**
1. Text categorization with spaCy's **textcat** model.
1. Contextual word vectors with BERT via spaCy PyTorch Transformers.

#### Why use these models?
Dense vector representations for text like LDA, word2vec, and BERT can greatly improve performance for a number of common, text-heavy problems like:
- Text classification
- Search
- Recommendations
- Question answering

...and more generally are a powerful way machines can help humans make sense of what's in a giant pile of text. They're also often useful as a pre-processing step for many other downstream machine learning applications.

## AI Engineering @ S&P Global &mdash; *we are hiring!*