# Capstone 2 Project

## NLP training on ca. 3 million Yelp reviews

## What is covered in this part of the project
This part of the project deals with natural language processing of the Yelp data set. Specifically, it will import the raw data in json format, filter the reviews that relate to restaurants, prepare the data for NLP and perform a number of NLP functions:

1. Text preparation, including tokenization, lemmatization and normalization of the text using the spaCy library
1. Phrase modeling using gensim
1. Topic modeling with LDA
1. Word vector models with word2vec
1. Clustering words that commonly appear together 

Topic modeling and word2vec clusters will provide context to the reviews in order to return summary information that is easily interpretable and understandable to an average user, and may provide useful insight.

## Loading the Yelp data
[**The Yelp Dataset**](https://www.yelp.com/dataset_challenge/) is a dataset published by the business review service [Yelp](http://yelp.com) for academic research and educational purposes. I really like the Yelp dataset as a subject for machine learning and natural language processing demos, because it's big (but not so big that you need your own data center to process it), well-connected, and anyone can relate to it &mdash; it's largely about food, after all!

**Note:** If you'd like to execute this notebook interactively on your local machine, you'll need to download your own copy of the Yelp dataset. 

After filtering for restaurants, there are approximately __52K__ restaurants with approximately __2.9M__ user reviews related to them.

The raw data is available in six of files in _.json_ format, of which two are relevant for the project:
- __business.json__ &mdash; _the records for individual businesses_
- __review.json__ &mdash; _the records for reviews users wrote about businesses_

The files are text files (UTF-8) with one _json object_ per line, each one corresponding to an individual data record. 

In [None]:
import os
import codecs

data_directory = os.path.join('data',
                              'dataset')

businesses_filepath = os.path.join(data_directory,
                                   'business.json')

with codecs.open(businesses_filepath, encoding='utf_8') as f:
    first_business_record = f.readline() 

print first_business_record

The business records consist of _key, value_ pairs containing information about the particular business. The information from this file is treated in a separate notebook named _"Restaurants"_. 

This notebook will process information contained in the _reviews.json_ file.

In [None]:
review_json_filepath = os.path.join(data_directory,
                                    'review.json')

with codecs.open(review_json_filepath, encoding='utf_8') as f:
    first_review_record = f.readline()
    
print first_review_record

A few attributes of note on the review records:
- __business\_id__ &mdash; _identifies the business in question and provides a link to each business in the business.json file_
- __text__ &mdash; _the actual text of the review written by the user_


Work required to prepare the text for NLP analysis requires the following:
1. Loading each business record line by line using _json.loads_. Python converts json objects into a Python dict
2. Filter each business record to include only those that have "Restaurant: in the category list
3. Form a set of the business_id records in order to use this filter when processing the reviews file

In [None]:
import json

restaurant_ids = set()

# open the businesses file
with codecs.open(businesses_filepath, encoding='utf_8') as f:
    
    # iterate through each line (json record) in the file
    for business_json in f:
        
        # convert the json record to a Python dict
        business = json.loads(business_json)
        
        # if this business is not a restaurant, skip to the next one
        if u'Restaurants' not in business[u'categories']:
            continue
            
        # add the restaurant business id to our restaurant_ids set
        restaurant_ids.add(business[u'business_id'])

# turn restaurant_ids into a frozenset, as we don't need to change it anymore
restaurant_ids = frozenset(restaurant_ids)

# print the number of unique restaurant ids in the dataset
print '{:,}'.format(len(restaurant_ids)), u'restaurants in the dataset.'

This step creates a new file that contains only the text from reviews about restaurants, with one review per line in the file.

In [None]:
intermediate_directory = os.path.join('data', 'intermediate')

review_txt_filepath = os.path.join(intermediate_directory,
                                   'review_text_all.txt')

In [None]:
%%time

# this part is complete so make the statement false to skip directly to reading the file
if 0 == 1:
    
    review_count = 0

    # create & open a new file in write mode
    with codecs.open(review_txt_filepath, 'w', encoding='utf_8') as review_txt_file:

        # open the existing review json file
        with codecs.open(review_json_filepath, encoding='utf_8') as review_json_file:

            # loop through all reviews in the existing file and convert to dict
            for review_json in review_json_file:
                review = json.loads(review_json)

                # if this review is not about a restaurant, skip to the next one
                if review[u'business_id'] not in restaurant_ids:
                    continue

                # write the restaurant review as a line in the new file
                # escape newline characters in the original review text
                review_txt_file.write(review[u'text'].replace('\n', '\\n') + '\n')
                review_count += 1

    print u'''Text from {:,} restaurant reviews
              written to the new txt file.'''.format(review_count)
    
else:
    
    with codecs.open(review_txt_filepath, encoding='utf_8') as review_txt_file:
        for review_count, line in enumerate(review_txt_file):
            pass
        
    print u'Text from {:,} restaurant reviews in the txt file.'.format(review_count + 1)

## Text preparation using spaCy

The [**spaCy**](https://spacy.io) library will be used to perform the following NLP tasks: 
- Tokenization
- Text normalization, including converting text to lemmatized form 
- Part-of-speech tagging
- Syntactic dependency parsing
- Sentence boundary detection
- Named entity recognition and annotation

spaCy provides downloadable English-language models that enable the comparison of a specific corpus of text to a general English language corpus, such as common stpowords, and the probability of occurence of words. 

In [None]:
import pandas as pd
import itertools as it

The next part is needed to append the system path list to access the location of the spaCy module and English language model on my laptop.

In [None]:
import sys
sys.path.append('C:\Windows\System32\.env\Lib\site-packages')

This part now handles the text processing using spaCy

In [None]:
import spacy
import en_core_web_sm
# English model will only load with this code, does not work using the code provided in spaCy documentation
nlp = en_core_web_sm.load() 

In [None]:
# this examines a sample review
with codecs.open(review_txt_filepath, encoding='utf_8') as f:
    sample_review = list(it.islice(f, 8, 9))[0]
    sample_review = sample_review.replace('\\n', '\n')
        
print sample_review

Sentence detection and segmentation:

In [None]:
parsed_review = nlp(sample_review)

for num, sentence in enumerate(parsed_review.sents):
    print 'Sentence {}:'.format(num + 1)
    print sentence
    print ''

Named entity detection:

In [None]:
for num, entity in enumerate(parsed_review.ents):
    print 'Entity {}:'.format(num + 1), entity, '-', entity.label_
    print ''

Part of speech tagging:

In [None]:
token_text = [token.orth_ for token in parsed_review]
token_pos = [token.pos_ for token in parsed_review]

pd.DataFrame(zip(token_text, token_pos),
             columns=['token_text', 'part_of_speech'])

Text normalization includes converting words to their lemmatized form so that grammatical form is removed in order to simplify and reduce the vocabulary. Hence words are converted to lowercase and words with the same stem but different grammatical forms are converted to a single lemma (e.g. "be:, "is", "were", "am", "are" are all represented as "be" in lemmatized form).

In [None]:
token_lemma = [token.lemma_ for token in parsed_review]
token_shape = [token.shape_ for token in parsed_review]

pd.DataFrame(zip(token_text, token_lemma, token_shape),
             columns=['token_text', 'token_lemma', 'token_shape'])

spaCy will also recognize words / symbols that are stopwords, punctuation, whitespace and numbers. Some of these will be removed from the corpus at a later stage in order to simplify the corpus to strip the text of words that are gramattically necessary but do not contribute to the meaning. 

## Phrase Modeling with gensim

_Phrase modeling_ is an algorithm that passes over the text corpus to learn multi-word concepts. Using the algorithm once will detect phrases such as "ice cream" by detecting that "ice" and "cream" appear together with a frequency that passes a certain threshold. At the next pass, the algorithm will treat the bigram "ice_cream" as a single token and may also learn "vanilla_ice_cream" as a phrase and treat it as a trigram. 

This part makes three passes over the text corpus in order to reveal trigrams, which will help provide a robust level of meaning, given that the corpus is fairly large.

In [None]:
from gensim.models import Phrases
from gensim.models.word2vec import LineSentence

As we're performing phrase modeling, we'll be doing some iterative data transformation at the same time. Our roadmap for data preparation includes:

1. Segment text of complete reviews into sentences & normalize text
1. First-order phrase modeling $\rightarrow$ _apply first-order phrase model to transform sentences_
1. Second-order phrase modeling $\rightarrow$ _apply second-order phrase model to transform sentences_
1. Apply text normalization and second-order phrase model to text of complete reviews

We'll use this transformed data as the input for some higher-level modeling approaches in the following sections.

First, let's define a few helper functions that we'll use for text normalization. In particular, the `lemmatized_sentence_corpus` generator function will use spaCy to:
- Iterate over the 1M reviews in the `review_txt_all.txt` we created before
- Segment the reviews into individual sentences
- Remove punctuation and excess whitespace
- Lemmatize the text

... and do so efficiently in parallel, thanks to spaCy's `nlp.pipe()` function.

In [2]:
def punct_space(token):
    """
    helper function to eliminate tokens
    that are pure punctuation or whitespace
    """
    
    return token.is_punct or token.is_space

def line_review(filename):
    """
    generator function to read in reviews from the file
    and un-escape the original line breaks in the text
    """
    
    with codecs.open(filename, encoding='utf_8') as f:
        for review in f:
            yield review.replace('\\n', '\n')
            
def lemmatized_sentence_corpus(filename):
    """
    generator function to use spaCy to parse reviews,
    lemmatize the text, and yield sentences
    """
    
    for parsed_review in nlp.pipe(line_review(filename),
                                  batch_size=10000, n_threads=4):
        
        for sent in parsed_review.sents:
            yield u' '.join([token.lemma_ for token in sent
                             if not punct_space(token)])

In [None]:
unigram_sentences_filepath = os.path.join(intermediate_directory,
                                          'unigram_sentences_all.txt')

Let's use the `lemmatized_sentence_corpus` generator to loop over the original review text, segmenting the reviews into individual sentences and normalizing the text. We'll write this data back out to a new file (`unigram_sentences_all`), with one normalized sentence per line. We'll use this data for learning our phrase models.

In [None]:
%%time

# this is completed

if 0 == 1:

    with codecs.open(unigram_sentences_filepath, 'w', encoding='utf_8') as f:
        for sentence in lemmatized_sentence_corpus(review_txt_filepath):
            f.write(sentence + '\n')

If your data is organized like our `unigram_sentences_all` file now is &mdash; a large text file with one document/sentence per line &mdash; gensim's [**LineSentence**](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.LineSentence) class provides a convenient iterator for working with other gensim components. It *streams* the documents/sentences from disk, so that you never have to hold the entire corpus in RAM at once. This allows you to scale your modeling pipeline up to potentially very large corpora.

In [None]:
unigram_sentences = LineSentence(unigram_sentences_filepath)

Next, we'll learn a phrase model that will link individual words into two-word phrases. We'd expect words that together represent a specific concept, like "`ice cream`", to be linked together to form a new, single token: "`ice_cream`".

In [None]:
bigram_model_filepath = os.path.join(intermediate_directory, 'bigram_model_all')

In [None]:
%%time

# this is completed

if 0 == 1:

    bigram_model = Phrases(unigram_sentences, max_vocab_size=10000000)

    bigram_model.save(bigram_model_filepath)
    
# load the finished model from disk
bigram_model = Phrases.load(bigram_model_filepath)

In [None]:
bigram_sentences_filepath = os.path.join(intermediate_directory,
                                         'bigram_sentences_all.txt')

In [None]:
%%time

# this is completed
# warning message: For a faster implementation, use the gensim.models.phrases.Phraser class
if 0 == 1:

    with codecs.open(bigram_sentences_filepath, 'w', encoding='utf_8') as f:
        
        for unigram_sentence in unigram_sentences:
            
            bigram_sentence = u' '.join(bigram_model[unigram_sentence])
            
            f.write(bigram_sentence + '\n')

In [None]:
bigram_sentences = LineSentence(bigram_sentences_filepath)

In [None]:
trigram_model_filepath = os.path.join(intermediate_directory,
                                      'trigram_model_all')

In [None]:
%%time

# this is completed
if 0 == 1:

    trigram_model = Phrases(bigram_sentences, max_vocab_size=10000000)

    trigram_model.save(trigram_model_filepath)
    
# load the finished model from disk
trigram_model = Phrases.load(trigram_model_filepath)

In [None]:
trigram_sentences_filepath = os.path.join(intermediate_directory,
                                          'trigram_sentences_all.txt')

In [None]:
%%time

# this is completed
if 0 == 1:

    with codecs.open(trigram_sentences_filepath, 'w', encoding='utf_8') as f:
        
        for bigram_sentence in bigram_sentences:
            
            trigram_sentence = u' '.join(trigram_model[bigram_sentence])
            
            f.write(trigram_sentence + '\n')

In [None]:
trigram_sentences = LineSentence(trigram_sentences_filepath)

The final step of our text preparation process circles back to the complete text of the reviews. We're going to run the complete text of the reviews through a pipeline that applies our text normalization and phrase models.

In addition, we'll remove stopwords at this point. _Stopwords_ are very common words, like _a_, _the_, _and_, and so on, that serve functional roles in natural language, but typically don't contribute to the overall meaning of text. Filtering stopwords is a common procedure that allows higher-level NLP modeling techniques to focus on the words that carry more semantic weight.

Finally, we'll write the transformed text out to a new file, with one review per line.

In [None]:
trigram_reviews_filepath = os.path.join(intermediate_directory,
                                        'trigram_transformed_reviews_all.txt')

In [None]:
%%time

# this is completed
if 0 == 1:

    with codecs.open(trigram_reviews_filepath, 'w', encoding='utf_8') as f:
        
        for parsed_review in nlp.pipe(line_review(review_txt_filepath),
                                      batch_size=10000, n_threads=4):
            
            # lemmatize the text, removing punctuation and whitespace
            unigram_review = [token.lemma_ for token in parsed_review
                              if not punct_space(token)]
            
            # apply the first-order and second-order phrase models
            bigram_review = bigram_model[unigram_review]
            trigram_review = trigram_model[bigram_review]
            
            # remove any remaining stopwords
            trigram_review = [term for term in trigram_review
                              if term not in spacy.en.language_data.STOP_WORDS]
            
            # write the transformed review as a line in the new file
            trigram_review = u' '.join(trigram_review)
            f.write(trigram_review + '\n')

You can see that most of the grammatical structure has been scrubbed from the text &mdash; capitalization, articles/conjunctions, punctuation, spacing, etc. However, much of the general semantic *meaning* is still present. Also, multi-word concepts such as "`friday_night`" and "`above_average`" have been joined into single tokens, as expected. The review text is now ready for higher-level modeling. 

## Topic Modeling with Latent Dirichlet Allocation (_LDA_)

*Topic modeling* is family of techniques that can be used to describe and summarize the documents in a corpus according to a set of latent "topics". For this demo, we'll be using [*Latent Dirichlet Allocation*](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf) or LDA, a popular approach to topic modeling.

In many conventional NLP applications, documents are represented a mixture of the individual tokens (words and phrases) they contain. In other words, a document is represented as a *vector* of token counts. There are two layers in this model &mdash; documents and tokens &mdash; and the size or dimensionality of the document vectors is the number of tokens in the corpus vocabulary. This approach has a number of disadvantages:
* Document vectors tend to be large (one dimension for each token $\Rightarrow$ lots of dimensions)
* They also tend to be very sparse. Any given document only contains a small fraction of all tokens in the vocabulary, so most values in the document's token vector are 0.
* The dimensions are fully indepedent from each other &mdash; there's no sense of connection between related tokens, such as _knife_ and _fork_.

LDA injects a third layer into this conceptual model. Documents are represented as a mixture of a pre-defined number of *topics*, and the *topics* are represented as a mixture of the individual tokens in the vocabulary. The number of topics is a model hyperparameter selected by the practitioner. LDA makes a prior assumption that the (document, topic) and (topic, token) mixtures follow [*Dirichlet*](https://en.wikipedia.org/wiki/Dirichlet_distribution) probability distributions. This assumption encourages documents to consist mostly of a handful of topics, and topics to consist mostly of a modest set of the tokens.

LDA is fully unsupervised. The topics are "discovered" automatically from the data by trying to maximize the likelihood of observing the documents in your corpus, given the modeling assumptions. They are expected to capture some latent structure and organization within the documents, and often have a meaningful human interpretation for people familiar with the subject material.

We'll again turn to gensim to assist with data preparation and modeling. In particular, gensim offers a high-performance parallelized implementation of LDA with its [**LdaMulticore**](https://radimrehurek.com/gensim/models/ldamulticore.html) class.

In [None]:
from gensim.corpora import Dictionary, MmCorpus
from gensim.models.ldamulticore import LdaMulticore

import warnings
import cPickle as pickle

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups

The first step to creating an LDA model is to learn the full vocabulary of the corpus to be modeled. We'll use gensim's [**Dictionary**](https://radimrehurek.com/gensim/corpora/dictionary.html) class for this.

In [None]:
trigram_dictionary_filepath = os.path.join(intermediate_directory,
                                           'trigram_dict_all.dict')

In [None]:
%%time

# this is completed
if 0 == 1:

    trigram_reviews = LineSentence(trigram_reviews_filepath)

    # learn the dictionary by iterating over all of the reviews
    trigram_dictionary = Dictionary(trigram_reviews)
    
    # filter tokens that are very rare or too common from
    # the dictionary (filter_extremes) and reassign integer ids (compactify)
    trigram_dictionary.filter_extremes(no_below=10, no_above=0.4)
    trigram_dictionary.compactify()

    trigram_dictionary.save(trigram_dictionary_filepath)
    
# load the finished dictionary from disk
trigram_dictionary = Dictionary.load(trigram_dictionary_filepath)

Like many NLP techniques, LDA uses a simplifying assumption known as the [*bag-of-words* model](https://en.wikipedia.org/wiki/Bag-of-words_model). In the bag-of-words model, a document is represented by the counts of distinct terms that occur within it. Additional information, such as word order, is discarded. 

Using the gensim Dictionary we learned to generate a bag-of-words representation for each review. The `trigram_bow_generator` function implements this. We'll save the resulting bag-of-words reviews as a matrix.

In the following code, "bag-of-words" is abbreviated as `bow`.

In [None]:
trigram_bow_filepath = os.path.join(intermediate_directory,
                                    'trigram_bow_corpus_all.mm')

In [None]:
def trigram_bow_generator(filepath):
    """
    generator function to read reviews from a file
    and yield a bag-of-words representation
    """
    
    for review in LineSentence(filepath):
        yield trigram_dictionary.doc2bow(review)

In [None]:
%%time

# this is completed
if 0 == 1:

    # generate bag-of-words representations for
    # all reviews and save them as a matrix
    MmCorpus.serialize(trigram_bow_filepath,
                       trigram_bow_generator(trigram_reviews_filepath))
    
# load the finished bag-of-words corpus from disk
trigram_bow_corpus = MmCorpus(trigram_bow_filepath)

With the bag-of-words corpus, we're finally ready to learn our topic model from the reviews. We simply need to pass the bag-of-words matrix and Dictionary from our previous steps to `LdaMulticore` as inputs, along with the number of topics the model should learn. For this demo, we're asking for 50 topics.

In [None]:
lda_model_filepath = os.path.join(intermediate_directory, 'lda_model_all')

In [None]:
%%time

# this is completed
if 0 == 1:

    with warnings.catch_warnings():
        warnings.simplefilter('ignore')
        
        # workers => sets the parallelism, and should be
        # set to your number of physical cores minus one
        lda = LdaMulticore(trigram_bow_corpus,
                           num_topics=50,
                           id2word=trigram_dictionary,
                           workers=3)
    
    lda.save(lda_model_filepath)
    
# load the finished LDA model from disk
lda = LdaMulticore.load(lda_model_filepath)

Our topic model is now trained and ready to use! Since each topic is represented as a mixture of tokens, you can manually inspect which tokens have been grouped together into which topics to try to understand the patterns the model has discovered in the data.

In [None]:
def explore_topic(topic_number, topn=25):
    """
    accept a user-supplied topic number and
    print out a formatted list of the top terms
    """
        
    print u'{:20} {}'.format(u'term', u'frequency') + u'\n'

    for term, frequency in lda.show_topic(topic_number, topn=25):
        print u'{:20} {:.3f}'.format(term, round(frequency, 3))

In [None]:
explore_topic(topic_number=49)

The first topic has strong associations with words like *taco*, *salsa*, *chip*, *burrito*, and *margarita*, as well as a handful of more general words. You might call this the **Mexican food** topic!

It's possible to go through and inspect each topic in the same way, and try to assign a human-interpretable label that captures the essence of each one. I've given it a shot for all 50 topics below.

In [None]:
topic_names = {0: u'dessert',
               1: u'table, ordering & service',
               2: u'airport & flying, allergies',
               3: u'buffet & dishes',
               4: u'pizza & italian',
               5: u'bar & drinking',
               6: u'salad & lunch',
               7: u'kids & family',
               8: u'las vegas',
               9: u'fish & asian',
               10: u'seating experience',
               11: u'waffle / cupcake', # waffle + various unrelated terms
               12: u'family members',
               13: u'flavor & food experience',
               14: u'general ambience',
               15: u'night / happy hour / drinks',
               16: u'brunch',
               17: u'mexican',
               18: u'meat',
               19: u'price & paying',
               20: u'wedding / german', # various unrelated words incl single letters
               21: u'various locations', # e.g. north, south, town names, uptown, street etc.
               22: u'cleansiness',
               23: u'mall / cheesecake_factory', # plus various unrelated words and some experience words
               24: u'positive service experience',
               25: u'buying / offers',
               26: u'positive food experience',
               27: u'cheese / wings',
               28: u'french language reviews',
               29: u'positive experience',
               30: u'ordering and service', # geenral and negative service experience
               31: u'time & place',
               32: u'menu & dishes',
               33: u'sushi & fish',
               34: u'german language reviews',
               35: u'positive food experience',
               36: u'timing & waiting',
               37: u'seafood',
               38: u'tea & drinks',
               39: u'thai',
               40: u'staff', 
               41: u'sandwiches',
               42: u'yes / know / eat', # unrelated common words
               43: u'hot_dog, chilli', #hot is the most prominent word, both in teh sense of spicy and hot_dog
               44: u'drinks / coffee / atmosphere',
               45: u'burgers',
               46: u'breakfast food', 
               47: u'price / service / portions',
               48: u'steaks',
               49: u'review / yelp'} # various common urelated words

In [None]:
topic_names_filepath = os.path.join(intermediate_directory, 'topic_names.pkl')

with open(topic_names_filepath, 'w') as f:
    pickle.dump(topic_names, f)

You can see that, along with **mexican**, there are a variety of topics related to different styles of food, such as **thai**, **steak**, **sushi**, **pizza**, and so on. In addition, there are topics that are more related to the overall restaurant *experience*, like **ambience & seating**, **good service**, **waiting**, and **price**.

Beyond these two categories, there are still some topics that are difficult to apply a meaningful human interpretation to, such as topic 23 and 42.


### Describing text with LDA
Beyond data exploration, one of the key uses for an LDA model is providing a compact, quantitative description of natural language text. Once an LDA model has been trained, it can be used to represent free text as a mixture of the topics the model learned from the original corpus. This mixture can be interpreted as a probability distribution across the topics, so the LDA representation of a paragraph of text might look like 50% _Topic A_, 20% _Topic B_, 20% _Topic C_, and 10% _Topic D_.

To use an LDA model to generate a vector representation of new text, you'll need to apply any text preprocessing steps you used on the model's training corpus to the new text, too. For our model, the preprocessing steps we used include:
1. Using spaCy to remove punctuation and lemmatize the text
1. Applying our first-order phrase model to join word pairs
1. Applying our second-order phrase model to join longer phrases
1. Removing stopwords
1. Creating a bag-of-words representation

Once you've applied these preprocessing steps to the new text, it's ready to pass directly to the model to create an LDA representation. The `lda_description(...)` function will perform all these steps for us, including printing the resulting topical description of the input text.

In [None]:
def get_sample_review(review_number):
    """
    retrieve a particular review index
    from the reviews file and return it
    """
    
    return list(it.islice(line_review(review_txt_filepath),
                          review_number, review_number+1))[0]

In [None]:
def lda_description(review_text, min_topic_freq=0.05):
    """
    accept the original text of a review and (1) parse it with spaCy,
    (2) apply text pre-proccessing steps, (3) create a bag-of-words
    representation, (4) create an LDA representation, and
    (5) print a sorted list of the top topics in the LDA representation
    """
    
    # parse the review text with spaCy
    parsed_review = nlp(review_text)
    
    # lemmatize the text and remove punctuation and whitespace
    unigram_review = [token.lemma_ for token in parsed_review
                      if not punct_space(token)]
    
    # apply the first-order and secord-order phrase models
    bigram_review = bigram_model[unigram_review]
    trigram_review = trigram_model[bigram_review]
    
    # remove any remaining stopwords
    trigram_review = [term for term in trigram_review
                      if not term in spacy.en.language_data.STOP_WORDS]
    
    # create a bag-of-words representation
    review_bow = trigram_dictionary.doc2bow(trigram_review)
    
    # create an LDA representation
    review_lda = lda[review_bow]
    
    # sort with the most highly related topics first
    review_lda = sorted(review_lda, key=lambda (topic_number, freq): -freq)
    
    for topic_number, freq in review_lda:
        if freq < min_topic_freq:
            break
            
        # print the most highly related topic names and frequencies
        print '{:25} {}'.format(topic_names[topic_number],
                                round(freq, 3))

In [None]:
sample_review = get_sample_review(222)
print sample_review

In [None]:
lda_description(sample_review)

In [None]:
sample_review = get_sample_review(122)
print sample_review

In [None]:
lda_description(sample_review)

## Word Vector Embedding with Word2Vec

The goal of *word vector embedding models*, or *word vector models* for short, is to learn dense, numerical vector representations for each term in a corpus vocabulary. If the model is successful, the vectors it learns about each term should encode some information about the *meaning* or *concept* the term represents, and the relationship between it and other terms in the vocabulary. Word vector models are also fully unsupervised &mdash; they learn all of these meanings and relationships solely by analyzing the text of the corpus, without any advance knowledge provided.

Perhaps the best-known word vector model is [word2vec](https://arxiv.org/pdf/1301.3781v3.pdf), originally proposed in 2013. The general idea of word2vec is, for a given *focus word*, to use the *context* of the word &mdash; i.e., the other words immediately before and after it &mdash; to provide hints about what the focus word might mean. To do this, word2vec uses a *sliding window* technique, where it considers snippets of text only a few tokens long at a time.

At the start of the learning process, the model initializes random vectors for all terms in the corpus vocabulary. The model then slides the window across every snippet of text in the corpus, with each word taking turns as the focus word. Each time the model considers a new snippet, it tries to learn some information about the focus word based on the surrouding context, and it "nudges" the words' vector representations accordingly. One complete pass sliding the window across all of the corpus text is known as a training *epoch*. It's common to train a word2vec model for multiple passes/epochs over the corpus. Over time, the model rearranges the terms' vector representations such that terms that frequently appear in similar contexts have vector representations that are *close* to each other in vector space.

For a deeper dive into word2vec's machine learning process, see [here](https://arxiv.org/pdf/1411.2738v4.pdf).

Word2vec has a number of user-defined hyperparameters, including:
- The dimensionality of the vectors. Typical choices include a few dozen to several hundred.
- The width of the sliding window, in tokens. Five is a common default choice, but narrower and wider windows are possible.
- The number of training epochs.

For using word2vec in Python, [gensim](https://rare-technologies.com/deep-learning-with-word2vec-and-gensim/) comes to the rescue again! It offers a [highly-optimized](https://rare-technologies.com/word2vec-in-python-part-two-optimizing/), [parallelized](https://rare-technologies.com/parallelizing-word2vec-in-python/) implementation of the word2vec algorithm with its [Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html) class.

In [None]:
from gensim.models import Word2Vec

trigram_sentences = LineSentence(trigram_sentences_filepath)
word2vec_filepath = os.path.join(intermediate_directory, 'word2vec_model_all')

We'll train our word2vec model using the normalized sentences with our phrase models applied. We'll use 100-dimensional vectors, and set up our training process to run for twelve epochs.

In [None]:
if 0 ==1:
    food2vec.train(trigram_sentences, total_examples=food2vec.corpus_count, epochs=food2vec.iter)
    print food2vec.train_count
    food2vec.save(word2vec_filepath)
    
# load the finished model from disk
food2vec = Word2Vec.load(word2vec_filepath)
food2vec.init_sims()

print u'{} training epochs so far.'.format(food2vec.train_count)

In [None]:
print food2vec.iter
print food2vec.train_count

On my four-core machine, each epoch over all the text in the ~1 million Yelp reviews takes about 5-10 minutes.

In [None]:
print u'{:,} terms in the food2vec vocabulary.'.format(len(food2vec.wv.vocab))

Let's take a peek at the word vectors our model has learned. We'll create a pandas DataFrame with the terms as the row labels, and the 100 dimensions of the word vector model as the columns.

In [None]:
# build a list of the terms, integer indices,
# and term counts from the food2vec model vocabulary
ordered_vocab = [(term, voc.index, voc.count)
                 for term, voc in food2vec.wv.vocab.iteritems()]

# sort by the term counts, so the most common terms appear first
ordered_vocab = sorted(ordered_vocab, key=lambda (term, index, count): -count)

# unzip the terms, integer indices, and counts into separate lists
ordered_terms, term_indices, term_counts = zip(*ordered_vocab)

# create a DataFrame with the food2vec vectors as data,
# and the terms as row labels
word_vectors = pd.DataFrame(food2vec.wv.syn0norm[term_indices, :],
                            index=ordered_terms)

word_vectors

In [None]:
word_vectors.shape

This DataFrame has 80,588 rows &mdash; one for each term in the vocabulary &mdash; and 100 colums. Our model has learned a quantitative vector representation for each term, as expected.

Put another way, our model has "embedded" the terms into a 100-dimensional vector space.

### So... what can we do with all these numbers?
The first thing we can use them for is to simply look up related words and phrases for a given term of interest.

In [None]:
def get_related_terms(token, topn=10):
    """
    look up the topn most similar terms to token
    and print them as a formatted list
    """

    for word, similarity in food2vec.most_similar(positive=[token], topn=topn):

        print u'{:20} {}'.format(word, round(similarity, 3))

### Related items

In [None]:
get_related_terms(u'restaurant')

In [None]:
get_related_terms(u'happy_hour')

In [None]:
get_related_terms(u'pasta', topn=20)

The model has noticed several alternate spellings for happy hour, such as *hh* and *happy hr*, and assesses them as highly related. If you were looking for reviews about happy hour, such alternate spellings would be very helpful to know.

Taking a deeper look &mdash; the model has turned up phrases like *3-6pm*, *4-7pm*, and *mon-fri*, too. This is especially interesting, because the model has no advance knowledge at all about what happy hour is, and what time of day it should be. But simply by scanning through restaurant reviews, the model has discovered that the concept of happy hour has something very important to do with that block of time around 3-7pm on weekdays.

## Word Vector Clustering

Our input for t-SNE will be the DataFrame of word vectors we created before. Let's first:
1. Drop stopwords &mdash; it's probably not too interesting to visualize *the*, *of*, *or*, and so on
1. Take only the 5,000 most frequent terms in the vocabulary &mdash; no need to visualize all ~50,000 terms right now.

In [None]:
tsne_input = word_vectors.drop(spacy.en.STOP_WORDS, errors=u'ignore')
tsne_input = tsne_input.head(5000)

In [None]:
tsne_input.head(20)

In [None]:
tsne_input.to_csv('data/wv_model_ready.csv', encoding='utf-8')

In [None]:
tsne_filepath = os.path.join(intermediate_directory,
                             u'tsne_model')

tsne_vectors_filepath = os.path.join(intermediate_directory,
                                     u'tsne_vectors.npy')

In [None]:
%%time

if 0 == 1:
    
    tsne = TSNE()
    tsne_vectors = tsne.fit_transform(tsne_input.values)
    
    with open(tsne_filepath, 'w') as f:
        pickle.dump(tsne, f)

    pd.np.save(tsne_vectors_filepath, tsne_vectors)
    
with open(tsne_filepath) as f:
    tsne = pickle.load(f)
    
tsne_vectors = pd.np.load(tsne_vectors_filepath)

tsne_vectors = pd.DataFrame(tsne_vectors,
                            index=pd.Index(tsne_input.index),
                            columns=[u'x_coord', u'y_coord'])