# What you can learn about food by analyzing a million Yelp reviews

## The Yelp Dataset
[**The Yelp Dataset**](https://www.yelp.com/dataset_challenge/) is a dataset published by the business review service [Yelp](http://yelp.com) for academic research and educational purposes. I really like the Yelp dataset as a subject for machine learning and natural language processing demos, because it's big (but not so big that you need your own data center to process it), well-connected, and anyone can relate to it &mdash; it's largely about food, after all!

**Note:** If you'd like to execute this notebook interactively on your local machine, you'll need to download your own copy of the Yelp dataset. If you're reviewing a static copy of the notebook online, you can skip this step. Here's how to get the dataset:
1. Please visit the Yelp dataset webpage [here](https://www.yelp.com/dataset_challenge/)
1. Click "Get the Data"
1. Please review, agree to, and respect Yelp's terms of use!
1. The dataset downloads as a compressed .tgz file; uncompress it
1. Place the uncompressed dataset files (*yelp_academic_dataset_business.json*, etc.) in a directory named *yelp_dataset_challenge_academic_dataset*
1. Place the *yelp_dataset_challenge_academic_dataset* within the *data* directory in the *Modern NLP in Python* project folder

That's it! You're ready to go.

The current iteration of the Yelp dataset (as of this demo) consists of the following data:
- __552K__ users
- __77K__ businesses
- __2.2M__ user reviews

When focusing on restaurants alone, there are approximately __22K__ restaurants with approximately __1M__ user reviews written about them.

The data is provided in a handful of files in _.json_ format. We'll be using the following files for our demo:
- __yelp\_academic\_dataset\_business.json__ &mdash; _the records for individual businesses_
- __yelp\_academic\_dataset\_review.json__ &mdash; _the records for reviews users wrote about businesses_

The files are text files (UTF-8) with one _json object_ per line, each one corresponding to an individual data record. Let's take a look at a few examples.

In [6]:
import os
import re
import json
data_directory = os.path.join('..', 'data',
                              'yelp_dataset_challenge_academic_dataset', 'dataset')

The review records are stored in a similar manner &mdash; _key, value_ pairs containing information about the reviews.

In [3]:
review_filepath = os.path.join(data_directory,
                                    'review.json')

with open(review_filepath, encoding='utf_8') as f:
    first_review_record = f.readline()
    
print(first_review_record)

{"review_id":"VfBHSwC5Vz_pbFluy07i9Q","user_id":"cjpdDjZyprfyDG3RlkVG3w","business_id":"uYHaNptLzDLoV_JZ_MuzUA","stars":5,"date":"2016-07-12","text":"My girlfriend and I stayed here for 3 nights and loved it. The location of this hotel and very decent price makes this an amazing deal. When you walk out the front door Scott Monument and Princes street are right in front of you, Edinburgh Castle and the Royal Mile is a 2 minute walk via a close right around the corner, and there are so many hidden gems nearby including Calton Hill and the newly opened Arches that made this location incredible.\n\nThe hotel itself was also very nice with a reasonably priced bar, very considerate staff, and small but comfortable rooms with excellent bathrooms and showers. Only two minor complaints are no telephones in room for room service (not a huge deal for us) and no AC in the room, but they have huge windows which can be fully opened. The staff were incredible though, letting us borrow umbrellas for t

A few attributes of note on the review records:
- __text__ &mdash; _the natural language text the user wrote_
- __stars__ &mdash; _the number of stars the reviewer left_

The _text_ and the _stars_ attribute will be our focus today!

Next, we will create a new directory that contains only the text from reviews about restaurants, with one review per line in the file.

In [23]:
intermediate_directory = os.path.join('..','data','yelp_dataset_challenge_academic_dataset', 'extracted_from_json')

review_txt_filepath = os.path.join(intermediate_directory,'sentiment_data',
                                   'review_text_for_sentiment.txt')

review_sentiment_filepath = os.path.join(intermediate_directory, 'sentiment_data','sentiment_of_review_text.txt')

review_json_filepath = os.path.join(data_directory,'review.json')

In [24]:
%%time
# Make the if statement True
# if you want to execute data prep yourself.

if True:
    
    review_count = 0

    # create & open a new files in write mode
    with open(review_txt_filepath, 'w', encoding='utf_8') as review_txt_file:
        with open(review_sentiment_filepath, 'w', encoding='utf_8') as review_sentiment_file:

            # open the existing review json file
            with open(review_json_filepath, encoding='utf_8') as review_json_file:
                # loop through all reviews in the existing file and convert to dict
                for review_json in review_json_file:
                    review = json.loads(review_json)
                    # write the review as a line in the new file
                    # escape newline characters in the original review text
                    review_txt_file.write(review.get('text','NA').replace('\n', '\\n') + '\n')
                    review_sentiment_file.write(str(review.get('stars','NA')) +'\n')
                    review_count =  review_count + 1

    print ('Text from {} reviews written to the new txt file.'.format(review_count))
    
else:
    
    with open(review_txt_filepath, encoding='utf_8') as review_txt_file:
        for review_count, line in enumerate(review_txt_file):
            pass
        
    print('Text from {} reviews in the txt file.'.format(review_count + 1))



Text from 4736897 reviews written to the new txt file.
CPU times: user 1min 7s, sys: 18 s, total: 1min 25s
Wall time: 1min 28s


In [25]:
#count the lines in the above files

from itertools import (takewhile,repeat)

def rawincount(filename):
    with open(filename, 'rb') as f:
        bufgen = takewhile(lambda x: x, (f.raw.read(1024*1024) for _ in repeat(None)))
        return sum( buf.count(b'\n') for buf in bufgen )

print('Len of review text file:{}\nLen of review sentiment file:{}'.format(rawincount(review_txt_filepath), rawincount(review_sentiment_filepath)))

Len of review text file:4736897
Len of review sentiment file:4736897


In [None]:
# Good!  The lengths of the files match!

## spaCy &mdash; Industrial-Strength NLP in Python

![spaCy](https://s3.amazonaws.com/skipgram-images/spaCy.png)

[**spaCy**](https://spacy.io) is an industrial-strength natural language processing (_NLP_) library for Python. spaCy's goal is to take recent advancements in natural language processing out of research papers and put them in the hands of users to build production software.

spaCy handles many tasks commonly associated with building an end-to-end natural language processing pipeline:
- Tokenization
- Text normalization, such as lowercasing, stemming/lemmatization
- Part-of-speech tagging
- Syntactic dependency parsing
- Sentence boundary detection
- Named entity recognition and annotation

In the "batteries included" Python tradition, spaCy contains built-in data and models which you can use out-of-the-box for processing general-purpose English language text:
- Large English vocabulary, including stopword lists
- Token "probabilities"
- Word vectors

spaCy is written in optimized Cython, which means it's _fast_. According to a few independent sources, it's the fastest syntactic parser available in any language. Key pieces of the spaCy parsing pipeline are written in pure C, enabling efficient multithreading (i.e., spaCy can release the _GIL_).

In [19]:
#!python -m spacy download en_core_web_md
#!python -m spacy link en_core_web_md en_default


[93m    Link en_default already exists[0m

    To overwrite an existing link, use the --force flag.



In [20]:
import spacy
import pandas as pd
import itertools as it
nlp = spacy.load('en_default')

Let's grab a sample review to play with.

In [21]:
with open(review_txt_filepath, encoding='utf_8') as f:
    sample_review = list(it.islice(f, 8, 9))[0]
    sample_review = sample_review.replace('\\n', '\n')
        
print(sample_review)

First time at this group of hotels. Pretty new, only one in UK, another to open in Edinburgh and one in London. Rooms not very big but great price and location for a weekend in Edinburgh. Rooms clean, comfortable, good shower and free wifi!



## Word Vector Embedding with Word2Vec

Pop quiz! Can you complete this text snippet?

<br><br>

![word2vec quiz](https://s3.amazonaws.com/skipgram-images/word2vec-1.png)

<br><br><br>
You just demonstrated the core machine learning concept behind word vector embedding models!
<br><br><br>

![word2vec quiz 2](https://s3.amazonaws.com/skipgram-images/word2vec-2.png)

The goal of *word vector embedding models*, or *word vector models* for short, is to learn dense, numerical vector representations for each term in a corpus vocabulary. If the model is successful, the vectors it learns about each term should encode some information about the *meaning* or *concept* the term represents, and the relationship between it and other terms in the vocabulary. Word vector models are also fully unsupervised &mdash; they learn all of these meanings and relationships solely by analyzing the text of the corpus, without any advance knowledge provided.

Perhaps the best-known word vector model is [word2vec](https://arxiv.org/pdf/1301.3781v3.pdf), originally proposed in 2013. The general idea of word2vec is, for a given *focus word*, to use the *context* of the word &mdash; i.e., the other words immediately before and after it &mdash; to provide hints about what the focus word might mean. To do this, word2vec uses a *sliding window* technique, where it considers snippets of text only a few tokens long at a time.

At the start of the learning process, the model initializes random vectors for all terms in the corpus vocabulary. The model then slides the window across every snippet of text in the corpus, with each word taking turns as the focus word. Each time the model considers a new snippet, it tries to learn some information about the focus word based on the surrouding context, and it "nudges" the words' vector representations accordingly. One complete pass sliding the window across all of the corpus text is known as a training *epoch*. It's common to train a word2vec model for multiple passes/epochs over the corpus. Over time, the model rearranges the terms' vector representations such that terms that frequently appear in similar contexts have vector representations that are *close* to each other in vector space.

For a deeper dive into word2vec's machine learning process, see [here](https://arxiv.org/pdf/1411.2738v4.pdf).

Word2vec has a number of user-defined hyperparameters, including:
- The dimensionality of the vectors. Typical choices include a few dozen to several hundred.
- The width of the sliding window, in tokens. Five is a common default choice, but narrower and wider windows are possible.
- The number of training epochs.

For using word2vec in Python, [gensim](https://rare-technologies.com/deep-learning-with-word2vec-and-gensim/) comes to the rescue again! It offers a [highly-optimized](https://rare-technologies.com/word2vec-in-python-part-two-optimizing/), [parallelized](https://rare-technologies.com/parallelizing-word2vec-in-python/) implementation of the word2vec algorithm with its [Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html) class.

In [58]:
from gensim.models import Word2Vec

trigram_sentences = LineSentence(trigram_sentences_filepath)
word2vec_filepath = os.path.join(intermediate_directory, 'word2vec_model_all')

We'll train our word2vec model using the normalized sentences with our phrase models applied. We'll use 100-dimensional vectors, and set up our training process to run for twelve epochs.

In [59]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to train the word2vec model yourself.
if 0 == 1:

    # initiate the model and perform the first epoch of training
    food2vec = Word2Vec(trigram_sentences, size=100, window=5,
                        min_count=20, sg=1, workers=4)
    
    food2vec.save(word2vec_filepath)

    # perform another 11 epochs of training
    for i in range(1,12):

        food2vec.train(trigram_sentences)
        food2vec.save(word2vec_filepath)
        
# load the finished model from disk
food2vec = Word2Vec.load(word2vec_filepath)
food2vec.init_sims()

print u'{} training epochs so far.'.format(food2vec.train_count)

12 training epochs so far.
CPU times: user 5.43 s, sys: 891 ms, total: 6.32 s
Wall time: 7.12 s


On my four-core machine, each epoch over all the text in the ~1 million Yelp reviews takes about 5-10 minutes.

In [60]:
print u'{:,} terms in the food2vec vocabulary.'.format(len(food2vec.vocab))

50,835 terms in the food2vec vocabulary.


Let's take a peek at the word vectors our model has learned. We'll create a pandas DataFrame with the terms as the row labels, and the 100 dimensions of the word vector model as the columns.

In [90]:
# build a list of the terms, integer indices,
# and term counts from the food2vec model vocabulary
ordered_vocab = [(term, voc.index, voc.count)
                 for term, voc in food2vec.vocab.iteritems()]

# sort by the term counts, so the most common terms appear first
ordered_vocab = sorted(ordered_vocab, key=lambda (term, index, count): -count)

# unzip the terms, integer indices, and counts into separate lists
ordered_terms, term_indices, term_counts = zip(*ordered_vocab)

# create a DataFrame with the food2vec vectors as data,
# and the terms as row labels
word_vectors = pd.DataFrame(food2vec.syn0norm[term_indices, :],
                            index=ordered_terms)

word_vectors

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
the,-0.035762,-0.173890,-0.035782,-0.007144,0.032371,-0.065272,-0.219383,-0.064665,0.002739,0.025802,...,0.050136,0.044030,0.145281,-0.020442,0.128879,-0.076461,0.075532,-0.012841,0.024710,-0.067555
be,-0.074780,-0.049524,0.085974,-0.098892,0.141556,0.024878,-0.011119,-0.175374,0.005410,-0.110996,...,-0.199047,-0.081284,-0.198344,0.007257,0.075339,0.070266,-0.008326,-0.127542,-0.046246,0.110279
and,-0.070505,-0.026918,0.028344,-0.099909,0.127974,-0.058155,-0.056091,-0.028973,0.197281,-0.040528,...,-0.049051,-0.212434,-0.042576,0.055731,0.117097,-0.206737,0.055435,-0.065056,0.052316,-0.078666
i,-0.161238,0.050831,-0.081706,-0.084479,0.053073,-0.102327,-0.108607,-0.001920,-0.057367,-0.050715,...,0.028528,-0.016578,-0.179229,0.053357,0.070913,0.036893,-0.000544,-0.007254,-0.056005,0.106345
a,-0.083491,-0.033712,-0.124125,-0.110776,-0.033046,-0.089950,0.025416,-0.052321,-0.059281,0.074985,...,-0.101939,0.022392,0.057049,0.015819,-0.001798,0.001103,0.003096,0.037175,-0.074279,0.001683
to,-0.012082,0.033135,-0.063183,-0.057252,-0.018721,-0.017931,-0.027784,0.112110,0.020549,-0.174336,...,-0.017111,-0.067532,-0.022149,0.154788,-0.093789,-0.020456,0.065478,0.075484,-0.053530,-0.005314
it,0.025022,0.081581,0.127987,-0.188015,0.041450,-0.126222,0.172725,-0.149931,-0.069566,-0.036031,...,0.045720,0.094828,0.089329,0.051623,-0.108989,-0.145476,0.068617,0.090687,-0.101725,0.090377
have,-0.140812,-0.070552,0.022102,0.001077,0.109890,-0.061365,0.046450,0.003073,0.113845,-0.038957,...,-0.051071,-0.090922,-0.022011,0.157082,-0.082406,-0.010306,-0.063481,-0.098728,-0.064020,0.153466
of,-0.036341,-0.054903,0.000644,-0.010602,0.168195,-0.058505,-0.052342,0.039159,-0.053572,-0.160039,...,0.085908,-0.211464,-0.084990,0.082315,0.223018,-0.142501,0.280647,0.003435,-0.037710,-0.145140
not,-0.075276,0.109047,0.055135,0.052251,0.209437,0.084334,-0.122419,-0.193307,0.000699,-0.099067,...,-0.150619,-0.060446,0.181940,-0.118538,-0.002879,0.018827,0.084586,0.040437,0.070277,-0.047521


Holy wall of numbers! This DataFrame has 50,835 rows &mdash; one for each term in the vocabulary &mdash; and 100 colums. Our model has learned a quantitative vector representation for each term, as expected.

Put another way, our model has "embedded" the terms into a 100-dimensional vector space.

### So... what can we do with all these numbers?
The first thing we can use them for is to simply look up related words and phrases for a given term of interest.

In [63]:
def get_related_terms(token, topn=10):
    """
    look up the topn most similar terms to token
    and print them as a formatted list
    """

    for word, similarity in food2vec.most_similar(positive=[token], topn=topn):

        print u'{:20} {}'.format(word, round(similarity, 3))