In [2]:
%matplotlib inline
import matplotlib
import seaborn as sns
sns.set()
matplotlib.rcParams['figure.dpi'] = 144

# NLP: Analyzing Review Text


Unstructured data makes up the vast majority of data.  We will use NLP basics to handle Yelp review data. Our objective is to be able to extract the sentiment (positive or negative) from review text.  


## Download and parse the data


The data set can be downloaded from Yelp:
'yelp_train_academic_dataset_review_reduced.json.gz'

The training data are a series of JSON objects, in a Gzipped file. Python supports Gzipped files natively: [`gzip.open`](https://docs.python.org/2/library/gzip.html) has the same interface as `open`, but handles `.gz` files automatically.

The built-in json package has a `loads()` function that converts a JSON string into a Python dictionary.  We could call that once for each row of the file. [`ujson`](http://docs.micropython.org/en/latest/library/ujson.html) has the same interface as the built-in `json` library, but is *substantially* faster (at the cost of non-robust handling of malformed json).  We will use that inside a list comprehension to get a list of dictionaries:

In [3]:
from sklearn import base
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfTransformer
from sklearn.linear_model import Ridge
import dill
from sklearn.decomposition import TruncatedSVD
import pandas as pd

In [4]:
import gzip
import ujson as json

with gzip.open('yelp_train_academic_dataset_review_reduced.json.gz') as f:
    data = [json.loads(line) for line in f]

Scikit Learn will want the labels in a separate data structure, so let's pull those out now.

In [7]:
stars = [row['stars'] for row in data]

In [10]:
#Reading review ids in a pandas data frame
review_data = pd.read_json("yelp_train_academic_dataset_review_reduced.json.gz", lines=True,compression="gzip")

In [11]:
review_data.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,votes
0,vcNAWiLM4dR7D2nwwJ7nCA,2013-04-19,_ePLBPrkrf4bhyiKWEn4Qg,1,I don't know what Dr. Goldberg was like before...,review,Qrs3EICADUKNFoUq2iHStA,"{'funny': 0, 'useful': 0, 'cool': 0}"
1,JwUE5GmEO-sH1FuwJgKBlQ,2009-05-04,ow1c4Lcl3ObWxDC2yurwjQ,4,"If you like lot lizards, you'll love the Pine ...",review,ZYaumz29bl9qHpu-KVtMGA,"{'funny': 6, 'useful': 0, 'cool': 0}"
2,JwUE5GmEO-sH1FuwJgKBlQ,2011-03-31,4iPPOQIo5Mr1NAUPUgCUrQ,4,Only went here once about a year and a half ag...,review,EEYwj6_t1OT5WQGypqEPNg,"{'funny': 0, 'useful': 0, 'cool': 0}"
3,JwUE5GmEO-sH1FuwJgKBlQ,2012-01-08,_utPYHIdXeq8CqQ4iYD1bw,3,Ate a Saturday morning breakfast at the Pine C...,review,MnXcXwr0keJpkIiwuPsOKg,"{'funny': 0, 'useful': 1, 'cool': 0}"
4,JwUE5GmEO-sH1FuwJgKBlQ,2012-08-26,gksnzyc9jQ9hNXESjvTrQw,3,This is definitely not your usual truck stop. ...,review,wC8r-m6KHifL6R2i8ok8yg,"{'funny': 0, 'useful': 1, 'cool': 0}"


### Notes:

1. [Pandas](http://pandas.pydata.org/) is able to read JSON text directly.  Use the `read_json()` function with the `lines=True` keyword argument.

2. There are obvious mistakes in the data.  There is no need to correct them.


## Building models


we will build and train different estimator that predicts the star rating given certain features. In most cases we will use a pipeline containing custom or pre-built transformers and an existing estimator.


Also in some models we find it useful to serialize the trained models to disk.  This will allow to reload it after restarting the Jupyter notebook, without needing to retrain it.  We will be using the [`dill` library](https://pypi.python.org/pypi/dill) for this (although the [`joblib` library](http://scikit-learn.org/stable/modules/model_persistence.html) also works). We will Use
```python
dill.dump(estimator, open('estimator.dill', 'w'))
estimator = dill.load(open('estimator.dill', 'r'))

## bag_of_words_model

We will build a linear model predicting the star rating based on the count of the words in each document (bag-of-words model). We will use a [`CountVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) or [`HashingVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html#sklearn.feature_extraction.text.HashingVectorizer) to produce a feature matrix giving the counts of each word in each review. Then we will feed this in to linear model to predict the number of stars from each review.


In [5]:

class ColumnSelectTransformer(base.BaseEstimator, base.TransformerMixin):
    
    def __init__(self, col_names):
        self.col_names = col_names  # We will need these in transform()
        self.data = {}
    def fit(self, X, y=None):
        # This transformer doesn't need to learn anything about the data,
        # so it can just return self without any further processing
        return self
    
    def transform(self, X):
        # Return an array with the same number of rows as X and one
        # column for each in self.col_names
        return [row[self.col_names] for row in X]

In [8]:

bag_of_words_est = Pipeline([
    ("Cst", ColumnSelectTransformer("text")),
    ("HashingVect", HashingVectorizer()),
    # Frequency filter (if necessary)
    ("RG", Ridge())
])
bag_of_words_est.fit(data, stars)

Pipeline(memory=None,
     steps=[('Cst', ColumnSelectTransformer(col_names='text')), ('HashingVect', HashingVectorizer(alternate_sign=True, analyzer='word', binary=False,
         decode_error='strict', dtype=<class 'numpy.float64'>,
         encoding='utf-8', input='content', lowercase=True,
         n_features=1048576, ngr...it_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001))])

## normalized_model

Normalization is key for good linear regression. Previously, we used the count as the normalization scheme.  Add in a normalization transformer to your pipeline to improve the score. 

[TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) is a common normalization scheme used in text processing. 

In [None]:
bag_of_words_normalized_est = Pipeline([
    ("Cst", ColumnSelectTransformer("text")),
    ("HashingVect", HashingVectorizer()),
    ("Tfid_normalizer", TfidfTransformer(use_idf=False)),
    ("RG", Ridge())
])
bag_of_words_normalized_est.fit(data, stars)

In [None]:
dill.dump(bag_of_words_normalized_est, open('bag_of_words_normalized_est.dill', 'wb'))

## bigram_model

In a bigram model, we'll consider both single words and pairs of consecutive words that appear.  This is going to be a much higher dimensional problem (large $p$) so you should be careful about overfitting.

In [None]:
bag_of_words_bigram_est = Pipeline([
    ("Cst", ColumnSelectTransformer("text")),
    ("HashingVect", HashingVectorizer(ngram_range = (1,2))),
    ("Tfid_normalizer", TfidfTransformer(use_idf=False)),
    ("RG", Ridge())
])
bag_of_words_bigram_est.fit(data, stars)

In [None]:
dill.dump(bag_of_words_bigram_est, open('bag_of_words_bigram_est.dill', 'wb'))

In [228]:
bag_of_words_bigram_est = dill.load(open('bag_of_words_bigram_est.dill', 'rb'))

## food_bigrams

We want to find collocations --- that is, bigrams that are "special" and appear more often than we'd expect from chance. We can think of the corpus as defining an empirical distribution over all *n*-grams.  We can find word pairs that are unlikely to occur consecutively based on the underlying probability of their words. Mathematically, if $p(w)$ be the probability of a word $w$ and $p(w_1 w_2)$ is the probability of the bigram $w_1 w_2$, then we want to look at word pairs $w_1 w_2$ where the statistic

  $$ \frac{p(w_1 w_2)}{p(w_1) p(w_2)} $$

is high.  We will find the top 100 (mostly food) bigrams with this statistic with the 'right' prior factor (see below). We will estimate the probabilities by counting and using `CountVectorizer` to count up how many times each word and each bigram appears in each review.

We can use business data set to map business with their id and category which can be downloaded from Yelp:
"yelp_train_academic_dataset_business.json.gz"

In [13]:
business_data = pd.read_json("yelp_train_academic_dataset_business.json.gz", lines=True,compression="gzip")

Each row of this file corresponds to a single business.  The category key gives a list of categories for each; take all where "Restaurants" appears.

In [230]:
business_data.head()

Unnamed: 0,attributes,business_id,categories,city,full_address,hours,latitude,longitude,name,neighborhoods,open,review_count,stars,state,type
0,{'By Appointment Only': True},vcNAWiLM4dR7D2nwwJ7nCA,"[Doctors, Health & Medical]",Phoenix,"4840 E Indian School Rd\nSte 101\nPhoenix, AZ ...","{'Tuesday': {'close': '17:00', 'open': '08:00'...",33.499313,-111.983758,"Eric Goldberg, MD",[],True,7,3.5,AZ,business
1,"{'Take-out': True, 'Good For': {'dessert': Fal...",JwUE5GmEO-sH1FuwJgKBlQ,[Restaurants],De Forest,"6162 US Highway 51\nDe Forest, WI 53532",{},43.238893,-89.335844,Pine Cone Restaurant,[],True,26,4.0,WI,business
2,"{'Take-out': True, 'Good For': {'dessert': Fal...",uGykseHzyS5xAMWoN6YUqA,"[American (Traditional), Restaurants]",De Forest,"505 W North St\nDe Forest, WI 53532","{'Monday': {'close': '22:00', 'open': '06:00'}...",43.252267,-89.353437,Deforest Family Restaurant,[],True,16,4.0,WI,business
3,"{'Take-out': True, 'Wi-Fi': 'free', 'Takes Res...",LRKJF43s9-3jG9Lgx4zODg,"[Food, Ice Cream & Frozen Yogurt, Fast Food, R...",De Forest,"4910 County Rd V\nDe Forest, WI 53532","{'Monday': {'close': '22:00', 'open': '10:30'}...",43.251045,-89.374983,Culver's,[],True,7,4.5,WI,business
4,"{'Take-out': True, 'Has TV': False, 'Outdoor S...",RgDg-k9S5YD_BaxMckifkg,"[Chinese, Restaurants]",De Forest,"631 S Main St\nDe Forest, WI 53532","{'Monday': {'close': '22:00', 'open': '11:00'}...",43.240875,-89.343722,Chang Jiang Chinese Kitchen,[],True,3,4.0,WI,business


In [14]:
restaurant_ids = business_data["business_id"].loc[business_data["categories"].apply(lambda x:"Restaurants" in x)]

1        JwUE5GmEO-sH1FuwJgKBlQ
2        uGykseHzyS5xAMWoN6YUqA
3        LRKJF43s9-3jG9Lgx4zODg
4        RgDg-k9S5YD_BaxMckifkg
9        _wZTYYL7cutanzAnJUTGMA
12       zOc8lbjViUZajbY7M0aUCQ
13       UgjVZTSOaYoEvws_lAP_Dw
15       SKLw05kEIlZcpTD5pqma8Q
16       77ESrCo7hQ96VpCWWdvoxg
21       KTqNU4plO23583DYAMGXYg
26       ShEYKerTwb2LSORE5o_s7A
27       HaBkx5PwvbBpQ2iNCgHnVQ
28       KW6HejC-67KSL9J8Cz1dSw
29       wl-7A4jC0f27MOEmW-XTbQ
31       suQyHycqv8nA7EualcUB3g
35       uga5g16PncJOtY7Sc3A05w
38       xbJ9tdGbcVJIUkNgHSCwZQ
39       koRLvzIl-4fCybOIdk56jA
44       cruBFtsFaBuhX_I72uU8pA
45       X47o6R2rWiqGqvBtC7VS3Q
48       lN-5-YTsaJr_IByyA476iw
50       IEmqrFe96NOhU07TA0rZdw
53       KFJ1jBfFkRfyn3AoAUl3YQ
56       G1qUGBNYNS220jKlPAlFsA
59       6oDZ78JiSKjXVLcurCpnhw
61       kiAsweI4sOjDzGyz3Y_HsQ
63       q8fD82us6uuGufvI44NoAg
64       oc0rCahXOaJeHLzzDdSfyA
65       Nj0vQjUYjaa7qbvZlEr2jQ
67       B0Vuwn6Hugc-0U5n31YBfg
                  ...          
37874   

In [16]:
#Check points
assert len(restaurant_ids) == 12876

In [12]:
#two data frames are review_data and business_data

In [17]:
review_data[["business_id","text"]].head()

Unnamed: 0,business_id,text
0,vcNAWiLM4dR7D2nwwJ7nCA,I don't know what Dr. Goldberg was like before...
1,JwUE5GmEO-sH1FuwJgKBlQ,"If you like lot lizards, you'll love the Pine ..."
2,JwUE5GmEO-sH1FuwJgKBlQ,Only went here once about a year and a half ag...
3,JwUE5GmEO-sH1FuwJgKBlQ,Ate a Saturday morning breakfast at the Pine C...
4,JwUE5GmEO-sH1FuwJgKBlQ,This is definitely not your usual truck stop. ...


In [18]:
#Creating a new data frame
restaurant_reviews = pd.merge(review_data[["business_id","text"]],restaurant_ids.to_frame(), on="business_id")

In [21]:
bigrams = CountVectorizer(min_df=21,stop_words="english",ngram_range = (2,2))
bigramsVec = bigrams.fit_transform(restaurant_reviews_txt.values)
unigram = CountVectorizer(stop_words="english")
unigramVec = unigram.fit_transform(restaurant_reviews_txt.values)
unigramVec = unigramVec.sum(axis=0)
bigramsVec = bigramsVec.sum(axis=0)

In [22]:
dill.dump(unigramVec, open('unigramVec.dill', 'wb'))

In [23]:
dill.dump(bigramsVec, open('bigramsVec.dill', 'wb'))

In [387]:
bgram = bigramsVec.A1
ngram = unigramVec.A1 

In [390]:
def FindProb(bigram, bigramCount ,unigram, unigramCount):
    bigramNames = bigram.get_feature_names()
    unigramDict = unigram.vocabulary_
    Normalized = []
    for i in range(len(bigramNames)):
        w = bigramNames[i].split(" ")
        num = bigramCount[i]
        d1 = unigramCount[unigramDict[w[0]]]+30
        d2 = unigramCount[unigramDict[w[1]]]+30
        Normalized.append(num/(d1*d2))
    return Normalized

In [391]:
Norm = FindProb(bigrams, bgram, unigram, ngram)

In [394]:
names=bigrams.get_feature_names()
names

['00 00',
 '00 dinner',
 '00 lunch',
 '00 meal',
 '00 people',
 '00 person',
 '00 plus',
 '00 pm',
 '00 tip',
 '00 worth',
 '10 00',
 '10 10',
 '10 11',
 '10 12',
 '10 13',
 '10 14',
 '10 15',
 '10 20',
 '10 30',
 '10 30am',
 '10 30pm',
 '10 45',
 '10 50',
 '10 95',
 '10 99',
 '10 bucks',
 '10 burger',
 '10 coupon',
 '10 days',
 '10 different',
 '10 discount',
 '10 dishes',
 '10 dollars',
 '10 drink',
 '10 feet',
 '10 food',
 '10 free',
 '10 good',
 '10 got',
 '10 great',
 '10 just',
 '10 lunch',
 '10 meal',
 '10 miles',
 '10 min',
 '10 mins',
 '10 minute',
 '10 minutes',
 '10 order',
 '10 ordered',
 '10 oz',
 '10 people',
 '10 person',
 '10 pizza',
 '10 plate',
 '10 plus',
 '10 pm',
 '10 really',
 '10 sandwich',
 '10 seconds',
 '10 service',
 '10 small',
 '10 stars',
 '10 tables',
 '10 time',
 '10 times',
 '10 tip',
 '10 year',
 '10 years',
 '100 00',
 '100 bucks',
 '100 challenge',
 '100 degree',
 '100 degrees',
 '100 different',
 '100 people',
 '100 person',
 '100 sure',
 '100 times

In [397]:
Top = [[a, b] for a, b in zip(names, Norm)]
Top = sorted(Top, key=lambda x:-x[1])
Top

[['knick knacks', 0.007562008469449486],
 ['rula bula', 0.0074211502782931356],
 ['ropa vieja', 0.007165437302423604],
 ['itty bitty', 0.006979062811565304],
 ['dac biet', 0.006771780877299374],
 ['gulab jamun', 0.006701414743112435],
 ['patatas bravas', 0.006085192697768763],
 ['puerto rican', 0.00590717299578059],
 ['wal mart', 0.0057817998994469585],
 ['bradley ogden', 0.005681818181818182],
 ['lomo saltado', 0.005643738977072311],
 ['vice versa', 0.005526083112290009],
 ['valle luna', 0.00545578540577404],
 ['kao tod', 0.005432595573440644],
 ['artery clogging', 0.005151604356785399],
 ['har gow', 0.005098257322951428],
 ['pina colada', 0.00497787610619469],
 ['bells whistles', 0.00496031746031746],
 ['harry potter', 0.0048430840759395586],
 ['aguas frescas', 0.004618473895582329],
 ['ping pang', 0.004591710101762224],
 ['casey moore', 0.0045281465472882575],
 ['pin kaow', 0.004526935264825713],
 ['cochinita pibil', 0.004387602688573563],
 ['scantily clad', 0.004330343334364367],
 

In [399]:
Top100 = []
for i in range(100):
    Top100.append(Top[i][0])
Top100

['knick knacks',
 'rula bula',
 'ropa vieja',
 'itty bitty',
 'dac biet',
 'gulab jamun',
 'patatas bravas',
 'puerto rican',
 'wal mart',
 'bradley ogden',
 'lomo saltado',
 'vice versa',
 'valle luna',
 'kao tod',
 'artery clogging',
 'har gow',
 'pina colada',
 'bells whistles',
 'harry potter',
 'aguas frescas',
 'ping pang',
 'casey moore',
 'pin kaow',
 'cochinita pibil',
 'scantily clad',
 'demi glace',
 'lactose intolerant',
 'thit nuong',
 'kilt lifter',
 'moscow mule',
 'woody allen',
 'hustle bustle',
 'dulce leche',
 'cabo wabo',
 'kee mao',
 'tres leches',
 'arnold palmer',
 'coca cola',
 'stainless steel',
 'kool aid',
 'osso bucco',
 'rick moonen',
 'van buren',
 'huli huli',
 'fleur lys',
 'insult injury',
 'quench thirst',
 'bok choy',
 'fogo chao',
 'jean philippe',
 'toby keith',
 'tilted kilt',
 'identity crisis',
 'parmigiano reggiano',
 'hush puppies',
 'sierra bonita',
 'nba finals',
 'panna cotta',
 'apache junction',
 'petit fours',
 'hong kong',
 'peter piper'

In [None]:
#Check point
assert len(restaurant_reviews_txt) == 143361

*Copyright &copy; 2019 The Data Incubator.  All rights reserved.*