<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

# Sprint Challenge
## *Data Science Unit 4 Sprint 1*

After a week of Natural Language Processing, you've learned some cool new stuff: how to process text, how turn text into vectors, and how to model topics from documents. Apply your newly acquired skills to one of the most famous NLP datasets out there: [Yelp](https://www.yelp.com/dataset/challenge). As part of the job selection process, some of my friends have been asked to create analysis of this dataset, so I want to empower you to have a head start.  

The real dataset is massive (almost 8 gigs uncompressed). I've sampled the data for you to something more managable for the Sprint Challenge. You can analyze the full dataset as a stretch goal or after the sprint challenge. As you work on the challenge, I suggest adding notes about your findings and things you want to analyze in the future.

## Challenge Objectives
*Successfully complete these all these objectives to earn a 2. There are more details on each objective further down in the notebook.*
* <a href="#p1">Part 1</a>: Write a function to tokenize the yelp reviews
* <a href="#p2">Part 2</a>: Create a vector representation of those tokens
* <a href="#p3">Part 3</a>: Use your tokens in a classification model on yelp rating
* <a href="#p4">Part 4</a>: Estimate & Interpret a topic model of the Yelp reviews

In [1]:
import pandas as pd

yelp = pd.read_json('./data/review_sample.json', lines=True)

In [2]:
yelp.head()

Unnamed: 0,business_id,cool,date,funny,review_id,stars,text,useful,user_id
0,nDuEqIyRc8YKS1q1fX0CZg,1,2015-03-31 16:50:30,0,eZs2tpEJtXPwawvHnHZIgQ,1,"BEWARE!!! FAKE, FAKE, FAKE....We also own a sm...",10,n1LM36qNg4rqGXIcvVXv8w
1,eMYeEapscbKNqUDCx705hg,0,2015-12-16 05:31:03,0,DoQDWJsNbU0KL1O29l_Xug,4,Came here for lunch Togo. Service was quick. S...,0,5CgjjDAic2-FAvCtiHpytA
2,6Q7-wkCPc1KF75jZLOTcMw,1,2010-06-20 19:14:48,1,DDOdGU7zh56yQHmUnL1idQ,3,I've been to Vegas dozens of times and had nev...,2,BdV-cf3LScmb8kZ7iiBcMA
3,k3zrItO4l9hwfLRwHBDc9w,3,2010-07-13 00:33:45,4,LfTMUWnfGFMOfOIyJcwLVA,1,We went here on a night where they closed off ...,5,cZZnBqh4gAEy4CdNvJailQ
4,6hpfRwGlOzbNv7k5eP9rsQ,1,2018-06-30 02:30:01,0,zJSUdI7bJ8PNJAg4lnl_Gg,4,"3.5 to 4 stars\n\nNot bad for the price, $12.9...",5,n9QO4ClYAS7h9fpQwa5bhA


## Part 1: Tokenize Function
<a id="#p1"></a>

Complete the function `tokenize`. Your function should
- accept one document at a time
- return a list of tokens

You are free to use any method you have learned this week.

In [3]:
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS



In [4]:
def tokenize(doc):
    return [token for token in simple_preprocess(doc) if token not in STOPWORDS]

In [5]:
yelp['tokens'] = yelp['text'].apply(tokenize)

In [6]:
yelp['tokens'][0:10]

0    [beware, fake, fake, fake, small, business, lo...
1    [came, lunch, togo, service, quick, staff, fri...
2    [ve, vegas, dozens, times, stepped, foot, circ...
3    [went, night, closed, street, party, best, act...
4    [stars, bad, price, lunch, seniors, pay, eatin...
5    [tasty, fast, casual, latin, street, food, men...
6    [absolutely, amazing, incredible, production, ...
7    [came, pho, enjoyed, got, pm, busy, got, serve...
8    [absolutely, unique, experience, nail, shop, f...
9    [wow, walked, sat, bar, minutes, bartenders, w...
Name: tokens, dtype: object

## Part 2: Vector Representation
<a id="#p2"></a>
1. Create a vector representation of the reviews
2. Write a fake review and query for the 10 most similiar reviews, print the text of the reviews. Do you notice any patterns?
    - Given the size of the dataset, it will probably be best to use a `NearestNeighbors` model for this. 

In [34]:
from sklearn.neighbors import NearestNeighbors
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

In [8]:
tfidf = TfidfVectorizer(tokenizer=tokenize, min_df=0.1, max_df=0.9, ngram_range=(1,2))

sparse = tfidf.fit_transform(yelp['text'])

dtm = pd.DataFrame(sparse.todense(), columns=tfidf.get_feature_names())

dtm.head()

Unnamed: 0,amazing,best,better,came,come,definitely,delicious,experience,food,friendly,...,people,place,recommend,restaurant,service,staff,time,try,ve,went
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.706692,0.0,0.0,0.0
1,0.0,0.0,0.0,0.508688,0.0,0.0,0.0,0.0,0.0,0.496724,...,0.0,0.0,0.0,0.0,0.363028,0.489918,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.458496,0.482884
3,0.0,0.511373,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.558702,0.366225,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.540582
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.700211,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [9]:
nn = NearestNeighbors(n_neighbors=10, algorithm='ball_tree')
nn.fit(dtm)

NearestNeighbors(algorithm='ball_tree', leaf_size=30, metric='minkowski',
         metric_params=None, n_jobs=None, n_neighbors=10, p=2, radius=1.0)

In [10]:
query = [""" I took my family to this establishment and we were immediately seated. 
            However, it took nearly 30 minutes for someone to take our order. 
            When they finally did, we waited an hour without any food so we decided
            to leave and find somewhere else to leave. When the owner saw us leaving 
            he was immediately irate and began to chase after us. We tried to drive away 
            but he ran in front of our car. We started to move forward and he jumped on the hood.
            We then proceeded to drive him to the local authorities. """]

In [11]:
looking = tfidf.transform(query)
nn.kneighbors(looking.todense())

(array([[0.        , 0.        , 0.42529452, 0.46630881, 0.48821427,
         0.48995944, 0.4935585 , 0.49501427, 0.49537281, 0.50291764]]),
 array([[6731, 7091, 6757, 9683, 3530, 2585, 7840,  328, 1883, 6746]],
       dtype=int64))

In [12]:
yelp['text'][6746]

'Currently waiting for my food in the drive thru. I don\'t care how good this food is going to be because this place has some of the worst customer service I have ever experienced. There are 3 people in my car who all have different orders and different questions and before you know it the girl who takes orders complains and asks us to drive to the window instead because she couldn\'t handle the simple task of actually answering our questions. We waited 20 min to get to the window and order our food because they took 20 min to make the order for the car in front of that BUT WAIT when we got to the window some guy I\'m guessing he was the cook says "so you were the ones with all those questions or whatever SO anyways you ordered..." Like how fucking rude can you be??? And to top it off the girl who took our order laughed and rolled her eyes when he finished taking our order, I\'m sorry you can\'t handle customers asking you simple questions about a menu you SHOULD know about. Learn how 

In [13]:
yelp['text'][1883]

'Excellent!!! Very fresh and tasty. Very well priced a for big portions. Two can eat off one order. My Fav place for Thai food.'

In [14]:
yelp['text'][328]

'I pulled up in the drive-through, said hello twice, and waited a couple minutes before a man said "yeah, what do you want?". He got my 3 item order wrong and had to be corrected. When I pulled up to pay he stuck his hand out the window and told me to stop before I was at the window. He was clearly taking the order of the person begging me. Once he was done with that he signaled to me to pull ahead. I worked in fast food long enough to know that he was trying to keep his timer down, but this was extremely rude and unnecessary. He was not friendly at the window either and the food was cold.'

In [15]:
yelp['text'][7840]

"9/22/17\nCalling and placing an order was easy. Payment over the phone with no complications.  Food arrived exactly when they said it would. Great amount of food for the price.  The food also tasted amazing and was the perfect temperature.  Ground and through, I give this place an A+++!!!\nFast Forward to 11/4/17\nI called and placed an order at 16:30. I am informed it will be 40 minutes to an hour for delivery, which is more than acceptable for a Saturday evening.  At 17:50, an hour and 20 minutes later, I started calling to see where my order was.  20 minutes of calling with no answer. So I make my own food, leftovers, and at 18:20 the driver shows up.  Really, an hour and 50 minutes? Food is room temperature and I asked what the delay was and I was informed it's a busy night.  Unacceptable.  We have always had great customer service and excellent food, but today proved us wrong.  I asked for the manager to call us and the driver took my number.  If I get a call, I would expect some

In [16]:
yelp['text'][2585]

'I was very disappointed with the way I was treated my first time ordering from this restaurant.\nAfter buying a coupon book to support our local high school baseball team, I decided to use a coupon that entitled me to a free order of small breadsticks with any large pizza. I told the man over the phone that I had the coupon, and he quoted me $14 and some odd cents for my order. But when the delivery driver showed up, he told me it was $21. I asked why it was so much, and he said the man on the phone forgot to add something to my bill. I should have looked it over more carefully before I signed it, but I signed it anyway and then reviewed the receipt in my home.\nThe restaurant charged me for a large order of breadsticks with my pizza. I called to explain their mistake and to ask for the $6 back that they charged me for the breadsticks. A very rude and inexperienced young girl was not helpful or courteous to me, did not apologize for the mistake, and told me she could refund me $3.50 w

In [17]:
yelp['text'][3530]

"I don't love the sushi here, but my husband thinks it is acceptable. We used to order from here if we crave sushi but is broke since the sushi here is very cheap, so we will order the platters and have food for a few meals. The sushi here is about the same level as supermarket sushi."

In [18]:
yelp['text'][9683]

"The most horrible Taco Bell I've ever been too. I made my order and waited for her to take my separate order and didn't say anything for 5 mins. I finally drove up and waited for another 5 mins for her to answer the window for me just to repeat my second order in the pouring rain. Then messed up both orders and combined them into 1 full order when I clearly said 2! It takes about 30 minutes to get food here. Even if your the only one in line. Don't eat here. If I could give it 0 stars I would."

In [19]:
yelp['text'][6757]

"This location is the worst in all of beautiful 'merica. We have used the drive thru on 6 separate occasions in a period of 4 months when we first moved to this neighborhood. EVERY time they forgot items for our order or gave us completely wrong food. Called a couple times to give 'em some feedback without being rude. It has been 2 years now since they screwed up our last order because we don't go there anymore. This Taco Bell/KFC is a mile from our house but we much rather drive 3 miles further and visit the other locations."

In [20]:
yelp['text'][7091]

'So we are asked to park the car for our order.  12 minutes later we get our food.  Cold fries.   Asked for extra onions on the cheeseburgers.  No extra onions.  What the hell?'

In [21]:
yelp['text'][6731]

"Stopped by to pick up some BBQ beef to take home and make sandwiches for the family. Unlike some BBQ places in the valley, the meat was very lean, well cooked with a slightly smokey, well seasoned taste. We did not get any sides nor drinks so I cannot evaluate those. \nThe cost per pound for the beef was a couple of dollars less than Dave's BBQ, Tom's BBQ or Andrew's BBQ. There were four customers ahead of me in line, but it only took 5 minutes for the food to be ready after I placed my order."

## Analysis
### Some reviews matched the unhappiness of my fake review but others were positive and just talked about ordering food.

## Part 3: Classification
<a id="#p3"></a>
Your goal in this section will be to predict `stars` from the review dataset. 

1. Create a piepline object with a sklearn `CountVectorizer` or `TfidfVector` and any sklearn classifier. Use that pipeline to estimate a model to predict `stars`. Use the Pipeline to predict a star rating for your fake review from Part 2. 
2. Tune the entire pipeline with a GridSearch

In [22]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

In [23]:
vect = TfidfVectorizer(stop_words='english')
rfc = RandomForestClassifier()

pipe = Pipeline([('vect', vect), ('clf', rfc)])

In [24]:
parameters = {
    'vect__max_df': [0.75, 1.0],
    'vect__min_df': [.02, .05],
    'vect__max_features': [100, 500],
    'clf__n_estimators': [100, 200],
    'clf__max_depth': [5, 8]
}

grid_search = GridSearchCV(pipe, parameters, cv=5, n_jobs=-1, verbose=1)
grid_search.fit(yelp['text'], yelp['stars'])

Fitting 5 folds for each of 32 candidates, totalling 160 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   34.2s
[Parallel(n_jobs=-1)]: Done 160 out of 160 | elapsed:  2.7min finished


GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
...obs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'vect__max_df': [0.75, 1.0], 'vect__min_df': [0.02, 0.05], 'vect__max_features': [100, 500], 'clf__n_estimators': [100, 200], 'clf__max_depth': [5, 8]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

In [42]:
grid_search.best_params_

{'clf__max_depth': 8,
 'clf__n_estimators': 100,
 'vect__max_df': 1.0,
 'vect__max_features': 500,
 'vect__min_df': 0.02}

In [25]:
grid_search.predict(query)

array([5], dtype=int64)

In [35]:
vect2 = CountVectorizer(stop_words='english')

pipe2 = Pipeline([('vect', vect2), ('clf', rfc)])

In [38]:
parameters = {
    'vect__max_df': [0.75, 1.0],
    'vect__min_df': [.02, .05],
    'vect__max_features': [100, 500],
    'clf__n_estimators': [100, 200],
    'clf__max_depth': [5, 8]
}

grid_search2 = GridSearchCV(pipe, parameters, cv=5, n_jobs=-1, verbose=1)
grid_search2.fit(yelp['text'], yelp['stars'])

Fitting 5 folds for each of 32 candidates, totalling 160 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   34.7s
[Parallel(n_jobs=-1)]: Done 160 out of 160 | elapsed:  2.9min finished


GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
...obs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'vect__max_df': [0.75, 1.0], 'vect__min_df': [0.02, 0.05], 'vect__max_features': [100, 500], 'clf__n_estimators': [100, 200], 'clf__max_depth': [5, 8]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

In [39]:
grid_search2.predict(query)

array([5], dtype=int64)

In [43]:
grid_search2.best_params_

{'clf__max_depth': 8,
 'clf__n_estimators': 100,
 'vect__max_df': 1.0,
 'vect__max_features': 500,
 'vect__min_df': 0.02}

## Stretch goal
The difference between the CountVectorizer and the TfidfVectorizer is not apparent here because I implemented stop words. However, with some research, you will find that CountVectorizer simply obtains the count for every word. With the TfidfVectorizer, the use of IDF(Inverse Document Frequency) helps to adjust the weights of count because some words like "the" or "and" are going to appear more frequently regardless of their importance in the text.

## Part 4: Topic Modeling

Let's find out what those yelp reviews are saying! :D

1. Estimate a LDA topic model of the review text
    - Keep the `iterations` parameter at or below 5 to reduce run time
    - The `workers` parameter should match the number of physical cores on your machine.
2. Create 1-2 visualizations of the results
    - You can use the most important 3 words of a topic in relevant visualizations. Refer to yesterday's notebook to extract. 
3. In markdown, write 1-2 paragraphs of analysis on the results of your topic model

__*Note*__: You can pass the DataFrame column of text reviews to gensim. You do not have to use a generator.

In [26]:
from gensim.models import LdaMulticore
from gensim.corpora import Dictionary

Learn the vocubalary of the yelp data:

In [27]:
id2word = Dictionary(yelp['tokens'])

Create a bag of words representation of the entire corpus

In [28]:
corpus = [id2word.doc2bow(text) for text in yelp['tokens']]

Your LDA model should be ready for estimation: 

In [29]:
lda = LdaMulticore(corpus=corpus,
                   id2word=id2word,
                   iterations=5,
                   workers=8,
                   num_topics = 14 # You can change this parameter
                  )

In [30]:
lda.print_topics()

[(0,
  '0.013*"place" + 0.011*"food" + 0.009*"good" + 0.008*"like" + 0.008*"time" + 0.007*"great" + 0.005*"service" + 0.004*"love" + 0.004*"best" + 0.003*"delicious"'),
 (1,
  '0.014*"great" + 0.009*"service" + 0.008*"food" + 0.007*"good" + 0.007*"place" + 0.005*"like" + 0.005*"got" + 0.005*"time" + 0.004*"staff" + 0.004*"come"'),
 (2,
  '0.011*"place" + 0.009*"great" + 0.009*"food" + 0.008*"good" + 0.007*"time" + 0.007*"like" + 0.005*"service" + 0.004*"got" + 0.004*"try" + 0.004*"order"'),
 (3,
  '0.012*"good" + 0.009*"food" + 0.009*"like" + 0.009*"place" + 0.008*"service" + 0.007*"great" + 0.006*"time" + 0.006*"nice" + 0.004*"got" + 0.004*"order"'),
 (4,
  '0.014*"good" + 0.014*"food" + 0.008*"place" + 0.008*"service" + 0.006*"great" + 0.006*"like" + 0.006*"time" + 0.004*"got" + 0.004*"love" + 0.004*"come"'),
 (5,
  '0.011*"food" + 0.010*"good" + 0.008*"like" + 0.007*"service" + 0.007*"great" + 0.007*"place" + 0.007*"time" + 0.005*"got" + 0.005*"ve" + 0.004*"nice"'),
 (6,
  '0.011*"s

In [31]:
import re
words = [re.findall(r'"([^"]*)"', t[1]) for t in lda.print_topics()]

In [32]:
topics = [' '.join(t[0:5]) for t in words]

Create 1-2 visualizations of the results

In [33]:
for id, t in enumerate(topics):
    print(f"------ Topic {id} ------")
    print(t, end='\n')
    print("\n")

------ Topic 0 ------
place food good like time


------ Topic 1 ------
great service food good place


------ Topic 2 ------
place great food good time


------ Topic 3 ------
good food like place service


------ Topic 4 ------
good food place service great


------ Topic 5 ------
food good like service great


------ Topic 6 ------
service place food great time


------ Topic 7 ------
good food place great time


------ Topic 8 ------
food good place service like


------ Topic 9 ------
good great food place time


------ Topic 10 ------
food time place great like


------ Topic 11 ------
place great good food service


------ Topic 12 ------
food service place time like


------ Topic 13 ------
great place good food service




In [34]:
import pyLDAvis.gensim

pyLDAvis.enable_notebook()

  from collections import Iterable


In [36]:
# pyLDAvis.gensim.prepare(lda, corpus, id2word)

## Paragraph 
 Given how Yelp has become somewhat famous for where people go to complain about bad restaurant experiences, I was surprised at how many topics were extremely positive. I think this might have to do with the fact that I didn't remove duplicates or maybe there are a lot of positive reviews. If I had more time, I would comb through the data, make sure I got rid of all the duplicates, and check to see if I missed anything else.

## Stretch Goals

Complete one or more of these to push your score towards a three: 
* Incorporate named entity recognition into your analysis
* Compare vectorization methods in the classification section
* Analyze more (or all) of the yelp dataset - this one is v. hard. 
* Use a generator object on the reviews file - this would help you with the analyzing the whole dataset.
* Incorporate any of the other yelp dataset entities in your analysis (business, users, etc.)