<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

# Sprint Challenge
## *Data Science Unit 4 Sprint 1*

After a week of Natural Language Processing, you've learned some cool new stuff: how to process text, how turn text into vectors, and how to model topics from documents. Apply your newly acquired skills to one of the most famous NLP datasets out there: [Yelp](https://www.yelp.com/dataset/challenge). As part of the job selection process, some of my friends have been asked to create analysis of this dataset, so I want to empower you to have a head start.  

The real dataset is massive (almost 8 gigs uncompressed). I've sampled the data for you to something more managable for the Sprint Challenge. You can analyze the full dataset as a stretch goal or after the sprint challenge. As you work on the challenge, I suggest adding notes about your findings and things you want to analyze in the future.

## Challenge Objectives
*Successfully complete these all these objectives to earn a 2. There are more details on each objective further down in the notebook.*
* <a href="#p1">Part 1</a>: Write a function to tokenize the yelp reviews
* <a href="#p2">Part 2</a>: Create a vector representation of those tokens
* <a href="#p3">Part 3</a>: Use your tokens in a classification model on yelp rating
* <a href="#p4">Part 4</a>: Estimate & Interpret a topic model of the Yelp reviews

In [1]:
import pandas as pd

yelp = pd.read_json('review_sample.json', lines=True)

In [2]:
yelp.head()

Unnamed: 0,business_id,cool,date,funny,review_id,stars,text,useful,user_id
0,nDuEqIyRc8YKS1q1fX0CZg,1,2015-03-31 16:50:30,0,eZs2tpEJtXPwawvHnHZIgQ,1,"BEWARE!!! FAKE, FAKE, FAKE....We also own a sm...",10,n1LM36qNg4rqGXIcvVXv8w
1,eMYeEapscbKNqUDCx705hg,0,2015-12-16 05:31:03,0,DoQDWJsNbU0KL1O29l_Xug,4,Came here for lunch Togo. Service was quick. S...,0,5CgjjDAic2-FAvCtiHpytA
2,6Q7-wkCPc1KF75jZLOTcMw,1,2010-06-20 19:14:48,1,DDOdGU7zh56yQHmUnL1idQ,3,I've been to Vegas dozens of times and had nev...,2,BdV-cf3LScmb8kZ7iiBcMA
3,k3zrItO4l9hwfLRwHBDc9w,3,2010-07-13 00:33:45,4,LfTMUWnfGFMOfOIyJcwLVA,1,We went here on a night where they closed off ...,5,cZZnBqh4gAEy4CdNvJailQ
4,6hpfRwGlOzbNv7k5eP9rsQ,1,2018-06-30 02:30:01,0,zJSUdI7bJ8PNJAg4lnl_Gg,4,"3.5 to 4 stars\n\nNot bad for the price, $12.9...",5,n9QO4ClYAS7h9fpQwa5bhA


In [3]:
!python -m spacy download en_core_web_sm

[33mYou are using pip version 10.0.1, however version 19.1.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


In [4]:
import spacy
from spacy.tokenizer import Tokenizer
import gensim
from gensim.parsing.preprocessing import STOPWORDS

nlp = spacy.load('en_core_web_sm')

tokenizer = Tokenizer(nlp.vocab)
def tokenize(doc):
    tokens = [token.text for token in tokenizer(doc) if token not in STOPWORDS]

    return tokens

## Part 1: Tokenize Function
<a id="#p1"></a>

Complete the function `tokenize`. Your function should
- accept one document at a time
- return a list of tokens

You are free to use any method you have learned this week.

In [5]:
test = tokenize(yelp['text'].iloc[0])
print(test)

['BEWARE!!!', 'FAKE,', 'FAKE,', 'FAKE....We', 'also', 'own', 'a', 'small', 'business', 'in', 'Los', 'Alamitos,', 'CA', 'and', 'received', 'what', 'looked', 'like', 'a', 'legitimate', 'bill', 'for', '$70', 'with', 'an', 'account', 'number', 'and', 'all.', ' ', 'I', 'called', 'the', 'phone', 'number', 'listed', '(866)', '273-7934.', ' ', 'The', 'wait', 'time', 'on', 'hold', 'said', '20', 'minutes', 'and', 'to', 'leave', 'a', 'message.', ' ', 'I', 'could', 'not', 'get', 'a', 'live', 'person', 'on', 'the', 'phone', 'no', 'matter', 'what', 'number', 'I', 'selected.', ' ', 'I', 'left', 'a', 'very', 'FIRM', 'message', 'that', 'I', 'would', 'be', 'contacting', 'the', 'BBB', 'and', 'my', 'attorney', 'regarding', 'their', 'company', 'trying', 'to', 'scam', 'businesses.', 'This', 'has', 'to', 'be', 'illegal!!!!!']


## Part 2: Vector Representation
<a id="#p2"></a>
1. Create a vector representation of the reviews
2. Write a fake review and query for the 10 most similiar reviews, print the text of the reviews. Do you notice any patterns?
    - Given the size of the dataset, it will probably be best to use a `NearestNeighbors` model for this. 

In [6]:
data = yelp['text'].values

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Instantiate vectorizer object
tfidf = TfidfVectorizer(stop_words='english', max_features=5000)

# Create a vocabulary and get word counts per document
dtm = tfidf.fit_transform(data)


In [8]:
# Instantiate
from sklearn.neighbors import NearestNeighbors

nn = NearestNeighbors(n_neighbors=10, algorithm='ball_tree')

# Fit on TF-IDF Vectors
nn.fit(dtm.todense())

NearestNeighbors(algorithm='ball_tree', leaf_size=30, metric='minkowski',
         metric_params=None, n_jobs=None, n_neighbors=10, p=2, radius=1.0)

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

fake_review = ["This restaurant was one of the worst I've ever seen. The waitress tried to steal my phone and the food tasted stale. Horrible service. Will not be coming back!"] 

new = tfidf.transform(fake_review)

nn.kneighbors(new.todense())



(array([[1.        , 1.        , 1.15240698, 1.20476727, 1.23566525,
         1.23704062, 1.23894771, 1.24319716, 1.24480605, 1.25606254]]),
 array([[6311, 6204,  753, 4162, 9193, 9914, 4913, 2847, 7235, 9266]]))

In [10]:
data[753]

"The service was horrible. Everything took too long and the food tasted bland. I love sushi and go out for it regularly and this is the worst place I've been to"

## Part 3: Classification
<a id="#p3"></a>
Your goal in this section will be to predict `stars` from the review dataset. 

1. Create a piepline object with a sklearn `CountVectorizer` or `TfidfVector` and any sklearn classifier. Use that pipeline to estimate a model to predict `stars`. Use the Pipeline to predict a star rating for your fake review from Part 2. 
2. Tune the entire pipeline with a GridSearch

In [11]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier

vect = TfidfVectorizer(stop_words='english')
sgdc = SGDClassifier()

pipe = Pipeline([('vect', vect), ('clf', sgdc)])

In [12]:
star_data = yelp

In [13]:
pipe.fit(star_data['text'], star_data['stars'])



Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
...m_state=None, shuffle=True, tol=None,
       validation_fraction=0.1, verbose=0, warm_start=False))])

In [14]:
pipe.predict(["The service was horrible. Everything took too long and the food tasted bland. I love sushi and go out for it regularly and this is the worst place I've been to"])

array([1])

In [15]:
pipe.predict(["Best restaurant ever! I loved it! So good."])

array([5])

In [16]:
from sklearn.model_selection import GridSearchCV
parameters = {
    'vect__max_df': (0.5, 0.75, 1.0),
    'clf__max_iter':(20, 10, 100)
}
grid_search = GridSearchCV(pipe,parameters, cv=5, n_jobs=-1, verbose=1)
grid_search.fit(star_data['text'], star_data['stars'])


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  45 out of  45 | elapsed:   14.3s finished


GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
...m_state=None, shuffle=True, tol=None,
       validation_fraction=0.1, verbose=0, warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'vect__max_df': (0.5, 0.75, 1.0), 'clf__max_iter': (20, 10, 100)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

## Part 4: Topic Modeling

Let's find out what those yelp reviews are saying! :D

1. Estimate a LDA topic model of the review text
2. Create 1-2 visualizations of the results
    - You can use the most important 3 words of a topic in relevant visualizations. Refer to yesterday's notebook to extract. 
3. In markdown, write 1-2 paragraphs of analysis on the results of your topic model

__*Note*__: You can pass the DataFrame column of text reviews to gensim. You do not have to use a generator.

In [17]:
!pip install gensim

[33mYou are using pip version 10.0.1, however version 19.1.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [18]:
import numpy as np
import gensim
import os
import re

from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from gensim import corpora

from gensim.models.ldamulticore import LdaMulticore

import pandas as pd

In [19]:
def doc_stream():
    n = 0
    for index, row in yelp.iterrows():
        if n <= 100:
            text = row['text'].strip('\n')
            tokens = tokenize(str(text))
            #tokens =[tokenize(row) for row in yelp['text']]
            #print(tokens)
            #print(text)
            n += 1
            yield tokens

In [20]:
# A Dictionary Representation of all the words in our corpus
id2word = corpora.Dictionary(doc_stream())

In [21]:
print(id2word)

Dictionary(3855 unique tokens: [' ', '$70', '(866)', '20', '273-7934.']...)


In [22]:
import sys
print(sys.getsizeof(id2word))

56


In [23]:
corpus = [id2word.doc2bow(text) for text in doc_stream()]

In [24]:
print(corpus)

[[(0, 4), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 2), (10, 1), (11, 1), (12, 5), (13, 1), (14, 1), (15, 1), (16, 5), (17, 1), (18, 1), (19, 1), (20, 1), (21, 4), (22, 1), (23, 2), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 3), (52, 2), (53, 1), (54, 1), (55, 2), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 3), (64, 1), (65, 1), (66, 3), (67, 1), (68, 1), (69, 1), (70, 2), (71, 1), (72, 1)], [(16, 1), (23, 1), (31, 1), (73, 1), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 1), (80, 1), (81, 1), (82, 1), (83, 1), (84, 1), (85, 1), (86, 1), (87, 1), (88, 1), (89, 1), (90, 1), (91, 1), (92, 2)], [(0, 2), (12, 8), (14, 1), (16, 7), (21, 12), (25, 1), (30, 1), (31, 2), (32, 1), (36, 1), (40, 1), (43, 1), (48, 3), (52, 2), (62, 1), (63, 

In [25]:
#lda = LdaMulticore(corpus=corpus,
                   #id2word=id2word,
                   #random_state=723812,
                   #num_topics = 15,
                   #passes=10,
                   #workers=4
                  #)

lda = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=10, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

In [26]:
print(lda)

LdaModel(num_terms=3855, num_topics=10, decay=0.5, chunksize=100)


In [27]:
import pyLDAvis
import pyLDAvis.gensim

# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda, corpus, id2word)
#vis

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


In [28]:
vis

In [30]:
Despite Gensim's STOPWORDS list not being comprehensive enough to catch many useless words, this topic model proved interesting.
Service was ranked in the top 30 salient terms, as is to be expected. A typical review will mention the service given, whether good or bad.
The most regularly occuring proper noun in the model is the "San" of San Francisco. The city is famous for its food It is also where Yelp
was founded. "Hot" was found within the same topic as San Francisco—it is unclear whether this refers to the
temperature inside of the restaurant or the taste of the food (the former, more likely).

Examining the 8th, 9th, and 10th topics revealed more unambigous words (and emojis)—"Okay", "bummer", ":(:(:(", "GREAT", "weird", "price!!!".
I was surprised to see that these words occured with such low frequency—making up about .2% of tokens. 

<gensim.corpora.dictionary.Dictionary at 0x7fe06e942b00>

In [31]:
## Stretch Goals

Complete one of more of these to push your score towards a three: 
* Incorporate named entity recognition into your analysis
* Compare vectorization methods in the classification section
* Analyze more (or all) of the yelp dataset - this one is v. hard. 
* Use a generator object on the reviews file - this would help you with the analyzing the whole dataset.
* Incorporate any of the other yelp dataset entities in your analysis (business, users, etc.)

SyntaxError: invalid syntax (<ipython-input-31-41dfacb96a4d>, line 3)