<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

# Sprint Challenge
## *Data Science Unit 4 Sprint 1*

After a week of Natural Language Processing, you've learned some cool new stuff: how to process text, how turn text into vectors, and how to model topics from documents. Apply your newly acquired skills to one of the most famous NLP datasets out there: [Yelp](https://www.yelp.com/dataset/challenge). As part of the job selection process, some of my friends have been asked to create analysis of this dataset, so I want to empower you to have a head start.  

The real dataset is massive (almost 8 gigs uncompressed). I've sampled the data for you to something more managable for the Sprint Challenge. You can analyze the full dataset as a stretch goal or after the sprint challenge. As you work on the challenge, I suggest adding notes about your findings and things you want to analyze in the future.

## Challenge Objectives
*Successfully complete these all these objectives to earn a 2. There are more details on each objective further down in the notebook.*
* <a href="#p1">Part 1</a>: Write a function to tokenize the yelp reviews
* <a href="#p2">Part 2</a>: Create a vector representation of those tokens
* <a href="#p3">Part 3</a>: Use your tokens in a classification model on yelp rating
* <a href="#p4">Part 4</a>: Estimate & Interpret a topic model of the Yelp reviews

In [1]:
import pandas as pd

yelp = pd.read_json('./data/review_sample.json', lines=True)

In [2]:
yelp.head()

Unnamed: 0,business_id,cool,date,funny,review_id,stars,text,useful,user_id
0,nDuEqIyRc8YKS1q1fX0CZg,1,2015-03-31 16:50:30,0,eZs2tpEJtXPwawvHnHZIgQ,1,"BEWARE!!! FAKE, FAKE, FAKE....We also own a sm...",10,n1LM36qNg4rqGXIcvVXv8w
1,eMYeEapscbKNqUDCx705hg,0,2015-12-16 05:31:03,0,DoQDWJsNbU0KL1O29l_Xug,4,Came here for lunch Togo. Service was quick. S...,0,5CgjjDAic2-FAvCtiHpytA
2,6Q7-wkCPc1KF75jZLOTcMw,1,2010-06-20 19:14:48,1,DDOdGU7zh56yQHmUnL1idQ,3,I've been to Vegas dozens of times and had nev...,2,BdV-cf3LScmb8kZ7iiBcMA
3,k3zrItO4l9hwfLRwHBDc9w,3,2010-07-13 00:33:45,4,LfTMUWnfGFMOfOIyJcwLVA,1,We went here on a night where they closed off ...,5,cZZnBqh4gAEy4CdNvJailQ
4,6hpfRwGlOzbNv7k5eP9rsQ,1,2018-06-30 02:30:01,0,zJSUdI7bJ8PNJAg4lnl_Gg,4,"3.5 to 4 stars\n\nNot bad for the price, $12.9...",5,n9QO4ClYAS7h9fpQwa5bhA


In [3]:
yelp.columns

Index(['business_id', 'cool', 'date', 'funny', 'review_id', 'stars', 'text',
       'useful', 'user_id'],
      dtype='object')

## Part 1: Tokenize Function
<a id="#p1"></a>

Complete the function `tokenize`. Your function should
- accept one document at a time
- return a list of tokens

You are free to use any method you have learned this week.

# ADD STOP WORD REMOVAL & THROW SOME CHARTS IN FOR GOOD MEASURE & REMOVE ASIAN/FOREIGN LANGUAGES

In [4]:
import spacy
from spacy.tokenizer import Tokenizer

nlp = spacy.load("en_core_web_lg")

# Tokenizer
tokenizer = Tokenizer(nlp.vocab)

In [5]:
STOP_WORDS = nlp.Defaults.stop_words.union(['', ' ','  ','   ', '-', '1', 'the,', ' the', 'the ',
                                            'be', 'like' ,'coffee', '', 'i', 'I', 'be'])

In [6]:
import re

In [7]:
def tokenize(text):
    tokens = []

    for doc in tokenizer.pipe(text):

        doc_tokens = []

        for token in doc: 
            if token.text not in STOP_WORDS:
                doc_tokens_text = re.sub(r'[^a-zA-Z ^0-9]', '', token.text)
                doc_tokens.append(doc_tokens_text.lower())


        tokens.append(doc_tokens)
    return tokens

In [8]:
yelp['tokens'] = tokenize(yelp['text'])

In [9]:
yelp['tokens']

0       [beware, fake, fake, fakewe, small, business, ...
1       [came, lunch, togo, service, quick, staff, fri...
2       [ive, vegas, dozens, times, stepped, foot, cir...
3       [we, went, night, closed, street, party, and, ...
4       [35, 4, stars, , not, bad, price, 1299, lunch,...
5       [tasty, fast, casual, latin, street, food, the...
6       [this, absolutely, amazing, what, incredible, ...
7       [came, pho, enjoyed, it, we, got, 900pm, busy,...
8       [absolutely, unique, experience, nail, shop, f...
9       [wow, walked, sat, bar, 10, minutes, all, bart...
10      [we, popped, dinner, yesterday, reservation, d...
11      [thw, worst, stay, ever, so, ended, paying, 70...
12      [great, friendly, customer, service, quality, ...
13      [the, food, great, it, super, busy, server, at...
14      [talk, getting, ripped, off, they, charged, 42...
15      [girls, night, tonight, kid, decided, drive, h...
16      [stopped, drinks, flying, charlotte, weeks, ba...
17      [this,

## Part 2: Vector Representation
<a id="#p2"></a>
1. Create a vector representation of the reviews
2. Write a fake review and query for the 10 most similiar reviews, print the text of the reviews. Do you notice any patterns?
    - Given the size of the dataset, it will probably be best to use a `NearestNeighbors` model for this. 

# 1

In [10]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [11]:
vect = CountVectorizer(stop_words='english', min_df = 0.05, max_df= 0.90)

In [12]:
vect

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=0.9, max_features=None, min_df=0.05,
                ngram_range=(1, 1), preprocessor=None, stop_words='english',
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [13]:
vect.fit(yelp['text'])

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=0.9, max_features=None, min_df=0.05,
                ngram_range=(1, 1), preprocessor=None, stop_words='english',
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [14]:
dtm = vect.transform(yelp['text'])

In [15]:
dtm

<10000x116 sparse matrix of type '<class 'numpy.int64'>'
	with 112374 stored elements in Compressed Sparse Row format>

In [16]:
# print(dtm)

In [17]:
dtm.todense()

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 1, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [18]:
dtm = pd.DataFrame(dtm.todense(),columns=vect.get_feature_names())

In [19]:
dtm.shape

(10000, 116)

In [20]:
dtm.head()

Unnamed: 0,10,amazing,area,asked,away,awesome,bad,bar,best,better,...,vegas,wait,want,wanted,wasn,way,went,work,worth,years
0,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,1,0,0,0,0,0,0,0,...,2,0,0,0,0,0,1,0,0,0
3,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,1,0,0,0
4,0,0,0,0,0,0,1,2,0,0,...,0,0,0,0,0,0,0,0,0,0


In [21]:
dtm.columns

Index(['10', 'amazing', 'area', 'asked', 'away', 'awesome', 'bad', 'bar',
       'best', 'better',
       ...
       'vegas', 'wait', 'want', 'wanted', 'wasn', 'way', 'went', 'work',
       'worth', 'years'],
      dtype='object', length=116)

# 2

In [22]:
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics.pairwise import cosine_similarity

In [23]:
nn  = NearestNeighbors(n_neighbors=5, algorithm='ball_tree')
nn.fit(dtm)

NearestNeighbors(algorithm='ball_tree', leaf_size=30, metric='minkowski',
                 metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                 radius=1.0)

In [24]:
nn.kneighbors([dtm.iloc[356]])

(array([[0.        , 3.87298335, 4.12310563, 4.12310563, 4.12310563]]),
 array([[ 356, 3888, 6475, 5232, 5072]], dtype=int64))

In [25]:
yelp['text'][356][:200]

'I went to this place after seeing all the hype from different blogs posting about their drinks! \nI ordered their signature tea (recommended by the cashier) - I am glad I asked her and chose less ice/5'

In [26]:
yelp['text'][3888][:200]

"They advertise that they will  price match on the internet in their camera department. h\nHowever, don't expect to receieve any of the internet promo items like a free camera bag, or free memory card w"

In [27]:
# tfidf = TfidfVectorizer(stop_words = 'english', min_df=.025, max_df=.95, ngram_range=(1,2))
tfidf = TfidfVectorizer(stop_words = STOP_WORDS, min_df=.025, max_df=.95, ngram_range=(1,2))

In [28]:
sparse = tfidf.fit_transform(yelp['text'])

  'stop_words.' % sorted(inconsistent))


In [29]:
sparse

<10000x317 sparse matrix of type '<class 'numpy.float64'>'
	with 174191 stored elements in Compressed Sparse Row format>

In [30]:
dtm = pd.DataFrame(sparse.todense(), columns=tfidf.get_feature_names())

In [31]:
dtm.head()

Unnamed: 0,10,15,20,30,able,absolutely,actually,ago,amazing,area,...,wonderful,work,working,worst,worth,wouldn,wrong,year,years,yelp
0,0.0,0.0,0.241159,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.126582,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.14977,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.302932,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.225699,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [32]:
dtm.shape

(10000, 317)

In [33]:
nn.fit(dtm)

NearestNeighbors(algorithm='ball_tree', leaf_size=30, metric='minkowski',
                 metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                 radius=1.0)

In [34]:
dist_matrix  = cosine_similarity(dtm)

In [35]:
df = pd.DataFrame(dist_matrix)

In [36]:
df.shape

(10000, 10000)

In [37]:
query = ['''horrible place''']

In [38]:
new = tfidf.transform(query)
new

<1x317 sparse matrix of type '<class 'numpy.float64'>'
	with 1 stored elements in Compressed Sparse Row format>

In [39]:
nn.kneighbors(new.todense())

(array([[0.        , 0.73128674, 0.87937313, 0.91067666, 0.91406045]]),
 array([[4062, 8013, 1503, 2483, 5129]], dtype=int64))

In [40]:
#ACCURATE
yelp['text'][4062]

'Place is horrible it smells like cigarets smoke penetrated. Out dated and nasty'

## Part 3: Classification
<a id="#p3"></a>
Your goal in this section will be to predict `stars` from the review dataset. 

1. Create a piepline object with a sklearn `CountVectorizer` or `TfidfVector` and any sklearn classifier. Use that pipeline to estimate a model to predict `stars`. Use the Pipeline to predict a star rating for your fake review from Part 2. 
2. Tune the entire pipeline with a GridSearch

In [41]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

In [42]:
clf = RandomForestClassifier()


In [43]:
pipe = Pipeline([
    # VECTORIZER
    ('vect', vect),
    # CLASSIFIER
    ('clf', clf)
    ])

In [44]:
parameters = {
    'vect__max_df': ( 0.75, 1.0),
    'vect__min_df': (.02, .05),
    'vect__max_features': (500,1000),
    'clf__n_estimators':(5, 10),
    'clf__max_depth':(15,20)
}

In [45]:
grid_search = GridSearchCV(pipe, parameters, cv=5, n_jobs=-1, verbose=1)
grid_search.fit(yelp['text'], yelp['stars'])

Fitting 5 folds for each of 32 candidates, totalling 160 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    7.3s
[Parallel(n_jobs=-1)]: Done 160 out of 160 | elapsed:   28.2s finished


GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=0.9,
                                                        max_features=None,
                                                        min_df=0.05,
                                                        ngram_range=(1, 1),
                                         

In [46]:
grid_search.predict(['horrible place','not the worst place but also not the best place enjoyed the coffee', 'love',
                     'hated spending time here it was expensive and bad'])

array([1, 5, 5, 5], dtype=int64)

In [47]:
grid_search.best_score_

0.5417

## Part 4: Topic Modeling

Let's find out what those yelp reviews are saying! :D

1. Estimate a LDA topic model of the review text
    - Keep the `iterations` parameter at or below 5 to reduce run time
    - The `workers` parameter should match the number of physical cores on your machine.
2. Create 1-2 visualizations of the results
    - You can use the most important 3 words of a topic in relevant visualizations. Refer to yesterday's notebook to extract. 
3. In markdown, write 1-2 paragraphs of analysis on the results of your topic model

__*Note*__: You can pass the DataFrame column of text reviews to gensim. You do not have to use a generator.

In [48]:
from gensim.models import LdaMulticore
from gensim.corpora import Dictionary
from gensim import corpora
from gensim.parsing.preprocessing import STOPWORDS
from gensim.utils import simple_preprocess

Learn the vocubalary of the yelp data:

In [49]:
id2word = corpora.Dictionary(yelp['tokens'])

In [50]:
id2word.token2id['touch']

473

In [51]:
id2word[100]

'long'

In [52]:
id2word[74]

'checkin'

Create a bag of words representation of the entire corpus

In [53]:
corpus = [id2word.doc2bow(text) for text in yelp['tokens']]

Your LDA model should be ready for estimation: 

In [54]:
lda = LdaMulticore(corpus=corpus,
                   id2word=id2word,
                   iterations=5,
                   workers=6,
                   num_topics = 10 # You can change this parameter
                  )

In [55]:
lda.print_topics()

[(0,
  '0.021*"" + 0.019*"the" + 0.009*"good" + 0.008*"food" + 0.006*"it" + 0.006*"we" + 0.006*"great" + 0.006*"time" + 0.005*"service" + 0.005*"place"'),
 (1,
  '0.027*"" + 0.022*"the" + 0.010*"good" + 0.009*"food" + 0.008*"place" + 0.007*"its" + 0.007*"it" + 0.006*"we" + 0.006*"great" + 0.006*"service"'),
 (2,
  '0.028*"" + 0.015*"the" + 0.008*"place" + 0.008*"service" + 0.008*"it" + 0.007*"food" + 0.007*"great" + 0.006*"we" + 0.006*"good" + 0.005*"time"'),
 (3,
  '0.020*"" + 0.012*"the" + 0.008*"great" + 0.007*"it" + 0.007*"we" + 0.007*"place" + 0.007*"time" + 0.006*"good" + 0.006*"food" + 0.005*"got"'),
 (4,
  '0.028*"" + 0.015*"the" + 0.010*"food" + 0.009*"place" + 0.008*"good" + 0.007*"it" + 0.007*"great" + 0.006*"service" + 0.006*"we" + 0.006*"time"'),
 (5,
  '0.019*"" + 0.013*"the" + 0.011*"great" + 0.008*"place" + 0.008*"food" + 0.007*"good" + 0.007*"it" + 0.007*"we" + 0.006*"service" + 0.004*"they"'),
 (6,
  '0.025*"" + 0.011*"the" + 0.009*"place" + 0.008*"food" + 0.007*"grea

In [56]:
words = [re.findall(r'"([^"]*)"',t[1]) for t in lda.print_topics()]

In [57]:
words

[['', 'the', 'good', 'food', 'it', 'we', 'great', 'time', 'service', 'place'],
 ['', 'the', 'good', 'food', 'place', 'its', 'it', 'we', 'great', 'service'],
 ['', 'the', 'place', 'service', 'it', 'food', 'great', 'we', 'good', 'time'],
 ['', 'the', 'great', 'it', 'we', 'place', 'time', 'good', 'food', 'got'],
 ['', 'the', 'food', 'place', 'good', 'it', 'great', 'service', 'we', 'time'],
 ['', 'the', 'great', 'place', 'food', 'good', 'it', 'we', 'service', 'they'],
 ['', 'the', 'place', 'food', 'great', 'service', 'time', 'good', 'it', 'we'],
 ['', 'the', 'good', 'food', 'it', 'place', 'time', 'its', 'service', 'we'],
 ['', 'the', 'service', 'food', 'good', 'great', 'place', 'it', 'time', 'we'],
 ['', 'the', 'good', 'food', 'place', 'time', 'we', 'great', 'it', 'service']]

Create 1-2 visualizations of the results

In [58]:
import pyLDAvis.gensim

In [59]:
pyLDAvis.enable_notebook()

In [60]:
pyLDAvis.gensim.prepare(lda, corpus, id2word)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


Had some trouble getting certain words out and couldn't do so due to time constraints.

The model seems to pick out sentiment in a hit or miss fashion (which is usually how yelpers review so I
don't believe it is accurate beyond a coin toss). This is unfortunate but I'm sure that, with a little tuning,
I can get this puppy predicting ratings properly.

## Stretch Goals

Complete one of more of these to push your score towards a three: 
* Incorporate named entity recognition into your analysis
* Compare vectorization methods in the classification section
* Analyze more (or all) of the yelp dataset - this one is v. hard. 
* Use a generator object on the reviews file - this would help you with the analyzing the whole dataset.
* Incorporate any of the other yelp dataset entities in your analysis (business, users, etc.)