<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

# Sprint Challenge
## *Data Science Unit 4 Sprint 1*

After a week of Natural Language Processing, you've learned some cool new stuff: how to process text, how turn text into vectors, and how to model topics from documents. Apply your newly acquired skills to one of the most famous NLP datasets out there: [Yelp](https://www.yelp.com/dataset/challenge). As part of the job selection process, some of my friends have been asked to create analysis of this dataset, so I want to empower you to have a head start.  

The real dataset is massive (almost 8 gigs uncompressed). I've sampled the data for you to something more managable for the Sprint Challenge. You can analyze the full dataset as a stretch goal or after the sprint challenge. As you work on the challenge, I suggest adding notes about your findings and things you want to analyze in the future.

## Challenge Objectives
*Successfully complete these all these objectives to earn a 2. There are more details on each objective further down in the notebook.*
* <a href="#p1">Part 1</a>: Write a function to tokenize the yelp reviews
* <a href="#p2">Part 2</a>: Create a vector representation of those tokens
* <a href="#p3">Part 3</a>: Use your tokens in a classification model on yelp rating
* <a href="#p4">Part 4</a>: Estimate & Interpret a topic model of the Yelp reviews

In [32]:
import warnings
warnings.filterwarnings("ignore")

In [2]:
import pandas as pd
import spacy
import re
from sklearn.neighbors import NearestNeighbors
from sklearn.feature_extraction.text import TfidfVectorizer

yelp = pd.read_json('review_sample.json', lines=True)

In [3]:
yelp.head()

Unnamed: 0,business_id,cool,date,funny,review_id,stars,text,useful,user_id
0,nDuEqIyRc8YKS1q1fX0CZg,1,2015-03-31 16:50:30,0,eZs2tpEJtXPwawvHnHZIgQ,1,"BEWARE!!! FAKE, FAKE, FAKE....We also own a sm...",10,n1LM36qNg4rqGXIcvVXv8w
1,eMYeEapscbKNqUDCx705hg,0,2015-12-16 05:31:03,0,DoQDWJsNbU0KL1O29l_Xug,4,Came here for lunch Togo. Service was quick. S...,0,5CgjjDAic2-FAvCtiHpytA
2,6Q7-wkCPc1KF75jZLOTcMw,1,2010-06-20 19:14:48,1,DDOdGU7zh56yQHmUnL1idQ,3,I've been to Vegas dozens of times and had nev...,2,BdV-cf3LScmb8kZ7iiBcMA
3,k3zrItO4l9hwfLRwHBDc9w,3,2010-07-13 00:33:45,4,LfTMUWnfGFMOfOIyJcwLVA,1,We went here on a night where they closed off ...,5,cZZnBqh4gAEy4CdNvJailQ
4,6hpfRwGlOzbNv7k5eP9rsQ,1,2018-06-30 02:30:01,0,zJSUdI7bJ8PNJAg4lnl_Gg,4,"3.5 to 4 stars\n\nNot bad for the price, $12.9...",5,n9QO4ClYAS7h9fpQwa5bhA


In [4]:
nlp = spacy.load("en_core_web_lg")

## Part 1: Tokenize Function
<a id="#p1"></a>

Complete the function `tokenize`. Your function should
- accept one document at a time
- return a list of tokens

You are free to use any method you have learned this week.

In [33]:
def tokenize(doc):
    
    return nlp.tokenizer(doc)

## Part 2: Vector Representation
<a id="#p2"></a>
1. Create a vector representation of the reviews
2. Write a fake review and query for the 10 most similiar reviews, print the text of the reviews. Do you notice any patterns?
    - Given the size of the dataset, it will probably be best to use a `NearestNeighbors` model for this. 

In [6]:

# 1. Remove new line characters
yelp['clean_text'] = yelp['text'].apply(lambda x: re.sub('\s+', ' ', x))

# 2. Remove Emails
yelp['clean_text'] = yelp['clean_text'].apply(lambda x: re.sub('From: \S+@\S+', '', x))

# 3. Remove non-alphanumeric characters
yelp['clean_text'] = yelp['clean_text'].apply(lambda x: re.sub('[^a-zA-Z]', ' ', x))

# 4. Remove extra whitespace 
yelp['clean_text'] = yelp['clean_text'].apply(lambda x: ' '.join(x.split()))

yelp['tokens']= yelp['clean_text'].apply(lambda x: nlp.tokenizer(x))

tfidf = TfidfVectorizer(stop_words='english', max_features=500)

data = list(yelp['clean_text'])


In [7]:
data

['BEWARE FAKE FAKE FAKE We also own a small business in Los Alamitos CA and received what looked like a legitimate bill for with an account number and all I called the phone number listed The wait time on hold said minutes and to leave a message I could not get a live person on the phone no matter what number I selected I left a very FIRM message that I would be contacting the BBB and my attorney regarding their company trying to scam businesses This has to be illegal',
 'Came here for lunch Togo Service was quick Staff was friendly No complaints here Sweet tea is good Parking can be a pain sometimes',
 'I ve been to Vegas dozens of times and had never stepped foot into Circus Circus For one reason mostly the resort is marketed to familes that travel to Vegas so what business does a twenty three year old have at Circus Circus Well I needed a room for one night and I got on hotels com and the rate was only so I pulled the trigger Upon arriving the interior was nicer than I expected and 

In [8]:
dtm = tfidf.fit_transform(data)
dtm = pd.DataFrame(dtm.todense(), columns=tfidf.get_feature_names())
dtm.sample(7)

Unnamed: 0,able,absolutely,actually,add,ago,amazing,appointment,area,arrived,ask,...,working,worst,worth,wouldn,wrong,year,years,yelp,yes,yummy
1427,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6182,0.0,0.0,0.0,0.0,0.0,0.0,0.292583,0.0,0.0,0.0,...,0.268504,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7392,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.113575,...,0.0,0.128473,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4506,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3743,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
831,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6906,0.20337,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.216935,0.0,0.0,0.0,0.0,0.0,0.0


In [9]:
nn = NearestNeighbors(n_neighbors= 10, algorithm='kd_tree')
nn.fit(dtm)
text = [''' I had a wonderfull meal and the service was excellent. 
            I would highly recommend this that everyone should try 
            this place to eat at least once. I am not the owner, 
            and they did not pay me to write this, I swear. (ask for Tim)
            ''']
model_food = tfidf.transform(text)

nn.kneighbors(model_food.todense())

(array([[1.        , 1.        , 1.        , 1.        , 1.00124083,
         1.07531694, 1.08642147, 1.09137052, 1.10718802, 1.10892212]]),
 array([[6311,  469, 6204, 9889, 9465,  436, 6468, 1508, 9626, 4117]],
       dtype=int64))

In [10]:
print(data[6311][:200])
print("\n")
print(data[469][:200])
print("\n")
print(data[6204][:200])
print("\n")
print(data[9889][:200])
print("\n")
print(data[9465][:200])
print("\n")
print(data[436][:200])
print("\n")
print(data[6468][:200])
print("\n")
print(data[1508][:200])
print("\n")
print(data[9626][:200])
print("\n")
print(data[4117][:200])
print("\n")




O o thenk nnn b cgv xx TV cvg nvehxcfvvv c nb b c y nb and the vghvhridd h d c v vv ruddy





Bon massage Spa propre organis manque juste le stationnement prix abordable en promotion personnels respectueux


This place has excellent service and they work as a team Food was excellent Highly recommend


Excellent service Steve was very accommodating and worked in a tight time frame Would highly recommend


Went to Social House on a recommendation with my girlfriend and she is a vegetarian Service and good was excellent we highly recommend food was fresh and tasty A must try


Best pedicure I ve ever had I highly recommend She did not rush through massage or sugar scrub hot stones and towel were great too Foot massage on nd level pedicure was practically reflexology I highl


This place is so unique I have been here many times and love the personalized service and attention I get from the owner She has a great selection of cute consignment clothes and items I highly recomm


Food was 

#I noticed the first one is complete giberrish, the second is blank, and the others seem as fake as mine... Nobody rushes to Yelp to leave such gleaming reviews.. but besdies that, yes they are all positive.

## Part 3: Classification
<a id="#p3"></a>
Your goal in this section will be to predict `stars` from the review dataset. 

1. Create a piepline object with a sklearn `CountVectorizer` or `TfidfVector` and any sklearn classifier. Use that pipeline to estimate a model to predict `stars`. Use the Pipeline to predict a star rating for your fake review from Part 2. 
2. Tune the entire pipeline with a GridSearch

In [45]:
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.decomposition import TruncatedSVD
from scipy.stats import uniform

In [12]:
X_train, X_test, y_train, y_test = train_test_split(yelp['clean_text'], 
                                                    yelp['stars'], 
                                                    test_size=0.2, 
                                                    stratify=yelp['stars'],
                                                    random_state=17)

In [13]:
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(8000,) (8000,) (2000,) (2000,)


In [14]:
X_train.head(2)

7243    I give five stars because the food at this loc...
3954    This place was open at am That helps since a l...
Name: clean_text, dtype: object

In [15]:
vect = TfidfVectorizer(stop_words='english', ngram_range=(1,2)) #vectorizer
clf = LinearSVC() #classifier

pipe = Pipeline([('vect', vect), ('clf', clf)])


In [16]:
parameters = {
    'vect__max_df': (30/100, 50/100),
    'vect__min_df': (2, 5, 10),
    'vect__max_features': (5000, 20000),
    'clf__penalty': ('l1','l2'),
    'clf__C': (0.1, 0.5, 1., 2.)
}


In [17]:
grid_search = GridSearchCV(pipe, parameters, cv=5, n_jobs=12, verbose=1)
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 96 candidates, totalling 480 fits


[Parallel(n_jobs=12)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=12)]: Done  26 tasks      | elapsed:   14.7s
[Parallel(n_jobs=12)]: Done 176 tasks      | elapsed:  1.2min
[Parallel(n_jobs=12)]: Done 426 tasks      | elapsed:  2.9min
[Parallel(n_jobs=12)]: Done 480 out of 480 | elapsed:  3.3min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 2),
                                                        no

In [18]:
grid_search.best_score_

0.61525

In [19]:
params = grid_search.best_params_
params

{'clf__C': 0.5,
 'clf__penalty': 'l2',
 'vect__max_df': 0.5,
 'vect__max_features': 20000,
 'vect__min_df': 2}

In [20]:
print(f'{grid_search.predict(text)[0]} star rating predicted for text : {text[0]}')

5 star rating predicted for text :  I had a wonderfull meal and the service was excellent. 
            I would highly recommend this that everyone should try 
            this place to eat at least once. I am not the owner, 
            and they did not pay me to write this, I swear. (ask for Tim)
            


In [47]:
logistic = LogisticRegression(solver='saga', tol=1e-2, max_iter=200,random_state=0)
distributions = dict(C=uniform(loc=0, scale=4),penalty=['l2', 'l1'])
RSCV = RandomizedSearchCV(logistic, distributions, random_state=17)

In [50]:
vect = TfidfVectorizer(stop_words='english', ngram_range=(1,2))
svd = TruncatedSVD(algorithm='randomized', n_iter=10) 

pipe = Pipeline([
    ('vect', vect),      # TF-IDF Vectorizer
    ('svd', svd),        # Truncated SVD Dimensionality Reduction
    ('clf', clf)         # LinearSVC classifier
])

In [57]:
r_params = {
    'vect__max_df': (0.3, .5),
    'vect__min_df': (2, 5, 10),
    'vect__max_features': (5000, 20000),
}
rando_search = RandomizedSearchCV(pipe, r_params, cv=5, n_jobs=-1, verbose=1, random_state = 17)

rando_search.fit(X_train, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:   15.2s
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:   23.6s finished


RandomizedSearchCV(cv=5, error_score=nan,
                   estimator=Pipeline(memory=None,
                                      steps=[('vect',
                                              TfidfVectorizer(analyzer='word',
                                                              binary=False,
                                                              decode_error='strict',
                                                              dtype=<class 'numpy.float64'>,
                                                              encoding='utf-8',
                                                              input='content',
                                                              lowercase=True,
                                                              max_df=1.0,
                                                              max_features=None,
                                                              min_df=1,
                                                        

In [58]:
rando_search.best_score_

0.46699999999999997

## Worse than previous model's accuracy 

## Part 4: Topic Modeling

Let's find out what those yelp reviews are saying! :D

1. Estimate a LDA topic model of the review text
    - Keep the `iterations` parameter at or below 5 to reduce run time
    - The `workers` parameter should match the number of physical cores on your machine.
2. Create 1-2 visualizations of the results
    - You can use the most important 3 words of a topic in relevant visualizations. Refer to yesterday's notebook to extract. 
3. In markdown, write 1-2 paragraphs of analysis on the results of your topic model

__*Note*__: You can pass the DataFrame column of text reviews to gensim. You do not have to use a generator.

In [21]:
from gensim.models import LdaMulticore
from gensim.corpora import Dictionary
from tqdm import tqdm
tqdm.pandas()

import pyLDAvis
import pyLDAvis.gensim 
import matplotlib.pyplot as plt
%matplotlib inline

In [22]:
def get_lemmas(x):
    lemmas = []
    for token in nlp(x):
        if (token.is_stop!=True) and (token.is_punct!=True):
            lemmas.append(token.lemma_)
    return lemmas

yelp['lemmas'] = yelp['clean_text'].progress_apply(get_lemmas)

  and should_run_async(code)
100%|███████████████████████████████████████████████| 10000/10000 [07:33<00:00, 22.05it/s]


Learn the vocubalary of the yelp data:

In [35]:
id2word = Dictionary(yelp['lemmas'])
print(f' Before filtering : {len(id2word.keys())} words in the custom dictionary')
id2word.filter_extremes(no_below=3, no_above=0.75)
print(f' After filtering : {len(id2word.keys())} words in the custom dictionary')

 Before filtering : 26075 words in the custom dictionary
 After filtering : 9300 words in the custom dictionary


Create a bag of words representation of the entire corpus

In [36]:
corpus = [id2word.doc2bow(txt) for txt in yelp['lemmas']]

Your LDA model should be ready for estimation: 

In [37]:
lda = LdaMulticore(corpus=corpus,
                   id2word=id2word,
                   iterations=5,
                   workers=12,
                   num_topics = 10 # You can change this parameter
                  )

Create 1-2 visualizations of the results

In [39]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda, corpus, id2word, mds='mmds')
pyLDAvis.display(vis)

###   Shown in the visualization above, I have separated the contents of 10,000 yelp reviews into just 10 topics based on the frequency of words within those reviews. This visualization is interactive, so by choosing a particular topic, we can see the frequency of the thirty most relevant words to that particular topic. The red bar represents the frequency of the word within the topic, and the blue bar represents the overall frequency over the full set of topics. Also, hovering over any given word will adjust the circles to represent in which topics that word was found, and how often.  

###   Sliding the 'relevance metric' bar toward the left will help us determine the words that helped the most to define each topic, and in turn, help to give us some insight on how that topic was determined. Choosing topic 4 and sliding the bar to lambda = 0 reveals words such as 'yuk', 'spoil', and 'diarrhea'. These topics are undoubtedly composed of mostly bad restaurant reviews.

In [38]:
lda2 = LdaMulticore(corpus=corpus,
                   id2word=id2word,
                   iterations=5,
                   workers=12,
                   num_topics = 30 # You can change this parameter
                  )

In [40]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda2, corpus, id2word, mds='mmds')
pyLDAvis.display(vis)

## This visualization displays the same data, but separated into 30 different topics. Now the reviews are separated  a bit more cleanly into reviews about travel, about resturants, and about activities.

## Stretch Goals

Complete one of more of these to push your score towards a three: 
* Create more visualizations of the LDA results and provide written analysis
* Incorporate RandomizedSearchCV into docoument classification pipeline
* Incorporate named entity recognition into your analysis
* Compare vectorization methods in the classification section
* Analyze more (or all) of the yelp dataset - this one is v. hard. 
* Use a generator object on the reviews file - this would help you with the analyzing the whole dataset.
* Incorporate any of the other yelp dataset entities in your analysis (business, users, etc.)

In [59]:
import spacy
from spacy.tokenizer import Tokenizer

nlp = spacy.load("en_core_web_lg")
tokenizer = Tokenizer(nlp.vocab)

sample_text = "Natural Language Processing is really fun!"
[token.text for token in tokenizer(sample_text)]

['Natural', 'Language', 'Processing', 'is', 'really', 'fun!']