
## Autograded Notebook (Canvas & CodeGrade)

This notebook will be automatically graded. It is designed to test your answers and award points for the correct answers. Following the instructions for each Task carefully.
Instructions

- **Download** this notebook as you would any other ipynb file 
- **Upload** to Google Colab or work locally (if you have that set-up)
- **Delete** `raise NotImplementedError()`

- **Write** your code in the `# YOUR CODE HERE` space


- **Execute** the Test cells that contain assert statements - these help you check your work (others contain hidden tests that will be checked when you submit through Canvas)

- **Save** your notebook when you are finished
- **Download** as a ipynb file (if working in Colab)
- **Upload** your complete notebook to Canvas (there will be additional instructions in Slack and/or Canvas)



# Sprint Challenge
## *Data Science Unit 4 Sprint 1*

After a week of Natural Language Processing, you've learned some cool new stuff: how to process text, how turn text into vectors, and how to model topics from documents. Apply your newly acquired skills to one of the most famous NLP datasets out there: [Yelp](https://www.yelp.com/dataset). As part of the job selection process, some of my friends have been asked to create analysis of this dataset, so I want to empower you to have a head start.  

The real dataset is massive (almost 8 gigs uncompressed). I've sampled the data for you to something more managable for the Sprint Challenge. You can analyze the full dataset as a stretch goal or after the sprint challenge. As you work on the challenge, I suggest adding notes about your findings and things you want to analyze in the future.

## Challenge Objectives
Successfully complete all these objectives to earn full credit. 

**Successful completion is defined as passing all the unit tests in each objective.**  

Each unit test that you pass is 1 point. 

There are 5 total possible points in this sprint challenge. 


There are more details on each objective further down in the notebook.*
* <a href="#p1">Part 1</a>: Write a function to tokenize the yelp reviews
* <a href="#p2">Part 2</a>: Create a vector representation of those tokens
* <a href="#p3">Part 3</a>: Use your tokens in a classification model on yelp rating
* <a href="#p4">Part 4</a>: Estimate & Interpret a topic model of the Yelp reviews

____

# Before you submit your notebook you must first

1) Restart your notebook's Kernal

2) Run all cells sequentially, from top to bottom, so that cell numbers are sequential numbers (i.e. 1,2,3,4,5...)
- Easiest way to do this is to click on the **Cell** tab at the top of your notebook and select **Run All** from the drop down menu. 

3) Comment out the cell that generates a pyLDAvis visual in objective 4 (see instructions in that section). 
____



### Import Data

In [46]:
import pandas as pd
import json

# Load reviews from URL
data_url = 'https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_4/unit1_nlp/review_sample.json'

# Import data into a DataFrame named df
df = pd.read_json(data_url, lines=True)

In [47]:
print(df.shape)
df.head()

(10000, 9)


Unnamed: 0,business_id,cool,date,funny,review_id,stars,text,useful,user_id
0,nDuEqIyRc8YKS1q1fX0CZg,1,2015-03-31 16:50:30,0,eZs2tpEJtXPwawvHnHZIgQ,1,"BEWARE!!! FAKE, FAKE, FAKE....We also own a sm...",10,n1LM36qNg4rqGXIcvVXv8w
1,eMYeEapscbKNqUDCx705hg,0,2015-12-16 05:31:03,0,DoQDWJsNbU0KL1O29l_Xug,4,Came here for lunch Togo. Service was quick. S...,0,5CgjjDAic2-FAvCtiHpytA
2,6Q7-wkCPc1KF75jZLOTcMw,1,2010-06-20 19:14:48,1,DDOdGU7zh56yQHmUnL1idQ,3,I've been to Vegas dozens of times and had nev...,2,BdV-cf3LScmb8kZ7iiBcMA
3,k3zrItO4l9hwfLRwHBDc9w,3,2010-07-13 00:33:45,4,LfTMUWnfGFMOfOIyJcwLVA,1,We went here on a night where they closed off ...,5,cZZnBqh4gAEy4CdNvJailQ
4,6hpfRwGlOzbNv7k5eP9rsQ,1,2018-06-30 02:30:01,0,zJSUdI7bJ8PNJAg4lnl_Gg,4,"3.5 to 4 stars\n\nNot bad for the price, $12.9...",5,n9QO4ClYAS7h9fpQwa5bhA


In [3]:
# Visible Testing
assert isinstance(df, pd.DataFrame), 'df is not a DataFrame. Did you import the data into df?'
assert df.shape[0] == 10000, 'DataFrame df has the wrong number of rows.'

## Part 1: Tokenize Function
<a id="#p1"></a>

Complete the function `tokenize`. Your function should
- accept one document at a time
- return a list of tokens

You are free to use any method you have learned this week.

In [48]:
# Optional: Consider using spaCy in your function. The spaCy library can be imported by running this cell.
# A pre-trained model (en_core_web_sm) has been made available to you in the CodeGrade container.
# If you DON'T need use the en_core_web_sm model, you can comment it out below.

import spacy
nlp = spacy.load('en_core_web_sm')

In [71]:
#Create function to tokenize docs
#Add some words to STOP_WORDS list

STOP_WORDS = nlp.Defaults.stop_words.union([' ', '\n', '\n\n', 'I'])

def tokenize(doc):
  tokens = []
  doc = nlp(doc)

  for token in doc:
    if (token.text not in STOP_WORDS) & (token.is_punct == False):
      tokens.append(token.text.lower())

  return tokens

In [74]:
'''Testing'''
assert isinstance(tokenize(df.sample(n=1)["text"].iloc[0]), list), "Make sure your tokenizer function accepts a single document and returns a list of tokens!"

## Part 2: Vector Representation
<a id="#p2"></a>
1. Create a vector representation of the reviews (i.e. create a doc-term matrix).
2. Write a fake review and query for the 10 most similiar reviews, print the text of the reviews. Do you notice any patterns?
    - Given the size of the dataset, use `NearestNeighbors` model for this. 

In [83]:
# Create a vector representation of the reviews 
# Name that doc-term matrix "dtm"

from sklearn.feature_extraction.text import TfidfVectorizer

#instantiate the vectorizer
tfidf = TfidfVectorizer(stop_words='english',
                        ngram_range=(1,2),
                        min_df=3,
                        max_df=0.25)

#build the vocabulary, transform text
dtm = tfidf.fit_transform(df['text'])

dtm = pd.DataFrame(dtm.todense(), columns=tfidf.get_feature_names())
print(dtm.shape)
dtm.head()

(10000, 31690)


Unnamed: 0,00,00 extra,00 pm,000,00am,00pm,01,04,05,06,07,08,10,10 00,10 10,10 12,10 15,10 20,10 30,10 30am,10 50,10 55,10 95,10 came,10 coupon,10 days,10 different,10 dinner,10 dollars,10 drinks,10 food,10 hours,10 lunch,10 min,10 mins,10 minute,10 minutes,10 months,10 oz,10 people,...,yummmm,yummy,yummy breakfast,yummy came,yummy dessert,yummy food,yummy good,yummy looking,yummy place,yummy service,yummy tummy,yummy want,yummy yummy,yup,yuppie,yuzu,zach,zack,zen,zero,zero customer,zero star,zero stars,zest,zing,zip,zombie,zombies,zone,zoo,zucchini,zucchini fries,zumba,ça,équipe,érable,était,était très,été,être
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [84]:
# Create and fit a NearestNeighbors model named "nn"
from sklearn.neighbors import NearestNeighbors

#fit on dtm
nn = NearestNeighbors(n_neighbors=10)
nn.fit(dtm)

#sample a doc from dtm to use as query point
doc = dtm.iloc[0].values

#query using kneighbors
nn.kneighbors([doc])

(array([[0.        , 1.        , 1.        , 1.29367092, 1.29654258,
         1.31654242, 1.3287342 , 1.33418235, 1.33872057, 1.34074972]]),
 array([[   0, 6311, 6204, 3276, 2131, 2943, 6019,   14, 4386, 8470]]))

In [87]:
print(df['text'][0])

BEWARE!!! FAKE, FAKE, FAKE....We also own a small business in Los Alamitos, CA and received what looked like a legitimate bill for $70 with an account number and all.  I called the phone number listed (866) 273-7934.  The wait time on hold said 20 minutes and to leave a message.  I could not get a live person on the phone no matter what number I selected.  I left a very FIRM message that I would be contacting the BBB and my attorney regarding their company trying to scam businesses. This has to be illegal!!!!!


In [91]:
print(df['text'][2131])

Haven't even gotten the chance to hear a quote or anything from them because they never pick up the phone. I've called several times over the past three days (always during normal business hours!!!) because I'm shopping for car insurance. The phone rings several times, and eventually goes to a message stating their business hours (and confirming that I was indeed calling when they were supposedly open). Then it tells me to leave a message and lists extensions of people there to leave messages with.

If they can't be bothered to pick up the phone to get new customers, I can only imagine how unavailable they must be to any customer who gets into an accident.


In [85]:
'''Testing.'''
assert nn.__module__ == 'sklearn.neighbors._unsupervised', ' nn is not a NearestNeighbors instance.'
assert nn.n_neighbors == 10, 'nn has the wrong value for n_neighbors'

In [110]:
# Create a fake review and find the 10 most similar reviews

fake_review = ["This place has the best gelato. I eat gelato three or four times a week, and this gelateria never disappoints. The flavors are always changing, and tastes so authentic and good!"]

doc = tfidf.transform(fake_review)

k = nn.kneighbors(doc.todense())

In [111]:

for doc in k[1][0]:
  print("Review:", df['text'][doc])

Review: 天氣很熱吃不下東西，今天我點了一個韓國冷面湯、餐後點了甜點，冰沙系列不會太甜膩，覺得店家很用心製作，包含擺盤精緻、佐料衛生，夏日想開胃，這是一個不錯的選擇，服務人員也很敬業，以後會常常來
Review: 旅行でラスベガスに来ましたがネイルがはげてるのが気になり、探したお店でした。
質問にも丁寧に答えてくれましたし、日本人の方も日本語が話せる方も居て、とても綺麗で居心地のいいお店でした。 
ネイルはちはるさんと言う綺麗な方が丁寧にしてくれとても気に入りました。
予定になかったまつ毛エクステもお願いし、日本ではまだあまりないブラウンカラーのエクステをしてもらい、とても気に入りました。
また是非マッサージなどで伺いたいと思います。
Review: I stopped in because I was craving Gelato, and was pleasantly surprised with the variety of flavors they sell! They also had sugar-free and vegan gelatos, which caters to everyone's tastes. I ordered the bubblegum gelato and it was amazing, with loads of actual bubblegum pieces!!
Review: I've eaten here several times but never disappoints. Although you can never take from dining in Italia, this gets close. Love when. they offer the guanciale. Have an order of prosciutto and buratta and you are transported. Finish with an amaro and homemade gelato and enjoy.
Review: Great little pizza and gelato shop. It reminded me of being back in Italy. If I ever return to 

## Part 3: Classification
<a id="#p3"></a>
Your goal in this section will be to predict `stars` from the review dataset. 

1. Create a pipeline object with a sklearn `CountVectorizer` or `TfidfVector` and any sklearn classifier.
    - Use that pipeline to train a model to predict the `stars` feature (i.e. the labels). 
    - Use that Pipeline to predict a star rating for your fake review from Part 2. 



2. Create a parameter dict including `one parameter for the vectorizer` and `one parameter for the model`. 
    - Include 2 possible values for each parameter
    - **Use `n_jobs` = 1** 
    - Due to limited computational resources on CodeGrader `DO NOT INCLUDE ADDITIONAL PARAMETERS OR VALUES PLEASE.`
    
    
3. Train the entire pipeline with a GridSearch
    - Name your GridSearch object as `gs`

In [112]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import TruncatedSVD

# Name the gridsearch instance "gs"

#word embedding
svd = TruncatedSVD(n_components=2, algorithm='randomized', n_iter=10)

#vectorizer
vect = TfidfVectorizer(stop_words='english',
                       ngram_range=(1,2),
                       min_df=2,
                       max_df=0.3)

#classifier
kn = KNeighborsClassifier()

pipe = Pipeline([
                 ('vect', vect),
                 ('svd', svd),
                 ('clf', kn)
])

pipe

Pipeline(memory=None,
         steps=[('vect',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=0.3, max_features=None,
                                 min_df=2, ngram_range=(1, 2), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words='english', strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('svd',
                 TruncatedSVD(algorithm='randomized', n_components=2, n_iter=10,
                              random_state=None, tol=0.0)),
  

In [115]:
#Tune hyperparameters for the pipeline

parameters = {
    'vect__max_features': (10000, 20000),
    'clf__algorithm': ('kd_tree', 'brute')
}

gs = GridSearchCV(pipe, parameters, cv=5, n_jobs=1, verbose=1)
gs.fit(df['text'], df['stars'])

Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:  1.1min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=0.3,
                                                        max_features=None,
                                                        min_df=2,
                                                        ngram_range=(1, 2),
                                                        no

In [116]:
gs.best_score_

0.36970000000000003

In [117]:
gs.best_params_

{'clf__algorithm': 'kd_tree', 'vect__max_features': 20000}

In [119]:
#Prediction

fake_review = ["This place has the best gelato. I eat gelato three or four times a week, and this gelateria never disappoints. The flavors are always changing, and tastes so authentic and good!"]

pred = gs.predict(fake_review)
pred

array([5])

In [121]:
fake_review = ["I dread coming to this store. The customer service is middling at best. It just feels cold and unwelcoming. And there's too much junk food."]

pred = gs.predict(fake_review)
pred

array([1])

In [122]:
# Visible Testing
prediction = gs.predict(["I wish dogs knew how to speak English."])[0]
assert prediction in df.stars.values, 'You gs object should be able to accept raw text within a list. Did you include a vectorizer in your pipeline?'

## Part 4: Topic Modeling

Let's find out what those yelp reviews are saying! :D

1. Estimate a LDA topic model of the review text
    - Set num_topics to `5`
    - Name your LDA model `lda`
2. Create 1-2 visualizations of the results
    - You can use the most important 3 words of a topic in relevant visualizations. Refer to yesterday's notebook to extract. 
3. In markdown, write 1-2 paragraphs of analysis on the results of your topic model

When you instantiate your LDA model, it should look like this: 

```python
lda = LdaModel(corpus=corpus,
               id2word=id2word,
               random_state=723812,
               num_topics = num_topics,
               passes=1
              )

```

__*Note*__: You can pass the DataFrame column of text reviews to gensim. You do not have to use a generator.

## Note about  pyLDAvis

**pyLDAvis** is the Topic modeling package that we used in class to visualize the topics that LDA generates for us.

You are welcomed to use pyLDAvis if you'd like for your visualization. However, **you MUST comment out the code that imports the package and the cell that generates the visualization before you submit your notebook to CodeGrade.** 

Although you should leave the print out of the visualization for graders to see (i.e. comment out the cell after you run it to create the viz). 

In [123]:
from gensim import corpora
# Due to limited computationalresources on CodeGrader, use the non-multicore version of LDA 
from gensim.models.ldamodel import LdaModel
import gensim
import re

### 1. Estimate a LDA topic model of the review tex

In [126]:
# Remember to read the LDA docs for more information on the various class attirbutes and methods available to you
# in the LDA model: https://radimrehurek.com/gensim/models/ldamodel.html

# use tokenize function you created earlier to create tokens 
df['tokens'] = df['text'].apply(tokenize)
df.head()

Unnamed: 0,business_id,cool,date,funny,review_id,stars,text,useful,user_id,tokens
0,nDuEqIyRc8YKS1q1fX0CZg,1,2015-03-31 16:50:30,0,eZs2tpEJtXPwawvHnHZIgQ,1,"BEWARE!!! FAKE, FAKE, FAKE....We also own a sm...",10,n1LM36qNg4rqGXIcvVXv8w,"[beware, fake, fake, fake, we, small, business..."
1,eMYeEapscbKNqUDCx705hg,0,2015-12-16 05:31:03,0,DoQDWJsNbU0KL1O29l_Xug,4,Came here for lunch Togo. Service was quick. S...,0,5CgjjDAic2-FAvCtiHpytA,"[came, lunch, togo, service, quick, staff, fri..."
2,6Q7-wkCPc1KF75jZLOTcMw,1,2010-06-20 19:14:48,1,DDOdGU7zh56yQHmUnL1idQ,3,I've been to Vegas dozens of times and had nev...,2,BdV-cf3LScmb8kZ7iiBcMA,"[vegas, dozens, times, stepped, foot, circus, ..."
3,k3zrItO4l9hwfLRwHBDc9w,3,2010-07-13 00:33:45,4,LfTMUWnfGFMOfOIyJcwLVA,1,We went here on a night where they closed off ...,5,cZZnBqh4gAEy4CdNvJailQ,"[we, went, night, closed, street, party, and, ..."
4,6hpfRwGlOzbNv7k5eP9rsQ,1,2018-06-30 02:30:01,0,zJSUdI7bJ8PNJAg4lnl_Gg,4,"3.5 to 4 stars\n\nNot bad for the price, $12.9...",5,n9QO4ClYAS7h9fpQwa5bhA,"[3.5, 4, stars, not, bad, price, $, 12.99, lun..."


In [127]:
# create a id2word object (hint: use corpora.Dictionary)
id2word = corpora.Dictionary(df['tokens'])

# create a corpus object (hint: id2word.doc2bow)
corpus = [id2word.doc2bow(text) for text in df['tokens']]

# don't change this value 
num_topics = 5

# instantiate an lda model
lda = LdaModel(corpus=corpus,
               id2word=id2word,
               random_state=723812,
               num_topics = num_topics,
               passes=1
              )

#### Testing

In [128]:
# Visible Testing
assert lda.get_topics().shape[0] == 5, 'Did your model complete its training? Did you set num_topics to 5?'

#### 2. Create 1-2 visualizations of the results

In [133]:
import seaborn as sns
import matplotlib.pyplot as plt

# !pip install pyLDAvis

# Use pyLDAvis (or a ploting tool of your choice) to visualize your results 

Collecting pyLDAvis
[?25l  Downloading https://files.pythonhosted.org/packages/24/38/6d81eff34c84c9158d3b7c846bff978ac88b0c2665548941946d3d591158/pyLDAvis-3.2.2.tar.gz (1.7MB)
[K     |████████████████████████████████| 1.7MB 4.9MB/s 
Collecting funcy
  Downloading https://files.pythonhosted.org/packages/66/89/479de0afbbfb98d1c4b887936808764627300208bb771fcd823403645a36/funcy-1.15-py2.py3-none-any.whl
Building wheels for collected packages: pyLDAvis
  Building wheel for pyLDAvis (setup.py) ... [?25l[?25hdone
  Created wheel for pyLDAvis: filename=pyLDAvis-3.2.2-py2.py3-none-any.whl size=135593 sha256=e048b9fbc597ad23576490f00441852923bccea179ca32a3ef54afc5f1cc4293
  Stored in directory: /root/.cache/pip/wheels/74/df/b6/97234c8446a43be05c9a8687ee0db1f1b5ade5f27729187eae
Successfully built pyLDAvis
Installing collected packages: funcy, pyLDAvis
Successfully installed funcy-1.15 pyLDAvis-3.2.2
[31mERROR: Could not find a version that satisfies the requirement pyLDAvis.gensim (from vers

In [136]:
# import pyLDAvis
# import pyLDAvis.gensim

# pyLDAvis.enable_notebook()
# vis = pyLDAvis.gensim.prepare(lda, corpus, id2word)
# vis

#### 3. In markdown, write 1-2 paragraphs of analysis on the results of your topic model

Here were the top unique words in each topic:

topic 1: time, service, customer, car, day, experience

-May be businesses like car rental or car dealer.

topic 2: service, chicken, friendly, delicious, fresh

-Healthy restaurant

topic 3: cheese, best, try, meat, sauce, salad, fries

-Not healthy restaurant

topic 4: room, time, nice, people, love, hotel, clean

-Hotels

topic 5: time, restaurant, pizza, pretty, night

-Trendy restaurant

This could have used more text cleaning to take out the most common words, perhaps some lemmatization too. There's a lot of overlapping words in it now that make it difficult to identify what the topics are.


## Stretch Goals

Complete one of more of these to push your score towards a three: 
* Incorporate named entity recognition into your analysis
* Compare vectorization methods in the classification section
* Analyze more (or all) of the yelp dataset - this one is v. hard. 
* Use a generator object on the reviews file - this would help you with the analyzing the whole dataset.
* Incorporate any of the other yelp dataset entities in your analysis (business, users, etc.)