# Yelp Reviews Project
## *Data Science - Machine Learning*

By using Natural Language Processing and other cool stuff, I would like to process Yelp reviews. In this project, I am going to figure out how to process text, how turn text into vectors, and how to model topics from documents. Apply skills to NLP datasets out there: [Yelp](https://www.yelp.com/dataset).  

The real dataset is massive (almost 8 gigs uncompressed). The data is sampled to something more manageable. As I work on the project, I also add comments and conclusions about my findings and describe anything I want to analyze in the future.

____


## Project Objectives 

* <a href="#p1">Part 1</a>: Write a function to tokenize the yelp reviews
* <a href="#p2">Part 2</a>: Create a vector representation of those tokens
* <a href="#p3">Part 3</a>: Use the tokens in a classification model on Yelp rating
* <a href="#p4">Part 4</a>: Estimate & Interpret a topic model of the Yelp reviews

____

### Part 0: Import Necessary Packages
For this section, I will need to import:
- `spacy` 
- `Pandas`
- `Seaborn`
- `Matplotlib`
- `NearestNeighbors`
- `Pipeline` 
- `TfidfVectorizer`
- `KneighborsClassifier`
- `GridSearchCV`
- `corpora`
- `LdaModel`
- `gensim`
- `re`


In [None]:
import spacy
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.neighbors import NearestNeighbors
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
import gensim
import gensim.corpora as corpora
import re

In [None]:
import warnings
warnings.filterwarnings('ignore', category=DeprecationWarning)
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=RuntimeWarning)



### Part 0: Import Data

In [None]:
# Load reviews from URL
data_url = 'https://raw.githubusercontent.com/bloominstituteoftechnology/data-science-practice-datasets/main/unit_4/unit1_nlp/review_sample.json'

# Import data into a DataFrame named df
df = pd.read_json(data_url,lines=True)
df.head()

Unnamed: 0,business_id,cool,date,funny,review_id,stars,text,useful,user_id
0,nDuEqIyRc8YKS1q1fX0CZg,1,2015-03-31 16:50:30,0,eZs2tpEJtXPwawvHnHZIgQ,1,"BEWARE!!! FAKE, FAKE, FAKE....We also own a sm...",10,n1LM36qNg4rqGXIcvVXv8w
1,eMYeEapscbKNqUDCx705hg,0,2015-12-16 05:31:03,0,DoQDWJsNbU0KL1O29l_Xug,4,Came here for lunch Togo. Service was quick. S...,0,5CgjjDAic2-FAvCtiHpytA
2,6Q7-wkCPc1KF75jZLOTcMw,1,2010-06-20 19:14:48,1,DDOdGU7zh56yQHmUnL1idQ,3,I've been to Vegas dozens of times and had nev...,2,BdV-cf3LScmb8kZ7iiBcMA
3,k3zrItO4l9hwfLRwHBDc9w,3,2010-07-13 00:33:45,4,LfTMUWnfGFMOfOIyJcwLVA,1,We went here on a night where they closed off ...,5,cZZnBqh4gAEy4CdNvJailQ
4,6hpfRwGlOzbNv7k5eP9rsQ,1,2018-06-30 02:30:01,0,zJSUdI7bJ8PNJAg4lnl_Gg,4,"3.5 to 4 stars\n\nNot bad for the price, $12.9...",5,n9QO4ClYAS7h9fpQwa5bhA


## Part 1: Tokenize Function
<a id="#p1"></a>

Complete the function `tokenize`. My function will
- Accept one document at a time
- Return a list of tokens


In [None]:
nlp = spacy.load('en_core_web_sm')

In [None]:
def tokenize(doc):
    doc = doc.lower()

    # Remove Punctuation
    doc = re.sub('[^a-zA-Z0-9]', ' ', doc)

    # Split into Tokens
    tokens = doc.split()

    return tokens

## Part 2: Vector Representation
<a id="#p2"></a>
1. Create a vector representation of the reviews (i.e. create a doc-term matrix).
    * Name that doc-term matrix `dtm`
2. Write a fake review. Assign the text of the review to an object called `fake_review`. 
3. Query the fake review for the 10 most similar reviews, print the text of the reviews. 
    - Given the size of the dataset, use `NearestNeighbors` model for this. Name the model `nn`.

In [None]:
%%time
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(stop_words='english', max_features=10)
vect.fit(df['text'])
dtm = vect.transform(df['text'])

In [None]:
# Create and fit a NearestNeighbors model named "nn"
nn = NearestNeighbors(n_neighbors=10)

nn.fit(dtm)

In [None]:
# Create a fake review and find the 10 most similar reviews
fake_review = 'Good job'
query_vector = nlp.vocab.get_vector(fake_review)[None,:]
query_vector.shape

## Part 3: Classification
<a id="#p3"></a>
My goal in this section will be to predict `stars` from the review dataset. 

1. Create a pipeline object with a sklearn `CountVectorizer` or `TfidfVector` and any sklearn classifier.
    - Use that pipeline to train a model to predict the `stars` feature (i.e. the labels). 
    - Use that pipeline to predict a star rating for my fake review from Part 2. 



2. Create a parameter dict including `one parameter for the vectorizer` and `one parameter for the model`. 
    - Include 2 possible values for each parameter
    - **Use `n_jobs` = 1** 
    - Due to limited computational resources on CodeGrader 
    
    
3. Train the entire pipeline with a GridSearch
    - Name the GridSearch object as `gs`

In [None]:
from sklearn.ensemble import RandomForestClassifier

target = 'stars'
y = df[target]
X = df['text']

cv = CountVectorizer(stop_words='english', lowercase=False) 
rfc = RandomForestClassifier(random_state=42)

pipe = Pipeline([('vect', cv),
                 ('clf', rfc)])

param_grid = {
    'vect__max_df':[.75],
    "vect__min_df":[.015],
    "vect__max_features": [5, 10, 20, 50, 100]}

gs = GridSearchCV(pipe, param_grid, n_jobs=1, verbose=1)
gs.fit(X, y)

## Part 4: Topic Modeling

Let's find out what those yelp reviews are saying! :D

1. Estimate a LDA topic model of the review text
    - Set num_topics to `5`
    - Name your LDA model `lda`
2. Create 1-2 visualizations of the results
    - I will use the most important 3 words of a topic in relevant visualizations. 
3. In markdown, write 1-2 paragraphs of analysis on the results of your topic model



### 1. Estimate a LDA topic model of the review text

* Use the `tokenize` function I created earlier to create tokens.
* Create an `id2word` object. 
* Create a `corpus` object.
* Instantiate an `lda` model. 


In [None]:
num_topics = 5

In [None]:
def clean_data(text): 

    text = re.sub('\S+@\S+.\S+', 'EMAIL', text)
    text = re.sub('[ ]{2,}', ' ', text)

    return text.lower().strip()

In [None]:
df["clean_data"] = df["text"].apply(clean_data)

In [None]:
df['lemmas'] = df['clean_data'].map(lambda x: [token.lemma_ for token in nlp(x) if (token.is_stop != True) and (token.is_punct != True) and (token.is_space != True)])

In [None]:
id2word = corpora.Dictionary(df['lemmas'])
corpus = [id2word.doc2bow(doc_lemmas) for doc_lemmas in df['lemmas']]
print(corpus[num_topics])

In [None]:
%%time
lda = gensim.models.ldamodel.LdaModel(corpus=corpus,
               id2word=id2word,
               random_state=723812,
               num_topics = num_topics,
               passes=1
              )
lda.save('lda.model')


#### 2. Create 1-2 visualizations of the results. Assign one of the visualizations to a variable called `visual_plot`.


In [None]:
#!python -m spacy download en_core_web_md

In [None]:
#!wget https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-1-NLP/main/requirements.txt
#!pip install -r requirements.txt

In [None]:
#!pip install pyLDAvis

In [None]:
#pyLDAvis.enable_notebook()
#visual_plot = pyLDAvis.gensim.prepare(lda, corpus, id2word)

#visual_plot

In [None]:
big_string = ''
for item in df['text']:
  big_string+= ' '
  big_string += item
  print(item)

In [None]:
corpus_token = tokenize(big_string)

In [None]:
counts = [token[1] for token in word_counts.most_common(10)]
tokens = [token[0] for token in word_counts.most_common(10)]

In [None]:
from collections import Counter
# The object `Counter` takes an iterable, but I can instantiate an empty one and update it. 
word_counts = Counter()
word_counts.update(corpus_token)

word_counts.most_common(10)

In [None]:
#list(range(1,11))

fig, visual_plot = plt.subplots()

plt.bar(x=list(range(1,11)), height=counts)
plt.xticks(list(range(1,11)),tokens,rotation=45)
plt.show()