<a href="https://colab.research.google.com/github/JeffreyLuo333/ML-Notebooks/blob/main/NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Yelp Review Sentiment Classification

In this project, we will build a classifier that can predict how a user feels (positively or negatively) about a given restaurant from their review. This is an example of **sentiment analysis**: being able to quantify an individual's opinion about a particular topic merely from the words they use.

In this notebook, we'll:


*   Explore the Yelp review dataset
*   Preprocess and vectorize our text data for NLP
*   Train a sentiment analysis classifier with logistic regression
*   Explore and improve our model
*   Train a model with word embeddings
*   Use word embeddings to calculate similarity and analogies




In [1]:
#@title Import our libraries (this may take a minute or two)
import pandas as pd   # Great for tables (google spreadsheets, microsoft excel, csv).
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import string
import nltk
import spacy
import wordcloud
import os # Good for navigating your computer's files
import sys
pd.options.mode.chained_assignment = None #suppress warnings

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from spacy.lang.en.stop_words import STOP_WORDS
nltk.download('wordnet')
nltk.download('punkt')

from wordcloud import WordCloud
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score
!python -m spacy download en_core_web_md
import en_core_web_md
text_to_nlp = en_core_web_md.load()



[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Collecting en-core-web-md==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.1/en_core_web_md-3.7.1-py3-none-any.whl (42.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [2]:
#@title Import our data

# import gdown
#gdown.download('https://drive.google.com/uc?id=1u0tnEF2Q1a7H_gUEH-ZB3ATx02w8dF4p', 'yelp_final.csv', True)
data_file  = 'yelp_final.csv'

!wget https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%201%20-%205/Session%203%20-%20NLP/yelp_final.csv


--2024-10-04 18:11:10--  https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%201%20-%205/Session%203%20-%20NLP/yelp_final.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 142.251.2.207, 142.250.141.207, 74.125.137.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.251.2.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 760976 (743K) [text/csv]
Saving to: ‘yelp_final.csv’


2024-10-04 18:11:10 (126 MB/s) - ‘yelp_final.csv’ saved [760976/760976]



# Data Exploration

First we read in the file containing the reviews and take a look at the data available to us.

In [3]:
# read our data in using 'pd.read_csv('file')'
yelp_full = pd.read_csv(data_file)
yelp_full.head()

Unnamed: 0,business_id,stars,text,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,5,My wife took me here on my birthday for breakf...,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,5,I have no idea why some people give bad review...,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,_1QQZuf4zZOyFCvXc0o6Vg,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",uZetl9T0NcROGOyFfughhg,1,2,0
3,6ozycU1RpktNG2-1BroVtw,5,General Manager Scott Petello is a good egg!!!...,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0
4,zp713qNhx8d9KCJJnrw1xA,5,Drop what you're doing and drive here. After I...,wFweIWhv2fREZV_dYkz_1g,7,7,4



Let's keep only the two columns we need:



In [4]:
needed_columns = ["stars","text"]
yelp = yelp_full[needed_columns]
yelp.head()

Unnamed: 0,stars,text
0,5,My wife took me here on my birthday for breakf...
1,5,I have no idea why some people give bad review...
2,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!..."
3,5,General Manager Scott Petello is a good egg!!!...
4,5,Drop what you're doing and drive here. After I...


The text column is the one we are primarily focused with.

# Preparing Our Data for Machine Learning

We'll need to prepare our data to use logistic regression. First, let's prepare our output column:

### Preparing to Classify
We're going to try to predict the sentiment - **positive** or **negative** - based on a review's text.

In order to reduce our problem to a **binary classification** (two classes) problem, we will:

 - label 4 and 5 star reviews as 'good'
 - label 1, 2, 3 star reviews as 'bad'


In [5]:
def is_good_review(num_stars):
    if num_stars >= 4:
        return 'good'
    else:
        return 'bad'

# Change the stars column to either be 'good' or 'bad'.
yelp['is_good_review'] = yelp['stars'].apply(is_good_review)
yelp.head()

Unnamed: 0,stars,text,is_good_review
0,5,My wife took me here on my birthday for breakf...,good
1,5,I have no idea why some people give bad review...,good
2,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",good
3,5,General Manager Scott Petello is a good egg!!!...,good
4,5,Drop what you're doing and drive here. After I...,good


## Text Preprocessing: A Preview

Now, the trickier part: preparing our text input.

We'll need a few steps to preprocess our text and represent it numerically.

We'll talk through all the steps here, then use a single function to implement them.

## Tokenization

First of all, we would like to **tokenize** each review: convert it from a single string into a list of words.

## Stopwords

Next, let's remove **stopwords**: words which are there to provide grammatical structure, but don't give us much information about a review's sentiment.

We're going to remove these stopwords from the user reviews.

Tokenization and removal of stop words are universal to nearly every NLP application. In some cases, additional cleaning may be required (for example, removal of proper nouns, removal of digits) but we can build a text preprocessing function with these "base" cleaning steps.

Putting all these together, we can come up with a text cleaning function that we can apply to all of our reviews.

## Vectors

Finally, we'll need to convert our text to **vectors**, or lists of numbers. We'll start off doing this with Bag of Words.


### Bag of Words

In a **bag of words** approach, we count how many times each word was used in each review.

Suppose we want to represent two **reviews**:
- "The food was great. The ambience was also great."
- "Great ambience, but not great food!"

First we define our vocabulary. This is *each unique word* in the review. So our **vocabulary** is:
- [also, ambience, but, food, great, not, the, was].

Next, we count up how many times each word was used. (You can also think of this as adding up one-hot encodings.)

Our reviews are encoded as:
- **First review:** [1, 1, 0, 1, 2, 0, 2, 2].
- **Second review:** [0, 1, 1, 1, 2, 1, 0, 0]



## Preprocessing Our Text in Action

Let's use bag-of-words to prepare our data

First, let's select our input *X* and output *y*:

In [6]:
X_text = yelp['text']
y = yelp['is_good_review']

Now, let's prepare our data. First, we'll use CountVectorizer, a useful tool from Scikit-learn, to:
*   Tokenize our reviews
*   Remove stopwords
*   Prepare our vocabulary


In [7]:
#@title Initialize the text cleaning function { display-mode: "form" }
def tokenize(text):
    clean_tokens = []
    for token in text_to_nlp(text):
        if (not token.is_stop) & (token.lemma_ != '-PRON-') & (not token.is_punct): # -PRON- is a special all inclusive "lemma" spaCy uses for any pronoun, we want to exclude these
            clean_tokens.append(token.lemma_)
    return clean_tokens

The cell below will take a moment.

In [8]:
bow_transformer = CountVectorizer(analyzer=tokenize, max_features=1000).fit(X_text)

Now, we can see our entire vocabulary.

In [9]:
bow_transformer.vocabulary_
sorted_dict = sorted(bow_transformer.vocabulary_.items(), key = lambda x: x[1])
for i in sorted_dict:
  print(f'\'{i[0]}\': {i[1]}')

'
': 0
'

': 1
'
 ': 2
' ': 3
' 
': 4
' 

': 5
'  ': 6
'$': 7
'+': 8
'1': 9
'1/2': 10
'10': 11
'100': 12
'11': 13
'12': 14
'15': 15
'2': 16
'20': 17
'25': 18
'3': 19
'30': 20
'4': 21
'45': 22
'5': 23
'50': 24
'6': 25
'7': 26
'8': 27
'9': 28
'=': 29
'ASU': 30
'AZ': 31
'Arizona': 32
'BBQ': 33
'Burger': 34
'California': 35
'Chandler': 36
'Chicago': 37
'Chicken': 38
'Chili': 39
'Day': 40
'Food': 41
'Friday': 42
'Good': 43
'Green': 44
'Grill': 45
'Happy': 46
'Hour': 47
'Ice': 48
'Mesa': 49
'Mexican': 50
'Mill': 51
'New': 52
'Old': 53
'Phoenix': 54
'Pizza': 55
'Pork': 56
'Saturday': 57
'Scottsdale': 58
'Service': 59
'St.': 60
'Starbucks': 61
'Sunday': 62
'Tempe': 63
'Thai': 64
'Town': 65
'Valley': 66
'Wednesday': 67
'White': 68
'Yelp': 69
'able': 70
'absolutely': 71
'actually': 72
'add': 73
'addition': 74
'admit': 75
'adult': 76
'afternoon': 77
'ago': 78
'agree': 79
'air': 80
'allow': 81
'amazing': 82
'ambiance': 83
'annoying': 84
'answer': 85
'anymore': 86
'anyways': 87
'app': 88
'apparentl

The number represents the **index** (alphabetical position) of a word in the vocabulary.

By the way, how many words do we have?


In [10]:
len(bow_transformer.vocabulary_)

1000

Now that our vocabulary is ready, we can **transform** each review into a bag of words.



In [11]:
X = bow_transformer.transform(X_text)

Finally, we've converted our reviews to numerical data that we can use in a logistic regression.

We can see what `X` looks like by printing it out as a DataFrame.

In [12]:
pd.DataFrame(X.toarray())

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,990,991,992,993,994,995,996,997,998,999
0,0,3,0,8,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,2,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,2,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,3,0,18,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0,0,0,2,0,1,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
996,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
997,0,11,0,0,0,0,0,0,1,0,...,1,0,1,0,0,0,0,0,0,0
998,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# Creating a Baseline Classifier

Now, back to our sentiment analysis problem. Our data is ready for machine learning.

Our classification problem is a classic two-class classification problem, and so we will use the tried-and-tested **Logistic Regression** machine learning model.

As always, we'll start by setting aside testing and training data:

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)

###Training the Model
Now, we can create and train our model.

In [14]:
logistic_model = LogisticRegression()
logistic_model.fit(X_train,y_train)

###Testing Your Model
Now, let's evaluate our model's accuracy. The model needs to **predict** the sentiment, and then we'll **calculate the accuracy** using `accuracy_score`.

In [15]:
y_pred = logistic_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print (accuracy)

0.77


# Exploring The Model

Let's explore our model in more depth.

###Using a Different Classifier

We used logistic regression for our baseline model, but there are many other classifier models we could use.

One common model is called Multinomial Naive Bayes. Naive Bayes uses Bayes' Theorem of probability to predict the class of new input data. The important assumption that Naive Bayes makes is that all the features are independent: the number of times a review uses "potato" is unrelated to the number of times a review uses "server".

Let's build a model using a Naive Bayes classifier.

In [16]:
from sklearn.naive_bayes import MultinomialNB
nb_model = MultinomialNB()

We can train and generate predictions from this model in the same way we did for our Logistic Regression model. We can try training this model on the same data and see if it performs better or worse than our logistic regression model. Then, evaluate the model accuracy as we did for the Logistic Regression classifier.



In [17]:
nb_model.fit(X_train,y_train)
y_pred=nb_model.predict(X_test)
accuracy = accuracy_score(y_test,y_pred)
print(accuracy)

0.745


Experiment with [other models](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html) to try to get the highest accuracy.

In [18]:
from sklearn.gaussian_process import GaussianProcessClassifier
model = GaussianProcessClassifier()
model.fit(X_train.toarray(),y_train)
y_pred = model.predict(X_test.toarray())
accuracy = accuracy_score(y_test,y_pred)
print(accuracy)

0.635


# Training Logistic Regression with Word Embeddings

# Word Embedding Math

One reason text embeddings are cool is that we can use them to explore connections in meaning between different words, including calculating similarity between words and completing [analogies](http://bionlp-www.utu.fi/wv_demo/).

We'll start by creating a dictionary containing the vectors for all the words in our vocabulary. We'll stick to the vocabulary above of 800 words from the Yelp reviews - if you want to use more words, change that number.

In [26]:
vocab_dict = dict() #initialize dictionary

for word in bow_transformer.vocabulary_:
    vocab_dict[word] = text_to_nlp(word).vector

for word, vec in vocab_dict.items(): # Iterating through the dictionary to print each key and value
  print ('Word: {}. Vector length: {}'.format(word, len(vec)))

print()
print ('{} words in our dictionary'.format(len(vocab_dict)))

Word: wife. Vector length: 300
Word: take. Vector length: 300
Word: birthday. Vector length: 300
Word: breakfast. Vector length: 300
Word: excellent. Vector length: 300
Word:  . Vector length: 300
Word: perfect. Vector length: 300
Word: sit. Vector length: 300
Word: outside. Vector length: 300
Word: ground. Vector length: 300
Word: waitress. Vector length: 300
Word: food. Vector length: 300
Word: arrive. Vector length: 300
Word: quickly. Vector length: 300
Word: busy. Vector length: 300
Word: Saturday. Vector length: 300
Word: morning. Vector length: 300
Word: look. Vector length: 300
Word: like. Vector length: 300
Word: place. Vector length: 300
Word: fill. Vector length: 300
Word: pretty. Vector length: 300
Word: early. Vector length: 300
Word: well. Vector length: 300
Word: 

. Vector length: 300
Word: simply. Vector length: 300
Word: good. Vector length: 300
Word: sure. Vector length: 300
Word: use. Vector length: 300
Word: ingredient. Vector length: 300
Word: fresh. Vector length:

Next, let's calculate the similarity between two words, using their Word2Vec representations.

A common way to calculate the similarity between two vectors is called *cosine similarity*. It depends on the angle between those two vectors when plotted in space. As an example, imagine we had two three-dimensional vectors:

In [27]:
v0 = [2,3,1]
v1 = [2,4,1]

Run the code below to plot those vectors

In [28]:
#@title Run this to create an interactive 3D plot
#Code from https://stackoverflow.com/questions/47319238/python-plot-3d-vectors
import numpy as np
import plotly.graph_objs as go

def vector_plot(tvects,is_vect=True,orig=[0,0,0]):
    """Plot vectors using plotly"""

    if is_vect:
        if not hasattr(orig[0],"__iter__"):
            coords = [[orig,np.sum([orig,v],axis=0)] for v in tvects]
        else:
            coords = [[o,np.sum([o,v],axis=0)] for o,v in zip(orig,tvects)]
    else:
        coords = tvects

    data = []
    for i,c in enumerate(coords):
        X1, Y1, Z1 = zip(c[0])
        X2, Y2, Z2 = zip(c[1])
        vector = go.Scatter3d(x = [X1[0],X2[0]],
                              y = [Y1[0],Y2[0]],
                              z = [Z1[0],Z2[0]],
                              marker = dict(size = [0,5],
                                            color = ['blue'],
                                            line=dict(width=5,
                                                      color='DarkSlateGrey')),
                              name = 'Vector'+str(i+1))
        data.append(vector)

    layout = go.Layout(
             margin = dict(l = 4,
                           r = 4,
                           b = 4,
                           t = 4)
                  )
    fig = go.Figure(data=data,layout=layout)
    fig.show()


vector_plot([v0,v1])

For our Word2Vec vectors, we can imagine doing the same thing in 300-dimensional space. Of course, it's much harder to plot that. [Here](https://projector.tensorflow.org/) is one representation that you can play around with.

Then we find the cosine of the angle between the two vectors to get the similarity.

If the vectors are exactly the same, the angle will be 0, so we get a similarity of $cos(0) = 1$.

If the vectors are exactly opposite, the angle will be 180 degrees, so we get a similarity of $cos(180) = -1$.

There's a useful [mathematical trick](https://www.mathsisfun.com/algebra/vectors-dot-product.html) to find the cosine similarity:

![](https://wikimedia.org/api/rest_v1/media/math/render/svg/1d94e5903f7936d3c131e040ef2c51b473dd071d)

Where $A_1, A_2, ..., A_{300}$ are the elements of the first vector and $B_1, B_2, ..., B_{300}$ are the elements of the second vector.


In [29]:
def vector_cosine_similarity(vec1,vec2):
  #Assume vec1 and vec2 have the same size

  numerator = 0
  for i in range(len(vec1)):
    numerator += vec1[i]*vec2[i]
  mag1 = (sum(elem**2 for elem in vec1))**0.5
  mag2 = (sum(elem**2 for elem in vec2))**0.5
  similarity = numerator/(mag1*mag2)
  return similarity

print(vector_cosine_similarity(v0,v1))

0.9914601339836675


Now, use your cosine similarity function to calculate the similarity between two words. Try out a few words from the dataset.

In [31]:
def word_similarity(word1, word2):
  #Should return a similarity between -1 and 1

  try:
    vec1 = vocab_dict[word1]
    vec2 = vocab_dict[word2]
    return vector_cosine_similarity(vec1,vec2)

    #TODO: Fill in the return statement here

  except KeyError:
    print ('Word not in dictionary')

print(word_similarity('burger','steak'))

0.7387349091587005


Now, we can use our functions above to find the *most* similar words to any particular word.

`find_most_similar(start_vec)` should output the top 5 words whose vectors are most similar to start_vec, with their similarities. Please fill it in.


In [32]:
def find_nearest_neighbor(word):
  try:
    vec = vocab_dict[word]
    find_most_similar(vec)
  except KeyError:
    print ('Word not in dictionary')

def find_most_similar(start_vec):
  #Should print the top 5 most similar words to start_vec, and their similarities.,

  similarity_series = pd.Series(np.nan, index = vocab_dict.keys())
  for word, vec in vocab_dict.items():
    similarity_series[word] = vector_cosine_similarity(start_vec, vec)
  similarity_series = similarity_series[similarity_series.notna()] #get rid of N/A
  five_most_similar = similarity_series.sort_values().tail()
  print (five_most_similar) #words and similarities

find_nearest_neighbor('bagel')


invalid value encountered in scalar divide



taco        0.790093
pickle      1.000000
sandwich    1.000000
burrito     1.000000
bagel       1.000000
dtype: float64


Finally, we can use the functions we've built to complete word analogies, like the ones you can try out [here](http://bionlp-www.utu.fi/wv_demo/). For example:

*   Breakfast is to bagel as lunch is to ________,

This requires a bit of "word arithmetic":
let's say A1, A2, and B1 are vectors for three words we know. We're trying to find B2 to complete

*   A1 is to A2 as B1 is to B2.

Intuitively, this means that the difference between A1 and A2 is the same as the difference between B1 and B2. So we write

*   A1 - A2 = B1 - B2

Once we know the vector that we "expect" for B2, we can use our previous functions to find the word whose representation is closest to that vector.

In [33]:
def find_analogy(word_a1, word_a2, word_b1):
  #Convert the words to vectors a1, a2, b1
  #If word_a1:word_a2 as word_b1:word_b2, then
  #a1 - a2 = b1 - b2
  #So b2 = ...
  #Calculate b2, and use your previous functions to find the best candidates for word_b2.

  a1_vec = vocab_dict[word_a1]
  a2_vec = vocab_dict[word_a2]
  b1_vec = vocab_dict[word_b1]
  find_most_similar(b1_vec - a1_vec + a2_vec)

find_analogy('breakfast','bagel','lunch')


invalid value encountered in scalar divide



bagel       0.782805
sandwich    0.782805
pickle      0.782805
burrito     0.782805
lunch       0.815342
dtype: float64
