## Sentiment analysis of Restaurant review

In this analysis we're going to parse a text document with some NLP methods, and then do some sentiment analysis, determining which reviews are positvie vs negative. 

- We aim to compare some different models, and find the best

What steps will we take?

- Import Dataset
- Preprocess Data
- Vectorization (NLP)
- Training and Classification
- Conclusion


### Import Dataset

In [None]:
import numpy as np
import pandas as pd

In [None]:
# Read in the data

df = pd.read_csv('Restaurant_Reviews.tsv', delimiter='\t', quoting=3)

In [None]:
# Inspect the data, looks good

df.head()

### Preprocess Data
Each review undergoes a preprocessing step, where vague information is removed.

- Remove stopwords, numeric and special characters
- Normalize each review with stemming

for i in range(0,2):
    review = re.sub('[^a-zA-Z]', ' ', df['Review'][i]) # Remove all non-letters
    print("After sub: ", review)
    review = review.lower() # Make all letters lowercase
    print("After lower: ",review)
    review = review.split() # Split the review into a list of words
    print("After split: ",review)
    ps = PorterStemmer() # Create a PorterStemmer object
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))] # Remove stopwords and stem the words
    print("After stemming: ",review)
    review = ' '.join(review) # Join the words back together
    print("After join: ",review)
    corpus.append(review) # Add the cleaned review to the corpus

In [None]:
# We're using NLTK (Natural Language Toolkit) to clean the data
# I put an example for visualizing each step in the markdown above

import nltk
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

nltk.download('stopwords')

corpus = []
for i in range(0,len(df)):
    review = re.sub('[^a-zA-Z]', ' ', df['Review'][i]) # Remove all non-letters
    
    review = review.lower() # Make all letters lowercase
    review = review.split() # Split the review into a list of words
    ps = PorterStemmer() # Create a PorterStemmer object
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))] # Remove stopwords and stem the words
    review = ' '.join(review) # Join the words back together
    corpus.append(review) # Add the cleaned review to the corpus


In [None]:
# Inspect corpus, looks good

corpus

### Vectorization

We want to create a numerical representation of our string, so that the computer can process it. This process is called vectorization. 

In [7]:
# Create the bag of words model

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=1500) # Only keep the 1500 most frequent words
X = cv.fit_transform(corpus).toarray()
y = df.iloc[:,1].values

In [8]:
X.sum(axis=0) # Check the number of times each word appears

array([9, 1, 1, ..., 2, 4, 5])

In [9]:
# inspect y, looks good¨

y

array([1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1,
       0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1,
       1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1,
       1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0,
       0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1,
       0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1,
       0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1,

### Training and Classification

The data is split into training and testing set, ready for classification.


In [10]:
# Split data into train, test

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.30, random_state=42)

In [11]:
print(X_train.shape) # Check the shape of the training data
print(X_test.shape) # Check the shape of the test data

(700, 1500)
(300, 1500)


Let's start the classification party

**Logistic Regression**

In [12]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=42) # Create an empty LogisticRegression model
classifier.fit(X_train, y_train) # Fit the model to the training data

y_pred = classifier.predict(X_test) # Predict the test data


In [13]:
y_pred # Inspect the predictions

array([0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0,
       1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1,
       1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0,
       0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0,
       0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,
       0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0,
       1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1,
       0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1,
       0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1,
       1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1])

In [14]:
# Inspect the accuracy of the model on the test data

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[124  28]
 [ 44 104]]


In [15]:
# Inspect accuracy, precision, recall, f1-score

from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

print('Accuracy: ', accuracy_score(y_test, y_pred))
print('Precision: ', precision_score(y_test, y_pred))
print('Recall: ', recall_score(y_test, y_pred))
print('F1-Score: ', f1_score(y_test, y_pred))

print(classification_report(y_test, y_pred))

Accuracy:  0.76
Precision:  0.7878787878787878
Recall:  0.7027027027027027
F1-Score:  0.7428571428571429
              precision    recall  f1-score   support

           0       0.74      0.82      0.78       152
           1       0.79      0.70      0.74       148

    accuracy                           0.76       300
   macro avg       0.76      0.76      0.76       300
weighted avg       0.76      0.76      0.76       300



**Multinomial Naive Bayes**

In [16]:
# We create a multinomial naive bayes model to compare with the logistic regression model

from sklearn.naive_bayes import MultinomialNB
NB_cls = MultinomialNB()
NB_cls.fit(X_train, y_train)

y_pred_NB = NB_cls.predict(X_test)


In [17]:
# inspect the predictions

y_pred_NB

array([0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0,
       1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0,
       1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1,
       1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0,
       0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,
       0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0,
       1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1,
       0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0,
       0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0,
       1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1])

In [18]:
# Make a confusion matrix
# We can see that the result differs from the logistic regression model

cm_NB = confusion_matrix(y_test, y_pred_NB)
print(cm_NB)


[[115  37]
 [ 35 113]]


In [19]:
# Inspect accuracy, precision, recall, f1-score

print('Accuracy: ', accuracy_score(y_test, y_pred_NB))
print('Precision: ', precision_score(y_test, y_pred_NB))
print('Recall: ', recall_score(y_test, y_pred_NB))
print('F1-Score: ', f1_score(y_test, y_pred_NB))


Accuracy:  0.76
Precision:  0.7533333333333333
Recall:  0.7635135135135135
F1-Score:  0.7583892617449665


In [20]:
# Inspect accuracy, precision, recall, f1-score of both models  

print('Logistic Regression')
print(classification_report(y_test, y_pred))

print('Multinomial Naive Bayes')
print(classification_report(y_test, y_pred_NB))

Logistic Regression
              precision    recall  f1-score   support

           0       0.74      0.82      0.78       152
           1       0.79      0.70      0.74       148

    accuracy                           0.76       300
   macro avg       0.76      0.76      0.76       300
weighted avg       0.76      0.76      0.76       300

Multinomial Naive Bayes
              precision    recall  f1-score   support

           0       0.77      0.76      0.76       152
           1       0.75      0.76      0.76       148

    accuracy                           0.76       300
   macro avg       0.76      0.76      0.76       300
weighted avg       0.76      0.76      0.76       300

