# 7.2 Restaurant Reviews Dataset

The restaurant reviews dataset includes 1000 single-line English reviews on restaurants, associated
with a polarity (1 = positive review; 0 = negative review). The text included in reviews can be seen
as noisy, in the sense that not every token corresponds to a word in English or a punctuation mark.
In this Notebook, we will develop sentiment analysis modelsfor predicting the polarity of restaurant
reviews.

**b)** Use the pandas library to Import the restaurant reviews dataset.

In [1]:
import pandas as pd
import os

# Importing the dataset
df = pd.read_csv(os.getcwd() + "/Restaurant_Reviews.tsv", sep="\t")

**c)** Using the regular expressions library (re), perform some simple cleanup on the text. For
example, you may consider only alphabetic character sequences, and convert the whole text
to lowercase.

In [2]:
import re

reviews = []

for i in range(0, len(df.index)):
    # Fet review and remove non alpha chars, to lower-case and tokenize
    reviews.append(re.sub('[^a-zA-Z]', ' ', df['Review'][i]).lower().split())

**d)** Tokenize the text, use NLTK’s Porter stemmer to stem the obtained tokens, and remove stop
words using NLTK’s stop word list for English.

In [3]:
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

ps = PorterStemmer()
corpus = []

for review in reviews:
    # Stemming and stop word removal
    review = ' '.join([ps.stem(word) for word in review if not word in set(stopwords.words('english'))])
    corpus.append(review)

**e)** From the cleaned up and tokenized corpus, create bag-of-words features, using sklearn’s
CountVectorizer. Now you should have obtained a structured dataset, where each
restaurant review is represented by a list of 0’s and 1’s with the size of the vocabulary.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features = 1500)
x = vectorizer.fit_transform(corpus).toarray()
y = df.iloc[:,-1].values

print(x.shape, y.shape)

(1000, 1500) (1000,)


**f)** Split the data into train and test sets.

In [5]:
# Split dataset into training and test sets

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state = 0)

print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

(800, 1500) (800,)
(200, 1500) (200,)


**g)** Try to fit a Naïve Bayes classifier to the training set, and predict its test set results. Analyse
the confusion matrix and the classification scores (accuracy, precision, recall, F1).


In [6]:
# Fit Naive Bayes to the training set

from sklearn.naive_bayes import GaussianNB

classifier = GaussianNB()
classifier.fit(x_train, y_train)

# Predict test set results

y_pred = classifier.predict(x_test)

print(y_pred)

# Generate metrics

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

# confusion matrix
print(confusion_matrix(y_test, y_pred))

# accuracy
print('Accuracy: ', accuracy_score(y_test, y_pred))

# precision
print('Precision: ', precision_score(y_test, y_pred))

# recall
print('Recall: ', recall_score(y_test, y_pred))

# f1
print('F1: ', f1_score(y_test, y_pred))

[1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 0 1 1 1 0 1 1 1 1 1 0 1
 0 1 1 1 1 1 0 0 0 1 1 0 0 1 1 1 1 1 0 1 1 0 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1
 0 1 1 0 0 1 0 1 1 0 1 1 1 0 1 1 0 1 0 0 1 1 1 1 1 1 0 1 1 1 0 1 1 1 0 0 0
 1 0 1 1 0 1 1 1 1 1 0 1 1 0 0 1 1 0 1 1 1 0 0 1 1 1 1 1 1 0 1 1 0 1 0 1 1
 1 1 1 0 1 1 1 0 1 1 1 1 1 0 0 1 0 0 1 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 0 1 0
 1 0 1 0 1 1 0 1 1 1 0 1 1 1 1]
[[55 42]
 [12 91]]
Accuracy:  0.73
Precision:  0.6842105263157895
Recall:  0.883495145631068
F1:  0.7711864406779663


**h)** Try out the model by prompting the user to input a restaurant review and predicting its class.

In [None]:
# Simple test

rev = input("Enter review: ")
rev = re.sub('[^a-zA-Z]', ' ', rev).lower().split()
rev = ' '.join([ps.stem(w) for w in rev])
X = vectorizer.transform([rev]).toarray()

if(classifier.predict(X) == [1]):
    print('positive review (+)')
    
else:
    print('negative review (-)')

**i)** Experiment with other classifiers and see if you can improve on the performance of the
classification model.

#### SVM

In [None]:
from sklearn.svm import SVC

classifier = SVC()
classifier.fit(x_train, y_train)
y_pred = classifier.predict(x_test)

print(confusion_matrix(y_test, y_pred))
print('Accuracy: ', accuracy_score(y_test, y_pred))
print('Precision: ', precision_score(y_test, y_pred))
print('Recall: ', recall_score(y_test, y_pred))
print('F1: ', f1_score(y_test, y_pred))

#### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression()
classifier.fit(x_train, y_train)
y_pred = classifier.predict(x_test)

print(confusion_matrix(y_test, y_pred))
print('Accuracy: ', accuracy_score(y_test, y_pred))
print('Precision: ', precision_score(y_test, y_pred))
print('Recall: ', recall_score(y_test, y_pred))
print('F1: ', f1_score(y_test, y_pred))

#### Perceptron

In [None]:
from sklearn.linear_model import Perceptron

classifier = Perceptron()
classifier.fit(x_train, y_train)
y_pred = classifier.predict(x_test)

print(confusion_matrix(y_test, y_pred))
print('Accuracy: ', accuracy_score(y_test, y_pred))
print('Precision: ', precision_score(y_test, y_pred))
print('Recall: ', recall_score(y_test, y_pred))
print('F1: ', f1_score(y_test, y_pred))

#### Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

classifier = DecisionTreeClassifier()
classifier.fit(x_train, y_train)
y_pred = classifier.predict(x_test)

print(confusion_matrix(y_test, y_pred))
print('Accuracy: ', accuracy_score(y_test, y_pred))
print('Precision: ', precision_score(y_test, y_pred))
print('Recall: ', recall_score(y_test, y_pred))
print('F1: ', f1_score(y_test, y_pred))

#### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

classifier = RandomForestClassifier()
classifier.fit(x_train, y_train)
y_pred = classifier.predict(x_test)

print(confusion_matrix(y_test, y_pred))
print('Accuracy: ', accuracy_score(y_test, y_pred))
print('Precision: ', precision_score(y_test, y_pred))
print('Recall: ', recall_score(y_test, y_pred))
print('F1: ', f1_score(y_test, y_pred))