## Business problem

We have a dataset of reviews for a restaurant. Using the reviews we are going to carry out NLP to predict whether a review is a good review or a bad review.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
df = pd.read_csv('Restaurant_Reviews.tsv', delimiter='\t', quoting=3)
df.head(10)

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1
5,Now I am getting angry and I want my damn pho.,0
6,Honeslty it didn't taste THAT fresh.),0
7,The potatoes were like rubber and you could te...,0
8,The fries were great too.,1
9,A great touch.,1


## Cleaning the texts

Below is the method we will carry out to clean our reviews.

1. Remove anything that is not a lower case alphabetical character (such as .,/ etc)
2. Make all characters lower case
3. Split all the sentences into separate words
4. Create a variable for our stopwords, these stop words will exclude the word not as we need this for 'bad' reviews
5. Apply our port stemmer class to remove unecessary words along with our stopwords variable
6. Join these words together into a sentence

In [6]:
import re
import nltk
# nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

corpus = []
rows = len(df)

for i in range(0, rows):
    review = re.sub('[^a-zA-Z]', ' ', df['Review'][i])
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    all_stopwords = stopwords.words('english')
    all_stopwords.remove('not')
    review = [ps.stem(word) for word in review if not word in set(all_stopwords)]
    review = ' '.join(review)
    corpus.append(review)

## Creating the Bag of Words model

The fit_transform method takes all the words and puts them into columns. Essentially, the goal is to transform it into some matrix/2D array in which we can apply a machine learning algorithm on it.

We can use our max_features parameter to take a max of 1500 words.

In [23]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_features=1500)

X = cv.fit_transform(corpus).toarray()
y = df.iloc[:, -1].values

Usually we would have 1566 but as we have set a max, we can see we have 1500 words resulting from the tokenization. In other words, we have 1500 words from the list of reviews.

In [24]:
len(X[0])

1500

## Splitting the dataset into training and test set

In [25]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

For this example, we are going to use Naive Bayes, but we can also apply other methods to see if we get a better score.

## Training the Naive Bayes model

In [26]:
from sklearn.naive_bayes import GaussianNB

gnb_clf = GaussianNB()
gnb_clf.fit(X_train, y_train)

GaussianNB()

## Predicting the test set results

In [28]:
y_pred = gnb_clf.predict(X_test)

## Confusion matrix

In [42]:
from sklearn.metrics import confusion_matrix, accuracy_score

print('Confusion matrix:')

c_matrix = confusion_matrix(y_test, y_pred)
print(c_matrix)

print('----------------------------')

acc = accuracy_score(y_test, y_pred)
print('Accuracy: {}%'.format((acc * 100)))

Confusion matrix:
[[60 48]
 [15 77]]
----------------------------
Accuracy: 68.5%


We can conclude from this that:

60 records were correctly predicted negative (true negative) i.e. predicted to be a negative review and was negative. <br>
77 records were correctly predicted positive (true positives) i.e. predicted to be a positive review and was positive. <br>

15 records were incorrectly predicted negative (false negative) i.e. predicted to be a negative review but was positive. <br>
48 records were incorrectly predicted positive (false positive) i.e. predicted to be a positive review but was negative. <br>