<h1>Predicting reviews of a restaurant</h1>

<p>The dataset contain reviews of restaurants and so we will make some machinery models that will predict if the review is positive or negative</p>.
Columns :
1 - review
2 - rate

CSV means Comma-Saparated Values, that means that the columns in the file are separated by Comma. And for the TSV is Tab-separated Values, like the first one but the separator is Tab.
<p>Why it is better to choose a <b>TSV</b> file than a <b>CSV</b> when working with text ?.</p>
It's common that we found in some reviews a comma like "Food, great!". In this example the CSV will parse this review in 2 columns. "Food" and "great!" when it's actually one column for review and it will mess all the algorithm. However, in TSV, the parser will know that is only one column because the delimiter is the TAB and a TAB can't be written in a review.

For the rate's column it's equal 1 or 0, 1 means that is a positive review and 0 negative review.

<h2>Importing the libraries</h2>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

<h2>Importing the dataset</h2>

In [2]:
dataset = pd.read_csv("Restaurant_reviews.tsv", delimiter="\t", quoting= 3)
dataset.head()

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


<h2>Cleaning the text</h2>

<p>The first step is to cleaning text. In the end, we will create a <b>Bag of Words Model</b> and this will consist of getting only the relevant words, that means that we will get rid of words like "The", "on", "or"... also the punctuation, and the different version of words like "Walk" is the same as "Walking".
<p>Then we will represent the words in a vector for each reviews so it will be one column for each word. We will have a lot columns and then for each review each column will take the number of times the associate number appears in the review.</p>

In [3]:
import re
# We will take the first review as an example
review = dataset['Review'][0]
print(review)

Wow... Loved this place.


<h4>First Step : Getting rid of irrelevant caracters</h4>
<p>caracters like numbers or punctuation will not help the algorithm to predict the result, we will let only the letters.
We will replace the others by a space, if we just delete these, some uncomprehensible words will appears. 
Example : If we have "Wow,There is so beautiful". If we delete the comma, there will be a new word "WowThere" and of course this will make no sens.<p>

In [4]:
# Replace the unwanted caracter by a space
review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][0])
review

'Wow    Loved this place '

In [5]:
# Putting all the letters of the reviews in the lowercase
review = review.lower()
review

'wow    loved this place '

<h4>Second step : Let only the important words</h4>
<p>In the reviews, some words a very important to determine if the review is positive or negative. Like in the first example "Loved". On the contrary, there are words who do not help the algorithm, like "The", these words have not impact on the review, that means if we delete these, the review nature will still be the same.</p>
<p>The irrelevant word are in the package called <i>stopwords</i> in the NLTK library, we will download this package to work with for checking if there is such words</p>

In [6]:
# In this review the word that will help the algorithm to predict is "Loved"
# All the irrelevant word are in a package called stopwords in the nltk library
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\L\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [7]:
# Splitting the reviews to get every words
review = review.split()
review

['wow', 'loved', 'this', 'place']

<h4>Third step : Getting rid of variations of the words</h4>
<p>To explain this, let's say we have 2 corpus :<br>
"I was taking a ride in the car"<br>
"I was riding in the car"<br>
For sur, the two corpus give the same meaning. The words "ride" and "riding" here is viewed as a different words even if it is the same. We must treat these words like one. That's why we will use the Stemming. The Stemming allow us to extract the racine of the word, for our example, we have "loved", with the stemming the word "loved" will be converted to "love"</p>
<p>The stemming is a powerful tools. However, some times this technique might generate some error, like the words "University" and "Universe" is viewed as "univers".</p>

In [8]:
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
# The stemmer has converted "loved" to "love"
review

['wow', 'love', 'place']

<p>Right now, we are on a good path to reduce the future sparcity that will occur throught the creation ou bag of words model.</p>

<h4>Fourth Step : Rejoining the words to one corpus</h4>
<p>In this step we will bring back to words to one corpus. We will put a space between each words.</p>

In [9]:
review = ' '.join(review)
review


'wow love place'

<h3>Apply the cleanings step to all reviews and joining them all to one corpus</h3>

In [10]:
corpus = []
# Append the first example review
corpus.append(review)
for i in range(1, 1000):
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)

In [48]:
# Print just the first five reviews
print(corpus[0:5])

['wow love place', 'crust good', 'tasti textur nasti', 'stop late may bank holiday rick steve recommend love', 'select menu great price']


<h2>Creating the Bag of Words</h2>
<p>What is the bag of words and why we do need to create it ?</p>
<p>In the first step of processing the main goal was not only to clean the text but to create the main corpus. The bag of words consist of creating for each word one column, of course we have a lot of words so we will have a lot of columns.<br>
For the rows in the table, it will be the reviews. Basically we will get in the table, 1000 rows for 1000 reviews and a lot of columns to each word, so each cell will correspand to one specific review and one specific word and in this cell we will have the number, and this number is going to be the number of times the word is corresponding to the column appears in the review</p>
<p>For the first example, we will get the column for "wow" in the first row is equal to 1 because it appears in it, the same for "love". All the words remains is equal to 0 for this rows, so basically we wil
l get a lot of 0 in the matrix. This fact is called <b>Sparcity</b><br></p>
<p>So why do need to create this matrice ? Simply because we will predict if the review are positive or negative and we need this matrix to train the model, because for all these reviews, we have the real results of the sentiment, and the model will make some correlations between the words and the sentiment result<br>
</p>

In [20]:
# The library needed for the creation of bag of words
from sklearn.feature_extraction.text import CountVectorizer
# Here we decide to take the most 1500 frequent words
cv = CountVectorizer(max_features = 1500)
matrix = cv.fit_transform(corpus).toarray()

<p>wW will have the result a matrix of 1000 rows and 1500 columns</p>

In [22]:
matrix.shape

(1000, 1500)

In [26]:
# Creating the dependant variable
y = dataset.iloc[:, 1].values

In [34]:
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(matrix, y, test_size = 0.20, random_state = 0)

In [35]:
# Fitting Naive Bayes to the Training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

GaussianNB(priors=None)

In [37]:
# Predicting the Test set results
y_pred = classifier.predict(X_test)
y_pred

array([1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1,
       1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0,
       1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1,
       0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1,
       0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1,
       0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1], dtype=int64)

In [39]:

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm

array([[55, 42],
       [12, 91]], dtype=int64)

<p>In the results here, we have 55 + 91 correct prediction and 42 + 12 incorrect prediction, not bad for 800 observation. Here we used naive bayes model, so the more observation the more accurate will be the prediction.</p>

In [45]:
print(((55 + 91)/ 200)*100,"%")

73.0 %
