<h1>Predicting reviews of a restaurant</h1>

<p>The dataset contain reviews of restaurants and so we will make some machinery models that will predict if the review is positive or negative</p>.
Columns :
1 - review
2 - rate

CSV means Comma-Saparated Values, that means that the columns in the file are separated by Comma. And for the TSV is Tab-separated Values, like the first one but the separator is Tab.
<p>Why it is better to choose a <b>TSV</b> file than a <b>CSV</b> when working with text ?.</p>
It's common that we found in some reviews a comma like "Food, great!". In this example the CSV will parse this review in 2 columns. "Food" and "great!" when it's actually one column for review and it will mess all the algorithm. However, in TSV, the parser will know that is only one column because the delimiter is the TAB and a TAB can't be written in a review.

For the rate's column it's equal 1 or 0, 1 means that is a positive review and 0 negative review.

<h3>Importing the libraries</h3>

In [26]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

<h3>Importing the dataset</h3>

In [27]:
dataset = pd.read_csv("Restaurant_reviews.tsv", delimiter="\t", quoting= 3)
dataset.head()

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


<h3>Cleaning the text</h3>

<p>The first step is to cleaning text. In the end, we will create a <b>Bag of Words Model</b> and this will consist of getting only the relevant words, that means that we will get rid of words like "The", "on", "or"... also the punctuation, and the different version of words like "Walk" is the same as "Walking".
<p>Then we will represent the words in a vector for each reviews so it will be one column for each word. We will have a lot columns and then for each review each column will take the number of times the associate number appears in the review.</p>

In [32]:
import re
# We will take the first review as an example
review = dataset['Review'][0]
print(review)

Wow... Loved this place.


<h4>First Step : Getting rid of irrelevant caracters</h4>
caracters like numbers or punctuation will not help the algorithm to predict the result, we will let only the letters.
We will replace the others by a space, if we just delete these, some uncomprehensible words will appears. 
Example : If we have "Wow,There is so beautiful". If we delete the comma, there will be a new word "WowThere" and of course this will make no sens.

In [39]:
# Replace the unwanted caracter by a space
review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][0])
review

'Wow    Loved this place '

In [40]:
# Putting all the letters of the reviews in the lowercase
review = review.lower()
review

'wow    loved this place '

<h4>Second step : Let only the important words</h4>
<p>In the reviews, some words a very important to determine if the review is positive or negative. Like in the first example "Loved". On the contrary, there are words who do not help the algorithm, like "The", these words have not impact on the review, that means if we delete these, the review nature will still be the same.</p>
<p>The irrelevant word are in the package called <i>stopwords</i> in the NLTK library, we will download this package to work with for checking if there is such words</p>

In [35]:
# In this review the word that will help the algorithm to predict is "Loved"
# All the irrelevant word are in a package called stopwords in the nltk library
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\L\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [42]:
# Splitting the reviews to get every words
review = review.split()
review

AttributeError: 'list' object has no attribute 'split'

In [37]:
from nltk.step.porter import PorterStemmer
ps = PorterStemmer()
review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
review

ModuleNotFoundError: No module named 'nltk.step'