# Natural Language Processing

## Importing the libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Importing the dataset

In [2]:
data=pd.read_csv("Restaurant_Reviews.tsv", sep='\t', quoting=3)
data

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1
...,...,...
995,I think food should have flavor and texture an...,0
996,Appetite instantly gone.,0
997,Overall I was not impressed and would not go b...,0
998,"The whole experience was underwhelming, and I ...",0


## Cleaning the texts

In [10]:
import re # Regular expression operations
import nltk 
# library of natural language processing  to  download the symbols of stop words, which we don't want to include
# like the or an (anything that is not helpful)
nltk.download("stopwords") # download all stopwords
from nltk.corpus import stopwords # import to the notebook
from nltk.stem.porter import PorterStemmer #to apply stemming on our reviews

#stemming consist of taking only the root of a word that indicates enough about what this word means.
# stemming the word 'love' etc

corpus = [] ## will contain all cleaned reviews
#We will iterate through all reviews and we will clean each review (remove punctuations)

for i in range(0,1000):
    # 1.. remove all punctuation
    review = re.sub("[^a-zA-Z]", " ", data["Review"][i])# can replace anything in a text, replace by spaces
    #ˆ-means no letters "not"- 
    #  not all the letters from a to z in lowercase nor the capital letters from a to z  by spaces
    # The question: is where shall we replace: reviews
    #["Review"][i] - review of the loop
    
    #2.. to transform all the capital letters into lowercase.
    review=review.lower()
    
    #3.. to split the different elements of the reviews in different words
    # then we can apply stemming to each
    
    review=review.split()
    
    # 4.. Applying Stemming
    # Need to not include Not in the stopwords

    ps=PorterStemmer()
    
    all_stopwords=stopwords.words('english')
    all_stopwords.remove('not')
    review=[ps.stem(word) for word in review if not word in set(all_stopwords)] 
    
    # we want not to include stopwords
     #if the word of the review is not in the set of all the English stop word then we will consider for stemming
    
    #5.. Join all them together (as a strng format)
    review= " ".join(review) ## joining reviews and adding the space
    
    #6..add all cleaned reviews to the corpus
    corpus.append(review)

[nltk_data] Downloading package stopwords to /Users/Anna/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [12]:
#print(corpus)

## Creating the Bag of Words model

In [5]:
#the reviews in different rows and all the words from all the reviews in the different columns where
#the sales will get a one if the word is in the review and a zero otherwise.

In [19]:
#tokenazation
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(max_features = 1500) #  the maximum size of sparse matrix. but before find the len of X

#Reviews
x=cv.fit_transform(corpus).toarray()
# it will take all the words from all the reviews in the corpus and then using this
#transform part of the method it will put all these words in different columns.
#shall be a 2D array

#Liked
y=data.iloc[:,-1].values #taking the last column of the file

In [20]:
len(x[0]) # words of tokenazation

1500

Basically we have 1566 words that were taken from all the reviews and for each of the reviews we

have either one in the columns corresponding to the words that are in the review and zero to all the

other columns corresponding to the words that are not in the review.

## Splitting the dataset into the Training set and Test set

In [21]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x,y, test_size=0.2, random_state=0)

## Training the Naive Bayes model on the Training set

In [22]:
from sklearn.naive_bayes import GaussianNB
classifier=GaussianNB()
classifier.fit(x_train, y_train)

# We can try different models

GaussianNB()

## Predicting the Test set results

In [28]:
y_pred=classifier.predict(x_test)
#print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

## Making the Confusion Matrix

In [25]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm=confusion_matrix(y_test, y_pred)
print(cm)


[[55 42]
 [12 91]]


In [26]:
accuracy_score(y_test, y_pred)

0.73