# Importing the Libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset

In [None]:
dataset=pd.read_csv('Restaurant_Reviews.tsv',delimiter="\t",quoting=3)

quoting=3 will ignore all the double quote in the dataset file

To upload a dataset

1.   load the dataset in google drive
2.   click on "Files" Icon on left hand side
3.   Click on "Permit access to google drive files" on goole collab --> drive symbol should have cross on it --> should read unmount Drive
4.   click on upload icon and select "Restaurant_Reviews.tsv" file


# Cleaning the Texts

In [None]:
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus_review=[]  #list will hold all the cleaned review
for i in range(0,1000):
  review=re.sub('[^a-zA-Z]',' ',dataset['Review'][i]) 
  # replace all special symbols(non-letters) into space
  review=review.lower()   #convert all reviews into lowercase
  review=review.split()   #split words of reviews so stemming can be applied
  ps=PorterStemmer()      #object of PorterStemmer() class
  ####review=[ps.stem(w) for w in review if not w in set(stopwords.words('english'))]
  # upperline will remove all stopwords including "not"--> so reviews wont have not word
  #"not" word should not be included in stopwords because we need to keep not word in review
  all_stopwords=stopwords.words('english')
  all_stopwords.remove('not')
  review=[ps.stem(w) for w in review if not w in set(all_stopwords)]
  #apply stemming to each of the words in review which is not in stopword
  #for w in review --> will iterate through each word in review and check if it is stopwords list
  review=' '.join(review)
  #join all the different words after stemming was applied to them
  corpus_review.append(review)
  #add each cleaned review to the corpus_review list
  #"not" word should not be included in stopwords because we need to keep not word in review

In [None]:
print(corpus_review)

 

1.   nltk library will help us to remove all the stopwords "The / a" .. etc
2.   from the corpus module of the nltk library import all the stopwords
3.   PorterStemmer is used to apply stemming on our reviews
4.   Stemming takes only the root of the word --> that indicates enough what the word means.
5.   Sparse matrix contains all the different words of all the different reviews. Sparse matrix is created while creating bag of words model







# Creating the Bag of Words Model



1.   Sparse matrix will contain all the different reviews in different rows and all the different words taken from different reviews in different columns.
1.   Each cell will either get 0 or 1. It will get zero if the word of the column is not in the review. It will get one if the word of the column is indeed part of the review of the row.
2.   The process of creating columns of all of the words taken from all of the reviews is called tokenization.
2.   bag of words model contains sparse matrix which contain all the words of the review after they were cleaned. All the different words of the review will be the columns of the sparse matrix
3.  Sparse matrix will be the future matrix of features that will combine the dependent variable vector that contains binary outcome--> that will train future ML model(Naive Bayes Model) 



In [None]:
from sklearn.feature_extraction.text import CountVectorizer
#tokenization will be done from feature_extraction module from sckit-learn by a class CountVectorizer
cv=CountVectorizer(max_features=1500)
#CountVectorizer --> takes one parameter max no of columns (words) ie max size of sparse matrix
#CountVectorizer parameter takes most frequent words because some unnecessary words can be excluded
#CountVectorier creates matrix of features ie sparse matrix
X=cv.fit_transform(corpus_review).toarray()
#fit_transform will fit the input of cv ie corpus to X
#fit will take all the words and transform will put all the words into column of X
y=dataset.iloc[:,-1].values
#dependent variable y takes the last column from the dataset


In [None]:
len(X[0])

# Splitting the dataset into training and test set

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=0)

# Training the Naive Bayes model on the training set

In [None]:
from sklearn.naive_bayes import GaussianNB
classifier=GaussianNB()
classifier.fit(X_train,y_train)

# Predicting the test set results

In [None]:
y_pred=classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1),y_test.reshape(len(y_test),1)),1))





1.  y_pred=vector of prediction we got because of naive bayes
2.   y_test=vector of real results containing the real outcome of the review



# Making the confusion matrix



1.   accuracy_score is the number of correct predictions divided by the total number of observations in the test set
2.   List item



In [None]:
from sklearn.metrics import confusion_matrix,accuracy_score
cm=confusion_matrix(y_test,y_pred)
print(cm)
accuracy_score(y_test,y_pred)

# Predicting if the single review is positive or negative

# Positive review

Use our model to predict if the following review:

"I love this restaurant so much"

is positive or negative.



Solution: We just repeat the same text preprocessing process we did before, but this time with a single review.

In [None]:
new_review = 'I love this restaurant so much'
new_review = re.sub('[^a-zA-Z]', ' ', new_review)
new_review = new_review.lower()
new_review = new_review.split()
ps = PorterStemmer()
all_stopwords = stopwords.words('english')
all_stopwords.remove('not')
new_review = [ps.stem(word) for word in new_review if not word in set(all_stopwords)]
new_review = ' '.join(new_review)
new_corpus = [new_review]
new_X_test = cv.transform(new_corpus).toarray()
new_y_pred = classifier.predict(new_X_test)
print(new_y_pred)


The review was correctly predicted as positive by our model.

# Negative review

Negative review
Use our model to predict if the following review:

"I hate this restaurant so much"

is positive or negative.

Solution: We just repeat the same text preprocessing process we did before, but this time with a single review.

In [None]:
new_review = 'I hate this restaurant so much'
new_review = re.sub('[^a-zA-Z]', ' ', new_review)
new_review = new_review.lower()
new_review = new_review.split()
ps = PorterStemmer()
all_stopwords = stopwords.words('english')
all_stopwords.remove('not')
new_review = [ps.stem(word) for word in new_review if not word in set(all_stopwords)]
new_review = ' '.join(new_review)
new_corpus = [new_review]
new_X_test = cv.transform(new_corpus).toarray()
new_y_pred = classifier.predict(new_X_test)
print(new_y_pred)

The review was correctly predicted as negative by our model.