# Sentiment Analysis of Customer's Reviews for a Resturant

# Overview

In this project, we'll build a 2-way polarity (positive and negative) classification system of Customer's Feedback for a restraunt, without using NLTK's in-built sentiment analysis engine.

We'll create our own pre-processing module to handle the raw text reviews from the customers. 

Further, we'll create a bag-of-words Model and use various Classification models evaluating the performance of each. 

# Data Used

Restaurant_Reviews.tsv: This file contains ~1000 raw reviews, along with their polarity labels (1 = positive, 0 = negative). We'll use this file to train our classifiers.

# Prepocessing

The first thing that we'll do is preprocess the Reviews so that they're easier to deal with, and ready for feature extraction, and training by the classifiers.

For the preprocessing, we'll:

- tokenize the sentences
- lowercase all the words
- split review string words in the form of list.
- remove stemming
- remove stopwords (Irrelavant words like 'is', 'an', 'the','this' etc.. which doesn't add to any relavance for our  model. 
- Create a bag of words Model
- Split dataset into the Training set and Test set

In [1]:
# Importing Libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd


In [3]:
# Import Dataset
dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3 )

In [8]:
# Preview the Data
dataset.head()

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


In [9]:
#Cleaning Text
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords 
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(0,1000):
    review = re.sub('[^a-zA-Z]',' ', dataset['Review'][i])
    review = review.lower()
    review = review.split() # split review string words in the form of list
    ps = PorterStemmer() # object of PorterStemmer
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\kathurhi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [35]:
corpus[0:5]

[u'wow love place',
 'crust good',
 u'tasti textur nasti',
 u'stop late may bank holiday rick steve recommend love',
 u'select menu great price']

In [13]:
# Creating a bag of words model 
#This would create a sparse Matrix
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values

In [15]:
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)



# Classifying

# Naive Bayes classifier

In [16]:
#Fitting Naive Bayes classifier to the training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

GaussianNB(priors=None)

In [17]:
#Predict Test set result
y_pred = classifier.predict(X_test)

In [18]:
y_pred

array([1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1,
       1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0,
       1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1,
       0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1,
       0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1,
       0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1], dtype=int64)

In [19]:
y_test

array([0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1,
       0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1,
       0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1,
       0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1,
       0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1,
       1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0,
       0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0,
       0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1], dtype=int64)

In [20]:
#Making confusion Matrix and Evaluating Naive Bayes Classification
from sklearn.metrics import confusion_matrix #confusion_matrix is func not class
cm = confusion_matrix(y_test,y_pred)
Acuuracy_Percentage = (55+91)*100/200
Precision_Percentage = 91*100/133
Recal_Percentage = 91*100/103

In [21]:
cm

array([[55, 42],
       [12, 91]], dtype=int64)

In [22]:
Acuuracy_Percentage

73

In [23]:
Precision_Percentage

68

In [24]:
Recal_Percentage

88