# Sentiment/Customer Feedback Analysis By Using Python

## Importing the libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

In [2]:
dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3)
dataset.head()

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


### Below comand shows there are total 1000 rows and 2 columns in the input dataset.

In [3]:
# shape Of the Data Set
print(dataset.shape)

(1000, 2)


## Data Preprocessing

### Cleaning the texts
#### Punctuations, Numbers don’t add any values to the final analysis. They will decrease the model efficiency. They can be removed from the input file.
#### Roots of the word(Stemming) - Stemming is basically removing the suffix from a word and reduce it to its root word. As for example, by removing 'ing' from 'walking', we'll get the base word or root word which is “Walk”. This process will be applied to all the words in the input file.
#### Stopwords are the most common words in any natural language. For the purpose of analyzing text data and building NLP models, these stopwords might not add much value,so, they need to be removed.  As for example, The, In, On, Over, Is and so on.

In [4]:
import re
import nltk
nltk.download('stopwords')
# To Remove Stop words
from nltk.corpus import stopwords
# To convert into Root Words
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(0, 1000):
  review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
  review = review.lower()
  review = review.split()
  ps = PorterStemmer()
  all_stopwords = stopwords.words('english')
  all_stopwords.remove('not')
  review = [ps.stem(word) for word in review if not word in set(all_stopwords)]
  review = ' '.join(review)
  corpus.append(review)

[nltk_data] Downloading package stopwords to /home/nbuser/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Changes after applying above code...

In [5]:
#print(corpus)
print("Before Applying The Code")
print("------------------------")
#print(dataset.Review[:10].to_string(index=False))
print(list(dataset.Review[:10]))
print("After Applying The Code")
print("------------------------")
print(corpus[:10])    

Before Applying The Code
------------------------
['Wow... Loved this place.', 'Crust is not good.', 'Not tasty and the texture was just nasty.', 'Stopped by during the late May bank holiday off Rick Steve recommendation and loved it.', 'The selection on the menu was great and so were the prices.', 'Now I am getting angry and I want my damn pho.', "Honeslty it didn't taste THAT fresh.)", 'The potatoes were like rubber and you could tell they had been made up ahead of time being kept under a warmer.', 'The fries were great too.', 'A great touch.']
After Applying The Code
------------------------
['wow love place', 'crust not good', 'not tasti textur nasti', 'stop late may bank holiday rick steve recommend love', 'select menu great price', 'get angri want damn pho', 'honeslti tast fresh', 'potato like rubber could tell made ahead time kept warmer', 'fri great', 'great touch']


## Creating the Bag of Words model using CountVectorizer

#### A bag-of-words is a representation of text that describes the occurrence of words. It involves two things:
#### Vocabulary of known words.
#### Measure of the presence of known words.

#### From the cleaned dataset, potential features are extracted and are converted to numerical format. The vectorization techniques(Bag OF Words) are used to convert textual data to numerical format. Using this method, a matrix is created where each column represents a feature and each row represents an individual review.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
# "max_features" is used to get better results. To extract max 1500 feature. 
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, -1].values

In [7]:
# Saving these details for final submission.
#X1 = cv.fit_transform(corpus).toarray()
X1 = corpus
y1 = dataset.iloc[:, -1].values
from sklearn.model_selection import train_test_split
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size = 0.20, random_state = 0)
#X1_test
#print(cv.inverse_transform(X1_test))

## Splitting the dataset into the Training set and Test set

In [8]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

## Training the Naive Bayes model on the Training set

In [9]:
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
# print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print ("Confusion Matrix:\n",cm)
#accuracy_score(y_test, y_pred)
acc_log1 = round(accuracy_score(y_test,y_pred)*100, 2)
print('accuracy percentage on test dataset is', acc_log1)

Confusion Matrix:
 [[55 42]
 [12 91]]
accuracy percentage on test dataset is 73.0


### Training the Random Forest model on the Training set

In [10]:
from sklearn.ensemble import RandomForestClassifier 
classifier = RandomForestClassifier(n_estimators = 501, 
                            criterion = 'entropy') 
                              
classifier.fit(X_train, y_train) 
# Predicting the Test set results
y_pred = classifier.predict(X_test) 

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print ("Confusion Matrix:\n",cm)
#accuracy_score(y_test, y_pred)
acc_log2 = round(accuracy_score(y_test,y_pred)*100, 2)
print('accuracy percentage on test dataset is', acc_log2)

  from numpy.core.umath_tests import inner1d


Confusion Matrix:
 [[89  8]
 [39 64]]
accuracy percentage on test dataset is 76.5


### Training the Logistic Regression model on the Training set

In [11]:
from sklearn import linear_model
classifier = linear_model.LogisticRegression(C=1.5)
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print ("Confusion Matrix:\n",cm)
#accuracy_score(y_test, y_pred)
acc_log3 = round(accuracy_score(y_test,y_pred)*100, 2)
print('accuracy percentage on test dataset is', acc_log3)


Confusion Matrix:
 [[79 18]
 [27 76]]
accuracy percentage on test dataset is 77.5


### Evaluating The Best Model

In [12]:
models = pd.DataFrame({
    'Model': ['Naive Bayes', 'Random Forest', 'Logistic Regression'],
    'Score': [acc_log1, acc_log2, acc_log3]})
models.sort_values(by='Score', ascending=False)

Unnamed: 0,Model,Score
2,Logistic Regression,77.5
1,Random Forest,76.5
0,Naive Bayes,73.0


### Predecting The Test Data By Using The Logistic Regression Algorithm And Writing The Test Data + Predicted Column Into CSV File

In [13]:
df1 = pd.DataFrame(X1_test)
df2 = pd.DataFrame(y_pred)
df2.columns = ['Predicted Type']
submission = pd.concat([df1,df2],axis = 1)
submission.to_csv('submission.csv', index=False)
submission

Unnamed: 0,0,Predicted Type
0,present food aw,0
1,worst food servic,0
2,never dine place,0
3,guess mayb went night disgrac,0
4,sushi lover avoid place mean,0
5,ambianc much better,0
6,hole wall great mexican street taco friendli s...,1
7,food bad enough enjoy deal world worst annoy d...,0
8,never ever go back,0
9,atmospher fun,1
