# SENTIMENT ANALYSIS OF RESTAURANT REVIEWS

# Creation of the initial dataset (30/11/2022 – 05/12/2022)

Lest first create a dataframe named 'df' from our provided tsv file for Restaurant reviews. We will use PANDAS for doing this.

In [16]:
# Importing essential libraries
import pandas as pd

df = pd.read_table('Restaurant_Reviews.tsv',delimiter='\t')
df

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1
...,...,...
995,I think food should have flavor and texture an...,0
996,Appetite instantly gone.,0
997,Overall I was not impressed and would not go b...,0
998,"The whole experience was underwhelming, and I ...",0


Now we will process our data in dataframe , as we can see that review sentences have words including 
capitals, punctuations, spelling mistakes and non required words,
which needs to be removed before feeding data to model.

In [17]:
import re   # Regular Expression package

# Cleaning the reviews
cleaned_reviews = []
for i in df['Review']:
    
    review = re.sub(pattern='[^a-zA-Z]',repl=' ', string=i)# Cleaning special character from the reviews
    
    review = review.lower() # Converting the entire review into lower case

   
    cleaned_reviews.append(review)   # Creating clean review column.



In [18]:
df['Cleaned reviews']=cleaned_reviews  #adding column to the dataframe
df

Unnamed: 0,Review,Liked,Cleaned reviews
0,Wow... Loved this place.,1,wow loved this place
1,Crust is not good.,0,crust is not good
2,Not tasty and the texture was just nasty.,0,not tasty and the texture was just nasty
3,Stopped by during the late May bank holiday of...,1,stopped by during the late may bank holiday of...
4,The selection on the menu was great and so wer...,1,the selection on the menu was great and so wer...
...,...,...,...
995,I think food should have flavor and texture an...,0,i think food should have flavor and texture an...
996,Appetite instantly gone.,0,appetite instantly gone
997,Overall I was not impressed and would not go b...,0,overall i was not impressed and would not go b...
998,"The whole experience was underwhelming, and I ...",0,the whole experience was underwhelming and i ...


In [19]:
#removing unwanted column
del df['Review']
df.head(10)

Unnamed: 0,Liked,Cleaned reviews
0,1,wow loved this place
1,0,crust is not good
2,0,not tasty and the texture was just nasty
3,1,stopped by during the late may bank holiday of...
4,1,the selection on the menu was great and so wer...
5,0,now i am getting angry and i want my damn pho
6,0,honeslty it didn t taste that fresh
7,0,the potatoes were like rubber and you could te...
8,1,the fries were great too
9,1,a great touch


# Data Extraction through web (06/12/2022) - (13/12/2022)  


# #WEBSCRAPING
Now we will extract data from web by webscraping, here we need reviews about restaurant.
We will need two libraries as 'requests' and 'beautiful soup'.
Website: MAGICPIN.COM



First we will scrape review for multiple restaurants having only one page for reviews

In [20]:
#  now scraping from multiple urls
    
import requests
from bs4 import BeautifulSoup
url=['https://magicpin.in/Mumbai/Mumbai-Central/Restaurant/Food-Box/store/2ac94/reviews/',
     'https://magicpin.in/Mumbai/Town-Hall/Restaurant/Aromas-Cafe-and-Bistro/store/57b3b1/reviews/',
    'https://magicpin.in/Mumbai/Bhuleshwar/Restaurant/Jilani-Fast-Food-Corner/store/35c2b9/reviews/',
    'https://magicpin.in/Mumbai/Fort/Restaurant/Lunchbox---Meals-And-Thalis/store/57bb52/reviews/',
    'https://magicpin.in/Mumbai/Lokhandwala/Restaurant/Ubq-By-Barbeque-Nation/store/13074a3/reviews/',
     'https://magicpin.in/Mumbai/Goregaon-West/Restaurant/Green-Leaf---Only-Veg/store/5cc111/reviews/',
     'https://magicpin.in/Mumbai/Poonam-Sagar-Complex/Restaurant/Celebration-Point/store/5970cc/reviews/',
     'https://magicpin.in/Mumbai/Kandivali-East/Restaurant/Kusum-Rolls/store/399275/reviews/'
    
    ]
newlist=[]  # temporary creating list for storing scraped reviews

for a in url:
    site=requests.get(a)
    soup = BeautifulSoup(site.content, 'html.parser')
    s=soup.find('section', class_='merchant-brick merchant-ratings merchant-recent-ratings')
    content = s.find_all('p', class_='review')
    for j in content:
        newlist.append(j.text)
        
newlist1=[] # formating and storing data in new list1
for i in newlist:
    x=i.strip()
    y=x.replace('\n','')
    newlist1.append(y)
newlist1

ConnectionError: HTTPSConnectionPool(host='magicpin.in', port=443): Max retries exceeded with url: /Mumbai/Mumbai-Central/Restaurant/Food-Box/store/2ac94/reviews/ (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x0000023A8D607760>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))



Now, we will scrape reviews from one restaurant having multiple pages for reviews, 

we will create the code to extract reviews sentence from page and iterate it over all the pages,

in this case there are 59 pages



In [None]:
# lets scrape some more from other restaurant/ url with multiple pages

import requests
from bs4 import BeautifulSoup
url='https://magicpin.in/Mumbai/Bandra-West/Restaurant/Barbeque-Nation/store/23368/reviews'
newlist=[]
for i in range(1,60):  #there are 59 pages for reviews.
    site=requests.get(url + str('?page=') + str(i))
    soup = BeautifulSoup(site.content, 'html.parser')
    s=soup.find('section', class_='merchant-brick merchant-ratings merchant-recent-ratings')
    content = s.find_all('p', class_='review')
    for j in content:
        newlist.append(j.text)
    

for i in newlist:
    x=i.strip()
    y=x.replace('\n','')
    newlist1.append(y)  #adding new scraped reviews to newlist1
newlist1

It is observed that scraped reviews contain duplictae reviews ,

which i think should be removed before going further,

this will ensure that our model will be trained on unique and various data.

In [None]:
#removing duplicate reviews from list

newlist1=[*set(newlist1)]
newlist1


We will be adding the scraped data to our provided dataset ,

so we will clean and format the scraped data first

similarly as we didi earlier

In [None]:

import re   # Regular Expression package
from nltk.corpus import stopwords   # stopwords package
from nltk.stem.porter import PorterStemmer   #porter stemmer for stemming words

# Cleaning the reviews
cleaned_newreviews=[]
for i in newlist1:
    
    newreview = re.sub(pattern='[^a-zA-Z]',repl=' ', string=i)# Cleaning special character from the reviews
    
    newreview = newreview.lower() # Converting the entire review into lower case
   
   
    cleaned_newreviews.append(newreview)   # Creating clean_newreviews list
   
        
    

In [None]:
cleaned_newreviews   # displaying the cleaned list of scraped reviews.

Before adding these scraped reviews to our original dataframe, 

we need to classify them , lets create a df2 for these reviews to process them.

In [None]:
#creating new dataframe df2 from newliked and cleaned_newreviews lists
 # creating new liked with values baser upon the review polarity
df2=pd.DataFrame({"Cleaned reviews":cleaned_newreviews})
df2

As we can see that it has 173 new scraped reviews.

Now we will classify the scraped reviews by kmeans ,

first we will convert words to numerical arrays by using count vetorizer.

In [None]:

from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer

Tvect = TfidfVectorizer(stop_words='english')
Cvect = CountVectorizer()



X = Cvect.fit_transform(cleaned_newreviews)
X.toarray()

In [None]:
# Create "Liked " feature colum in dataframe


from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2, max_iter=10000)
df2["Liked"] = kmeans.fit_predict(X)
df2["Liked"] = df2["Liked"].astype("category")

In [None]:
df2.head(50)

# UPDATING DATAFRAME WITH SCRAPED REVIEWS 

In [None]:
df  #  checking orignal clean dataset

In [None]:
df3=df.append(df2, ignore_index=True) # adding new cleaned reviews datframe to original dataframe.

df3 is our latest dataframe including intial cleaned data as well as new extracted and cleaned data.

In [None]:

df3.to_csv('updated_reviews_with_scraped.csv', index=False)# lets scae df to csv as scraping takes time.
df3 # df3 is our latest dataframeincluding intial cleaned data as well as new extracted and cleaned data.

# Exploratory Data Analysis (14/12/2022 – 19/12/2022) 

we will do EDA upon our new df3 dataframe.

We will look at the information about our data,

Before loading the dataframe below we have verified whether the classification is done accurately, if not then it is corrected.

In [None]:
#we have exported our dataframe df3 as csv file lets load it.
import pandas as pd
df3=pd.read_csv('updated_reviews_with_scraped_edited.csv')
df3.info()

Here we observed that some values are missing so removing null values by droping rows containing them.

In [None]:
df3=df3.dropna()
df3=df3.reset_index(drop=True) 

In [None]:
df3

Sentiment count:
Here, we will see number of negative and positive reviews in the given data set, to make sure whether the datset is balanced or not.We will plot a pie diagram.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
fig=plt.figure(figsize=(10,10))
colors=["blue",'pink']
pos=df3[df3['Liked']==1]
neg=df3[df3['Liked']==0]
ck=[pos['Liked'].count(),neg['Liked'].count()]
legpie=plt.pie(ck,labels=["Positive","Negative"],
                 autopct ='%1.1f%%', 
                 shadow = True,
                 colors = colors,
                 startangle = 45,
                 explode=(0, 0.1))

# Text visualization.

In [None]:
df4 = df3["Cleaned reviews"].apply(lambda x: pd.value_counts(x.split(" "))).sum(axis=0).reset_index()
df4.columns = ["words", "counts"]
df4.head()

In [None]:
df4=df4.sort_values("counts", ascending = False)

Top 25 repeated words with their counts can be seen in bar garph below.

In [None]:
df4[df4["counts"]> 50].plot.bar(x="words", y="counts")
plt.show()


#  Feature Engineering (20/12/2022 – 28/12/2022) 

### 

In [None]:
#The goal of feature engineering is simply to make your data better suited to the problem at hand.

#lets view our latest dataframe df3 which contain intial as well as scraped data, cleaned and combined.
import pandas as pd
df3=pd.read_csv('updated_reviews_with_scraped_edited.csv')
df3=df3.dropna()
df3=df3.reset_index(drop=True) 
df3.info()
df3

-Our dataframe contains only two features one is 'Liked' which contain polarity of review and other is 'Cleaned reviews' containing cleaned text .

-'Liked' column is target variable.

Here we will be creating new feature from our exixsting 'Cleaned reviews' feature's data

In [None]:
#Using  vectorizer which convets words into vectors.

from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer

Tvect = TfidfVectorizer(stop_words='english')
Cvect = CountVectorizer()

X= Cvect.fit_transform(df3['Cleaned reviews']).toarray()
X

Now we will view our new created features by converting the array into dataframe
import pandas as pd
data = pd.DataFrame(X.toarray(), columns=vect.get_feature_names())
data

In [None]:
Xlist=list(X)  #converting to list
df3['word_vector']=Xlist # adding under new feature - 'Word vector'
df3

Above we have created new feature name 'word vector'

#  PREDICTIVE MODELS


In [27]:
import pandas as pd
df3=pd.read_csv('updated_reviews_with_scraped_edited.csv')
df3=df3.dropna()
df3=df3.reset_index(drop=True) 
df3.info()
df3

#creating target variable
X=df3['Cleaned reviews']
y=df3.Liked


#Using  vectorizer which convets words into vectors.

from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer

Tvect = TfidfVectorizer(stop_words='english')
Cvect = CountVectorizer()

X= Cvect.fit_transform(df3['Cleaned reviews']).toarray()



#Creating a train test split from data

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test= train_test_split(X,y,test_size=0.1,random_state=100)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1173 entries, 0 to 1172
Data columns (total 2 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Liked            1173 non-null   int64 
 1   Cleaned reviews  1173 non-null   object
dtypes: int64(1), object(1)
memory usage: 18.5+ KB


# Lets create models using various classification algorithms

In [28]:
#1- Using Logistic regression

from sklearn.linear_model import LogisticRegression
model1 = LogisticRegression(random_state=100).fit(X_train,y_train)
ypred=model1.predict(X_test)
print('Logistic regression model evaluation')

# Accuracy, Precision and Recall 

from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
score1 = accuracy_score(y_test,ypred)
score2 = precision_score(y_test,ypred)
score3= recall_score(y_test,ypred)
print("---- Scores ----")
print("Accuracy score is: {}%".format(round(score1*100,2)))
print("Precision score is: {}".format(round(score2,2)))
print("Recall score is: {}".format(round(score3,2)))


Logistic regression model evaluation
---- Scores ----
Accuracy score is: 88.14%
Precision score is: 0.92
Recall score is: 0.82


In [29]:
#2-Using Support Vector Classifier or SVC

from sklearn.svm import SVC
model2 = SVC(random_state=50)
model2.fit(X_train, y_train)
ypred = model2.predict(X_test)

print('SVC model evaluation')

# Accuracy, Precision and Recall
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
score1 = accuracy_score(y_test,ypred)
score2 = precision_score(y_test,ypred)
score3= recall_score(y_test,ypred)
print("---- Scores ----")
print("Accuracy score is: {}%".format(round(score1*100,2)))
print("Precision score is: {}".format(round(score2,2)))
print("Recall score is: {}".format(round(score3,2)))

SVC model evaluation
---- Scores ----
Accuracy score is: 83.05%
Precision score is: 0.88
Recall score is: 0.75


In [30]:
#3-Using Naive Bayes

from sklearn.naive_bayes import BernoulliNB,GaussianNB,MultinomialNB

NB=[BernoulliNB,GaussianNB,MultinomialNB]
for i in NB:
    model3=i().fit(X_train,y_train)
    ypred=model3.predict(X_test)
    print(i.__name__,'model evaluation')
    # Accuracy, Precision and Recall
    from sklearn.metrics import accuracy_score
    from sklearn.metrics import precision_score
    from sklearn.metrics import recall_score
    score1 = accuracy_score(y_test,ypred)
    score2 = precision_score(y_test,ypred)
    score3= recall_score(y_test,ypred)
    print("---- Scores ----")
    print("Accuracy score is: {}%".format(round(score1*100,2)))
    print("Precision score is: {}".format(round(score2,2)))
    print("Recall score is: {}".format(round(score3,2)))

    
model3a=BernoulliNB().fit(X_train,y_train)
    
model3b=GaussianNB().fit(X_train,y_train)
    
model3c=MultinomialNB().fit(X_train,y_train)


BernoulliNB model evaluation
---- Scores ----
Accuracy score is: 82.2%
Precision score is: 0.82
Recall score is: 0.8
GaussianNB model evaluation
---- Scores ----
Accuracy score is: 71.19%
Precision score is: 0.65
Recall score is: 0.86
MultinomialNB model evaluation
---- Scores ----
Accuracy score is: 87.29%
Precision score is: 0.92
Recall score is: 0.8


In [6]:
#RANDOM FOREST

In [31]:
#Hyperparameter tuning for Random FOrest


from sklearn.ensemble import RandomForestClassifier
best_n = 0
best_acc = 0
for n in range(10,100,10):
    model4b = RandomForestClassifier(criterion='entropy', n_estimators=n,max_depth=n)
    model4b.fit(X_train,y_train)
    y_pred=model4b.predict(X_test)
    acc = accuracy_score(y_test,y_pred)
    if acc > best_acc:
        best_n = n
        best_acc = acc
    print(n, acc)
print('\n', best_n, best_acc)

10 0.7542372881355932
20 0.7796610169491526
30 0.788135593220339
40 0.8305084745762712
50 0.8559322033898306
60 0.8559322033898306
70 0.8898305084745762
80 0.8305084745762712
90 0.8813559322033898

 70 0.8898305084745762


In [32]:
#4-Using Random forest

from sklearn.ensemble import RandomForestClassifier
model4=RandomForestClassifier(n_estimators = 70, criterion = 'entropy', max_depth=70)
model4.fit(X_train,y_train)
ypred=model4.predict(X_test)
print('RandomForest model evaluation')

# Accuracy, Precision and Recall
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
score1 = accuracy_score(y_test,ypred)
score2 = precision_score(y_test,ypred)
score3= recall_score(y_test,ypred)
print("---- Scores ----")
print("Accuracy score is: {}%".format(round(score1*100,2)))
print("Precision score is: {}".format(round(score2,2)))
print("Recall score is: {}".format(round(score3,2)))




RandomForest model evaluation
---- Scores ----
Accuracy score is: 87.29%
Precision score is: 0.86
Recall score is: 0.88


In [33]:
#Using XGB

from xgboost import XGBClassifier
model5 = XGBClassifier(max_depth=10, n_estimators=100).fit(X_train,y_train)
ypred=model5.predict(X_test)
print('XGBclassifier model evaluation')


# Accuracy, Precision and Recall
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
score1 = accuracy_score(y_test,ypred)
score2 = precision_score(y_test,ypred)
score3= recall_score(y_test,ypred)
print("---- Scores ----")
print("Accuracy score is: {}%".format(round(score1*100,2)))
print("Precision score is: {}".format(round(score2,2)))
print("Recall score is: {}".format(round(score3,2)))


XGBclassifier model evaluation
---- Scores ----
Accuracy score is: 85.59%
Precision score is: 0.85
Recall score is: 0.84


# Review prediction

In [34]:
import pandas as pd
df3=pd.read_csv('updated_reviews_with_scraped_edited.csv')
df3=df3.dropna()
df3=df3.reset_index(drop=True) 
df3.info()
df3

#creating target variable
X=df3['Cleaned reviews']
y=df3.Liked


#Creating a train test split from data

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test= train_test_split(X,y,test_size=0.1,random_state=1000)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1173 entries, 0 to 1172
Data columns (total 2 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Liked            1173 non-null   int64 
 1   Cleaned reviews  1173 non-null   object
dtypes: int64(1), object(1)
memory usage: 18.5+ KB


In [35]:
#lets predict sample review using our various models

import re   # Regular Expression package

#lets define a function for prediction
def predict_sentiment(sample_review):
    sample_review = re.sub(pattern='[^a-zA-Z]',repl=' ', string = sample_review)
    sample_review = sample_review.lower()
    temp = Cvect.transform([sample_review]).toarray()
    return model1.predict(temp)

In [36]:

# Predicting values from out test set
sample_review = X_test
for i in sample_review:
    if predict_sentiment(i):
        print('-',i,': This is a POSITIVE review.')
    else:
        print('-',i,': This is a NEGATIVE review!')

- insults  profound deuchebaggery  and had to go outside for a smoke break while serving just to solidify it  : This is a NEGATIVE review!
- how can you call yourself a steakhouse if you can t properly cook a steak  i don t understand  : This is a NEGATIVE review!
- i don t think i ll be running back to carly s anytime soon for food  : This is a NEGATIVE review!
- anyway  this fs restaurant has a wonderful breakfast lunch  : This is a POSITIVE review.
- i had to wait over    minutes to get my drink and longer to get   arepas  : This is a NEGATIVE review!
- chuck : This is a NEGATIVE review!
- i don t recommend unless your car breaks down in front of it and you are starving  : This is a NEGATIVE review!
- the black eyed peas and sweet potatoes    unreal  : This is a POSITIVE review.
- i could care less    the interior is just beautiful  : This is a POSITIVE review.
- i vomited in the bathroom mid lunch  : This is a NEGATIVE review!
- great pork sandwich  : This is a POSITIVE review.
- i

In [38]:
import pickle
pickle.dump(model1,open('Model.pkl','wb'))

In [40]:
# Loading model to compare the results
model = pickle.load(open('Model.pkl','rb'))


import re   # Regular Expression package

#lets define a function for prediction
def predict_sentiment(sample_review):
    sample_review = re.sub(pattern='[^a-zA-Z]',repl=' ', string = sample_review)
    sample_review = sample_review.lower()
    temp = Cvect.transform([sample_review]).toarray()
    return model.predict(temp)


sample_review = ['very bad','horrible taste','bad place']
for i in sample_review:
    if predict_sentiment(i):
        print('-',i,': This is a POSITIVE review.')
    else:
        print('-',i,': This is a NEGATIVE review!')

- very bad : This is a NEGATIVE review!
- horrible taste : This is a NEGATIVE review!
- bad place : This is a NEGATIVE review!
