# **Case Study on Zomato to Predict Ratings from the Reviews**

**The objective of this case study is to make a model to predict the rating in a review by using NLP and Machine learning based on the contents of the test review.**

The dataset contains reviews and ratings.From that,we will try to predict rating based on that.

# **Importing necessary libraries**

In [2]:
import pandas as pd #used to analyze data
import numpy as np #used for working with arrays
import re #used to work with Regular Expressions.


In [3]:
from google.colab import files
uploaded = files.upload()

Saving Zomato_reviews.csv to Zomato_reviews.csv


# **Importing the data**

In [4]:
reviews0 = pd.read_csv("Zomato_reviews.csv", encoding= 'unicode_escape')

In [5]:
reviews0.head()

Unnamed: 0,rating,review_text
0,1.0,"Their service is worst, pricing in menu is dif..."
1,5.0,really appreciate their quality and timing . I...
2,4.0,"Went there on a Friday night, the place was su..."
3,4.0,A very decent place serving good food.\r\nOrde...
4,5.0,One of the BEST places for steaks in the city....


# **Getting the summary statistics**

In [6]:
reviews0.describe(include="all")

Unnamed: 0,rating,review_text
count,27762.0,27748
unique,,10548
top,,good
freq,,278
mean,3.665784,
std,1.284573,
min,1.0,
25%,3.0,
50%,4.0,
75%,5.0,


# **Removing rows with missing values**

Here we can see that review text is missing for 14 rows. We have to reg rid of this by using the below code

In [7]:
reviews1 = reviews0[~reviews0.review_text.isnull()].copy()
reviews1.reset_index(inplace=True, drop=True)

In [8]:
reviews0.shape, reviews1.shape

((27762, 2), (27748, 2))

## **Converting to list for easy manipulation**


In [9]:
reviews_list = reviews1.review_text.values

Printing the length of list

In [10]:
len(reviews_list)

27748

# **Cleaning up the text**

Now lets clean the text by using the following :

**Normalize the case**

**Remove stop words like**

         remove "not", "no" from the stop word list

**Remove punctuations**

First lets normalize the case by converting all the words in the list to lowercase

In [11]:
reviews_lower = [txt.lower() for txt in reviews_list]

In [12]:
reviews_lower[2:4]

['went there on a friday night, the place was surprisingly empty. interesting menu which is almost fully made of dosas. i had bullseye dosa and cheese masala dosa. the bullseye dosa was really good, with the egg perfectly cooked to a half boiled state. the masala in the cheese masala was good, but the cheese was a bit too chewy for my liking. the chutney was good, the sambar was average. the dishes are reasonably priced.',
 'a very decent place serving good food.\r\nordered chilli fish, chicken & pork sizzler.\r\neverything tasted good but pork could have been slightly better cooked.\r\ntried 2 beverages, both were very sweet.']

We can see that all the words in the list is changed into lowercase. Now let's join the string using .join 

In [13]:
reviews_lower = [" ".join(txt.split()) for txt in reviews_lower]

In [14]:
reviews_lower[2:4]

['went there on a friday night, the place was surprisingly empty. interesting menu which is almost fully made of dosas. i had bullseye dosa and cheese masala dosa. the bullseye dosa was really good, with the egg perfectly cooked to a half boiled state. the masala in the cheese masala was good, but the cheese was a bit too chewy for my liking. the chutney was good, the sambar was average. the dishes are reasonably priced.',
 'a very decent place serving good food. ordered chilli fish, chicken & pork sizzler. everything tasted good but pork could have been slightly better cooked. tried 2 beverages, both were very sweet.']

## **Tokenize**

 NLTK contains a module called tokenize with a **word_tokenize**() method that will help us split a text into tokens or words.

In [15]:
from nltk.tokenize import word_tokenize

In [16]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [17]:
print(word_tokenize(reviews_lower[0]))

['their', 'service', 'is', 'worst', ',', 'pricing', 'in', 'menu', 'is', 'different', 'from', 'bill', '.', 'they', 'can', 'give', 'you', 'a', 'bill', 'with', 'increased', 'pricing', '.', 'even', 'for', 'serving', 'water', ',', 'menu', ',', 'order', 'you', 'need', 'to', 'call', 'them', '3-4', 'times', 'even', 'on', 'a', 'non', 'busy', 'day', '.']


Tokenizing all the sentences in the reviews_lower data

In [18]:
reviews_tokens = [word_tokenize(sent) for sent in reviews_lower]
print(reviews_tokens[0])


['their', 'service', 'is', 'worst', ',', 'pricing', 'in', 'menu', 'is', 'different', 'from', 'bill', '.', 'they', 'can', 'give', 'you', 'a', 'bill', 'with', 'increased', 'pricing', '.', 'even', 'for', 'serving', 'water', ',', 'menu', ',', 'order', 'you', 'need', 'to', 'call', 'them', '3-4', 'times', 'even', 'on', 'a', 'non', 'busy', 'day', '.']


## **Remove stop words and punctuations**

**Stop words** are all those words that don't add much information to the sentence.In order to remove **stopwords** and **punctuation** using NLTK, we have to download all the stop words using nltk. download('stopwords'), then we have to specify the language for which we want to remove the stopwords, therefore, we use stopwords. words('english') to specify and save it to the variable.Punctuation can be removed by importing punctuation from string and then saving it as a list in a variable. Then we can define a function to remove all those unnecessary stopwards and punctuation from our data

In [19]:
from nltk.corpus import stopwords
from string import punctuation
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [20]:
stop_nltk = stopwords.words("english")
stop_punct = list(punctuation)

In [21]:
print(stop_nltk)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

Removing useful stopwords in our data from the list

In [22]:
stop_nltk.remove("no")
stop_nltk.remove("not")
stop_nltk.remove("don")
stop_nltk.remove("won")

Checking whether the removed word is there in stopwords.

In [23]:
"no" in stop_nltk

False

Finalizing the words which we want to remove by adding stopwords , punctuation and some other symbols mentioned below

In [24]:
stop_final = stop_nltk + stop_punct + ["...", "``","''", "====", "must"]


Defining a function to remove stop_final from the reviews list

In [25]:
def del_stop(sent):
    return [term for term in sent if term not in stop_final]

Passing the function to reviews_tokens[1]

In [26]:
del_stop(reviews_tokens[1])

['really',
 'appreciate',
 'quality',
 'timing',
 'tried',
 'thattil',
 'kutti',
 'dosa',
 "'ve",
 'addicted',
 'dosa',
 'really',
 'chutney',
 'really',
 'good',
 'money',
 'worth',
 'much',
 'better',
 'thattukada',
 'try']

In [27]:
reviews_clean = [del_stop(sent) for sent in reviews_tokens]

In [28]:
reviews_clean = [" ".join(sent) for sent in reviews_clean]
reviews_clean[:2]

['service worst pricing menu different bill give bill increased pricing even serving water menu order need call 3-4 times even non busy day',
 "really appreciate quality timing tried thattil kutti dosa 've addicted dosa really chutney really good money worth much better thattukada try"]

Thus we have cleaned our data so that we can use it to build the model and make predictions.

## **Separate X and Y and perform train test split, 70-30**

Printing the length of cleaned data

In [29]:
len(reviews_clean)

27748

Splitting the data into dependent and independent variable

In [30]:
X = reviews_clean
y = reviews1.rating

Now splitting the whole dataset into testing and training set

In [31]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state=42)


## **Document term matrix using TfIdf**

**TF-IDF** is a popular approach used to weigh terms for NLP tasks because it assigns a value to a term according to its importance in a document scaled by its importance across all documents in your corpus, which mathematically eliminates naturally occurring words in the English language, and selects words that are more descriptive of your text.

In [32]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [33]:
vectorizer = TfidfVectorizer(max_features = 5000)

In [34]:
len(X_train), len(X_test)

(19423, 8325)

Fitting the vectorizer in training data 

In [35]:
X_train_bow = vectorizer.fit_transform(X_train)

In [36]:
X_test_bow = vectorizer.transform(X_test)

Printing the shape of bag of words of traing and testing data

In [37]:
X_train_bow.shape, X_test_bow.shape

((19423, 5000), (8325, 5000))

## **Model building**

In [38]:
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor


In [39]:
RandomForestRegressor()

RandomForestRegressor()

In [40]:
learner_rf = RandomForestRegressor(random_state=42)

In [41]:
learner_rf.fit(X_train_bow, y_train)

RandomForestRegressor(random_state=42)

In [42]:
y_train_preds = learner_rf.predict(X_train_bow)

In [43]:
from sklearn.metrics import mean_squared_error

In [44]:
mean_squared_error(y_train, y_train_preds)**0.5

0.23684233164605095

## **Increasing the number of trees**

In [45]:
learner_rf = RandomForestRegressor(random_state=42, n_estimators=30)

In [46]:
%%time
learner_rf.fit(X_train_bow, y_train)

CPU times: user 2min 10s, sys: 151 ms, total: 2min 10s
Wall time: 2min 10s


RandomForestRegressor(n_estimators=30, random_state=42)

In [47]:
y_train_preds = learner_rf.predict(X_train_bow)

In [48]:
mean_squared_error(y_train, y_train_preds)**0.5

0.24670225379076136

## **Hyper-parameter tuning**

**GridSearchCV** is a technique for finding the optimal parameter values from a given set of parameters in a grid. It's essentially a cross-validation technique. The model as well as the parameters must be entered. After extracting the best parameter values, predictions are made.

In [49]:
from sklearn.model_selection import GridSearchCV

In [50]:
RandomForestRegressor()

RandomForestRegressor()

In [51]:
learner_rf = RandomForestRegressor(random_state=42)

**max_features:** Random forest takes random subsets of features and tries to find the best split. max_features helps to find the number of features to take into account in order to make the best split. It can take four values “auto“, “sqrt“, “log2” and None.



**max_depth** is the number of nodes along the longest path from the root node down to the farthest leaf node.

In [52]:
# Create the parameter grid based on the results of random search 
param_grid = {
    'max_features': [500, "sqrt", "log2", "auto"],
    'max_depth': [20, 25, 30]
}

In [53]:
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = learner_rf, param_grid = param_grid, 
                          cv = 5, n_jobs = -1, verbose = 1, scoring = "neg_mean_squared_error" )


In [54]:
grid_search.fit(X_train_bow, y_train)

Fitting 5 folds for each of 12 candidates, totalling 60 fits


GridSearchCV(cv=5, estimator=RandomForestRegressor(random_state=42), n_jobs=-1,
             param_grid={'max_depth': [20, 25, 30],
                         'max_features': [500, 'sqrt', 'log2', 'auto']},
             scoring='neg_mean_squared_error', verbose=1)

In [55]:
grid_search.cv_results_

{'mean_fit_time': array([ 21.43790364,   3.10006599,   1.08372607, 136.47960043,
         30.23206444,   4.74353414,   1.53775978, 175.92281604,
         37.36585717,   6.03855934,   1.74621334, 202.73735857]),
 'std_fit_time': array([7.30044698e-01, 4.59105883e-02, 1.26327850e-02, 1.96634567e+00,
        2.79127828e+00, 4.22235091e-01, 1.91722364e-01, 2.92163826e+00,
        3.61909170e-01, 5.42202341e-01, 8.91621280e-03, 2.87579263e+01]),
 'mean_score_time': array([0.10625358, 0.10166903, 0.09512672, 0.10663872, 0.11701632,
        0.14636149, 0.10268979, 0.12045789, 0.1375814 , 0.12638478,
        0.10880585, 0.12223282]),
 'std_score_time': array([0.00154978, 0.00296263, 0.00813756, 0.00237473, 0.00122169,
        0.04720366, 0.00307437, 0.0087029 , 0.01250709, 0.0121136 ,
        0.00261173, 0.01296921]),
 'param_max_depth': masked_array(data=[20, 20, 20, 20, 25, 25, 25, 25, 30, 30, 30, 30],
              mask=[False, False, False, False, False, False, False, False,
              

Lets see which is the best estimtor

In [56]:
grid_search.best_estimator_

RandomForestRegressor(max_depth=30, max_features=500, random_state=42)

# **Using the best estimator to make predictions on the test set**

In [57]:
y_train_pred = grid_search.best_estimator_.predict(X_train_bow)

In [58]:
y_test_pred = grid_search.best_estimator_.predict(X_test_bow)


In [59]:
mean_squared_error(y_train, y_train_pred)**0.5

0.519080829238783

In [60]:
mean_squared_error(y_test, y_test_pred)**0.5

0.625162520140185

# **Identifying mismatch cases**

In [61]:
res_df = pd.DataFrame({'review':X_test, 'rating':y_test, 'rating_pred':y_test_pred})

In [62]:
res_df[(res_df.rating - res_df.rating_pred)>=2].shape

(9, 3)

In [63]:
res_df[(res_df.rating - res_df.rating_pred)>=2]

Unnamed: 0,review,rating,rating_pred
7277,life saviours serving excellent food worst tim...,5.0,2.352474
4771,not good,5.0,2.000031
16510,may not polished serving packaging etc never b...,5.0,1.952442
14845,oh memories place first drink bangalore almost...,5.0,2.57838
16916,delivered time really liked food,5.0,2.893742
15201,sauce not included,4.0,1.742956
27705,options would improvement long quality not com...,4.0,1.828912
3165,rice quantity less,5.0,2.750608
16515,may not polished serving packaging etc never b...,5.0,1.952442
