# NLP - Zomato Review.

## Description:
	Taking the reviews from Zomato the online delivery service and performing the Natural Language Processing.


## Objectives:

•	To Import the requiring libraries. 

•	Preprocessing the reviews.

•	Performing the feature extraction on words.

•	Performing the statistical model.


# Importing the libraries

In [1]:
import pandas as pd
import numpy as np
import re

# Uploading the dataframe

In [2]:
reviews0=pd.read_csv("Zomato_reviews.csv")

In [3]:
reviews0.head()

Unnamed: 0,rating,review_text
0,1.0,"Their service is worst, pricing in menu is dif..."
1,5.0,really appreciate their quality and timing . I...
2,4.0,"Went there on a Friday night, the place was su..."
3,4.0,A very decent place serving good food.\r\nOrde...
4,5.0,One of the BEST places for steaks in the city....


# Describing the dataframe

In [4]:
reviews0.describe(include="all")

Unnamed: 0,rating,review_text
count,27762.0,27748
unique,,10548
top,,good
freq,,278
mean,3.665784,
std,1.284573,
min,1.0,
25%,3.0,
50%,4.0,
75%,5.0,


# Checking for null Values and dropping them.

In [5]:
reviews1 = reviews0[~reviews0.review_text.isnull()].copy()
reviews1.reset_index(inplace=True,drop=True)

In [6]:
reviews0.shape,reviews1.shape,

((27762, 2), (27748, 2))

# Converting list for easy manipulation

In [7]:
reviews_list = reviews1.review_text.values


In [8]:
len(reviews_list)

27748

# Preprocessing.
 Text clean up
 
 Normalise
 
 Remove stop words - "not", "no" from the stop word list
 
 Romve punctuation

### Printing the first 5 reviews in a list

Assembling all the individual reviews into a list.

In [9]:
reviews_lower=[txt.lower() for txt in reviews_list]

In [10]:
reviews_lower[0:5]

['their service is worst, pricing in menu is different from bill. they can give you a bill with increased pricing. even for serving water,menu, order you need to call them 3-4 times even on a non busy day.',
 "really appreciate their quality and timing . i have tried the thattil kutti dosa i've been addicted to the dosa really and the chutney... really good and money worth much better than a thattukada must try it",
 'went there on a friday night, the place was surprisingly empty. interesting menu which is almost fully made of dosas. i had bullseye dosa and cheese masala dosa. the bullseye dosa was really good, with the egg perfectly cooked to a half boiled state. the masala in the cheese masala was good, but the cheese was a bit too chewy for my liking. the chutney was good, the sambar was average. the dishes are reasonably priced.',
 'a very decent place serving good food.\r\nordered chilli fish, chicken & pork sizzler.\r\neverything tasted good but pork could have been slightly bett

### Cleaning the text by removing the unwanted blackslash.

join(txt.split()) - spliting the text and joining it inorder to remove the unwanted blackslash in the reviews.



In [11]:
reviews_lower = [" ".join(txt.split()) for txt in reviews_lower]

In [12]:
reviews_lower[0:5]

['their service is worst, pricing in menu is different from bill. they can give you a bill with increased pricing. even for serving water,menu, order you need to call them 3-4 times even on a non busy day.',
 "really appreciate their quality and timing . i have tried the thattil kutti dosa i've been addicted to the dosa really and the chutney... really good and money worth much better than a thattukada must try it",
 'went there on a friday night, the place was surprisingly empty. interesting menu which is almost fully made of dosas. i had bullseye dosa and cheese masala dosa. the bullseye dosa was really good, with the egg perfectly cooked to a half boiled state. the masala in the cheese masala was good, but the cheese was a bit too chewy for my liking. the chutney was good, the sambar was average. the dishes are reasonably priced.',
 'a very decent place serving good food. ordered chilli fish, chicken & pork sizzler. everything tasted good but pork could have been slightly better coo

# Preprocessing Techniques.

## 1. Tokenization 

    Breaks the sentence into seperate words. These words are called tokens.

    Splits the word whenever there is a space.

    Treats punctuations as tokens for they also has meaning 

In [13]:
import nltk
#import word_tokenize - to seperate words in the sentence.
from nltk.tokenize import word_tokenize

In [14]:
import nltk

nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\2211582\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Seperating the sentences into words 

    word_tokenize - package

    reviews_lower[0] - review that lies in the indexing position 0.

In [15]:
print(word_tokenize(reviews_lower[0]))

['their', 'service', 'is', 'worst', ',', 'pricing', 'in', 'menu', 'is', 'different', 'from', 'bill', '.', 'they', 'can', 'give', 'you', 'a', 'bill', 'with', 'increased', 'pricing', '.', 'even', 'for', 'serving', 'water', ',', 'menu', ',', 'order', 'you', 'need', 'to', 'call', 'them', '3-4', 'times', 'even', 'on', 'a', 'non', 'busy', 'day', '.']


### Reading again and omitting the numerical terms in the reviews. 

    Refining the words again by removing the numerics in the reviews. 

In [16]:
reviews_tokens = [word_tokenize(sent) for sent in reviews_lower]
print(reviews_tokens[1])

['really', 'appreciate', 'their', 'quality', 'and', 'timing', '.', 'i', 'have', 'tried', 'the', 'thattil', 'kutti', 'dosa', 'i', "'ve", 'been', 'addicted', 'to', 'the', 'dosa', 'really', 'and', 'the', 'chutney', '...', 'really', 'good', 'and', 'money', 'worth', 'much', 'better', 'than', 'a', 'thattukada', 'must', 'try', 'it']


## 2. Remove stop word and punctuations

#### Corpus  
    Collection of authentic text or data organised into dataset. authentic- native language.

### Importing the punctation and stopwords.

In [17]:
from nltk.corpus import stopwords
from string import punctuation
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\2211582\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

###  Performing the stopwords on the engish language.

    Stopwords.words("english") is performed on the english language.

    Listing out the punctutation in the sentences.

In [18]:
stop_nltk=stopwords.words("english")


In [19]:
stop_punct = list(punctuation)

### Listing the stopwords.

In [20]:
print(stop_nltk)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

### Listing the Punctuations. 

In [21]:
print(stop_punct)

['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']


In [22]:
len(stop_punct)

32

In [23]:
len(stop_nltk)

179

### Removing the stop words.

Once the stopwords are found remove them.

Removing of stopwords also gives the appropriate message out of it.

Here, I'm removing some stopwords "no","not","don","won","off","they","some" 

In [24]:
stop_nltk.remove("no")
stop_nltk.remove("not")
stop_nltk.remove("don")
stop_nltk.remove("won")
stop_nltk.remove("off")
stop_nltk.remove("some")



### Checking  whether the stopword is present in your reviews.

In [25]:
"their" in reviews_tokens[1]

True

In [26]:
"off" in reviews_tokens[0:3]

False

### Printing all the stopwords and the punctuations.

In [27]:
stop_final = stop_nltk + stop_punct + ["...","``","''","===","must",'-', '.', '/', ':', ';']

In [28]:
stop_final

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each',
 'few

### Deleting all the stop words and punctuation from the review (sent)

In [29]:
def del_stop(sent):
    return [term for term in sent if term not in stop_final]

### Reviews after deleting the stopwords and punctutaions.

In [30]:
del_stop(reviews_tokens[1])

['really',
 'appreciate',
 'quality',
 'timing',
 'tried',
 'thattil',
 'kutti',
 'dosa',
 "'ve",
 'addicted',
 'dosa',
 'really',
 'chutney',
 'really',
 'good',
 'money',
 'worth',
 'much',
 'better',
 'thattukada',
 'try']

### Using "re" to remove the punctuation.

In [110]:
import string

in_revi=reviews_lower[3]
reslt= re.sub('[%s]' % re.escape(string.punctuation), '',in_revi)
print(reslt)

a very decent place serving good food ordered chilli fish chicken  pork sizzler everything tasted good but pork could have been slightly better cooked tried 2 beverages both were very sweet


### Complete Cleaning.

Reviews after cleaning both stopwords and punctuation. For it contains only the meaningful words.

In [31]:
reviews_clean = [del_stop(sent) for sent in reviews_tokens]

In [32]:
reviews_clean = [" ".join(sent) for sent in reviews_clean]
reviews_clean[2:5]

['went friday night place surprisingly empty interesting menu almost fully made dosas bullseye dosa cheese masala dosa bullseye dosa really good egg perfectly cooked half boiled state masala cheese masala good cheese bit chewy liking chutney good sambar average dishes reasonably priced',
 'decent place serving good food ordered chilli fish chicken pork sizzler everything tasted good pork could slightly better cooked tried 2 beverages sweet',
 'one best places steaks city tried beef steak chili rum grilled fish orange jalapenos exceptionally good herbed rice mashed potatoes serves alongside equally delecatble service prompt zomato gold great steal steak lover place visit hope come another ourself somewhere cbd wish back soon bon appetit']

In [33]:
len(reviews_clean)

27748

## Training and validating the data.

### Seperating x and y and performing traintest split

In [34]:
reviews_clean

['service worst pricing menu different bill give bill increased pricing even serving water menu order need call 3-4 times even non busy day',
 "really appreciate quality timing tried thattil kutti dosa 've addicted dosa really chutney really good money worth much better thattukada try",
 'went friday night place surprisingly empty interesting menu almost fully made dosas bullseye dosa cheese masala dosa bullseye dosa really good egg perfectly cooked half boiled state masala cheese masala good cheese bit chewy liking chutney good sambar average dishes reasonably priced',
 'decent place serving good food ordered chilli fish chicken pork sizzler everything tasted good pork could slightly better cooked tried 2 beverages sweet',
 'one best places steaks city tried beef steak chili rum grilled fish orange jalapenos exceptionally good herbed rice mashed potatoes serves alongside equally delecatble service prompt zomato gold great steal steak lover place visit hope come another ourself somewhe

In [35]:
 reviews1.rating

0        1.0
1        5.0
2        4.0
3        4.0
4        5.0
        ... 
27743    4.0
27744    4.0
27745    5.0
27746    5.0
27747    3.0
Name: rating, Length: 27748, dtype: float64

In [36]:
x= reviews_clean
y = reviews1.rating

In [37]:
from sklearn.model_selection import train_test_split
x_train,x_test, y_train, y_test = train_test_split(x, y, test_size = 0.30, random_state=42)

# Feature Extraction

## Document term matrix using Tfldf

The feature extraction used here is Tf-Idf 

Tf - Term frequency (words counted for their number of occurences in the document.)

Tf calculation:

    No.of occurances of word divided by total number of words in the document.

Idf - Inverse Document Frequency (calculates the rarity of the words.For Rarely used words may hold significant information)

Idf calculation:

    Reciprocal of Tf.



In [38]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [39]:
vectorizer = TfidfVectorizer(max_features = 5000)

In [40]:
len(x_train),len(x_test)

(19423, 8325)

In [41]:
x_train_bow = vectorizer.fit_transform(x_train)

In [42]:
x_test_bow = vectorizer.transform(x_test)

In [43]:
x_train_bow.shape,x_test_bow.shape

((19423, 5000), (8325, 5000))

## Model Building

Statistical Model performed here is Random forest Regressor. 

In [44]:

from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor



In [82]:
?RandomForestRegressor

### Performing Random Forest Regressor with single tree.

In [46]:
learner_rf = RandomForestRegressor(random_state=42)

In [None]:
learner_rf.fit(x_train_bow,y_train)

### Predicting and performing metrics

In [48]:
y_train_preds = learner_rf.predict(x_train_bow)

In [49]:
from sklearn.metrics import mean_squared_error

In [50]:
mean_squared_error(y_train, y_train_preds)**0.5

0.2374354773429524

### Increasing the number of trees.

In [69]:
learner_rf = RandomForestRegressor(random_state=42, n_estimators=1100)


In [70]:
%%time
learner_rf.fit(x_train_bow, y_train)

CPU times: total: 1h 6min
Wall time: 1h 26min 49s


RandomForestRegressor(n_estimators=1100, random_state=42)

In [71]:
y_train_preds = learner_rf.predict(x_train_bow)

In [72]:
mean_squared_error(y_train, y_train_preds)**0.5

0.23428164760326076

## Hyper-parameter Turning.

    In machine learning, tuning or hyper parameter optimization is the difficulty of picking a collection of optimal parameters for a model learning algorithm. A hyper Parameter is also called a model predictor, since its value is used as a starting point for the model learning algorithm. 

In [73]:
from sklearn.model_selection import GridSearchCV

In [74]:
?RandomForestRegressor

In [75]:
learner_rf = RandomForestRegressor(random_state=42)

In [76]:
param_grid= {
    'max_features':[1000,'sqrt','log2',"auto"],
    'max_depth':[50,50,100]
}

In [77]:
grid_search = GridSearchCV(estimator = learner_rf,param_grid = param_grid,
                          cv = 5, n_jobs = -1, verbose = 1, scoring = "neg_mean_squared_error" )

In [79]:
grid_search.fit(x_train_bow, y_train)

Fitting 5 folds for each of 12 candidates, totalling 60 fits


GridSearchCV(cv=5, estimator=RandomForestRegressor(random_state=42), n_jobs=-1,
             param_grid={'max_depth': [50, 50, 100],
                         'max_features': [1000, 'sqrt', 'log2', 'auto']},
             scoring='neg_mean_squared_error', verbose=1)

In [81]:
from sklearn.model_selection import GridSearchCV

In [83]:
grid_search.best_estimator_

RandomForestRegressor(max_depth=100, max_features=1000, random_state=42)

In [84]:
y_train_pred = grid_search.best_estimator_.predict(x_train_bow)

In [85]:
y_test_pred = grid_search.best_estimator_.predict(x_test_bow)


In [86]:
mean_squared_error(y_train, y_train_pred)**0.5

0.2539152158396246

In [87]:
mean_squared_error(y_test, y_test_pred)**0.5

0.483176080429695

In [88]:
res_df = pd.DataFrame({'review':x_test, 'rating':y_test, 'rating_pred':y_test_pred})

In [89]:
res_df[(res_df.rating - res_df.rating_pred)>=2].shape

(19, 3)

## Model Building- GradientBoostingRegressor. 

Statistical Model performed here is GradientBoostingRegressor. 

In [114]:
?GradientBoostingRegressor

In [116]:
learner_rf = GradientBoostingRegressor(random_state=42)

In [117]:
learner_rf.fit(x_train_bow,y_train)

GradientBoostingRegressor(random_state=42)

In [118]:
y_train_preds = learner_rf.predict(x_train_bow)

In [119]:
from sklearn.metrics import mean_squared_error

In [120]:
mean_squared_error(y_train, y_train_preds)**0.5

0.8191796746007274

# Increasing the number of trees.

In [121]:
learner_rf = GradientBoostingRegressor(random_state=42, n_estimators=1100)

In [144]:
%%time
learner_rf.fit(x_train_bow, y_train)

CPU times: total: 4min 18s
Wall time: 5min 15s


GradientBoostingRegressor(n_estimators=1100, random_state=42)

In [146]:
y_train_preds = learner_rf.predict(x_train_bow)

In [147]:
mean_squared_error(y_train, y_train_preds)**0.5

0.5199618794331563

## Hyper-parameter Turning.

In [148]:
from sklearn.model_selection import GridSearchCV

In [149]:
?GradientBoostingRegressor

In [150]:
learner_rf = GradientBoostingRegressor(random_state=42)

In [151]:
param_grid= {
    'max_features':[1000,'sqrt','log2',"auto"],
    'max_depth':[50,50,100]
}

In [152]:
grid_search = GridSearchCV(estimator = learner_rf,param_grid = param_grid,
                          cv = 5, n_jobs = -1, verbose = 1, scoring = "neg_mean_squared_error" )

In [153]:
grid_search.fit(x_train_bow, y_train)

Fitting 5 folds for each of 12 candidates, totalling 60 fits


GridSearchCV(cv=5, estimator=GradientBoostingRegressor(random_state=42),
             n_jobs=-1,
             param_grid={'max_depth': [50, 50, 100],
                         'max_features': [1000, 'sqrt', 'log2', 'auto']},
             scoring='neg_mean_squared_error', verbose=1)

In [154]:
grid_search.best_estimator_

GradientBoostingRegressor(max_depth=100, max_features='sqrt', random_state=42)

In [155]:
y_train_pred = grid_search.best_estimator_.predict(x_train_bow)

In [156]:
y_test_pred = grid_search.best_estimator_.predict(x_test_bow)


In [157]:
mean_squared_error(y_test, y_test_pred)**0.5

0.45340918188867235

In [158]:
mean_squared_error(y_train, y_train_preds)**0.5

0.5199618794331563

In [159]:
res_df = pd.DataFrame({'review':x_test, 'rating':y_test, 'rating_pred':y_test_pred})

In [160]:
res_df[(res_df.rating - res_df.rating_pred)>=2].shape

(7, 3)

# re package

### Using "re" to remove numerics.

In [122]:
reviews_lower[0]

'their service is worst, pricing in menu is different from bill. they can give you a bill with increased pricing. even for serving water,menu, order you need to call them 3-4 times even on a non busy day.'

In [123]:
# removing the numbers using regex

in_revi=reviews_lower[0]
ot_revi = re.sub(r"\d+","",in_revi)
print(ot_revi)

their service is worst, pricing in menu is different from bill. they can give you a bill with increased pricing. even for serving water,menu, order you need to call them - times even on a non busy day.


### Using "re.sub" 

Removes the punctuation.

re.sub('pattern','replace','string')

returns a string where matched occurrences

If the pattern is not found, re.sub() returns the original string.

In [124]:
import string

in_revi=reviews_lower[0]
reslt= re.sub('[%s]' % re.escape(string.punctuation), '',in_revi)
print(reslt)

their service is worst pricing in menu is different from bill they can give you a bill with increased pricing even for serving watermenu order you need to call them 34 times even on a non busy day


### using "re.split" to tokenize 

Separetes the sentences in to words. 

re.split('pattern','string')

if pattern not found returns the original string.

In [141]:
in_revi=reviews_lower[1]
reslt= re.split('\s',in_revi)
print(reslt)

['really', 'appreciate', 'their', 'quality', 'and', 'timing', '.', 'i', 'have', 'tried', 'the', 'thattil', 'kutti', 'dosa', "i've", 'been', 'addicted', 'to', 'the', 'dosa', 'really', 'and', 'the', 'chutney...', 'really', 'good', 'and', 'money', 'worth', 'much', 'better', 'than', 'a', 'thattukada', 'must', 'try', 'it']


### re.search() 

search for the word in the semtence. if exist gives the match object, if not returns none.

The re.search() method takes two arguments: a pattern and a string. The method looks for the first location where the RegEx pattern produces a match with the string.

If the search is successful, re.search() returns a match object; if not, it returns None.

In [128]:
in_revi=reviews_lower[1]
x= re.search("better",in_revi)
print(x)

<re.Match object; span=(171, 177), match='better'>


### re.complie()

Customizing the split by changing the value of maxsplit.

It splits first 5 words.

maxsplit  - It's the maximum number of splits that will occur

In [130]:
in_revi=reviews_lower[1]
x= re.compile(r"\s")

split = x.split(in_revi, maxsplit = 5)
print(x)
print(split)

re.compile('\\s')
['really', 'appreciate', 'their', 'quality', 'and', "timing . i have tried the thattil kutti dosa i've been addicted to the dosa really and the chutney... really good and money worth much better than a thattukada must try it"]


### re.split('\d+')

re.split method splits the string where there is a match and returns a list of strings where the splits have occurred.

puts the sentences in the list.

+ matches one or more occurrences of the pattern left to it

Backlash \ is used to escape various characters including all metacharacters. 

\ follwed by +.



In [142]:
in_revi=reviews_lower[2]
reslt= re.split('\d+',in_revi)
print(reslt)

['went there on a friday night, the place was surprisingly empty. interesting menu which is almost fully made of dosas. i had bullseye dosa and cheese masala dosa. the bullseye dosa was really good, with the egg perfectly cooked to a half boiled state. the masala in the cheese masala was good, but the cheese was a bit too chewy for my liking. the chutney was good, the sambar was average. the dishes are reasonably priced.']


### re.subn()

The re.subn() is similar to re.sub() except it returns a tuple of 2 items containing the new string and the number of substitutions made.

\B - Opposite of \b. Matches if the specified characters are not at the beginning or end of a word.

\Bemp finds for the match.


In [143]:
in_revi=reviews_lower[2]
# pattern = "\Bboards"
x= re.subn("\Bwent","pads",in_revi)
print(x) # writes the review within double quotes.
print(x[0]) # removes the quotes for the review.


('went there on a friday night, the place was surprisingly empty. interesting menu which is almost fully made of dosas. i had bullseye dosa and cheese masala dosa. the bullseye dosa was really good, with the egg perfectly cooked to a half boiled state. the masala in the cheese masala was good, but the cheese was a bit too chewy for my liking. the chutney was good, the sambar was average. the dishes are reasonably priced.', 0)
went there on a friday night, the place was surprisingly empty. interesting menu which is almost fully made of dosas. i had bullseye dosa and cheese masala dosa. the bullseye dosa was really good, with the egg perfectly cooked to a half boiled state. the masala in the cheese masala was good, but the cheese was a bit too chewy for my liking. the chutney was good, the sambar was average. the dishes are reasonably priced.


### re.findall()

The re.findall() method returns a list of strings containing all matches.


Here it extract numbers from the string.


In [145]:
reslt=re.findall('\d+',reviews_lower[0])
print(reslt)

['3', '4']
