# Case study 3
### Problem statement

Zomato is India's largest platform for discovering restaurants and ordering food. It operates in India as well as a few cities internationally. Bangalore is one of the biggest customer and restaurant bases for zomato with 4 to 5 million users using the platform each month.

Users on the platform can also post reviews of restaurants, and provide a rating accomapnying the review. The content in the reviews should ideally reflect the rating provided by the customer. In many cases, there is a mismatch, owning to multiple reasons where the rating does not match the customer review. The reviews and ratings matching is very important as it builds customer trust on the platform, and helps the user get an accurate picture of the restaurant.

You, as a data scientist, need to enable the identification and cleanup of such cases, to ensure the ratings are reflective of the reviews and that the reviews seem trustworthy to the custimer. You will need to use NLP techniques in conjunction with Machine learning models to predict the rating from the review text

**Domain**: Hospitality and internet

**Analysis to be done**:Perform specific data cleanup build a rating prediction model using Random Forest techniqe


# Load up basic dependencies


In [1]:
import pandas as pd
import numpy as np
import re
from google.colab import files
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from string import punctuation
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


# Load and View the Dataset



In [2]:
uploaded = files.upload()

Saving Zomato_reviews.csv to Zomato_reviews (1).csv


In [3]:
reviews0 = pd.read_csv("Zomato_reviews.csv", encoding= 'unicode_escape')

In [None]:
reviews0.head() #viewing the first 5 reviews and ratings

Unnamed: 0,rating,review_text
0,1.0,"Their service is worst, pricing in menu is dif..."
1,5.0,really appreciate their quality and timing . I...
2,4.0,"Went there on a Friday night, the place was su..."
3,4.0,A very decent place serving good food.\r\nOrde...
4,5.0,One of the BEST places for steaks in the city....


In [4]:
reviews0.describe(include="all")

Unnamed: 0,rating,review_text
count,27762.0,27748
unique,,10548
top,,good
freq,,278
mean,3.665784,
std,1.284573,
min,1.0,
25%,3.0,
50%,4.0,
75%,5.0,


# Basic Data Processing

## Remove all records with no review text

In [5]:
reviews1 = reviews0[~reviews0.review_text.isnull()].copy()
reviews1.reset_index(inplace=True, drop=True)

In [6]:
reviews0.shape, reviews1.shape

((27762, 2), (27748, 2))

**Checking imbalances**

In [7]:
reviews0["rating"].value_counts()

4.0    8632
5.0    8118
3.0    3762
1.0    3126
2.0    1675
3.5    1078
4.5     933
2.5     261
1.5     177
Name: rating, dtype: int64

## **Converting to list for easy manipulation**


In [8]:
reviews_list = reviews1.review_text.values

In [9]:
len(reviews_list)

27748

## Text clean up

* Normalize the case

* Remove stop words

* remove "not", "no" from the stop word list

* Remove punctuations

### Normalizing case / Case Conversion
All the texts are converted into lower case

In [10]:
reviews_lower = [txt.lower() for txt in reviews_list]

In [11]:
reviews_lower[2:4]

['went there on a friday night, the place was surprisingly empty. interesting menu which is almost fully made of dosas. i had bullseye dosa and cheese masala dosa. the bullseye dosa was really good, with the egg perfectly cooked to a half boiled state. the masala in the cheese masala was good, but the cheese was a bit too chewy for my liking. the chutney was good, the sambar was average. the dishes are reasonably priced.',
 'a very decent place serving good food.\r\nordered chilli fish, chicken & pork sizzler.\r\neverything tasted good but pork could have been slightly better cooked.\r\ntried 2 beverages, both were very sweet.']

In [12]:
reviews_lower = [" ".join(txt.split()) for txt in reviews_lower]

In [13]:
reviews_lower[2:4]

['went there on a friday night, the place was surprisingly empty. interesting menu which is almost fully made of dosas. i had bullseye dosa and cheese masala dosa. the bullseye dosa was really good, with the egg perfectly cooked to a half boiled state. the masala in the cheese masala was good, but the cheese was a bit too chewy for my liking. the chutney was good, the sambar was average. the dishes are reasonably priced.',
 'a very decent place serving good food. ordered chilli fish, chicken & pork sizzler. everything tasted good but pork could have been slightly better cooked. tried 2 beverages, both were very sweet.']

There are different ways to preprocess text: 

* stop word removal, 
* tokenization, 
* stemming. 

## **Tokenization**
Among these, the most important step is tokenization. It’s the process of breaking a stream of textual data into words, terms, sentences, symbols, or some other meaningful elements called tokens. 

**Why do we need tokenization?**
Tokenization is the first step in any NLP pipeline. It has an important effect on the rest of our pipeline. A tokenizer breaks unstructured data and natural language text into chunks of information that can be considered as discrete elements. The token occurrences in a document can be used directly as a vector representing that document. 

This immediately turns an unstructured string (text document) into a numerical data structure suitable for machine learning. They can also be used directly by a computer to trigger useful actions and responses. Or they might be used in a machine learning pipeline as features that trigger more complex decisions or behavior.

Tokenization can separate sentences, words, characters, or subwords. When we split the text into sentences, we call it sentence tokenization. For words, we call it word tokenization.

### NLTK Word Tokenize
NLTK (Natural Language Toolkit) is an open-source Python library for Natural Language Processing. It has easy-to-use interfaces for over 50 corpora and lexical resources such as WordNet, along with a set of text processing libraries for classification, tokenization, stemming, and tagging.

We can easily tokenize the sentences and words of the text with the tokenize module of NLTK.

In [14]:
print(word_tokenize(reviews_lower[0]))

['their', 'service', 'is', 'worst', ',', 'pricing', 'in', 'menu', 'is', 'different', 'from', 'bill', '.', 'they', 'can', 'give', 'you', 'a', 'bill', 'with', 'increased', 'pricing', '.', 'even', 'for', 'serving', 'water', ',', 'menu', ',', 'order', 'you', 'need', 'to', 'call', 'them', '3-4', 'times', 'even', 'on', 'a', 'non', 'busy', 'day', '.']


In [15]:
reviews_tokens = [word_tokenize(sent) for sent in reviews_lower]
print(reviews_tokens[0])


['their', 'service', 'is', 'worst', ',', 'pricing', 'in', 'menu', 'is', 'different', 'from', 'bill', '.', 'they', 'can', 'give', 'you', 'a', 'bill', 'with', 'increased', 'pricing', '.', 'even', 'for', 'serving', 'water', ',', 'menu', ',', 'order', 'you', 'need', 'to', 'call', 'them', '3-4', 'times', 'even', 'on', 'a', 'non', 'busy', 'day', '.']


## **Remove stop words and punctuations**

1. **Stop words** are all those words that don't add much information to the sentence. For example, the last sentence can be shortened to: stop words don't add useful information sentence. And despite the fact that it doesn't look like a proper English sentence, we'd likely understand the meaning if you heard it somewhere. That's why in many cases we can make our models simpler by simply ignoring these words. Stop words are usually the most common words in natural texts. 

2. When a sentence is tokenized, and all **punctuation marks** are removed from it, all punctuation marks are removed from each word.
* Removing punctuation is a standard preparation step in machine learning and data analysis activities.
* For example, creating a text classification model is useless; therefore, we eliminate it during the pre-processing step.
* When working with user-generated text data, such as social media postings, we will encounter a lot of punctuation that may not be beneficial for the task at hand; thus, removing it becomes a necessary pre-processing chore.

In [16]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [17]:
stop_nltk = stopwords.words("english")
stop_punct = list(punctuation)

In [18]:
print(stop_nltk)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [19]:
print(stop_punct)

['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']


In [20]:
stop_nltk.remove("no")
stop_nltk.remove("not")
stop_nltk.remove("don")
stop_nltk.remove("won")
stop_nltk.remove("did")


In [21]:
"no" in stop_nltk

False

In [22]:
stop_final = stop_nltk + stop_punct + ["...", "``","''", "====", "must"]


In [23]:
def del_stop(sent):
    return [term for term in sent if term not in stop_final]

In [24]:
del_stop(reviews_tokens[1])

['really',
 'appreciate',
 'quality',
 'timing',
 'tried',
 'thattil',
 'kutti',
 'dosa',
 "'ve",
 'addicted',
 'dosa',
 'really',
 'chutney',
 'really',
 'good',
 'money',
 'worth',
 'much',
 'better',
 'thattukada',
 'try']

In [34]:
reviews_clean = [del_stop(sent) for sent in reviews_tokens]

['service worst pricing menu different bill give bill increased pricing even serving water menu order need call 3-4 times even non busy day',
 "really appreciate quality timing tried thattil kutti dosa 've addicted dosa really chutney really good money worth much better thattukada try"]

In [28]:
import nltk
nltk.download('omw-1.4')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package wordnet to /root/nltk_data...


In [35]:
lemmatizer=WordNetLemmatizer()

In [36]:
def apply_lemmatization(sent):
  return[lemmatizer.lemmatize(term) for term in sent]

In [37]:
reviews_clean[2]

['went',
 'friday',
 'night',
 'place',
 'surprisingly',
 'empty',
 'interesting',
 'menu',
 'almost',
 'fully',
 'made',
 'dosas',
 'bullseye',
 'dosa',
 'cheese',
 'masala',
 'dosa',
 'bullseye',
 'dosa',
 'really',
 'good',
 'egg',
 'perfectly',
 'cooked',
 'half',
 'boiled',
 'state',
 'masala',
 'cheese',
 'masala',
 'good',
 'cheese',
 'bit',
 'chewy',
 'liking',
 'chutney',
 'good',
 'sambar',
 'average',
 'dishes',
 'reasonably',
 'priced']

In [38]:
apply_lemmatization(reviews_clean[2])

['went',
 'friday',
 'night',
 'place',
 'surprisingly',
 'empty',
 'interesting',
 'menu',
 'almost',
 'fully',
 'made',
 'dosas',
 'bullseye',
 'dosa',
 'cheese',
 'masala',
 'dosa',
 'bullseye',
 'dosa',
 'really',
 'good',
 'egg',
 'perfectly',
 'cooked',
 'half',
 'boiled',
 'state',
 'masala',
 'cheese',
 'masala',
 'good',
 'cheese',
 'bit',
 'chewy',
 'liking',
 'chutney',
 'good',
 'sambar',
 'average',
 'dish',
 'reasonably',
 'priced']

In [40]:
reviews_clean=[apply_lemmatization(sent) for sent in reviews_clean]

In [41]:
reviews_clean = [" ".join(sent) for sent in reviews_clean]
reviews_clean[:2]

['service worst pricing menu different bill give bill increased pricing even serving water menu order need call 3-4 time even non busy day',
 "really appreciate quality timing tried thattil kutti dosa 've addicted dosa really chutney really good money worth much better thattukada try"]

In [42]:
len(reviews_clean)

27748

## **Separate X and Y and perform train test split, 70-30**

In [43]:
X = reviews_clean
y = reviews1.rating

In [44]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state=42)


In [45]:
X_train

['great place weekend brunch .. drink available reasonable price variety food option available veg nonveg people',
 'came enjoy sushi ordered rainbow maki sriracha maki along salad amazing wanted try sushi wallet said no place bit pricey dessert disappointing though vietnamese banana cake supposed come candied banana came cut banana caramel replaced scorched service also poor ask thrice cheque ambience good seating comfortable mixed experience',
 "one hyped restaurant bangalore .. thinking visiting place quite long finally happened .. restaurant 31st floor .. 2 part one inside sitting another one open space good thing restaurant view .. r planning romantic dinner proposal dinner burning pocket wo n't hurt u choose place .. food not mark .. schezwan chicken chicken steak ordered chicken hard service slow property quite older u could understand seeing sofasets stain onto .. service could b better ask plate .. served food not plate .. 2 starter 2 alcoholic drink 1 non alcoholic drink cost

## **Document term matrix using TfIdf**

* Full form of TF is term frequency. It is the count of word “x” in a sentence.

* Full form of IDF is inverse document frequency. Document frequency is the number of documents which contain the word “x”. 
Natural language processing (NLP) uses tf-idf technique to convert text documents to a machine understandable form. Each sentence is a document and words in the sentence are tokens. Tfidf vectorizer creates a matrix with documents and token scores therefore it is also known as document term matrix (dtm).

**Vectorization**: tfidfVectorizer performs this task by first creating an array where each value represents a word in your training data set and its corresponding weight.

To understand tf-idf, let’s take a look at the following example. A document is made up of words, each word is assigned a weight from 1 to 100 (where 1 is the most important and 100 means it has no importance in that particular context). Given this information, TfidfVectorizer will output a vector for each document in which the values correspond to these weights.

Vectorization: tfidfVectorizer performs this task by first creating an array where each value represents a word in your training data set and its corresponding weight.*
What we get after applying TfidfVectorizer is a matrix of dimensions (2397, 75889).

This is where the magic happens.

Once you have applied TfidfVectorizer to your corpus, what you get is a sparse matrix of dimensions (2397, 75889). This means that there are only non-zero values in this matrix. The first dimension represents the document IDs and second dimension represents our words or terms in the corpus. Each row of this matrix contains information about how often was a given word used across all documents in our corpus along with TFIDF value for that term/word (float).

So basically we can say TfidfVectorizer has turned our text data into numerical form so that it can be used by other ML algorithms such as LDA

In this example, we have 2397 documents and 75889 different words in our corpus.

The sklearn library uses a sparse matrix format for storing this matrix which means that it only stores the non-zero values and indices. This makes the process of loading and saving the model extremely fast as well as saves memory space. The sparse format also allows us to use less memory when dealing with large datasets.

In [51]:
vectorizer = TfidfVectorizer(max_features = 10000)

In [52]:
len(X_train), len(X_test)

(19423, 8325)

In [53]:
X_train_bow = vectorizer.fit_transform(X_train)

In [54]:
X_test_bow = vectorizer.transform(X_test)

In [55]:
X_train_bow.shape, X_test_bow.shape

((19423, 10000), (8325, 10000))

## **Model building**

#### Random forest 

Random Forest is an ensemble technique capable of performing both regression and classification tasks with the use of multiple decision trees and a technique called Bootstrap and Aggregation, commonly known as bagging. The basic idea behind this is to combine multiple decision trees in determining the final output rather than relying on individual decision trees. 

In [None]:
?RandomForestRegressor

In [56]:
learner_rf = RandomForestRegressor(random_state=42)

In [57]:
learner_rf.fit(X_train_bow, y_train)

RandomForestRegressor(random_state=42)

In [58]:
y_train_preds = learner_rf.predict(X_train_bow)

In [62]:
from sklearn.metrics import mean_squared_error,r2_score

In [63]:
mean_squared_error(y_train, y_train_preds)**0.5

0.2280107027325059

In [64]:
r2_score(y_train,y_train_preds)

0.9683257307920315

## **Increasing the number of trees**



In [68]:
learner_rf = RandomForestRegressor(random_state=42, n_estimators=1000)


In [72]:
%%time
learner_rf.fit(X_train_bow, y_train)

CPU times: user 58min 12s, sys: 4.4 s, total: 58min 16s
Wall time: 58min


RandomForestRegressor(n_estimators=1000, random_state=42)

In [73]:
y_train_preds = learner_rf.predict(X_train_bow)

In [74]:
mean_squared_error(y_train, y_train_preds)**0.5

0.22514344099837558

**Increasing the number of trees increases the performance of the model**

## **Hyper-parameter tuning**

A model hyperparameter is a characteristic of a model that is external to the model and whose value cannot be estimated from data. The value of the hyperparameter has to be set before the learning process begins. For example, c in Support Vector Machines, k in k-Nearest Neighbors, the number of hidden layers in Neural Networks.

**Grid-search** is used to find the optimal hyperparameters of a model which results in the most ‘accurate’ predictions.

**cross-validation** is used to evaluate the performance of the models. Cross-validation measures how a model generalizes itself to an independent dataset. We use cross-validation to get a good estimate of how well a predictive model performs.

In [None]:
?RandomForestRegressor

In [75]:
learner_rf = RandomForestRegressor(random_state=42,n_estimators=35)

In [76]:
# Create the parameter grid based on the results of random search 
param_grid = {
    'max_features': [500, "sqrt", "log2", "auto"],
    'max_depth': [10, 15, 20]
}

The param_grid parameter takes a list of parameters and ranges for each, as we have shown above.

In [77]:
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = learner_rf, param_grid = param_grid, 
                          cv = 6, n_jobs = -1, verbose = 1, scoring = "neg_mean_squared_error" )


We mentioned that cross-validation is carried out to estimate the performance of a model. In k-fold cross-validation, k is the number of folds. As shown below, through `cv=5`, we use cross-validation to train the model 5 times. This means that 5 would be the k value.

`scoring='neg_mean_squared_error'` gives us the mean squared error. It is used in this form in grid search. This is meant to take the negative of the mean squared error to maximize and optimize it instead of minimizing the actual error.

`n_jobs` parameter specifies the number of concurrent processes that should be used for routines parallelized with the library joblib. In our case, at -1, it means that all CPUs are in use.

`verbose` gives us an option to produce logging information. We keep it at 0 to disable it since it may slow down our algorithm.

## Fitting the data.
We do this through grid.fit(X,y), which does the fitting with all the parameters.

In [78]:
grid_search.fit(X_train_bow, y_train)

Fitting 6 folds for each of 12 candidates, totalling 72 fits


GridSearchCV(cv=6,
             estimator=RandomForestRegressor(n_estimators=35, random_state=42),
             n_jobs=-1,
             param_grid={'max_depth': [10, 15, 20],
                         'max_features': [500, 'sqrt', 'log2', 'auto']},
             scoring='neg_mean_squared_error', verbose=1)

In [79]:
grid_search.cv_results_

{'mean_fit_time': array([ 1.55667675,  0.48463802,  0.185449  , 15.7205048 ,  2.64798319,
         0.74193919,  0.23918275, 26.13915626,  4.10916062,  1.05066343,
         0.31177219, 36.87497274]),
 'std_fit_time': array([0.06125171, 0.00829479, 0.0208894 , 0.26152519, 0.07144849,
        0.02317705, 0.01425118, 0.41991089, 0.09625898, 0.03442877,
        0.00943139, 1.03852334]),
 'mean_score_time': array([0.01489802, 0.01492568, 0.01412133, 0.01659008, 0.01763829,
        0.01750565, 0.01590065, 0.01945802, 0.02098342, 0.02267734,
        0.01808715, 0.02059579]),
 'std_score_time': array([0.00043385, 0.00053641, 0.00053077, 0.00234973, 0.00036372,
        0.00116304, 0.00041194, 0.00094548, 0.00066643, 0.00679073,
        0.00087608, 0.00183534]),
 'param_max_depth': masked_array(data=[10, 10, 10, 10, 15, 15, 15, 15, 20, 20, 20, 20],
              mask=[False, False, False, False, False, False, False, False,
                    False, False, False, False],
        fill_value='?',
 

In [80]:
grid_search.best_estimator_

RandomForestRegressor(max_depth=20, n_estimators=35, random_state=42)

In [81]:
y_train_pred = grid_search.best_estimator_.predict(X_train_bow)

In [82]:
y_test_pred = grid_search.best_estimator_.predict(X_test_bow)


In [83]:
mean_squared_error(y_train, y_train_pred)**0.5

0.6747454928551583

In [84]:
mean_squared_error(y_test, y_test_pred)**0.5

0.7517546027617316

In [88]:
r2_score(y_train,y_train_preds)

0.9691173374381444

## Identifying the mismatches

In [85]:
res_df = pd.DataFrame({'review':X_test, 'rating':y_test, 'rating_pred':y_test_pred})

In [86]:
res_df[(res_df.rating - res_df.rating_pred)>=2].shape

(17, 3)

In [87]:
res_df[(res_df.rating - res_df.rating_pred)>=2]

Unnamed: 0,review,rating,rating_pred
7277,life saviour serving excellent food worst time...,5.0,1.045725
26186,manjushree restaurant quite old one main eater...,4.0,1.857143
2141,food good did not get item one pack,4.0,1.930098
6835,food good onion lemon not given,4.0,1.391083
21981,another restaurant main street j p nagar wide ...,5.0,2.90178
7104,nice taste biryani not hot,4.0,1.655281
4771,not good,5.0,1.895617
19793,part review programme ordered bombay masala qu...,5.0,1.776756
13196,delicious food say vegetarian thought might sh...,5.0,1.975555
16510,may not polished serving packaging etc never b...,5.0,1.692914
