<a href="https://colab.research.google.com/github/Suryan5h/Natural-Language-Processing/blob/main/basics/Zomato_Review_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ZOMATO REVIEW NLP

## Importing the libraries

In [2]:
#Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.cm as cm #Colormap
import nltk

import os
import warnings
warnings.filterwarnings('ignore')
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Reading input data

In [4]:
zomato = pd.read_csv('Restaurant reviews.csv')
print(zomato.shape)

(10000, 7)


In [5]:
zomato.head()

Unnamed: 0,Restaurant,Reviewer,Review,Rating,Metadata,Time,Pictures
0,Beyond Flavours,Rusha Chakraborty,"The ambience was good, food was quite good . h...",5,"1 Review , 2 Followers",5/25/2019 15:54,0
1,Beyond Flavours,Anusha Tirumalaneedi,Ambience is too good for a pleasant evening. S...,5,"3 Reviews , 2 Followers",5/25/2019 14:20,0
2,Beyond Flavours,Ashok Shekhawat,A must try.. great food great ambience. Thnx f...,5,"2 Reviews , 3 Followers",5/24/2019 22:54,0
3,Beyond Flavours,Swapnil Sarkar,Soumen das and Arun was a great guy. Only beca...,5,"1 Review , 1 Follower",5/24/2019 22:11,0
4,Beyond Flavours,Dileep,Food is good.we ordered Kodi drumsticks and ba...,5,"3 Reviews , 2 Followers",5/24/2019 21:37,0


## Example Review with urls, html tags, empjis and other text noise

In [10]:
text = u'<div><h1><Title>The apple π was [*][AMAZING][*] and YuMmY too\U0001f602! You can Checkout the entire Menu in https://www.zomato.com/chennai/top-restaurants</div></h1></Title>'
print(text)

<div><h1><Title>The apple π was [*][AMAZING][*] and YuMmY too😂! You can Checkout the entire Menu in https://www.zomato.com/chennai/top-restaurants</div></h1></Title>


### Removing HTML strips

In [15]:
from bs4 import BeautifulSoup
def strip_html(text):
  soup = BeautifulSoup(text,"html.parser")
  return soup.get_text()

In [16]:
text = strip_html(text)
print(text)

The apple π was [*][AMAZING][*] and YuMmY too😂! You can Checkout the entire Menu in https://www.zomato.com/chennai/top-restaurants


Removed the html tags

### Removing URLs

In [17]:
import re
text = re.sub(r"http\S+","",text)
print(text)

The apple π was [*][AMAZING][*] and YuMmY too😂! You can Checkout the entire Menu in 


### Removing emojis

In [23]:
def deEmojify(text):
  regex_pattern = re.compile(pattern = "["
    u"\U0001F600-\U0001F64F"  #emoticons
    u"\U0001F300-\U0001F5FF"  #miscellaneous symbols and pictographs
    u"\U0001F680-\U0001F6FF"  #transport and map symbols
    u"\U0001F1E0-\U0001F1FF"  #flags(ios))
    "]", flags = re.UNICODE)
  return regex_pattern.sub(r'',text)

In [25]:
text = deEmojify(text)
print(text)

The apple π was [*][AMAZING][*] and YuMmY too! You can Checkout the entire Menu in 


### Text Encoding

In [26]:
#Let's see what happens if we encode the text to utf-8 format
text.encode('utf-8','ignore')

b'The apple \xcf\x80 was [*][AMAZING][*] and YuMmY too! You can Checkout the entire Menu in '

We can see that pi (maths function) gets converted to some \xcf\x80 IN UTF-8 FORMAT. Thats because unicode has all the universal codes.

But we want to remove the 'pi' symbol so we encode it to ASCII as ASCII has only American characters encoding.

In [27]:
text.encode('ascii','ignore')

b'The apple  was [*][AMAZING][*] and YuMmY too! You can Checkout the entire Menu in '

ASCII removes the pi symbol as it doesn't get recognized.

In [28]:
text = text.encode('ascii','ignore')
print(text)

b'The apple  was [*][AMAZING][*] and YuMmY too! You can Checkout the entire Menu in '


The 'b' in starting means it is now in Byte format.

In [29]:
# Unicode encoding 
def to_unicode(text):
    if isinstance(text, float):
        text = str(text)
    if isinstance(text, int):
        text = str(text)
    if not isinstance(text, str):
        text = text.decode('utf-8', 'ignore')
    return text

The isinstance() function checks whether the given input is of the given type. In this case, it is being used to convert floats and int to string.  

In [30]:
text = to_unicode(text)
print(text)

The apple  was [*][AMAZING][*] and YuMmY too! You can Checkout the entire Menu in 


### Removing Symbols

In [31]:
import re,string

def remove_between_square_brackets(text):
  return re.sub('\[|\]|\*','',text)

In [33]:
text = remove_between_square_brackets(text)
print(text)

The apple  was AMAZING and YuMmY too! You can Checkout the entire Menu in 


### Removing Special Characters

In [34]:
def remove_special_characters(text,remove_digits=True):
  pattern = r'[^a-zA-Z0-9\s]'
  text = re.sub(pattern,'',text)
  return text

In [35]:
text = remove_special_characters(text)
print(text)

The apple  was AMAZING and YuMmY too You can Checkout the entire Menu in 


### Lowercase conversion

In [36]:
text = text.lower()
print(text)

the apple  was amazing and yummy too you can checkout the entire menu in 


## Preprocessing Zomato reviews

In [47]:
#Collating all functions together and applying it for the zomato reviews dataset
def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

def deEmojify(text):
    regrex_pattern = re.compile(pattern = "["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           "]+", flags = re.UNICODE)
    return regrex_pattern.sub(r'',text)

def to_unicode(text):
    if isinstance(text, float):
        text = str(text)
    if isinstance(text, int):
        text = str(text)
    if not isinstance(text, str):
        text = text.decode('utf-8', 'ignore')
    return text

#Removing the square brackets
def remove_between_square_brackets(text):
    return re.sub('\[|\]|\*', '', text)


#Define function for removing special characters
def remove_special_characters(text, remove_digits=True):
    pattern=r'[^a-zA-z0-9\s]'
    text=re.sub(pattern,' ',text)
    return text

def denoise_text(text):
  text = to_unicode(text)
  text = strip_html(text)
  text = re.sub(r"http\S+", "", text)
  text = deEmojify(text)
  text = text.encode('ascii', 'ignore')
  text = to_unicode(text)
  text = remove_between_square_brackets(text)
  text = remove_special_characters(text)
  text = text.lower()
  return text

In [50]:
#Let's try the denoise on random review
zomato['Review'][300]

'Haleem, the best place to try out.\nAvaialble in all multiple outlets.\nHaleem, usually found in the Ramdan season, months of march to July served by Shah ghouse and badam kheer are must try'

In [51]:
tempText = denoise_text(zomato['Review'][300])
print(tempText)

haleem  the best place to try out 
avaialble in all multiple outlets 
haleem  usually found in the ramdan season  months of march to july served by shah ghouse and badam kheer are must try


In [52]:
#Applying the function on review column
zomato['Review'] = zomato['Review'].apply(denoise_text)
zomato['Review'].head()

0    the ambience was good  food was quite good   h...
1    ambience is too good for a pleasant evening  s...
2    a must try   great food great ambience  thnx f...
3    soumen das and arun was a great guy  only beca...
4    food is good we ordered kodi drumsticks and ba...
Name: Review, dtype: object

In [53]:
#Processed example of randomly selected review
zomato['Review'][111]

'we ordered tandoori chicken as starters that s very tasty and light and then butter naan with mutton masala and as main course we ordered chicken biryani  egg and mutton biryani those are very very tasty and light mutton piecs cooked with perfection'

## Removing stopwords

The following is the pseudo code of removing the stop words from reviews. 

Separate the reviews into words.
We use the toktok tokeniser from the NLTK library to split the review sentences into a list of words. 

From the list of words, check each word to see if it belongs to the stop words list from NLTK. 

If the word is present in the stop words list, then do nothing; if not, add it to a new list. 

Combine the words from the new list to convert the list of words into a sentence. 

In [55]:
#Libraries
from nltk.corpus import stopwords
from nltk.tokenize.toktok import ToktokTokenizer

In [57]:
#Tokenization of text
tokenizer = ToktokTokenizer()

In [59]:
#Setting English stopwords
stopword_list = nltk.corpus.stopwords.words('english')

#Removing standard english stop words like prepositions, adverbs
from nltk.tokenize import word_tokenize,sent_tokenize
stop = set(stopwords.words('english'))
print(stop)

{'once', 'our', 'me', 'ourselves', 'did', 'such', 'of', 'his', "she's", 's', "you'll", 'don', 'been', 'shouldn', 'this', 'about', 'have', 'doesn', 've', 'it', 'until', "wouldn't", "should've", 'mustn', 'wouldn', 'am', 'again', 'if', 'you', 'then', 'its', 'with', 'yourselves', 'she', 'down', 'where', 'no', "hadn't", 'will', 'what', 'they', 'how', 'myself', "that'll", "couldn't", 'some', 'here', 'won', 'after', 'couldn', 'himself', 'an', 'on', 'ain', "it's", 'isn', 'are', 'there', 'very', 'yours', 'nor', 'under', 'from', "wasn't", "shan't", 'by', 'i', 'll', 'mightn', "needn't", "won't", 'hasn', 'themselves', 'we', 'hadn', "didn't", "you're", 'has', "don't", 'ma', "shouldn't", 'o', 'than', 'ours', "weren't", 'be', 'he', 'who', 'into', 'her', 'that', 'doing', 'above', 'against', 'both', 'too', "hasn't", 'because', 'same', 'why', 'when', 'didn', 'these', 'at', 'him', 'having', 'any', 'during', 'other', 'can', 'out', 'while', 't', 'wasn', 'those', 'few', 'itself', 're', 'which', 'my', 'below

In [60]:
stopword_list

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [61]:
#Removing the stopwords
def remove_stopwords(text,is_lower_case=False):
  tokens = tokenizer.tokenize(text)
  tokens = [token.strip() for token in tokens]
  if is_lower_case:
    filtered_tokens = [token for token in tokens if token not in stopword_list]
  else:
    filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
  filtered_text = ' '.join(filtered_tokens)
  return filtered_text

In [64]:
zomato['Review'][678]

'you need to improve your packing  had ordered veg thai green curry with rice  by the time it was delivered to me  curry had splashed all over the container   even leaking from package  zomato delivery person said it leaked while getting it  had second thoughts abt whether to consume it or not '

In [63]:
remove_stopwords(zomato['Review'][678])

'need improve packing ordered veg thai green curry rice time delivered curry splashed container even leaking package zomato delivery person said leaked getting second thoughts abt whether consume'

In [65]:
#Apply function on review column
zomato['Review']=zomato['Review'].apply(remove_stopwords)

In [67]:
#Processed example of randomly selected review text
zomato['Review'][300]

'haleem best place try avaialble multiple outlets haleem usually found ramdan season months march july served shah ghouse badam kheer must try'

## Stemming and Lemmatization

In [68]:
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import PorterStemmer
def simple_stemmer(text):
  ps = SnowballStemmer(language='english')
  return ' '.join([ps.stem(word) for word in tokenizer.tokenize(text)])

In [69]:
zomato['Review'][1]

'ambience good pleasant evening service prompt food good good experience soumen das kudos service'

In [70]:
simple_stemmer(zomato['Review'][1])

'ambienc good pleasant even servic prompt food good good experi soumen das kudo servic'

In [77]:
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')
#Lemmatizer example
def lemmatize_all(sentence):
  wnl = WordNetLemmatizer()
  for word,tag in pos_tag(word_tokenize(sentence)):
    if tag.startswith("NN"):
      yield wnl.lemmatize(word, pos='n')
    elif tag.startswith('VB'):
            yield wnl.lemmatize(word, pos='v')
    elif tag.startswith('JJ'):
        yield wnl.lemmatize(word, pos='a')
    else:
        yield word

def lemmatize_text(text):
    return ' '.join(lemmatize_all(text))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


In [75]:
zomato['Review'][1]

'ambience good pleasant evening service prompt food good good experience soumen das kudos service'

In [78]:
lemmatize_text(zomato['Review'][1])

'ambience good pleasant even service prompt food good good experience soumen das kudos service'

In [79]:
#Raw example of randomly selected review text
zomato['Review'][456]

'food ambience nice chiken chettinadu chicken pulao toooo good beer vodka appreciate kamal quick sevice'

In [80]:
#Apply Lemmatization
zomato['Review'] = zomato['Review'].apply(lemmatize_text)

In [81]:
#Processed example of randomly selected review text
zomato['Review'][456]

'food ambience nice chiken chettinadu chicken pulao toooo good beer vodka appreciate kamal quick sevice'

in lemmatisation, first, you pass the complete text to the parts of speech tagger, post which each word will have a part of speech attached with it. Now you can tokenise the text and lemmatise it. For each word to be converted to its lemma, you need the word itself and its POS tag. Although the results obtained using a lemmatiser are more accurate, it uses a significantly larger amount of computation resources.

## Creating features using Bag of Words model

In [82]:
#Transformed reviews
norm_reviews = zomato.Review

In [83]:
norm_reviews

0       ambience good food quite good saturday lunch c...
1       ambience good pleasant even service prompt foo...
2       must try great food great ambience thnx servic...
3       soumen das arun great guy behavior sincerety g...
4       food good order kodi drumstick basket mutton b...
                              ...                        
9995    madhumathi mahajan well start nice courteous s...
9996    place never disappoint us food courteous staff...
9997    bad rating mainly chicken bone find veg food a...
9998    personally love prefer chinese food couple tim...
9999    check try delicious chinese food see non veg l...
Name: Review, Length: 10000, dtype: object

In [84]:
zomato

Unnamed: 0,Restaurant,Reviewer,Review,Rating,Metadata,Time,Pictures
0,Beyond Flavours,Rusha Chakraborty,ambience good food quite good saturday lunch c...,5,"1 Review , 2 Followers",5/25/2019 15:54,0
1,Beyond Flavours,Anusha Tirumalaneedi,ambience good pleasant even service prompt foo...,5,"3 Reviews , 2 Followers",5/25/2019 14:20,0
2,Beyond Flavours,Ashok Shekhawat,must try great food great ambience thnx servic...,5,"2 Reviews , 3 Followers",5/24/2019 22:54,0
3,Beyond Flavours,Swapnil Sarkar,soumen das arun great guy behavior sincerety g...,5,"1 Review , 1 Follower",5/24/2019 22:11,0
4,Beyond Flavours,Dileep,food good order kodi drumstick basket mutton b...,5,"3 Reviews , 2 Followers",5/24/2019 21:37,0
...,...,...,...,...,...,...,...
9995,Chinese Pavilion,Abhishek Mahajan,madhumathi mahajan well start nice courteous s...,3,"53 Reviews , 54 Followers",6/5/2016 0:08,0
9996,Chinese Pavilion,Sharad Agrawal,place never disappoint us food courteous staff...,4.5,"2 Reviews , 53 Followers",6/4/2016 22:01,0
9997,Chinese Pavilion,Ramandeep,bad rating mainly chicken bone find veg food a...,1.5,"65 Reviews , 423 Followers",6/3/2016 10:37,3
9998,Chinese Pavilion,Nayana Shanbhag,personally love prefer chinese food couple tim...,4,"13 Reviews , 144 Followers",5/31/2016 17:22,0


In [100]:
#Convert data to float
zomato['Rating'] = pd.to_numeric(zomato['Rating'],errors='coerce')

In [101]:
zomato.dtypes

Restaurant     object
Reviewer       object
Review         object
Rating        float64
Metadata       object
Time           object
Pictures        int64
dtype: object

In [102]:
def create_sentiment(rating):
  if rating>=3:
    sentiment = 'Positive'
  else:
    sentiment = 'Negative'
  return sentiment

zomato['Sentiment'] = zomato['Rating'].apply(create_sentiment)
zomato['Sentiment'].value_counts()

Positive    7514
Negative    2486
Name: Sentiment, dtype: int64

In [103]:
from sklearn.preprocessing import LabelBinarizer

#Labelling the rating data
lb = LabelBinarizer()

#Transformed rating data
sentiment_data = lb.fit_transform(zomato['Sentiment'])
print(sentiment_data.shape)

(10000, 1)


## Fitting the bag of words model on the entire dataset

In [104]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
cv_fit = cv.fit(norm_reviews)

In [141]:
cv_fit.vocabulary_

{'ambience': 788,
 'good': 5045,
 'food': 4574,
 'quite': 9310,
 'saturday': 10136,
 'lunch': 6969,
 'cost': 2957,
 'effective': 3909,
 'place': 8813,
 'sate': 10119,
 'brunch': 1963,
 'one': 8134,
 'also': 746,
 'chill': 2429,
 'friend': 4715,
 'parent': 8484,
 'waiter': 12590,
 'soumen': 10836,
 'das': 3246,
 'really': 9497,
 'courteous': 2988,
 'helpful': 5396,
 'pleasant': 8846,
 'even': 4120,
 'service': 10330,
 'prompt': 9129,
 'experience': 4231,
 'kudos': 6539,
 'must': 7682,
 'try': 12022,
 'great': 5130,
 'thnx': 11738,
 'pradeep': 8984,
 'subroto': 11204,
 'personal': 8684,
 'recommendation': 9545,
 'penne': 8639,
 'alfredo': 697,
 'pasta': 8538,
 'music': 7674,
 'background': 1304,
 'amazing': 772,
 'arun': 1069,
 'guy': 5213,
 'behavior': 1521,
 'sincerety': 10586,
 'course': 2985,
 'would': 12872,
 'like': 6787,
 'visit': 12527,
 'order': 8184,
 'kodi': 6482,
 'drumstick': 3803,
 'basket': 1436,
 'mutton': 7693,
 'biryani': 1703,
 'thanks': 11672,
 'serve': 10322,
 'well'

In [142]:
print(cv_fit)

CountVectorizer()


In [110]:
### Transform the train and the test dataset separately
norm_train_reviews=zomato['Review'][:8000]
print('train:','\n',norm_train_reviews[0])
norm_train_cv_reviews=cv_fit.transform(norm_train_reviews)

#Normalised test reviews
norm_test_reviews=zomato['Review'][8000:]
print('test:','\n',norm_test_reviews[8001])
norm_test_cv_reviews=cv_fit.transform(norm_test_reviews)

train: 
 ambience good food quite good saturday lunch cost effective good place sate brunch one also chill friend parent waiter soumen das really courteous helpful
test: 
 pleasant experience tfw stuff tangdi kabab fantastic succulent mildly flavour great salad go sizzler biryani great innovative touch right well cook cabbage base add new twist term biryani definitely must visit


### Splitting the output variable into test and train

In [113]:
#Splitting the sentiment data
train_sentiments=sentiment_data[:8000]
test_sentiments=sentiment_data[8000:]

In [116]:
#Calculating sentiment count
zomato['Sentiment'].value_counts()

Positive    7514
Negative    2486
Name: Sentiment, dtype: int64

In [117]:
from sklearn.linear_model import LogisticRegression,SGDClassifier

#Training the model
lr=LogisticRegression(penalty='l2',max_iter=500,C=1,random_state=42)

#Fitting the model for the bag of words
lr_bow=lr.fit(norm_train_cv_reviews,train_sentiments)
print(lr_bow)

LogisticRegression(C=1, max_iter=500, random_state=42)


In [118]:
#Predicting the model for bag of words
lr_bow_predict=lr.predict(norm_test_cv_reviews)
print(lr_bow_predict)

[1 1 0 ... 1 1 1]
