<a href="https://www.kaggle.com/kamaljp/analyse-bengaluru-restaurants?scriptVersionId=86684496" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

### Purpose

Let the data come alive with the efficient usage of the library. 

To do EDA with simple transformations that can be implemented by others with minimal modifications.

Use the Restaurant Reviews to understand the NLP modeling with multiple heuristics

### What to Expect

Slew of transformation, in order to clean and get the data ready for visualisations. Especially, the cuisines and dishes liked columns posed interesting [challenge](#chal-1). Attempting to change the data types and clean the dataset took some [imagination.](#chal-2) These challenges are hidden from the viewers looking for the visualisations, just unhide the code.

### Sneek Peek

The [North Star](#vis-17) in the dark sky of restaurants jumps out. Plot shines with vibrancy once the [effort](#vis-14) to prepare the data is complete. The ease with which plots were [rendered](#vis-5) with the simple commands were possible after the transformations mentioned above. The visuals provide [interfaces](#vis-6) to check each category, and make it interactive. Take a look and see if there is story somewhere, that is missed....  

<a id='go_up'>PS: Purpose, What to Expect and A sneek Peek are hidden in above cell, unhide to see the same.The blue colored words are links that take you to the relevant location or the chart in the notebook</a>

## Contents:

[Starting the datapreparation and Dataframe creations](#datpre)

[Visual_14 Restaurant distribution on multiple factory?](#vis-14)

Thinking Why Visual 14 is on the top? The idea of this visual came late in the EDA. But this visual embodies the major part of the dataset.

[Visual_1 Restaurants with reservation table and online ordering](#vis-1)

[Visual_2 Locations that are online_ordering friendly](#vis-2)

[Visual_3 Location that are having Table Booking Restaurants](#vis-3)

[Visual_4 Restaurants Types and their locales](#vis-4)

[Visual_5 Dine Types](#vis-5)

[Visual_6 Review ratings and Votes](#vis-6)

[Visual_7 Where people Liked more dishes Vs Ratings](#vis-7)

[Visual_8 Which type of restaurant variety of food is served](#vis-8)

[Visual_9 Which type of restaurant is pocket friendly](#vis-9)

[Visual_10 Where are pocket friendly restaurants](#vis-10)

[Visual_11 Do costly restaurants have best or worst rating?](#vis-11)

[Visual_12 Which type of restaurants have best or worst rating?](#vis-12)

[Visual_13 Which location has the best or worst rated hotels?](#vis-13)

We have explored the restaurants, location and its review ratings. The food is having its own dimension. The types of foods served, and the dishes liked is having some insights under its sleeve. 

[Visual_15 Which location has the best or worst rated hotels?](#vis-15)

[Visual_16 Dishes liked the most by Bangaloreans?](#vis-16)

[Visual_17 What it will cost if you?](#vis-17)

[Visual_18 What it will cost in certain type of restaurant?](#vis-18)

[NLP Sentiment Analyis](#NLP)


In [None]:
import os
import numpy as np 
import pandas as pd 
import seaborn as sns
import plotly.express as px
from plotly.offline import init_notebook_mode
from plotly.subplots import make_subplots
import plotly.graph_objects as go
init_notebook_mode(connected=True)
pd.set_option('display.max_columns', 5000)
import warnings
warnings.filterwarnings("ignore")
#os.mkdir('/kaggle/working/individual_charts/')
import matplotlib.pyplot as plt
# Load the data
#Will come in handy to wrap the lengthy texts
import textwrap
#useful libraries and functions
#import sidetable as stb
from itertools import repeat
#Libraries that give a different visual possibilities
from pandas import option_context 
from plotly.subplots import make_subplots
#Importing Market Basket Analysis libraries
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

from wordcloud import WordCloud
from geopy.geocoders import Nominatim

import gensim
from collections import Counter
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

def long_sentences_seperate(sentence, width=30):
    try:
        splittext = textwrap.wrap(sentence,width)
        text = '<br>'.join(splittext)#whitespace is removed, and the sentence is joined
        return text
    except:
        return sentence

def load_csv(base_dir,file_name):
    """Loads a CSV file into a Pandas DataFrame"""
    file_path = os.path.join(base_dir,file_name)
    df = pd.read_csv(file_path,low_memory=False)
    return df    

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
#Supporting Functions that are used at various locations in the notebook

#Function to reduce the names to just abbreviations
def shrnk_name(company):
    lngt = company.split(' ')
    temp = str()
    if len(lngt) > 1:
        for x in lngt:
            temp = temp + x[0]
        return temp
    else:
        return company

#Function that converts the strings that needs to be numbers. This function grew, as I started finding issues
#Issues like finding "M","," and "." in the sales, profit, MV and assets values 
def convert_cost(x):
    if ',' in x: #checking if there is ',' in the string
        temp = x.replace(',','')
        if '.'in temp:
            return float(temp)
        else:
            return int(temp)

    else:
        return int(x)

In [None]:
base_dir = '../input/zomato-bangalore-restaurants'
file_name = 'zomato.csv'
main_df = load_csv(base_dir,file_name)

### <a id='datpre'> Data preparation n Dataframe creation</a>

The process of rooting out the null-values, renaming the columns and creating new columns for better dataset rendering is carried out in next couple of cells, which has been hidden. Those curious to learn, here is somethings you will find.

1) Creation of set to get the unique values of the cuisines and dishes

2) Using merge and concat operations on the dataframes

3) Creation of dataframes from the list of cusines and dishes, to find the frequency they are liked, or served. 

4) Data manipulation using the split and replace functions 

Feel free to explore. The pythonic grammar used is kept to those most of us are familiar with. 


In [None]:
#Taking the important columns further for effective EDA
anlys_df = main_df[['name', 'online_order', 'book_table', 'rate', 'votes',
                    'location', 'rest_type', 'dish_liked', 'cuisines',
                    'approx_cost(for two people)', 'reviews_list']]

In [None]:
#Doing some cleanUPs, general fillna() functions are not applicable here.
anlys_df.loc[anlys_df.dish_liked.isna(),'dish_liked'] = 'None_Liked'
anlys_df.loc[anlys_df.location.isna(),'location'] = 'not_provided'
anlys_df.loc[anlys_df.rest_type.isna(),'rest_type'] = 'Unknown'
anlys_df.loc[anlys_df.cuisines.isna(),'cuisines'] = 'Unknown'

#I am assuming the value here, will be changed, or even can be used as prediction set
anlys_df.loc[anlys_df['approx_cost(for two people)'].isna(),'approx_cost(for two people)'] = '0'
anlys_df.loc[anlys_df.rate.isna(),'rate'] = '0/5'
anlys_df.loc[anlys_df.rate == 'NEW','rate'] = '0/5'
anlys_df.loc[anlys_df.rate == '-','rate'] = '0/5'

#### <a id='chal-2'> Imagining the way to transform the data

In [None]:
#dish_liked column is string type with multiple items
anlys_df['tot_dish_liked'] = anlys_df.dish_liked.apply(lambda x : len(x.split(',')))
anlys_df['tot_cuisines'] = anlys_df.cuisines.apply(lambda x : len(x.split(',')))

#There are multiple reviews in the list, so creating the seperate column
anlys_df['tot_reviews'] = anlys_df.reviews_list.apply(lambda x: len(x))
#Can think about running NLP sentiment analysis on there reviews later 

#Getting the review rate of each restaurants
anlys_df['review_rate'] = anlys_df.rate.apply(lambda x: float(x.split('/')[0]))
anlys_df.drop('rate',axis=1,inplace=True)
#converting the votes column to integer for easy manipulation
anlys_df.votes = anlys_df.votes.astype(int)

#calculating votes in favour
anlys_df['temp'] = anlys_df.votes * anlys_df.review_rate
anlys_df['in_fav'] = anlys_df.temp.apply(lambda x: round(x/5,0))
anlys_df.drop('temp',axis=1,inplace=True)

#Converting approx cost to int
anlys_df['approx_cost_per_pair'] = anlys_df['approx_cost(for two people)'].apply(lambda x: convert_cost(x))
anlys_df.drop('approx_cost(for two people)',axis=1,inplace = True)

In [None]:
#There are so many types of restaurants. To help visualisation creating this columns
anlys_df['dine_type'] = anlys_df.rest_type.apply(lambda x: x.split(',')[0])
anlys_df.loc[anlys_df.online_order == 'Yes','online_order'] = 'online'
anlys_df.loc[anlys_df.online_order == 'No','online_order'] = 'offline'
anlys_df.loc[anlys_df.book_table == 'Yes','book_table'] = 'booking_allowed'
anlys_df.loc[anlys_df.book_table == 'No','book_table'] = 'no_booking'

#### <a id='chal-1'> Idea is to convert the list of cuisines and dish liked into seperate columns

In [None]:
anlys_df.cuisines = anlys_df.cuisines.apply(lambda x: x.replace(' ','').split(','))
anlys_df.dish_liked = anlys_df.dish_liked.apply(lambda x: x.replace(' ','').split(','))


#convert the lists into seperate dataframe
cusines = pd.DataFrame(anlys_df.cuisines.to_list(),columns=['C1','C2','C3','C4','C5','C6','C7','C8'])
dishes = pd.DataFrame(anlys_df.dish_liked.to_list(),columns=['D1','D2','D3','D4','D5','D6','D7'])

#merge the dataframes on to the main_df
anlys_df = pd.merge(left=anlys_df,right=cusines,how='left',left_index=True,right_index=True)
anlys_df = pd.merge(left=anlys_df,right=dishes,how='left',left_index=True,right_index=True)

#Need a string values in place of 'None'
anlys_df.fillna('NA',inplace=True)

In [None]:
# Idea is to locate the unique dishes in all the restaurant cuisines, and dishes liked
cuisine_set = set()

for cols in anlys_df.columns[16:23]:
    #print(cols)
    for cuisi in anlys_df[cols].apply(lambda x: x.replace(' ','')):
        cuisine_set.add(cuisi)
        
dishes_set = set()

for cols in anlys_df.columns[24:]:
    #print(cols)
    for cuisi in anlys_df[cols].apply(lambda x: x.replace(' ','')):
        dishes_set.add(cuisi)
        
print('There are total {} unique cuisines sold in Zomato'.format(len(cuisine_set)))
print('There are total {} unique dishes sold in Zomato'.format(len(dishes_set)))

In [None]:
#value_counts() gives pandas Series which can be concatenated to pandas dataframe directly.
#market_basket algorithms can be used to find relationships between the dishes liked, and the cuisines served

#cuisines Dataframe
cusines_df = pd.DataFrame()
for cols in anlys_df.columns[16:24]:
    cusines_df = pd.concat([cusines_df,anlys_df[cols].value_counts()],axis = 1,ignore_index=True)
    
cusines_df.columns =['First','second','third','fourth','fifth','sixth','seventh','eigth']
cusines_df.fillna(0,inplace=True)
cusines_df['total_served']= cusines_df.sum(axis=1)

#dishes Dataframe
dishes_df = pd.DataFrame()
for cols in anlys_df.columns[24:]:
    dishes_df = pd.concat([dishes_df,anlys_df[cols].value_counts()],axis = 1,ignore_index=True)
    
dishes_df.columns =['First','second','third','fourth','fifth','sixth','seventh']
dishes_df.fillna(0,inplace=True)

dishes_df['total_liked']= dishes_df.sum(axis=1)
print(cusines_df.head(1))
print(dishes_df.head(1))

In [None]:
# Making the Rows to Columns
cuis_transpose = cusines_df.T
dish_transpose = dishes_df.T

#making the cells with values greater than 1 as simply 1. The Market Basket algo requirement
for cols in cuis_transpose.columns: 
    cuis_transpose[cols] = cuis_transpose[cols].apply(lambda x : 1 if x>0 else 0)

for cols in dish_transpose.columns: 
    dish_transpose[cols] = dish_transpose[cols].apply(lambda x : 1 if x>0 else 0)

In [None]:
#Collecting garbage memory
import gc
gc.collect()

[Go To Contents](#go_up)

### <a id='vis-14'> Visual_14 Restaurant distribution on multiple factor?</a>

Every dataset has mainly two types of data, continous and discrete data. The mixture of these two data happens vibrantly with Treemap chart. The contours of the colors created by the continous variable, and the neat demarcation of the discrete categorical variables can be mesmerising, and informative. The idea struck as I was nearing my EDA. Then I realized, multifactor EDA has not been done. 

The underlying algorithm in treemap takes care of many details which would be a challenge for us to set. For example, in the below tree map the review rate is a continous variable, that is used for color. The location variable is discrete category. 

The color of the entire location is averaged based on the underlying components. This is inbuit and not explicit either through surprise, or continous research and experimentation.


In [None]:
grp_14 = anlys_df.groupby(['location','tot_dish_liked','tot_cuisines','online_order', 'book_table', 
                           'tot_reviews', 'review_rate','approx_cost_per_pair','dine_type'])['name'].count().reset_index()
vis_14 = px.treemap(data_frame=grp_14,
                    path=['location','dine_type','tot_dish_liked','tot_cuisines'],
                    names='approx_cost_per_pair',
                    values = 'name',
                    color='review_rate',
                    title='Restaurants Distribution on various factors')
vis_14.update_layout(height = 1000)
#del grp_14. Not deleting since used in vis 17 and 18
vis_14.show()

### <a id='vis-1'> Visual_1 Restaurants with reservation table and online ordering</a>

In [None]:
grp_1 = anlys_df.groupby(['book_table','online_order'])['name'].count().reset_index()
vis_1 = px.bar(data_frame=grp_1,x='book_table',y='name',color='online_order')
del grp_1
vis_1.show()

[Go To Contents](#go_up)

### <a id='vis-2'> Visual_2 Locations that are online_ordering friendly</a>

In [None]:
grp_2 = anlys_df.groupby(['location','online_order'])['name'].count().reset_index()
vis_2 = px.bar(data_frame=grp_2,y='location',x='name',color='online_order',orientation='h')
vis_2.update_layout(yaxis={'categoryorder':'total ascending'},height = 1000)
del grp_2
vis_2.show()

[Go To Contents](#go_up)

### <a id='vis-3'> Visual_3 Location that are having Table Booking Restaurants</a>

In [None]:
grp_3 = anlys_df.groupby(['location','book_table'])['name'].count().reset_index()
vis_3 = px.bar(data_frame=grp_3,y='location',x='name',color='book_table',orientation='h')
vis_3.update_layout(yaxis={'categoryorder':'total ascending'},height = 1000)
del grp_3
vis_3.show()

[Go To Contents](#go_up)

### <a id='vis-4'> Visual_4 Restaurants Types and their locales </a>

In [None]:
grp_4 = anlys_df.groupby(['dine_type','location','book_table'])['name'].count().reset_index()
vis_4 = px.bar(data_frame=grp_4,y='location',x='name',color='dine_type',title = 'Restaurant Types and Locales')
vis_4.update_layout(yaxis={'categoryorder':'total ascending'},height = 1200)
del grp_4
vis_4.show()

[Go To Contents](#go_up)

### <a id='vis-5'> Visual_5 Dine Types </a>

In [None]:
grp_5 = anlys_df.groupby('dine_type')['name'].count().reset_index()
vis_5 = px.bar(data_frame=grp_5,y='dine_type',x='name',
               title = 'Restaurant Types and Locales')
vis_5.update_layout(yaxis={'categoryorder':'total ascending'},height = 1200)
del grp_5
vis_5.show()

[Go To Contents](#go_up)

### <a id='vis-6'> Visual_6 Review ratings and Votes </a>

In [None]:
vis_6 = px.scatter(data_frame=anlys_df,x='review_rate',y='votes',animation_frame='dine_type',
                   color='location',title='Review Rating and Votes')
vis_6.update_layout(height = 1200)
vis_6.show()

[Go To Contents](#go_up)

### <a id='vis-7'> Visual_7 Where people Liked more dishes Vs Ratings</a>

In [None]:
grp_7 = anlys_df.groupby(['tot_dish_liked','dine_type'])['name'].count().reset_index()
vis_7 = px.bar(data_frame=grp_7,x='tot_dish_liked',y='name',color='dine_type',
               title='Where People liked more dishes')
vis_7.update_layout(xaxis={'categoryorder':'total ascending'},height = 1200)
del grp_7
vis_7.show()

[Go To Contents](#go_up)

### <a id='vis-8'> Visual_8 Which type of restaurant variety of food is served</a>

In [None]:
grp_8 = anlys_df.groupby(['tot_cuisines','dine_type'])['name'].count().reset_index()
grp_8.tot_cuisines = grp_8.tot_cuisines.astype('category')
vis_8 = px.bar(data_frame=grp_8,y='dine_type',x='name',color='tot_cuisines',
               title='Where variety of cuisine is more?',orientation='h')
vis_8.update_layout(yaxis={'categoryorder':'total ascending'},height = 1200)
del grp_8
vis_8.show()

[Go To Contents](#go_up)

### <a id='vis-9'> Visual_9 Which type of restaurant is pocket friendly</a>

In [None]:
grp_9 = anlys_df.groupby(['approx_cost_per_pair','dine_type'])['name'].count().reset_index()
grp_9.approx_cost_per_pair = grp_9.approx_cost_per_pair.astype('category')
vis_9 = px.bar(data_frame=grp_9,y='dine_type',x='name',color='approx_cost_per_pair',
               title='Which type of restaurant is pocket friendly?',orientation='h')
vis_9.update_layout(yaxis={'categoryorder':'total ascending'},height = 1200)
del grp_9
vis_9.show()

[Go To Contents](#go_up)

### <a id='vis-10'> Visual_10 Where are pocket friendly restaurants</a>

In [None]:
grp_10 = anlys_df.groupby(['approx_cost_per_pair','location'])['name'].count().reset_index()
grp_10.approx_cost_per_pair = grp_10.approx_cost_per_pair.astype('category')
vis_10 = px.bar(data_frame=grp_10,y='location',x='name',color='approx_cost_per_pair',
               title='Where are pocket friendly restaurants?',orientation='h')
vis_10.update_layout(yaxis={'categoryorder':'total ascending'},height = 1200)
del grp_10
vis_10.show()

[Go To Contents](#go_up)

### <a id='vis-11'> Visual_11 Do costly restaurants have best or worst rating?</a>

In [None]:
grp_11 = anlys_df.groupby(['approx_cost_per_pair','review_rate'])['name'].count().reset_index()
grp_11.approx_cost_per_pair = grp_11.approx_cost_per_pair.astype('category')
vis_11 = px.bar(data_frame=grp_11,y='review_rate',x='name',color='approx_cost_per_pair',
               title='Do costly restaurants have better rating?',orientation='h')
vis_11.update_layout(yaxis={'categoryorder':'total ascending'},height = 800)
del grp_11
vis_11.show()

[Go To Contents](#go_up)

### <a id='vis-12'> Visual_12 Which type of restaurants have best or worst rating?</a>

In [None]:
grp_12 = anlys_df.groupby(['dine_type','review_rate'])['name'].count().reset_index()
grp_12.dine_type = grp_12.dine_type.astype('category')
vis_12 = px.bar(data_frame=grp_12,y='review_rate',x='name',color='dine_type',
               title='Which type of restaurant have better rating?',orientation='h')
vis_12.update_layout(yaxis={'categoryorder':'total ascending'},height = 800)
del grp_12
vis_12.show()

[Go To Contents](#go_up)

### <a id='vis-13'> Visual_13 Which location has the best or worst rated hotels?</a>

In [None]:
grp_13 = anlys_df.groupby(['location','review_rate'])['name'].count().reset_index()
vis_13 = px.bar(data_frame=grp_13,y='location',x='name',color='review_rate',
               title='Which location have better rating?',orientation='h')
vis_13.update_layout(yaxis={'categoryorder':'total ascending'},height = 800)
del grp_13
vis_13.show()

[Go To Contents](#go_up)

### <a id='vis-15'> Visual_15 Cuisines types served in Bangalore Restaurants?</a>

In [None]:
# changing all the not applicable values to 0 
cusines_df.loc[cusines_df.index == 'NA',:] = 0
dishes_df.loc[dishes_df.index == 'NA',:] = 0

In [None]:
#considering only the top 25 dishes, and cuisine types to show the visualisation.
cuis_cons = cusines_df[:25]
dish_cons = dishes_df[1:25]

In [None]:
vis_15 = make_subplots(rows=8,cols=1)

x = 1
for colum in cuis_cons.columns[:-1]:
    vis_15.add_trace(go.Bar(orientation='h',x=cuis_cons[colum],y=cuis_cons.index,name=colum),row=x,col=1)
    x = x + 1
    vis_15.update_yaxes(categoryorder="total descending")

vis_15.update_layout(height = 1500,title='Top cuisines offered at Restuarants')
vis_15.show()

[Go To Contents](#go_up)

### <a id='vis-16'> Visual_16 Dishes liked the most by Bangaloreans?</a>

In [None]:
vis_16 = make_subplots(rows=7,cols=1)

x = 1
for colum in dish_cons.columns[:-1]:
    vis_16.add_trace(go.Bar(orientation='h',x=dish_cons[colum],y=dish_cons.index,name=colum),row=x,col=1)
    x = x + 1
    vis_16.update_yaxes(categoryorder="total descending") #This option is great find

vis_16.update_layout(height = 1500,title='Dishes liked the most by Bangaloreans')
vis_16.show()

[Go To Contents](#go_up)

### <a id='vis-17'> Visual_17 What it will cost if you?</a>

This question was raised by my dad. Below treemap goes a step more and show which area has more costly restaurants, and highlights like a "North Star" in a dead of a ocean

In [None]:
vis_17 = px.treemap(data_frame=grp_14,
                    path=['location','book_table','online_order'],
                    names='review_rate',
                    values = 'name',
                    color='approx_cost_per_pair',
                    title='What it will cost you when ordering?')
vis_17.update_layout(height = 1000)
vis_17.show()

[Go To Contents](#go_up)

### <a id='vis-18'> Visual_18 What it will cost you in certain type of restaurant?</a>

In [None]:
vis_18 = px.treemap(data_frame=grp_14,
                    path=['location','dine_type','book_table','online_order'],
                    values = 'approx_cost_per_pair',
                    color='dine_type',
                    title='What will be cost in certain type of restaurant')
vis_18.update_layout(height = 1000)
#del grp_14. Not deleting since used in vis 17 and 18
vis_18.show()

Further analysis will be continued, as new questions arise. This dataset is treasure trove to be dug into more insights and machine learning experimentation.

1) Review ratings predictions using the NLP sentiment analysis

2) Classification of the restaurant type based on multiple factors, and providing the probability

3) Choropleth and Scatter Geo plot rendering, by merging the location coordinates of the restaurants

4) Association rules can be generated to show the "Food you may like" by using the food someone has liked. 

5) The population of the location can be predicted based on the number of restaurants

Want to review any other chart, go to the contents using this [link](#go_up)

In [None]:
import gc
gc.collect()

### <a id='NLP'> NLP Sentiment Analyis </a>

import pkg_resources
import pip
installedPackages = {pkg.key for pkg in pkg_resources.working_set}
required = {'nltk', 'spacy', 'textblob', 'backtrader'}
missing = required - installedPackages
if missing:
    !pip install nltk==3.4
    !pip install textblob==0.15.3
    !pip install -U SpaCy==2.2.0
    !python -m spacy download en_core_web_lg
    !pip install backtrader==1.9.74.123    

In [None]:
#NLP libraries
from textblob import TextBlob
import spacy
from tqdm import tqdm #library to show the progress bar
import re
import nltk
import warnings
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')
import csv

from nltk import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter
from gensim.models import word2vec

from sklearn.model_selection import train_test_split

#Run the command python -m spacy download en_core_web_sm to download this
#https://spacy.io/models
import en_core_web_lg
nlp = en_core_web_lg.load()

#Libraries for processing the news headlines
from lxml import etree
import json
from io import StringIO
from os import listdir
from os.path import isfile, join
from pandas.tseries.offsets import BDay
from scipy.stats.mstats import winsorize
from copy import copy

# Libraries for Classification for modeling the sentiments
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Keras package for the deep learning model for the sentiment prediction. 
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Flatten, LSTM, Dropout, Activation
from keras.layers.embeddings import Embedding

# Load libraries
import seaborn as sns
import datetime
from datetime import date
import matplotlib.pyplot as plt

#Additional Libraries 
import json  
import zipfile
import os.path
import sys

[Go To Contents](#go_up)

In [None]:
#Creating subset of the anlys dataframe to start our NLP sentiment analysis(main_df has missing values!!!)
nlp_df = anlys_df[['name','tot_reviews','reviews_list','review_rate','votes','in_fav']]
nlp_df.info()

In [None]:
#Loading all the reviews in the model building is unnecessary at this moment
part_df = anlys_df[:100]

In [None]:
#Getting the rating and the reviews stripped 
part_ratings = []

for name,ratings in tqdm(zip(part_df['name'],part_df['reviews_list'])):
    ratings = eval(ratings)
    for score, doc in ratings:
        if score:
            score = score.strip("Rated").strip()
            doc = doc.strip('RATED').strip()
            score = float(score)
            part_ratings.append([name,score, doc])

In [None]:
sample_rating_df = pd.DataFrame(part_ratings,columns=['name','rating','review'])
sample_rating_df.head()

In [None]:
#The nlp library that has been instantiated has many tokens of the english vocabs. Each token is reperesented
#using 300 variables.
#Below phrase converts the text in the reviews to the representation in the nlp library.
part_vectors = pd.np.array([pd.np.array([token.vector for token in nlp(s)]).mean(axis=0)*pd.np.ones((300)) \
                           for s in sample_rating_df['review']])

In [None]:
#The vectors are created from the dictionary of already existing library, so the below sentence becomes a 300
#element array which is in turn created by individual words that the sentence makes. This concept is later
#useful in understanding other modeling methods.
vec = [token.vector for token in nlp('Restaurant location was very calm and futuristic')]
vec[0].shape

#Each word creates 300 element array, and then sentence is converted to 300 element array.

In [None]:
#There will be total 300 columns of numbers, and one column of existing review ratings. Now the language
#modeling problem is simply a machine learning problem that will be solved using the traditional 

print('The shape of the part vector is {}'.format(part_vectors.shape))
print('The number of reviews that were collected from 1st 1000 restaurant is {}'.format(sample_rating_df.shape[0]))

[Go To Contents](#go_up)

### <a id='startM'> Modeling Start with the 1st 1000 restaurant reviews </a>

In [None]:
# split out validation dataset for the end
Y= sample_rating_df["review"].values
X = part_vectors

#Check if there are infinites in the array
#print('Are there infinite values: {}'.format(np.all(np.isfinite(X))))
#Check if there are null values in the array. It seems there is
#print('Are there Null values: {}'.format(np.any(np.isnan(X))))
X = np.nan_to_num(X) #Runnning this function to replacing null values
np.any(np.isnan(X))

from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV
validation_size = 0.3
seed = 7
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=validation_size, random_state=seed)

# test options for classification
num_folds = 10
seed = 7
scoring = 'accuracy'

# spot check the algorithms
models = []
models.append(('LR', LogisticRegression(solver='liblinear')))#default solver doesnt work
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('SVM', SVC()))
#Neural Network
models.append(('NN', MLPClassifier()))
#Ensable Models 
models.append(('RF', RandomForestClassifier()))

### <a id='Run'> Executing the multiple models </a>

In [None]:
results = []
names = []
kfold_results = []
test_results = []
train_results = []
for name, model in models:
    kfold = KFold(n_splits=num_folds, random_state=seed)
    cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    #msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    #print(msg)
   # Full Training period
    res = model.fit(X_train, Y_train)
    train_result = accuracy_score(res.predict(X_train), Y_train)
    train_results.append(train_result)
    
    # Test results
    test_result = accuracy_score(res.predict(X_test), Y_test)
    test_results.append(test_result)    
    
    msg = "%s: %f (%f) %f %f" % (name, cv_results.mean(), cv_results.std(), train_result, test_result)
    print(msg)
    print(confusion_matrix(res.predict(X_test), Y_test))
    #print(classification_report(res.predict(X_test), Y_test))

In [None]:
# compare algorithms
from matplotlib import pyplot
fig = pyplot.figure()
ind = np.arange(len(names))  # the x locations for the groups
width = 0.35  # the width of the bars
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
pyplot.bar(ind - width/2, train_results,  width=width, label='Train Error')
pyplot.bar(ind + width/2, test_results, width=width, label='Test Error')
fig.set_size_inches(15,8)
pyplot.legend()
ax.set_xticks(ind)
ax.set_xticklabels(names)
pyplot.show()

In [None]:
all_ratings = []

for name,ratings in tqdm(zip(anlys_df['name'],anlys_df['reviews_list'])):
    ratings = eval(ratings)
    for score, doc in ratings:
        if score:
            score = score.strip("Rated").strip()
            doc = doc.strip('RATED').strip()
            score = float(score)
            all_ratings.append([name,score, doc])

In [None]:
rating_df=pd.DataFrame(all_ratings,columns=['name','rating','review'])
rating_df.to_csv('ratings.csv')
rating_df.head()

In [None]:
rest=anlys_df['name'].value_counts()[:9].index
def produce_wordcloud(rest):
    
    plt.figure(figsize=(20,30))
    for i,r in enumerate(rest):
        plt.subplot(3,3,i+1)
        corpus=rating_df[rating_df['name']==r]['review'].values.tolist()
        corpus=' '.join(x  for x in corpus)
        wordcloud = WordCloud(max_font_size=None, background_color='white', collocations=False,
                      width=800, height=800).generate(corpus)
        plt.imshow(wordcloud)
        plt.title(r)
        plt.axis("off")

        
produce_wordcloud(rest)

[Go To Contents](#go_up)

In [None]:
#Converting the ratings to either 0 or 1
rating_df['sent']=rating_df['rating'].apply(lambda x: 1 if int(x)>2.5 else 0)

### Using existing library called Textblob

In [None]:
sample_rating_df['sentiment_textblob'] = [TextBlob(s).sentiment.polarity for s in sample_rating_df['review']] 
sample_rating_df.head(10)

In [None]:
vis_19 = go.Figure()

vis_19.add_trace(go.Scatter(x=rating_df['rating'],y=rating_df['sentiment_textblob']))
vis_19.update_xaxes(title='Actual Rating')
vis_19.update_yaxes(title='Sentiment by TB')
vis_19.update_layout(title='Comparing the actual rating with TB rating')
vis_19.show()

In [None]:
stops=stopwords.words('english')

lem=WordNetLemmatizer()
#creating corpus for the positive sentiment reviews.
corpus_positive =' '.join(lem.lemmatize(x) for x in rating_df[rating_df['sent']==1]['review'][:3000] if x not in stops)
positive_tokens=word_tokenize(corpus_positive)

In [None]:
vect=TfidfVectorizer()
vect_fit_pos=vect.fit(positive_tokens)

In [None]:
#Latend Drichlet Model
id_map=dict((v,k) for k,v in vect.vocabulary_.items()) #Changes the items and keys
vectorized_data=vect_fit_pos.transform(tokens)
gensim_corpus=gensim.matutils.Sparse2Corpus(vectorized_data,documents_columns=False)
ldamodel = gensim.models.ldamodel.LdaModel(gensim_corpus,id2word=id_map,num_topics=5,random_state=34,passes=25)

In [None]:
counter=Counter(corpus_positive)

In [None]:
import matplotlib.colors as mcolors

out=[]
topics=ldamodel.show_topics(formatted=False)
for i,topic in topics:
    for word,weight in topic:
        out.append([word,i,weight,counter[word]])

dataframe = pd.DataFrame(out, columns=['word', 'topic_id', 'importance', 'word_count'])        


# Plot Word Count and Weights of Topic Keywords
fig, axes = plt.subplots(2, 2, figsize=(8,6), sharey=True, dpi=160)
cols = [color for name, color in mcolors.TABLEAU_COLORS.items()]
for i, ax in enumerate(axes.flatten()):
    ax.bar(x='word', height="word_count", data=dataframe.loc[dataframe.topic_id==i, :], color=cols[i], width=0.3, alpha=0.3, label='Word Count')
    ax_twin = ax.twinx()
    ax_twin.bar(x='word', height="importance", data=dataframe.loc[dataframe.topic_id==i, :], color=cols[i], width=0.2, label='Weights')
    ax.set_ylabel('Word Count', color=cols[i])
    #ax_twin.set_ylim(0, 0.030); ax.set_ylim(0, 3500)
    ax.set_title('Topic: ' + str(i), color=cols[i], fontsize=8)
    ax.tick_params(axis='y', left=False)
    ax.set_xticklabels(dataframe.loc[dataframe.topic_id==i, 'word'], rotation=30, horizontalalignment= 'right')
    ax.legend(loc='upper left'); ax_twin.legend(loc='upper right')

fig.tight_layout(w_pad=2)    
fig.suptitle('Word Count and Importance of Topic Keywords', fontsize=8, y=1.05)    
plt.show()

In [None]:
stops=stopwords.words('english')
lem=WordNetLemmatizer()
#Building the model with negative reviews

corpus=' '.join(lem.lemmatize(x) for x in rating_df[rating_df['sent']==0]['review'][:3000] if x not in stops)
tokens=word_tokenize(corpus)

In [None]:
vect=TfidfVectorizer()
vect_fit=vect.fit(tokens)

id_map=dict((v,k) for k,v in vect.vocabulary_.items())
vectorized_data=vect_fit.transform(tokens)

gensim_corpus=gensim.matutils.Sparse2Corpus(vectorized_data,documents_columns=False)
ldamodel = gensim.models.ldamodel.LdaModel(gensim_corpus,id2word=id_map,num_topics=5,random_state=34,passes=25)


In [None]:
counter=Counter(corpus)

In [None]:
out=[]
topics=ldamodel.show_topics(formatted=False)
for i,topic in topics:
    for word,weight in topic:
        out.append([word,i,weight,counter[word]])

dataframe = pd.DataFrame(out, columns=['word', 'topic_id', 'importance', 'word_count'])        


# Plot Word Count and Weights of Topic Keywords
fig, axes = plt.subplots(2, 2, figsize=(8,6), sharey=True, dpi=160)
cols = [color for name, color in mcolors.TABLEAU_COLORS.items()]
for i, ax in enumerate(axes.flatten()):
    ax.bar(x='word', height="word_count", data=dataframe.loc[dataframe.topic_id==i, :], color=cols[i], width=0.3, alpha=0.3, label='Word Count')
    ax_twin = ax.twinx()
    ax_twin.bar(x='word', height="importance", data=dataframe.loc[dataframe.topic_id==i, :], color=cols[i], width=0.2, label='Weights')
    ax.set_ylabel('Word Count', color=cols[i])
    #ax_twin.set_ylim(0, 0.030); ax.set_ylim(0, 3500)
    ax.set_title('Topic: ' + str(i), color=cols[i], fontsize=8)
    ax.tick_params(axis='y', left=False)
    ax.set_xticklabels(dataframe.loc[dataframe.topic_id==i, 'word'], rotation=30, horizontalalignment= 'right')
    ax.legend(loc='upper left'); ax_twin.legend(loc='upper right')

fig.tight_layout(w_pad=2)    
fig.suptitle('Word Count and Importance of Topic Keywords', fontsize=8, y=1.05)    
plt.show()

In [None]:
stops=set(stopwords.words('english'))
lem=WordNetLemmatizer()
corpus=[]
for review in tqdm(rating_df['review'][:10000]):
    words=[]
    for x in word_tokenize(review):
        x=lem.lemmatize(x.lower())
        if x not in stops:
            words.append(x)
            
    corpus.append(words)

In [None]:
model = word2vec.Word2Vec(corpus, vector_size=100, window=20, min_count=200, workers=4)

In [None]:
from sklearn.manifold import TSNE
def tsne_plot(model):
    "Creates and TSNE model and plots it"
    labels = []
    tokens = []

    for word in model.wv.key_to_index:
        tokens.append(model.wv[word])
        labels.append(word)
    
    tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=2500, random_state=23)
    new_values = tsne_model.fit_transform(tokens)

    x = []
    y = []
    for value in new_values:
        x.append(value[0])
        y.append(value[1])
        
    plt.figure(figsize=(10, 10)) 
    for i in range(len(x)):
        plt.scatter(x[i],y[i])
        plt.annotate(labels[i],
                     xy=(x[i], y[i]),
                     xytext=(5, 2),
                     textcoords='offset points',
                     ha='right',
                     va='bottom')
    plt.show()

In [None]:
tsne_plot(model)

In [None]:
postive=rating_df[rating_df['rating']>3]['review'][:2000]
negative=rating_df[rating_df['rating']<2.5]['review'][:2000]

def return_corpus(df):
    corpus=[]
    for review in df:
        tagged=nltk.pos_tag(word_tokenize(review))
        adj=[]
        for x in tagged:
            if x[1]=='JJ':
                adj.append(x[0])
        corpus.append(adj)
    return corpus

In [None]:
corpus=return_corpus(postive)
model = word2vec.Word2Vec(corpus, vector_size=100, min_count=10,window=20, workers=4)
tsne_plot(model)

In [None]:
corpus=return_corpus(negative)
model = word2vec.Word2Vec(corpus, vector_size=100, min_count=10,window=20, workers=4)
tsne_plot(model)

#### [Sentimental Analysis]()<a id="sentimental" ></a><br>

In [None]:
rating_df['sent']=rating_df['rating'].apply(lambda x: 1 if int(x)>2.5 else 0)

In [None]:
max_features=3000
tokenizer=Tokenizer(num_words=max_features,split=' ')
tokenizer.fit_on_texts(rating_df['review'].values)
X = tokenizer.texts_to_sequences(rating_df['review'].values)
X = pad_sequences(X)

In [None]:
embed_dim = 32
lstm_out = 32

model = Sequential()
model.add(Embedding(max_features, embed_dim,input_length = X.shape[1]))
#model.add(SpatialDropout1D(0.4))
model.add(LSTM(lstm_out, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(2,activation='softmax'))

model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
print(model.summary())

In [None]:
Y = pd.get_dummies(rating_df['sent'].astype(int)).values
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.33, random_state = 42)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)

In [None]:
batch_size = 3200
model.fit(X_train, Y_train, epochs = 5, batch_size=batch_size)

In [None]:
validation_size = 1500

X_validate = X_test[-validation_size:]
Y_validate = Y_test[-validation_size:]
X_test = X_test[:-validation_size]
Y_test = Y_test[:-validation_size]
score,acc = model.evaluate(X_test, Y_test, verbose = 2, batch_size = batch_size)
print("score: %.2f" % (score))
print("acc: %.2f" % (acc))

In [None]:
Y_train