<a href="https://colab.research.google.com/github/Ruchika810/Unsupervised-ML/blob/main/Team_notebook_Zomato_Restaurant_Clustering_and_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Problem Statement**

Zomato is an Indian restaurant aggregator and food delivery start-up founded by Deepinder Goyal and Pankaj Chaddah in 2008. Zomato provides information, menus and user-reviews of restaurants, and also has food delivery options from partner restaurants in select cities.

India is quite famous for its diverse multi cuisine available in a large number of restaurants and hotel resorts, which is reminiscent of unity in diversity. Restaurant business in India is always evolving. More Indians are warming up to the idea of eating restaurant food whether by dining outside or getting food delivered. The growing number of restaurants in every state of India has been a motivation to inspect the data to get some insights, interesting facts and figures about the Indian food industry in each city. So, this project focuses on analysing the Zomato restaurant data for each city in India.

The Project focuses on Customers and Company, you have  to analyze the sentiments of the reviews given by the customer in the data and made some useful conclusion in the form of Visualizations. Also, cluster the zomato restaurants into different segments. The data is vizualized as it becomes easy to analyse data at instant. The Analysis also solve some of the business cases that can directly help the customers finding the Best restaurant in their locality and for the company to grow up and work on the fields they are currently lagging in.

This could help in clustering the restaurants into segments. Also the data has valuable information around cuisine and costing which can be used in cost vs. benefit analysis

Data could be used for sentiment analysis. Also the metadata of reviewers can be used for identifying the critics in the industry. 

# **Attribute Information**

## **Zomato Restaurant names and Metadata**
Use this dataset for clustering part

1. Name : Name of Restaurants

2. Links : URL Links of Restaurants

3. Cost : Per person estimated Cost of dining

4. Collection : Tagging of Restaurants w.r.t. Zomato categories

5. Cuisines : Cuisines served by Restaurants

6. Timings : Restaurant Timings

## **Zomato Restaurant reviews**
Merge this dataset with Names and Matadata and then use for sentiment analysis part

1. Restaurant : Name of the Restaurant

2. Reviewer : Name of the Reviewer

3. Review : Review Text

4. Rating : Rating Provided by Reviewer

5. MetaData : Reviewer Metadata - No. of Reviews and followers

6. Time: Date and Time of Review

7. Pictures : No. of pictures posted with review

# **Importing libraries and reading the datasets**

In [None]:
# Importing the libraries
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import preprocessing
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import scipy.cluster.hierarchy as sch
from sklearn.cluster import AgglomerativeClustering
from prettytable import PrettyTable 

import nltk
import string
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
import xgboost as xgb
from sklearn import svm

from sklearn.metrics import classification_report,confusion_matrix

import warnings
warnings.filterwarnings('ignore')

In [None]:
# Mounting google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Importing the datasets
df1 = pd.read_csv('/content/drive/MyDrive/Datasets/Zomato Restaurant names and Metadata.csv')
df2 = pd.read_csv('/content/drive/MyDrive/Datasets/Zomato Restaurant reviews.csv')

FileNotFoundError: ignored

# **Inspecting the "Zomato Restaurant names and Metadata" dataset**

In [None]:
# First five rows of the dataset
df1.head()

In [None]:
# Last five rows of the dataset
df1.tail()

In [None]:
# Shape of the dataset
df1.shape

The dataset consists of the data of 105 restaurants, which is represented by 6 columns including the name of the restaurant.

In [None]:
# Data type of each column
df1.info()

Here the cost column needs to be of datatype int or float.

In [None]:
# Changing data type of cost column from object to integer
df1['Cost'] = df1['Cost'].str.replace("," , "").astype('int64')

In [None]:
# finding statistical measures of numerical column
df1.describe()

In [None]:
# Checking skewness of cost column
df1.skew()

Here the distribution of cost column is positively skewed. It can also be visualized using dist plot.

In [None]:
# Creating dist plot of cost column
sns.distplot(df1['Cost'], hist = False)
plt.show()

In [None]:
# Checking null value count of each column
df1.isnull().sum()

There are null values in "Collections" and "Timings" columns. As there columns are of type object we can replace these null values with a string.

In [None]:
# Filling null values with 'Unknown'
df1.fillna('Unknown', inplace = True)

In [None]:
# Checking for any dulicate rows
df1[df1.duplicated()].sum()

There are no duplicate rows. 

**Now we can proceed towards the exploratory data analysis part where we will find some insights from the dataset.**

## EDA on "Zomato Restaurant names and Metadata" dataset

In [None]:
cuisine_list = df1['Cuisines'].str.split(', ')         # Separating all the cuisines by spliting the column by comma.
restaurants = {}                                       # Creating an empty dictionary which will store the cuisine name as key and count of restaurant as value
for i in cuisine_list:                                 # Iterating through each index
  for j in i:                                          # Iterating inside a particular index
    if (j in restaurants):
      restaurants[j] += 1
    else:
      restaurants[j] = 1

In [None]:
X = pd.DataFrame(restaurants.values(),index = restaurants.keys(), columns = {'Number_of_Restaurants'})  # Converting the above dictionary to dataframe
X.sort_values(by = 'Number_of_Restaurants',ascending = False,inplace = True)                 # Sorting the df by descending order to get most available cusines at top
X = X.head(10)  # fetching the top 10 cuisines

In [None]:
# Plotting the above result
plt.figure(figsize = (14, 6))
sns.barplot(x = 'Number_of_Restaurants', y = X.index,  data = X, palette = "mako")
plt.title("Top 10 popular cuisines", size = 25)
plt.xlabel("Number of Restaurants", size = 15)
plt.ylabel("Cuisines", size = 15)
plt.show()

"North Indian" cuisine is the most popular cuisine which is available in more than 50% of restaurants.

"Chinese" cuisine is the 2nd most available cuisine.

In [None]:
collection_list = df1['Collections'].str.split(', ')  # Separating all the cuisines by spliting the column by comma.
rest = {}                                             # Creating an empty dictionary which will store the collection name as key and count of restaurant as value
for i in collection_list:                             # Iterating through each index
  for j in i:                                         # Iterating inside a particular index
    if (j in rest):
      rest[j] += 1
    else:
      rest[j] = 1

In [None]:
Y = pd.DataFrame(rest.values(),index = rest.keys(), columns = {'Number_of_Restaurants'})  # Converting the above dictionary to dataframe
Y.sort_values(by = 'Number_of_Restaurants',ascending = False,inplace = True)     # Sorting the df by descending order to get most available collection at top
Y = Y[1:11]

In [None]:
# Plotting the above result
plt.figure(figsize = (14, 6))
sns.barplot(x = 'Number_of_Restaurants', y = Y.index,  data = Y, palette = "rocket")
plt.title("Top 10 popular collections", size = 25)
plt.xlabel("Number of Restaurants", size = 15)
plt.ylabel("Collections", size = 15)
plt.show()

In [None]:
# Creating a new dataframe which is sorted by cost.
rest_cost = df1.sort_values(by = 'Cost',ascending = False)

In [None]:
# Top 10 most expensive restaurants
rest_cost[['Name','Cuisines','Cost']][0:10]

In [None]:
# Top 10 cheapest restaurants
rest_cost[['Name','Cuisines','Cost']].tail(10).sort_values(by = 'Cost', ascending = True)

# **Inspecting the "Zomato Restaurant reviews" dataset**

In [None]:
# First five rows of the dataset
df2.head()

In [None]:
# Last five rows of the dataset
df2.tail()

In [None]:
# Shape of the dataset
df2.shape

There are total 10000 reviews for the 105 restaurants.

In [None]:
# Checking null value count of each column
df2.isnull().sum()

There are few null value in each column. Instead of filling those few null values, it's better to drop those rows.

In [None]:
# Dropping the null values
df2.dropna(inplace = True)

In [None]:
# column information
df2.info()

Here we need to change the dtype of rating column from object to int/float. 

In [None]:
# Checking the values of rating column
df2['Rating'].value_counts()

As there is a word 'Like' in the rating column, we can't convert the rating column dtype to integer/float. So to proceed further we have to remove/replace this word from the column.

In [None]:
# Replacing 'Like' word
df2['Rating'] = df2['Rating'].replace('Like', 5)

In [None]:
# Converting dtype of 'Rating' from object to float64. 
df2['Rating'] = df2['Rating'].astype('float64')

Now from the metadata column, we need to separate the reviews and followers. For this we will make 2 separate column to store those values.

In [None]:
df2['No_of_Reviews'],df2['No_of_Followers']=df2['Metadata'].str.split(',').str        # Splitting by comma.
df2['No_of_Reviews'] = pd.to_numeric(df2['No_of_Reviews'].str.split(' ').str[0])      # Splitting by space and fetching the zeroth index
df2['No_of_Followers'] = pd.to_numeric(df2['No_of_Followers'].str.split(' ').str[1])  # Splitting by space and fetching the first index

# Removing the 'Metadata' column
df2.drop(['Metadata'], axis = 1, inplace=True)

In [None]:
# Checking the final modified dataset
df2.head()

In [None]:
df2.info()

**Now we can proceed towards the exploratory data analysis part where we will find some insights from the dataset.**

## EDA on "Zomato Restaurant reviews" dataset

In [None]:
# Top rated restaurants
plt.figure(figsize=(10,6))
df2.groupby('Restaurant')['Rating'].mean().sort_values(ascending = False).head(10).plot.barh(color = 'g') # Finding the average rating of each restaurant.
plt.title("Top rated restaurants", fontsize=20)
plt.xlabel("Average Rating", fontsize=15)
plt.ylabel("Name of the restaurant", fontsize=15)
plt.gca().invert_yaxis()
plt.show()

In [None]:
# Worst rated restaurants
plt.figure(figsize=(10,6))
df2.groupby('Restaurant')['Rating'].mean().sort_values().head(10).plot.barh(color = 'darkred')
plt.title("Worst rated restaurants", fontsize=20)
plt.xlabel("Average Rating", fontsize=15)
plt.ylabel("Name of the restaurant", fontsize=15)
plt.xlim([0, 5])
plt.gca().invert_yaxis()
plt.show()

In [None]:
# Top reviewers
plt.figure(figsize=(12,6))
df2['Reviewer'].value_counts().sort_values(ascending = False).head(10).plot.bar(color = 'b')
plt.title("Top reviewers", fontsize=25)
plt.xlabel("Reviewer", fontsize=15)
plt.ylabel("Number of reviews", fontsize=15)
plt.show()

In [None]:
# Most reviewed restaurants
plt.figure(figsize=(10,6))
df2.groupby('Restaurant')['No_of_Reviews'].sum().sort_values(ascending = False).head(10).plot.barh(color = 'mediumspringgreen')
plt.title("Most reviewed restaurants", fontsize=20)
plt.xlabel("Number of reviews", fontsize=15)
plt.ylabel("Name of the restaurant", fontsize=15)
plt.gca().invert_yaxis()
plt.show()

In [None]:
# Most followed restaurants
plt.figure(figsize=(10,6))
df2.groupby('Restaurant')['No_of_Followers'].sum().sort_values(ascending = False).head(10).plot.barh(color = 'indigo')
plt.title("Most followed restaurants", fontsize=20)
plt.xlabel("Number of followers", fontsize=15)
plt.ylabel("Name of the restaurant", fontsize=15)
plt.gca().invert_yaxis()
plt.show()

# **Data preparation for clustering**

In [None]:
# Changing the column name which will help to merge both the datasets.
df2.rename(columns = {'Restaurant':'Name'}, inplace = True)

Now we need to aggregate the 'Rating', 'No_of_Reviews' and 'No_of_Followers' to make it a single value for each restaurant. For this we will take average of each column by grouping them restaurant name wise.

In [None]:
restaurants = list(df2['Name'].unique())
# Initializing three new columns
df2['Mean_Rating'] = 0        
df2['Mean_Reviews'] = 0
df2['Mean_Followers'] = 0

for i in range(len(restaurants)):
    df2['Mean_Rating'][df2['Name'] == restaurants[i]] = df2['Rating'][df2['Name'] == restaurants[i]].mean()
    df2['Mean_Reviews'][df2['Name'] == restaurants[i]] = df2['No_of_Reviews'][df2['Name'] == restaurants[i]].mean()
    df2['Mean_Followers'][df2['Name'] == restaurants[i]] = df2['No_of_Followers'][df2['Name'] == restaurants[i]].mean()

In [None]:
df2.head()

In [None]:
# Ceaating a new dataframe by taking only the required columns for clustering
df_clust2 = df2[['Name', 'Mean_Rating',	'Mean_Followers']]

As there will be duplicate value in the newly formed dataset, we have to remove them.

In [None]:
df_clust2.drop_duplicates(inplace = True)

In [None]:
# Separating out the required columns from df1 for clustering
df_clust1 = df1[['Name', 'Cost']]

In [None]:
# Merging both the datasets
final_df = pd.merge(df_clust1, df_clust2, on = 'Name')

In [None]:
final_df.head(10)

In [None]:
final_df.describe()

In [None]:
# Checking the distribution of all the columns
for var in final_df.describe().columns:
  sns.distplot(final_df[var].dropna())
  plt.ylabel('frequency')
  plt.xlabel(var)
  plt.show()

As the distribution of 'Cost' and 'Mean_Followers' are slightly right skewed, we will apply square root transformation on these columns to make it normally distributed.

In [None]:
# Applying square root transformation on 'Cost' and 'Mean_Followers' column
final_df['Cost'] = final_df['Cost']**0.5
final_df['Mean_Followers'] = final_df['Mean_Followers']**0.5

In [None]:
# Checking the distribution of all the columns
for var in final_df.describe().columns:
  sns.distplot(final_df[var].dropna())
  plt.ylabel('frequency')
  plt.xlabel(var)
  plt.show()

Now as all the features are nearly normally distributed we can proceed futher to cluster them together.

# **Clustering**

## Clustering by 'Cost' and 'Mean_Rating' 

**Feature scaling**

Feature Scaling is a technique of bringing down the values of all the independent features of the dataset on the same scale. Feature selection helps to do calculations in algorithms very quickly. It is the important stage of data preprocessing.

In [None]:
features_rec_mon=['Cost', 'Mean_Rating']
X_features_rec_mon = final_df[features_rec_mon].values
scaler_rec_mon = preprocessing.StandardScaler()
X_rec_mon=scaler_rec_mon.fit_transform(X_features_rec_mon)
X=X_rec_mon

**Silhouette score**

Silhouette score is used to evaluate the quality of clusters created using clustering algorithms such as K-Means in terms of how well samples are clustered with other samples that are similar to each other. The Silhouette score is calculated for each sample of different clusters. The more is the Silhouette score, better the clusters are.

In [None]:
# Calculating silhouette score for a range of clusters
range_n_clusters = [2,3,4,5,6,7,8,9]
for n_clusters in range_n_clusters:
    clusterer = KMeans(n_clusters=n_clusters, random_state = 100)
    preds = clusterer.fit_predict(X)
    centers = clusterer.cluster_centers_

    score = silhouette_score(X, preds)
    print("For n_clusters = {}, silhouette score is {}".format(n_clusters, score))

**Optimal number of cluster = 4**

**Elbow method**

In cluster analysis, the elbow method is a heuristic used in determining the number of clusters in a data set. The method consists of plotting the explained variation as a function of the number of clusters, and picking the elbow of the curve as the number of clusters to use.

In [None]:
# Elbow method
sum_of_sq_dist = {}
for k in range(1,15):
    km = KMeans(n_clusters= k, init= 'k-means++', max_iter= 1000, random_state = 100)
    km = km.fit(X)
    sum_of_sq_dist[k] = km.inertia_
    
#Plot the graph for the sum of square distance values and Number of Clusters
sns.pointplot(x = list(sum_of_sq_dist.keys()), y = list(sum_of_sq_dist.values()))
plt.xlabel('Number of Clusters(k)')
plt.ylabel('Sum of Square Distances')
plt.title('Elbow Method For Optimal k')
plt.show()

**Optimal number of cluster = 4**

In [None]:
# Fitting kmeans clustering algorithm
kmeans = KMeans(n_clusters=4,random_state = 100)
kmeans.fit(X)
y_kmeans= kmeans.predict(X)

In [None]:
# Plotting the clusters
plt.figure(figsize=(15,10))
plt.title('Restaurant segmentation based on Cost and Mean_Rating')
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=100)

# Plotting cluster centres
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=400, alpha=0.5)
plt.show()

In [None]:
#Finding the clusters for the observation given in the dataset
final_df['Cluster'] = kmeans.labels_
final_df[['Name', 'Cost', 'Mean_Rating','Cluster']].head(10)

**Dendrogram**

The sole concept of hierarchical clustering lies in just the construction and analysis of a dendrogram. A dendrogram is a tree-like structure that explains the relationship between all the data points in the system.

In [None]:
# Using the dendrogram to find the optimal number of clusters
plt.figure(figsize=(13,8))
dendrogram = sch.dendrogram(sch.linkage(X, method = 'ward'))
plt.title('Dendrogram')
plt.xlabel('Restaurants')
plt.ylabel('Euclidean Distances')
plt.axhline(y=10, color='r', linestyle='--')
plt.show()

**Optimal number of cluster = 2**

In [None]:
# Fitting hierarchical clustering
hc = AgglomerativeClustering(n_clusters = 2, affinity = 'euclidean', linkage = 'ward')
y_hc = hc.fit_predict(X)

In [None]:
# Visualizing the clusters (two dimensions only)
plt.figure(figsize=(13,8))
plt.scatter(X[y_hc == 0, 0], X[y_hc == 0, 1], s = 100, c = 'red', label = 'Category 1')
plt.scatter(X[y_hc == 1, 0], X[y_hc == 1, 1], s = 100, c = 'blue', label = 'Category 2')

plt.title('Clusters of Restaurants')
plt.legend()
plt.show()

## Clustering by 'Mean_Rating' and 'Mean_Followers' 

**Feature scaling**

In [None]:
features_rec_mon=['Mean_Rating', 'Mean_Followers']
X_features_rec_mon = final_df[features_rec_mon].values
scaler_rec_mon = preprocessing.StandardScaler()
X_rec_mon=scaler_rec_mon.fit_transform(X_features_rec_mon)
X=X_rec_mon

**Silhouette score**

In [None]:
# Calculating silhouette score for a range of clusters
range_n_clusters = [2,3,4,5,6,7,8,9]
for n_clusters in range_n_clusters:
    clusterer = KMeans(n_clusters=n_clusters, random_state = 100)
    preds = clusterer.fit_predict(X)
    centers = clusterer.cluster_centers_

    score = silhouette_score(X, preds)
    print("For n_clusters = {}, silhouette score is {}".format(n_clusters, score))

**Optimal number of cluster = 3**

**Elbow method**

In [None]:
sum_of_sq_dist = {}
for k in range(1,15):
    km = KMeans(n_clusters= k, init= 'k-means++', max_iter= 1000, random_state = 100)
    km = km.fit(X)
    sum_of_sq_dist[k] = km.inertia_
    
#Plotting the graph for the sum of square distance values and Number of Clusters
sns.pointplot(x = list(sum_of_sq_dist.keys()), y = list(sum_of_sq_dist.values()))
plt.xlabel('Number of Clusters(k)')
plt.ylabel('Sum of Square Distances')
plt.title('Elbow Method For Optimal k')
plt.show()

**Optimal number of cluster = 3**

In [None]:
# Fitting kmeans clustering algothm
kmeans = KMeans(n_clusters=3,random_state = 100)
kmeans.fit(X)
y_kmeans= kmeans.predict(X)

In [None]:
# Plotting the clusters
plt.figure(figsize=(15,10))
plt.title('Restaurant segmentation based on Mean_Rating and Mean_Followers')
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=100)

# Plotting cluster centres
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=400, alpha=0.5)
plt.show()

In [None]:
#Finding the clusters for the observation given in the dataset
final_df['Cluster'] = kmeans.labels_
final_df[['Name', 'Mean_Rating', 'Mean_Followers','Cluster']].head(10)

**Dendrogram**

In [None]:
# Using the dendrogram to find the optimal number of clusters
plt.figure(figsize=(13,8))
dendrogram = sch.dendrogram(sch.linkage(X, method = 'ward'))
plt.title('Dendrogram')
plt.xlabel('Restaurants')
plt.ylabel('Euclidean Distances')
plt.axhline(y=8.5, color='r', linestyle='--')
plt.show()

**Optimal number of cluster = 3**

In [None]:
# Fitting hierarchical clustering
hc = AgglomerativeClustering(n_clusters = 3, affinity = 'euclidean', linkage = 'ward')
y_hc = hc.fit_predict(X)

In [None]:
# Visualizing the clusters (two dimensions only)
plt.figure(figsize=(13,8))
plt.scatter(X[y_hc == 0, 0], X[y_hc == 0, 1], s = 100, c = 'red', label = 'Category 1')
plt.scatter(X[y_hc == 1, 0], X[y_hc == 1, 1], s = 100, c = 'blue', label = 'Category 2')
plt.scatter(X[y_hc == 2, 0], X[y_hc == 2, 1], s = 100, c = 'green', label = 'Category 3')

plt.title('Clusters of Restaurants')
plt.legend()
plt.show()

## Clustering by 'Cost' and 'Mean_Followers' 

**Feature scaling**

In [None]:
features_rec_mon=['Cost', 'Mean_Followers']
X_features_rec_mon = final_df[features_rec_mon].values
scaler_rec_mon = preprocessing.StandardScaler()
X_rec_mon=scaler_rec_mon.fit_transform(X_features_rec_mon)
X=X_rec_mon

**Silhouette score**

In [None]:
# Calculating silhouette score for a range of clusters
range_n_clusters = [2,3,4,5,6,7,8,9]
for n_clusters in range_n_clusters:
    clusterer = KMeans(n_clusters=n_clusters, random_state = 100)
    preds = clusterer.fit_predict(X)
    centers = clusterer.cluster_centers_

    score = silhouette_score(X, preds)
    print("For n_clusters = {}, silhouette score is {}".format(n_clusters, score))

**Optimal number of cluster = 9**

**Elbow method**

In [None]:
sum_of_sq_dist = {}
for k in range(1,15):
    km = KMeans(n_clusters= k, init= 'k-means++', max_iter= 1000, random_state = 100)
    km = km.fit(X)
    sum_of_sq_dist[k] = km.inertia_
    
#Plotting the graph for the sum of square distance values and Number of Clusters
sns.pointplot(x = list(sum_of_sq_dist.keys()), y = list(sum_of_sq_dist.values()))
plt.xlabel('Number of Clusters(k)')
plt.ylabel('Sum of Square Distances')
plt.title('Elbow Method For Optimal k')
plt.show()

**Optimal number of cluster = 3**

In [None]:
# Fitting the kmeans clustering algorithm
kmeans = KMeans(n_clusters=3,random_state = 100)
kmeans.fit(X)
y_kmeans= kmeans.predict(X)

In [None]:
# Plotting the clusters
plt.figure(figsize=(15,10))
plt.title('Restaurant segmentation based on Cost and Mean_Followers')
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=100)

# Plotting cluster centres
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=400, alpha=0.5)
plt.show()

In [None]:
#Finding the clusters for the observation given in the dataset
final_df['Cluster'] = kmeans.labels_
final_df[['Name', 'Cost', 'Mean_Followers','Cluster']].head(10)

**Dendrogram**

In [None]:
# Using the dendrogram to find the optimal number of clusters
plt.figure(figsize=(13,8))
dendrogram = sch.dendrogram(sch.linkage(X, method = 'ward'))
plt.title('Dendrogram')
plt.xlabel('Restaurants')
plt.ylabel('Euclidean Distances')
plt.axhline(y=10, color='r', linestyle='--')
plt.show()

**Optimal number of cluster = 2**

In [None]:
# Fitting hierarchical clustering
hc = AgglomerativeClustering(n_clusters = 2, affinity = 'euclidean', linkage = 'ward')
y_hc = hc.fit_predict(X)

In [None]:
# Visualizing the clusters (two dimensions only)
plt.figure(figsize=(13,8))
plt.scatter(X[y_hc == 0, 0], X[y_hc == 0, 1], s = 100, c = 'red', label = 'Category 1')
plt.scatter(X[y_hc == 1, 0], X[y_hc == 1, 1], s = 100, c = 'blue', label = 'Category 2')

plt.title('Clusters of Restaurants')
plt.legend()
plt.show()

## Clustering by 'Cost', 'Mean_Rating' and 'Mean_Followers' 

**Feature scaling**

In [None]:
features_rec_mon=['Cost', 'Mean_Rating', 'Mean_Followers']
X_features_rec_mon = final_df[features_rec_mon].values
scaler_rec_mon = preprocessing.StandardScaler()
X_rec_mon=scaler_rec_mon.fit_transform(X_features_rec_mon)
X=X_rec_mon

**Silhouette score**

In [None]:
# Calculating silhouette score for a range of clusters
range_n_clusters = [2,3,4,5,6,7,8,9]
for n_clusters in range_n_clusters:
    clusterer = KMeans(n_clusters=n_clusters, random_state = 100)
    preds = clusterer.fit_predict(X)
    centers = clusterer.cluster_centers_

    score = silhouette_score(X, preds)
    print("For n_clusters = {}, silhouette score is {}".format(n_clusters, score))

**Optimal number of cluster = 6**

**Elbow method**

In [None]:
sum_of_sq_dist = {}
for k in range(1,15):
    km = KMeans(n_clusters= k, init= 'k-means++', max_iter= 1000, random_state = 100)
    km = km.fit(X)
    sum_of_sq_dist[k] = km.inertia_
    
#Plotting the graph for the sum of square distance values and Number of Clusters
sns.pointplot(x = list(sum_of_sq_dist.keys()), y = list(sum_of_sq_dist.values()))
plt.xlabel('Number of Clusters(k)')
plt.ylabel('Sum of Square Distances')
plt.title('Elbow Method For Optimal k')
plt.show()

**Optimal number of cluster = 4**

In [None]:
# Fitting the kmeans clustering algorithm
kmeans = KMeans(n_clusters=4,random_state = 100)
kmeans.fit(X)
y_kmeans= kmeans.predict(X)

In [None]:
# plotting the clusters
plt.figure(figsize=(15,10))
plt.title('Restaurant segmentation based on Cost, Mean_Rating and Mean_Followers')
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=100)

# Plotting cluster centres
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=400, alpha=0.5)
plt.show()

In [None]:
#Finding the clusters for the observation given in the dataset
final_df['Cluster'] = kmeans.labels_
final_df.head(10)

**Dendrogram**

In [None]:
# Using the dendogram to find the optimal number of clusters
plt.figure(figsize=(13,8))
dendrogram = sch.dendrogram(sch.linkage(X, method = 'ward'))
plt.title('Dendrogram')
plt.xlabel('Restaurants')
plt.ylabel('Euclidean Distances')
plt.axhline(y=11, color='r', linestyle='--')
plt.show()

**Optimal number of cluster = 2**

In [None]:
# Fitting hierarchical clustering
hc = AgglomerativeClustering(n_clusters = 2, affinity = 'euclidean', linkage = 'ward')
y_hc = hc.fit_predict(X)

In [None]:
# Visualizing the clusters (two dimensions only)
plt.figure(figsize=(13,8))
plt.scatter(X[y_hc == 0, 0], X[y_hc == 0, 1], s = 100, c = 'red', label = 'Category 1')
plt.scatter(X[y_hc == 1, 0], X[y_hc == 1, 1], s = 100, c = 'blue', label = 'Category 2')

plt.title('Clusters of Restaurants')
plt.legend()
plt.show()

## Summary

In [None]:
# Specify the Column Names while initializing the Table 
myTable = PrettyTable(['SL No.',"Model_Name",'Data', "Optimal_Number_of_clusters"]) 
  
# Add rows 
myTable.add_row(['1', "K-Means with silhouette_score ", "Cost and Mean_Rating", "4"]) 
myTable.add_row(['2', "K-Means with Elbow method  ", " Cost and Mean_Rating", "4"])
myTable.add_row(['3', "Hirarchical Clustering", " Cost and Mean_Rating", "2"]) 

myTable.add_row(['4',"K-Means with silhouette_score ", "Mean_Rating and Mean_Followers", "3"]) 
myTable.add_row(['5',"K-Means with Elbow method  ", "Mean_Rating and Mean_Followers", "3"])
myTable.add_row(['6',"Hierarchical Clustering", "Mean_Rating and Mean_Followers", "3"])

myTable.add_row(['7',"K-Means with silhouette_score ", "Cost and Mean_Followers", "9"]) 
myTable.add_row(['8',"K-Means with Elbow method  ", "Cost and Mean_Followers", "3"])
myTable.add_row(['9',"Hierarchical clustering  ", "Cost and Mean_Followers", "2"])

myTable.add_row(['10',"K-Means with silhouette_score ", "Cost, Mean_Rating and Mean_Followers", "6"]) 
myTable.add_row(['11',"K-Means with Elbow method  ", "Cost, Mean_Rating and Mean_Followers", "4"])
myTable.add_row(['12',"Hierarchical clustering  ", "Cost, Mean_Rating and Mean_Followers", "2"])

print(myTable)

## Conclusion from clustering

Optimal number of clusters by taking two variables at a time are either three or four. And optimal number of clusters by taking all variables at a time is four.

# **Sentiment Analysis**

## Data preprocessing and EDA

In [None]:
nlp_df = df2[['Name', 'Review', 'Rating']]

In [None]:
nlp_df.head()

In [None]:
nlp_df['Rating'] = np.where(nlp_df['Rating']<4, 0, 1)

In [None]:
nlp_df.head(10)

In [None]:
# Checking whether there is class imbalance or not
nlp_df['Rating'].value_counts().plot.bar()
plt.title("Count of positive and negative reviews", fontsize=20)
plt.ylabel("Number of reviews", fontsize=15)
plt.show()

**Restaurants with most number of positive reviews:**

In [None]:
# Restaurant with most number of positive reviews
plt.figure(figsize=(10,6))
nlp_df[nlp_df['Rating']==1]['Name'].value_counts()[:10].plot.barh(color = 'g')
plt.title("Restaurants with most number of positive reviews", fontsize=20)
plt.xlabel("Number of positive reviews", fontsize=15)
plt.ylabel("Name of the restaurant", fontsize=15)
plt.gca().invert_yaxis()
plt.show()

**Restaurants with most number of negative reviews:**

In [None]:
# Restaurant with most number of negative reviews
plt.figure(figsize=(10,6))
nlp_df[nlp_df['Rating']==0]['Name'].value_counts()[:11].plot.barh(color = 'r')
plt.title("Restaurants with most number of negative reviews", fontsize=20)
plt.xlabel("Number of negative reviews", fontsize=15)
plt.ylabel("Name of the restaurant", fontsize=15)
plt.gca().invert_yaxis()
plt.show()

In [None]:
nlp_df.head()

In [None]:
nlp_df["Review"] = nlp_df["Review"].str.lower()

## Removing stopwords and punctuations

Stop words are available in abundance in any human language. By removing these words, we remove the low-level information from our text in order to give more focus to the important information.

In [None]:
# Downloading stopwords
nltk.download('stopwords')

In [None]:
# Creating a function to remove stopwords and punctations from 'Review' column

def text_process(msg):
    nopunc =[char for char in msg if char not in string.punctuation]
    nopunc=''.join(nopunc)
    return ' '.join([word for word in nopunc.split() if word.lower() not in stopwords.words('english')])

In [None]:
# Applying text_process function to "Review" column and storing the changes in a new column
nlp_df['Filtered_Review'] = nlp_df['Review'].apply(text_process)

In [None]:
print(nlp_df['Review'][0])
print(nlp_df['Filtered_Review'][0])

## Stemming

Stemming is a method of normalization of words in Natural Language Processing. It is a technique in which a set of words in a sentence are converted into a sequence to shorten its lookup. In this method, the words having the same meaning but have some variations according to the context or sentence are normalized.

In another word, there is one root word, but there are many variations of the same words. For example, the root word is “eat” and it’s variations are “eats, eating, eaten and like so”. In the same way, with the help of Stemming in Python, we can find the root word of any variations.

In [None]:
# creating an object of stemming function
stemmer = SnowballStemmer("english")

def stemming(text):
  '''a function which stems each word in the given text'''
  text = [stemmer.stem(word) for word in text.split()]
  return " ".join(text)

In [None]:
# Applying stemming function to the column
nlp_df['Filtered_Review'] = nlp_df['Filtered_Review'].apply(stemming)

## Vectorization

Vectorization is jargon for a classic approach of converting input data from its raw format (i.e. text ) into vectors of real numbers which is the format that ML models support.

In Machine Learning, vectorization is a step in feature extraction. The idea is to get some distinct features out of the text for the model to train on, by converting text to numerical vectors.

In [None]:
# Creating an object of TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=500)
# Vectorizing The column
X = vectorizer.fit_transform(nlp_df['Filtered_Review'])

In [None]:
# Name of the features
print(vectorizer.get_feature_names())

In [None]:
X.toarray().shape

## Train test split

In [None]:
review_train,review_test,label_train,label_test = train_test_split(nlp_df['Filtered_Review'], nlp_df['Rating'],test_size=0.20, random_state = 100)

In [None]:
review_train.head()

With reviews represented as vectors, we can finally train our sentiment analysis classifier. Now we will use Naive Bayes Classifier to perform this classification task

In [None]:
train_vectorized = vectorizer.transform(review_train)
test_vectorized = vectorizer.transform(review_test)

In [None]:
train_vectorized

In [None]:
train_array= train_vectorized.toarray()
test_array = test_vectorized.toarray()

## Naive Bayes Classifier

Naive Bayes algorithms are mostly used in sentiment analysis, spam filtering, recommendation systems etc. They are fast and easy to implement but their biggest disadvantage is that the requirement of predictors to be independent. In most of the real life cases, the predictors are dependent, this hinders the performance of the classifier.

This is basically used as a baseline model.

In [None]:
# Instantiating naive bayes classifier
nb_clf = GaussianNB()

In [None]:
# Training the model
nb_clf.fit(train_array,label_train)

In [None]:
# Predictions
train_preds = nb_clf.predict(train_array)
test_preds = nb_clf.predict(test_array)

In [None]:
# Print the classification report for train and test
print(classification_report(label_train,train_preds))
print(classification_report(label_test,test_preds))

# Confusion matrix
sns.heatmap(confusion_matrix(label_test, test_preds), annot=True, fmt='d')
plt.show()

## Logistic regression

In statistics, the (binary) logistic model (or logit model) is a statistical model that models the probability of one event (out of two alternatives) taking place by having the log-odds (the logarithm of the odds) for the event be a linear combination of one or more independent variables ("predictors").

In [None]:
# Instantiating logistic regression classifier
lr_clf = LogisticRegression(max_iter=1000)

In [None]:
# Training the model
lr_clf.fit(train_array,label_train)

In [None]:
# Getting the predicted classes
lr_train_class_preds = lr_clf.predict(train_array)
lr_test_class_preds = lr_clf.predict(test_array)

In [None]:
# Classification report
print(classification_report(label_train, lr_train_class_preds))
print(classification_report(label_test, lr_test_class_preds))

# Confusion matrix
sns.heatmap(confusion_matrix(label_test, lr_test_class_preds), annot=True, fmt='d')
plt.show()

## XGBoost

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way.

In [None]:
# Instantiating XGBoost classifier
xgb_clf = xgb.XGBClassifier(random_state=20)

In [None]:
# Training the model
xgb_clf.fit(train_array,label_train)

In [None]:
# Getting the predicted classes
xgb_train_preds = xgb_clf.predict(train_array)
xgb_test_preds = xgb_clf.predict(test_array)

In [None]:
# Classification report
print (classification_report(label_train, xgb_train_preds))
print (classification_report(label_test, xgb_test_preds))

# Confusion matrix
sns.heatmap(confusion_matrix(label_test, xgb_test_preds), annot=True, fmt='d')
plt.show()

## SVM

The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional space into classes so that we can easily put the new data point in the correct category in the future. This best decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are called as support vectors, and hence algorithm is termed as Support Vector Machine. 

In [None]:
# Instantiating SVM classifier
svm_clf = svm.SVC(kernel = 'rbf', C = 0.5)

In [None]:
# Training the model 
svm_clf.fit(train_array, label_train)

In [None]:
# Getting the predicted classes
svm_train_preds = svm_clf.predict(train_array)
svm_test_preds = svm_clf.predict(test_array)

In [None]:
# Classification report
print (classification_report(label_train, svm_train_preds))
print (classification_report(label_test, svm_test_preds))

# Confusion matrix
sns.heatmap(confusion_matrix(label_test, svm_test_preds), annot=True, fmt='d')
plt.show()

## Summary

In [None]:
# Specify the Column Names while initializing the Table 
myTable = PrettyTable(['SL No.',"Model_Name",'Train accuracy', "Test accuracy"]) 
  
# Add rows 
myTable.add_row(['1',"Naive-Bayes ", "0.84", "0.83"]) 
myTable.add_row(['2',"Logistic Regression  ", "0.88", "0.86"])
myTable.add_row(['3',"XGBoost ", "0.86", "0.84"]) 
myTable.add_row(['4',"SVM ", "0.92", "0.87"]) 

print(myTable)

## Conclusion from sentiment analysis

Logistic regression gives good accuracy without overfitting the data.

SVM also gives good accuracy but it over fits the data.

For sentiment analysis logistic regression and SVM are the two most appropriate models with accuracy of nearly 87%. 