<a href="https://colab.research.google.com/github/Mayank0195/Capstone_Project_Zomato_Resturant_Clustering_and_Sentimental_Analysis/blob/main/ZOMATO_RESTAURANT_CLUSTERING_AND_SENTIMENT_ANALYSIS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Problem Statement**

Zomato is an Indian restaurant aggregator and food delivery start-up founded by Deepinder Goyal and Pankaj Chaddah in 2008. Zomato provides information, menus and user-reviews of restaurants, and also has food delivery options from partner restaurants in select cities.

India is quite famous for its diverse multi cuisine available in a large number of restaurants and hotel resorts, which is reminiscent of unity in diversity. Restaurant business in India is always evolving. More Indians are warming up to the idea of eating restaurant food whether by dining outside or getting food delivered. The growing number of restaurants in every state of India has been a motivation to inspect the data to get some insights, interesting facts and figures about the Indian food industry in each city. So, this project focuses on analysing the Zomato restaurant data for each city in India.

The Project focuses on Customers and Company, you have  to analyze the sentiments of the reviews given by the customer in the data and made some useful conclusion in the form of Visualizations. Also, cluster the zomato restaurants into different segments. The data is vizualized as it becomes easy to analyse data at instant. The Analysis also solve some of the business cases that can directly help the customers finding the Best restaurant in their locality and for the company to grow up and work on the fields they are currently lagging in.

This could help in clustering the restaurants into segments. Also the data has valuable information around cuisine and costing which can be used in cost vs. benefit analysis

Data could be used for sentiment analysis. Also the metadata of reviewers can be used for identifying the critics in the industry. 

### **Attribute Information**

#### **Zomato Restaurant names and Metadata**
Use this dataset for clustering part

1. Restaurant : Name of the Restaurant

2. Reviewer : Name of the Reviewer

3. Review : Review Text

4. Rating : Rating Provided by Reviewer

5. MetaData : Reviewer Metadata - No. of Reviews and followers

6. Time: Date and Time of Review

7. Pictures : No. of pictures posted with review

###**Notebook Breakdown:**
* Business Problem Analysis
* Data Collection
* Data Cleaning and Preprocessing
* Feature Engineering
* Exploratory Data Analysis
    - Best Restaurants in the City
    - The Most Popular Cuisines in Hyderabad
    - Restaurants and their Costs
    - Cost-Benefit Analysis
    - Hypotheses Generation on visualized data for Clustering
* Restaurant Clustering
    - K means Clustering on Cost and Ratings
    - Multi-Dimensional K means Restaurant Clustering 
        -  Principal Component Analysis
        -  Silhouette Score
        -  K means Clustering
        -  Cluster Exploration
* Sentiment Analysis 
    -  Exploratory Data Analysis
        -  Critics in the Industry
    -  Text Pre-Processing and Text Visualization
    - Modeling
* Conclusion

###**Business Problem Analysis**

Indian cuisine consists of a variety of regional and traditional cuisines native to the Indian subcontinent. With every state, you can find something different to love. Besides traditional North Indian and South Indian food, the food culture is heavily inspired by and evolved around various civilizations. To say that Indians are food lovers would be an understatement. 
The restaurant business in India has been booming and people even like to celebrate small occasions of their lives with good food and great ambiance. 
Here comes Zomato, connecting people and restaurants.
Zomato is an Indian restaurant aggregator which provides information, menus, and user reviews of restaurants, and also has food delivery options. They basically take orders on the restaurant's behalf and get the food delivered at the convenience of your doorstep.

The problem statement here has two datasets for us to work on:
* Zomato Restaurant Names and Metadata
* Zomato Restaurant Reviews

To assure Zomato's success it is important for the company to analyze its datasets and make appropriate strategic decisions. The problem statement here asks us to cluster the restaurants to help customers find the best restaurants in their city and according to their taste and understand the fields they are lagging in. This will help Zomato in building a good recommendation system for their customers. Do a cost-benefit analysis using the cuisines and costs of the restaurants.
In order to understand fields that need to be worked upon, it is important to do sentiment analysis to get an idea about how people really feel about a particular restaurant. To identify the industry critics and  especially work on their reviews to build a reputation worth praising.


In [1]:
#importing all the important librarys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
import time
from wordcloud import WordCloud
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score, plot_precision_recall_curve
from sklearn.metrics import precision_score,recall_score,f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
sns.set_style("whitegrid",{'grid.linestyle': '--'})
plt.rcParams.update({'figure.figsize':(8,5),'figure.dpi':100})
from datetime import datetime

# Set the display figure size using rcParams method 
sns.set(rc={'figure.figsize':(10,6)})
plt.rcParams['figure.figsize'] = [10,6]



## **1. Dataset Reading**

In [None]:
#mounting drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#reading datasets
rest_df = pd.read_csv("/content/drive/MyDrive/AlmaBetter/Capstone_Project_ML_Unsupervised/Zomato Restaurant names and Metadata.csv")
reviews_df = pd.read_csv("/content/drive/MyDrive/AlmaBetter/Capstone_Project_ML_Unsupervised/Zomato Restaurant reviews.csv")

### **Checking the Head and Tail of the Metadata and Reviews**

In [None]:
rest_df.head()

In [None]:
rest_df.tail()

In [None]:
#first five rows of reviews dataset
reviews_df.head()

In [None]:
reviews_df.tail()

## **2. Dataset Discovery**

**dicovering the dataset and get a notion of what the attributes describe.**

In [None]:
rest_df.count()

In [None]:
reviews_df.count()

In [None]:
rest_df.columns

In [None]:
reviews_df.columns

In [None]:
rest_df.shape

In [None]:
reviews_df.shape

In [None]:
#restaurnts info - null count and dtypes 
rest_df.info()

In [None]:
reviews_df.info()

In [None]:
rest_df.describe().transpose()

In [None]:
reviews_df.describe()

## **3.DATA CLEANING**

In [None]:
import missingno as msno
msno.matrix(rest_df)
plt.show()

In [None]:
# finding the count of null values
rest_df.isnull().sum()

**Around 50% of the data is missing in the categorical column "Collections", which are basically just tags given by zomato for better search results.**
**Even when imputed with various categorical data imputing measures, it would be pretty difficult to match similar tags as the restaurants and then even more difficult to then convert them into a meaningful numerical feature afterward.**

**If the information contained in the variable is not that high, it is better to drop the variable if it has 50% or more missing values.**

In [None]:
#drop collections
rest_df.drop('Collections', axis=1, inplace=True)

In [None]:
#Impute one missing timing row with the mode
rest_df['Timings'].fillna(rest_df['Timings'].mode()[0],inplace=True)

In [None]:
#check nulls
rest_df.isnull().sum()

In [None]:
rest_df.Cost.unique()

In [None]:
# changing cost datatype
rest_df['Cost'] = rest_df['Cost'].str.replace(',','')
rest_df['Cost'] = rest_df['Cost'].astype('int')

In [None]:
import missingno as msno
msno.matrix(reviews_df)
plt.show()

**The "Review" column has text that needs to be analyzed to understand the sentiments and without it, the analysis cannot be done. It can also be seen that most of the null values in the review column also have nulls in other corresponding columns such as Reviewer, Rating, Metadata, and Time. These instances should be dropped.**

In [None]:
# finding the count of null values
reviews_df.isnull().sum()

In [None]:
#dropping null rows in reviews first
reviews_df.dropna(subset = ["Review"], inplace=True)

In [None]:
# checking
reviews_df.isnull().sum()

In [None]:
#rating is in object type
reviews_df['Rating'].unique()

In [None]:
#like should not be here
# correcting and changing the datatype
reviews_df['Rating'] = reviews_df['Rating'].replace('Like','4')
reviews_df['Rating'] = reviews_df['Rating'].astype('float')

###**Feature Engineering**

Feature engineering is the process of selecting, manipulating, and transforming raw data into meaningful numerical features that can be used by machine learning algorithms. 




####**Zomato Restaurant names and Metadata**

First, the restaurants dataset has columns such as Links, Cuisine, and Timings which aren't directly interpretable.
The location of the restaurant can be extracted by the Links column.
Cuisines can be clubbed and categorized into a few categories and a total number of cuisines served by a particular restaurant.
Timings can be categorized into thr

In [None]:
rest_df.head()

**Links**

In [None]:
# link value
rest_df.loc[0,'Links']

In [None]:
#function to extract location of the restaurant
def location(link):
  link_elements = link.split("/")
  return link_elements[3]

#create a location feature
rest_df['Location'] = rest_df['Links'].apply(location)

In [None]:
# looks like the dataset consists of the restaurants in Hyderabad
rest_df['Location'].unique()

In [None]:
# exploring the other value
rest_df[rest_df.isin(['thetiltbarrepublic'])].stack()

In [None]:
#doesnt have location
rest_df.loc[68,:]

In [None]:
#dropping unnecessary columns
rest_df.drop(['Links','Location'],axis=1,inplace=True)

In [None]:
rest_df.columns

In [None]:
#let's drop time as it would not be required
reviews_df.drop(['Time'],axis=1,inplace=True)

## **Cuisines**

Here, it can be seen that the various cuisines served by every restaurant are in the form of strings and it's important to categorize and create dummy variables for all the cuisines served.
The procedure followed in doing this is as follows:
* First, strings are split to get the cuisines in the list datatype.
* A frequency dictionary is created to understand the unique cuisines and the frequency in which the cuisine occurs.
* An attempt is made to the club and categorize various misspelled cuisines and get a minimized number of unique cuisines.
* Next, we need these cuisines in the one-hot encoded form. To get these a data frame is created with the unique cuisines as columns and if a particular restaurant has this cuisine available we get a positive.

In [None]:
#splitting to create list instead of strings
rest_df['Cuisines'] = rest_df['Cuisines'].apply(lambda x : x.split(','))

#creating a list of all cuisine lists for different restaurants
cuisine_list = []
for idx in rest_df.index:
  cuisine_list.append(rest_df['Cuisines'][idx])

#creating a flat list
cuisine_list = [item for sublist in cuisine_list for item in sublist]

In [None]:
#frequency dict
frequency_dict = {}
for elem in cuisine_list:
  if elem not in frequency_dict.keys():
    frequency_dict[elem] = cuisine_list.count(elem)
  else:
    pass

#frequency dictionary
frequency_dict

**It is observable that many of the cuisines are misspelled in terms of an extra space added at the beginning of the string. For example, there are two categories for North Indian food - 'North Indian' and ' North Indian'.**

**Another point to note is there are various unnecessary categories made. For example, there are 'Chinese' and ' Momos' both in the dataset as different cuisines. Let's try to club and correct them.**

In [None]:
#minimising the number of cuisines by sorting and categorizing them out
cuisine_dict = {'Chinese':['Chinese',' Chinese','Momos',' Momos'],'North Indian':['North Indian',' North Indian',' BBQ','BBQ',' Biryani','Biryani','Kebab',' Kebab'],'Continental':['Continental',' Continental',' American','American',' BBQ','BBQ','Burger',' Burger','Finger Food',' Finger Food', ' Juices',' Pizza',' Salad',' Wraps'],
                'Andhra':['Andhra',' Andhra'],'Arabian':['Arabian',' Arabian'],'Asian': ['Asian',' Asian'],'Bakery':['Bakery',' Bakery'],
                'Beverages':['Beverages',' Beverages'],'Cafe':['Cafe',' Cafe'],'Desserts':['Desserts',' Desserts',' Mithai','Ice Cream'],
                'European':['European',' European',' Spanish'],'Fast Food':['Fast Food',' Fast Food','Burger',' Burger'],'Goan':[' Goan',' Goan'],
                'Hyderabadi':['Hyderabadi',' Hyderabadi',' Biryani','Biryani'],'Indonesian':['Indonesian',' Indonesian'],'Italian':['Italian',' Italian',' Pizza'],
                'Japanese':['Japanese',' Japanese',' Sushi'],'Malaysian':['Malaysian',' Malaysian'],'Mediterranean':['Mediterranean',' Mediterranean'],
                'Modern Indian':['Modern Indian',' Modern Indian',' Salad'],'Mughlai':['Mughlai',' Mughlai',' BBQ','BBQ','Kebab',' Kebab'],
                'Seafood':['Seafood',' Seafood'],'South Indian':['South Indian',' South Indian'],
                'Thai':['Thai',' Thai'],'Healthy Food':['Healthy Food'],'Lebanese':['Lebanese'],'Mexican':['Mexican'],'North Eastern':['North Eastern'],
                'Street Food':['Street Food']}

In [None]:
# just in case 
names_df = rest_df.copy()

In [None]:
#the function returns a list of error free and mapped cuisines according to the dictionary created
def cuisine_corrector(cuisine):
  list1 = []
  # for every cuisine in the list of a particular row
  for elem in cuisine:
    # and for every key value in the dict
    for key,value in cuisine_dict.items():
      # if cuisine is correct and matches with one of the unique keys we append to the list and break
      if elem == key:
        list1.append(key)
        break
      # next if the other elem doesnot match if search and value and append the key for that value
      if elem in value:
        list1.append(key)
      
  return list(set(list1)) # returns a unique cuisines list

In [None]:
#correcting and getting the desired lists as row values for cuisines column
names_df['Cuisines'] = names_df['Cuisines'].apply(cuisine_corrector)

In [None]:
#check
names_df.head(3)

**The next step is to create column features for the unique cuisines and assign values according to the row values available.**

In [None]:
# concatenate new columns with the dataset
names_df = pd.concat([names_df,pd.DataFrame(columns=list(cuisine_dict.keys()))])

In [None]:
# iterating for every row in the dataframe
for i, row in names_df.iterrows():
  # and for every row we iterate over the new columns only
  for column in list(names_df.columns):
      if column not in ['Name','Cost','Cuisines','Timings']:
        # and check if the column is in the list of cuisines available for that row
        if column in row['Cuisines']:
          #then assign it as 1 else 0
          names_df.loc[i,column] = 1
        else:
          names_df.loc[i,column] = 0

In [None]:
#let's check
names_df.head(2)

In [None]:
# value for 1st restaurant and verifying 
names_df.loc[0,'Cuisines']

In [None]:
#creating a new column for the total number of cusines served by restaurants
names_df['Total Cuisines'] = names_df['Cuisines'].apply(lambda x : len(x))


In [None]:
#drop cuisines column
names_df.drop(['Cuisines'],axis=1,inplace=True)

**Timings**

In [None]:
#analyse the unique values in Timings
names_df['Timings'].unique()

**Upon analyzing the unique values in the timings columns, it can be concluded that the restaurants are more or less open at the same timings and don't really provide a considerable variation in order to cluster the restaurants.**

In [None]:
#drop timings
names_df.drop(['Timings'],axis=1,inplace=True)

**Restaurant Average Ratings**

In [None]:
# groupby restaurant and ratings to get average ratings
restaurant_ratings = reviews_df.groupby('Restaurant')['Rating'].mean().reset_index()
restaurant_ratings.rename(columns={'Restaurant':'Name'},inplace=True)
#sort restaurants according to ratings and getting top 5 restaurants
restaurant_ratings.sort_values(by='Rating',ascending = False).head()

In [None]:
#adding an average rating feature in restaurant names and metadata dataframe
names_df = names_df.merge(restaurant_ratings,on='Name',how='left')
names_df.rename(columns={'Rating':'Avg Rating'},inplace=True)
names_df.head(1)

In [None]:
# info on the final dataset
names_df.info()

In [None]:
#five restaurants have not been rated by people yet
names_df['Avg Rating'].fillna(0,inplace=True)

####**Zomato Restaurant Reviews**

In [None]:
#head
reviews_df.head(1)

In [None]:
# splitting meta data into reviews and followers seperately
reviews_df['Reviews'], reviews_df['Followers'] = reviews_df['Metadata'].str.split(',').str
reviews_df['Reviews'] = pd.to_numeric(reviews_df['Reviews'].str.split(' ').str[0])
reviews_df['Followers'] = pd.to_numeric(reviews_df['Followers'].str.split(' ').str[1])

reviews_df.head(1)

In [None]:
#drop Metadata
reviews_df.drop(['Metadata'],axis=1,inplace=True)

In [None]:
#create a seperate detaframe for reviewers and their activity
reviewers_df = reviews_df.groupby(['Reviewer','Reviews','Followers'])['Rating'].mean().reset_index()
reviewers_df.sort_values(by=['Reviews','Followers','Rating'],ascending=[False,False,True],inplace=True,ignore_index=True)

#sorting out the crtics of the industry, these are the people with most reviews written and most followers who have given low rating on an avg
reviewers_df.head(3)

###**Exploratory Data Analysis**
Exploratory data analysis is a crucial part of data analysis. It involves exploring and analyzing the dataset given to find patterns, trends and conclusions to make better decisions related to the data, often using statistical graphics and other data visualization tools to summarize the results. Python libraries like pandas are used to explore the data and matplotlib and seaborn to visualize it.

Some important aspects to include in the project are as follows:

*  Best restaurants in the city 
* The Most Popular Cuisines in Hyderabad
* Restaurants and their Costs
* Cost-Benefit Analysis
* Hypotheses Generation on visualized data for Clustering

####**Best Restaurants in the City**

There are various factors involved in choosing a good restaurant such as food, ambiance, cost, location, reviews, etc but the most important ones are cuisine, cost, and reviews.
The first thing that comes to mind while choosing a good restaurant is if the cuisine you like is available at the restaurant and then the taste should also be good. The second thing is value for money, it is important that you get what you paid for. To help in the above decisions reviews come into place. They give you an idea of what the restaurant is like from people who had been to the place several times. 

The dataset here has the features- Name, Cost, Total Cuisines, and Average Ratings to help in the decision making. Best restaurants in the city would be having low cost and high ratings and the number of total cuisines served. Let's go ahead and explore a bit.

In [None]:
# sorting out the best restaurants
best_restaurants = names_df[['Name','Avg Rating','Total Cuisines','Cost']]
best_restaurants.sort_values(by=['Avg Rating','Total Cuisines','Cost'],ascending=[False,False,True],inplace=True,ignore_index=True)
#top10
best_restaurants = best_restaurants.loc[0:9,:]
best_restaurants

In [None]:
#visualizing the best restaurants 
sns.barplot(x='Avg Rating', y='Name',data=best_restaurants)
plt.title('Best Restaurants in Hyderabad',size=10)

In [None]:
#distribution of Average Ratings in Hyderabad
sns.distplot(x=names_df['Avg Rating'])
plt.xlabel('Average Rating',size=8)
plt.title('Distribution of Average Restaurant Ratings in Hyderabad',size=10)

**Few restaurants in the original restaurant dataset have not been rated by the people yet, most restaurants have ratings between 3.5 and 4. Efforts should be made by the company to improve the existing restaurants by pushing them to act on the reviews and to include restaurants with better services in the future to improve overall rating distribution.**

####**The Most Popular Cuisines in Hyderabad**

In [None]:
#creating a new dataframe for the cuisines and number of restaurants providing them
#list of cuisines
cuisines1 = list(cuisine_dict.keys())
#creating a new dataframe
popular_cuisines = pd.DataFrame()
#creating a feature called cuisines and assigning unique cuisines as values
popular_cuisines['Cuisines'] = cuisines1
#creating a feature of sum of cuisines in the whole dataset
popular_cuisines['Total Restaurants'] = [names_df[i].sum() for i in cuisines1]
#sort values
popular_cuisines.sort_values('Total Restaurants',ascending=False,inplace=True,ignore_index=True)
popular_cuisines

In [None]:
#visualizing cuisines
sns.barplot(x='Total Restaurants', y='Cuisines',data=popular_cuisines)
plt.title('The Most Popular Cuisines in Hyderabad',size=10)

**Although located in South India, North Indian food is dominating in the restaurants followed by Chinese, and Continental. The number of cuisines shows the diverse food options available in Hyderabad.**

####**Restaurants and their Costs**

In [None]:
#visualizing Restaurant Costs
names_df.sort_values(['Cost']).plot(x="Name", y=["Cost"], kind="bar", figsize=(20, 8))
plt.xlabel('Restaurants',size=10)
plt.ylabel('Cost',size=10)
plt.title('Costs of Restaurants in Hyderabad',size=15)
plt.legend(['Average Cost at Restaurant'])

In [None]:
#top 5 cheapest restaurants
names_df[['Name','Cost']].sort_values(['Cost']).head()

## **The cheapest restaurants in the dataset are basically small food joints and bakeries.**

In [None]:
#top 5 costliest restaurants
names_df[['Name','Cost']].sort_values(['Cost'],ascending=False).head()

## **The most expensive restaurants in the dataset are restaurants by 4 star above hotels.**

In [None]:
#distribution of Cost in Hyderabad
sns.distplot(x=names_df['Cost'])
plt.xlabel('Cost',size=8)
plt.title('Distribution of Restaurant Costs in Hyderabad',size=10)