# **Problem Statement**

Zomato is an Indian restaurant aggregator and food delivery start-up founded by Deepinder Goyal and Pankaj Chaddah in 2008. Zomato provides information, menus and user-reviews of restaurants, and also has food delivery options from partner restaurants in select cities.

India is quite famous for its diverse multi cuisine available in a large number of restaurants and hotel resorts, which is reminiscent of unity in diversity. Restaurant business in India is always evolving. More Indians are warming up to the idea of eating restaurant food whether by dining outside or getting food delivered. The growing number of restaurants in every state of India has been a motivation to inspect the data to get some insights, interesting facts and figures about the Indian food industry in each city. So, this project focuses on analysing the Zomato restaurant data for each city in India.

The Project focuses on Customers and Company, you have  to analyze the sentiments of the reviews given by the customer in the data and made some useful conclusion in the form of Visualizations. Also, cluster the zomato restaurants into different segments. The data is vizualized as it becomes easy to analyse data at instant. The Analysis also solve some of the business cases that can directly help the customers finding the Best restaurant in their locality and for the company to grow up and work on the fields they are currently lagging in.

This could help in clustering the restaurants into segments. Also the data has valuable information around cuisine and costing which can be used in cost vs. benefit analysis

Data could be used for sentiment analysis. Also the metadata of reviewers can be used for identifying the critics in the industry. 

# **Attribute Information**

## **Zomato Restaurant names and Metadata**
Use this dataset for clustering part

1. Name : Name of Restaurants

2. Links : URL Links of Restaurants

3. Cost : Per person estimated Cost of dining

4. Collection : Tagging of Restaurants w.r.t. Zomato categories

5. Cuisines : Cuisines served by Restaurants

6. Timings : Restaurant Timings

## **Zomato Restaurant reviews**
Merge this dataset with Names and Matadata and then use for sentiment analysis part

1. Restaurant : Name of the Restaurant

2. Reviewer : Name of the Reviewer

3. Review : Review Text

4. Rating : Rating Provided by Reviewer

5. MetaData : Reviewer Metadata - No. of Reviews and followers

6. Time: Date and Time of Review

7. Pictures : No. of pictures posted with review

In [130]:
# importing the essential libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [131]:
# to make rows visible upto 500
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 500)

In [132]:
#reading the restaurant names data
names_df = pd.read_csv('Zomato Restaurant names and Metadata.csv')

#reading the zomato reviews data
reviews_df = pd.read_csv('Zomato Restaurant reviews.csv')

# Basic analysis on names data

In [133]:
# Shape of the data
names_df.shape

(105, 6)

In [134]:
#head of the data
names_df.head(5)

Unnamed: 0,Name,Links,Cost,Collections,Cuisines,Timings
0,Beyond Flavours,https://www.zomato.com/hyderabad/beyond-flavou...,800,"Food Hygiene Rated Restaurants in Hyderabad, C...","Chinese, Continental, Kebab, European, South I...","12noon to 3:30pm, 6:30pm to 11:30pm (Mon-Sun)"
1,Paradise,https://www.zomato.com/hyderabad/paradise-gach...,800,Hyderabad's Hottest,"Biryani, North Indian, Chinese",11 AM to 11 PM
2,Flechazo,https://www.zomato.com/hyderabad/flechazo-gach...,1300,"Great Buffets, Hyderabad's Hottest","Asian, Mediterranean, North Indian, Desserts","11:30 AM to 4:30 PM, 6:30 PM to 11 PM"
3,Shah Ghouse Hotel & Restaurant,https://www.zomato.com/hyderabad/shah-ghouse-h...,800,Late Night Restaurants,"Biryani, North Indian, Chinese, Seafood, Bever...",12 Noon to 2 AM
4,Over The Moon Brew Company,https://www.zomato.com/hyderabad/over-the-moon...,1200,"Best Bars & Pubs, Food Hygiene Rated Restauran...","Asian, Continental, North Indian, Chinese, Med...","12noon to 11pm (Mon, Tue, Wed, Thu, Sun), 12no..."


In [135]:
# Basic information about our dataframe
names_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105 entries, 0 to 104
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Name         105 non-null    object
 1   Links        105 non-null    object
 2   Cost         105 non-null    object
 3   Collections  51 non-null     object
 4   Cuisines     105 non-null    object
 5   Timings      104 non-null    object
dtypes: object(6)
memory usage: 5.0+ KB


In [136]:
# Columns and the no.of unique observations
names_df.nunique()

Name           105
Links          105
Cost            29
Collections     42
Cuisines        92
Timings         77
dtype: int64

* There are total of 105 unique restuarants

# Basic analysis on reviews data

In [137]:
# Shape of the data
reviews_df.shape

(10000, 7)

In [138]:
#head of the data
names_df.head(5)

Unnamed: 0,Name,Links,Cost,Collections,Cuisines,Timings
0,Beyond Flavours,https://www.zomato.com/hyderabad/beyond-flavou...,800,"Food Hygiene Rated Restaurants in Hyderabad, C...","Chinese, Continental, Kebab, European, South I...","12noon to 3:30pm, 6:30pm to 11:30pm (Mon-Sun)"
1,Paradise,https://www.zomato.com/hyderabad/paradise-gach...,800,Hyderabad's Hottest,"Biryani, North Indian, Chinese",11 AM to 11 PM
2,Flechazo,https://www.zomato.com/hyderabad/flechazo-gach...,1300,"Great Buffets, Hyderabad's Hottest","Asian, Mediterranean, North Indian, Desserts","11:30 AM to 4:30 PM, 6:30 PM to 11 PM"
3,Shah Ghouse Hotel & Restaurant,https://www.zomato.com/hyderabad/shah-ghouse-h...,800,Late Night Restaurants,"Biryani, North Indian, Chinese, Seafood, Bever...",12 Noon to 2 AM
4,Over The Moon Brew Company,https://www.zomato.com/hyderabad/over-the-moon...,1200,"Best Bars & Pubs, Food Hygiene Rated Restauran...","Asian, Continental, North Indian, Chinese, Med...","12noon to 11pm (Mon, Tue, Wed, Thu, Sun), 12no..."


In [139]:
# Basic information about our dataframe
reviews_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Restaurant  10000 non-null  object
 1   Reviewer    9962 non-null   object
 2   Review      9955 non-null   object
 3   Rating      9962 non-null   object
 4   Metadata    9962 non-null   object
 5   Time        9962 non-null   object
 6   Pictures    10000 non-null  int64 
dtypes: int64(1), object(6)
memory usage: 547.0+ KB


* There are total of 9955 rows with no null values in this dataset
* It is okay to drop those

In [140]:
# Columns and the no.of unique observations
reviews_df.nunique()

Restaurant     100
Reviewer      7446
Review        9364
Rating          10
Metadata      2477
Time          9782
Pictures        36
dtype: int64

* Here unique restaurants are 100, So 5 of the restaurants doesn't have any reviews

In [141]:
reviews_df['Rating'].unique()

array(['5', '4', '1', '3', '2', '3.5', '4.5', '2.5', '1.5', 'Like', nan],
      dtype=object)

* For the ratings column nan values can be dropped
* But 'Like' might be an invalid entry

# Merging the datasets

In [142]:
#Changing the column name for convenience while merging
names_df = names_df.rename(columns={'Name':'Restaurant'})

# Merging the two dataframes
df = pd.merge(reviews_df, names_df, how='left', on='Restaurant')

In [143]:
df.head()

Unnamed: 0,Restaurant,Reviewer,Review,Rating,Metadata,Time,Pictures,Links,Cost,Collections,Cuisines,Timings
0,Beyond Flavours,Rusha Chakraborty,"The ambience was good, food was quite good . h...",5,"1 Review , 2 Followers",5/25/2019 15:54,0,https://www.zomato.com/hyderabad/beyond-flavou...,800,"Food Hygiene Rated Restaurants in Hyderabad, C...","Chinese, Continental, Kebab, European, South I...","12noon to 3:30pm, 6:30pm to 11:30pm (Mon-Sun)"
1,Beyond Flavours,Anusha Tirumalaneedi,Ambience is too good for a pleasant evening. S...,5,"3 Reviews , 2 Followers",5/25/2019 14:20,0,https://www.zomato.com/hyderabad/beyond-flavou...,800,"Food Hygiene Rated Restaurants in Hyderabad, C...","Chinese, Continental, Kebab, European, South I...","12noon to 3:30pm, 6:30pm to 11:30pm (Mon-Sun)"
2,Beyond Flavours,Ashok Shekhawat,A must try.. great food great ambience. Thnx f...,5,"2 Reviews , 3 Followers",5/24/2019 22:54,0,https://www.zomato.com/hyderabad/beyond-flavou...,800,"Food Hygiene Rated Restaurants in Hyderabad, C...","Chinese, Continental, Kebab, European, South I...","12noon to 3:30pm, 6:30pm to 11:30pm (Mon-Sun)"
3,Beyond Flavours,Swapnil Sarkar,Soumen das and Arun was a great guy. Only beca...,5,"1 Review , 1 Follower",5/24/2019 22:11,0,https://www.zomato.com/hyderabad/beyond-flavou...,800,"Food Hygiene Rated Restaurants in Hyderabad, C...","Chinese, Continental, Kebab, European, South I...","12noon to 3:30pm, 6:30pm to 11:30pm (Mon-Sun)"
4,Beyond Flavours,Dileep,Food is good.we ordered Kodi drumsticks and ba...,5,"3 Reviews , 2 Followers",5/24/2019 21:37,0,https://www.zomato.com/hyderabad/beyond-flavou...,800,"Food Hygiene Rated Restaurants in Hyderabad, C...","Chinese, Continental, Kebab, European, South I...","12noon to 3:30pm, 6:30pm to 11:30pm (Mon-Sun)"


# Data Cleaning

In [144]:
# Dropping the unwanted columns
df.drop(['Reviewer', 'Time', 'Pictures', 'Links', 'Collections'], axis=1, inplace=True)

In [145]:
print('\033[1m' + 'Column\t   null count' + '\033[0m')
print(df.isnull().sum())

[1mColumn	   null count[0m
Restaurant      0
Review         45
Rating         38
Metadata       38
Cost            0
Cuisines        0
Timings       100
dtype: int64


In [146]:
#Dropping the row in which we got invalid entry for Rating
df = df[df['Rating']!='Like']

In [147]:
# rows with Review,Rating,Metadata as null values
df[(df['Review'].isnull())&(df['Rating'].isnull())&(df['Metadata'].isnull())].shape[0]

38

* As there are more than 38 rows which having the above columns as null, Imputation of ratings is unneccesary
* Even if we imput, we have to drop those rows as they still have null values

In [148]:
#removing all the rows with null values
df = df[(df['Review'].notna())&(df['Timings'].notna())]

In [149]:
print('\033[1m' + 'Column\t   null count' + '\033[0m')
print(df.isnull().sum())

[1mColumn	   null count[0m
Restaurant    0
Review        0
Rating        0
Metadata      0
Cost          0
Cuisines      0
Timings       0
dtype: int64


# Data PreProcessing

In [150]:
#Datatypes of our dataframe
df.dtypes

Restaurant    object
Review        object
Rating        object
Metadata      object
Cost          object
Cuisines      object
Timings       object
dtype: object

In [151]:
#Changing the datatypes of Restaurant, Review
df['Restaurant'] = df['Restaurant'].astype(str)
df['Review'] = df['Review'].astype(str)

In [152]:
# Cost values are separated by comma in general (22,500 = 22500)
df['Cost'] = df['Cost'].str.replace(',','').astype(int)

In [154]:
# Making a cuisine list for every restaurant
df['Cuisines'] = df['Cuisines'].str.split(',')

In [155]:
df.sample(5)

Unnamed: 0,Restaurant,Review,Rating,Metadata,Cost,Cuisines,Timings
1712,Hotel Zara Hi-Fi,It’s a good restaurant and immediately deliver...,5,"1 Review , 2 Followers",400,"[Chinese, North Indian]",11:30 AM to 1 AM
7855,Khaan Saab,I went to khaansaab on a Sunday night. It was ...,5,"13 Reviews , 26 Followers",1100,"[North Indian, Mughlai]","12 Noon to 3:30 PM, 7 PM to 11:30 PM"
2756,"3B's - Buddies, Bar & Barbecue",Good service by govind and Shivam they were ve...,5,"1 Review , 7 Followers",1100,"[North Indian, Mediterranean, European]","12 Noon to 4 PM, 6:30 PM to 11:30 PM"
2369,Amul,Awesome !🙏,5,0 Reviews,150,"[Ice Cream, Desserts]",10 AM to 5 AM
1209,Absolute Sizzlers,Its great place dine absolute sizzler absolute...,5,1 Review,750,"[Continental, American, Chinese]",11:30 AM to 1 AM
