In [6]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/zomato/zomato.csv


# Zomato Restaurant Dataset Analysis

This notebook contains an analysis of the Zomato restaurant dataset. We will clean the data, perform exploratory data analysis (EDA), and visualize various aspects of the data. The dataset includes information about restaurants, such as their location, type, cost, ratings, and more.

## Objectives
- Data Cleaning
- Data Visualization
- Insights and Observations

In [7]:
# Importing necessary libraries
import pandas as pd
import numpy as np

## Load Dataset

We will start by loading the Zomato dataset into a pandas DataFrame.

In [8]:
# Load the dataset
df = pd.read_csv('/kaggle/input/zomato/zomato.csv')
# Display the first few rows of the dataframe
df.head()

Unnamed: 0,url,address,name,online_order,book_table,rate,votes,phone,location,rest_type,dish_liked,cuisines,approx_cost(for two people),reviews_list,menu_item,listed_in(type),listed_in(city)
0,https://www.zomato.com/bangalore/jalsa-banasha...,"942, 21st Main Road, 2nd Stage, Banashankari, ...",Jalsa,Yes,Yes,4.1/5,775,080 42297555\r\n+91 9743772233,Banashankari,Casual Dining,"Pasta, Lunch Buffet, Masala Papad, Paneer Laja...","North Indian, Mughlai, Chinese",800,"[('Rated 4.0', 'RATED\n A beautiful place to ...",[],Buffet,Banashankari
1,https://www.zomato.com/bangalore/spice-elephan...,"2nd Floor, 80 Feet Road, Near Big Bazaar, 6th ...",Spice Elephant,Yes,No,4.1/5,787,080 41714161,Banashankari,Casual Dining,"Momos, Lunch Buffet, Chocolate Nirvana, Thai G...","Chinese, North Indian, Thai",800,"[('Rated 4.0', 'RATED\n Had been here for din...",[],Buffet,Banashankari
2,https://www.zomato.com/SanchurroBangalore?cont...,"1112, Next to KIMS Medical College, 17th Cross...",San Churro Cafe,Yes,No,3.8/5,918,+91 9663487993,Banashankari,"Cafe, Casual Dining","Churros, Cannelloni, Minestrone Soup, Hot Choc...","Cafe, Mexican, Italian",800,"[('Rated 3.0', ""RATED\n Ambience is not that ...",[],Buffet,Banashankari
3,https://www.zomato.com/bangalore/addhuri-udupi...,"1st Floor, Annakuteera, 3rd Stage, Banashankar...",Addhuri Udupi Bhojana,No,No,3.7/5,88,+91 9620009302,Banashankari,Quick Bites,Masala Dosa,"South Indian, North Indian",300,"[('Rated 4.0', ""RATED\n Great food and proper...",[],Buffet,Banashankari
4,https://www.zomato.com/bangalore/grand-village...,"10, 3rd Floor, Lakshmi Associates, Gandhi Baza...",Grand Village,No,No,3.8/5,166,+91 8026612447\r\n+91 9901210005,Basavanagudi,Casual Dining,"Panipuri, Gol Gappe","North Indian, Rajasthani",600,"[('Rated 4.0', 'RATED\n Very good restaurant ...",[],Buffet,Banashankari


## Dataset Information

Let's check the structure and information of the dataset to understand its contents and identify any issues.

In [9]:
# Display dataset information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51717 entries, 0 to 51716
Data columns (total 17 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   url                          51717 non-null  object
 1   address                      51717 non-null  object
 2   name                         51717 non-null  object
 3   online_order                 51717 non-null  object
 4   book_table                   51717 non-null  object
 5   rate                         43942 non-null  object
 6   votes                        51717 non-null  int64 
 7   phone                        50509 non-null  object
 8   location                     51696 non-null  object
 9   rest_type                    51490 non-null  object
 10  dish_liked                   23639 non-null  object
 11  cuisines                     51672 non-null  object
 12  approx_cost(for two people)  51371 non-null  object
 13  reviews_list                 51

In [10]:
# Display summary statistics of the dataset
df.describe()

Unnamed: 0,votes
count,51717.0
mean,283.697527
std,803.838853
min,0.0
25%,7.0
50%,41.0
75%,198.0
max,16832.0


In [11]:
# Check for missing values in the dataset
df.isnull().sum()

url                                0
address                            0
name                               0
online_order                       0
book_table                         0
rate                            7775
votes                              0
phone                           1208
location                          21
rest_type                        227
dish_liked                     28078
cuisines                          45
approx_cost(for two people)      346
reviews_list                       0
menu_item                          0
listed_in(type)                    0
listed_in(city)                    0
dtype: int64

## Data Cleaning

We will clean the dataset by performing the following steps:
- Deleting redundant columns
- Renaming columns
- Dropping duplicates
- Cleaning individual columns
- Removing NaN values

In [12]:
# Check for duplicate rows in the dataset
df.duplicated().sum()

0

In [13]:
# Drop redundant columns
df = df.drop(['url','address','phone','menu_item'] , axis = 1)
df.head(5)

Unnamed: 0,name,online_order,book_table,rate,votes,location,rest_type,dish_liked,cuisines,approx_cost(for two people),reviews_list,listed_in(type),listed_in(city)
0,Jalsa,Yes,Yes,4.1/5,775,Banashankari,Casual Dining,"Pasta, Lunch Buffet, Masala Papad, Paneer Laja...","North Indian, Mughlai, Chinese",800,"[('Rated 4.0', 'RATED\n A beautiful place to ...",Buffet,Banashankari
1,Spice Elephant,Yes,No,4.1/5,787,Banashankari,Casual Dining,"Momos, Lunch Buffet, Chocolate Nirvana, Thai G...","Chinese, North Indian, Thai",800,"[('Rated 4.0', 'RATED\n Had been here for din...",Buffet,Banashankari
2,San Churro Cafe,Yes,No,3.8/5,918,Banashankari,"Cafe, Casual Dining","Churros, Cannelloni, Minestrone Soup, Hot Choc...","Cafe, Mexican, Italian",800,"[('Rated 3.0', ""RATED\n Ambience is not that ...",Buffet,Banashankari
3,Addhuri Udupi Bhojana,No,No,3.7/5,88,Banashankari,Quick Bites,Masala Dosa,"South Indian, North Indian",300,"[('Rated 4.0', ""RATED\n Great food and proper...",Buffet,Banashankari
4,Grand Village,No,No,3.8/5,166,Basavanagudi,Casual Dining,"Panipuri, Gol Gappe","North Indian, Rajasthani",600,"[('Rated 4.0', 'RATED\n Very good restaurant ...",Buffet,Banashankari


In [14]:
# Rename Columns for Better Readability
df = df.rename(columns={
    'listed_in(type)': 'listed_in_type',
    'listed_in(city)': 'listed_in_city'
})

In [15]:
# Remove rows with missing values
df.dropna(inplace = True)
# Check for any remaining missing values in the dataset
df.isnull().any()

name                           False
online_order                   False
book_table                     False
rate                           False
votes                          False
location                       False
rest_type                      False
dish_liked                     False
cuisines                       False
approx_cost(for two people)    False
reviews_list                   False
listed_in_type                 False
listed_in_city                 False
dtype: bool

In [16]:
# Display unique values in the 'rate' column
df['rate'].unique()

array(['4.1/5', '3.8/5', '3.7/5', '4.6/5', '4.0/5', '4.2/5', '3.9/5',
       '3.0/5', '3.6/5', '2.8/5', '4.4/5', '3.1/5', '4.3/5', '2.6/5',
       '3.3/5', '3.5/5', '3.8 /5', '3.2/5', '4.5/5', '2.5/5', '2.9/5',
       '3.4/5', '2.7/5', '4.7/5', 'NEW', '2.4/5', '2.2/5', '2.3/5',
       '4.8/5', '3.9 /5', '4.2 /5', '4.0 /5', '4.1 /5', '2.9 /5',
       '2.7 /5', '2.5 /5', '2.6 /5', '4.5 /5', '4.3 /5', '3.7 /5',
       '4.4 /5', '4.9/5', '2.1/5', '2.0/5', '1.8/5', '3.4 /5', '3.6 /5',
       '3.3 /5', '4.6 /5', '4.9 /5', '3.2 /5', '3.0 /5', '2.8 /5',
       '3.5 /5', '3.1 /5', '4.8 /5', '2.3 /5', '4.7 /5', '2.4 /5',
       '2.1 /5', '2.2 /5', '2.0 /5', '1.8 /5'], dtype=object)

### Clean Individual Columns

In [17]:
# Replace missing values, 'NEW', and '-' with '0.0/5', then convert to float
df['rate'] = df['rate'].replace([np.nan] , ['0.0/5'])
df['rate'] = df['rate'].replace('NEW' , '0.0/5')
df['rate'] = df['rate'].replace('-' , '0.0/5')
df['rate'] = df['rate'].astype(str).str.replace('/5', '').replace('nan', pd.NA).astype(float)

In [18]:
df['rate']

0        4.1
1        4.1
2        3.8
3        3.7
4        3.8
        ... 
51705    3.8
51707    3.9
51708    2.8
51711    2.5
51715    4.3
Name: rate, Length: 23406, dtype: float64

In [19]:
df['rate'].unique()

array([4.1, 3.8, 3.7, 4.6, 4. , 4.2, 3.9, 3. , 3.6, 2.8, 4.4, 3.1, 4.3,
       2.6, 3.3, 3.5, 3.2, 4.5, 2.5, 2.9, 3.4, 2.7, 4.7, 0. , 2.4, 2.2,
       2.3, 4.8, 4.9, 2.1, 2. , 1.8])

In [20]:
#Checking for unique values
df['approx_cost(for two people)'].unique()

array(['800', '300', '600', '700', '550', '500', '450', '650', '400',
       '750', '200', '850', '1,200', '150', '350', '250', '1,500',
       '1,300', '1,000', '100', '900', '1,100', '1,600', '950', '230',
       '1,700', '1,400', '1,350', '2,200', '2,000', '1,800', '1,900',
       '180', '330', '2,500', '2,100', '3,000', '2,800', '3,400', '40',
       '1,250', '3,500', '4,000', '2,400', '1,450', '3,200', '6,000',
       '1,050', '4,100', '2,300', '120', '2,600', '5,000', '3,700',
       '1,650', '2,700', '4,500'], dtype=object)

In [21]:
## Convert to string and clean 'approx_cost(for two people)' column
df['approx_cost(for two people)'] = df['approx_cost(for two people)'].astype(str)
df['approx_cost(for two people)'] = df['approx_cost(for two people)'].str.replace(',', '').replace('nan', None).astype(float)
df['approx_cost(for two people)'].unique()

array([ 800.,  300.,  600.,  700.,  550.,  500.,  450.,  650.,  400.,
        750.,  200.,  850., 1200.,  150.,  350.,  250., 1500., 1300.,
       1000.,  100.,  900., 1100., 1600.,  950.,  230., 1700., 1400.,
       1350., 2200., 2000., 1800., 1900.,  180.,  330., 2500., 2100.,
       3000., 2800., 3400.,   40., 1250., 3500., 4000., 2400., 1450.,
       3200., 6000., 1050., 4100., 2300.,  120., 2600., 5000., 3700.,
       1650., 2700., 4500.])

In [22]:
#Cleaning Index 
rest = df.groupby('name')[['rate' , 'votes']].max()
rest

Unnamed: 0_level_0,rate,votes
name,Unnamed: 1_level_1,Unnamed: 2_level_1
#L-81 Cafe,3.9,48
#refuel,3.7,37
1000 B.C,3.2,49
100ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ°C,3.7,41
1131 Bar + Kitchen,4.6,2861
...,...,...
b CafÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ© - Shangri-La Hotel,4.3,429
eat.fit,4.6,1238
i-Bar - The Park Bangalore,3.8,625
nu.tree,4.4,300


In [25]:
import re
a1= """ Bohra Bohra CafÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ©

Urban Solace - CafÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ© for the Soul

KazÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ©"""
import re 
a =re.compile('[Â©\x83\x82Ã]')
a.sub('', a1)

' Bohra Bohra Caf\n\nUrban Solace - Caf for the Soul\n\nKaz'

In [47]:
df.index = df.index.map(lambda x: pattern.sub('', x) if isinstance(x, str) else x)


In [28]:
#Rechecking the null values
df.isnull().sum()

name                           0
online_order                   0
book_table                     0
rate                           0
votes                          0
location                       0
rest_type                      0
dish_liked                     0
cuisines                       0
approx_cost(for two people)    0
reviews_list                   0
listed_in_type                 0
listed_in_city                 0
dtype: int64

## Data Visualization

We will create various visualizations to explore the dataset:
- Restaurants delivering Online or not
- Restaurants allowing table booking or not
- Table booking Rate vs Rate
- Best Location
- Relation between Location and Rating
- Restaurant Type
- Gaussian Rest type and Rating
- Types of Services
- Relation between Type and Rating
- Cost of Restaurant
- No. of restaurants in a Location
- Most famous restaurant chains in Bengaluru

## Import Visualization Libraries

We will use various libraries for data visualization:
- **Matplotlib**: A plotting library for creating static, animated, and interactive visualizations in Python.
- **Seaborn**: A Python visualization library based on Matplotlib that provides a high-level interface for drawing attractive statistical graphics.
- **Plotly**: A graphing library that makes interactive, publication-quality graphs online. It is particularly useful for creating interactive plots.

### About Plotly
Plotly is a powerful visualization library that allows you to create interactive and dynamic graphs. It is particularly useful for creating plots that you can interact with in a Jupyter notebook or on a web page.

Here are some key features of Plotly:
- **Interactivity**: Plotly graphs can be zoomed, panned, and hovered over to get more details.
- **Wide Range of Charts**: Supports a wide variety of chart types including line plots, scatter plots, bar charts, pie charts, bubble charts, histograms, and more.
- **High-Quality Visuals**: Produces high-quality visuals suitable for publication and presentation.
- **Customization**: Highly customizable with a vast range of options for modifying the appearance of your plots.

Now, let's import the libraries.

In [29]:
# Importing necessary libraries for visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
# Setting the visualization style
sns.set(style="whitegrid")

In [None]:
# Creating a pie chart for online order availability
fig1 = px.pie(df,'online_order', title='Online Order Availability', color = 'online_order')
fig1.show()

In [None]:
#Creating a pie chart for Table booking Availability
fig2 = px.pie(df, 'book_table', title='Table Booking Availability' , color = 'book_table')
fig2.show()

In [None]:
# Table booking Rate vs Rate
fig3 = px.box(df, x='book_table', y='rate', title='Table Booking Rate vs Rate', color = 'book_table')
fig3.show()





In [None]:
# Best Location
location_counts = df['location'].value_counts().head(10)
fig4 = px.bar(x=location_counts.values, y=location_counts.index, orientation='h', title='Top 10 Locations',labels={'index': 'Location', 'location': 'Count'} , color = location_counts ,)
fig4.show()

In [None]:
# Relation between Location and Rating
lo = df.groupby('location')['rate'].mean()
lo = lo.nlargest(5).sort_values(ascending = False).head(10).reset_index()
fig5 = px.line(lo, x='location', y='rate',labels={'location': 'Location', 'rate': 'Average Rating'}, title='Average Rating by Location')
fig5.update_traces(mode='lines+markers')
fig5.show()

In [36]:
# Restaurant Type
fig6 = px.histogram(df , x = 'rest_type',title = 'Restaurant Type ' , color = 'rest_type')
fig6.show()





In [37]:
# Gaussian Rest type and Rating
fig7 = px.violin(df, y='rest_type', x='rate', title='Rest Type vs Rating Distribution' , color = 'rest_type')
fig7.show()





In [38]:
# Counting the different types of services
service_counts = df['listed_in_type'].value_counts().reset_index()
# Renaming columns for clarity
service_counts.columns = ['Service Type', 'Count']
# Creating a bar chart for the types of services
fig8 = px.bar(service_counts, x='Service Type', y='Count', labels={'Service Type': 'Service Type', 'Count': 'Count'}, title='Types of Services' ,color = 'Service Type')
fig8.show()





In [39]:
# Relation between Type and Rating
fig9 = px.box(df, y='listed_in_type', x='rate', title='Type of Service vs Rating' , color = 'listed_in_type')
fig9.show()





In [40]:
#Cost of Restaurant
fig10=px.histogram(df , x = df['approx_cost(for two people)'],nbins=100, color = 'approx_cost(for two people)') 
fig10.show()





In [54]:
#Show the popular types of services listed in each city.
city_service_counts = df.groupby(['listed_in_city', 'listed_in_type']).size().reset_index(name='count').head(15)
fig11 = px.sunburst(city_service_counts, path=['listed_in_city', 'listed_in_type'], values='count', title='Popular Services Listed in Each City',color = 'listed_in_city')
fig11.show()


In [57]:
#Explore how the number of votes correlates with restaurant ratings
fig12 = px.scatter(df, x='votes', y='rate', title='Relationship between Votes and Rating',color='votes')
fig12.show()


In [60]:
#The most liked dishes based on customer reviews
from collections import Counter
dish_likes = df['dish_liked'].dropna().str.split(',').apply(lambda x: [item.strip() for item in x])
dish_counts = Counter([item for sublist in dish_likes for item in sublist]).most_common(10)
dishes, counts = zip(*dish_counts)
fig13 = px.bar(x=dishes, y=counts, title='Most Liked Dishes',color=dishes)
fig13.show()






## Insights and Observations

- **Online Order Availability**: The majority of restaurants offer online orders.
- **Table Booking Availability**: Fewer restaurants provide table booking options.
- **Table Booking Rate vs Rate**: Restaurants with table booking generally have higher ratings.
- **Top Locations**: The top 10 locations have the highest number of restaurants.
- **Rating by Location**: Some locations tend to have higher average ratings than others.
- **Top Restaurant Types**: The most common types of restaurants in the dataset.
- **Rest Type vs Rating Distribution**: The distribution of ratings across different restaurant types.
- **Types of Services**: The common types of services offered by restaurants.
- **Type of Service vs Rating**: The impact of service type on restaurant ratings.
- **Cost of Restaurant**: The distribution of restaurant costs for two people.
- **Number of Restaurants in Top Locations**: The number of restaurants in the top 10 locations.
- **Famous Restaurant Chains**: The most famous restaurant chains in Bengaluru.

## Conclusion

This analysis provided insights into the Zomato restaurant dataset. We explored various aspects such as online order availability, table booking, locations, types of restaurants, and their ratings. The visualizations helped in understanding the data better and deriving meaningful insights.

### Future Work
- Further analysis can be done to predict ratings based on restaurant features.
- Additional external data sources can be integrated for deeper insights.

Thank you for exploring this analysis!

## References

- [Zomato API](https://www.kaggle.com/datasets/pranavuikey/zomato-eda)
- [Plotly Documentation](https://plotly.com/python/)
- [Seaborn Documentation](https://seaborn.pydata.org/)