In [49]:
print("Zomato analytics project")

Zomato analytics project


In [50]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("rabhar/zomato-restaurants-in-india")

print("Path to dataset files:", path)

Path to dataset files: /kaggle/input/zomato-restaurants-in-india


Restaurant analytics can be very useful for companies like Swiggy and Zomato. It helps them understand which restaurants are in high demand, which cuisines are more preferred in different cities, and how much revenue different locations contribute. Analytics also shows how wide their delivery coverage is across various pin codes and how customers are responding in different regions.

Based on these insights, they can make smarter business decisions such as:


* optimizing recommendations
* improving delivery coverage
* designing city-specific marketing strategies
* enhancing customer experience
* helping restaurants grow through data-driven insights

In this project, I explore restaurant data to uncover meaningful insights about demand, pricing, cuisines, ratings, and city-wise behavior.

**Dataset**
For this analysis, I have used the **Zomato Restaurants in India** dataset from Kaggle.  
It contains detailed information about restaurants across various Indian cities, including:
- Restaurant Name  
- City / Location  
- Cuisines  
- Ratings and Votes  
- Average Cost  
- Delivery and Dining Details

  Dataset Source: https://www.kaggle.com/datasets/rabhar/zomato-restaurants-in-india


In [51]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [52]:
#Seaborn style setting
sns.set(style="whitegrid")


In [53]:
import os
os.listdir('/kaggle/input/zomato-restaurants-in-india')

['zomato_restaurants_in_India.csv']

In [54]:
#Load Data
df = pd.read_csv("/kaggle/input/zomato-restaurants-in-india/zomato_restaurants_in_India.csv")
df.head()
df.shape
df.info()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 211944 entries, 0 to 211943
Data columns (total 26 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   res_id                211944 non-null  int64  
 1   name                  211944 non-null  object 
 2   establishment         211944 non-null  object 
 3   url                   211944 non-null  object 
 4   address               211810 non-null  object 
 5   city                  211944 non-null  object 
 6   city_id               211944 non-null  int64  
 7   locality              211944 non-null  object 
 8   latitude              211944 non-null  float64
 9   longitude             211944 non-null  float64
 10  zipcode               48757 non-null   object 
 11  country_id            211944 non-null  int64  
 12  locality_verbose      211944 non-null  object 
 13  cuisines              210553 non-null  object 
 14  timings               208070 non-null  object 
 15  

In [55]:
df.columns


Index(['res_id', 'name', 'establishment', 'url', 'address', 'city', 'city_id',
       'locality', 'latitude', 'longitude', 'zipcode', 'country_id',
       'locality_verbose', 'cuisines', 'timings', 'average_cost_for_two',
       'price_range', 'currency', 'highlights', 'aggregate_rating',
       'rating_text', 'votes', 'photo_count', 'opentable_support', 'delivery',
       'takeaway'],
      dtype='object')

**Data Cleaning**

**Making sure datatypes are of consistent nature**

In [56]:

df.dtypes

res_id                    int64
name                     object
establishment            object
url                      object
address                  object
city                     object
city_id                   int64
locality                 object
latitude                float64
longitude               float64
zipcode                  object
country_id                int64
locality_verbose         object
cuisines                 object
timings                  object
average_cost_for_two      int64
price_range               int64
currency                 object
highlights               object
aggregate_rating        float64
rating_text              object
votes                     int64
photo_count               int64
opentable_support       float64
delivery                  int64
takeaway                  int64
dtype: object

Although the delivery column is stored as int64, it represents a binary categorical feature (delivery available vs not available). Since there is no numerical ordering or magnitude involved, it should be converted to a categorical datatype.

**Removing irrelevant columns**


The following columns don’t add analytical value for EDA:


* url: This only contains the restaurant’s Zomato webpage link.
* address:The address is a long unstructured text field. Most insights related to location can already be derived from structured fields like city, locality, zipcode
* city_id: This is simply a numeric mapping of the city column. Since the city name itself is already available and more readable, city_id becomes redundant and unnecessary for analysis.
* country_id: This dataset only contains restaurants from India, so the country is the same for all rows.
* locality_verbose:This is a detailed descriptive version of locality, but we already have structured and cleaner locality and city columns.

These columns were dropped because they either serve as identifiers, URLs, redundant information, or unstructured text that does not contribute to meaningful insights in exploratory data analysis.


In [8]:
df = df.drop(columns=['url','city_id','country_id','locality_verbose','address'])


In [9]:
df.columns

Index(['res_id', 'name', 'establishment', 'city', 'locality', 'latitude',
       'longitude', 'zipcode', 'cuisines', 'timings', 'average_cost_for_two',
       'price_range', 'currency', 'highlights', 'aggregate_rating',
       'rating_text', 'votes', 'photo_count', 'opentable_support', 'delivery',
       'takeaway'],
      dtype='object')

**Handling Missing Values**

In [10]:
df.isnull().sum()


res_id                       0
name                         0
establishment                0
city                         0
locality                     0
latitude                     0
longitude                    0
zipcode                 163187
cuisines                  1391
timings                   3874
average_cost_for_two         0
price_range                  0
currency                     0
highlights                   0
aggregate_rating             0
rating_text                  0
votes                        0
photo_count                  0
opentable_support           48
delivery                     0
takeaway                     0
dtype: int64

• Removed zipcode because it had a very high percentage of missing values and offered limited analytical value.

• Dropped rows with missing cuisines since cuisine information is essential for analysis.

• Filled missing timings values with “Not Available”.

• Removed opentable_support column as it is not relevant for analysis


In [None]:
df = df.drop(columns=['zipcode'])
df = df.dropna(subset=['cuisines'])
df['timings'] = df['timings'].fillna("Not Available")
df = df.drop(columns=['opentable_support'])




**Removing Duplicate**
Since multiple duplicate restaurant entries existed in the dataset, duplicates were identified using the unique res_id column. Only one entry per restaurant was retained and the rest were removed to ensure unbiased and accurate analysis.

In [11]:
df['res_id'].duplicated().sum()


np.int64(156376)

In [12]:
df = df.drop_duplicates(subset=['res_id'])


In [13]:
df.shape

(55568, 21)

In [14]:

df['res_id'].duplicated().sum()

np.int64(0)

**Cleaning ratings**

*  Some restaurants have aggregate_rating = 0.0 but are actually “Not Rated”
*   Some may have rating_text mismatch
*   Some have very few votes → unreliable rating
*    Need to ensure rating is numeric and clean


In [28]:
df['rating_text'].info

<bound method Series.info of 0         Very Good
1         Very Good
2         Very Good
3         Very Good
4         Excellent
            ...    
211882      Average
211925    Very Good
211926         Good
211940    Very Good
211942         Good
Name: rating_text, Length: 45510, dtype: object>

The rating_text column was dropped because it is simply a textual representation of the aggregate_rating column. Since aggregate_rating already provides a more precise and useful numerical value for analysis, keeping both would be redundant.

In [29]:
df = df.drop(columns = 'rating_text')

In [30]:
df.columns

Index(['res_id', 'name', 'establishment', 'city', 'locality', 'latitude',
       'longitude', 'zipcode', 'cuisines', 'timings', 'average_cost_for_two',
       'price_range', 'currency', 'highlights', 'aggregate_rating', 'votes',
       'photo_count', 'opentable_support', 'delivery', 'takeaway'],
      dtype='object')

In [32]:
df['aggregate_rating'].info

<bound method Series.info of 0         4.4
1         4.4
2         4.2
3         4.3
4         4.9
         ... 
211882    2.9
211925    4.0
211926    3.9
211940    4.1
211942    3.7
Name: aggregate_rating, Length: 45510, dtype: float64>

In [19]:
df['aggregate_rating'].unique()

array([4.4, 4.2, 4.3, 4.9, 4. , 3.8, 3.4, 4.1, 3.5, 4.6, 3.9, 3.6, 4.5,
       4.7, 3.7, 4.8, 3.2, 0. , 3.3, 2.8, 3.1, 2.6, 3. , 2.7, 2.9, 2.2,
       2.3, 2.4, 2.5, 2.1, 1.8, 2. , 1.9])

In [20]:
df['aggregate_rating'].describe()

count    55568.000000
mean         2.958593
std          1.464576
min          0.000000
25%          2.900000
50%          3.500000
75%          3.900000
max          4.900000
Name: aggregate_rating, dtype: float64

In [23]:
#Check how many times rating 0 appears
(df['aggregate_rating'] == 0).sum()


np.int64(10058)

Restaurants with 0.0 rating represent unrated restaurants. Since they do not contribute meaningful information for rating analysis, they were removed.

In [24]:
df = df[df['aggregate_rating'] > 0]

In [34]:
df.shape

(45510, 20)

To improve the reliability of rating-based analysis, only restaurants with at least 5 votes were retained. Ratings with very few votes may not be representative of overall customer sentiment and can introduce noise into the analysis.

In [35]:
df['votes'].info

<bound method Series.info of 0          814
1         1203
2          801
3          693
4          470
          ... 
211882       4
211925     111
211926     207
211940     187
211942     128
Name: votes, Length: 45510, dtype: int64>

In [36]:
df = df[df['votes'] >= 5]

In [37]:
df.shape

(43207, 20)