# 🧪 EDA Tutorial – Indian Restaurants Dataset


## Introduction
Exploratory Data Analysis (EDA) is a crucial first step in any data science or machine learning workflow. While it’s not mandatory for building models, it is highly recommended as it helps uncover patterns, detect anomalies, test assumptions, and better understand the structure of your data.

In this project, we’ll perform a detailed EDA on a dataset of Indian Restaurants sourced from Zomato (link to dataset). This guide serves as a hands-on reference for conducting basic to intermediate-level analysis, useful across many datasets.

You’ll learn how to:

Understand dataset structure and composition

Handle missing data and duplicates

Use groupby, apply, and unique effectively

Create insightful visualizations: bar charts, scatter plots, box plots, word clouds, density plots

Derive business insights from the data

Let’s dive in!

## 🗂️ Project Workflow
#### 1. Importing Libraries and Data
#### 2. Preprocessing Steps

Removing duplicates

Handling missing values

Dropping unhelpful features

#### 3. Exploratory Analysis

- ##### ✅ Restaurant Chains

- Number of outlets

- Average ratings

- ##### 🏢 Establishment Types

- Count, ratings, votes, photos

- ##### 🌆 City-Wise Analysis

- Restaurants per city

- Aggregated performance metrics

- ##### 🍱 Cuisine Insights

- Unique cuisines

- Highest-rated types

- ##### ✨ Restaurant Highlights

- Feature frequency and ratings

- WordCloud for highlights

#### 4. Cost and Rating Analysis

Rating distribution

Price range and cost comparisons

Correlation between rating, price, and votes

#### 5. 📌 Conclusion

Summarizing key insights

Suggestions for stakeholders or further modeling

## Importing necessary libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import seaborn as sns

### Preprocessing

#### Exploring data

In [2]:
data = pd.read_csv("../input/zomato/zomato_restaurants_in_India.csv")

FileNotFoundError: [Errno 2] No such file or directory: '../input/zomato/zomato_restaurants_in_India.csv'

In [None]:
data.head()

In [None]:
data.tail()

In [None]:
data.shape

In [None]:
data.dtypes

In [None]:
data.isnull().sum()


In [None]:
data.describe()


In [None]:
data[data["city"]=="Mumbai"]


In [None]:
data[data["city"]=="Pune"]

In [None]:
data[data["city"]=="Bangalore"]

In [None]:
data.info()

In [None]:
data[data["average_cost_for_two"]==3000]

In [None]:
data.drop_duplicates(["res_id"],keep='first',inplace=True)
data.shape

Data Cleaning Update:
We discovered that approximately ****75%**** of our dataset consisted of duplicate records.
Thankfully, we addressed this early in the process. 
Despite the reduction, we still have over **55,000** unique restaurant entries—more than sufficient for meaningful analysis.


Next Step: ****Handling Missing Values****
Let’s now examine which variables have missing data.

In [None]:
data.isnull().sum()

We identified **5 variables** with missing values. Among them, the `zipcode` feature has approximately **80% missing data**, making it unreliable for analysis—so we’ll exclude it from further consideration.

For the remaining **4 features**, we can consider **imputation**. However, before investing effort into filling missing values, we’ll evaluate whether these features are even necessary for our analysis.

### 🔍 Feature Evaluation: Omit or Keep?

Let’s examine the remaining features individually:

- `res_id`:
This is a unique identifier for each restaurant. Since it's not useful for analysis or modeling, we can safely **exclude** it.

- `name`:
This feature is **important**. We’ll use it to identify and highlight **top restaurants**, so it should be **retained**.

- `establishment`:
Let's explore the values in this column to assess its **variety and relevance** to our objectives. If the distribution is meaningful (e.g., cafe, fine dining, quick bites), it could provide useful segmentation

Out of 5 columns with missing values, `zipcode` has about **80%** missing data, so we’ll skip it.

For the other 4, we’ll first check if they’re even useful. If not, no need to waste time fixing them.

Here’s the plan:

- **`res_id` is just an ID—skip.**

- **`name` helps us find top restaurants—keep.**

- **`establishment` could be useful—let’s check the values and decide.**

In [None]:
data["establishment"].unique()

In [None]:
print(data["establishment"].unique()[0])
print(type(data["establishment"].unique()[0]))

The `establishment` column appears to be a valuable feature for exploratory data analysis. However, its current format includes unwanted characters such as **square brackets (`[]`) and quotes (`''`)**, which add noise to the data.

To clean this up, we’ll use the `apply()` function to strip out these characters. Additionally, one of the values is an empty string (`""`), which we’ll replace with `"NA"` to avoid ambiguity during analysis.

In [None]:
# Removing [' '] from each value
print(data["establishment"].unique()[0])
data["establishment"] = data["establishment"].apply(lambda x:x[2:-2])
print(data["establishment"].unique()[0])

# Changing ''  to 'NA'
print(data["establishment"].unique())
data["establishment"] = data["establishment"].apply(lambda x : np.where(x=="", "NA", x))
print(data["establishment"].unique())

- `url`:
This column contains links to individual restaurant pages. Since it doesn’t contribute any analytical value and is unlikely to be used in our visualizations or modeling, we will **drop it**.

- `address`:
While detailed, this column contains lengthy unstructured text that can be difficult to standardize or categorize. Unless we plan to perform text analysis or geocoding, it's best to **exclude** it from our analysis.

- `city`:
This could be an **important feature** for geographical segmentation or city-level insights.
Let’s explore the unique values in the `city` column to check for duplicates, inconsistencies, or sparsely populated categories.

In [None]:
len(data[data["city"]=="Shimla"])

In [None]:
len(data[data["city"]=="Agra"])

In [None]:
len(data["city"].unique())

In [None]:
data["city"].unique()

In [None]:
data.isnull().sum()

In [None]:
data.head()

In [None]:
data[data["city"]=="Jabalpur"]

Look's good.

1.city_id - We can uniquely use city name or id. So one feature is enough

2.locality - Let's see number of unique values

In [None]:
data["locality"].nunique()


Although it can be an interesting feature, but since this feature has so many unique classes, we will avoid it.

1.latitude - Can be helpful while using geographic maps, but we won't be doing that here 

2.longitude - Same as above

3.zipcode - Approx 80% missing values

4.country_id - Since this dataset is for Indian restaurants, there should be just one unique id here. Let's check.

In [None]:
data["country_id"].unique()

1. locality_verbose - Same as locality

In [None]:
data["locality_verbose"].nunique()

1. cuisines - This feature has some missing values. Even though this has 9382 unique classes, we can see that each restaurant has a list of cusinies and the composition of the list is the reason why we have so many different cuisine classes. Let's check actual number of unique cuisine classes. But first we need to replace null values with a label.


In [None]:
print(data["cuisines"].nunique())
print(data["cuisines"].unique())

Now that we have taken a deep look at our data, let's start with some EDA!

# Exploratory Data Analysis (EDA)
**Restaurant chains**

Here chains represent restaurants with more than one outlet

Chains vs Outlets

In [None]:
outlets = data["name"].value_counts()

In [None]:
outlets 

In [None]:
chains = outlets[outlets >= 2]
single = outlets[outlets == 1]

In [None]:
data.shape

In [None]:
chains

In [None]:
print("Total Restaurants = ", data.shape[0])
print("Total Restaurants that are part of some chain = ", data.shape[0] - single.shape[0])
print("Percentage of Restaurants that are part of a chain = ", np.round((data.shape[0] - single.shape[0]) / data.shape[0],2)*100, "%")

35% of total restaurants are part of some kind of restaurant chain. Here, we should account for cases where two different retaurants might have exact same name but are not related to each other.

Top restaurant chains (by number of outlets)
Let's plot a horizontal bar graph to look at Top 10 restaurant chains. For the color scheme, we are using a list of pre-defined and selected colours to make the chart more appealing. If you want your analysis to look good visually, you should customize each and every element of your graph.

In [None]:
chains.head(10)

In [None]:
top10_chains = data["name"].value_counts()[:10].sort_values(ascending=True)

In [None]:
height = top10_chains.values
bars = top10_chains.index
y_pos = np.arange(len(bars))

fig = plt.figure(figsize=[11,7], frameon=False)
ax = fig.gca()
ax.spines["top"].set_visible("#424242")
ax.spines["right"].set_visible(False)
ax.spines["left"].set_color("#424242")
ax.spines["bottom"].set_color("#424242")


colors = ["#f9cdac","#f2a49f","#ec7c92","#e65586","#bc438b","#933291","#692398","#551c7b","#41155e","#2d0f41"]
plt.barh(y_pos, height, color=colors)
 
plt.xticks(color="#424242")

plt.yticks(y_pos, bars, color="#424242")
plt.xlabel("Number of outlets in India")

for i, v in enumerate(height):
    ax.text(v+3, i, str(v), color='#424242')
plt.title("Top 10 Restaurant chain in India (by number of outlets)")


plt.show()

This chart is majorly dominaed by big fast food chains

Top restaurant chains (by average rating)
Here we will look at top chains by their ratings. I have set the criteria of number of outlets to greater than 4 to remove some outliers.

In [None]:
outlets = data["name"].value_counts()

In [None]:
atleast_5_outlets = outlets[outlets > 4]

In [None]:
top10_chains2 = data[data["name"].isin(atleast_5_outlets.index)].groupby("name")["aggregate_rating"].mean().sort_values(ascending=False).head(10).sort_values()


In [None]:
height = pd.Series(top10_chains2.values).map(lambda x : np.round(x, 2))
bars = top10_chains2.index
y_pos = np.arange(len(bars))

fig = plt.figure(figsize=[11,7], frameon=False)
ax = fig.gca()
ax.spines["top"].set_visible("#424242")
ax.spines["right"].set_visible(False)
ax.spines["left"].set_color("#424242")
ax.spines["bottom"].set_color("#424242")


colors = ['#fded86', '#fce36b', '#f7c65d', '#f1a84f', '#ec8c41', '#e76f34', '#e25328', '#b04829', '#7e3e2b', '#4c3430']
plt.barh(y_pos, height, color=colors)

plt.xlim(3)
plt.xticks(color="#424242")
plt.yticks(y_pos, bars, color="#424242")
plt.xlabel("Number of outlets in India")

for i, v in enumerate(height):
    ax.text(v + 0.01, i, str(v), color='#424242')
plt.title("Top 10 Restaurant chain in India (by average Rating)")


plt.show()

Interestingly, no fast food chain appears in this chart. To maintain a high rating, restaurants needs to provide superior service which becomes impossible with booming 
fast food restaurant in every street.

**Establishment Types**

Number of restaurants (by establishment type)

In [None]:
est_count = data.groupby("establishment").count()["res_id"].sort_values(ascending=False)[:5]

fig = plt.figure(figsize=[8,5], frameon=False)
ax = fig.gca()
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.spines["left"].set_color("#424242")
ax.spines["bottom"].set_color("#424242")


colors = ["#2d0f41",'#933291',"#e65586","#f2a49f","#f9cdac"]
plt.bar(est_count.index, est_count.values, color=colors)

plt.xticks(range(0, 6), color="#424242")
plt.yticks(range(0, 25000, 5000), color="#424242")
plt.xlabel("Top 5 establishment types")

for i, v in enumerate(est_count):
    ax.text(i-0.2, v+500, str(v), color='#424242')
plt.title("Number of restaurants (by establishment type)")


plt.show()

Top 3 represents more casual and quick service restaurants, then from 4-6 we have dessert based shops.

Average rating, votes and photos (by Establishment)
Here, we will not plot each graph since it will make this notebook filled with horizontal bar charts. I see horizontal bar charts the only option to display results of this kind when we have lots of classes to compare (here 10 classes). Let's look at value_counts( ) directly

In [None]:
data

In [None]:
plt.hist(data["city"].dropna(), bins = 30, color = "red", edgecolor = "black")
plt.title("age distribution")
plt.show()

In [None]:
rating_by_est = data.groupby("establishment")["aggregate_rating"].mean().sort_values(ascending=False).head(10)
rating_by_est


In [None]:
top10_votes_by_est = data.groupby("establishment")["votes"].mean().sort_values(ascending=False).head(10)
top10_votes_by_est


In [None]:
top10_photos_by_est = data.groupby("establishment")["photo_count"].mean().sort_values(ascending=False).head(10)
top10_photos_by_est


It can be concluded that establishments with alcohol availability have highest average ratings, votes and photo uploads.

#### Cities¶

#### Number of restaurants (by city)

In [None]:
city_counts = data.groupby("city").count()["res_id"].sort_values(ascending=True)[-10:]

height = pd.Series(city_counts.values)
bars = city_counts.index
y_pos = np.arange(len(bars))

fig = plt.figure(figsize=[11,7], frameon=False)
ax = fig.gca()
ax.spines["top"].set_visible("#424242")
ax.spines["right"].set_visible(False)
ax.spines["left"].set_color("#424242")
ax.spines["bottom"].set_color("#424242")

colors = ['#dcecc9', '#aadacc', '#78c6d0', '#48b3d3', '#3e94c0', '#3474ac', '#2a5599', '#203686', '#18216b', '#11174b']
plt.barh(y_pos, height, color=colors)

plt.xlim(3)
plt.xticks(color="#424242")
plt.yticks(y_pos, bars, color="#424242")
plt.xlabel("Number of outlets")

for i, v in enumerate(height):
    ax.text(v + 20, i, str(v), color='#424242')
plt.title("Number of restaurants (by city)")


plt.show()

As expected, metro cities have more number of restaurants than others with South India dominating the Top 4

Average rating, votes and photos (by city)

In [None]:
rating_by_city = data.groupby("city")["aggregate_rating"].mean().sort_values(ascending=False).head(10)
rating_by_city

In [None]:
top10_votes_by_city = data.groupby("city")["votes"].mean().sort_values(ascending=False).head(10)
top10_votes_by_city


In [None]:
top10_photos_by_city = data.groupby("city")["photo_count"].mean().sort_values(ascending=False).head(10)
top10_photos_by_city

Gurgaon has highest rated restaurants whereas Hyderabad has more number of critics. Mumbai and New Delhi dominates for most photo uploads per outlet.

**Cuisine**

**Unique cuisines**

In [None]:
print("Total number of unique cuisines = ", data["cuisines"].nunique())

Number of restaurants (by cuisine)

In [None]:
import matplotlib.pyplot as plt

# Extract the cuisines column
cuisines = data["cuisines"]

# Get the top 5 most common cuisines
c_count = cuisines.value_counts().head(5)

# Plotting
fig = plt.figure(figsize=[8,5], frameon=False)
ax = fig.gca()
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.spines["left"].set_color("#424242")
ax.spines["bottom"].set_color("#424242")

colors = ['#4c3430', '#b04829', '#ec8c41', '#f7c65d','#fded86']
plt.bar(c_count.index, c_count.values, color=colors)

plt.xticks(range(0, 5), c_count.index, color="#424242")
plt.yticks(range(0, max(c_count.values)+5000, 5000), color="#424242")
plt.xlabel("Top 5 cuisines", color="#424242")

for i, v in enumerate(c_count.values):
    ax.text(i-0.2, v+500, str(v), color='#424242')
    
plt.title("Number of restaurants (by cuisine type)", color="#424242")
plt.show()


Surprisingly, Fast Food comes second in the list of cuisines that Indians prefer, even more than cafe, North Indian and South Indian food.

In [None]:
import pandas as pd
import numpy as np
from collections import Counter

data["cuisines2"] = data['cuisines'].apply(lambda x: x.split(", ") if pd.notnull(x) else [])
all_cuisines = [cuisine for sublist in data["cuisines2"] for cuisine in sublist]
cuisines_list = list(set(all_cuisines))
zeros = np.zeros(shape=(len(cuisines_list), 2))
c_and_r = pd.DataFrame(zeros, index=cuisines_list, columns=["Sum", "Total"])
cuisine_counts = Counter(all_cuisines)

for cuisine, count in cuisine_counts.items():
    c_and_r.at[cuisine, "Total"] = count
    c_and_r.at[cuisine, "Sum"] = count


In [None]:
for i, x in data.iterrows():
    for j in x["cuisines2"]:
        c_and_r.at[j, "Sum"] += x["aggregate_rating"]
        c_and_r.at[j, "Total"] += 1

In [None]:
c_and_r["Mean"] = c_and_r["Sum"] / c_and_r["Total"]
c_and_r

In [None]:
c_and_r[["Mean", "Total"]].sort_values(by="Mean", ascending=False).head(10)

We can ignore a few cuisines in this list since they are available in less number. But the overall conclusion which can be drawn is that International (and rarely available) cuisines are rated higher than local cuisines.

### Highlights/Features of restaurants

#### Unique highlights

In [None]:
hl = pd.Series([item for sublist in data["highlights"].dropna().apply(lambda x: x.split(", ")) for item in sublist])
print("Top highlights of restaurants:")
print(hl.value_counts().head(10))

In [None]:
print("Total number of unique cuisines = ", hl.nunique())

#### Number of restaurants (by highlights)

In [None]:
h_count = hl.value_counts()[:5]

fig = plt.figure(figsize=[10,6], frameon=False)
ax = fig.gca()
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.spines["left"].set_color("#424242")
ax.spines["bottom"].set_color("#424242")

colors = ['#11174b', '#2a5599', '#3e94c0', '#78c6d0', '#dcecc9']
plt.bar(h_count.index, h_count.values, color=colors)

plt.xticks(range(0, 6), color="#424242")
plt.yticks(range(0, 70000, 10000), color="#424242")
plt.xlabel("Top 5 highlights")

for i, v in enumerate(h_count):
    ax.text(i-0.2, v+500, str(v), color='#424242')

plt.title("Number of restaurants (by highlights)")
plt.show()

Top 5 highlights doesn't convey much information since they are very trivial to almost every restaurant. Let's look at uncommon highlights that matter more to the customers.

**Highest rated highlights**

In [None]:
data["highlights"][0]

In [None]:
data["highlights2"] = data['highlights'].apply(lambda x : x[2:-2].split("', '"))

hl_list = hl.unique().tolist()
zeros = np.zeros(shape=(len(hl_list),2))
h_and_r = pd.DataFrame(zeros, index=hl_list, columns=["Sum","Total"])

In [None]:
data["highlights2"] = data["highlights"].apply(lambda x: x.split(", ") if pd.notnull(x) else [])

all_highlights = [item for sublist in data["highlights2"] for item in sublist]

highlights_list = list(set(all_highlights))

import numpy as np
h_and_r = pd.DataFrame(np.zeros((len(highlights_list), 2)), index=highlights_list, columns=["Sum", "Total"])


In [None]:
for i, x in data.iterrows():
    for j in x["highlights2"]:
        if j in h_and_r.index:
            h_and_r.at[j, "Sum"] += x["aggregate_rating"]
            h_and_r.at[j, "Total"] += 1


In [None]:
h_and_r["Mean"] = h_and_r["Sum"] / h_and_r["Total"]
h_and_r

In [None]:
h_and_r[["Mean","Total"]].sort_values(by="Mean", ascending=False)[:10]

We can safely ignore highlights which have a frequency of less than 10 since they can be considered as outliers. Features like Gastro pub, Craft beer, Romantic dining and Sneakpeek are well received among customers.

#### Highlights wordcloud

Here we will create a wordoud of top 30 highlights

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

hl_str = " ".join(str(i) for i in hl)

wordcloud = WordCloud(width=800, height=500,
                      background_color='white',
                      min_font_size=10,
                      max_words=30).generate(hl_str)

plt.figure(figsize=(8, 8), facecolor=None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()


### Ratings and cost

#### Ratings distribution

Let's see how the ratings are distributes

In [None]:
sns.kdeplot(data['aggregate_rating'], shade=True)
plt.title("Ratings distribution")
plt.show()


There is a huge spike at 0 which might account for newly opened or unrated restaurants. On average, majority of restaurants have rating between 3 to 4 with fewer restaurants managing to go beyond 4.

Avergae cost for two distribution

In [None]:
sns.kdeplot(data['average_cost_for_two'], shade=True)
plt.title("Average cost for 2 distribution")
plt.show()

With few restaurants charging average of Rs.25000+ for two, this graph is extremely skewed. Let's take a closer look at a lower range of 0 to 60000.

In [None]:
sns.kdeplot(data['average_cost_for_two'], shade=True)
plt.xlim([0, 6000])
plt.xticks(range(0,6000,500))
plt.title("Average cost for 2 distribution")
plt.show()

Majority of restaurants are budget friendly with an average cost between Rs.250 to Rs.800 for two.

Price range count

In [None]:
pr_count = data.groupby("price_range").count()["name"]

fig = plt.figure(figsize=[8,5], frameon=False)
ax = fig.gca()
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.spines["left"].set_color("#424242")
ax.spines["bottom"].set_color("#424242")


colors = ["#2d0f41",'#933291',"#f2a49f","#f9cdac"]
plt.bar(pr_count.index, pr_count.values, color=colors)

plt.xticks(range(0, 5), color="#424242")
plt.yticks(range(0, 40000, 5000), color="#424242")
plt.xlabel("Price Ranges")

for i, v in enumerate(pr_count):
    ax.text(i+0.85, v+700, str(v), color='#424242')
plt.title("Number of restaurants (by price ranges)")


plt.show()

Price range chart supports our previous observation from the Average cost chart. Number of restaurant decreases with increase in price range.

##### Relation between Average price for two and Rating¶

In [None]:
np.round(data[["average_cost_for_two","aggregate_rating"]].corr()["average_cost_for_two"][1],2)

A correlation can be seen between restaurant average cost and rating

In [None]:
plt.plot("average_cost_for_two","aggregate_rating", data=data, linestyle="none", marker="o")
plt.xlim([0,6000])
plt.title("Relationship between Average cost and Rating")
plt.xlabel("Average cost for two")
plt.ylabel("Ratings")
plt.show()

There is definetely a direct relation between the two. Let's take a smaller sample to draw a clearer scatter plot.

In [None]:
plt.plot("average_cost_for_two","aggregate_rating", data=data.sample(1000), linestyle="none", marker="o")
plt.xlim([0,3000])
plt.show()

This relation concludes that that as average cost for two increases, there is a better chance that the restaurant will be rated highly. Let's look at price range for a better comparison.


##### Relation between Price range and Rating

In [None]:
np.round(data[["price_range","aggregate_rating"]].corr()["price_range"][1],2)

In [None]:
sns.boxplot(x='price_range', y='aggregate_rating', data=data)
plt.ylim(1)
plt.title("Relationship between Price range and Ratings")
plt.show()

Now, it is clear. The higher the price a restaurant charges, more services they provide and hence more chances of getting good ratings from their customers.

## Conclusions

 After working on this data, we can conclude the following things:-

1. Approx. 35% of restaurants in India are part of some chain
2. Domino's Pizza, Cafe Coffee Day, KFC are the biggest fast food chains in the country with most number of outlets
3. Barbecues and Grill food chains have highest average ratings than other type of restaurants
4. Quick bites and casual dining type of establishment have most number of outlets
5. Establishments with alcohol availability have highest average ratings, votes and photo uploads
6. Banglore has most number of restaurants
7. Gurgaon has highest rated restaurants (average 3.83) whereas Hyderabad has more number of critics (votes). Mumbai and New Delhi dominates for most photo uploads per outlet
8. After North Indian, Chinese is the most prefered cuisine in India
9. International cuisines are better rated than local cuisines
10. Gastro pub, Romantic Dining and Craft Beer features are well rated by customers
11. Most restaurants are rated between 3 and 4
12. Majority of restaurants are budget friendly with average cost of two between Rs.250 to Rs.800
13. There are less number of restaurants at higher price ranges
14. As the average cost of two increases, the chance of a restaurant having higher rating increases

##### Now we have come to the end of this project, I hope you learned some new tricks.

#### Please give this notebook an upvote if you find it useful!

In [None]:
data.to_csv("zomato_final.csv")