<h1>Importing Liberaries and uploading the dataset</h1>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
import chardet
import scipy as sp

In [3]:
with open("./data/zomato.csv", 'rb') as f:
    result = chardet.detect(f.read(10000))
    print(result)

{'encoding': 'MacRoman', 'confidence': 0.6358511488511488, 'language': ''}


In [4]:
df = pd.read_csv("./data/zomato.csv", encoding=result['encoding'])

<h1>Task 1: Data Exploration</h1>

In [6]:
df.head()

Unnamed: 0,Restaurant ID,Restaurant Name,Country Code,City,Address,Locality,Locality Verbose,Longitude,Latitude,Cuisines,...,Currency,Has Table booking,Has Online delivery,Is delivering now,Switch to order menu,Price range,Aggregate rating,Rating color,Rating text,Votes
0,6317637,Le Petit Souffle,162,Makati City,"Third Floor, Century City Mall, Kalayaan Avenu...","Century City Mall, Poblacion, Makati City","Century City Mall, Poblacion, Makati City, Mak...",121.027535,14.565443,"French, Japanese, Desserts",...,Botswana Pula(P),Yes,No,No,No,3,4.8,Dark Green,Excellent,314
1,6304287,Izakaya Kikufuji,162,Makati City,"Little Tokyo, 2277 Chino Roces Avenue, Legaspi...","Little Tokyo, Legaspi Village, Makati City","Little Tokyo, Legaspi Village, Makati City, Ma...",121.014101,14.553708,Japanese,...,Botswana Pula(P),Yes,No,No,No,3,4.5,Dark Green,Excellent,591
2,6300002,Heat - Edsa Shangri-La,162,Mandaluyong City,"Edsa Shangri-La, 1 Garden Way, Ortigas, Mandal...","Edsa Shangri-La, Ortigas, Mandaluyong City","Edsa Shangri-La, Ortigas, Mandaluyong City, Ma...",121.056831,14.581404,"Seafood, Asian, Filipino, Indian",...,Botswana Pula(P),Yes,No,No,No,4,4.4,Green,Very Good,270
3,6318506,Ooma,162,Mandaluyong City,"Third Floor, Mega Fashion Hall, SM Megamall, O...","SM Megamall, Ortigas, Mandaluyong City","SM Megamall, Ortigas, Mandaluyong City, Mandal...",121.056475,14.585318,"Japanese, Sushi",...,Botswana Pula(P),No,No,No,No,4,4.9,Dark Green,Excellent,365
4,6314302,Sambo Kojin,162,Mandaluyong City,"Third Floor, Mega Atrium, SM Megamall, Ortigas...","SM Megamall, Ortigas, Mandaluyong City","SM Megamall, Ortigas, Mandaluyong City, Mandal...",121.057508,14.58445,"Japanese, Korean",...,Botswana Pula(P),Yes,No,No,No,4,4.8,Dark Green,Excellent,229


In [8]:
df.shape

(9551, 21)

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9551 entries, 0 to 9550
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Restaurant ID         9551 non-null   int64  
 1   Restaurant Name       9551 non-null   object 
 2   Country Code          9551 non-null   int64  
 3   City                  9551 non-null   object 
 4   Address               9551 non-null   object 
 5   Locality              9551 non-null   object 
 6   Locality Verbose      9551 non-null   object 
 7   Longitude             9551 non-null   float64
 8   Latitude              9551 non-null   float64
 9   Cuisines              9542 non-null   object 
 10  Average Cost for two  9551 non-null   int64  
 11  Currency              9551 non-null   object 
 12  Has Table booking     9551 non-null   object 
 13  Has Online delivery   9551 non-null   object 
 14  Is delivering now     9551 non-null   object 
 15  Switch to order menu 

In [10]:
df.describe()

Unnamed: 0,Restaurant ID,Country Code,Longitude,Latitude,Average Cost for two,Price range,Aggregate rating,Votes
count,9551.0,9551.0,9551.0,9551.0,9551.0,9551.0,9551.0,9551.0
mean,9051128.0,18.365616,64.126574,25.854381,1199.210763,1.804837,2.66637,156.909748
std,8791521.0,56.750546,41.467058,11.007935,16121.183073,0.905609,1.516378,430.169145
min,53.0,1.0,-157.948486,-41.330428,0.0,1.0,0.0,0.0
25%,301962.5,1.0,77.081343,28.478713,250.0,1.0,2.5,5.0
50%,6004089.0,1.0,77.191964,28.570469,400.0,2.0,3.2,31.0
75%,18352290.0,1.0,77.282006,28.642758,700.0,2.0,3.7,131.0
max,18500650.0,216.0,174.832089,55.97698,800000.0,4.0,4.9,10934.0


In [11]:
df.isnull().sum()

Restaurant ID           0
Restaurant Name         0
Country Code            0
City                    0
Address                 0
Locality                0
Locality Verbose        0
Longitude               0
Latitude                0
Cuisines                9
Average Cost for two    0
Currency                0
Has Table booking       0
Has Online delivery     0
Is delivering now       0
Switch to order menu    0
Price range             0
Aggregate rating        0
Rating color            0
Rating text             0
Votes                   0
dtype: int64

In [14]:
df.duplicated().sum()

np.int64(0)

In [15]:
df.nunique()

Restaurant ID           9551
Restaurant Name         7446
Country Code              15
City                     141
Address                 8918
Locality                1208
Locality Verbose        1265
Longitude               8120
Latitude                8677
Cuisines                1825
Average Cost for two     140
Currency                  12
Has Table booking          2
Has Online delivery        2
Is delivering now          2
Switch to order menu       1
Price range                4
Aggregate rating          33
Rating color               6
Rating text                6
Votes                   1012
dtype: int64

<h2>Data Preprocessing on Features</h2>

<h3>Categorical Cols</h3>

In [16]:
categorical_cols = df.select_dtypes(include='object').columns
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns

In [17]:
for col in categorical_cols:
    print(f"{col}: {df[col].nunique()} unique values")

Restaurant Name: 7446 unique values
City: 141 unique values
Address: 8918 unique values
Locality: 1208 unique values
Locality Verbose: 1265 unique values
Cuisines: 1825 unique values
Currency: 12 unique values
Has Table booking: 2 unique values
Has Online delivery: 2 unique values
Is delivering now: 2 unique values
Switch to order menu: 1 unique values
Rating color: 6 unique values
Rating text: 6 unique values


In [18]:
for col in categorical_cols:
    print(df[col].value_counts(normalize=True).head(10))

Restaurant Name
Cafe Coffee Day     0.008690
Domino's Pizza      0.008271
Subway              0.006596
Green Chick Chop    0.005340
McDonald's          0.005026
Keventers           0.003560
Pizza Hut           0.003141
Giani               0.003036
Baskin Robbins      0.002932
Barbeque Nation     0.002722
Name: proportion, dtype: float64
City
New Delhi       0.573029
Gurgaon         0.117056
Noida           0.113077
Faridabad       0.026280
Ghaziabad       0.002618
Bhubaneshwar    0.002199
Lucknow         0.002199
Ahmedabad       0.002199
Amritsar        0.002199
Guwahati        0.002199
Name: proportion, dtype: float64
Address
Sector 41, Noida                                                              0.001152
Dilli Haat, INA, New Delhi                                                    0.001152
Greater Kailash (GK) 1, New Delhi                                             0.001047
The Imperial, Janpath, New Delhi                                              0.000942
Food Court, 3rd F

In [22]:
df['Primary Cuisine'] = df['Cuisines'].str.split(',').str[0].str.strip().str.lower()
top_cuisines = df['Primary Cuisine'].value_counts().nlargest(10).index
df['Cuisine Grouped'] = df['Primary Cuisine'].apply(lambda x: x if x in top_cuisines else 'Other')

In [23]:
text_cols = ['City', 'Locality', 'Locality Verbose', 'Rating text']
for col in text_cols:
    df[col] = df[col].str.strip().str.lower()

In [24]:
cat_cols = ['City', 'Has Table booking', 'Has Online delivery', 'Is delivering now', 'Rating text', 'Cuisine Grouped']
df[cat_cols] = df[cat_cols].astype('category')


In [25]:
df['Rating Category'] = pd.cut(df['Aggregate rating'], bins=[0, 2, 3, 4, 5], labels=['Poor', 'Average', 'Good', 'Excellent'])

<h2>Numerical COls</h2>

In [20]:
df[numerical_cols].describe()

Unnamed: 0,Restaurant ID,Country Code,Longitude,Latitude,Average Cost for two,Price range,Aggregate rating,Votes
count,9551.0,9551.0,9551.0,9551.0,9551.0,9551.0,9551.0,9551.0
mean,9051128.0,18.365616,64.126574,25.854381,1199.210763,1.804837,2.66637,156.909748
std,8791521.0,56.750546,41.467058,11.007935,16121.183073,0.905609,1.516378,430.169145
min,53.0,1.0,-157.948486,-41.330428,0.0,1.0,0.0,0.0
25%,301962.5,1.0,77.081343,28.478713,250.0,1.0,2.5,5.0
50%,6004089.0,1.0,77.191964,28.570469,400.0,2.0,3.2,31.0
75%,18352290.0,1.0,77.282006,28.642758,700.0,2.0,3.7,131.0
max,18500650.0,216.0,174.832089,55.97698,800000.0,4.0,4.9,10934.0


In [31]:
df[numerical_cols].skew()

Restaurant ID           0.061570
Country Code            3.043965
Longitude              -2.807328
Latitude               -3.081635
Average Cost for two    2.568842
Price range             0.889618
Aggregate rating       -0.954130
Votes                   3.781207
dtype: float64

In [26]:
from scipy.stats.mstats import winsorize

df['Capped_Cost'] = winsorize(df['Average Cost for two'], limits=[0.01, 0.01])
df['Capped_Votes'] = winsorize(df['Votes'], limits=[0.01, 0.01])

In [27]:
df['Capped_Cost'] = pd.Series(df['Capped_Cost'])
df['Capped_Votes'] = pd.Series(df['Capped_Votes'])

In [28]:
df['Rating_Level'] = pd.cut(df['Aggregate rating'],
                            bins=[0, 2, 3, 4, 5],
                            labels=['Poor', 'Average', 'Good', 'Excellent'])

In [29]:
price_map = {1: 'Low', 2: 'Mid', 3: 'High', 4: 'Luxury'}
df['Price_Level'] = df['Price range'].map(price_map)

In [30]:
df.drop(columns=['Switch to order menu'], inplace=True) #NO varience Featurre

Some additional grouping

In [36]:
# city grouping 
city_proportion = df['City'].value_counts(normalize=True)
threshold = 0.01  # i.e., 1%
common_cities = city_proportion[city_proportion > threshold].index
df['City_grouped'] = df['City'].apply(lambda x: x if x in common_cities else 'Other')

In [37]:
# Grouping all non-INR currencies as "Other"
df['Currency_grouped'] = df['Currency'].apply(
    lambda x: x if x == 'Indian Rupees(Rs.)' else 'Other'
)

In [38]:
# Getting top 10 most frequent cuisines
top_cuisines = df['Cuisines'].value_counts().nlargest(10).index

# Grouping all others into "Other"
df['Cuisines_grouped'] = df['Cuisines'].apply(
    lambda x: x if x in top_cuisines else 'Other'
)

📌 1. Initial Data Exploration
✅ Basic Checks:

    Loaded dataset using proper encoding (MacRoman)

    Verified with:

        df.shape → Checked rows and columns

        df.info() → Checked datatypes and nulls

        df.describe() → Summary statistics for numerical columns

        df.duplicated() → Checked for duplicate rows

        df.isnull().sum() → Counted missing values per column

📌 2. Class Distribution Review (Categorical Columns)
🔍 Checked value_counts(normalize=True) for key categorical features:
Restaurant Name

    7446 unique values

    Highly sparse — top brands include: Cafe Coffee Day, Domino's, Subway, etc.

City

    Highly imbalanced:

        New Delhi → 57%

        Gurgaon, Noida → 11-12% each

        Others → < 3% (each)

Currency

    12 unique values

        Indian Rupees (Rs.) → 90.5%

        All others < 6%

Locality & Locality Verbose

    1200+ unique values, long-tailed distributions

Has Table Booking / Online Delivery / Is Delivering Now

    Binary features, but class imbalance noted:

        Is Delivering Now → 99.6% = No

Rating Color & Text

    6 categories, somewhat imbalanced but still meaningful

📌 3. Numerical Feature Analysis
✅ df.describe() Summary:

    Extreme values present (e.g., Average Cost for Two max = 800,000)

    Votes and Cost are highly skewed

✅ Skewness Check:

df[numeric_columns].skew()

    Votes: 8.8 (highly right-skewed)

    Average Cost for two: 35.4 (very high skew)

    Country Code: 3.04 (right-skewed)

    Aggregate rating: -0.95 (left-skewed)

📌 4. Feature Engineering
🆕 Features Created:
🔹 Currency_grouped

df['Currency_grouped'] = df['Currency'].apply(lambda x: x if x == 'Indian Rupees(Rs.)' else 'Other')

🔹 Cuisines_grouped

top_cuisines = df['Cuisines'].value_counts().nlargest(10).index
df['Cuisines_grouped'] = df['Cuisines'].apply(lambda x: x if x in top_cuisines else 'Other')

🔹 (Optional/Proposed) City_grouped

city_proportion = df['City'].value_counts(normalize=True)
common_cities = city_proportion[city_proportion > 0.01].index
df['City_grouped'] = df['City'].apply(lambda x: x if x in common_cities else 'Other')

📌 5. Class Imbalance Handling (Categorical)
👇 Applied Grouping to:

Feature	Strategy	Notes

City	=>Grouped cities < 1% into "Other"

Currency	=>Kept Indian Rupees, grouped others	

Cuisines	=>Kept top 10 frequent combinations

Locality	=>Could optionally group long-tail