# **Project Name**    - **Unsupervised ML - Zomato Restaurant Clustering**



##### **Project Type**    - Unsupervised
##### **Contribution**    - Individual
##### **Team Member 1 -** Pamendra Kaushik
##### **Team Member 2 -**
##### **Team Member 3 -**
##### **Team Member 4 -**

# **Project Summary -**

This project performs Unsupervised Machine Learning on Zomato restaurant data to identify meaningful customer–restaurant patterns and segment restaurants into similar groups. The workflow begins with loading and exploring two datasets containing restaurant metadata and customer reviews. After handling duplicates, missing values, and inconsistent data formats, important features such as cost, cuisines, types, collections, review text length, and picture count were engineered to create a structured dataset for modelling.

Multiple clustering algorithms such as K-Means, Agglomerative (Hierarchical) Clustering, and DBSCAN were applied and evaluated. Metrics like the Silhouette Score were used to compare model performance. Based on the scores and cluster stability, the best-fit clustering solution was selected to group restaurants into distinct segments such as premium dining, budget-friendly outlets, and high-engagement restaurants.

These clusters help in understanding restaurant behaviour, customer preferences, and business positioning. The results can be used for improving recommendations, better categorisation on food platforms, and providing data-driven insights for restaurant owners.

# **GitHub Link -**

https://github.com/PamendraKaushik/Zomato-Restaurant-Clustering

# **Problem Statement**


Zomato hosts thousands of restaurants with diverse cuisines, pricing, customer reviews, and popularity levels. Understanding these restaurants and grouping similar ones together is challenging due to the high variation in features such as cost, cuisines, ratings, and review behaviour.
Traditional supervised learning cannot be applied because the dataset does not contain predefined labels or categories.

The objective of this project is to apply Unsupervised Machine Learning techniques to cluster restaurants into meaningful groups based on their metadata and review characteristics. By analysing numerical and text-derived features, the project aims to identify hidden patterns and segments such as premium restaurants, mid-range popular outlets, low-engagement restaurants, and budget-friendly options.

These insights help food platforms like Zomato improve recommendations, help customers discover restaurants more easily, and allow business owners to understand their competitive positioning using data-driven clusters.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import silhouette_score, pairwise_distances
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.preprocessing import OneHotEncoder
from nltk.corpus import stopwords
import datetime
import ast
import joblib
sns.set(style="whitegrid")


### Dataset Loading

In [None]:
# Load Dataset
meta = pd.read_csv('https://raw.githubusercontent.com/PamendraKaushik/Zomato-Restaurant-Clustering/refs/heads/main/Zomato%20Restaurant%20names%20and%20Metadata.csv')
reviews = pd.read_csv('https://raw.githubusercontent.com/PamendraKaushik/Zomato-Restaurant-Clustering/refs/heads/main/Zomato%20Restaurant%20reviews.csv')

: 

### Dataset First View

In [None]:
# Dataset First Look

In [None]:
meta.head(5)

In [None]:
reviews.head(5)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

In [None]:
meta.shape

In [None]:
reviews.shape

### Dataset Information

In [None]:
# Dataset Info

In [None]:
meta.info()

In [None]:
reviews.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print(meta.duplicated().sum())
print(reviews.duplicated().sum())

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(meta.isnull().sum())
print(reviews.isnull().sum())

In [None]:
# Visualizing the missing values
meta.isnull().sum().plot(kind='bar')

In [None]:
reviews.isnull().sum().plot(kind='bar')

### What did you know about your dataset?

1. Rating has non-numeric values like "Like"
“Like” → np.nan OR maybe map to a default (we can decide)

2. Metadata has multiple patterns

Example:

"3 Reviews , 2 Followers"

"1 Review"

"30 Reviews , 34 Followers"

 3. 36 duplicate rows in reviews

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print(meta.columns)
print(reviews.columns)

In [None]:
# Dataset Describe
meta.describe(include='all')


In [None]:
reviews.describe(include='all')

### Variables Description

**Meta Data:**

Name: Name of Restaurants

Links: URL Links of Restaurants

Cost: Per person estimated cost of dining

Collection: Tagging of Restaurants w.r.t. Zomato categories

Cuisines: Cuisines served by restaurants

Timings:Restaurant  timings                                          
  
**Review Data:**

Reviewer: Name of the reviewer

Review: Review text

Rating: Rating provided

MetaData: Reviewer metadata - Number of reviews and followers

Time: Date and Time of Review

Pictures: Number of pictures posted with review

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

In [None]:
meta.nunique()

In [None]:
reviews.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# 1️⃣ Clean "Cost" column in metadata → numeric (Cost_clean)
meta["Cost_clean"] = (
    meta["Cost"]              # e.g. "1,300"
    .astype(str)                 # ensure string
    .str.replace(",", "")        # remove comma
    .str.extract(r"(\d+)", expand=False)  # keep only digits
    .astype(float)               # convert to float
)

In [None]:
# 2️⃣ Clean "Rating" in reviews → numeric
reviews["Rating"] = pd.to_numeric(reviews["Rating"], errors="coerce")

In [None]:
# 3️⃣ Convert "Time" in reviews → datetime (optional but good)
reviews["Time"] = pd.to_datetime(reviews["Time"], errors="coerce")

In [None]:
# 4️⃣ Drop rows with missing critical fields
#    - Restaurant name or Rating missing in reviews → drop
#    - Restaurant Name missing in meta → drop
reviews = reviews.dropna(subset=["Restaurant", "Rating"])
meta = meta.dropna(subset=["Name"])

print("After basic cleaning:")
print("  Metadata shape :", meta.shape)
print("  Reviews shape  :", reviews.shape)

In [None]:
# 5️⃣ Check remaining missing values (for info)
print("\nMissing values in meta_df:")
print(meta.isna().sum())

print("\nMissing values in reviews_df:")
print(reviews.isna().sum())


In [None]:
# 6️⃣ Aggregate reviews at restaurant level
#    We create one row per restaurant with:
#    - avg_rating       : mean rating
#    - review_count     : number of reviews
#    - total_pictures   : total pictures uploaded
#    - avg_review_length: average length of review text
restaurant_agg = (
    reviews
    .groupby("Restaurant")
    .agg(
        avg_rating=("Rating", "mean"),
        review_count=("Rating", "count"),
        total_pictures=("Pictures", "sum"),
        avg_review_length=("Review", lambda x: x.astype(str).str.len().mean())
    )
    .reset_index()
)

print("\nRestaurant-level aggregated reviews (first 5 rows):")
display(restaurant_agg.head())

In [None]:
# 7️⃣ Merge aggregated reviews with restaurant metadata
#    match Restaurant (reviews) with Name (metadata)
restaurants_full = pd.merge(
    restaurant_agg,
    meta,
    left_on="Restaurant",
    right_on="Name",
    how="left"
)

print("\nMerged restaurant dataset shape:", restaurants_full.shape)
display(restaurants_full.head())


In [None]:
restaurants_full.columns

### What all manipulations have you done and insights you found?

Dropped missing values and duplicates
Extracted cuisines from the Cuisines column
Converted cost column to int data type
Insights

There are 44 unique cuisines across 104 restaurants
Estimated cost of dining of all 104 restaurents are in the range 150 Rs to 2800 Rs
Extracting the locations from the links column we can observe that all restaurents are from Gachibowli, Hyderabad

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 **Distribution of Cost Across Different Cuisines (Box Plot)**

In [None]:
restaurants_full["primary_cuisine"] = restaurants_full["Cuisines"].fillna("Unknown").apply(lambda x: str(x).split(",")[0])

plt.figure(figsize=(12,6))
sns.boxplot(
    data=restaurants_full,
    x="primary_cuisine",
    y="Cost_clean"
)
plt.xticks(rotation=90)
plt.title("Distribution of Cost Across Different Cuisines")
plt.xlabel("Cuisine")
plt.ylabel("Cost for Two (₹)")
plt.show()




##### 1. Why did you pick the specific chart?

A box plot is ideal for comparing price distributions across many cuisines — it shows median, spread, and outliers clearly.

##### 2. What is/are the insight(s) found from the chart?

Continental, Café, Italian have highest median prices.

North Indian, Fast Food, Chinese show huge variation → some very cheap, some premium.

Dessert/Bakery have consistently low cost.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive:**
Helps Zomato build “Affordable Cuisine” vs “Premium Cuisine” categories.
Identifies cuisines with stable pricing → easier to promote.

**Negative Insight:**
Cuisines like North Indian/Chinese show inconsistent pricing, hurting customer trust.

#### Chart - **Bar Chart: Most Popular Cuisines (Total Reviews)**

In [None]:
popular = (
    restaurants_full.groupby("primary_cuisine")["review_count"]
    .sum()
    .sort_values(ascending=False)
    .head(15)
)

plt.figure(figsize=(10,6))
popular.plot(kind="bar", color="skyblue")
plt.title("Most Popular Cuisines (Based on Total Reviews)")
plt.xlabel("Cuisine")
plt.ylabel("Total Reviews")
plt.show()


##### 1. Why did you pick the specific chart?

A bar chart clearly compares total popularity (review count) across cuisines.

##### 2. What is/are the insight(s) found from the chart?

North Indian and Chinese are the most popular cuisines.

Fast Food, Desserts, Bakery also show high review volume.

Premium cuisines (Continental/Italian) have fewer restaurants but high reviews per outlet.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive:**
Helps Zomato target promotions toward highly-engaged cuisines.
Nice cuisines with high review density are great for premium campaigns.

**Negative Insight:**
 Popular cuisines are oversaturated → lower margins, tougher competition.

#### Chart - 3 **Scatter Plot: Cost vs Rating (Bubble = Review Count)**

In [None]:
plt.figure(figsize=(8,6))
sns.scatterplot(
    data=restaurants_full,
    x="Cost_clean",
    y="avg_rating",
    size="review_count",
    hue="primary_cuisine",
    sizes=(20,200),
    alpha=0.7
)
plt.title("Cost vs Rating (Bubble Size = Review Count)")
plt.xlabel("Cost for Two (₹)")
plt.ylabel("Average Rating")
plt.show()


##### 1. Why did you pick the specific chart?

Shows the relationship between cost and quality (rating), while bubble size shows popularity.

##### 2. What is/are the insight(s) found from the chart?

Cost and rating have almost no correlation.

Many high-rated restaurants are affordable (₹300–₹600).

Expensive restaurants (>₹1000) often have average ratings → poor value.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive:**
Highlights value-for-money restaurants, ideal for recommendations.
Helps premium restaurants identify quality issues.

**Negative Insight:**
Expensive restaurants with poor ratings harm Zomato’s brand reliability.

#### Chart - 4 **Pie Chart: Cuisine Market Share**

In [None]:
cuisine_counts = restaurants_full["primary_cuisine"].value_counts().head(8)

plt.figure(figsize=(8,8))
plt.pie(
    cuisine_counts,
    labels=cuisine_counts.index,
    autopct="%1.1f%%",
    startangle=90
)
plt.title("Cuisine Market Share (Top 8 Cuisines)")
plt.show()


##### 1. Why did you pick the specific chart?

Pie chart shows proportional distribution quickly and visually.

##### 2. What is/are the insight(s) found from the chart?

North Indian & Chinese dominate market share.

Niche cuisines have <5% share.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive:**
Helps with cuisine-level planning for campaigns & promotions.

**Negative Insight:**
Heavy reliance on two cuisines creates low variety → possible customer fatigue.

#### Chart - 5 **Line Chart: Average Cost Across Rating Buckets**

In [None]:
restaurants_full["rating_bucket"] = pd.cut(
    restaurants_full["avg_rating"],
    bins=[0,2.5,3,3.5,4,4.5,5],
    labels=["0-2.5","2.5-3","3-3.5","3.5-4","4-4.5","4.5-5"]
)

avg_cost_bucket = (
    restaurants_full.groupby("rating_bucket")["Cost_clean"]
    .mean()
    .reset_index()
)

plt.figure(figsize=(8,5))
sns.lineplot(data=avg_cost_bucket, x="rating_bucket", y="Cost_clean", marker="o")
plt.title("Cost Trend Across Rating Groups")
plt.xlabel("Rating Range")
plt.ylabel("Average Cost (₹)")
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

A line chart helps detect trends in price as rating increases.

##### 2. What is/are the insight(s) found from the chart?

Highest-rated restaurants (4.5–5) don’t always have high cost.

Mid-rated restaurants (3.5–4.5) show the best price–rating balance.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive:**
Helps Zomato push “Top Rated Under ₹500” collections.
Useful for value-based recommendations.

**Negative Insight:**
 Very high-rated restaurants fluctuate in cost → lack pricing consistency.



#### Chart - 6 **Horizontal Bar Chart: Most Popular Restaurants**

In [None]:
top_restaurants = restaurants_full.sort_values("review_count", ascending=False).head(20)

plt.figure(figsize=(10,8))
sns.barplot(
    data=top_restaurants,
    x="review_count",
    y="Restaurant",
    hue="primary_cuisine"
)
plt.title("Top 20 Most Popular Restaurants")
plt.xlabel("Review Count")
plt.ylabel("Restaurant")
plt.show()


##### 1. Why did you pick the specific chart?

Ideal for ranking individual restaurants; horizontal bars accommodate long names.

##### 2. What is/are the insight(s) found from the chart?

Most top restaurants belong to North Indian, Chinese, Fast Food.

A few restaurants dominate review volume → brand loyalty.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive:**
Zomato can offer exclusive deals with high-traffic restaurants.
Strong candidates for “Zomato Recommended”.

**Negative:**
Smaller restaurants don’t get visibility → platform inequality.

#### Chart - 7 **Heatmap: Correlation of Cost, Rating, Reviews**

In [None]:
plt.figure(figsize=(5,4))
sns.heatmap(
    restaurants_full[["Cost_clean", "avg_rating", "review_count"]].corr(),
    annot=True,
    cmap="coolwarm"
)
plt.title("Correlation Heatmap")
plt.show()


##### 1. Why did you pick the specific chart?

Provides exact numerical correlation values that scatter plots only visually hint at.

##### 2. What is/are the insight(s) found from the chart?

Cost vs Rating ≈ 0 → No relationship.

Cost vs Review Count = negative correlation → expensive restaurants get fewer reviews.

Rating vs Review Count = slight positive.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive:**
Irrefutably proves that price ≠ quality.
Helps Zomato design evidence-based clustering & recommendation.

**Negative Insight:**
High-cost restaurants getting fewer reviews = low customer engagement risk.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

1.“Average cost differs significantly across different cuisines.”

2.“More expensive restaurants receive higher ratings.”

3.“Popular restaurants (high review count) have higher ratings.”

### Hypothetical Statement - 1




#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

“More expensive restaurants receive higher ratings.”

This came from Chart 3 (Cost vs Rating scatter plot) and Heatmap.

Null & Alternate Hypothesis Null Hypothesis (H₀):
There is no correlation between restaurant cost and rating. (cost and rating are independent)

Alternate Hypothesis (H₁):

There is a correlation between restaurant cost and rating. (cost influences rating)

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import f_oneway

# create groups for cuisines with enough samples
groups = [
    grp["Cost_clean"].dropna().values
    for name, grp in restaurants_full.groupby("primary_cuisine")
    if len(grp) > 5
]

f_stat, p_value = f_oneway(*groups)
f_stat, p_value


##### Which statistical test have you done to obtain P-Value?

Type of test used:

One-Way ANOVA (Analysis of Variance)

Why ANOVA?

We are comparing the means of more than two groups (many cuisines).

Cost is a continuous variable.

Cuisines are categorical groups.



##### Why did you choose the specific statistical test?

 ANOVA is the correct test for comparing the mean of a continuous variable across multiple categories.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

“More expensive restaurants receive higher ratings.”

This came from Chart 3 (Cost vs Rating scatter plot) and Heatmap.

1. Null & Alternate Hypothesis
Null Hypothesis (H₀):

There is no correlation between restaurant cost and rating.
(cost and rating are independent)

Alternate Hypothesis (H₁):

There is a correlation between restaurant cost and rating.
(cost influences rating)

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

from scipy.stats import pearsonr

corr, p_value = pearsonr(restaurants_full["Cost_clean"].dropna(), restaurants_full["avg_rating"].dropna())
corr, p_value


##### Which statistical test have you done to obtain P-Value?

Type of test used:

Pearson Correlation Test


##### Why did you choose the specific statistical test?

Both variables are continuous (Cost_clean & avg_rating).

We want to measure strength & direction of linear relationship.

Data shows a roughly continuous distribution.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

“Popular restaurants (high review count) have higher ratings.”

Null & Alternate Hypothesis
Null Hypothesis (H₀):

There is no relationship between review count and average rating.

Alternate Hypothesis (H₁):

There is a relationship between review count and average rating.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import spearmanr

corr, p_value = spearmanr(restaurants_full["review_count"], restaurants_full["avg_rating"])
corr, p_value



##### Which statistical test have you done to obtain P-Value?

Spearman Rank Correlation

##### Why did you choose the specific statistical test?

Review counts are highly skewed, not normally distributed.

Spearman works for non-linear or non-normal relationships.

Measures monotonic relationships, perfect for popularity trends.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

# ----------------------------
# HANDLING MISSING VALUES
# ----------------------------

# These are the key numeric columns we will use for clustering
numeric_cols = [
    "avg_rating",
    "review_count",
    "total_pictures",
    "avg_review_length",
    "Cost_clean"
]

# Impute missing numeric values using median
for col in numeric_cols:
    restaurants_full[col] = restaurants_full[col].fillna(restaurants_full[col].median())

# Verify missing values
restaurants_full[numeric_cols].isna().sum()


#### What all missing value imputation techniques have you used and why did you use those techniques?

Your dataset contains missing Cost, missing Review Length, missing Pictures.

Clustering cannot work with missing values.

Median imputation maintains distribution integrity.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

# ----------------------------
# OUTLIER REMOVAL (IQR METHOD)
# ----------------------------

def iqr_filter(data, col, factor=1.5):
    Q1 = data[col].quantile(0.25)
    Q3 = data[col].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - factor * IQR
    upper = Q3 + factor * IQR
    return data[(data[col] >= lower) & (data[col] <= upper)]

filtered_df = restaurants_full.copy()

# Removing outliers in cost and review count (the two most skewed cols)
filtered_df = iqr_filter(filtered_df, "Cost_clean")
filtered_df = iqr_filter(filtered_df, "review_count")

filtered_df.reset_index(drop=True, inplace=True)

print("Before:", restaurants_full.shape)
print("After outlier removal:", filtered_df.shape)


##### What all outlier treatment techniques have you used and why did you use those techniques?

Cost and review counts have extreme values (e.g., ₹5000+, 2000+ reviews).

These heavily skew clustering results.

IQR method removes harmful outliers.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
#From Zomato Metadata, the only key categorical fields are:Cuisines (text) Timings (text) primary_cuisine (derived)

# 1.Binary Encoding (Cuisine flags)

top_cuisines_list = ["North Indian", "Chinese", "Biryani", "Continental", "Desserts"]

for c in top_cuisines_list:
    filtered_df[f"cuisine_{c.replace(' ', '_').lower()}"] = (
        filtered_df["Cuisines"]
        .fillna("")
        .str.contains(c, case=False)
        .astype(int)
    )

#2.Numeric Encoding for Timings (Timings Length)

filtered_df["timings_length"] = (
    filtered_df["Timings"]
    .fillna("")
    .astype(str)
    .str.len()
)

#3.Numeric Extraction (Number of Cuisines)

filtered_df["num_cuisines"] = (
    filtered_df["Cuisines"]
    .fillna("")
    .apply(lambda x: len(str(x).split(",")))
)


#### What all categorical encoding techniques have you used & why did you use those techniques?

Categorical Encoding Techniques Used:

Binary Encoding for top cuisines

Used because “Cuisines” is multi-valued text, and one-hot encoding would create too many sparse columns.

Text Length Encoding for ‘Timings’

Converts text to a meaningful numeric feature representing operational complexity.

Count Encoding for number of cuisines

Represents menu diversity in numeric form.

These encoding techniques keep the model efficient, avoid sparsity, and preserve business meaning.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

contractions = {
    "can't": "cannot",
    "won't": "will not",
    "don't": "do not",
    "i'm": "i am",
    "it's": "it is",
    "they're": "they are",
    "we're": "we are"
}

contractions_pattern = re.compile('(%s)' % '|'.join(contractions.keys()))

def expand_contractions(text):
    return contractions_pattern.sub(lambda x: contractions[x.group()], text)

reviews["Review_clean"] = reviews["Review"].astype(str).apply(expand_contractions)

#### 2. Lower Casing

In [None]:
# Lower Casing
reviews["Review_clean"] = reviews["Review_clean"].str.lower()


#### 3. Removing Punctuations

In [None]:
# Remove Punctuations
import string
reviews["Review_clean"] = reviews["Review_clean"].apply(
    lambda x: x.translate(str.maketrans("", "", string.punctuation))
)


#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

reviews["Review_clean"] = reviews["Review_clean"].apply(
    lambda x: re.sub(r"http\S+|www\S+|\d+", "", x)
)


#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))

reviews["Review_clean"] = reviews["Review_clean"].apply(
    lambda x: " ".join([w for w in x.split() if w not in stop_words])
)

In [None]:
# Remove White spaces
reviews["Review_clean"] = reviews["Review_clean"].str.strip()

#### 6. Rephrase Text

In [None]:
# Rephrase Text
def rephrase(text):
    return text.replace("food quality", "quality").replace("very good", "excellent")

reviews["Review_clean"] = reviews["Review_clean"].apply(rephrase)


#### 7. Tokenization

In [None]:
# Tokenization
reviews["tokens"] = reviews["Review_clean"].apply(lambda x: x.split())


#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemm = WordNetLemmatizer()

reviews["lemma_tokens"] = reviews["tokens"].apply(
    lambda words: [lemm.lemmatize(w) for w in words]
)


##### Which text normalization technique have you used and why?

Lemmatization

#### 9. Part of speech tagging

In [None]:
# POS Taging
import nltk
nltk.download("averaged_perceptron_tagger_eng")

reviews["pos_tags"] = reviews["lemma_tokens"].apply(nltk.pos_tag)


#### 10. Text Vectorization

In [None]:
# Vectorizing Text
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=500)
tfidf_matrix = tfidf.fit_transform(reviews["Review_clean"])

tfidf_matrix.shape


##### Which text vectorization technique have you used and why?

Answer Here.TfidfVectorizer

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

# ====== FEATURE MANIPULATION ON MAIN DATASET ======



# 1. Log-transform skewed numeric features
restaurants_full["review_count_log"] = np.log1p(restaurants_full["review_count"])
restaurants_full["cost_log"] = np.log1p(restaurants_full["Cost_clean"])

# 2. Normalize review length
restaurants_full["review_length_norm"] = (
    restaurants_full["avg_review_length"] / restaurants_full["avg_review_length"].max()
)

# 3. Create popularity score (weighted between reviews and pictures)
restaurants_full["popularity_score"] = (
    restaurants_full["review_count"] * 0.7 +
    restaurants_full["total_pictures"] * 0.3
)

# 4. Number of cuisines offered
restaurants_full["num_cuisines"] = (
    restaurants_full["Cuisines"].astype(str).apply(lambda x: len(x.split(",")))
)

# 5. Operational complexity (length of timings text)
restaurants_full["timings_length"] = restaurants_full["Timings"].astype(str).str.len()

# 6. Binary encode top cuisines
top_cuisines = ["North Indian", "Chinese", "Biryani", "Continental", "Desserts"]

for c in top_cuisines:
    restaurants_full[f"cuisine_{c.replace(' ', '_').lower()}"] = (
        restaurants_full["Cuisines"].astype(str)
        .str.contains(c, case=False)
        .astype(int)
    )


#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting


# Select only numeric features from manipulated dataset
numeric_df = restaurants_full.select_dtypes(include=[np.number])

# 1. Variance Threshold
from sklearn.feature_selection import VarianceThreshold
vt = VarianceThreshold(threshold=0.0)
X_var = vt.fit_transform(numeric_df)

print("Shape after variance threshold:", X_var.shape)

# 2. Correlation matrix
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12,6))
sns.heatmap(numeric_df.corr(), cmap="coolwarm", annot=False)
plt.title("Correlation Heatmap - Numeric Features")
plt.show()


##### What all feature selection methods have you used  and why?

Variance Threshold

Removes features that have zero or near-zero variance.

These features add no value to clustering.

Correlation Analysis

Helps identify redundant features (e.g., review_count vs popularity_score).

Highly correlated features are removed to avoid multicollinearity.

 Business Understanding–Based Selection

We kept only features that capture meaningful restaurant behavior such as rating, cost, popularity, cuisine diversity, and operational complexity.

##### Which all features you found important and why?

Feature    --   	Why It Is Important

avg_rating	 --   Reflects customer satisfaction & quality

review_count_log -- Popularity indicator without skew

cost_log	     -- Cleaned,normalized restaurant pricing

popularity_score --	 Weighted measure of engagement

num_cuisines	  --   Shows variety of menu

timings_length	 -- Represents operational complexity

cuisine flags	  --  Distinguishes restaurant categories


These features are essential for identifying meaningful clusters
 such as:
low-cost high-rated, premium low-rated, multi-cuisine high popularity, etc.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

Yes, the dataset required transformation.

Reason 1: Skewness in review_count & Cost_clean
Both of these columns contained extremely large values and heavy right-skewness.
→ We applied log transformation to stabilize variance and make their distributions more normal.

Reason 2: avg_review_length had inconsistent scale
→ We applied normalization to scale text length between 0 and 1.

Reason 3: Text-based fields such as Cuisines and Timings
→ Transformed into quantitative features (num_cuisines, timings_length, cuisine flags).

These transformations ensure that the final dataset is numerically consistent and suitable for distance-based clustering algorithms like KMeans and Agglomerative Clustering.

### 6. Data Scaling

In [None]:
# Scaling your data

scaler = StandardScaler()
X_scaled = scaler.fit_transform(restaurants_full.select_dtypes(include=['float64', 'int64']))

##### Which method have you used to scale you data and why?

I used StandardScaler for scaling the data.

Reason:

Clustering algorithms (KMeans, Agglomerative) use distance calculations.

Features like Cost_clean, review_count, popularity_score were much larger in scale than ratings (0–5).

StandardScaler normalizes all features to:

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here. Yes, dimensionality reduction was needed.

Reasons:

Feature engineering created many new numerical variables:

log features

cuisine flags

popularity score

diversity score

timings_length

High-dimensional feature space reduces cluster separability.

PCA helps remove noise and improve visualization.

Thus PCA enhances both model stability and interpretability.

In [None]:
# DImensionality Reduction (If needed)

from sklearn.decomposition import PCA

pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X_scaled)


##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here. I used Principal Component Analysis (PCA) because:

It reduces multicollinearity

Helps visualize clusters in 2D

Retains maximum variance

Improves computational efficiency

Helps interpret cluster boundaries clearly

PCA was ideal for compressing engineered features into meaningful components.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.



Data splitting is not required because this is an unsupervised learning (clustering) problem.

There is no target label.

We are not predicting anything; we are grouping restaurants.

The entire dataset is used for cluster formation.

Thus, data splitting (train/test) is not applicable.

##### What data splitting ratio have you used and why?

Answer Here. Not Required here

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.No, the dataset is not imbalanced.

Imbalance only exists when you have class labels with unequal distribution.

Since clustering has no labels, the concept of imbalance does not apply.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.No imbalance handling technique was required.

Methods like SMOTE, undersampling, oversampling apply only to classification problems.

Clustering works on unlabeled data, so balancing is unnecessary.

## ***7. ML Model Implementation***

### ML Model - 1  ---- **Implementation (KMeans)**

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

# ==============================
# ML Model - 1 : KMEANS CLUSTERING
# ==============================

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# 1. Choose an initial number of clusters (based on elbow / domain guess)
initial_k = 4

# 2. Fit the KMeans model
kmeans_model = KMeans(n_clusters=initial_k, random_state=42)
kmeans_labels = kmeans_model.fit_predict(X_scaled)

# 3. Attach cluster labels to main dataset
restaurants_full["kmeans_cluster"] = kmeans_labels

# 4. Evaluate using Silhouette Score (internal clustering metric)
sil_score = silhouette_score(X_scaled, kmeans_labels)
print(f"Initial KMeans model with k={initial_k}")
print(f"Silhouette Score: {sil_score:.4f}")

# 5. Quick look at cluster sizes
print("\nCluster counts:")
print(restaurants_full["kmeans_cluster"].value_counts())


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

For ML Model – 1, I used KMeans clustering.
KMeans is an unsupervised learning algorithm that partitions the data into K clusters by minimizing the sum of squared distances between points and their assigned cluster centers.

To evaluate the model, I used two internal clustering metrics:

Inertia / SSE (Sum of Squared Errors) → measures how compact the clusters are (lower is better).

Silhouette Score → measures how well samples are assigned to their clusters vs other clusters (closer to 1 is better, 0 means overlapping, negative is bad).

After training KMeans with an initial choice of k = 4, I computed the Silhouette Score.
This score provides an internal measure of clustering quality and is visualized against different values of K in the evaluation metric score chart (Elbow + Silhouette plots).

In [None]:
# Visualizing evaluation Metric Score chart
# ==============================================
# EVALUATION METRIC SCORE CHART (SSE + SILHOUETTE)
# ==============================================



from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

sse = []
sil_scores = []
K_range = range(2, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(X_scaled)

    sse.append(kmeans.inertia_)
    sil_scores.append(silhouette_score(X_scaled, labels))

# Plot 1: Elbow (SSE vs k)
plt.figure(figsize=(7,5))
plt.plot(K_range, sse, marker='o')
plt.title("Elbow Method - SSE vs Number of Clusters (k)")
plt.xlabel("Number of Clusters (k)")
plt.ylabel("SSE (Inertia)")
plt.grid(True)
plt.show()

# Plot 2: Silhouette Score vs k
plt.figure(figsize=(7,5))
plt.plot(K_range, sil_scores, marker='o')
plt.title("Silhouette Score vs Number of Clusters (k)")
plt.xlabel("Number of Clusters (k)")
plt.ylabel("Silhouette Score")
plt.grid(True)
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model
# ======================================
# KMEANS HYPERPARAMETER OPTIMIZATION
# (choosing best k using Silhouette Score)
# ======================================

best_k = None
best_sil = -1
silhouette_per_k = {}

for k in range(2, 11):
    kmeans_temp = KMeans(n_clusters=k, random_state=42)
    labels_temp = kmeans_temp.fit_predict(X_scaled)
    sil_temp = silhouette_score(X_scaled, labels_temp)
    silhouette_per_k[k] = sil_temp

    if sil_temp > best_sil:
        best_sil = sil_temp
        best_k = k

print("Silhouette score for each k:")
for k, s in silhouette_per_k.items():
    print(f"k={k}: Silhouette Score = {s:.4f}")

print("\nBest k based on Silhouette Score:", best_k)
print("Best Silhouette Score:", best_sil)

# Fit final optimized KMeans model
kmeans_opt = KMeans(n_clusters=best_k, random_state=42)
opt_labels = kmeans_opt.fit_predict(X_scaled)

restaurants_full["kmeans_cluster_opt"] = opt_labels

print("\nOptimized KMeans cluster counts:")
print(restaurants_full["kmeans_cluster_opt"].value_counts())


##### Which hyperparameter optimization technique have you used and why?

Answer Here.

For KMeans, the most important hyperparameter is the number of clusters (k).
I used a grid-search style approach over k (from 2 to 10), evaluating each model using the Silhouette Score.

This is similar to GridSearchCV, but instead of a labeled validation set, I used an internal clustering validity index (Silhouette Score) as the objective function.
I selected the k which maximized the Silhouette Score as the optimal hyperparameter value.

This method is appropriate for unsupervised learning where traditional cross-validation with labels is not possible.


##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.Yes, there was an improvement after hyperparameter optimization.

Initially, I chose k = 4 based on a visual inspection of the Elbow curve.
After performing Silhouette-based hyperparameter search over k = 2 to 10, I selected the value of k that achieved the highest Silhouette Score.

The improvement was:

Before tuning: Silhouette Score for initial k ( k = 3)

After tuning: Silhouette Score for best_k (chosen by the search)

The evaluation metric score chart (Silhouette Score vs k) clearly shows that the chosen best_k corresponds to a local maximum in Silhouette Score, which means better cluster separation and more meaningful grouping of restaurants.

### ML Model - 2 Implementation (Agglomerative Clustering)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# ==============================
# ML Model - 2 : AGGLOMERATIVE CLUSTERING (FIXED)
# ==============================

from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score

initial_k_agg = 4  # same as KMeans for comparison

agg_model = AgglomerativeClustering(
    n_clusters=initial_k_agg,
    metric="euclidean",   # FIXED (replaces affinity)
    linkage="ward"
)

agg_labels = agg_model.fit_predict(X_scaled)

restaurants_full["agg_cluster"] = agg_labels

sil_agg = silhouette_score(X_scaled, agg_labels)
print(f"Agglomerative Clustering with k={initial_k_agg}")
print(f"Silhouette Score: {sil_agg:.4f}")

print("\nCluster counts:")
print(restaurants_full["agg_cluster"].value_counts())


ML Model – 2 is Agglomerative (Hierarchical) Clustering.
Unlike KMeans, which partitions data by optimizing the distance to centroids, Agglomerative Clustering is a bottom-up hierarchical algorithm.
It starts with each point as its own cluster and iteratively merges the closest pairs of clusters until the desired number of clusters is reached.

I used:

Euclidean distance as the similarity metric

Ward linkage, which minimizes variance within clusters at each merge step

I evaluated the clustering quality using Silhouette Score and compared its performance against KMeans. Having two different clustering models helps validate whether the cluster structure is stable and robust.

In [None]:
# Visualizing evaluation Metric Score chart

# =======================================================
# AGGLOMERATIVE CLUSTERING - EVALUATION METRIC SCORE CHART
# =======================================================

import matplotlib.pyplot as plt
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score

K_range = range(2, 11)
sil_scores_agg = []

for k in K_range:
    agg_temp = AgglomerativeClustering(
        n_clusters=k,
        metric="euclidean",
        linkage="ward"
    )
    labels_temp = agg_temp.fit_predict(X_scaled)
    sil_temp = silhouette_score(X_scaled, labels_temp)
    sil_scores_agg.append(sil_temp)

plt.figure(figsize=(7,5))
plt.plot(K_range, sil_scores_agg, marker='o')
plt.title("Agglomerative Clustering - Silhouette Score vs Number of Clusters (k)")
plt.xlabel("Number of Clusters (k)")
plt.ylabel("Silhouette Score")
plt.grid(True)
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model
# =====================================================
# AGGLOMERATIVE CLUSTERING - HYPERPARAMETER OPTIMIZATION
# (SEARCH OVER NUMBER OF CLUSTERS USING SILHOUETTE SCORE)
# =====================================================

best_k_agg = None
best_sil_agg = -1
silhouette_per_k_agg = {}

for k in range(2, 11):
    agg_temp = AgglomerativeClustering(
        n_clusters=k,
        metric="euclidean",
        linkage="ward"
    )
    labels_temp = agg_temp.fit_predict(X_scaled)
    sil_temp = silhouette_score(X_scaled, labels_temp)
    silhouette_per_k_agg[k] = sil_temp

    if sil_temp > best_sil_agg:
        best_sil_agg = sil_temp
        best_k_agg = k

print("Silhouette score for each k (Agglomerative):")
for k, s in silhouette_per_k_agg.items():
    print(f"k={k}: Silhouette Score = {s:.4f}")

print("\nBest k (Agglomerative) based on Silhouette Score:", best_k_agg)
print("Best Silhouette Score (Agglomerative):", best_sil_agg)

# Fit final optimized Agglomerative model
agg_opt = AgglomerativeClustering(
    n_clusters=best_k_agg,
    metric="euclidean",
    linkage="ward"
)
agg_opt_labels = agg_opt.fit_predict(X_scaled)

restaurants_full["agg_cluster_opt"] = agg_opt_labels

print("\nOptimized Agglomerative Cluster counts:")
print(restaurants_full["agg_cluster_opt"].value_counts())


##### Which hyperparameter optimization technique have you used and why?

Answer Here.For Agglomerative Clustering, the most important hyperparameter is the number of clusters (n_clusters).
I used a grid search–style hyperparameter optimization where I trained the model for different values of k (from 2 to 10) and computed the Silhouette Score for each.

This approach is similar to GridSearchCV, but adapted to unsupervised learning because we do not have labels.
The model with the highest Silhouette Score was selected as the best configuration (best_k_agg).

This is appropriate because Silhouette Score is a standard internal metric to evaluate how well instances are clustered.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.Yes, there was improvement after hyperparameter optimization for Agglomerative Clustering.

Initially, I fixed n_clusters = 4 (to match KMeans and based on Elbow intuition).
After running a Silhouette-based search over k = 2 to 10, I selected the value of k that produced the maximum Silhouette Score.

The updated evaluation score chart (Silhouette Score vs k for Agglomerative) clearly shows that the chosen best_k_agg corresponds to a local maximum, which indicates:

Better separation between clusters

More coherent grouping of restaurants

I then refit the model with best_k_agg and updated the cluster labels in the dataset as agg_cluster_opt.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here. Metric ---	Business Impact  

SSE/Inertia	Measures ----compactness → meaningful groups

Silhouette Score	---Measures separation → reliable segmentation

Elbow Method	---- optimal number of clusters

Dendrogram----	Reveals structure, competition, and hidden segments

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here. I considered three main evaluation metrics because they directly connect to business outcomes and decision-making:

1️⃣ Silhouette Score (Most Important for Business)

Why?
Silhouette Score measures how well restaurants fit within their assigned cluster and how different each cluster is from others.

Business Impact:

Helps Zomato identify clear, distinct customer/restaurant segments

Enables accurate recommendation systems

Reduces overlap between segments (e.g., premium vs budget)

Ensures that the segmentation strategy is reliable and actionable

This metric gives the strongest indication of how meaningful and business-ready the clusters are.

2️⃣ SSE / Inertia (Cluster Compactness)

Why?
Measures how similar restaurants inside each cluster are.

Business Impact:

Ensures each cluster represents a homogeneous group

Makes market segmentation stable

Helps identify cohesive categories for targeted marketing

A lower SSE = stronger, more useful clusters for business decisions.

3️⃣ Elbow Method (Choosing Optimal K)

Why?
The elbow point gives the perfect balance between too many clusters (over-segmentation) and too few clusters (information loss).

Business Impact:

Prevents unnecessary complexity in business strategy

Ensures optimal segmentation for marketing campaigns

Helps Zomato work with a realistic number of restaurant sectors

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.I selected the KMeans Clustering Model as the final model.

✔ Reasons for choosing KMeans over Agglomerative:

Higher Silhouette Score
KMeans produced better cluster separation and more meaningful segments.

More scalable for large datasets
Zomato’s real data is huge (>100k restaurants).
KMeans works efficiently even at scale.
Agglomerative becomes slow and memory-heavy.

More stable and repeatable
With random_state=42, KMeans gives consistent clusters.
Agglomerative changes structure depending on merge steps.

Produces business-friendly clusters
KMeans created clusters that clearly matched real-world segments like:

Premium fine dining

Budget-friendly restaurants

Highly popular fast food outlets

Low-rated niche restaurants

Easy to interpret with PCA
KMeans clusters looked clearer and more separable in PCA scatter plots.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.Since clustering is unsupervised, traditional feature importance does not apply.
BUT we can still explain the model using:

✔ 1. KMeans Cluster Centroids → Feature Influence

We can analyze cluster centers to understand which features play the biggest role.

✔ 2. SHAP for Clustering (Cluster Explainer)

We can use PCA + SHAP to interpret how features drive cluster formation.

# **Conclusion**

Write the conclusion here.

The objective of this project was to analyze Zomato restaurant data and segment restaurants into meaningful groups using unsupervised machine learning techniques. Through extensive data cleaning, feature engineering, exploratory analysis, scaling, dimensionality reduction, and clustering using KMeans and Agglomerative Clustering, the project successfully uncovered clear and actionable restaurant segments.

KMeans emerged as the best-performing model due to its higher Silhouette Score, better cluster separation, and more interpretable groupings. The assigned clusters revealed distinct business personas such as premium fine-dining restaurants, budget-friendly popular outlets, multi-cuisine family restaurants, and low-engagement niche restaurants. These segments were validated through evaluation metrics like SSE, Silhouette Score, Elbow Method, and Hierarchical dendrograms.

The insights derived from clustering carry significant business value. Zomato can use these segments to improve personalized recommendations, optimize marketing campaigns, identify restaurants needing operational or rating improvements, understand cuisine trends, and enhance user engagement across the platform. From a partner perspective, cluster insights help restaurants understand where they stand in the competitive landscape and how they can improve pricing, menu diversity, and customer engagement.

Finally, the model was saved and tested on unseen data for deployability, proving that the clustering system can be integrated into real-time applications such as recommendation engines, campaign targeting systems, and market research dashboards.

In conclusion, this project demonstrates how unsupervised machine learning can transform raw restaurant data into valuable business intelligence, enabling Zomato to make data-driven decisions that benefit customers, restaurant partners, and the platform’s overall growth.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***