# **Project Name**    -



##### **Project Type**    - Unsupervised
##### **Contribution**    - Individual
##### **Team Member**     - Harshita Goyal

# **Project Summary -**

The goal of this project is to build a complete machine learning system that can use data-driven methods to look at and group restaurants. The main goal of the project is to use machine learning algorithms that don't need any help to group restaurants that are similar based on their ratings, prices, popularity, cuisines, and other traits and customer preferences. This kind of segmentation can help businesses make smart choices about things like who to market to, how much to charge, and how to market.

The first thing to do for the project is to gather and analyze data. This meant loading and looking through several datasets that had information about restaurants and reviews from customers. We did some initial exploratory data analysis (EDA) to learn how the data was set up, find missing values, duplicates, and outliers, and use charts and plots to show important connections. This step taught us about the spread of ratings, the cost patterns, the number of reviews, and what people like to eat.



After that, the dataset was cleaned up and made ready for machine learning. We fixed data formats that weren't consistent, got rid of duplicate entries, and deal with missing values in the right way. We used feature engineering to make useful variables like cost categories, review counts that have been log-transformed, and indicators of value for money. We used one-hot encoding and label encoding to encode categorical variables. We also used feature scaling to make sure that all of the numerical features had the same effect on distance-based clustering.

We used natural language processing (NLP) techniques because the dataset also had reviews from customers that were written down. Lowercasing, removing punctuation, URLs, numbers, stopwords, tokenization, lemmatization, part-of-speech tagging, and TF-IDF vectorization were all steps in the preprocessing of text. These steps turned unstructured review text into numbers that could be used to figure things out.

Three distinct unsupervised machine learning models were executed and evaluated. We picked K-Means clustering as our first model because it's easy to use and can get bigger. We used the second model, DBSCAN, to find groups of points based on how close they are to each other and how much noise or outliers they have. The third model used to figure out how restaurants are connected was Agglomerative (Hierarchical) Clustering. We used the Silhouette Score to find out how well the model worked. This score tells you how well the data points are split up and grouped together.

We set the hyperparameters for each model by using the Elbow Method for K-Means, epsilon tuning for DBSCAN, and linkage analysis for Agglomerative Clustering. We picked K-Means as the final model because it worked well, was easy to understand, and had a higher silhouette score. By looking at the cluster centroids, we were able to figure out which features were most important for dividing restaurants into groups.

Finally, joblib was used to save the best model so it could be used again later. The saved model was reloaded and tested on new data to make sure it was correct. The project ends by talking about how putting restaurants together can help businesses, like by making it easier to find customers, plan ahead, and market to them. This project uses machine learning ideas in every step, from getting the data ready to using the model.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


There is a lot of data about prices, customer ratings, reviews, and food preferences in the restaurant business. But it's hard to find meaningful patterns and customer groups by looking at this data by hand. The goal of this project is to put restaurants into meaningful groups based on their traits without using pre-defined labels. The goal of the project is to use unsupervised machine learning methods to find restaurants that are similar, find hidden patterns, and give insights that can help businesses make better decisions, like targeted marketing, pricing optimization, and strategies for getting customers to interact with them.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
resturant_df=pd.read_csv('Zomato Restaurant names and Metadata.csv')
reviews_df=pd.read_csv('Zomato Restaurant reviews.csv')

### Dataset First View

In [None]:
# Dataset First Look
resturant_df.head()


In [None]:
reviews_df.head()

### Dataset Rows & Columns count

In [None]:
resturant_df.shape

In [None]:
reviews_df.shape

### Dataset Information

In [None]:
# Dataset Info
resturant_df.info()
reviews_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
resturant_df.duplicated().sum()


In [None]:
reviews_df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
resturant_df.isnull().sum()


In [None]:
reviews_df.isnull().sum()

In [None]:
# Visualizing the missing values
plt.figure(figsize=(12,5))
sns.heatmap(resturant_df.isnull(), cbar=False)
plt.title("Missing Values Heatmap - Restaurant Dataset")
plt.show()

plt.figure(figsize=(12,5))
sns.heatmap(reviews_df.isnull(), cbar=False)
plt.title("Missing Values Heatmap - Reviews Dataset")
plt.show()


### What did you know about your dataset?

The project uses two datasets: one containing restaurant-level metadata and another containing customer review information.
The restaurant dataset includes features such as restaurant name, cost, cuisines, collections, and timings, which are useful for clustering restaurants.
The review dataset contains customer ratings, review text, reviewer details, and timestamps, which help in analyzing customer behavior and satisfaction.
Most columns have very few missing values, except the Collections column, which has a higher number of missing entries.
Overall, the dataset is suitable for exploratory data analysis and unsupervised machine learning after appropriate data cleaning and preprocessing.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
resturant_df.columns


In [None]:
reviews_df.columns

In [None]:
# Dataset Describe
resturant_df.describe()

In [None]:
reviews_df.describe()

### Variables Description



```
# This is formatted as code
```

Name: Name of the restaurant

Links: Zomato URL of the restaurant

Cost: Approximate cost for two people

Collections: Zomato category or collection tags

Cuisines: Types of cuisines served by the restaurant

Timings: Opening and closing time of the restaurant

Reviewer: Name of the reviewer

Review: Textual feedback provided by the customer

Rating: Rating given by the customer

Time: Date and time when the review was posted

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for col in resturant_df.columns:
    print(f"{col}: {resturant_df[col].nunique()}")

In [None]:
for col in reviews_df.columns:
    print(f"{col}: {reviews_df[col].nunique()}")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# Handling Missing Values
resturant_df['Collections'] = resturant_df['Collections'].fillna('Not Specified')
resturant_df['Timings'] = resturant_df['Timings'].fillna('Not Available')
resturant_df['Cuisines'] = resturant_df['Cuisines'].fillna('Unknown')

# Removing Duplicate Rows
restaurant_df = resturant_df.drop_duplicates()

#Cleaning Cost Column (if cost is stored as string)
if restaurant_df['Cost'].dtype == 'object':
    restaurant_df['Cost'] = restaurant_df['Cost'].str.replace(',', '')
    restaurant_df['Cost'] = pd.to_numeric(restaurant_df['Cost'], errors='coerce')

# Removing Duplicate Reviews
review_df = reviews_df.drop_duplicates()

# Handling Missing Values in Reviews
review_df['Review'] = review_df['Review'].fillna('No Review')
review_df['Rating'] = pd.to_numeric(review_df['Rating'], errors='coerce')

# Dropping rows with missing ratings (important for analysis)
review_df = review_df.dropna(subset=['Rating'])

# Aggregating Review Data
review_summary = review_df.groupby('Restaurant').agg({
    'Rating': ['mean', 'count']
}).reset_index()

review_summary.columns = ['Restaurant', 'Avg_Rating', 'Total_Reviews']

# Merging Restaurant and Review Data
final_df = restaurant_df.merge(
    review_summary,
    left_on='Name',
    right_on='Restaurant',
    how='left'
)

# Filling missing aggregated review values
final_df['Avg_Rating'] = final_df['Avg_Rating'].fillna(0)
final_df['Total_Reviews'] = final_df['Total_Reviews'].fillna(0)

# Display cleaned dataset
final_df.head()


### What all manipulations have you done and insights you found?

Missing values were handled by filling them with appropriate labels, and duplicate records were removed to ensure clean data. The cost column was cleaned and converted into numerical form for analysis. Review data was processed by removing invalid entries and aggregating ratings and review counts for each restaurant. Finally, restaurant and review datasets were merged to create a single, analysis-ready dataset.

From the analysis, it was observed that many restaurants are not part of any Zomato collection, cost data is mostly complete, and there is noticeable variation in restaurant ratings and popularity, which makes the data suitable for clustering.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

plt.figure(figsize=(8,5))
sns.scatterplot(data=final_df, x='Cost', y='Avg_Rating')
plt.title('Cost vs Average Rating of Restaurants')
plt.xlabel('Cost for Two')
plt.ylabel('Average Rating')
plt.show()


##### 1. Why did you pick the specific chart?

A scatter plot was chosen because it is effective in understanding the relationship between two numerical variables. It helps visualize how restaurant cost is related to customer ratings and whether higher-priced restaurants tend to receive better ratings.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that higher cost does not always result in higher ratings. Many low and mid-priced restaurants receive good ratings, while some expensive restaurants have average ratings. This indicates that customer satisfaction is influenced by factors beyond just pricing.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights can help create a positive business impact. Customers can identify good-quality restaurants without assuming that higher cost means better experience. For Zomato, this insight helps promote well-rated budget and mid-range restaurants, increasing customer satisfaction and engagement.A potential negative insight is that some high-cost restaurants have low ratings, which may affect their growth. However, this can be used constructively by identifying areas of improvement for such restaurants.

#### Chart - 2

In [None]:
# Chart - 2 visualization code(Most Popular Cuisines)

# Split multiple cuisines into individual entries
cuisine_series = final_df['Cuisines'].str.split(', ').explode()

# Get top 10 cuisines
top_cuisines = cuisine_series.value_counts().head(10)

plt.figure(figsize=(8,5))
sns.barplot(x=top_cuisines.values, y=top_cuisines.index)
plt.title('Top 10 Most Popular Cuisines')
plt.xlabel('Number of Restaurants')
plt.ylabel('Cuisine')
plt.show()


##### 1. Why did you pick the specific chart?

A bar chart was chosen because it is the most effective way to compare the frequency of categorical variables. It clearly shows which cuisines are most commonly offered by restaurants.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that certain cuisines such as North Indian, Chinese, and Fast Food dominate the restaurant market. This indicates strong customer demand for these cuisines, while other cuisines are comparatively less common.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights can create a positive business impact by helping Zomato promote popular cuisines and guide new restaurants on which cuisines have higher demand.
A possible negative insight is that less popular cuisines may struggle to gain visibility. However, this can be addressed by targeted promotions and recommendations to niche customer segments.

#### Chart - 3

In [None]:
# Chart - 3 visualization code(Distribution of Average Ratings)

plt.figure(figsize=(8,5))
sns.histplot(final_df['Avg_Rating'], bins=10, kde=True)
plt.title('Distribution of Average Restaurant Ratings')
plt.xlabel('Average Rating')
plt.ylabel('Number of Restaurants')
plt.show()


##### 1. Why did you pick the specific chart?

A histogram was chosen to understand the distribution of restaurant ratings. It helps identify how ratings are spread across restaurants and whether most restaurants receive low, average, or high ratings.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that most restaurants have average ratings between 3.0 and 4.5. Very few restaurants have extremely low or extremely high ratings. This indicates that the majority of restaurants provide a satisfactory customer experience.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, this insight helps Zomato understand overall customer satisfaction levels and identify high-performing restaurants for promotion.
A possible negative insight is that restaurants with consistently low ratings may lose customers. However, identifying these restaurants allows targeted improvements and quality control measures to enhance customer experience.

#### Chart - 4

In [None]:
# Chart - 4 visualization code(Total Reviews vs Average Rating)

plt.figure(figsize=(8,5))
sns.scatterplot(data=final_df, x='Total_Reviews', y='Avg_Rating')
plt.title('Total Reviews vs Average Rating')
plt.xlabel('Total Number of Reviews')
plt.ylabel('Average Rating')
plt.show()


##### 1. Why did you pick the specific chart?

A scatter plot was chosen to analyze the relationship between the number of reviews and average rating. It helps understand whether popular restaurants  also maintain good customer satisfaction.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that restaurants with a high number of reviews generally maintain average to high ratings, indicating consistent customer satisfaction. Some restaurants with fewer reviews also have high ratings, suggesting they may be new or less discovered but high quality.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights help Zomato identify reliable and popular restaurants that can be promoted to customers. Restaurants with good ratings but fewer reviews can be highlighted to increase visibility.

A negative insight is that restaurants with many reviews but lower ratings may face customer trust issues, which could impact growth. However, this insight allows early identification and improvement opportunities.

#### Chart - 5

In [None]:
# Chart - 5 visualization code(Distribution of Restaurant Cost)

plt.figure(figsize=(8,5))
sns.histplot(final_df['Cost'], bins=15, kde=True)
plt.title('Distribution of Restaurant Cost')
plt.xlabel('Cost for Two')
plt.ylabel('Number of Restaurants')
plt.show()


##### 1. Why did you pick the specific chart?

A histogram was chosen to understand how restaurant costs are distributed across the dataset. It helps identify common price ranges and the presence of low-cost or high-cost restaurants.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that most restaurants fall in the low to mid price range, while only a small number of restaurants are high-priced. This indicates that the platform is dominated by affordable and mid-range dining options.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, this insight helps Zomato understand customer affordability and focus on promoting restaurants that match popular price ranges.
A possible negative insight is that premium restaurants form a smaller segment and may attract fewer customers. However, this can be addressed by targeted marketing to premium users.

#### Chart - 6

In [None]:
# Chart - 6 visualization code(Average Rating by Cuisine (Top Cuisines))

# Split cuisines and explode
cuisine_rating_df = final_df[['Cuisines', 'Avg_Rating']].copy()
cuisine_rating_df['Cuisines'] = cuisine_rating_df['Cuisines'].str.split(', ')
cuisine_rating_df = cuisine_rating_df.explode('Cuisines')

# Calculate average rating per cuisine
avg_rating_by_cuisine = (
    cuisine_rating_df
    .groupby('Cuisines')['Avg_Rating']
    .mean()
    .sort_values(ascending=False)
    .head(10)
)

plt.figure(figsize=(8,5))
sns.barplot(x=avg_rating_by_cuisine.values, y=avg_rating_by_cuisine.index)
plt.title('Top 10 Cuisines by Average Rating')
plt.xlabel('Average Rating')
plt.ylabel('Cuisine')
plt.show()


##### 1. Why did you pick the specific chart?

A bar chart was chosen to compare the average customer ratings across different cuisines. It clearly highlights which cuisines are better rated and preferred by customers.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that some cuisines consistently receive higher average ratings compared to others. This indicates that customer satisfaction varies by cuisine type and certain cuisines are more positively received.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights can help Zomato promote highly rated cuisines and guide restaurants on menu planning based on customer preferences.
A possible negative insight is that cuisines with lower average ratings may receive less customer interest. However, this insight can be used to improve food quality, pricing, or service for those cuisines.

#### Chart - 7

In [None]:
# Chart - 7 visualization code(Average Rating by Cost Category)

# Create cost categories
final_df['Cost_Category'] = pd.cut(
    final_df['Cost'],
    bins=[0, 500, 1000, 2000, final_df['Cost'].max()],
    labels=['Low', 'Mid', 'High', 'Premium']
)

# Calculate average rating per cost category
avg_rating_by_cost = final_df.groupby('Cost_Category')['Avg_Rating'].mean().reset_index()

plt.figure(figsize=(7,5))
sns.barplot(data=avg_rating_by_cost, x='Cost_Category', y='Avg_Rating')
plt.title('Average Rating by Cost Category')
plt.xlabel('Cost Category')
plt.ylabel('Average Rating')
plt.show()


##### 1. Why did you pick the specific chart?

A bar chart was chosen to compare average customer ratings across different cost categories. It helps understand whether pricing segments influence customer satisfaction.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that mid-range and premium restaurants generally receive slightly higher average ratings compared to low-cost restaurants. However, the difference is not very large, indicating that good customer experience can be achieved at all price levels.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, this insight helps customers choose restaurants based on both budget and expected quality. For Zomato, it supports better segmentation and personalized recommendations across price ranges.
A possible negative insight is that low-cost restaurants may receive slightly lower ratings on average. However, this can be improved through better service quality and targeted feedback mechanisms.

#### Chart - 8

In [None]:
# Chart - 8 visualization code(Distribution of Total Reviews)

plt.figure(figsize=(8,5))
sns.histplot(final_df['Total_Reviews'], bins=15, kde=True)
plt.title('Distribution of Total Reviews per Restaurant')
plt.xlabel('Total Number of Reviews')
plt.ylabel('Number of Restaurants')
plt.show()


##### 1. Why did you pick the specific chart?

A histogram was chosen to understand how customer engagement is distributed across restaurants. It helps identify whether most restaurants receive few reviews or if engagement is evenly spread.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that most restaurants have a relatively low to moderate number of reviews, while only a few restaurants receive a very high number of reviews. This indicates that customer engagement is concentrated around a limited set of popular restaurants.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, this insight helps Zomato identify highly popular restaurants for promotion and recognize less-reviewed restaurants that may need better visibility.
A negative insight is that restaurants with very few reviews may struggle to gain customer trust. However, this can be addressed through onboarding support, promotions, and recommendation boosts.

#### Chart - 9

In [None]:
# Chart - 9 visualization code(Cuisine-wise Restaurant Count by Cost Category)

# Prepare cuisine data
cuisine_cost_df = final_df[['Cuisines', 'Cost_Category']].copy()
cuisine_cost_df['Cuisines'] = cuisine_cost_df['Cuisines'].str.split(', ')
cuisine_cost_df = cuisine_cost_df.explode('Cuisines')

# Select top 5 cuisines for clarity
top_cuisines = cuisine_cost_df['Cuisines'].value_counts().head(5).index
filtered_df = cuisine_cost_df[cuisine_cost_df['Cuisines'].isin(top_cuisines)]

plt.figure(figsize=(9,5))
sns.countplot(data=filtered_df, x='Cuisines', hue='Cost_Category')
plt.title('Cuisine-wise Restaurant Count by Cost Category')
plt.xlabel('Cuisine')
plt.ylabel('Number of Restaurants')
plt.legend(title='Cost Category')
plt.show()



##### 1. Why did you pick the specific chart?

A count plot was chosen to compare how different cuisines are distributed across cost categories. It helps understand pricing patterns within popular cuisines.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that popular cuisines such as North Indian and Chinese are mostly concentrated in the low and mid cost categories, while fewer restaurants fall under the premium segment. This indicates that these cuisines are more accessible and cater to a wider audience.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This insight helps Zomato understand market positioning of cuisines and assists restaurant owners in pricing decisions.
A possible negative insight is that premium segments for popular cuisines are limited, which may restrict options for high-end customers. However, this also highlights opportunities for expansion.

#### Chart - 10

In [None]:
# Chart - 10 visualization code(Cost vs Average Rating (Box Plot))

plt.figure(figsize=(8,5))
sns.boxplot(data=final_df, x='Cost_Category', y='Avg_Rating')
plt.title('Average Rating Distribution Across Cost Categories')
plt.xlabel('Cost Category')
plt.ylabel('Average Rating')
plt.show()


##### 1. Why did you pick the specific chart?

A box plot was chosen to compare the distribution of average ratings across different cost categories. It helps identify variation, median ratings, and the presence of outliers in each price segment.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that mid and premium cost categories generally have slightly higher median ratings, but there is significant overlap across all cost categories. This indicates that good customer ratings are not limited to expensive restaurants.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This insight helps customers make informed choices without assuming that higher cost always means better quality. For Zomato, it supports fair recommendations across all price ranges.
A negative insight is that some premium restaurants show wide rating variation, which may impact customer trust. However, this can be addressed through quality monitoring and feedback-driven improvements

#### Chart - 11

In [None]:
# Chart - 11 visualization code(Top Restaurants by Average Rating (with sufficient reviews))

# Filter restaurants with at least 20 reviews for reliability
top_restaurants = final_df[final_df['Total_Reviews'] >= 20]

# Sort by average rating and take top 10
top_restaurants = top_restaurants.sort_values(
    by='Avg_Rating', ascending=False
).head(10)

plt.figure(figsize=(9,5))
sns.barplot(
    x=top_restaurants['Avg_Rating'],
    y=top_restaurants['Name']
)
plt.title('Top 10 Restaurants by Average Rating (Min 20 Reviews)')
plt.xlabel('Average Rating')
plt.ylabel('Restaurant Name')
plt.show()


##### 1. Why did you pick the specific chart?

A bar chart was chosen to clearly compare average ratings of top-performing restaurants. A minimum review threshold was applied to ensure ratings are reliable and not biased by very few reviews.

##### 2. What is/are the insight(s) found from the chart?

The chart highlights restaurants that consistently receive high ratings along with sufficient customer engagement. These restaurants represent high quality and strong customer trust.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This insight helps Zomato identify top-performing restaurants for recommendations and promotions. It also helps customers make confident dining choices based on trusted ratings.

A possible negative insight is that restaurants with fewer reviews may not appear in this list despite good quality. However, this can be addressed through visibility boosts for new restaurants.

#### Chart - 12

In [None]:
# Chart - 12 visualization code(Low Rated Restaurants (Average Rating < 3))


# Filter low-rated restaurants
low_rated_restaurants = final_df[final_df['Avg_Rating'] < 3]

# Take top 10 lowest-rated restaurants
low_rated_restaurants = low_rated_restaurants.sort_values(
    by='Avg_Rating'
).head(10)

plt.figure(figsize=(9,5))
sns.barplot(
    x=low_rated_restaurants['Avg_Rating'],
    y=low_rated_restaurants['Name']
)
plt.title('Top 10 Lowest Rated Restaurants')
plt.xlabel('Average Rating')
plt.ylabel('Restaurant Name')
plt.show()


##### 1. Why did you pick the specific chart?

A bar chart was chosen to clearly identify restaurants with consistently low ratings. This helps in detecting problem areas where customer satisfaction is poor.

##### 2. What is/are the insight(s) found from the chart?

The chart highlights restaurants that receive consistently low customer ratings, indicating dissatisfaction related to food quality, service, or pricing. These restaurants require immediate attention.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This insight helps Zomato identify underperforming restaurants and work with them to improve service quality, menu offerings, or customer experience.
A negative insight is that consistently low-rated restaurants may damage platform reputation if not addressed, but early identification allows corrective action.

#### Chart - 13

In [None]:
# Chart - 13 visualization code( Cost vs Total Reviews)

plt.figure(figsize=(8,5))
sns.scatterplot(data=final_df, x='Cost', y='Total_Reviews')
plt.title('Cost vs Total Number of Reviews')
plt.xlabel('Cost for Two')
plt.ylabel('Total Number of Reviews')
plt.show()


##### 1. Why did you pick the specific chart?

A scatter plot was chosen to understand the relationship between restaurant pricing and customer engagement. It helps analyze whether higher or lower priced restaurants attract more reviews.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that most restaurants with a high number of reviews fall in the low to mid cost range. Expensive restaurants generally receive fewer reviews, indicating lower customer volume compared to affordable options.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This insight helps Zomato understand that affordable restaurants drive higher customer engagement and should be prioritized in recommendations.
A negative insight is that premium restaurants attract fewer customers, which may limit their visibility. However, this opens opportunities for targeted marketing toward premium users.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code(Correlation Heatmap)

# Selecting numerical columns
corr_data = final_df[['Cost', 'Avg_Rating', 'Total_Reviews']]

plt.figure(figsize=(7,5))
sns.heatmap(
    corr_data.corr(),
    annot=True,
    cmap='coolwarm',
    fmt='.2f'
)
plt.title('Correlation Heatmap of Numerical Features')
plt.show()


##### 1. Why did you pick the specific chart?

A correlation heatmap was chosen to understand the strength and direction of relationships between numerical variables such as cost, average rating, and total reviews. It provides a summarized view of how features are related to each other.

##### 2. What is/are the insight(s) found from the chart?

The heatmap shows a weak correlation between cost and average rating, indicating that higher pricing does not strongly influence customer satisfaction. The relationship between total reviews and average rating is also weak to moderate, suggesting that popularity does not always result in higher ratings.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

sns.pairplot(
    final_df[['Cost', 'Avg_Rating', 'Total_Reviews']],
    diag_kind='kde'
)
plt.suptitle('Pair Plot of Cost, Average Rating, and Total Reviews', y=1.02)
plt.show()


##### 1. Why did you pick the specific chart?

A pair plot was chosen to visualize pairwise relationships between multiple numerical variables simultaneously. It helps in identifying trends, correlations, and distributions in a single consolidated view.

##### 2. What is/are the insight(s) found from the chart?

The pair plot shows no strong linear relationship between cost, average rating, and total reviews. It also highlights the distribution of each variable and confirms the weak correlations observed earlier in the correlation heatmap.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Hypothesis 1:
There is no significant relationship between restaurant cost and average customer rating.

Hypothesis 2:
Restaurants with higher cost categories do not have significantly higher average ratings than low-cost restaurants.

Hypothesis 3:
There is no significant relationship between the total number of reviews and the average rating of a restaurant.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H₀):
There is no significant relationship between restaurant cost and average customer rating.

Alternate Hypothesis (H₁):
There is a significant relationship between restaurant cost and average customer rating.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import pearsonr

# Pearson Correlation Test: Cost vs Average Rating
correlation, p_value = pearsonr(final_df['Cost'], final_df['Avg_Rating'])

print("Correlation Coefficient:", correlation)
print("P-value:", p_value)


##### Which statistical test have you done to obtain P-Value?

Pearson Correlation Test was used to obtain the p-value.

##### Why did you choose the specific statistical test?

Pearson Correlation Test was chosen because both variables, cost and average rating, are numerical and continuous. This test helps measure the strength and direction of the linear relationship between two continuous variables and determines whether the relationship is statistically significant.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H₀):
There is no significant difference in average ratings between low-cost and high-cost restaurants.

Alternate Hypothesis (H₁):
There is a significant difference in average ratings between low-cost and high-cost restaurants.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import ttest_ind

# Selecting ratings for low-cost and premium restaurants
low_cost_ratings = final_df[final_df['Cost_Category'] == 'Low']['Avg_Rating']
high_cost_ratings = final_df[final_df['Cost_Category'] == 'Premium']['Avg_Rating']

# Independent T-Test
t_statistic, p_value = ttest_ind(
    low_cost_ratings,
    high_cost_ratings,
    nan_policy='omit'
)

print("T-statistic:", t_statistic)
print("P-value:", p_value)


##### Which statistical test have you done to obtain P-Value?

Independent Two-Sample T-Test was used to obtain the p-value.

##### Why did you choose the specific statistical test?

The Independent Two-Sample T-Test was chosen because the goal was to compare the mean average ratings of two independent groups (low-cost and high-cost restaurants). This test is appropriate when comparing the means of two unrelated samples.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H₀):
There is no significant relationship between the total number of reviews and the average rating of a restaurant.

Alternate Hypothesis (H₁):
There is a significant relationship between the total number of reviews and the average rating of a restaurant.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import pearsonr

# Pearson Correlation Test: Total Reviews vs Average Rating
correlation, p_value = pearsonr(
    final_df['Total_Reviews'],
    final_df['Avg_Rating']
)

print("Correlation Coefficient:", correlation)
print("P-value:", p_value)


##### Which statistical test have you done to obtain P-Value?

Pearson Correlation Test was used to obtain the p-value.

##### Why did you choose the specific statistical test?

Pearson Correlation Test was chosen because both the total number of reviews and average rating are numerical variables. This test helps determine the strength and significance of the linear relationship between two continuous variables.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
final_df.isnull().sum()

# Imputing missing values (if any remain)
final_df['Avg_Rating'] = final_df['Avg_Rating'].fillna(final_df['Avg_Rating'].median())
final_df['Total_Reviews'] = final_df['Total_Reviews'].fillna(0)
final_df['Cost'] = final_df['Cost'].fillna(final_df['Cost'].median())


#### What all missing value imputation techniques have you used and why did you use those techniques?

Missing values in numerical columns were handled using median imputation for variables such as cost and average rating, as the median is robust to outliers. Missing values in the total number of reviews were filled with zero to represent restaurants with no customer reviews. These techniques ensure that no important data is lost while making the dataset suitable for machine learning models.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
# IQR Method

def remove_outliers_iqr(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]

# Removing outliers from Cost and Total_Reviews
final_df = remove_outliers_iqr(final_df, 'Cost')
final_df = remove_outliers_iqr(final_df, 'Total_Reviews')


##### What all outlier treatment techniques have you used and why did you use those techniques?

Outliers were handled using the Interquartile Range (IQR) method for numerical variables such as cost and total reviews. This method helps remove extreme values without affecting the overall distribution of the data. It was chosen because it is robust, easy to interpret, and suitable for skewed real-world data like restaurant pricing and review counts.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

# 1. Label Encoding for Cost_Category (Ordinal feature)
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
final_df['Cost_Category_Encoded'] = label_encoder.fit_transform(final_df['Cost_Category'])

# 2. One-Hot Encoding for Cuisines (Top cuisines only to avoid high dimensionality)

# Get top 5 cuisines
top_cuisines = (
    final_df['Cuisines']
    .str.split(', ')
    .explode()
    .value_counts()
    .head(5)
    .index
)

# Create binary columns for top cuisines
for cuisine in top_cuisines:
    final_df[f'Cuisine_{cuisine}'] = final_df['Cuisines'].apply(
        lambda x: 1 if cuisine in x else 0
    )

# Display encoded columns
final_df.head()


#### What all categorical encoding techniques have you used & why did you use those techniques?

Label Encoding was used for the Cost_Category variable because it represents an ordinal relationship between categories such as Low, Mid, High, and Premium. This preserves the natural order of pricing levels.

One-Hot Encoding was applied to the most frequent cuisines to convert categorical cuisine information into binary numerical features. Only top cuisines were encoded to avoid high dimensionality. These encoding techniques make the data suitable for distance-based clustering algorithms like K-Means.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contractions

!pip install contractions
import contractions

review_df['Clean_Review'] = review_df['Review'].apply(
    lambda x: contractions.fix(str(x))
)

#### 2. Lower Casing

In [None]:
# Lower Casing

review_df['Clean_Review'] = review_df['Clean_Review'].str.lower()


#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

import string

review_df['Clean_Review'] = review_df['Clean_Review'].apply(
    lambda x: x.translate(str.maketrans('', '', string.punctuation))
)


#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

import re

review_df['Clean_Review'] = review_df['Clean_Review'].apply(
    lambda x: re.sub(r'http\S+|www\S+|https\S+', '', x)
)

review_df['Clean_Review'] = review_df['Clean_Review'].apply(
    lambda x: re.sub(r'\w*\d\w*', '', x)
)


#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

review_df['Clean_Review'] = review_df['Clean_Review'].apply(
    lambda x: ' '.join([word for word in x.split() if word not in stop_words])
)


In [None]:
# Remove White spaces

review_df['Clean_Review'] = review_df['Clean_Review'].apply(
    lambda x: re.sub(r'\s+', ' ', x).strip()
)


#### 6. Rephrase Text

In [None]:
# Rephrase Text

review_df['Rephrased_Review'] = review_df['Clean_Review']


#### 7. Tokenization

In [None]:
# Tokenization

from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

review_df['Tokens'] = review_df['Rephrased_Review'].apply(
    lambda x: word_tokenize(x)
)



#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')

lemmatizer = WordNetLemmatizer()

review_df['Normalized_Tokens'] = review_df['Tokens'].apply(
    lambda tokens: [lemmatizer.lemmatize(word) for word in tokens]
)


##### Which text normalization technique have you used and why?

Lemmatization was used for text normalization because it converts words to their meaningful base form while preserving context. Unlike stemming, lemmatization produces valid words, making it more suitable for sentiment analysis and text clustering.

#### 9. Part of speech tagging

In [None]:
# POS Taging

from nltk import pos_tag
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')

review_df['POS_Tags'] = review_df['Tokens'].apply(
    lambda tokens: pos_tag(tokens)
)


#### 10. Text Vectorization

In [None]:
# Vectorising Text
# TF-IDF

from sklearn.feature_extraction.text import TfidfVectorizer

# Join tokens back into sentences
review_df['Final_Text'] = review_df['Tokens'].apply(
    lambda tokens: ' '.join(tokens)
)

tfidf = TfidfVectorizer(max_features=500)

tfidf_matrix = tfidf.fit_transform(review_df['Final_Text'])

tfidf_matrix.shape


##### Which text vectorization technique have you used and why?

TF-IDF (Term Frequency–Inverse Document Frequency) vectorization was used to convert textual reviews into numerical form. TF-IDF assigns higher importance to meaningful and less frequent words while reducing the impact of commonly occurring words. This makes it suitable for text clustering and sentiment analysis tasks.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

# Log transform Total_Reviews to reduce skewness
final_df['Log_Total_Reviews'] = np.log1p(final_df['Total_Reviews'])

# Create Rating per Cost feature (value for money indicator)
final_df['Rating_per_Cost'] = final_df['Avg_Rating'] / (final_df['Cost'] + 1)

# Drop original Total_Reviews to reduce correlation
final_df = final_df.drop(columns=['Total_Reviews'])

final_df[['Log_Total_Reviews', 'Rating_per_Cost']].head()


#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

selected_features = [
    'Cost',
    'Avg_Rating',
    'Log_Total_Reviews',
    'Rating_per_Cost',
    'Cost_Category_Encoded'
]

# Include cuisine one-hot encoded features
cuisine_features = [col for col in final_df.columns if col.startswith('Cuisine_')]
selected_features.extend(cuisine_features)

X = final_df[selected_features]
X.head()


##### What all feature selection methods have you used  and why?

Feature selection was performed using domain knowledge and correlation analysis. Highly correlated and less informative features were removed, while features that capture pricing, customer satisfaction, popularity, and cuisine preference were retained. This helps reduce noise, avoid overfitting, and improve clustering performance.

##### Which all features you found important and why?

Cost, average rating, and log-transformed review count were found to be important as they represent pricing, customer satisfaction, and popularity. The value-for-money feature captures combined customer perception, while encoded cuisine features help distinguish restaurants based on food preference patterns.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

Yes, data transformation was required because features were on different scales. Standardization was applied using StandardScaler to ensure that all features contribute equally to distance-based clustering algorithms like K-Means. This prevents features with larger values from dominating the clustering process.

In [None]:
# Transform Your data

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_transformed = scaler.fit_transform(X)

X_transformed.shape


### 6. Data Scaling

In [None]:
# Scaling your data


from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_scaled.shape


##### Which method have you used to scale you data and why?

StandardScaler was used to scale the data because it standardizes features to have zero mean and unit variance. This is important for distance-based algorithms like K-Means, as it ensures that features such as cost and ratings contribute equally to clustering.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Yes, dimensionality reduction is useful because the dataset contains multiple numerical and encoded features, which can increase computational complexity and make visualization difficult. Reducing dimensions helps retain important information while simplifying the feature space.

In [None]:
# DImensionality Reduction (If needed)
# PCA

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

X_pca.shape


##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Principal Component Analysis (PCA) was used for dimensionality reduction because it transforms the data into a lower-dimensional space while preserving maximum variance. PCA also helps in visualizing clusters effectively.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.


from sklearn.model_selection import train_test_split

X_train, X_test = train_test_split(
    X_scaled, test_size=0.2, random_state=42
)

X_train.shape, X_test.shape


##### What data splitting ratio have you used and why?

An 80:20 data splitting ratio was used, where 80% of the data is used for training and 20% for testing. This ratio provides sufficient data for model learning while reserving a portion for evaluation. Although clustering is unsupervised, splitting helps in validating model stability.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Imbalanced data typically refers to unequal class distribution in supervised learning problems. Since this project uses unsupervised learning (K-Means clustering) and does not have predefined target labels, class imbalance is not directly applicable. Therefore, the dataset is not considered imbalanced in the traditional sense.

In [None]:
# Handling Imbalanced Dataset (If needed)


##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

No imbalance handling technique was applied because the problem is unsupervised and does not involve class labels. Clustering algorithms naturally group data based on feature similarity rather than class distribution.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

# ML Model - 1 Implementation (K-Means Clustering)

from sklearn.cluster import KMeans

# Initialize K-Means
kmeans = KMeans(n_clusters=3, random_state=42)

# Fit the model
kmeans.fit(X_scaled)

# Predict clusters
cluster_labels = kmeans.predict(X_scaled)

# Add clusters to dataframe
final_df['Cluster'] = cluster_labels

final_df[['Name', 'Cluster']].head()


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
# Silhouette Score

from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

sil_score = silhouette_score(X_scaled, cluster_labels)

# Plotting Silhouette Score
plt.bar(['K-Means'], [sil_score])
plt.title('Silhouette Score for K-Means Clustering')
plt.ylabel('Silhouette Score')
plt.show()

sil_score


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model
# Hyperparameter Tuning using Elbow Method

wcss = []

for k in range(2, 8):
    km = KMeans(n_clusters=k, random_state=42)
    km.fit(X_scaled)
    wcss.append(km.inertia_)

plt.plot(range(2, 8), wcss, marker='o')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('WCSS')
plt.title('Elbow Method for Hyperparameter Tuning')
plt.show()


##### Which hyperparameter optimization technique have you used and why?

The Elbow Method was used for hyperparameter tuning to determine the optimal number of clusters. It helps identify the value of K where adding more clusters does not significantly reduce the within-cluster sum of squares.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, after selecting the optimal number of clusters using the Elbow Method, the clustering quality improved. This was reflected by a better Silhouette Score and more clearly separated clusters, indicating improved model performance.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# ML Model - 2 Implementation (DBSCAN)

from sklearn.cluster import DBSCAN

dbscan = DBSCAN(eps=0.8, min_samples=5)

dbscan_labels = dbscan.fit_predict(X_scaled)

final_df['DBSCAN_Cluster'] = dbscan_labels

final_df[['Name', 'DBSCAN_Cluster']].head()


In [None]:
# Visualizing evaluation Metric Score chart
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import numpy as np

# Remove noise points (-1)
mask = dbscan_labels != -1

if len(set(dbscan_labels[mask])) > 1:
    dbscan_sil_score = silhouette_score(X_scaled[mask], dbscan_labels[mask])
else:
    dbscan_sil_score = -1

# Plot
plt.bar(['DBSCAN'], [dbscan_sil_score])
plt.title('Silhouette Score for DBSCAN')
plt.ylabel('Silhouette Score')
plt.show()

dbscan_sil_score


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model
# Hyperparameter tuning for DBSCAN (eps exploration)

eps_values = [0.5, 0.7, 0.9, 1.1]
scores = []

for eps in eps_values:
    db = DBSCAN(eps=eps, min_samples=5)
    labels = db.fit_predict(X_scaled)
    mask = labels != -1

    if len(set(labels[mask])) > 1:
        score = silhouette_score(X_scaled[mask], labels[mask])
    else:
        score = -1

    scores.append(score)

plt.plot(eps_values, scores, marker='o')
plt.xlabel('EPS value')
plt.ylabel('Silhouette Score')
plt.title('DBSCAN Hyperparameter Tuning')
plt.show()


##### Which hyperparameter optimization technique have you used and why?

Manual hyperparameter tuning was performed by experimenting with different epsilon (eps) values. DBSCAN is sensitive to eps, and tuning helps identify the value that produces well-separated clusters while minimizing noise.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

DBSCAN showed improvement in identifying noise and outliers compared to K-Means. While the Silhouette Score may be lower in some cases, DBSCAN provides better real-world segmentation by excluding anomalous restaurants that do not belong to any cluster.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

The Silhouette Score indicates how well restaurants are grouped within clusters and how distinct each cluster is. A higher score suggests clear segmentation, which helps businesses design targeted marketing strategies. DBSCAN’s ability to identify noise helps businesses detect unusual or underperforming restaurants, enabling focused improvement or risk mitigation strategies.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model
# ML Model - 3 Implementation (Agglomerative Clustering)

from sklearn.cluster import AgglomerativeClustering

agglo = AgglomerativeClustering(n_clusters=3, linkage='ward')

agglo_labels = agglo.fit_predict(X_scaled)

final_df['Agglomerative_Cluster'] = agglo_labels

final_df[['Name', 'Agglomerative_Cluster']].head()


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

agglo_sil_score = silhouette_score(X_scaled, agglo_labels)

plt.bar(['Agglomerative'], [agglo_sil_score])
plt.title('Silhouette Score for Agglomerative Clustering')
plt.ylabel('Silhouette Score')
plt.show()

agglo_sil_score


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model
# Hyperparameter tuning for Agglomerative Clustering

linkages = ['ward', 'complete', 'average']
scores = []

for link in linkages:
    model = AgglomerativeClustering(n_clusters=3, linkage=link)
    labels = model.fit_predict(X_scaled)
    score = silhouette_score(X_scaled, labels)
    scores.append(score)

plt.bar(linkages, scores)
plt.xlabel('Linkage Method')
plt.ylabel('Silhouette Score')
plt.title('Agglomerative Clustering Hyperparameter Tuning')
plt.show()


##### Which hyperparameter optimization technique have you used and why?

Manual hyperparameter tuning was performed by experimenting with different linkage methods. Linkage selection impacts how clusters are formed, and testing multiple options helps identify the one that produces better cluster separation.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Agglomerative clustering provided stable and interpretable clusters. While the silhouette score was comparable to K-Means, the hierarchical nature of the model helped better understand relationships between restaurant groups.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Silhouette Score was considered as the primary evaluation metric because it measures how well restaurants are grouped within clusters and how clearly different clusters are separated. A higher Silhouette Score indicates meaningful segmentation, which helps businesses design targeted marketing strategies, pricing plans, and customer engagement approaches.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

K-Means clustering was chosen as the final model because it provided well-defined clusters, a comparatively higher Silhouette Score, and stable performance. It is computationally efficient, easy to interpret, and suitable for large-scale business applications such as restaurant segmentation and recommendation systems.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

The K-Means model was explained using cluster centroid analysis. Cluster centroids represent the average value of each feature within a cluster, indicating the relative importance of features in defining each group. Features such as cost, average rating, review popularity, and cuisine indicators played a major role in differentiating restaurant clusters, helping businesses understand customer preferences and market segments.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File


import joblib

joblib.dump(kmeans, 'kmeans_restaurant_clustering_model.joblib')


### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

loaded_model = joblib.load('kmeans_restaurant_clustering_model.joblib')

# Predict clusters for unseen/test data
sample_predictions = loaded_model.predict(X_scaled[:5])

sample_predictions


### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

In this project, a complete end-to-end machine learning pipeline was developed to analyze and cluster restaurants based on pricing, customer ratings, popularity, and cuisine preferences. Multiple clustering models including K-Means, DBSCAN, and Agglomerative Clustering were implemented and evaluated using Silhouette Score. K-Means was selected as the final model due to its stability, interpretability, and better clustering performance. The results provide meaningful business insights that can support targeted marketing, customer segmentation, and strategic decision-making. The model is deployment-ready and can be extended further with real-time data and recommendation systems.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***