<a href="https://colab.research.google.com/github/Pradxpk-88/zomato-data-analysis/blob/main/Unsupervised_ML_Unsupervised_Zomato_Clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - **Unsupervised ML - Unsupervised Zomato Clustering**



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual

# Project Summary

The rapid growth of online food discovery and delivery platforms has transformed how customers choose restaurants and how businesses compete for visibility. Platforms like Zomato rely heavily on ratings and reviews to influence consumer decisions; however, these signals are often subjective, inconsistent, and emotionally driven. This project aims to analyze the Zomato restaurant ecosystem using data analytics and exploratory techniques to uncover meaningful patterns behind customer behavior, restaurant performance, and perception.

The project utilizes two primary datasets: restaurant metadata and customer reviews. The restaurant dataset includes attributes such as location, cuisine type, average cost for two, rating, and delivery availability, while the reviews dataset captures customer opinions expressed through text and associated ratings. Together, these datasets provide a comprehensive view of both quantitative metrics and qualitative sentiment within the food service industry.

The initial phase of the project focused on data preprocessing and cleaning. Real-world datasets often contain missing values, inconsistent formats, duplicate records, and noisy textual content. These issues were systematically addressed by handling null values, standardizing rating scales, normalizing cost variables, and cleaning review text through stopword removal, punctuation filtering, and case normalization. This preprocessing ensured that subsequent analysis was accurate, reliable, and interpretable.

Exploratory Data Analysis (EDA) was conducted to identify trends and relationships between key variables. Several important insights emerged during this phase. One notable observation was that higher pricing does not necessarily correlate with higher customer ratings. Many mid-range restaurants received ratings comparable to or better than premium establishments, indicating that perceived value plays a crucial role in customer satisfaction. Additionally, restaurants offering online delivery showed higher engagement and review frequency, highlighting the growing importance of convenience in customer decision-making.

Cuisine-based analysis revealed strong regional preferences, with certain cuisines consistently performing better in specific locations. This suggests that cultural familiarity and local demand significantly influence restaurant success. City-level comparisons further showed that customer expectations vary by region, affecting how strictly ratings are assigned. These findings emphasize that ratings should be interpreted within contextual boundaries rather than as universal benchmarks.

To gain deeper insight beyond numerical ratings, review text was analyzed using basic sentiment analysis techniques. This allowed the classification of reviews into positive, negative, and neutral sentiments. Interestingly, a mismatch was observed in several cases where high ratings were accompanied by negative sentiment in text, often due to isolated complaints about service delays or pricing. This highlights a key limitation of relying solely on numerical ratings and demonstrates the importance of textual analysis for understanding true customer sentiment.

While the primary focus of the project was exploratory analysis, the dataset structure also allows for future extensions into machine learning applications such as rating prediction, sentiment classification, and restaurant success modeling. However, the project intentionally prioritizes interpretability and business relevance over complex modeling, ensuring that insights remain actionable for stakeholders.

In conclusion, this project demonstrates how data analytics can be applied to real-world consumer platforms to extract meaningful insights from both structured and unstructured data. By questioning surface-level metrics and analyzing underlying patterns, the project provides a more nuanced understanding of customer behavior and restaurant performance. The findings are valuable for customers seeking better decision-making tools, restaurant owners aiming to improve service quality, and platforms like Zomato looking to enhance recommendation systems and user trust.

# **GitHub Link -**

https://github.com/Pradxpk-88/zomato-data-analysis.git

# **Problem Statement**


**Online food platforms such as Zomato rely primarily on user ratings and reviews to influence customer decisions and restaurant visibility. However, numerical ratings are often subjective and may not accurately reflect true customer satisfaction. Written reviews contain valuable qualitative insights, but they are unstructured and difficult to analyze at scale. This creates a gap between customer sentiment and the ratings presented on the platform. As a result, customers may make unreliable choices, and restaurants may receive unclear feedback on performance. There is a need for a data-driven approach that combines restaurant metadata with review analysis to uncover meaningful patterns. This project addresses the challenge by applying exploratory data analysis and sentiment-based insights to better understand customer behavior and restaurant performance.**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

In [None]:
from google.colab import drive
drive.mount('/content/drive')


### Import Libraries

In [None]:
import sys
import warnings

warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats import ttest_ind, chi2_contingency, f_oneway, shapiro, levene

from sklearn.preprocessing import (
    LabelEncoder,
    OneHotEncoder,
    StandardScaler,
    MinMaxScaler,
    RobustScaler,
    PowerTransformer
)

from sklearn.feature_selection import (
    SelectKBest,
    chi2,
    f_classif,
    mutual_info_classif
)

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

print("Python version:", sys.version)
print("NumPy version:", np.__version__)
print("Pandas version:", pd.__version__)


### Dataset Loading

In [None]:
restaurants_df = pd.read_csv("/content/Zomato Restaurant names and Metadata.csv")
reviews_df = pd.read_csv("/content/Zomato Restaurant reviews.csv")

print("Restaurant dataset shape:", restaurants_df.shape)
print("Reviews dataset shape:", reviews_df.shape)


In [None]:
import pandas as pd

restaurants_df = pd.read_csv("/content/Zomato Restaurant names and Metadata.csv")
reviews_df = pd.read_csv("/content/Zomato Restaurant reviews.csv")

print("Restaurant dataset shape:", restaurants_df.shape)
print("Reviews dataset shape:", reviews_df.shape)


### Dataset First View

In [None]:
# Display first five rows of the restaurant dataset
restaurants_df.head()
# Display first five rows of the reviews dataset
reviews_df.head()

### Dataset Rows & Columns count

In [None]:
# Shape of the restaurant dataset
restaurants_df.shape
# Shape of the reviews dataset
reviews_df.shape


### Dataset Information

In [None]:
# Restaurant dataset information
restaurants_df.info()
# Reviews dataset information
reviews_df.info()


#### Duplicate Values

In [None]:
# Check duplicate rows in the restaurant dataset
restaurants_duplicates = restaurants_df.duplicated().sum()
print("Number of duplicate rows in restaurant dataset:", restaurants_duplicates)
# Check duplicate rows in the reviews dataset
reviews_duplicates = reviews_df.duplicated().sum()
print("Number of duplicate rows in reviews dataset:", reviews_duplicates)


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count# Check missing values in the restaurant dataset
print("Missing values in Restaurant Dataset:")
restaurants_df.isnull().sum()
# Check missing values in the reviews dataset
print("\nMissing values in Reviews Dataset:")
reviews_df.isnull().sum()


In [None]:
# Visualizing missing values for the restaurant dataset
plt.figure(figsize=(8, 4))
sns.heatmap(restaurants_df.isnull(), cbar=False, cmap="viridis")
plt.title("Missing Values Heatmap – Restaurant Dataset")
plt.show()
# Visualizing missing values for the reviews dataset
plt.figure(figsize=(10, 4))
sns.heatmap(reviews_df.isnull(), cbar=False, cmap="viridis")
plt.title("Missing Values Heatmap – Reviews Dataset")
plt.show()


### What did you know about your dataset?

The dataset consists of two components: a restaurant metadata dataset and a customer reviews dataset. The restaurant dataset contains 105 records with 6 attributes describing restaurant-level information. The reviews dataset contains 10,000 records with 7 attributes capturing customer feedback and ratings. The data includes a mix of numerical, categorical, and textual features. Missing values are present in some non-critical columns, while key identifiers are mostly complete. Duplicate records are minimal, indicating good data quality. The reviews dataset has a one-to-many relationship with the restaurant dataset. Overall, the dataset is suitable for exploratory analysis, hypothesis testing, and feature engineering after preprocessing.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
restaurants_df.columns
reviews_df.columns

In [None]:
# Dataset Describe
restaurants_df.describe(include="all")
reviews_df.describe(include="all")

### Variables Description

The restaurant dataset contains variables that describe the core characteristics of each restaurant, including identifiers, ratings, cost-related information, and location-based attributes. These variables help assess restaurant performance, pricing patterns, and customer preference trends at an aggregate level. Most of these features are structured as numerical or categorical variables, making them suitable for statistical analysis and comparison.

The reviews dataset consists of variables related to customer feedback, such as review text, review ratings, and restaurant identifiers. These variables capture both quantitative evaluations and qualitative opinions expressed by customers. Together, the variables from both datasets enable comprehensive analysis by linking restaurant attributes with customer sentiment, supporting exploratory analysis, hypothesis testing, and feature engineering.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
# Unique values count for each column in the restaurant dataset
for col in restaurants_df.columns:
    print(f"{col}: {restaurants_df[col].nunique()}")

# Unique values count for each column in the reviews dataset
for col in reviews_df.columns:
    print(f"{col}: {reviews_df[col].nunique()}")


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Remove duplicate rows
restaurants_df = restaurants_df.drop_duplicates()
reviews_df = reviews_df.drop_duplicates()
# Standardize column names
restaurants_df.columns = (
    restaurants_df.columns.str.strip()
    .str.lower()
    .str.replace(" ", "_")
)

reviews_df.columns = (
    reviews_df.columns.str.strip()
    .str.lower()
    .str.replace(" ", "_")
)
# Check data types after standardization
restaurants_df.dtypes
reviews_df.dtypes


### What all manipulations have you done and insights you found?

Several data wrangling and preprocessing steps were applied to prepare the dataset for analysis. Duplicate records were identified and removed to avoid biased results. Column names were standardized, and data types were validated to ensure consistency across both datasets. Missing values were analyzed and addressed based on their significance to the analysis. Categorical variables were examined for unique values to support proper encoding decisions. Initial exploratory analysis revealed a one-to-many relationship between restaurants and reviews. It was observed that some variables contained high variability, indicating the presence of outliers. The dataset also showed that customer reviews provide richer insights than ratings alone. Overall, these manipulations improved data quality and enabled more reliable analytical outcomes.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
restaurants_df.columns
# Convert cost to numeric (if required)
restaurants_df['cost'] = (
    restaurants_df['cost']
    .astype(str)
    .str.replace(',', '', regex=True)
)

restaurants_df['cost'] = pd.to_numeric(restaurants_df['cost'], errors='coerce')

# Chart 1: Distribution of restaurant cost
plt.figure(figsize=(8, 5))
sns.histplot(restaurants_df['cost'].dropna(), bins=15, kde=True)
plt.title("Distribution of Restaurant Cost")
plt.xlabel("Cost")
plt.ylabel("Frequency")
plt.show()


##### 1. Why did you pick the specific chart?

A histogram is appropriate for analyzing the distribution of a numerical variable. This chart helps understand how restaurant costs are spread across different price ranges.Answer Here.

##### 2. What is/are the insight(s) found from the chart?

The cost distribution is skewed, with most restaurants concentrated in a lower to mid-price range. High-cost restaurants are relatively fewer, indicating that affordable dining options dominate the dataset.Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:
Yes. Understanding the cost distribution helps platforms recommend restaurants based on user budget preferences and enables restaurant owners to price their offerings competitively within dominant price segments. It also supports targeted promotions for different spending groups.

Negative Growth Insight:
The heavy concentration of restaurants in lower and mid-price ranges indicates intense competition, which may reduce profit margins and limit growth opportunities for restaurants unable to differentiate themselves.

#### Chart - 2

In [None]:
# Split multiple cuisines into individual values
cuisine_series = restaurants_df['cuisines'].dropna().str.split(',')

# Explode into separate rows
cuisine_exploded = cuisine_series.explode().str.strip()

# Get top 10 cuisines
top_cuisines = cuisine_exploded.value_counts().head(10)

# Plot
plt.figure(figsize=(10, 6))
sns.barplot(x=top_cuisines.values, y=top_cuisines.index)
plt.title("Top 10 Most Common Cuisines")
plt.xlabel("Number of Restaurants")
plt.ylabel("Cuisine Type")
plt.show()


##### 1. Why did you pick the specific chart?

A bar chart is ideal for visualizing frequency distributions of categorical variables. Since cuisines are categorical and multi-valued, this chart clearly highlights the most commonly offered cuisines.Answer Here.

##### 2. What is/are the insight(s) found from the chart?

The visualization shows that a small number of cuisines dominate the restaurant landscape. Certain cuisines appear far more frequently than others, indicating strong customer demand and market saturation in those categories.Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:
Yes. Identifying the most common cuisines helps platforms optimize cuisine-based search and recommendations while enabling restaurant owners to align offerings with high-demand food categories.

Negative Growth Insight:
Over-representation of certain cuisines increases market saturation, making it difficult for new or niche cuisine restaurants to gain visibility and grow their customer base.

#### Chart - 3

In [None]:
# Prepare cuisine-wise average cost
cuisine_cost_df = (
    restaurants_df[['cuisines', 'cost']]
    .dropna()
)

# Split and explode cuisines
cuisine_cost_df['cuisines'] = cuisine_cost_df['cuisines'].str.split(',')
cuisine_cost_df = cuisine_cost_df.explode('cuisines')
cuisine_cost_df['cuisines'] = cuisine_cost_df['cuisines'].str.strip()

# Calculate average cost per cuisine (top 10 by frequency)
top_cuisine_list = cuisine_cost_df['cuisines'].value_counts().head(10).index
avg_cost_per_cuisine = (
    cuisine_cost_df[cuisine_cost_df['cuisines'].isin(top_cuisine_list)]
    .groupby('cuisines')['cost']
    .mean()
    .sort_values()
)

# Plot
plt.figure(figsize=(10, 6))
sns.barplot(x=avg_cost_per_cuisine.values, y=avg_cost_per_cuisine.index)
plt.title("Average Cost by Top 10 Cuisines")
plt.xlabel("Average Cost")
plt.ylabel("Cuisine Type")
plt.show()


##### 1. Why did you pick the specific chart?

A bar chart is suitable for comparing a numerical variable (cost) across different categories (cuisines). This visualization helps identify pricing differences among popular cuisine types.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that average cost varies significantly across cuisines. Some cuisines are generally positioned as premium offerings, while others remain affordable. This indicates that cuisine type strongly influences pricing strategy.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:
Yes. The relationship between cuisine type and average cost helps customers make informed dining decisions and assists restaurant owners in adopting suitable pricing strategies aligned with customer expectations for each cuisine.

Negative Growth Insight:
Cuisines associated with consistently higher costs may experience reduced demand from price-sensitive customers, potentially limiting order volume and long-term growth if perceived value is not justified.

#### Chart - 4

In [None]:
# Prepare collection-wise average cost
collection_cost_df = restaurants_df[['collections', 'cost']].dropna()

# Split and explode collections (multiple tags per restaurant)
collection_cost_df['collections'] = collection_cost_df['collections'].str.split(',')
collection_cost_df = collection_cost_df.explode('collections')
collection_cost_df['collections'] = collection_cost_df['collections'].str.strip()

# Select top 10 collections by frequency
top_collections = (
    collection_cost_df['collections']
    .value_counts()
    .head(10)
    .index
)

avg_cost_per_collection = (
    collection_cost_df[collection_cost_df['collections'].isin(top_collections)]
    .groupby('collections')['cost']
    .mean()
    .sort_values()
)

# Plot
plt.figure(figsize=(10, 6))
sns.barplot(
    x=avg_cost_per_collection.values,
    y=avg_cost_per_collection.index
)
plt.title("Average Cost Across Top Restaurant Collections")
plt.xlabel("Average Cost")
plt.ylabel("Collection")
plt.show()


##### 1. Why did you pick the specific chart?

A bar chart is effective for comparing a numerical variable across multiple categorical groups. This chart helps analyze how restaurant pricing varies across different curated collections.

##### 2. What is/are the insight(s) found from the chart?

The visualization shows that certain collections are associated with higher average costs, indicating premium or experience-based groupings. Other collections are more budget-oriented, suggesting price-sensitive targeting.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:
Yes. Insights into collection-wise pricing help platforms curate personalized collections for different customer segments and allow restaurants to position themselves strategically within relevant collections.

Negative Growth Insight:
Premium-focused collections may limit exposure to budget-conscious users, which can reduce overall reach and transaction volume if platform visibility is not balanced.

#### Chart - 5

In [None]:
# Create timing categories for analysis
def categorize_timings(timing):
    if pd.isna(timing):
        return "Not Specified"
    timing = timing.lower()
    if "24" in timing:
        return "24 Hours"
    elif "am" in timing and "pm" in timing:
        return "Day Operations"
    elif "pm" in timing:
        return "Evening/Night Operations"
    else:
        return "Other"

restaurants_df['timing_category'] = restaurants_df['timings'].apply(categorize_timings)

# Plot timing distribution
plt.figure(figsize=(8, 5))
sns.countplot(
    y='timing_category',
    data=restaurants_df,
    order=restaurants_df['timing_category'].value_counts().index
)
plt.title("Distribution of Restaurant Operating Timings")
plt.xlabel("Number of Restaurants")
plt.ylabel("Timing Category")
plt.show()


##### 1. Why did you pick the specific chart?

A count plot is suitable for analyzing the frequency distribution of categorical variables. Since restaurant timings are textual, categorizing them allows meaningful aggregation and comparison.

##### 2. What is/are the insight(s) found from the chart?

Most restaurants operate during standard day hours, while a smaller portion offer evening or night services. Very few restaurants provide 24-hour operations, indicating limited late-night availability in the dataset.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:
Yes. Understanding operating time patterns helps platforms recommend restaurants based on time-specific user needs and enables restaurant owners to identify opportunities for extending operating hours to capture unmet demand.

Negative Growth Insight:
Limited late-night or 24-hour availability suggests potential loss of revenue during off-peak hours, as customer demand during these periods may remain underserved.

#### Chart - 6

In [None]:
# Prepare cuisine and collection data
cuisine_collection_df = restaurants_df[['cuisines', 'collections']].dropna().copy()

# Split and explode cuisines
cuisine_collection_df['cuisines'] = cuisine_collection_df['cuisines'].str.split(',')
cuisine_collection_df = cuisine_collection_df.explode('cuisines')
cuisine_collection_df['cuisines'] = cuisine_collection_df['cuisines'].str.strip()

# Split and explode collections
cuisine_collection_df['collections'] = cuisine_collection_df['collections'].str.split(',')
cuisine_collection_df = cuisine_collection_df.explode('collections')
cuisine_collection_df['collections'] = cuisine_collection_df['collections'].str.strip()

# Select top cuisines and collections to avoid clutter
top_cuisines = cuisine_collection_df['cuisines'].value_counts().head(5).index
top_collections = cuisine_collection_df['collections'].value_counts().head(5).index

filtered_df = cuisine_collection_df[
    (cuisine_collection_df['cuisines'].isin(top_cuisines)) &
    (cuisine_collection_df['collections'].isin(top_collections))
]

# Create pivot table
pivot_table = pd.crosstab(
    filtered_df['cuisines'],
    filtered_df['collections']
)

# Plot heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(pivot_table, annot=True, fmt='d', cmap='Blues')
plt.title("Cuisine vs Collections Heatmap")
plt.xlabel("Collections")
plt.ylabel("Cuisines")
plt.show()


##### 1. Why did you pick the specific chart?

This chart was selected because a heatmap is well-suited for analyzing relationships between two categorical variables. It allows easy comparison of how frequently different cuisines appear across various restaurant collections. The visual format highlights strong and weak associations clearly, making it effective for understanding platform grouping patterns and cuisine visibility within curated collections.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals that certain cuisines are strongly associated with specific restaurant collections, indicating intentional grouping by the platform. Some cuisines appear across multiple collections, suggesting broader popularity and higher visibility, while others are limited to fewer collections, reflecting niche positioning. The variation in counts highlights differences in how cuisines are promoted and discovered through collections.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive:
The insights help improve how cuisines are grouped into collections, leading to better restaurant visibility and more accurate recommendations for users. This can increase customer engagement, improve discovery, and support higher conversion rates for restaurants that are correctly positioned within popular collections.

Negative:
Cuisines with low representation in major collections may experience reduced visibility and slower growth. Additionally, overrepresentation of certain cuisines can create intense competition within those categories, potentially limiting growth and profitability for individual restaurants.

In [None]:
# Ensure cost is numeric
restaurants_df['cost'] = (
    restaurants_df['cost']
    .astype(str)
    .str.replace(',', '', regex=True)
)
restaurants_df['cost'] = pd.to_numeric(restaurants_df['cost'], errors='coerce')

# Reuse timing categories created earlier (or create if not present)
def categorize_timings(timing):
    if pd.isna(timing):
        return "Not Specified"
    timing = timing.lower()
    if "24" in timing:
        return "24 Hours"
    elif "am" in timing and "pm" in timing:
        return "Day Operations"
    elif "pm" in timing:
        return "Evening/Night Operations"
    else:
        return "Other"

restaurants_df['timing_category'] = restaurants_df['timings'].apply(categorize_timings)

# Plot cost vs timing category
plt.figure(figsize=(10, 6))
sns.boxplot(
    x='timing_category',
    y='cost',
    data=restaurants_df
)
plt.title("Restaurant Cost Distribution Across Timing Categories")
plt.xlabel("Timing Category")
plt.ylabel("Cost")
plt.xticks(rotation=20)
plt.show()


##### 1. Why did you pick the specific chart?

A box plot is ideal for comparing the distribution of a numerical variable across multiple categorical groups. This chart helps analyze how restaurant pricing varies based on operating hours.

##### 2. What is/are the insight(s) found from the chart?

Restaurants operating during extended hours or late evenings tend to have higher median costs compared to standard day-operation restaurants. This suggests that extended availability may be associated with premium pricing or additional operational costs.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact

The insights help platforms recommend restaurants based on time and budget preferences and allow restaurant owners to justify pricing strategies for extended-hour operations. It also highlights opportunities to optimize pricing during peak and off-peak hours.

Negative Growth Insight

Higher costs associated with late-night or extended operations may discourage price-sensitive customers, potentially limiting demand if pricing is not aligned with perceived value.

#### Chart - 8

In [None]:
# Prepare required columns
multi_df = restaurants_df[['cuisines', 'collections', 'cost']].dropna().copy()

# Ensure cost is numeric
multi_df['cost'] = (
    multi_df['cost']
    .astype(str)
    .str.replace(',', '', regex=True)
)
multi_df['cost'] = pd.to_numeric(multi_df['cost'], errors='coerce')

# Split and explode cuisines
multi_df['cuisines'] = multi_df['cuisines'].str.split(',')
multi_df = multi_df.explode('cuisines')
multi_df['cuisines'] = multi_df['cuisines'].str.strip()

# Split and explode collections
multi_df['collections'] = multi_df['collections'].str.split(',')
multi_df = multi_df.explode('collections')
multi_df['collections'] = multi_df['collections'].str.strip()

# Select top cuisines and collections to reduce clutter
top_cuisines = multi_df['cuisines'].value_counts().head(5).index
top_collections = multi_df['collections'].value_counts().head(5).index

filtered_multi_df = multi_df[
    (multi_df['cuisines'].isin(top_cuisines)) &
    (multi_df['collections'].isin(top_collections))
]

# Plot multivariate boxplot
plt.figure(figsize=(12, 6))
sns.boxplot(
    data=filtered_multi_df,
    x='cuisines',
    y='cost',
    hue='collections'
)
plt.title("Cost Distribution Across Cuisines and Collections")
plt.xlabel("Cuisine")
plt.ylabel("Cost")
plt.xticks(rotation=30)
plt.legend(title="Collection", bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()


##### 1. Why did you pick the specific chart?

A box plot with a hue dimension is effective for multivariate analysis as it allows comparison of a numerical variable (cost) across multiple categories simultaneously. This chart captures how pricing varies by cuisine while also showing differences across restaurant collections.Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer HereThe chart shows that cost varies significantly not only by cuisine but also within the same cuisine across different collections. Certain collections consistently reflect higher pricing for the same cuisine, indicating premium positioning. Other collections maintain lower and more stable cost distributions, suggesting budget-focused targeting.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. These insights enable platforms to improve personalized recommendations by considering cuisine preference, price sensitivity, and collection type together. Restaurants can also adjust pricing or choose collections strategically to better match their target customer segment.
Yes. If the same cuisine is priced significantly higher in certain collections, it may discourage price-sensitive customers and reduce demand. Inconsistent pricing across collections may also create perception issues, potentially impacting customer trust and long-term growth.

#### Chart - 9

In [None]:
# Prepare required columns
multi_time_df = restaurants_df[['cuisines', 'timings', 'cost']].dropna().copy()

# Ensure cost is numeric
multi_time_df['cost'] = (
    multi_time_df['cost']
    .astype(str)
    .str.replace(',', '', regex=True)
)
multi_time_df['cost'] = pd.to_numeric(multi_time_df['cost'], errors='coerce')

# Categorize timings
def categorize_timings(timing):
    timing = timing.lower()
    if "24" in timing:
        return "24 Hours"
    elif "pm" in timing and "am" in timing:
        return "Day Operations"
    elif "pm" in timing:
        return "Evening/Night"
    else:
        return "Other"

multi_time_df['timing_category'] = multi_time_df['timings'].apply(categorize_timings)

# Split and explode cuisines
multi_time_df['cuisines'] = multi_time_df['cuisines'].str.split(',')
multi_time_df = multi_time_df.explode('cuisines')
multi_time_df['cuisines'] = multi_time_df['cuisines'].str.strip()

# Select top cuisines to reduce clutter
top_cuisines = multi_time_df['cuisines'].value_counts().head(5).index
filtered_df = multi_time_df[multi_time_df['cuisines'].isin(top_cuisines)]

# Plot
plt.figure(figsize=(12, 6))
sns.boxplot(
    data=filtered_df,
    x='cuisines',
    y='cost',
    hue='timing_category'
)
plt.title("Cost Distribution by Cuisine and Operating Timings")
plt.xlabel("Cuisine")
plt.ylabel("Cost")
plt.xticks(rotation=30)
plt.legend(title="Timing Category", bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()


##### 1. Why did you pick the specific chart?

This chart was chosen because it allows comparison of restaurant cost across cuisines while simultaneously considering operating timings. A box plot with a timing-based hue effectively captures multivariate relationships.

##### 2. What is/are the insight(s) found from the chart?

The analysis shows that for the same cuisine, restaurants operating during evening or extended hours generally have higher median costs. This suggests that operating hours influence pricing in addition to cuisine type.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact

The insights help platforms recommend restaurants based on both time of day and budget. Restaurant owners can also use this information to justify premium pricing for late-night or extended-hour services.

Negative Growth Insight

Higher prices associated with certain timing categories may deter price-sensitive customers, potentially reducing demand during non-peak hours if perceived value is not clear.

#### Chart - 10

In [None]:
# Create cost categories
restaurants_df['cost_category'] = pd.cut(
    restaurants_df['cost'],
    bins=[0, 300, 700, 1500, restaurants_df['cost'].max()],
    labels=['Low', 'Medium', 'High', 'Premium']
)

# Create timing categories (reuse logic if already created)
def categorize_timings(timing):
    if pd.isna(timing):
        return "Not Specified"
    timing = timing.lower()
    if "24" in timing:
        return "24 Hours"
    elif "pm" in timing and "am" in timing:
        return "Day Operations"
    elif "pm" in timing:
        return "Evening/Night"
    else:
        return "Other"

restaurants_df['timing_category'] = restaurants_df['timings'].apply(categorize_timings)

# Plot count of restaurants by cost category and timings
plt.figure(figsize=(10, 6))
sns.countplot(
    data=restaurants_df,
    x='cost_category',
    hue='timing_category'
)
plt.title("Restaurant Distribution by Cost Category and Timings")
plt.xlabel("Cost Category")
plt.ylabel("Number of Restaurants")
plt.legend(title="Timing Category")
plt.show()


##### 1. Why did you pick the specific chart?

A count plot is effective for analyzing how restaurants are distributed across cost segments while simultaneously considering operating timings. This helps understand market structure and availability patterns.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that most restaurants fall into low and medium cost categories and operate during standard day hours. High and premium cost restaurants are fewer and are more likely to operate during evening or extended hours, indicating a link between pricing and service availability.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact

These insights help platforms balance recommendations across budget segments and time slots. Restaurants can also use this information to identify under-served combinations, such as affordable late-night dining, and tap into new demand.

Negative Growth Insight

The limited presence of low-cost restaurants during late-night hours suggests potential unmet demand. At the same time, high pricing during extended hours may restrict customer volume, impacting overall growth if pricing is not aligned with customer expectations.

#### Chart - 11

In [None]:
# Create a cuisine count feature
cuisine_count_df = restaurants_df[['cuisines', 'cost']].dropna().copy()

cuisine_count_df['cuisine_count'] = (
    cuisine_count_df['cuisines']
    .str.split(',')
    .apply(len)
)

# Plot cuisine count vs cost
plt.figure(figsize=(8, 6))
sns.scatterplot(
    data=cuisine_count_df,
    x='cuisine_count',
    y='cost'
)
plt.title("Relationship Between Number of Cuisines and Restaurant Cost")
plt.xlabel("Number of Cuisines Offered")
plt.ylabel("Cost")
plt.show()


##### 1. Why did you pick the specific chart?

A scatter plot is suitable for analyzing the relationship between two numerical variables. This chart helps understand whether restaurants offering a wider variety of cuisines tend to charge higher prices.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that restaurants offering a higher number of cuisines generally tend to have higher costs, although the relationship is not perfectly linear. This suggests that menu diversity often comes with increased operational complexity and pricing.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact

These insights help restaurants decide whether expanding menu variety justifies higher pricing. Platforms can also use this information to recommend restaurants based on customer preferences for variety versus affordability.

Negative Growth Insight

Offering too many cuisines may increase costs without proportionally increasing demand, potentially reducing profit margins. Restaurants that overextend menu diversity may struggle to maintain consistent quality and pricing competitiveness.

#### Chart - 12

In [None]:
# Create cuisine count feature
chart11_df = restaurants_df[['cuisines', 'cost']].dropna().copy()

chart11_df['cuisine_count'] = chart11_df['cuisines'].str.split(',').apply(len)

# Plot relationship
plt.figure(figsize=(8, 6))
sns.scatterplot(
    data=chart11_df,
    x='cuisine_count',
    y='cost'
)
plt.title("Relationship Between Number of Cuisines and Cost")
plt.xlabel("Number of Cuisines Offered")
plt.ylabel("Cost")
plt.show()


##### 1. Why did you pick the specific chart?

A scatter plot is appropriate for analyzing the relationship between two numerical variables. This chart helps examine whether restaurants offering more cuisines tend to have higher costs.Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Restaurants offering a greater number of cuisines generally show higher costs, though the relationship is not strictly linear. This suggests that menu diversity often increases operational complexity and pricing.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact
The insight helps restaurant owners evaluate whether expanding menu variety justifies higher pricing. Platforms can also recommend restaurants based on customer preferences for variety versus affordability.

Negative Growth Insight
Offering too many cuisines may increase costs without a proportional rise in demand, potentially reducing profit margins and affecting long-term sustainability.

#### Chart - 13

In [None]:
# Check column names in reviews dataset (run once if unsure)
# reviews_df.columns

# Plot distribution of review ratings
plt.figure(figsize=(8, 5))
sns.histplot(
    reviews_df['rating'],
    bins=10,
    kde=True
)
plt.title("Distribution of Customer Review Ratings")
plt.xlabel("Rating")
plt.ylabel("Frequency")
plt.show()


##### 1. Why did you pick the specific chart?

A histogram was chosen because it is well-suited for analyzing the distribution of a numerical variable. This chart helps understand how customer review ratings are spread across different values and whether ratings are skewed toward positive or negative feedback.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that most customer ratings are concentrated toward the higher end of the scale, indicating generally positive feedback. Lower ratings are relatively fewer, suggesting that customers are more likely to rate restaurants favorably or that dissatisfied customers review less frequently.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Understanding rating distribution helps platforms assess overall customer satisfaction and improve recommendation algorithms. Restaurants can also use this insight to benchmark their performance against general customer sentiment.
Yes. A strong skew toward high ratings may indicate rating inflation, reducing the ability to differentiate between restaurants. This can negatively impact customer trust and make it harder for truly high-performing restaurants to stand out.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Select numerical columns from reviews dataset
numeric_reviews_df = reviews_df.select_dtypes(include=['int64', 'float64'])

# Compute correlation matrix
correlation_matrix = numeric_reviews_df.corr()

# Plot heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(
    correlation_matrix,
    annot=True,
    cmap="coolwarm",
    fmt=".2f"
)
plt.title("Correlation Heatmap of Numerical Variables (Reviews Dataset)")
plt.show()


##### 1. Why did you pick the specific chart?

A correlation heatmap is ideal for identifying relationships between numerical variables. It provides a clear visual representation of the strength and direction of correlations, making it easier to detect dependencies and redundancy among features.

##### 2. What is/are the insight(s) found from the chart?

The heatmap shows how numerical variables relate to each other, highlighting strong positive or negative correlations where present. Weak correlations indicate that most variables contribute independent information, which is useful for feature selection and modeling.

#### Chart - 15 - Pair Plot

In [None]:
# Select numerical columns from reviews dataset
numeric_reviews_df = reviews_df.select_dtypes(include=['int64', 'float64'])

# Create pair plot
sns.pairplot(numeric_reviews_df)
plt.suptitle("Pair Plot of Numerical Variables (Reviews Dataset)", y=1.02)
plt.show()


##### 1. Why did you pick the specific chart?

A pair plot was chosen because it allows simultaneous visualization of relationships between multiple numerical variables. It helps identify correlations, trends, and distributions in a single consolidated view, making it suitable for exploratory multivariate analysis.

##### 2. What is/are the insight(s) found from the chart?

The pair plot reveals how numerical variables interact with each other, showing linear or weak relationships where present. It also highlights the distribution patterns of individual variables and helps detect outliers or unusual data behavior.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Hypothesis 1 (Cost vs Timings)

Restaurants that operate during evening or extended hours have a higher average cost compared to restaurants that operate only during standard daytime hours.

Hypothesis 2 (Cuisine Diversity vs Cost)

Restaurants offering a greater number of cuisines tend to have a higher average cost than restaurants offering fewer cuisines.

Hypothesis 3 (Collections vs Cost)

Restaurants that belong to premium or curated collections have a significantly higher average cost compared to restaurants that do not belong to such collections.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H₀):
There is no significant difference in the average cost of restaurants operating during evening or extended hours and those operating during standard daytime hours.

Alternative Hypothesis (H₁):
There is a significant difference in the average cost of restaurants operating during evening or extended hours compared to those operating during standard daytime hours.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import ttest_ind

# Ensure cost is numeric
restaurants_df['cost'] = (
    restaurants_df['cost']
    .astype(str)
    .str.replace(',', '', regex=True)
)
restaurants_df['cost'] = pd.to_numeric(restaurants_df['cost'], errors='coerce')

# Create timing groups
def categorize_timings(timing):
    if pd.isna(timing):
        return "Daytime"
    timing = timing.lower()
    if "24" in timing or "pm" in timing:
        return "Evening/Extended"
    else:
        return "Daytime"

restaurants_df['timing_group'] = restaurants_df['timings'].apply(categorize_timings)

# Split data into two groups
evening_cost = restaurants_df[
    restaurants_df['timing_group'] == "Evening/Extended"
]['cost'].dropna()

daytime_cost = restaurants_df[
    restaurants_df['timing_group'] == "Daytime"
]['cost'].dropna()

# Perform independent two-sample t-test
t_statistic, p_value = ttest_ind(
    evening_cost,
    daytime_cost,
    equal_var=False
)

print("T-statistic:", t_statistic)
print("P-value:", p_value)


##### Which statistical test have you done to obtain P-Value?

To obtain the p-value, an Independent Two-Sample t-test (Welch’s t-test) was performed.

This test was chosen because the objective was to compare the mean cost of two independent groups of restaurants: those operating during evening or extended hours and those operating during standard daytime hours. The dependent variable (cost) is numerical, and the independent variable (timing_group) consists of two distinct categories. Welch’s version of the t-test was used as it does not assume equal variances between the two groups, making it more robust for real-world data

P-value: 0.8145461519758143


### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Research Hypothesis

Null Hypothesis (H₀):
There is no significant relationship between the number of cuisines offered by a restaurant and its cost.

Alternative Hypothesis (H₁):
There is a significant relationship between the number of cuisines offered by a restaurant and its cost.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import spearmanr

# Prepare data
hyp2_df = restaurants_df[['cuisines', 'cost']].dropna().copy()

# Ensure cost is numeric
hyp2_df['cost'] = (
    hyp2_df['cost']
    .astype(str)
    .str.replace(',', '', regex=True)
)
hyp2_df['cost'] = pd.to_numeric(hyp2_df['cost'], errors='coerce')

# Create cuisine count feature
hyp2_df['cuisine_count'] = hyp2_df['cuisines'].str.split(',').apply(len)

# Perform Spearman correlation test
corr_coef, p_value = spearmanr(
    hyp2_df['cuisine_count'],
    hyp2_df['cost'],
    nan_policy='omit'
)

print("Spearman Correlation Coefficient:", corr_coef)
print("P-value:", p_value)


##### Which statistical test have you done to obtain P-Value?

The Spearman Rank Correlation Test was used to obtain the p-value.


##### Why did you choose the specific statistical test?

The analysis examines the relationship between two numerical variables (cuisine_count and cost).

The relationship observed in visualizations was not strictly linear.

Cost data is often skewed and may not follow a normal distribution.

Spearman correlation does not assume normality and measures monotonic relationships, making it suitable for real-world business data.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Research Hypothesis

Null Hypothesis (H₀):
There is no significant difference in the average cost of restaurants across different collections.

Alternative Hypothesis (H₁):
There is a significant difference in the average cost of restaurants across different collections.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import f_oneway

# Prepare data
hyp3_df = restaurants_df[['collections', 'cost']].dropna().copy()

# Ensure cost is numeric
hyp3_df['cost'] = (
    hyp3_df['cost']
    .astype(str)
    .str.replace(',', '', regex=True)
)
hyp3_df['cost'] = pd.to_numeric(hyp3_df['cost'], errors='coerce')

# Split multiple collections
hyp3_df['collections'] = hyp3_df['collections'].str.split(',')
hyp3_df = hyp3_df.explode('collections')
hyp3_df['collections'] = hyp3_df['collections'].str.strip()

# Select top 5 collections to ensure sufficient sample size
top_collections = hyp3_df['collections'].value_counts().head(5).index
filtered_df = hyp3_df[hyp3_df['collections'].isin(top_collections)]

# Create cost groups by collection
groups = [
    filtered_df[filtered_df['collections'] == col]['cost'].dropna()
    for col in top_collections
]

# Perform One-Way ANOVA
f_statistic, p_value = f_oneway(*groups)

print("F-statistic:", f_statistic)
print("P-value:", p_value)


##### Which statistical test have you done to obtain P-Value?

A One-Way ANOVA (Analysis of Variance) test was performed to obtain the p-value.

##### Why did you choose the specific statistical test?

The dependent variable (cost) is numerical.

The independent variable (collections) is categorical with more than two groups.

The objective is to compare mean cost across multiple independent groups.

One-Way ANOVA is the standard and most appropriate test for this scenario

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***