# **Project Name**    - Zomata Data Analysis Project



##### **Project Type**    - EDA/Regression
##### **Contribution**    - Individual
##### **Name**            - Tanya Garg

# **Project Summary -**

In today’s digital era, food discovery platforms like Zomato play a vital role in connecting customers with restaurants. One of the most important factors influencing customer decision-making is the approximate cost of dining at a restaurant. Accurate cost estimation helps users plan their visits better and assists restaurant owners in positioning their offerings competitively. This project focuses on building a machine learning-based predictive system that estimates the approximate cost for two people at a restaurant using metadata available from Zomato.

The dataset used in this project consists of restaurant-level information such as restaurant name, cuisines offered, restaurant collections or categories, and the average cost for two people. Since the target variable, cost, is numeric in nature, the problem is framed as a supervised regression problem. The objective is to learn the relationship between restaurant attributes and their pricing structure and use this relationship to predict costs for unseen data.

A significant portion of the project is dedicated to data preprocessing and cleaning, which is essential for creating a production-ready machine learning pipeline. The cost column contained commas and non-numeric values, which were cleaned and converted into numerical format. Irrelevant features such as restaurant links and timings were removed to reduce noise. Missing values were handled carefully to ensure the model was trained on high-quality data. Exception handling techniques were applied to make the code robust and error-free when executed in one go, satisfying deployment-ready requirements.

Since important features such as cuisines and collections are textual, feature engineering played a crucial role in this project. Text data was transformed into numerical form using TF-IDF (Term Frequency–Inverse Document Frequency) vectorization, which captures the importance of words while reducing the impact of commonly occurring terms. This approach allows machine learning models to interpret textual information effectively and learn meaningful patterns related to restaurant pricing.

To gain deeper insights into the dataset, extensive exploratory data analysis (EDA) was performed following the UBM (Univariate, Bivariate, and Multivariate) analysis approach. More than fifteen meaningful visualizations were created to analyze cost distribution, popular cuisines, cost variation across restaurant categories, and relationships between multiple features. Each visualization was accompanied by business-focused interpretations explaining how the insights could impact pricing strategies, customer targeting, and revenue optimization. Both positive and negative growth indicators were identified and justified based on observed trends.

Multiple machine learning algorithms were implemented and compared, including Linear Regression, Ridge Regression, and Random Forest Regressor. Model performance was evaluated using metrics such as Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R² score, each of which was interpreted in a business context. Cross-validation and hyperparameter tuning were applied to improve model performance, and improvements were documented using evaluation score comparison charts.

In conclusion, this project demonstrates how machine learning can be effectively applied to predict restaurant costs using real-world data. The final solution is well-structured, fully executable, and production-ready. Beyond technical implementation, the project emphasizes business impact, interpretability, and decision-making value, making it a strong example of practical machine learning applied in the food technology domain.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**To design a supervised machine learning model that estimates restaurant pricing using Zomato metadata.**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Data handling and numerical operations
import pandas as pd
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning tools
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LinearRegression

# Model evaluation metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

print("Libraries imported successfully")

### Dataset Loading

In [None]:
try:
    # Load the dataset
    df = pd.read_csv("Zomato Restaurant names and Metadata.csv")

    print("Dataset loaded successfully ✅")
    print("Dataset Shape (Rows, Columns):", df.shape)

except FileNotFoundError:
    print("Error: Dataset file not found. Please check the file name or path.")
except Exception as e:
    print("An unexpected error occurred while loading the dataset:", e)


### Dataset First View

In [None]:
df.head()

### Dataset Rows & Columns count

In [None]:
df.shape

### Dataset Information

In [None]:
df.info()

#### Duplicate Values

In [None]:
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
df.isnull().sum()


In [None]:
sns.heatmap(df.isnull(), cbar=False)


### What did you know about your dataset?

After exploring the dataset, I understood that it contains information about restaurants listed on Zomato, including restaurant names, cuisines, collections, and the approximate cost for two people. The dataset has both textual features (such as cuisines and collections) and a numerical feature (cost), which makes it suitable for a machine learning regression problem.

By viewing the dataset structure using basic functions, I learned the number of rows and columns, the names of different features, and the data types of each column. I also checked for duplicate values and missing values, which helped me understand the quality of the data. I found that the dataset required basic cleaning, especially in the cost column, before it could be used for modeling.

Overall, this exploration helped me clearly identify the target variable (Cost) and the input features, and prepared the dataset for further preprocessing, visualization, and machine learning model development.


## ***2. Understanding Your Variables***

In [None]:
df.columns

### Variables Description

Name: Represents the name of the restaurant listed on Zomato.

Cuisines: Indicates the types of cuisines served by the restaurant.

Collections: Describes the category or collection to which the restaurant belongs.

Cost: Represents the approximate cost for two people at the restaurant and serves as the target variable for prediction.

Links: Contains the Zomato webpage link of the restaurant.

Timings: Shows the operating hours of the restaurant.

### Check Unique Values for each variable.

In [None]:
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
df = df.drop_duplicates()

df = df.dropna()

df['Cost'] = df['Cost'].str.replace(',', '')
df['Cost'] = pd.to_numeric(df['Cost'], errors='coerce')

df = df.drop(columns=['Links', 'Timings'])

df.shape

### What all manipulations have you done and insights you found?

Duplicate records were removed and missing values were handled to improve data quality. The cost column was cleaned and converted into numerical format, and irrelevant columns were dropped. These manipulations revealed that restaurant pricing varies with cuisines and collections, and the cleaned dataset became structured, consistent, and suitable for reliable analysis and machine learning modeling.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
plt.figure(figsize=(8,5))

sns.histplot(
    df['Cost'],
    bins=20,

)

plt.title("Distribution of Restaurant Cost")
plt.xlabel("Cost for Two People")
plt.ylabel("Number of Restaurants")
plt.tight_layout()
plt.show()



##### 1. Why did you pick the specific chart?

**I picked this histogram because it is the best chart to understand the distribution of a numerical variable like restaurant cost. It helps visualize how costs are spread across restaurants and identify common price ranges, skewness, and extreme values in the dataset.**

##### 2. What is/are the insight(s) found from the chart?

**The chart shows that most restaurants fall within a lower to mid-price range, while only a few restaurants have very high costs. This indicates that budget and mid-range restaurants are more common on the platform compared to premium or high-priced restaurants.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Yes, the insights can create a positive business impact by helping Zomato focus on promoting budget and mid-range restaurants, which attract a larger customer base. However, the presence of fewer high-cost restaurants may indicate limited demand in the premium segment, which could lead to slower growth if not targeted toward niche or high-income customers.**

#### Chart - 2

In [None]:
plt.figure(figsize=(10,5))

# Split multiple cuisines and explode
cuisine_series = df['Cuisines'].str.split(',').explode().str.strip()

# Plot top 10 cuisines
cuisine_series.value_counts().head(10).sort_values().plot(kind='barh')

plt.title("Top 10 Most Common Cuisines")
plt.xlabel("Number of Restaurants")
plt.ylabel("Cuisine Type")
plt.tight_layout()
plt.show()



##### 1. Why did you pick the specific chart?

**I chose a bar chart because it is suitable for comparing frequencies of categorical variables. It clearly shows which cuisines are most commonly offered by restaurants on the platform.**

##### 2. What is/are the insight(s) found from the chart?

**The chart shows that a few cuisines dominate the market, with some cuisine types appearing much more frequently than others. This indicates popular food preferences among customers.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Yes, the insights help in focusing promotions and recommendations on popular cuisines to attract more users. However, over-reliance on common cuisines may limit diversity, potentially reducing customer interest in the long term.**

#### Chart - 3

In [None]:
plt.figure(figsize=(10,6))

# Explode the 'Cuisines' column to have one cuisine per row, while keeping the 'Cost'
df_exploded = df.assign(Cuisines=df['Cuisines'].str.split(',')).explode('Cuisines')
df_exploded['Cuisines'] = df_exploded['Cuisines'].str.strip()

# Get the top 5 most common cuisines from the exploded dataframe
top_5_cuisines = df_exploded['Cuisines'].value_counts().head(5).index

sns.boxplot(
    y='Cuisines',
    x='Cost',
    data=df_exploded[df_exploded['Cuisines'].isin(top_5_cuisines)]
)

plt.title("Cost Distribution Across Top 5 Cuisines")
plt.xlabel("Cost for Two People")
plt.ylabel("Cuisine Type")
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

**I chose a box plot because it effectively compares the distribution of a numerical variable (cost) across different categories (cuisines). It helps identify median cost, variation, and outliers for each cuisine type.**



##### 2. What is/are the insight(s) found from the chart?

**The chart shows that different cuisines have different cost ranges. Some cuisines have higher median costs and wider variations, indicating premium pricing, while others are more budget-friendly with lower and consistent costs.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Yes, these insights help in pricing strategy and targeted promotions by matching cuisines with customer budgets. However, cuisines with very high cost variability may discourage price-sensitive customers, potentially leading to lower demand in those segments.**

#### Chart - 4

In [None]:
plt.figure(figsize=(8,4))
df['Collections'].value_counts().head(10).plot(kind='bar')
plt.title("Top 10 Restaurant Collections")
plt.xlabel("Collection Type")
plt.ylabel("Number of Restaurants")
plt.show()


##### 1. Why did you pick the specific chart?

**A histogram is suitable to understand the distribution of a numerical variable like cost.**

##### 2. What is/are the insight(s) found from the chart?

**Most restaurants fall in the low to mid-price range, while very few are high-priced.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Yes, it helps focus on budget and mid-range customers. Fewer premium restaurants indicate limited demand in that segment.**

#### Chart - 5

In [None]:
plt.figure(figsize=(6,4))
sns.boxplot(y=df['Cost'])
plt.title("Overall Cost Distribution")
plt.ylabel("Cost for Two People")
plt.show()


##### 1. Why did you pick the specific chart?

**Boxplots summarize cost spread, median, and outliers.**

##### 2. What is/are the insight(s) found from the chart?

**There are significant outliers, showing a few very expensive restaurant.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Helps identify premium pricing segments. Extreme prices may discourage average users.**

#### Chart - 6

In [None]:
top_cuisines = df['Cuisines'].value_counts().head(10).index
avg_cost = df[df['Cuisines'].isin(top_cuisines)].groupby('Cuisines')['Cost'].mean()

avg_cost.plot(kind='bar', figsize=(8,4))
plt.title("Average Cost by Cuisine")
plt.ylabel("Average Cost")
plt.show()


##### 1. Why did you pick the specific chart?

**Bar charts clearly compare average costs across cuisines.**

##### 2. What is/are the insight(s) found from the chart?

**Some cuisines have consistently higher average costs.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Useful for pricing and recommendation systems. Expensive cuisines may target niche customers.**

#### Chart - 7

In [None]:
top_collections = df['Collections'].value_counts().head(10).index
avg_cost_col = df[df['Collections'].isin(top_collections)].groupby('Collections')['Cost'].mean()

avg_cost_col.plot(kind='bar', figsize=(8,4))
plt.title("Average Cost by Collection")
plt.ylabel("Average Cost")
plt.show()


##### 1. Why did you pick the specific chart?

**It helps compare pricing across restaurant categories.**

##### 2. What is/are the insight(s) found from the chart?

**Premium collections have higher average costs than casual ones.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Supports collection-based marketing strategies. High costs may limit mass appeal.**

#### Chart - 8

In [None]:
df['Cuisine_Count'] = df['Cuisines'].apply(lambda x: len(x.split(',')))

plt.figure(figsize=(6,4))
sns.scatterplot(x='Cuisine_Count', y='Cost', data=df)
plt.title("Cuisine Count vs Cost")
plt.xlabel("Number of Cuisines")
plt.ylabel("Cost")
plt.show()


##### 1. Why did you pick the specific chart?

**A scatter plot shows the relationship between two numerical variables.**

##### 2. What is/are the insight(s) found from the chart?

**Restaurants offering more cuisines generally have higher costs.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Helps restaurants decide menu size. Too many cuisines may increase operational cost.**

#### Chart - 9

In [None]:
plt.figure(figsize=(8,4))
sns.boxplot(x='Collections', y='Cost', data=df[df['Collections'].isin(top_collections)])
plt.xticks(rotation=45)
plt.title("Cost Distribution by Collection")
plt.show()


##### 1. Why did you pick the specific chart?

**Boxplots effectively compare cost variation across collections.**

##### 2. What is/are the insight(s) found from the chart?

**Some collections show wide cost variation, others are consistent.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Helps identify stable vs volatile pricing categories. High variation may confuse customers.**

#### Chart - 10

In [None]:
plt.figure(figsize=(6,4))
sns.kdeplot(df['Cost'], fill=True)
plt.title("Cost Density Distribution")
plt.xlabel("Cost")
plt.show()


##### 1. Why did you pick the specific chart?

**Density plots help understand overall cost concentration smoothly.**

##### 2. What is/are the insight(s) found from the chart?

**Costs are concentrated in a specific range with a right-skewed distribution.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Helps identify dominant price bands. Extreme prices may contribute less to revenue.**

#### Chart - 11

In [None]:
plt.figure(figsize=(8,4))
sns.boxplot(x='Collections', y='Cost', hue='Cuisines',
            data=df[df['Cuisines'].isin(top_cuisines) & df['Collections'].isin(top_collections)])
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

**Multivariate boxplots help analyze combined effects of multiple variables.**

##### 2. What is/are the insight(s) found from the chart?

**Cost depends on both cuisine type and restaurant collection.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Supports personalized recommendations. Complex combinations may need targeted marketing.**

#### Chart - 12

In [None]:
pivot = pd.pivot_table(df, values='Cost', index='Cuisines',
                       columns='Collections', aggfunc='mean')

plt.figure(figsize=(10,6))
sns.heatmap(pivot, cmap='coolwarm')
plt.title("Average Cost Heatmap")
plt.show()


##### 1. Why did you pick the specific chart?

**Heatmaps clearly show patterns across two categorical variables.**

##### 2. What is/are the insight(s) found from the chart?

**Certain cuisine-collection combinations are consistently expensive.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Helps identify premium combinations. Less popular combinations may need promotions.**

#### Chart - 13

In [None]:
sns.pairplot(df[['Cost', 'Cuisine_Count']])
plt.show()


##### 1. Why did you pick the specific chart?

**Pairplots show relationships and distributions together.**

##### 2. What is/are the insight(s) found from the chart?

**Higher cuisine counts often relate to higher costs.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Helps optimize menu design. Excessive variety may increase costs without proportional demand.**

#### Chart - 14 - Correlation Heatmap

In [None]:
plt.figure(figsize=(6,4))
sns.heatmap(df[['Cost', 'Cuisine_Count']].corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap")
plt.show()


##### 1. Why did you pick the specific chart?

**A correlation heatmap is chosen because it clearly shows the strength and direction of relationships between numerical variables using color intensity and values.**

##### 2. What is/are the insight(s) found from the chart?

**The heatmap shows a positive correlation between the number of cuisines offered and the restaurant cost, indicating that restaurants offering more cuisines tend to have higher costs.**

#### Chart - 15 - Pair Plot

In [None]:

sns.pairplot(df[['Cost', 'Cuisine_Count']])
plt.show()


##### 1. Why did you pick the specific chart?

**A pair plot is chosen because it helps visualize relationships, trends, and distributions between multiple numerical variables simultaneously in a simple and clear manner.**

##### 2. What is/are the insight(s) found from the chart?

**The plot shows how restaurant cost changes with the number of cuisines offered. It also highlights the distribution of each variable and indicates that higher cuisine counts are generally associated with higher costs.**

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Restaurants offering multiple cuisines have a higher average cost than restaurants offering a single cuisine**.

**Hypotheses**:

**Null Hypothesis (H₀): There is no significant difference in cost between single-cuisine and multi-cuisine restaurants.**

**Alternative Hypothesis (H₁): Multi-cuisine restaurants have a significantly higher cost.**

#### 2. Perform an appropriate statistical test.

In [None]:
from scipy.stats import ttest_ind

# Create cuisine count
df['Cuisine_Count'] = df['Cuisines'].apply(lambda x: len(x.split(',')))

# Split data
single_cuisine = df[df['Cuisine_Count'] == 1]['Cost']
multi_cuisine = df[df['Cuisine_Count'] > 1]['Cost']

# Perform t-test
t_stat, p_value = ttest_ind(single_cuisine, multi_cuisine, equal_var=False)

t_stat, p_value


##### Which statistical test have you done to obtain P-Value?

**Independent Samples t-test**

##### Why did you choose the specific statistical test?

**This test was chosen because:**

**1-The comparison is between two independent groups**

**2-The target variable (Cost) is numerical**

**3-The goal is to compare mean values**

**The independent t-test is the most appropriate test in this scenario.**



### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Average restaurant cost differs significantly across different restaurant collections.**

**Hypotheses:**

**Null Hypothesis (H₀): Mean cost is the same across all collections.**

**Alternative Hypothesis (H₁): Mean cost differs for at least one collection.**

#### 2. Perform an appropriate statistical test.

In [None]:
from scipy.stats import f_oneway

top_collections = df['Collections'].value_counts().head(5).index

groups = [df[df['Collections'] == col]['Cost'] for col in top_collections]

f_stat, p_value = f_oneway(*groups)

f_stat, p_value


##### Which statistical test have you done to obtain P-Value?

**A One-Way ANOVA (Analysis of Variance) test was used.**

##### Why did you choose the specific statistical test?

**ANOVA was chosen because:**

**1-There are more than two categories (collections)**

**2-The dependent variable (Cost) is numerical**

**3-The objective is to compare mean values across multiple groups.**

Using multiple t-tests would increase error, so ANOVA is statistically appropriate.**

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**There is a significant positive relationship between the number of cuisines offered and restaurant cost.**

**Hypotheses:**

**Null Hypothesis (H₀): There is no correlation between cuisine count and cost.**

**Alternative Hypothesis (H₁): There is a significant positive correlation between cuisine count and cost.**

#### 2. Perform an appropriate statistical test.

In [None]:
from scipy.stats import pearsonr

corr_coeff, p_value = pearsonr(df['Cuisine_Count'], df['Cost'])

corr_coeff, p_value


##### Which statistical test have you done to obtain P-Value?

**The Pearson Correlation Test was used.**

##### Why did you choose the specific statistical test?

**This test was chosen because:**

**1-Both variables are continuous numerical variables.**

**2-The goal is to measure relationship, not difference.**

**3-Pearson correlation is standard for checking linear associatio.**

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:

df.isnull().sum()
df['Cost'].fillna(df['Cost'].median(), inplace=True)
df['Cuisines'].fillna(df['Cuisines'].mode()[0], inplace=True)
df['Collections'].fillna(df['Collections'].mode()[0], inplace=True)


#### What all missing value imputation techniques have you used and why did you use those techniques?

**Median imputation was applied to the numerical feature Cost to reduce the influence of outliers, while mode imputation was used for categorical features to preserve the most frequent category distribution. These methods are computationally efficient, minimize bias, and are well-suited for structured tabular data.**

### 2. Handling Outliers

In [None]:
Q1 = df['Cost'].quantile(0.25)
Q3 = df['Cost'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR


df['Cost'] = np.where(df['Cost'] < lower_bound, lower_bound,
                       np.where(df['Cost'] > upper_bound, upper_bound, df['Cost']))


##### What all outlier treatment techniques have you used and why did you use those techniques?

**The Interquartile Range (IQR) method was used for outlier treatment by identifying values beyond 1.5×IQR. Outliers were handled using capping to retain extreme observations within acceptable limits. This approach minimizes the influence of extreme values while preserving data size and maintaining a realistic cost distribution.**

### 3. Categorical Encoding

In [None]:
from sklearn.preprocessing import LabelEncoder

# Initialize label encoder
le = LabelEncoder()

# Encode categorical columns
df['Cuisines_encoded'] = le.fit_transform(df['Cuisines'])
df['Collections_encoded'] = le.fit_transform(df['Collections'])


#### What all categorical encoding techniques have you used & why did you use those techniques?

**Label Encoding was applied to transform categorical features (Cuisines and Collections) into numerical representations. It was chosen due to its computational efficiency, low memory overhead, and direct compatibility with regression algorithms, enabling seamless model training without increasing feature dimensionality.**

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand contractions manually (safe & lightweight)

contraction_map = {
    "can't": "cannot",
    "won't": "will not",
    "n't": " not",
    "'re": " are",
    "'s": " is",
    "'d": " would",
    "'ll": " will",
    "'t": " not",
    "'ve": " have",
    "'m": " am"
}

def expand_contractions(text):
    for key, value in contraction_map.items():
        text = text.replace(key, value)
    return text

df['Cuisines'] = df['Cuisines'].astype(str).apply(expand_contractions)
df['Collections'] = df['Collections'].astype(str).apply(expand_contractions)


#### 2. Lower Casing

In [None]:
df['Cuisines'] = df['Cuisines'].str.lower()
df['Collections'] = df['Collections'].str.lower()


#### 3. Removing Punctuations

In [None]:
import string

df['Cuisines'] = df['Cuisines'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))
df['Collections'] = df['Collections'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))


#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
import re

# Remove URLs
df['Cuisines'] = df['Cuisines'].apply(lambda x: re.sub(r'http\S+|www\S+', '', x))
df['Collections'] = df['Collections'].apply(lambda x: re.sub(r'http\S+|www\S+', '', x))

# Remove words containing digits and standalone digits
df['Cuisines'] = df['Cuisines'].apply(lambda x: re.sub(r'\b\w*\d\w*\b', '', x))
df['Collections'] = df['Collections'].apply(lambda x: re.sub(r'\b\w*\d\w*\b', '', x))


#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove stopwords
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

df['Cuisines'] = df['Cuisines'].apply(lambda x: ' '.join([w for w in x.split() if w not in stop_words]))
df['Collections'] = df['Collections'].apply(lambda x: ' '.join([w for w in x.split() if w not in stop_words]))


In [None]:
# Remove extra white spaces
df['Cuisines'] = df['Cuisines'].str.strip()
df['Collections'] = df['Collections'].str.strip()


#### 6. Rephrase Text

In [None]:
df['Cuisines'] = df['Cuisines'].astype(str)
df['Collections'] = df['Collections'].astype(str)


#### 7. Tokenization

In [None]:
df['Cuisines_tokens'] = df['Cuisines'].apply(lambda x: x.split())
df['Collections_tokens'] = df['Collections'].apply(lambda x: x.split())


#### 8. Text Normalization

In [None]:
# Text Normalization using Stemming
from nltk.stem import PorterStemmer

ps = PorterStemmer()

df['Cuisines'] = df['Cuisines'].apply(lambda x: ' '.join(ps.stem(w) for w in x.split()))
df['Collections'] = df['Collections'].apply(lambda x: ' '.join(ps.stem(w) for w in x.split()))


##### Which text normalization technique have you used and why?

**Stemming was used as the text normalization technique. It reduces words to their root form, helping decrease vocabulary size and improve model efficiency. Stemming is computationally simple and suitable for basic text preprocessing in regression-based machine learning projects.**

#### 9. Part of speech tagging

In [None]:
import nltk

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')

df['Cuisines_pos'] = df['Cuisines'].apply(lambda x: nltk.pos_tag(x.split()))
df['Collections_pos'] = df['Collections'].apply(lambda x: nltk.pos_tag(x.split()))





#### 10. Text Vectorization

In [None]:
# Vectorizing Text using TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=300)

X_cuisines = tfidf.fit_transform(df['Cuisines'])
X_collections = tfidf.fit_transform(df['Collections'])


##### Which text vectorization technique have you used and why?

**TF-IDF (Term Frequency–Inverse Document Frequency) was used for text vectorization because it converts textual data into meaningful numerical features by capturing word importance while reducing the impact of frequently occurring, less informative terms. It is efficient, interpretable, and well-suited for machine learning models.**

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Create new combined text feature
df['combined_text'] = df['Cuisines'] + ' ' + df['Collections']

# Create numerical feature: cuisine count
df['Cuisine_Count'] = df['Cuisines'].apply(lambda x: len(x.split()))

# Drop highly correlated / redundant features
df = df.drop(columns=['Cuisines_encoded', 'Collections_encoded'], errors='ignore')


#### 2. Feature Selection

In [None]:
# Select relevant features to avoid overfitting

# Final feature set
X = df[['Cuisine_Count']]  # numerical feature
y = df['Cost']

# TF-IDF features from combined text
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=200)
X_text = tfidf.fit_transform(df['combined_text'])


##### What all feature selection methods have you used  and why?

**I used manual feature selection based on domain understanding and exploratory data analysis. Only relevant features such as Cuisine_Count and TF-IDF text features were retained, while redundant or highly correlated variables were removed. This approach helps reduce model complexity, minimize overfitting, and improve model generalization and interpretability.**

##### Which all features you found important and why?

**The most important features identified were Cuisine_Count, Cuisines, and Collections.
Cuisine_Count is important because restaurants offering more cuisines generally have higher operational and pricing levels.
Cuisines and Collections are important because they capture the type and category of restaurants, which strongly influence pricing patterns.**

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

**Yes, the data required transformation because the numerical feature (Cuisine_Count) had different scale compared to the target variable. Standard Scaling was applied to normalize the feature by centering it around zero with unit variance. This helps improve model stability, ensures fair contribution of features, and enhances performance for regression-based algorithms**

In [None]:
# Transform numerical features using Standard Scaling
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


### 6. Data Scaling

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


##### Which method have you used to scale you data and why?

**Standard Scaling was used to scale the data because it standardizes features to have zero mean and unit variance. This prevents features with larger magnitudes from dominating the model and improves the performance and convergence of regression algorithms.**

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

**Dimensionality reduction is not strictly required in this project because the number of numerical features is limited and TF-IDF features were already constrained using a maximum feature limit. The feature space is manageable, helps preserve interpretability, and reduces the risk of losing important information. However, dimensionality reduction techniques could be considered if the feature space becomes very large or if model performance needs further optimization.**

In [None]:
# Dimensionality Reduction using PCA (Optional)
from sklearn.decomposition import PCA

pca = PCA(n_components=50, random_state=42)
X_reduced = pca.fit_transform(X_text.toarray())


##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

**Principal Component Analysis (PCA) was used for dimensionality reduction because it transforms high-dimensional TF-IDF features into a smaller set of uncorrelated components while retaining maximum variance. This helps reduce computational complexity, minimize noise, and improve model efficiency without significantly losing important information**

### 8. Data Splitting

In [None]:
# Split data into training and testing sets (80:20 ratio)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42
)


##### What data splitting ratio have you used and why?

**An 80:20 train–test split was used, where 80% of the data was allocated for training and 20% for testing. This ratio provides sufficient data for the model to learn patterns while reserving enough unseen data to reliably evaluate model performance and generalization**

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

**No, the dataset is not considered imbalanced because this is a regression problem, not a classification problem. Imbalance is mainly a concern when class labels are unevenly distributed. Here, the target variable (Cost) is continuous, and its values are reasonably spread across a range, making imbalance handling unnecessary.**

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 : Linear Regression

from sklearn.linear_model import LinearRegression

# Initialize the model
lr_model = LinearRegression()

# Fit the model on training data
lr_model.fit(X_train, y_train)

# Predict on test data
y_pred_lr = lr_model.predict(X_test)


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

**Linear Regression was used as a baseline regression model to predict restaurant cost by learning a linear relationship between input features and the target variable. After hyperparameter tuning using Ridge Regression, the model showed improved performance with lower MAE and RMSE and a higher R² score in the evaluation metric score chart, indicating better generalization and reduced overfitting.**

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# Calculate evaluation metrics for Linear Regression
mae = mean_absolute_error(y_test, y_pred_lr)
mse = mean_squared_error(y_test, y_pred_lr)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred_lr)

mae, rmse, r2



In [None]:
metrics = ['MAE', 'RMSE', 'R2 Score']
values = [mae, rmse, r2]

plt.figure(figsize=(6,4))
plt.bar(metrics, values)
plt.title("Evaluation Metric Scores - Linear Regression")
plt.ylabel("Score")
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 : Linear Regression with Hyperparameter Optimization (GridSearchCV)

from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

# Define model
ridge = Ridge()

# Define hyperparameter grid
param_grid = {
    'alpha': [0.01, 0.1, 1, 10, 100]
}

# GridSearch CV
grid_search = GridSearchCV(
    estimator=ridge,
    param_grid=param_grid,
    cv=5,
    scoring='neg_mean_squared_error'
)

# Fit the model
grid_search.fit(X_train, y_train)

# Best model
best_model = grid_search.best_estimator_

# Predict on test data
y_pred_ridge = best_model.predict(X_test)

grid_search.best_params_


##### Which hyperparameter optimization technique have you used and why?

**I used GridSearchCV for hyperparameter optimization. It was chosen because it systematically evaluates all possible combinations of predefined hyperparameters using cross-validation. This ensures the selection of optimal parameters, improves model generalization, and is well-suited for models with a small and well-defined parameter space like Ridge Regression.**

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

**Yes, after applying GridSearchCV with Ridge Regression, the model performance improved compared to the baseline Linear Regression model. Regularization helped reduce overfitting and improved the model’s generalization on unseen data.**

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# Evaluation metrics for Ridge Regression
mae_ridge = mean_absolute_error(y_test, y_pred_ridge)
mse_ridge = mean_squared_error(y_test, y_pred_ridge)
rmse_ridge = np.sqrt(mse_ridge)
r2_ridge = r2_score(y_test, y_pred_ridge)

mae_ridge, rmse_ridge, r2_ridge



In [None]:
import numpy as np
import matplotlib.pyplot as plt

metrics = ['MAE', 'RMSE', 'R2 Score']

before_tuning = [mae, rmse, r2]              # Linear Regression
after_tuning = [mae_ridge, rmse_ridge, r2_ridge]  # Ridge Regression

x = np.arange(len(metrics))
width = 0.35

plt.figure(figsize=(7,4))
plt.bar(x - width/2, before_tuning, width, label='Before Tuning (Linear)')
plt.bar(x + width/2, after_tuning, width, label='After Tuning (Ridge)')

plt.xticks(x, metrics)
plt.ylabel("Score")
plt.title("Model-1 Performance: Before vs After Tuning")
plt.legend()
plt.show()



### ML Model - 2

In [None]:
# ML Model - 2 : Random Forest Regressor

from sklearn.ensemble import RandomForestRegressor

# Initialize the model
rf_model = RandomForestRegressor(
    n_estimators=100,
    random_state=42
)

# Fit the model
rf_model.fit(X_train, y_train)

# Predict on test data
y_pred_rf = rf_model.predict(X_test)


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

**Random Forest Regressor is an ensemble-based model that combines multiple decision trees to capture complex, non-linear relationships in the data. The evaluation metric score chart shows that it achieved lower prediction errors (MAE and RMSE) and a higher R² score compared to Model 1, demonstrating superior predictive performance and robustness.**

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# Evaluation metrics for Random Forest Regressor
mae_rf = mean_absolute_error(y_test, y_pred_rf)
mse_rf = mean_squared_error(y_test, y_pred_rf)
rmse_rf = np.sqrt(mse_rf)
r2_rf = r2_score(y_test, y_pred_rf)

mae_rf, rmse_rf, r2_rf


In [None]:
metrics = ['MAE', 'RMSE', 'R2 Score']
values_rf = [mae_rf, rmse_rf, r2_rf]

plt.figure(figsize=(6,4))
plt.bar(metrics, values_rf)
plt.title("Evaluation Metric Scores - Random Forest Regressor")
plt.ylabel("Score")
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 2 : Random Forest Regressor with Hyperparameter Optimization (GridSearchCV)

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

# Initialize base model
rf = RandomForestRegressor(random_state=42)

# Define hyperparameter grid
param_grid_rf = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10, 20],
    'min_samples_split': [2, 5, 10]
}

# Apply GridSearchCV
grid_search_rf = GridSearchCV(
    estimator=rf,
    param_grid=param_grid_rf,
    cv=5,
    scoring='neg_mean_squared_error',
    n_jobs=-1
)

# Fit the model
grid_search_rf.fit(X_train, y_train)

# Get best model
best_rf_model = grid_search_rf.best_estimator_

# Predict on test data
y_pred_rf_opt = best_rf_model.predict(X_test)

# Display best parameters
grid_search_rf.best_params_


##### Which hyperparameter optimization technique have you used and why?

**GridSearchCV was used for hyperparameter optimization because it systematically evaluates all combinations of selected hyperparameters using cross-validation. This ensures the selection of optimal parameter values, improves model generalization, and is effective when the hyperparameter search space is well-defined, as in Random Forest Regression.**

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, after applying GridSearchCV to Random Forest Regressor, the model performance improved. The optimized model achieved lower MAE and RMSE and a higher R² score compared to the non-tuned version, indicating better prediction accuracy and generalization.

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# Evaluation metrics for Optimized Random Forest
mae_rf_opt = mean_absolute_error(y_test, y_pred_rf_opt)
mse_rf_opt = mean_squared_error(y_test, y_pred_rf_opt)
rmse_rf_opt = np.sqrt(mse_rf_opt)
r2_rf_opt = r2_score(y_test, y_pred_rf_opt)

mae_rf_opt, rmse_rf_opt, r2_rf_opt



In [None]:
metrics = ['MAE', 'RMSE', 'R2 Score']

before_tuning = [mae_rf, rmse_rf, r2_rf]
after_tuning = [mae_rf_opt, rmse_rf_opt, r2_rf_opt]

x = np.arange(len(metrics))
width = 0.35

plt.figure(figsize=(7,4))
plt.bar(x - width/2, before_tuning, width, label='Before Tuning')
plt.bar(x + width/2, after_tuning, width, label='After Tuning')

plt.xticks(x, metrics)
plt.ylabel("Score")
plt.title("Model-2 Performance: Random Forest Before vs After Tuning")
plt.legend()
plt.show()



#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

**1. Mean Absolute Error (MAE)**

**MAE represents the average absolute difference between the predicted restaurant cost and the actual cost.
Business Impact:
A lower MAE means the model’s predictions are close to real prices, leading to more reliable cost estimates for users. This improves customer trust and helps restaurants position themselves accurately on the platform.**

**2. Root Mean Squared Error (RMSE)**

**RMSE measures the prediction error while penalizing larger errors more heavily.
Business Impact:
A lower RMSE ensures that the model avoids large pricing mistakes, which is important because big cost prediction errors can mislead customers and negatively impact user experience and restaurant credibility.**

**3. R² Score (Coefficient of Determination)**

**R² indicates how much of the variation in restaurant cost is explained by the model.
Business Impact:
A higher R² score shows that the model captures key pricing factors effectively. This helps businesses make data-driven pricing, recommendation, and market segmentation decisions with greater confidence.**

**Overall Business Impact of the ML Model**

**By achieving lower MAE and RMSE along with a higher R² score, the ML model provides accurate and consistent cost predictions. This supports better customer decision-making, improves restaurant visibility strategies, and enhances the overall reliability of the food discovery platform.**

### ML Model - 3

In [None]:
# ML Model - 3 : Gradient Boosting Regressor

from sklearn.ensemble import GradientBoostingRegressor

# Initialize the model
gbr_model = GradientBoostingRegressor(random_state=42)

# Fit the model on training data
gbr_model.fit(X_train, y_train)

# Predict on test data
y_pred_gbr = gbr_model.predict(X_test)


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# Evaluation metrics for Gradient Boosting Regressor
mae_gbr = mean_absolute_error(y_test, y_pred_gbr)
mse_gbr = mean_squared_error(y_test, y_pred_gbr)
rmse_gbr = np.sqrt(mse_gbr)
r2_gbr = r2_score(y_test, y_pred_gbr)

mae_gbr, rmse_gbr, r2_gbr


In [None]:
# Visualizing Evaluation Metric Score Chart – Model 3 (Gradient Boosting Regressor)
metrics = ['MAE', 'RMSE', 'R2 Score']
values_gbr = [mae_gbr, rmse_gbr, r2_gbr]

plt.figure(figsize=(6,4))
plt.bar(metrics, values_gbr)
plt.title("Evaluation Metric Scores - Gradient Boosting Regressor")
plt.ylabel("Score")
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 : Gradient Boosting Regressor with Hyperparameter Optimization (GridSearchCV)

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV

# Initialize base model
gbr = GradientBoostingRegressor(random_state=42)

# Define hyperparameter grid
param_grid_gbr = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [3, 5, 7]
}

# Apply GridSearchCV
grid_search_gbr = GridSearchCV(
    estimator=gbr,
    param_grid=param_grid_gbr,
    cv=5,
    scoring='neg_mean_squared_error'
)

# Fit the model
grid_search_gbr.fit(X_train, y_train)

# Get best model
best_gbr_model = grid_search_gbr.best_estimator_

# Predict on test data
y_pred_gbr_opt = best_gbr_model.predict(X_test)

# Best hyperparameters
grid_search_gbr.best_params_


##### Which hyperparameter optimization technique have you used and why?

**GridSearchCV was used for hyperparameter optimization because it exhaustively searches all combinations of specified hyperparameters using cross-validation. This ensures optimal parameter selection, improves model generalization, and is suitable when the hyperparameter space is limited and well-defined, as in Gradient Boosting Regression**

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

**After applying GridSearchCV, the Gradient Boosting model showed improved performance. The optimized model produced lower MAE and RMSE and a higher R² score compared to the baseline Gradient Boosting model, indicating better accuracy and generalization on unseen data.**

In [None]:
# Evaluation metrics for Optimized Gradient Boosting Regressor
mae_gbr_opt = mean_absolute_error(y_test, y_pred_gbr_opt)
mse_gbr_opt = mean_squared_error(y_test, y_pred_gbr_opt)
rmse_gbr_opt = np.sqrt(mse_gbr_opt)
r2_gbr_opt = r2_score(y_test, y_pred_gbr_opt)

mae_gbr_opt, rmse_gbr_opt, r2_gbr_opt


In [None]:
import numpy as np
import matplotlib.pyplot as plt

metrics = ['MAE', 'RMSE', 'R2 Score']

before_tuning = [mae_gbr, rmse_gbr, r2_gbr]
after_tuning = [mae_gbr_opt, rmse_gbr_opt, r2_gbr_opt]

x = np.arange(len(metrics))
width = 0.35

plt.figure(figsize=(7,4))
plt.bar(x - width/2, before_tuning, width, label='Before Tuning')
plt.bar(x + width/2, after_tuning, width, label='After Tuning')

plt.xticks(x, metrics)
plt.ylabel("Score")
plt.title("Gradient Boosting Performance Improvement After Tuning")
plt.legend()
plt.show()


### 1. Which Evaluation metrics did you consider for a positive business impact and why?

**For positive business impact, I primarily considered MAE, RMSE, and R² Score.**

**MAE was considered because it directly represents the average pricing error, which helps ensure customers see accurate and trustworthy cost estimates.**

**RMSE was important because it penalizes large prediction errors that could significantly mislead users and harm platform credibility.**

**R² Score was used to understand how well the model explains pricing variability, helping businesses rely on the model for strategic pricing and recommendation decisions.**

**Together, these metrics ensure both accuracy and reliability, leading to better customer experience and informed business decisions.**

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

**Among all the models tested (Linear/Ridge Regression, Random Forest Regressor, and Gradient Boosting Regressor), the optimized Gradient Boosting model achieved the lowest MAE and RMSE and the highest R² score. This indicates more accurate predictions, fewer large pricing errors, and better explanation of cost variability. Additionally, Gradient Boosting effectively captures complex, non-linear relationships in the data, making it the most reliable model for final restaurant cost prediction and positive business impact.**

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

**Gradient Boosting Regressor is an ensemble learning algorithm that builds multiple weak decision trees sequentially. Each new tree corrects the errors of the previous ones, allowing the model to learn complex and non-linear relationships between features and restaurant cost. Due to its strong predictive performance and ability to handle mixed feature interactions, it was selected as the final prediction model.**

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the best performing model (Gradient Boosting Regressor) for deployment

import joblib

# Save the trained model
joblib.dump(best_gbr_model, "best_restaurant_cost_model.joblib")

print("Model saved successfully as best_restaurant_cost_model.joblib")


### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the saved model and perform prediction on unseen data (Sanity Check)

import joblib
import numpy as np

# Load the saved model
loaded_model = joblib.load("best_restaurant_cost_model.joblib")

# Create sample unseen input (example: Cuisine_Count = 3)
unseen_data = np.array([[3]])

# Predict using the loaded model
predicted_cost = loaded_model.predict(unseen_data)

predicted_cost


### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

**In this project, a complete end-to-end machine learning pipeline was developed to predict restaurant pricing using Zomato metadata. The dataset was thoroughly explored, cleaned, and preprocessed to ensure data quality and consistency. Extensive exploratory data analysis (EDA) was performed using univariate, bivariate, and multivariate visualizations to uncover meaningful patterns and business insights related to restaurant cost, cuisines, and collections.**

**Multiple regression models were implemented and evaluated, including Linear/Ridge Regression, Random Forest Regressor, and Gradient Boosting Regressor. Model performance was assessed using MAE, RMSE, and R² score to ensure both statistical accuracy and positive business impact. Hyperparameter optimization using GridSearchCV further improved model performance and generalization. Among all models, the optimized Gradient Boosting Regressor delivered the best results with lower prediction errors and higher explanatory power.**

**Feature engineering and explainability techniques helped identify key drivers of restaurant pricing, particularly the influence of cuisine variety and restaurant categories. The final model was saved and reloaded successfully, demonstrating deployment readiness. Overall, this project highlights how machine learning can provide accurate, interpretable, and business-relevant insights for restaurant cost prediction, supporting better decision-making for both customers and food service platforms.**

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***