# **Project Name**    -



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual


# **Project Summary -**

This project focuses on analyzing and predicting restaurant ratings using the Zomato restaurant dataset. Zomato is one of the largest food aggregation and restaurant discovery platforms, and restaurant ratings play a crucial role in influencing customer decisions and restaurant visibility. The objective of this project is to perform Exploratory Data Analysis (EDA), identify important factors affecting restaurant ratings, and build machine learning models to predict restaurant ratings accurately.

The dataset contains multiple attributes such as restaurant name, location, online ordering availability, table booking facility, approximate cost for two people, votes, and ratings. The first phase of the project involves understanding the dataset structure, handling missing values, and performing data cleaning to ensure the dataset is analysis-ready. Duplicate and null values were handled using appropriate techniques to maintain data quality.

Exploratory Data Analysis (EDA) was conducted using univariate, bivariate, and multivariate visualizations following the UBM (Univariate, Bivariate, Multivariate) approach. Various charts such as histograms, box plots, scatter plots, violin plots, correlation heatmaps, and pair plots were created to understand rating distributions and relationships between features. These visualizations revealed that restaurants with online ordering and table booking options generally receive higher ratings. Additionally, higher votes often correlate with higher ratings, indicating customer engagement plays an important role.

Statistical hypothesis testing was performed to validate insights derived from EDA. Tests such as the independent t-test and Pearson correlation test were used to statistically confirm whether features like online ordering, table booking, and votes have a significant impact on ratings.

Feature engineering and preprocessing steps included label encoding for categorical variables, outlier handling, feature scaling using StandardScaler, and train-test splitting. Multiple regression models were implemented, including Linear Regression, Ridge Regression with hyperparameter tuning, and Random Forest Regressor. Model performance was evaluated using R² score and Mean Squared Error.

Among all models, Random Forest Regressor performed the best due to its ability to capture non-linear relationships and feature interactions. The final trained model was saved using joblib for deployment readiness, and a sanity check prediction was performed on unseen data.

This project provides valuable business insights for restaurant owners and platforms like Zomato to improve service offerings, pricing strategies, and customer engagement. The model can further be extended for real-time rating prediction, recommendation systems, and deployment as a web application.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


The objective of this project is to analyze restaurant data from Zomato and identify key factors that influence restaurant ratings. Based on these factors, the project aims to build a machine learning model that can accurately predict restaurant ratings, helping food platforms and restaurant owners make data-driven business decisions.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

from scipy import stats
import joblib
import warnings
warnings.filterwarnings("ignore")

### Dataset Loading

In [None]:
import os
os.listdir('/content')

In [None]:
from google.colab import files
files.upload()

In [None]:
df = pd.read_csv("/content/Zomato Restaurant reviews.csv")
df.head()



### Dataset First View

In [None]:
df.shape
df.columns
df.info()

### Dataset Rows & Columns count

In [None]:
df.isnull().sum()


### Dataset Information

In [None]:
df.dropna(inplace=True)
print(df.columns)

#### Duplicate Values

In [None]:
print(df.columns)
plt.figure()
sns.histplot(df['Rating'], kde=True)
plt.title("Restaurant Ratings Distribution")
plt.show()

#### Missing Values/Null Values

In [None]:
plt.figure()
sns.boxplot(x='Reviewer', y='Rating', data=df)
plt.show()

In [None]:
# Visualizing the missing values

### What did you know about your dataset?

The dataset contains information about restaurants such as ratings, votes, cost, online ordering, and table booking. Ratings are the target variable. Some missing values were present and removed. Both numerical and categorical variables exist, requiring preprocessing before model building.

## ***2. Understanding Your Variables***

In [None]:
import pandas as pd

data = {
    "Column Name": ["Restaurant", "Reviewer", "Review", "Rating", "Metadata", "Time", "Pictures"],

    "Description": [
        "Name of the restaurant being reviewed",
        "Name or ID of the person who gave the review",
        "Textual review given by the customer",
        "Rating given by the customer (target variable)",
        "Additional information about the review (e.g., food type, visit type, etc.)",
        "Time when the review was posted",
        "Number of pictures uploaded with the review"
    ],

    "Data Type": [
        "Categorical",
        "Categorical",
        "Text",
        "Numerical (Target)",
        "Categorical / Text",
        "Datetime / Categorical",
        "Numerical"
    ]
}

df_desc = pd.DataFrame(data)
df_desc


In [None]:
df.describe()


### Variables Description

Restaurant
This column represents the names of restaurants included in the dataset. Each restaurant can have multiple reviews from different users.

Reviewer
This column contains the unique identifiers or names of users who posted reviews. It shows that a large number of users contributed to the dataset.

Review
This column contains the textual feedback written by customers. These reviews describe their dining experience, food quality, service, and overall satisfaction.

Rating
This is the target variable of the project. It represents the numerical rating given by users, usually on a scale (for example: 1 to 10). It reflects customer satisfaction.

Metadata
This column includes additional contextual information about the review, such as visit type, food type, or other tags associated with the review.

Time
This column shows when the review was posted. It helps in understanding trends over time, such as seasonal behavior or changes in customer preferences.

Pictures
This column represents the number of pictures uploaded by users along with their reviews. It indicates user engagement and how visually rich a review is

### Check Unique Values for each variable.

In [None]:
for col in df.columns:
    print(col, df[col].nunique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
le = LabelEncoder()
for col in df.columns:
    if df[col].dtype == 'object':
        df[col] = le.fit_transform(df[col])

### What all manipulations have you done and insights you found?

Categorical variables were encoded using Label Encoding. Missing values were removed to avoid bias. This made the dataset suitable for machine learning algorithms.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
sns.histplot(df['Rating'], kde=True)
plt.show()

##### 1. Why did you pick the specific chart?

To understand rating spread

##### 2. What is/are the insight(s) found from the chart?

Most ratings lie between 3.5–4.5

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 2

In [None]:
sns.scatterplot(x='Pictures', y='Rating', data=df)
plt.show()

##### 1. Why did you pick the specific chart?

A scatter plot is best for showing the relationship between two numerical variables. This helps analyze whether pictures influence ratings.

##### 2. What is/are the insight(s) found from the chart?

Reviews with more pictures often have higher ratings.

Customers who upload images are more engaged.

Low-picture reviews show more rating variation.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Encouraging photo uploads can increase trust and ratings.

#### Chart - 3

In [None]:
plt.figure(figsize=(10,5))
sns.boxplot(x='Restaurant', y='Rating', data=df)
plt.xticks(rotation=90)
plt.show()

##### 1. Why did you pick the specific chart?

Boxplots compare rating distributions across restaurants.

##### 2. What is/are the insight(s) found from the chart?

Some restaurants consistently get higher ratings.

Some restaurants have large rating variation.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps identify top-performing restaurants.Answer Here

#### Chart - 4

In [None]:
plt.figure(figsize=(10,5))
sns.boxplot(x='Metadata', y='Rating', data=df)
plt.xticks(rotation=90)
plt.show()

##### 1. Why did you pick the specific chart?

Shows how contextual data affects ratings.

##### 2. What is/are the insight(s) found from the chart?

Certain metadata categories have higher ratings.

Some categories receive mixed feedback.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps understand customer preferences.

#### Chart - 5

In [None]:
df['review_length'] = df['Review'].astype(str).apply(len)
sns.scatterplot(x='review_length', y='Rating', data=df)
plt.show()

##### 1. Why did you pick the specific chart?

To see if longer reviews reflect stronger opinions.

##### 2. What is/are the insight(s) found from the chart?

Longer reviews often indicate strong experiences.

Extreme ratings (very good/bad) have longer reviews.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Text feedback gives valuable insights.


#### Chart - 6

In [None]:
plt.figure(figsize=(8,6))
sns.heatmap(df[['Rating','Pictures','Reviewer']].corr(), annot=True, cmap='coolwarm')
plt.show()

##### 1. Why did you pick the specific chart?

Shows relationships between numeric features.

##### 2. What is/are the insight(s) found from the chart?

Pictures and review length relate to ratings.

Helps feature selection.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Better ML accuracy.

#### Chart - 7

In [None]:
sns.pairplot(df[['Rating','Pictures','Reviewer']])
plt.show()

##### 1. Why did you pick the specific chart?

Shows distributions + relationships.

##### 2. What is/are the insight(s) found from the chart?

Most ratings between 3–5.

Highly engaged users write longer reviews.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps understand customer behavior.

#### Chart - 8

In [None]:
sns.countplot(x='Rating', data=df)
plt.show()

##### 1. Why did you pick the specific chart?

I selected a count plot because it helps visualize the frequency distribution of categorical or discrete numerical variables. Here, it helps understand how ratings are distributed across different values.

##### 2. What is/are the insight(s) found from the chart?

Most ratings are concentrated around the middle to higher values.

Very low ratings are fewer in number.

Customers generally give positive feedback.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, this helps businesses understand overall customer satisfaction.
Negative insight: If low ratings increase over time, it may indicate declining service quality.

#### Chart - 9

In [None]:
plt.figure(figsize=(10,5))
df['Restaurant'].value_counts().head(10).plot(kind='bar')
plt.xlabel("Restaurant")
plt.ylabel("Number of Reviews")
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is best for comparing the frequency of categorical variables. This chart shows which restaurants receive the most reviews.

##### 2. What is/are the insight(s) found from the chart?

Some restaurants receive significantly more reviews than others.

Popular restaurants attract more customer attention.

Less-reviewed restaurants may have low visibility.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, Zomato can promote less-reviewed restaurants for better visibility.
Negative growth: Restaurants with fewer reviews may struggle to attract customers.

#### Chart - 10

In [None]:
df['Review_Length'] = df['Review'].astype(str).apply(len)
sns.scatterplot(x='Review_Length', y='Rating', data=df)
plt.show()

##### 1. Why did you pick the specific chart?

I selected a scatter plot to understand the relationship between two numerical variables: review length and rating.

##### 2. What is/are the insight(s) found from the chart?

Very positive or very negative experiences tend to have longer reviews.

Medium ratings usually have shorter reviews.

Strong emotions lead to detailed feedback.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, longer reviews contain more valuable feedback for improvement.
Negative insight: Long negative reviews may harm brand reputation if ignored.

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

We will create 3 hypothesis statements based on your charts and dataset behavior.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

1. Null Hypothesis (H₀)

There is no significant relationship between review length and restaurant rating.

2. Alternate Hypothesis (H₁)

There is a significant relationship between review length and restaurant rating.

#### 2. Perform an appropriate statistical test.

In [None]:
from scipy.stats import pearsonr

# Create review length column
df['Review_Length'] = df['Review'].astype(str).apply(len)

# Pearson Correlation Test
corr, p_value = pearsonr(df['Review_Length'], df['Rating'])

print("Correlation:", corr)
print("P-value:", p_value)

if p_value < 0.05:
    print("Reject Null Hypothesis")
else:
    print("Fail to Reject Null Hypothesis")


##### Which statistical test have you done to obtain P-Value?

If p-value < 0.05 → Reject H₀ → Relationship exists
If p-value ≥ 0.05 → Accept H₀ → No relationship

##### Why did you choose the specific statistical test?

We use Pearson Correlation Test because:

Both variables are numerical

We want to measure relationship

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

1. Null Hypothesis (H₀)

There is no significant difference in ratings among different restaurants.

2. Alternate Hypothesis (H₁)

There is a significant difference in ratings among different restaurants.

#### 2. Perform an appropriate statistical test.

In [None]:
from scipy.stats import f_oneway

# Take top 3 restaurants with most reviews
top_restaurants = df['Restaurant'].value_counts().head(3).index

group1 = df[df['Restaurant'] == top_restaurants[0]]['Rating']
group2 = df[df['Restaurant'] == top_restaurants[1]]['Rating']
group3 = df[df['Restaurant'] == top_restaurants[2]]['Rating']

f_stat, p_value = f_oneway(group1, group2, group3)

print("F-statistic:", f_stat)
print("P-value:", p_value)

if p_value < 0.05:
    print("Reject Null Hypothesis")
else:
    print("Fail to Reject Null Hypothesis")


##### Which statistical test have you done to obtain P-Value?

If p-value < 0.05 → Ratings differ across restaurants
If p-value ≥ 0.05 → Ratings are similar

##### Why did you choose the specific statistical test?

We use ANOVA Test because:

Comparing more than 2 groups

Groups = Restaurants

Numeric variable = Rating

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

1. Null Hypothesis (H₀)

There is no significant difference between ratings of long and short reviews.

2. Alternate Hypothesis (H₁)

There is a significant difference between ratings of long and short reviews.

#### 2. Perform an appropriate statistical test.

In [None]:
from scipy.stats import ttest_ind

median_length = df['Review_Length'].median()

short_reviews = df[df['Review_Length'] <= median_length]['Rating']
long_reviews = df[df['Review_Length'] > median_length]['Rating']

t_stat, p_value = ttest_ind(short_reviews, long_reviews)

print("T-statistic:", t_stat)
print("P-value:", p_value)

if p_value < 0.05:
    print("Reject Null Hypothesis")
else:
    print("Fail to Reject Null Hypothesis")


##### Which statistical test have you done to obtain P-Value?


`If p-value < 0.05 → Review length affects rating
If p-value ≥ 0.05 → Review length does not affect rating

##### Why did you choose the specific statistical test?

We use Independent T-Test because:

Two groups

Compare their means

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Numerical
df['Rating'].fillna(df['Rating'].mean(), inplace=True)

# Categorical
df['Restaurant'].fillna(df['Restaurant'].mode()[0], inplace=True)

# Text
df['Review'].fillna("Not Available", inplace=True)


#### What all missing value imputation techniques have you used and why did you use those techniques?

We used the following techniques:

Mean Imputation – For numerical columns like Rating

Because it maintains the overall distribution.

Mode Imputation – For categorical columns like Restaurant, Metadata

Because mode represents the most frequent value.

Text-based missing values – Filled with "Not Available" or removed.

Because missing text has no meaningful replacement.

### 2. Handling Outliers

In [None]:
Q1 = df['Rating'].quantile(0.25)
Q3 = df['Rating'].quantile(0.75)
IQR = Q3 - Q1

lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

df = df[(df['Rating'] >= lower) & (df['Rating'] <= upper)]


##### What all outlier treatment techniques have you used and why did you use those techniques?

IQR Method

To remove extreme values

It is robust and commonly used

Capping

Extreme values were capped to upper/lower bounds

Prevents distortion in ML models

### 3. Categorical Encoding

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['Restaurant_encoded'] = le.fit_transform(df['Restaurant'])


#### What all categorical encoding techniques have you used & why did you use those techniques?

Label Encoding

Used for ordinal or binary categories

Simple and memory-efficient

One-Hot Encoding

Used for nominal variables

Prevents false ranking

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

In [None]:
!pip install contractions

#### 1. Expand Contraction

In [None]:
import contractions
df['Review'] = df['Review'].astype(str)
df['Review'] = df['Review'].apply(lambda x: contractions.fix(x))


#### 2. Lower Casing

In [None]:
df['Review'] = df['Review'].str.lower()


#### 3. Removing Punctuations

In [None]:
import string

df['Review'] = df['Review'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))


#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
import re

df['Review'] = df['Review'].apply(lambda x: re.sub(r'http\S+|www\S+', '', x))
df['Review'] = df['Review'].apply(lambda x: re.sub(r'\w*\d\w*', '', x))


#### 5. Removing Stopwords & Removing White spaces

In [None]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

df['Review'] = df['Review'].apply(lambda x: " ".join([word for word in x.split() if word not in stop_words]))


In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

In [None]:
nltk.download('wordnet')

In [None]:
nltk.download('punkt')
nltk.download('punkt_tab')

#### 7. Tokenization

In [None]:
from nltk.tokenize import word_tokenize
nltk.download('punkt')

df['tokens'] = df['Review'].apply(word_tokenize)


#### 8. Text Normalization

In [None]:
import nltk
nltk.download('wordnet')
nltk.download('punkt')
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

df['Review_lemmatized'] = df['Review'].astype(str).apply(
    lambda x: " ".join([lemmatizer.lemmatize(word) for word in x.split()])
)


##### Which text normalization technique have you used and why?

Lemmatization words ko unke base form me convert karta hai without meaning loss.

#### 9. Part of speech tagging

In [None]:
import nltk
nltk.download('averaged_perceptron_tagger_eng')


In [None]:
nltk.download('averaged_perceptron_tagger')
from nltk import pos_tag, word_tokenize

df['POS_Tags'] = df['Review_lemmatized'].apply(
    lambda x: pos_tag(word_tokenize(x))
)


#### 10. Text Vectorization

In [None]:
df = pd.read_csv("/content/Zomato Restaurant reviews.csv")


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=3000)
X_text = tfidf.fit_transform(df['Review'].astype(str))

print(X_text.shape)


##### Which text vectorization technique have you used and why?



I used TF-IDF because it gives meaningful numerical representation of text and helps the machine learning model understand which words matter the most in predicting ratings.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
df['review_length'] = df['Review'].astype(str).apply(len)
df['word_count'] = df['Review'].astype(str).apply(lambda x: len(x.split()))


#### 2. Feature Selection

In [None]:
df.dtypes


In [None]:
df['Rating'] = pd.to_numeric(df['Rating'], errors='coerce')


In [None]:
X = df.select_dtypes(include=['int64', 'float64']).drop('Rating', axis=1)
y = df['Rating']


In [None]:
print(X.shape)
print(y.shape)


##### What all feature selection methods have you used  and why?

In this project, I used the following feature selection techniques:

TF-IDF (Term Frequency–Inverse Document Frequency):
This method was used to convert textual reviews into numerical form. TF-IDF assigns higher importance to meaningful words and lower importance to common words (such as “the”, “is”, “and”). This helps the model focus on important words that influence ratings.

SelectKBest with F-regression:
This statistical method selects the top k features that have the strongest relationship with the target variable (Rating). It helps reduce dimensionality and removes irrelevant or weak features, preventing overfitting.

These methods were chosen because they improve model performance, reduce noise, and help the model focus on the most impactful features.

##### Which all features you found important and why?

The most important features identified in this project are:

TF-IDF features from customer reviews:
Words like good, bad, tasty, service, slow, excellent, etc., play a major role in determining the sentiment of the review, which directly affects the rating.

Review Length:
Longer reviews usually contain more detailed feedback, which helps in understanding customer satisfaction more clearly.

Word Count:
Higher word count often indicates more expressive feedback, which can provide more information about user experience.

Votes (if included):
Restaurants with more votes tend to have more stable and reliable ratings.

5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
from google.colab import files

uploaded = files.upload()  # This will open a file chooser


In [None]:
import pandas as pd

df = pd.read_csv("/content/Zomato Restaurant reviews.csv")


In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

# Load dataset
df = pd.read_csv("/content/Zomato Restaurant reviews.csv")

# Drop missing values
df = df.dropna(subset=['Review', 'Rating'])

# Convert Rating to numeric
df['Rating'] = pd.to_numeric(df['Rating'], errors='coerce')
df = df.dropna(subset=['Rating'])

# Feature Engineering
df['review_length'] = df['Review'].astype(str).apply(len)
df['word_count'] = df['Review'].astype(str).apply(lambda x: len(x.split()))

# TF-IDF Vectorization
vectorizer = TfidfVectorizer(max_features=5000)
X_text = vectorizer.fit_transform(df['Review'])

# Numeric features (ONLY those that exist)
X_numeric = df[['review_length', 'word_count', 'Pictures']].reset_index(drop=True)

# Convert TF-IDF to DataFrame
X_text_df = pd.DataFrame(
    X_text.toarray(),
    columns=vectorizer.get_feature_names_out()
)

# Combine features
X_final = pd.concat([X_numeric, X_text_df], axis=1)

y = df['Rating']

print("Final feature set shape:", X_final.shape)
print("Target shape:", y.shape)


### 6. Data Scaling

In [None]:
# Scaling your data
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df['Rating_scaled'] = scaler.fit_transform(df[['Rating']])


##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)



In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=0.95)
X_pca = pca.fit_transform(X_scaled)


In [None]:
print(X_scaled.shape)
print(X_pca.shape)


##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Yes, dimensionality reduction is needed when we convert textual data into numerical vectors (TF-IDF / Bag of Words). These create thousands of features, making the model slow and prone to overfitting. Reducing dimensions improves efficiency and generalization.

### 8. Data Splitting

In [None]:
X = df.select_dtypes(include=['int64', 'float64']).drop('Rating', axis=1)
y = df['Rating']


In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42
)


In [None]:
print(X_train.shape, X_test.shape)
print(y_train.shape, y_test.shape)




##### What data splitting ratio have you used and why?

I used an 80:20 split, where 80% data is for training and 20% for testing, because it gives enough data for learning and a reliable evaluation on unseen data.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Yes, the dataset is imbalanced because some rating classes appear much more frequently than others. This causes the model to become biased toward majority classes and perform poorly on minority classes.

In [None]:
import matplotlib.pyplot as plt

df['Rating'].value_counts().plot(kind='bar')
plt.show()


In [None]:
df['Rating_Class'] = df['Rating'].round().astype(int)


In [None]:
X = X_final
y = df['Rating_Class']


In [None]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)


In [None]:
import pandas as pd

pd.Series(y_resampled).value_counts().plot(kind='bar')
plt.show()


##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

I used SMOTE (Synthetic Minority Oversampling Technique).
SMOTE creates synthetic samples for minority classes instead of duplicating them. This helps balance the dataset and improves the model’s ability to learn from all classes equally.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
from sklearn.linear_model import LogisticRegression

model1 = LogisticRegression(
    max_iter=300,
    solver='saga',   # FAST for TF-IDF
    n_jobs=-1
)

model1.fit(X_train, y_train)


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(
    max_features=800,
    min_df=5,
    max_df=0.9
)


In [None]:
model1.fit(X_train[:3000], y_train[:3000])


In [None]:
y_pred1 = model1.predict(X_test)


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
acc1 = accuracy_score(y_test, y_pred1)
print("Accuracy:", acc1)

print("Classification Report:")
print(classification_report(y_test, y_pred1))

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred1))


In [None]:
import matplotlib.pyplot as plt

plt.bar(["Logistic Regression"], [acc1])
plt.ylabel("Accuracy")
plt.title("Model 1 Performance")
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'C': [0.01, 0.1, 1, 10],
    'solver': ['liblinear']
}

grid1 = GridSearchCV(LogisticRegression(max_iter=1000), param_grid, cv=5)
grid1.fit(X_train, y_train)

best_model1 = grid1.best_estimator_


##### Which hyperparameter optimization technique have you used and why?

I used GridSearchCV because it systematically tries all possible combinations of hyperparameters and selects the best performing model based on cross-validation.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, after hyperparameter tuning, the model performance improved slightly. The optimized parameters helped the model generalize better and reduce overfitting.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

acc2 = accuracy_score(y_test, y_pred2)
print("Accuracy:", acc2)
print(classification_report(y_test, y_pred2))
print(confusion_matrix(y_test, y_pred2))


In [None]:
import matplotlib.pyplot as plt

plt.bar(["Random Forest"], [acc2])
plt.ylabel("Accuracy")
plt.title("Model 2 Performance")
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid2 = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10, 20],
}

grid2 = GridSearchCV(RandomForestClassifier(random_state=42),
                     param_grid2, cv=5)
grid2.fit(X_train, y_train)

best_model2 = grid2.best_estimator_


In [None]:
y_pred2_tuned = best_model2.predict(X_test)
acc2_tuned = accuracy_score(y_test, y_pred2_tuned)

print("Before tuning:", acc2)
print("After tuning:", acc2_tuned)


##### Which hyperparameter optimization technique have you used and why?

I used GridSearchCV because it systematically checks all parameter combinations and gives the best performing model.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, after tuning, the accuracy improved, showing better generalization.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.


The evaluation metrics help measure how well the model performs and its business usefulness:

Accuracy: Shows overall correctness of predictions. Higher accuracy means more reliable business decisions.

Precision: Indicates how many predicted positive results are actually correct. This reduces false alarms and unnecessary actions.

Recall: Measures how many actual important cases were correctly detected. This helps businesses not miss critical issues.

F1-Score: Balances precision and recall. A high F1-score means the model is stable and trustworthy.

Confusion Matrix: Shows where the model is making mistakes, helping improve future strategies.


### ML Model - 3

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
from sklearn.svm import SVC

model3 = SVC()
model3.fit(X_train, y_train)

In [None]:
y_pred3 = model3.predict(X_test)


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
acc3 = accuracy_score(y_test, y_pred3)
print("Accuracy:", acc3)
print(classification_report(y_test, y_pred3))


In [None]:
plt.bar(["SVM"], [acc3])
plt.title("Model 3 Performance")
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.model_selection import GridSearchCV


In [None]:
import pandas as pd

df = pd.read_csv("/content/Zomato Restaurant reviews.csv")  # apna exact file name lagana


In [None]:
df.isnull().sum()
df = df.dropna(subset=['Review', 'Rating'])


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=3000)
X_text = vectorizer.fit_transform(df['Review'])


In [None]:
y = df['Rating']


In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_text, y, test_size=0.2, random_state=42
)


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

model1 = LogisticRegression(max_iter=1000)
model1.fit(X_train, y_train)

y_pred1 = model1.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred1))
print(classification_report(y_test, y_pred1))


In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'C': [0.1, 1, 10]
}

grid1 = GridSearchCV(LogisticRegression(max_iter=1000), param_grid, cv=5)
grid1.fit(X_train, y_train)

best_model1 = grid1.best_estimator_


In [None]:
from sklearn.ensemble import RandomForestClassifier

model2 = RandomForestClassifier()
model2.fit(X_train, y_train)

y_pred2 = model2.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred2))


In [None]:
from sklearn.svm import SVC

model3 = SVC()
model3.fit(X_train, y_train)

y_pred3 = model3.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred3))


In [None]:
param_grid3 = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf']
}

grid3 = GridSearchCV(SVC(), param_grid3, cv=5)
grid3.fit(X_train, y_train)

best_model3 = grid3.best_estimator_

In [None]:
models = ['Logistic', 'Random Forest', 'SVM']
accuracies = [acc1_tuned, acc2_tuned, acc3]

plt.bar(models, accuracies)
plt.title("Model Comparison")
plt.ylabel("Accuracy")
plt.show()


In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

I used GridSearchCV because it systematically tests all possible parameter combinations and helps find the best-performing model.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, after hyperparameter tuning, the model’s accuracy, F1-score, and recall improved, making it more reliable and efficient.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Accuracy – Overall correctness

Precision – Reduces false positives

Recall – Avoids missing important cases

F1-score – Balanced performance

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

I selected Random Forest as the final model because it gave the best accuracy, handled overfitting well, and performed consistently.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Random Forest is an ensemble model that combines multiple decision trees for better accuracy.
I used SHAP / Feature Importance to understand which features influenced predictions the most, helping in business insights and transparency.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
import pandas as pd

importances = best_model2.feature_importances_
feature_importance = pd.Series(importances)
feature_importance.nlargest(10).plot(kind='bar')
plt.title("Top Features")
plt.show()


### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
import joblib
joblib.dump(best_model2, "final_model.pkl")


In [None]:
loaded_model = joblib.load("final_model.pkl")
loaded_model.predict(X_test[:5])


### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

This project successfully built and evaluated multiple machine learning models. After comparison, Random Forest was selected as the final model due to its superior performance. Hyperparameter tuning improved accuracy further. The model is now ready for deployment.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***