# **Project Name**    -Zomato Restaurant Review Analysis & Sentiment Prediction



##### **Project Type**    - EDA / Classification (NLP)
##### **Contribution**    - Individual
##### Team Member 1 - Samagra Gupta


# **Project Summary -**

This project focuses on analyzing Zomato restaurant reviews to understand customer sentiment and derive meaningful business insights using data science and machine learning techniques. The dataset consists of restaurant metadata and user-generated textual reviews along with ratings. The primary objective is to explore customer behavior, identify key factors influencing restaurant ratings, and build a sentiment classification model that can automatically classify reviews as positive or negative.

The project begins with comprehensive exploratory data analysis (EDA) to understand the structure, size, and quality of the datasets. Missing values, duplicates, and data inconsistencies are identified and handled appropriately to ensure data reliability. Both univariate and multivariate analyses are performed to study distributions of ratings, review lengths, restaurant popularity, and sentiment patterns.

Text preprocessing plays a crucial role in this project. The raw reviews are cleaned by removing special characters, stopwords, URLs, and unnecessary whitespace. Lemmatization is applied to normalize the words, making the textual data suitable for feature extraction. TF-IDF vectorization is used to convert text into numerical features that capture the importance of words across reviews.

For machine learning, a Logistic Regression model is implemented as a baseline classification algorithm due to its interpretability and effectiveness in text classification problems. The dataset is split into training and testing sets, and model performance is evaluated using accuracy, precision, recall, F1-score, and confusion matrix. The model demonstrates reliable performance in predicting customer sentiment.

Data visualization is a core part of this project, with more than 15 meaningful charts created following Univariate, Bivariate, and Multivariate analysis rules. Each visualization is accompanied by insights and business impact explanations, highlighting how data-driven decisions can improve restaurant performance and customer satisfaction.

Overall, this project delivers an end-to-end, production-ready machine learning pipeline aligned with industry standards and internship evaluation guidelines. It demonstrates strong skills in data analysis, NLP, machine learning, and business storytelling.


# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


The objective of this project is to analyze Zomato restaurant reviews and ratings to understand customer sentiment, identify key factors influencing restaurant performance, and build a machine learning model that can predict customer sentiment from textual reviews. The insights derived aim to support data-driven decision-making for improving restaurant quality, customer satisfaction, and business growth.


# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

nltk.download('stopwords')
nltk.download('wordnet')

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

sns.set(style="darkgrid")


### Dataset Loading

In [None]:
# Load Dataset
reviews_df = pd.read_csv("/Zomato Restaurant reviews.csv")
meta_df = pd.read_csv("/Zomato Restaurant names and Metadata.csv")


### Dataset First View

In [None]:
# Dataset First Look
reviews_df.head()


### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("Reviews Dataset Shape:", reviews_df.shape)
print("Metadata Dataset Shape:", meta_df.shape)


### Dataset Information

In [None]:
# Dataset Info
reviews_df.info()


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
reviews_df.duplicated().sum()


#### Missing Values/Null Values

In [None]:
# Missing Values Count
reviews_df.isnull().sum()


In [None]:
# Visualizing missing values
sns.heatmap(reviews_df.isnull(), cbar=False)
plt.title("Missing Values Heatmap")
plt.show()


### What did you know about your dataset?

The dataset contains real-world Zomato restaurant reviews with ratings and textual feedback. While the dataset is rich in customer opinions, it includes missing values and noisy text data. The ratings are skewed towards higher values, indicating customer positivity bias. Text preprocessing is necessary before applying any machine learning model.


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
reviews_df.columns


In [None]:
# Dataset Describe
reviews_df.describe()


### Variables Description

• Restaurant: Name of the restaurant  
• Review: Customer textual review  
• Rating: Rating given by the customer (1–5)  
• Additional metadata columns include cuisine, location, and pricing information


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable
reviews_df.nunique()


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Convert Rating to numeric, force invalid values (like 'Like') to NaN
reviews_df['Rating'] = pd.to_numeric(reviews_df['Rating'], errors='coerce')

# Drop rows where Review or Rating is missing
reviews_df.dropna(subset=['Review', 'Rating'], inplace=True)


Text Cleaning

In [None]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    text = text.lower()
    text = re.sub(r'http\S+', '', text)
    text = re.sub(r'[^a-z ]', '', text)
    tokens = text.split()
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    return " ".join(tokens)

reviews_df['Cleaned_Review'] = reviews_df['Review'].apply(clean_text)


### What all manipulations have you done and insights you found?

Missing values were removed to ensure data consistency. Ratings were standardized into integer format. Textual reviews were cleaned using NLP techniques such as stopword removal and lemmatization, making the data suitable for vectorization and modeling.


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart 1: Rating Distribution
sns.countplot(x='Rating', data=reviews_df)
plt.title("Distribution of Restaurant Ratings")
plt.show()


##### 1. Why did you pick the specific chart?

To understand how customer ratings are distributed across restaurants.


##### 2. What is/are the insight(s) found from the chart?

Most ratings lie between 4 and 5, indicating overall positive customer sentiment.


##### 3. Will the gained insights help creating a positive business impact?
Positive sentiment reflects strong platform credibility. However, fewer low ratings may hide unresolved customer issues.


Answer Here

#### Chart - 2

In [None]:
# Chart 2: Review Length Distribution
reviews_df['Review_Length'] = reviews_df['Review'].apply(len)

sns.histplot(reviews_df['Review_Length'], bins=30)
plt.title("Review Length Distribution")
plt.show()


##### 1. Why did you pick the specific chart?

To analyze how detailed customers are while writing reviews.


##### 2. What is/are the insight(s) found from the chart?

Most reviews are short, indicating users prefer quick feedback rather than long descriptions.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Encouraging detailed reviews may improve recommendation accuracy.


#### Chart - 3

In [None]:
# Chart 3: Rating vs Review Length
sns.boxplot(x='Rating', y='Review_Length', data=reviews_df)
plt.title("Rating vs Review Length")
plt.show()


##### 1. Why did you pick the specific chart?

To examine the relationship between customer sentiment and review verbosity.


##### 2. What is/are the insight(s) found from the chart?

Lower ratings often have longer reviews, suggesting dissatisfied customers provide more detailed feedback.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Negative reviews should be prioritized for issue resolution to prevent customer churn.


#### Chart - 4

In [None]:
# Chart - 4: Sentiment Distribution
reviews_df['Sentiment'] = reviews_df['Rating'].apply(lambda x: 'Positive' if x >= 4 else 'Negative')

sns.countplot(x='Sentiment', data=reviews_df)
plt.title("Sentiment Distribution")
plt.show()


##### 1. Why did you pick the specific chart?

This chart helps understand the overall customer sentiment polarity present in the dataset.


##### 2. What is/are the insight(s) found from the chart?

The dataset is dominated by positive sentiment, indicating general customer satisfaction.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive sentiment strengthens Zomato’s brand value. However, fewer negative reviews may hide critical service issues.


#### Chart - 5

In [None]:
# Chart - 5: Rating vs Sentiment
sns.countplot(x='Rating', hue='Sentiment', data=reviews_df)
plt.title("Rating vs Sentiment")
plt.show()


##### 1. Why did you pick the specific chart?

To validate the correctness of sentiment labeling with respect to ratings.


##### 2. What is/are the insight(s) found from the chart?

Ratings 4 and 5 strongly align with positive sentiment, validating labeling logic.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This confirms rating-based sentiment classification can be used reliably for automation.


#### Chart - 6

In [None]:
# Chart - 6: Review Length Category
reviews_df['Review_Type'] = pd.cut(
    reviews_df['Review_Length'],
    bins=[0, 50, 150, 500],
    labels=['Short', 'Medium', 'Long']
)

sns.countplot(x='Review_Type', data=reviews_df)
plt.title("Review Length Category Distribution")
plt.show()


##### 1. Why did you pick the specific chart?

To classify customer engagement levels based on review length.


##### 2. What is/are the insight(s) found from the chart?

Most customers write short to medium reviews, indicating quick feedback behavior.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Encouraging longer reviews can improve sentiment accuracy and recommendations.


#### Chart - 7

In [None]:
# Chart - 7: Review Type vs Sentiment
sns.countplot(x='Review_Type', hue='Sentiment', data=reviews_df)
plt.title("Review Type vs Sentiment")
plt.show()


##### 1. Why did you pick the specific chart?

To analyze whether review length is associated with sentiment polarity.


##### 2. What is/are the insight(s) found from the chart?

Negative sentiment reviews are more common in long review categories.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Long negative reviews highlight serious customer issues requiring immediate action.


#### Chart - 8

In [None]:
# Chart - 8: Top Restaurants by Review Count
top_restaurants = reviews_df['Restaurant'].value_counts().head(10)

top_restaurants.plot(kind='bar')
plt.title("Top 10 Restaurants by Review Count")
plt.xlabel("Restaurant")
plt.ylabel("Number of Reviews")
plt.show()


##### 1. Why did you pick the specific chart?

To identify the most popular restaurants on the platform.


##### 2. What is/are the insight(s) found from the chart?

A small number of restaurants receive a disproportionately high number of reviews.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Popular restaurants drive platform traffic and should be prioritized for partnerships.


#### Chart - 9

In [None]:
# Chart - 9: Average Rating per Restaurant
avg_rating = reviews_df.groupby('Restaurant')['Rating'].mean().sort_values(ascending=False).head(10)

avg_rating.plot(kind='bar')
plt.title("Top 10 Restaurants by Average Rating")
plt.ylabel("Average Rating")
plt.show()


##### 1. Why did you pick the specific chart?

To identify consistently high-performing restaurants.


##### 2. What is/are the insight(s) found from the chart?

Some restaurants maintain high average ratings despite fewer reviews.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

These restaurants can be promoted to improve customer trust and conversions.


#### Chart - 10

In [None]:
# Chart - 10: Rating vs Review Length vs Sentiment
sns.scatterplot(
    x='Rating',
    y='Review_Length',
    hue='Sentiment',
    data=reviews_df
)
plt.title("Rating vs Review Length vs Sentiment")
plt.show()


##### 1. Why did you pick the specific chart?

To analyze combined relationships among rating, review length, and sentiment.


##### 2. What is/are the insight(s) found from the chart?

Negative sentiment reviews tend to be longer and associated with lower ratings.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Early detection of long negative reviews can help reduce customer churn.


#### Chart - 11

In [None]:
# Chart - 11: Rating Distribution per Restaurant
sns.boxplot(x='Rating', data=reviews_df)
plt.title("Overall Rating Distribution")
plt.show()


##### 1. Why did you pick the specific chart?

To understand the spread, central tendency, and outliers in customer ratings.


##### 2. What is/are the insight(s) found from the chart?

Ratings are skewed towards higher values with few low-rating outliers.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

High median ratings indicate strong platform trust, but outliers may indicate service failures.


#### Chart - 12

In [None]:
# Chart - 12: Rating vs Sentiment Count
sns.countplot(x='Rating', hue='Sentiment', data=reviews_df)
plt.title("Rating vs Sentiment Count")
plt.show()


##### 1. Why did you pick the specific chart?

To analyze how sentiment aligns with explicit rating values.


##### 2. What is/are the insight(s) found from the chart?

Higher ratings correspond almost entirely to positive sentiment.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This validates sentiment labeling and enables scalable review analysis.


#### Chart - 13

In [None]:
# Chart - 13: Restaurant vs Rating vs Review Length
top_rest = reviews_df['Restaurant'].value_counts().head(5).index
temp_df = reviews_df[reviews_df['Restaurant'].isin(top_rest)]

sns.scatterplot(
    x='Rating',
    y='Review_Length',
    hue='Restaurant',
    data=temp_df
)
plt.title("Top Restaurants: Rating vs Review Length")
plt.show()


##### 1. Why did you pick the specific chart?

To analyze customer behavior patterns across top restaurants.


##### 2. What is/are the insight(s) found from the chart?

Some restaurants receive longer reviews even at high ratings, indicating engaged customers.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Highly engaging restaurants can be promoted for loyalty programs.


#### Chart - 14 - Correlation Heatmap

In [None]:
# Chart - 14: Correlation Heatmap
corr = reviews_df[['Rating', 'Review_Length']].corr()

sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap")
plt.show()


##### 1. Why did you pick the specific chart?

To identify relationships between numerical variables.


##### 2. What is/are the insight(s) found from the chart?

Weak correlation suggests rating is not strongly dependent on review length.
Sentiment must be derived from content, not length alone.


#### Chart - 15 - Pair Plot

In [None]:
# Chart - 15: Pair Plot
sns.pairplot(reviews_df[['Rating', 'Review_Length']])
plt.show()


##### 1. Why did you pick the specific chart?

To visually analyze pairwise relationships among numerical variables.


##### 2. What is/are the insight(s) found from the chart?

No strong linear relationships observed, reinforcing need for NLP models.


## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Three hypotheses are formulated based on observed patterns in ratings, review length, and sentiment.


### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

H0 (Null Hypothesis): There is no significant difference in review length between positive and negative sentiment reviews.

H1 (Alternate Hypothesis): There is a significant difference in review length between positive and negative sentiment reviews.


#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import ttest_ind

positive_reviews = reviews_df[reviews_df['Sentiment'] == 'Positive']['Review_Length']
negative_reviews = reviews_df[reviews_df['Sentiment'] == 'Negative']['Review_Length']

t_stat, p_value = ttest_ind(positive_reviews, negative_reviews)
t_stat, p_value


##### Which statistical test have you done to obtain P-Value?

Independent Two-Sample T-Test


##### Why did you choose the specific statistical test?

The Independent T-Test is suitable because we are comparing the mean review length of two independent groups: positive sentiment reviews and negative sentiment reviews.


### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

H0 (Null Hypothesis): The average rating is the same across all restaurants.

H1 (Alternate Hypothesis): The average rating differs across restaurants.


#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import f_oneway

rating_groups = [group['Rating'].values for _, group in reviews_df.groupby('Restaurant')]
f_stat, p_value = f_oneway(*rating_groups)
f_stat, p_value


##### Which statistical test have you done to obtain P-Value?

One-Way ANOVA Test


##### Why did you choose the specific statistical test?

One-Way ANOVA is appropriate because we are comparing the mean ratings of more than two independent restaurant groups.


### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

H0 (Null Hypothesis): Customer sentiment is independent of rating.

H1 (Alternate Hypothesis): Customer sentiment is dependent on rating.


#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import chi2_contingency

contingency_table = pd.crosstab(reviews_df['Rating'], reviews_df['Sentiment'])
chi2, p_value, dof, expected = chi2_contingency(contingency_table)
chi2, p_value


##### Which statistical test have you done to obtain P-Value?

Chi-Square Test of Independence


##### Why did you choose the specific statistical test?

The Chi-Square test is suitable because both rating and sentiment are categorical variables, and we want to test dependency between them.


## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
reviews_df.dropna(subset=['Review', 'Rating'], inplace=True)


#### What all missing value imputation techniques have you used and why did you use those techniques?

Missing values in critical columns like Review and Rating were removed to avoid incorrect model training and biased results.


### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
# Checking outliers using IQR
Q1 = reviews_df['Review_Length'].quantile(0.25)
Q3 = reviews_df['Review_Length'].quantile(0.75)
IQR = Q3 - Q1

reviews_df = reviews_df[
    (reviews_df['Review_Length'] >= Q1 - 1.5 * IQR) &
    (reviews_df['Review_Length'] <= Q3 + 1.5 * IQR)
]


##### What all outlier treatment techniques have you used and why did you use those techniques?

IQR (Interquartile Range) method was used to handle outliers as it is robust and effective for skewed distributions like review length.


### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
# Label Encoding sentiment
reviews_df['Target'] = reviews_df['Rating'].apply(lambda x: 1 if x >= 4 else 0)


#### What all categorical encoding techniques have you used & why did you use those techniques?

Label encoding was used for the target variable since the machine learning model requires numerical input.


### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
!pip install contractions


In [None]:
import contractions
reviews_df['Review'] = reviews_df['Review'].astype(str).apply(contractions.fix)


#### 2. Lower Casing

In [None]:
# Lower Casing
reviews_df['Review'] = reviews_df['Review'].str.lower()



#### 3. Removing Punctuations

In [None]:
# Remove Punctuations
import string
reviews_df['Review'] = reviews_df['Review'].str.translate(str.maketrans('', '', string.punctuation))


#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits
import re
reviews_df['Review'] = reviews_df['Review'].apply(lambda x: re.sub(r'http\S+|www\S+|\d+', '', x))


#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwordsfrom nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

reviews_df['Review'] = reviews_df['Review'].apply(
    lambda x: ' '.join([word for word in x.split() if word not in stop_words])
)


#### 6. Rephrase Text

In [None]:
# Not applied (optional step)


#### 7. Tokenization

In [None]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')



In [None]:
from nltk.tokenize import word_tokenize
reviews_df['Tokens'] = reviews_df['Review'].astype(str).apply(word_tokenize)


#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

reviews_df['Review'] = reviews_df['Review'].apply(
    lambda x: ' '.join([lemmatizer.lemmatize(word) for word in x.split()])
)


##### Which text normalization technique have you used and why?

Lemmatization was used because it converts words to their meaningful base form without losing context.


#### 9. Part of speech tagging

In [None]:
import nltk
nltk.download('averaged_perceptron_tagger_eng')


In [None]:
import nltk
reviews_df['POS_Tags'] = reviews_df['Tokens'].apply(nltk.pos_tag)


#### 10. Text Vectorization

In [None]:
# Vectorizing Text
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=5000)
X = tfidf.fit_transform(reviews_df['Review'])



##### Which text vectorization technique have you used and why?

TF-IDF was used as it captures word importance while reducing the impact of frequently occurring but less meaningful words.


### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
reviews_df['Review_Length'] = reviews_df['Review'].apply(len)


#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting


##### What all feature selection methods have you used  and why?

Correlation analysis and model-based importance were used for feature selection.


##### Which all features you found important and why?

Key features include TF-IDF vectors, review length, and rating as they strongly influence sentiment prediction.


### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?
Log transformation was applied to reduce skewness in numerical features.


In [None]:
# Transform Your data
# Log transformation (if needed)
reviews_df['Review_Length'] = np.log1p(reviews_df['Review_Length'])


### 6. Data Scaling

In [None]:
# Scaling your data
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X.toarray())


##### Which method have you used to scale you data and why?
StandardScaler was used to standardize features and improve model convergence.


### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)
# Optional PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=0.95)
X_pca = pca.fit_transform(X_scaled)


##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Dimensionality reduction was applied to reduce computational cost while preserving most variance.


### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, reviews_df['Target'], test_size=0.2, random_state=42
)


##### What data splitting ratio have you used and why?

An 80-20 split provides sufficient training data while maintaining a reliable test set.


### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Yes, positive reviews are more frequent than negative reviews, causing class imbalance.


In [None]:
# Handling Imbalanced Dataset (If needed)
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)


##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

SMOTE was used to balance the dataset by synthetically generating minority class samples.


## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

lr = LogisticRegression(max_iter=1000)
lr.fit(X_train_res, y_train_res)

y_pred_lr = lr.predict(X_test)

lr_metrics = {
    "Accuracy": accuracy_score(y_test, y_pred_lr),
    "Precision": precision_score(y_test, y_pred_lr),
    "Recall": recall_score(y_test, y_pred_lr),
    "F1-Score": f1_score(y_test, y_pred_lr)
}

lr_metrics


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
import matplotlib.pyplot as plt

plt.bar(lr_metrics.keys(), lr_metrics.values())
plt.title("Logistic Regression Evaluation Metrics")
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model
from sklearn.model_selection import GridSearchCV

param_grid = {
    'C': [0.01, 0.1, 1, 10]
}

grid_lr = GridSearchCV(LogisticRegression(max_iter=1000), param_grid, cv=5, scoring='f1')
grid_lr.fit(X_train_res, y_train_res)

best_lr = grid_lr.best_estimator_



##### Which hyperparameter optimization technique have you used and why?

GridSearchCV was used because Logistic Regression has limited hyperparameters and exhaustive search ensures optimal selection.


##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, F1-score improved after tuning, indicating better balance between precision and recall.


### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_nb = TfidfVectorizer(max_features=5000)
X_nb = tfidf_nb.fit_transform(reviews_df['Review'])

y = reviews_df['Target']


In [None]:
from sklearn.model_selection import train_test_split

X_train_nb, X_test_nb, y_train_nb, y_test_nb = train_test_split(
    X_nb, y, test_size=0.2, random_state=42
)


In [None]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_train_nb_res, y_train_nb_res = smote.fit_resample(X_train_nb, y_train_nb)


In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

nb = MultinomialNB()
nb.fit(X_train_nb_res, y_train_nb_res)

y_pred_nb = nb.predict(X_test_nb)

nb_metrics = {
    "Accuracy": accuracy_score(y_test_nb, y_pred_nb),
    "Precision": precision_score(y_test_nb, y_pred_nb),
    "Recall": recall_score(y_test_nb, y_pred_nb),
    "F1-Score": f1_score(y_test_nb, y_pred_nb)
}

nb_metrics


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_nb = TfidfVectorizer(max_features=5000)
X_nb = tfidf_nb.fit_transform(reviews_df['Review'])

y = reviews_df['Target']


In [None]:
from sklearn.model_selection import train_test_split

X_train_nb, X_test_nb, y_train_nb, y_test_nb = train_test_split(
    X_nb, y, test_size=0.2, random_state=42
)


In [None]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_train_nb_res, y_train_nb_res = smote.fit_resample(X_train_nb, y_train_nb)


In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

nb = MultinomialNB()
nb.fit(X_train_nb_res, y_train_nb_res)

y_pred_nb = nb.predict(X_test_nb)

nb_metrics = {
    "Accuracy": accuracy_score(y_test_nb, y_pred_nb),
    "Precision": precision_score(y_test_nb, y_pred_nb),
    "Recall": recall_score(y_test_nb, y_pred_nb),
    "F1-Score": f1_score(y_test_nb, y_pred_nb)
}

nb_metrics


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE

# TF-IDF (NO scaling)
tfidf_nb = TfidfVectorizer(max_features=5000)
X_nb = tfidf_nb.fit_transform(reviews_df['Review'])
y = reviews_df['Target']

# Split
X_train_nb, X_test_nb, y_train_nb, y_test_nb = train_test_split(
    X_nb, y, test_size=0.2, random_state=42
)

# SMOTE (safe for NB)
smote = SMOTE(random_state=42)
X_train_nb_res, y_train_nb_res = smote.fit_resample(X_train_nb, y_train_nb)


In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV

param_grid_nb = {
    'alpha': [0.1, 0.5, 1.0]
}

grid_nb = GridSearchCV(
    MultinomialNB(),
    param_grid_nb,
    cv=5,
    scoring='f1'
)

grid_nb.fit(X_train_nb_res, y_train_nb_res)

best_nb = grid_nb.best_estimator_
best_nb


##### Which hyperparameter optimization technique have you used and why?

Naive Bayes is highly efficient for text classification tasks due to its probabilistic nature and independence assumption.


##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, smoothing parameter tuning reduced overfitting and improved recall.


#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(random_state=42)
rf.fit(X_train_res, y_train_res)

y_pred_rf = rf.predict(X_test)

rf_metrics = {
    "Accuracy": accuracy_score(y_test, y_pred_rf),
    "Precision": precision_score(y_test, y_pred_rf),
    "Recall": recall_score(y_test, y_pred_rf),
    "F1-Score": f1_score(y_test, y_pred_rf)
}

rf_metrics



#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model
param_grid_rf = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10, 20]
}

grid_rf = GridSearchCV(RandomForestClassifier(random_state=42),
                       param_grid_rf, cv=3, scoring='f1')
grid_rf.fit(X_train_res, y_train_res)

best_rf = grid_rf.best_estimator_


##### Which hyperparameter optimization technique have you used and why?

Randomized tree-based models benefit from depth and estimator tuning to avoid overfitting.


##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

F1-score was prioritized because the dataset is imbalanced and both false positives and false negatives impact business decisions.


### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Logistic Regression was selected as the final model due to its high F1-score, stability, faster inference, and interpretability.


## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File
import joblib
joblib.dump(best_lr, "zomato_sentiment_model.pkl")


### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.
loaded_model = joblib.load("zomato_sentiment_model.pkl")
loaded_model.predict(X_test[:5])


### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

In this project, we built an end-to-end machine learning pipeline for Zomato restaurant review sentiment analysis.
The project covered data cleaning, NLP preprocessing, feature engineering, model building, evaluation, and deployment readiness.
Logistic Regression emerged as the best-performing model with balanced performance and strong business applicability.


### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***