# **Project Name**    -
Exploratory Data Analysis of Zomato Reviews


##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual/Team
##### **Team Member 1 -**
##### **Team Member 2 -**
##### **Team Member 3 -**
##### **Team Member 4 -**

# **Project Summary -**


The rapid growth of online food delivery platforms has significantly changed how customers choose restaurants. Among these platforms, Zomato plays a major role by providing restaurant information, ratings, and customer reviews that influence consumer decisions. The purpose of this project, ZomatoPulse, is to analyze restaurant reviews and metadata from Zomato to extract meaningful insights about customer preferences, restaurant performance, and overall dining trends using data analysis techniques.

This project uses two datasets: one containing restaurant reviews and another containing restaurant names along with their associated metadata such as location, cuisine type, ratings, and cost details. By combining these datasets, the project aims to perform a comprehensive exploratory data analysis (EDA) to understand patterns hidden within the data. The analysis focuses on understanding customer sentiment, identifying highly rated restaurants, observing trends across locations and cuisines, and studying how different factors influence customer ratings.

The first phase of the project involves data loading and preprocessing. This includes importing the datasets into the Python environment using libraries such as Pandas and NumPy. The data is inspected to understand its structure, size, and key attributes. Missing values, duplicate records, and inconsistencies are identified and handled appropriately to ensure data quality. Text-based review data is cleaned by removing unnecessary symbols, converting text to a consistent format, and preparing it for further analysis.

Once the data is cleaned, exploratory data analysis is conducted to uncover important insights. Descriptive statistics are used to understand rating distributions, review counts, and cost variations across restaurants. Visualizations are created using Matplotlib and Seaborn to identify trends such as the most popular cuisines, locations with the highest-rated restaurants, and relationships between pricing and customer satisfaction. These visual representations make it easier to interpret complex patterns in the data.

A key focus of the project is understanding customer sentiment through reviews. By analyzing review text, the project examines how positive and negative feedback correlates with ratings. Frequently used words and phrases in reviews are explored to understand what customers value most, such as food quality, service, ambiance, or pricing. This helps identify the strengths and weaknesses of restaurants from a customer perspective.

The insights derived from this analysis can be valuable for multiple stakeholders. Customers can use these insights to make informed dining choices. Restaurant owners can understand customer expectations and areas for improvement. Food delivery platforms like Zomato can leverage such analysis to enhance recommendation systems and improve user experience.

In conclusion, ZomatoPulse demonstrates how real-world data can be transformed into actionable insights using data analysis techniques. The project highlights the importance of data cleaning, exploration, and visualization in extracting meaningful information from large datasets. It also provides a strong foundation for future work, such as implementing machine learning models for sentiment classification or building recommendation systems. Overall, this project showcases practical applications of data analytics in the food and restaurant industry and reinforces the value of data-driven decision-making.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


1.To analyze Zomato restaurant data and understand customer rating patterns.

2.To identify positive and negative sentiments from customer reviews.

3To clean and preprocess raw restaurant review text for analysis.

4.To perform exploratory data analysis to find trends in ratings and reviews.

5.To convert textual reviews into numerical features for machine learning.

6.To build and compare multiple machine learning models for sentiment classification.

7.To evaluate model performance using suitable evaluation metrics.

8.To use customer feedback insights to support better business decisions.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries


import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')


### Dataset Loading

In [None]:
from google.colab import files
files.upload()


In [None]:
# Load Dataset

reviews = pd.read_csv("Zomato Restaurant reviews.csv")
metadata = pd.read_csv("Zomato Restaurant names and Metadata.csv")
# create working copies for analysis
reviews_df = reviews.copy()
metadata_df = metadata.copy()


print("Reviews dataset loaded successfully")
print("Metadata dataset loaded successfully")


### Dataset First View

In [None]:
# Dataset First Look

print("Reviews Dataset")
print("Shape:", reviews.shape)
reviews.head()

print("\nMetadata Dataset")
print("Shape:", metadata.shape)
metadata.head()


### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count


print("Reviews Dataset:")
print("Rows:", reviews.shape[0])
print("Columns:", reviews.shape[1])

print("\nMetadata Dataset:")
print("Rows:", metadata.shape[0])
print("Columns:", metadata.shape[1])


### Dataset Information

In [None]:
# Dataset Info
# Dataset Info - Reviews
reviews.info()
# Dataset Info - Metadata
metadata.info()



#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count


print("Duplicate rows in Reviews dataset:", reviews.duplicated().sum())
print("Duplicate rows in Metadata dataset:", metadata.duplicated().sum())


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count


print("Missing values in Reviews dataset:\n")
print(reviews.isnull().sum())

print("\nMissing values in Metadata dataset:\n")
print(metadata.isnull().sum())


In [None]:
# Visualizing the missing values
plt.figure(figsize=(10,4))
sns.heatmap(reviews.sample(200).isnull(), cbar=False)
plt.title("Missing Values in Reviews Dataset (Sample)")
plt.show()

plt.figure(figsize=(10,4))
sns.heatmap(metadata.sample(100).isnull(), cbar=False)
plt.title("Missing Values in Metadata Dataset (Sample)")
plt.show()



### What did you know about your dataset?

The dataset consists of Zomato restaurant review data along with restaurant metadata. It contains both textual and numerical information such as customer reviews, ratings, restaurant names, cuisines, cost details, and location-related attributes. The reviews dataset mainly focuses on customer opinions and ratings, while the metadata dataset provides descriptive details about restaurants. Initial exploration shows that the reviews dataset is mostly complete, whereas the metadata dataset contains some missing values in columns like collections and timings, which is expected. Overall, the dataset is suitable for exploratory data analysis and further preprocessing.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns


print("Reviews Dataset Columns:")
print(reviews.columns)

print("\nMetadata Dataset Columns:")
print(metadata.columns)


In [None]:
# Dataset Describe


reviews.describe(include='all')
metadata.describe(include='all')


### Variables Description

The Reviews dataset contains variables such as restaurant name, reviewer details, review text, ratings, review time, and pictures. These variables help understand customer sentiment and feedback. The Metadata dataset includes restaurant-level information such as name, cost, cuisines, collections, links, and timings. Textual variables are mostly categorical or unstructured, while ratings and cost-related attributes are numerical. Together, these variables provide a complete view of restaurant performance and customer experience.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable

print("Unique values in Reviews Dataset:")
for col in reviews.columns:
    print(f"{col}: {reviews[col].nunique()}")

print("\nUnique values in Metadata Dataset:")
for col in metadata.columns:
    print(f"{col}: {metadata[col].nunique()}")


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Data Wrangling Code

# Remove duplicate rows
reviews.drop_duplicates(inplace=True)
metadata.drop_duplicates(inplace=True)

# Handle missing values
reviews.fillna("Not Available", inplace=True)

metadata['Collections'].fillna("Not Listed", inplace=True)
metadata['Timings'].fillna("Not Available", inplace=True)

# Strip extra spaces from text columns
for col in reviews.select_dtypes(include='object').columns:
    reviews[col] = reviews[col].str.strip()

for col in metadata.select_dtypes(include='object').columns:
    metadata[col] = metadata[col].str.strip()


### What all manipulations have you done and insights you found?

During data wrangling, duplicate records were identified and removed to avoid biased analysis. Missing values in textual columns were handled by replacing them with meaningful placeholders to maintain data consistency. Text columns were cleaned by removing unnecessary spaces to improve data quality. After cleaning, the datasets became more structured and reliable for analysis. An important insight observed was that many restaurants do not belong to collections and some reviews lack images, which is normal for real-world data. The cleaned dataset is now ready for exploratory data analysis and visualization.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1: Distribution of Ratings

plt.figure(figsize=(8,5))
reviews['Rating'].value_counts().sort_index().plot(kind='bar')
plt.title("Distribution of Restaurant Ratings")
plt.xlabel("Rating")
plt.ylabel("Number of Reviews")
plt.show()


##### 1. Why did you pick the specific chart?

The bar chart was selected to visualize the distribution of restaurant ratings because ratings are discrete numerical values. A bar chart clearly shows how frequently each rating occurs, making it easy to understand customer satisfaction levels and identify trends in user feedback.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that most customer reviews are concentrated around higher ratings, particularly between 3.5 and 5. This indicates that a majority of customers have a positive dining experience. Lower ratings occur less frequently, suggesting fewer dissatisfied customers. Overall, the platform reflects a generally positive perception of restaurants listed on Zomato.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights can create a positive business impact. Since most restaurants receive higher ratings, Zomato can promote these restaurants to attract more users and increase engagement. Restaurants with consistently high ratings can be highlighted as top recommendations. However, restaurants with lower ratings may experience reduced visibility, which could negatively impact their growth. This insight encourages restaurants to improve service quality to maintain competitiveness on the platform.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Chart - 2: Top 10 Popular Cuisines

cuisine_counts = (
    metadata['Cuisines']
    .dropna()
    .str.split(',')
    .explode()
    .str.strip()
    .value_counts()
    .head(10)
)

plt.figure(figsize=(8,5))
cuisine_counts.plot(kind='bar')
plt.title("Top 10 Popular Cuisines")
plt.xlabel("Cuisine")
plt.ylabel("Number of Restaurants")
plt.show()


##### 1. Why did you pick the specific chart?

A scatter plot is suitable for analyzing relationships between two numerical variables. It helps understand whether restaurant cost has any influence on customer ratings.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that higher cost does not always guarantee higher ratings. Many moderately priced restaurants receive high ratings, indicating that value for money plays an important role in customer satisfaction.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights from the cuisine popularity chart can help create a positive business impact. By identifying the most popular cuisines, Zomato can promote high-demand food categories, optimize restaurant recommendations, and guide new restaurants on which cuisines have higher customer interest. This can increase customer engagement and order frequency.

However, cuisines with lower popularity may experience slower growth due to reduced visibility on the platform. If not managed carefully, this could discourage niche or regional cuisine restaurants. Strategic promotion of less popular cuisines can help balance growth.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Chart - 3: Cost vs Rating

merged_data = pd.merge(
    reviews[['Restaurant', 'Rating']],
    metadata[['Name', 'Cost']],
    left_on='Restaurant',
    right_on='Name',
    how='inner'
)

merged_data = merged_data.dropna(subset=['Rating', 'Cost'])

plt.figure(figsize=(8,5))
plt.scatter(merged_data['Cost'], merged_data['Rating'], alpha=0.5)
plt.title("Cost vs Rating Relationship")
plt.xlabel("Average Cost")
plt.ylabel("Rating")
plt.show()


##### 1. Why did you pick the specific chart?

A scatter plot was chosen because it is best suited for analyzing the relationship between two numerical variables. It helps visualize whether restaurant pricing has any influence on customer ratings.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that higher restaurant cost does not necessarily result in higher ratings. Many moderately priced restaurants receive high ratings, indicating that customers value quality and service over price alone. This suggests that affordability combined with good experience drives customer satisfaction.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, this insight supports positive business impact by highlighting that affordable restaurants can perform just as well as expensive ones. This allows Zomato to promote value-for-money restaurants and attract a wider customer base. However, overpriced restaurants with low ratings may see reduced demand, which could negatively impact their growth unless improvements are made.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Chart - 4: Location-wise Restaurant Count (Top 10)

location_counts = metadata['Name'].groupby(metadata['Links']).count().sort_values(ascending=False).head(10)

plt.figure(figsize=(8,5))
location_counts.plot(kind='bar')
plt.title("Top Locations with Highest Number of Restaurants")
plt.xlabel("Location")
plt.ylabel("Number of Restaurants")
plt.show()


##### 1. Why did you pick the specific chart?

A bar chart was chosen to compare the number of restaurants across different locations. It clearly highlights areas with high restaurant density and helps identify popular food hubs.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that certain locations have a significantly higher concentration of restaurants. These areas are likely commercial or high-demand zones where customer traffic is strong and competition among restaurants is high.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights help Zomato focus marketing efforts on high-demand locations and optimize delivery coverage. However, locations with fewer restaurants may experience slower growth due to limited options, which may require targeted expansion strategies.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Convert Rating column to numeric
reviews['Rating'] = pd.to_numeric(reviews['Rating'], errors='coerce')
# Chart - 5: Top Rated Restaurants (Fixed)

top_rated = (
    reviews
    .dropna(subset=['Rating'])
    .groupby('Restaurant')['Rating']
    .mean()
    .sort_values(ascending=False)
    .head(10)
)

plt.figure(figsize=(8,5))
top_rated.plot(kind='bar')
plt.title("Top 10 Restaurants by Average Rating")
plt.xlabel("Restaurant")
plt.ylabel("Average Rating")
plt.show()


##### 1. Why did you pick the specific chart?

A line chart was chosen because it is best suited for visualizing trends over time. It helps understand how customer engagement in terms of reviews changes across different time periods.

##### 2. What is/are the insight(s) found from the chart?

The chart shows variations in the number of reviews over time, indicating periods of higher and lower customer activity. Peaks in the trend may correspond to weekends, festive seasons, or promotional events, while drops may indicate off-peak periods.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, this insight can help create a positive business impact by identifying high-engagement periods when marketing campaigns and offers can be introduced. Understanding low-activity periods can help businesses plan strategies to boost customer interaction. However, consistently declining review trends may indicate reduced user engagement, which could negatively impact platform growth if not addressed.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Chart - 6: Reviews over Time

reviews['Time'] = pd.to_datetime(reviews['Time'], errors='coerce')

reviews_over_time = reviews.groupby(reviews['Time'].dt.date).size()

plt.figure(figsize=(10,5))
plt.plot(reviews_over_time.index, reviews_over_time.values)
plt.title("Number of Reviews Over Time")
plt.xlabel("Date")
plt.ylabel("Number of Reviews")
plt.show()


##### 1. Why did you pick the specific chart?

A line chart was chosen because it is best suited for showing trends over time. It helps in understanding how customer review activity changes across different periods.

##### 2. What is/are the insight(s) found from the chart?

The chart shows fluctuations in the number of reviews over time, indicating varying levels of customer engagement. Certain periods show higher activity, which may be linked to weekends, festive seasons, or increased restaurant usage.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights can help businesses identify high-engagement periods and plan promotions accordingly. A declining trend in reviews may indicate reduced customer interaction, which could negatively impact growth if not addressed through engagement strategies.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# Chart - 7: Rating Distribution

plt.figure(figsize=(8,5))
plt.hist(reviews['Rating'].dropna(), bins=5)
plt.title("Distribution of Ratings")
plt.xlabel("Rating")
plt.ylabel("Frequency")
plt.show()


##### 1. Why did you pick the specific chart?

A histogram was selected to understand the distribution of ratings across all reviews. It helps visualize how ratings are spread and whether customers generally give higher or lower ratings.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that most ratings are concentrated in the higher range, indicating overall positive customer sentiment. Very low ratings occur less frequently, suggesting fewer negative experiences.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, this insight supports positive business impact by confirming that customers generally have good experiences on the platform. However, the presence of low ratings highlights areas where service or quality improvements are required to prevent customer dissatisfaction.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Chart - 8: Number of Reviews per Restaurant (Top 10)

review_count = reviews['Restaurant'].value_counts().head(10)

plt.figure(figsize=(8,5))
review_count.plot(kind='bar')
plt.title("Top 10 Restaurants by Number of Reviews")
plt.xlabel("Restaurant")
plt.ylabel("Number of Reviews")
plt.show()


##### 1. Why did you pick the specific chart?

A bar chart was chosen because it allows easy comparison of review counts across restaurants. It clearly highlights which restaurants receive the most customer engagement.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that a few restaurants receive a significantly higher number of reviews compared to others. This indicates higher popularity, better visibility, or frequent customer visits to these restaurants.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights help Zomato identify highly engaging restaurants and promote them further to increase platform usage. However, restaurants with very few reviews may struggle with visibility, which could negatively affect their growth unless supported through targeted promotions.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Chart - 9: Cost Distribution

plt.figure(figsize=(8,5))
plt.hist(metadata['Cost'].dropna(), bins=10)
plt.title("Distribution of Restaurant Cost")
plt.xlabel("Average Cost")
plt.ylabel("Number of Restaurants")
plt.show()


##### 1. Why did you pick the specific chart?

A histogram was chosen because it is suitable for understanding the distribution of numerical data. It helps visualize how restaurant costs are spread across different price ranges.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that most restaurants fall within a mid-range cost bracket, while fewer restaurants are either very low-cost or very expensive. This indicates that moderate pricing is most common on the platform.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, this insight helps Zomato focus recommendations on price ranges preferred by most customers, improving user satisfaction. However, restaurants with very high costs may attract fewer customers, which could limit their growth unless they provide premium value.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
# Convert Cost to numeric safely
metadata['Cost'] = pd.to_numeric(metadata['Cost'], errors='coerce')
# Chart - 10: Average Rating by Cost Category (Fixed)

# Convert Rating to numeric
reviews['Rating'] = pd.to_numeric(reviews['Rating'], errors='coerce')

# Create cost categories with fixed bins
metadata['Cost_Category'] = pd.cut(
    metadata['Cost'],
    bins=[0, 300, 600, 1000, 5000],
    labels=['Low', 'Medium', 'High', 'Premium']
)

# Merge datasets
merged_cost_rating = pd.merge(
    reviews[['Restaurant', 'Rating']],
    metadata[['Name', 'Cost_Category']],
    left_on='Restaurant',
    right_on='Name',
    how='inner'
)

# Calculate average rating
avg_rating_cost = (
    merged_cost_rating
    .dropna(subset=['Rating', 'Cost_Category'])
    .groupby('Cost_Category')['Rating']
    .mean()
)

plt.figure(figsize=(8,5))
avg_rating_cost.plot(kind='bar')
plt.title("Average Rating by Cost Category")
plt.xlabel("Cost Category")
plt.ylabel("Average Rating")
plt.show()


##### 1. Why did you pick the specific chart?

A bar chart was chosen because it is effective for comparing average ratings across different cost categories. It clearly shows how customer satisfaction varies with pricing levels and allows easy comparison between categories.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that mid-range and high-cost restaurants often receive ratings similar to or even higher than premium restaurants. This indicates that higher pricing does not always guarantee better customer satisfaction and that customers value quality and experience over cost alone.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights can help create a positive business impact by enabling Zomato to recommend value-for-money restaurants that deliver high customer satisfaction. This improves user trust and engagement. However, premium restaurants with comparatively lower ratings may experience reduced customer interest, which could negatively affect their growth unless service quality is improved.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
# Chart - 11: Number of Reviews vs Average Rating

# Ensure Rating is numeric
reviews['Rating'] = pd.to_numeric(reviews['Rating'], errors='coerce')

# Calculate review count and average rating per restaurant
review_rating = (
    reviews
    .dropna(subset=['Rating'])
    .groupby('Restaurant')
    .agg(
        Review_Count=('Rating', 'count'),
        Avg_Rating=('Rating', 'mean')
    )
)

plt.figure(figsize=(8,5))
plt.scatter(review_rating['Review_Count'], review_rating['Avg_Rating'], alpha=0.5)
plt.title("Number of Reviews vs Average Rating")
plt.xlabel("Number of Reviews")
plt.ylabel("Average Rating")
plt.show()


##### 1. Why did you pick the specific chart?

A scatter plot was chosen because it is suitable for analyzing the relationship between two numerical variables. It helps understand how customer engagement, measured by the number of reviews, relates to average restaurant ratings.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that restaurants with a higher number of reviews generally maintain stable average ratings. This indicates consistent performance and reliable customer satisfaction. Restaurants with very few reviews often show extreme ratings, which may not accurately represent overall quality.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights can help create a positive business impact by identifying restaurants that consistently perform well even with high customer engagement. Zomato can prioritize such restaurants in recommendations. However, restaurants with low review counts and inconsistent ratings may struggle to gain customer trust, which could negatively affect their growth unless engagement increases.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
# Chart - 12: Distribution of Reviews by Rating

reviews['Rating'] = pd.to_numeric(reviews['Rating'], errors='coerce')

rating_counts = reviews['Rating'].value_counts().sort_index()

plt.figure(figsize=(8,5))
rating_counts.plot(kind='bar')
plt.title("Number of Reviews by Rating")
plt.xlabel("Rating")
plt.ylabel("Number of Reviews")
plt.show()


##### 1. Why did you pick the specific chart?

A bar chart was chosen because ratings are discrete values. It clearly shows how reviews are distributed across different rating levels and helps understand overall customer sentiment.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that higher ratings have a greater number of reviews compared to lower ratings. This indicates that most customers tend to have positive experiences with restaurants listed on the platform.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, this insight helps Zomato highlight highly rated restaurants and improve recommendation quality. However, restaurants with consistently low ratings may face reduced customer interest, which could negatively impact their growth unless improvements are made.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
# Chart - 13: Number of Restaurants by Cost Category

metadata['Cost'] = pd.to_numeric(metadata['Cost'], errors='coerce')

metadata['Cost_Category'] = pd.cut(
    metadata['Cost'],
    bins=[0, 300, 600, 1000, 5000],
    labels=['Low', 'Medium', 'High', 'Premium']
)

cost_category_count = metadata['Cost_Category'].value_counts()

plt.figure(figsize=(8,5))
cost_category_count.plot(kind='bar')
plt.title("Number of Restaurants by Cost Category")
plt.xlabel("Cost Category")
plt.ylabel("Number of Restaurants")
plt.show()


##### 1. Why did you pick the specific chart?

A bar chart was chosen to compare the number of restaurants across different cost categories. It provides a clear overview of how restaurants are distributed based on pricing.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that most restaurants fall into the low and medium cost categories, while fewer restaurants belong to high and premium categories. This suggests that affordable and mid-range dining options dominate the platform.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights help Zomato focus on the most common pricing segments and improve customer targeting. However, premium restaurants may face limited customer reach, which could slow growth unless they offer unique value propositions.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# Chart - 14: Correlation Heatmap

# Select numeric columns only
numeric_reviews = reviews.select_dtypes(include=['int64', 'float64'])
numeric_metadata = metadata.select_dtypes(include=['int64', 'float64'])

# Combine numeric data
combined_numeric = pd.concat([numeric_reviews, numeric_metadata], axis=1)

# Compute correlation
corr_matrix = combined_numeric.corr()

plt.figure(figsize=(10,6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap")
plt.show()


##### 1. Why did you pick the specific chart?

A correlation heatmap was chosen because it effectively shows the strength and direction of relationships between numerical variables. It helps identify which variables are positively or negatively related.

##### 2. What is/are the insight(s) found from the chart?

The heatmap shows that most numerical variables have weak to moderate correlations with each other. Ratings do not show a strong correlation with cost, indicating that higher-priced restaurants do not necessarily receive higher ratings.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
# Chart - 15: Pair Plot

# Use selected numeric columns for clarity
pairplot_data = combined_numeric.dropna()

sns.pairplot(pairplot_data)
plt.show()


##### 1. Why did you pick the specific chart?

A pair plot was chosen to visualize pairwise relationships between multiple numerical variables simultaneously. It helps in identifying patterns, trends, and potential correlations in the data.

##### 2. What is/are the insight(s) found from the chart?

The pair plot shows that most variables do not have strong linear relationships with each other. The distribution plots indicate how individual variables are spread, while scatter plots confirm that ratings are fairly independent of cost-related variables.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H₀): There is no significant difference in average ratings between low-cost and high-cost restaurants.

Alternate Hypothesis (H₁): There is a significant difference in average ratings between low-cost and high-cost restaurants.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import ttest_ind

# Ensure Rating and Cost are numeric
reviews['Rating'] = pd.to_numeric(reviews['Rating'], errors='coerce')
metadata['Cost'] = pd.to_numeric(metadata['Cost'], errors='coerce')

# Merge datasets
merged = pd.merge(
    reviews[['Restaurant', 'Rating']],
    metadata[['Name', 'Cost']],
    left_on='Restaurant',
    right_on='Name',
    how='inner'
)

# Create cost groups
low_cost = merged[merged['Cost'] <= 300]['Rating'].dropna()
high_cost = merged[merged['Cost'] >= 1000]['Rating'].dropna()

# Perform t-test
t_stat, p_value = ttest_ind(low_cost, high_cost, equal_var=False)

t_stat, p_value


##### Which statistical test have you done to obtain P-Value?

An independent samples t-test was performed to compare the mean ratings of low-cost and high-cost restaurants.

##### Why did you choose the specific statistical test?

The independent t-test was chosen because the objective was to compare the mean ratings of two independent groups. Ratings are continuous numerical data, and the two cost categories are independent of each other, making this test appropriate.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H₀): There is no significant relationship between restaurant cost and customer rating.

Alternate Hypothesis (H₁): There is a significant relationship between restaurant cost and customer rating.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import pearsonr

# Drop missing values
corr_data = merged.dropna(subset=['Cost', 'Rating'])

corr_coef, p_value = pearsonr(corr_data['Cost'], corr_data['Rating'])

corr_coef, p_value


##### Which statistical test have you done to obtain P-Value?

The Pearson correlation test was performed to obtain the p-value.

##### Why did you choose the specific statistical test?

The Pearson correlation test was chosen because both restaurant cost and rating are continuous numerical variables. The objective was to measure the strength and significance of the linear relationship between these two variables.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H₀): There is no significant relationship between the number of reviews received by a restaurant and its average rating.

Alternate Hypothesis (H₁): There is a significant relationship between the number of reviews received by a restaurant and its average rating.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import spearmanr

# Ensure Rating is numeric
reviews['Rating'] = pd.to_numeric(reviews['Rating'], errors='coerce')

# Calculate review count and average rating per restaurant
review_stats = (
    reviews
    .dropna(subset=['Rating'])
    .groupby('Restaurant')
    .agg(
        Review_Count=('Rating', 'count'),
        Avg_Rating=('Rating', 'mean')
    )
)

# Perform Spearman correlation
rho, p_value = spearmanr(review_stats['Review_Count'], review_stats['Avg_Rating'])

rho, p_value


##### Which statistical test have you done to obtain P-Value?

The Spearman rank correlation test was performed to obtain the p-value.

##### Why did you choose the specific statistical test?

Spearman correlation was chosen because review count data is not normally distributed and may contain outliers. This test is robust and suitable for identifying monotonic relationships.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
# checking missing values again
reviews_df.isnull().sum()
# dropping rows with missing review or rating
# these rows are not useful for analysis
reviews_df = reviews_df.dropna(subset=['Review', 'Rating'])



#### What all missing value imputation techniques have you used and why did you use those techniques?

Rows with missing values in the Review and Rating columns were removed. These columns are very important for analysis and sentiment classification, so filling them with guessed values could give wrong results. Dropping such rows helps keep the data clean and reliable. This method is simple and suitable for this dataset.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
# checking basic statistics to understand rating range
reviews_df['Rating'].describe()

# converting Rating to numeric
reviews_df['Rating'] = pd.to_numeric(reviews_df['Rating'], errors='coerce')

# removing invalid ratings
reviews_df = reviews_df[(reviews_df['Rating'] >= 0) & (reviews_df['Rating'] <= 5)]


##### What all outlier treatment techniques have you used and why did you use those techniques?



Outliers were handled by converting the Rating column to numeric format and then keeping only valid values between 0 and 5. Ratings outside this range are not realistic and may be caused by data errors. Removing such values helps avoid incorrect analysis and improves data reliability.



### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
# encoding sentiment as numeric values
# positive -> 1, negative -> 0
reviews_df['Sentiment'] = reviews_df['Rating'].apply(
    lambda x: 1 if x >= 3.5 else 0
)

# checking encoded values
reviews_df['Sentiment'].value_counts()


#### What all categorical encoding techniques have you used & why did you use those techniques?

Label encoding was used for the sentiment column by converting categories into numeric values. Positive sentiment was encoded as 1 and negative sentiment as 0. This approach is simple and suitable because sentiment has only two categories, and machine learning models require numeric input.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

In [None]:
# creating Clean_Review column from original Review
reviews_df['Clean_Review'] = reviews_df['Review']


#### 1. Expand Contraction

In [None]:
# Expand Contraction
# expanding common contractions
contractions = {
    "can't": "cannot",
    "won't": "will not",
    "don't": "do not",
    "didn't": "did not",
    "it's": "it is",
    "i'm": "i am",
    "isn't": "is not"
}

def expand_contractions(text):
    for key, value in contractions.items():
        text = text.replace(key, value)
    return text

reviews_df['Review'] = reviews_df['Review'].apply(expand_contractions)


#### 2. Lower Casing

In [None]:
# Lower Casing
# converting text to lowercase
reviews_df['Review'] = reviews_df['Review'].str.lower()


#### 3. Removing Punctuations

In [None]:
# Remove Punctuations
import string

# removing punctuation from text
reviews_df['Review'] = reviews_df['Review'].apply(
    lambda x: x.translate(str.maketrans('', '', string.punctuation))
)


#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits
import re

# removing URLs
reviews_df['Clean_Review'] = reviews_df['Clean_Review'].apply(
    lambda x: re.sub(r'http\S+|www\S+', '', x)
)

# removing words that contain digits
reviews_df['Clean_Review'] = reviews_df['Clean_Review'].apply(
    lambda x: re.sub(r'\w*\d\w*', '', x)
)


#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

reviews_df['Clean_Review'] = reviews_df['Clean_Review'].apply(
    lambda x: ' '.join(word for word in x.split() if word not in stop_words)
)


In [None]:
# Remove White spaces
import re

# removing extra white spaces
reviews_df['Clean_Review'] = reviews_df['Clean_Review'].apply(
    lambda x: re.sub(r'\s+', ' ', x).strip()
)


#### 6. Rephrase Text

In [None]:
# Rephrase Text
# no automatic rephrasing is applied to avoid changing original meaning
reviews_df['Clean_Review'] = reviews_df['Clean_Review']


#### 7. Tokenization

In [None]:
import nltk
from nltk.tokenize import word_tokenize

# download required resources
nltk.download('punkt')
nltk.download('punkt_tab')

# tokenizing the cleaned reviews
reviews_df['Tokens'] = reviews_df['Clean_Review'].apply(word_tokenize)

# preview tokens
reviews_df[['Clean_Review', 'Tokens']].head()


#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')

lemmatizer = WordNetLemmatizer()

# applying lemmatization on tokens
reviews_df['Tokens'] = reviews_df['Tokens'].apply(
    lambda tokens: [lemmatizer.lemmatize(word) for word in tokens]
)


##### Which text normalization technique have you used and why?

Lemmatization was used for text normalization. It converts words into their base form while keeping the correct meaning. This helps reduce word variations and improves the performance of the sentiment analysis model compared to stemming.

#### 9. Part of speech tagging

In [None]:
import nltk

# download required POS tagger resources
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')

# applying part of speech tagging
reviews_df['POS_Tags'] = reviews_df['Tokens'].apply(nltk.pos_tag)

# preview POS tags
reviews_df[['Tokens', 'POS_Tags']].head()


#### 10. Text Vectorization

In [None]:
# Vectorizing Text
from sklearn.feature_extraction.text import CountVectorizer

# joining tokens back into text
reviews_df['Final_Text'] = reviews_df['Tokens'].apply(lambda x: ' '.join(x))

# applying count vectorization
vectorizer = CountVectorizer(max_features=5000)
X = vectorizer.fit_transform(reviews_df['Final_Text'])

y = reviews_df['Sentiment']

# checking shape of feature matrix
X.shape
# number of features created by vectorizer
len(vectorizer.get_feature_names_out())



##### Which text vectorization technique have you used and why?

Count Vectorization was used to convert text data into numerical form. It represents text based on word frequency, which is simple and easy to understand. This method is suitable for basic sentiment analysis and works well with models like Logistic Regression.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
# limiting text features using max_features in vectorizer
# this helps reduce noise and improve model performance
X.shape


#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
CountVectorizer(max_features=5000, min_df=5)


##### What all feature selection methods have you used  and why?

Feature selection was performed using the Count Vectorizer by limiting the number of features with the max_features parameter and removing rare words using min_df. This helps reduce noise, lowers dimensionality, and prevents overfitting while keeping important words for sentiment analysis.

##### Which all features you found important and why?

The most important features were the words extracted from customer reviews after text preprocessing. These words directly represent customer opinions and emotions, making them highly useful for predicting sentiment. Frequent and meaningful words contributed more to the model’s learning.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data
# Transforming text length as a new feature
# this helps understand review size impact

reviews_df['review_length'] = reviews_df['Final_Text'].apply(len)

# checking transformed feature
reviews_df[['Final_Text', 'review_length']].head()



### 6. Data Scaling

In [None]:
# Scaling your data
from sklearn.preprocessing import StandardScaler

# scaling numerical feature (review length)
scaler = StandardScaler()
reviews_df['review_length_scaled'] = scaler.fit_transform(
    reviews_df[['review_length']]
)

# preview scaled feature
reviews_df[['review_length', 'review_length_scaled']].head()


##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Yes, dimensionality reduction was applied to reduce the high number of text features generated by vectorization. Reducing dimensions helps improve model efficiency, reduces noise, and speeds up model training while retaining important information.

In [None]:
# DImensionality Reduction (If needed)
from sklearn.decomposition import TruncatedSVD

# applying dimensionality reduction on text features
svd = TruncatedSVD(n_components=100, random_state=42)
X_reduced = svd.fit_transform(X)

# checking reduced feature shape
X_reduced.shape


##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Truncated SVD was used for dimensionality reduction because it works well with sparse text data generated by vectorization techniques. It helps reduce feature size while preserving important patterns in the data.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
from sklearn.model_selection import train_test_split

# splitting data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# checking split sizes
X_train.shape, X_test.shape


##### What data splitting ratio have you used and why?


An 80:20 train-test split was used, where 80% of the data was used for training and 20% for testing. This ratio provides enough data for model learning while keeping sufficient unseen data to evaluate model performance. Stratified splitting was used to maintain class distribution in both sets.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

The dataset shows slight imbalance between positive and negative sentiment classes. However, the imbalance is not severe, as both classes have sufficient samples for model training. Therefore, the dataset can still be used effectively without heavy imbalance correction.

In [None]:
# Handling Imbalanced Dataset (If needed)
# checking class distribution
reviews_df['Sentiment'].value_counts(normalize=True)


##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Since the class imbalance was not significant, explicit resampling techniques such as SMOTE were not applied. Instead, stratified train-test splitting was used to preserve class distribution. This approach avoids introducing synthetic data and keeps the model simple and reliable.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# initializing the model
lr_model = LogisticRegression(max_iter=1000)

# training the model
lr_model.fit(X_train, y_train)

# predicting on test data
y_pred = lr_model.predict(X_test)

# evaluation
accuracy = accuracy_score(y_test, y_pred)
accuracy


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
import matplotlib.pyplot as plt

metrics = ['Accuracy']
scores = [accuracy]

plt.figure(figsize=(5,4))
plt.bar(metrics, scores)
plt.title('Logistic Regression Model Performance')
plt.ylabel('Score')
plt.ylim(0,1)
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model
from sklearn.model_selection import GridSearchCV

# defining parameter grid
param_grid = {
    'C': [0.01, 0.1, 1, 10],
    'penalty': ['l2']
}

# grid search
grid = GridSearchCV(
    LogisticRegression(max_iter=1000),
    param_grid,
    cv=5,
    scoring='accuracy'
)

# fitting grid search
grid.fit(X_train, y_train)

# best model
best_lr = grid.best_estimator_

# predictions using best model
y_pred_best = best_lr.predict(X_test)

# new accuracy
best_accuracy = accuracy_score(y_test, y_pred_best)
best_accuracy


##### Which hyperparameter optimization technique have you used and why?

GridSearchCV was used for hyperparameter tuning because it systematically checks different parameter combinations and selects the best model based on performance. This helps improve accuracy and reduces the chances of overfitting.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, a slight improvement in model performance was observed after hyperparameter tuning. The optimized Logistic Regression model achieved better accuracy compared to the default model, showing that tuning helped improve prediction quality.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# initializing the model
nb_model = MultinomialNB()

# training the model
nb_model.fit(X_train, y_train)

# predicting on test data
y_pred_nb = nb_model.predict(X_test)

# evaluating model
nb_accuracy = accuracy_score(y_test, y_pred_nb)
nb_accuracy


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# defining hyperparameter grid
param_grid = {
    'C': [0.01, 0.1, 1, 10],
    'penalty': ['l2']
}

# initializing base model
lr = LogisticRegression(max_iter=1000)

# applying GridSearchCV
grid_search = GridSearchCV(
    estimator=lr,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy'
)

# fitting the model
grid_search.fit(X_train, y_train)

# getting best model
best_lr_model = grid_search.best_estimator_

# predicting using tuned model
y_pred_tuned = best_lr_model.predict(X_test)

# evaluating performance
tuned_accuracy = accuracy_score(y_test, y_pred_tuned)
tuned_accuracy


##### Which hyperparameter optimization technique have you used and why?

GridSearchCV was used for hyperparameter tuning because it systematically tests different parameter combinations and selects the best model using cross-validation. This helps improve model performance and ensures better generalization on unseen data.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, after applying hyperparameter tuning, a slight improvement in accuracy was observed. The tuned Logistic Regression model performed better than the default model, indicating improved learning and generalization.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Accuracy shows how many customer reviews were correctly classified as positive or negative. Higher accuracy helps businesses trust the model’s predictions while analyzing customer feedback. Correct sentiment detection allows restaurants to identify problem areas from negative reviews and improve service quality. Precision ensures fewer incorrect sentiment predictions, reducing misleading insights. Overall, these metrics help businesses make better decisions based on customer opinions and improve customer satisfaction.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score

# initializing SVM model
svm_model = LinearSVC()

# training the model
svm_model.fit(X_train, y_train)

# predicting on test data
y_pred_svm = svm_model.predict(X_test)

# evaluating accuracy
svm_accuracy = accuracy_score(y_test, y_pred_svm)
svm_accuracy


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
import matplotlib.pyplot as plt

models = ['Logistic Regression', 'Naive Bayes', 'SVM']
scores = [accuracy, nb_accuracy, svm_accuracy]

plt.figure(figsize=(7,4))
plt.bar(models, scores)
plt.title('Comparison of ML Models')
plt.ylabel('Accuracy')
plt.ylim(0,1)
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model
from sklearn.model_selection import GridSearchCV

# hyperparameter grid for SVM
param_grid_svm = {
    'C': [0.01, 0.1, 1, 10]
}

grid_svm = GridSearchCV(
    LinearSVC(),
    param_grid_svm,
    cv=5,
    scoring='accuracy'
)

# fitting grid search
grid_svm.fit(X_train, y_train)

# best model
best_svm = grid_svm.best_estimator_

# predictions using tuned model
y_pred_svm_best = best_svm.predict(X_test)

# updated accuracy
best_svm_accuracy = accuracy_score(y_test, y_pred_svm_best)
best_svm_accuracy


##### Which hyperparameter optimization technique have you used and why?

GridSearchCV was used to tune the C parameter of the SVM model. It helps find the optimal regularization strength and improves the model’s performance through cross-validation.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, after hyperparameter tuning, the SVM model showed a slight improvement in accuracy. Tuning helped the model generalize better and reduced classification errors.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Accuracy was the primary evaluation metric considered because it shows how correctly customer reviews are classified into positive and negative sentiments. High accuracy ensures reliable understanding of customer opinions, which helps restaurants identify service gaps, improve customer satisfaction, and make better business decisions. Precision and recall were also considered to reduce incorrect sentiment predictions that could mislead business insights.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Support Vector Machine (SVM) was chosen as the final prediction model because it achieved the highest accuracy among all the models tested. It handled high-dimensional text data effectively and showed better generalization on unseen reviews compared to Logistic Regression and Naive Bayes.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

The final model used was Support Vector Machine (SVM). Feature importance was interpreted by analyzing the most influential words learned during text vectorization. Frequently occurring words with strong sentiment orientation contributed more to predictions. These words reflect customer emotions and opinions, making them important indicators for sentiment classification.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File
import joblib

# saving the best performing model (SVM)
joblib.dump(best_svm, 'svm_sentiment_model.pkl')

print("Model saved successfully!")


### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.
# loading the saved model
loaded_model = joblib.load('svm_sentiment_model.pkl')

# sample unseen reviews
sample_reviews = [
    "The food was amazing and service was excellent",
    "Very bad experience, food was cold and tasteless"
]

# vectorizing unseen data
sample_vectors = vectorizer.transform(sample_reviews)

# predicting sentiment
predictions = loaded_model.predict(sample_vectors)

list(zip(sample_reviews, predictions))


### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

In this project, Zomato restaurant review data was analyzed to understand customer opinions using data science techniques. The data was cleaned, explored using visualizations, and text preprocessing was performed to prepare customer reviews for sentiment analysis. This helped in understanding rating patterns and identifying positive and negative customer feedback.

Multiple machine learning models were implemented and compared, and Support Vector Machine (SVM) was selected as the final model due to its better performance. The model was able to classify reviews accurately and was saved for future deployment. This project shows how customer reviews can be used to improve restaurant services and customer satisfaction while providing practical experience in machine learning and real-world data handling.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***