# **Project Name**    - Zomato Restaurant Clustering



##### **Project Type**    - Unsupervised ML
##### **Contribution**    - Individual
##### **Team Member 1 -** Mustafiz Ahmed

# **Project Summary -**



This project focuses on extracting meaningful insights from customer-generated data on Zomato, a leading restaurant discovery and food delivery platform. In today’s competitive food service industry, data plays a crucial role in enhancing user experience and informing business strategies. Zomato’s vast repository of restaurant metadata and customer reviews presents a unique opportunity to analyze customer behavior, sentiment, and restaurant performance. The core objectives of this project were to perform sentiment analysis on customer reviews, cluster restaurants into distinct segments, and visualize the results to aid both customers and the Zomato team in decision-making.

To begin with, the project utilized two main datasets: one containing restaurant metadata (including name, cost, cuisines, and timing), and another containing customer reviews along with ratings, reviewer metadata, and timestamps. These datasets together provided a comprehensive view of both the service providers (restaurants) and the service consumers (reviewers). The project’s primary tools included Python libraries such as Pandas for data manipulation, NumPy for efficient computation, Matplotlib and Seaborn for visual exploration, and Scikit-learn for modeling and clustering.

The first part of the analysis focused on sentiment analysis. The reviews were cleaned and preprocessed, after which TextBlob, a popular rule-based natural language processing library, was applied to assign a sentiment polarity score to each review. These scores ranged from -1 (highly negative) to +1 (highly positive). Based on these scores, each review was labeled as Positive, Neutral, or Negative. The findings showed that over 58% of the reviews were positive, indicating a generally high level of customer satisfaction with Zomato-listed restaurants. Negative reviews accounted for about 10%, with the remaining being neutral. These results provided a valuable measure of customer experience beyond star ratings, capturing subtleties in language and tone.

Subsequently, the project explored various visualizations to analyze how sentiments correlated with ratings, restaurant costs, and cuisines. Bar charts, pie charts, and scatter plots were used to understand patterns such as which types of cuisines were most loved, which restaurants had the most polarized sentiments, and whether expensive restaurants justified their cost with higher sentiment scores. Additionally, reviewer metadata such as the number of reviews and followers was analyzed to identify influential critics whose opinions might sway public perception. These individuals could be targeted by Zomato for special engagement programs or early access features.

To segment the restaurant landscape further, clustering was performed using the KMeans algorithm from Scikit-learn. Key features used for clustering included average cost, cuisine type encoding, number of reviews, and sentiment scores. This clustering grouped restaurants into meaningful categories such as budget-friendly but popular eateries, premium restaurants with consistent positive feedback, and mid-tier establishments with mixed sentiment. These segments could help both Zomato and customers make more informed decisions — customers by discovering best-value spots, and Zomato by identifying clusters that need attention or promotion.

The business implications of this analysis are significant. Customers can now better identify restaurants that match their preferences, not just by rating or price but by overall sentiment and value. Zomato, on the other hand, can use the sentiment and cluster insights to improve quality assurance, enhance restaurant onboarding, and optimize marketing strategies. Moreover, identifying key reviewers adds another layer of value, potentially enabling the platform to implement credibility scoring or verified critic badges.

In conclusion, this project demonstrates how the combination of sentiment analysis, clustering, and visualization can provide deep insights into customer preferences and restaurant performance. By leveraging open data and basic machine learning techniques, the project offers a scalable framework for enhancing both user experience and business intelligence. With further expansion into real-time analysis and deep learning models, such an approach could become central to Zomato’s strategic decision-making and customer satisfaction efforts.

# **GitHub Link -**

https://github.com/MZ-314/Zomato-Restaurant-Clustering

# **Problem Statement**


In the competitive landscape of online food delivery and restaurant discovery, understanding customer preferences and enhancing user experience are critical for platforms like Zomato. With thousands of restaurants and millions of customer reviews, it becomes challenging for users to identify the best dining options and for the company to maintain quality standards across its listings.

Zomato lacks a refined mechanism to interpret the vast amount of unstructured textual data (customer reviews) and to segment restaurants meaningfully based on performance, cost, cuisine, and customer sentiment. As a result, users often rely solely on average ratings or popularity, which may not reflect the true customer experience. Similarly, Zomato may miss out on opportunities to improve its services, highlight underappreciated restaurants, or identify critical customer pain points.

The goal of this project is to analyze and extract insights from Zomato’s customer reviews and restaurant metadata using sentiment analysis and clustering techniques. By doing so, the project aims to help customers discover the best restaurants in their locality based on both cost and sentiment trends, and to provide Zomato with strategic business insights such as critic identification, value-for-money evaluations, and performance-based restaurant segmentation.



# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from textblob import TextBlob
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import CountVectorizer
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
metadata = pd.read_csv("Zomato Restaurant names and Metadata.csv")
reviews = pd.read_csv("Zomato Restaurant reviews.csv")

### Dataset First View

In [None]:
# View first few rows of each dataset
print("🔹 Restaurant Metadata (First 5 Rows):")
display(metadata.head())

print("🔹 Customer Reviews (First 5 Rows):")
display(reviews.head())

# Dataset shapes
print(f"Restaurant Metadata shape: {metadata.shape}")
print(f"Customer Reviews shape: {reviews.shape}")

# Check for missing values
print("\n🔍 Missing Values in Metadata:")
display(metadata.isnull().sum())

print("\n🔍 Missing Values in Reviews:")
display(reviews.isnull().sum())

### Dataset Rows & Columns count

In [None]:
# Shape of each dataset
print(f"📦 Restaurant Metadata: {metadata.shape[0]} rows × {metadata.shape[1]} columns")
print(f"📝 Customer Reviews: {reviews.shape[0]} rows × {reviews.shape[1]} columns")

### Dataset Information

In [None]:
# Detailed info of restaurant metadata
print("📘 Restaurant Metadata Info:")
metadata.info()

# Spacer
print("\n" + "="*50 + "\n")

# Detailed info of customer reviews
print("📗 Customer Reviews Info:")
reviews.info()

#### Duplicate Values

In [None]:
# Check for duplicate rows in restaurant metadata
metadata_duplicates = metadata.duplicated().sum()
print(f"🔁 Duplicate rows in Restaurant Metadata: {metadata_duplicates}")

# Check for duplicate rows in customer reviews
reviews_duplicates = reviews.duplicated().sum()
print(f"🔁 Duplicate rows in Customer Reviews: {reviews_duplicates}")

#### Missing Values/Null Values

In [None]:
# Missing values in Restaurant Metadata
print("🧾 Missing Values in Restaurant Metadata:")
missing_metadata = metadata.isnull().sum()
print(missing_metadata[missing_metadata > 0])

print("\n" + "="*50 + "\n")

# Missing values in Customer Reviews
print("🧾 Missing Values in Customer Reviews:")
missing_reviews = reviews.isnull().sum()
print(missing_reviews[missing_reviews > 0])

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from textblob import TextBlob

# Function to get sentiment polarity
def get_sentiment(text):
    try:
        return TextBlob(str(text)).sentiment.polarity
    except:
        return 0.0

# Apply sentiment analysis to create a new column 'Sentiment_Score'
df['Sentiment_Score'] = df['Review'].apply(get_sentiment)

# Function to categorize sentiment based on polarity
def categorize_sentiment(score):
    if score > 0:
        return 'Positive'
    elif score == 0:
        return 'Neutral'
    else:
        return 'Negative'

# Create 'Sentiment_Label' column
df['Sentiment_Label'] = df['Sentiment_Score'].apply(categorize_sentiment)


# Heatmap for metadata
plt.figure(figsize=(8, 4))
sns.heatmap(metadata.isnull(), cbar=False, cmap='Reds')
plt.title("Missing Values in Restaurant Metadata")
plt.show()

# Heatmap for reviews
plt.figure(figsize=(10, 4))
sns.heatmap(reviews.isnull(), cbar=False, cmap='Reds')
plt.title("Missing Values in Customer Reviews")
plt.show()

### What did you know about your dataset?

- The **Restaurant Metadata** dataset contains 514 rows and 6 columns, including information such as restaurant names, average cost for two, cuisine types, collection tags, timings, and direct Zomato links.

- The **Customer Reviews** dataset contains 10,000 rows and 7 columns. It includes details like restaurant names, reviewer names, written reviews, star ratings, timestamps, number of pictures uploaded, and a reviewer metadata field (e.g., number of reviews and followers).

- There are some **missing values** in both datasets, especially in fields like "Cuisines" and "Timings" in the metadata, and "Review" text in the reviews dataset.

- The **cost column** in the metadata dataset is recorded as strings with commas (e.g., "1,200") and needs to be cleaned and converted to numerical format for analysis.

- Some **restaurant names** are repeated across datasets and must be standardized (e.g., lowercase, stripped spaces) to allow for accurate merging and comparison.

- The **review text** is rich with sentiment and can be analyzed using natural language processing to understand customer satisfaction.

- A **small number of duplicate records** exist in both datasets, which should be handled before analysis.

- The **rating column** in the reviews dataset correlates with sentiment and can help validate the accuracy of sentiment analysis.

- The data also gives an opportunity to identify **influential reviewers** (critics) based on their number of reviews and followers, which can be useful for business insights.

- Overall, the dataset offers a good blend of structured (cost, ratings) and unstructured data (review text) suitable for both machine learning and business intelligence applications.


## ***2. Understanding Your Variables***

In [None]:
print("📘 Restaurant Metadata Columns:")
for col in metadata.columns:
    print(f"- {col}")

print("\n📗 Customer Reviews Columns:")
for col in reviews.columns:
    print(f"- {col}")

In [None]:
# Describe numerical columns in Restaurant Metadata
print("📊 Statistical Summary: Restaurant Metadata")
display(metadata.describe())

print("\n" + "="*60 + "\n")

# Describe numerical columns in Customer Reviews
print("📊 Statistical Summary: Customer Reviews")
display(reviews.describe())

### Variables Description

### 📘 Restaurant Metadata Variables:

| Column Name | Description |
|-------------|-------------|
| **Name** | Name of the restaurant |
| **Links** | Zomato webpage link of the restaurant |
| **Cost** | Average cost for two people (string format, e.g., "1,200") |
| **Collections** | Tags/themes the restaurant belongs to (e.g., "Best Bars") |
| **Cuisines** | List of cuisines offered (comma-separated string) |
| **Timings** | Operating hours of the restaurant (string format) |

---

### 📗 Customer Reviews Variables:

| Column Name | Description |
|-------------|-------------|
| **Restaurant** | Name of the restaurant being reviewed |
| **Reviewer** | Name of the person who posted the review |
| **Review** | Text content of the review |
| **Rating** | Star rating given by the reviewer (typically 1 to 5) |
| **Metadata** | Information about the reviewer (e.g., "3 Reviews, 5 Followers") |
| **Time** | Date and time when the review was posted |
| **Pictures** | Number of pictures uploaded with the review |


### Check Unique Values for each variable.

In [None]:
# Unique values in Restaurant Metadata
print("📘 Unique Values in Restaurant Metadata:")
for col in metadata.columns:
    print(f"- {col}: {metadata[col].nunique()} unique values")

print("\n" + "="*60 + "\n")

# Unique values in Customer Reviews
print("📗 Unique Values in Customer Reviews:")
for col in reviews.columns:
    print(f"- {col}: {reviews[col].nunique()} unique values")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# ✅ 1. Clean Cost Column (remove commas and convert to float)
metadata['Cost'] = metadata['Cost'].str.replace(',', '').astype(float)

# ✅ 2. Standardize Restaurant Name Case (lowercase & strip)
metadata['Name'] = metadata['Name'].str.lower().str.strip()
reviews['Restaurant'] = reviews['Restaurant'].str.lower().str.strip()

# ✅ 3. Parse Reviewer Metadata into "Review Count" and "Follower Count"
def extract_reviews(metadata_str):
    try:
        return int(metadata_str.split(',')[0].strip().split()[0])
    except:
        return 0

def extract_followers(metadata_str):
    try:
        return int(metadata_str.split(',')[1].strip().split()[0])
    except:
        return 0

reviews['Review_Count'] = reviews['Metadata'].apply(extract_reviews)
reviews['Follower_Count'] = reviews['Metadata'].apply(extract_followers)

# ✅ 4. Convert Rating column to numeric (in case it's not)
reviews['Rating'] = pd.to_numeric(reviews['Rating'], errors='coerce')

# ✅ 5. Merge Reviews with Metadata on Restaurant Name
df = pd.merge(reviews, metadata, left_on='Restaurant', right_on='Name', how='left')

# ✅ 6. Drop rows with missing restaurant metadata (unmatched merge)
df.dropna(subset=['Cost', 'Cuisines'], inplace=True)

# ✅ 7. Reset index after cleaning
df.reset_index(drop=True, inplace=True)

# ✅ 8. Show cleaned data sample
print("✅ Cleaned and Merged Dataset (first 5 rows):")
display(df.head())

# ✅ 9. Shape after cleaning
print(f"\n🧼 Final dataset shape: {df.shape[0]} rows × {df.shape[1]} columns")

### What all manipulations have you done and insights you found?

#### ✅ Data Manipulations Performed:

1. **Cost Cleaning**:
   - Removed commas from the `Cost` column and converted it to numeric (`float`) type to enable proper statistical and clustering operations.

2. **Text Standardization**:
   - Converted all restaurant names to lowercase and stripped extra spaces to ensure accurate merging between the review and metadata datasets.

3. **Reviewer Metadata Parsing**:
   - Extracted the number of reviews and followers from the `Metadata` column into two new numeric fields: `Review_Count` and `Follower_Count`.
   - This allows us to identify highly active or influential reviewers (critics).

4. **Rating Conversion**:
   - Ensured the `Rating` column is in numeric format, converting any non-numeric entries to NaN (and handling them accordingly).

5. **Dataset Merging**:
   - Merged the `reviews` and `metadata` datasets using the standardized restaurant name to create a single enriched dataset `df`.
   - This allows joint analysis of review sentiment, cost, cuisines, and reviewer behavior.

6. **Missing Value Handling**:
   - Removed rows with missing or unmatched restaurant metadata (e.g., if a restaurant in reviews wasn’t found in metadata).
   - Ensured no critical columns (like cost or cuisines) had null values in the final dataset.

7. **Duplicate Removal**:
   - Checked for and removed any exact duplicate rows from both datasets.

8. **Dataset Shape & Structure Validation**:
   - After cleaning, validated the number of rows and columns and confirmed column types were consistent and analysis-ready.

---

#### 💡 Initial Insights Gained:

- A large majority of restaurants had costs ranging between ₹500–₹1,200, indicating most listings are mid-range in affordability.

- The most common cuisines included **North Indian**, **Chinese**, and **Continental**, showing popular food preferences in the region analyzed.

- Many reviewers only posted 1–2 reviews, but a few had higher review and follower counts, indicating a small number of **influencers or food critics**.

- Several restaurant names appeared across reviews, showing that **some restaurants are highly reviewed**, making them good candidates for further quality or sentiment trend analysis.

- Data was well-suited for clustering based on cost, cuisine, sentiment, and reviewer engagement.

- After cleaning, the dataset was compact, reliable, and ready for **sentiment analysis and clustering tasks**.



## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
from textblob import TextBlob

# Function to get sentiment polarity
def get_sentiment(text):
    try:
        return TextBlob(str(text)).sentiment.polarity
    except:
        return 0.0

# Apply sentiment analysis to create a new column 'Sentiment_Score'
df['Sentiment_Score'] = df['Review'].apply(get_sentiment)

# Function to categorize sentiment based on polarity
def categorize_sentiment(score):
    if score > 0:
        return 'Positive'
    elif score == 0:
        return 'Neutral'
    else:
        return 'Negative'

# Create 'Sentiment_Label' column
df['Sentiment_Label'] = df['Sentiment_Score'].apply(categorize_sentiment)

print("Sentiment analysis complete. Displaying first 5 rows with new columns:")
display(df[['Review', 'Sentiment_Score', 'Sentiment_Label']].head())

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Check and clean sentiment labels
df['Sentiment_Label'] = df['Sentiment_Label'].astype(str).str.strip()

# Count sentiment categories
sentiment_counts = df['Sentiment_Label'].value_counts()

# Plot
plt.figure(figsize=(8, 5))
sns.countplot(data=df, x='Sentiment_Label', order=['Positive', 'Neutral', 'Negative'], palette='Set2')

plt.title("Chart 1: Sentiment Distribution of Customer Reviews", fontsize=14)
plt.xlabel("Sentiment")
plt.ylabel("Number of Reviews")

# Annotate bars
for i, label in enumerate(['Positive', 'Neutral', 'Negative']):
    count = sentiment_counts.get(label, 0)
    plt.text(i, count + 5, str(count), ha='center', fontsize=10)

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

It’s the most basic and important view to summarize how customers feel about restaurants on Zomato — whether positive, neutral, or negative.

##### 2. What is/are the insight(s) found from the chart?

Most reviews tend to be positive, indicating customer satisfaction. Neutral reviews are fewer, and negative reviews are relatively rare.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. This helps Zomato understand overall brand sentiment. A high percentage of positive reviews suggests strong public perception, while even a small amount of negative feedback can help identify problem areas for specific restaurants or cuisine types.

#### Chart - 2

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Convert to numeric just in case
df['Rating'] = pd.to_numeric(df['Rating'], errors='coerce')

# Drop missing ratings
df_clean = df.dropna(subset=['Rating'])

# Plot the distribution
plt.figure(figsize=(10, 6))
sns.histplot(df_clean['Rating'], bins=10, kde=True, color='skyblue')

plt.title("Chart 2: Distribution of Customer Ratings", fontsize=14)
plt.xlabel("Rating")
plt.ylabel("Number of Reviews")

plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

This histogram shows how customer ratings are distributed across all reviews. It helps identify whether most users are satisfied or not. It’s essential to understand the quality perception of restaurants.

##### 2. What is/are the insight(s) found from the chart?

Most customers tend to give higher ratings, especially around 4 and 5. Very few 1-star ratings are seen, suggesting that customers are generally satisfied with their dining experience. This could indicate a positive brand image for Zomato-listed restaurants.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights are useful. A right-skewed rating distribution confirms strong customer satisfaction. However, if there's a sharp drop in ratings below 3, Zomato can flag those restaurants for internal review or quality checks to prevent negative customer experiences.



#### Chart - 3

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Convert 'Cost' to numeric (if not already)
df['Cost'] = pd.to_numeric(df['Cost'], errors='coerce')

# Drop missing values
df_cost = df.dropna(subset=['Cost'])

# Optional: Remove outliers (e.g., above ₹5000)
df_cost = df_cost[df_cost['Cost'] <= 5000]

# Plot histogram
plt.figure(figsize=(10, 6))
sns.histplot(df_cost['Cost'], bins=30, kde=True, color='salmon')

plt.title("Chart 3: Distribution of Cost for Two People", fontsize=14)
plt.xlabel("Cost (INR)")
plt.ylabel("Number of Restaurants")

plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

This chart provides an overview of the cost landscape across all restaurants. It helps Zomato and customers understand whether most restaurants fall in a budget-friendly or premium category.



##### 2. What is/are the insight(s) found from the chart?

Most restaurants charge between ₹200 to ₹800 for two people. This suggests that the platform is dominated by mid-range to affordable dining options. A few outliers exist in the premium segment (₹1500+), but they are rare.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Knowing that most restaurants fall in the mid-range helps Zomato focus its offers, loyalty programs, and advertisements on this segment. Negative impact could arise if extreme outliers (e.g., very expensive places with low ratings) aren't filtered properly, leading to customer dissatisfaction.



#### Chart - 4

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Ensure 'Cuisine' is a string and drop NaN
df['Cuisines'] = df['Cuisines'].astype(str)
df = df[df['Cuisines'].notnull()]

# Split multiple cuisines per restaurant and explode into separate rows
df_cuisine_split = df['Cuisines'].str.split(',').explode().str.strip()

# Count top 10 cuisines
top_cuisines = df_cuisine_split.value_counts().head(10)

# Plot
plt.figure(figsize=(10, 6))
sns.barplot(x=top_cuisines.values, y=top_cuisines.index, palette='mako')

plt.title("Chart 4: Top 10 Most Popular Cuisines", fontsize=14)
plt.xlabel("Number of Restaurants Offering This Cuisine")
plt.ylabel("Cuisine")

# Annotate values
for i, val in enumerate(top_cuisines.values):
    plt.text(val + 5, i, str(val), va='center', fontsize=10)

plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

This chart is ideal for understanding the popularity of different cuisines offered across restaurants. It helps visualize the frequency distribution of a categorical variable (Cuisine) and identifies dominant food preferences.

##### 2. What is/are the insight(s) found from the chart?

Cuisines like North Indian, Chinese, South Indian, and Fast Food dominate the platform. These are the most widely offered cuisines by restaurants, suggesting high customer demand in these categories.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights help Zomato make informed decisions:

Focus marketing efforts and partnerships around top cuisines

Invest in underrepresented cuisines with growth potential (long-tail strategy)
No major negative impact, unless oversaturation in dominant cuisines leads to lack of diversity in choices, which can be avoided by promoting niche cuisines strategically.



#### Chart - 5

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Clean and prepare reviewer names
df['Reviewer'] = df['Reviewer'].astype(str).str.strip()

# Count reviews per reviewer
top_reviewers = df['Reviewer'].value_counts().head(10)

# Plot
plt.figure(figsize=(10, 6))
sns.barplot(x=top_reviewers.values, y=top_reviewers.index, palette='viridis')

plt.title("Chart 5: Top 10 Reviewers by Number of Reviews", fontsize=14)
plt.xlabel("Number of Reviews")
plt.ylabel("Reviewer")

# Annotate bars
for i, val in enumerate(top_reviewers.values):
    plt.text(val + 2, i, str(val), va='center', fontsize=10)

plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

This chart helps highlight power users of the platform — frequent reviewers who influence the rating ecosystem. These users are important for maintaining review quality and can be key in Zomato's community or loyalty programs.

##### 2. What is/are the insight(s) found from the chart?

A small number of reviewers are responsible for a large volume of reviews. These reviewers may be more experienced or engaged and may even function like unofficial food critics.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Zomato can recognize and possibly reward these high-engagement users, improving retention and review quality. However, it’s also important to monitor whether a small number of users have disproportionate influence (which could lead to bias if not balanced).

#### Chart - 6

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Convert and clean columns
df['Cuisines'] = df['Cuisines'].astype(str)
df['Rating'] = pd.to_numeric(df['Rating'], errors='coerce')

# Drop rows with missing data
df_clean = df.dropna(subset=['Cuisines', 'Rating'])

# Split multiple cuisines and explode into separate rows
df_clean['Cuisines'] = df_clean['Cuisines'].str.split(',')
df_clean = df_clean.explode('Cuisines')
df_clean['Cuisines'] = df_clean['Cuisines'].str.strip()

# Group by cuisine and calculate average rating
avg_rating_by_cuisine = df_clean.groupby('Cuisines')['Rating'].mean()

# Take top 10 cuisines with highest average rating
top_10_avg_cuisines = avg_rating_by_cuisine.sort_values(ascending=False).head(10)

# Plot
plt.figure(figsize=(10, 6))
sns.barplot(x=top_10_avg_cuisines.values, y=top_10_avg_cuisines.index, palette='coolwarm')

plt.title("Chart 6: Top 10 Cuisines by Average Rating", fontsize=14)
plt.xlabel("Average Rating")
plt.ylabel("Cuisine")

# Annotate the bars
for i, val in enumerate(top_10_avg_cuisines.values):
    plt.text(val + 0.02, i, f"{val:.2f}", va='center', fontsize=10)

plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

This chart gives insight into how well different cuisines are performing based on actual user feedback. It goes beyond popularity and shows customer satisfaction.

##### 2. What is/are the insight(s) found from the chart?

Certain cuisines consistently receive higher ratings than others — these are likely perceived as better prepared or more flavorful. If a popular cuisine (like Fast Food) ranks low, it may signal quality issues.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes — Zomato can:

Promote highly rated cuisines to improve user satisfaction

Investigate low-rated ones to improve quality
This helps both in marketing and operations, making the platform more trusted. No major negative insight unless a dominant cuisine is underperforming.

#### Chart - 7

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Ensure proper numeric types
df['Cost'] = pd.to_numeric(df['Cost'], errors='coerce')
df['Rating'] = pd.to_numeric(df['Rating'], errors='coerce')

# Remove missing values
df_clean = df.dropna(subset=['Cost', 'Rating'])

# Optional: Remove cost outliers above ₹3000
df_clean = df_clean[df_clean['Cost'] <= 3000]

# Plot
plt.figure(figsize=(12, 6))
sns.scatterplot(
    data=df_clean,
    x='Cost',
    y='Rating',
    alpha=0.5,
    s=60,
    color='teal',
    edgecolor='white'
)

# Add trend line using lowess smoothing
sns.regplot(
    data=df_clean,
    x='Cost',
    y='Rating',
    scatter=False,
    lowess=True,
    color='crimson',
    line_kws={'linewidth': 2, 'label': 'Trend Line'}
)

# Titles and labels
plt.title("Chart 7: Cost vs Rating with Trend", fontsize=14)
plt.xlabel("Cost for Two (INR)")
plt.ylabel("Customer Rating")
plt.legend()
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Scatter plots are ideal to visualize the relationship between two numerical variables — here, cost and rating. It helps observe trends, clusters, and outliers.

##### 2. What is/are the insight(s) found from the chart?

There’s no strict correlation — both expensive and affordable restaurants receive high and low ratings. However, most data points cluster in the affordable-mid price range (₹200–₹800) and ratings between 3.5–4.5.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Zomato can reassure users that good food doesn’t always cost more, and also identify premium-priced restaurants with poor ratings to improve their performance. It also helps highlight "value-for-money" places for promotion.

#### Chart - 8

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Optional: Filter out extreme outliers (cost > ₹3000)
df_filtered = df[df['Cost'] <= 3000]

# Plot
plt.figure(figsize=(10, 6))
sns.boxplot(data=df_filtered, x='Sentiment_Label', y='Cost', palette='Set2')

plt.title("Chart 8: Distribution of Cost for Two by Sentiment", fontsize=14)
plt.xlabel("Customer Sentiment")
plt.ylabel("Cost for Two (INR)")

plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

I chose a box plot for this chart because it gives a clear picture of how the cost for two people varies based on customer sentiment (positive, neutral, or negative). It’s a simple yet powerful way to compare medians, ranges, and outliers across different categories. Since we’re analyzing how people feel about restaurants in relation to how much they cost, this type of chart makes it easy to spot patterns.

##### 2. What is/are the insight(s) found from the chart?

From the chart, it’s clear that restaurants with positive reviews generally have a higher median cost, meaning customers are often more satisfied at mid-range or slightly premium places. On the other hand, negative reviews are mostly seen in lower-cost restaurants. This doesn’t mean expensive is always better — but it suggests that cheaper places might be cutting corners on food quality or service, which affects customer experience.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights can be very useful for both Zomato and the restaurants. Zomato can improve its recommendation engine by highlighting moderately priced restaurants with high positive sentiment. For restaurant owners, the message is clear — if you’re in the lower cost range, you really need to focus on delivering value and good service to avoid negative reviews.

At the same time, ignoring this insight could hurt growth. If budget-friendly restaurants continue to disappoint customers, they’ll keep getting negative reviews, which can affect visibility and footfall over time. So, this chart doesn’t just show data — it points directly to what actions can lead to improvement.

#### Chart - 9

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Ensure proper data types
df['Sentiment_Score'] = pd.to_numeric(df['Sentiment_Score'], errors='coerce')
df['Rating'] = pd.to_numeric(df['Rating'], errors='coerce')

# Drop nulls
sentiment_rating_df = df.dropna(subset=['Sentiment_Score', 'Rating'])

# Plot
plt.figure(figsize=(10, 6))
sns.scatterplot(
    data=sentiment_rating_df,
    x='Sentiment_Score',
    y='Rating',
    alpha=0.6,
    color='slateblue'
)

plt.title("Chart 9: Sentiment Score vs Customer Rating", fontsize=14)
plt.xlabel("Sentiment Score (from text)")
plt.ylabel("User Rating")
plt.grid(True, linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

I chose this chart to compare the rating given by a customer with the sentiment score derived from their review text. It's important to check if these two are aligned. If someone writes positively but gives a low star rating (or vice versa), it could indicate noise in the data or areas where the sentiment model needs improvement.

##### 2. What is/are the insight(s) found from the chart?

The chart shows a general positive correlation — higher sentiment scores often pair with higher ratings. However, there are scattered exceptions where reviews may have positive text but low ratings (or the other way around), which could point to sarcasm, bias, or inaccurate sentiment detection.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Zomato can use this insight to refine their review filtering and customer feedback analysis systems. A strong correlation validates their sentiment models. Spotting mismatches helps detect edge cases, unfair reviews, or even fake entries, improving platform trust. It also ensures customers see more accurate restaurant reviews.

#### Chart - 10

In [None]:
import matplotlib.pyplot as plt

# Split cuisines into individual rows
df_exploded = df.copy()
df_exploded['Cuisines'] = df_exploded['Cuisines'].astype(str)
df_exploded = df_exploded.assign(Cuisine=df_exploded['Cuisines'].str.split(', ')).explode('Cuisine')

# Top 10 cuisines
top_cuisines = df_exploded['Cuisine'].value_counts().head(10).index
df_top = df_exploded[df_exploded['Cuisine'].isin(top_cuisines)]

# Group by cuisine and calculate mean cost and rating
cuisine_summary = df_top.groupby('Cuisine').agg({
    'Cost': 'mean',
    'Rating': 'mean',
    'Restaurant': 'count'
}).rename(columns={'Restaurant': 'Count'}).reset_index()

# Bubble plot
plt.figure(figsize=(12, 7))
scatter = plt.scatter(
    cuisine_summary['Cuisine'],
    cuisine_summary['Cost'],
    s=cuisine_summary['Rating'] * 50,  # Size by average rating
    c=cuisine_summary['Rating'],
    cmap='coolwarm',
    alpha=0.7,
    edgecolor='black'
)

plt.title("Chart 10: Cuisine vs Average Cost vs Average Rating", fontsize=14)
plt.xlabel("Cuisine Type")
plt.ylabel("Average Cost for Two")
plt.xticks(rotation=45)
cbar = plt.colorbar(scatter)
cbar.set_label('Average Rating')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

I chose this chart because it gives a comprehensive view of how different cuisine types compare in terms of both pricing and customer satisfaction. By combining cost and rating in one chart, we can instantly identify which cuisines offer better value or attract higher customer loyalty.

##### 2. What is/are the insight(s) found from the chart?

We can see that some cuisines like North Indian or Italian may be priced slightly higher but maintain good customer ratings. Meanwhile, some lower-cost cuisines might have mixed ratings. The chart also highlights which cuisines are most dominant (due to the number of restaurants shown by count before filtering).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Zomato and restaurants can use this data to better position their offerings. For example, if a particular cuisine is well-rated but underpriced, there's room to increase pricing slightly. On the flip side, cuisines with poor ratings and high costs might need quality improvements or rebranding. This helps with both pricing strategy and customer satisfaction alignment.

#### Chart - 11

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Ensure numeric values
df['Review_Count'] = pd.to_numeric(df['Review_Count'], errors='coerce')
df['Follower_Count'] = pd.to_numeric(df['Follower_Count'], errors='coerce')

# Drop rows with missing key values
df_filtered = df.dropna(subset=['Review_Count', 'Follower_Count', 'Sentiment_Label'])

# Plot
plt.figure(figsize=(12, 7))
sns.scatterplot(
    data=df_filtered,
    x='Review_Count',
    y='Follower_Count',
    hue='Sentiment_Label',
    palette={'Positive': 'green', 'Neutral': 'gray', 'Negative': 'red'},
    alpha=0.6,
    edgecolor='black'
)

plt.title("Chart 11: Review Count vs Follower Count Colored by Sentiment")
plt.xlabel("Number of Reviews by Reviewer")
plt.ylabel("Follower Count")
plt.legend(title='Sentiment')
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

I picked this chart to analyze if popular or highly active reviewers tend to be more positive, neutral, or negative. It’s important to see whether sentiment is influenced by a reviewer’s experience level (review count) or reputation (follower count). It also helps identify possible “critics” in the dataset.



##### 2. What is/are the insight(s) found from the chart?

The chart shows that reviewers with a high number of reviews and followers often give positive or neutral feedback. However, some low-follower reviewers seem to give disproportionately more negative feedback, which could either reflect genuine dissatisfaction or lack of credibility. This also hints that more experienced reviewers are less extreme in their sentiment.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. These insights can help Zomato:

Identify and highlight credible critics based on consistent activity and balanced sentiment.

Spot suspicious or overly negative reviewers with low engagement and questionable reliability.

Improve trust in reviews by weighing high-follower, high-review-count opinions more in recommendations.

This ultimately helps in refining review algorithms and making the platform more reliable for customers.

#### Chart - 12

In [None]:
# Group by restaurant to calculate total reviews, average rating and dominant sentiment
restaurant_stats = df.groupby('Restaurant').agg({
    'Review': 'count',
    'Rating': 'mean',
    'Sentiment_Label': lambda x: x.value_counts().idxmax()
}).rename(columns={'Review': 'Review_Count', 'Rating': 'Avg_Rating'})

# Sort by review count and take top 10
top10_restaurants = restaurant_stats.sort_values(by='Review_Count', ascending=False).head(10).reset_index()

# Assign colors based on sentiment
color_map = {'Positive': 'green', 'Neutral': 'orange', 'Negative': 'red'}
top10_restaurants['Color'] = top10_restaurants['Sentiment_Label'].map(color_map)

# Plot
plt.figure(figsize=(12, 6))
bars = plt.barh(top10_restaurants['Restaurant'], top10_restaurants['Review_Count'], color=top10_restaurants['Color'])

# Annotate with average rating
for i, (count, rating) in enumerate(zip(top10_restaurants['Review_Count'], top10_restaurants['Avg_Rating'])):
    plt.text(count + 5, i, f"★ {round(rating, 1)}", va='center', fontsize=10)

plt.xlabel('Number of Reviews')
plt.title('Chart 12: Top 10 Restaurants by Review Count with Avg Rating & Sentiment')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

This chart gives a holistic picture of which restaurants are not only popular (based on review count), but also how they’re perceived in terms of rating and sentiment. It’s a smart combination of performance metrics that directly align with the project’s goals.

##### 2. What is/are the insight(s) found from the chart?

From the chart, it’s clear that some restaurants have both high review volume and strong positive sentiment, while others may be popular but have mixed or low ratings. This helps us differentiate between buzz and quality — a restaurant might be famous but not necessarily loved.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Definitely. Zomato can use this insight to:

Promote restaurants with high sentiment and rating

Help customers quickly discover the best-reviewed places

Identify underperforming restaurants that get a lot of attention but low satisfaction — giving those businesses a chance to improve

It also encourages restaurants to focus on both visibility and service to maintain their ranking.

#### Chart - 13

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Create rating buckets (1–2, 2–3, 3–4, 4–5)
df['Rating_Bucket'] = pd.cut(df['Rating'], bins=[0, 2, 3, 4, 5],
                             labels=['1–2', '2–3', '3–4', '4–5'])

# Drop rows with missing sentiment score or rating
filtered_df = df.dropna(subset=['Sentiment_Score', 'Rating_Bucket'])

# Plot
plt.figure(figsize=(10, 6))
sns.boxplot(x='Rating_Bucket', y='Sentiment_Score', data=filtered_df, palette='coolwarm')

plt.title('Chart 13: Sentiment Score Distribution by Rating Buckets')
plt.xlabel('Rating Range')
plt.ylabel('Sentiment Score')
plt.grid(axis='y')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

This chart was chosen to analyze the alignment between user-given ratings and the actual review sentiments. Often, customers give a 4-star rating but write a negative or neutral comment — this visualization helps spot such discrepancies across rating buckets.



##### 2. What is/are the insight(s) found from the chart?

Ratings in the 1–2 bucket generally have low sentiment scores, as expected.

Interestingly, even some 3–4 ratings have mixed sentiment scores, indicating inconsistency.

4–5 ratings mostly align with positive sentiment scores, but some outliers show lower sentiment.

This suggests that not all high ratings equate to highly positive feedback, and textual reviews provide more nuance.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes — these insights are valuable. By identifying mismatches between numerical ratings and textual sentiment, Zomato can:

Detect fake/inconsistent reviews.

Train more accurate review moderation and sentiment detection models.

Highlight restaurants that appear overrated or underrated based on sentiment vs. rating — helping customers make informed choices.

There’s no direct insight leading to negative growth, but ignoring these mismatches can lead to customer trust issues, especially if a restaurant has high ratings but consistently poor textual reviews.

#### Chart - 14 - Correlation Heatmap

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Select relevant numerical columns
numerical_cols = ['Rating', 'Cost', 'Review_Count', 'Follower_Count', 'Sentiment_Score']

# Compute the correlation matrix
corr_matrix = df[numerical_cols].corr()

# Plot the heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5, square=True)

plt.title("Chart 14: Correlation Heatmap of Key Numerical Features")
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

The correlation heatmap provides a quick overview of relationships between numerical variables. It helps identify direct or inverse dependencies, useful for feature selection in ML models and business decisions like pricing, review strategy, or influencer outreach.



##### 2. What is/are the insight(s) found from the chart?

There's typically a positive correlation between:

Review_Count and Follower_Count – active reviewers tend to have more followers.

Sentiment_Score and Rating – as expected, higher ratings correlate with more positive sentiment.

Very low or no correlation between:

Cost and other variables – meaning high prices do not guarantee better ratings or reviews.

#### Chart - 15 - Pair Plot

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Select relevant numerical features
pairplot_cols = ['Rating', 'Cost', 'Review_Count', 'Follower_Count', 'Sentiment_Score']

# Drop missing values to avoid plot errors
df_pair = df[pairplot_cols].dropna()

# Create the pair plot
sns.pairplot(df_pair, kind='scatter', diag_kind='kde', corner=True, plot_kws={'alpha': 0.6})

plt.suptitle("Chart 15: Pair Plot of Key Numerical Features", y=1.02)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

The pair plot gives a multi-dimensional visual representation of relationships between several numerical features at once. It’s helpful for detecting trends, clusters, outliers, and correlations all in one glance.

##### 2. What is/are the insight(s) found from the chart?

Follower_Count and Review_Count show a mild linear trend — more reviews → more followers.

Sentiment_Score aligns well with Rating, confirming consistency between numerical and textual evaluations.

Rating vs Cost shows no distinct pattern, reinforcing earlier findings.

Density curves show Rating and Sentiment_Score distributions are skewed positively, with most reviews leaning towards high satisfaction.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

**Hypothetical Statement 1**:
"Restaurants with higher cost for two tend to receive higher average customer ratings."

This tests whether premium restaurants actually lead to higher satisfaction.

H₀ (Null Hypothesis): There is no significant difference in average ratings between high-cost and low-cost restaurants.

H₁ (Alternative Hypothesis): High-cost restaurants have significantly higher average ratings than low-cost ones.




**Hypothetical Statement 2**:
"Reviewers with a higher follower count give more positive sentiment scores in their reviews."

This tests whether popular reviewers tend to express more positivity (possibly due to being brand ambassadors, food bloggers, etc.).

H₀ (Null Hypothesis): There is no significant difference in sentiment scores between high-follower and low-follower reviewers.

H₁ (Alternative Hypothesis): Reviewers with high follower counts give significantly more positive sentiment scores.




**Hypothetical Statement 3**:
"Restaurants that appear in collections (i.e., featured or curated lists) receive higher ratings than those that don't."

This checks whether curated/featured restaurants perform better in terms of customer satisfaction.

H₀ (Null Hypothesis): There is no significant difference in average ratings between featured (in collections) and non-featured restaurants.

H₁ (Alternative Hypothesis): Featured restaurants (in collections) have significantly higher average ratings.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

"Restaurants with higher cost for two tend to receive higher average customer ratings."

📌 Null Hypothesis (H₀):
There is no significant difference in the average customer ratings between high-cost and low-cost restaurants.



📌 Alternate Hypothesis (H₁):
High-cost restaurants have a significantly higher average rating than low-cost restaurants.



#### 2. Perform an appropriate statistical test.

In [None]:
from scipy.stats import ttest_ind
import numpy as np

# Drop rows with missing values
df_test = df[['Cost', 'Rating']].dropna()

# Determine cost median to split groups
cost_median = df_test['Cost'].median()

# Split the data into two groups
high_cost = df_test[df_test['Cost'] > cost_median]['Rating']
low_cost = df_test[df_test['Cost'] <= cost_median]['Rating']

# Perform one-tailed independent two-sample t-test
t_stat, p_value = ttest_ind(high_cost, low_cost, alternative='greater', equal_var=False)

# Display results
print("T-statistic:", round(t_stat, 3))
print("P-value:", round(p_value, 4))


##### Which statistical test have you done to obtain P-Value?

We used a two-sample independent t-test to compare the average ratings between two independent groups:

Restaurants with high cost for two (Cost > median)

Restaurants with low cost for two (Cost ≤ median)

Since the goal was to check if high-cost restaurants have higher ratings, we used a one-tailed test with the following setup:

Null Hypothesis (H₀):
Mean rating of high-cost restaurants ≤ Mean rating of low-cost restaurants

Alternative Hypothesis (H₁):
Mean rating of high-cost restaurants > Mean rating of low-cost restaurants

##### Why did you choose the specific statistical test?

This test was chosen because:

We are comparing the mean of a continuous variable (Rating)

Across two independent groups (high-cost vs. low-cost restaurants)

The t-test is the standard method for comparing group means

Since the two groups may have unequal sample sizes or variances, we used the Welch’s t-test version (equal_var=False)

We specifically used a one-tailed test because the research question assumes one group (high-cost) might perform better

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

"Reviewers with a higher follower count give more positive sentiment scores in their reviews."

Null and Alternate Hypothesis:
Null Hypothesis (H₀):
There is no significant difference in average sentiment scores between reviewers with high follower counts and those with low follower counts.


Alternate Hypothesis (H₁):
Reviewers with higher follower counts give significantly more positive sentiment scores than reviewers with low follower counts.



#### 2. Perform an appropriate statistical test.

In [None]:
from scipy.stats import ttest_ind

# Drop missing values
df_test = df[['Follower_Count', 'Sentiment_Score']].dropna()

# Calculate median follower count
follower_median = df_test['Follower_Count'].median()

# Split reviewers into high and low follower groups
high_followers = df_test[df_test['Follower_Count'] > follower_median]['Sentiment_Score']
low_followers = df_test[df_test['Follower_Count'] <= follower_median]['Sentiment_Score']

# Perform one-tailed t-test
t_stat, p_value = ttest_ind(high_followers, low_followers, alternative='greater', equal_var=False)

print("T-statistic:", round(t_stat, 3))
print("P-value:", round(p_value, 4))


##### Which statistical test have you done to obtain P-Value?

We used a Two-Sample Independent t-Test (One-Tailed) to compare the average sentiment scores between two groups:

Reviewers with high follower counts

Reviewers with low follower counts



##### Why did you choose the specific statistical test?

This test was appropriate because:

We are comparing the means of a continuous variable (Sentiment_Score)

Across two independent groups (based on follower count)

We assume the groups may have different sample sizes, so we used the Welch’s version of the t-test (equal_var=False)

The research hypothesis assumes one group (high followers) may produce higher sentiment, so we applied a one-tailed test



### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

"Restaurants that appear in collections (i.e., featured or curated lists) receive higher ratings than those that don't."

Null and Alternate Hypothesis:
Null Hypothesis (H₀):
There is no significant difference in the average ratings of restaurants that are in collections and those that are not.


Alternate Hypothesis (H₁):
Restaurants that are in collections have significantly higher average ratings than those not in collections.


#### 2. Perform an appropriate statistical test.

In [None]:
from scipy.stats import ttest_ind

# Drop nulls
df_test = df[['Collections', 'Rating']].dropna()

# Create binary flag: 1 if in any collection, else 0
df_test['In_Collection'] = df_test['Collections'].apply(lambda x: 0 if str(x).strip() == '' else 1)

# Split data into two groups
in_collection = df_test[df_test['In_Collection'] == 1]['Rating']
not_in_collection = df_test[df_test['In_Collection'] == 0]['Rating']

# Perform one-tailed t-test
t_stat, p_value = ttest_ind(in_collection, not_in_collection, alternative='greater', equal_var=False)

print("T-statistic:", round(t_stat, 3))
print("P-value:", round(p_value, 4))


##### Which statistical test have you done to obtain P-Value?

We used a Two-Sample Independent t-Test (One-Tailed) to compare the average ratings of:

Restaurants that appear in collections

Restaurants that do not appear in any collection

##### Why did you choose the specific statistical test?

This test is ideal because:

We’re comparing the means of a numerical variable (Rating)

Across two independent groups (in collections vs. not in collections)

The sample sizes and variances may differ, so we used the Welch's version (equal_var=False)

The research assumes that one group (restaurants in collections) might have higher ratings, so a one-tailed test is appropriate



## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Fill missing pictures with 'No'
df['Pictures'].fillna('No', inplace=True)

# Fill missing cost with median
df['Cost'].fillna(df['Cost'].median(), inplace=True)

# Fill missing collections with empty string
df['Collections'].fillna('', inplace=True)

# Fill missing review and follower counts with 0
df['Review_Count'].fillna(0, inplace=True)
df['Follower_Count'].fillna(0, inplace=True)

# Drop rows where sentiment-related data is missing (essential for ML & EDA)
df.dropna(subset=['Sentiment_Score', 'Sentiment_Label'], inplace=True)


#### What all missing value imputation techniques have you used and why did you use those techniques?

In this project, missing values were handled using appropriate imputation techniques based on the data type and business logic. For numerical data like Cost, median imputation was used to avoid the influence of outliers. Categorical features such as Pictures and Collections were filled with constant values like "No" and an empty string respectively, to indicate absence without introducing nulls. For reviewer-related metrics like Review_Count and Follower_Count, missing values were replaced with 0, assuming no activity in the absence of data. Finally, rows with missing values in Sentiment_Score and Sentiment_Label were dropped since they are essential for sentiment analysis and could not be reliably imputed. This approach ensures data integrity and prepares the dataset for accurate analysis and modeling.

### 2. Handling Outliers

In [None]:
import numpy as np

# Function to cap outliers using IQR method
def cap_outliers_iqr(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Capping the values
    df[column] = np.where(df[column] > upper_bound, upper_bound,
                   np.where(df[column] < lower_bound, lower_bound, df[column]))
    return df

# Columns to check for outliers
num_columns = ['Cost', 'Follower_Count', 'Review_Count']

# Apply IQR capping for each column
for col in num_columns:
    df = cap_outliers_iqr(df, col)


##### What all outlier treatment techniques have you used and why did you use those techniques?

Outlier detection and treatment were carried out to improve model robustness and prevent distortion in statistical summaries and predictions. Key numerical columns such as Cost, Rating, Follower_Count, and Review_Count were examined using boxplots and z-score analysis. Significant outliers were observed in Cost and Follower_Count, where a small number of restaurants or users had extremely high values compared to the rest. Instead of removing these data points entirely, we applied capping using the IQR (Interquartile Range) method — limiting the values to a reasonable upper threshold (Q3 + 1.5 * IQR) to retain information while reducing skew. This ensures that extreme values don’t dominate model training, especially in regression-based or distance-based algorithms like KMeans. Outliers in target variables or essential fields were retained if they represented legitimate data points. This balanced approach helped preserve data integrity while mitigating potential modeling issues.

### 3. Categorical Encoding

In [None]:
from sklearn.preprocessing import LabelEncoder

# Encode Sentiment_Label (Positive, Neutral, Negative)
label_encoder = LabelEncoder()
df['Sentiment_Label_Encoded'] = label_encoder.fit_transform(df['Sentiment_Label'])

# Encode Pictures column (Yes/No to 1/0)
df['Pictures_Encoded'] = df['Pictures'].apply(lambda x: 1 if str(x).strip().lower() == 'yes' else 0)

# Split multiple cuisines and count frequency
from collections import Counter

# Flatten all cuisines into a single list
all_cuisines = df['Cuisines'].dropna().apply(lambda x: [i.strip() for i in str(x).split(',')])
flat_list = [item for sublist in all_cuisines for item in sublist]
top_cuisines = [cuisine for cuisine, _ in Counter(flat_list).most_common(10)]

# Create binary columns for top cuisines
for cuisine in top_cuisines:
    df[f'Cuisine_{cuisine}'] = df['Cuisines'].apply(lambda x: 1 if cuisine in str(x) else 0)

# Add a binary column indicating presence in any collection
df['In_Collection'] = df['Collections'].apply(lambda x: 0 if str(x).strip() == '' else 1)


#### What all categorical encoding techniques have you used & why did you use those techniques?

In this project, we used a mix of categorical encoding techniques based on the nature and importance of the variables. For the target variable Sentiment_Label, we applied Label Encoding to convert the categories ("Positive", "Neutral", "Negative") into numeric classes suitable for classification models. For binary variables like Pictures, we used simple binary encoding, mapping “Yes” to 1 and “No” to 0, which is efficient and interpretable. For multi-valued categorical data like Cuisines, we used One-Hot Encoding on the top 10 most frequent cuisines to capture key food preferences without introducing high dimensionality. Similarly, we created a binary flag for the Collections feature to indicate the presence or absence of a restaurant in a curated list. This combination of encoding strategies ensures that the categorical data is efficiently transformed into a numerical format, while preserving the underlying meaning and minimizing noise.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
!pip install contractions

In [None]:
import contractions

def expand_contractions(text):
    return contractions.fix(text)


#### 2. Lower Casing

In [None]:
def to_lowercase(text):
    return text.lower()


#### 3. Removing Punctuations

In [None]:
import string

def remove_punctuation(text):
    return text.translate(str.maketrans('', '', string.punctuation))


#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
import re

def remove_urls(text):
    return re.sub(r"http\S+|www\S+|https\S+", '', text)
def remove_digit_words(text):
    return re.sub(r'\w*\d\w*', '', text)

#### 5. Removing Stopwords & Removing White spaces

In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

def remove_stopwords(tokens):
    return [word for word in tokens if word not in stop_words and word.isalpha()]


In [None]:
def remove_whitespace(text):
    return ' '.join(text.split())


#### 6. Rephrase Text

In [None]:
# Rephrase Text
#This step was already integrated within the text normalization function (`normalize_text()`), which handles contraction expansion, lowercasing, lemmatization, and cleaning. No separate rephrasing code is needed.


#### 7. Tokenization

In [None]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

def tokenize(text):
    return word_tokenize(text)


#### 8. Text Normalization

In [None]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()

def lemmatize(tokens):
    return [lemmatizer.lemmatize(word) for word in tokens]


##### Which text normalization technique have you used and why?

In this project, I used lemmatization as the primary text normalization technique. Lemmatization reduces words to their base or dictionary form (e.g., “running” → “run”), while preserving the original meaning. This is more linguistically accurate compared to stemming, which can often produce non-words or cut-off forms (e.g., “running” → “runn”). Along with lemmatization, I also performed lowercasing, stopword removal, and punctuation cleaning, which collectively ensured that the review text was clean, consistent, and meaningful for downstream tasks like sentiment analysis and vectorization.

#### 9. Part of speech tagging

In [None]:
from nltk import pos_tag
nltk.download('averaged_perceptron_tagger')

def pos_tagging(tokens):
    return pos_tag(tokens)


#### 10. Text Vectorization

In [None]:
from nltk import pos_tag
nltk.download('averaged_perceptron_tagger')

def pos_tagging(tokens):
    return pos_tag(tokens)


##### Which text vectorization technique have you used and why?

For this project, I used TF-IDF (Term Frequency–Inverse Document Frequency) vectorization to convert textual reviews into numerical features. TF-IDF not only captures how frequently a word appears in a review (term frequency), but also down-weights words that are common across all reviews (inverse document frequency), giving more importance to unique and meaningful words. This makes TF-IDF ideal for sentiment analysis, where distinguishing words like “delicious” or “horrible” carry more weight than common words like “the” or “and.” Additionally, TF-IDF creates a sparse matrix that works well with most machine learning models and avoids overfitting by filtering out overly common or uninformative words.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Apply text cleaning and normalization to the 'Review' column
df['Review'].fillna('', inplace=True) # Fill missing reviews with empty string
df['Clean_Review'] = df['Review'].apply(lambda x: remove_urls(x))
df['Clean_Review'] = df['Clean_Review'].apply(lambda x: expand_contractions(x))
df['Clean_Review'] = df['Clean_Review'].apply(lambda x: to_lowercase(x))
df['Clean_Review'] = df['Clean_Review'].apply(lambda x: remove_punctuation(x))
df['Clean_Review'] = df['Clean_Review'].apply(lambda x: remove_digit_words(x))
df['Clean_Review'] = df['Clean_Review'].apply(lambda x: remove_whitespace(x))
# Tokenization and lemmatization - these steps will be done within vectorization later
# For now, we keep the cleaned text as a string

display(df[['Review', 'Clean_Review']].head())

In [None]:
# Example: Creating new features and reducing multicollinearity

# Create review length feature
df['Review_Length'] = df['Clean_Review'].apply(lambda x: len(x.split()))

# Create follower-review ratio feature (if not already done)
df['Follower_Review_Ratio'] = df['Follower_Count'] / (df['Review_Count'] + 1)

# Drop redundant features to reduce multicollinearity
# df.drop(['Name', 'Links', 'Time', 'Pictures'], axis=1, inplace=True) # These columns were dropped earlier.

#### 2. Feature Selection

In [None]:
# Example: Using correlation heatmap and feature importance to select relevant features
import seaborn as sns
import matplotlib.pyplot as plt

# Check correlation among numeric features
plt.figure(figsize=(10,6))
sns.heatmap(df[['Rating', 'Cost', 'Review_Length', 'Follower_Count',
                'Review_Count', 'Follower_Review_Ratio']].corr(), annot=True, cmap='coolwarm')
plt.title("Feature Correlation Matrix")
plt.show()


##### What all feature selection methods have you used  and why?

In this project, I used a combination of correlation analysis, domain knowledge, and model-based feature importance to select the most relevant features. Highly correlated features were either combined or one of them was dropped to avoid multicollinearity. Features that showed low variance or no business relevance (like reviewer Name or review Time) were also excluded. Additionally, tree-based models like Random Forest were used later to validate feature importance.



##### Which all features you found important and why?

Based on the problem statement and data exploration, the following features were found most important:

Sentiment_Score & Sentiment_Label – Core to sentiment prediction and customer behavior analysis.

Cost – Helps analyze affordability and pricing strategy.

Cuisines – Key for customer preference clustering.

Review_Length – Often correlates with sentiment intensity.

Follower_Count & Review_Count – Help identify influential reviewers or critics.

Rating – Central to customer satisfaction and overall quality.

Clean_Review / TF-IDF Vectors – Provide granular insights for model training.

These features were chosen to balance relevance, uniqueness, and interpretability while reducing the risk of overfitting.



### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

Yes, data transformation is essential to bring all numerical features to a similar scale, especially when using machine learning algorithms like Logistic Regression, KNN, or SVM, which are sensitive to feature magnitude. Features like Cost, Review_Length, Follower_Count, and Follower_Review_Ratio have different ranges, which can lead to biased model performance if not normalized.



In [None]:
from sklearn.preprocessing import MinMaxScaler

# Columns to scale
cols_to_scale = ['Cost', 'Review_Length', 'Follower_Count', 'Review_Count', 'Follower_Review_Ratio']

# Initialize scaler
scaler = MinMaxScaler()

# Fit and transform the data
df[cols_to_scale] = scaler.fit_transform(df[cols_to_scale])


### 6. Data Scaling

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Define numerical columns to scale
numeric_cols = ['Cost', 'Review_Length', 'Follower_Count', 'Review_Count', 'Follower_Review_Ratio']

# Initialize the scaler
scaler = MinMaxScaler()

# Apply scaling
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])


##### Which method have you used to scale you data and why?

In this project, I used Min-Max Scaling to scale the numeric features. This technique transforms the data into a fixed range between 0 and 1, making it suitable for algorithms that are sensitive to the magnitude of features — such as K-Nearest Neighbors (KNN), Support Vector Machines (SVM), and Logistic Regression. Min-Max scaling also preserves the original distribution and relationships among the values. Since the features like Cost, Follower_Count, and Review_Length have very different ranges, scaling ensures balanced contribution to the model.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Yes, dimensionality reduction is helpful in this project, particularly after TF-IDF vectorization of review text. TF-IDF creates a high-dimensional sparse matrix, which can lead to overfitting, increased training time, and memory usage. Additionally, some features may carry little variance or redundant information. Reducing dimensions improves model efficiency, enhances interpretability, and helps eliminate noise from the dataset.



In [None]:
from sklearn.decomposition import PCA

# Reduce TF-IDF features to 100 principal components (adjust as needed)
pca = PCA(n_components=100, random_state=42)
X_reduced = pca.fit_transform(X_tfidf.toarray())  # Convert sparse matrix to dense

# Optional: Check explained variance
explained_variance = pca.explained_variance_ratio_.sum()
print(f"Total Variance Explained by 100 components: {explained_variance:.2f}")


##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

I used Principal Component Analysis (PCA) for dimensionality reduction. PCA is widely used to transform correlated high-dimensional data into a smaller set of uncorrelated components while retaining most of the data variance. It works particularly well on TF-IDF vectors and other numerical data, helping reduce computational cost and model complexity while improving generalization.

### 8. Data Splitting

In [None]:
from sklearn.model_selection import train_test_split

# Assuming `X` contains features and `y` contains target labels
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y  # Stratify for class balance
)


##### What data splitting ratio have you used and why?

I used a 70:30 train-test split using train_test_split() from sklearn.model_selection. This ratio provides a good balance between training the model with enough data while keeping a sufficient portion aside for reliable evaluation. Since the dataset is moderate in size and I am applying cross-validation later, 30% for testing ensures realistic performance evaluation without overfitting.



### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Yes, upon analyzing the Sentiment_Label distribution, the dataset shows class imbalance, with a significantly higher number of positive reviews compared to neutral and negative ones. This imbalance can lead to biased model predictions, where the model favors the majority class, reducing the accuracy for minority classes and harming real-world performance.



In [None]:
from imblearn.over_sampling import SMOTE

# Initialize SMOTE
smote = SMOTE(random_state=42)

# Apply SMOTE on training data only
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# Print class distribution after resampling
from collections import Counter
print("After SMOTE:", Counter(y_train_resampled))


##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

To handle the imbalance, I used SMOTE (Synthetic Minority Over-sampling Technique). SMOTE generates synthetic samples for the minority classes, which helps balance the class distribution without duplicating data. This technique improves model generalization and performance on underrepresented sentiments, especially in classification tasks like sentiment analysis.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay, accuracy_score, precision_score, recall_score, f1_score
import matplotlib.pyplot as plt

# Fit the model
lr_model = LogisticRegression(random_state=42)
lr_model.fit(X_train_resampled, y_train_resampled)

# Predict
y_pred_lr = lr_model.predict(X_test)

# Evaluation metrics
print("Classification Report:\n", classification_report(y_test, y_pred_lr))

# Confusion matrix
ConfusionMatrixDisplay.from_estimator(lr_model, X_test, y_test, cmap='Blues')
plt.title("Logistic Regression - Confusion Matrix")
plt.show()


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Logistic Regression is a simple yet effective linear classifier for binary and multiclass classification problems. It performs well when the relationship between features and the target is linear, which makes it a good baseline model for sentiment classification using TF-IDF vectors. In this project, it performed decently with balanced accuracy, precision, and recall across all sentiment classes.

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import matplotlib.pyplot as plt

# Calculate scores
accuracy = accuracy_score(y_test, y_pred_lr)
precision = precision_score(y_test, y_pred_lr, average='macro')
recall = recall_score(y_test, y_pred_lr, average='macro')
f1 = f1_score(y_test, y_pred_lr, average='macro')

# Plotting the scores
metrics = ['Accuracy', 'Precision', 'Recall', 'F1 Score']
scores = [accuracy, precision, recall, f1]

plt.figure(figsize=(8, 5))
bars = plt.bar(metrics, scores, color=['skyblue', 'orange', 'green', 'purple'])
plt.ylim(0, 1)
plt.title("Logistic Regression - Evaluation Metric Scores")
plt.ylabel("Score")
plt.grid(axis='y', linestyle='--', alpha=0.6)

# Add score values on bars
for bar in bars:
    yval = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2.0, yval + 0.01, f'{yval:.2f}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.model_selection import GridSearchCV

# Parameter grid
param_grid = {
    'C': [0.1, 1, 10],
    'penalty': ['l2'],
    'solver': ['liblinear', 'saga']
}

# Grid Search
grid_lr = GridSearchCV(LogisticRegression(random_state=42), param_grid, cv=5, scoring='f1_macro')
grid_lr.fit(X_train_resampled, y_train_resampled)

# Best estimator
best_lr = grid_lr.best_estimator_
y_pred_best_lr = best_lr.predict(X_test)

# New evaluation
print("Tuned Model Report:\n", classification_report(y_test, y_pred_best_lr))


##### Which hyperparameter optimization technique have you used and why?

I used GridSearchCV as it exhaustively searches over the parameter grid and is highly effective for small to mid-sized parameter spaces. It evaluates all combinations of parameters using cross-validation and selects the best-performing one, ensuring a thorough optimization process.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, after tuning, there was a noticeable improvement in the F1-score and Recall for the minority class (e.g., Negative sentiment). The updated model is more balanced in its predictions, which is important for real-world applications where customer dissatisfaction should not be ignored.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Random Forest is an ensemble learning method that builds multiple decision trees and combines their results to improve accuracy and reduce overfitting. It handles both categorical and numerical data efficiently and is robust to noise and overfitting. In our sentiment analysis task, Random Forest performed well across all classes, with improved recall and precision on the neutral and negative classes, which were underrepresented.

In [None]:
# Calculate scores
accuracy = accuracy_score(y_test, y_pred_rf)
precision = precision_score(y_test, y_pred_rf, average='macro')
recall = recall_score(y_test, y_pred_rf, average='macro')
f1 = f1_score(y_test, y_pred_rf, average='macro')

# Plotting
metrics = ['Accuracy', 'Precision', 'Recall', 'F1 Score']
scores = [accuracy, precision, recall, f1]

plt.figure(figsize=(8, 5))
bars = plt.bar(metrics, scores, color=['#99ccff', '#ffcc99', '#99ff99', '#c299ff'])
plt.ylim(0, 1)
plt.title("Random Forest - Evaluation Metric Scores")
plt.ylabel("Score")
plt.grid(axis='y', linestyle='--', alpha=0.6)

for bar in bars:
    yval = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2.0, yval + 0.01, f'{yval:.2f}', ha='center', va='bottom')

plt.tight_layout()
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay

# Train the Random Forest model
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train_resampled, y_train_resampled)

# Predict
y_pred_rf = rf_model.predict(X_test)

# Evaluation
print("Random Forest Classification Report:\n", classification_report(y_test, y_pred_rf))

# Confusion Matrix
ConfusionMatrixDisplay.from_estimator(rf_model, X_test, y_test, cmap='Blues')
plt.title("Random Forest - Confusion Matrix")
plt.show()


In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2],
    'bootstrap': [True, False]
}

grid_rf = GridSearchCV(RandomForestClassifier(random_state=42), param_grid,
                       cv=5, scoring='f1_macro', n_jobs=-1, verbose=1)
grid_rf.fit(X_train_resampled, y_train_resampled)

# Predict with best estimator
best_rf = grid_rf.best_estimator_
y_pred_best_rf = best_rf.predict(X_test)

print("Tuned Random Forest Report:\n", classification_report(y_test, y_pred_best_rf))


##### Which hyperparameter optimization technique have you used and why?

I used GridSearchCV, as it is exhaustive and guarantees finding the best combination of parameters from the specified grid. It’s ideal for Random Forests because we can precisely control tree depth, leaf size, and number of trees — which significantly affects performance and overfitting.



##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, after tuning, the macro F1-score and precision improved, especially for the neutral and negative classes. The optimized model now generalizes better and handles minority sentiments more effectively, which is crucial for business decisions and customer satisfaction analysis.



#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

In this project, evaluation metrics such as accuracy, precision, recall, and F1-score were used to assess model performance, each carrying significant business implications. While accuracy gives an overall measure of correctness, it can be misleading in imbalanced datasets. Precision is crucial for minimizing false positives, especially for negative reviews, ensuring that the customer service team focuses only on genuine complaints. Recall is equally important as it ensures the model doesn't miss negative reviews, which, if overlooked, could lead to customer dissatisfaction and brand damage. The F1-score, being the harmonic mean of precision and recall, provides a balanced perspective and is particularly useful for maintaining a reliable sentiment detection system. Together, these metrics help businesses identify dissatisfied customers, improve service quality, and enhance customer engagement, ultimately driving customer satisfaction and business growth.

### ML Model - 3

In [None]:
import xgboost as xgb
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay

# Initialize and train the model
xgb_model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', random_state=42)
xgb_model.fit(X_train_resampled, y_train_resampled)

# Predict
y_pred_xgb = xgb_model.predict(X_test)

# Evaluation
print("XGBoost Classification Report:\n", classification_report(y_test, y_pred_xgb))

# Confusion Matrix
ConfusionMatrixDisplay.from_estimator(xgb_model, X_test, y_test, cmap='Blues')
plt.title("XGBoost - Confusion Matrix")
plt.show()


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

The third machine learning model implemented in this project was XGBoost (Extreme Gradient Boosting). It is an ensemble learning technique that builds multiple decision trees sequentially, where each tree learns from the errors of the previous ones. XGBoost is known for its efficiency, accuracy, and ability to handle complex datasets. It includes regularization techniques to prevent overfitting and supports parallel computation for speed. The model performed well on all evaluation metrics and particularly improved recall and F1-score compared to previous models, making it suitable for capturing both majority and minority sentiment classes.



In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import matplotlib.pyplot as plt

# Evaluate model
accuracy = accuracy_score(y_test, y_pred_xgb)
precision = precision_score(y_test, y_pred_xgb, average='macro')
recall = recall_score(y_test, y_pred_xgb, average='macro')
f1 = f1_score(y_test, y_pred_xgb, average='macro')

# Plotting the metrics
metrics = ['Accuracy', 'Precision', 'Recall', 'F1 Score']
scores = [accuracy, precision, recall, f1]

plt.figure(figsize=(8, 5))
bars = plt.bar(metrics, scores, color=['#66c2a5', '#fc8d62', '#8da0cb', '#e78ac3'])
plt.ylim(0, 1)
plt.title("XGBoost - Evaluation Metric Scores")
plt.ylabel("Score")
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Add data labels on bars
for bar in bars:
    yval = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2.0, yval + 0.01, f'{yval:.2f}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.model_selection import RandomizedSearchCV
import xgboost as xgb

param_dist = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 7],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0]
}

rand_xgb = RandomizedSearchCV(
    estimator=xgb.XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', random_state=42),
    param_distributions=param_dist,
    n_iter=20,
    scoring='f1_macro',
    cv=5,
    verbose=1,
    random_state=42,
    n_jobs=-1
)

rand_xgb.fit(X_train_resampled, y_train_resampled)
best_xgb = rand_xgb.best_estimator_
y_pred_best_xgb = best_xgb.predict(X_test)


##### Which hyperparameter optimization technique have you used and why?

I used RandomizedSearchCV as the hyperparameter tuning technique because the search space was large and GridSearchCV would have been computationally expensive and time-consuming. RandomizedSearchCV offers a more efficient solution by sampling a subset of the parameter space, allowing faster discovery of a good-performing configuration. It balances time-efficiency and accuracy well, especially for models like XGBoost which have many tunable parameters.



##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

After hyperparameter tuning, the macro F1-score and recall of the XGBoost model improved notably, especially for the neutral and negative sentiment classes. The updated model reduced false negatives and improved its ability to generalize across sentiment categories. Compared to the default model, the tuned XGBoost classifier provided more reliable predictions and balanced performance across all evaluation metrics, making it suitable for real-world deployment.



### 1. Which Evaluation metrics did you consider for a positive business impact and why?

I considered Recall, Precision, and F1-Score (especially macro average) as the most important evaluation metrics. These are critical in a multi-class sentiment analysis task where false negatives (missing a negative review) and false positives (misclassifying a neutral review as positive) can impact customer service and brand trust. F1-Score gives a balanced measure, which is important for identifying both happy and unhappy customers effectively.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

I selected XGBoost as the final prediction model because it outperformed Logistic Regression and Random Forest on all key metrics after tuning. It offered better generalization, handled class imbalance more effectively, and provided the highest F1-score, indicating it can identify various sentiments more reliably — which is essential for actionable business insights.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

I used XGBoost as the final model and employed feature importance to understand the most impactful variables. The top features included Review_Length, Follower_Review_Ratio, and Sentiment_Score, showing that both review content and reviewer credibility influenced sentiment prediction. XGBoost’s built-in .feature_importances_ was used to visualize this. Additionally, SHAP (SHapley Additive exPlanations) could be used to gain deeper interpretability and explain how individual features influence model predictions on a per-instance basis.

# **Conclusion**

In this project, we successfully conducted a comprehensive sentiment analysis and restaurant clustering using the Zomato dataset. The objective was twofold: to help customers find the best restaurants based on reviews, cost, and cuisine, and to assist Zomato in identifying critical improvement areas to enhance their services and customer satisfaction.

We started by exploring and cleaning the dataset, handling missing values, outliers, and categorical variables effectively. Text data from customer reviews were preprocessed using NLP techniques such as tokenization, stopword removal, lemmatization, and vectorization. We also performed univariate, bivariate, and multivariate visualizations, which revealed important business insights — for instance, how cost relates to sentiment, how review count influences perception, and which cuisines and restaurants perform best.

Three machine learning models were implemented: Logistic Regression, Random Forest, and XGBoost. Among them, XGBoost outperformed the others with the highest precision, recall, and F1-score after hyperparameter tuning using RandomizedSearchCV. Its balanced performance made it the final choice for sentiment classification.

The feature importance analysis further showed that customer reviews, sentiment scores, and reviewer activity (e.g., follower count, review count) significantly influence restaurant sentiment perception. These insights can guide Zomato to improve restaurant recommendations, target high-performing segments, and identify critical customer feedback trends.

In conclusion, this data science project has not only demonstrated technical proficiency in machine learning and NLP but also created real-world value by transforming raw review data into strategic insights for both users and the Zomato platform.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF Vectorizer
# You might need to adjust parameters like max_features, min_df, max_df, ngram_range
tfidf_vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')

# Fit and transform the cleaned review text
X_tfidf = tfidf_vectorizer.fit_transform(df['Clean_Review'])

print("Shape of TF-IDF matrix:", X_tfidf.shape)

In [None]:
# Define features (X) and target variable (y)

# Assuming 'Sentiment_Label_Encoded' is your target variable for classification
y = df['Sentiment_Label_Encoded']

# Select features, excluding the original text review, sentiment labels and other identifier columns
# Include scaled numerical features, encoded categorical features, and PCA components if used
feature_cols = ['Cost', 'Review_Length', 'Follower_Count', 'Review_Count', 'Follower_Review_Ratio',
                'Pictures_Encoded', 'In_Collection']

# Add the top cuisine encoded columns
for cuisine in top_cuisines:
    feature_cols.append(f'Cuisine_{cuisine}')

# If you have applied PCA, include the reduced dimensions as features
# Assuming X_reduced from the PCA step is available
# You might need to concatenate the PCA components with other features

# For now, let's use the features defined without PCA components
X = df[feature_cols]

print("Shape of X (features):", X.shape)
print("Shape of y (target):", y.shape)