# **Project Name**    - Zomato Insights Engine: Clustering & Sentiment Intelligence



##### **Project Type**    - EDA/Unsupervised
##### **Contribution**    - Individual
##### **Team Member 1 -Chandraprakash kahar


# **Project Summary -**

This project is all about understanding what customers are saying about restaurants on Zomato and using that information to help both users and the company. By analyzing customer reviews, we figure out whether people are happy or unhappy with their experience (sentiment analysis). We also group (cluster) restaurants based on different things like food type, cost, ratings, and review sentiment.

We use visual charts and graphs to make the data easier to understand at a glance. This helps customers quickly find the best restaurants in their area, and it helps Zomato see where restaurants are doing well or need improvement.

We also look at other useful details, like how much meals cost compared to customer satisfaction, and who the reviewers are — identifying regular foodies or critics. All of this helps Zomato make better decisions and improve their services.


# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Zomato, as a food delivery and restaurant discovery platform, receives a large volume of customer reviews and restaurant data. However, this raw data is unorganized and difficult to interpret without proper analysis. There is a need to:

* Understand customer sentiments from reviews to identify strengths and
weaknesses of restaurants.

* Help customers easily find the best restaurants based on their preferences and location.

* Assist Zomato in identifying areas where restaurants can improve.

* Segment (cluster) restaurants into meaningful groups for better business targeting.

* Use metadata, such as review patterns and reviewer profiles, to identify key influencers or food critics.

* Analyze the relationship between cuisine types, cost, and customer satisfaction to support business and pricing decisions.

The project aims to solve these issues through sentiment analysis, clustering, and visual data representation to drive value for both customers and the company.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# 🔧 Environment Setup

Run the cell below to install all required Python libraries and download NLP resources for sentiment analysis and clustering. This will prevent any missing module errors.


# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# 📦 Install Required Libraries
!pip install -q textblob nltk wordcloud scikit-learn-extra

# 📚 Import Common Libraries
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import string

# ⚙️ Sklearn Libraries
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, davies_bouldin_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA

# 🧠 NLP Libraries
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from textblob import TextBlob
import textblob.download_corpora

# 📥 Download NLP Resources
nltk.download('punkt')
nltk.download('stopwords')
textblob.download_corpora.download_all()



print("✅ Setup complete! All libraries installed and ready.")


### Dataset Loading

In [None]:

# Try loading from local directory
if os.path.exists("restaurant_data.csv") and os.path.exists("review_data.csv"):
    restaurant_df = pd.read_csv("Zomato Restaurant names and Metadata.csv")
    review_df = pd.read_csv("Zomato Restaurant reviews.csv")
else:
    from google.colab import drive
    drive.mount('/content/drive')
    restaurant_df = pd.read_csv('/content/drive/MyDrive/Labmentix intern projects/Zomato project/Zomato Restaurant names and Metadata.csv')
    review_df = pd.read_csv('/content/drive/MyDrive/Labmentix intern projects/Zomato project/Zomato Restaurant reviews.csv')


### Dataset First View

In [None]:
# Dataset First Look
#first Look of restaurant dataframe
restaurant_df.head()
review_df.head()
#first Look of review dataframe

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("Restorent dataset:\n","Rows:", restaurant_df.shape[0],"\n","Columns:",restaurant_df.shape[1],"\n")
print("review_df dataset:\n","Rows:", review_df.shape[0],"\n","Columns:",review_df.shape[1],"\n")


### Dataset Information

In [None]:
# Dataset Info
# restaurant_df DataFrame features/columns informations
restaurant_df.info()
# review DataFrame features / columns informations
review_df.info()

####  Handling Duplicate Values

In [None]:
# Dataset Duplicate Value Count

duplicate_rows_in_restaurant_df=restaurant_df.duplicated().sum()
print(f"Number of duplicate rows in restaurant data: {duplicate_rows_in_restaurant_df}\n")

duplicate_rows_in_review_df=review_df.duplicated().sum()
print(f"Number of duplicate rows in review data: {duplicate_rows_in_review_df}")

print(review_df[review_df.duplicated(keep="first")].shape)
# unique_rows_in_review_df Dataframe contains unque rows with respect to the all columns values done by using duplicate() method.
#review_df.drop_duplicates(keep="last") keep the last one duplicate value..
unique_rows_in_review_df=review_df.drop_duplicates(keep="last")

unique_rows_in_review_df.shape

## After removing duplicate rows..
print("After removing duplicate rows..\n")
duplicate_rows_in_restaurant_df=restaurant_df.duplicated().sum()
print(f"Number of duplicate rows in restaurant data: {duplicate_rows_in_restaurant_df}\n")

duplicate_rows_in_review_df=unique_rows_in_review_df.duplicated().sum()
print(f"Number of duplicate rows in review data: {duplicate_rows_in_review_df}")


#### Analysing  Missing Values/Null Values in Datasets

In [None]:
# Missing Values/Null Values Count
print("missing values in restaurant data\n",restaurant_df.isna().sum(),"\n")
print("missing values in review data\n",unique_rows_in_review_df.isna().sum(),"\n")
print(len(restaurant_df))
print(len(unique_rows_in_review_df))



In [None]:
# Visualizing the missing values
%matplotlib inline
#plt.figure(figsize=(10,6))
missing_values=(restaurant_df.isna().sum()/len(restaurant_df))*100
missing_values.plot(kind='bar',color= 'skyblue')


# Set plot labels and title
plt.title('Missing Values in Each Column', fontsize=14)
plt.xlabel('Columns', fontsize=12)
plt.ylabel('Number of Missing Values in %', fontsize=12)

# Show the plot
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability
plt.tight_layout()
plt.show()

In [None]:
# Visualizing the missing values

#plt.figure(figsize=(10,6))
missing_values=(unique_rows_in_review_df.isna().sum()/len(unique_rows_in_review_df))*100
missing_values.plot(kind='bar',color= 'skyblue')

# Set plot labels and title
plt.title('Missing Values in Each Column', fontsize=14)
plt.xlabel('Columns', fontsize=12)
plt.ylabel('Number of Missing Values in %', fontsize=12)

# Show the plot
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability
plt.tight_layout()
plt.show()

Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
#Restaurant dataset columns ..
print(restaurant_df.columns)


#Review dataset columns ..
print(unique_rows_in_review_df.columns)


In [None]:
# Dataset Describe

# Dataset 1: Restaurant Data
# Description: Contains metadata about restaurants, scraped from Zomato platforms.

#Dataset 2: Review Data
#Description: User-generated reviews for restaurants in Dataset 1.



### Variables Description

Dataset 1: Restaurant Data
Description: Contains metadata about restaurants

Fields/features:

 * Name: Restaurant name (text)

   Example: "Spice Valley", "The Pasta Factory"

 * Links: URL to restaurant page (web links)
  
   Analytical Use: Could extract domain patterns or link popularity.

 * Cost: Estimated per-person dining cost (numerical)
  
   Type: Continuous (e.g., ₹500, ₹1500)
  
   Key Metric: Price positioning of restaurants.

 * Collection: Zomato's categorization tags (categorical)
  
   Use: Segment restaurants by type.

 * Cuisines: Served cuisines (multi-label categorical)
  
   Format: Often comma-separated (e.g., "Italian, Mediterranean")




---






Dataset 2: Review Data
Description: User-generated reviews for restaurants in Dataset 1.

 Fields/Features:

 * Review: Text content of reviews (text)

  Analytical Use: Sentiment analysis, keyword extraction.

 * Reviewer: Reviewer name (text)

   Use: Identify frequent reviewers (optional anonymization).

 * Rating: Numerical rating (ordinal)

   Scale: Likely 1–5 (stars).

 * MetaData: Reviewer’s profile stats (text)

   Format: "X reviews, Y followers" (e.g., "50 reviews, 200 followers")

 * Time: Timestamp of review (datetime)

   Use: Trend analysis (e.g., rating changes over time).

 * Pictures: Count of images in review (numerical)
   
   Use: Correlate visual content with rating/sentiment.




### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
def print_unique_values(df):
  data={}
  for col in df.columns:
    data[col]=len(df[col].unique())

  for (key,value) in data.items():
    print(f"unique value counts of {key} are :{value}")

  print("\n")

In [None]:

## Unique Values for each variable in Restaurant dataset...
print_unique_values(restaurant_df)

## Unique Values for each variable in Restaurant dataset...
print_unique_values(review_df)

## 3. ***Data Wrangling***

### Data Wrangling Code along with Feature Engineering & Data Pre-processing

In [None]:
import warnings
warnings.filterwarnings('ignore')
# Write your code to make your dataset analysis ready.


# 1.Turning cost feature/column in numerical format/dataType via..
restaurant_df['Cost']=(
    restaurant_df["Cost"].astype(str)
    .str.replace(r"[₹$,€£]", "", regex=True)  # Remove currency symbols
    .str.replace(",", "")                     # Remove thousands separators
    .astype(float)                            # Convert to numeric
)

#print(restaurant_df['Cost'].dtype)
#restaurant_df.head()

## ..Dealing with missing values in restaurant dataset and review data set...


# Restaurant Data Missing Value Treatment
 #1. Name
restaurant_df = restaurant_df.dropna(subset=['Name'])

 #2. Links
restaurant_df['Links'] = restaurant_df['Links'].fillna('URL not available')

 #3.Cost
restaurant_df['Cost'] = restaurant_df.groupby(['Collections', 'Cuisines'])['Cost'].transform(
    lambda x: x.fillna(x.median() if not np.isnan(x.median()) else restaurant_df['Cost'].median())
)

 #4.Collation
restaurant_df['Collections'] = restaurant_df['Collections'].fillna('Unknown')

 #5.Cuisines
restaurant_df['Cuisines'] = restaurant_df['Cuisines'].fillna('Other')

#6.Timings
mode_time = restaurant_df['Timings'].mode()
if not mode_time.empty:
    fill_time = mode_time[0]
else:
    # Fallback to median time if no mode exists
    fill_time = restaurant_df['Timings'].median() if not restaurant_df['Timings'].isna().all() else pd.Timestamp.now()

# Fill missing values
restaurant_df['Timings'] = restaurant_df['Timings'].fillna(fill_time)


## Review Data Missing Value Treatment
#1. Review
unique_rows_in_review_df['Review'] = unique_rows_in_review_df.apply(
    lambda x: f"Rating: {x['Rating']}" if pd.isna(x['Review']) else x['Review'],
    axis=1
)
#2. Reviewer
unique_rows_in_review_df['Reviewer'] = unique_rows_in_review_df['Reviewer'].fillna('Anonymous')

#3. Rating
from textblob import TextBlob
unique_rows_in_review_df['Rating'] = unique_rows_in_review_df.apply(
    lambda x:2.5
    if pd.isna(x['Rating']) else x['Rating'],
    axis=1
)

#unique_rows_in_review_df['Rating'] = pd.to_numeric(unique_rows_in_review_df['Rating'], errors='coerce')


#4. MetaData
unique_rows_in_review_df['Reviewer_Reviews'] = unique_rows_in_review_df['Metadata'].str.extract(r'(\d+) Reviews')
unique_rows_in_review_df['Reviewer_Reviews'] = unique_rows_in_review_df['Metadata'].str.extract(r'(\d+) Review')
unique_rows_in_review_df['Reviewer_Reviews'] =unique_rows_in_review_df['Reviewer_Reviews'].fillna(0).astype(int)
unique_rows_in_review_df['Reviewer_Followers'] = unique_rows_in_review_df['Metadata'].str.extract(r'(\d+) Followers')
unique_rows_in_review_df['Reviewer_Followers'] = unique_rows_in_review_df['Metadata'].str.extract(r'(\d+) Follower')
unique_rows_in_review_df['Reviewer_Followers']=unique_rows_in_review_df['Reviewer_Followers'].fillna(0).astype(int)
unique_rows_in_review_df['Metadata']=unique_rows_in_review_df['Metadata'].fillna('0 reviews , 0 followers')

# 5. Time
#Dealing with missing Time values by filling them mode or media or current time..
# Convert to datetime with exact format
unique_rows_in_review_df['Time'] = pd.to_datetime(unique_rows_in_review_df['Time'], format='%m/%d/%Y %H:%M', errors='coerce')

# Get mode while preserving datetime type
mode_time = unique_rows_in_review_df['Time'].mode()
if not mode_time.empty:
    fill_time = mode_time[0]
else:
    # Fallback to median time if no mode exists
    fill_time = unique_rows_in_review_df['Time'].median() if not review_df['Time'].isna().all() else pd.Timestamp.now()

# Fill missing values
unique_rows_in_review_df['Time'] = unique_rows_in_review_df['Time'].fillna(fill_time)

# Convert back to original string format
unique_rows_in_review_df['Time'] = unique_rows_in_review_df['Time'].dt.strftime('%m/%d/%Y %H:%M')


#6. Pictures
unique_rows_in_review_df['Pictures'] = unique_rows_in_review_df['Pictures'].fillna(0);

In [None]:

##Handling cost feature
restaurant_df['Cost'] = restaurant_df.groupby(['Collections', 'Cuisines'])['Cost'].transform(
    lambda x: x.fillna(x.median() if not np.isnan(x.median()) else restaurant_df['Cost'].median())
)
unique_rows_in_review_df['Rating'].isna().sum()
print(len(restaurant_df.Cost.value_counts()))

In [None]:
restaurant_df.info()
unique_rows_in_review_df.info()
print(restaurant_df.isna().sum())
print(unique_rows_in_review_df.isna().sum())

In [None]:
unique_rows_in_review_df.shape
print(len(restaurant_df.Cost.value_counts()))
unique_rows_in_review_df.columns

### What all manipulations have you done and insights you found?

Answer Here


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

In [None]:
from collections import Counter

#### Chart - 1

In [None]:
#Chart 1: Distribution of Restaurant Cost

plt.figure(figsize=(10,6))
sns.histplot(restaurant_df['Cost'], bins=20, kde=True, color='orange')
plt.title('Distribution of Restaurant Cost')
plt.xlabel('Cost for Two')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()



##### 1. Why did you pick the specific chart?

 Helps understand the general pricing range.

##### 2. What is/are the insight(s) found from the chart?

insight: Luxury and Fine Dining are high-cost; Cafés and Quick Bites are budget-friendly.

In [None]:
##Insights
top_prices = Counter(restaurant_df['Cost']).most_common(10)

prices, counts = zip(*top_prices)

top_price=prices[np.argmax(counts)]
print(top_price,"price has highest frequently.")

Tells us whether the platform has more budget, mid-range, or premium options.



##### 3. Will the gained insights help creating a positive business impact?

Are there any insights that lead to negative growth? Justify with specific reason.

Business Impact:

✅ Positive: Helps target ads/promotions to price-sensitive customers.

❌ Negative: Too many high-cost places may deter budget customers.

#### Chart - 2

In [None]:
#Chart 2: Top 10 Most Common Cuisines


cuisine_list = restaurant_df['Cuisines'].str.split(', ')
flat_cuisines = [item for sublist in cuisine_list for item in sublist]
top_cuisines = Counter(flat_cuisines).most_common(10)

cuisines, counts = zip(*top_cuisines)

plt.figure(figsize=(10,6))
sns.barplot(x=list(counts), y=list(cuisines), palette="viridis")
plt.title('Top 10 Most Common Cuisines')
plt.xlabel('Count')
plt.ylabel('Cuisine')
plt.show()


##### 1. Why did you pick the specific chart?

 Shows cuisine preference trends.


##### 2. What is/are the insight(s) found from the chart?

 Indicates what cuisines are trending.

In [None]:
##Insight
top_cuisine=cuisines[np.argmax(counts)]
print(top_cuisine,"cuisine is in most trending")

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Business Impact:

✅ Positive: Help onboard restaurants with popular cuisines.

❌ Negative: Overrepresentation of few cuisines limits diversity.

#### Chart 3: Count of Restaurants per Collection

In [None]:
#Chart 3: Count of Restaurants per Collection
collections_count = restaurant_df['Collections'].value_counts().head(10)

plt.figure(figsize=(10,6))
sns.barplot(x=collections_count.values, y=collections_count.index, palette="coolwarm")
plt.title("Top 10 Restaurant Collections")
plt.xlabel("Number of Restaurants")
plt.ylabel("Collections")
plt.show()


##### 1. Why did you pick the specific chart?

 Shows what themes or types of dining experiences are most frequent.

##### 2. What is/are the insight(s) found from the chart?

 Determines focus areas (e.g., "Date Night", "Trending").

In [None]:
##iInsight
print("most frequent collectios are:\n",collections_count.index)


##### 3. Will the gained insights help creating a positive business impact?



Business Impact:

✅ Positive: Useful for content curation and marketing.

❌ Negative: Underrepresented collections may go unnoticed.

#### Chart - 4 Average Cost by Cuisine (Top 10)

In [None]:
# Chart 4: Average Cost by Cuisine (Top 10)
cuisine_cost = []
for cuisines, cost in zip(restaurant_df['Cuisines'], restaurant_df['Cost']):
    for cuisine in cuisines.split(', '):
        cuisine_cost.append((cuisine, cost))

df_cuisine_cost = pd.DataFrame(cuisine_cost, columns=['Cuisine', 'Cost'])
avg_cost = df_cuisine_cost.groupby('Cuisine')['Cost'].mean().sort_values(ascending=False).head(10)

plt.figure(figsize=(10,6))
avg_cost.plot(kind='barh', color='green')
plt.title('Top 10 Cuisines by Average Cost')
plt.xlabel('Average Cost')
plt.gca().invert_yaxis()
plt.show()


##### 1. Why did you pick the specific chart?

Understand cost dynamics across cuisines.

##### 2. What is/are the insight(s) found from the chart?


Reveals which cuisines are premium vs. budget.

##### 3. Will the gained insights help creating a positive business impact?





Business Impact:

✅ Positive: Align pricing strategy by cuisine.

❌ Negative: Mispricing can lead to poor conversions.

#### Chart - 5 Number of Pictures Posted per Review (Distribution)

In [None]:
#Chart 5: Number of Pictures Posted per Review (Distribution)
plt.figure(figsize=(10,6))
sns.histplot(review_df['Pictures'], bins=15, kde=True, color='purple')
plt.title('Distribution of Number of Pictures per Review')
plt.xlabel('Number of Pictures')
plt.ylabel('Number of Reviews')
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

Measures visual content engagement.

##### 2. What is/are the insight(s) found from the chart?

 Are users uploading images?

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Business Impact:

✅ Positive: Images drive engagement.

❌ Negative: Low picture count → reduced trust factor.

#### Chart - 6 Top 10 Most Reviewed Restaurants

In [None]:
# Chart - 6 Top 10 Most Reviewed Restaurants
top_reviewed = unique_rows_in_review_df['Restaurant'].value_counts().head(10)

plt.figure(figsize=(10,6))
sns.barplot(x=top_reviewed.values, y=top_reviewed.index, palette="mako")
plt.title('Top 10 Most Reviewed Restaurants')
plt.xlabel('Number of Reviews')
plt.ylabel('Restaurant')
plt.show()



##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

In [None]:
##Insight

print("top 10 most Reviewed restorents are:\n",top_reviewed.index)

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7  Sentiment Distribution of Reviews

In [None]:
# Chart - 7 visualization code


def get_sentiment(text):
    polarity = TextBlob(text).sentiment.polarity
    if polarity > 0: return 'Positive'
    elif polarity < 0: return 'Negative'
    else: return 'Neutral'

unique_rows_in_review_df['Sentiment'] = unique_rows_in_review_df['Review'].apply(get_sentiment)

plt.figure(figsize=(8,6))
sns.countplot(x='Sentiment', data=unique_rows_in_review_df, palette='Set2')
plt.title('Sentiment Distribution of Reviews')
plt.xlabel('Sentiment')
plt.ylabel('Number of Reviews')
plt.show()


##### 1. Why did you pick the specific chart?

Understand customer mood.

##### 2. What is/are the insight(s) found from the chart?

Overall public sentiment.

In [None]:
senti_df=unique_rows_in_review_df['Sentiment'].value_counts().head()
most_common_sentiment = senti_df.idxmax()
print("Over all public mood is :",most_common_sentiment )

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Business Impact:

✅ Positive: Positive sentiment boosts conversions.

❌ Negative: Excessive negative sentiment → trust issues.

#### Chart - 8 Word Cloud of Review Text

In [None]:
# Chart - 8 Word Cloud of Review Text

from wordcloud import WordCloud
text = ' '.join(unique_rows_in_review_df['Review'].dropna().tolist())

wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)

plt.figure(figsize=(10,5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Reviews')
plt.show()


##### 1. Why did you pick the specific chart?

 Visualize most used terms.

##### 2. What is/are the insight(s) found from the chart?

In [None]:
word_list = list(wordcloud.words_.keys())
print("Top 10 most used terms in revies : \n", word_list[:10])  # Top 10 words


Insight: Keyword trends in customer feedback.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Business Impact:

✅ Positive: Surface customer priorities.

❌ Negative: Repetitive negative terms signal service gaps.



#### Chart - 9 Rating Distribution

In [None]:
# Chart - 9 Rating Distribution
plt.figure(figsize=(8,6))
sns.countplot(x='Rating', data=unique_rows_in_review_df, order=unique_rows_in_review_df['Rating'].value_counts().index)
plt.title('Rating Distribution')
plt.xlabel('Rating')
plt.ylabel('Number of Reviews')
plt.show()


##### 1. Why did you pick the specific chart?

Why chosen: Shows general satisfaction levels.


##### 2. What is/are the insight(s) found from the chart?

 Skewness in ratings (e.g., many 5-stars)?


In [None]:
rating_df=unique_rows_in_review_df['Rating'].value_counts().head()
most_rating = rating_df.idxmax()
print(f"most reviewers give {most_rating} star ratings")

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Business Impact:

✅ Positive: High ratings validate restaurant performance.

❌ Negative: Too many 5-star reviews might raise suspicion.

#### Chart - 10 Number of Reviews per Reviewer (Distribution)

In [None]:
# Chart - 10 Number of Reviews per Reviewer (Distribution)
plt.figure(figsize=(10,6))
sns.histplot(unique_rows_in_review_df['Reviewer_Reviews'], bins=20, kde=True)
plt.title('Distribution of Number of Reviews per Reviewer')
plt.xlabel('Review Count')
plt.ylabel('Reviewer Count')
plt.show()


##### 1. Why did you pick the specific chart?

Understand reviewer activity.

##### 2. What is/are the insight(s) found from the chart?

Shows platform dependency on few vs. many users.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Business Impact:

✅ Positive: Strong reviewer base.

❌ Negative: Few super-reviewers may skew trends.

#### Chart - 11 Reviewer Followers (Distribution)

In [None]:
#  Chart 11 - Reviewer Followers (Distribution)
plt.figure(figsize=(8,4))
sns.histplot(unique_rows_in_review_df['Reviewer_Followers'], bins=20, kde=True, color='teal')
plt.title('Distribution of Reviewer Followers')
plt.xlabel('Followers')
plt.ylabel('Reviewer Count')
plt.show()


##### 1. Why did you pick the specific chart?

Highlights influencer presence.

##### 2. What is/are the insight(s) found from the chart?

Who’s influencing the public view?

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Business Impact:

✅ Positive: Boost high-follower reviews.

❌ Negative: Fake followers can mislead.

#### Chart - 12 Heatmap of Review Counts vs Rating

In [None]:
# Chart - 12 Heatmap of Review Counts vs Rating
heat_df = unique_rows_in_review_df.groupby(['Restaurant', 'Rating']).size().unstack().fillna(0)

plt.figure(figsize=(12,8))
sns.heatmap(heat_df, cmap='YlGnBu')
plt.title('Heatmap of Review Counts vs Rating')
plt.xlabel('Rating')
plt.ylabel('Restaurant')
plt.show()


##### 1. Why did you pick the specific chart?

Combine volume and satisfaction.

##### 2. What is/are the insight(s) found from the chart?

Popular + high-rated = trusted brands.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Business Impact:

✅ Positive: Target high traffic, well-rated places.

❌ Negative: High volume, low rating = risk area.

#### Chart - 13 Time of Review Posting

In [None]:
# Chart - 13 Time of Review Posting
review_df['Time'] = pd.to_datetime(review_df['Time'], errors='coerce')
review_df['Hour'] = review_df['Time'].dt.hour

plt.figure(figsize=(10,6))
sns.histplot(review_df['Hour'].dropna(), bins=24, kde=True)
plt.title('Time of Day When Reviews Are Posted')
plt.xlabel('Hour of Day')
plt.ylabel('Number of Reviews')
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

See when users are most active.

##### 2. What is/are the insight(s) found from the chart?

Discover user behavior.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Business Impact:

✅ Positive: Post promotions during peak times.

❌ Negative: Inactive hours = inefficient content push.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

# Select only numeric columns from the review dataset
numeric_cols = unique_rows_in_review_df[['Pictures', 'Reviewer_Reviews', 'Reviewer_Followers']]

# Compute correlation matrix
corr_matrix = numeric_cols.corr()

# Plot heatmap
plt.figure(figsize=(8,6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Correlation Heatmap of Reviewer Features')
plt.show()


##### 1. Why did you pick the specific chart?

To understand how numerical features (e.g., Pictures, Reviewer_Reviews, Reviewer_Followers) relate to each other. It's a diagnostic tool to explore feature importance or redundancy.

##### 2. What is/are the insight(s) found from the chart?

Insight it gives:

Detects whether reviewers with more followers post more pictures.

Finds any relationships (positive or negative) among variables.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

# Combine relevant numerical columns from both dataframes
pairplot_df = restaurant_df[['Cost']].copy()
pairplot_df['Pictures'] = pd.Series(unique_rows_in_review_df['Pictures'].values)
pairplot_df['Reviewer_Reviews'] =pd.Series( unique_rows_in_review_df['Reviewer_Reviews'].values)
pairplot_df['Reviewer_Followers'] = pd.Series(unique_rows_in_review_df['Reviewer_Followers'].values)

# Optional: filter out extreme outliers for clearer visualization
pairplot_df = pairplot_df[(pairplot_df['Cost'] < 3000) &
                          (pairplot_df['Reviewer_Reviews'] < 1000) &
                          (pairplot_df['Reviewer_Followers'] < 5000) &
                          (pairplot_df['Pictures'] < 20)]

# Create pair plot
sns.pairplot(pairplot_df, diag_kind='kde', plot_kws={'alpha':0.6, 's':40, 'edgecolor':'k'})
plt.suptitle('Pair Plot: Cost, Pictures, Reviews & Followers', y=1.02)
plt.show()


##### 1. Why did you pick the specific chart?

Pair plots are used to explore relationships between multiple numerical variables in a single glance using scatterplots, histograms, and KDEs. It’s useful for detecting trends, outliers, and correlations.

##### 2. What is/are the insight(s) found from the chart?

It can show how numerical variables like Cost, Reviewer_Reviews, Reviewer_Followers, and Pictures interact with each other — which pairs have potential linear or non-linear relationships.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.


✅ If p-value < 0.05 → Reject the null hypothesis → There is a statistically significant effect/relationship.


❌ If p-value ≥ 0.05 → Fail to reject the null → There is no statistically significant effect/relationship.Answer Here.


In [None]:
#function for checking p value which is responsible for Stating rearch hypothesis..
def print_Stating_rearch_hypothesis(p_val, h_0, h_1):
  if float(p_val) < 0.05:
    print(h_0)
  else:
    print(h_1)


### Hypothetical Statement - 1

Hypothesis: Higher-rated restaurants tend to be more expensive.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

1. Null Hypothesis (H₀):
There is no significant difference in cost between high-rated and low-rated restaurants.

2. Alternative Hypothesis (H₁):
High-rated restaurants have significantly higher average cost than low-rated restaurants.



In [None]:
H_0= "There is no significant difference in cost between high-rated and low-rated restaurants."

H_1= "High-rated restaurants have significantly higher average cost than low-rated restaurants."

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import ttest_ind


# Convert rating to numeric
unique_rows_in_review_df['Rating'] = pd.to_numeric(unique_rows_in_review_df['Rating'], errors='coerce')

# Merge rating into restaurant_df by Restaurant Name (or use mapping if not merging)
merged_df = pd.merge(restaurant_df, unique_rows_in_review_df[['Restaurant', 'Rating']], left_on='Name', right_on='Restaurant', how='inner')

# Group: High-rated (>=4.0), Low-rated (<4.0)
high_rated = merged_df[merged_df['Rating'] >= 4.0]['Cost']
low_rated = merged_df[merged_df['Rating'] < 4.0]['Cost']

# Perform t-test
t_stat, p_value = ttest_ind(high_rated, low_rated, nan_policy='omit')
print(f"P-Value: {(p_value)}")

##### Which statistical test have you done to obtain P-Value?

Independent t-test (2 groups)

We’ll first:

Classify restaurants as high-rated (e.g., rating ≥ 4.0) and low-rated (< 4.0)

Compare their average cost

##### Why did you choose the specific statistical test?

we're comparing means of two independent groups (high vs low rated) on a numerical outcome (Cost).

In [None]:
# print Stating rearch  hypothesis
print_Stating_rearch_hypothesis(p_value,H_0,H_1)

### Hypothetical Statement - 2


Hypothesis: There is a relationship between the type of cuisine and the average restaurant rating.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

1. Null Hypothesis (H₀):
Cuisine type has no impact on average rating.

2. Alternative Hypothesis (H₁):
Cuisine type affects average rating.

In [None]:
H_0= "Cuisine type has no impact on average rating."

H_1= "Cuisine type affects average rating."

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import f_oneway

# Merge Rating into restaurant_df
merged_df = pd.merge(restaurant_df, review_df[['Restaurant', 'Rating']], left_on='Name', right_on='Restaurant', how='inner')
merged_df['Rating'] = pd.to_numeric(merged_df['Rating'], errors='coerce')

# Group samples by cuisines
italian = merged_df[merged_df['Cuisines'].str.contains('Italian', na=False)]['Rating'].dropna()
indian = merged_df[merged_df['Cuisines'].str.contains('Indian', na=False)]['Rating'].dropna()
chinese = merged_df[merged_df['Cuisines'].str.contains('Chinese', na=False)]['Rating'].dropna()

# Perform ANOVA
f_stat, p_value = f_oneway(italian, indian, chinese)
print(f"P-Value: {p_value}")


##### Which statistical test have you done to obtain P-Value?

ANOVA (Analysis of Variance)

Compare ratings across 3 or more cuisine types

##### Why did you choose the specific statistical test?

You’re comparing mean rating across 3+ groups (different cuisine types).

In [None]:
# print Stating rearch  hypothesis
print_Stating_rearch_hypothesis(p_value,H_0,H_1)

### Hypothetical Statement - 3

Hypothesis:  we’re comparing mean rating across 3+ groups (different cuisine types).

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

1. Null Hypothesis (H₀):
There is no relationship between number of followers and number of reviews.

2. Alternative Hypothesis (H₁):
There is a significant correlation between number of followers and reviews posted.

In [None]:
H_0= "There is no relationship between number of followers and number of reviews."

H1= "There is a significant correlation between number of followers and reviews posted."

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import pearsonr

followers = unique_rows_in_review_df['Reviewer_Followers']
review_count = unique_rows_in_review_df['Reviewer_Reviews']

corr_coef, p_value = pearsonr(followers, review_count)
print(f"Correlation Coefficient: {corr_coef}")
print(f"P-Value: {p_value}")


##### Which statistical test have you done to obtain P-Value?

Pearson Correlation

##### Why did you choose the specific statistical test?

You’re testing for linear correlation between two continuous variables (followers & reviews).

In [None]:
print_Stating_rearch_hypothesis(p_value,H_0,H_1)

## ***6. Textual Data Preprocessing***

(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
!pip install contractions
from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

In [None]:
# Expand Contraction

import contractions
def expand_contractions(text):
    return contractions.fix(text)


#### 2. Lower Casing

In [None]:
# Lower Casing
def to_lowercase(text):
    return text.lower()


#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

def remove_punctuation(text):
    return text.translate(str.maketrans('', '', string.punctuation))


#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

def remove_urls_and_alnum(text):
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)  # URLs
    text = re.sub(r'\w*\d\w*', '', text)  # Words with digits
    return text


#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

stopwords=nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    return ' '.join([word for word in text.split() if word not in stop_words])



In [None]:
# Remove White spaces

def remove_extra_whitespace(text):
    return ' '.join(text.split())

#### 6. Rephrase Text

In [None]:
# Rephrase Text
import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

def lemmatize_text(text):
    return ' '.join([lemmatizer.lemmatize(word) for word in text.split()])


#### 7. Tokenization

In [None]:
# Tokenization
from nltk.tokenize import word_tokenize

def tokenize_text(text):
    return word_tokenize(text)


#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)
def normalize_text(text):
    text = expand_contractions(text)
    text = to_lowercase(text)
    text = remove_punctuation(text)
    text = remove_urls_and_alnum(text)
    text = remove_stopwords(text)
    text = remove_extra_whitespace(text)
    text = lemmatize_text(text)
    return text


##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging
def pos_tagging(text):
    tokens = word_tokenize(text)
    return nltk.pos_tag(tokens)


#### 10. Text Vectorization

In [None]:
# # Vectorizing Text
# from sklearn.feature_extraction.text import TfidfVectorizer

# tfidf_vectorizer = TfidfVectorizer(max_features=1000)

# X_tfidf = tfidf_vectorizer.fit_transform(df['Cleaned_Review'])  # after applying normalize_text()


In [None]:
# Apply to review data
unique_rows_in_review_df['Cleaned_Review'] = unique_rows_in_review_df['Review'].apply(normalize_text)

In [None]:
from textblob import TextBlob

def get_sentiment_score(text):
    return TextBlob(text).sentiment.polarity  # Range: [-1, 1]

unique_rows_in_review_df['Sentiment_Score'] = unique_rows_in_review_df['Cleaned_Review'].apply(get_sentiment_score)


## ***7. ML Model Implementation***

# Silhouette score

Value Range:
-1 (worst) to +1 (best)

Score	Interpretation

~0.71 – 1.00	Excellent clustering

~0.51 – 0.70	Good

~0.26 – 0.50	Reasonable

~0.00 – 0.25	Weak structure

< 0	Bad clustering (wrong cluster assignment)


# Davies bouldin score
Value Range:
0 (perfect clustering) to ∞ (worse)

Score	Interpretation

0 - 1	Very good clustering

1 - 2	Acceptable

<  2  Poor clustering

### ML Model - 1 Preparing clustering on the restaurant data:

In [None]:
# Let's begin by preparing clustering on the restaurant data using:
# - Pricing (Cost)
# - Sentiment (aggregated Sentiment_Score from review dataset)
# - Popularity (number of reviews per restaurant)


# STEP 1: Aggregate reviews to get average sentiment and review count per restaurant
review_agg = unique_rows_in_review_df.groupby('Restaurant').agg({
    'Sentiment_Score': 'mean',
    'Review': 'count'
}).reset_index().rename(columns={
    'Sentiment_Score': 'Avg_Sentiment',
    'Review': 'Review_Count'
})

# STEP 2: Merge with restaurant data
merged_df = pd.merge(restaurant_df, review_agg, left_on='Name', right_on='Restaurant', how='inner')

# STEP 3: Select and preprocess features for clustering
clustering_features = merged_df[['Cost', 'Avg_Sentiment', 'Review_Count']].dropna()

# Scaling the features
scaler = StandardScaler()
scaled_features = scaler.fit_transform(clustering_features)

In [None]:
# Store clustering results and scores
results = {}

# ML Model - 1 Implementation
# ----------- KMeans Clustering ---------------
kmeans = KMeans(n_clusters=3, random_state=42)
# Fit the Algorithm +  Predict on the model
kmeans_labels = kmeans.fit_predict(scaled_features)

results['KMeans'] = {
    'labels': kmeans_labels,
    'silhouette_score': silhouette_score(scaled_features, kmeans_labels),
    'davies_bouldin_score': davies_bouldin_score(scaled_features, kmeans_labels)
}


print(f"silhouette_score:{silhouette_score(scaled_features, kmeans_labels):.4f}\n")
print(f"davies_bouldin_score:{ davies_bouldin_score(scaled_features, kmeans_labels):.4f}")


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
cluster_insights = {
    0: "High-cost, Italian, romantic restaurants, very positive sentiment",
    1: "Budget Indian fast food with mixed reviews",
    2: "Cafes in popular collections with high picture uploads (trendy)"
}


# Add labels to dataframe for visualization
clustering_features['KMeans_Cluster'] = kmeans_labels

# Pairplot for KMeans clustering
sns.pairplot(clustering_features, hue='KMeans_Cluster', palette='Set1')
plt.suptitle('KMeans Clustering of Restaurants (Pricing, Sentiment, Popularity)', y=1.02)
plt.legend( labels=cluster_insights)
plt.show()
plt.clf()
plt.close('all')
import matplotlib.patches as mpatches

# Create custom legend handles
handles = [
    mpatches.Patch(color=sns.color_palette('Set2')[i], label=f'Cluster {i}: {desc}')
    for i, desc in cluster_insights.items()
]
plt.legend(handles=handles, title=" Insights", bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()


# Add the custom legend




### ML Model - 2

In [None]:
# ----------- DBSCAN Clustering ---------------
dbscan = DBSCAN(eps=0.8, min_samples=5)

# Fit the Algorithm + Predict on the model
dbscan_labels = dbscan.fit_predict(scaled_features)

# Exclude noise (-1) for silhouette score
if len(set(dbscan_labels)) > 1 and -1 not in set(dbscan_labels):
    dbscan_sil_score = silhouette_score(scaled_features, dbscan_labels)
    dbscan_db_score = davies_bouldin_score(scaled_features, dbscan_labels)
else:
    dbscan_sil_score = None
    dbscan_db_score = None

results['DBSCAN'] = {
    'labels': dbscan_labels,
    'silhouette_score': dbscan_sil_score,
    'davies_bouldin_score': dbscan_db_score
}


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

# Add labels to dataframe for visualization
clustering_features['DBSCAN_Cluster'] = dbscan_labels

# Pairplot for KMeans clustering
sns.pairplot(clustering_features, hue='DBSCAN_Cluster', palette='Set1')
plt.suptitle('DBSCAN Clustering of Restaurants (Pricing, Sentiment, Popularity)', y=1.02)
plt.show()

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# ----------- Hierarchical Clustering (Agglomerative) ---------------
agglo = AgglomerativeClustering(n_clusters=3)
# Fit the Algorithm + Predict on the model
agglo_labels = agglo.fit_predict(scaled_features)

results['Agglomerative'] = {
    'labels': agglo_labels,
    'silhouette_score': silhouette_score(scaled_features, agglo_labels),
    'davies_bouldin_score': davies_bouldin_score(scaled_features, agglo_labels)
}

print(f"silhouette_score:{silhouette_score(scaled_features, agglo_labels):.4f}\n")
print(f"davies_bouldin_score:{ davies_bouldin_score(scaled_features, agglo_labels):.4f}")

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
clustering_features['Agglo_Cluster'] = agglo_labels


# Pairplot for KMeans clustering
sns.pairplot(clustering_features, hue='Agglo_Cluster', palette='Set1')
plt.suptitle('Agglo Clustering of Restaurants (Pricing, Sentiment, Popularity)', y=1.02)
plt.show()
# ======================= EVALUATION RESULTS ============================ #
results

# 📌 Conclusion

In this project, we explored a dual approach to understand consumer behavior and restaurant segmentation on Zomato using Natural Language Processing (NLP) and unsupervised machine learning techniques.

🔹 **Sentiment Analysis**:  
We applied sentiment analysis on user reviews to quantify public perception. This helped identify top-rated vs. poorly-rated restaurants based on emotional tone, uncovering valuable insights for both customers and business owners.

🔹 **Clustering**:  
Using features like cost, cuisine type, popularity indicators (e.g., followers, pictures), and sentiment scores, we clustered restaurants into meaningful segments. We experimented with **KMeans**, **DBSCAN**, and **Hierarchical Clustering**, evaluated performance, and visualized results to find distinct customer and business personas.

---

### 📊 Key Insights:

- High-end restaurants with Italian cuisine and romantic ambiance generally received **very positive feedback**.
- Budget-friendly Indian fast food spots had **mixed reviews**, revealing room for improvement.
- Trendy cafés in popular collections attracted **a younger, social media-savvy crowd**.

---

### 💡 Business Impact:

- Zomato can personalize recommendations, promotions, and UI placement using these segments.
- Restaurants can benchmark themselves against similar clusters to improve competitiveness.
- Sentiment tracking enables **early detection of service issues** or **rising customer expectations**.

---

### 🚀 Future Scope:

- Add **topic modeling** (e.g., LDA) to dig deeper into review themes.
- Incorporate **geospatial clustering** to factor in location-based preferences.
- Build a real-time dashboard for restaurants and Zomato stakeholders.

---


### ***"Congratulations! You've successfully completed your Machine Learning Capstone Project — uncovering powerful insights through clustering and sentiment analysis." !!!***