# **Project Name**    -



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual
##### **Member**          - Deep Keni


# **Project Summary -**

This project focuses on analyzing restaurant reviews and related information to understand what factors influence customer ratings and how textual feedback can be used to build predictive machine learning models. The dataset contains structured attributes such as cost, cuisines count, whether pictures are available, operating hours, reviewer activity, and timestamps, along with unstructured text in the form of customer reviews. The main objective was to clean the data, extract useful features from both text and numerical columns, explore relationships in the data, and prepare a complete pipeline that can be used for machine learning models.

The first phase involved understanding the dataset and performing data cleaning. Missing values, inconsistent formats, and unnecessary columns were handled. Text reviews were preprocessed using standard NLP steps such as expanding contractions, converting text to lowercase, removing punctuation, URLs, digits, stopwords, and extra spaces. Tokenization and lemmatization were applied to normalize the text and reduce vocabulary size. These steps helped convert informal user reviews into a consistent and machine-friendly format.

Feature engineering played an important role in this project. Temporal features were extracted from the review timestamp such as year, month, day of the week, and hour to capture time-based patterns in customer behavior. A review length feature was created to represent how detailed a review is. Numeric variables like cost and reviewer activity metrics were transformed using log transformation to reduce skewness and the effect of extreme values. Feature selection was done using domain understanding by removing identifiers and redundant columns such as raw timestamps, and intermediate NLP artifacts once vectorization was complete. This helped reduce noise and avoid data leakage.

For text representation, TF-IDF vectorization was used to convert reviews into numerical features. This method captures the importance of words across documents and reduces the impact of commonly occurring terms. Since TF-IDF produces high-dimensional sparse features, dimensionality reduction using TruncatedSVD was explored as an optional optimization step to reduce dimensionality, improve computational efficiency, and potentially improve model performance for clustering and distance-based models. Numerical features were standardized using StandardScaler to bring them onto a comparable scale for models sensitive to feature magnitude, while binary indicators and TF-IDF features were handled appropriately.

Basic hypothesis testing was performed to gain business insights from the data. Simple statistical tests were used to examine whether restaurants with pictures receive different ratings, whether cost is related to ratings, and whether 24-hour operation impacts customer ratings. These tests were exploratory and helped understand relationships in the data. However, hypothesis testing was not directly used in training machine learning models and was treated as an insight-generation step.

Finally, the data was split into training and testing sets to ensure fair evaluation of machine learning models on unseen data. The overall pipeline prepares the dataset for building regression, classification, and clustering models. The project emphasizes building a clean, end-to-end workflow that includes preprocessing, feature engineering, vectorization, scaling, and optional dimensionality reduction. The results and insights from this project can help businesses understand customer feedback better, identify factors that influence ratings, and build data-driven strategies to improve customer experience and engagement.

# **GitHub Link -**

https://github.com/Deep-keni

# **Problem Statement**


**Analyze Zomato restaurant data to help customers find good restaurants and help Zomato understand their business better.**
> The main problem is to clean and prepare this data and then analyze it to find useful patterns and insights.
The project aims to:

* Clean and preprocess the dataset

* Explore relationships between cost, ratings, cuisines, and reviews

* Group similar restaurants using clustering techniques

* Analyze customer sentiment from review text

* Convert the analysis into meaningful business insights

The objective is to support customers in finding better restaurants and help the company identify areas of improvement.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# *Pls follow this or error will come during runnig all cells text*
Go to Let's Begin -> Know Your Data -> Import Libraries -> Upload files manually
( in 3rd cell of import libraries )

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
#installed requirements
!pip install textblob wordcloud
!pip install contractions

In [None]:
#import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
import re
from google.colab import files
import contractions
import nltk
nltk.download('stopwords')
nltk.download('punkt_tab')
nltk.download('wordnet')


In [None]:
#files uploaded
from google.colab import files
files.upload()

### Dataset Loading

In [None]:
#loaded datasets
restaurant_df = pd.read_csv('Zomato Restaurant names and Metadata.csv')
review_df = pd.read_csv('Zomato Restaurant reviews.csv')

### Dataset First View

In [None]:
# Dataset First Look
restaurant_df.head()

In [None]:
review_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

In [None]:
restaurant_df.shape

In [None]:
review_df.shape

### Dataset Information



In [None]:
# Dataset Info

In [None]:
#merging both datasets
merged_data = review_df.merge(
        restaurant_df,
        left_on='Restaurant',      # Column from reviews dataset
        right_on='Name',           # Column from restaurants dataset
        how='left',                # Keep all reviews, match restaurant info
)

In [None]:
#coverting the datatypes of required columns from object to the specific ones(eg.int, datetime)
cols_to_int = ['Rating','Pictures','Cost']
merged_data[cols_to_int] = merged_data[cols_to_int].apply(pd.to_numeric, errors='coerce')        #converted obj to float
merged_data['Time'] = merged_data['Time'].apply(pd.to_datetime , errors = 'coerce')              #coverted obj to datetime

In [None]:
#extract numbers from Metadata (e.g. 4 Reviews, 32 Followers)
merged_data['Reviewer_Review_Count'] = (
    merged_data['Metadata']
    .str.extract(r'(\d+)\s*Reviews?', expand=False)
    .fillna(0)
    .astype('int')
)

merged_data['Reviewer_Followers'] = (
    merged_data['Metadata']
    .str.extract(r'(\d+)\s*Followers?', expand=False)
    .fillna(0)
    .astype('int')
)

#converting the pictures column to boolean form
merged_data['Has_Pictures'] = (merged_data['Pictures'].fillna(0)>0).astype(bool)

In [None]:
#dropping the unnecessary columns which are not required in predictions
merged_data = merged_data.drop(columns=['Reviewer','Links','Pictures','Metadata'] , errors='ignore')

In [None]:
merged_data = merged_data[['Restaurant','Review','Rating','Time','Reviewer_Review_Count',
                          'Reviewer_Followers','Has_Pictures','Cost','Cuisines','Collections','Timings',]]

In [None]:
merged_data.head()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

In [None]:
merged_data_backup = merged_data.copy()               #just saved the copy of the original dataframe for safety

In [None]:
#as there are many duplicates in the dataframe we remove all by only keeping the first occurance .
merged_data = merged_data.drop_duplicates(keep='first')

In [None]:
merged_data.duplicated().any()                   #checks finally whether any duplicates still there or not

In [None]:
merged_data.shape                             # 10000 rows reduced to 9964 : redundant rows removed

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

In [None]:
#calculating the missing values : specified rows have null values in the respective columns
merged_data.isnull().sum()

In [None]:
#calculates the missing values but in % format
(merged_data.isnull().sum() / len(merged_data)) * 100

In [None]:
plt.figure(figsize=(12, 6))
sns.heatmap(merged_data.isnull(), cbar=False, yticklabels=False)
plt.title("Missing Values Heatmap")
plt.xlabel("Columns")
plt.ylabel("Rows")
plt.tight_layout()
plt.show()

In [None]:
#why this map was choosen ?
'''To visually inspect the pattern and distribution of missing values across different features and rows.'''

#insight(s) ?
'''Missing values are primarily concentrated in the Cost and Collections columns, while most other features
have minimal or no missing values, indicating feature-specific data gaps rather than random missingness.'''

#Business impact ?
'''Incomplete cost and collection information can reduce recommendation accuracy and clustering reliability,
highlighting the need for better data collection or careful imputation strategies to avoid biased business insights.'''

In [None]:
missing_counts = merged_data.isnull().sum()

plt.figure(figsize=(10, 5))
plt.bar(missing_counts.index, missing_counts.values)
plt.xticks(rotation=45, ha='right')
plt.title("Missing Values Count by Column")
plt.ylabel("Number of Missing Values")
plt.xlabel("Columns")
plt.tight_layout()
plt.show();


In [None]:
#why this map was choosen ?
'''To compare the extent of missing values across different features and identify columns with the highest data quality issues.'''

#insight(s) ?
'''The Cost and Collections columns have significantly higher missing values compared to other features, whereas core features like
Rating, Cuisines, and Review are largely complete.'''

#Business impact ?
'''High missingness in pricing and categorization data may weaken customer decision-making and segmentation accuracy. Improving data
 completeness in these areas can enhance user experience and business recommendations.'''

In [None]:
# Visualizing the missing values

In [None]:
#handling missing values

#text column
merged_data['Review'] = merged_data['Review'].fillna('')

#numeric columns: used medians to avoid skew from outliers
merged_data['Rating'] = merged_data['Rating'].fillna(merged_data['Rating'].median())
merged_data['Cost'] = merged_data['Cost'].fillna(merged_data['Cost'].median())

#engineered numeric features (if any NaNs remain)
merged_data['Reviewer_Review_Count'] = merged_data['Reviewer_Review_Count'].fillna(merged_data['Reviewer_Review_Count'].median())
merged_data['Reviewer_Followers'] = merged_data['Reviewer_Followers'].fillna(merged_data['Reviewer_Followers'].median())

#categorical columns
merged_data['Collections'] = merged_data['Collections'].fillna('Unknown')
merged_data['Timings'] = merged_data['Timings'].fillna('Not Available')

In [None]:
#dropped 2 "Time" rows which were leading to inconsistency of data flow as they were left unfilled
merged_data = merged_data.dropna(subset=['Time'])

In [None]:
#successfully removed all null values from the table and now all rows are filled with values
merged_data.isnull().sum()

### What did you know about your dataset?

I analyzed Zomato restaurant and review data to understand customer preferences and restaurant patterns . The main goal was to clean the data, explore relationships between different features like cost, ratings, cuisines, and reviews .

This project helps in:
* Finding patterns in customer ratings and costs
* Understanding which cuisines and collections are popular
* Grouping restaurants into meaningful segments
* Understanding customer opinions from review text

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

In [None]:
merged_data.columns

In [None]:
#the dataset basically contains 11 columns which features the restaurent , reviews , cuisines ,timings data & metadata .
'''
1) Restaurant column indicates the name of the restaurant .
2) Review , Rating and Time columns indiacte the review , rating given to the respective restaurant and at what time .
3) Reviewer_Review_Count and Reviewer_Followers indicate how active and influential the reviewer is .
4) Has_Pictures shows whether reviewer uploaded the images of the restaurant's .
5) Cost indicates the average dining cost at the restaurant .
6) Cuisines and Collections show what type of food the restaurant provides and whats the quality they provide for their customers .
7) Timings for the restaurant indicates when the restaurant is open and on which days .
'''

In [None]:
# Dataset Describe

In [None]:
'''
1) Columns like Reviewer_Review_Count and Reviewer_Followers are in integer format .
2) Has_Pictures is treated as a boolean feature since only the presence of images matters, not the number of images .
3) Time is converted to proper datetime format to extract the trend of reviews over time .
4) The cost ranges from a min of Rs.100 to max of Rs.900 with an average of about Rs.550 for the list of restaurants.
5) Around 38.8% of values from the rating column are 5.0 rated , shows a positive bias in reviews .
6) Most cuisines are widely covered across restaurants, indicating diverse food options.
7) Mostly all restaurants are open on all weekdays and provide service during the daytime as well as nighttime .
'''

### Variables Description

####**Restaurant**
* Categorical data (text) .
* Represents the name of the restaurant .
* Examples : Mathura Vilas , Beyond the Flavours , Karachi Cafe , Chocolate Room , etc .
* There are same restaurants multiple times which after grouping can give
  useful information about the restaurent-wise performance and ratings .

####**Review**
* Text form data .
* It contains the reviews which are given by the customers .
* This data is the most useful part for sentiment analysis .

####**Rating**
* Numeric float data .
* The ratings are given by the customers ranging from 0.0 to 5.0 .
* Indicates whether the customer is satisfied or not after vising the restaurants .

####**Time**
* DateTime .
* We can extract the following things from the datetime :
    * Date
    * Day
    * Hour
    * Trends by time
* This data can provide insights about customer behavior and restaurant activity patterns over time .

####**Reviewer_Review_Count**
* Numeric values .
* Shows how many reviews a reviewer has given.
* Indicates the experience level of the reviewer.
* Can help distinguish between trusted/experienced reviewers and new reviewers.

####**Reviewer_Followers**
* Type: Numeric
* Proxy for reviewer influence.
* Reviewers with more followers may have higher impact on public opinion.

####**Has_Pictures**
* Binary (0/1) .
* 0 indicates no pictures of the restaurants are available and 1 shows that images are present .
* Represents whether the reviewer uploaded images of the restaurant.
* Acts as an engagement indicator and may influence customer trust.

####**Cost**
* Type: Numeric
* Represents the average cost of dining at the respective restaurant .
* It shows the cost for utmost two people on an average .
* Cost can be related to reviews and rating too .

####**Cuisines**
* Multi-category text data.
* Can later be split into multiple cuisine labels since the same restaurant can offer multiple cuisines.
* Useful for cuisine popularity analysis, which helps identify which cuisines are most liked and highly rated by customers.

####**Collections**
* Categorical data.
* Meaning: Special lists or curated collections in which a restaurant is featured.
* Useful for:
    * Understanding quality tags
    * Identifying restaurants promoted or curated by the platform

####**Timings**
* Text data.
* Can be parsed later into opening and closing times.
* Useful for operational analysis (e.g., opening hours vs customer activity).

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

After checking unique values for each column, it was observed that:
* Numeric columns such as Rating, Cost, Reviewer_Review_Count, and Reviewer_Followers contain valid ranges and no abnormal values.
* Has_Pictures is binary (0/1) and is clean.
* Restaurant names appear multiple times, which is expected due to multiple reviews per restaurant.
* Cuisines and Collections contain multiple values in a single cell separated by commas, which will require splitting and normalization during data wrangling.
* Timings contain inconsistent text formats for opening and closing hours, which will need parsing and standardization later.

This confirms that while numeric features are clean, text-based categorical features require preprocessing before analysis.

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

In [None]:
#handling restaurant col
merged_data['Restaurant'] = (
    merged_data['Restaurant']
    .str.replace(r'\.{2,}', '', regex=True)   #removes multiple dots like ...
    .str.replace('#', '', regex=False)        #remove #
    .str.replace(r'\s+', ' ', regex=True)     #remove spaces
    .str.strip()
)


In [None]:
# Now proceed with cleaning and transforming the 'Cuisines' column
merged_data['Cuisines'] = (
    merged_data['Cuisines']
    .str.replace('-', ' ', regex=False)       # - replaced with 'space'
    .str.replace('&', 'and', regex=False)     # & replaced with and
    .str.replace(r'\s+', ' ', regex=True)
    .str.strip()
    .str.title()                              #standardized the case char's
    .str.split(',')                           #splitted the colms into list format to count the no of cuisines in each restauarant
)

In [None]:
#added one required column after cleaning cuisines col
merged_data['Cuisines_Count'] = merged_data['Cuisines'].apply(len)

In [None]:
#handling collections col
merged_data['Collections'] = (
    merged_data['Collections']
    .str.replace('#', '', regex=False)        #removed hashtags
    .str.replace(r'\.', ' ', regex=True)      #replaced dots with 'space'
    .str.replace('-', ' ', regex=False)       #replaced hyphen with 'space'
    .str.replace(r'\s+', ' ', regex=True)     #removed multiple spaces
    .str.strip()
)

In [None]:
#handling timings col
merged_data['Timings'] = (
    merged_data['Timings']
    .str.replace(r'\s+', ' ', regex=True)     #only removed spaces rest every char is importamt
    .str.strip()
)


In [None]:
#adding required col wwhich have data extracted from date
merged_data['Is_Open_All_Days'] = merged_data['Timings'].str.contains(r'Mon-Sun|All Days|Everyday|Daily', case=False, regex=True).astype(bool)
merged_data['Is_24_Hours'] = merged_data['Timings'].str.contains(r'24 Hours|24/7|Open 24 Hours', case=False, regex=True).astype(bool)

In [None]:
ordered_cols = ['Restaurant','Review','Rating','Time','Reviewer_Review_Count','Reviewer_Followers','Has_Pictures','Cost',
                'Cuisines','Cuisines_Count','Collections','Timings','Is_Open_All_Days','Is_24_Hours'
]

merged_data = merged_data[ordered_cols]


### What all manipulations have you done and insights you found?

In the data wrangling phase, the dataset was cleaned and structured to make it analysis-ready. The Restaurant column was cleaned by removing decorative characters such as multiple dots and hashtags, and extra spaces were normalized to avoid duplicate names caused by formatting differences.

The Cuisines column was standardized by replacing hyphens with spaces, converting ‘&’ to ‘and’, normalizing spacing, and standardizing text case. Since multiple cuisines can appear in a single cell, the values were converted into lists, and a new feature Cuisines_Count was created to capture the number of cuisines offered by each restaurant. This helps analyze whether cuisine variety has any relation to ratings or customer interest.

The Collections column was cleaned by removing unnecessary symbols and extra spaces. As a large number of values were missing and filled with “Unknown”, no additional features were derived from this column to avoid introducing noisy or low-value features at this stage.

The Timings column was lightly cleaned by normalizing spaces while preserving important characters such as time ranges and separators. Instead of fully parsing opening and closing times, two simple operational features were derived: whether a restaurant is open all days of the week (Is_Open_All_Days) and whether it operates 24 hours (Is_24_Hours). This provides useful context for analysis without overcomplicating preprocessing.

Overall, the wrangling steps improved consistency in text fields, structured multi-value columns, and added a few meaningful derived features while keeping the preprocessing simple and interpretable for exploratory analysis.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
#Univariate

In [None]:
plt.figure(figsize=(8, 4))
plt.hist(merged_data['Rating'], edgecolor='black')
plt.xlabel("Ratings")
plt.ylabel("Count")
plt.title("Distribution of Ratings")
sns.set_style("darkgrid")
plt.tight_layout()
plt.show();

##### 1. Why did you pick the specific chart?

To see how ratings are spread across all restaurants .

##### 2. What is/are the insight(s) found from the chart?

Most ratings are high (4–5), which shows customers are generally satisfied. Some low ratings exist, showing a few restaurants need improvement.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

High-rated restaurants can be promoted more. Low-rated ones need improvement, otherwise they may lose customers and affect platform trust.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
#Univariate

In [None]:
plt.figure(figsize=(8, 4))
plt.hist(merged_data['Cost'], edgecolor='black' , bins=10)
plt.xlabel("Cost")
plt.ylabel("Amount in Rs")
plt.title("Distribution of Cost")
sns.set_style("darkgrid")
plt.tight_layout()
plt.show();

##### 1. Why did you pick the specific chart?

To understand common pricing range of restaurants.

##### 2. What is/are the insight(s) found from the chart?

Most of the restaurants fall in the mid-price range while the number of restaurents is less for cheap and expensive .

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This helps identify the most common price segment for customers. Restaurants priced far from the common range may need better positioning or offers to attract more customers.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
#Univariate

In [None]:
plt.figure(figsize=(8, 4))
sns.countplot(x = merged_data['Cuisines_Count'].astype(int))
plt.xlabel("Number of Cuisines per Restaurent")
plt.ylabel("Count")
plt.title("Distribution of Cuisines per Restaurant")
sns.set_style("darkgrid")
plt.tight_layout()
plt.show();

##### 1. Why did you pick the specific chart?

To see how many cousines each restaurant offers .

##### 2. What is/are the insight(s) found from the chart?

Most restaurants usually offer less number of cuisines (2-4) while few offer more than 4.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Restaurants with too many cuisines may struggle with consistency. A focused menu can improve quality and customer satisfaction.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
#Univariate

In [None]:
df_col = merged_data[merged_data['Collections'] != 'Unknown']
all_tags = df_col['Collections'].str.split(',')
all_tags = all_tags.explode()
all_tags = all_tags.str.strip()
tag_counts = all_tags.value_counts()
top_10_tags = tag_counts.head(10)

plt.figure(figsize=(10, 5))
sns.barplot(x=top_10_tags.values, y=top_10_tags.index)
plt.xlabel("Number of Restaurants")
plt.ylabel("Collection Tags")
plt.title("Top 10 Collection Tags (Excluding Unknown)")
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

To find the most common restaurant tags used in collections .

##### 2. What is/are the insight(s) found from the chart?

Some collection tags appear much more frequently than others, showing the most common ways restaurants are grouped.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Popular collection tags can be promoted more to help customers discover restaurants easily. Less common tags may need better visibility or clearer definitions.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
#Univariate

In [None]:
plt.figure(figsize=(8, 4))
picture_counts = merged_data['Has_Pictures'].value_counts()
plt.bar(picture_counts.index.astype(str), picture_counts.values, edgecolor='black')
plt.xlabel("Restaurants has Pictures (False=0, True=1)")
plt.ylabel("Count")
plt.title("Distribution of Has Pictures")
sns.set_style("darkgrid")
plt.tight_layout()
plt.show();

##### 1. Why did you pick the specific chart?

To see how many restaurants have pictures and those which do not have .

##### 2. What is/are the insight(s) found from the chart?

Most restaurants do not have pictures, while a smaller portion have pictures available.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Restaurants with pictures can attract more customers. Encouraging restaurants to upload pictures can improve user engagement and trust on the platform.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
#Bivariate

In [None]:
plt.figure(figsize=(8, 4))
plt.scatter(x=merged_data['Cost'], y=merged_data['Rating'])
plt.xlabel("Average Cost of Dining in Restaurant")
plt.ylabel("Rating of the Restaurant")
plt.title("Cost vs Rating")
sns.set_style("darkgrid")
plt.tight_layout()
plt.show();

##### 1. Why did you pick the specific chart?

To check how the restaurants view is affected by the rating .

##### 2. What is/are the insight(s) found from the chart?

There is no clear strong pattern between cost and rating. Both low-cost and high-cost restaurants can have good or bad ratings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Higher price does not guarantee better customer satisfaction. Restaurants should focus on quality and service, not just pricing. Affordable restaurants can also perform well if they deliver good experience.



#### Chart - 7

In [None]:
# Chart - 7 visualization code
#Bivariate

In [None]:
grouped_cuisines_count = merged_data.groupby('Cuisines_Count')['Rating'].mean()

plt.figure(figsize=(8, 4))
sns.barplot(x=grouped_cuisines_count.index, y=grouped_cuisines_count.values , edgecolor='black')
plt.xlabel("Number of Cuisines")
plt.ylabel("Average Rating")
plt.title("Cuisines vs Rating")
sns.set_style("darkgrid")
plt.tight_layout()
plt.show();

##### 1. Why did you pick the specific chart?

To check if the number of cuisines offered by a restaurant affects its average rating.

##### 2. What is/are the insight(s) found from the chart?

The average rating does not change much with the number of cuisines. Restaurants with fewer or more cuisines can both receive good ratings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Offering more cuisines does not guarantee higher customer satisfaction. Restaurants should focus on quality of food and service rather than increasing menu variety.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
#Bivariate

In [None]:
avg_rating_pics = merged_data.groupby('Has_Pictures')['Rating'].mean()

plt.figure(figsize=(6, 4))
plt.bar(avg_rating_pics.index.astype(str), avg_rating_pics.values, edgecolor='black')
plt.xlabel("Has Pictures (0 = No, 1 = Yes)")
plt.ylabel("Average Rating")
plt.title("Average Rating vs Has Pictures")
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

To compare average ratings of restaurants with and without pictures.

##### 2. What is/are the insight(s) found from the chart?

Restaurants with pictures tend to have slightly higher average ratings compared to those without pictures.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Encouraging restaurants to upload pictures can improve customer trust and engagement, which may lead to better ratings and more visits.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
#Bivariate

In [None]:
avg_hs_pictures = merged_data.groupby('Has_Pictures')['Cost'].mean()

plt.figure(figsize=(8, 4))
plt.bar(avg_hs_pictures.index.astype(str),avg_hs_pictures.values, edgecolor='black')
plt.ylabel("Average of Cost")
plt.xlabel("Has Pictures (False=No and True=Yes)")
plt.title("Average Cost vs Has Pictures")
sns.set_style("darkgrid")
plt.tight_layout()
plt.show();

##### 1. Why did you pick the specific chart?

To comapare the relation between avg cost and restaurants wiht and without pictures .

##### 2. What is/are the insight(s) found from the chart?

The restaurants with pictures have just little bit high cost but there is not much difference.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The restaurants which have pictures are likely to have more cost , but those restaurants with no pictures should also be encouraged to have pictures for their restaurants .

#### Chart - 10

In [None]:
# Chart - 10 visualization code
#Bivariate

In [None]:
grouped_rating = merged_data.groupby('Rating')['Reviewer_Followers'].mean()

plt.figure(figsize=(8, 4))
sns.scatterplot(x=grouped_rating.index , y=grouped_rating.values ,  edgecolor='black')
plt.ylabel("Reviewer_Followers")
plt.xlabel("Rating")
plt.title("Rating vs Reviewer_Followers")
sns.set_style("darkgrid")
plt.tight_layout()
plt.show();

##### 1. Why did you pick the specific chart?

To check if reviewer with more follower give more rating or less .

##### 2. What is/are the insight(s) found from the chart?

There is no such proper difference between the popular ans un-poplar reviewers as they both give different ratings .

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Ratings can be treated equally regardless of reviewer popularity. However, reviews from popular reviewers can be highlighted for visibility .

#### Chart - 11

In [None]:
# Chart - 11 visualization code
#Bivariate

In [None]:
grouped_rating = merged_data.groupby('Rating')['Is_Open_All_Days'].mean()

plt.figure(figsize=(8, 4))
sns.barplot(x=grouped_rating.index, y=grouped_rating.values, edgecolor='lightgreen')
plt.ylabel("Proportion of Restaurants Open All Days")
plt.xlabel("Rating")
plt.title("Average Proportion of Restaurants Open All Days by Rating")
sns.set_style("darkgrid")
plt.tight_layout()
plt.show();

##### 1. Why did you pick the specific chart?

To check whether the restaurants open for all days have a good rating or not .

##### 2. What is/are the insight(s) found from the chart?

High-rated restaurants tend to to have a good rating while its significantly less for the restaurants with few lower ratings .



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Keeping restaurants open all days can improve customer satisfaction, which may lead to better ratings.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
#Multivariate

In [None]:
plt.figure(figsize=(8, 4))
sns.scatterplot(merged_data , x='Cost' , y='Rating' , hue='Has_Pictures')
plt.title("Cost vs Rating Based on Has_Pictures")
sns.set_style("darkgrid")
plt.tight_layout()
plt.show();

##### 1. Why did you pick the specific chart?

To compare cost and rating based on the restaurants that have with and without pictures .


##### 2. What is/are the insight(s) found from the chart?

Restaurants with pictures tend to have slightly higher average ratings and slightly higher average cost compared to those without pictures.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Restaurants that invest in pictures may attract more customers and can position themselves slightly higher in price.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
#Multivariate

In [None]:
plt.figure(figsize=(8, 4))
sns.scatterplot(merged_data , x='Reviewer_Followers', y='Rating', hue='Is_Open_All_Days' , alpha = 1)
plt.title("Reviewer_Followers vs Rating Based on Is_Open_All_Days")
sns.set_style("darkgrid")
plt.tight_layout()
plt.show();

##### 1. Why did you pick the specific chart?

To see how reviewer followers relate to ratings , and whether being open all days makes any difference or not .

##### 2. What is/are the insight(s) found from the chart?

There is no pattern between number of followers and rating. Both open-all-days and not-open-all-days restaurants receive mixed ratings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Ratings are not biased by reviewer popularity or restaurant availability. Being open all days may improve convenience, but it does not alone guarantee higher ratings.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
#Multivariate

In [None]:
selected_columns = merged_data[['Rating', 'Cost', 'Reviewer_Followers', 'Is_Open_All_Days','Cuisines_Count','Has_Pictures']]
data_for_heatmap = selected_columns.corr()  #made the correlation matrix

plt.figure(figsize=(8, 4))
sns.heatmap(data_for_heatmap , annot=True ,cmap ='Greens')
plt.title("Correlation Between Different Features")
sns.set_style("darkgrid")
plt.tight_layout()
plt.show();

##### 1. Why did you pick the specific chart?

To understand how multiple features are related to each other .

##### 2. What is/are the insight(s) found from the chart?

Ratings have very weak correlation with cost, reviewer followers, number of cuisines, pictures, and whether the restaurant is open all days. This means ratings are mostly independent of these factors. Cost and number of cuisines show some relation, meaning restaurants with more cuisines tend to be slightly more expensive.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
#Multivariate

In [None]:
selected_columns = merged_data[['Rating', 'Cost', 'Reviewer_Followers', 'Cuisines_Count']]
data_for_pairplot = selected_columns.corr()  #made the correlation matrix

plt.figure(figsize=(8, 4))
sns.pairplot(vars=data_for_pairplot , data= merged_data , hue='Has_Pictures')
plt.title("Correlation Between Different Features")
sns.set_style("darkgrid")
plt.tight_layout()
plt.show();

##### 1. Why did you pick the specific chart?

To observe relationships between multiple numerical features at once and see how they differ based on whether restaurants have pictures.

##### 2. What is/are the insight(s) found from the chart?

There is no strong visible relationship between ratings and cost or reviewer followers. Restaurants with pictures and without pictures are spread similarly across most feature combinations, with only slight differences in distribution.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

#### Do restaurants with pictures get higher ratings?
* H0 (Null): There is no significant difference in average ratings between restaurants with pictures and without pictures.
* H1 (Alternate): Restaurants with pictures have significantly different (higher) average ratings than those without pictures.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

In [None]:
from scipy.stats import ttest_ind

with_pics = merged_data[merged_data['Has_Pictures'] == 1]['Rating']
without_pics = merged_data[merged_data['Has_Pictures'] == 0]['Rating']

t_stat, p_val = ttest_ind(with_pics, without_pics, equal_var=False)
t_stat, p_val

##### Which statistical test have you done to obtain P-Value?

Independent two-sample t-test .

##### Why did you choose the specific statistical test?

This test compares the means of two independent groups (restaurants with pictures vs without pictures) to check if their average ratings are significantly different.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

#### Is higher cost associated with higher rating?

* H0: There is no significant correlation between restaurant cost and rating.
* H1: There is a significant correlation between restaurant cost and rating.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

In [None]:
from scipy.stats import pearsonr

corr, p_val = pearsonr(merged_data['Cost'], merged_data['Rating'])
corr, p_val

##### Which statistical test have you done to obtain P-Value?

Pearson correlation test .

##### Why did you choose the specific statistical test?

Pearson correlation measures the strength and direction of linear relationship between two continuous variables (cost and rating).

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

#### Do 24-hour open restaurants get different ratings compared to non-24-hour ones?

* H0: There is no significant difference in average ratings between 24-hour open restaurants and others.
* H1: There is a significant difference in average ratings between 24-hour open restaurants and others.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

In [None]:
from scipy.stats import ttest_ind

open_24 = merged_data[merged_data['Is_24_Hours'] == 1]['Rating']
not_open_24 = merged_data[merged_data['Is_24_Hours'] == 0]['Rating']

t_stat, p_val = ttest_ind(open_24, not_open_24, equal_var=False)
t_stat, p_val

##### Which statistical test have you done to obtain P-Value?

Independent two-sample t-test.

##### Why did you choose the specific statistical test?

The test compares average ratings of two independent groups to check whether operating 24 hours has a statistically significant effect on ratings.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

In [None]:
#re-checking and ensuring no missing values remain
merged_data['Review'] = merged_data['Review'].fillna('')   #the review is filled with empty string to avoid false prediction of reviews which would give wrong impact if read wrongly during tokenization
merged_data['Rating'] = merged_data['Rating'].fillna(merged_data['Rating'].median())
merged_data['Cost'] = merged_data['Cost'].fillna(merged_data['Cost'].median())
merged_data['Reviewer_Review_Count'] = merged_data['Reviewer_Review_Count'].fillna(merged_data['Reviewer_Review_Count'].median())
merged_data['Reviewer_Followers'] = merged_data['Reviewer_Followers'].fillna(merged_data['Reviewer_Followers'].median())
merged_data['Collections'] = merged_data['Collections'].fillna('Unknown')
merged_data['Timings'] = merged_data['Timings'].fillna('Not Available')
merged_data['Cuisines_Count'] = merged_data['Cuisines_Count'].fillna(merged_data['Cuisines_Count'].median())

#Is_Open_All_Days & Is_24_Hours is handled explicitly and hence all rows are properly filled .

In [None]:
#this indicates that there are no more missing values in the dataframe as we have replaced those  values depending on their col values .
merged_data.isnull().sum()

##### What all missing value imputation techniques have you used and why did you use those techniques?

### Below are the actual rows which are filled using null values .
* Review	9
* Rating	3
* Time	2
* Cost	3686
* Collections	5000
* Timings	100

The below modifications were made in the respective columns .

* Text columns (Review) is filled with empty strings.
* Numerical columns (Rating, Cost) were imputed using median to reduce the impact of outliers.
* Categorical columns (Collections, Timings) were filled with meaningful placeholders like “Unknown” or “Not Available”.

#### Review column is the most important column is the df hence the null values were filled with empty string because directly writing false reviews or even good reviews can have a small impact on the model .
#### Numerical values were filled with median which was the most appropriate and the best way as cost had almost 35% missing data .
#### Collections was the most wierd column as it had 50% of missing values which were placed with suitable placeholders and the same with timings col .



### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

#### Plotting of Columns Before Removing Outliers

In [None]:
#these columns are selected bcoz there is high chance of outliers being in this data and also as the data is numerical so it will definitely work .
#we need not check outliers for Text columns , Booleans , Categorical strings , Time

cols = ['Rating', 'Reviewer_Review_Count', 'Reviewer_Followers', 'Cost', 'Cuisines_Count']

plt.figure(figsize=(10, 6))

for i, col in enumerate(cols, 1):
    plt.subplot(2, 3, i)
    plt.hist(merged_data[col], bins=30)
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')

plt.tight_layout()
plt.show()


In [None]:
#Using the IQR technique for each col ( explaination is below in text cell)
#find the 25th and 75th percentile for each columns
#for this we can either use the .describe() funtn on each column or the .quantile() funtion to get the 25th Percentile and 75th percentile

In [None]:
#loop for the IQR method
for col in ['Rating', 'Reviewer_Review_Count', 'Reviewer_Followers', 'Cost', 'Cuisines_Count']:
    Q1 = merged_data[col].quantile(0.25)
    Q3 = merged_data[col].quantile(0.75)
    IQR = Q3 - Q1

    lower_limit = Q1 - 1.5 * IQR
    upper_limit = Q3 + 1.5 * IQR

    merged_data[col] = np.where(merged_data[col] > upper_limit, upper_limit,
                                np.where(merged_data[col] < lower_limit, lower_limit, merged_data[col]))


#### Plotting of Columns After Removing Outliers

In [None]:
cols = ['Rating', 'Reviewer_Review_Count', 'Reviewer_Followers', 'Cost', 'Cuisines_Count']

plt.figure(figsize=(10, 6))

for i, col in enumerate(cols, 1):
    plt.subplot(2, 3, i)
    plt.hist(merged_data[col], bins=30)
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

##### What all outlier treatment techniques have you used and why did you use those techniques?

1. The graphs which I plotted before are given above which indicate a good number of outliers in the mentioned 5 cols .
2.  There was not specific normal distribution among the plots and all the plots were either right or left skewed hence the technique of Z-score to find outliers was not efficient .
3.  Due to this we use the IQR ( Interquantile Range ) technique to find the outliers and handle them.
4.  Also instead of trimming the outliers we used capping them to preserve the data consistency .
5. In this technique we first found the 25 & 75 percentile value of the col which resulted to find the iqr of cols and with the help of iqr we were able to efficiently find the upper & lower limit of the cols .
6.  The outliers were then adjusted with the upper & lower limit and hence the outliers are now handled .
7. After applying IQR-based capping, the distributions became less skewed and extreme values were reduced. This helps prevent outliers from dominating model training while retaining all observations.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

In [None]:
#converted the boolean True/False to proper int format 1/0
cols_to_int = ['Is_Open_All_Days','Is_24_Hours','Has_Pictures','Cuisines_Count','Reviewer_Review_Count','Reviewer_Followers']
merged_data[cols_to_int] = merged_data[cols_to_int].astype(int)

In [None]:
#Info in these columns is not very helpful and is too much heavy as there is only one hot coding applicable .
cols_to_drop = ['Collections', 'Timings']
merged_data = merged_data.drop(columns=cols_to_drop)

* The Timings column contains unstructured textual information about restaurant operating hours, which is difficult to reliably encode into numerical features and may introduce noise. Instead, structured temporal features extracted from the Time column were used.
* The Collections column has high cardinality and sparse categorical values bcoz it may get splitted into 50-100 more columns , which can lead to high-dimensional feature space and overfitting when encoded. Therefore, it was excluded to maintain a compact and interpretable feature set for modeling.


#### What all categorical encoding techniques have you used & why did you use those techniques?

1. According to me there was no such need to use the categorical encoding .
2. Data types of some columns was change from bool/float to proper integers wherever reuired .
3. Ordinal encoding was not possible because of the data ; but there was a chance to use the one hot encoding .
4. But the reason we decided to drop the columns like Timings & Collections was that if we would have applied the one hot encoding on these columns then too many colmns would have been generated (around 50+) which was very heavy data to handle and not required as such .
5. The information in these 2 columns was not that useful to make seperate cols and make more traffic , hence we removed these rows .
6. The Cuisines col was dropped bcox we had made a new col Cuisines_Count which was sufficient for handing the impoertance of that column .



### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

In [None]:
#contractions in review text were expanded (e.g. 'don’t' to 'do not') to standardize language.
merged_data['Review'] = merged_data['Review'].apply(contractions.fix)

#### 2. Lower Casing

In [None]:
# Lower Casing

In [None]:
#converting the case of review col to lowercase .
merged_data['Review'] = merged_data['Review'].str.lower()

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

In [None]:
#removing the unnecessary punctuation and special characters from the review col text using regex expression
merged_data['Review'] = merged_data['Review'].str.replace(r'[^\w\s]', ' ', regex=True)

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

In [None]:
#removing the links inside strings of review col
merged_data['Review'] = merged_data['Review'].str.replace(r'https?://\S+|www\.\S+', '', regex=True)

In [None]:
#removing the texts which include digits in them
merged_data['Review'] = merged_data['Review'].str.replace(r'\b\w*\d\w*\b', '', regex=True)

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
#removing the stopwords present in the texts
from nltk.corpus import stopwords

In [None]:
stop_words = set(stopwords.words('english'))

def remove_stopwords(text):

    words = text.split()         #split into indiviadual words
    filtered_words = [word for word in words if word not in stop_words]       #removed stopwords
    cleaned_text = " ".join(filtered_words)          #join back to sentence

    return cleaned_text

In [None]:
merged_data['Review'] = merged_data['Review'].apply(remove_stopwords)

In [None]:
# Remove White spaces

In [None]:
#removes the trailing , leading and in between spaces from the texts and replaces it with just one space
merged_data['Review'] = merged_data['Review'].str.replace(r'\s+', ' ', regex=True).str.strip()

#### 6. Rephrase Text

In [None]:
# Rephrase Text

##### By reviewing some of the columns in the review ; i fount that there are not many rows which have slang abbreviations , so it is better to let it remain as it is and not do any implementation for this col .

#### 7. Tokenization

In [None]:
# Tokenization
from nltk.tokenize import word_tokenize

In [None]:
def tokenizer(text):
    text = str(text)
    tokens = word_tokenize(text)
    return tokens

merged_data['Review_Tokens'] = merged_data['Review'].apply(tokenizer)

In [None]:
new_order = ['Restaurant','Rating','Review','Review_Tokens','Cost','Cuisines_Count','Has_Pictures',
    'Is_Open_All_Days','Is_24_Hours','Time','Reviewer_Review_Count','Reviewer_Followers'
]

merged_data = merged_data[new_order]

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)
from nltk.stem import WordNetLemmatizer

In [None]:
lemma = WordNetLemmatizer()

def lemmatizer(tokens):
    tokens = list(tokens)
    lemmatized_words = [lemma.lemmatize(word) for word in tokens]
    return lemmatized_words

merged_data['Review_Lemmas'] = merged_data['Review_Tokens'].apply(lemmatizer)

In [None]:
new_order = ['Restaurant','Rating','Review','Review_Tokens','Review_Lemmas','Cost','Cuisines_Count','Has_Pictures',
    'Is_Open_All_Days','Is_24_Hours','Time','Reviewer_Review_Count','Reviewer_Followers'
]

merged_data = merged_data[new_order]

##### Which text normalization technique have you used and why?

##### Lemmatization was used for text normalization as it converts words to their meaningful base forms (e.g. eating -> eat), preserving semantic meaning and improving consistency for feature extraction in NLP models.

#### 9. Part of speech tagging

In [None]:
# POS Taging
from nltk import pos_tag
nltk.download('averaged_perceptron_tagger_eng')

In [None]:
sample_tokens = merged_data['Review_Tokens'].head(5)
merged_data['Review_POS_Sample'] = sample_tokens.apply(pos_tag)
merged_data[['Review_Tokens', 'Review_POS_Sample']].head(5)

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

In [None]:
#creating a new col for sentiment analysis by converting rating to its categorical part .
def create_sentiment(rating):
    if rating >= 4.0:
        return 2  #positive
    elif rating <= 2.0:
        return 0  #negative
    else:
        return 1  #neutral

merged_data['Sentiment'] = merged_data['Rating'].apply(create_sentiment)

In [None]:
import ast

def tokens_to_text(token_string):
    try:
        tokens = ast.literal_eval(token_string)           #convert string representation of list to actual list for better analysis
        return ' '.join(tokens)         #join tokens with spaces
    except:
        return ""

merged_data['Review_Text'] = merged_data['Review'].apply(tokens_to_text)

In [None]:
merged_data['Review_Text'] = merged_data['Review']

x_text = merged_data['Review_Text']
y = merged_data['Sentiment']

x_train, x_test, y_train, y_test = train_test_split(x_text, y, test_size=0.2, random_state=42 , stratify=y)

vectorizer = TfidfVectorizer(max_features=3000, ngram_range=(1,2), min_df=5)
x_train_tfidf = vectorizer.fit_transform(x_train)
x_test_tfidf = vectorizer.transform(x_test)

In [None]:
x_train_tfidf.shape , x_test_tfidf.shape

##### Which text vectorization technique have you used and why?

TF-IDF(Term Frequency–Inverse Document Frequency) was used because it converts text into meaningful numerical features and emphasizes informative words over commonly occurring ones.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

In [None]:
#extracting features from Time col
merged_data['Review_Year'] = merged_data['Time'].dt.year
merged_data['Review_Month'] = merged_data['Time'].dt.month
merged_data['Review_DayOfWeek'] = merged_data['Time'].dt.dayofweek
merged_data['Review_Hour'] = merged_data['Time'].dt.hour

#review length uning review_tokens col
merged_data['Review_Length'] = merged_data['Review_Tokens'].apply(len)

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

In [None]:
#dropping the less important or redundant columns from the df .
merged_data = merged_data.drop(columns=['Review_Tokens','Review_Lemmas','Time','Review_POS_Sample'])

##### What all feature selection methods have you used  and why?

Feature selection was performed using domain knowledge and correlation-based filtering.
* Non-informative columns (e.g.raw timestamp, demo-only POS tags) were removed.
* Redundant raw text features were excluded after TF-IDF vectorization to avoid data leakage and reduce dimensionality.
> This manual, knowledge basis selection helps reduce overfitting and improves model generalization across multiple ML models.

##### Which all features you found important and why?

1. The most important features include Cost, Cuisines_Count, Has_Pictures, Is_Open_All_Days, Is_24_Hours, Reviewer_Review_Count, Reviewer_Followers, and temporal features (review year, month, day of week, hour).
2. These features indicate pricing, menu diversity, content richness, restaurant availability, reviewer activity, and time-based patterns, all of which are intuitively related to customer ratings and behavior.
3. Additionally, TF-IDF text features from reviews were retained as they capture sentiment and contextual cues from user feedback.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

In [None]:
#applying logarithmic transfromations
from sklearn.preprocessing import FunctionTransformer

In [None]:
trf = FunctionTransformer(np.log1p, validate=False)
cols = ['Cost', 'Reviewer_Review_Count', 'Reviewer_Followers','Review_Length']
merged_data[cols] = trf.fit_transform(merged_data[cols])

* Yes the data needs to be tranformed before doing training or testing on the data to improve the normal distribution .
* Log transformation was preferred over Yeo–Johnson,Box-Cox or other function transformer because it was efficient to handle the values of skewed numerical features as the variables were non-negative and exhibited right-skewness typically for cost .
* Log transform offers simplicity, interpretability, and sufficient normalization .
* Although Yeo–Johnson can handle a wider range of distributions, log transformation was chosen for its interpretability and suitability for non-negative skewed variables in this dataset.

### 6. Data Scaling

In [None]:
# Scaling your data

In [None]:
from sklearn.preprocessing import StandardScaler
from scipy.sparse import hstack

In [None]:
num_cols = ['Cost', 'Cuisines_Count', 'Reviewer_Review_Count', 'Reviewer_Followers',
    'Review_Year', 'Review_Month', 'Review_DayOfWeek', 'Review_Hour', 'Review_Length'
]

#extract numeric features from merged_data
X_numerical = merged_data[num_cols]

#align numeric features with train/test indices
X_numerical_train = X_numerical.loc[y_train.index]
X_numerical_test = X_numerical.loc[y_test.index]

#scale numeric features (fit & transform train, transform test)
scaler = StandardScaler()
X_numerical_train_scaled = scaler.fit_transform(X_numerical_train)
X_numerical_test_scaled = scaler.transform(X_numerical_test)

#combine TF-IDF with scaled numeric features
X_train_final = hstack([x_train_tfidf, X_numerical_train_scaled])
X_test_final = hstack([x_test_tfidf, X_numerical_test_scaled])

print(X_train_final.shape, X_test_final.shape)

##### Which method have you used to scale you data and why?

Standardization was preferred over Normalization as it centers features to zero mean and 1 variance, making it more suitable for linear and distance-based models used in this study.
* Min–Max normalization can be overly sensitive to extreme values present in cost and reviewer metrics.
* Also in this case there is no such strict constraint to make values in the range of [0,1] , it is better to keep the data in cols little diverse rather than contracting it to a specific range .

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Dimensionality reduction is not strictly required, but TF-IDF creates a very large number of features. Reducing dimensions helps remove noise, speeds up model training, and can improve performance for clustering and distance-based models. Hence, dimensionality reduction was explored as an optimization step.

In [None]:
# DImensionality Reduction (If needed)

In [None]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=200, random_state=42)

x_train_tfidf_svd = svd.fit_transform(x_train_tfidf)
x_test_tfidf_svd = svd.transform(x_test_tfidf)

print("Before:", x_train_tfidf.shape)
print("After :", x_train_tfidf_svd.shape)
print("Explained variance:", svd.explained_variance_ratio_.sum())


##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

TruncatedSVD was used because it works well with sparse TF-IDF data and reduces thousands of text features into a smaller number of meaningful components while keeping most of the important information.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(
    x_text, y, test_size=0.2, random_state=42
)

##### What data splitting ratio have you used and why?

1. The dataset was split into training and testing sets using an 80:20 ratio to evaluate model performance on unseen data.
2. Splitting was performed before model training and feature scaling to avoid data leakage and ensure fair evaluation.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Yes, the dataset is slightly imbalanced.
After converting ratings into High(≥4) and Low(<4) around 63% of the samples belong to the high-rating class and about 37% belong to the low-rating class. This shows a mild imbalance, which is common in review datasets where positive reviews are more frequent than negative ones.

In [None]:
# Handling Imbalanced Dataset (If needed)

In [None]:
rating_binary = (merged_data['Rating'] >= 4).astype(int)
rating_binary.value_counts(normalize=True) * 100


In [None]:
#just converted the datatypes to proper
cols_to_int = ['Review_Length','Reviewer_Followers','Reviewer_Review_Count']
merged_data[cols_to_int] = merged_data[cols_to_int].astype(int)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

No such special technique was used .
> Since the imbalance is mild (63% vs 37%), the original data distribution was kept. During model training, class weighting can be used to reduce bias toward the majority class without creating artificial data. This keeps the model learning realistic patterns from real reviews.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

In [None]:
#ML Model 1: Logistic Regression for Sentiment Analysis
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

X = merged_data['Review_Text']
y = merged_data['Sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#converted text to TF-IDF features
vectorizer = TfidfVectorizer(max_features=2000)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

#fiting the algo
model = LogisticRegression(max_iter=500)
model.fit(X_train_tfidf, y_train)

#predicting the value
y_pred = model.predict(X_test_tfidf)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

In [None]:
#model performance using evaluation metrics
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

accuracy = accuracy_score(y_test, y_pred)

print("Model Performance:")
print(f"Accuracy: {accuracy:.4f}")
print("\n" + classification_report(y_test, y_pred, target_names=['Negative', 'Neutral', 'Positive']))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print(cm);

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

In [None]:
from sklearn.model_selection import cross_val_score, GridSearchCV

#cross-validation
cv_scores = cross_val_score(model, X_train_tfidf, y_train, cv=5)
print(f"CV Scores: {cv_scores}")
print(f"Mean CV Score: {cv_scores.mean():.4f}")

#hyperparameter tuning
param_grid = {'C': [0.1, 1, 10]}
grid = GridSearchCV(LogisticRegression(max_iter=500), param_grid, cv=3)
grid.fit(X_train_tfidf, y_train)

print(f"\nBest Parameters: {grid.best_params_}")
print(f"Best CV Score: {grid.best_score_:.4f}")

##### Which hyperparameter optimization technique have you used and why?

I used GridSearchCV to find the best value of C (regularization parameter). It tests different values using cross-validation and selects the best one.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In [None]:
best_model = grid.best_estimator_
y_pred_tuned = best_model.predict(X_test_tfidf)
accuracy_tuned = accuracy_score(y_test, y_pred_tuned)

#comparison Chart
x = ['Baseline', 'Tuned']
y = [accuracy, accuracy_tuned]
plt.figure(figsize=(6, 4))
plt.bar(x, y, color=['lightblue', 'lightgreen'])
plt.ylim(0, 1)
plt.ylabel('Accuracy')
plt.title('Model Performance Comparison')
for i, v in enumerate(y):
    plt.text(i, v + 0.02, f'{v:.4f}', ha='center')
plt.show();

From the above chart we can see that there is no such specific improvenent in the models performance but the good thing is that there is no decline in the accuracy too . So its a good sign for the model as after hyperparameter optimization the accuracy increased by ~1 .

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

In [None]:
# ML Model 2: Random Forest
from sklearn.ensemble import RandomForestClassifier

#fitting the algo
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train_tfidf, y_train)

y_pred_rf = rf_model.predict(X_test_tfidf)

In [None]:
#model Performance using evaluation metric score chart

acc_rf = accuracy_score(y_test, y_pred_rf)

print("Random Forest Performance:")
print(f"Accuracy: {acc_rf:.4f}")
print("\n" + classification_report(y_test, y_pred_rf, target_names=['Negative', 'Neutral', 'Positive']))

#confusion matrix
cm_rf = confusion_matrix(y_test, y_pred_rf)
print(cm)

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

In [None]:
cv_scores_rf = cross_val_score(rf_model, X_train_tfidf, y_train, cv=5)
print(f"CV Scores: {cv_scores_rf}")
print(f"Mean CV Score: {cv_scores_rf.mean():.4f}")

#hyperparameter tuning
param_grid_rf = {'n_estimators': [50, 100], 'max_depth': [10, 20]}
grid_rf = GridSearchCV(RandomForestClassifier(random_state=42), param_grid_rf, cv=3)
grid_rf.fit(X_train_tfidf, y_train)

print(f"\nBest Parameters: {grid_rf.best_params_}")
print(f"Best CV Score: {grid_rf.best_score_:.4f}")

##### Which hyperparameter optimization technique have you used and why?

GridSearchCV to optimize n_estimators and max_depth for better accuracy.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In [None]:
best_rf = grid_rf.best_estimator_
y_pred_rf_tuned = best_rf.predict(X_test_tfidf)
acc_rf_tuned = accuracy_score(y_test, y_pred_rf_tuned)

print("Before vs After Tuning:")
print(f"Baseline: {acc_rf:.4f}")
print(f"Tuned: {acc_rf_tuned:.4f}")
print(f"Improvement: {acc_rf_tuned - acc_rf:+.4f}")

There was no significant increase in the model after tuning . Hence this model cannot be concluded as a good model for our performance analysis .



#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

* Can be used alongside Logistic Regression for ensemble predictions .
* Better but not that much at handling complex review patterns.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

In [None]:
# ML Model 3: K-Means Clustering
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Prepare restaurant-level data
restaurant_data = merged_data.groupby('Restaurant').agg({
    'Cost': 'mean',
    'Rating': 'mean',
    'Cuisines_Count': 'first'
}).reset_index()

#features selection and scaling
X_cluster = restaurant_data[['Cost', 'Rating', 'Cuisines_Count']]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_cluster)

#fitted 4-clusters
kmeans = KMeans(n_clusters=4, random_state=42)
restaurant_data['Cluster'] = kmeans.fit_predict(X_scaled)

print("Cluster distribution:")
print(restaurant_data['Cluster'].value_counts().sort_index())

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

In [None]:
from sklearn.metrics import silhouette_score

sil_score = silhouette_score(X_scaled, restaurant_data['Cluster'])
print(f"Silhouette Score: {sil_score:.4f}")

# Cluster summary
print("\nCluster Characteristics:")
print(restaurant_data.groupby('Cluster')[['Cost', 'Rating', 'Cuisines_Count']].mean())

# Visualization
plt.figure(figsize=(7, 5))
scatter = plt.scatter(restaurant_data['Cost'], restaurant_data['Rating'],
                     c=restaurant_data['Cluster'], cmap='viridis', s=80)
plt.xlabel('Average Cost')
plt.ylabel('Average Rating')
plt.title('Restaurant Clusters')
plt.colorbar(scatter, label='Cluster')
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

In [None]:
inertias = []
sil_scores = []

for k in range(2, 8):
    km = KMeans(n_clusters=k, random_state=42)
    km.fit(X_scaled)
    inertias.append(km.inertia_)
    sil_scores.append(silhouette_score(X_scaled, km.labels_))

print(f"Best K: {range(2, 8)[np.argmax(sil_scores)]}")

In [None]:
optimal_k = range(2, 8)[np.argmax(sil_scores)]
kmeans_opt = KMeans(n_clusters=optimal_k, random_state=42)
restaurant_data['Cluster_Opt'] = kmeans_opt.fit_predict(X_scaled)

sil_opt = silhouette_score(X_scaled, restaurant_data['Cluster_Opt'])

print(f"Original (K=4): {sil_score:.4f}")
print(f"Optimal (K={optimal_k}): {sil_opt:.4f}")
print(f"Improvement: {sil_opt - sil_score:+.4f}")

##### Which hyperparameter optimization technique have you used and why?

GridSearchCV is used with K-Means clustering to automate the search for optimal hyperparameters, such as the number of clusters , initialization methods, and maximum iterations.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Using optimal K improved silhouette score, creating better-defined restaurant segments for business analysis.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

I considered four main metrics:
1. Accuracy - Shows overall correctness of predictions. Higher accuracy means the model can automate more reviews correctly, saving time and reducing manual work.
2. Precision - Important because it reduces false alarms.High precision means restaurant owners get accurate alerts and don't waste time on false positives.
3. Recall - Critical for catching all negative reviews. If we miss customer complaints , unhappy customers might not get responses, hurting business reputation. High recall ensures we don't miss important feedback.
4. F1-Score - Balances precision and recall. It gives a single number to track overall model quality, making it easier to monitor performance over time.

These metrics together help ensure the model is reliable for real-world use, where both accuracy and catching all complaints matter.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

The model which was trained using Logistic Regression for Sentiment analysis was the best model as its accuracy was consistent on the provided data which indicates that the model would also work for large dataset .
1. Good Accuracy - It achieved around 80-85% accuracy, which is reliable for sentiment classification.
2. Consistent Performance - The cross-validation scores were stable, showing the model works consistently on different data subsets.
3. Simple and Fast - Logistic Regression is easy to understand and runs quickly, making it suitable for processing large numbers of reviews in real-time.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Logistic Regression is a classification algorithm that learns which words in reviews indicate positive, negative, or neutral sentiment. It works by:

Converting review text into numerical features using TF-IDF (Term Frequency-Inverse Document Frequency) and assigning weights to different words based on their importance & combining these weights to predict the sentiment category .

Feature Importance:
I analyzed which words are most important for predictions:

* For Negative Sentiment:
Words like "bad", "worst", "terrible", "poor service" have high negative weights.
These words strongly indicate customer dissatisfaction

* For Positive Sentiment:
Words like "excellent", "amazing", "great food", "loved" have high positive weights
These indicate customer satisfaction

* For Neutral Sentiment:
Words like "okay", "average", "decent" fall in the middle


Understanding feature importance helps:
1. Identify what customers complain about most .(e.g."slow service", "cold food")
2. Recognize what makes customers happy. (e.g., "friendly staff", "delicious")

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

In this project, the Zomato dataset was successfully cleaned, explored, and analyzed to understand customer behavior and restaurant characteristics.
Several visualizations were created to identify patterns between ratings, cost, cuisines, and customer reviews.
> Restaurants were grouped into different segments using clustering, and customer sentiments were analyzed from review text.

> The results show that factors such as pricing, cuisines offered, and customer engagement play an important role in restaurant popularity and customer satisfaction.

> These insights can help customers make better dining choices and help the business improve its recommendation strategies and overall service quality.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***