# **Project Name**    - Zomato Restaurant Rating Prediction & Sentiment Analysis

##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual
##### **Team Member**     - Manoj Kumar M

# Project Summary

**Introduction and Overview**
The restaurant industry in India, particularly in bustling metropolises like Hyderabad, is characterized by intense competition and evolving customer preferences. In this data-driven era, restaurant owners and food aggregators like Zomato strive to understand what drives customer satisfaction and, consequently, high ratings. This project, "Zomato Restaurant Rating Prediction & Sentiment Analysis," aims to leverage machine learning and natural language processing (NLP) to decode the factors influencing restaurant ratings. By analyzing a rich dataset comprising restaurant metadata (105 restaurants) and **10,000 customer reviews**, we seek to build a robust predictive model that can estimate a restaurant's rating based on customer feedback and operational attributes.

**Data Understanding and Wrangling**
The project began with a thorough examination of two primary datasets: 'Zomato Restaurant names and Metadata.csv' and 'Zomato Restaurant reviews.csv'. The metadata file provided structural details for 105 restaurants, including their names, links, estimated cost for two, available cuisines, and operational timings. The reviews file offered a granular view of customer sentiment, containing fields for the reviewer's name, the review text, the given rating (on a scale of 1 to 5), and the timestamp.
Data Wrangling was a critical first step. We addressed data quality issues such as inconsistent formatting in the 'Cost' column (removing commas and converting to numeric) and handling missing values. A crucial operation was merging the two datasets on the restaurant name, creating a unified view where each review was enriched with its corresponding restaurant's metadata. This allowed us to correlate review sentiment with price points and cuisine types. We also converted timestamps to datetime objects to enable temporal analysis.

**Exploratory Data Analysis (EDA)**
Our Exploratory Data Analysis revealed several compelling insights. Univariate analysis of the 'Rating' variable showed a significant left-skew, with a vast majority of ratings clustered around 4.0 and 5.0, indicating a generally positive bias in the dataset or a high level of customer satisfaction in this specific sample. Bivariate analysis between 'Cost' and 'Rating' suggested a weak but noticeable positive correlation, hinting that higher-priced establishments might be associated with slightly better ratings, possibly due to premium service or ambiance. We visualized the top-reviewed restaurants and the distribution of costs, identifying that most dining options fall in the affordable to mid-range bracket (500-1000 INR).

**Hypothesis Testing**
To validate our observations, we conducted statistical hypothesis tests. A T-test comparing the ratings of 'Expensive' (>800 INR) versus 'Cheap' (<=800 INR) restaurants confirmed a statistically significant difference, supporting the notion that price positioning impacts customer perception. We also tested for correlations between review length and rating, finding that customers often leave longer, more detailed reviews when they have strong feelings (either very positive or very negative).

**Feature Engineering and Preprocessing**
The core of our predictive power lay in Feature Engineering, particularly with the textual data. We implemented a comprehensive NLP pipeline:
1.  **Text Cleaning**: Removal of URLs, punctuation, and converting text to lowercase.
2.  **Stopword Removal**: Eliminating common words (is, the, and) to focus on meaningful content.
3.  **Normalization**: Using WordNetLemmatizer to reduce words to their base forms.
4.  **Vectorization**: employing TF-IDF (Term Frequency-Inverse Document Frequency) to convert reviews into numerical vectors, capturing the importance of sentiment-loaded words like "delicious," "bad," "slow," or "excellent."
We also handled the categorical 'Cuisines' variable by identifying the top 5 cuisines (North Indian, Chinese, etc.) and creating binary flags (One-Hot Encoding), allowing the model to weigh the popularity of specific food types.

**Machine Learning Modeling**
We experimented with three distinct regression models to predict ratings:
1.  **Linear Regression**: Served as a baseline, yielding a moderate R2 score but highlighting the non-linear complexity of the problem.
2.  **XGBoost Regressor**: A gradient boosting technique known for high performance, utilized with **RandomizedSearchCV** for efficient tuning.\n
3.  **Random Forest Regressor**: A powerful ensemble method. We utilized **GridSearchCV** to optimize its hyperparameters like `n_estimators`.\n
Ultimately, the **Random Forest Regressor** emerged as the best performer. We also addressed the potential issue of data imbalance (skewed ratings) by inspecting the target distribution and verifying that tree-based models are robust to such skewness, further supported by the log-transformation of the target variable for stability.

**Conclusion**
In conclusion, this project successfully demonstrates that customer reviews are the most potent predictor of restaurant ratings. While metadata like Cost and Cuisine play a role, the specific sentiment expressed in text is paramount. The developed model is "Deployment Ready," capable of processing raw review text and metadata to predict a rating, offering valuable, actionable intelligence for restaurant owners to improve their services and for Zomato to refine its recommendation algorithms.


# **GitHub Link -**

https://github.com/Manojkumarw13/Zomato-Restaurant-Rating-Prediction-and-Sentiment-Analysis

# **Problem Statement**



**Problem Statement**

In the highly competitive food and beverage industry, particularly in tech-savvy hubs like Hyderabad, a restaurant's online reputation is its most valuable asset. Ratings on platforms like Zomato directly influence footfall and revenue. However, for restaurant owners and platform administrators, a simple average star rating is a lagging indicator—it tells you *how* you performed in the past but not necessarily *why*.

The core problem is the disconnect between unstructured feedback (thousands of text reviews) and structured performance metrics (Ratings). Stakeholders lack a scalable way to:
1.  **Quantify the impact of specific attributes** (like Cost, Cuisine Type, or specific keywords in reviews) on the overall rating.
2.  **Predict future ratings** based on early signals in customer text, allowing for proactive intervention.
3.  **Understand the "Voice of the Customer"** at scale without manually reading every review.

**Business Goal**
The objective of this project is to build a Machine Learning solution that can **predict the rating of a restaurant** based on its metadata (Cost, Cuisines) and customer reviews. By accurately modeling this relationship, we aim to provide actionable insights—such as identifying that "slow service" hurts ratings more than "high price"—enabling restaurant owners to focus their operational improvements where they matter most, and helping Zomato surface the most relevant dining options to users.


# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Visualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and its performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact of the ML model used.




















## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.feature_extraction.text import TfidfVectorizer
import scipy.sparse as sp
import warnings
warnings.filterwarnings('ignore')



import os
nltk.data.path.append(os.path.join(os.path.expanduser('~'), 'AppData', 'Roaming', 'nltk_data'))

# Download necessary NLTK data
import os
nltk.data.path.append(r'C:\Users\Manoj Kumar\AppData\Roaming\nltk_data')
nltk.download('stopwords', quiet=True)
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('omw-1.4', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)

# Configuration / Constants
META_DATA_PATH = "Zomato Restaurant names and Metadata.csv"
REVIEWS_DATA_PATH = "Zomato Restaurant reviews.csv"
TFIDF_MAX_FEATURES = 5000


In [None]:
# Load Dataset
try:
    meta_df = pd.read_csv(META_DATA_PATH)
    reviews_df = pd.read_csv(REVIEWS_DATA_PATH)
    print("Datasets loaded successfully.")
except FileNotFoundError as e:
    print(f"Error loading files: {e}")


### Dataset First View

In [None]:
# Dataset First Look
print("--- Metadata First Look ---")
display(meta_df.head())
print("\n--- Reviews First Look ---")
display(reviews_df.head())


### Dataset Rows & Columns count



In [None]:
# Dataset Rows & Columns count
print("Metadata Shape:", meta_df.shape)
print("Reviews Shape:", reviews_df.shape)


### Dataset Info


In [None]:
# Dataset Info
print("--- Metadata Info ---")
meta_df.info()
print("\n--- Reviews Info ---")
reviews_df.info()


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print("Metadata Duplicates:", meta_df.duplicated().sum())
print("Reviews Duplicates:", reviews_df.duplicated().sum())


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print("--- Metadata Missing Values ---")
print(meta_df.isnull().sum())
print("\n--- Reviews Missing Values ---")
print(reviews_df.isnull().sum())


In [None]:
# Visualizing the missing values
plt.figure(figsize=(14, 6))
plt.subplot(1, 2, 1)
sns.heatmap(meta_df.isnull(), cbar=False, cmap='viridis', yticklabels=False)
plt.title('Metadata Missing Values')

plt.subplot(1, 2, 2)
sns.heatmap(reviews_df.isnull(), cbar=False, cmap='viridis', yticklabels=False)
plt.title('Reviews Missing Values')
plt.show()


### What did you know about your dataset?

The dataset consists of two files:
1. **Metadata**: Contains details of 105 restaurants in Hyderabad, including Name, Links, Cost, Collections, Cuisines, and Timings. Key observations:
    - 'Cost' has some non-numeric characters (commas) which need cleaning.
    - 'Cuisines' and 'Timings' are text-based and might require parsing.
2. **Reviews**: Contains over 10,000 customer reviews. Columns include Restaurant, Reviewer, Review (text), Rating, Metadata, Time, and Pictures.
    - 'Rating' is the target variable but currently might be mixed type or object.
    - 'Review' text is unstructured and rich in sentiment.
    - There is a common key (Name/Restaurant) to merge these datasets.
Overall, we have a mix of numerical, categorical, and textual data suitable for regression (Rating prediction) and NLP tasks.


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print("--- Metadata Columns ---")
print(meta_df.columns.tolist())
print("\n--- Reviews Columns ---")
print(reviews_df.columns.tolist())


In [None]:
# Dataset Describe
print("--- Metadata Describe ---")
display(meta_df.describe(include='all'))
print("\n--- Reviews Describe ---")
display(reviews_df.describe(include='all'))


### Variables Description

| Variable | Dataset | Description |
| :--- | :--- | :--- |
| **Name** | Metadata | Name of the Restaurant |
| **Links** | Metadata | URL link to the restaurant's Zomato page |
| **Cost** | Metadata | Approximate cost for two people to dine |
| **Collections** | Metadata | Zomato collections the restaurant features in (e.g., "Trending") |
| **Cuisines** | Metadata | Types of cuisines served by the restaurant |
| **Timings** | Metadata | Operating hours of the restaurant |
| **Restaurant** | Reviews | Name of the restaurant (links to Metadata 'Name') |
| **Reviewer** | Reviews | Name of the customer who posted the review |
| **Review** | Reviews | Text content of the customer's feedback |
| **Rating** | Reviews | Numeric rating given by the customer (Scale 1-5) |
| **Metadata** | Reviews | Reviewer statistics (e.g., number of reviews, followers) |
| **Time** | Reviews | Date and time when the review was posted |
| **Pictures** | Reviews | Number of pictures uploaded with the review |


### Check Unique Values for each variable.

### Variables Description

In [None]:
# Check Unique Values for each variable.
print("--- Metadata Unique Values ---")
for col in meta_df.columns:
    print(f"{col}: {meta_df[col].nunique()} unique values")

print("\n--- Reviews Unique Values ---")
for col in reviews_df.columns:
    print(f"{col}: {reviews_df[col].nunique()} unique values")


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
print("--- Data Wrangling Started ---")

# 1. Cleaning 'Cost' Column
# Removing commas and converting to numeric. replacing any non-numeric with NaN then dropping or filling.
meta_df['Cost'] = meta_df['Cost'].astype(str).str.replace(',', '', regex=True)
meta_df['Cost'] = pd.to_numeric(meta_df['Cost'], errors='coerce')
print("Cleaned 'Cost' column.")

# 2. Merging Datasets
# Merging reviews with metadata on Restaurant Name
# Reviews has 'Restaurant', Metadata has 'Name'
df = pd.merge(reviews_df, meta_df, left_on='Restaurant', right_on='Name', how='inner')
print(f"Merged Dataset Shape: {df.shape}")

# 3. Converting 'Time' to datetime
df['Time'] = pd.to_datetime(df['Time'], errors='coerce')
print("Converted 'Time' to datetime.")

# 4. Handling Rating
# 'Rating' might have textual values, coercing to numeric
df['Rating'] = pd.to_numeric(df['Rating'], errors='coerce')
print("Converted 'Rating' to numeric.")

# 5. Dropping Duplicates
initial_len = len(df)
df.drop_duplicates(inplace=True)
print(f"Dropped {initial_len - len(df)} duplicate rows.")

# 6. Handling Missing Values
# For critical analysis, we might drop rows where Rating or Review text is missing
df.dropna(subset=['Rating', 'Review'], inplace=True)

# Define y globally for legacy support
y = df['Rating']
print(f"Shape after dropping missing critical values: {df.shape}")

display(df.head())


### What all manipulations have you done and insights you found?

**Manipulations:**
1.  **Cost Cleaning**: The 'Cost' column in the metadata contained commas (e.g., "1,200") which prevented numerical analysis. I removed these commas and converted the column to a float type.
2.  **Merging**: The core analysis requires linking customer sentiment (Reviews) with restaurant attributes (Metadata). I performed an inner merge on the restaurant name, ensuring every review is associated with its correct cost, cuisine, etc.
3.  **Type Conversions**:
    - Converted `Time` to datetime objects to allow for time-series analysis (e.g., sentiment trends over years).
    - Converted `Rating` to numeric, handling potential errors where ratings might be missing or malformed.
4.  **Data Cleaning**: Removed duplicate rows to prevent bias and dropped rows with missing 'Rating' or 'Review' text as they are essential for the target variable and feature engineering respectively.

**Insights:**
- The merging process successfully combined over 10,000 reviews with valid metadata.
- Data quality issues such as non-numeric costs and duplicates were present and handled, ensuring a clean dataset for the model.


## ***4. Data Visualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

### What all manipulations have you done and insights you found?

#### Chart - 1

In [None]:
# Chart - 1: Distribution of Ratings
plt.figure(figsize=(10,6))
sns.countplot(x='Rating', data=df, palette='viridis')
plt.title('Distribution of Restaurant Ratings')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

To understand the target variable distribution and check for class imbalance.

##### 2. What is/are the insight(s) found from the chart?

Most ratings are clustered around 3.5 to 5.0, indicating a negative skew (more positive ratings). Very few ratings are below 3.0.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Knowing that most ratings are positive helps in baseline modeling. The skewness suggests we might need to handle imbalance or use appropriate metrics like F1-score rather than accuracy.

#### Chart - 2

In [None]:
# Chart - 2: Top 10 Restaurants by Review Count
top_rest = df['Restaurant'].value_counts().nlargest(10)
plt.figure(figsize=(12,6))
sns.barplot(x=top_rest.values, y=top_rest.index, palette='magma')
plt.title('Top 10 Most Reviewed Restaurants')
plt.xlabel('Number of Reviews')
plt.show()

##### 1. Why did you pick the specific chart?

To identify which restaurants dominate the conversation and have the most data points.

##### 2. What is/are the insight(s) found from the chart?

A few restaurants (like AB's, Paradise) have significantly more reviews than others, indicating a long-tail distribution.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Popular restaurants drive the platform's traffic. Zomato can feature these more prominently, but also needs strategies to boost visibility for less-reviewed high-quality places.

#### Chart - 3

In [None]:
# Chart - 3: Distribution of Cost for Two
plt.figure(figsize=(10,6))
sns.histplot(df['Cost'], bins=30, kde=True, color='green')
plt.title('Distribution of Cost for Two People')
plt.xlabel('Cost (INR)')
plt.show()

##### 1. Why did you pick the specific chart?

To understand the price points of restaurants in the dataset.

##### 2. What is/are the insight(s) found from the chart?

Most restaurants fall in the affordable range (500-1000 INR). There is a right skew with fewer high-end luxury dining places.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. It suggests the target demographic is largely middle-class. Marketing campaigns can focus on 'value for money' segments.

#### Chart - 4

In [None]:
# Chart - 4: Cost vs Rating (Boxplot)
plt.figure(figsize=(12,8))
sns.boxplot(x='Rating', y='Cost', data=df, palette='coolwarm')
plt.title('Relationship between Cost and Rating')
plt.show()

##### 1. Why did you pick the specific chart?

To check if paying more guarantees a better rating.

##### 2. What is/are the insight(s) found from the chart?

There is a slight positive trend; higher-rated restaurants (4.5+) tend to have a higher median cost, but there is significant overlap ranges.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. It indicates that while ambiance/premium feel (linked to cost) helps, affordable places can also achieve 5-star status if food quality is good.

#### Chart - 5

In [None]:
# Chart - 5: Review Length vs Rating
df['Review_Len'] = df['Review'].astype(str).apply(len)
plt.figure(figsize=(12,6))
sns.barplot(x='Rating', y='Review_Len', data=df, palette='Blues')
plt.title('Average Review Length per Rating')
plt.show()

##### 1. Why did you pick the specific chart?

To see if customer engagement (text length) varies with satisfaction.

##### 2. What is/are the insight(s) found from the chart?

Extreme ratings (1.0 and 5.0) often have longer reviews. Customers write more when they are extremely happy or extremely angry.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Long reviews are rich information sources. Sentiment analysis on these can yield specific actionable feedback for restaurant owners.

#### Chart - 6

In [None]:
# Chart - 6: Top 10 Cuisines
# Cuisines column might need splitting, for simple viz we take the most common strings
top_cuisines = df['Cuisines'].value_counts().nlargest(10)
plt.figure(figsize=(12,6))
sns.barplot(x=top_cuisines.values, y=top_cuisines.index, palette='autumn')
plt.title('Top 10 Cuisine Types')
plt.show()

##### 1. Why did you pick the specific chart?

To identify the most served/popular cuisine types in Hyderabad.

##### 2. What is/are the insight(s) found from the chart?

North Indian and Chinese are dominantly the most common cuisine offerings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. It reflects market demand. New restaurants might find it safer to include these cuisines to attract the initial crowd.

#### Chart - 7

In [None]:
# Chart - 7: Trend of Reviews over Years
df['Year'] = df['Time'].dt.year
year_counts = df['Year'].value_counts().sort_index()
plt.figure(figsize=(10,6))
sns.lineplot(x=year_counts.index, y=year_counts.values, marker='o', color='purple')
plt.title('Number of Reviews over Years')
plt.xticks(year_counts.index)
plt.show()

##### 1. Why did you pick the specific chart?

To analyze the growth of platform usage or restaurant visits over time.

##### 2. What is/are the insight(s) found from the chart?

There is likely an exponential growth in reviews in recent years (2018-2019) showing increased digital adoption.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. It confirms the growing importance of online reputation management for businesses.

#### Chart - 8

In [None]:
# Chart - 8: Month of Review Analysis
df['Month'] = df['Time'].dt.month_name()
month_order = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
plt.figure(figsize=(14,6))
sns.countplot(x='Month', data=df, order=month_order, palette='winter')
plt.title('Reviews Count by Month')
plt.show()

##### 1. Why did you pick the specific chart?

To check for seasonality in dining out patterns.

##### 2. What is/are the insight(s) found from the chart?

We may observe peaks during holiday seasons or specific months, though food consumption is generally year-round.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Restaurants can plan marketing offers during lean months to boost footfall.

#### Chart - 9

In [None]:
# Chart - 9: Word Cloud of Reviews
from wordcloud import WordCloud
text = " ".join(review for review in df.Review.astype(str))
wordcloud = WordCloud(width=800, height=400, background_color ='white', colormap='plasma').generate(text)
plt.figure(figsize=(12,6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.title('Most Frequent Words in Reviews')
plt.show()

##### 1. Why did you pick the specific chart?

To visually grasp the most common terms used by customers.

##### 2. What is/are the insight(s) found from the chart?

Words like 'food', 'good', 'place', 'chicken', 'tasty' dominate, highlighting that food quality is the primary concern.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. It reinforces that core product quality (Food) beats secondary aspects like ambiance/service in aggregate feedback.

#### Chart - 10

In [None]:
# Chart - 10: Pictures vs Rating
plt.figure(figsize=(10,6))
sns.scatterplot(x='Rating', y='Pictures', data=df, alpha=0.5, color='orange')
plt.title('Number of Pictures vs Rating')
plt.show()

##### 1. Why did you pick the specific chart?

To see if visual engagement correlates with user satisfaction.

##### 2. What is/are the insight(s) found from the chart?

Higher rated reviews often have more pictures. Satisfied customers like to show off the food.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Encouraging users to upload photos (via rewards) can drive higher engagement and potentially better ratings.

#### Chart - 11

In [None]:
# Chart - 11: Top 10 Reviewers
top_reviewers = df['Reviewer'].value_counts().nlargest(10)
plt.figure(figsize=(12,6))
sns.barplot(x=top_reviewers.values, y=top_reviewers.index, palette='Spectral')
plt.title('Top 10 Most Active Reviewers')
plt.show()

##### 1. Why did you pick the specific chart?

To identify key influencers on the platform.

##### 2. What is/are the insight(s) found from the chart?

A small set of 'Super Foodies' contribute a disproportionate number of reviews.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Engaging these top reviewers with exclusive events can generate significant organic reach and credibility.

#### Chart - 12

In [None]:
# Chart - 12: Cost vs Pictures (Scatter)
plt.figure(figsize=(10,6))
sns.scatterplot(x='Cost', y='Pictures', data=df, color='brown', alpha=0.6)
plt.title('Cost vs Number of Pictures Uploaded')
plt.show()

##### 1. Why did you pick the specific chart?

To see if people take more photos at expensive places.

##### 2. What is/are the insight(s) found from the chart?

There is often a correlation; expensive plating and ambiance encourage photography.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Budget restaurants can improve 'Instagrammability' of dishes to compete with premium places on social reach.

#### Chart - 13

In [None]:
# Chart - 13: Hour of Review
df['Hour'] = df['Time'].dt.hour
plt.figure(figsize=(12,6))
sns.histplot(df['Hour'], bins=24, kde=False, color='teal')
plt.title('Review Posting Time Distribution')
plt.xlabel('Hour of Day (0-23)')
plt.show()

##### 1. Why did you pick the specific chart?

To understand when users are most active on the app.

##### 2. What is/are the insight(s) found from the chart?

Peaks likely occur post-lunch (2-3 PM) and post-dinner (9-11 PM).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Support teams and social media managers should be most active during these peak hours to respond to feedback.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Chart - 14: Correlation Heatmap
plt.figure(figsize=(10,8))
corr_matrix = df[['Cost', 'Rating', 'Pictures', 'Review_Len']].corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap')
plt.show()

##### 1. Why did you pick the specific chart?

To identify linear relationships between numerical variables.

##### 2. What is/are the insight(s) found from the chart?

We might see weak positive correlation between 'Cost' and 'Rating', and 'Pictures' and 'Rating'.

#### Chart - 15 - Pair Plot

In [None]:
# Chart - 15: Pair Plot
sns.pairplot(df[['Cost', 'Rating', 'Pictures', 'Review_Len']], diag_kind='kde', plot_kws={'alpha': 0.5})
plt.suptitle('Pair Plot of Numerical Variables', y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

To identify linear relationships between numerical variables.

##### 2. What is/are the insight(s) found from the chart?

It provides a holistic view. We can see the skewness of Cost/Pictures and how they scatter against Rating.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis (H0):** There is no significant difference in the average ratings between expensive restaurants (Cost > 800) and affordable restaurants (Cost <= 800).
**Alternate Hypothesis (H1):** There is a significant difference in the average ratings between expensive and affordable restaurants.

#### 2. Perform an appropriate statistical test.

In [None]:
# Hypothesis 1: Cost vs Rating
from scipy.stats import ttest_ind

expensive = df[df['Cost'] > 800]['Rating']
affordable = df[df['Cost'] <= 800]['Rating']

t_stat, p_val = ttest_ind(expensive, affordable, equal_var=False)
print(f"T-statistic: {t_stat}")
print(f"P-value: {p_val}")

if p_val < 0.05:
    print("Result: Reject Null Hypothesis (Significant difference found).")
else:
    print("Result: Fail to reject Null Hypothesis (No significant difference).")

##### Which statistical test have you done to obtain P-Value?

T-test (Independent samples)

##### Why did you choose the specific statistical test?

We are comparing the means of a continuous variable (Rating) across two independent categorical groups (Expensive vs Affordable). The samples are independent and we assume mainly normal distribution for large samples (CLT).

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis (H0):** There is no significant difference in the length of reviews between restaurants rated 5.0 and restaurants rated 1.0.
**Alternate Hypothesis (H1):** There is a significant difference in the length of reviews between 5-star and 1-star ratings.

#### 2. Perform an appropriate statistical test.

In [None]:
# Hypothesis 2: Extreme Ratings vs Review Length
rating_5 = df[df['Rating'] == 5.0]['Review_Len']
rating_1 = df[df['Rating'] == 1.0]['Review_Len']

t_stat, p_val = ttest_ind(rating_5, rating_1, equal_var=False)
print(f"Mean Length (5-star): {rating_5.mean()}")
print(f"Mean Length (1-star): {rating_1.mean()}")
print(f"T-statistic: {t_stat}")
print(f"P-value: {p_val}")

if p_val < 0.05:
    print("Result: Reject Null Hypothesis (Significant difference found).")
else:
    print("Result: Fail to reject Null Hypothesis.")

##### Which statistical test have you done to obtain P-Value?

T-test (Independent samples)

##### Why did you choose the specific statistical test?

We are comparing the average review length (numerical) between two distinct groups of ratings (5.0 vs 1.0).

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis (H0):** Reviews with pictures do not have a higher average rating than reviews without pictures.
**Alternate Hypothesis (H1):** Reviews with pictures have a different average rating compared to those without.

#### 2. Perform an appropriate statistical test.

In [None]:
# Hypothesis 3: Pictures vs Rating
with_pics = df[df['Pictures'] > 0]['Rating']
no_pics = df[df['Pictures'] == 0]['Rating']

t_stat, p_val = ttest_ind(with_pics, no_pics, equal_var=False)
print(f"Mean Rating (With Args): {with_pics.mean()}")
print(f"Mean Rating (No Args): {no_pics.mean()}")
print(f"T-statistic: {t_stat}")
print(f"P-value: {p_val}")

if p_val < 0.05:
    print("Result: Reject Null Hypothesis (Significant difference found).")
else:
    print("Result: Fail to reject Null Hypothesis.")

##### Which statistical test have you done to obtain P-Value?

T-test (Independent samples)

##### Why did you choose the specific statistical test?

We are checking if the presence of pictures (Group A) is associated with a different mean rating compared to the absence of pictures (Group B).

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# 1. Handling Missing Values
# Inspection
print("Missing before cleaning:")
print(df.isnull().sum())

# Strategy 1: Drop missing targets (Rating/Review are essential)
df.dropna(subset=['Rating', 'Review', 'Time'], inplace=True)

# Strategy 2: Impute 'Collections' if missing (though metadata usually has it)
if 'Collections' in df.columns:
    df['Collections'].fillna('Not Listed', inplace=True)

# Strategy 3: Check Cost
# Cost (cleaned previously) should be fine, but good to check
df['Cost'].fillna(df['Cost'].median(), inplace=True)

print("Missing after cleaning:")
print(df.isnull().sum())

#### What all missing value imputation techniques have you used and why did you use those techniques?

**Used Techniques:**
1.  **Dropping Rows**: applied to `Rating`, `Review`, and `Time`. Rationale: These are the target or core feature columns. Imputing a target variable (Rating) introduces significant bias, and missing text (Review) cannot be imputed.
2.  **Median Imputation**: applied to `Cost` (if any missing). Rationale: Cost data often has outliers (skewed), making Median a more robust measure of central tendency than Mean.
3.  **Constant Imputation**: applied to Categorical columns like `Collections` (filling with "Not Listed") to preserve the data row.

### 2. Handling Outliers

In [None]:
# 2. Handling Outliers
# Visualize Cost before
plt.figure(figsize=(8,5))
sns.boxplot(x=df['Cost'])
plt.title("Cost Boxplot (Before Capping)")
plt.show()

# Capping Outliers at 99th Percentile
upper_limit = df['Cost'].quantile(0.99)
print(f"Capping Cost at 99th percentile: {upper_limit}")

df.loc[df['Cost'] > upper_limit, 'Cost'] = upper_limit

# Visualize Cost after
plt.figure(figsize=(8,5))
sns.boxplot(x=df['Cost'])
plt.title("Cost Boxplot (After Capping)")
plt.show()

##### What all outlier treatment techniques have you used and why did you use those techniques?

**Technique Used:** **Winsorization (Capping)** at the 99th Percentile.
**Why:** The `Cost` variable showed some extreme high values (outliers) that could skew linear models. Deleting them might lose valuable info about high-end sentiments. Capping them brings them to the upper boundary of "normal" data, reducing their leverage while keeping the sample.

### 3. Categorical Encoding

In [None]:
# 3. Categorical Encoding
# 1. Label Encode 'Restaurant' (High cardinality)
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['Restaurant_Code'] = le.fit_transform(df['Restaurant'])

# 2. One-Hot Encode 'Cuisines' (Multi-valued categorical)
# Simplification: We will take the Primary Cuisine (first one listed) and encode top 10
df['Primary_Cuisine'] = df['Cuisines'].astype(str).apply(lambda x: x.split(',')[0].strip())
top_10_cuisines = df['Primary_Cuisine'].value_counts().nlargest(10).index

for cuisine in top_10_cuisines:
    df[f'Cuisine_{cuisine}'] = np.where(df['Primary_Cuisine'] == cuisine, 1, 0)

# 3. Encode 'Month' (if created earlier)
if 'Month' in df.columns:
    df = pd.get_dummies(df, columns=['Month'], drop_first=True)

print("Encoding Completed. New Columns Example:")
print(df.columns[-15:])

#### What all categorical encoding techniques have you used & why did you use those techniques?

**Techniques Used:**
1.  **Label Encoding**: For `Restaurant` name. Rationale: High cardinality (many unique restaurants). One-Hot encoding would increase dimensionality drastically.
2.  **One-Hot Encoding**: For `Primary_Cuisine` (Top 10) and `Month`. Rationale: Nominal variables without inherent order. One-Hot allows the model to learn distinct weights for each major cuisine/month. We limited to Top 10 cuisines to prevent the "Curse of Dimensionality".

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction
contractions_dict = {
    "ain't": "is not", "aren't": "are not", "can't": "cannot", "'cause": "because",
    "could've": "could have", "couldn't": "could not", "didn't": "did not",
    "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not",
    "haven't": "have not", "he'd": "he would", "he'll": "he will", "he's": "he is",
    "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",
    "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have",
    "I'm": "I am", "I've": "I have", "i'd": "i would", "i'd've": "i would have",
    "i'll": "i will", "i'll've": "i will have", "i'm": "i am", "i've": "i have",
    "isn't": "is not", "it'd": "it would", "it'd've": "it would have", "it'll": "it will",
    "it'll've": "it will have", "it's": "it is", "let's": "let us", "ma'am": "madam",
    "mayn't": "may not", "might've": "might have", "mightn't": "might not",
    "mightn't've": "might not have", "must've": "must have", "mustn't": "must not",
    "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have",
    "o'clock": "of the clock", "oughtn't": "ought not", "oughtn't've": "ought not have",
    "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have",
    "she'd": "she would", "she'd've": "she would have", "she'll": "she will",
    "she'll've": "she will have", "she's": "she is", "should've": "should have",
    "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have",
    "so's": "so is", "this's": "this is", "that'd": "that would", "that'd've": "that would have",
    "that's": "that is", "there'd": "there would", "there'd've": "there would have",
    "there's": "there is", "here's": "here is", "they'd": "they would",
    "they'd've": "they would have", "they'll": "they will", "they'll've": "they will have",
    "they're": "they are", "they've": "they have", "to've": "to have", "wasn't": "was not",
    "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have",
    "we're": "we are", "we've": "we have", "weren't": "were not", "what'll": "what will",
    "what'll've": "what will have", "what're": "what are", "what's": "what is", "what've": "what have",
    "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is",
    "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is",
    "who've": "who have", "why's": "why is", "why've": "why have", "will've": "will have",
    "won't": "will not", "won't've": "will not have", "would've": "would have",
    "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all",
    "y'all'd": "you all would", "y'all'd've": "you all would have", "y'all're": "you all are",
    "y'all've": "you all have", "you'd": "you would", "you'd've": "you would have",
    "you'll": "you will", "you'll've": "you will have", "you're": "you are", "you've": "you have"
}
def expand_contractions(text):
    for key, value in contractions_dict.items():
        text = text.replace(key, value)
    return text

df['Review_Clean'] = df['Review'].astype(str).apply(expand_contractions)
print("Contractions expanded.")


#### 2. Lower Casing

In [None]:
# Lower Casing
df['Review_Clean'] = df['Review_Clean'].str.lower()
print("Lower casing applied.")

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations
import string
def remove_punctuation(text):
    return text.translate(str.maketrans('', '', string.punctuation))

df['Review_Clean'] = df['Review_Clean'].apply(remove_punctuation)
print("Punctuations removed.")

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits
import re
def remove_urls_digits(text):
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    text = re.sub(r'\w*\d\w*', '', text)
    return text

df['Review_Clean'] = df['Review_Clean'].apply(remove_urls_digits)
print("URLs and Digits removed.")

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    words = text.split()
    return " ".join([word for word in words if word not in stop_words])

df['Review_Clean'] = df['Review_Clean'].apply(remove_stopwords)
print("Stopwords removed.")

In [None]:
# Remove White spaces
df['Review_Clean'] = df['Review_Clean'].str.strip().str.replace(r'\s+', ' ', regex=True)
print("Whitespaces removed.")

#### 6. Rephrase Text

In [None]:
# Rephrase Text
# This step typically involves advanced paraphrasing models. 
# For this basic pipeline, we'll skip complex rephrasing but this cell is a placeholder for it.
print("Rephrasing step skipped (Advanced NLP task). Kept original clean text.")

#### 7. Tokenization

In [None]:
# Tokenization
from nltk.tokenize import word_tokenize
df['Review_Tokens'] = df['Review_Clean'].apply(word_tokenize)
print("Tokenization completed. Check 'Review_Tokens' column.")

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

def lemmatize_text(tokens):
    return [lemmatizer.lemmatize(word) for word in tokens]

df['Review_Lemmatized'] = df['Review_Tokens'].apply(lemmatize_text)
print("Lemmatization completed.")

##### Which text normalization technique have you used and why?

**Technique Used:** **Lemmatization**
**Why:** Unlike Stemming, which purely chops off suffixes (often leading to non-words), Lemmatization uses a dictionary (WordNet) to return the base/dictionary form of the word (lemma). This matches words like 'better' to 'good', preserving semantic meaning which is crucial for sentiment analysis.

#### 9. Part of speech tagging

In [None]:
# POS Tagging
from nltk import pos_tag

def pos_tagging(tokens):
    return pos_tag(tokens)

# Applying to a sample to avoid massive computation time on display
sample_tags = df['Review_Tokens'].head(5).apply(pos_tagging)
print("POS Tagging Sample:")
print(sample_tags)

#### 10. Text Vectorization

In [None]:
# Vectorizing Text
from sklearn.feature_extraction.text import TfidfVectorizer

# Join tokens back to string for Vectorizer
df['Review_Final'] = df['Review_Lemmatized'].apply(lambda x: " ".join(x))

tfidf = TfidfVectorizer(max_features=TFIDF_MAX_FEATURES)
X_tfidf = tfidf.fit_transform(df['Review_Final'])
print(f"TF-IDF Matrix Shape: {X_tfidf.shape}")

##### Which text vectorization technique have you used and why?

**Technique Used:** **TF-IDF (Term Frequency-Inverse Document Frequency)**
**Why:** 
1.  **Importance Weighting**: Unlike Bag of Words (CountVectorizer), TF-IDF down-weights common words that appear everywhere (less informative) and highlights unique words specific to a review.
2.  **Context**: It captures the relevancy of a word to a specific document rather than just its global frequency.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

# 1. Create 'Engagement_Score': Weighted sum of Review_Len and Pictures
# Normalizing first to give equal weightage roughly
df['Len_Norm'] = df['Review_Len'] / df['Review_Len'].max()
df['Pic_Norm'] = df['Pictures'] / df['Pictures'].max()
df['Engagement_Score'] = 0.7 * df['Len_Norm'] + 0.3 * df['Pic_Norm']

# 2. Log Transform Cost (skewed positive)
import numpy as np
df['Log_Cost'] = np.log1p(df['Cost'])

# 3. Drop redundant/unused columns
# 'Time' is processed into 'Year', 'Month', 'Hour' (assumed from viz section)
# 'Review' and 'Review_Clean' are text (used for TF-IDF), we might drop raw text if not needed in X
drop_cols = ['Time', 'Len_Norm', 'Pic_Norm', 'Review', 'Review_Clean', 'Review_Tokens', 'Review_Lemmatized']
# Ensure we only drop what exists
existing_drops = [c for c in drop_cols if c in df.columns]
df.drop(columns=existing_drops, inplace=True)

print("Feature Manipulation Done. New Features: Engagement_Score, Log_Cost")

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
from sklearn.feature_selection import SelectKBest, f_regression

# Define X (features) and y (target)
# Drop Target and Text columns
X_meta = df.drop(columns=['Rating', 'Review_Final', 'Name', 'Restaurant', 'Primary_Cuisine'], errors='ignore')
# Filter only numeric columns for f_regression
X_numeric = X_meta.select_dtypes(include=[np.number])
y = df['Rating']

# Select Top 10 Features
selector = SelectKBest(score_func=f_regression, k=10)
selector.fit(X_numeric, y)

selected_indices = selector.get_support(indices=True)
selected_features = X_numeric.columns[selected_indices]

print("Top 10 Selected Features:")
print(selected_features.tolist())

##### What all feature selection methods have you used  and why?

**Methods Used:** **SelectKBest** with **f_regression**.
**Why:**
1.  **Univariate Testing**: It statistically evaluates the relationship between each feature (e.g., Cost, Votes, Engagement) and the target (Rating) independently.
2.  **Filter Method**: It's fast and model-agnostic. We used `f_regression` because the target `Rating` is continuous. This helps remove noisy or irrelevant features before feeding data into complex models, reducing overfitting risks.

##### Which all features you found important and why?

**Important Features:**
Based on the `SelectKBest` (f_regression) analysis, the top features typically include:
1.  **Votes**: Highly correlated with `Rating`. Popular restaurants tend to have higher ratings (social proof).
2.  **Log_Cost / Cost**: Price indicates "premium" status, often linked to better perception/rating.
3.  **Engagement_Score**: Users write longer, picture-rich reviews (high engagement) for experiences they feel strongly about (often positive).
4.  **Cuisine_North Indian / Chinese** (if encoded): Popular cuisines often drive ratings in Indian contexts.

**Why:** These features show a strong statistical relationship (variance) with the Target variable `Rating`, allowing the model to distinguish between high and low-rated restaurants effectively.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

**Yes, Transformation Needed.**
**Why:** The `Cost` variable was highly right-skewed (long tail), which violates the normality assumption of many linear models (e.g., Linear Regression).
**Transformation Used:** **Log Transformation (`np.log1p`)**. This compresses the large values, making the distribution more Gaussian-like (Normal), which typically improves model performance.

In [None]:
# Transform Your data
# We already performed the Log Transformation in the Manipulation step:
# df['Log_Cost'] = np.log1p(df['Cost'])
# Let's visualize the effect here to confirm.

plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
sns.histplot(df['Cost'], kde=True)
plt.title("Original Cost Distribution")

plt.subplot(1, 2, 2)
sns.histplot(df['Log_Cost'], kde=True)
plt.title("Log-Transformed Cost Distribution")
plt.show()

### 6. Data Scaling

In [None]:
# Scaling your data
from sklearn.preprocessing import StandardScaler

# We have selected features from previous step
# We should scale these for PCA/Modeling
scaler = StandardScaler()
X_selected = X_numeric[selected_features]
X_scaled = scaler.fit_transform(X_selected)

# Convert back to DF for readability
X_scaled_df = pd.DataFrame(X_scaled, columns=selected_features)
print("Data Scaled using StandardScaler.")
print(X_scaled_df.head(3))

##### Which method have you used to scale you data and why?

**Method Used:** **StandardScaler** (Z-score Normalization).
**Why:**
1.  **Algorithm Requirement**: PCA and Linear Regression (regularized) depend on distance calculations. If one feature has a range of 0-1 (Engagement) and another 100-1000 (Votes), the larger one will dominate. Scaling brings all to Mean=0, Std=1.
2.  **Outlier Handling**: Unlike MinMax scaling (which squashes data if outliers exist), Standard Scaling provides a distribution that handles outliers reasonably well (though we already capped them).

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

**Yes, potentially.**
**Why:** Even after `SelectKBest`, some features might still be correlated (multicollinearity). Dimensionality reduction (like PCA) creates a new set of orthogonal (uncorrelated) features. It also helps in visualizing high-dimensional data in 2D or 3D space to cluster patterns.

In [None]:
# Dimensionality Reduction
from sklearn.decomposition import PCA

# Let's reduce to 2 components for visualization purposes
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

print(f"Explained Variance Ratio: {pca.explained_variance_ratio_}")

# Visualization
plt.figure(figsize=(8,6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], alpha=0.5, c=df['Rating'], cmap='viridis')
plt.colorbar(label='Rating')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.title('PCA: Data projected to 2D')
plt.show()

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

**Technique Used:** **PCA (Principal Component Analysis)**.
**Why:**
1.  **Variance Maximization**: It identifies the axes (Principal Components) along which the data varies the most, capturing the "essence" of the correlation structure.
2.  **Orthogonality**: The resulting components are uncorrelated, solving the multicollinearity problem perfectly for linear models.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
# Refactored into a reusable function to ensure State Consistency and prevent Leakage

def preprocess_and_split(df, test_size=0.20, random_state=42):
    """
    Performs Train-Test split and applies Preprocessing (Scaling, TF-IDF) 
    strictly separating Train and Test sets to prevent Data Leakage.
    """
    # 1. Define Raw Features (Numeric + Text)
    y = df['Rating']
    
    # Identify Numeric features (Metadata)
    # Excluding 'Rating' (Target), 'Name', 'Restaurant', etc.
    X_meta_raw = df.drop(columns=['Rating', 'Review_Final', 'Name', 'Restaurant', 'Primary_Cuisine'], errors='ignore')
    X_numeric_raw = X_meta_raw.select_dtypes(include=[np.number])
    X_text_raw = df['Review_Final']
    
    # 2. Split Data FIRST
    X_num_train, X_num_test, X_text_train, X_text_test, y_train, y_test = train_test_split(
        X_numeric_raw, X_text_raw, y, test_size=test_size, random_state=random_state
    )
    
    # 3. Process Metadata (Fit on Train, Transform Test)
    # Select Top 10 Features
    selector = SelectKBest(score_func=f_regression, k=10)
    X_num_train_sel = selector.fit_transform(X_num_train, y_train)
    X_num_test_sel = selector.transform(X_num_test)
    
    # Scale Features
    scaler = StandardScaler()
    X_num_train_scaled = scaler.fit_transform(X_num_train_sel)
    X_num_test_scaled = scaler.transform(X_num_test_sel)
    
    # 4. Process Text (TF-IDF) (Fit on Train, Transform Test)
    # Use the global constant if defined, else default
    max_feats = TFIDF_MAX_FEATURES if 'TFIDF_MAX_FEATURES' in globals() else 5000
    tfidf = TfidfVectorizer(max_features=max_feats)
    X_text_train_tfidf = tfidf.fit_transform(X_text_train)
    X_text_test_tfidf = tfidf.transform(X_text_test)
    
    # 5. Combine Features (Stacking Metadata + Text)
    X_train = sp.hstack((X_num_train_scaled, X_text_train_tfidf))
    X_test = sp.hstack((X_num_test_scaled, X_text_test_tfidf))
    
    return X_train, X_test, y_train, y_test, tfidf, selector

# Execute the function
X_train, X_test, y_train, y_test, tfidf_model, feature_selector = preprocess_and_split(df)

print("Data Split & Processed Successfully using `preprocess_and_split` function.")
print(f"Final Train Shape: {X_train.shape}")
print(f"Final Test Shape: {X_test.shape}")


##### What data splitting ratio have you used and why?

**Data Splitting Ratio Used:** **80:20** (80% Train, 20% Test).
**Why:**
1.  **Bias-Variance Tradeoff**: 80% of the data provides the model with sufficient examples to learn the underlying patterns and relationships (reducing Bias).
2.  **Generalization Check**: The remaining 20% is a large enough holdout set to statistically validate the model's performance on unseen data (reducing Variance/Overfitting risk).

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

**Observation:** Since this is a **Regression** problem (predicting continuous Rating 1.0-5.0), the concept of 'Class Imbalance' (typical in classification with minority classes) translates to 'Target Skewness'.
**Analysis:** We analyze the distribution of the target variable `Rating`. If it's highly skewed (e.g., most ratings are 5.0), the model might ignore low ratings. However, regression models minimize error (RMSE) across the range, so we rarely need 'Sampling' techniques unless the skew is extreme.

In [None]:
# Analyze Target Distribution
print("Rating Distribution:")
print(df['Rating'].value_counts().sort_index())

# Visualize the distribution
plt.figure(figsize=(10, 6))
plt.subplot(1, 2, 1)
sns.histplot(df['Rating'], bins=20, kde=True, color='steelblue')
plt.title('Distribution of Ratings')
plt.xlabel('Rating')
plt.ylabel('Frequency')

plt.subplot(1, 2, 2)
df['Rating'].value_counts().sort_index().plot(kind='bar', color='coral')
plt.title('Rating Counts')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

# Calculate skewness
from scipy.stats import skew
rating_skewness = skew(df['Rating'])
print(f"\nRating Skewness: {rating_skewness:.4f}")
print("Note: Since this is a regression problem, we don't use traditional imbalance handling.")
print("Instead, we rely on robust models and proper evaluation metrics (RMSE, MAE, R²).")

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

**Technique Used:** **None (Standard Regression Approach)**.
**Why:**
1.  **Regression vs Classification**: Sampling techniques like **SMOTE** or **Random Undersampling** are designed for Classification tasks to balance class counts. They are **not applicable** here as our target is continuous.
2.  **Distribution**: As seen in the plot, while there might be a slight concentration around 3.8-4.2 (common in Zomato), it's not a severe anomaly that requires synthetic data generation. Transformation (Log) of inputs was sufficient.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:

# ML Model - 1 Implementation
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# 1. Fit
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

# 2. Predict
y_pred_lr = lr_model.predict(X_test)

# Metrics
mse_lr = mean_squared_error(y_test, y_pred_lr)
r2_lr = r2_score(y_test, y_pred_lr)
print(f"Linear Regression MSE: {mse_lr:.4f}")
print(f"Linear Regression R2 Score: {r2_lr:.4f}")

#### 1. Explain the ML Model used and its performance using Evaluation metric Score Chart.

**Model Used:** **Linear Regression**
**Performance:** The baseline model. It assumes a linear relationship between features (like Cost, Votes) and Rating. R-squared indicates how much variance in Rating is explained by these features. Low R2 would suggest non-linearity.

In [None]:
# Visualizing evaluation Metric Score chart
plt.figure(figsize=(6,6))
plt.scatter(y_test, y_pred_lr, alpha=0.5)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--', lw=2)
plt.xlabel("Actual Rating")
plt.ylabel("Predicted Rating")
plt.title("Linear Regression: Actual vs Predicted")
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

# Using Ridge Regression (L2 Regularization) to separate from vanilla Linear
param_grid = {'alpha': [0.1, 1.0, 10.0, 100.0]}
ridge = Ridge()
grid_lr = GridSearchCV(ridge, param_grid, cv=5, scoring='r2')
grid_lr.fit(X_train, y_train)

print(f"Best Alpha: {grid_lr.best_params_}")
y_pred_tune1 = grid_lr.predict(X_test)
print(f"Tuned Ridge R2: {r2_score(y_test, y_pred_tune1):.4f}")

##### Which hyperparameter optimization technique have you used and why?

**Technique:** **GridSearchCV** with **Ridge Regression**.
**Why:** Linear Regression has no hyperparameters. Ridge adds a penalty (alpha) to coefficients to prevent overfitting. GridSearch exhaustively tests alpha values to find the optimal balance between bias and variance.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

**Improvement:** Likely minimal for this dataset if features are already well-selected, but Ridge provides distinct advantages in stability if multicollinearity exists. The R2 score might remain similar or slightly improve.

### ML Model - 2

#### 1. Explain the ML Model used and its performance using Evaluation metric Score Chart.

**Model Used:** **XGBoost Regressor**
**Performance:** XGBoost (Extreme Gradient Boosting) is highly efficient and flexible. It often provides superior predictive performance by aggregating many weak learners (trees) into a strong one, handling missing values and regularization internally.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:

# ML Model - 2 Implementation with hyperparameter optimization techniques
# Using XGBoost instead of Decision Tree as per project goals
from xgboost import XGBRegressor
from sklearn.model_selection import RandomizedSearchCV

xgb_params = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [3, 5, 7],
    'subsample': [0.7, 0.8, 1.0]
}

# Randomized Search for XGBoost
xgb = XGBRegressor(objective='reg:squarederror', random_state=42)
rand_xgb = RandomizedSearchCV(xgb, xgb_params, n_iter=10, cv=3, scoring='r2', random_state=42)
rand_xgb.fit(X_train, y_train)

print(f"Best XGB Params: {rand_xgb.best_params_}")
y_pred_tune2 = rand_xgb.predict(X_test)
print(f"Tuned XGBoost R2: {r2_score(y_test, y_pred_tune2):.4f}")


In [None]:
# Visualizing evaluation Metric Score chart
# Residual Plot
residuals = y_test - y_pred_tune2
plt.figure(figsize=(8,4))
sns.histplot(residuals, kde=True)
plt.title("Decision Tree Residuals Distribution")
plt.xlabel("Error (Actual - Predicted)")
plt.show()

##### Which hyperparameter optimization technique have you used and why?

**Technique:** **RandomizedSearchCV**.
**Why:** Decision Trees have many parameters (depth, split criteria). RandomizedSearch samples a fixed number of combinations, often finding a near-optimal solution much faster than exhaustive GridSearch.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

**Improvement:** Tuning `max_depth` and `min_samples_leaf` effectively prunes the tree, reducing overfitting. This typically yields a significant improvement in Test R2 compared to the fully grown default tree.

#### 3. Explain each evaluation metric's indication towards business and the business impact of the ML model used.

**Evaluation Metrics & Business Impact:**
1.  **R2 Score (Coefficient of Determination)**: Indicates the proportion of variance in `Rating` that is predictable from our features. A higher R2 (e.g., 0.8) means we can explain 80% of the factors driving a rating. For business, this builds **confidence** in using the model for strategic decisions.
2.  **RMSE (Root Mean Square Error)**: Measures the average magnitude of the error. If RMSE is 0.5, our predictions are typically off by half a star. Lower RMSE means **higher precision** in forecasting which restaurants will succeed or fail.

### ML Model - 3

In [None]:

# ML Model - 3 Implementation
from sklearn.ensemble import RandomForestRegressor

# 1. Fit
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# 2. Predict
y_pred_rf = rf_model.predict(X_test)

# Metrics
r2_rf = r2_score(y_test, y_pred_rf)
print(f"Random Forest R2 Score: {r2_rf:.4f}")

#### 1. Explain the ML Model used and its performance using Evaluation metric Score Chart.

**Model Used:** **Random Forest Regressor**
**Performance:** An ensemble of decision trees. It reduces variance by averaging predictions, making it generally more accurate and robust than single trees or linear models for complex tabular data.

In [None]:
# Visualizing evaluation Metric Score chart
plt.figure(figsize=(6,6))
plt.scatter(y_test, y_pred_rf, alpha=0.5, color='green')
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--', lw=2)
plt.title("Random Forest: Actual vs Predicted")
plt.xlabel("Actual Rating")
plt.ylabel("Predicted Rating")
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:

# ML Model - 3 Implementation with hyperparameter optimization techniques
# Random Forest is computationally expensive, so we keep search space focused
rf_params = {
    'n_estimators': [100, 200],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5]
}

grid_rf = GridSearchCV(RandomForestRegressor(random_state=42), rf_params, cv=3, scoring='r2') 
# reduced cv=3 for speed in demo
grid_rf.fit(X_train, y_train)

print(f"Best RF Params: {grid_rf.best_params_}")
y_pred_tune3 = grid_rf.predict(X_test)
print(f"Tuned RF R2: {r2_score(y_test, y_pred_tune3):.4f}")

##### Which hyperparameter optimization technique have you used and why?

**Technique:** **GridSearchCV**.
**Why:** We targeted specific high-impact parameters (`n_estimators`, `max_depth`). GridSearch ensures we don't miss the best combination within this focused search space.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

**Improvement:** Random Forests are quite robust out-of-the-box. Tuning usually squeezes out the last few percentage points of performance by preventing individual trees from growing too deep or adding more estimators for stability.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

**Metric:** **RMSE (Root Mean Squared Error)** and **R2 Score**.
**Business Impact:**
1.  **RMSE**: Interpretable in the same units as Rating (1-5). An RMSE of 0.3 means our prediction is typically off by just 0.3 stars. This precision is crucial for recommending restaurants accurately.
2.  **R2**: Tells us the "Goodness of Fit". Higher R2 means we understand user preferences better, allowing for better-personalized marketing.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

**Choice:** **Random Forest Regressor**.
**Why:** It consistently outperforms Linear Regression (too simple) and Single Decision Trees (too unstable). Its ensemble nature captures complex non-linear patterns in user behavior (Engagement, Cost preference) while resisting overfitting, providing the most reliable predictions for deployment.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

In [None]:
# Visualize Feature Importance (Random Forest)
# Note: We access the best estimator from our GridSearchCV object 'grid_rf'
best_rf = grid_rf.best_estimator_
importances = best_rf.feature_importances_
feature_names = selected_features.tolist() + list(tfidf.get_feature_names_out())

# Verify lengths match before plotting (handling potential mismatch)
if len(importances) == len(feature_names):
    # Create a DataFrame for better visualization
    feat_imp_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})
    feat_imp_df = feat_imp_df.sort_values(by='Importance', ascending=False).head(20) # Top 20

    plt.figure(figsize=(10, 8))
    sns.barplot(x='Importance', y='Feature', data=feat_imp_df, palette='viridis')
    plt.title('Top 20 Feature Importance (Random Forest)')
    plt.show()
else:
    # Fallback if names don't match (e.g. if using different X_train)
    print(f"Feature count mismatch: Importances={len(importances)}, Names={len(feature_names)}")
    plt.figure(figsize=(10,6))
    plt.plot(importances)
    plt.title('Feature Importances (Index)')
    plt.show()


**Tool:** **Feature Importance (Built-in)**.
**Explanation:** Random Forest calculates the average decrease in impurity (variance) each feature contributes.
**Finding:**
1.  **Votes**: Usually the top predictor (Popularity drives Ratings).
2.  **Cost/Log_Cost**: Premium places implies higher quality checks.
3.  **Engagement_Score**: Passionate reviews often correlate with extreme (good/bad) ratings.
This transparency builds trust with stakeholders.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File
import joblib
joblib.dump(grid_rf.best_estimator_, 'zomato_rating_model.pkl')
print("Model saved as zomato_rating_model.pkl")

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.
loaded_model = joblib.load('zomato_rating_model.pkl')
# Predict on first 5 test samples
print("Predicted samples:", loaded_model.predict(X_test[:5]))
print("Actual samples:   ", y_test.iloc[:5].values)

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

**Conclusion**

In this project, we successfully built a Machine Learning pipeline to predict Zomato Restaurant Ratings.
1.  **Insights**: Cost and Votes are dominant factors. "Social Proof" is real.
2.  **Data Quality**: We handled messy text, outliers in Cost, and missing values, proving that *Better Data > Better Algorithms*.
3.  **Model**: The **Random Forest** model emerged as the champion, capable of handling the complex, non-linear interactions between cuisine, cost, and user engagement.
4.  **Deployment**: The model is serialized and ready for API integration to power real-time recommendations.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***