### Task: Predictive Modeling
Build a regression model to predict theaggregate rating of a restaurant based onavailable features.
Split the dataset into training and testing set sand evaluate the model's performance usin gappropriate metrics.
Experiment with different algorithms (e. .,linear regression, decision trees, rand omforest) and compare their performance.

### Importing Libraries

In [3]:
# Importing the Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

import warnings
warnings.filterwarnings('ignore')

### Importing Dataset

In [5]:
# Importing the required dataset
df = pd.read_csv('Dataset .csv')

In [6]:
# getting first five rows of the dataset
df.head()

Unnamed: 0,Restaurant ID,Restaurant Name,Country Code,City,Address,Locality,Locality Verbose,Longitude,Latitude,Cuisines,...,Currency,Has Table booking,Has Online delivery,Is delivering now,Switch to order menu,Price range,Aggregate rating,Rating color,Rating text,Votes
0,6317637,Le Petit Souffle,162,Makati City,"Third Floor, Century City Mall, Kalayaan Avenu...","Century City Mall, Poblacion, Makati City","Century City Mall, Poblacion, Makati City, Mak...",121.027535,14.565443,"French, Japanese, Desserts",...,Botswana Pula(P),Yes,No,No,No,3,4.8,Dark Green,Excellent,314
1,6304287,Izakaya Kikufuji,162,Makati City,"Little Tokyo, 2277 Chino Roces Avenue, Legaspi...","Little Tokyo, Legaspi Village, Makati City","Little Tokyo, Legaspi Village, Makati City, Ma...",121.014101,14.553708,Japanese,...,Botswana Pula(P),Yes,No,No,No,3,4.5,Dark Green,Excellent,591
2,6300002,Heat - Edsa Shangri-La,162,Mandaluyong City,"Edsa Shangri-La, 1 Garden Way, Ortigas, Mandal...","Edsa Shangri-La, Ortigas, Mandaluyong City","Edsa Shangri-La, Ortigas, Mandaluyong City, Ma...",121.056831,14.581404,"Seafood, Asian, Filipino, Indian",...,Botswana Pula(P),Yes,No,No,No,4,4.4,Green,Very Good,270
3,6318506,Ooma,162,Mandaluyong City,"Third Floor, Mega Fashion Hall, SM Megamall, O...","SM Megamall, Ortigas, Mandaluyong City","SM Megamall, Ortigas, Mandaluyong City, Mandal...",121.056475,14.585318,"Japanese, Sushi",...,Botswana Pula(P),No,No,No,No,4,4.9,Dark Green,Excellent,365
4,6314302,Sambo Kojin,162,Mandaluyong City,"Third Floor, Mega Atrium, SM Megamall, Ortigas...","SM Megamall, Ortigas, Mandaluyong City","SM Megamall, Ortigas, Mandaluyong City, Mandal...",121.057508,14.58445,"Japanese, Korean",...,Botswana Pula(P),Yes,No,No,No,4,4.8,Dark Green,Excellent,229


##### **Checking Number of Rows and Columns**

In [8]:
# Get number of rows and columns
rows, cols = df.shape
print(f"Number of rows: {rows}")
print(f"Number of columns: {cols}")

Number of rows: 9551
Number of columns: 21


##### **Check for Missing Values in Each Column and Handle Them Accordingly**

In [10]:
# Check for missing values in each column

missing_values = df.isnull().sum()
print("Missing values in each column:")
print(missing_values)

Missing values in each column:
Restaurant ID           0
Restaurant Name         0
Country Code            0
City                    0
Address                 0
Locality                0
Locality Verbose        0
Longitude               0
Latitude                0
Cuisines                9
Average Cost for two    0
Currency                0
Has Table booking       0
Has Online delivery     0
Is delivering now       0
Switch to order menu    0
Price range             0
Aggregate rating        0
Rating color            0
Rating text             0
Votes                   0
dtype: int64


In [11]:
# For categorical columns, you can either fill missing values with the mode (most frequent value) or drop them

df['Cuisines'].fillna(df['Cuisines'].mode()[0], inplace=True)

In [12]:
# Verify missing values after handling them

print("Missing values after handling:")
print(df.isnull().sum())

Missing values after handling:
Restaurant ID           0
Restaurant Name         0
Country Code            0
City                    0
Address                 0
Locality                0
Locality Verbose        0
Longitude               0
Latitude                0
Cuisines                0
Average Cost for two    0
Currency                0
Has Table booking       0
Has Online delivery     0
Is delivering now       0
Switch to order menu    0
Price range             0
Aggregate rating        0
Rating color            0
Rating text             0
Votes                   0
dtype: int64


##### **Perform Data Type Conversion if Necessary**

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9551 entries, 0 to 9550
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Restaurant ID         9551 non-null   int64  
 1   Restaurant Name       9551 non-null   object 
 2   Country Code          9551 non-null   int64  
 3   City                  9551 non-null   object 
 4   Address               9551 non-null   object 
 5   Locality              9551 non-null   object 
 6   Locality Verbose      9551 non-null   object 
 7   Longitude             9551 non-null   float64
 8   Latitude              9551 non-null   float64
 9   Cuisines              9551 non-null   object 
 10  Average Cost for two  9551 non-null   int64  
 11  Currency              9551 non-null   object 
 12  Has Table booking     9551 non-null   object 
 13  Has Online delivery   9551 non-null   object 
 14  Is delivering now     9551 non-null   object 
 15  Switch to order menu 

Note - There is no need type Conversion.

#### **Convert Categorical Variables to Numeric (One-Hot Encoding)**

In [17]:
# Use One-Hot Encoding to convert categorical columns to numerical columns
df = pd.get_dummies(df, columns=['Currency', 'Has Table booking', 'Has Online delivery', 'Is delivering now', 'Switch to order menu', 'Price range', 'Rating color', 'Rating text', 'Cuisines'], drop_first=True)

# Display the first few rows of the transformed dataset
print(df.head())

   Restaurant ID         Restaurant Name  Country Code              City  \
0        6317637        Le Petit Souffle           162       Makati City   
1        6304287        Izakaya Kikufuji           162       Makati City   
2        6300002  Heat - Edsa Shangri-La           162  Mandaluyong City   
3        6318506                    Ooma           162  Mandaluyong City   
4        6314302             Sambo Kojin           162  Mandaluyong City   

                                             Address  \
0  Third Floor, Century City Mall, Kalayaan Avenu...   
1  Little Tokyo, 2277 Chino Roces Avenue, Legaspi...   
2  Edsa Shangri-La, 1 Garden Way, Ortigas, Mandal...   
3  Third Floor, Mega Fashion Hall, SM Megamall, O...   
4  Third Floor, Mega Atrium, SM Megamall, Ortigas...   

                                     Locality  \
0   Century City Mall, Poblacion, Makati City   
1  Little Tokyo, Legaspi Village, Makati City   
2  Edsa Shangri-La, Ortigas, Mandaluyong City   
3      SM 

#### **Feature Selection**

We will select the features that are most likely to impact the target variable (Aggregate rating).

In [20]:
# Select features (excluding 'Restaurant ID', 'Restaurant Name', 'City', 'Address' since they are less likely to be predictive)
features = df.drop(columns=['Restaurant ID', 'Restaurant Name', 'City', 'Address', 'Aggregate rating', 'Locality', 'Locality Verbose'])

# Define the target variable
target = df['Aggregate rating']

In [21]:
features.head()

Unnamed: 0,Country Code,Longitude,Latitude,Average Cost for two,Votes,Currency_Brazilian Real(R$),Currency_Dollar($),Currency_Emirati Diram(AED),Currency_Indian Rupees(Rs.),Currency_Indonesian Rupiah(IDR),...,"Cuisines_Turkish, Arabian, Middle Eastern","Cuisines_Turkish, Arabian, Moroccan, Lebanese","Cuisines_Turkish, Mediterranean, Middle Eastern",Cuisines_Vietnamese,"Cuisines_Vietnamese, Fish and Chips","Cuisines_Western, Asian, Cafe","Cuisines_Western, Fusion, Fast Food",Cuisines_World Cuisine,"Cuisines_World Cuisine, Mexican, Italian","Cuisines_World Cuisine, Patisserie, Cafe"
0,162,121.027535,14.565443,1100,314,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,162,121.014101,14.553708,1200,591,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,162,121.056831,14.581404,4000,270,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,162,121.056475,14.585318,1500,365,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,162,121.057508,14.58445,1500,229,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


#### **Split the Dataset into Training and Testing Sets**

In [23]:
from sklearn.model_selection import train_test_split

# Split data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

# Verify the split sizes
print(f"Training data size: {X_train.shape}")
print(f"Testing data size: {X_test.shape}")

Training data size: (7640, 1856)
Testing data size: (1911, 1856)


#### **Build and Evaluate Models**

We will experiment with Linear Regression, Decision Tree Regressor, and Random Forest Regressor to predict the target variable.

##### **1. Linear Regression**

In [27]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Initialize the model
linear_model = LinearRegression()

# Train the model
linear_model.fit(X_train, y_train)

# Make predictions
y_pred_linear = linear_model.predict(X_test)

# Evaluate the model's performance
mse_linear = mean_squared_error(y_test, y_pred_linear)
r2_linear = r2_score(y_test, y_pred_linear)

print(f"Linear Regression - MSE: {mse_linear}, R2 Score: {r2_linear}")

Linear Regression - MSE: 3409839406.588018, R2 Score: -1498100231.065065


##### **2. Decision Tree Regressor**

In [29]:
from sklearn.tree import DecisionTreeRegressor

# Initialize the model
decision_tree_model = DecisionTreeRegressor(random_state=42)

# Train the model
decision_tree_model.fit(X_train, y_train)

# Make predictions
y_pred_tree = decision_tree_model.predict(X_test)

# Evaluate the model's performance
mse_tree = mean_squared_error(y_test, y_pred_tree)
r2_tree = r2_score(y_test, y_pred_tree)

print(f"Decision Tree Regressor - MSE: {mse_tree}, R2 Score: {r2_tree}")

Decision Tree Regressor - MSE: 0.0566509680795395, R2 Score: 0.9751106083580661


##### **3. Random Forest Regressor**

In [31]:
from sklearn.ensemble import RandomForestRegressor

# Initialize the model
random_forest_model = RandomForestRegressor(random_state=42)

# Train the model
random_forest_model.fit(X_train, y_train)

# Make predictions
y_pred_rf = random_forest_model.predict(X_test)

# Evaluate the model's performance
mse_rf = mean_squared_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

print(f"Random Forest Regressor - MSE: {mse_rf}, R2 Score: {r2_rf}")

Random Forest Regressor - MSE: 0.02953474884353737, R2 Score: 0.9870240181954022


#### **Model Comparison**

Now we compare the performance of all models based on Mean Squared Error (MSE) and R2 Score.

In [34]:
# Compare the models
model_comparison = {
    'Linear Regression': {'MSE': mse_linear, 'R2': r2_linear},
    'Decision Tree Regressor': {'MSE': mse_tree, 'R2': r2_tree},
    'Random Forest Regressor': {'MSE': mse_rf, 'R2': r2_rf}
}

# Display the comparison
comparison_df = pd.DataFrame(model_comparison).T
print(comparison_df)

                                  MSE            R2
Linear Regression        3.409839e+09 -1.498100e+09
Decision Tree Regressor  5.665097e-02  9.751106e-01
Random Forest Regressor  2.953475e-02  9.870240e-01
