## **LEVEL 3 -TASK 1**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from google.colab import files

# --- This part handles the file upload in Colab ---
print("Please upload your 'Dataset .csv' file:")
uploaded = files.upload()

# Get the file name you just uploaded
file_name = list(uploaded.keys())[0]
print(f"\nSuccessfully uploaded {file_name}")

# --- Task 1: Load the dataset and explore rows/columns ---
df = pd.read_csv(file_name)
print(f"\nThe dataset has {df.shape[0]} rows and {df.shape[1]} columns.")

Please upload your 'Dataset .csv' file:


Saving structured_dataset.csv to structured_dataset.csv

Successfully uploaded structured_dataset.csv

The dataset has 9551 rows and 24 columns.


##**Part 1: Data Preparation (The Most Important Step)**

In [8]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# --- 1. Prepare Data for Modeling ---

# IMPORTANT: Drop all rows where 'Aggregate rating' is NaN (Not Rated)
model_df = df.dropna(subset=['Aggregate rating'])

print(f"Original dataset had {len(df)} rows.")
print(f"Dataset for modeling has {len(model_df)} rows (after dropping 'Not Rated').")

# --- 2. Define Features (X) and Target (y) ---

# Our target variable
target_variable = 'Aggregate rating'

# Our feature list (all numeric or binary)
features = [
    'Average Cost for two',
    'Votes',
    'Price range',
    'Has Table booking',      # This is 1/0
    'Has Online delivery',    # This is 1/0
    'Name Length',            # From Level 2
    'Address Length',         # From Level 2
    'Cuisine Count',          # From Level 2
    'Longitude',
    'Latitude'
]

# Create X and y
X = model_df[features]
y = model_df[target_variable]

print("\nFeatures (X) and Target (y) are ready.")
print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")

Original dataset had 9551 rows.
Dataset for modeling has 9551 rows (after dropping 'Not Rated').

Features (X) and Target (y) are ready.
X shape: (9551, 10)
y shape: (9551,)


##**Part 2: Split the Dataset (Training and Testing)**

In [9]:
# Split into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42  # 'random_state' ensures we get the same split every time
)

print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

X_train shape: (7640, 10)
X_test shape: (1911, 10)
y_train shape: (7640,)
y_test shape: (1911,)


##**Part 3: Model 1 - Linear Regression (The Baseline)**

In [10]:
print("--- Training Linear Regression Model ---")
# Create the model
lr_model = LinearRegression()

# Train the model on the training data
lr_model.fit(X_train, y_train)

# --- Evaluate the Model ---
print("Evaluating Linear Regression...")
y_pred_lr = lr_model.predict(X_test)

# Calculate metrics
r2_lr = r2_score(y_test, y_pred_lr)
mae_lr = mean_absolute_error(y_test, y_pred_lr)
mse_lr = mean_squared_error(y_test, y_pred_lr)
rmse_lr = np.sqrt(mse_lr)

print(f"R-squared (R¬≤): {r2_lr:.4f}")
print(f"Mean Absolute Error (MAE): {mae_lr:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse_lr:.4f}")

--- Training Linear Regression Model ---
Evaluating Linear Regression...
R-squared (R¬≤): 0.2784
Mean Absolute Error (MAE): 1.0675
Root Mean Squared Error (RMSE): 1.2816


##**Part 4: Model 2 & 3 - Decision Tree & Random Forest**

In [11]:
print("\n--- Training Decision Tree Model ---")
# Create the model
dt_model = DecisionTreeRegressor(random_state=42)
dt_model.fit(X_train, y_train)

# --- Evaluate the Model ---
print("Evaluating Decision Tree...")
y_pred_dt = dt_model.predict(X_test)

r2_dt = r2_score(y_test, y_pred_dt)
mae_dt = mean_absolute_error(y_test, y_pred_dt)
rmse_dt = np.sqrt(mean_squared_error(y_test, y_pred_dt))

print(f"R-squared (R¬≤): {r2_dt:.4f}")
print(f"Mean Absolute Error (MAE): {mae_dt:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse_dt:.4f}")


#Model 3: Random Forest Regressor

print("\n--- Training Random Forest Model ---")
# Create the model
rf_model = RandomForestRegressor(random_state=42, n_estimators=100)
rf_model.fit(X_train, y_train)

# --- Evaluate the Model ---
print("Evaluating Random Forest...")
y_pred_rf = rf_model.predict(X_test)

r2_rf = r2_score(y_test, y_pred_rf)
mae_rf = mean_absolute_error(y_test, y_pred_rf)
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))

print(f"R-squared (R¬≤): {r2_rf:.4f}")
print(f"Mean Absolute Error (MAE): {mae_rf:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse_rf:.4f}")


--- Training Decision Tree Model ---
Evaluating Decision Tree...
R-squared (R¬≤): 0.9124
Mean Absolute Error (MAE): 0.2895
Root Mean Squared Error (RMSE): 0.4466

--- Training Random Forest Model ---
Evaluating Random Forest...
R-squared (R¬≤): 0.9597
Mean Absolute Error (MAE): 0.1995
Root Mean Squared Error (RMSE): 0.3029


## **Part 5: Comparison**


##--- Training Decision Tree Model ---
Evaluating Decision Tree...\

-R-squared (R¬≤): 0.9124

-Mean Absolute Error (MAE): 0.2895

-Root Mean Squared Error (RMSE): 0.4466

##--- Training Random Forest Model ---
Evaluating Random Forest...

R-squared (R¬≤): 0.9597

Mean Absolute Error (MAE): 0.1995

Root Mean Squared Error (RMSE): 0.3029

##-- Training Linear Regression Model ---
Evaluating Linear Regression...

R-squared (R¬≤): 0.2784

Mean Absolute Error (MAE): 1.0675

Root Mean Squared Error (RMSE): 1.2816



Key Takeaways:

Best Model:

 Random Forest gave the best performance across every single metric.

Worst Model:

Linear Regression performed very poorly, showing it cannot find the complex patterns in your data.

##1. Random Forest (üèÜ Best Performance)

**-R-squared (R¬≤)**: 0.9597

What it means: This is an excellent score. It means your model can "explain" 96% of the variation in restaurant ratings.

**Mean Absolute Error (MAE)**: 0.1995

What it means: On average, your model's prediction is only 0.20 stars off from the actual rating. This is a very low and accurate error rate.

**RMSE**: 0.3029

What it means: This is your lowest RMSE, confirming the model's high accuracy and reliability.

##2. Decision Tree (Good, but Overfitting)

**R-squared (R¬≤): 0.9124**

What it means: This is also a very good score, explaining 91% of the variation.

**Mean Absolute Error (MAE): 0.2895**

What it means: The average error is 0.29 stars. This is good, but about 45% higher than the Random Forest's error.

**RMSE: 0.4466**

What it means: This error is significantly higher than the Random Forest's. This suggests the Decision Tree might be "overfitting" (it learned the training data too well, but isn't as good at making new predictions).

##3. Linear Regression (Poor Performance)

**R-squared (R¬≤): 0.2784**

What it means: This is a very low score. It means the model can only explain 28% of the variation in ratings.

**Mean Absolute Error (MAE): 1.0675**

What it means: The model's predictions are, on average, 1.07 stars wrong. This is a very high error.

**RMSE: 1.2816**

What it means: The error is extremely high, confirming this model is not useful. This is because restaurant ratings are too complex for a simple straight-line (linear) model.

In [12]:
results_df = pd.DataFrame({
    'Actual': y_test,
    'Linear_Regression_Prediction': y_pred_lr,
    'Decision_Tree_Prediction': y_pred_dt,
    'Random_Forest_Prediction': y_pred_rf
})

output_filename = 'model_predictions.csv'
results_df.to_csv(output_filename, index=False)

print(f"Predictions saved to {output_filename}")
files.download(output_filename)

Predictions saved to model_predictions.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>