# Car Price Prediction Assignment
## 1. Problem Definition & Data Collection
**Objective:** Predict the price of used cars based on attributes like mileage, year, engine capacity, and model.
**Dataset:** A collection of used car listings (Toyota, Suzuki, Honda, etc.) filtered for hatchbacks manufactured after 2005.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import shap

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Display settings
pd.set_option('display.max_columns', None)
plt.style.use('ggplot')

: 

## 2. Data Loading & Preprocessing
We load the cleaned dataset which has already undergone:
- Removal of units (km, cc)
- Filtering for domain (Hatchbacks, Year > 2005, Price <= 10M)
- One-Hot Encoding for Fuel Type

In [None]:
# Load Data
file_path = 'data/processed/cleaned_car_data.csv'
try:
    df = pd.read_csv(file_path)
    print(f"✅ Loaded dataset with {df.shape[0]} rows and {df.shape[1]} columns.")
except FileNotFoundError:
    print("❌ Error: File not found. Please run 'preprocess.py' first.")

# Overview
df.head()

### Feature Engineering
The 'model' column is categorical text (e.g., 'Vitz', 'Swift'). 
Random Forest requires numerical input, so we use **One-Hot Encoding**.

In [None]:
# Define Features (X) and Target (y)
X = df.drop(columns=['price'], errors='ignore')
y = df['price']

# One-Hot Encoding for 'model' column
# drop_first=True avoids multicollinearity
X = pd.get_dummies(X, columns=['model'], drop_first=True)

print(f"Feature set shape after encoding: {X.shape}")

## 3. Model Training
We use **Random Forest Regressor**.

**Justification:**
- It handles non-linear relationships better than Linear Regression.
- It is robust to outliers and doesn't require feature scaling.
- It provides built-in feature importance.

**Hyperparameters (Pruning):**
- `n_estimators=100`: Builds 100 trees for stability.
- `max_depth=15`: Limits tree depth to prevent **Overfitting**.
- `min_samples_split=5`: Ensures leaves contain enough data points.

In [None]:
# Train-Test Split (80% Train, 20% Test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training samples: {X_train.shape[0]}")
print(f"Testing samples:  {X_test.shape[0]}")

# Initialize Model
rf_model = RandomForestRegressor(
    n_estimators=100,
    max_depth=15,          # Pruning parameter
    min_samples_split=5,   # Pruning parameter
    random_state=42
)

# Train
print("Training Random Forest model...")
rf_model.fit(X_train, y_train)
print("Training complete.")

## 4. Evaluation
We evaluate the model using:
- **R² Score:** How well the model explains variance (Accuracy).
- **MAE (Mean Absolute Error):** Average error in Rupees.
- **RMSE (Root Mean Squared Error):** Penalizes large errors more heavily.

In [None]:
# Predictions
y_pred = rf_model.predict(X_test)

# Metrics
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print("-" * 30)
print(f"Model Performance Results:")
print(f"✅ Accuracy (R² Score): {r2:.4f} ({r2*100:.2f}%)")
print(f"❌ Mean Absolute Error: Rs. {mae:,.2f}")
print(f"❌ RMSE:                Rs. {rmse:,.2f}")
print("-" * 30)

# Visualizing Actual vs Predicted
plt.figure(figsize=(10, 6))
sns.scatterplot(x=y_test, y=y_pred, alpha=0.6, color='blue')
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--', lw=2) # Ideal line
plt.xlabel("Actual Price (Rs)")
plt.ylabel("Predicted Price (Rs)")
plt.title("Actual vs Predicted Prices (Random Forest)")
plt.show()

## 5. Explainability (XAI) - Requirement 4
We use **SHAP (SHapley Additive exPlanations)** to explain *why* the model predicts specific prices.

- **Summary Plot:** Shows which features are most important globally.
- **Dependence Plot:** Shows how a single feature (e.g., Year) affects price.

In [None]:
# Initialize JS visualization code
shap.initjs()

# Create TreeExplainer
explainer = shap.TreeExplainer(rf_model)
shap_values = explainer.shap_values(X_test)

# 1. Summary Plot (Feature Importance)
plt.figure(figsize=(12, 8))
plt.title("Feature Importance (SHAP Summary)")
shap.summary_plot(shap_values, X_test, show=False)
plt.show()

# 2. Dependence Plot for 'manufacture_year'
# This shows how the Year affects the price prediction
print("Dependence Plot: Effect of Manufacture Year on Price")
shap.dependence_plot("manufacture_year", shap_values, X_test)