# 📓 Day 06 — Predicting Car Prices with LightGBM
*Turning raw car data into accurate price predictions*

## 🔹 1. Introduction
We start with a simple question:
💭 *“Given a car’s features — brand, model, mileage, fuel type — can a machine predict its price?”*

This is **regression** in action. And today, we’ll use **LightGBM**, one of the most powerful algorithms in modern machine learning.

## 🔹 2. Import Libraries

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

import lightgbm as lgb

## 🔹 3. Load the Data

In [None]:
# Load dataset (assuming you already have the cleaned version)
cars = pd.read_csv("cars_processed.csv")
print("Shape:", cars.shape)
cars.head()

## 🔹 4. Feature & Target Split

In [None]:
X = cars.drop("price", axis=1)
y = cars["price"]

## 🔹 5. Train/Validation/Test Split

In [None]:
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

print("Train:", X_train.shape, "Validation:", X_val.shape, "Test:", X_test.shape)

## 🔹 6. Encoding Categorical Features

In [None]:
categorical_cols = X.select_dtypes(include=["object"]).columns

encoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
X_train_enc = encoder.fit_transform(X_train[categorical_cols])
X_val_enc = encoder.transform(X_val[categorical_cols])
X_test_enc = encoder.transform(X_test[categorical_cols])

# Combine with numerical features
num_train = X_train.drop(categorical_cols, axis=1).to_numpy()
num_val = X_val.drop(categorical_cols, axis=1).to_numpy()
num_test = X_test.drop(categorical_cols, axis=1).to_numpy()

X_train_final = np.hstack([num_train, X_train_enc])
X_val_final   = np.hstack([num_val, X_val_enc])
X_test_final  = np.hstack([num_test, X_test_enc])

print("Final Shape:", X_train_final.shape)

## 🔹 7. Train LightGBM

In [None]:
train_data = lgb.Dataset(X_train_final, label=y_train)
val_data = lgb.Dataset(X_val_final, label=y_val)

params = {
    "objective": "regression",
    "metric": "rmse",
    "boosting_type": "gbdt",
    "learning_rate": 0.05,
    "num_leaves": 31,
    "verbose": -1
}

model = lgb.train(params, train_data, valid_sets=[val_data], early_stopping_rounds=50)

## 🔹 8. Evaluate the Model

In [None]:
def evaluate(name, X, y, model):
    preds = model.predict(X)
    rmse = mean_squared_error(y, preds, squared=False)
    mae = mean_absolute_error(y, preds)
    r2 = r2_score(y, preds)
    mape = np.mean(np.abs((y - preds) / y)) * 100

    print(f"{name:<15} | RMSE: {rmse:8.1f} | MAE: {mae:8.1f} | MAPE: {mape:6.2f}% | R²: {r2:.3f}")

evaluate("LightGBM Val", X_val_final, y_val, model)
evaluate("LightGBM Test", X_test_final, y_test, model)

## 🔹 9. Results
Example output (yours may differ slightly):

```
LightGBM Val    | RMSE:   3492.7 | MAE:   2144.7 | MAPE:  26.19% | R²: 0.923
LightGBM Test   | RMSE:   3499.2 | MAE:   2129.3 | MAPE:  27.31% | R²: 0.922
```

💡 Interpretation:
- RMSE ≈ 3,500 → average error is ~3500 in price units.
- MAE ≈ 2100 → on average, we’re off by about 2,100.
- R² ≈ 0.92 → model explains 92% of the variation in car prices.

This is **excellent** performance.

## 🔹 10. Reflection
Today, you:
- Transformed raw car features into machine-readable format.
- Trained **LightGBM**, a state-of-the-art gradient boosting model.
- Evaluated it with **multiple metrics**.
- Achieved **R² > 0.92**, meaning your model really understands car pricing patterns.