
# 🚗 Day 04 — Feature Engineering & Pipelines (Learning Notes)

**Objective:**  
Today we will take our cleaned `vehicles.csv` dataset (from Day 03) and:
- Split into train/test sets
- Engineer new features
- Build preprocessing pipelines (numeric + categorical)
- Train multiple baseline models: Random Forest, Linear Regression, Decision Tree
- Compare their performance

This follows the style from *Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow* (Aurélien Géron, Chapter 2).  


## 1. Load the Dataset

In [None]:

import pandas as pd

# Load the cleaned dataset (output from Day 03)
vehicles = pd.read_csv("/mnt/data/vehicles_cleaned.csv")
print("Shape:", vehicles.shape)
vehicles.head()



## 2. Train/Test Split

As Géron emphasizes, always keep a test set aside before doing any heavy exploration or model training.  
We'll stratify based on price categories to ensure balanced splits across cheap/expensive cars.


In [None]:

from sklearn.model_selection import train_test_split
import numpy as np

# Stratify based on price ranges
vehicles["price_cat"] = pd.cut(vehicles["price"],
                               bins=[0, 5000, 15000, 30000, 60000, np.inf],
                               labels=[1,2,3,4,5])

train_set, test_set = train_test_split(vehicles, test_size=0.2, stratify=vehicles["price_cat"], random_state=42)

print("Train size:", len(train_set), " Test size:", len(test_set))

# Drop the temporary stratification column
for set_ in (train_set, test_set):
    set_.drop("price_cat", axis=1, inplace=True)


## 3. Separate Features & Labels

In [None]:

# Target variable
y_train = train_set["price"]
X_train = train_set.drop("price", axis=1)

y_test = test_set["price"]
X_test = test_set.drop("price", axis=1)

X_train.head()



## 4. Feature Engineering

We create new features:
- **Car Age** = current year - year of the car
- Optionally, **log-transform price** to reduce skewness (not done here yet)


In [None]:

import datetime

current_year = datetime.datetime.now().year

X_train = X_train.copy()
X_test = X_test.copy()

# Create new feature: car age
X_train["car_age"] = current_year - X_train["year"]
X_test["car_age"] = current_year - X_test["year"]

# Drop original year column
X_train.drop("year", axis=1, inplace=True)
X_test.drop("year", axis=1, inplace=True)

X_train.head()



## 5. Build Preprocessing Pipelines

- Numeric Pipeline: Imputer → StandardScaler  
- Categorical Pipeline: Imputer → OneHotEncoder  
- Combine using ColumnTransformer


In [None]:

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
import numpy as np

# Identify numeric and categorical columns
num_attribs = X_train.select_dtypes(include=[np.number]).columns.tolist()
cat_attribs = X_train.select_dtypes(exclude=[np.number]).columns.tolist()

print("Numeric:", num_attribs)
print("Categorical:", cat_attribs)

# Numeric pipeline
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy="median")),
    ('scaler', StandardScaler()),
])

# Categorical pipeline
cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy="most_frequent")),
    ('onehot', OneHotEncoder(handle_unknown="ignore")),
])

# Full preprocessing pipeline
full_pipeline = ColumnTransformer([
    ("num", num_pipeline, num_attribs),
    ("cat", cat_pipeline, cat_attribs),
])


## 6. Random Forest Regressor

In [None]:

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

rf_pipeline = Pipeline([
    ("preprocess", full_pipeline),
    ("model", RandomForestRegressor(random_state=42))
])

rf_scores = cross_val_score(rf_pipeline, X_train, y_train,
                            scoring="neg_mean_squared_error", cv=5)
rf_rmse = np.sqrt(-rf_scores)
print("Random Forest CV RMSE:", rf_rmse.mean())


## 7. Linear Regression & Decision Tree

In [None]:

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor

# Linear Regression pipeline
lr_pipeline = Pipeline([
    ("preprocess", full_pipeline),
    ("model", LinearRegression())
])
lr_scores = cross_val_score(lr_pipeline, X_train, y_train,
                            scoring="neg_mean_squared_error", cv=5)
lr_rmse = np.sqrt(-lr_scores)
print("Linear Regression CV RMSE:", lr_rmse.mean())

# Decision Tree pipeline
dt_pipeline = Pipeline([
    ("preprocess", full_pipeline),
    ("model", DecisionTreeRegressor(random_state=42))
])
dt_scores = cross_val_score(dt_pipeline, X_train, y_train,
                            scoring="neg_mean_squared_error", cv=5)
dt_rmse = np.sqrt(-dt_scores)
print("Decision Tree CV RMSE:", dt_rmse.mean())


## 8. Evaluate on Test Set (Random Forest)

In [None]:

rf_pipeline.fit(X_train, y_train)
from sklearn.metrics import mean_squared_error

final_preds = rf_pipeline.predict(X_test)
final_rmse = mean_squared_error(y_test, final_preds, squared=False)
print("Random Forest Test RMSE:", final_rmse)



## 9. Next Steps & Learning Notes

- Compare performances of different models (Linear Regression, Decision Tree, Random Forest)  
- Try hyperparameter tuning (GridSearchCV / RandomizedSearchCV)  
- Evaluate feature importance (Random Forest feature importances)  
- Explore Gradient Boosting / XGBoost / LightGBM for better performance  

This completes the full Day 04 notebook 🚀
