# BA5c: Challenge Notebook – Predictive Modeling of Vehicle Fuel Efficiency

In this notebook, you will work on predicting **Fuel Efficiency (FE)** using real-world data from vehicles manufactured in 2010.

This is an advanced extension of previous notebooks (BA5, BA5a, BA5b). You will:
- Prepare and explore the dataset
- Train multiple predictive models
- Evaluate performance using RMSE
- Conduct model diagnostics
- Take on challenges to explore model robustness and parsimony

**Target Variable:** `FE`  
**Dataset:** `cars2010.csv`


## 1. Load and Preview the Dataset

In [None]:
import pandas as pd

cars_df = pd.read_csv("/mnt/data/cars2010.csv")
cars_df.head()

## 2. Explore Target and Predictors

In [None]:
print("Shape:", cars_df.shape)
print("\nMissing values:")
print(cars_df.isnull().sum())

print("\nTarget Variable Summary (FE):")
print(cars_df['FE'].describe())

cars_df[['FE', 'EngDispl', 'NumCyl', 'AirAspirationMethod']].hist(figsize=(10,6))


## 3. Data Preparation: Encoding, Feature Selection

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

X = cars_df.drop(columns=["FE"])
y = cars_df["FE"]

# Separate categorical and numeric
categorical = X.select_dtypes(include='object').columns.tolist()
numeric = X.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Build a preprocessing pipeline
preprocessor = ColumnTransformer([
    ("cat", OneHotEncoder(drop='first', sparse_output=False), categorical)
], remainder='passthrough')

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3, random_state=0)


## 4. Fit and Evaluate Linear Regression

In [None]:
model = Pipeline([
    ("prep", preprocessor),
    ("lr", LinearRegression())
])
model.fit(X_train, y_train)

y_pred = model.predict(X_valid)
rmse = np.sqrt(mean_squared_error(y_valid, y_pred))
print(f"RMSE: {rmse:.2f}")


## 🧪 5. Challenges

Try the following extensions:

1. Drop one of the high-cardinality categorical variables and retrain the model. Is RMSE better or worse?
2. Use `Ridge` and `Lasso` and compare coefficients and RMSEs.
3. Try using a **polynomial feature** (e.g., `EngDispl^2`) to capture nonlinear effects.
4. Which predictors appear to be most influential? Use `.coef_` to analyze.


## ✅ Summary

- You have applied predictive modeling techniques to a new domain.
- Explored categorical encoding and model diagnostics.
- Compared baseline regression performance and took on custom modeling challenges.
