# Module 6 ? Trees Homework (2025)

This notebook follows the HW06 brief end?to?end and prints the values you need to map to the multiple?choice questions. Each section explains the steps and why the results make sense.

Dataset: car fuel efficiency (target: `fuel_efficiency_mpg`).


## 1. Setup

In [7]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

pd.set_option('display.float_format', lambda x: f'{x:.6f}')

def rmse(y_true, y_pred):
    return float(np.sqrt(np.mean((y_true - y_pred) ** 2)))


## 2. Load dataset

Source URL is provided by the course. If the download fails (e.g., due to network restrictions), save it under `data/car_fuel_efficiency.csv` and replace the URL below with the local path.


In [8]:
URL = 'https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv'
cols = [
    'engine_displacement','num_cylinders','horsepower','vehicle_weight','acceleration',
    'model_year','origin','fuel_type','drivetrain','num_doors','fuel_efficiency_mpg'
]

df = pd.read_csv(URL, usecols=cols)
df.head()


Unnamed: 0,engine_displacement,num_cylinders,horsepower,vehicle_weight,acceleration,model_year,origin,fuel_type,drivetrain,num_doors,fuel_efficiency_mpg
0,170,3.0,159.0,3413.433759,17.7,2003,Europe,Gasoline,All-wheel drive,0.0,13.231729
1,130,5.0,97.0,3149.664934,17.8,2007,USA,Gasoline,Front-wheel drive,0.0,13.688217
2,170,,78.0,3079.038997,15.1,2018,Europe,Gasoline,Front-wheel drive,0.0,14.246341
3,220,4.0,,2542.392402,20.2,2009,USA,Diesel,All-wheel drive,2.0,16.912736
4,210,1.0,140.0,3460.87099,14.4,2009,Europe,Gasoline,All-wheel drive,2.0,12.488369


## 3. Preparation

- Fill missing values in numeric features with 0.0 (keeps scale for trees).
- Fill missing categorical values with 'NA'.
- Split train/val/test: 60/20/20 with `random_state=1`.
- Convert to dictionaries and then to sparse matrices via `DictVectorizer(sparse=True)`.


In [9]:
target = 'fuel_efficiency_mpg'
features = [c for c in df.columns if c != target]

num_cols = [c for c in features if pd.api.types.is_numeric_dtype(df[c])]
cat_cols = [c for c in features if c not in num_cols]

# Impute
X_df = df[features].copy()
for c in num_cols:
    X_df[c] = X_df[c].astype('float64').fillna(0.0)
for c in cat_cols:
    X_df[c] = X_df[c].astype('object').fillna('NA')

y = df[target].values

# Split 60/20/20
X_train_df, X_temp_df, y_train, y_temp = train_test_split(
    X_df, y, test_size=0.4, random_state=1
)
X_val_df, X_test_df, y_val, y_test = train_test_split(
    X_temp_df, y_temp, test_size=0.5, random_state=1
)

# DictVectorizer sparse
train_dicts = X_train_df.to_dict(orient='records')
val_dicts = X_val_df.to_dict(orient='records')

dv = DictVectorizer(sparse=True)
X_train = dv.fit_transform(train_dicts)
X_val = dv.transform(val_dicts)
feature_names = dv.get_feature_names_out()
len(feature_names), feature_names[:10]


(14,
 array(['acceleration', 'drivetrain=All-wheel drive',
        'drivetrain=Front-wheel drive', 'engine_displacement',
        'fuel_type=Diesel', 'fuel_type=Gasoline', 'horsepower',
        'model_year', 'num_cylinders', 'num_doors'], dtype=object))

## 4. Q1 ? Decision Tree (max_depth=1)
We fit a depth?1 tree and inspect the root split feature (node 0). For one?hot encoded categoricals, we map `feature_name` like `origin=Asia` to its base feature `origin`.


In [10]:
tree = DecisionTreeRegressor(max_depth=1, random_state=1)
tree.fit(X_train, y_train)

# Root node feature index
root_idx = int(tree.tree_.feature[0])
root_name = feature_names[root_idx]
base_name = root_name.split('=')[0]
root_name, base_name


('vehicle_weight', 'vehicle_weight')

The base feature identifies which column the root split uses, even if the actual split is on a specific category.


## 5. Q2 ? Random Forest (10 trees)
Train a small forest with `n_estimators=10` and evaluate RMSE on the validation set.


In [11]:
rf10 = RandomForestRegressor(n_estimators=10, random_state=1, n_jobs=-1)
rf10.fit(X_train, y_train)
val_pred10 = rf10.predict(X_val)
rmse_rf10 = rmse(y_val, val_pred10)
rmse_rf10, round(rmse_rf10, 2)


(0.4602815367032658, 0.46)

## 6. Q3 ? When does RMSE stop improving (by 3 decimals)?
We scan `n_estimators` from 10 to 200 (step 10), keep the best (rounded to 3 decimals), and report the last `n` where a new best appears. If improvement continues to 200, we answer 200.


In [12]:
ns = list(range(10, 201, 10))
rmse_track = []
best = None
last_new_best_n = None

for n in ns:
    m = RandomForestRegressor(n_estimators=n, random_state=1, n_jobs=-1)
    m.fit(X_train, y_train)
    p = m.predict(X_val)
    r = rmse(y_val, p)
    rmse_track.append((n, r))
    r3 = round(r, 3)
    if best is None or r3 < best:
        best = r3
        last_new_best_n = n

rmse_track[:5], best, last_new_best_n


([(10, 0.4602815367032659),
  (20, 0.4461567458911003),
  (30, 0.4397780761280069),
  (40, 0.4383939265191819),
  (50, 0.4371703249467452)],
 0.435,
 90)

The `last_new_best_n` value is the answer for Q3 (considering 3 decimal places).


## 7. Q4 ? Best max_depth by mean RMSE
For each depth in `[10, 15, 20, 25]`, compute RMSE across `n_estimators` ? {10,20,...,200} and take the mean. Select the depth with the lowest mean RMSE.


In [13]:
depths = [10, 15, 20, 25]
mean_rmse = {}
for d in depths:
    rs = []
    for n in ns:
        m = RandomForestRegressor(n_estimators=n, max_depth=d, random_state=1, n_jobs=-1)
        m.fit(X_train, y_train)
        p = m.predict(X_val)
        rs.append(rmse(y_val, p))
    mean_rmse[d] = float(np.mean(rs))

mean_rmse, min(mean_rmse, key=mean_rmse.get)


({10: 0.43624733022811624,
  15: 0.4378245115127723,
  20: 0.43769343549884143,
  25: 0.43765343428485853},
 10)

## 8. Q5 ? Feature importances (RF)
Train RF with `n_estimators=10, max_depth=20` and inspect `feature_importances_`. We aggregate importances per original numeric feature.


In [19]:
rf_imp = RandomForestRegressor(n_estimators=10, max_depth=20, random_state=1, n_jobs=-1)
rf_imp.fit(X_train, y_train)
imp = rf_imp.feature_importances_

# Aggregate by base feature (important for categoricals with one-hot; numeric are one column)
from collections import defaultdict
agg = defaultdict(float)
for name, w in zip(feature_names, imp):
    base = name.split('=')[0]
    agg[base] += float(w)

candidates = ['vehicle_weight','horsepower','acceleration','engine_displacement']
subset = {k: agg.get(k, 0.0) for k in candidates}
subset, max(subset, key=subset.get)


({'vehicle_weight': 0.9598782143148441,
  'horsepower': 0.01593348148976617,
  'acceleration': 0.01144231373523756,
  'engine_displacement': 0.0031594240303503144},
 'vehicle_weight')

## 9. Q6 ? XGBoost (eta tuning)
We compare `eta=0.3` vs `eta=0.1` for 100 boosting rounds.

Install and train:
```bash
pip install xgboost
```
If installation is blocked, skip execution and answer later; otherwise run the cell below.


In [20]:
# If xgboost is unavailable, skip this cell gracefully
try:
    import xgboost as xgb

    dtrain = xgb.DMatrix(X_train, label=y_train, feature_names=feature_names)
    dval = xgb.DMatrix(X_val, label=y_val, feature_names=feature_names)
    watchlist = [(dtrain, 'train'), (dval, 'val')]

    def run_xgb(eta):
        params = {
            'eta': eta,
            'max_depth': 6,
            'min_child_weight': 1,
            'objective': 'reg:squarederror',
            'nthread': 8,
            'seed': 1,
            'verbosity': 0,
        }
        model = xgb.train(params, dtrain, num_boost_round=100, evals=watchlist)
        val_pred = model.predict(dval)
        return rmse(y_val, val_pred)

    rmse_03 = run_xgb(0.3)
    rmse_01 = run_xgb(0.1)

    (round(rmse_03, 3), round(rmse_01, 3)), ('0.3' if rmse_03 < rmse_01 else '0.1' if rmse_01 < rmse_03 else 'equal')
except Exception as e:
    print('xgboost not available or failed to run:', e)
    rmse_03 = rmse_01 = None


xgboost not available or failed to run: Expecting a sequence of strings for feature names, got: <class 'numpy.ndarray'>


In [21]:
try:
    import xgboost as xgb

    names = list(feature_names)  # ensure it's a list

    dtrain = xgb.DMatrix(X_train, label=y_train, feature_names=names)
    dval   = xgb.DMatrix(X_val,   label=y_val,   feature_names=names)
    watchlist = [(dtrain, 'train'), (dval, 'val')]

    def run_xgb(eta):
        params = {
            'eta': eta,
            'max_depth': 6,
            'min_child_weight': 1,
            'objective': 'reg:squarederror',
            'nthread': 8,
            'seed': 1,
            'verbosity': 0,
        }
        model = xgb.train(params, dtrain, num_boost_round=100, evals=watchlist)
        val_pred = model.predict(dval)
        return rmse(y_val, val_pred)

    rmse_03 = run_xgb(0.3)
    rmse_01 = run_xgb(0.1)
    (round(rmse_03, 3), round(rmse_01, 3)), ('0.3' if rmse_03 < rmse_01 else '0.1' if rmse_01 < rmse_03 else 'equal')
except Exception as e:
    print('xgboost not available or failed to run:', e)
    rmse_03 = rmse_01 = None

[0]	train-rmse:1.83282	val-rmse:1.82567
[1]	train-rmse:1.33231	val-rmse:1.32771
[2]	train-rmse:0.99034	val-rmse:0.99257
[3]	train-rmse:0.76090	val-rmse:0.76897
[4]	train-rmse:0.61110	val-rmse:0.62742
[5]	train-rmse:0.51643	val-rmse:0.54010
[6]	train-rmse:0.45800	val-rmse:0.48954
[7]	train-rmse:0.42172	val-rmse:0.46026
[8]	train-rmse:0.39836	val-rmse:0.44332
[9]	train-rmse:0.38494	val-rmse:0.43456
[10]	train-rmse:0.37400	val-rmse:0.43004
[11]	train-rmse:0.36596	val-rmse:0.42696
[12]	train-rmse:0.36050	val-rmse:0.42569
[13]	train-rmse:0.35549	val-rmse:0.42519
[14]	train-rmse:0.35143	val-rmse:0.42455
[15]	train-rmse:0.34792	val-rmse:0.42450
[16]	train-rmse:0.34533	val-rmse:0.42478
[17]	train-rmse:0.34356	val-rmse:0.42472
[18]	train-rmse:0.34129	val-rmse:0.42472
[19]	train-rmse:0.33846	val-rmse:0.42502
[20]	train-rmse:0.33724	val-rmse:0.42509
[21]	train-rmse:0.33463	val-rmse:0.42538
[22]	train-rmse:0.33260	val-rmse:0.42543
[23]	train-rmse:0.33023	val-rmse:0.42611
[24]	train-rmse:0.32738	va

## 10. Summary for submission
This prints the exact values you need for the multiple?choice form.


In [18]:
summary = {}
try:
    summary['Q1_root_feature'] = {'full': root_name, 'base': base_name}
except NameError:
    pass
try:
    summary['Q2_rmse_rf10'] = round(rmse_rf10, 3)
except NameError:
    pass
try:
    summary['Q3_last_new_best_n'] = last_new_best_n
    summary['Q3_rmse_track'] = [(n, round(r,3)) for n, r in rmse_track]
except NameError:
    pass
try:
    summary['Q4_mean_rmse'] = {k: round(v, 3) for k, v in mean_rmse.items()}
    summary['Q4_best_depth'] = min(mean_rmse, key=mean_rmse.get)
except NameError:
    pass
try:
    summary['Q5_importances'] = {k: round(v, 4) for k, v in subset.items()}
    summary['Q5_most_important'] = max(subset, key=subset.get)
except NameError:
    pass
try:
    if rmse_03 is not None:
        summary['Q6_xgb_rmse'] = {'eta_0.3': round(rmse_03, 3), 'eta_0.1': round(rmse_01, 3)}
        summary['Q6_best'] = '0.3' if rmse_03 < rmse_01 else '0.1' if rmse_01 < rmse_03 else 'equal'
except NameError:
    pass
summary


{'Q1_root_feature': {'full': 'vehicle_weight', 'base': 'vehicle_weight'},
 'Q2_rmse_rf10': 0.46,
 'Q3_last_new_best_n': 90,
 'Q3_rmse_track': [(10, 0.46),
  (20, 0.446),
  (30, 0.44),
  (40, 0.438),
  (50, 0.437),
  (60, 0.436),
  (70, 0.436),
  (80, 0.436),
  (90, 0.435),
  (100, 0.435),
  (110, 0.435),
  (120, 0.435),
  (130, 0.435),
  (140, 0.435),
  (150, 0.435),
  (160, 0.435),
  (170, 0.435),
  (180, 0.435),
  (190, 0.435),
  (200, 0.435)],
 'Q4_mean_rmse': {10: 0.436, 15: 0.438, 20: 0.438, 25: 0.438},
 'Q4_best_depth': 10,
 'Q5_importances': {'vehicle_weight': 0.9599,
  'horsepower': 0.0159,
  'acceleration': 0.0114,
  'engine_displacement': 0.0032},
 'Q5_most_important': 'vehicle_weight'}