# ML Zoomcamp Homework 6

Dataset
In this homework, we continue using the fuel efficiency dataset. Download it from here.

You can do it with wget:

wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv
The goal of this homework is to create a regression model for predicting the car fuel efficiency (column 'fuel_efficiency_mpg').

Preparing the dataset
Preparation:

Fill missing values with zeros.
Do train/validation/test split with 60%/20%/20% distribution.
Use the train_test_split function and set the random_state parameter to 1.
Use DictVectorizer(sparse=True) to turn the dataframes into matrices.

In [8]:
import pandas as pd

url = "https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv"
df = pd.read_csv(url)
df.head()

Unnamed: 0,engine_displacement,num_cylinders,horsepower,vehicle_weight,acceleration,model_year,origin,fuel_type,drivetrain,num_doors,fuel_efficiency_mpg
0,170,3.0,159.0,3413.433759,17.7,2003,Europe,Gasoline,All-wheel drive,0.0,13.231729
1,130,5.0,97.0,3149.664934,17.8,2007,USA,Gasoline,Front-wheel drive,0.0,13.688217
2,170,,78.0,3079.038997,15.1,2018,Europe,Gasoline,Front-wheel drive,0.0,14.246341
3,220,4.0,,2542.392402,20.2,2009,USA,Diesel,All-wheel drive,2.0,16.912736
4,210,1.0,140.0,3460.87099,14.4,2009,Europe,Gasoline,All-wheel drive,2.0,12.488369


In [9]:
#2. Fill missing values with zeros
df = df.fillna(0)

In [10]:
#3. Define features and target 4. 
target = 'fuel_efficiency_mpg'
features = [c for c in df.columns if c != target]

X = df[features]
y = df[target]

In [11]:
#Split into train, validation, and test sets (60/20/20)
from sklearn.model_selection import train_test_split

# First split into 60% train and 40% temp
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=1)

# Split temp into 20% val and 20% test (i.e., 0.5 * 40%)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=1)

In [12]:
from sklearn.feature_extraction import DictVectorizer

train_dicts = X_train.to_dict(orient='records')
val_dicts = X_val.to_dict(orient='records')
test_dicts = X_test.to_dict(orient='records')

dv = DictVectorizer(sparse=True)
X_train_matrix = dv.fit_transform(train_dicts)
X_val_matrix = dv.transform(val_dicts)
X_test_matrix = dv.transform(test_dicts)

In [13]:
X_train_matrix.shape, X_val_matrix.shape, X_test_matrix.shape

((5822, 14), (1941, 14), (1941, 14))

**Question 1**

Let's train a decision tree regressor to predict the fuel_efficiency_mpg variable.

Train a model with max_depth=1.
Which feature is used for splitting the data?

'vehicle_weight'
'model_year'
'origin'
'fuel_type'

In [16]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.tree import DecisionTreeRegressor

# Load and prepare data
df = pd.read_csv("car_fuel_efficiency.csv")
df = df.fillna(0)

target = 'fuel_efficiency_mpg'
features = [c for c in df.columns if c != target]

X = df[features]
y = df[target]

# Split 60/20/20
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=1)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=1)

# Convert to dictionaries
dv = DictVectorizer(sparse=True)
X_train_dicts = X_train.to_dict(orient='records')
X_val_dicts = X_val.to_dict(orient='records')

X_train_matrix = dv.fit_transform(X_train_dicts)
X_val_matrix = dv.transform(X_val_dicts)

# Train decision tree
dt = DecisionTreeRegressor(max_depth=1, random_state=1)
dt.fit(X_train_matrix, y_train)

# Find the feature used for splitting
split_feature = dv.get_feature_names_out()[dt.tree_.feature[0]]
print("Feature used for splitting:", split_feature)


Feature used for splitting: vehicle_weight


**Question 2**

Train a random forest regressor with these parameters:

n_estimators=10
random_state=1
n_jobs=-1 (optional - to make training faster)
What's the RMSE of this model on the validation data?

In [19]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

In [20]:
# Train Random Forest Regressor
rf = RandomForestRegressor(
    n_estimators=10,
    random_state=1,
    n_jobs=-1
)
rf.fit(X_train_matrix, y_train)

# Make predictions on validation set
y_pred = rf.predict(X_val_matrix)

# Compute RMSE
rmse = np.sqrt(mean_squared_error(y_val, y_pred))
print("RMSE for the Validation data:", rmse)

RMSE for the Validation data: 0.4602815367032658


**Question 3**

Now let's experiment with the n_estimators parameter

Try different values of this parameter from 10 to 200 with step 10.
Set random_state to 1.
Evaluate the model on the validation dataset.
After which value of n_estimators does RMSE stop improving? Consider 3 decimal places for calculating the answer.

10
25
80
200
If it doesn't stop improving, use the latest iteration number in your answer.

In [23]:
# Try different n_estimators
rmse_scores = {}

for n in range(10, 201, 10):
    rf = RandomForestRegressor(n_estimators=n, random_state=1, n_jobs=-1)
    rf.fit(X_train_matrix, y_train)
    y_pred = rf.predict(X_val_matrix)
    rmse = np.sqrt(mean_squared_error(y_val, y_pred))
    rmse_scores[n] = round(rmse, 3)

# Display results
for n, score in rmse_scores.items():
    print(f"n_estimators={n}: RMSE={score}")

best_n = min(rmse_scores, key=rmse_scores.get)
print("\nBest n_estimators:", best_n)


n_estimators=10: RMSE=0.46
n_estimators=20: RMSE=0.446
n_estimators=30: RMSE=0.44
n_estimators=40: RMSE=0.438
n_estimators=50: RMSE=0.437
n_estimators=60: RMSE=0.436
n_estimators=70: RMSE=0.436
n_estimators=80: RMSE=0.436
n_estimators=90: RMSE=0.435
n_estimators=100: RMSE=0.435
n_estimators=110: RMSE=0.435
n_estimators=120: RMSE=0.435
n_estimators=130: RMSE=0.435
n_estimators=140: RMSE=0.435
n_estimators=150: RMSE=0.435
n_estimators=160: RMSE=0.435
n_estimators=170: RMSE=0.435
n_estimators=180: RMSE=0.435
n_estimators=190: RMSE=0.435
n_estimators=200: RMSE=0.435

Best n_estimators: 90


**Question 4**

Let's select the best max_depth:

Try different values of max_depth: [10, 15, 20, 25]
For each of these values,
try different values of n_estimators from 10 till 200 (with step 10)
calculate the mean RMSE
Fix the random seed: random_state=1
What's the best max_depth, using the mean RMSE?

10
15
20
25

In [34]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

# If your dataset has categorical columns like 'origin' or 'fuel_type':
X_train_enc = pd.get_dummies(X_train, drop_first=True)
X_val_enc = pd.get_dummies(X_val, drop_first=True)

# Ensure both sets have same columns
X_val_enc = X_val_enc.reindex(columns=X_train_enc.columns, fill_value=0)

max_depth_values = [10, 15, 20, 25]
n_estimators_values = range(10, 201, 10)

results = {}

for depth in max_depth_values:
    rmse_list = []
    for n in n_estimators_values:
        model = RandomForestRegressor(
            n_estimators=n,
            max_depth=depth,
            random_state=1
        )
        model.fit(X_train_enc, y_train)
        y_pred = model.predict(X_val_enc)
        rmse = mean_squared_error(y_val, y_pred, squared=False)
        rmse_list.append(rmse)
    mean_rmse = np.mean(rmse_list)
    results[depth] = mean_rmse
    print(f"max_depth={depth}, mean_RMSE={mean_rmse:.3f}")

best_max_depth = min(results, key=results.get)
print(f"\n✅ Best max_depth: {best_max_depth} with mean RMSE={results[best_max_depth]:.3f}")




max_depth=10, mean_RMSE=0.436




max_depth=15, mean_RMSE=0.438




max_depth=20, mean_RMSE=0.438




max_depth=25, mean_RMSE=0.438

✅ Best max_depth: 10 with mean RMSE=0.436




**Question 5**

We can extract feature importance information from tree-based models.

At each step of the decision tree learning algorithm, it finds the best split. When doing it, we can calculate "gain" - the reduction in impurity before and after the split. This gain is quite useful in understanding what are the important features for tree-based models.

In Scikit-Learn, tree-based models contain this information in the feature_importances_ field.

For this homework question, we'll find the most important feature:

Train the model with these parameters:
n_estimators=10,
max_depth=20,
random_state=1,
n_jobs=-1 (optional)
Get the feature importance information from this model
What's the most important feature (among these 4)?

vehicle_weight
horsepower
acceleration
engine_displacement

In [36]:
from sklearn.ensemble import RandomForestRegressor
import pandas as pd

# Encode categorical features if not already done
X_train_enc = pd.get_dummies(X_train, drop_first=True)
X_val_enc = pd.get_dummies(X_val, drop_first=True)
X_val_enc = X_val_enc.reindex(columns=X_train_enc.columns, fill_value=0)

# Train the model
rf = RandomForestRegressor(
    n_estimators=10,
    max_depth=20,
    random_state=1,
    n_jobs=-1
)
rf.fit(X_train_enc, y_train)

# Get feature importances
importances = pd.Series(rf.feature_importances_, index=X_train_enc.columns)
importances.sort_values(ascending=False).head(10)


vehicle_weight                  0.959935
horsepower                      0.016012
acceleration                    0.011526
engine_displacement             0.003235
model_year                      0.003189
num_cylinders                   0.002366
num_doors                       0.001613
drivetrain_Front-wheel drive    0.000556
fuel_type_Gasoline              0.000555
origin_Europe                   0.000511
dtype: float64

**most important feature:** vehicle_weight

**Question 6**

Now let's train an XGBoost model! For this question, we'll tune the eta parameter:

Install XGBoost
Create DMatrix for train and validation
Create a watchlist
Train a model with these parameters for 100 rounds:
xgb_params = {
    'eta': 0.3, 
    'max_depth': 6,
    'min_child_weight': 1,
    
    'objective': 'reg:squarederror',
    'nthread': 8,
    
    'seed': 1,
    'verbosity': 1,
}
Now change eta from 0.3 to 0.1.

Which eta leads to the best RMSE score on the validation dataset?

0.3
0.1
Both give equal value

In [43]:
!pip install xgboost

Collecting xgboost
  Downloading xgboost-3.1.1-py3-none-win_amd64.whl.metadata (2.1 kB)
Downloading xgboost-3.1.1-py3-none-win_amd64.whl (72.0 MB)
   ---------------------------------------- 0.0/72.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/72.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/72.0 MB 435.7 kB/s eta 0:02:46
   ---------------------------------------- 0.1/72.0 MB 751.6 kB/s eta 0:01:36
   ---------------------------------------- 0.1/72.0 MB 853.3 kB/s eta 0:01:25
   ---------------------------------------- 0.2/72.0 MB 1.3 MB/s eta 0:00:57
   ---------------------------------------- 0.4/72.0 MB 1.4 MB/s eta 0:00:50
   ---------------------------------------- 0.5/72.0 MB 1.7 MB/s eta 0:00:43
   ---------------------------------------- 0.7/72.0 MB 1.9 MB/s eta 0:00:38
    --------------------------------------- 1.0/72.0 MB 2.5 MB/s eta 0:00:28
    --------------------------------------- 1.3/72.0 MB 3.0 MB/s eta 0:00:24
    ----------

In [45]:
import xgboost as xgb
from sklearn.metrics import mean_squared_error

# Create DMatrices for train and validation
dtrain = xgb.DMatrix(X_train_enc, label=y_train)
dval = xgb.DMatrix(X_val_enc, label=y_val)

watchlist = [(dtrain, 'train'), (dval, 'val')]


In [47]:
#(a) With eta = 0.3
xgb_params = {
    'eta': 0.3,
    'max_depth': 6,
    'min_child_weight': 1,
    'objective': 'reg:squarederror',
    'nthread': 8,
    'seed': 1,
    'verbosity': 1
}

model_03 = xgb.train(xgb_params, dtrain, num_boost_round=100, evals=watchlist, verbose_eval=False)
y_pred_03 = model_03.predict(dval)
rmse_03 = mean_squared_error(y_val, y_pred_03, squared=False)
print("RMSE (eta=0.3):", rmse_03)


RMSE (eta=0.3): 0.4386587759933081




In [49]:
#(b) With eta = 0.1
xgb_params['eta'] = 0.1

model_01 = xgb.train(xgb_params, dtrain, num_boost_round=100, evals=watchlist, verbose_eval=False)
y_pred_01 = model_01.predict(dval)
rmse_01 = mean_squared_error(y_val, y_pred_01, squared=False)
print("RMSE (eta=0.1):", rmse_01)


RMSE (eta=0.1): 0.41855170013454596




**eta = 0.1  leads to the best RMSE score on the validation dataset**