## Homework

> Note: sometimes your answer doesn't match one of 
> the options exactly. That's fine. 
> Select the option that's closest to your solution.
> If it's exactly in between two options, select the higher value.


### Dataset

In this homework, we continue using the fuel efficiency dataset.
Download it from <a href='https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv'>here</a>.

You can do it with wget:

```bash
wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv
```

The goal of this homework is to create a regression model for predicting the car fuel efficiency (column `'fuel_efficiency_mpg'`).



### Preparing the dataset 

Preparation:

1) Fill missing values with zeros.
2) Do train/validation/test split with 60%/20%/20% distribution. 
   <br>(Use the `train_test_split` function and set the `random_state` parameter to 1.)</b>
3) Use `DictVectorizer(sparse=True)` to turn the dataframes into matrices.

In [1]:
path = 'wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv'

In [2]:
!wget $path

--2025-11-02 11:12:42--  http://wget/
Resolving wget (wget)... failed: Name or service not known.
wget: unable to resolve host address ‘wget’
--2025-11-02 11:12:42--  https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 874188 (854K) [text/plain]
Saving to: ‘car_fuel_efficiency.csv.6’


2025-11-02 11:12:42 (295 MB/s) - ‘car_fuel_efficiency.csv.6’ saved [874188/874188]

FINISHED --2025-11-02 11:12:42--
Total wall clock time: 0.2s
Downloaded: 1 files, 854K in 0.003s (295 MB/s)


In [3]:
!pip install xgboost


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


In [35]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
import xgboost as xgb

from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import roc_auc_score
from sklearn.tree import export_text
from IPython.display import display

In [5]:
df = pd.read_csv('car_fuel_efficiency.csv')
df.head()

Unnamed: 0,engine_displacement,num_cylinders,horsepower,vehicle_weight,acceleration,model_year,origin,fuel_type,drivetrain,num_doors,fuel_efficiency_mpg
0,170,3.0,159.0,3413.433759,17.7,2003,Europe,Gasoline,All-wheel drive,0.0,13.231729
1,130,5.0,97.0,3149.664934,17.8,2007,USA,Gasoline,Front-wheel drive,0.0,13.688217
2,170,,78.0,3079.038997,15.1,2018,Europe,Gasoline,Front-wheel drive,0.0,14.246341
3,220,4.0,,2542.392402,20.2,2009,USA,Diesel,All-wheel drive,2.0,16.912736
4,210,1.0,140.0,3460.87099,14.4,2009,Europe,Gasoline,All-wheel drive,2.0,12.488369


#### 1) Fill missing values with zeros.

In [6]:
# check missing values
df.isnull().sum()

engine_displacement      0
num_cylinders          482
horsepower             708
vehicle_weight           0
acceleration           930
model_year               0
origin                   0
fuel_type                0
drivetrain               0
num_doors              502
fuel_efficiency_mpg      0
dtype: int64

In [7]:
# filling the missing values with zeros
df= df.fillna(0)

In [8]:
# check missing values
df.isnull().sum()

engine_displacement    0
num_cylinders          0
horsepower             0
vehicle_weight         0
acceleration           0
model_year             0
origin                 0
fuel_type              0
drivetrain             0
num_doors              0
fuel_efficiency_mpg    0
dtype: int64

Column number of doors (num_doors) might look numeric (e.g., 2, 3, 4, 5), but its numeric distance has no meaning — e.g., “5 doors” isn’t 2.5× “2 doors”.
It should be treated as categorical.

In order for the encoder to create dummy variables this column needs to be converted to a string by example

In [9]:
df['num_doors'] = df['num_doors'].astype(str)

In [10]:
df.dtypes

engine_displacement      int64
num_cylinders          float64
horsepower             float64
vehicle_weight         float64
acceleration           float64
model_year               int64
origin                  object
fuel_type               object
drivetrain              object
num_doors               object
fuel_efficiency_mpg    float64
dtype: object

In [11]:
# creation of numerical and categorical columns lists
cat_col = list(df.dtypes[df.dtypes=='O'].index)
cat_col

['origin', 'fuel_type', 'drivetrain', 'num_doors']

In [12]:
num_col = list(df.dtypes[df.dtypes!='O'].index)
num_col

['engine_displacement',
 'num_cylinders',
 'horsepower',
 'vehicle_weight',
 'acceleration',
 'model_year',
 'fuel_efficiency_mpg']

#### 2) Do train/validation/test split with 60%/20%/20% distribution.
 Use the train_test_split function and set the random_state parameter to 1.

In [13]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1 )

In [14]:
df_train = df_train.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)

In [15]:
# the target variable is 'fuel_efficiency_mpg'
y_train = df_train.fuel_efficiency_mpg.values
y_test = df_test.fuel_efficiency_mpg.values
y_val = df_val.fuel_efficiency_mpg.values

In [16]:
#delete the y from the datafram in order to avoid future mistakes
#better to do right away to not forget and by accident use it for training
del df_train['fuel_efficiency_mpg']
del df_test['fuel_efficiency_mpg']
del df_val['fuel_efficiency_mpg']

#### 4) Use DictVectorizer(sparse=True) to turn the dataframes into matrices.

In [17]:
train_dicts = df_train.to_dict(orient='records')
dv = DictVectorizer(sparse=True)
X_train = dv.fit_transform(train_dicts)
feature_names = dv.get_feature_names_out()

In [18]:
val_dicts = df_val.to_dict(orient='records')
X_val = dv.transform(val_dicts)
feature_names = dv.get_feature_names_out()

## Question 1

Let's train a decision tree regressor to predict the `fuel_efficiency_mpg` variable. 

* Train a model with `max_depth=1`.


Which feature is used for splitting the data?


* `'vehicle_weight'`
* `'model_year'`
* `'origin'`
* `'fuel_type'`

In [19]:
dt = DecisionTreeRegressor(max_depth=1)
dt.fit(X_train, y_train)

0,1,2
,criterion,'squared_error'
,splitter,'best'
,max_depth,1
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,
,random_state,
,max_leaf_nodes,
,min_impurity_decrease,0.0


In [20]:
tree_rules = export_text(dt, feature_names=list(feature_names))
print(tree_rules)

|--- vehicle_weight <= 3022.11
|   |--- value: [16.88]
|--- vehicle_weight >  3022.11
|   |--- value: [12.94]



## Question 2

Train a random forest regressor with these parameters:

* `n_estimators=10`
* `random_state=1`
* `n_jobs=-1` (optional - to make training faster)


What's the RMSE of this model on the validation data?

* 0.045
* 0.45
* 4.5
* 45.0

In [21]:
rf = RandomForestRegressor(n_estimators=10, random_state=1, n_jobs=-1)

In [22]:
rf.fit(X_train,y_train)

0,1,2
,n_estimators,10
,criterion,'squared_error'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,1.0
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


## Question 3

Now let's experiment with the `n_estimators` parameter

* Try different values of this parameter from 10 to 200 with step 10.
* Set `random_state` to `1`.
* Evaluate the model on the validation dataset.


After which value of `n_estimators` does RMSE stop improving?
Consider 3 decimal places for calculating the answer.

- 10
- 25
- 80
- 200

If it doesn't stop improving, use the latest iteration number in
your answer.

In [24]:
estimators = list(np.arange(10,210,10))

In [25]:
rmse_dict = {}
for i in estimators:
    rf = RandomForestRegressor(n_estimators=i, random_state=1, n_jobs=-1)
    rf.fit(X_train,y_train)
    y_pred = rf.predict(X_val)
    rmse_value = np.sqrt(mean_squared_error(y_val, y_pred))
    rmse_dict[i]=rmse_value

In [26]:
rmse_dict.items()

dict_items([(np.int64(10), np.float64(0.45861437168610425)), (np.int64(20), np.float64(0.45333349501967724)), (np.int64(30), np.float64(0.4507313300709125)), (np.int64(40), np.float64(0.44765683890153896)), (np.int64(50), np.float64(0.4460532825375533)), (np.int64(60), np.float64(0.4451616679632915)), (np.int64(70), np.float64(0.4446432784213396)), (np.int64(80), np.float64(0.4448780343893129)), (np.int64(90), np.float64(0.44434050740741)), (np.int64(100), np.float64(0.4437825252946082)), (np.int64(110), np.float64(0.44315476925344055)), (np.int64(120), np.float64(0.4434152440879842)), (np.int64(130), np.float64(0.4431901624110538)), (np.int64(140), np.float64(0.44301975080434763)), (np.int64(150), np.float64(0.44267458182198255)), (np.int64(160), np.float64(0.442253365222632)), (np.int64(170), np.float64(0.44255926764462034)), (np.int64(180), np.float64(0.44221387317449706)), (np.int64(190), np.float64(0.4425841324092249)), (np.int64(200), np.float64(0.4426035616542575))])

In [27]:
sorted_items = sorted(rmse_dict.items(), key=lambda item: item[1], reverse=True)

# sorted_items is now a list of tuples: [('banana', 3), ('apple', 2), ('cherry', 1)]

# To create a new dictionary from the sorted list
sorted_dict = dict(sorted_items)
sorted_dict

{np.int64(10): np.float64(0.45861437168610425),
 np.int64(20): np.float64(0.45333349501967724),
 np.int64(30): np.float64(0.4507313300709125),
 np.int64(40): np.float64(0.44765683890153896),
 np.int64(50): np.float64(0.4460532825375533),
 np.int64(60): np.float64(0.4451616679632915),
 np.int64(80): np.float64(0.4448780343893129),
 np.int64(70): np.float64(0.4446432784213396),
 np.int64(90): np.float64(0.44434050740741),
 np.int64(100): np.float64(0.4437825252946082),
 np.int64(120): np.float64(0.4434152440879842),
 np.int64(130): np.float64(0.4431901624110538),
 np.int64(110): np.float64(0.44315476925344055),
 np.int64(140): np.float64(0.44301975080434763),
 np.int64(150): np.float64(0.44267458182198255),
 np.int64(200): np.float64(0.4426035616542575),
 np.int64(190): np.float64(0.4425841324092249),
 np.int64(170): np.float64(0.44255926764462034),
 np.int64(160): np.float64(0.442253365222632),
 np.int64(180): np.float64(0.44221387317449706)}

In [None]:
answer 80

## Question 4

Let's select the best `max_depth`:

* Try different values of `max_depth`: `[10, 15, 20, 25]`
* For each of these values,
  * try different values of `n_estimators` from 10 till 200 (with step 10)
  * calculate the mean RMSE 
* Fix the random seed: `random_state=1`


What's the best `max_depth`, using the mean RMSE?

* 10
* 15
* 20
* 25

In [28]:
results = {}

max_depth_list = [10, 15, 20, 25]
for max_d in max_depth_list:
    for i in estimators:
        rf = RandomForestRegressor(n_estimators=i, max_depth=max_d, random_state=1, n_jobs=-1)
        rf.fit(X_train,y_train)
        y_pred = rf.predict(X_val)
        rmse_value = np.sqrt(mean_squared_error(y_val, y_pred))
        rmse_dict[i]=rmse_value
    results[max_d] = rmse_dict

In [29]:
mean_rmse_per_depth = {}

for max_d, estimators_rmse_output in results.items():
    mean_rmse_per_depth[max_d] = np.mean(list(estimators_rmse_output.values()))

best_max_depth = min(mean_rmse_per_depth, key=mean_rmse_per_depth.get)
best_max_depth

10

# Question 5

We can extract feature importance information from tree-based models. 

At each step of the decision tree learning algorithm, it finds the best split. 
When doing it, we can calculate "gain" - the reduction in impurity before and after the split. 
This gain is quite useful in understanding what are the important features for tree-based models.

In Scikit-Learn, tree-based models contain this information in the
[`feature_importances_`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor.feature_importances_)
field. 

For this homework question, we'll find the most important feature:

* Train the model with these parameters:
  * `n_estimators=10`,
  * `max_depth=20`,
  * `random_state=1`,
  * `n_jobs=-1` (optional)
* Get the feature importance information from this model


What's the most important feature (among these 4)? 

* `vehicle_weight`
*	`horsepower`
* `acceleration`
* `engine_displacement`

In [34]:
model = RandomForestRegressor(
    n_estimators=10,
    max_depth=20,
    random_state=1,
    n_jobs=-1
)

# Train model
model.fit(X_val, y_val)

feature_names = dv.get_feature_names_out()

importances = model.feature_importances_

# Combine into a DataFrame
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

most_important_feature = feature_importance_df.iloc[0]
print(f"Most important feature: {most_important_feature['Feature']} with importance {most_important_feature['Importance']}")

Most important feature: vehicle_weight with importance 0.9639160944519313


## Question 6

Now let's train an XGBoost model! For this question, we'll tune the `eta` parameter:

* Install XGBoost
* Create DMatrix for train and validation
* Create a watchlist
* Train a model with these parameters for 100 rounds:

```
xgb_params = {
    'eta': 0.3, 
    'max_depth': 6,
    'min_child_weight': 1,
    
    'objective': 'reg:squarederror',
    'nthread': 8,
    
    'seed': 1,
    'verbosity': 1,
}
```

Now change `eta` from `0.3` to `0.1`.

Which eta leads to the best RMSE score on the validation dataset?

* 0.3
* 0.1
* Both give equal value



#### eta = 0.3

In [38]:
features = list(dv.get_feature_names_out())
dtrain = xgb.DMatrix(X_train, label=y_train, feature_names=features)
dval = xgb.DMatrix(X_val, label=y_val, feature_names=features)

In [39]:
xgb_params = {
    'eta': 0.3, 
    'max_depth': 6,
    'min_child_weight': 1,
    
    'objective': 'reg:squarederror',
    'nthread': 8,
    
    'seed': 1,
    'verbosity': 1,
}

model = xgb.train(xgb_params, dtrain, num_boost_round=100)

In [40]:
y_pred = model.predict(dval)

In [42]:
rmse_value = np.sqrt(mean_squared_error(y_val, y_pred))
rmse_value

np.float64(0.4470252463214915)

#### eta = 0.1

In [43]:
xgb_params = {
    'eta': 0.1, 
    'max_depth': 6,
    'min_child_weight': 1,
    
    'objective': 'reg:squarederror',
    'nthread': 8,
    
    'seed': 1,
    'verbosity': 1,
}

model = xgb.train(xgb_params, dtrain, num_boost_round=100)

In [44]:
y_pred = model.predict(dval)

In [45]:
rmse_value = np.sqrt(mean_squared_error(y_val, y_pred))
rmse_value

np.float64(0.428650825239578)

## Submit the results

* Submit your results here: https://courses.datatalks.club/ml-zoomcamp-2025/homework/hw06
* If your answer doesn't match options exactly, select the closest one. If the answer is exactly in between two options, select the higher value.