In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("car_fuel_efficiency.csv")

### Question 1

**There's one column with missing values. What is it?**

In [4]:
import pandas as pd

df = pd.read_csv("car_fuel_efficiency.csv")

# Select relevant columns
cols = ['engine_displacement', 'horsepower', 'vehicle_weight', 'model_year', 'fuel_efficiency_mpg']
df_filtered = df[cols]

# Check for missing values
df.isnull().sum()

engine_displacement      0
num_cylinders          482
horsepower             708
vehicle_weight           0
acceleration           930
model_year               0
origin                   0
fuel_type                0
drivetrain               0
num_doors              502
fuel_efficiency_mpg      0
dtype: int64

Column with missing values: horsepower

### Question 2
**What's the median (50% percentile) for variable 'horsepower'?**

In [7]:
df['horsepower'].median()

149.0

Median: 149

**Prepare and split the dataset
Shuffle the dataset (the filtered one you created above), use seed 42.
Split your data in train/val/test sets, with 60%/20%/20% distribution.
Use the same code as in the lectures**

In [10]:
import numpy as np
from sklearn.model_selection import train_test_split

In [11]:
df_shuffled = df_filtered.sample(frac=1, random_state=42)

df_train, df_temp = train_test_split(df_shuffled, test_size=0.4, random_state=42)
df_val, df_test = train_test_split(df_temp, test_size=0.5, random_state=42)

print(f"Train set shape: {df_train.shape}")
print(f"Validation set shape: {df_val.shape}")
print(f"Test set shape: {df_test.shape}")

Train set shape: (5822, 5)
Validation set shape: (1941, 5)
Test set shape: (1941, 5)


### Question Three

We need to deal with missing values for the column from Q1.
- We have two options: fill it with 0 or with the mean of this variable.
- Try both options. For each, train a linear regression model without regularization using the code from the lessons.
- For computing the mean, use the training only!
- Use the validation dataset to evaluate the models and compare the RMSE of each option.
- Round the RMSE scores to 2 decimal digits using round(score, 2)
- Which option gives better RMSE?

In [13]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

features = ['engine_displacement', 'horsepower', 'vehicle_weight', 'model_year']
target = 'fuel_efficiency_mpg'

def prepare_X(df):
    return df[features].copy()


**Option 1: imputing with 0**

In [15]:
df_train_0 = df_train.copy()
df_val_0 = df_val.copy()

df_train_0['horsepower'] = df_train_0['horsepower'].fillna(0)
df_val_0['horsepower'] = df_val_0['horsepower'].fillna(0)

X_train_0 = prepare_X(df_train_0)
X_val_0 = prepare_X(df_val_0)
y_train = df_train_0[target].values
y_val = df_val_0[target].values

model_0 = LinearRegression()
model_0.fit(X_train_0, y_train)
y_pred_0 = model_0.predict(X_val_0)

rmse_0 = mean_squared_error(y_val, y_pred_0, squared=False)
print(f"RMSE_0: {round(rmse_0, 2)}")

RMSE_0: 0.51




**Option 2: Fill missing with training mean**

In [17]:
df_train_mean = df_train.copy()
df_val_mean = df_val.copy()

hp_mean = df_train_mean['horsepower'].mean()

df_train_mean['horsepower'] = df_train_mean['horsepower'].fillna(hp_mean)
df_val_mean['horsepower'] = df_val_mean['horsepower'].fillna(hp_mean)

X_train_mean = prepare_X(df_train_mean)
X_val_mean = prepare_X(df_val_mean)
y_train = df_train_mean[target].values
y_val = df_val_mean[target].values

model_mean = LinearRegression()
model_mean.fit(X_train_mean, y_train)
y_pred_mean = model_mean.predict(X_val_mean)

rmse_mean = mean_squared_error(y_val, y_pred_mean, squared=False)
round(rmse_mean, 2)
print(f"RMSE_Mean: {round(rmse_mean, 2)}")

RMSE_Mean: 0.46




**Comparing the two RMSEs**

**With mean**

Filling the missing horsepower values with the mean (RMSE = 0.46) gives a better RMSE than filling them with 0 (RMSE = 0.51).
Hence, using the mean imputation method results in better model performance (lower validation error).

### Question 4 
**Now let's train a regularized linear regression.**

In [20]:
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error


Filling the NAs with 0. 

In [22]:
df_train_r = df_train.copy()
df_val_r = df_val.copy()

df_train_r['horsepower'] = df_train_r['horsepower'].fillna(0)
df_val_r['horsepower'] = df_val_r['horsepower'].fillna(0)

X_train_r = prepare_X(df_train_r)
X_val_r = prepare_X(df_val_r)
y_train = df_train_r[target].values
y_val = df_val_r[target].values


Round the RMSE scores to 2 decimal digits.

In [24]:
for r in [0, 0.01, 0.1, 1, 5, 10, 100]:
    model = Ridge(alpha=r)
    model.fit(X_train_r, y_train)
    y_pred = model.predict(X_val_r)
    rmse = mean_squared_error(y_val, y_pred, squared=False)
    print(r, round(rmse, 2))

0 0.51
0.01 0.51
0.1 0.51
1 0.51
5 0.51
10 0.51
100 0.51




**Which r gives the best RMSE?**

RMSE (0.51) is produced for all values. The smallest is r = 0

### Question 5
We used seed 42 for splitting the data. Let's find out how selecting the seed influences our score.

Try different seed values: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9].

For each seed, do the train/validation/test split with 60%/20%/20% distribution.

Fill the missing values with 0 and train a model without regularization.

For each seed, evaluate the model on the validation dataset and collect the RMSE scores.

What's the standard deviation of all the scores? To compute the standard deviation, use np.std.

Round the result to 3 decimal digits (round(std, 3))

What's the value of std?

In [27]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np
from sklearn.model_selection import train_test_split

seeds = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
rmse_scores = []

for seed in seeds:
    # Split data
    df_shuffled = df.sample(frac=1, random_state=seed)
    df_train, df_temp = train_test_split(df_shuffled, test_size=0.4, random_state=seed)
    df_val, df_test = train_test_split(df_temp, test_size=0.5, random_state=seed)

    # Fill missing values with 0
    df_train['horsepower'] = df_train['horsepower'].fillna(0)
    df_val['horsepower'] = df_val['horsepower'].fillna(0)

    # Prepare features
    X_train = df_train[features]
    X_val = df_val[features]
    y_train = df_train[target].values
    y_val = df_val[target].values

    # Train model
    model = LinearRegression()
    model.fit(X_train, y_train)

    # Predict and evaluate
    y_pred = model.predict(X_val)
    rmse = mean_squared_error(y_val, y_pred, squared=False)
    rmse_scores.append(rmse)

# Compute standard deviation
std = np.std(rmse_scores)
print("RMSE scores:", rmse_scores)
print("Standard deviation:", round(std, 3))




RMSE scores: [0.518029012129703, 0.5090406519449436, 0.5141627479366627, 0.5152929065813445, 0.5184898612166596, 0.5236053324442334, 0.5150251935468096, 0.5249385057075809, 0.50760288990225, 0.5292310655399466]
Standard deviation: 0.007




**Closest match is 0.006**

### Question 6

**Split the dataset like previously, use seed 9.
Combine train and validation datasets.
Fill the missing values with 0 and train a model with r=0.001.
What's the RMSE on the test dataset?**


Step 1: Split data 60/20/20 split approach (seed = 9)

In [31]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error

# Shuffle and split
df_shuffled = df.sample(frac=1, random_state=9)
df_train, df_temp = train_test_split(df_shuffled, test_size=0.4, random_state=9)
df_val, df_test = train_test_split(df_temp, test_size=0.5, random_state=9)


Step 2: Combine train + validation

In [33]:
df_full_train = pd.concat([df_train, df_val])

Step 3: Fill missing values with 0

In [35]:
df_full_train['horsepower'] = df_full_train['horsepower'].fillna(0)
df_test['horsepower'] = df_test['horsepower'].fillna(0)


Step 4: Prepare and train model

In [37]:
features = ['engine_displacement', 'horsepower', 'vehicle_weight', 'model_year']
target = 'fuel_efficiency_mpg'

X_full_train = df_full_train[features]
y_full_train = df_full_train[target].values
X_test = df_test[features]
y_test = df_test[target].values

model = Ridge(alpha=0.001)
model.fit(X_full_train, y_full_train)

y_pred = model.predict(X_test)
rmse_test = mean_squared_error(y_test, y_pred, squared=False)
print("RMSE on test:", round(rmse_test, 3))


RMSE on test: 0.532




**Closest Answer: 0.515**