# Module 2 ? Regression Homework (2025 cohort)

Recreates every step from the assignment so the outputs can be verified before submitting the multiple-choice answers.

## 1. Environment & Imports

We work with a small set of libraries: NumPy, Pandas, and the linear models that come with scikit-learn.

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, Ridge

pd.set_option('display.float_format', lambda x: f'{x:.5f}')

## 2. Load the dataset

Only the columns specified in the homework brief are required, so we filter at read time to reduce the DataFrame to the relevant features plus the target.

In [2]:
DATA_URL = 'https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv'
COLUMNS = [
    'engine_displacement',
    'horsepower',
    'vehicle_weight',
    'model_year',
    'fuel_efficiency_mpg',
]
FEATURES = COLUMNS[:-1]
TARGET = 'fuel_efficiency_mpg'

df = pd.read_csv(DATA_URL, usecols=COLUMNS)
df.head()

Unnamed: 0,engine_displacement,horsepower,vehicle_weight,model_year,fuel_efficiency_mpg
0,170,159.0,3413.43376,2003,13.23173
1,130,97.0,3149.66493,2007,13.68822
2,170,78.0,3079.039,2018,14.24634
3,220,,2542.3924,2009,16.91274
4,210,140.0,3460.87099,2009,12.48837


## 3. Distribution Check for the Target

We inspect summary statistics and quantiles to decide whether the distribution has a long tail.

In [3]:
df[TARGET].describe(percentiles=[0.5, 0.75, 0.9, 0.95, 0.99])

count   9704.00000
mean      14.98524
std        2.55647
min        6.20097
50%       15.00604
75%       16.70797
90%       18.25946
95%       19.15002
99%       20.88206
max       25.96722
Name: fuel_efficiency_mpg, dtype: float64

The median MPG is about 15.0, while the 99th percentile is above 20.8. The gradual rise indicates a moderately long right tail, which is enough to answer the EDA prompt in the homework.

## 4. Missing Values (Q1)

Counting missing entries per column points to the feature that requires imputation.

In [4]:
missing_counts = df.isna().sum()
missing_counts

engine_displacement      0
horsepower             708
vehicle_weight           0
model_year               0
fuel_efficiency_mpg      0
dtype: int64

`horsepower` is the only column with missing values (708 rows). This answers Question 1.

## 5. Median Horsepower (Q2)

Pandas offers the exact percentile through `median()`.

In [5]:
median_hp = df['horsepower'].median()
median_hp

np.float64(149.0)

The median horsepower is 149 (Question 2).

## 6. Helper Functions

Re-using the same logic as the course videos: shuffle with a fixed seed, do a 60/20/20 split, create design matrices, and compute RMSE by hand.

In [6]:
def shuffle_split(dataframe: pd.DataFrame, seed: int = 42):
    rng = np.random.default_rng(seed)
    indices = np.arange(len(dataframe))
    rng.shuffle(indices)
    df_shuffled = dataframe.iloc[indices].reset_index(drop=True)

    n_total = len(df_shuffled)
    n_train = int(0.6 * n_total)
    n_val = int(0.2 * n_total)

    df_train = df_shuffled.iloc[:n_train].reset_index(drop=True)
    df_val = df_shuffled.iloc[n_train:n_train + n_val].reset_index(drop=True)
    df_test = df_shuffled.iloc[n_train + n_val:].reset_index(drop=True)

    return df_train, df_val, df_test


def rmse(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    return float(np.sqrt(np.mean((y_true - y_pred) ** 2)))


def prepare_xy(frame: pd.DataFrame, fill_value):
    x = frame[FEATURES].fillna({'horsepower': fill_value}).to_numpy()
    y = frame[TARGET].to_numpy()
    return x, y

## 7. Question 3 ? Imputation Strategy Comparison

We stick to seed 42 to match the homework. The `fill_zeros` model sets missing horsepower to 0; the `fill_mean` model uses the training mean and applies it to validation as well.

In [7]:
df_train, df_val, df_test = shuffle_split(df, seed=42)

X_train_zero, y_train = prepare_xy(df_train, fill_value=0)
X_val_zero, y_val = prepare_xy(df_val, fill_value=0)

model_zero = LinearRegression()
model_zero.fit(X_train_zero, y_train)
val_pred_zero = model_zero.predict(X_val_zero)
rmse_zero = rmse(y_val, val_pred_zero)

hp_mean = df_train['horsepower'].mean()
X_train_mean, _ = prepare_xy(df_train, fill_value=hp_mean)
X_val_mean, _ = prepare_xy(df_val, fill_value=hp_mean)

model_mean = LinearRegression()
model_mean.fit(X_train_mean, y_train)
val_pred_mean = model_mean.predict(X_val_mean)
rmse_mean = rmse(y_val, val_pred_mean)

rmse_zero, rmse_mean


(0.52166824787551, 0.46652012986218777)

Rounded to two decimals, the RMSE values are `{round(rmse_zero, 2)}` when filling with 0 and `{round(rmse_mean, 2)}` when filling with the mean. Mean imputation performs better, answering Question 3.

## 8. Question 4 ? Regularized Linear Regression

Missing values stay at 0 for all runs. We sweep the list of `r` values with Ridge regression and keep the validation RMSE.

In [8]:
ridge_grid = [0, 0.01, 0.1, 1, 5, 10, 100]
rmse_by_r = {}

for r in ridge_grid:
    ridge = Ridge(alpha=r)
    ridge.fit(X_train_zero, y_train)
    preds = ridge.predict(X_val_zero)
    rmse_by_r[r] = rmse(y_val, preds)

{key: round(value, 2) for key, value in rmse_by_r.items()}

{0: 0.52, 0.01: 0.52, 0.1: 0.52, 1: 0.52, 5: 0.52, 10: 0.52, 100: 0.52}

All options give roughly the same RMSE. Per the instructions, we select the smallest `r` (0).

## 9. Question 5 ? Impact of the Random Seed

Repeat the train/validation split with seeds 0 through 9, always filling horsepower with 0. Collect the RMSE values and compute their standard deviation.

In [9]:
seed_range = range(10)
rmse_scores = []

for seed in seed_range:
    df_t, df_v, _ = shuffle_split(df, seed=seed)
    x_t, y_t = prepare_xy(df_t, fill_value=0)
    x_v, y_v = prepare_xy(df_v, fill_value=0)

    model = LinearRegression()
    model.fit(x_t, y_t)
    preds = model.predict(x_v)

    rmse_scores.append(rmse(y_v, preds))

rmse_scores, np.std(rmse_scores)

([0.5210033268580287,
  0.5243954303489885,
  0.5251938709279687,
  0.5243806735673154,
  0.5259017805690743,
  0.5252251126484908,
  0.5204468326653946,
  0.5103918743323342,
  0.5203868283492601,
  0.5323005287933144],
 np.float64(0.005336878726231149))

The standard deviation of the validation RMSE values is `{round(np.std(rmse_scores), 3)}`, corresponding to option 0.006 in Question 5. The low spread indicates that the model is stable with respect to the random split choice.

## 10. Question 6 ? Final Model on the Test Set

Follow the homework instructions: split with seed 9, combine train and validation, fill NA with 0, train Ridge with `r = 0.001`, then score on the held-out test set.

In [10]:
df_train_9, df_val_9, df_test_9 = shuffle_split(df, seed=9)

df_full_train = pd.concat([df_train_9, df_val_9]).reset_index(drop=True)
X_full_train, y_full_train = prepare_xy(df_full_train, fill_value=0)
X_test_9, y_test_9 = prepare_xy(df_test_9, fill_value=0)

final_model = Ridge(alpha=0.001)
final_model.fit(X_full_train, y_full_train)

test_predictions = final_model.predict(X_test_9)
rmse_test = rmse(y_test_9, test_predictions)
rmse_test

0.5045579433148385

Rounded to three decimals, the test RMSE is `{round(rmse_test, 3)}` (Question 6).

## 11. Summary Table for Submission

Collect the exact numbers to copy into the course form.

In [11]:
summary = {
    'Q1_missing_column': 'horsepower',
    'Q2_median_hp': median_hp,
    'Q3_rmse_fill_zero': round(rmse_zero, 2),
    'Q3_rmse_fill_mean': round(rmse_mean, 2),
    'Q3_best_option': 'mean',
    'Q4_best_r': min([r for r, score in rmse_by_r.items() if abs(score - min(rmse_by_r.values())) < 1e-8]),
    'Q5_std_rmse': round(float(np.std(rmse_scores)), 3),
    'Q6_test_rmse': round(rmse_test, 3),
}
summary

{'Q1_missing_column': 'horsepower',
 'Q2_median_hp': np.float64(149.0),
 'Q3_rmse_fill_zero': 0.52,
 'Q3_rmse_fill_mean': 0.47,
 'Q3_best_option': 'mean',
 'Q4_best_r': 0,
 'Q5_std_rmse': 0.005,
 'Q6_test_rmse': 0.505}