
# Part 1: Fuel Consumption → Horsepower Prediction
Goal: Build regression models to predict horsepower (HP) based on fuel consumption features.

1.1 Load and inspect the dataset

In [4]:
import pandas as pd

df = pd.read_csv('FuelEconomy.csv')
df.head()

print("Shape:", df.shape)
print("\nColumns:")
print(df.columns)

print("\nSummary Statistics:")
display(df.describe())




Shape: (100, 2)

Columns:
Index(['Horse Power', 'Fuel Economy (MPG)'], dtype='object')

Summary Statistics:


Unnamed: 0,Horse Power,Fuel Economy (MPG)
count,100.0,100.0
mean,213.67619,23.178501
std,62.061726,4.701666
min,50.0,10.0
25%,174.996514,20.439516
50%,218.928402,23.143192
75%,251.706476,26.089933
max,350.0,35.0


In [5]:
missing = df.isnull().sum().sort_values(ascending=False)
print("Missing values per column:")
display(missing[missing > 0])

print("\nTotal missing values:", df.isnull().sum().sum())


Missing values per column:


Unnamed: 0,0



Total missing values: 0


In [6]:
df = df.dropna()
print("After dropna, shape:", df.shape)

After dropna, shape: (100, 2)


1.2 Train/Test split (70% / 30% random)

In [7]:
from sklearn.model_selection import train_test_split

# Define target (HP) and feature(s)
target_col = "Horse Power"                 # HP column
feature_cols = ["Fuel Economy (MPG)"]      # fuel consumption feature(s)

y = df[target_col]
X = df[feature_cols]

# Split into 70% train and 30% test (random, reproducible)
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.30,        # 30% test, 70% train
    random_state=42        # fixed seed for reproducibility
)

print("Train rows:", X_train.shape[0])
print("Test rows:", X_test.shape[0])

Train rows: 70
Test rows: 30


1.3 Model training: Linear + Polynomial regression

In [8]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

# Define models (NO regularization)
models = {
    "Linear Regression": LinearRegression(),

    "Polynomial Regression (Degree 2)": Pipeline([
        ("poly", PolynomialFeatures(degree=2, include_bias=False)),
        ("lr", LinearRegression())
    ]),

    "Polynomial Regression (Degree 3)": Pipeline([
        ("poly", PolynomialFeatures(degree=3, include_bias=False)),
        ("lr", LinearRegression())
    ]),

    "Polynomial Regression (Degree 4)": Pipeline([
        ("poly", PolynomialFeatures(degree=4, include_bias=False)),
        ("lr", LinearRegression())
    ])
}

# Train (fit) all models
for name, model in models.items():
    model.fit(X_train, y_train)

print(" All models trained.")

 All models trained.


1.4 Model evaluation (train and test)

In [13]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

def evaluate_model(model, X_tr, y_tr, X_te, y_te):
    # Predictions
    y_tr_pred = model.predict(X_tr)
    y_te_pred = model.predict(X_te)

    # Metrics (train)
    train_mse = mean_squared_error(y_tr, y_tr_pred)
    train_mae = mean_absolute_error(y_tr, y_tr_pred)
    train_r2  = r2_score(y_tr, y_tr_pred)

    # Metrics (test)
    test_mse = mean_squared_error(y_te, y_te_pred)
    test_mae = mean_absolute_error(y_te, y_te_pred)
    test_r2  = r2_score(y_te, y_te_pred)

    return train_mse, train_mae, train_r2, test_mse, test_mae, test_r2

# Build results table
rows = []

for name, model in models.items():
    train_mse, train_mae, train_r2, test_mse, test_mae, test_r2 = evaluate_model(
        model, X_train, y_train, X_test, y_test
    )
    rows.append({
        "Model": name,
        "Train MSE": train_mse,
        "Train MAE": train_mae,
        "Train R^2": train_r2,
        "Test MSE": test_mse,
        "Test MAE": test_mae,
        "Test R^2": test_r2
    })

results_df = pd.DataFrame(rows)

# Sort by best test R2
results_df_sorted = results_df.sort_values(by="Test R^2", ascending=False)

display(results_df_sorted.style.format({
    "Train MSE": "{:.3f}",
    "Train MAE": "{:.3f}",
    "Train R^2": "{:.3f}",
    "Test MSE": "{:.3f}",
    "Test MAE": "{:.3f}",
    "Test R^2": "{:.3f}",
}))

Unnamed: 0,Model,Train MSE,Train MAE,Train R^2,Test MSE,Test MAE,Test R^2
3,Polynomial Regression (Degree 4),339.7,15.508,0.911,313.799,14.735,0.914
2,Polynomial Regression (Degree 3),345.109,15.747,0.91,318.404,14.765,0.913
0,Linear Regression,357.699,16.062,0.906,318.561,14.941,0.913
1,Polynomial Regression (Degree 2),350.88,15.996,0.908,331.105,15.148,0.909


1.5 Discussion and Interpretation

**Q: Which model performs best on the test set and why?**

Based on the test metrics, Polynomial Regression (Degree 4) performs the best overall.  
It has the highest Test R² (0.914) and also the lowest Test MSE (313.799) and lowest Test MAE (14.735) compared to the other models.  
This indicates that the Degree 4 model generalizes slightly better than the other models.

**Q: Does increasing polynomial degree always improve performance?**

Not always. While higher polynomial degrees can improve performance, the improvement is not guaranteed and can become very small after a certain point.  
In my results, moving from Linear (Test R² = 0.913) to Degree 3 (0.913) gives almost no improvement, while Degree 4 (0.914) improves slightly.  


**Q: If a model performs unexpectedly poorly, what are two plausible reasons?**

One model that performs worse than expected is Polynomial Regression (Degree 2), with the lowest Test R² (0.909) and the highest test error among most models (Test MSE = 331.105, Test MAE = 15.148).

Two Plausible reasons:
1) Underfitting (model not flexible enough / wrong curve shape)

Degree 2 may not capture the relationship between MPG and horsepower very well.
This shows up because its test performance (Test R² = 0.909) is worse than Linear Regression (0.913) and Degree 3 (0.913), meaning the model is not fitting the pattern as well as those models.

2) Weak/limited feature information (insufficient predictors for HP)

All models have similar test errors and R² values (around 0.909–0.914), which suggests that Fuel Economy (MPG) alone can only explain so much of horsepower.


# Part 2 Weather → Daily Electricity Consumption Prediction
Goal: Build regression models to predict daily electricity consumption using weather features.

2.1 Load and inspect the dataset

In [3]:
import pandas as pd
import numpy as np

df2 = pd.read_csv('electricity_consumption_based_weather_dataset.csv')
df2.head()

print("Shape:", df2.shape)
print("\nColumns:")
print(df2.columns)

print("\nSummary Statistics:")
display(df2.describe())

Shape: (1433, 6)

Columns:
Index(['date', 'AWND', 'PRCP', 'TMAX', 'TMIN', 'daily_consumption'], dtype='object')

Summary Statistics:


Unnamed: 0,AWND,PRCP,TMAX,TMIN,daily_consumption
count,1418.0,1433.0,1433.0,1433.0,1433.0
mean,2.642313,3.800488,17.187509,9.141242,1561.078061
std,1.140021,10.973436,10.136415,9.028417,606.819667
min,0.0,0.0,-8.9,-14.4,14.218
25%,1.8,0.0,8.9,2.2,1165.7
50%,2.4,0.0,17.8,9.4,1542.65
75%,3.3,1.3,26.1,17.2,1893.608
max,10.2,192.3,39.4,27.2,4773.386


In [8]:
missing2 = df2.isnull().sum().sort_values(ascending=False)
print("Missing values per column (only showing > 0):")
display(missing2[missing2 > 0])

print("\nTotal missing values:", df2.isnull().sum().sum())

Missing values per column (only showing > 0):


Unnamed: 0,0
AWND,15



Total missing values: 15


In [9]:
df2.columns = df2.columns.str.strip()
print("Cleaned Columns:")
print(df2.columns)

Cleaned Columns:
Index(['date', 'AWND', 'PRCP', 'TMAX', 'TMIN', 'daily_consumption'], dtype='object')


In [11]:
df2 = df2.dropna()
print("After dropna, shape:", df2.shape)

After dropna, shape: (1418, 6)


In [14]:
target_col2 = "daily_consumption"

y2 = df2[target_col2]
X2 = df2[["AWND", "PRCP", "TMAX", "TMIN"]]

2.2 Train/Test split (70% / 30% random)

In [16]:
y2 = df2[target_col2]
X2 = df2.drop(columns=[target_col2])

print("X2 shape:", X2.shape)
print("y2 shape:", y2.shape)

X2 = X2.select_dtypes(include="number")
print("X2 numeric-only shape:", X2.shape)

X2 shape: (1418, 5)
y2 shape: (1418,)
X2 numeric-only shape: (1418, 4)


In [17]:
from sklearn.model_selection import train_test_split

X2_train, X2_test, y2_train, y2_test = train_test_split(
    X2, y2,
    # 30% test
    test_size=0.30,

     # fixed seed
    random_state=42
)

print("Train rows:", X2_train.shape[0])
print("Test rows:", X2_test.shape[0])

Train rows: 992
Test rows: 426


2.3 Model training: Linear + Polynomial regression

In [27]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

models2 = {
    "Linear Regression": LinearRegression(),

    # Polynomial Regression degree 2:
    "Poly (deg=2)": Pipeline([
        ("poly", PolynomialFeatures(degree=2, include_bias=False)),
        ("lr", LinearRegression())
    ]),

    # Polynomial Regression degree 3: includes cubic terms in the feature expansion
    "Poly (deg=3)": Pipeline([
        ("poly", PolynomialFeatures(degree=3, include_bias=False)),
        ("lr", LinearRegression())
    ]),

    # Polynomial Regression degree 4: includes up to 4th power terms (most flexible, higher overfitting risk)
    "Poly (deg=4)": Pipeline([
        ("poly", PolynomialFeatures(degree=4, include_bias=False)),
        ("lr", LinearRegression())
    ])
}

In [21]:
for name, model in models2.items():
    model.fit(X2_train, y2_train)

print(" All Part 2 models trained!")

 All Part 2 models trained!


2.4 Model evaluation (train and test)

In [25]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

def evaluate_model(model, X_tr, y_tr, X_te, y_te):
    y_tr_pred = model.predict(X_tr)
    y_te_pred = model.predict(X_te)

    return {
        "Train MSE": mean_squared_error(y_tr, y_tr_pred),
        "Train MAE": mean_absolute_error(y_tr, y_tr_pred),
        "Train R²": r2_score(y_tr, y_tr_pred),
        "Test MSE": mean_squared_error(y_te, y_te_pred),
        "Test MAE": mean_absolute_error(y_te, y_te_pred),
        "Test R²": r2_score(y_te, y_te_pred),
    }



In [26]:
rows = []

for name, model in models2.items():
    metrics = evaluate_model(model, X2_train, y2_train, X2_test, y2_test)
    metrics["Model"] = name
    rows.append(metrics)

results2_df = pd.DataFrame(rows)

# Reorder columns to match the recommended table format
results2_df = results2_df[["Model", "Train MSE", "Train MAE", "Train R²", "Test MSE", "Test MAE", "Test R²"]]

# Sort by Test R² (best on top)
results2_df_sorted = results2_df.sort_values(by="Test R²", ascending=False)

display(results2_df_sorted.style.format({
    "Train MSE": "{:.3f}",
    "Train MAE": "{:.3f}",
    "Train R²": "{:.3f}",
    "Test MSE": "{:.3f}",
    "Test MAE": "{:.3f}",
    "Test R²": "{:.3f}",
}))

Unnamed: 0,Model,Train MSE,Train MAE,Train R²,Test MSE,Test MAE,Test R²
0,Linear Regression,272403.396,384.465,0.276,248125.786,375.405,0.299
1,Poly (deg=2),264765.77,379.649,0.296,255268.494,379.039,0.279
2,Poly (deg=3),259249.535,375.953,0.311,265623.658,385.235,0.25
3,Poly (deg=4),251909.339,372.117,0.33,12151486.443,578.642,-33.314


2.5 Discussion and interpretation

**Q: Which model generalizes best (best test performance)?**

Based on the test results, Linear Regression generalizes the best overall. It has the highest Test R² (0.299) and also the lowest Test MSE (248,125.786) and lowest Test MAE (375.405) compared to the other models. The Test R² of 0.30 suggests weather only explains a limited part of daily electricity usage, so the relationship is real but not strong with just these inputs.


**Q: Do polynomial models improve the fit compared to linear regression?**

Not really. Degree 2 and Degree 3 improve the training fit a little (for example, Train R² goes from 0.276 in linear to 0.296 in deg=2 and 0.311 in deg=3), but their test performance actually drops (Test R² = 0.279 and 0.250). So the extra complexity doesn’t help predict unseen days better, which means the nonlinear patterns aren’t strong enough using only these weather features.


**Q: If higher-degree models perform worse on the test set, explain this behavior using evidence from metrics (e.g., train error decreases but test error increases).**

The Degree 4 model is a clear case of overfitting. It has the best training metrics (Train R² = 0.330, Train MSE = 251,909.339), but it completely breaks on the test set (Test R² = -33.314, Test MSE = 12,151,486.443). That huge gap between train and test is a sign the model is fitting noise in the training data and producing unstable predictions on new data.



**Q: If none of the models achieve good test performance, provide at least two reasons supported by your outputs**

Even the best model still has a fairly low Test R², which suggests the dataset is missing important drivers of electricity usage. One likely reason is that the weather features alone are limited—daily consumption is heavily influenced by factors like occupancy, building operations, and human behavior, not just temperature or precipitation. Another reason is that time-based effects aren’t being modeled; since the dataset includes a date column, there are probably weekly and seasonal patterns in electricity demand that the models aren’t capturing, which helps explain why the test errors remain relatively high across all models.