# Part 1
### 1.1 Load and Inspect the Dataset

In [23]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

In [None]:
# Load the dataset
df = pd.read_csv("FuelEconomy.csv")

In [12]:
# Display first few rows
df.head()

Unnamed: 0,Horse Power,Fuel Economy (MPG)
0,118.770799,29.344195
1,176.326567,24.695934
2,219.262465,23.95201
3,187.310009,23.384546
4,218.59434,23.426739


In [9]:
# Display column names and dataset shape
df.columns, df.shape

(Index(['Horse Power', 'Fuel Economy (MPG)'], dtype='object'), (100, 2))

In [10]:
# Summary statistics
df.describe()

Unnamed: 0,Horse Power,Fuel Economy (MPG)
count,100.0,100.0
mean,213.67619,23.178501
std,62.061726,4.701666
min,50.0,10.0
25%,174.996514,20.439516
50%,218.928402,23.143192
75%,251.706476,26.089933
max,350.0,35.0


In [11]:
# Check for missing values
df.isna().sum()

Unnamed: 0,0
Horse Power,0
Fuel Economy (MPG),0


**Missing Values:**  
There are no missing values in the dataset. Therefore, no data imputation or removal is required.

### 1.2 Train/Test Split

In [14]:
# Feature and target
X = df[["Fuel Economy (MPG)"]]
y = df["Horse Power"]

In [15]:
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

In [16]:
X_train.shape, X_test.shape

((70, 1), (30, 1))

The dataset is randomly split into 70% training data and 30% testing data.
A fixed random_state is used to ensure reproducibility.

### 1.3 Model Training: Linear and Polynomial Regression

In [19]:
# Linear Regression
linear_reg = LinearRegression()
linear_reg.fit(X_train, y_train)

In [20]:
# Polynomial Regression (degree 2)
poly2 = PolynomialFeatures(degree=2)
X_train_poly2 = poly2.fit_transform(X_train)

poly2_reg = LinearRegression()
poly2_reg.fit(X_train_poly2, y_train)

In [21]:
# Polynomial Regression (degree 3)
poly3 = PolynomialFeatures(degree=3)
X_train_poly3 = poly3.fit_transform(X_train)

poly3_reg = LinearRegression()
poly3_reg.fit(X_train_poly3, y_train)

In [22]:
# Polynomial Regression (degree 4)
poly4 = PolynomialFeatures(degree=4)
X_train_poly4 = poly4.fit_transform(X_train)

poly4_reg = LinearRegression()
poly4_reg.fit(X_train_poly4, y_train)

In this section, I train a linear regression model and polynomial regression models
with degrees 2, 3, and 4 to predict horsepower (HP). Polynomial models are implemented
using PolynomialFeatures followed by LinearRegression, without any regularization.

### 1.4 Model Evaluation

For each model, performance is evaluated on both the training and testing sets
using mean squared error (MSE), mean absolute error (MAE), and R².

In [24]:
# Linear Regression predictions
y_train_pred_lin = linear_reg.predict(X_train)
y_test_pred_lin = linear_reg.predict(X_test)

# Linear Regression metrics
linear_train_mse = mean_squared_error(y_train, y_train_pred_lin)
linear_test_mse = mean_squared_error(y_test, y_test_pred_lin)

linear_train_mae = mean_absolute_error(y_train, y_train_pred_lin)
linear_test_mae = mean_absolute_error(y_test, y_test_pred_lin)

linear_train_r2 = r2_score(y_train, y_train_pred_lin)
linear_test_r2 = r2_score(y_test, y_test_pred_lin)

In [25]:
# Polynomial Regression (degree 2) predictions
X_test_poly2 = poly2.transform(X_test)

y_train_pred_poly2 = poly2_reg.predict(X_train_poly2)
y_test_pred_poly2 = poly2_reg.predict(X_test_poly2)

# Polynomial degree 2 metrics
poly2_train_mse = mean_squared_error(y_train, y_train_pred_poly2)
poly2_test_mse = mean_squared_error(y_test, y_test_pred_poly2)

poly2_train_mae = mean_absolute_error(y_train, y_train_pred_poly2)
poly2_test_mae = mean_absolute_error(y_test, y_test_pred_poly2)

poly2_train_r2 = r2_score(y_train, y_train_pred_poly2)
poly2_test_r2 = r2_score(y_test, y_test_pred_poly2)

In [26]:
# Polynomial Regression (degree 3) predictions
X_test_poly3 = poly3.transform(X_test)

y_train_pred_poly3 = poly3_reg.predict(X_train_poly3)
y_test_pred_poly3 = poly3_reg.predict(X_test_poly3)

# Polynomial degree 3 metrics
poly3_train_mse = mean_squared_error(y_train, y_train_pred_poly3)
poly3_test_mse = mean_squared_error(y_test, y_test_pred_poly3)

poly3_train_mae = mean_absolute_error(y_train, y_train_pred_poly3)
poly3_test_mae = mean_absolute_error(y_test, y_test_pred_poly3)

poly3_train_r2 = r2_score(y_train, y_train_pred_poly3)
poly3_test_r2 = r2_score(y_test, y_test_pred_poly3)

In [27]:
# Polynomial Regression (degree 4) predictions
X_test_poly4 = poly4.transform(X_test)

y_train_pred_poly4 = poly4_reg.predict(X_train_poly4)
y_test_pred_poly4 = poly4_reg.predict(X_test_poly4)

# Polynomial degree 4 metrics
poly4_train_mse = mean_squared_error(y_train, y_train_pred_poly4)
poly4_test_mse = mean_squared_error(y_test, y_test_pred_poly4)

poly4_train_mae = mean_absolute_error(y_train, y_train_pred_poly4)
poly4_test_mae = mean_absolute_error(y_test, y_test_pred_poly4)

poly4_train_r2 = r2_score(y_train, y_train_pred_poly4)
poly4_test_r2 = r2_score(y_test, y_test_pred_poly4)

In [28]:
# Create a results table
results = pd.DataFrame({
    "Model": ["Linear", "Poly Degree 2", "Poly Degree 3", "Poly Degree 4"]
})

In [31]:
# Add MSE
results["Train MSE"] = [
    linear_train_mse,
    poly2_train_mse,
    poly3_train_mse,
    poly4_train_mse
]

results["Test MSE"] = [
    linear_test_mse,
    poly2_test_mse,
    poly3_test_mse,
    poly4_test_mse
]

In [33]:
# Add MAE
results["Train MAE"] = [
    linear_train_mae,
    poly2_train_mae,
    poly3_train_mae,
    poly4_train_mae
]

results["Test MAE"] = [
    linear_test_mae,
    poly2_test_mae,
    poly3_test_mae,
    poly4_test_mae
]

In [36]:
# Add R²
results["Train R2"] = [
    linear_train_r2,
    poly2_train_r2,
    poly3_train_r2,
    poly4_train_r2
]

results["Test R2"] = [
    linear_test_r2,
    poly2_test_r2,
    poly3_test_r2,
    poly4_test_r2
]

In [37]:
results

Unnamed: 0,Model,Train MSE,Test MSE,Train MAE,Test MAE,Train R2,Test R2
0,Linear,357.69918,318.561087,16.061689,14.940628,0.90632,0.912561
1,Poly Degree 2,350.879731,331.105434,15.995824,15.14833,0.908106,0.909118
2,Poly Degree 3,345.108668,318.404012,15.746762,14.764973,0.909618,0.912604
3,Poly Degree 4,339.700171,313.798757,15.508465,14.735471,0.911034,0.913868


Based on the results in the table above, polynomial regression models generally
outperform linear regression on both the training and testing sets. As the polynomial
degree increases, training error, including MSE and MAE, decreases, and training R² increases. It indicates improved fit to the training data.

On the testing set, performance also improves slightly with higher polynomial degrees. The degree 4 polynomial regression achieves the lowest test MSE and
the highest test R², suggesting the best generalization performance on unseen data.
However, the performance differences between higher-degree models are relatively small.

### 1.5 Discussion and Interpretation

**Which model performs best on the test set and why?**  
Based on the evaluation results, the polynomial regression model with degree 4
performs best on the test set. It achieves the lowest test MSE (313.80) and the
highest test R² (0.9139) among all models. Compared to linear regression and
lower-degree polynomial models, the degree 4 model better captures the nonlinear
relationship between fuel economy and horsepower, which leads to improved
generalization on unseen data.


**Does increasing polynomial degree always improve performance?**  
Increasing the polynomial degree does not always lead to a significant improvement
in test performance. While training error consistently decreases as model complexity
increases, the improvements on the test set are relatively small, particularly when
comparing degree 3 and degree 4 models. This suggests that there are diminishing changes from
increasing polynomial degree when it reaches certain point.

**Why might a model perform poorly?**  
One possible reason is underfitting or overfitting. For example, linear regression
may underfit the data, as reflected by its higher training error compared to
polynomial models. Another possible reason is insufficient feature information.
Fuel economy alone may not capture all factors influencing horsepower, and additional
features such as engine size or vehicle weight could improve predictive performance.

# Part 2
### 2.1 Load and Inspect the Dataset

In [38]:
# Load the dataset
df_weather = pd.read_csv("electricity_consumption_based_weather_dataset.csv")

In [39]:
# Display first few rows
df_weather.head()

Unnamed: 0,date,AWND,PRCP,TMAX,TMIN,daily_consumption
0,2006-12-16,2.5,0.0,10.6,5.0,1209.176
1,2006-12-17,2.6,0.0,13.3,5.6,3390.46
2,2006-12-18,2.4,0.0,15.0,6.7,2203.826
3,2006-12-19,2.4,0.0,7.2,2.2,1666.194
4,2006-12-20,2.4,0.0,7.2,1.1,2225.748


In [40]:
# Display column names and dataset shape
df_weather.columns, df_weather.shape

(Index(['date', 'AWND', 'PRCP', 'TMAX', 'TMIN', 'daily_consumption'], dtype='object'),
 (1433, 6))

In [41]:
# Summary statistics
df_weather.describe()

Unnamed: 0,AWND,PRCP,TMAX,TMIN,daily_consumption
count,1418.0,1433.0,1433.0,1433.0,1433.0
mean,2.642313,3.800488,17.187509,9.141242,1561.078061
std,1.140021,10.973436,10.136415,9.028417,606.819667
min,0.0,0.0,-8.9,-14.4,14.218
25%,1.8,0.0,8.9,2.2,1165.7
50%,2.4,0.0,17.8,9.4,1542.65
75%,3.3,1.3,26.1,17.2,1893.608
max,10.2,192.3,39.4,27.2,4773.386


**Dependent Variable:**  
The dependent variable in this dataset is `daily_consumption`, which represents
daily electricity consumption.

In [42]:
# Check for missing values
df_weather.isna().sum()

Unnamed: 0,0
date,0
AWND,15
PRCP,0
TMAX,0
TMIN,0
daily_consumption,0


**Missing Values:**  
Most variables in the dataset do not contain missing values. However, the variable
`AWND` has 15 missing entries, while all other variables have no missing values. To handle this consistently, rows
with missing `AWND` values are removed from the dataset before model training.

In [43]:
# Remove rows with missing values
df_weather = df_weather.dropna()

In [44]:
df_weather.shape

(1418, 6)

### 2.2 Train/Test Split

The dataset is randomly split into 70% training data and 30% testing data.
A fixed random_state is used to ensure reproducibility.

In [45]:
# Define features and target
X = df_weather.drop(columns=["daily_consumption", "date"])
y = df_weather["daily_consumption"]

In [46]:
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

In [47]:
X_train.shape, X_test.shape

((992, 4), (426, 4))

The dataset is randomly divided into 70% training data and 30% test data.
The dependent variable is `daily_consumption`, and all other weather-related variables are used as input features. The `date` column is excluded from the feature set.
A fixed random_state is used to ensure reproducibility.

### 2.3 Model Training: Linear and Polynomial Regression

In this section, the following regression models are trained to predict
`daily_consumption`:

- Linear Regression  
- Polynomial Regression (degree 2)  
- Polynomial Regression (degree 3)  
- Polynomial Regression (degree 4)

Polynomial regression models are implemented using PolynomialFeatures followed by
LinearRegression. No regularization is used.

In [48]:
# Linear Regression
linear_reg = LinearRegression()
linear_reg.fit(X_train, y_train)

In [49]:
# Polynomial Regression (degree 2)
poly2 = PolynomialFeatures(degree=2)
X_train_poly2 = poly2.fit_transform(X_train)

poly2_reg = LinearRegression()
poly2_reg.fit(X_train_poly2, y_train)

In [50]:
# Polynomial Regression (degree 3)
poly3 = PolynomialFeatures(degree=3)
X_train_poly3 = poly3.fit_transform(X_train)

poly3_reg = LinearRegression()
poly3_reg.fit(X_train_poly3, y_train)

In [51]:
# Polynomial Regression (degree 4)
poly4 = PolynomialFeatures(degree=4)
X_train_poly4 = poly4.fit_transform(X_train)

poly4_reg = LinearRegression()
poly4_reg.fit(X_train_poly4, y_train)

### 2.4 Model Evaluation

For each model, performance is evaluated on both the training and testing sets
using mean squared error (MSE), mean absolute error (MAE), and R².

In [52]:
# Linear Regression predictions
y_train_pred_lin = linear_reg.predict(X_train)
y_test_pred_lin = linear_reg.predict(X_test)

# Linear Regression metrics
linear_train_mse = mean_squared_error(y_train, y_train_pred_lin)
linear_test_mse = mean_squared_error(y_test, y_test_pred_lin)

linear_train_mae = mean_absolute_error(y_train, y_train_pred_lin)
linear_test_mae = mean_absolute_error(y_test, y_test_pred_lin)

linear_train_r2 = r2_score(y_train, y_train_pred_lin)
linear_test_r2 = r2_score(y_test, y_test_pred_lin)

In [53]:
# Polynomial Regression (degree 2) predictions
X_test_poly2 = poly2.transform(X_test)

y_train_pred_poly2 = poly2_reg.predict(X_train_poly2)
y_test_pred_poly2 = poly2_reg.predict(X_test_poly2)

# Degree 2 metrics
poly2_train_mse = mean_squared_error(y_train, y_train_pred_poly2)
poly2_test_mse = mean_squared_error(y_test, y_test_pred_poly2)

poly2_train_mae = mean_absolute_error(y_train, y_train_pred_poly2)
poly2_test_mae = mean_absolute_error(y_test, y_test_pred_poly2)

poly2_train_r2 = r2_score(y_train, y_train_pred_poly2)
poly2_test_r2 = r2_score(y_test, y_test_pred_poly2)

In [54]:
# Polynomial Regression (degree 3) predictions
X_test_poly3 = poly3.transform(X_test)

y_train_pred_poly3 = poly3_reg.predict(X_train_poly3)
y_test_pred_poly3 = poly3_reg.predict(X_test_poly3)

# Degree 3 metrics
poly3_train_mse = mean_squared_error(y_train, y_train_pred_poly3)
poly3_test_mse = mean_squared_error(y_test, y_test_pred_poly3)

poly3_train_mae = mean_absolute_error(y_train, y_train_pred_poly3)
poly3_test_mae = mean_absolute_error(y_test, y_test_pred_poly3)

poly3_train_r2 = r2_score(y_train, y_train_pred_poly3)
poly3_test_r2 = r2_score(y_test, y_test_pred_poly3)

In [55]:
# Polynomial Regression (degree 4) predictions
X_test_poly4 = poly4.transform(X_test)

y_train_pred_poly4 = poly4_reg.predict(X_train_poly4)
y_test_pred_poly4 = poly4_reg.predict(X_test_poly4)

# Degree 4 metrics
poly4_train_mse = mean_squared_error(y_train, y_train_pred_poly4)
poly4_test_mse = mean_squared_error(y_test, y_test_pred_poly4)

poly4_train_mae = mean_absolute_error(y_train, y_train_pred_poly4)
poly4_test_mae = mean_absolute_error(y_test, y_test_pred_poly4)

poly4_train_r2 = r2_score(y_train, y_train_pred_poly4)
poly4_test_r2 = r2_score(y_test, y_test_pred_poly4)

The table below summarizes the performance of each model on the training and
testing sets.

In [56]:
# Create results table
results_part2 = pd.DataFrame({
    "Model": [
        "Linear",
        "Poly Degree 2",
        "Poly Degree 3",
        "Poly Degree 4"
    ]
})

In [58]:
# Add MSE
results_part2["Train MSE"] = [
    linear_train_mse,
    poly2_train_mse,
    poly3_train_mse,
    poly4_train_mse
]

results_part2["Test MSE"] = [
    linear_test_mse,
    poly2_test_mse,
    poly3_test_mse,
    poly4_test_mse
]

In [59]:
# Add MAE
results_part2["Train MAE"] = [
    linear_train_mae,
    poly2_train_mae,
    poly3_train_mae,
    poly4_train_mae
]

results_part2["Test MAE"] = [
    linear_test_mae,
    poly2_test_mae,
    poly3_test_mae,
    poly4_test_mae
]

In [60]:
# Add R²
results_part2["Train R2"] = [
    linear_train_r2,
    poly2_train_r2,
    poly3_train_r2,
    poly4_train_r2
]

results_part2["Test R2"] = [
    linear_test_r2,
    poly2_test_r2,
    poly3_test_r2,
    poly4_test_r2
]

In [61]:
results_part2

Unnamed: 0,Model,Train MSE,Test MSE,Train MAE,Test MAE,Train R2,Test R2
0,Linear,272403.396174,248125.8,384.465016,375.404537,0.276,0.299333
1,Poly Degree 2,264765.769932,255268.5,379.648753,379.039083,0.2963,0.279163
2,Poly Degree 3,259249.53487,265623.7,375.952901,385.235167,0.310961,0.249922
3,Poly Degree 4,251909.339001,12151490.0,372.116566,578.642201,0.33047,-33.313844


### 2.5 Discussion and Interpretation

**Which model generalizes best on the test set, and what does this suggest about the
relationship between weather and electricity consumption?**  

Among all models, linear regression achieves the best test performance with a test
R² of approximately 0.30 and the lowest test MAE (375.4) among the stable models. This
suggests that while weather variables do influence electricity consumption, the
relationship is relatively weak and only partially captured by the available
features. Weather alone is insufficient to fully explain variations in daily
electricity usage.

**Do polynomial models improve performance compared to linear regression? Why might
electricity consumption depend nonlinearly on weather?**  

Lower-degree polynomial models, such as degrees 2 and 3, provide only marginal improvements
in training performance and do not improve test performance compared to
linear regression. This indicates that while some nonlinear effects may exist, such as increased heating or cooling demand at temperature extremes. Therefore, the overall relationship
between weather and electricity consumption remains limited without additional
contextual features.


**Why do higher-degree polynomial models perform worse on the test set?**  

The degree 4 polynomial model exhibits clear overfitting. Although its training R²
increases, its test error increases dramatically, with a negative test R² (-33.31). This method of reducing training errors but sharply increasing test errors can provide evidence that the model fits the noise in the training data rather than capturing generalizable patterns.

**Why do none of the models achieve strong test performance?**  

One reason is the limited feature set. Electricity consumption is influenced by many
factors beyond weather, including occupancy patterns, human behavior, and economic
activity, which are not captured in this dataset. Additionally, from my perspective, seasonal and temporal
effects are not explicitly modeled, further limiting predictive performance.