# Baseline Model: Linear Regression

This notebook establishes the baseline performance for predicting `value_0` using the raw electricity load features from the OpenML Electricity Load Diagrams dataset.  

We use a chronological 80/20 split and apply Linear Regression with and without normalization to show:

- the raw baseline RMSE,
- the effect of scaling (expected to be unchanged for OLS),
- why the baseline underperforms,
- and how this motivates improved models such as Ridge, SRP, and Gradient Boosting.


In [9]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import json

df = pd.read_pickle("../data/cleaned_data.pkl")  # adjust path if needed
print(df.shape)
df.head()


(105217, 323)


Unnamed: 0,id_series,date,value_0,value_1,value_2,value_3,value_4,value_5,value_6,value_7,...,value_311,value_312,value_313,value_314,value_315,time_step,hour,day,month,weekday
0,0,2012-01-01 00:00:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0,0,1,1,6
1,0,2012-01-01 00:15:00,3.807107,22.759602,77.324066,136.178862,70.731707,351.190476,9.609949,279.461279,...,15.645372,12.873025,504.828797,63.439065,761.730205,1,0,1,1,6
2,0,2012-01-01 00:30:00,5.076142,22.759602,77.324066,136.178862,73.170732,354.166667,9.044658,279.461279,...,15.645372,13.458163,525.021949,60.100167,702.346041,2,0,1,1,6
3,0,2012-01-01 00:45:00,3.807107,22.759602,77.324066,140.243902,69.512195,348.214286,8.479367,279.461279,...,15.645372,10.532475,526.777875,56.761269,696.480938,3,0,1,1,6
4,0,2012-01-01 01:00:00,3.807107,22.759602,77.324066,140.243902,75.609756,339.285714,7.348785,279.461279,...,15.645372,14.628438,539.947322,63.439065,693.548387,4,1,1,1,6


## Feature and Target Selection

For the baseline, we predict `value_0` using all available numeric features except metadata fields such as `date` and `id_series`.  
This gives us a simple but consistent benchmark.

In [10]:
target = "value_0"

drop_cols = ["date", "id_series", "time_step"]
X = df.drop(columns=drop_cols + [target])
y = df[target]

n = len(df)
split_idx = int(0.8 * n)

X_train = X.iloc[:split_idx]
y_train = y.iloc[:split_idx]

X_test = X.iloc[split_idx:]
y_test = y.iloc[split_idx:]

X_train.shape, X_test.shape


((84173, 319), (21044, 319))

## Time-Series Split

Since this is time-series data, we disable shuffling to ensure the model is trained only on past data and tested on future data.

In [3]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, shuffle=False
)

X_train.shape, X_test.shape

((84173, 319), (21044, 319))

## Train the Baseline Model

We fit a standard Linear Regression model. This model serves as our baseline because it is simple and fast, and provides a reference RMSE for comparison.

In [12]:
lr_raw = LinearRegression()
lr_raw.fit(X_train, y_train)

y_pred_raw = lr_raw.predict(X_test)
mse_raw = mean_squared_error(y_test, y_pred_raw)
rmse_raw = np.sqrt(mse_raw)

print("Baseline Linear Regression RMSE (raw features):", rmse_raw)


Baseline Linear Regression RMSE (raw features): 7.329735736397394


In [13]:
pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("lr", LinearRegression())
])

pipe.fit(X_train, y_train)
y_pred_scaled = pipe.predict(X_test)

mse_scaled = mean_squared_error(y_test, y_pred_scaled)
rmse_scaled = np.sqrt(mse_scaled)

print("Baseline Linear Regression RMSE (with StandardScaler):", rmse_scaled)


Baseline Linear Regression RMSE (with StandardScaler): 7.329735736397361


## Save Baseline Results

We store the RMSE and model information in a JSON file so later models can be compared easily.

In [14]:
results = {
    "linear_regression_raw": float(rmse_raw),
    "linear_regression_scaled": float(rmse_scaled)
}

with open("../results/baseline_results.json", "w") as f:
    json.dump(results, f, indent=2)

results


{'linear_regression_raw': 7.329735736397394,
 'linear_regression_scaled': 7.329735736397361}

## Interpretation of Baseline Results

**Raw Baseline RMSE:** ~7.33  
**Scaled Baseline RMSE:** ~7.33

These results align with our EDA and model theory:

- Ordinary Least Squares regression is *invariant* to scaling of inputs.  
- Very weak correlations (<0.22) between `value_0` and other meters lead to poor linear predictive power.  
- Scatterplots show strong nonlinear relationships and horizontal banding.  
- High dimensionality (316 meter features) and large differences in feature scales hurt unregularized linear models.  

### Modeling Implications

- Linear Regression will only serve as the **baseline**.  
- Normalization is still necessary for **Ridge**, **Lasso**, **SRP**, and **Gradient Boosting**, even though it does not improve OLS RMSE.  
- These results motivate the improved modeling pipeline built in later notebooks.
