# Feature Engineering & Improved Modeling Notebook

In the previous phase, models trained on weather variables alone achieved low RÂ² values, indicating limited explanatory power.
This notebook introduces temporal, categorical, and autoregressive features to better capture systematic patterns in bakery sales.

## Imports

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import TimeSeriesSplit
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

from sklearn.linear_model import LinearRegression
from sklearn.neural_network import MLPRegressor


## Load data

In [2]:
DATA_PATH = "../merged_daily_sales_weather.csv"

df = pd.read_csv(DATA_PATH, parse_dates=["Datum"])
df = df.sort_values("Datum").reset_index(drop=True)

df.head()


Unnamed: 0,id,Datum,Warengruppe,Umsatz,Bewoelkung,Temperatur,Windgeschwindigkeit,Wettercode,KielerWoche
0,1307011,2013-07-01,1,148.828353,6.0,17.8375,15.0,20.0,0
1,1307013,2013-07-01,3,201.198426,6.0,17.8375,15.0,20.0,0
2,1307015,2013-07-01,5,317.475875,6.0,17.8375,15.0,20.0,0
3,1307012,2013-07-01,2,535.856285,6.0,17.8375,15.0,20.0,0
4,1307014,2013-07-01,4,65.890169,6.0,17.8375,15.0,20.0,0


## Temporal features
Bakery sales exhibit strong weekly and seasonal patterns driven by human routines rather than weather alone.

In [3]:
df["day_of_week"] = df["Datum"].dt.weekday
df["month"] = df["Datum"].dt.month
df["day_of_year"] = df["Datum"].dt.dayofyear

# Cyclical encoding
df["dow_sin"] = np.sin(2 * np.pi * df["day_of_week"] / 7)
df["dow_cos"] = np.cos(2 * np.pi * df["day_of_week"] / 7)

df["doy_sin"] = np.sin(2 * np.pi * df["day_of_year"] / 365)
df["doy_cos"] = np.cos(2 * np.pi * df["day_of_year"] / 365)


## Lag features

In [4]:
df["lag_1"] = df["Umsatz"].shift(1)
df["lag_7"] = df["Umsatz"].shift(7)

df["rolling_7"] = df["Umsatz"].rolling(window=7).mean()


In [5]:
df = df.dropna().reset_index(drop=True)


## Categorical Handling

Different product groups exhibit distinct sales dynamics and should be modeled explicitly.

In [6]:
# Encode warengruppe
df = pd.get_dummies(df, columns=["Warengruppe"], drop_first=True)


In [7]:
TARGET = "Umsatz"

FEATURES = [
    "Bewoelkung",
    "Temperatur",
    "Windgeschwindigkeit",
    "KielerWoche",
    "dow_sin", "dow_cos",
    "doy_sin", "doy_cos",
    "lag_1", "lag_7", "rolling_7"
]

FEATURES += [col for col in df.columns if col.startswith("Warengruppe_")]

X = df[FEATURES]
y = df[TARGET]


## Time-Aware Train/Test Split

A temporal split is used to prevent information leakage from future sales.

In [8]:
split_idx = int(len(df) * 0.8)

X_train, X_test = X.iloc[:split_idx], X.iloc[split_idx:]
y_train, y_test = y.iloc[:split_idx], y.iloc[split_idx:]

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


## Models
### Baseline (Linear regression)

In [9]:
lin_model = LinearRegression()
lin_model.fit(X_train_scaled, y_train)

y_pred_lin = lin_model.predict(X_test_scaled)


### Neural Network

In [10]:
nn_model = MLPRegressor(
    hidden_layer_sizes=(64, 32),
    activation="relu",
    solver="adam",
    max_iter=500,
    random_state=42
)

nn_model.fit(X_train_scaled, y_train)

y_pred_nn = nn_model.predict(X_test_scaled)




## Evaluation

In [11]:
def evaluate(y_true, y_pred):
    return {
        "RMSE": np.sqrt(mean_squared_error(y_true, y_pred)),
        "MAE": mean_absolute_error(y_true, y_pred),
        "R2": r2_score(y_true, y_pred)
    }

results = pd.DataFrame([
    {"Model": "Linear (Engineered)", **evaluate(y_test, y_pred_lin)},
    {"Model": "Neural Net (Engineered)", **evaluate(y_test, y_pred_nn)}
])

results


Unnamed: 0,Model,RMSE,MAE,R2
0,Linear (Engineered),63.635084,44.488544,0.743705
1,Neural Net (Engineered),47.012853,28.59134,0.860112
