# Modeling Baselines – Wind Turbine Power Prediction

The goal of this notebook is to build and compare several baseline machine learning models to predict the wind turbine **active power (TARGET)** using only sensor measurements (no wind speed or power curve information).

We evaluate different model families to understand the trade-off between:
- linear models
- tree-based models
- gradient boosting
- neural networks

### 1 — Loading the prepared dataset

We load the cleaned and merged dataset saved previously as a Parquet file (`engie_full.parquet`).

In [1]:
import os
import pandas as pd
import numpy as np

DATA_DIR = "../data"
PARQUET_PATH = os.path.join(DATA_DIR, "engie_full.parquet")

df = pd.read_parquet(PARQUET_PATH)
print("df shape:", df.shape)
df.head()


df shape: (617386, 79)


Unnamed: 0,ID,MAC_CODE,Date_time,Pitch_angle,Pitch_angle_min,Pitch_angle_max,Pitch_angle_std,Hub_temperature,Hub_temperature_min,Hub_temperature_max,...,Rotor_speed_min,Rotor_speed_max,Rotor_speed_std,Rotor_bearing_temperature,Rotor_bearing_temperature_min,Rotor_bearing_temperature_max,Rotor_bearing_temperature_std,Absolute_wind_direction_c,Nacelle_angle_c,TARGET
0,1,WT3,1.0,92.470001,92.470001,92.470001,0.0,7.0,7.0,7.0,...,0.0,0.0,0.0,2.4,2.4,2.4,0.0,294.19,294.23999,-0.703
1,2,WT3,2.0,92.470001,92.470001,92.470001,0.0,7.0,7.0,7.0,...,0.0,0.0,0.0,2.4,2.4,2.4,0.0,297.82999,294.23999,-0.747
2,3,WT3,3.0,92.470001,92.470001,92.470001,0.0,7.0,7.0,7.0,...,0.0,0.0,0.0,2.4,2.4,2.4,0.0,322.20999,294.23999,-0.791
3,4,WT3,4.0,92.470001,92.470001,92.470001,0.0,6.97,6.7,7.0,...,0.0,0.0,0.0,2.4,2.4,2.4,0.0,318.69,294.23999,-0.736
4,5,WT3,5.0,92.470001,92.470001,92.470001,0.0,6.93,6.0,7.0,...,0.0,0.0,0.0,2.4,2.4,2.5,0.0,314.89001,294.23999,-1.055


#### 2 — Single turbine (WT3)

In [2]:
df = df[df["MAC_CODE"] == "WT3"].copy()
df = df.sort_values("Date_time").reset_index(drop=True)
print(df["MAC_CODE"].unique(), df["Date_time"].min(), df["Date_time"].max(), df.shape)

['WT3'] 1.0 157680.0 (154253, 79)


#### 3 — Train / Val / Test split

In [3]:
from sklearn.model_selection import train_test_split

target_col = "TARGET"
drop_cols = ["ID", "Date_time", target_col]  # Date_time utilisé uniquement pour split

n = len(df)
train = df.iloc[:int(0.70*n)]
val   = df.iloc[int(0.70*n):int(0.85*n)]
test  = df.iloc[int(0.85*n):]

X_train = train.drop(columns=drop_cols)
y_train = train[target_col]

X_val = val.drop(columns=drop_cols)
y_val = val[target_col]

X_test = test.drop(columns=drop_cols)
y_test = test[target_col]


### 4 — Feature preprocessing

In [4]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

cat_cols = ["MAC_CODE"] if "MAC_CODE" in X_train.columns else []
num_cols = [c for c in X_train.columns if c not in cat_cols]

preprocess_linear = ColumnTransformer(
    transformers=[
        ("num", Pipeline([
            ("imputer", SimpleImputer(strategy="median")),
            ("scaler", StandardScaler())
        ]), num_cols),
        ("cat", Pipeline([
            ("imputer", SimpleImputer(strategy="most_frequent")),
            ("onehot", OneHotEncoder(handle_unknown="ignore"))
        ]), cat_cols),
    ],
    remainder="drop"
)

preprocess_tree = ColumnTransformer(
    transformers=[
        ("num", Pipeline([
            ("imputer", SimpleImputer(strategy="median"))
        ]), num_cols),
        ("cat", Pipeline([
            ("imputer", SimpleImputer(strategy="most_frequent")),
            ("onehot", OneHotEncoder(handle_unknown="ignore"))
        ]), cat_cols),
    ],
    remainder="drop"
)


### 5 — Evaluation and models training

#### 5.1 — Evaluation function

In [5]:
from sklearn.metrics import mean_absolute_error

def eval_model(name, model, X_train, y_train, X_val, y_val, X_test, y_test):
    model.fit(X_train, y_train)
    pred_val = model.predict(X_val)
    pred_test = model.predict(X_test)
    return {
        "model": name,
        "mae_val": float(mean_absolute_error(y_val, pred_val)),
        "mae_test": float(mean_absolute_error(y_test, pred_test)),
    }


#### 5.2 — Ridge

In [6]:
from sklearn.linear_model import Ridge

ridge = Pipeline([
    ("prep", preprocess_linear),
    ("model", Ridge(alpha=10.0, random_state=0))
])

res_ridge = eval_model("Ridge", ridge, X_train, y_train, X_val, y_val, X_test, y_test)
res_ridge


{'model': 'Ridge',
 'mae_val': 105.21223650920521,
 'mae_test': 104.15188167106882}

#### 5.3 — Random Forest

In [14]:
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.pipeline import Pipeline

rf = Pipeline([
    ("prep", preprocess_tree),
    ("model", ExtraTreesRegressor(
        n_estimators=300,
        max_depth=None,
        min_samples_leaf=10,
        max_features="sqrt",
        n_jobs=-1,
        random_state=0
    ))
])
res_rf = eval_model("RandomForest", rf, X_train, y_train, X_val, y_val, X_test, y_test)
res_rf

{'model': 'RandomForest',
 'mae_val': 27.088180421277187,
 'mae_test': 31.697817925054775}

#### 5.4 — HistGradientBoostingRegressor

In [12]:
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_absolute_error

hgb = Pipeline([
    ("prep", preprocess_tree),
    ("model", HistGradientBoostingRegressor(
        max_depth=6,
        learning_rate=0.05,
        max_iter=800,
        random_state=0
    ))
])

hgb.fit(X_train, y_train)

print("HGB MAE val:", mean_absolute_error(y_val, hgb.predict(X_val)))
print("HGB MAE test:", mean_absolute_error(y_test, hgb.predict(X_test)))


HGB MAE val: 16.00445999930796
HGB MAE test: 22.556198530476244



[notice] A new release of pip is available: 24.0 -> 26.0
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting xgboost
  Downloading xgboost-3.1.3-py3-none-win_amd64.whl.metadata (2.0 kB)
Downloading xgboost-3.1.3-py3-none-win_amd64.whl (72.0 MB)
   ---------------------------------------- 0.0/72.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/72.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/72.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/72.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/72.0 MB 162.5 kB/s eta 0:07:23
   ---------------------------------------- 0.0/72.0 MB 162.5 kB/s eta 0:07:23
   ---------------------------------------- 0.0/72.0 MB 130.4 kB/s eta 0:09:12
   ---------------------------------------- 0.0/72.0 MB 130.4 kB/s eta 0:09:12
   ---------------------------------------- 0.0/72.0 MB 130.4 kB/s eta 0:09:12
   ---------------------------------------- 0.0/72.0 MB 130.4 kB/s eta 0:09:12
   ---------------------------------------- 0.0/72.0 MB 98.1 kB/s eta 0:12:14
   ------------------

In [15]:
res_hgb = eval_model("HistGradientBoostingRegressor", hgb, X_train, y_train, X_val, y_val, X_test, y_test)
res_hgb

{'model': 'HistGradientBoostingRegressor',
 'mae_val': 16.00445999930796,
 'mae_test': 22.556198530476244}

#### 5.5 — MLP (sklearn, rapide)

In [13]:
from sklearn.neural_network import MLPRegressor

mlp = Pipeline([
    ("prep", preprocess_linear),
    ("model", MLPRegressor(
        hidden_layer_sizes=(256,128,64),
        activation="relu",
        alpha=1e-4,
        learning_rate_init=1e-3,
        early_stopping=True,
        n_iter_no_change=20,
        max_iter=500,
        random_state=0
    ))
])

res_mlp = eval_model("MLP", mlp, X_train, y_train, X_val, y_val, X_test, y_test)
res_mlp


{'model': 'MLP', 'mae_val': 19.71522262704102, 'mae_test': 25.60927369887037}

### 6 — Résultats 

In [16]:
results = pd.DataFrame([res_ridge, res_rf, res_hgb, res_mlp]).sort_values("mae_val")
results

Unnamed: 0,model,mae_val,mae_test
2,HistGradientBoostingRegressor,16.00446,22.556199
3,MLP,19.715223,25.609274
1,RandomForest,27.08818,31.697818
0,Ridge,105.212237,104.151882


### 7 — Sauvegarde du meilleur modèle + résultats

In [17]:
import joblib, json
from pathlib import Path

Path("../models").mkdir(exist_ok=True)
Path("../reports").mkdir(exist_ok=True)

best_name = results.iloc[0]["model"]
best_pipeline = {"Ridge": ridge, "RandomForest": rf, "HistGradientBoostingRegressor": hgb, "MLP": mlp}[best_name]

joblib.dump(best_pipeline, f"../models/best_model_{best_name}.joblib")

with open("../reports/results.json", "w") as f:
    json.dump(results.to_dict(orient="records"), f, indent=2)

print("Saved:", best_name)


Saved: HistGradientBoostingRegressor


## 📊 Results interpretation

The comparison of baseline models highlights clear performance differences between modeling families.

### Key observations

• Ridge Regression performs poorly.  
This indicates that the relationship between sensors and power output is highly **non-linear**, which cannot be captured by a simple linear model.

• Tree-based ensemble methods strongly outperform linear models.  
Both Random Forest and Gradient Boosting capture complex interactions between features such as temperatures, speeds, and angles.

• HistGradientBoosting achieves the best performance.  
Boosting methods iteratively correct previous errors, which typically leads to better accuracy on structured/tabular datasets.

• The MLP neural network performs reasonably well but does not outperform boosting.  
This is expected, as deep learning is usually less competitive than gradient boosting on medium-sized tabular datasets.

### Model selection

Based on validation performance, **HistGradientBoostingRegressor is selected as the final model**.

It provides:


✔ lowest MAE  
✔ better generalization  
✔ faster training  
✔ simpler pipeline (no heavy dependencies)



