# Machine Learning Lab - Steel Industry
## Part 2: Energy Consumption Prediction

In this part, we will develop machine learning models to predict the energy consumption of a steel industry. This predictive capability is crucial for cost optimization and production planning.

### Objectives:
- Implement different regression models
- Compare their performance
- Interpret the results in an industrial context

### Data structure:

1. **Target variable (to predict):**
   - `Usage_kWh`: Energy consumption in kilowatt-hours

2. **Numerical descriptive variables:**
   - `Lagging_Current_Reactive.Power_kVarh`: Lagging reactive power
   - `Leading_Current_Reactive_Power_kVarh`: Leading reactive power
   - `CO2(tCO2)`: CO2 emissions
   - `Lagging_Current_Power_Factor`: Lagging power factor
   - `Leading_Current_Power_Factor`: Leading power factor
   - `NSM`: Number of Seconds from Midnight (time)

3. **Categorical variables:**
   - `Day_of_week`: Day of the week (Monday to Sunday)
   - `WeekStatus`: Type of day (Weekday/Weekend)

### Applied preprocessing:
1. Standardization of numerical variables (mean=0, std=1)
2. One-hot encoding of categorical variables
3. Train/test split (80%/20%)

### Models covered:
- Linear regression (baseline)
- K-Nearest Neighbors (KNN)
- Decision Trees
- Random Forest
- Neural Networks

In [None]:
# Import des packages nécessaires
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import (mean_squared_error, r2_score, mean_absolute_error,
                           explained_variance_score, max_error)
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from scipy import stats
import os

# Configuration de l'affichage
sns.set_theme()  # Utilisation du style seaborn directement
plt.rcParams['figure.figsize'] = [10, 6]

In [None]:
# Téléchargement et préparation des données
if not os.path.exists('Steel_industry_data.csv'):
    print("Téléchargement des données...")
    # Télécharger le fichier zip
    !wget -O steel_industry_data.zip https://archive.ics.uci.edu/static/public/851/steel+industry+energy+consumption.zip
    # Décompresser le fichier
    !unzip -o steel_industry_data.zip
    print("Données téléchargées et décompressées.")
else:
    print("Fichier de données déjà présent.")

# Chargement des données
try:
    df = pd.read_csv('Steel_industry_data.csv')
    print(f"Données chargées avec succès : {df.shape[0]} observations, {df.shape[1]} variables")
except Exception as e:
    print(f"Erreur lors du chargement des données : {e}")
    raise

# Séparation des variables
target = 'Usage_kWh'
numeric_features = [
    'Lagging_Current_Reactive.Power_kVarh',
    'Leading_Current_Reactive_Power_kVarh',
    'CO2(tCO2)',
    'Lagging_Current_Power_Factor',
    'Leading_Current_Power_Factor',
    'NSM'
]
categorical_features = ['Day_of_week', 'WeekStatus', 'period']

# Affichage des dimensions
print("Dimensions du dataset :")
print(f"Nombre d'observations : {df.shape[0]:,}")
print(f"Nombre de variables : {df.shape[1]:,}")

# Résumé statistique des variables numériques
print("\nRésumé statistique des variables numériques :")
display(df[numeric_features + [target]].describe())

# Distribution de la variable cible
plt.figure(figsize=(10, 5))
sns.histplot(data=df, x=target, bins=50)
# /!\ Complétez les '...' pour afficher comme titre de la figure (Matplotlib): Distribution de la consommation énergétique /!\
plt...
plt.show()

### 1. Temporal analysis and feature creation

We will structure our temporal data at several levels:
1. Days of the week (Monday to Sunday)
2. Type of day (weekday/weekend)
3. Periods of the day (6 blocks of 4 hours)

In [None]:
# Création des périodes de la journée industrielle
def create_industrial_periods(df):
    # Conversion NSM en heures
    df['hour'] = df['NSM'] / 3600

    # Création des périodes avec la journée commençant à 6h
    conditions = [
        (df['hour'] >= 6) & (df['hour'] < 10),   # Matin1
        (df['hour'] >= 10) & (df['hour'] < 14),  # Matin2
        (df['hour'] >= 14) & (df['hour'] < 18),  # Aprem1
        (df['hour'] >= 18) & (df['hour'] < 22),  # Aprem2
        (df['hour'] >= 22) | (df['hour'] < 2),   # Nuit1
        (df['hour'] >= 2) & (df['hour'] < 6)     # Nuit2
    ]

    periods = ['Matin1', 'Matin2', 'Aprem1', 'Aprem2', 'Nuit1', 'Nuit2']
    df['period'] = np.select(conditions, periods, default='Nuit2')

    return df

# Application des périodes
# /!\ Complétez les '...' pour transformer df grâce à la fonction create_industrial_periods() /!\
df = ...

# Visualisation des patterns de consommation
plt.figure(figsize=(15, 5))

# 1. Consommation moyenne par période
plt.subplot(1, 3, 1)
period_order = ['Matin1', 'Matin2', 'Aprem1', 'Aprem2', 'Nuit1', 'Nuit2']
sns.boxplot(data=df, x='period', y='Usage_kWh', order=period_order)
plt.title('Distribution de la consommation par période')
plt.xticks(rotation=45)

# 2. Heatmap période x jour
pivot_period_day = pd.pivot_table(df,
                                values='Usage_kWh',
                                index='period',
                                columns='Day_of_week',
                                aggfunc='mean')
plt.subplot(1, 3, 2)
sns.heatmap(pivot_period_day, cmap='YlOrRd', annot=True, fmt='.0f')
plt.title('Consommation moyenne\npar période et jour')

# 3. Comparaison semaine/weekend
plt.subplot(1, 3, 3)
sns.boxplot(data=df, x='period', y='Usage_kWh', hue='WeekStatus', order=period_order)
plt.title('Consommation par période\net type de jour')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

# Statistiques détaillées
print("\nConsommation moyenne (kWh) par période et type de jour :")
pivot_stats = pd.pivot_table(df,
                           values='Usage_kWh',
                           index='period',
                           columns=['WeekStatus', 'Day_of_week'],
                           aggfunc=['mean', 'std'])
display(pivot_stats.round(2))

❓ **Questions about temporal patterns:**

1. **Production cycles**
   - Which period shows the highest consumption? Why?
   - How does consumption evolve between Morning1 and Morning2?

2. **Day/night variations**
   - What is the difference in consumption between daytime and nighttime periods?
   - Is variability greater during the day or at night?

3. **Weekend impact**
   - How does the consumption pattern change on weekends?
   - Which periods show the greatest weekday/weekend difference?
   - What recommendations for energy optimization?

In [None]:
# Préparation des données

# 1. Standardisation des variables numériques
scaler = StandardScaler()
df_scaled = pd.DataFrame(
    scaler.fit_transform(df[numeric_features]),
    columns=numeric_features
)

# 2. Encodage des variables catégorielles
encoder = OneHotEncoder(sparse_output=False)
categorical_encoded = encoder.fit_transform(df[categorical_features])

# Noms des colonnes encodées
day_names = [f'Day_{day}' for day in encoder.categories_[0]]
week_status_names = [f'Status_{status}' for status in encoder.categories_[1]]
period_names = [f'Period_{period}' for period in encoder.categories_[2]]
encoded_columns = day_names + week_status_names + period_names

# Vérification des dimensions
print("\nDimensions de l'encodage :")
print(f"Nombre de colonnes encodées : {len(encoded_columns)}")
print(f"Shape des données encodées : {categorical_encoded.shape}")
print("Catégories encodées :")
for i, feature in enumerate(categorical_features):
    print(f"{feature}: {list(encoder.categories_[i])}")

# Création du DataFrame avec les variables encodées
df_encoded = pd.DataFrame(categorical_encoded, columns=encoded_columns)

# 3. Combinaison des features
X = pd.concat([df_scaled, df_encoded], axis=1)
# /!\ Complétez les '...' pour affecter la variable y à la colonne target du dataframe /!\
y = ...

print("Structure des données préparées :")
print(f"Variables numériques standardisées : {len(numeric_features)}")
print(f"Variables catégorielles encodées : {len(encoded_columns)}")
print(f"Dimensions finales de X : {X.shape}")

# 4. Division train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("\nDimensions des ensembles d'entraînement et de test :")
print(f"X_train : {X_train.shape}")
print(f"X_test : {X_test.shape}")
print(f"y_train : {y_train.shape}")
print(f"y_test : {y_test.shape}")

### 2. Linear regression

To understand how linear regression finds its coefficients, let's implement our own version:

y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ

The loss (mean squared error) is:
L = (1/n) Σ(y_pred - y_true)²

The coefficients are updated as:
β_new = β_old - α * ∂L/∂β
where α is the learning rate

In [None]:
class LinearRegressionGD:
    def __init__(self, learning_rate=0.01, n_iterations=1000):
        self.lr = learning_rate
        self.n_iterations = n_iterations
        self.weights = None
        self.bias = None
        self.loss_history = []
        self.weights_history = []
        self.bias_history = []

    def fit(self, X, y):
        # Initialisation des paramètres
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0

        # Historique pour visualisation
        self.loss_history = []
        self.weights_history = []
        self.bias_history = []

        # Descente de gradient
        for i in range(self.n_iterations):
            # Prédiction courante
            y_pred = np.dot(X, self.weights) + self.bias

            # Calcul des gradients
            dw = (1/n_samples) * np.dot(X.T, (y_pred - y))
            db = (1/n_samples) * np.sum(y_pred - y)

            # Mise à jour des paramètres
            self.weights -= self.lr * dw
            self.bias -= self.lr * db

            # Calcul et sauvegarde de la loss et des paramètres
            loss = np.mean((y_pred - y) ** 2)
            self.loss_history.append(loss)
            self.weights_history.append(self.weights.copy())
            self.bias_history.append(self.bias)

            # Affichage progression
            if (i+1) % 1000 == 0:
                print(f'Iteration {i+1}/{self.n_iterations}, Loss: {loss:.4f}')

    def predict(self, X):
        return np.dot(X, self.weights) + self.bias

# Visualisation de l'erreur en fonction de chaque poids
def plot_error_vs_weights(X, y, weights, bias, feature_names, n_points=100):
    plt.figure(figsize=(20, 15))
    n_features = len(weights)
    n_rows = (n_features + 3) // 4  # Arrondi supérieur pour le nombre de lignes

    for i, (feature_name, weight) in enumerate(zip(feature_names, weights)):
        # Créer une plage de valeurs autour du poids optimal
        weight_range = np.linspace(weight - 2, weight + 2, n_points)
        errors = []

        # Calculer l'erreur pour chaque valeur du poids
        for w in weight_range:
            weights_temp = weights.copy()
            weights_temp[i] = w
            y_pred = np.dot(X, weights_temp) + bias
            mse = np.mean((y - y_pred) ** 2)
            errors.append(mse)

        # Tracer la courbe d'erreur
        plt.subplot(n_rows, 4, i+1)
        plt.plot(weight_range, errors, 'b-', alpha=0.7)
        plt.axvline(x=weight, color='r', linestyle='--', label=f'w={weight:.2f}')
        plt.title(f'MSE vs {feature_name}')
        plt.xlabel('Valeur du poids')
        plt.ylabel('MSE')
        plt.grid(True)

        # Marquer le poids trouvé
        plt.plot(weight, np.min(errors), 'ro', label='Poids trouvé')
        plt.legend(loc='upper right')

    plt.tight_layout()
    plt.show()

# Entraînement des deux modèles
lr_gd = LinearRegressionGD(learning_rate=0.01, n_iterations=15000)
# /!\ Complétez les '...' pour appliquer la régression linéaire lr_gd aux données /!\
...(X_train.values, y_train.values)

lr_sk = LinearRegression()
# /!\ Complétez les '...' pour appliquer la régression linéaire lr_sk aux données /!\
...(X_train, y_train)

# Comparaison des performances
print("\nComparaison des deux implémentations :")
comparison = pd.DataFrame(columns=['GD', 'Sklearn'])

# Prédictions
y_pred_gd = lr_gd.predict(X_test.values)
y_pred_sk = lr_sk.predict(X_test)

# Métriques
comparison.loc['R² score'] = [
    r2_score(y_test, y_pred_gd),
    r2_score(y_test, y_pred_sk)
]
comparison.loc['MSE'] = [
    mean_squared_error(y_test, y_pred_gd),
    mean_squared_error(y_test, y_pred_sk)
]
comparison.loc['RMSE'] = [
    np.sqrt(mean_squared_error(y_test, y_pred_gd)),
    np.sqrt(mean_squared_error(y_test, y_pred_sk))
]

print("\nMétriques de performance :")
display(comparison.round(4))

# Comparaison des coefficients
coef_comparison = pd.DataFrame({
    'Feature': X_train.columns,
    'GD': lr_gd.weights,
    'Sklearn': lr_sk.coef_,
    'Différence': np.abs(lr_gd.weights - lr_sk.coef_)
})

print("\nComparaison des coefficients :")
display(coef_comparison.round(4))

# Affichage de l'équation complète (avec les 5 coefficients les plus importants)
print("\nÉquation de régression :")
print(f"Usage_kWh = {lr_gd.bias:.2f}", end=" ")
top_coefs = coef_comparison.assign(abs_coef=lambda x: np.abs(x['GD'])).nlargest(21, 'abs_coef')
for _, row in top_coefs.iterrows():
    print(f"+ ({row['GD']:.2f} × {row['Feature']})", end=" ")
print("\n")

# Visualisation des courbes d'erreur
plot_error_vs_weights(X_train.values, y_train.values,
                     lr_gd.weights, lr_gd.bias,
                     feature_names=X_train.columns)

### Interpretation of metrics:

1. **R² (Coefficient of determination)**
   - Ranges between 0 and 1 (or negative if the model is very poor)
   - The closer to 1, the better the model
   - Represents the proportion of variance explained by the model
   - An R² of 0.8 means the model explains 80% of the data variability

2. **MSE (Mean Squared Error)**
   - Mean of squared errors
   - Heavily penalizes large errors
   - Hard to interpret because the unit is squared

3. **RMSE (Root Mean Squared Error)**
   - Square root of the MSE
   - Same unit as the target variable (kWh)
   - Easier to interpret: average error in kWh
   - Example: RMSE = 10 means an average error of ±10 kWh

❓ **In-depth questions about linear regression:**

1. **Convergence**
   - How does the loss evolve over iterations?
   - Why is the decrease fast at first and then slower?
   - How do you know if the global minimum is reached?

2. **Comparison with sklearn**
   - Are the coefficients similar?
   - Why are there differences?

3. **Mathematical understanding**
   - Why use MSE (Mean Squared Error) as the loss function?

4. **Coefficient analysis**
   - Why does CO2(tCO2) have the largest coefficient (26.47)?
   - Do negative coefficients mean a negative influence?

5. **Error curves**
   - Why do the curves have a parabolic shape?
   - What does the width of the parabola mean for each feature?
   - Why do some features have a greater impact?

### 3. K-Nearest Neighbors (KNN)
The K-Nearest Neighbors algorithm is a non-parametric method that predicts consumption based on the k most similar observations.

In [None]:
class SimpleKNN:
    def __init__(self, k=3):
        self.k = k
        self.X_train = None
        self.y_train = None

    def fit(self, X, y):
        """Mémorise simplement les données d'entraînement"""
        self.X_train = X
        self.y_train = y
        print(f"Mémorisation de {len(X)} observations d'entraînement")

    def predict_single(self, x, verbose=False):
        """Prédit pour une seule observation avec option d'affichage des détails"""
        # Calcul des distances avec tous les points d'entraînement
        distances = np.sqrt(np.sum((self.X_train - x)**2, axis=1))

        # Trouve les k plus proches voisins
        nearest_indices = np.argsort(distances)[:self.k]
        nearest_distances = distances[nearest_indices]

        if verbose:
            print("\nDétails de la prédiction:")
            print(f"Observation à prédire: {x}")
            print("\nPlus proches voisins trouvés:")
            for i, (idx, dist) in enumerate(zip(nearest_indices, nearest_distances)):
                print(f"Voisin {i+1}:")
                print(f"- Distance: {dist:.2f}")
                print(f"- Valeur: {self.y_train[idx]:.2f}")

        # Calcul de la prédiction (moyenne simple)
        prediction = np.mean(self.y_train[nearest_indices])

        if verbose:
            print(f"\nPrédiction finale: {prediction:.2f}")

        return prediction

    def predict(self, X):
        """Prédit pour plusieurs observations"""
        return np.array([self.predict_single(x) for x in X])

# Test avec différentes valeurs de k
k_values = [1, 2, 3, 4, 5]
knn_scores = []

# Création d'un exemple simple pour visualisation
example_idx = 42  # Un indice arbitraire pour l'exemple

for k in k_values:
    knn = SimpleKNN(k=k)
    knn.fit(X_train.values, y_train.values)

    # Prédiction détaillée pour l'exemple
    print(f"\nTest avec k={k}:")
    example_pred = knn.predict_single(X_test.values[example_idx], verbose=True)

    # Calcul des métriques
    y_pred = knn.predict(X_test.values)
    # /!\ Complétez les '...' pour calculer le R² entre la valeur cible réel et la prédiction /!\
    r2 = ...
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    knn_scores.append(r2)

    print(f"\nMétriques globales:")
    print(f"R² score: {r2:.3f}")
    print(f"MSE: {mse:.3f}")
    print(f"RMSE: {rmse:.3f}")

# Visualisation de l'impact de k
plt.figure(figsize=(10, 5))
plt.plot(k_values, knn_scores, 'bo-')
plt.xlabel('Nombre de voisins (k)')
plt.ylabel('R² Score')
plt.title('Performance selon le nombre de voisins')
plt.grid(True)
plt.show()

❓ **Questions about the KNN algorithm:**

1. **Algorithm understanding**
   - How does KNN make predictions for a new observation?
   - Why is it important for variables to be standardized?

2. **Choice of k**
   - What happens if k is too small (k=1)?
   - What happens if k is too large (k close to n)?
   - Why do we observe an optimal k in the performance curve?

3. **Interpretability**
   - How to explain a KNN prediction to a user?
   - Can we identify the most important variables?

### 4. Decision Trees
Decision trees allow the creation of easily interpretable prediction rules.
They can capture non-linear relationships and are particularly useful in an industrial context.

Key points:
- Transparent and interpretable model
- Able to capture non-linear relationships
- Risk of overfitting to be controlled

The decision tree recursively splits the data into homogeneous subgroups by choosing the best variables and split thresholds.

In [None]:
from sklearn.tree import DecisionTreeRegressor, plot_tree
import matplotlib.pyplot as plt

# Création et entraînement de l'arbre avec profondeur 4
dt = DecisionTreeRegressor(max_depth=4, random_state=42)
dt.fit(X_train, y_train)

# Visualisation de l'arbre avec une meilleure lisibilité
plt.figure(figsize=(30, 15))  # Grande taille pour la lisibilité
plot_tree(dt,
          feature_names=X_train.columns,
          filled=True,
          rounded=True,
          fontsize=8,  # Taille de la police
          precision=2)  # Nombre de décimales pour les valeurs
plt.title('Arbre de décision (profondeur=8)', fontsize=15)
plt.show()

# Exemple de prédiction détaillée
example_idx = 42
example = X_test.iloc[example_idx]
real_value = y_test.iloc[example_idx]
prediction = dt.predict([example])[0]

print("\nDétails de l'instance à prédire:")
print(f"\nValeur réelle de consommation: {real_value:.2f} kWh")
print(f"Valeur prédite: {prediction:.2f} kWh")

print("\nCaractéristiques non nulles de l'instance:")
for feature, value in example.items():
    if abs(value) > 0.01:  # On n'affiche que les valeurs non nulles
        print(f"{feature}: {value:.2f}")

print("\nChemin de décision détaillé:")

print("\nNiveau 1:")
print("   Test: CO2(tCO2) ≤ 0.22")
print("   Valeur mesurée: -0.71")
print("   Nombre d'observations: 28032")
print("   Moyenne du groupe: 27.29 kWh")
print("   → Branche gauche (condition vraie)")

print("\nNiveau 2:")
print("   Test: CO2(tCO2) ≤ -0.40")
print("   Valeur mesurée: -0.71")
print("   Nombre d'observations: 18049")
print("   Moyenne du groupe: 5.41 kWh")
print("   → Branche gauche (condition vraie)")

print("\nNiveau 3:")
print("   Test: Lagging_Current_Reactive.Power_kVarh ≤ 0.60")
print("   Valeur mesurée: -0.60")
print("   Nombre d'observations: 16817")
print("   Moyenne du groupe: 3.91 kWh")
print("   → Branche gauche (condition vraie)")

print("\nNiveau 4:")
print("   Test: Lagging_Current_Reactive.Power_kVarh ≤ -0.1")
print("   Valeur mesurée: -0.60")
print("   Nombre d'observations: 16793")
print("   Moyenne du groupe: 3.76 kWh")
print("   → Branche gauche (condition vraie)")

print("\n→ Feuille finale:")
print("   - Nombre d'observations: 16788")
print("   - Valeur prédite: 3.75 kWh")

# Métriques de performance
y_pred = dt.predict(X_test)
print("\nMétriques de performance:")
print(f"R² score: {r2_score(y_test, y_pred):.3f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.1f}")


# Création et entraînement de l'arbre avec profondeur 10
# /!\ Complétez les '...' pour entrainer un arbre de décision d'une profondeur de 10 /!\
dt = ...
dt.fit(X_train, y_train)


# Métriques de performance
y_pred = dt.predict(X_test)
print("\nMétriques de performance pour profondeur 10:")
print(f"R² score: {r2_score(y_test, y_pred):.3f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.1f}")

❓ **Questions:**
1. What are the most important variables according to the decision tree?

2. How does tree depth influence performance?

3. [OPTIONAL] **Tree structure**
   - Why is CO2(tCO2) often chosen as the first split?
   - How does the number of observations decrease at each level?
   - What is the meaning of the average values in the nodes?

4. [OPTIONAL] **Prediction process**
   - How does the tree arrive at its final prediction?
   - Why are predictions more accurate with greater depth?

### 5. Random Forest
Random Forest is an ensemble of decision trees that improves generalization and prediction stability compared to a single tree.

Key points:
- Better generalization than a single tree
- Robust estimation of variable importance
- Reduced overfitting

In [None]:
# Création d'un Random Forest simple avec 500 arbres pour la visualisation
rf = RandomForestRegressor(n_estimators=500, max_depth=10, random_state=42)
rf.fit(X_train, y_train)

# Exemple de prédiction détaillée
example_idx = 42
example = X_test.iloc[example_idx]
real_value = y_test.iloc[example_idx]

print("\nDétails de l'instance à prédire:")
print(f"Valeur réelle de consommation: {real_value:.2f} kWh")

print("\nCaractéristiques importantes de l'instance:")
for feature, value in example.items():
    if abs(value) > 0.01:  # On n'affiche que les valeurs non nulles
        print(f"{feature}: {value:.2f}")

# Prédiction de chaque arbre
print("\nPrédictions des arbres individuels:")
predictions = []
for i, tree in enumerate(rf.estimators_):
    pred = tree.predict([example])[0]
    predictions.append(pred)
    # print(f"\nArbre {i+1}:")
    # print(f"Prédiction: {pred:.2f} kWh")

    # Affichage du chemin de décision pour cet arbre
    path = tree.decision_path([example])
    feature_path = []
    for node_id in path.indices:
        if node_id == tree.tree_.children_left[path.indices[0]]:  # Si c'est une feuille
            continue
        feature = X_train.columns[tree.tree_.feature[node_id]]
        threshold = tree.tree_.threshold[node_id]
        value = example[feature]
        direction = "gauche" if value <= threshold else "droite"
        feature_path.append(f"   {feature} ≤ {threshold:.2f} ? {value:.2f} → {direction}")

    # print("Chemin de décision:")
    # for step in feature_path:
    #     print(step)

# Prédiction finale (moyenne des arbres)
# /!\ Complétez les '...' pour obtenir la moyenne des predictions (Numpy) /!\
final_prediction = ...
print(f"\nPrédiction finale (moyenne des arbres): {final_prediction:.2f} kWh")

# Métriques de performance
y_pred = rf.predict(X_test)
print("\nMétriques de performance:")
print(f"R² score: {r2_score(y_test, y_pred):.3f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.1f}")

# Importance des variables
importance = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': rf.feature_importances_
}).sort_values('Importance', ascending=False)

print("\nTop 5 variables les plus importantes:")
print(importance.head().to_string())

❓ **Questions:**
1. Why does Random Forest perform better than a single tree?

2. [OPTIONAL] Compare variable importance between Random Forest and a single tree.
   Which estimation seems more reliable to you? Why?

3. [OPTIONAL] **Voting mechanism**
   - Why take the average of the predictions from the trees?

4. **Advantages over a single tree**
   - How does Random Forest avoid overfitting?
   - Why are predictions more stable?

### 6. Neural Networks (MLP)
The Multi-Layer Perceptron is a neural network capable of capturing complex relationships between variables. It is particularly effective for non-linear problems.

Key points:
- High learning capacity
- Able to capture complex relationships
- Requires fine-tuning of hyperparameters

In [None]:
# Création d'un MLP simple pour la visualisation
mlp = MLPRegressor(hidden_layer_sizes=(12, 6),
                  max_iter=250,
                  activation='relu',
                  solver='adam',
                  random_state=42)
# /!\ Complétez les '...' pour que le réseau de neurone apprenne à prédire y_train grâce aux X_train/!\
mlp...

# Exemple de prédiction détaillée
example_idx = 42
example = X_test.iloc[example_idx]
real_value = y_test.iloc[example_idx]
prediction = mlp.predict([example])[0]

print("\nDétails de l'instance à prédire:")
print(f"Valeur réelle de consommation: {real_value:.2f} kWh")
print(f"Valeur prédite: {prediction:.2f} kWh")

print("\nCaractéristiques importantes de l'instance:")
for feature, value in example.items():
    if abs(value) > 0.01:  # On n'affiche que les valeurs non nulles
        print(f"{feature}: {value:.2f}")

# Test de différentes architectures
architectures = [(12,), (25,), (12, 6), (25, 12), (25, 12, 6)]
mlp_scores = []
mlp_predictions = []

print("\nComparaison des architectures:")
for arch in architectures:
    # Création et entraînement du modèle
    mlp = MLPRegressor(hidden_layer_sizes=arch,
                      max_iter=250,
                      random_state=42)
    mlp.fit(X_train, y_train)

    # Calcul du nombre de paramètres
    n_params = sum(layer.size * next_layer.size + next_layer.size
                  for layer, next_layer in zip([np.array([X_train.shape[1]])] + mlp.coefs_[:-1],
                                             mlp.coefs_))

    # Prédiction pour l'exemple
    pred = mlp.predict([example])[0]
    mlp_predictions.append(pred)

    # Score global
    score = r2_score(y_test, mlp.predict(X_test))
    mlp_scores.append(score)

    print(f"\nArchitecture {arch}:")
    print(f"- Nombre de neurones par couche: Entrée({X_train.shape[1]}) → {' → '.join(str(x) for x in arch)} → Sortie(1)")
    print(f"- Nombre total de paramètres: {n_params:,}")
    print(f"- Prédiction pour l'exemple: {pred:.2f} kWh")
    print(f"- R² score global: {score:.3f}")

# Visualisation des résultats
plt.figure(figsize=(15, 5))

# Scores R²
plt.subplot(1, 2, 1)
plt.plot(range(len(architectures)), mlp_scores, 'bo-')
plt.xticks(range(len(architectures)), [str(arch) for arch in architectures], rotation=45)
plt.xlabel('Architecture')
plt.ylabel('R² Score')
plt.title('Performance selon l\'architecture')
plt.grid(True)

# Prédictions pour l'exemple
plt.subplot(1, 2, 2)
plt.plot(range(len(architectures)), mlp_predictions, 'ro-', label='Prédictions')
plt.axhline(y=real_value, color='g', linestyle='--', label='Valeur réelle')
plt.xticks(range(len(architectures)), [str(arch) for arch in architectures], rotation=45)
plt.xlabel('Architecture')
plt.ylabel('Prédiction (kWh)')
plt.title('Prédictions pour l\'exemple')
plt.legend()

plt.tight_layout()
plt.show()

❓ **Questions about the MLP:**

1. **Network architecture**
   - Why use multiple hidden layers?
   - How to choose the number of neurons per layer?

2. **Architecture comparison**
   - Which architecture gives the best results? Why?
   - [OPTIONAL] Is there a trade-off between complexity and performance?

### 7. Final model comparison

Let's now compare all the models to choose the most suitable for our problem.

In [None]:
# Création et entraînement des meilleurs modèles
# 1. Régression linéaire (déjà créée)
lr_sk = LinearRegression()
lr_sk.fit(X_train, y_train)

# 2. KNN (avec le meilleur k trouvé)
knn_best = KNeighborsRegressor(n_neighbors=3)  # k=3 donnait les meilleurs résultats
knn_best.fit(X_train, y_train)

# 3. Arbre de décision (déjà créé)
dt = DecisionTreeRegressor(max_depth=10, random_state=42)
dt.fit(X_train, y_train)

# 4. Random Forest (déjà créé)
rf = RandomForestRegressor(n_estimators=500, max_depth=10, random_state=42)
rf.fit(X_train, y_train)

# 5. MLP (déjà créé)
mlp = MLPRegressor(hidden_layer_sizes=(25, 12), max_iter=250, random_state=42)
mlp.fit(X_train, y_train)


example_idx = 42
example = X_test.iloc[example_idx]
real_value = y_test.iloc[example_idx]

# Dictionnaire des modèles
models = {
    'Régression linéaire': lr_sk,
    'KNN': knn_best,
    'Arbre de décision': dt,
    'Random Forest': rf,
    'MLP': mlp
}

# Comparaison détaillée
print("\nPrédictions pour l'exemple (consommation réelle: {:.2f} kWh):".format(real_value))
predictions = {}
for name, model in models.items():
    pred = model.predict([example])[0]
    predictions[name] = pred
    print(f"{name}: {pred:.2f} kWh")

# Calcul des métriques globales
results = []
for name, model in models.items():
    y_pred = model.predict(X_test)
    results.append({
        'Modèle': name,
        'R²': r2_score(y_test, y_pred),
        'MSE': mean_squared_error(y_test, y_pred),
        'RMSE': np.sqrt(mean_squared_error(y_test, y_pred))
    })

# /!\ Complétez les '...' pour affecter à result_df un dataframe contenant les résultats results /!\
results_df = ...
print("\nMétriques globales:")
print(results_df.round(3))

# Visualisation des comparaisons
plt.figure(figsize=(20, 5))

# 1. Prédictions sur l'exemple
plt.subplot(1, 3, 1)
plt.bar(predictions.keys(), predictions.values(), color='skyblue')
plt.axhline(y=real_value, color='r', linestyle='--', label='Valeur réelle')
plt.xticks(rotation=45)
plt.ylabel('Prédiction (kWh)')
plt.title('Prédictions des modèles pour l\'exemple')
plt.legend()

# 2. Scores R²
plt.subplot(1, 3, 2)
plt.bar(results_df['Modèle'], results_df['R²'], color='lightgreen')
plt.xticks(rotation=45)
plt.ylabel('R² Score')
plt.title('Performance globale des modèles (R²)')
# Ajuster l'échelle pour mieux voir les différences
plt.ylim(0.98, 1.0)  # Les scores R² sont tous > 0.98
plt.grid(True, axis='y')

# 3. MSE
plt.subplot(1, 3, 3)
plt.bar(results_df['Modèle'], results_df['MSE'], color='salmon')
plt.xticks(rotation=45)
plt.ylabel('MSE')
plt.title('Erreur quadratique moyenne (MSE)')
plt.grid(True, axis='y')

plt.tight_layout()
plt.show()

❓ **Comparative analysis of models:**

1. **Complexity/Interpretability trade-off**
   - In an industrial context, is it better to use a simple model like a decision tree or a more complex model like an MLP?
   - How to justify the choice of model?
   - What are the advantages and disadvantages of using a "black box" model like the MLP in an industrial environment?

2. [OPTIONAL] **Practical and operational aspects**
   - How to handle model updates as new data arrives?

3. **Optimization and improvement**
   - How can these predictions be used to optimize energy consumption?
   - What concrete recommendations can be made to this industrial company?

4. **Robustness and maintenance**
   - How to ensure that models remain performant over time?
   - How often should models be retrained?