## Execise 1

In this exercise, do the following:
1. Load the dataset used in the time series example - Energy consumption data. You can find it in the notebook "TSA_Example" in Time Series folder in Moodle.
2. Setup a nested MLFlow loop where different modelling experiments can be tracked and the use the dataset in point 1 to experiment and track models. You should do following combinations:
    1. At least 3 model types
    2. At least 3 different feature combinations
    3. At least 3 different options for 3 different hyperparameters
    4. At least 3 different time splits for train test
3. For each option in the combination, you should calculate & log the following in MLFlow:
    1. RMSE
    2. MAE
    3. Plot of actual vs predicted for 1 month data
    4. Plot of actual vs predicted for 1 week of data
    5. All of the combination info in point 2, such as which model, what feature combindation, what hyperparameter, what train test split has been used
4. Turn on MLFlow UI and track your experiments

In [1]:
import mlflow
import mlflow.sklearn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import TimeSeriesSplit
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import os 

1. Load the dataset used in the time series example - Energy consumption data. You can find it in the notebook "TSA_Example" in Time Series folder in Moodle.

In [2]:

# Define the file path
file_path = "data/EnergyEfficiency.csv"

# Load the dataset
data = pd.read_csv(file_path)

# Display the first few rows
print(data.head())
data.describe()


   RelativeCompactness  SurfaceArea  WallArea  RoofArea  OverallHeight  \
0                 0.98        514.5     294.0    110.25            7.0   
1                 0.98        514.5     294.0    110.25            7.0   
2                 0.98        514.5     294.0    110.25            7.0   
3                 0.98        514.5     294.0    110.25            7.0   
4                 0.90        563.5     318.5    122.50            7.0   

   Orientation  GlazingArea  GlazingAreaDistribution  HeatingLoad  CoolingLoad  
0            2          0.0                        0        15.55        21.33  
1            3          0.0                        0        15.55        21.33  
2            4          0.0                        0        15.55        21.33  
3            5          0.0                        0        15.55        21.33  
4            2          0.0                        0        20.84        28.28  


Unnamed: 0,RelativeCompactness,SurfaceArea,WallArea,RoofArea,OverallHeight,Orientation,GlazingArea,GlazingAreaDistribution,HeatingLoad,CoolingLoad
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,0.764167,671.708333,318.5,176.604167,5.25,3.5,0.234375,2.8125,22.307201,24.58776
std,0.105777,88.086116,43.626481,45.16595,1.75114,1.118763,0.133221,1.55096,10.090196,9.513306
min,0.62,514.5,245.0,110.25,3.5,2.0,0.0,0.0,6.01,10.9
25%,0.6825,606.375,294.0,140.875,3.5,2.75,0.1,1.75,12.9925,15.62
50%,0.75,673.75,318.5,183.75,5.25,3.5,0.25,3.0,18.95,22.08
75%,0.83,741.125,343.0,220.5,7.0,4.25,0.4,4.0,31.6675,33.1325
max,0.98,808.5,416.5,220.5,7.0,5.0,0.4,5.0,43.1,48.03


2. Setup a nested MLFlow loop where different modelling experiments can be tracked and the use the dataset in point 1 to experiment and track models. You should do following combinations:
    1. At least 3 model types
    2. At least 3 different feature combinations
    3. At least 3 different options for 3 different hyperparameters
    4. At least 3 different time splits for train test

1. At least 3 model types

In [3]:
models = {
    "LinearRegression": LinearRegression(),
    "RandomForest": RandomForestRegressor(),
    "XGBoost": XGBRegressor()
}

2. At least 3 different feature combinations

In [4]:
# Define feature subsets
feature_combinations = [
    ["RelativeCompactness", "SurfaceArea", "WallArea"],
    ["RoofArea", "OverallHeight", "GlazingArea"],
    ["RelativeCompactness", "SurfaceArea", "WallArea", "RoofArea", "OverallHeight", "GlazingArea"]
]

target = "HeatingLoad"

3. At least 3 different options for 3 different hyperparameters

In [5]:
hyperparameters = {
    "LinearRegression": {"fit_intercept": [True, False]},
    "RandomForest": {"n_estimators": [50, 100, 200], "max_depth": [None, 10, 20]},
    "XGBoost": {"n_estimators": [50, 100, 200], "learning_rate": [0.01, 0.1, 0.2]}
}

4. At least 3 different time splits for train test

In [6]:
# Define time-based splits
ts_splits = [3, 5, 7]

lets create the test split 

In [7]:
def split_data(features, ts_split):
    tscv = TimeSeriesSplit(n_splits=ts_split)
    X = data[features]
    y = data[target]
    
    for train_idx, test_idx in tscv.split(X):
        yield X.iloc[train_idx], X.iloc[test_idx], y.iloc[train_idx], y.iloc[test_idx]

3. For each option in the combination, you should calculate & log the following in MLFlow:
    1. RMSE
    2. MAE
    3. Plot of actual vs predicted for 1 month data
    4. Plot of actual vs predicted for 1 week of data
    5. All of the combination info in point 2, such as which model, what feature combindation, what hyperparameter, what train test split has been used

In [8]:
# Start MLflow Experiment
mlflow.set_experiment("Energy_Consumption_Modeling")

for features in feature_combinations:
    for ts_split in ts_splits:
        for X_train, X_test, y_train, y_test in split_data(features, ts_split):
            for model_name, model in models.items():
                with mlflow.start_run(run_name=f"{model_name}_Experiment", nested=True):
                    mlflow.log_param("Features", ", ".join(features))
                    mlflow.log_param("Time_Split", ts_split)

                    model.fit(X_train, y_train)
                    predictions = model.predict(X_test)

                    # Calculate metrics
                    mae = mean_absolute_error(y_test, predictions)
                    mse = mean_squared_error(y_test, predictions)
                    rmse = np.sqrt(mse)
                    r2 = r2_score(y_test, predictions)

                    # Log metrics
                    mlflow.log_metric("MAE", mae)
                    mlflow.log_metric("MSE", mse)
                    mlflow.log_metric("RMSE", rmse)
                    mlflow.log_metric("R2", r2)

                    # Plot actual vs predicted for 1 month and 1 week
                    plt.figure(figsize=(10, 4))
                    plt.plot(y_test[:30].values, label="Actual", marker='o')  # 1 month (~30 days)
                    plt.plot(predictions[:30], label="Predicted", marker='x')
                    plt.title("Actual vs Predicted (1 Month)")
                    plt.legend()
                    plot_path_1m = f"actual_vs_predicted_1m_{model_name}.png"
                    plt.savefig(plot_path_1m)
                    mlflow.log_artifact(plot_path_1m)
                    plt.close()
                    
                    plt.figure(figsize=(10, 4))
                    plt.plot(y_test[:7].values, label="Actual", marker='o')  # 1 week (~7 days)
                    plt.plot(predictions[:7], label="Predicted", marker='x')
                    plt.title("Actual vs Predicted (1 Week)")
                    plt.legend()
                    plot_path_1w = f"actual_vs_predicted_1w_{model_name}.png"
                    plt.savefig(plot_path_1w)
                    mlflow.log_artifact(plot_path_1w)
                    plt.close()

                    # Remove saved plots after logging to MLflow
                    os.remove(plot_path_1m)
                    os.remove(plot_path_1w)

print("MLflow tracking complete!")

MLflow tracking complete!


4. Turn on MLFlow UI and track your experiments

In [None]:
!mlflow ui

^C
