# **00-Preparation**
This notebook acts like the main component to create the machine learning used throughout this series.

### **Exploratory Data Analysis**
In this part, the dataset is explored for integrity and consistency checking before being put into the training step.

As mentioned in README, The dataset used in this project is sourced from Kaggle: <a href="https://www.kaggle.com/datasets/govindaramsriram/energy-consumption-dataset-linear-regression">Energy Consumption Dataset</a>. Therefore, the objective of this part is to create the machine learning model that can predict energy consumption of the specific building based on its features.

In [21]:
# import necessary libraries and modules
import os
import pandas as pd
import pickle as pkl
import mlflow
from extended_modules import assistant # from "extended_modules" directory

df = pd.read_csv(os.path.join("..", "data", "test_energy_data.csv")) # read the dataset
df.head(5)

Unnamed: 0,Building Type,Square Footage,Number of Occupants,Appliances Used,Average Temperature,Day of Week,Energy Consumption
0,Residential,24563,15,4,28.52,Weekday,2865.57
1,Commercial,27583,56,23,23.07,Weekend,4283.8
2,Commercial,45313,4,44,33.56,Weekday,5067.83
3,Residential,41625,84,17,27.39,Weekend,4624.3
4,Residential,36720,58,47,17.08,Weekday,4820.59


In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Building Type        100 non-null    object 
 1   Square Footage       100 non-null    int64  
 2   Number of Occupants  100 non-null    int64  
 3   Appliances Used      100 non-null    int64  
 4   Average Temperature  100 non-null    float64
 5   Day of Week          100 non-null    object 
 6   Energy Consumption   100 non-null    float64
dtypes: float64(2), int64(3), object(2)
memory usage: 5.6+ KB


In [23]:
report = assistant.report(df)
report.show()

number of columns: 7
number of rows: 100
number of duplicates: 0

number of numerical columns: 5
number of categorical columns: 2


In [24]:
report.df_numr # observe all numerical columns

Unnamed: 0,name,dtype,count,min,q1,q2,q3,max,mean,stddev,null_percent,distribution,min_normal,max_normal,n_outliers
0,Square Footage,int64,100,1161.0,14161.0,27582.5,38109.5,49354.0,25881.92,13711.075264,0.0,non-normal,-34761.75,85276.75,0
1,Number of Occupants,int64,100,2.0,21.0,47.0,73.0,99.0,47.23,29.905526,0.0,non-normal,-76.0,177.0,0
2,Appliances Used,int64,100,1.0,16.75,27.5,39.25,49.0,26.97,14.237846,0.0,non-normal,-32.75,82.75,0
3,Average Temperature,float64,100,10.4,15.6825,21.97,27.4925,34.71,22.0433,6.957951,0.0,non-normal,-7.315,52.425,0
4,Energy Consumption,float64,100,2351.97,3621.925,4249.39,4797.175,6042.56,4187.5783,832.55985,0.0,non-normal,589.095,7805.435,0


In [25]:
report.df_catg # observe all categorical columns

Unnamed: 0,name,dtype,count,n_class,null_percent,distribution
0,Building Type,object,100,3,0.0,multinoulli
1,Day of Week,object,100,2,0.0,bernoulli


In [26]:
# report.export(os.path.join("..", "report", "test_energy_data")) # save the report into the directory path "../report/test_energy_data"

As seen above, there is no anomaly causing the data to be fixed, which means the dataset is ready to be trained, so we can move on to the feature engineering step. However, before this step is completed, we should prepare the data by splitting it into 2 chunks: train and test. In this noteboook, we use only the train data because we want to train the model, not seriously use the model to make predictions.

In [27]:
X_train, X_test, y_train, y_test = assistant.split_data(df, "Energy Consumption", 0.2) # train 80 : test 20
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((80, 6), (20, 6), (80, 1), (20, 1))

In [28]:
# assistant.export_train_test(X_train, X_test, y_train, y_test, os.path.join("..", "data")) # save the train and test datasets
X_train, X_test, y_train, y_test = assistant.import_train_test(os.path.join("..", "data"))
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((80, 6), (20, 6), (80, 1), (20, 1))

### **Feature Engineering**
This step's concerning preparing the clean data to be able to be trained my the machine learning model. To proceed this, the categorical columns are encoded into numerical features. According to the dataset, there are 2 categorical columns needed to be transformed:
- **Day of Week:** This column contains only 2 unique values, so it can be encoded using the label encoder. Nonetheless, the label encoder provided by scikit-learn can't be integrated with other transformers, which causes some difficulties if we want to call the model to preprocess the data in case we store different preprocessing models seperately. As a result, I use the custom transformer named "LabelEncoderTransformer" in my extended modules to capsulate the label encoder and enable it to be compatible with other transformers.

- **Building Type:** This column contains 3 unique values, so one-hot encoding should be applied into it, forming 3 columns. Each of these columns indicates whether each row belongs in each specific building type or not.

I use the column transformer to wrap these transformers together so that when I save the preprocessing pipeline, I save it once, and when I call the pipeline, I also call it only once.

In [29]:
from sklearn.compose import ColumnTransformer
from extended_modules.sklearnext.preprocessing import LabelEncoderTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(
    transformers=[
        ("label-encoder-transformer", LabelEncoderTransformer(), "Day of Week"),
        ("one-hot-encoder", OneHotEncoder(sparse_output=False, handle_unknown="ignore"), ["Building Type"])
    ],
    remainder="passthrough"
)

ct_md = ct.fit(X_train) # fit the column transformer with the data
ct_md

The format of the columns of the 'remainder' transformer in ColumnTransformer.transformers_ will change in version 1.7 to match the format of the other transformers.
At the moment the remainder columns are stored as indices (of type int). With the same ColumnTransformer configuration, in the future they will be stored as column names (of type str).



In [30]:
# pkl.dump(ct_md, open(os.path.join("..", "model", "preprocessor.pkl"), "wb")) # save the preprocessing pipeline as a pickle model
ppc_md = pkl.load(open(os.path.join("..", "model", "preprocessor.pkl"), "rb"))
ppc_md

In [31]:
X_train = pd.DataFrame(ppc_md.transform(X_train)) # transform the features by the preprocessing pipeline created
df_train = pd.concat([X_train, y_train], axis=1) # concatenate the features and the target to be the train dataset
df_train.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,Energy Consumption
0,1.0,0.0,0.0,1.0,17982.0,4.0,37.0,13.29,3112.64
1,0.0,0.0,1.0,0.0,27165.0,73.0,25.0,30.15,4987.52
2,1.0,0.0,0.0,1.0,7924.0,63.0,36.0,34.71,3072.63
3,1.0,0.0,1.0,0.0,42767.0,40.0,28.0,17.94,5508.64
4,1.0,0.0,1.0,0.0,2145.0,56.0,12.0,11.77,3348.39


Next up, I split the train data into 2 chunks again: train data and validation data. I will parse the train data into the model in the training step and parse the validation data into the model to make predictions and evaluate it.

In [32]:
X_train_train, X_validate, y_train_train, y_validate = assistant.split_data(df_train, "Energy Consumption", 0.2) # train 80 : validation 20
X_train_train.shape, X_validate.shape, y_train_train.shape, y_validate.shape

((64, 8), (16, 8), (64, 1), (16, 1))

In [33]:
# save the train and validation datasets
# assistant.export_train_test(X_train_train, X_validate, y_train_train, y_validate, os.path.join("..", "data"), True)
X_train_train, X_validate, y_train_train, y_validate = assistant.import_train_test(os.path.join("..", "data"), True)
X_train_train.shape, X_validate.shape, y_train_train.shape, y_validate.shape

((64, 8), (16, 8), (64, 1), (16, 1))

### **Model Training**
In this step, several models are trained evaluated how accurate each of their predictions are. There are 9 models used for training in this step. Each of them, with default hyperparameters set, are from different algorithms: linear regression, Ridge regression, Lasso regression, ElasticNet regression, support vector regression (SVR), random forest regression, gradient boosting regression, XGBoost regression, and light gradient boosting machine regression (LGBM).

Each model is trained once but evaluated twice: with the train data (to see the training performance) and with the validation data (to see the unseen data handling performance). The threshold for the model that's defined to be good enough to be used for prediction is determined by both the training RMSE (not greater than 50) and the validation RMSE (not greater than 100). The models that pass this threshold will then be compared to find the best one by the same metrics.

In [None]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn.metrics import root_mean_squared_error

mlflow.set_tracking_uri("http://127.0.0.1:8080/") # replace with URI you set when starting MLFlow server
mlflow.set_experiment("energy-consumption-prediction") # log all runs into this experiment

# list all regression models
model_instances = [
    LinearRegression(), Ridge(), Lasso(), ElasticNet(), SVR(), RandomForestRegressor(), GradientBoostingRegressor(),
    XGBRegressor(), LGBMRegressor()
]

for md in model_instances:
    with mlflow.start_run(): # instantiate a run
        mlflow.log_params({"model_algorithm": type(md).__name__}) # log model algorithm as run's parameter
        md_md = md.fit(X_train_train, y_train_train) # train model
        y_train_pred = md_md.predict(X_train_train) # make predictions from train data
        y_validate_pred = md_md.predict(X_validate) # make predictions from validation data
        rmse_train = root_mean_squared_error(y_train_train, y_train_pred) # evaluate model by the training predictions: training RMSE
        rmse_validate = root_mean_squared_error(y_validate, y_validate_pred) # evaluate model by the validation predictions: validation RMSE
        if rmse_train <= 50 and rmse_validate <= 100: # passing threshold: training RMSE <= 50, validation RMSE <= 100
            mlflow.sklearn.log_model(md_md, artifact_path="forecasting_model") # log passing models only
        mlflow.log_metrics({"train_rmse": rmse_train, "validate_rmse": rmse_validate}) # log training RMSE and validation RMSE as run's metrics
    mlflow.end_run()



🏃 View run crawling-sheep-223 at: http://127.0.0.1:8080/#/experiments/761232900034787897/runs/0873776c8d1e45cb81cdca97fa08f1dd
🧪 View experiment at: http://127.0.0.1:8080/#/experiments/761232900034787897




🏃 View run legendary-midge-147 at: http://127.0.0.1:8080/#/experiments/761232900034787897/runs/314f6f3d380a48d0962ecd032ee595bb
🧪 View experiment at: http://127.0.0.1:8080/#/experiments/761232900034787897




🏃 View run crawling-gull-990 at: http://127.0.0.1:8080/#/experiments/761232900034787897/runs/3afe1e63aa44444ca31588e9f8b631f6
🧪 View experiment at: http://127.0.0.1:8080/#/experiments/761232900034787897
🏃 View run unique-shrike-487 at: http://127.0.0.1:8080/#/experiments/761232900034787897/runs/72aef0aa494d4d5d9889f1bb4c3abedf
🧪 View experiment at: http://127.0.0.1:8080/#/experiments/761232900034787897


  y = column_or_1d(y, warn=True)
  return fit_method(estimator, *args, **kwargs)


🏃 View run persistent-fowl-423 at: http://127.0.0.1:8080/#/experiments/761232900034787897/runs/3063fa7c98764986bb85cf0b6e8ddf7a
🧪 View experiment at: http://127.0.0.1:8080/#/experiments/761232900034787897
🏃 View run polite-kit-838 at: http://127.0.0.1:8080/#/experiments/761232900034787897/runs/4b7188b253484e6b8f07cb701ff5bac4
🧪 View experiment at: http://127.0.0.1:8080/#/experiments/761232900034787897
🏃 View run beautiful-snail-314 at: http://127.0.0.1:8080/#/experiments/761232900034787897/runs/ad54bf3f257042a688087768f12d23c2
🧪 View experiment at: http://127.0.0.1:8080/#/experiments/761232900034787897


  y = column_or_1d(y, warn=True)  # TODO: Is this still required?


🏃 View run bald-shrike-923 at: http://127.0.0.1:8080/#/experiments/761232900034787897/runs/2e32beabbcd9413282dea5c7b7fc03f7
🧪 View experiment at: http://127.0.0.1:8080/#/experiments/761232900034787897
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000023 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 94
[LightGBM] [Info] Number of data points in the train set: 64, number of used features: 7
[LightGBM] [Info] Start training from score 4018.297356
🏃 View run redolent-cub-140 at: http://127.0.0.1:8080/#/experiments/761232900034787897/runs/b3b0bc3bed1248d2824d11c85b0e5ca6
🧪 View experiment at: http://127.0.0.1:8080/#/experiments/761232900034787897


At the end of this stage, I would recommend you to terminate the server before restarting it since this is the way has been ensured that the MLFlow's API used in the next step is available.

In [35]:
import mlflow

mlflow.set_tracking_uri("http://127.0.0.1:8080/") # replace with URI you set when starting MLFlow server
mlflow.set_experiment("energy-consumption-prediction")

df_exp = mlflow.search_runs()
df_exp[(df_exp["metrics.train_rmse"] <= 50) & (df_exp["metrics.validate_rmse"] <= 100)]

Unnamed: 0,run_id,experiment_id,status,artifact_uri,start_time,end_time,metrics.train_rmse,metrics.validate_rmse,params.model_algorithm,tags.mlflow.source.name,tags.mlflow.runName,tags.mlflow.source.type,tags.mlflow.user,tags.mlflow.log-model.history
6,3afe1e63aa44444ca31588e9f8b631f6,761232900034787897,FINISHED,mlflow-artifacts:/761232900034787897/3afe1e63a...,2025-01-10 04:21:34.650000+00:00,2025-01-10 04:21:39.634000+00:00,3.237692,2.903962,Lasso,c:\ProgramData\anaconda3\Lib\site-packages\ipy...,crawling-gull-990,LOCAL,007955_Admin,"[{""run_id"": ""3afe1e63aa44444ca31588e9f8b631f6""..."
7,314f6f3d380a48d0962ecd032ee595bb,761232900034787897,FINISHED,mlflow-artifacts:/761232900034787897/314f6f3d3...,2025-01-10 04:21:29.526000+00:00,2025-01-10 04:21:34.618000+00:00,17.295416,16.914298,Ridge,c:\ProgramData\anaconda3\Lib\site-packages\ipy...,legendary-midge-147,LOCAL,007955_Admin,"[{""run_id"": ""314f6f3d380a48d0962ecd032ee595bb""..."
8,0873776c8d1e45cb81cdca97fa08f1dd,761232900034787897,FINISHED,mlflow-artifacts:/761232900034787897/0873776c8...,2025-01-10 04:21:23.680000+00:00,2025-01-10 04:21:29.494000+00:00,0.013813,0.009707,LinearRegression,c:\ProgramData\anaconda3\Lib\site-packages\ipy...,crawling-sheep-223,LOCAL,007955_Admin,"[{""run_id"": ""0873776c8d1e45cb81cdca97fa08f1dd""..."


We can see that among 9 models, the model with linear regression algorithm works best with this data observing that the training RMSE is only approximately 0.0097, and the validating RMSE is only approximately 0.0138. Consequently, this model is selected to be the model used for making predictions in this entire series. Finally, the model is then saved.

In [None]:
run_id = df_exp[df_exp["params.model_algorithm"] == "LinearRegression"]["run_id"].values[0] # get model's run ID
artifact_path = f"runs:/{run_id}/forecasting_model"
pdt_md = mlflow.sklearn.load_model(artifact_path)
pdt_md

Downloading artifacts:   0%|          | 0/5 [00:00<?, ?it/s]

In [43]:
# pkl.dump(pdt_md, open(os.path.join("..", "model", "predictor.pkl"), "wb")) # save the model
pdt_md = pkl.load(open(os.path.join("..", "model", "predictor.pkl"), "rb"))
pdt_md