This notebook uses the best parameters found in _Energy+efficiency dataset EDA and Regression.ipynb_ for the `ExtraTreesRegressor` when training our model. All steps are wrapped up as chain links in a pipeline. This pipeline takes the data, preprocess it and trains the model at the same time. Finally the model will be saved using `joblib` package, in order to use use it later, without the need to retrain it.

In [1]:
# Import libraries
import joblib
import numpy as np
import pandas as pd
import sklearn
from joblib import dump, load
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler

In [2]:
# Versions of the main packages. Will be needed for the docker enviroment(??)
print('Numpy version:', np.__version__)
print('Pandas version:', pd.__version__)
print('Scikit-learn version:', sklearn.__version__)
print('Joblib version:', joblib.__version__)

Numpy version: 1.18.5
Pandas version: 1.0.4
Scikit-learn version: 0.23.1
Joblib version: 0.15.1


In [3]:
# Names of columns
columns = ['relative_compactness', 'surface_area', 'wall_area', 'roof_area', 'overall_height',
           'orientation', 'glazing_area', 'glazing_area_distribution', 'heating_load', 'cooling_load']

# Read data
df = pd.read_excel(
    'https://archive.ics.uci.edu/ml/machine-learning-databases/00242/ENB2012_data.xlsx', names=columns, header=0)

# Data Preprocessing

In [4]:
# Split data to X and y
X = df.drop(['heating_load', 'cooling_load'], axis=1)
y = df[['heating_load', 'cooling_load']]

In [5]:
# Split to X_train, X_test, y_train, y_test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0)

In [6]:
# Create a standard scaler
scaler = StandardScaler()

# Create an Extra tree regressor with the known best parameters
# {'max_depth': 8, 'min_samples_leaf': 5, 'min_samples_split': 22, 'n_estimators': 44}
etr = ExtraTreesRegressor(max_depth=8, min_samples_leaf=5,
                          min_samples_split=22, n_estimators=44, random_state=0)

# Modeling

In [7]:
# create a pipeline to wrap the scaler and regressor
pipe = make_pipeline(scaler, etr)

# Train the pipeline
pipe.fit(X_train, y_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('extratreesregressor',
                 ExtraTreesRegressor(max_depth=8, min_samples_leaf=5,
                                     min_samples_split=22, n_estimators=44,
                                     random_state=0))])

In [8]:
# Score of the model
pipe.score(X_test, y_test)

0.9813522771767467

In [9]:
# Saving the model
dump(pipe, 'model.joblib')

['model.joblib']

In [10]:
# Testing the loaded model
model = load('model.joblib')

In [11]:
# Predictions of the loaded model for X_test
model.predict(X_test)

array([[15.21076923, 17.95384615],
       [10.32466667, 13.49066667],
       [36.00271808, 38.47570084],
       [19.06782541, 25.37542872],
       [32.5233079 , 33.39104329],
       [28.96916667, 31.3025    ],
       [28.5075    , 29.769375  ],
       [29.13841783, 30.75102819],
       [29.27466667, 31.22266667],
       [23.56      , 26.71615385],
       [ 6.72794148, 11.7734517 ],
       [41.90538462, 43.39538462],
       [11.51603749, 14.45692641],
       [41.90538462, 43.39538462],
       [41.90538462, 43.39538462],
       [26.32      , 28.603125  ],
       [10.67      , 13.965     ],
       [28.96916667, 31.3025    ],
       [14.34473661, 15.13146929],
       [12.21421982, 15.00660693],
       [12.65374148, 15.48288447],
       [32.5233079 , 33.39104329],
       [10.67      , 13.965     ],
       [39.10275036, 41.11082517],
       [ 6.72794148, 11.7734517 ],
       [14.44373605, 16.82592951],
       [12.60714155, 14.13397658],
       [13.93720473, 16.32996951],
       [10.67      ,

All the above code will be saved in *model.py* so we can easily produce our model.

---