# Clearbox Wrapper Tutorial

Clearbox Wrapper is a Python library to package and save a ML model.

We'll use the popular Boston Housing dataset and build a Pytorch regressor on it.

This is a typical case: before feeding the data to the model, we need to pre-process (scaling) them. Preprocessing code is usually written as a separate element wrt to the model, during the development phase. We want to wrap and save the pre-processing along with the model so to have a pipeline Processing+Model ready to take unprocessed data, process them and make predictions.

We can do that with Clearbox Wrapper, but all the preprocessing code must be wrapped in a single function. In this way, we can pass the function to the _save_model_ method.

## Install and import required libraries

In [1]:
%%capture 
!pip install pandas
!pip install numpy
!pip install scikit-learn
!pip install torch

!pip install clearbox-wrapper==0.3.10

In [2]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import RobustScaler
import sklearn.metrics as metrics

import torch
import torch.nn as nn
from torch.nn import functional as F

import clearbox_wrapper as cbw

## Datasets

We have two different csv files for the training and test set.

In [3]:
boston_training_csv_path = 'boston_training_set.csv'
boston_test_csv_path = 'boston_test_set.csv'

In [4]:
boston_training = pd.read_csv(boston_training_csv_path)
boston_test = pd.read_csv(boston_test_csv_path)

In [5]:
target_column = 'MEDV'

In [6]:
y_train = boston_training[target_column]
X_train = boston_training.drop(target_column, axis=1)

In [7]:
y_test = boston_test[target_column]
X_test = boston_test.drop(target_column, axis=1)

In [8]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 404 entries, 0 to 403
Data columns (total 13 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     404 non-null    float64
 1   ZN       404 non-null    float64
 2   INDUS    404 non-null    float64
 3   CHAS     404 non-null    int64  
 4   NOX      404 non-null    float64
 5   RM       404 non-null    float64
 6   AGE      404 non-null    float64
 7   DIS      404 non-null    float64
 8   RAD      404 non-null    int64  
 9   TAX      404 non-null    int64  
 10  PTRATIO  404 non-null    float64
 11  B        404 non-null    float64
 12  LSTAT    404 non-null    float64
dtypes: float64(10), int64(3)
memory usage: 41.2 KB


In [9]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102 entries, 0 to 101
Data columns (total 13 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     102 non-null    float64
 1   ZN       102 non-null    float64
 2   INDUS    102 non-null    float64
 3   CHAS     102 non-null    int64  
 4   NOX      102 non-null    float64
 5   RM       102 non-null    float64
 6   AGE      102 non-null    float64
 7   DIS      102 non-null    float64
 8   RAD      102 non-null    int64  
 9   TAX      102 non-null    int64  
 10  PTRATIO  102 non-null    float64
 11  B        102 non-null    float64
 12  LSTAT    102 non-null    float64
dtypes: float64(10), int64(3)
memory usage: 10.5 KB


## Create a preprocessing function

The data need to be preprocessed before be passed as input to the model. You can use your own custom code for the preprocessing, just remember to wrap all of it in a single function.

The following preprocessing makes no sense, it is provided just to show the possibilities offer by the wrapper.

We fit a SKlearn scaler on the X training set:

In [10]:
robust_scaler = RobustScaler()

In [11]:
robust_scaler.fit(X_train)

RobustScaler()

Then, we wrap the processing into a function adding also some useless additional lines that increment by 1 all the values of the dataset:

In [12]:
def boston_preprocessing(x_data):
    processed_data = robust_scaler.transform(x_data)
    processed_data = processed_data + 1.0
    return processed_data

## Create and train the model

We build a Pytorch network setting some basic parameters...

In [13]:
class BostonModel(nn.Module):
    def __init__(self, n_features, hiddenA, hiddenB):
        super(BostonModel, self).__init__()
        self.linearA = nn.Linear(n_features, hiddenA)
        self.linearB = nn.Linear(hiddenA, hiddenB)
        self.linearC = nn.Linear(hiddenB, 1)

    def forward(self, x):
        yA = F.relu(self.linearA(x))
        yB = F.relu(self.linearB(yA))
        return self.linearC(yB)

...preprocess the training and test data through our function...

In [14]:
X_train_processed = boston_preprocessing(X_train)
X_train_processed_tensor = torch.Tensor(X_train_processed)

In [15]:
X_test_processed = boston_preprocessing(X_test)
X_test_processed_tensor = torch.Tensor(X_test_processed)

...convert the y training and test data to the Pytorch format as well...

In [16]:
y_train_torch = torch.Tensor(y_train.values)
y_test_torch = torch.Tensor(y_test.values)

...and finally create a model instance and fit it on the resulting data:

In [17]:
torch.manual_seed(42)

<torch._C.Generator at 0x7fbc75f94c30>

In [18]:
train_datasets = torch.utils.data.TensorDataset(X_train_processed_tensor, y_train_torch)
train_loader = torch.utils.data.DataLoader(train_datasets, batch_size=25, shuffle=True)

In [19]:
torch_regr = BostonModel(X_train_processed_tensor.shape[1], 100, 50)
criterion = nn.MSELoss(reduction='sum')
optimizer = torch.optim.Adam(torch_regr.parameters(), lr=0.0001)

In [20]:
n_epochs = 150
all_losses = []
for epoch in range(n_epochs):
    losses = []
    total = 0
    for inputs, target in train_loader:
        optimizer.zero_grad()
        y_pred = torch_regr(inputs)
        loss = criterion(y_pred, torch.unsqueeze(target,dim=1))

        loss.backward()
        
        optimizer.step()
        
        losses.append(loss.item())
        total += 1

    epoch_loss = sum(losses) / total
    all_losses.append(epoch_loss)
                
    mess = f"Epoch #{epoch+1}\tLoss: {all_losses[-1]:.3f}"
    if (epoch%25 == 0):
        print(mess)

  Variable._execution_engine.run_backward(
Epoch #1	Loss: 14388.408
Epoch #26	Loss: 4106.438
Epoch #51	Loss: 1636.331
Epoch #76	Loss: 1068.869
Epoch #101	Loss: 844.120
Epoch #126	Loss: 729.870


We show some metrics on the training set...

In [21]:
torch_regr.eval()
training_predictions = torch_regr(X_train_processed_tensor).detach().numpy()

In [22]:
print('Training Report\n')
print('Target Column min: {}'.format(y_train.min()))
print('Target Column max: {}'.format(y_train.max()))
print('Target Column mean: {}'.format(y_train.mean()))
print('Target Column std: {}\n'.format(y_train.std()))
print("Max Error: {}".format(metrics.max_error(y_train, training_predictions)))
print("Mean Absolute Error: {}".format(metrics.mean_absolute_error(y_train, training_predictions)))
print("Mean Squared Error: {}".format(metrics.mean_squared_error(y_train, training_predictions)))
print("R2 Score: {}".format(metrics.r2_score(y_train, training_predictions)))

Training Report

Target Column min: 5.0
Target Column max: 50.0
Target Column mean: 22.796534653465343
Target Column std: 9.332147158711562

Max Error: 34.06868076324463
Mean Absolute Error: 3.6331796780671217
Mean Squared Error: 27.9573566724688
R2 Score: 0.6781827873784485


...and on the test set:

In [23]:
test_predictions = torch_regr(X_test_processed_tensor).detach().numpy()

In [24]:
print('Test Report\n')
print('Target Column min: {}'.format(y_test.min()))
print('Target Column max: {}'.format(y_test.max()))
print('Target Column mean: {}'.format(y_test.mean()))
print('Target Column std: {}\n'.format(y_test.std()))
print("Max Error: {}".format(metrics.max_error(y_test, test_predictions)))
print("Mean Absolute Error: {}".format(metrics.mean_absolute_error(y_test, test_predictions)))
print("Mean Squared Error: {}".format(metrics.mean_squared_error(y_test, test_predictions)))
print("R2 Score: {}".format(metrics.r2_score(y_test, test_predictions)))

Test Report

Target Column min: 5.0
Target Column max: 50.0
Target Column mean: 21.488235294117644
Target Column std: 8.60580386839697

Max Error: 30.797901153564453
Mean Absolute Error: 3.4708297935186647
Mean Squared Error: 27.61264297859737
R2 Score: 0.623466269042327


## Wrap and Save the Model

Finally, we use Clearbox Wrapper to wrap and save the model and the preprocessor as a zipped folder in a specified path. 

The model dependency (torch) and its version it is detected automatically by CBW and added to the requirements saved into the resulting folder. But (**IMPORTANT**) you need to pass as a parameter the additional dependencies required for the preprocessing as a list. We just need to add Scikit-Learn in this case.

We pass the training dataset (X train) to `save_model` as well, in order to generate a Preprocessing and Model Signature (the signature represents input and (optionally) outputs as data frames with (optionally) named columns and data type).

In [25]:
wrapped_model_path = 'boston_wrapped_model_v0.3.10'

In [26]:
processing_dependencies = ["scikit-learn==0.23.2"]

In [27]:
cbw.save_model(wrapped_model_path, torch_regr, preprocessing=boston_preprocessing, input_data=X_train, additional_deps=processing_dependencies)

## Unzip and load the model

The following cells are not necessary for the final users, the zip created should be uploaded to our SAAS as it is. But here we want to show how to load a saved model and compare it to the original one.

In [28]:
import zipfile

In [29]:
zipped_model_path = 'boston_wrapped_model_v0.3.10.zip'
unzipped_model_path = 'boston_wrapped_model_v0.3.10_unzipped'

In [30]:
with zipfile.ZipFile(zipped_model_path, 'r') as zip_ref:
    zip_ref.extractall(unzipped_model_path)

In [31]:
loaded_model = cbw.load_model(unzipped_model_path)

In [32]:
original_model_predictions = torch_regr(X_test_processed_tensor).detach().numpy()

In [33]:
loaded_model_predictions = loaded_model.predict(X_test)

In [34]:
np.testing.assert_array_equal(original_model_predictions, loaded_model_predictions)

## Remove all generated files and directory

In [None]:
import os
import shutil

In [None]:
if os.path.exists(zipped_model_path):
        os.remove(zipped_model_path)

In [None]:
if os.path.exists(unzipped_model_path):
        shutil.rmtree(unzipped_model_path)