# Clearbox Wrapper Tutorial

Clearbox Wrapper is a Python library to package and save a ML model.

We'll use the popular Boston Housing dataset and build a Pytorch regressor on it.

This is a typical case: before feeding the data to the model, we need to pre-process (scaling) them. Pre-processing code is usually written as a separate element wrt to the model, during the development phase. We want to wrap and save the pre-processing along with the model so to have a pipeline Processing+Model ready to take unprocessed data, process them and make predictions.

We can do that with Clearbox Wrapper, but all the pre-processing code must be wrapped in a single function. In this way, we can pass the function to the _save_model_ method.

## Install and import required libraries

In [1]:
!pip install pandas
!pip install numpy
!pip install scikit-learn
!pip install torch

!pip install clearbox-wrapper

You should consider upgrading via the '/home/andrea/clearbox_repos/clearbox-model-garden/.venv/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/home/andrea/clearbox_repos/clearbox-model-garden/.venv/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/home/andrea/clearbox_repos/clearbox-model-garden/.venv/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/home/andrea/clearbox_repos/clearbox-model-garden/.venv/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/home/andrea/clearbox_repos/clearbox-model-garden/.venv/bin/python -m pip install --upgrade pip' command.[0m


In [2]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import RobustScaler

import torch
import torch.nn as nn

import clearbox_wrapper as cbw

## Datasets

We have two different csv files for the training and test set.

In [3]:
boston_training_csv_path = 'boston_training_set.csv'
boston_test_csv_path = 'boston_test_set.csv'

  and should_run_async(code)


In [4]:
boston_training = pd.read_csv(boston_training_csv_path)
boston_test = pd.read_csv(boston_test_csv_path)

In [5]:
target_column = 'MEDV'

In [6]:
y_train = boston_training[target_column]
X_train = boston_training.drop(target_column, axis=1)

In [7]:
y_test = boston_test[target_column]
X_test = boston_test.drop(target_column, axis=1)

In [8]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 404 entries, 0 to 403
Data columns (total 13 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     404 non-null    float64
 1   ZN       404 non-null    float64
 2   INDUS    404 non-null    float64
 3   CHAS     404 non-null    int64  
 4   NOX      404 non-null    float64
 5   RM       404 non-null    float64
 6   AGE      404 non-null    float64
 7   DIS      404 non-null    float64
 8   RAD      404 non-null    int64  
 9   TAX      404 non-null    int64  
 10  PTRATIO  404 non-null    float64
 11  B        404 non-null    float64
 12  LSTAT    404 non-null    float64
dtypes: float64(10), int64(3)
memory usage: 41.2 KB


In [9]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102 entries, 0 to 101
Data columns (total 13 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     102 non-null    float64
 1   ZN       102 non-null    float64
 2   INDUS    102 non-null    float64
 3   CHAS     102 non-null    int64  
 4   NOX      102 non-null    float64
 5   RM       102 non-null    float64
 6   AGE      102 non-null    float64
 7   DIS      102 non-null    float64
 8   RAD      102 non-null    int64  
 9   TAX      102 non-null    int64  
 10  PTRATIO  102 non-null    float64
 11  B        102 non-null    float64
 12  LSTAT    102 non-null    float64
dtypes: float64(10), int64(3)
memory usage: 10.5 KB


## Create a preprocessing function

The data need to be preprocessed before be passed as input to the model. You can use your own custom code for the preprocessing, just remember to wrap all of it in a single function.

The following preprocessing makes no sense, it is provided just to show the possibilities offer by the wrapper.

We fit a SKlearn scaler on the X training set:

In [10]:
robust_scaler = RobustScaler()

In [11]:
robust_scaler.fit(X_train)

RobustScaler()

Then, we wrap the processing into a function adding also some useless additional lines that increment by 1 all the values of the dataset and (**IMPORTANT**) we convert the resulting data into the Pytorch format:

In [12]:
def boston_preprocessing(x_data):
    processed_data = robust_scaler.transform(x_data)
    processed_data = processed_data + 1.0
    processed_data = torch.Tensor(processed_data)
    return processed_data

  and should_run_async(code)


## Create and train the model

We build a Pytorch network setting some basic parameters...

In [13]:
num_epochs = 20
learning_rate = 0.0001
size_hidden1 = 25
size_hidden2 = 12
size_hidden3 = 6
size_hidden4 = 1

In [14]:
class BostonModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.lin1 = nn.Linear(13, size_hidden1)
        self.relu1 = nn.ReLU()
        self.lin2 = nn.Linear(size_hidden1, size_hidden2)
        self.relu2 = nn.ReLU()
        self.lin3 = nn.Linear(size_hidden2, size_hidden3)
        self.relu3 = nn.ReLU()
        self.lin4 = nn.Linear(size_hidden3, size_hidden4)

    def forward(self, input):
        return self.lin4(
            self.relu3(self.lin3(self.relu2(self.lin2(self.relu1(self.lin1(input))))))
        )

...add the training function...

In [15]:
def train(model_inp, x_train, y_train, num_epochs=num_epochs):
    datasets = torch.utils.data.TensorDataset(x_train, y_train)
    train_iter = torch.utils.data.DataLoader(datasets, batch_size=10, shuffle=True)
    criterion = nn.MSELoss(reduction="sum")
    optimizer = torch.optim.RMSprop(model_inp.parameters(), lr=learning_rate)
    for epoch in range(num_epochs):  # loop over the dataset multiple times
        running_loss = 0.0
        for inputs, labels in train_iter:
            # forward pass
            outputs = model_inp(inputs)
            # defining loss
            loss = criterion(outputs, labels)
            # zero the parameter gradients
            optimizer.zero_grad()
            # computing gradients
            loss.backward()
            # accumulating running loss
            running_loss += loss.item()
            # updated weights based on computed gradients
            optimizer.step()
        if (epoch+1) % 5 == 0:
            print(
                "Epoch [%d]/[%d] running accumulative loss across all batches: %.3f"
                % (epoch + 1, num_epochs, running_loss)
            )

...preprocess the training data through our function...

In [16]:
X_train_processed = boston_preprocessing(X_train)

...convert the y training data to the Pytorch format as well...

In [17]:
y_train = torch.Tensor(y_train.values)

  and should_run_async(code)


...and finally create a model instance and fit it on the resulting data:

In [18]:
model = BostonModel()
model.train()
train(model, X_train_processed, y_train)

Epoch [5]/[20] running accumulative loss across all batches: 2382470.088
Epoch [10]/[20] running accumulative loss across all batches: 2286777.667
Epoch [15]/[20] running accumulative loss across all batches: 2133008.015
Epoch [20]/[20] running accumulative loss across all batches: 1907148.792


## Wrap and Save the Model

Finally, we use Clearbox Wrapper to wrap and save the model and the preprocessor as a zipped folder in a specified path. 

The model dependency (torch) and its version it is detected automatically by CBW and added to the requirements saved into the resulting folder. But (**IMPORTANT**) you need to pass as a parameter the additional dependencies required for the preprocessing as a list. We just need to add Scikit-Learn in this case.

In [19]:
wrapped_model_path = 'boston_wrapped_model_v0.0.1'

In [20]:
processing_dependencies = ["scikit-learn==0.23.2"]

In [21]:
cbw.save_model(wrapped_model_path, model, boston_preprocessing, additional_deps=processing_dependencies)

<clearbox_wrapper.clearbox_wrapper.ClearboxWrapper at 0x7fe1348691f0>

## Unzip and load the model

The following cells are not necessary for the final users, the zip created should be uploaded to our SAAS as it is. But here we want to show how to load a saved model and compare it to the original one.

**IMPORTANT**: The wrapped model method _predict_ tries always to predict probabilities if the method required to is available in the saved model. It will look for the method _predict_proba_ of the original, and if it's not there (e.g. regression or model that output probabilities by default), it will use _predict_. So to compare the prediction results, we will use _predict_probas_ for the original model and _predict_ for the saved one.

In [22]:
import zipfile

In [23]:
zipped_model_path = 'boston_wrapped_model_v0.0.1.zip'
unzipped_model_path = 'boston_wrapped_model_v0.0.1_unzipped'

In [25]:
with zipfile.ZipFile(zipped_model_path, 'r') as zip_ref:
    zip_ref.extractall(unzipped_model_path)

In [26]:
loaded_model = cbw.load_model(unzipped_model_path)

In [27]:
X_test_processed = boston_preprocessing(X_test)
original_model_predictions = model(X_test_processed).detach().numpy()

In [28]:
loaded_model_predictions = loaded_model.predict(X_test).detach().numpy()

  and should_run_async(code)


In [29]:
np.testing.assert_array_equal(original_model_predictions, loaded_model_predictions)

  and should_run_async(code)


## Remove all generated files and directory

In [None]:
import os
import shutil

In [None]:
if os.path.exists(zipped_model_path):
        os.remove(zipped_model_path)

In [None]:
if os.path.exists(unzipped_model_path):
        shutil.rmtree(unzipped_model_path)