# Clearbox Wrapper Tutorial

Clearbox Wrapper is a Python library to package and save a ML model.

We'll use the popular Diabetes Hospital Readmissions dataset and build a Pytorch classifier on it.

This is a typical case: before feeding the data to the model, we need to preprocess them. Pre-processing code is usually written as a separate element wrt to the model, during the development phase. We want to wrap and save the preprocessing along with the model so to have a pipeline Processing+Model ready to take unprocessed data, process them and make predictions.

We can do that with Clearbox Wrapper, but all the preprocessing code must be wrapped in a single function. In this way, we can pass the function to the _save_model_ method.

## Install and import required libraries

In [1]:
%%capture 
!pip install pandas
!pip install numpy
!pip install scikit-learn
!pip install torch

!pip install clearbox-wrapper==0.3.7

In [2]:
import pandas as pd
import numpy as np

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer


import torch
import torch.nn as nn

import clearbox_wrapper as cbw

## Datasets

We have two different csv files for the training and test set.

In [3]:
hospital_training_csv_path = 'hospital_readmissions_training.csv'
hospital_test_csv_path = 'hospital_readmissions_test.csv'

In [4]:
hospital_training = pd.read_csv(hospital_training_csv_path)
hospital_test = pd.read_csv(hospital_test_csv_path)

In [5]:
target_column = 'readmitted'

In [7]:
y_train = hospital_training[target_column]
X_train = hospital_training.drop(target_column, axis=1)

In [8]:
y_test = hospital_test[target_column]
X_test = hospital_test.drop(target_column, axis=1)

In [9]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110612 entries, 0 to 110611
Data columns (total 40 columns):
 #   Column                    Non-Null Count   Dtype 
---  ------                    --------------   ----- 
 0   age                       110612 non-null  int64 
 1   gender                    110612 non-null  object
 2   race                      110612 non-null  object
 3   diagnosis                 110612 non-null  object
 4   time_in_hospital          110612 non-null  int64 
 5   num_procedures            110612 non-null  int64 
 6   num_lab_procedures        110612 non-null  int64 
 7   num_medications           110612 non-null  int64 
 8   num_inpatient             110612 non-null  int64 
 9   num_outpatient            110612 non-null  int64 
 10  num_emergency             110612 non-null  int64 
 11  num_diagnoses             110612 non-null  int64 
 12  service_utilization       110612 non-null  int64 
 13  num_change                110612 non-null  int64 
 14  num_

In [10]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12290 entries, 0 to 12289
Data columns (total 40 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   age                       12290 non-null  int64 
 1   gender                    12290 non-null  object
 2   race                      12290 non-null  object
 3   diagnosis                 12290 non-null  object
 4   time_in_hospital          12290 non-null  int64 
 5   num_procedures            12290 non-null  int64 
 6   num_lab_procedures        12290 non-null  int64 
 7   num_medications           12290 non-null  int64 
 8   num_inpatient             12290 non-null  int64 
 9   num_outpatient            12290 non-null  int64 
 10  num_emergency             12290 non-null  int64 
 11  num_diagnoses             12290 non-null  int64 
 12  service_utilization       12290 non-null  int64 
 13  num_change                12290 non-null  int64 
 14  num_drugs             

## Create a preprocessing function

The data need to be preprocessed before be passed as input to the model. You can use your own custom code for the preprocessing, just remember to wrap all of it in a single function.

Here, we create a preprocessing pipeline for the X using sklearn ColumnTransformer and Pipeline, then we fit it on the training X. The resulting x-processing is already a single function, so ready to be passed to the wrapper:

In [11]:
ordinal_features = X_train.select_dtypes(include="number").columns
categorical_features = X_train.select_dtypes(include="object").columns
ordinal_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler()),
    ]
)
categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore")),
    ]
)
x_processing = ColumnTransformer(
    transformers=[
        ("ord", ordinal_transformer, ordinal_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

In [12]:
x_processing.fit(X_train)

ColumnTransformer(transformers=[('ord',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='median')),
                                                 ('scaler', StandardScaler())]),
                                 Index(['age', 'time_in_hospital', 'num_procedures', 'num_lab_procedures',
       'num_medications', 'num_inpatient', 'num_outpatient', 'num_emergency',
       'num_diagnoses', 'service_utilization', 'num_change', 'num_drugs'],
      dtype='object')),
                                ('...
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'insulin', 'glyburide-metformin', 'glipizide-metformin',
       'glimepiride-pioglitazone', 'metformin-rosiglitazone',
       'metformin-pioglitazone', 'diabetes_

Then, we wrap the processing into a function:

In [13]:
def hospital_preprocessing(x_data):
    processed_data = x_processing.transform(x_data)
    return processed_data

As usual we encode the Y labels through a simple OneHotEncoder

In [14]:
y_processing = OneHotEncoder()

In [15]:
y_processing.fit(y_train.values.reshape(-1, 1))

OneHotEncoder()

## Create and train the model

We build a Pytorch network setting some basic parameters...

In [16]:
num_epochs = 35
learning_rate = 0.0001
size_hidden1 = 50
size_hidden2 = 30
size_hidden3 = 10
size_hidden4 = 2

In [17]:
class HospitalModel(nn.Module):
    def __init__(self, input_shape):
        super().__init__()
        self.lin1 = nn.Linear(input_shape, size_hidden1)
        self.relu1 = nn.ReLU()
        self.lin2 = nn.Linear(size_hidden1, size_hidden2)
        self.relu2 = nn.ReLU()
        self.lin3 = nn.Linear(size_hidden2, size_hidden3)
        self.relu3 = nn.ReLU()
        self.lin4 = nn.Linear(size_hidden3, size_hidden4)
        self.softmax = nn.Softmax(dim=1)

    def forward(self, input):
        return self.softmax(
            self.lin4(
                self.relu3(
                    self.lin3(self.relu2(self.lin2(self.relu1(self.lin1(input)))))
                )
            )
        )

...add the training function:

In [18]:
def train(model_inp, x_train, y_train, num_epochs=num_epochs):
    datasets = torch.utils.data.TensorDataset(x_train, y_train)
    train_iter = torch.utils.data.DataLoader(datasets, batch_size=500, shuffle=True)
    criterion = nn.BCELoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
    for epoch in range(num_epochs):  # loop over the dataset multiple times
        running_loss = 0.0
        for inputs, labels in train_iter:
            # forward pass
            outputs = model_inp(inputs)
            # defining loss
            loss = criterion(outputs, labels)
            # zero the parameter gradients
            optimizer.zero_grad()
            # computing gradients
            loss.backward()
            # accumulating running loss
            running_loss += loss.item()
            # updated weights based on computed gradients
            optimizer.step()
        if (epoch+1) % 5 == 0:
            print(
                "Epoch [%d]/[%d] running accumulative loss across all batches: %.3f"
                % (epoch + 1, num_epochs, running_loss)
            )

So we proceed to pre-process the training data:

In [19]:
X_train_processed = hospital_preprocessing(X_train)
X_train_processed = torch.Tensor(X_train_processed)

Then, we encode the y training data and convert them in the Pytorch Tensor format:

In [20]:
y_train_processed = y_processing.transform(y_train.values.reshape(-1, 1))

In [21]:
y_train_processed = torch.Tensor(y_train_processed.todense())

...and finally create a model instance and fit it on the resulting data:

In [22]:
model = HospitalModel(X_train_processed.shape[1])
model.train()
train(model, X_train_processed, y_train_processed)

  Variable._execution_engine.run_backward(
Epoch [5]/[35] running accumulative loss across all batches: 98.409
Epoch [10]/[35] running accumulative loss across all batches: 88.739
Epoch [15]/[35] running accumulative loss across all batches: 84.781
Epoch [20]/[35] running accumulative loss across all batches: 81.288
Epoch [25]/[35] running accumulative loss across all batches: 78.118
Epoch [30]/[35] running accumulative loss across all batches: 75.670
Epoch [35]/[35] running accumulative loss across all batches: 74.326


## Wrap and Save the Model

Finally, we use Clearbox Wrapper to wrap and save the model and the preprocessor as a zipped folder in a specified path. 

The model dependency (torch) and its version it is detected automatically by CBW and added to the requirements saved into the resulting folder. But (**IMPORTANT**) you need to pass as a parameter the additional dependencies required for the preprocessing as a list. We just need to add Scikit-Learn in this case.

We pass the training dataset (X train) to `save_model` as well, in order to generate a Preprocessing and Model Signature (the signature represents input and (optionally) outputs as data frames with (optionally) named columns and data type).

In [23]:
wrapped_model_path = 'hospital_wrapped_model_v0.3.7'

In [24]:
processing_dependencies = ["scikit-learn==0.23.2"]

In [25]:
cbw.save_model(wrapped_model_path, model, preprocessing=hospital_preprocessing, input_data=X_train, additional_deps=processing_dependencies)

## Unzip and load the model

The following cells are not necessary for the final users, the zip created should be uploaded to our SAAS as it is. But here we want to show how to load a saved model and compare it to the original one.

In [26]:
import zipfile

In [27]:
zipped_model_path = 'hospital_wrapped_model_v0.3.7.zip'
unzipped_model_path = 'hospital_wrapped_model_v0.3.7_unzipped'

In [28]:
with zipfile.ZipFile(zipped_model_path, 'r') as zip_ref:
    zip_ref.extractall(unzipped_model_path)

In [29]:
loaded_model = cbw.load_model(unzipped_model_path)

In [30]:
X_test_processed = hospital_preprocessing(X_test)
X_test_processed = torch.Tensor(X_test_processed)
original_model_predictions = model(X_test_processed).detach().numpy()

In [31]:
loaded_model_predictions = loaded_model.predict(X_test)

In [32]:
np.testing.assert_array_equal(original_model_predictions, loaded_model_predictions)

## Remove all generated files and directory

In [None]:
import os
import shutil

In [None]:
if os.path.exists(zipped_model_path):
        os.remove(zipped_model_path)

In [None]:
if os.path.exists(unzipped_model_path):
        shutil.rmtree(unzipped_model_path)