# Clearbox Wrapper Tutorial

Clearbox Wrapper is a Python library to package and save a ML model.

This is the simplest case possible: we'll wrap a Scikit-Learn model trained on the popular Iris Dataset. The dataset contains only ordinal values and we'll use all columns, so we do not need either **preprocessing** or **data preparation** for the X. We'll just use a simple LabelEncoder to encode the y strings to numerical values (0, 1, 2), but the LabelEncoder doesn't need to be saved together with the model.

## Install and import required libraries

In [1]:
%%capture
!pip install pandas
!pip install numpy
!pip install scikit-learn

!pip install clearbox-wrapper==0.3.6

In [2]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier

import clearbox_wrapper as cbw

## Datasets

We have two different csv files for the training and test set.

In [3]:
iris_training_csv_path = 'iris_training_set.csv'
iris_test_csv_path = 'iris_test_set.csv'

In [4]:
iris_training = pd.read_csv(iris_training_csv_path)
iris_test = pd.read_csv(iris_test_csv_path)

In [5]:
target_column = 'species'

In [6]:
y_train = iris_training[target_column]
X_train = iris_training.drop(target_column, axis=1)

In [7]:
y_test = iris_test[target_column]
X_test = iris_test.drop(target_column, axis=1)

In [8]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120 entries, 0 to 119
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  120 non-null    float64
 1   sepal_width   120 non-null    float64
 2   petal_length  120 non-null    float64
 3   petal_width   120 non-null    float64
dtypes: float64(4)
memory usage: 3.9 KB


In [9]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  30 non-null     float64
 1   sepal_width   30 non-null     float64
 2   petal_length  30 non-null     float64
 3   petal_width   30 non-null     float64
dtypes: float64(4)
memory usage: 1.1 KB


We create a simple LabelEncoder for the y series:

In [10]:
y_encoder = LabelEncoder()

We fit the LabelEncoder on the y of the training set and we get the encoded y for both the datasets:

In [11]:
y_train = y_encoder.fit_transform(y_train)
y_test = y_encoder.transform(y_test)

## Create and train the model

We build a simple Sklearn Decision Tree classifier setting some basic parameters...

In [12]:
tree_clf = DecisionTreeClassifier(max_depth=4, random_state=42)

...and fit on the training dataset:

In [13]:
tree_clf.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=4, random_state=42)

## Wrap and Save the Model

Finally, we use Clearbox Wrapper to wrap and save the model as a zipped folder in a specified path. The only dependency required for this model is Scikit Learn, but it is detected automatically by CBW and added to the requirements saved into the resulting folder. We pass the training dataset to `save_model` in order to generate a Model Signature (the signature represents model input as data frames with (optionally) named columns and data type).

In [14]:
wrapped_model_path = 'iris_wrapped_model_v0.3.6'

In [15]:
cbw.save_model(wrapped_model_path, tree_clf, input_data=X_train)

## Unzip and load the model

The following cells are not necessary for the final users, the zip created should be uploaded to our SAAS as it is. But here we want to show how to load a saved model and compare it to the original one.

In [16]:
import zipfile

In [17]:
zipped_model_path = 'iris_wrapped_model_v0.3.6.zip'
unzipped_model_path = 'iris_wrapped_model_v0.3.6_unzipped'

In [18]:
with zipfile.ZipFile(zipped_model_path, 'r') as zip_ref:
    zip_ref.extractall(unzipped_model_path)

In [19]:
loaded_model = cbw.load_model(unzipped_model_path)

In [20]:
original_model_predictions = tree_clf.predict_proba(X_test)

In [21]:
loaded_model_predictions = loaded_model.predict_proba(X_test)

In [22]:
np.testing.assert_array_equal(original_model_predictions, loaded_model_predictions)

## Remove all generated files and directory

In [None]:
import os
import shutil

In [None]:
if os.path.exists(zipped_model_path):
        os.remove(zipped_model_path)

In [None]:
if os.path.exists(unzipped_model_path):
        shutil.rmtree(unzipped_model_path)