# Clearbox Wrapper Tutorial

Clearbox Wrapper is a Python library to package and save a ML model.

We'll use the popular UCI Adult dataset and build a Keras classifier on it.

Before feeding the data to the model, they pass to two preliminary steps: data cleaning and pre-processing.

## Data Cleaning

By the terms __data cleaning__ we refer to the processing of raw data. Any kind of operation is allowed, but often cleaning the raw data includes removing or normalizing some columns, replacing values, add a column based on other column values,...

This kind of processing is usually performed even before considering any kind of model to feed the data in. The entire dataset is cleaned and the following processing steps and the model are built considering only the cleaned data. But this is not always the case. Often the _data cleaning step_ must be considered the first step of the model-in-production pipeline. The model, ready to take an input and compute a prediction, will receive a dirty data instance which should be cleaned. After this step, no matter what kind of transformation the data have been through, they should still be readable and understandable by an human user.

## Pre-processing

This is the kind of pre-processing we're already familiar with. It includes the operations required to transform the data just before feed them to the model.


We want to wrap and save _data cleaning_ and _pre-processing_ steps along with the model so to have a pipeline Data Cleaning+Preprocessing+Model ready to take raw data, clean and process them and make predictions.

We can do that with Clearbox Wrapper, but the _data cleaning_ and _pre-processing_ code must be wrapped in two separate functions.

## Install and import required libraries

In [1]:
!pip install pandas
!pip install numpy
!pip install scikit-learn
!pip install tensorflow

!pip install clearbox-wrapper==0.3.0

You should consider upgrading via the '/home/andrea/clearbox_repos/clearbox-wrapper/.venv/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/home/andrea/clearbox_repos/clearbox-wrapper/.venv/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/home/andrea/clearbox_repos/clearbox-wrapper/.venv/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/home/andrea/clearbox_repos/clearbox-wrapper/.venv/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/home/andrea/clearbox_repos/clearbox-wrapper/.venv/bin/python -m pip install --upgrade pip' command.[0m


In [2]:
import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder

from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential

import clearbox_wrapper as cbw

In [3]:
cbw.__version__

'0.3.0'

## Datasets

We already have two different csv files for the training and test set.

In [4]:
adult_training_csv_path = 'adult_training.csv'
adult_test_csv_path = 'adult_test.csv'

In [5]:
adult_training = pd.read_csv(adult_training_csv_path)
adult_test = pd.read_csv(adult_test_csv_path)

In [6]:
target_column = 'income'

In [7]:
y_train = adult_training[target_column]
X_train = adult_training.drop(target_column, axis=1)

In [8]:
y_test = adult_test[target_column]
X_test = adult_test.drop(target_column, axis=1)

In [9]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   work_class      30725 non-null  object
 2   education       32561 non-null  object
 3   marital_status  32561 non-null  object
 4   occupation      30718 non-null  object
 5   relationship    32561 non-null  object
 6   race            32561 non-null  object
 7   sex             32561 non-null  object
 8   capital_gain    32561 non-null  int64 
 9   capital_loss    32561 non-null  int64 
 10  hours_per_week  32561 non-null  int64 
 11  native_country  31978 non-null  object
dtypes: int64(4), object(8)
memory usage: 3.0+ MB


In [10]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16281 entries, 0 to 16280
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             16281 non-null  int64 
 1   work_class      15318 non-null  object
 2   education       16281 non-null  object
 3   marital_status  16281 non-null  object
 4   occupation      15315 non-null  object
 5   relationship    16281 non-null  object
 6   race            16281 non-null  object
 7   sex             16281 non-null  object
 8   capital_gain    16281 non-null  int64 
 9   capital_loss    16281 non-null  int64 
 10  hours_per_week  16281 non-null  int64 
 11  native_country  16007 non-null  object
dtypes: int64(4), object(8)
memory usage: 1.5+ MB


## Data Cleaning

Several columns of the dataset have a large cardinality (a lot of different values). We'll clean the data mapping a lot of these useless values. The cleaning is wrapped in a single function.

In [11]:
education_map = {
            "10th": "Dropout",
            "11th": "Dropout",
            "12th": "Dropout",
            "1st-4th": "Dropout",
            "5th-6th": "Dropout",
            "7th-8th": "Dropout",
            "9th": "Dropout",
            "Preschool": "Dropout",
            "HS-grad": "High School grad",
            "Some-college": "High School grad",
            "Masters": "Masters",
            "Prof-school": "Prof-School",
            "Assoc-acdm": "Associates",
            "Assoc-voc": "Associates",
        }

In [12]:
occupation_map = {
            "Adm-clerical": "Admin",
            "Armed-Forces": "Military",
            "Craft-repair": "Blue-Collar",
            "Exec-managerial": "White-Collar",
            "Farming-fishing": "Blue-Collar",
            "Handlers-cleaners": "Blue-Collar",
            "Machine-op-inspct": "Blue-Collar",
            "Other-service": "Service",
            "Priv-house-serv": "Service",
            "Prof-specialty": "Professional",
            "Protective-serv": "Other",
            "Sales": "Sales",
            "Tech-support": "Other",
            "Transport-moving": "Blue-Collar",
        }

In [13]:
country_map = {
            "Cambodia": "SE-Asia",
            "Canada": "British-Commonwealth",
            "China": "China",
            "Columbia": "South-America",
            "Cuba": "Other",
            "Dominican-Republic": "Latin-America",
            "Ecuador": "South-America",
            "El-Salvador": "South-America",
            "England": "British-Commonwealth",
            "Guatemala": "Latin-America",
            "Haiti": "Latin-America",
            "Honduras": "Latin-America",
            "Hong": "China",
            "India": "British-Commonwealth",
            "Ireland": "British-Commonwealth",
            "Jamaica": "Latin-America",
            "Laos": "SE-Asia",
            "Mexico": "Latin-America",
            "Nicaragua": "Latin-America",
            "Outlying-US(Guam-USVI-etc)": "Latin-America",
            "Peru": "South-America",
            "Philippines": "SE-Asia",
            "Puerto-Rico": "Latin-America",
            "Scotland": "British-Commonwealth",
            "Taiwan": "China",
            "Thailand": "SE-Asia",
            "Trinadad&Tobago": "Latin-America",
            "United-States": "United-States",
            "Vietnam": "SE-Asia",
        }

In [14]:
 married_map = {
            "Never-married": "Never-Married",
            "Married-AF-spouse": "Married",
            "Married-civ-spouse": "Married",
            "Married-spouse-absent": "Separated",
            "Divorced": "Separated",
        }

In [15]:
mapping = {
            "education": education_map,
            "occupation": occupation_map,
            "native_country": country_map,
            "marital_status": married_map,
        }

In [11]:
def cleaning(x):
    education_map = {
        "10th": "Dropout",
        "11th": "Dropout",
        "12th": "Dropout",
        "1st-4th": "Dropout",
        "5th-6th": "Dropout",
        "7th-8th": "Dropout",
        "9th": "Dropout",
        "Preschool": "Dropout",
        "HS-grad": "High School grad",
        "Some-college": "High School grad",
        "Masters": "Masters",
        "Prof-school": "Prof-School",
        "Assoc-acdm": "Associates",
        "Assoc-voc": "Associates",
        }
    occupation_map = {
        "Adm-clerical": "Admin",
        "Armed-Forces": "Military",
        "Craft-repair": "Blue-Collar",
        "Exec-managerial": "White-Collar",
        "Farming-fishing": "Blue-Collar",
        "Handlers-cleaners": "Blue-Collar",
        "Machine-op-inspct": "Blue-Collar",
        "Other-service": "Service",
        "Priv-house-serv": "Service",
        "Prof-specialty": "Professional",
        "Protective-serv": "Other",
        "Sales": "Sales",
        "Tech-support": "Other",
        "Transport-moving": "Blue-Collar",
    }
    country_map = {
        "Cambodia": "SE-Asia",
        "Canada": "British-Commonwealth",
        "China": "China",
        "Columbia": "South-America",
        "Cuba": "Other",
        "Dominican-Republic": "Latin-America",
        "Ecuador": "South-America",
        "El-Salvador": "South-America",
        "England": "British-Commonwealth",
        "Guatemala": "Latin-America",
        "Haiti": "Latin-America",
        "Honduras": "Latin-America",
        "Hong": "China",
        "India": "British-Commonwealth",
        "Ireland": "British-Commonwealth",
        "Jamaica": "Latin-America",
        "Laos": "SE-Asia",
        "Mexico": "Latin-America",
        "Nicaragua": "Latin-America",
        "Outlying-US(Guam-USVI-etc)": "Latin-America",
        "Peru": "South-America",
        "Philippines": "SE-Asia",
        "Puerto-Rico": "Latin-America",
        "Scotland": "British-Commonwealth",
        "Taiwan": "China",
        "Thailand": "SE-Asia",
        "Trinadad&Tobago": "Latin-America",
        "United-States": "United-States",
        "Vietnam": "SE-Asia",
    }
    married_map = {
        "Never-married": "Never-Married",
        "Married-AF-spouse": "Married",
        "Married-civ-spouse": "Married",
        "Married-spouse-absent": "Separated",
        "Divorced": "Separated",
    }
    mapping = {
        "education": education_map,
        "occupation": occupation_map,
        "native_country": country_map,
        "marital_status": married_map,
    }
    cleaned_x = x.replace(mapping)
    return cleaned_x

## Create a preprocessing function

The data need to be preprocessed before be passed as input to the model. You can use your own custom code for the preprocessing, just remember to wrap all of it in a single function.

Here, we create a pre-processing pipeline for the X using sklearn ColumnTransformer and Pipeline, then we fit it on the training X. The resulting _x-processing_ is already a single function, so ready to be passed to the wrapper.

In [12]:
ordinal_features = X_train.select_dtypes(include="number").columns
categorical_features = X_train.select_dtypes(include="object").columns
ordinal_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler()),
    ]
)
categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore")),
    ]
)
x_processing = ColumnTransformer(
    transformers=[
        ("ord", ordinal_transformer, ordinal_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

In [13]:
x_processing.fit(X_train)

ColumnTransformer(transformers=[('ord',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='median')),
                                                 ('scaler', StandardScaler())]),
                                 Index(['age', 'capital_gain', 'capital_loss', 'hours_per_week'], dtype='object')),
                                ('cat',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='most_frequent')),
                                                 ('onehot',
                                                  OneHotEncoder(handle_unknown='ignore'))]),
                                 Index(['work_class', 'education', 'marital_status', 'occupation',
       'relationship', 'race', 'sex', 'native_country'],
      dtype='object'))])

As usual we encode the Y labels through a simple LabelEncoder

In [14]:
y_processing = LabelEncoder()

In [15]:
y_processing.fit(y_train)

LabelEncoder()

## Create and train the model

We build a simple Keras network setting up some basic parameters:

In [16]:
def keras_model(input_shape):
    keras_clf = Sequential()
    keras_clf.add(Dense(27, input_dim=input_shape, activation="relu"))
    keras_clf.add(Dense(14, activation="relu"))
    keras_clf.add(Dense(7, activation="relu"))
    keras_clf.add(Dense(1, activation="sigmoid"))

    keras_clf.compile(
        optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"]
    )
    return keras_clf

So we proceed to clean and pre-process the training data:

In [17]:
X_train_cleaned = cleaning(X_train)

In [18]:
X_train_processed = x_processing.transform(X_train_cleaned)

Then, we encode the y training data:

In [19]:
y_train_processed = y_processing.transform(y_train)

Finally, we fit the model on the processed data:

In [20]:
model = keras_model(X_train_processed.shape[1])
model.fit(X_train_processed, y_train_processed, epochs=10, batch_size=32)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7ff87801ba30>

## Wrap and Save the Model

Finally, we use Clearbox Wrapper to wrap and save the model and the preprocessor as a zipped folder in a specified path. 

The model dependency (tensorflow) and its version it is detected automatically by CBW and added to the requirements saved into the resulting folder. But (**IMPORTANT**) you need to pass as a parameter the additional dependencies required for data-cleaning and preprocessing as a list. We need to add Pandas and Scikit-Learn in this case.

In [23]:
wrapped_model_path = 'adult_wrapped_model_preparation_preprocessing_v0.2'

In [22]:
processing_dependencies = ["pandas==1.2.0", "scikit-learn==0.23.2"]

In [24]:
cbw.save_model(wrapped_model_path, model, preprocessing=x_processing, data_preparation=cleaning, additional_deps=processing_dependencies)

INFO:tensorflow:Assets written to: adult_wrapped_model_preparation_preprocessing_v0.2/data/model/assets


## Unzip and load the model

The following cells are not necessary for the final users, the zip created should be uploaded to our SAAS as it is. But here we want to show how to load a saved model and compare it to the original one. Some lines similar to these are present in the backend of Clearbox AI SAAS.

In [25]:
import zipfile

In [26]:
zipped_model_path = 'adult_wrapped_model_preparation_preprocessing_v0.2.zip'
unzipped_model_path = 'adult_wrapped_model_preparation_preprocessing_v0.2_unzipped'

In [27]:
with zipfile.ZipFile(zipped_model_path, 'r') as zip_ref:
    zip_ref.extractall(unzipped_model_path)

In [28]:
loaded_model = cbw.load_model(unzipped_model_path)

  ColumnTransformer(transformers=[('ord',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='median')),
                                                 ('scaler', StandardScaler())]),
                                 Index(['age', 'capital_gain', 'capital_loss', 'hours_per_week'], dtype='object')),
                                ('cat',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='most_frequent')),
                                                 ('onehot',
                                                  OneHotEncoder(handle_unknown='ignore'))]),
                                 Index(['work_class', 'education', 'marital_status', 'occupation',
       'relationship', 'race', 'sex', 'native_country'],
      dtype='object'))])

  ColumnTransformer(transformers=[('ord',
                                 Pipeline(ste

Using the original model, the input data (X_test) must goes through both the data_cleaning function and the pre-processing function before the model.

In [29]:
X_test_cleaned = cleaning(X_test)
X_test_processed = x_processing.transform(X_test_cleaned)
original_model_predictions = model.predict(X_test_processed)

Using the wrapped model, **both the data cleaning and the pre-processing are part of the predict pipeline**, so we can pass directly the raw input data to the predict function of the model: 

In [30]:
loaded_model_predictions = loaded_model.predict(X_test)

In [31]:
prepared_data = loaded_model.prepare_data(X_test)

In [32]:
processed_data = loaded_model.preprocess_data(prepared_data)

In [33]:
X_test.shape

(16281, 12)

In [34]:
X_test['education'].unique()

array(['11th', 'HS-grad', 'Assoc-acdm', 'Some-college', '10th',
       'Prof-school', '7th-8th', 'Bachelors', 'Masters', 'Doctorate',
       '5th-6th', 'Assoc-voc', '9th', '12th', '1st-4th', 'Preschool'],
      dtype=object)

In [35]:
prepared_data.shape

(16281, 12)

In [36]:
prepared_data['education'].unique()

array(['Dropout', 'High School grad', 'Associates', 'Prof-School',
       'Bachelors', 'Masters', 'Doctorate'], dtype=object)

In [37]:
processed_data.shape

(16281, 103)

In [38]:
pred_no_preparation = loaded_model.predict(prepared_data, prepare_data=False)



We check that the predictions made with the original model and the wrapped one are equal:

In [39]:
np.testing.assert_array_equal(original_model_predictions, loaded_model_predictions)

## Remove all generated files and directory

In [None]:
import os
import shutil

In [None]:
if os.path.exists(zipped_model_path):
        os.remove(zipped_model_path)

In [None]:
if os.path.exists(zipped_model_path):
        os.remove(zipped_model_path)