# Introduction and Reference

This notebook is based on http://onnx.ai/sklearn-onnx/auto_examples/plot_complex_pipeline.html with some code changes. 

For more examples about how to use ONNX and ONNX Runtime with classical classifiers with sklearn -- check out:

http://onnx.ai/sklearn-onnx/auto_examples/index.html

Workflow: 

1. Train a sklearn classifier with Pipeline
2. Convert into ONNX format
3. Use ONNX Runtime to do inference 

## Import libaries 
You can skip installing the following packages if you're using container where all libaries are pre-installed. If not, you'll need to uncomment the cell and install the packages. 

In [None]:
# !pip install scikit-learn
# !pip install skl2onnx
# !pip install pandas
# !pip install --upgrade onnxruntime==1.9.0

In [None]:
import os
import time
import pprint
import pandas as pd
import numpy as np
from numpy.testing import assert_almost_equal
import onnxruntime as rt
import sklearn
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

## Load Titanic data

In [None]:
# data source: https://www.kaggle.com/c/titanic/data
data = pd.read_csv("datasets/titanic.csv")
data.head()

In [None]:
data.columns

In [None]:
X = data.drop('survived', axis=1)
y = data['survived']
print(data.dtypes)

# SimpleImputer on string is not available for
# string in ONNX-ML specifications.
# So we do it beforehand.

for cat in ['embarked', 'sex', 'pclass']:
    X[cat].fillna('missing', inplace=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = Pipeline(steps=[
    # --- SimpleImputer is not available for strings in ONNX-ML specifications.
    # ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features),
    ])

clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression(solver='lbfgs'))])


clf.fit(X_train, y_train)

In [None]:
import pickle 
pickle.dump(clf, open("models/pipeline_titanic.pkl", 'wb'))
print("saved")

## Define the inputs of the ONNX graph

*sklearn-onnx* does not know the features used to train the model
but it needs to know which feature has which name.
We simply reuse the dataframe column definition.

In [None]:
print(X_train.dtypes)

In [None]:
import skl2onnx
from skl2onnx.common.data_types import FloatTensorType, StringTensorType
from skl2onnx.common.data_types import Int64TensorType

# Conversion of inputs to ONNX inputs 
def convert_dataframe_schema(df, drop=None):
    inputs = []
    for k, v in zip(df.columns, df.dtypes):
        if drop is not None and k in drop:
            continue
        if v == 'int64':
            t = Int64TensorType(shape=[None, 1])
        elif v == 'float64':
            t = FloatTensorType(shape=[None, 1])
        else:
            t = StringTensorType(shape=[None, 1])
        inputs.append((k, t))
    return inputs

initial_inputs = convert_dataframe_schema(X_train)

pprint.pprint(initial_inputs)

In [None]:
# Drop unused inputs
to_drop = {'parch', 'sibsp', 'cabin', 'ticket',
           'name', 'body', 'home.dest', 'boat'}
initial_inputs = convert_dataframe_schema(X_train, to_drop)
pprint.pprint(initial_inputs)

## Convert the pipeline into ONNX



`convert_sklearn` function produces an equivalent ONNX model of the given scikit-learn model.
API reference: http://onnx.ai/sklearn-onnx/_modules/skl2onnx/convert.html 

In [None]:
from skl2onnx import convert_sklearn
try:
    model_onnx = convert_sklearn(model=clf, name='pipeline_titanic', initial_types=initial_inputs,
                                 target_opset=12, verbose=2)
except Exception as e:
    print(e)

In [None]:
# And save.
with open("models/pipeline_titanic.onnx", "wb") as f:
    f.write(model_onnx.SerializeToString())

## Compare the predictions

Final step, we need to ensure the converted model
produces the same predictions, labels and probabilities.
Let's start with *scikit-learn*.



In [None]:
print("predict", clf.predict(X_test[:5]))
print("predict_proba", clf.predict_proba(X_test[:2]))

Predictions with onnxruntime.
We need to remove the dropped columns and to change
the double vectors into float vectors as *onnxruntime*
does not support double floats.
*onnxruntime* does not accept *dataframe*.
inputs must be given as a list of dictionary.
Last detail, every column was described  not really as a vector
but as a matrix of one column which explains the last line
with the *reshape*.



In [None]:
X_test2 = X_test.drop(to_drop, axis=1)
inputs = {c: X_test2[c].values for c in X_test2.columns}
for c in numeric_features:
    inputs[c] = inputs[c].astype(np.float32)
for k in inputs:
    inputs[k] = inputs[k].reshape((inputs[k].shape[0], 1))

We are ready to run *onnxruntime*.



In [None]:
sess = rt.InferenceSession("models/pipeline_titanic.onnx")
pred_onx = sess.run(None, inputs)
print("predict", pred_onx[0][:5])
print("predict_proba", pred_onx[1][:2])

The output of onnxruntime is a list of dictionaries.
Let's swith to an array but that requires to convert again with
an additional option zipmap.



In [None]:
model_onnx = convert_sklearn(clf, 'pipeline_titanic', initial_inputs,
                             target_opset=12,
                             options={id(clf): {'zipmap': False}})

with open("models/pipeline_titanic_nozipmap.onnx", "wb") as f:
    f.write(model_onnx.SerializeToString())

In [None]:
sess = rt.InferenceSession("models/pipeline_titanic_nozipmap.onnx")
pred_onx = sess.run(None, inputs)
print("predict", pred_onx[0][:5])
print("predict_proba", pred_onx[1][:2])

Let's check they are the same.



In [None]:
assert_almost_equal(clf.predict_proba(X_test), pred_onx[1])

In [None]:
# compare size of models:
print('Pickle model size (MB):', os.path.getsize("models/pipeline_titanic.pkl")/(1024*1024))
print('ONNX model size with zipmap (MB):', os.path.getsize("models/pipeline_titanic.onnx")/(1024*1024))
print('ONNX model size without zipmap (MB):', os.path.getsize("models/pipeline_titanic_nozipmap.onnx")/(1024*1024))

## Display the ONNX graph

Finally, let's see the graph converted with *sklearn-onnx*:
https://netron.app/

## Check ONNX model format

In [None]:
import onnx

# Preprocessing: load the ONNX model
model_path = 'models/pipeline_titanic_nozipmap.onnx'
onnx_model = onnx.load(model_path)

# Check the model
try:
    onnx.checker.check_model(onnx_model)
except onnx.checker.ValidationError as e:
    print('The model is invalid: %s' % e)
else:
    print('The model is valid!')