# Wrapping an sklearn model with Catwalk

In this tutorial, we will train and save a simple sklearn model then wrap it with Catwalk.

This notebook creates a LogisticRegression model and saves it with MLflow, based on the sklearn [Linear Regression Example](https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html#sphx-glr-auto-examples-linear-model-plot-ols-py).

## 0) Install dependencies

As well as catwalk, we need the following dependencies installed:

In [None]:
!pip install sklearn matplotlib

## 1) Load a dataset

Here we're using the sklearn diabetes dataset. This tutorial uses the only the first feature of the diabetes dataset, in order to illustrate a two-dimensional plot of this regression technique.

In [None]:
import numpy as np
from sklearn import datasets

# Load the diabetes dataset
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)

# Use only one feature
diabetes_X = diabetes_X[:, np.newaxis, 2]

# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

# Split the targets into training/testing sets
diabetes_y_train = diabetes_y[:-20]
diabetes_y_test = diabetes_y[-20:]

print("Number of training examples:", len(diabetes_X_train))
print("Number of testing examples:", len(diabetes_X_test))

## 2) Train a model

In [None]:
from sklearn import linear_model

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)

## 3) Evaluate the model

Here the coefficients, the residual sum of squares and the coefficient of determination are calculated and displayed.

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)

# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print('Mean squared error: %.2f'
      % mean_squared_error(diabetes_y_test, diabetes_y_pred))
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f'
      % r2_score(diabetes_y_test, diabetes_y_pred))

## 4) Visualise the result

The straight line can be seen in the plot, showing how linear regression attempts to draw a straight line that will best minimize the residual sum of squares between the observed responses in the dataset, and the responses predicted by the linear approximation.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test,  color='black')
plt.plot(diabetes_X_test, diabetes_y_pred, color='blue', linewidth=3)

## 5) Save a model artifact

Next we can save our trained model. Here we've opted to simply pickle to model. The catwalk-wrapped model will load an run this pickle file.

Catwalk can test the model against some test cases. This is useful in a CI/CD pipeline where we need to make sure that models are not automatically wrapped with incorrect behaviour. So we will also save the test data and the model predictions along with the model in the same pickle file.

In [None]:
import pickle

with open("model.pkl", "wb") as fp:
    pickle.dump({
        "model": regr,
        "X_test": diabetes_X_test,
        "y_test": diabetes_y_pred,
    }, fp)

## 6) Create a catwalk Model

Catwalk requires a `model.py`, implementing a single class called `Model`, that follows this interface:

```python
class Model(object):
    """The Model knows how to load itself, provides test data and runs with `Model::predict`.
    """

    def __init__(self, path="."):
        """The Model constructor.

        Use this to initialise your model, including loading any weights etc.

        :param str path: The full path to the folder in which the model is located.
        """
        pass

    def load_test_data(self, path=".") -> (list, list):
        """Loads and returns test data.

        Format of the returned data is similar to pd.DataFrame.records, a list of key-value pairs.

        :param str path: The full path to the folder in which the model is located.
        :return: Tuple of feature, target lists.
        """
        pass

    def predict(self, X) -> dict:
        """Uses the model to predict a value.

        :param dict X: The features to predict against
        :return: The prediction result
        """
        pass
```

Let's create this below:

In [None]:
%%writefile model.py
from os.path import join

import pickle


class Model(object):
    def __init__(self, path="."):
        """The Model constructor.

        Use this to initialise your model, including loading any weights etc.

        :param str path: The full path to the folder in which the model is located.
        """
        # Unpickle the model artifact
        with open(join(path, "model.pkl"), "rb") as fp:
            model_artifact = pickle.load(fp)

        # Extract the model and test data
        self._model = model_artifact["model"]
        self._X_test = model_artifact["X_test"]
        self._y_test = model_artifact["y_test"]

    def load_test_data(self, path=".") -> (list, list):
        """Loads and returns test data.

        Format of the returned data is similar to pd.DataFrame.records, a list of key-value pairs.

        :param str path: The full path to the folder in which the model is located.
        :return: Tuple of feature, target lists.
        """
        # The test data needs to be json-serializable, so here we're using `ndarray.tolist()
        # to convert to a plain python list
        return [{"X": self._X_test.tolist()}], [{"y": self._y_test.tolist()}]

    def predict(self, X) -> dict:
        """Uses the model to predict a value.

        :param dict X: The features to predict against
        :return: The prediction result
        """
        y = self._model.predict(X["X"])
        # Again we're using `ndarray.tolist() to convert the model output to a plain python list
        return {"y": y.tolist()}


## 7 ) Create model metadata

The metadata file is used for the model's name, version and contact information, and to validate the model inputs and outputs.

```yaml
name: "Model name (str)"
version: "Model version (str)"

contact:
  name: "Contact name (str)"
  email: "Contact email (str)"

schema:
  input: "The input schema of the model in OpenAPI format (object / array)"
  output: "The output schema of the model in OpenAPI format (object / array)"
```

The input of our model is a 2D array and the output is only one-dimentional. This gives us in IO schema like the following:

In [None]:
%%writefile model.yml
name: "catwalk-sklearn-tutorial"
version: "0.1.0"

contact:
  name: "Andy Elmsley"
  email: "andy.elmsley@leapbeyond.ai"

schema:
  input:
    type: object
    properties:
        X:
            type: array
            items:
                type: array
                items:
                    type: number
  output:
    type: object
    properties:
        y:
            type: array
            items:
                type: number

## 8) Set the requirements

This model will be shipped around and run in different environments. The requirements.txt allows us to ensure that all dependencies will be met in each of these envs.

In [None]:
%%writefile requirements.txt
sklearn


## 9)  Test the model with Catwalk

Catwalk comes with several tests to make sure you've implemented the model in the way that it expects

In [None]:
!catwalk test-model

In [None]:
!catwalk test-server

## 10) Serve your model

When the two tests above pass, you're model is ready to be served by catwalk!

In a separate terminal, execute the following:

```bash
$ catwalk serve --debug
```

This will start a debug catwalk server.

Once that's ready, try sending some requests...

This first request returns the model metadata:

In [None]:
!curl http://localhost:9090/info | python -m json.tool

This request sends the model a predict request!

In [None]:
!curl -H "Content-Type: application/json" \
    -d '{"correlation_id": "foo", "request": {"X": [[0.07786339]]}}' \
    http://localhost:9090/predict | python -m json.tool