# Model Deployment with BentoML and Kubernetes

The easy part of machine learning in production is the model development, but once you've developed a model how do you deploy it? There are many issues to deal with, including package management, API performance and security, as well as basic model versioning. While you could deal with these things all manually, it quickly becomes a cumbersome engineering task. As both a Data Scientist and an ML Engineer, you want to transition your model to be production ready as quickly and easily as possible, with  little DevOps friction. In this tutorial, we will be building a transaction fraud detection API using the [BentoML](https://bentoml.com) framework.

We chose BentoML here as it is a framework that is both easy to use and has a lot of built-in features that make it a great choice for building production ready ML APIs, including a performant WGSI HTTP server powered by [gunicorn](https://gunicorn.org), in-built pip package management and quick containerization. Its output is simply a Docker container that can be deployed anywhere you want, from serverless to Kubernetes to even a simple VM. We demonstrate this on Kubernetes as this gives you a lot of flexibility on resource usage and gives you auto-scaling capability, but you can use any container runtime you want. Specifically, we will use [minikube](https://github.com/kubernetes/minikube) here as it is free, easy to use, and runs locally on your machine with a minimal install. As all Kubernetes runtimes are managed with the same utility, the concepts you learn here can be applied to any Kubernetes runtime.

By the end of this tutorial you will be able to:
- Setup a Kubernetes cluster with Minikube
- Create Bento services with BentoML and containerize them
- Deploy the a Bento service to Kubernetes

### Prerequisites:
- Install Docker
- Install Python3.8+
- Install JupyterLab
  
You should download the data required for this tutorial from [here](https://drive.google.com/file/d/1MidRYkLdAV-i0qytvsflIcKitK4atiAd/view?usp=sharing). This is originally from a [Kaggle dataset](https://www.kaggle.com/competitions/ieee-fraud-detection/data) for Fraud Detection. Place this dataset in a `data` directory in the root of your project. You can run this notebook either in VS Code or Jupyter Notebooks.

## Build a model

Firstly, we need a model to deploy. Let's build a quick model to detect fraudulent transactions. We will need a number of libraries so lets install them. Since the focus of this tutorial is deployment, don't worry about the feature selection or model training specifics. There are only two objects to be concerned with in the training code block:

- **enc**: The encoder object we use to preprocess the data with one-hot encoding.
- **model**: The model object we are going to deploy.

If you wish, create a virtual environment with conda or venv.
```
pip install scikit-learn==1.0.2 pandas==1.4.3 numpy==1.23.2 xgboost==1.5.1 bentoml==1.0.0
```

In [2]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from xgboost import XGBClassifier

# Load the data, sample such that the target classes are equal size
df = pd.read_csv("data/train_transaction.csv")
df = pd.concat(
    [df[df.isFraud == 0].sample(n=len(df[df.isFraud == 1])), df[df.isFraud == 1]],
    axis=0,
)

# Select the features and target
X = df[["ProductCD", "P_emaildomain", "R_emaildomain", "card4", "M1", "M2", "M3"]]
y = df.isFraud

# Use one-hot encoding to encode the categorical features
enc = OneHotEncoder(handle_unknown="ignore")
enc.fit(X)

X = pd.DataFrame(
    enc.transform(X).toarray(), columns=enc.get_feature_names_out().reshape(-1)
)
X["TransactionAmt"] = df[["TransactionAmt"]].to_numpy()

# Split the dataset and train the model
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
xgb = XGBClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    min_child_weight=1,
    gamma=0,
    subsample=0.8,
    colsample_bytree=0.8,
    objective="binary:logistic",
    nthread=4,
    scale_pos_weight=1,
    seed=27,
)
model = xgb.fit(X_train, y_train)

  from pandas import MultiIndex, Int64Index


## Setup your Kubernetes Cluster

So, you have built a model, and you want to deploy it so it's actually useful. How do you do that? We are going to use Kubernetes, a system for autoscaling and managing container based services. As we said in the intro, we are using [minikube](https://github.com/kubernetes/minikube) to create a local Kubernetes instance. You can however, use any local or cloud runtime you'd like, though you may need to go through additional setup. Some other options are:
### Local + VM
- [kind](https://kind.sigs.k8s.io)
- [K3s](https://k3s.io)
- [MicroK8s](https://microk8s.io)
- [kubeadm](https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/)
### Cloud
- [Managed Kubernetes on DigitalOcean](https://www.digitalocean.com/products/kubernetes)
- [EKS on AWS](https://aws.amazon.com/eks/)
- [GKE on GCP](https://cloud.google.com/kubernetes-engine)
- [AKS on Azure](https://azure.microsoft.com/en-us/services/kubernetes-service/)

Don't be too scared of using minikube as opposed to a cloud instance, the actual Kubernetes commands you'll need to run to deploy our ML service are the same as they are all done through `kubectl`. You can apply your learnings pretty easily on a managed K8s platform.

First we need to install the aforementioned `kubectl` utility. This will enable us to interact with our minikube cluster (or any cluster).
```bash
brew install kubectl
```

Let's set up our local minikube instance. Install minikube for your specific platform. We have provided the command for **brew**, since macOS is obviously the best platform:
```bash
brew install minikube
```


Before proceeding, it goes without saying that you'll need the **docker daemon** to be running so make sure that is the case! Minikube is now installed, but it is not running currently. To start up the cluster just run:
```bash
minikube start
```

You may have other clusters you interact with so you'll need to switch *contexts*. You can think of a context is an abstraction of a cluster. Ensure you're in your local minikube context.
```bash
kubectl config use-context minikube
```

Verify your context by retrieving the cluster info.
```bash
kubectl cluster-info
```

Congratulations! You have now setup a your minikube cluster. Now that we have the infrastructure, we need to create a API service for our fraud detection model.

## Create a Bento Service

While there are a number of tools that ease the stress of deploying a model, one of the more straightforward ways is to use the open source [BentoML](https://bentoml.com) framework. BentoML is a framework for deploying machine learning models that pre-packages the model for you into a callable REST API containerized service. We chose BentoML as it is easy to use, will install the packages we need and can deploy to multiple different cloud services and infrastructures. If you'd like, check out some of the alternative frameworks that are available.
- [MLFlow Models](https://mlflow.org/docs/latest/models.html) - Not as flexible as BentoML, more complex and less performant, but does have a lot of features like experiment tracking.
- [TensorFlow Serving](https://github.com/tensorflow/serving) - Only supports TensorFlow, and does not use the Python runtime.
- [TorchServe](https://pytorch.org/serve/) - Only supports PyTorch

Install the BentoML latest stable release (1.X.X) with the following command. Note that if you use conda, the package is only available in the conda-forge channel which may not be updated.

In [10]:
pip install bentoml

Note: you may need to restart the kernel to use updated packages.


Now, we want to create our Bento service using the **model** and **enc** objects we created before. When you *save* a model, Bento will store it locally and version it. Import `bentoml` and use the appropriate `save_model` function to save the models we need to the local **model** store, running it in your notebook. You may notice we used the **sklearn** `save_model` for the XGBoost model. This is as we have used the SKLearn API to create the model.

There are a number of different *optional* arguments you can include when saving a model. We included a number of extra tags to demonstrate how this works.

- **labels**: user-defined labels for managing models (e.g. team=nlp, stage=dev).
- **metadata**: user-defined metadata for storing model training context information or model evaluation metrics (e.g. dataset version, training parameters, confusion matrix, etc).
- **custom_objects**: user-defined additional python objects (e.g. a tokenizer instance, preprocessor functions, etc). Custom objects will be serialized with cloudpickle.
- **signatures**: model signatures for inference (e.g. input/output shapes, whether inference is batched, etc). For more information, see the [BentoML documentation](https://bentoml.com/docs/bento-ml/api/model-signatures/) on signatures.

In [11]:
import bentoml

saved_model = bentoml.sklearn.save_model(
    "fraud_classifier",
    model,
    labels={"owner": "Cerebrium", "stage": "prod"},
    metadata={"version": "1.0.0"},
    custom_objects={"ohe_encoder": enc},
    signatures={
        "predict": {
            "batchable": True,
            "batch_dim": 0,
        }
    },
)
print(f"{saved_model}")

Model(tag="fraud_classifier:47sbzaqj2oeaautt")


Next, you will need to create a Bento service. This abstraction tells Bento what model to use to run inference and handle any preprocessing. In this service, we are going to use the `fraud_classifier` model we saved to the local store.

Create a new file called `fraud_detection_service.py` in the project root directory, and paste the following code into it:

```python
import numpy as np
import pandas as pd

import bentoml
from bentoml.io import PandasDataFrame, JSON

ohe_encoder = bentoml.models.get("fraud_classifier:latest").custom_objects[
    "ohe_encoder"
]
fraud_classifier_runner = bentoml.sklearn.get("fraud_classifier:latest").to_runner()

svc = bentoml.Service("fraud_classifier", runners=[fraud_classifier_runner])


@svc.api(input=PandasDataFrame(), output=JSON(), route="/fraud-classifier")
def predict(df: pd.DataFrame) -> np.ndarray:
    X = df[["ProductCD", "P_emaildomain", "R_emaildomain", "card4", "M1", "M2", "M3"]]
    X = X.fillna(pd.NA)  # ensure all missing values are pandas NA
    X = pd.DataFrame(
        ohe_encoder.transform(X).toarray(),
        columns=ohe_encoder.get_feature_names_out().reshape(-1),
    )
    X["TransactionAmt"] = df[["TransactionAmt"]].to_numpy()
    return fraud_classifier_runner.predict.run(X)
```

There are a number of key details in this file to be aware of.

- `ohe_encoder`: You'll notice that we load in the encoder custom object from the *fraud_classifier* model we defined previously. This is because we need to transform the data before we can use it for inference.
- `fraud_classifier_runner`: Here, we load in the *fraud_classifier* model we defined previously and convert it into a **runner**. A runner in BentoML represents a unit of serving logic which wraps a model and can be scaled to maximize throughput and resource use.
- `svc`: This represents the Service object. It is the main entry point for the BentoML service.
- `svc.api`: This is a decorator that tells BentoML this is an API, what kind of input and output the API accepts, and the desired REST route.

As you can see, we instantiate a **Service** class and define an API with DataFrame inputs and JSON outputs. We run all necessary pre-processing with the **encoder** custom object, then call the **model** runner to make predictions.

We can now quickly test out our service. We can run the following command in the terminal:

```bash
bentoml serve fraud_detection_service:svc
```

Navigate to the specified IP of the service and run the following POST request (the output should be `[1]`):

```json
[{
    "isFraud":0,
    "TransactionAmt":495.0,
    "ProductCD":"W",
    "card4":"visa",
    "P_emaildomain":"live.com",
    "R_emaildomain":null,
    "M1":"T",
    "M2":"T",
    "M3":"T"
}]
```

Our service is ready to be packaged into a **Bento**, which is essentially the service packaged with needed dependencies. We're now going to build and containerize the Bento and deploy it to our minikube K8s cluster.

## Bento Building & Containerization

Before we containerize and test our Bento, we need to change the default repository our docker daemon will push to. In particular, instead of our local dockerhub repo, we should change it to the minikube repository. We do this by setting some environment variables.
```bash
eval $(minikube docker-env)
```
Now any command we run will use minikube's docker daemon.

To build our Bento, we need to define `bentofile.yaml` file in our project directory. We use this file to specify things such as python packages, CUDA installations, the base docker image, etc. You can read more about the various build options [here](https://docs.bentoml.org/en/latest/concepts/bento.html).

```yaml
service: "fraud_detection_service:svc"  # Same as the argument passed to `bentoml serve`
labels:
   owner: Cerebrium
   stage: prod
include:
- "*.py"  # A pattern for matching which files to include in the bento
python:
   packages:  # Additional pip packages required by the service
   - scikit-learn==1.0.2
   - pandas==1.4.3
   - numpy==1.23.2
   - xgboost==1.5.1
```

Now, we run the following command to build our Bento:

In [17]:
!bentoml build

Building BentoML service "fraud_classifier:nibfwjaj2wdqiutt" from build context "/Users/elijahrou/Cerebrium/deployment_tut"
Packing model "fraud_classifier:47sbzaqj2oeaautt"
Locking PyPI package versions..

██████╗░███████╗███╗░░██╗████████╗░█████╗░███╗░░░███╗██╗░░░░░
██╔══██╗██╔════╝████╗░██║╚══██╔══╝██╔══██╗████╗░████║██║░░░░░
██████╦╝█████╗░░██╔██╗██║░░░██║░░░██║░░██║██╔████╔██║██║░░░░░
██╔══██╗██╔══╝░░██║╚████║░░░██║░░░██║░░██║██║╚██╔╝██║██║░░░░░
██████╦╝███████╗██║░╚███║░░░██║░░░╚█████╔╝██║░╚═╝░██║███████╗
╚═════╝░╚══════╝╚═╝░░╚══╝░░░╚═╝░░░░╚════╝░╚═╝░░░░░╚═╝╚══════╝

Successfully built Bento(tag="fraud_classifier:nibfwjaj2wdqiutt")


'Grats! You created a Bento! You can serve it with the following command in your terminal with the **latest** tag and navigating [localhost:3000](localhost:3000).
```bash
bentoml serve fraud_classifier:latest --production
```

Now, we need to create the service container. Using `bentoml containerize`, we will containerize our Bento, tagging the image with your registry link and the name of the service. We are using the *latest* tag, but you could grab the id of the Bento as a substitute (with `bentoml list`). If you're on Apple Silicon, include `--platform=linux/amd64` in the command to avoid compatibility issues.

```bash
bentoml containerize fraud_classifier:latest -t fraud-classifier:latest --platform=linux/amd64
```

Before we push our image to the registry, let's test that the service is working. Instantiate the service by running the following command in your terminal (if you have a rogue process running on port 3000 from previous commands, you should kill it first):

```bash
docker run -p 3000:3000 fraud-classifier:latest --workers=2
```

Navigate to [localhost:3000](localhost:3000) in your browser and test the API with the previous POST request.

![Test the API locally](media/test_api.png)

Ensure the response output is as expected (either 1 or 0). If there are any errors, you likely made a mistake in the **FraudClassifier** class.

## Deploy your service

Well done! Now, there's only one thing left to do. We need to deploy our service images on Kubernetes! We will do this using Kubernetes manifests. We recommend using the [Kubernetes VS code extension](https://blog.knoldus.com/create-kubernetes-manifests-files-quickly/) to get started creating manifest files. In the end, you should end up with a file, `deployment.yaml` that looks roughly like this:
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: fraud-classifier
spec:
  selector:
    matchLabels:
      app: fraud-classifier
  template:
    metadata:
      labels:
        app: fraud-classifier
    spec:
      containers:
      - name: fraud-classifier
        image: fraud-classifier:latest
        resources:
          limits:
            memory: "2Gi"
            cpu: "1"
        ports:
        - containerPort: 3000
        imagePullPolicy: IfNotPresent
```

We can deploy this manifest to our minikube cluster with the following command:

```bash
kubectl apply -f deployment.yaml
```

You can confirm it is now running and viewing all Replica Sets.
```bash
kubectl get rs
kubectl get pods
```

We need a load balancer to expose our Bento service and make our Fraud Classifier scalable to multiple pods. Let's create a K8s service by exposing the deployment, forwarding port 80 to the target port 3000.

```bash
kubectl expose deployment fraud-classifier --type=LoadBalancer --port=80 --target-port=3000
```

Now that the load balancer is live, let's make sure our service is live! We can use minikube to quickly create a tunnel through to our service (if you've been using a cloud service, just navigate to the IP of the load balancer).
```bash
minikube service fraud-classifier
```

![Live App](media/live_app.png)

Congratulations, you've deployed your fraud classifier as an API!