# BentoML vs FastAPI: Good-bye FastAPI For Machine Learning!
## Objective comparison between BentoML and FastAPI

### Notes

FastAPI vs BentoML

Similar Features
- Pydantic validation
- Swagger UI
- async
- Starlette

ML Differentiators
- Running models in separate processes
    - https://modelserving.com/blog/breaking-up-with-flask-amp-fastapi-why-ml-model-serving-requires-a-specialized-framework

### Motivation

### But what is an API?

The world is full of APIs. Yet, strangely, most do a poor job of explaining what they are. When you google "what is an API?", you'll get stuff like "API stands for Application Programming Interface" (like we would care), or as Wikipedia puts it:

> An API is a way for two or more computer programs to communicate with each other.

For me, at a high level, an API is just a plain-old URL. An example? ChatGPT3.

The link chat.openai.com/chat is a URL that lets you interact with OpenAI's ChatGPT3 AI model. You send requests to the API via the website's UI using prompts. 

But APIs doesn't have to have fancy UIs. They can simply be URLs that programmers send requests to perform a variety of tasks like generating an image given a prompt. For example, I've built an API that returns a cuteness score given a pet's image. Here is its URL:

PASTE THE URL HERE

It has a PREDICT endpoint but APIs can have as many endpoints as possible that perform different functions.

An API is just a way to hide complex programming logic like ML models with billions of parameters behind simple interfaces so that users can interact without any prior knowledge of how the thing behind was built or used. 

So, in this article, we are comparing the king of API frameworks, FastAPI, which is generally used for web applications, to BentoML, a relatively young library specialized to deploy machine learning models as API.

### Similar features

Before we go into the differences in terms of machine learning use cases, let's outline discuss FastAPI and BentoML's similar features:

- Starlette: built upon the same powerful ASGI web application building framework, making them fast and easy to use.
- Automatic documentation with Swagger UI: both generate automatic documentation for APIs using the standard API docs format called OpenAPI.
- Asynchronous requests: both allow asynchronous requests for heavy Input/Output-bound APIs. This means they can handle multiple requests simultaneously without executing them linearly.


These are the basic features required by modern API frameworks. The real differentiators between BentoML and FastAPI are in machine learning use cases.

### Saving/loading models

Building APIs in machine learning starts with a script or notebook that trains a model and saves it for deployment. How it saves is the problem because the format of the saved model directly affects how it is fed to the API server. 

In FastAPI, your choices are limited to a) pickling it or b) pickling it.

```
import joblib

joblib.dump(your_awesome_model, "models/model.pkl")
```

With security and performance issues aside, the biggest problem in pickling is the total lack of model versioning and managing dependencies. 

In a typical ML project, engineers can build dozens or even hundreds of models and store them in model registry - a place where all trained models are maintained and versioned. Try building a proper model registry inside a directory with filenames like `model_v237.pkl`. 

In BentoML, there is a standard procedure to save and load models based on which framework they were used to train. Say you have a Keras CNN. You could save it with:

```
import bentoml

bentoml.keras.save_model("cnn16", model_cnn)
```

And load it back with:

```
retrieved_cnn = bentoml.keras.load_model("cnn16:latest")
```

> BentoML supports `framework.save/load_model` functions for 12 of the most popular ML frameworks and support for frameworks are planned.

Models are not always stand-alone files, sometimes they have dependencies that must be saved with them. In that case, you can use the `custom_objects` and `metadata` parameters to save extra objects and info about the model:

```python
import bentoml

bentoml.keras.save_model(
    "cnn16",
    model_cnn,
    metadata={"desc": "CNN architecture with 16 starting filters", "owner": "Bex"},
    custom_objects={"model_history": history_object['history']}
)
```

### Model registry

Once you save a model with `save_model` function, BentoML stores them into a local model registry, which is located at `~/bentoml/models`. Here is how you can list its contents:

```
$ bentoml models list
Tag                        Module           Size        Creation Time        Path
cnn16:2uo5fkgxj27exuqj  bentoml.keras  5.81 KiB    2022-12-19 08:36:52  ~/bentoml/models/cnn16/2uo5fkgxj27exuqj
cnn16:nb5vrfgwfgtjruqj  bentoml.keras  5.80 KiB    2022-12-19 21:36:27  ~/bentoml/models/cnn16/nb5vrfgwfgtjruqj
```

Even if models are saved with the same name, they are assigned with unique random names. Once you choose the best model from the registry, you can share it with others by exporting them to a unified `.bentomodel` archive format, regardless of the model's framework:

```
$ bentoml models export cnn:version_tag .
```

If a teammate shares a model in a `.bentomodel` with all its custom objects and metadata, you can easily add it to your own model registry with:

```
$ bentoml models import ./shared_cnn.bentomodel
```

Imaging what a mess the process would be if you were working with pickles.

You might be saying, "Well, great! I can use BentoML for model registry and still build the API with FastAPI". In that case, hold off making a decision for a little longer.

### Input/Output

APIs communicate in JSONs. This is a huge headache for data scientists and ML engineers, most of whom spend their lives working with NumPy arrays and Pandas DataFrames. 

Let's say you have a model trained on a dataset with four features. Here is how you would write the input/output formats of the model in FastAPI:

```python
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import numpy as np
import pickle

app = FastAPI()

class Input(BaseModel):
    feature_1: float
    feature_2: float
    feature_3: float
    feature_4: float

class Output(BaseModel):
    prediction: float
    
# Load the trained model
with open('model.pkl', 'rb') as f:
    model = pickle.load(f)

def predict(inputs: Input):
    # Convert the input data to a numpy array
    data = np.array([inputs.feature_1, inputs.feature_2, inputs.feature_3, inputs.feature_4])

    # Make a prediction using the model
    prediction = model.predict([data])[0]

    # Return the prediction as a response object
    return Output(prediction=prediction)
```

FastAPI uses Pydantic library to provide data validation, which means if the input, its shape and data types don't match the `Input` class, FastAPI will throw an error.

But you will only face such easy scenarios in cheap machine learning courses. Real-world problems are much more complicated. For example, how would you define the input if your model was trained on hundreds of features (which is pretty common). You would have to write a class with as many attributes as possible, which is absurd. 

If you were to use BentoML for validating NumPy array inputs, here is how you'd do it:

```python
import bentoml
from bentoml.io import NumpyNdarray


runner = bentoml.sklearn.get("model_name:latest").to_runner()

svc = bentoml.Service("classifier", runners=[runner])

@svc.api(input=NumpyNdarray(), output=NumpyNdarray())
def classify(input_series: np.ndarray) -> np.ndarray:
    result = runner.predict.run(input_series)
    return result
```

The `NumpyNdarray` class validates inputs against NumPy arrays of any shape. But what if you want to enforce a certain shape to the inputs (which is a best practice)? You could do it with a single parameter:

```python
@svc.api(input=NumpyNdarray(shape=(-1,15), enforce_shape=True), output=NumpyNdarray())
```

Providing (-1, 15) to the `shape` parameter will validate the inputs to have as many rows as possible but always 15 features. Here is the FastAPI and Pydantic equivalent:

```python
class Input(BaseModel):
    data: List[conlist(item_type=float, min_items=15, max_items=15)]
```

If you have trouble understanding the above class, you aren't to blame. I've scoured StackOverflow for hours before I could piece together the last line of code. 

On top of everything, models don't only deal with NumPy arrays. The data sent to the API could be anything from Pandas DataFrames to binary files like images or audio. The `bentoml.io` module contains validation classes for the most popular types of inputs and outputs:

- PandasDataFrame
- PandasSeries
- JSON
- Text
- Image
- File

Engineers must know so many things before they can deploy their models. Learning Pydantic and JSON just to be able to feed the data to the API would just be overwhelming. 

### Packaging the service

Once you have a service script ready, you have to package it for deployment. One of the most popular ways of doing this is creating a Docker image for the API. Having a docker image makes it very easy to host the API on different operating systems and cloud environments.

The FastAPI way requires some Docker knowledge, mainly, how to create a Dockerfile like below:


```shell
# Get the Fast API image with Python version 3.7
FROM tiangolo/uvicorn-gunicorn-fastapi:python3.7

# Create the directory for the container
WORKDIR /app
COPY requirements.txt ./requirements.txt

# Install the dependencies
RUN pip install --no-cache-dir -r requirements.txt

COPY ./app.py ./

# Copy the serialized model
COPY ./models/cnn_model.pkl ./models/cnn_model.pkl

# Run by specifying the host and port
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "80"]
```

Basically, you have to spoon-feed everything inside the Dockerfile so that when you run `docker build -t fastapiapp:latest -f docker/Dockerfile .`, the image should be created without errors.

In BentoML, everything is much simpler again. In the root directory, you create a `bentofile.yaml` with the following template:

```YAML
service: "service.py:service_nae"
include:
 - "*.py"
python:
  packages:
   - scikit_learn
   - numpy
   - tensorflow
```

In the `include` field, you specify all the files the service script requires to run without errors. Under package, you list your project's dependencies. Then, call `bentoml build` on the terminal:

```
$ bentoml build
```

The `build` command packages your entire project into a stand-alone archive inside the `~/bentoml/bentos` folder. 

To convert the archive into a Docker image, you run:

```
$ bentoml containerize model_name:latest
```

Now, you can deploy the image to any environment you want.

### Deploying the service

Once the service is packaged, preferably into a Docker container, it is time to deploy it. Unfortunately, FastAPI's functionality stops there. It gives you the API but doesn't care about how you deploy it. You are supposed to figure that out yourself.

In contrast, BentoML has a dedicated helper library called `bentoctl` that allows you to deploy your containerized APIs on any of the most popular cloud platforms (AWS, GCP, Azure, Heroku). For example, it only takes a few commands to deploy one of the models in the model registry to AWS SageMaker:

```
$ pip install bentoctl terraform
$ bentoctl operator install aws-sagemaker
$ export AWS_ACCESS_KEY_ID=REPLACE_WITH_YOUR_ACCESS_KEY
$ export AWS_SECRET_ACCESS_KEY=REPLACE_WITH_YOUR_SECRET_KEY
$ bentoctl init
$ bentoctl build -b model_name:latest -f deployment_config.yaml
$ terraform init
$ terraform apply -var-file=bentoctl.tfvars -auto-approve
```

If you want a detailed overview of the above steps, check out my comprehensive tutorial:

https://towardsdatascience.com/comprehensive-guide-to-deploying-any-ml-model-as-apis-with-python-and-aws-lambda-b441d257f1ec

### GPU serving
Most machine learning models need access to GPUs for heavy workloads. This means you should build the Docker image for the API with GPU support. 

The GPU support is provided by the CUDA library of NVIDIA and installing CUDA manually is one of the most horrible experiences you will face as a programmer. Installing different CUDA versions for different GPU libraries like TensorFlow, PyTorch or XGBoost is called "CUDA hell" (which rightfully sounds like "Go to hell!"). 

To install CUDA in your Docker image, you would need a monster of a Dockerfile like below:

```Dockerfile
FROM nvidia/cuda:11.2.0-runtime-ubuntu20.04

# install utilities
RUN apt-get update && \
    apt-get install --no-install-recommends -y curl

ENV CONDA_AUTO_UPDATE_CONDA=false \
    PATH=/opt/miniconda/bin:$PATH
RUN curl -sLo ~/miniconda.sh https://repo.anaconda.com/miniconda/Miniconda3-py38_4.9.2-Linux-x86_64.sh \
    && chmod +x ~/miniconda.sh \
    && ~/miniconda.sh -b -p /opt/miniconda \
    && rm ~/miniconda.sh \
    && sed -i "$ a PATH=/opt/miniconda/bin:\$PATH" /etc/environment

# Installing python dependencies
RUN python3 -m pip --no-cache-dir install --upgrade pip && \
    python3 --version && \
    pip3 --version

RUN pip3 --timeout=300 --no-cache-dir install torch==1.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html

COPY ./requirements.txt .
RUN pip3 --timeout=300 --no-cache-dir install -r requirements.txt

# Copy model files
COPY ./model /model

# Copy app files
COPY ./app /app
WORKDIR /app/
ENV PYTHONPATH=/app
RUN ls -lah /app/*

COPY ./start.sh /start.sh
RUN chmod +x /start.sh

EXPOSE 80
CMD ["/start.sh"]
```

In BentoML, installing CUDA is easy as adding a single field to the `bentofile.yaml`:


```YAML
service: "service.py:service_name"
include:
- "*.py"
python:
    packages:
    - torch
    - torchvision
    - torchaudio
docker:
    distro: ubuntu
    python_version: "3.8.12"
    cuda_version: "11.6.2"
```

Then, you can just repeat the steps outlined in the last section.

### Batching inputs
https://docs.bentoml.org/en/latest/guides/batching.html