# PyTorch Model Deployment


So far, the FoodVision Mini Project has only been accessible to us. Hence, the goal of this chapter is to deploy our FoodVision Mini model to the internet as a usable app

### What is Machine Learning model deployment?

Machine learning model deployment is the process of making your machine learning model accessible to someone or something else, allowing said user to interact with the model in some way.

This can come in the form of a person, or a program, app, or model that interacts with our model. Machine learning model deployment involves making your model available to someone or something else. 

### Why deploy a Machine Learning model?

While evaluating a model on a well crafted test set, or visualizing its results can give a good indicator as to a model's performance, one can never truly know the model's performance unless it is released in the wild. 

Having people who have never used your model interact with it often reveals edge cases never thought of during training. Model deployment helps figure out errors in models that are not obvious during training/testing. 

### Different types of Machine Learning model deployment

There are many types of model deployment, but when deciding the optimal type, one has to start with the question:

> What is the most ideal scenario for my machine learning model to be used?

And then work backwards from there. In the case of FoodVision Mini, the ideal scenario would entail:

* Someone takes a photo on a mobile device
* The prediction comes back fast

Therefore, this yields two important criteria:
1. The model should work on a mobile device (leading to compute constraints)
2. The model should make predictions fast (because a slow app is not very useful)

When dealing with this criteria, we have to also account for where is the data going to be stored, and if the predictions can be returned immediately or later. 

Because of all these criteria to tackle, it is often better to start with the most ideal use case, and work backwards from there. 

### Where is it going to go?

Where does the model live when it is deployed?

The main debate here is whether is lives on-device (also called edge/in the browser) or on the cloud (a computer/sever that isn't the actual device someone/something calls the model from).

Both scenarios have their pros and cons. 

| Deployment Location | Pros                                                       | Cons                                                                                                               |
|---------------------|------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------|
| On-device           | Can be very fast (since no data leaves the device)         | Limited compute power (larger models take longer to run)                                                           |
|                     | Privacy preserving (again no data has to leave the device) | Limited storage space (smaller model size required)                                                                |
|                     | No internet connection required (sometimes)                | Device-specific skills often required                                                                              |
| On cloud            | Near unlimited compute power (can scale up when needed)    | Costs can get out of hand (if proper scaling limits aren't enforced)                                               |
|                     | Can deploy one model and use everywhere (via API)          | Predictions can be slower due to data having to leave device and predictions having to come back (network latency) |
|                     | Links into existing cloud ecosystem                        | Data has to leave device (this may cause privacy concerns)                                                         |

Given these considerations, there is an evident trade-off between performance and prediction time with on-device being less performant but faster while on cloud offers a more performant model that requires more computation and storage, leading to longer prediction times. 

### How is it going to function?

When deploying the machine learning model, one has to decide whether immediate predictions, or slightly delayed predictions are desirable. These scenarios are generally referred to as:

* Online (real-time): Predictions/inference happen immediately.
* Offline (batch): Predictions/inference happen periodically. 

The periodic predictions can have a varying timescale too, from seconds to hours or days. 

These approaches can be mixed too, where our inference pipeline can happen online while the training pipeline happens in an offline fashion, which is what has been done throughout the course. 

### Ways to deploy a machine learning model

Here are a couple of options:

| Tool/Resource                                  | Deployment Type               |
|------------------------------------------------|-------------------------------|
| Google's ML Kit                                | On-device (Android and iOS)   |
| Apple's Core ML and coremltools Python package | On-device (all Apple devices) |
| Amazon Web Service's (AWS) Sagemaker           | Cloud                         |
| Google Cloud's Vertex AI                       | Cloud                         |
| Microsoft's Azure Machine Learning             | Cloud                         |
| Hugging Face Spaces                            | Cloud                         |
| API with FastAPI                               | Cloud/self-hosted server      |
| API with TorchServe                            | Cloud/self-hosted server      |
| ONNX (Open Neural Network Exchange)            | Many/general                  |

The chosen option is highly dependent on what is being built/who are you working with. 

One of the best small and simple ways is to turn your machine learning model into a demo app with Gradio and then deploy it on Hugging Face Spaces. 


### What will be covered

The goal is to deploy the FoodVision Model via a demo Gradio app with the following metrics:
1. Performance: 95% accuracy
2. Speed: real-time inference of 30FPS+ (each prediction has a latency of lower than ~0.03s)

## 0. Setup

In [None]:
try:
    import torch
    import torchvision
    assert int(torch.__version__.split(".")[0]) >= 2, "torch version should be 2.+"
    assert int(torchvision.__version__.split(".")[1]) >= 15, "torchvision version should be 0.15+"
    print(f"torch version: {torch.__version__}")
    print(f"torchvision version: {torchvision.__version__}")
except:
    print(f"[INFO] torch/torchvision versions not correct. Installing correct versions.")
    !pip3 install -U torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
    import torch
    import torchvision
    print(f"torch version: {torch.__version__}")
    print(f"torchvision version: {torchvision.__version__}")

In [None]:
import matplotlib
import matplotlib.pyplot as plt
import torch
import torchvision

from torch import nn
from torchvision import transforms

try:
    from torchinfo import summary
except:
    print("[INFO] Couldn't find torchinfo... installing it")
    !pip install -q torchinfo
    from torchinfo import summary

try:
    from going_modular import data_setup, engine
    from helper_functions import download_data, set_seeds, plot_loss_curves
except:
    print("[INFO] Could not find going_modular scripts. Downloading them from GitHub.")
    !git clone https://github.com/Aaron-Serpilin/Zero-To-Mastery-Pytorch
    !mv Zero-To-Mastery-Pytorch/Fundamentals/going_modular .
    !mv Zero-To-Mastery-Pytorch/Fundamentals/helper_functions.py .
    !rm -rf Zero-To-Mastery-Pytorch
    from going_modular import data_setup, engine
    from helper_functions import download_data, set_seeds, plot_loss_curves

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
device

## 1. Getting data