# Agenda

1. [Introduction. What is MLOps?](#Intro)
2. [Model Compression](#Compression)
3. [Deployment. Simple Approach. BentoML](#BentoML)
4. [Deployment. Advanced Approach. Custom Services](#Custom)
5. [Homework](#Homework)

<a id='Intro'></a>
# Introduction. What is MLOps?
MLOps stands for Machine Learning Operations. MLOps is a core function of Machine Learning engineering, focused on streamlining the process of taking machine learning models to production, and then maintaining and monitoring them. MLOps is a collaborative function, often comprising data scientists, devops engineers, and IT. <br>
![](https://cms.databricks.com/sites/default/files/inline-images/mlops-components.png) <br>
In this lecture our main focus will be on model inference / model deployments steps.

<a id='Compression'></a>
# Model Compression
The primary benefit of compression involves reduced compute costs during inference: The computational resource reduction is the primary motivator for performing model compression. Model compression reduces CPU/GPU time, memory usage, and disk storage. It can make a model suitable for production that would have previously been too expensive, too slow, or too large. <br>

Even when it’s beneficial, compression is not free. Costs of implementing it include:

- Increased deployment complexity: After implementing various model compression techniques there is more to keep track of, namely the original trained model and the compressed models. We must choose the model to deploy and spend time making this choice.
- Decreased accuracy: Some model compression techniques result in a loss of accuracy (however this is measured). This cost has an obvious counterpart in that the benefits of the model compression technique may outweigh the accuracy loss.
- Compute cost: While model compression reduces the compute resources required for inference, the compression itself may be computationally expensive to perform. Notably, distillation introduces an additional iterative training step.
- Your time: Adding a step to the lifecycle requires an investment of your time.

> **TODO**: You can read more about compression [here](https://medium.com/data-science-at-microsoft/model-compression-and-optimization-why-think-bigger-when-you-can-think-smaller-216ec096f68b).

## ONNX conversion and ONNX Runtime
ONNX is an open format that is used to represent various Machine Learning models. It works by defining a common set of operators and a common file format to enable data scientists to use models in a wide variety of frameworks. The conversion process for natural language models from (insert your favorite neural network library here) to ONNX additionally functions as a model compression technique. This is because the operators defined by ONNX have been optimized for specific types of hardware, resulting in slightly smaller models.<br>


The true utility of ONNX comes in the form of the ONNX Runtime backend. One of the optimizations with the most impact that ONNX Runtime implements is the capacity to “fuse” operations and activations within a model. The result of this fusion is a significant reduction in memory footprint and calculations per inference. For popular NLP model families, there exists customized logic to identify the operations within the models that can be fused.

pip install onnx==1.14.1
pip install onnxruntime
pip install optimum[onnxruntime]==1.13.2

In [1]:
import time
import torch
import onnxruntime as ort
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased')
model.eval()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [2]:
device = 'cpu'
onnx_path = './bert_base_uncased.onnx'

dummy_input = tokenizer('here is the sample text for the dummy input', return_tensors="pt", max_length=50, padding='max_length')

torch.onnx.export(
    model=model,
    args=(dummy_input['input_ids'].to(device), dummy_input['token_type_ids'].to(device), dummy_input['attention_mask'].to(device)),
    f=onnx_path,
    export_params=True,
    opset_version=11,
    do_constant_folding=True,
    input_names=['input_ids', 'token_type_ids', 'attention_mask'],
    output_names=['output'],
    dynamic_axes={
        'input_ids': {0: 'batch_size'},
        'token_type_ids': {0: 'batch_size'},
        'attention_mask': {0: 'batch_size'},
        'output': {0: 'batch_size'}
    }
)

In [3]:
texts = 3*[
    "The road whispers secrets to those who travel without direction.",
    "In the heart of the city, the jazz soul dances with shadows.",
    "Under the moon's gaze, we found stories waiting in every corner.",
    "The rhythm of the night was our compass through uncharted dreams.",
    "With each mile, the horizon whispered promises of freedom.",
    "In the silence of the mountains, our laughter echoed like ancient songs.",
    "We were nomads of the twilight, seeking truth in the stars.",
    "Every sunset was a painting, a masterpiece of our wanderlust.",
    "The neon lights flickered, writing poetry in the dark.",
    "Our conversations were a patchwork of memories and musings.",
    "In the arms of the wilderness, we found our untamed spirits.",
    "The city slept, but we walked its dreams.",
    "With the dawn came clarity, like the first breath of a new world.",
    "Our hearts beat to the rhythm of the train's song.",
    "In the quiet cafes, we sipped on stories and coffee.",
    "The road was a canvas, and our journey, its art.",
    "Night skies told tales older than time in their starry script.",
    "We found solace in the symphony of the wind and waves.",
    "Each town held a secret, whispered in the rustling leaves.",
    "We danced with shadows, embracing the mystery of the night.",
    "The mountains stood as guardians of our deepest thoughts.",
    "In the flicker of the campfire, our dreams danced freely.",
    "The desert's vastness echoed our longing for the unknown.",
    "Our path was lit by the hopes of a thousand adventures.",
    "In the depths of the forest, time stood still, a silent witness.",
    "We followed the river's song, meandering through forgotten lands.",
    "The city's heartbeat was a melody of chaos and beauty.",
    "Under the starlit sky, our souls whispered tales of old.",
    "The open road was our teacher, life its lesson.",
    "In every journey's end, a new story was born.",
    "We were seekers of the dawn, chasing the first light.",
    "The rain's rhythm spoke of journeys yet to come.",
    "Our map was drawn in dreams and detours.",
    "In the twilight, the world seemed to pause, listening to our footsteps.",
    "We wandered through the pages of the earth, writing our story.",
    "The ocean's vastness mirrored our boundless curiosity.",
    "In the stillness of the night, every star held a wish.",
    "Our laughter was the soundtrack of endless roads.",
    "We found poetry in the ordinary, magic in the mundane.",
    "Every mile traveled was a verse in our epic.",
    "The whispers of the wind were our guide through the unknown.",
    "In the heart of the forest, we spoke the language of the wild.",
    "The city at dawn was a canvas of hushed possibilities.",
    "Our journey was a mosaic of moments, each a priceless gem.",
    "With each setting sun, our stories grew richer.",
    "The mountains called to us, their peaks like beckoning fingers.",
    "In the quiet of the countryside, our thoughts found voice.",
    "We were pilgrims of the moonlight, worshiping the night.",
    "The road's end was not a destination, but a new beginning.",
    "In the labyrinth of streets, we found pieces of ourselves."
]

# Measure inference time
start_time = time.time()
for text in texts:
    inputs = tokenizer(text, return_tensors="pt", max_length=50, padding='max_length')
    with torch.no_grad():
        outputs = model(**inputs)
end_time = time.time()

original_model_time = end_time - start_time
print(f"Original Model Inference Time: {original_model_time} seconds")

# Load the ONNX model
session = ort.InferenceSession("bert_base_uncased.onnx")

# Measure inference time
start_time = time.time()
for text in texts:
    inputs = tokenizer(text, return_tensors="np", max_length=50, padding='max_length')
    inputs_onnx = {k: v for k, v in inputs.items()}
    outputs = session.run(None, inputs_onnx)
end_time = time.time()

onnx_model_time = end_time - start_time
print(f"ONNX Model Inference Time: {onnx_model_time} seconds")
print(f"Inference acceleration is {round(original_model_time/onnx_model_time, 2)}x times")

Original Model Inference Time: 11.530035018920898 seconds
ONNX Model Inference Time: 7.327675104141235 seconds
Inference acceleration is 1.57x times


## Quantization

Our next approach, quantization, is the process of mapping values from a large set to a smaller set. Rounding and truncation are both basic examples of quantization but aren’t how quantization manifests in the realm of neural networks.


Neural nets, in most default configurations, have weights stored as 32-bit floating point numbers (fp32). Operations with fp32 numbers are expensive and most hardware is not optimized to compute with them.


The most common quantization process takes fp32 numbers and reduces them to 8-bit integers (int8). The result is a model with a quarter the size that can perform inference at nearly four times the original speed. These benefits are at the cost of a loss in precision in the output of the model. Whether this loss in precision affects the target metric for the model is task and model dependent. Typically, when models have discrete outputs, such as identification of a handwritten digit, this precision loss has less effect.


This form of quantization comes with a catch. Moving from fp32 to int8 is most beneficial for models inferencing on the CPU.

![](https://developer-blogs.nvidia.com/wp-content/uploads/2021/07/qat-training-precision.png)
![](https://developer-blogs.nvidia.com/wp-content/uploads/2021/07/8-bit-signed-integer-quantization.png)

> **TODO**: You can read more about the exact process details [here](https://developer.nvidia.com/blog/achieving-fp32-accuracy-for-int8-inference-using-quantization-aware-training-with-tensorrt/) and get more practical transformers examples [here](https://github.com/ELS-RD/transformer-deploy/blob/main/demo/quantization/quantization_end_to_end.ipynb)

In [4]:
import onnx
from onnxruntime.quantization import quantize_dynamic, QuantType

quantize_dynamic("bert_base_uncased.onnx", "bert_base_uncased_quant.onnx")



Ignore MatMul due to non constant B: /[/bert/encoder/layer.0/attention/self/MatMul]
Ignore MatMul due to non constant B: /[/bert/encoder/layer.0/attention/self/MatMul_1]
Ignore MatMul due to non constant B: /[/bert/encoder/layer.1/attention/self/MatMul]
Ignore MatMul due to non constant B: /[/bert/encoder/layer.1/attention/self/MatMul_1]
Ignore MatMul due to non constant B: /[/bert/encoder/layer.2/attention/self/MatMul]
Ignore MatMul due to non constant B: /[/bert/encoder/layer.2/attention/self/MatMul_1]
Ignore MatMul due to non constant B: /[/bert/encoder/layer.3/attention/self/MatMul]
Ignore MatMul due to non constant B: /[/bert/encoder/layer.3/attention/self/MatMul_1]
Ignore MatMul due to non constant B: /[/bert/encoder/layer.4/attention/self/MatMul]
Ignore MatMul due to non constant B: /[/bert/encoder/layer.4/attention/self/MatMul_1]
Ignore MatMul due to non constant B: /[/bert/encoder/layer.5/attention/self/MatMul]
Ignore MatMul due to non constant B: /[/bert/encoder/layer.5/atten

In [5]:
# Load the ONNX model
session = ort.InferenceSession("bert_base_uncased_quant.onnx")

# Measure inference time
start_time = time.time()
for text in texts:
    inputs = tokenizer(text, return_tensors="np", max_length=50, padding='max_length')
    inputs_onnx = {k: v for k, v in inputs.items()}
    outputs = session.run(None, inputs_onnx)
end_time = time.time()

onnx_model_time_quant = end_time - start_time
print(f"ONNX Quantized Model Inference Time: {onnx_model_time_quant} seconds")
print(f"Inference acceleration is {round(original_model_time/onnx_model_time_quant, 2)}x times")

ONNX Quantized Model Inference Time: 5.012095624750311 seconds
Inference acceleration is 2.3x times


## Distillation
![](https://i0.wp.com/www.merchantnavydecoded.com/wp-content/uploads/2023/06/steam-egine-26.png?fit=1140%2C570&ssl=1)
Distillation is one of the most powerful approaches when it comes to model compression. Implementing a state-of-the-art distillation process can cut your model size down by a factor of seven, increase inference speeds by a factor of ten, and have almost no effect on the model’s accuracy metric (e.g. distilBERT tinyBERT).


Here is more good news: Distillation is still fairly young! There are likely many improvements to come. Now for some bad news: Distillation is still fairly young! This means that the process is not yet widely implemented in standard libraries. Research code does exist that can be used to distill various model architectures (such as BERT, GPT2, and BART), though to implement distillation on a custom model it is necessary to understand the full process.

> **TODO**: You can read more about the exact process details [here](https://medium.com/p/dd4973dbc764)

Teacher Student networks — How do they exactly work?
- Train the Teacher Network : The highly complex teacher network is first trained separately using the complete dataset. This step requires high computational performance and thus can only be done offline (on high performing GPUs).
- Establish Correspondence : While designing a student network, a correspondence needs to be established between intermediate outputs of the student network and the teacher network. This correspondence can involve directly passing the output of a layer in the teacher network to the student network, or performing some data augmentation before passing it to the student network. The way the knowledge of the good answers is transferred to the Student is through the loss function. Essentially, we want to train the Student so that it mimics the same distribution that the Teacher provides. To do this, we must also understand what the Student outputs are before it is even trained. This measurement is called the Kullback-Leibler, or KL, divergence. This approximates the work it takes to turn the red curve into the blue curve. The result is a loss function that has a term measuring the KL divergence between the Student distribution and the Teacher distribution.

![](https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Knowledge-Distillation_4.png?resize=900%2C356&ssl=1)
- Forward Pass through the Teacher network : Pass the data through the teacher network to get all intermediate outputs and then apply data augmentation (if any) to the same.
- Backpropagation through the Student Network : Now use the outputs from the teacher network and the correspondence relation to backpropagate error in the student network, so that the student network can learn to replicate the behavior of the teacher network. 


In [6]:
model_name = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()

# Measure inference time
start_time = time.time()
for text in texts:
    inputs = tokenizer(text, return_tensors="pt", max_length=50, padding='max_length')
    with torch.no_grad():
        outputs = model(**inputs)
end_time = time.time()

distilled_model_time = end_time - start_time
print(f"Distilled Model Inference Time: {distilled_model_time} seconds")
print(f"Inference acceleration is {round(original_model_time/distilled_model_time, 2)}x times")

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Distilled Model Inference Time: 5.7951743602752686 seconds
Inference acceleration is 1.99x times


## Distillation + ONNX + Quantization

In [7]:
device = 'cpu'
onnx_path = './distilbert_base_uncased.onnx'

dummy_input = tokenizer('here is the sample text for the dummy input', return_tensors="pt", max_length=50, padding='max_length')

torch.onnx.export(
    model=model,
    args=(dummy_input['input_ids'].to(device), dummy_input['attention_mask'].to(device)),
    f=onnx_path,
    export_params=True,
    opset_version=11,
    do_constant_folding=True,
    input_names=['input_ids', 'attention_mask'],
    output_names=['output'],
    dynamic_axes={
        'input_ids': {0: 'batch_size'},
        'attention_mask': {0: 'batch_size'},
        'output': {0: 'batch_size'}
    }
)

quantize_dynamic(onnx_path, './distilbert_base_uncased_quant.onnx')

  mask, torch.tensor(torch.finfo(scores.dtype).min)


Ignore MatMul due to non constant B: /[/distilbert/transformer/layer.0/attention/MatMul]
Ignore MatMul due to non constant B: /[/distilbert/transformer/layer.0/attention/MatMul_1]
Ignore MatMul due to non constant B: /[/distilbert/transformer/layer.1/attention/MatMul]
Ignore MatMul due to non constant B: /[/distilbert/transformer/layer.1/attention/MatMul_1]
Ignore MatMul due to non constant B: /[/distilbert/transformer/layer.2/attention/MatMul]
Ignore MatMul due to non constant B: /[/distilbert/transformer/layer.2/attention/MatMul_1]
Ignore MatMul due to non constant B: /[/distilbert/transformer/layer.3/attention/MatMul]
Ignore MatMul due to non constant B: /[/distilbert/transformer/layer.3/attention/MatMul_1]
Ignore MatMul due to non constant B: /[/distilbert/transformer/layer.4/attention/MatMul]
Ignore MatMul due to non constant B: /[/distilbert/transformer/layer.4/attention/MatMul_1]
Ignore MatMul due to non constant B: /[/distilbert/transformer/layer.5/attention/MatMul]
Ignore MatM

In [11]:
# Load the ONNX model
session = ort.InferenceSession('distilbert_base_uncased_quant.onnx')

# Measure inference time
start_time = time.time()
for text in texts:
    inputs = tokenizer(text, return_tensors="np", max_length=50, padding='max_length')
    inputs_onnx = {k: v for k, v in inputs.items()}
    outputs = session.run(None, inputs_onnx)
end_time = time.time()

onnx_model_time_quant_distilled = end_time - start_time
print(f"ONNX Distilled Quantized Model Inference Time: {onnx_model_time_quant_distilled} seconds")
print(f"Inference acceleration is {round(original_model_time/onnx_model_time_quant_distilled, 2)}x times")

ONNX Distilled Quantized Model Inference Time: 2.8224806785583496 seconds
Inference acceleration is 4.09x times


<a id='BentoML'></a>
# Deployment. Simple Approach. BentoML

ML model deployment is the process of integrating a trained ML model into an existing production environment to make practical, actionable decisions based on new data. It's a crucial step in a machine learning project as it allows the model to provide real-world value. Here we will consider **Real-time Inference** - for applications requiring immediate feedback, models are deployed in an environment that supports real-time data processing with a simple implementation using BentoML.

**BentoML** is designed for teams working to bring machine learning (ML) models into production in a reliable, scalable, and cost-efficient way. In particular, AI application developers can leverage BentoML to easily integrate state-of-the-art pre-trained models into their applications. By seamlessly bridging the gap between model creation and production deployment, BentoML promotes collaboration between developers and in-house data science teams.

> **TODO**: read documentations and do some experiments with more advanced options described [here](https://docs.bentoml.org/en/latest/quickstarts/deploy-a-transformer-model-with-bentoml.html)

In [None]:
pip install bentoml

In [12]:
import bentoml
import transformers

pipe = transformers.pipeline("text-classification", device='cpu')

bentoml.transformers.save_model(
  "text-classification-pipe",
  pipe,
  signatures={
    "__call__": {"batchable": True}  # Enable dynamic batching for model
  }
)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Model(tag="text-classification-pipe:auqktued2osicqqb", path="/home/abazdyrev/bentoml/models/text-classification-pipe/auqktued2osicqqb/")

In [12]:
!bentoml models list

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[1m [0m[1mTag                    [0m[1m [0m[1m [0m[1mModule              [0m[1m [0m[1m [0m[1mSize      [0m[1m [0m[1m [0m[1mCreation Time      [0m[1m [0m
 text-classification-pi…  bentoml.transformers  256.35 MiB  2023-11-15 14:57:05 


In [13]:
!bentoml serve bentoml_service.py:svc

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
2023-11-15T14:58:07+0000 [INFO] [cli] Environ for worker 0: set CPU thread count to 8
2023-11-15T14:58:07+0000 [INFO] [cli] Prometheus metrics for HTTP BentoServer from "service.py:svc" can be accessed at http://localhost:3000/metrics.
2023-11-15T14:58:07+0000 [INFO] [cli] Starting production HTTP BentoServer from "service.py:svc" listening on http://0.0.0.0:3000 (Press CTRL+C to quit)


In [13]:
import requests

requests.post('http://localhost:3000/classify', data="BentoML is awesome").text

'{"label":"POSITIVE","score":0.9998418092727661}'

In [14]:
import requests

requests.post('http://localhost:3000/classify', data="ML deployment is a very complicated and awful process").text

'{"label":"NEGATIVE","score":0.9997016787528992}'

<a id='Custom'></a>
# Deployment. Advanced Approach. Custom Services

## Docker 
Docker is a platform that uses containerization technology to make it easier to create, deploy, and run applications. Containers can be thought of as a kind of lightweight, standalone, and executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries, and settings.
Here are some key aspects of Docker:
- Containers: Docker containers wrap a piece of software in a complete filesystem that contains everything needed to run: code, runtime, system tools, and libraries. This guarantees that the software will always run the same, regardless of its environment.
- Isolation: Containers are isolated from each other and the host system. They have their own filesystem, their own networking, and their own isolated process space. This provides a layer of security and allows multiple containers to run on the same host machine without interference.
- Portability: Since Docker containers contain everything needed to run an application, they are highly portable. You can run these containers on any machine that has Docker installed, regardless of the underlying operating system.
- Consistency: Docker provides a consistent environment for development, testing, and production. This consistency helps to reduce the "it works on my machine" problem when working in teams.
- Microservices Architecture: Docker is well-suited for microservices architecture, where complex applications are broken down into smaller, independent services. This makes it easier to update and scale applications.
- Dockerfiles: Docker uses a simple text file called a Dockerfile to automate the building of container images. A Dockerfile specifies all the steps that need to be taken to create the image.
- CI/CD Integration: Docker integrates well with continuous integration and continuous deployment (CI/CD) workflows, making it easier to automate the testing and deployment of applications.

![](https://docs.docker.com/get-started/images/docker-architecture.png)

## FastAPI
FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.7+ based on standard Python type hints. It's known for its high speed and ease of use and has been gaining popularity in the Python community. Here are some key features and benefits of FastAPI:
- Speed: FastAPI is built on Starlette for the web parts and Pydantic for the data parts, which makes it one of the fastest Python frameworks available, only slower than NodeJS and Go according to some benchmarks.
- Automatic Documentation: FastAPI automatically generates interactive API documentation (using Swagger UI and ReDoc) that lets you call and test your API directly from the browser.
- Easy to Use: It has been designed to be easy to use while also ensuring that new developers can quickly understand its operation. FastAPI simplifies the process of building robust APIs.
- Asynchronous Code Support: FastAPI is one of the few Python web frameworks to support asynchronous request handlers out of the box, making it suitable for high I/O-bound applications.
- Security and Authentication: FastAPI includes several tools to help with API security, such as OAuth2 password flow and JWT tokens, as well as utilities for hashing passwords.
- Extensibility: Being lightweight and based on standard Python tools, FastAPI is very easy to extend with various databases, ORMs, authentication and authorization frameworks, data validation libraries, and more.
- Built for Production: It has features and utilities to help deploy and run your applications in production environments, including Docker integration.

**FastAPI-Docker model deployment example in ./fastapi_service**

To run in you need to instal `docker` and `docker-compose`

In [None]:
!cd fastapi_service
!docker-compose build
!docker-compose up -d

In [15]:
import requests

requests.post('http://0.0.0.0:8088/classify', data="ML deployment is a very complicated and awful process").text

'{"label":"NEGATIVE","score":0.9997016787528992}'

In [16]:
import requests

requests.post('http://0.0.0.0:8088/classify', data="ML deployment is a very simple and exciting process").text

'{"label":"POSITIVE","score":0.9997023940086365}'

<a id='Homework'></a>
# Homework
Theory:
- Follow all **TODO** links

Practice (15 points):
- Implement ONNX conversion, quantization and usage of distilled versions of pretrains for your models for one of your previous tasks (tweeter disasters/UA Locations/CommonLit/your finetuned generativeAI model) and check speed improvements and metrics degradation on your local validation.
- Deploy your model (optimized with onnx/distillation/quantization) localy using Bento (or similar) tools or using custom approach with FastAPI/Flask and Docker.
- (Advanced) Deploy your model in a cloud service e.g. using GCP free credits