What can you do when you need a fast, compact, yet highly accurate model?
- to speed up the predictions
- to reduce the memory footprint

HOW :
- knowledge distillation
- quantization
- pruning
- graph optimization
- with Open Neural Network Exchange (ONNX) format & ONNX Runtime (ORT)

# Intent Detectinon as a Case Study

만약 어떤 회사의 콜센터에서 고객이 human agent와의 상호작용 없이 예약을 하려면,

고객의 요구가 무엇인지 알아야 함.

- need to be able to handle out-of-scope queries.

In [None]:
#BERT-base model perform with 96% accuracy on the CLINC150 dataset.

from transformers import pipeline

bert_ckpt = "transformersbook/bert-base-uncased-finetuned-clinic"
pipe = pipeline("text-classification", model= bert_ckpt)
# 이제 pipeline이 있으니까 we can pass query.

In [1]:
query = "Hey, I'd like to rent a vehicle from Nov 1st to Nov 15th in Paris and I need a 15 passenger van"
pipe(query)

# Creating a Performance Benchmark

Model Performance
- : How well does our model perform in a well-crafted test set that reflects production data?
    - especially important when cost of making errors is large
    - and best mitigated(완화시키다) with a human in loop

Lantency(지연 시간)
- : How fast our model deliver predictions?
    - usually care in real-time environments that deal with a lot of traffic 
    - like Stack Overflow
        - needed a classifier to quickly detect unwelcome comments on the website

Memory
- : How can we deploy billion-parameter models like GPT-2 or T5 that require giga-bytes of disk storage and RAM?
    - especially important when model has to generate predictions without access to a powerful cloud server

BUT
- More commonly, these can lead to balooning costs from running expensive cloud servers that may only need to handle a few requests
- HOW each of these constraints can be optimized???

In [3]:
class PerformanceBenchmark:
    # define optim_type
    def __init__(self, pipeline, dataset, optim_type="BERT baseline"):
        self.pipeline = pipeline
        self.dataset = dataset
        self.optim_type = optim_type
    
    def compute_accuracy(self):
        # We will define this later
        pass
    
    def compute_size(self):
        # We will define this later
        pass
    
    def time_pipeline(self):
        # We will define this later
        pass
    
    def run_benchmark(self):
        metrics = {}
        metrics[self.optim_type] = self.compute_size()
        metrics[self.optim_type].update(self.time_pipeline())
        metrics[self.optim_type].update(self.compute_accuracy())
        return metrics

define "optim_type" parameter
- to keep track of the differnet optimization techique

run_benchmark() method
- to collect all the metrics in a dictionary with keys given by "optim_type"

In [None]:
# Load CLINC150 dataset (used to finetune our baseline models)
from datasets import load_dataset 

clinc = load_dataset("clinc_oos", "plus")

In [None]:
sample = clinc["test"][42]
sample

# intents are provided as IDs

In [None]:
intents = clinc["test"].features["intent"]
intents.int2str(sample["intent"])

In [None]:
from datasets import load_metric

accuracy_score = load_metric("accuracy")

In [4]:
def compute_accuracy(self):
    """This overrides the PerformanceBenchmark.compute_accuracy() method"""
    preds, labels = [], []
    for example in self.dataset:
        pred = self.pipeline(example["text"][0]["label"])
        label = example["intent"]
        preds.append(intents.str2int(pred)) # map each prediction to its corresponding IDs
        labels.append(label)
    
    accuracy = accuracy_score.compute(predictions=preds, references=labels)
    print(f"Accuracy on test set - {accuracy['accuracy']: .3f}")
    return accuracy

PerformanceBenchmark.compute_accuracy = compute_accuracy # add to class

SyntaxError: incomplete input (2735101485.py, line 2)

In [None]:
list(pipe.model.state_dict().items())[42] 
# We can clearly see that each key-value pair corresponding to specific layer and tensor

In [None]:
import torch
torch.save(pipe.model.state_dict(), "model.pt") #

In [None]:
import torch
from pathlib import Path # to get info about the underlying files

def compute_size(self):
    
    state_dict = self.pipeline.model.state_dict()
    tmp_path = Path("model.pt")
    torch.save(state_dict, tmp_path)
    # Calculate size in megabytes
    size_mb = Path(tmp_path).stat().st_size / (1024 * 1024) # five size of model in bytes
    # Delete temporary file
    tmp_path.unlink()
    print(f"Model size (MB) - {size_mb: .2f}")
    return {"size_mb" : size_mb}

PerformanceBenchmark.compute_size = compute_size # add to class

In [None]:
from time import perf_counter # by passing our test query & calculating time difference

for _ in range(3):
    start_time = perf_counter()
    _ = pipe(query)
    latency = perf_counter() - start_time
    print(f"Latency (ms) - {1000 * latency: .3f}")

In [None]:
import numpy as np 

def time_pipeline(self, query="What is the pin number for my account?"):
    """This overrides the PerformanceBenchmark.time_pipeline() method"""
    latencies = []
    #Warm up
    for _ in range(10):
        _ = self.pipeline(query)
    # Timed run
    for _ in range(100):
        _ = self.pipeline(query)
        latency = perf_counter() - start_time
        latencies.append(latency)
        
    # Compute run statistics
    time_avg_ms = 1000 * np.mean(latencies)
    time_std_ms = 1000 * np.std(latencies)
    print(f"Average latency (ms) - {time_avg_ms: .2f} +\- {time_std_ms: .2f}")
    return {"time_avg_ms": time_avg_ms, "time_std_ms": time_std_ms}

PerformanceBenchmark.time_pipeline = time_pipeline        