<a href="https://colab.research.google.com/github/Gooogr/Book_nlp_with_transformers/blob/main/ch8_optimization_for_production.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook we will explore four complementary opimization techniques for transformer models: 
* Knowledge distillation
* Quantization
* Pruning
* Graph optimization 

with the Open Neural Network Exchange (ONNX) format and ONNX Runtime (ORT).

Let’s suppose that we’re trying to build a text-based assistant for our company’s call
center so that customers can request their account balance or make bookings without
needing to speak with a human agent. In order to understand the goals of a customer,
our assistant will need to be able to classify a wide variety of natural language text
into a set of predefined actions or intents.

In [31]:
!pip install -qq transformers[sentencepiece] datasets

In [32]:
! nvidia-smi

Sun Jun  5 12:47:37 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   75C    P0    31W /  70W |   1408MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [33]:
from transformers import pipeline
from datasets import load_dataset
from datasets import load_metric
import torch
from pathlib import Path
from time import perf_counter
import numpy as np
from tqdm import tqdm

In [34]:
bert_ckpt = "transformersbook/bert-base-uncased-finetuned-clinc"
pipe = pipeline("text-classification", model=bert_ckpt)

In [35]:
query = """Hey, I'd like to rent a vehicle from Nov 1st to Nov 15th in
Paris and I need a 15 passenger van"""
pipe(query)

[{'label': 'car_rental', 'score': 0.5490034818649292}]

# Load Dataset (CLINC 150)

In [36]:
# Load ClINC150 datasets
clinc = load_dataset('clinc_oos', 'plus')

Reusing dataset clinc_oos (/root/.cache/huggingface/datasets/clinc_oos/plus/1.0.0/abcc41d382f8137f039adc747af44714941e8196e845dfbdd8ae7a7e020e6ba1)


  0%|          | 0/3 [00:00<?, ?it/s]

In [37]:
# General info
print(clinc)

DatasetDict({
    train: Dataset({
        features: ['text', 'intent'],
        num_rows: 15250
    })
    validation: Dataset({
        features: ['text', 'intent'],
        num_rows: 3100
    })
    test: Dataset({
        features: ['text', 'intent'],
        num_rows: 5500
    })
})


In [38]:
# Dataset sample
intents = clinc["test"].features["intent"]
get_intent = lambda x: intents.int2str(x)

sample = clinc['test'][42]
sample['intent_text'] = get_intent(sample['intent'])
print(sample)

{'text': 'transfer $100 from my checking to saving account', 'intent': 133, 'intent_text': 'transfer'}


# Creating a performance benchamark

In [39]:
accuracy_score = load_metric("accuracy")

In [40]:
# list(pipe.model.state_dict().items())[42]

In [41]:
class PerformanceBenchmark:
    def __init__(self, pipeline, dataset, optim_type="BERT baseline"):
        self.pipeline = pipeline
        self.dataset = dataset
        self.optim_type = optim_type

    def compute_accuracy(self):
        preds, labels = [], []
        for example in self.dataset:
            pred = self.pipeline(example["text"])[0]["label"]
            label = example["intent"]
            preds.append(intents.str2int(pred))
            labels.append(label)
        accuracy = accuracy_score.compute(predictions=preds, references=labels)
        print(f"Accuracy on test set - {accuracy['accuracy']:.3f}")
        return accuracy


    def compute_size(self):
        state_dict = self.pipeline.model.state_dict()
        tmp_path = Path("model.pt")
        torch.save(state_dict, tmp_path)
        # Calculate size in megabytes
        size_mb = Path(tmp_path).stat().st_size / (1024 * 1024)
        # Delete temporary file
        tmp_path.unlink()
        print(f"Model size (MB) - {size_mb:.2f}")
        return {"size_mb": size_mb}

    def time_pipeline(self, query="What is the pin number for my account?"):
        '''
        Function measure the average latency per query.
        '''
        latencies = []
        # Warmup
        for _ in range(10):
            _ = self.pipeline(query)
        # Timed run
        for _ in range(100):
            start_time = perf_counter()
            _ = self.pipeline(query)
            latency = perf_counter() - start_time
            latencies.append(latency)
        # Compute run statistics
        time_avg_ms = 1000 * np.mean(latencies)
        time_std_ms = 1000 * np.std(latencies)
        print(f"Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f}")
        return {"time_avg_ms": time_avg_ms, "time_std_ms": time_std_ms}


    def run_benchmark(self):
        metrics = {}
        metrics[self.optim_type] = self.compute_size()
        metrics[self.optim_type].update(self.time_pipeline())
        metrics[self.optim_type].update(self.compute_accuracy())
        return metrics

In [45]:
%%time
# test run ~ 7 min
pb = PerformanceBenchmark(pipe, clinc["test"])
perf_metrics = pb.run_benchmark()

Model size (MB) - 418.16
Average latency (ms) - 111.00 +\- 5.04
Accuracy on test set - 0.867


# Knowledge Distillation