<a href="https://colab.research.google.com/github/OleksiiLatypov/Practical_Deep_Learning_with_PyTorch/blob/main/Torchscript/template_intro_torchscript.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **TorchScript**

In this lab, you will learn how to use TorchScript to optimise inference runtime using examples of Yolov5s and BERT models.

>GPU is recomended for this assignment. `Runtime` -> `Change runtime type` -> `GPU`

**Instructions**
- Write code in the space indicated with `### START CODE HERE ###`
- Do not use loops (for/while) unless instructions explicitly tell you so. Parallelization in Deep Learning is key!
- If you get stuck, ask for help in Slack or DM `@DRU Team`

**You will learn**
- How to use tracing or scripting to convert a model to TorchScript
- How to measure Inference time of a model

# **Import packages**

In [None]:
!pip install yolort==0.6.2
!pip install transformers==4.18.0

Collecting yolort==0.6.2
  Downloading yolort-0.6.2-py3-none-any.whl.metadata (11 kB)
Collecting onnx>=1.8.0 (from yolort==0.6.2)
  Downloading onnx-1.16.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (16 kB)
Collecting thop (from yolort==0.6.2)
  Downloading thop-0.1.1.post2209072238-py3-none-any.whl.metadata (2.7 kB)
Collecting jedi>=0.16 (from ipython->yolort==0.6.2)
  Downloading jedi-0.19.1-py2.py3-none-any.whl.metadata (22 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch->thop->yolort==0.6.2)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch->thop->yolort==0.6.2)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch->thop->yolort==0.6.2)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvi

In [None]:
import torch
import numpy as np
from tqdm import trange

from yolort.models import yolov5s
from transformers import BertTokenizer, BertModel

Downloading https://ultralytics.com/assets/Arial.ttf to /root/.config/Ultralytics/Arial.ttf...


In [None]:
# VALIDATION_FIELD[cls] Config

class Config:

    n_imgs = 32
    input_shape = (3, 256, 256)
    yolo_nwarmup = 20
    yolo_nruns = 100

    bert_nwarmup = 50
    bert_nruns = 500

    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# **Runtime optimization with TochScript**

## **What is TorchScript ?**
[TorchScript](https://pytorch.org/docs/stable/jit.html#torchscript-language) is a high-performance subset of Python language, specialised for ML applications. Using TorchScript, you can easily create serialisable and optimisable models from PyTorch code to load into a process with no dependency on Python, such as in a C++ program. It will run your models faster and independent.

## **The Script mode and PyTorch JIT**

It is a part of the PyTorch focused on the production use case, which has 2 components PyTorch JIT (an optimised compiler) and TorchScript.

Script mode creates an intermediate representation (IR) through [`torch.jit.trace`](https://pytorch.org/docs/stable/generated/torch.jit.trace.html) and [`torch.jit.script`](https://pytorch.org/docs/stable/generated/torch.jit.script.html) to represent computation. The IR is internally optimised and utilises PyTorch JIT compilation at runtime.

PyTorch JIT is a compiler for PyTorch programs:
- It is a lightweight threadsafe interpreter
- Supports easy to write custom transformations
- It’s not just for inference as it has auto diff support

Pytorch provides two methods for generating TorchScript from your model code:

- **Tracing**: [`torch.jit.trace`](https://pytorch.org/docs/stable/generated/torch.jit.trace.html) takes a data instance and trained model as input. It runs the model and then records and traces the executed operations performed on all the tensors. This recording is turned into a TorchScript.

- **Scripting**: [`torch.jit.script`](https://pytorch.org/docs/stable/generated/torch.jit.script.html) allows writing code directly into TorchScript, which will be generated from the static inspection of the nn.Module contents. So, in contrast to trace mode, you only need to pass an instance of your model to torch.jit.script and a data sample are not necessary.

### When to use:

- **Tracing**:
    - Use `torch.jit.trace` if you are unable to modify the model code. In this case scripting the model simply will not work, because it uses unsupported Pytorch/Python functionality.
    - You may use the logic [freezing](https://pytorch.org/tutorials/prototype/torchscript_freezing.html) behaviour of tracing if you need to gain better performance or to make changes in architectural decisions.

- **Scripting**:
    - `torch.jit.script` is easy to use because it captures both your model's operations and full conditional logic. An export is likely to either fail for a well-defined reason, which you can solve by implying a clear code modification or even succeed without warnings.

Scripted and traced code can be mixed. You can see the existing [documentation](https://pytorch.org/docs/stable/jit.html#mixing-tracing-and-scripting) for details and examples.


# **Yolov5s inference optimising**

## **Load yolov5s model and scripting it**

We will use yolov5s from the [yolort](https://github.com/zhiqwang/yolov5-rt-stack) library, trained on the COCO dataset, to test speed inference time.

In [None]:
# VALIDATION_FIELD[func] native_yolov5s_model

native_yolov5s_model = yolov5s(pretrained=True, size=(256, 256))

Downloading: "https://github.com/zhiqwang/yolov5-rt-stack/releases/download/v0.5.2-alpha/yolov5_darknet_pan_s_r60_coco-9f44bf3f.pt" to /root/.cache/torch/hub/checkpoints/yolov5_darknet_pan_s_r60_coco-9f44bf3f.pt
100%|██████████| 14.0M/14.0M [00:00<00:00, 106MB/s] 


**Excercise**: create the script of yolov5s model, passing an instance of the model to `torch.jit.script`:

In [None]:
# VALIDATION_FIELD[func] scripted_yolov5s_model

### START CODE HERE ### (1 line of code)
scripted_yolov5s_model = torch.jit.script(native_yolov5s_model)
### END CODE HERE ###

For clarity, let's save the scripted model and load it.

In [None]:
torch.jit.save(scripted_yolov5s_model,'scripted_yolov5s.pt')
loaded_scripted_yolov5s_model = torch.jit.load('scripted_yolov5s.pt')

## **Benchmarking native and scripted Yolov5s model**

Before making inference time measurements, we need to run some dummy examples through the model to do a GPU warm-up. Warm-up will initialise the GPU and prevent it from going into a power-saving mode.

Next, we need to use [`torch.cuda.event`](https://pytorch.org/docs/stable/generated/torch.cuda.Event.html) and init loggers to measure time on the GPU.

Finally, after the model is warmed-up, we can measure performance. To do this correctly, we need to use [`torch.cuda.synchronize()`](https://alband.github.io/doc_view/cuda.html#torch.cuda.synchronize). This will perform synchronisation between the host and device (i.e., GPU and CPU), so the time recording takes place only after the process running on the GPU is finished.

Below is an implemented function that will perform warming up and the main inference of the model with measuring performance time.


In [None]:
# VALIDATION_FIELD[func] run_model

def run_model(model, input, nruns, desc=None, unpack=False):
    """
    Runs a model with inputs and measures performance time

    Arguments:
    model -- a model we will use to test performance
    input -- inputs for a model
    nruns -- steps to run a model
    desc -- description of a progress bar
    unpack -- bool indicator. Indicates unpack input or not

    Return:
    infer_time -- inference time
    """

    pbar = trange(nruns,
                  unit=" runs",
                  desc=desc,
                  bar_format='{desc}: {percentage:3.0f}%|{bar}| {n_fmt} run /{total_fmt} runs '
                '[{elapsed}<{remaining}, {rate_fmt}{postfix}]')


    if torch.cuda.is_available():
        # init loggers
        star_measure_time, end_measure_time = torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True)

        torch.cuda.synchronize()
        timings=np.zeros((nruns,1))
        with torch.no_grad():
            for i in pbar:
                star_measure_time.record()
                if unpack == False:
                    _ = model(input)
                else:
                    _ = model(*input)
                end_measure_time.record()
                # wait for gpu sync
                torch.cuda.synchronize()
                curr_time = star_measure_time.elapsed_time(end_measure_time)
                timings[i] = curr_time

        infer_time = np.sum(timings)
    else:
        with torch.no_grad():
            for _ in pbar:
                if unpack == False:
                    _ = model(input)
                else:
                    _ = model(*input)

        infer_time = pbar.format_dict["elapsed"] * 1000

    return infer_time

Now we will use `run_model` function prepared above to benchmark the model.

**Excercise**: implement `benchmark` function.

1. Pass dummy inputs to `device`.
2. Pass the model to `device`.
3. Switch the model to `eval` mode.
4. Warm-up the model using `run_model` with params `nwarmup` and `desc="Warm-up"`.
5. Measure inference time of the model using `run_model` with params `ninfer` and `desc="Inference timing"`.

In [None]:
# VALIDATION_FIELD[func] benchmark

def benchmark(model, dummy_input, nwarmup, ninfer, unpack=False, device=Config.device):

    """
    Benchmarks a model inference

    Arguments:
    model -- a model we will benchmark
    dummy_input -- list of dummy inputs for model inference
    nwarmup -- steps for warm-up model
    ninfer -- steps for main model inference
    unpack -- bool indicator. Indicates unpack input or not
    device -- type of device

    Return:
    infer_time -- inference time
    """

    ### START CODE HERE ### (≈ 5 lines of code)
    # pass dummy inputs to device
    #print(dummy_input[0])
    # if isinstance(dummy_input, list):
    #   dummy_input = [x.to(device) for x in dummy_input]
    # else:
    dummy_input = [x.to(device) for x in list(dummy_input)]


    # pass the model to device
    model.to(device)

    # switch the model to eval mode
    model.eval()

    # warm-up the model
    _ = run_model(model, dummy_input, unpack=unpack, nruns=nwarmup, desc="Warm-up")

    # measure performance of the model
    infer_time = run_model(model, dummy_input, unpack=unpack, nruns=ninfer, desc="Inference timing")
    ### END CODE HERE ###

    print(f"\nInference time for the model: {infer_time:.2f} ms for {ninfer} runs")
    print(f"Avg. inference time for the model: {(infer_time)/ ninfer:.2f} ms\n")

    return infer_time

Let's benchmark the scripted and native Yolov5s model:

In [None]:
# create dummy_input for Yolov5s
yolo_dummy_input = [torch.rand(Config.input_shape)for i in range(Config.n_imgs)]

# benchmark
print("Native Yolov5s benchmark:\n")
native_yolo_infer_time = benchmark(native_yolov5s_model, yolo_dummy_input, nwarmup=Config.yolo_nwarmup, ninfer=Config.yolo_nruns)
print("Scripted Yolov5s benchmark:\n")
scripted_yolo_infer_time = benchmark(loaded_scripted_yolov5s_model, yolo_dummy_input, nwarmup=Config.yolo_nwarmup, ninfer=Config.yolo_nruns)

print(f"Scripted Yolov5s is {((native_yolo_infer_time - scripted_yolo_infer_time)/ native_yolo_infer_time) * 100 :.2f} percent faster than Native model")

Native Yolov5s benchmark:



Warm-up: 100%|██████████| 20 run /20 runs [00:01<00:00, 14.44 runs/s]
Inference timing: 100%|██████████| 100 run /100 runs [00:07<00:00, 14.07 runs/s]



Inference time for the model: 6975.76 ms for 100 runs
Avg. inference time for the model: 69.76 ms

Scripted Yolov5s benchmark:



Warm-up: 100%|██████████| 20 run /20 runs [00:01<00:00, 19.54 runs/s]
Inference timing: 100%|██████████| 100 run /100 runs [00:05<00:00, 19.59 runs/s]


Inference time for the model: 5031.56 ms for 100 runs
Avg. inference time for the model: 50.32 ms

Scripted Yolov5s is 27.87 percent faster than Native model





**Expected output**:
```
Native Yolov5s benchmark:

Warm-up: 100%|██████████| 20 run /20 runs [00:03<00:00,  5.37 runs/s]
Inference timing: 100%|██████████| 100 run /100 runs [00:18<00:00,  5.42 runs/s]

Inference time for the model: 18185.01 ms for 100 runs
Avg. inference time for the model: 181.85 ms

Scripted Yolov5s benchmark:

Warm-up: 100%|██████████| 20 run /20 runs [00:03<00:00,  5.94 runs/s]
Inference timing: 100%|██████████| 100 run /100 runs [00:16<00:00,  5.94 runs/s]
Inference time for the model: 16608.96 ms for 100 runs
Avg. inference time for the model: 166.09 ms

Scripted Yolov5s is 8.67 percent faster than Native model
```
>(Elapsed time and speed may slightly vary)

>If you are running on Tesla K80, the scripted Yolov5s model will be **~ 6-10** percent faster than native Yolov5s model. For Telsa T4, the performance difference will be more significant on average **~ 15-20** percent.


# **BERT inference optimising**

Here we will use BERT from the transformer’s library provided by HuggingFace.


## **Creating dummy input for BERT**

Firstly let's initialise BERT tokenizer:

In [None]:
# VALIDATION_FIELD[func] tokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', torchscript=True)

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

The next step is to create a sample input data for inference.

In [None]:
# VALIDATION_FIELD[str]

text = "I live in [MASK], this country is located on the continent of the same name."

**Excercise**: implement `create_dummy_input_for_bert` function to create dummy input for BERT model.

- tokenize `text_example` using `encode_plus` with parameter `return_tensors="pt"` and return a list with `'input_ids'` and `'attention_mask'`:

In [None]:
# VALIDATION_FIELD[func] create_dummy_input_for_bert

def create_dummy_input_for_bert(text_example):

    """
    Creates dummy input for BERT model

    Arguments:
    text_example -- string with a sentence example

    Return:
    dummy_input -- list of 'input_ids' and 'attention_mask'
    """

    ### START CODE HERE ### (≈ 2-4 lines of code)
    text_example = tokenizer.encode_plus(text_example, return_tensors='pt')
    return [text_example['input_ids'], text_example['attention_mask']]

    ### END CODE HERE ###

dummy_input = create_dummy_input_for_bert(text)
print(f"{dummy_input[0]}\n{dummy_input[1]}")

tensor([[ 101, 1045, 2444, 1999,  103, 1010, 2023, 2406, 2003, 2284, 2006, 1996, 9983, 1997, 1996, 2168, 2171, 1012,  102]])
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])


**Expected output**:

```
tensor([[ 101, 1045, 2444, 1999,  103, 1010, 2023, 2406, 2003, 2284, 2006, 1996, 9983, 1997, 1996, 2168, 2171, 1012,  102]])
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
```



## **Loading native BERT model and tracing it**

Next, we will load the pretrained BERT model:

In [None]:
# VALIDATION_FIELD[func] native_bert_model

native_bert_model = BertModel.from_pretrained("bert-base-uncased", torchscript=True)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Now we can create the trace of BERT model.

**Excercise**: create the trace of BERT model, passing an instance of the model and `dummy_input` to `torch.jit.trace`:

In [None]:
# VALIDATION_FIELD[func] traced_bert_model

### START CODE HERE ### (1 line of code)
traced_bert_model = torch.jit.trace(native_bert_model, (dummy_input))
### END CODE HERE ###

For clarity, save the trace of BERT model and load it.

In [None]:
torch.jit.save(traced_bert_model, "traced_bert.pt")
loaded_traced_bert_model = torch.jit.load("traced_bert.pt")

## **Benchmarking native and traced BERT model**

In [None]:
# benchmark
print("Native BERT benchmark:\n")
native_bert_infer_time = benchmark(native_bert_model, dummy_input, nwarmup=Config.bert_nwarmup, ninfer=Config.bert_nruns, unpack=True)
print("Traced BERT benchmark:\n")
traced_bert_infer_time = benchmark(loaded_traced_bert_model, dummy_input, nwarmup=Config.bert_nwarmup, ninfer=Config.bert_nruns, unpack=True)

print(f"Traced BERT is {((native_bert_infer_time - traced_bert_infer_time)/ native_bert_infer_time) * 100 :.2f} percent faster than Native model")

Native BERT benchmark:



Warm-up: 100%|██████████| 50 run /50 runs [00:00<00:00, 103.50 runs/s]
Inference timing: 100%|██████████| 500 run /500 runs [00:04<00:00, 101.80 runs/s]



Inference time for the model: 4827.04 ms for 500 runs
Avg. inference time for the model: 9.65 ms

Traced BERT benchmark:



Warm-up: 100%|██████████| 50 run /50 runs [00:02<00:00, 24.23 runs/s]
Inference timing: 100%|██████████| 500 run /500 runs [00:02<00:00, 186.02 runs/s]


Inference time for the model: 2606.74 ms for 500 runs
Avg. inference time for the model: 5.21 ms

Traced BERT is 46.00 percent faster than Native model





**Expected output**:
```
Native BERT benchmark:

Warm-up: 100%|██████████| 50 run /50 runs [00:00<00:00, 58.19 runs/s]
Inference timing: 100%|██████████| 500 run /500 runs [00:07<00:00, 63.76 runs/s]

Inference time for the model: 7643.96 ms for 500 runs
Avg. inference time for the model: 15.29 ms

Traced BERT benchmark:

Warm-up: 100%|██████████| 50 run /50 runs [00:00<00:00, 75.80 runs/s]
Inference timing: 100%|██████████| 500 run /500 runs [00:06<00:00, 76.36 runs/s]
Inference time for the model: 6347.65 ms for 500 runs
Avg. inference time for the model: 12.70 ms

Traced BERT is 16.96 percent faster than Native model
```
>(Elapsed time and speed may slightly vary)

>If you are running on Tesla K80, the traced BERT model will be **~ 10-20** percent faster than native BERT model. For Telsa T4, the performance difference will be more significant on average **~ 40-50** percent.