# Whisper benchmark

To measure the speedup of `Kernl` on Whisper model, we use `eval` set from `librispeech` on a 3090 RTX GPU.
Following [Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf) paper we use the following setup:
- `openai/whisper-large-v2` flavor (we use v2 of the weights)
- beam search with 5 beams
- only apply optimization to the decoder, because encoder counts for very little in the end-to-end latency
- we leverage CUDA graphs to remove most of the CPU overhead


In [1]:
! pip install datasets soundfile librosa

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com


In [2]:
! nvidia-smi

Thu Jan 26 12:38:31 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13    Driver Version: 525.60.13    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  On   | 00000000:03:00.0  On |                  N/A |
| 37%   43C    P8    28W / 350W |     49MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+---------------------------------------------------------------------------

## Setup

In [3]:
import time

import torch
from datasets import load_dataset
from transformers import WhisperForConditionalGeneration, WhisperProcessor

from kernl.model_optimization import optimize_model


torch.set_float32_matmul_precision("high")
# torchdynamo.config.cache_size_limit = 512
# torchdynamo.config.dynamic_shapes = True
max_len = 50
num_beams = 5
model_name = "openai/whisper-large-v2"  # "openai/whisper-tiny"

## Load data & model

We set a simple function to extract tokens from audios.

In [4]:
# audio_dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")  # small dataset for tests
audio_dataset = load_dataset("librispeech_asr", "clean", split="test")


def get_tokens(item: dict[str, dict]) -> torch.Tensor:
    tensor = processor(item["audio"]["array"], return_tensors="pt", sampling_rate=16_000).input_features
    return tensor.cuda()


processor = WhisperProcessor.from_pretrained(model_name)
inputs_warmup = get_tokens(audio_dataset[0])

model = WhisperForConditionalGeneration.from_pretrained(model_name).to("cuda").eval()

Found cached dataset librispeech_asr (/home/geantvert/.cache/huggingface/datasets/librispeech_asr/clean/2.1.0/cff5df6e7955c80a67f80e27e7e655de71c689e2d2364bece785b972acb37fe7)


Downloading:   0%|          | 0.00/3.49k [00:00<?, ?B/s]

## Baseline

Measures is done on mixed precision `FP16` model.
We save each model output so we can check the quality impact of the optimizations.

In [5]:
timings_original = list()
transcriptions = list()
with torch.inference_mode(), torch.autocast(dtype=torch.float16, cache_enabled=True, device_type="cuda"):
    # warmup
    model.generate(inputs_warmup, min_length=max_len, max_length=max_len, num_beams=num_beams, do_sample=False)
    torch.cuda.synchronize()
    for audio in audio_dataset:
        inputs = get_tokens(audio)
        torch.cuda.synchronize()
        start = time.time()
        predicted_ids = model.generate(inputs, min_length=1, max_length=max_len, num_beams=num_beams, do_sample=False)
        torch.cuda.synchronize()
        timings_original.append(time.time() - start)
        transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True, normalize=True)[0]
        transcriptions.append(transcription)

assert len(audio_dataset) == len(transcriptions) == len(timings_original)

## Optimized model

### Hugging Face implementation

First, we fix a small inefficiency in the Hugging Face library.  
Basically, it avoids unnecessary encoder tensor (from K/V cache) copies.

The impact on speed inference is limited, it's mainly done for memory footprint.  
In the past, PyTorch 2.0 nightlies had many memory leaks and this fix was back then mandatory to not get OOM.  
FWIW, all memory leaks we have found during our experiments have been fixed in recent PyTorch versions.  

In [6]:
# apply efficiency fix to HuggingFace implementation of Whisper to limit memory footprint
@staticmethod
def fix_reorder_cache(past, beam_idx):
    reordered_past = ()
    for layer_past in past:
        reordered_past += (
            tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        )
    return reordered_past


WhisperForConditionalGeneration._reorder_cache = fix_reorder_cache

### Kernl optimization

Warmup takes around 12 minutes and is mostly spent by PyTorch 2.0 dynamo module on CPU to capture graph (it is down from 50 min with previous version of Kernl 🤯).
We plan to support dynamic shape mode on dynamo, preliminary benchmarks show a 5X faster warmup on Whisper large (basically mostly Triton autotune remains).

Note that < 2% outputs are different from the original model.
Manual inspection shows that differences are mostly small, many being only one different token in the whole transcription.

The explanation is that our optimized kernels are mathematically equivalent to the one of the original model but may not perform operations in the same order as PyTorch kernels. As operations on float tensors always lead to rounding, order matters even for commutative operations.

Moreover, our fused kernels perform accumulations in fp32 which is not possible when you chain PyTorch kernels with fp16 tensors.
When 2 tokens have very similar scores, these rounding differences matters and that's why, in a few cases, outputs are different.

In [7]:
optimize_model(model.model.decoder)
nb_diff = 0
timings_optimized = list()
with torch.inference_mode(), torch.autocast(dtype=torch.float16, cache_enabled=True, device_type="cuda"):
    start = time.time()
    model.generate(inputs_warmup, min_length=max_len, max_length=max_len, num_beams=num_beams, do_sample=False)
    torch.cuda.synchronize()
    print(f"time to warmup: {(time.time() - start)/60:.2f}min")
    for original_modem_transcription, audio in zip(transcriptions, audio_dataset):
        inputs = get_tokens(audio)
        torch.cuda.synchronize()
        start = time.time()
        predicted_ids = model.generate(inputs, min_length=1, max_length=max_len, num_beams=num_beams, do_sample=False)
        torch.cuda.synchronize()
        timings_optimized.append(time.time() - start)
        optimized_transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True, normalize=True)[0]
        nb_diff += original_modem_transcription != optimized_transcription

original_mins = sum(timings_original)/60
optimized_mins = sum(timings_optimized)/60
speedup = original_mins / optimized_mins
print(f"Kernl speedup: {speedup:.1f}X ({optimized_mins:.1f} VS {original_mins:.1f} min)")
print(f"# different outputs: {nb_diff}/{len(audio_dataset)} ({nb_diff / len(audio_dataset) * 100:.2f}%)")

print("\nmemory footprint:")
print(f"* allocated: {torch.cuda.memory_allocated(0) / 1024 / 1024 / 1024:.1f}GB")
print(f"* reserved: {torch.cuda.memory_reserved(0) / 1024 / 1024 / 1024:.1f}GB")
print(f"* max reserved: {torch.cuda.max_memory_reserved(0) / 1024 / 1024 / 1024:.1f}GB")

time to warmup: 11.60min
Kernl speedup: 2.4X (20.3 VS 48.7 min)
# different outputs: 37/2620 (1.41%)

memory footprint:
* allocated: 10.9GB
* reserved: 13.4GB
* max reserved: 13.9GB
