# Whisper benchmark

To measure the speedup of `Kernl` on Whisper model, we use `test` set from `librispeech` on a 3090 RTX GPU.

We use the following setup:
- `openai/whisper-large-v2` weights
- beam search with 5 beams (as advised in [Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf))
- only apply optimization to the decoder, because encoder counts for very little in the end-to-end latency
- use CUDA graphs (to remove most of the CPU overhead)

## Dependencies

There are dependencies required to use `librispeech`.

In [1]:
! pip install datasets soundfile librosa -q
# on Docker, you may want to install libsndfile1-dev through apt

You should consider upgrading via the '/usr/bin/python3.9 -m pip install --upgrade pip' command.[0m[33m
[0m

In [2]:
! nvidia-smi

Fri Jan 27 12:43:42 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A100-SXM...  On   | 00000000:06:00.0 Off |                    0 |
| N/A   34C    P0    51W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+---------------------------------------------------------------------------

## Import and setup

In [3]:
import time

import torch
from datasets import load_dataset
from transformers import WhisperForConditionalGeneration, WhisperProcessor

from kernl.model_optimization import optimize_model


torch.set_float32_matmul_precision("high")
# torchdynamo.config.cache_size_limit = 512
# torchdynamo.config.dynamic_shapes = True
max_len = 50  # we do not expect more than 50 tokens per audio.
num_beams = 5
model_name = "openai/whisper-large-v2"  # "openai/whisper-tiny"



## Load data & model

We set a simple function to extract tokens from audios.

In [4]:
# audio_dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")  # small dataset for tests
audio_dataset = load_dataset("librispeech_asr", "clean", split="test")


def get_tokens(item: dict[str, dict]) -> torch.Tensor:
    tensor = processor(item["audio"]["array"], return_tensors="pt", sampling_rate=16_000).input_features
    return tensor.cuda()


processor = WhisperProcessor.from_pretrained(model_name)
inputs_warmup = get_tokens(audio_dataset[0])

model = WhisperForConditionalGeneration.from_pretrained(model_name).to("cuda").eval()

Downloading builder script:   0%|          | 0.00/11.5k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/10.1k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/10.2k [00:00<?, ?B/s]

Downloading and preparing dataset librispeech_asr/clean to /root/.cache/huggingface/datasets/librispeech_asr/clean/2.1.0/cff5df6e7955c80a67f80e27e7e655de71c689e2d2364bece785b972acb37fe7...


Downloading data files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/338M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/347M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/6.39G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/23.0G [00:00<?, ?B/s]

Computing checksums of downloaded files. They can be used for integrity verification. You can disable this by passing ignore_verifications=True to load_dataset


Computing checksums: 100%|##########| 4/4 [00:20<00:00,  5.22s/it]

Extracting data files:   0%|          | 0/4 [00:00<?, ?it/s]

Generating train.100 split:   0%|          | 0/28539 [00:00<?, ? examples/s]

Generating train.360 split:   0%|          | 0/104014 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2703 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2620 [00:00<?, ? examples/s]

Dataset librispeech_asr downloaded and prepared to /root/.cache/huggingface/datasets/librispeech_asr/clean/2.1.0/cff5df6e7955c80a67f80e27e7e655de71c689e2d2364bece785b972acb37fe7. Subsequent calls will reuse this data.


Downloading (…)rocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

Downloading (…)main/normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/2.11k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.06k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.97k [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/6.17G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/3.49k [00:00<?, ?B/s]

## Baseline

Measures is done on mixed precision `FP16` model.
We save each model output so we can check the quality impact of the optimizations.

In [5]:
timings_original = list()
transcriptions = list()
with torch.inference_mode(), torch.autocast(dtype=torch.float16, cache_enabled=True, device_type="cuda"):
    # warmup
    model.generate(inputs_warmup, min_length=max_len, max_length=max_len, num_beams=num_beams, do_sample=False)
    torch.cuda.synchronize()
    for audio in audio_dataset:
        inputs = get_tokens(audio)
        torch.cuda.synchronize()
        start = time.time()
        predicted_ids = model.generate(inputs, min_length=1, max_length=max_len, num_beams=num_beams, do_sample=False)
        torch.cuda.synchronize()
        timings_original.append(time.time() - start)
        transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True, normalize=True)[0]
        transcriptions.append(transcription)

assert len(audio_dataset) == len(transcriptions) == len(timings_original)

## Optimized model

### Hugging Face implementation

First, we fix a small inefficiency in the Hugging Face library.  
Basically, it avoids unnecessary encoder tensor (from K/V cache) copies.

The impact on speed inference is limited, it's mainly done for memory footprint.  
In the past, PyTorch 2.0 nightlies had many memory leaks and this fix was back then mandatory to not get OOM.  
FWIW, all memory leaks we have found during our experiments have been fixed in recent PyTorch versions.  

In [6]:
@staticmethod
def fix_reorder_cache(past, beam_idx):
    reordered_past = ()
    for layer_past in past:
        reordered_past += (
            tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
        )
    return reordered_past


WhisperForConditionalGeneration._reorder_cache = fix_reorder_cache

### Kernl optimization

Warmup takes around 12 minutes and is mostly spent by PyTorch 2.0 dynamo on the CPU to capture graphs (it is down from 50 min with previous version of Kernl 🤯).
We plan to support dynamic shape mode on dynamo, preliminary benchmarks show a 5X faster warmup on Whisper large (basically mostly Triton autotune remains).

Note that < 2% outputs are different from the original model.
Manual inspection shows that differences are mostly small, many being only 1 or 2 different tokens in the whole transcription.

This happens because of rounding on float tensors when 2 tokens get very similar score from Whisper.
Keep in mind that with float tensors there is no real commutative operations, order of execution always matters because of rounding even if mathematically 2 expressions are strictly equivalent (which is the case with our optimized kernels). Moreover, because we fuse operations, our kernels tend to be more precise than eager PyTorch because we accumulate in fp32 where PyTorch eager will do it in fp16 most of the time.

Anyway, at the end, outputs are very similar and that's what matters :-)

In [7]:
optimize_model(model.model.decoder)
nb_diff = 0
timings_optimized = list()
with torch.inference_mode(), torch.autocast(dtype=torch.float16, cache_enabled=True, device_type="cuda"):
    start = time.time()
    model.generate(inputs_warmup, min_length=max_len, max_length=max_len, num_beams=num_beams, do_sample=False)
    torch.cuda.synchronize()
    print(f"time to warmup: {(time.time() - start)/60:.2f}min")
    for original_modem_transcription, audio in zip(transcriptions, audio_dataset):
        inputs = get_tokens(audio)
        torch.cuda.synchronize()
        start = time.time()
        predicted_ids = model.generate(inputs, min_length=1, max_length=max_len, num_beams=num_beams, do_sample=False)
        torch.cuda.synchronize()
        timings_optimized.append(time.time() - start)
        optimized_transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True, normalize=True)[0]
        nb_diff += original_modem_transcription != optimized_transcription

original_mins = sum(timings_original) / 60
optimized_mins = sum(timings_optimized) / 60
speedup = original_mins / optimized_mins
print(f"Kernl speedup: {speedup:.1f}X ({optimized_mins:.1f} VS {original_mins:.1f} min)")
print(f"# different outputs: {nb_diff}/{len(audio_dataset)} ({nb_diff / len(audio_dataset) * 100:.2f}%)")

print("\nmemory footprint:")
print(f"* allocated: {torch.cuda.memory_allocated(0) / 1024 / 1024 / 1024:.1f}GB")
print(f"* reserved: {torch.cuda.memory_reserved(0) / 1024 / 1024 / 1024:.1f}GB")
print(f"* max reserved: {torch.cuda.max_memory_reserved(0) / 1024 / 1024 / 1024:.1f}GB")

time to warmup: 13.50min
Kernl speedup: 2.3X (17.1 VS 39.0 min)
# different outputs: 37/2620 (1.41%)

memory footprint:
* allocated: 10.9GB
* reserved: 13.4GB
* max reserved: 13.9GB
