# Step-by-step guide to run Whisper inference with VLLM
## Install required packages

In [None]:
pip install -r requirements.txt

## Import the necessary libraries

In [None]:
from argparse import ArgumentParser, ArgumentDefaultsHelpFormatter
import time
from vllm import LLM, SamplingParams
from transformers import WhisperTokenizerFast
from pathlib import Path
from librosa import resample, load

import numpy as np

## Support functions

Since whisper can process only up to 30 seconds of audio, we might need to chunk audio. 

Example of chunking function

In [6]:
def chunking(audio:np.ndarray, sample_rate:int):
    """Split audio to 30 second duration chunks

    Args:
        audio (np.ndarray): Audio data
        sample_rate (int): Audio sample rate
    """
    max_duration_samples = sample_rate * 30.0
    padding = max_duration_samples - np.remainder(len(audio), max_duration_samples)
    audio = np.pad(audio, (0, padding.astype(int)), 'constant', constant_values=0.0)
    return np.split(audio, len(audio) // max_duration_samples)


## Main function

Next, we define VLLM engine

In [None]:
whisper = LLM(
        model="openai/whisper-large-v3-turbo",
        limit_mm_per_prompt={"audio": 1},
        gpu_memory_utilization = 0.5,
        dtype = "float16",
        max_num_seqs = 4,
        max_num_batched_tokens=448
    )

Let's check what what arguments we used

1. `model` - is obviously the model we are using, it can be any Whisper realization from HuggingFace hub
2. `limit_mm_per_prompt` - always like that because Whisper can only process 1 audio per prompt
3. `gpu_memory_utilization` - is the fraction of memory that VLLM can use for model inference and KV caching.
4. `dtype` - is the type of model weights, float16 or bfloat16 (for RTX 30xx series) is recommended for GPU
5. `max_num_seqs` - basically is a batch size 
6. `max_num_batched_tokens` - is the maximum number of tokens that can be processed in a batch, this is a hard limit and should be set to 448 for Whisper (because of audio duration)

For future flexibility we also want dynamically set language code. Without it, Whisper will try to translate the audio to English. Check [Whisper documentation](https://platform.openai.com/docs/guides/speech-to-text/supported-languages/#supported-languages) to find supported languages list.

<details>
<summary>The full list with codes</summary>

    'afrikaans': 'af',
    'arabic': 'ar',
    'armenian': 'hy',
    'azerbaijani': 'az',
    'belarusian': 'be',
    'bosnian': 'bs',
    'bulgarian': 'bg',
    'catalan': 'ca',
    'chinese': 'zh',
    'croatian': 'hr',
    'czech': 'cs',
    'danish': 'da',
    'dutch': 'nl',
    'english': 'en',
    'estonian': 'et',
    'finnish': 'fi',
    'french': 'fr',
    'galician': 'gl',
    'german': 'de',
    'greek': 'el',
    'hebrew': 'he',
    'hindi': 'hi',
    'hungarian': 'hu',
    'icelandic': 'is',
    'indonesian': 'id',
    'italian': 'it',
    'japanese': 'ja',
    'kannada': 'kn',
    'kazakh': 'kk',
    'korean': 'ko',
    'latvian': 'lv',
    'lithuanian': 'lt',
    'macedonian': 'mk',
    'malay': 'ms',
    'maori': 'mi',
    'marathi': 'mr',
    'nepali': 'ne',
    'norwegian': 'no',
    'persian': 'fa',
    'polish': 'pl',
    'portuguese': 'pt',
    'romanian': 'ro',
    'russian': 'ru',
    'serbian': 'sr',
    'slovak': 'sk',
    'slovenian': 'sl',
    'spanish': 'es',
    'swahili': 'sw',
    'swedish': 'sv',
    'tagalog': 'tl',
    'tamil': 'ta',
    'thai': 'th',
    'turkish': 'tr',
    'ukrainian': 'uk',
    'urdu': 'ur',
    'vietnamese': 'vi',
    'welsh': 'cy'

</details>

In [9]:
tokenizer = WhisperTokenizerFast.from_pretrained("openai/whisper-large-v3-turbo", language="en")

In [24]:
language = "ko"
lang_code = tokenizer.convert_tokens_to_ids(f"<|{language}|>")
if lang_code == 50257:
    raise ValueError(f"Language code for {language} not found")

lang_code

50264

This example suggest that audio samples is located in `samples` folder and have `wav` extention

In [14]:
audio_files = Path("samples").glob("*.wav")

Whisper can only process audio with 16000Hz sample rate, so we need to convert it first

In [None]:
samples = {}

for file in list(audio_files):
    # Load the audio file
    audio, sample_rate = load(file,sr=16000)
    if sample_rate != 16000:
        # Use librosa to resample the audio
        audio = resample(audio.numpy().astype(np.float32), orig_sr=sample_rate, target_sr=16000)
    print(f"File: {file}, Sample rate: {sample_rate}, Audio shape: {audio.shape}, Duration: {audio.shape[0] / sample_rate:.2f} seconds")
    chunks = chunking(audio, 16000)
    samples[file.stem] = [(chunk,16000) for chunk in chunks]

File: samples/sample.wav, Sample rate: 16000, Audio shape: (889352,), Duration: 55.58 seconds
File: samples/sample2.wav, Sample rate: 16000, Audio shape: (1008610,), Duration: 63.04 seconds
File: samples/sample3.wav, Sample rate: 16000, Audio shape: (596160,), Duration: 37.26 seconds
File: samples/sample4.wav, Sample rate: 16000, Audio shape: (1848960,), Duration: 115.56 seconds


Important notice:

VLLM expect to recieve a `tuple[np.ndarray, int]` where furst element is audio data and second is sample rate. We do exactly that in `samples[file.stem] = [(chunk,16000) for chunk in chunks]`

## Inference loop

Now we ready to process the audio files.

The inference loop can be as follows:

In [None]:
for file, chunks in samples.items():
        
    prompts = [{
                "encoder_prompt": {
                    "prompt": "",
                    "multi_modal_data": {
                        "audio": chunk,
                    },
                },
                "decoder_prompt":
                f"<|startoftranscript|><|{lang_code}|><|transcribe|><|notimestamps|>"
            } for chunk in chunks]
    print(f"File: {file}, Chunks: {len(chunks)}")
    # Create a sampling params object.
    sampling_params = SamplingParams(
        temperature=0,
        top_p=1.0,
        max_tokens=8192,
    )

    start = time.time()

    # Inferense based on max_num_seqs
    outputs = []
    for i in range(0, len(prompts)):
        output = whisper.generate(prompts[i], sampling_params=sampling_params)
        outputs.extend(output)
    # Print the outputs.
    generated = ""
    for output in outputs:
        prompt = output.prompt
        encoder_prompt = output.encoder_prompt
        generated_text = output.outputs[0].text
        generated+= generated_text
        print(f"Encoder prompt: {encoder_prompt!r}, "
            f"Decoder prompt: {prompt!r}, "
            f"Generated text: {generated_text!r}")

    duration = time.time() - start

    print("Duration:", duration)
    print("RPS:", len(prompt) / duration)
    print("Generated text:", generated)

Let's check it step by step.

### Prompting
First, we need to create query according to the VLLM prompting format. There is actually 2 possible ways to do it for Whisper:
1. We can use separate encoder and decoder prompts. In that case endoder receives the audio data and decoder receives the text prompt:
```prompts = [
    {
        "prompt": f"<|startoftranscript|><|{lang_code}|><|transcribe|><|notimestamps|>",
        "multi_modal_data": {
            "audio": chunk,
        }
    } for chunk in chunks]
]
```
2. We can use a single prompt where we provide the audio data and the text prompt as shown in code.

In certain way, 1st method is more simple and clear, but 2nd still valid.

The text prompt `<|startoftranscript|><|{lang_code}|><|transcribe|><|notimestamps|>` consist 4 tokens:

1. `<|startoftranscript|>` is always present.
2. `<|{lang_code}|>` - determine the language of transcribition. When actuall past to model, it'll look like `<|en|>` or `<|de|>`.
3. `<|transcribe|>` - determine the task. Alternative is `<|translate|>`, which enables translation from source language to English. Such behaviour can also be reached by not specifying language token.
4. `<|notimestamps|>` is optional token. If not provided, Whisper will also insert timestamp of the regoznized speech chunks. It might be useful for certain scenarios.

Notice: VLLM expects that in ```"multi_modal_data": {
            "audio": chunk,
        }``` `chunk` is tuple (`np.ndarray`, `sample_rate`)

### Sampling parameters

Usually, the sampling parameters are set to the following values:
- `temperature`: 0
- `top_p`: 1.0

`max_tokens` is actually optional, but I recommend setting it to somrthing like 4096 or 8192. The reason is that VLLM has a very small default value (about 16 or so), and because of that Whisper can't properly proceed a 30 seconds chunks.

### Processing loop

Is quite simple. In the example, it uses a pretty simple chink-by-chunk implementation, but it also possible to use a batched input like that:

```for i in range(0, len(prompts), max_num_seqs):
            output = whisper.generate(prompts[i:i+max_num_seqs], sampling_params=sampling_params)
            outputs.extend(output)
```

In such way, you can increase your throughput even further.