# Learn OpenAI Whisper - Chapter 7

## Notebook 2: Quantizing Distil-Whisper with OpenVINO

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1VbveJKfwuNS_P7-myiKq_d_IH-svQ5es)

![ch07_2-quantizing-distil-whisper-openvino.png](https://raw.githubusercontent.com/PacktPublishing/Learn-OpenAI-Whisper/main/Chapter07/ch07_2-quantizing-distil-whisper-openvino.png)

In this tutorial, we explore how to leverage the power of [Distil-Whisper](https://huggingface.co/distil-whisper/distil-large-v2), a distilled variant of OpenAI's  [Whisper](https://huggingface.co/openai/whisper-large-v2) model, in combination with OpenVINO for efficient and accurate automatic speech recognition (ASR). Distil-Whisper, proposed in the paper [Robust Knowledge Distillation via Large-Scale Pseudo Labelling](https://arxiv.org/abs/2311.00430), offers significant speed improvements and parameter reduction while maintaining comparable performance to the original Whisper model.

We delve into the architecture of Whisper, a Transformer-based encoder-decoder model that maps audio spectrogram features to text tokens. The process involves converting raw audio inputs to a log-Mel spectrogram, encoding the spectrogram using a Transformer encoder, and autoregressively predicting text tokens using a decoder.

To simplify the integration of Distil-Whisper with OpenVINO, we utilize the [Hugging Face Transformers](https://huggingface.co/docs/transformers/index) library for loading the pre-trained model and the [Hugging Face Optimum](https://huggingface.co/docs/optimum) library for converting the model to OpenVINO™ IR format. Additionally, we apply `INT8` post-training quantization from [NNCF](https://github.com/openvinotoolkit/nncf/) to further improve the performance of the OpenVINO Distil-Whisper model.

This tutorial covers the following topics:

1. [Prerequisites](#Prerequisites): Setting up the necessary dependencies and environment.
2. [Loading PyTorch model](#Loading-PyTorch-model): Loading the Distil-Whisper model using PyTorch.
   - [Preparing input sample](#Preparing-input-sample): Preparing an audio input sample for inference.
   - [Running model inference](#Running-model-inference): Running inference on the PyTorch model.
3. [Loading OpenVINO model using Optimum library](#Loading-OpenVINO-model-using-Optimum-library): Converting the model to OpenVINO format using the Optimum library.
   - [Selecting Inference device](#Selecting-Inference-device): Choosing the appropriate inference device.
   - [Compiling OpenVINO model](#Compiling-OpenVINO-model): Compiling the OpenVINO model for optimal performance.
   - [Running OpenVINO model inference](#Running-OpenVINO-model-inference): Running inference on the OpenVINO model.
4. [Comparing performance PyTorch vs OpenVINO](#Comparing-performance-PyTorch-vs-OpenVINO): Benchmarking the performance of the PyTorch and OpenVINO models.
   - [Comparing with OpenAI Whisper](#Comparing-with-OpenAI-Whisper): Comparing Distil-Whisper with the original OpenAI Whisper model.
5. [Using OpenVINO model with HuggingFace pipelines](#Using-OpenVINO-model-with-HuggingFace-pipelines): Integrating the OpenVINO model with HuggingFace pipelines for seamless usage.
6. [Quantizing](#Quantizing): Applying post-training quantization to further optimize the model.
   - [Preparing calibration datasets](#Preparing-calibration-datasets): Preparing datasets for quantization calibration.
   - [Quantizing Distil-Whisper encoder and decoder models](#Quantizing-Distil-Whisper-encoder-and-decoder-models): Quantizing the encoder and decoder components of the Distil-Whisper model.
   - [Running quantized model inference](#Runnig-quantized-model-inference): Running inference on the quantized model.
   - [Comparing performance and accuracy of the original and quantized models](#Comparing-performance-and-accuracy-of-the-original-and-quantized-models): Evaluating the impact of quantization on performance and accuracy.
7. [Running interactive demo](#Running-interactive-demo): Exploring an interactive demonstration of the Distil-Whisper model.

By the end of this tutorial, you will have a comprehensive understanding of how to leverage Distil-Whisper and OpenVINO for efficient and accurate automatic speech recognition. Let's get started!

## Prerequisites
[back to top ⬆️](#Table-of-contents:)

Before diving into the tutorial, ensure that you have the necessary prerequisites in place. This includes authenticating with the Hugging Face Hub using your token and verifying the authentication by running the provided code cells. These steps are crucial for accessing the required models and datasets throughout the tutorial.



In [None]:
# Enter your token in the output prompt below and hit enter on Windows or return on Mac
!huggingface-cli login

In [None]:
# Verify authentication
from huggingface_hub import whoami
whoami()
# you should see something like {'type': 'user',  'id': '...',  'name': 'Wauplin', ...}

In [None]:
%pip install -q sentence-transformers==2.3.1
%pip install -q "transformers==4.37.2"
%pip install -q onnx "git+https://github.com/huggingface/optimum-intel.git" "peft==0.6.2" --extra-index-url https://download.pytorch.org/whl/cpu
%pip install -q "openvino>=2023.2.0" datasets  "gradio>=4.0" "librosa" "soundfile"
%pip install -q "nncf>=2.6.0" "jiwer"

![Restart_the_runtime.png](https://github.com/PacktPublishing/Learn-OpenAI-Whisper/raw/main/Chapter07/Restart_the_runtime.png)

## Loading the PyTorch model
[back to top ⬆️](#Table-of-contents:)

Loading the PyTorch Whisper model is a straightforward process using the transformers library. The `AutoModelForSpeechSeq2Seq.from_pretrained` method is employed to initialize the model. In this tutorial, we will use the `distil-whisper/distil-large-v2` model as the default example. Please note that the model will be downloaded during the first run, which may take some time.

However, you have the flexibility to choose from a variety of models available in the [Distil-Whisper Hugging Face collection](https://huggingface.co/collections/distil-whisper/distil-whisper-models-65411987e6727569748d2eb6). Some alternative options include `distil-whisper/distil-medium.en` and `distil-whisper/distil-small.en`. Additionally, models of the original Whisper architecture are also accessible, which you can explore further [here](https://huggingface.co/openai).

It's important to highlight the significance of preprocessing and post-processing in the model's usage. The `AutoProcessor` class, specifically the `WhisperProcessor`, plays a crucial role in preparing the audio input data for the model. It handles tasks such as converting the audio to a Mel-spectrogram representation and decoding the predicted output token IDs back into a string using the tokenizer.

To ensure a smooth and efficient workflow, the `AutoProcessor` class streamlines the preprocessing and post-processing steps, allowing you to focus on the core functionality of the Whisper model. By leveraging this class, you can easily integrate the Whisper model into your speech recognition pipeline, regardless of the specific model variant you choose.

In [None]:
import ipywidgets as widgets

model_ids = {
    "Distil-Whisper": [
        "distil-whisper/distil-large-v2",
        "distil-whisper/distil-medium.en",
        "distil-whisper/distil-small.en"
    ],
    "Whisper": [
        "openai/whisper-large-v3",
        "openai/whisper-large-v2",
        "openai/whisper-large",
        "openai/whisper-medium",
        "openai/whisper-small",
        "openai/whisper-base",
        "openai/whisper-tiny",
        "openai/whisper-medium.en",
        "openai/whisper-small.en",
        "openai/whisper-base.en",
        "openai/whisper-tiny.en",
    ]
}

model_type = widgets.Dropdown(
    options=model_ids.keys(),
    value="Distil-Whisper",
    description="Model type:",
    disabled=False,
)

model_type

In [None]:
model_id = widgets.Dropdown(
    options=model_ids[model_type.value],
    value=model_ids[model_type.value][2],
    description="Model:",
    disabled=False,
)

model_id

In [None]:
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq

processor = AutoProcessor.from_pretrained(model_id.value)

pt_model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id.value)
pt_model.eval();

### Preparing the Input Sample
[back to top ⬆️](#Table-of-contents:)

To use the Whisper model for speech recognition, we need to properly prepare the input audio sample. The `WhisperProcessor` expects the audio data to be in the form of a NumPy array, along with information about the audio sampling rate. It then processes the audio and returns the `input_features` tensor, which is used for making predictions.

The conversion of the audio file to the required NumPy format is conveniently handled by the Hugging Face Datasets library. This library provides a seamless interface for loading and preprocessing audio data, making it easier to integrate with the Whisper model.

To prepare the input sample, the next Python code:

1. Loads the audio file using the Hugging Face Datasets library.
2. Extracts the audio data as a NumPy array and obtain the sampling rate.
3. Passes the audio array and sampling rate to the `WhisperProcessor`.
4. Retrieves the `input_features` tensor from the processor.

In [None]:
from datasets import load_dataset

def extract_input_features(sample):
    input_features = processor(
        sample["audio"]["array"],
        sampling_rate=sample["audio"]["sampling_rate"],
        return_tensors="pt",
    ).input_features
    return input_features

dataset = load_dataset(
    "hf-internal-testing/librispeech_asr_dummy", "clean", split="validation"
)
sample = dataset[0]
input_features = extract_input_features(sample)

### Running Model Inference
[back to top ⬆️](#Table-of-contents:)

With the input sample prepared, we can now perform speech recognition using the Whisper model. The model provides a convenient `generate` interface that simplifies the inference process. Here's how you can run the model inference:

1. Pass the `input_features` tensor to the `generate` method of the Whisper model.
2. The model will process the input and generate the predicted token IDs.
3. Once the generation is complete, use the `processor.batch_decode` method to decode the predicted token IDs into human-readable text transcription.

The `generate` method handles the complex task of sequence generation, taking into account the model's architecture and the provided input features. It produces the predicted token IDs, which represent the transcribed text in a encoded format.

By leveraging the `generate` interface and the `processor.batch_decode` method, you can easily perform speech recognition with the Whisper model. The model takes care of the complex task of mapping the audio input to text output, while the processor handles the necessary decoding step to provide you with the final transcription.

In [None]:
import IPython.display as ipd

predicted_ids = pt_model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

display(ipd.Audio(sample["audio"]["array"], rate=sample["audio"]["sampling_rate"]))
print(f"Reference: {sample['text']}")
print(f"Result: {transcription[0]}")

## Loading OpenVINO Model using Optimum Library
[back to top ⬆️](#Table-of-contents:)

To further optimize the performance of the Whisper model, we can leverage the Hugging Face Optimum API. This high-level API enables us to convert and quantize models from the Hugging Face Transformers library to the OpenVINO™ IR format, resulting in faster inference and reduced memory footprint. For more details, refer to the [Hugging Face Optimum documentation](https://huggingface.co/docs/optimum/intel/inference).

The Optimum Intel library provides a seamless way to load optimized models from the [Hugging Face Hub](https://huggingface.co/docs/optimum/intel/hf.co/models) and create pipelines for running inference with the OpenVINO Runtime using Hugging Face APIs. The Optimum Inference models are API-compatible with Hugging Face Transformers models, which means we only need to replace the `AutoModelForXxx` class with the corresponding `OVModelForXxx` class.

To initialize the model class, we call the `from_pretrained` method. When downloading and converting the Transformers model, it's important to add the `export=True` parameter to ensure the model is properly converted to the OpenVINO format. After conversion, we can save the model for future use with the `save_pretrained` method.

One advantage of using the Optimum library is that the tokenizers and processors distributed with the models are also compatible with the OpenVINO model. This means we can reuse the previously initialized processor without any modifications.


In [None]:
from pathlib import Path
from optimum.intel.openvino import OVModelForSpeechSeq2Seq

model_path = Path(model_id.value.replace('/', '_'))
ov_config = {"CACHE_DIR": ""}

if not model_path.exists():
    ov_model = OVModelForSpeechSeq2Seq.from_pretrained(
        model_id.value, ov_config=ov_config, export=True, compile=False, load_in_8bit=False
    )
    ov_model.half()
    ov_model.save_pretrained(model_path)
else:
    ov_model = OVModelForSpeechSeq2Seq.from_pretrained(
        model_path, ov_config=ov_config, compile=False
    )

### Selecting the Inference Device
[back to top ⬆️](#Table-of-contents:)

When running inference with the OpenVINO runtime, it's important to select the appropriate inference device based on your hardware capabilities and performance requirements. The OpenVINO toolkit supports a wide range of devices, including CPUs, GPUs, and specialized accelerators.

To select the inference device, you can use the `ov.Core` class from the `openvino` package. This class provides methods to query the available devices and set the desired device for inference.


In [None]:
import openvino as ov
import ipywidgets as widgets

core = ov.Core()

device = widgets.Dropdown(
    options=core.available_devices + ["AUTO"],
    value="AUTO",
    description="Device:",
    disabled=False,
)

device

### Compiling the OpenVINO Model
[back to top ⬆️](#Table-of-contents:)

After selecting the inference device, the next step is to compile the OpenVINO model for optimal performance on the chosen device. Compiling the model involves applying device-specific optimizations and generating an executable representation of the model that can be efficiently run on the target device.

To compile the OpenVINO model, you can use the `compile` method of the `OVModelForSpeechSeq2Seq` class.


In [None]:
ov_model.to(device.value)
ov_model.compile()

### Running Inference with the OpenVINO Model
[back to top ⬆️](#Table-of-contents:)

With the OpenVINO model compiled and optimized for the selected inference device, we can now run inference to perform speech recognition. The process of running inference with the OpenVINO model is similar to the PyTorch model, but with the added benefits of improved performance and efficiency.

To run inference with the OpenVINO model, we follow these steps:

1. Prepare the input sample by extracting the audio features using the `WhisperProcessor`, as explained in the [Preparing input sample](#Preparing-input-sample) section.

2. Pass the input features to the `generate` method of the OpenVINO model, just like we did with the PyTorch model:

   ```python
   predicted_ids = ov_model.generate(input_features)
   ```

   The `generate` method takes the input features and performs the necessary computations using the optimized OpenVINO model to generate the predicted token IDs.

3. Decode the predicted token IDs into human-readable text using the `processor.batch_decode` method:

   ```python
   transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
   ```

   This step converts the predicted token IDs back into the corresponding text transcription, just like we did with the PyTorch model.


In [None]:
predicted_ids = ov_model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

display(ipd.Audio(sample["audio"]["array"], rate=sample["audio"]["sampling_rate"]))
print(f"Reference: {sample['text']}")
print(f"Result: {transcription[0]}")

## Comparing Performance: PyTorch vs OpenVINO
[back to top ⬆️](#Table-of-contents:)

Now that we have successfully run inference with both the PyTorch and OpenVINO models, it's crucial to compare their performance to understand the benefits of using OpenVINO optimization. Performance comparison helps us evaluate the speed and efficiency improvements achieved by converting the model to the OpenVINO format and utilizing the optimized inference runtime.

To compare the performance of the PyTorch and OpenVINO models, we can measure and analyze the following metrics:

1. **Inference Time**: Measure the time taken by each model to process an input audio sample and generate the transcription. This includes the time spent on preprocessing, inference, and postprocessing steps. By comparing the inference times, we can determine which model offers faster execution.

2. **Throughput**: Evaluate the number of audio samples that each model can process per unit of time. Higher throughput indicates better efficiency and the ability to handle a larger volume of audio data in a given timeframe. OpenVINO optimization aims to improve throughput by leveraging hardware-specific optimizations and efficient resource utilization.

3. **Memory Usage**: Monitor the memory consumption of each model during inference. Efficient memory management is crucial, especially when dealing with limited resources or running inference on edge devices. OpenVINO optimization techniques, such as model compression and quantization, can help reduce memory usage without significant impact on accuracy.

4. **CPU and GPU Utilization**: Analyze the utilization of CPU and GPU resources during inference for both models. OpenVINO optimization aims to maximize the utilization of available hardware resources, enabling efficient parallel processing and reducing idle time. By comparing resource utilization, we can assess how effectively each model harnesses the capabilities of the underlying hardware.

To perform the performance comparison, you can use the `measure_perf` function provided in the notebook. This function takes the model and input sample as arguments and measures the inference time over a specified number of iterations. It provides a reliable estimate of the model's performance.

```python
perf_torch = measure_perf(pt_model, sample)
perf_ov = measure_perf(ov_model, sample)
```

By comparing the performance metrics between the PyTorch and OpenVINO models, you can quantify the speed and efficiency improvements achieved through OpenVINO optimization. This comparison highlights the benefits of using OpenVINO for accelerating inference and making the most of the available hardware resources.



In [None]:
import time
import numpy as np
from tqdm.notebook import tqdm


def measure_perf(model, sample, n=10):
    timers = []
    input_features = extract_input_features(sample)
    for _ in tqdm(range(n), desc="Measuring performance"):
        start = time.perf_counter()
        model.generate(input_features)
        end = time.perf_counter()
        timers.append(end - start)
    return np.median(timers)

In [None]:
perf_torch = measure_perf(pt_model, sample)
perf_ov = measure_perf(ov_model, sample)

In [None]:
print(f"Mean torch {model_id.value} generation time: {perf_torch:.3f}s")
print(f"Mean openvino {model_id.value} generation time: {perf_ov:.3f}s")
print(f"Performance {model_id.value} openvino speedup: {perf_torch / perf_ov:.3f}")

## Integrating the OpenVINO Model with Hugging Face Pipelines

One of the key advantages of using the OpenVINO model is its seamless compatibility with the Hugging Face [pipeline](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline) interface for `automatic-speech-recognition`. This compatibility allows us to leverage the powerful features and utilities provided by the Hugging Face ecosystem while benefiting from the performance optimizations of OpenVINO.

The `pipeline` interface is particularly useful for transcribing long audio files. Distil-Whisper employs a chunked algorithm that efficiently processes long-form audio by breaking it down into smaller segments. This chunked approach has been shown to be 9 times faster than the sequential algorithm proposed in the original OpenAI Whisper paper. To enable chunking, you can pass the `chunk_length_s` parameter to the pipeline, specifying the desired chunk length in seconds. For Distil-Whisper, a chunk length of 15 seconds has been found to be optimal.

In the next Python code cells, we create a pipeline for automatic speech recognition using the OpenVINO model (`ov_model`). We set the `generation_config` of the OpenVINO model to match that of the PyTorch model to ensure consistent behavior. The `tokenizer` and `feature_extractor` from the `processor` are also passed to the pipeline to handle the necessary preprocessing and postprocessing steps.

By specifying `max_new_tokens`, we limit the maximum number of tokens generated during transcription. The `chunk_length_s` parameter is set to 15 seconds to enable efficient chunking of long audio files. Additionally, we can activate batching by passing the `batch_size` argument, which allows the pipeline to process multiple audio samples concurrently.

Using the pipeline with the OpenVINO model offers several benefits:

1. **Simplified API**: The pipeline provides a high-level and intuitive interface for performing automatic speech recognition, abstracting away the complexities of the underlying model and preprocessing steps.

2. **Long Audio Support**: With the chunked algorithm and the ability to specify the chunk length, the pipeline can efficiently handle long audio files, making it suitable for various real-world scenarios.

3. **Batching**: By enabling batching through the `batch_size` parameter, the pipeline can process multiple audio samples simultaneously, improving overall throughput and utilization of hardware resources.

4. **Performance Optimization**: The OpenVINO model, being optimized for inference, delivers faster and more efficient transcription compared to the PyTorch model, while maintaining comparable accuracy.

By leveraging the Hugging Face pipeline with the OpenVINO model, you can easily integrate state-of-the-art speech recognition capabilities into your applications, benefiting from the ease of use, flexibility, and performance optimizations provided by this powerful combination.


In [None]:
from transformers import pipeline

ov_model.generation_config = pt_model.generation_config

pipe = pipeline(
    "automatic-speech-recognition",
    model=ov_model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=15,
    batch_size=16,
)

In [None]:
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample_long = dataset[0]


def format_timestamp(seconds: float):
    """
    format time in srt-file expected format
    """
    assert seconds >= 0, "non-negative timestamp expected"
    milliseconds = round(seconds * 1000.0)

    hours = milliseconds // 3_600_000
    milliseconds -= hours * 3_600_000

    minutes = milliseconds // 60_000
    milliseconds -= minutes * 60_000

    seconds = milliseconds // 1_000
    milliseconds -= seconds * 1_000

    return (
        f"{hours}:" if hours > 0 else "00:"
    ) + f"{minutes:02d}:{seconds:02d},{milliseconds:03d}"


def prepare_srt(transcription):
    """
    Format transcription into srt file format
    """
    segment_lines = []
    for idx, segment in enumerate(transcription["chunks"]):
        segment_lines.append(str(idx + 1) + "\n")
        timestamps = segment["timestamp"]
        time_start = format_timestamp(timestamps[0])
        time_end = format_timestamp(timestamps[1])
        time_str = f"{time_start} --> {time_end}\n"
        segment_lines.append(time_str)
        segment_lines.append(segment["text"] + "\n\n")
    return segment_lines

### Obtaining Timestamps and Generating Subtitles
The `pipeline` interface provides a convenient way to obtain timestamps associated with each processed chunk of audio. By passing the `return_timestamps` argument to the pipeline, you can retrieve the start and end timestamps of the speech segments corresponding to each chunk.

Timestamps can be incredibly useful in various scenarios, such as:

1. **Speech Separation**: If you have an audio file containing multiple speakers, the timestamps can help you identify and separate the speech segments corresponding to each speaker. This information can be used to create speaker-specific transcriptions or to isolate individual speakers for further analysis.

2. **Video Subtitles**: Timestamps play a crucial role in generating subtitles for videos. By aligning the transcribed text with the appropriate timestamps, you can create synchronized subtitles that match the speech in the video. This enhances the accessibility and comprehension of the video content.

In the provided example, we demonstrate how to format the transcription output in the popular SRT (SubRip Text) format, which is widely used for video subtitles. The SRT format consists of four parts for each subtitle:

1. Subtitle number
2. Start and end timestamps
3. Subtitle text
4. Blank line

In [None]:
result = pipe(sample_long["audio"].copy(), return_timestamps=True)

In [None]:
srt_lines = prepare_srt(result)

display(
    ipd.Audio(sample_long["audio"]["array"], rate=sample_long["audio"]["sampling_rate"])
)
print("".join(srt_lines))

## Quantizing the model using NNCF

Quantization is a popular technique for reducing the memory footprint and computational complexity of deep learning models. It involves converting the model's weights and activations from floating-point precision to lower-precision representations, such as INT8. Quantization can significantly speed up inference and make models more efficient, especially for deployment on resource-constrained devices.

The [Neural Network Compression Framework (NNCF)](https://github.com/openvinotoolkit/nncf/) is a powerful tool that enables post-training quantization of models. It works by adding quantization layers into the model graph and then using a subset of the training dataset to initialize the parameters of these quantization layers. NNCF is designed to minimize the modifications required to your original training code, making the quantization process straightforward.

The quantization process using NNCF involves the following steps:

1. **Create a calibration dataset**: Select a representative subset of your training data to be used for calibrating the quantization parameters. This dataset should cover a wide range of input variations to ensure accurate quantization.

2. **Run `nncf.quantize`**: Pass your model and the calibration dataset to the `nncf.quantize` function. NNCF will add quantization layers to the model graph and optimize the quantization parameters using the calibration dataset. This step produces quantized versions of the encoder and decoder models.

3. **Serialize the quantized model**: Use the `openvino.save_model` function to serialize the quantized INT8 model. This step saves the quantized model in a format that can be efficiently loaded and executed using the OpenVINO runtime.

It's important to note that quantization is a time and memory-consuming operation, especially for large models like Distil-Whisper. Running the quantization code provided in this notebook may take some time, depending on the size of your calibration dataset and the available computational resources.

To facilitate experimentation, the notebook includes a checkbox that allows you to select whether you would like to run the quantization process for Distil-Whisper. If you choose to proceed with quantization, the notebook will guide you through the necessary steps and provide detailed explanations along the way.

Quantization can yield significant performance improvements and memory savings, making it a valuable technique for deploying models in production environments. By leveraging NNCF and the OpenVINO runtime, you can easily quantize your Distil-Whisper model and benefit from faster inference and reduced memory consumption.

In [None]:
import ipywidgets as widgets

to_quantize = widgets.Checkbox(
    value=True,
    description='Quantization',
    disabled=False,
)

to_quantize

In [None]:
# Fetch notebook_utils module
import urllib.request

urllib.request.urlretrieve(
    url='https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/main/notebooks/utils/skip_kernel_extension.py',
    filename='skip_kernel_extension.py'
)

%load_ext skip_kernel_extension

### Preparing Calibration Datasets

The first step in the quantization process is to prepare calibration datasets. Calibration datasets are used to optimize the quantization parameters and ensure that the quantized model maintains acceptable accuracy. Since we quantize the Whisper encoder and decoder separately, we need to prepare a separate calibration dataset for each model.

To collect the calibration data, we use the `InferRequestWrapper` class, which intercepts the model inputs during inference and collects them into a list. This allows us to capture the input data that will be used for calibration.

Here's an overview of the steps to prepare the calibration datasets:

1. Import the `InferRequestWrapper` class from the `optimum.intel.openvino.quantization` module.

2. Initialize an instance of the `InferRequestWrapper` class for both the encoder and decoder models. This wrapper will intercept the model inputs during inference.

3. Run model inference on a small subset of audio samples. These samples should be representative of the data the model will encounter in production.

4. The `InferRequestWrapper` will collect the input data for each model and store it in separate lists for the encoder and decoder.

5. The collected input data will serve as the calibration datasets for quantization.

It's important to note that the size of the calibration dataset can impact the quality of the quantized model. Generally, increasing the calibration dataset size improves quantization quality. However, using a larger dataset also increases the time and computational resources required for calibration.

When selecting the calibration dataset, aim for a balance between representativeness and size. Choose a diverse set of audio samples that cover different speakers, accents, and recording conditions. This will help the quantization process capture a wide range of input variations and optimize the quantization parameters accordingly.

In [None]:
%%skip not $to_quantize.value

from itertools import islice
from optimum.intel.openvino.quantization import InferRequestWrapper


def collect_calibration_dataset(ov_model: OVModelForSpeechSeq2Seq, calibration_dataset_size: int):
    # Overwrite model request properties, saving the original ones for restoring later
    original_encoder_request = ov_model.encoder.request
    original_decoder_with_past_request = ov_model.decoder_with_past.request
    encoder_calibration_data = []
    decoder_calibration_data = []
    ov_model.encoder.request = InferRequestWrapper(original_encoder_request, encoder_calibration_data)
    ov_model.decoder_with_past.request = InferRequestWrapper(original_decoder_with_past_request,
                                                             decoder_calibration_data)

    calibration_dataset = load_dataset("librispeech_asr", "clean", split="validation", streaming=True)
    for sample in tqdm(islice(calibration_dataset, calibration_dataset_size), desc="Collecting calibration data",
                       total=calibration_dataset_size):
        input_features = extract_input_features(sample)
        ov_model.generate(input_features)

    ov_model.encoder.request = original_encoder_request
    ov_model.decoder_with_past.request = original_decoder_with_past_request

    return encoder_calibration_data, decoder_calibration_data

### Quantizing Distil-Whisper Encoder and Decoder Models

With the calibration datasets prepared, we can now proceed to quantize the Distil-Whisper encoder and decoder models. The `quantize` function in the notebook encapsulates the quantization process by calling `nncf.quantize` on the encoder and decoder-with-past models.

Here's a breakdown of the quantization steps:

1. The `quantize` function takes the OpenVINO model (`ov_model`) and the calibration dataset size as arguments.

2. Inside the function, we first check if the quantized model path already exists. If not, we proceed with the quantization process.

3. The calibration datasets for the encoder and decoder models are collected using the `collect_calibration_dataset` function, which utilizes the `InferRequestWrapper` to intercept model inputs during inference.

4. We quantize the encoder model by calling `nncf.quantize` with the following arguments:
   - The encoder model (`ov_model.encoder.model`)
   - The encoder calibration dataset
   - The subset size (equal to the size of the encoder calibration dataset)
   - The model type (set to `nncf.ModelType.TRANSFORMER`)
   - Advanced quantization parameters (e.g., `smooth_quant_alpha` for Smooth Quant algorithm)

5. The quantized encoder model is saved using the `openvino.save_model` function.

6. We repeat the quantization process for the decoder-with-past model, using the decoder calibration dataset and appropriate quantization parameters.

7. The quantized decoder-with-past model is saved using the `openvino.save_model` function.

It's important to note that we don't quantize the first-step-decoder model because its contribution to the overall inference time is negligible. Quantizing the encoder and decoder-with-past models, which account for the majority of the computation, provides the most significant performance improvements.

By quantizing the encoder and decoder models separately, we can optimize each model's quantization parameters based on its specific input characteristics and computational demands. This targeted quantization approach ensures that we achieve the best balance between performance and accuracy for each model component.

The `quantize` function abstracts away the complexities of the quantization process, making it easy to apply quantization to the Distil-Whisper model using NNCF. Once the quantization is complete, the quantized models are saved and ready for inference.


In [None]:
%%skip not $to_quantize.value

import gc
import shutil
import nncf

CALIBRATION_DATASET_SIZE = 40 # reducing from original 50
quantized_model_path = Path(f"{model_path}_quantized")


def quantize(ov_model: OVModelForSpeechSeq2Seq, calibration_dataset_size: int):
    if not quantized_model_path.exists():
        encoder_calibration_data, decoder_calibration_data = collect_calibration_dataset(
            ov_model, calibration_dataset_size
        )
        print("Quantizing encoder")
        quantized_encoder = nncf.quantize(
            ov_model.encoder.model,
            nncf.Dataset(encoder_calibration_data),
            subset_size=len(encoder_calibration_data),
            model_type=nncf.ModelType.TRANSFORMER,
            # Smooth Quant algorithm reduces activation quantization error; optimal alpha value was obtained through grid search
            advanced_parameters=nncf.AdvancedQuantizationParameters(smooth_quant_alpha=0.50)
        )
        ov.save_model(quantized_encoder, quantized_model_path / "openvino_encoder_model.xml")
        del quantized_encoder
        del encoder_calibration_data
        gc.collect()

        print("Quantizing decoder with past")
        quantized_decoder_with_past = nncf.quantize(
            ov_model.decoder_with_past.model,
            nncf.Dataset(decoder_calibration_data),
            subset_size=len(decoder_calibration_data),
            model_type=nncf.ModelType.TRANSFORMER,
            # Smooth Quant algorithm reduces activation quantization error; optimal alpha value was obtained through grid search
            advanced_parameters=nncf.AdvancedQuantizationParameters(smooth_quant_alpha=0.95)
        )
        ov.save_model(quantized_decoder_with_past, quantized_model_path / "openvino_decoder_with_past_model.xml")
        del quantized_decoder_with_past
        del decoder_calibration_data
        gc.collect()

        # Copy the config file and the first-step-decoder manually
        shutil.copy(model_path / "config.json", quantized_model_path / "config.json")
        shutil.copy(model_path / "openvino_decoder_model.xml", quantized_model_path / "openvino_decoder_model.xml")
        shutil.copy(model_path / "openvino_decoder_model.bin", quantized_model_path / "openvino_decoder_model.bin")

    quantized_ov_model = OVModelForSpeechSeq2Seq.from_pretrained(quantized_model_path, ov_config=ov_config, compile=False)
    quantized_ov_model.to(device.value)
    quantized_ov_model.compile()
    return quantized_ov_model


ov_quantized_model = quantize(ov_model, CALIBRATION_DATASET_SIZE)

### Running Inference with the Quantized Model
[back to top ⬆️](#Table-of-contents:)

Now that we have quantized the Distil-Whisper encoder and decoder models, let's compare the transcription results between the original and quantized models. Running inference with the quantized model allows us to assess the impact of quantization on the model's output and verify that the quantized model maintains acceptable accuracy.

To run inference with the quantized model, we follow these steps:

1. Load a sample audio file for transcription. This can be done using the `load_dataset` function from the Hugging Face Datasets library.

2. Prepare the audio input for the model by extracting the input features using the `extract_input_features` function. This function applies the necessary preprocessing steps, such as converting the audio to a spectrogram representation.

3. Run inference with the original OpenVINO model (`ov_model`) using the `generate` method. This generates the predicted token IDs for the original model.

4. Decode the predicted token IDs from the original model into a human-readable transcription using the `processor.batch_decode` function.

5. Run inference with the quantized OpenVINO model (`ov_quantized_model`) using the `generate` method. This generates the predicted token IDs for the quantized model.

6. Decode the predicted token IDs from the quantized model into a human-readable transcription using the `processor.batch_decode` function.

7. Compare the transcription results from the original and quantized models to assess the impact of quantization on the model's output.


In [None]:
%%skip not $to_quantize.value

dataset = load_dataset(
    "hf-internal-testing/librispeech_asr_dummy", "clean", split="validation"
)
sample = dataset[0]
input_features = extract_input_features(sample)

predicted_ids = ov_model.generate(input_features)
transcription_original = processor.batch_decode(predicted_ids, skip_special_tokens=True)

predicted_ids = ov_quantized_model.generate(input_features)
transcription_quantized = processor.batch_decode(predicted_ids, skip_special_tokens=True)

display(ipd.Audio(sample["audio"]["array"], rate=sample["audio"]["sampling_rate"]))
print(f"Original : {transcription_original[0]}")
print(f"Quantized: {transcription_quantized[0]}")

Results are the same!

### Comparing performance and accuracy of the original and quantized models

To evaluate the effectiveness of the quantization process, we need to compare the performance and accuracy of the original and quantized Distil-Whisper models. This comparison will help us understand the trade-offs introduced by quantization and assess whether the quantized model meets the desired performance and accuracy criteria.

#### Measuring Accuracy
To measure the accuracy of the models, we use the Word Error Rate (WER) as the evaluation metric. WER is a common metric used in speech recognition tasks, and it quantifies the percentage of words that are incorrectly recognized by the model. A lower WER indicates better accuracy.

In this comparison, we calculate the accuracy using the formula: `Accuracy = 1 - WER`. By subtracting the WER from 1, we obtain the percentage of correctly recognized words.

To calculate the WER, we compare the transcriptions generated by the models against the ground truth transcriptions. The WER is computed by aligning the predicted and ground truth transcriptions and counting the number of word insertions, deletions, and substitutions.

#### Measuring Inference Time
In addition to accuracy, we also measure the inference time of the models to assess their performance. Inference time is a critical metric, especially in real-time or resource-constrained scenarios where fast processing is essential.

We measure the inference time separately for three components of the Distil-Whisper model:
1. Encoder: We measure the time taken by the encoder model to process the input audio features and generate the encoded representation.
2. Decoder-with-past: We measure the time taken by the decoder-with-past model to generate the output tokens based on the encoded representation and the previous tokens.
3. Whole model: We measure the end-to-end inference time, including the time taken by both the encoder and decoder-with-past models, as well as any additional processing steps.

By measuring the inference time for each component, we can identify potential bottlenecks and understand the impact of quantization on different parts of the model.

#### Comparing Performance and Accuracy
To compare the performance and accuracy of the original and quantized models, we follow these steps:

1. Prepare a test dataset of audio samples that are representative of the intended use case.

2. Run inference on the test dataset using both the original and quantized models.

3. Calculate the accuracy (1 - WER) for each model by comparing the generated transcriptions against the ground truth.

4. Measure the inference time for the encoder, decoder-with-past, and whole model for both the original and quantized models.

5. Compare the accuracy and inference time results between the original and quantized models.

By analyzing the accuracy and inference time metrics, we can determine the impact of quantization on the model's performance. Ideally, the quantized model should achieve similar accuracy to the original model while providing a significant reduction in inference time.

If the quantized model maintains comparable accuracy and demonstrates improved inference speed, it indicates that the quantization process has successfully optimized the model for faster execution without compromising its recognition quality.

The comparison of performance and accuracy between the original and quantized models is crucial for making informed decisions about deploying the quantized model in production environments. It helps strike the right balance between performance and accuracy based on the specific requirements of the application.

In [None]:
%%skip not $to_quantize.value

import time
from contextlib import contextmanager
from jiwer import wer, wer_standardize


TEST_DATASET_SIZE = 50
MEASURE_TIME = False

@contextmanager
def time_measurement():
    global MEASURE_TIME
    try:
        MEASURE_TIME = True
        yield
    finally:
        MEASURE_TIME = False

def time_fn(obj, fn_name, time_list):
    original_fn = getattr(obj, fn_name)

    def wrapper(*args, **kwargs):
        if not MEASURE_TIME:
            return original_fn(*args, **kwargs)
        start_time = time.perf_counter()
        result = original_fn(*args, **kwargs)
        end_time = time.perf_counter()
        time_list.append(end_time - start_time)
        return result

    setattr(obj, fn_name, wrapper)

def calculate_transcription_time_and_accuracy(ov_model, test_samples):
    encoder_infer_times = []
    decoder_with_past_infer_times = []
    whole_infer_times = []
    time_fn(ov_model, "generate", whole_infer_times)
    time_fn(ov_model.encoder, "forward", encoder_infer_times)
    time_fn(ov_model.decoder_with_past, "forward", decoder_with_past_infer_times)

    ground_truths = []
    predictions = []
    for data_item in tqdm(test_samples, desc="Measuring performance and accuracy"):
        input_features = extract_input_features(data_item)

        with time_measurement():
            predicted_ids = ov_model.generate(input_features)
        transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

        ground_truths.append(data_item["text"])
        predictions.append(transcription[0])

    word_accuracy = (1 - wer(ground_truths, predictions, reference_transform=wer_standardize,
                             hypothesis_transform=wer_standardize)) * 100
    mean_whole_infer_time = sum(whole_infer_times)
    mean_encoder_infer_time = sum(encoder_infer_times)
    mean_decoder_with_time_infer_time = sum(decoder_with_past_infer_times)
    return word_accuracy, (mean_whole_infer_time, mean_encoder_infer_time, mean_decoder_with_time_infer_time)

test_dataset = load_dataset("librispeech_asr", "clean", split="test", streaming=True)
test_dataset = test_dataset.shuffle(seed=42).take(TEST_DATASET_SIZE)
test_samples = [sample for sample in test_dataset]

accuracy_original, times_original = calculate_transcription_time_and_accuracy(ov_model, test_samples)
accuracy_quantized, times_quantized = calculate_transcription_time_and_accuracy(ov_quantized_model, test_samples)
print(f"Encoder performance speedup: {times_original[1] / times_quantized[1]:.3f}")
print(f"Decoder with past performance speedup: {times_original[2] / times_quantized[2]:.3f}")
print(f"Whole pipeline performance speedup: {times_original[0] / times_quantized[0]:.3f}")
print(f"Whisper transcription word accuracy. Original model: {accuracy_original:.2f}%. Quantized model: {accuracy_quantized:.2f}%.")
print(f"Accuracy drop: {accuracy_original - accuracy_quantized:.2f}%.")

As we can see quantization significantly improves model inference time without major accuracy drop!

## Interactive Demo: Experience Distil-Whisper in Action

To showcase the capabilities of the Distil-Whisper model and provide a hands-on experience, we have developed an interactive demo using the Gradio interface. This demo allows you to test the model's speech recognition performance on your own audio data, making it easy to explore and evaluate the model's effectiveness.

### Features of the Interactive Demo
1. **Audio Upload**: You can upload your own audio files using the provided upload button. This enables you to test the model's performance on a wide range of audio samples, including different speakers, accents, and recording conditions.

2. **Microphone Input**: In addition to audio file upload, the demo also supports real-time audio input through your microphone. You can record your own speech directly within the interface and observe how the model transcribes it in real-time.

3. **Transcription Output**: The demo displays the transcribed text output for each audio input, allowing you to evaluate the model's accuracy and performance instantly. You can compare the generated transcriptions against the actual spoken content to assess the model's recognition quality.

4. **Quantized Model Comparison**: If you have chosen to run the quantization process, the demo provides a side-by-side comparison of the original and quantized models. You can input the same audio and observe the transcription results from both models, enabling you to evaluate the impact of quantization on the model's output.

### Limitations and Future Enhancements
Please note that the current version of Distil-Whisper is specifically trained for English speech recognition. While it excels at transcribing English audio, it may not perform optimally for other languages. However, we are actively working on extending the model's capabilities to support multiple languages in the future.

Multilingual support will greatly expand the utility of Distil-Whisper, enabling its application in a wider range of scenarios and catering to a global user base. Stay tuned for updates on the multilingual version of Distil-Whisper, which will be released in the near future.

### Getting Started with the Interactive Demo
To access the interactive demo, simply run the provided code cells in the notebook. The demo will launch using the Gradio interface, and you can start experimenting with your own audio data right away.

We encourage you to explore the demo extensively, testing the model's performance on various audio samples and evaluating its transcription accuracy. Provide feedback and report any issues you encounter, as your input is valuable in improving the model and the demo experience.

By interacting with the demo, you will gain a hands-on understanding of Distil-Whisper's capabilities and potential applications. Whether you are a researcher, developer, or enthusiast in the field of speech recognition, this interactive demo provides a convenient way to explore and leverage the power of the Distil-Whisper model.

In [None]:
from transformers.pipelines.audio_utils import ffmpeg_read
import gradio as gr
import urllib.request

urllib.request.urlretrieve(
    url="https://huggingface.co/spaces/distil-whisper/whisper-vs-distil-whisper/resolve/main/assets/example_1.wav",
    filename="example_1.wav",
)

BATCH_SIZE = 16
MAX_AUDIO_MINS = 30  # maximum audio input in minutes


generate_kwargs = {"language": "en", "task": "transcribe"} if not model_id.value.endswith(".en") else {}
ov_pipe = pipeline(
    "automatic-speech-recognition",
    model=ov_model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=15,
    generate_kwargs=generate_kwargs,
)
ov_pipe_forward = ov_pipe._forward

if to_quantize.value:
    ov_quantized_model.generation_config = ov_model.generation_config
    ov_quantized_pipe = pipeline(
        "automatic-speech-recognition",
        model=ov_quantized_model,
        tokenizer=processor.tokenizer,
        feature_extractor=processor.feature_extractor,
        max_new_tokens=128,
        chunk_length_s=15,
        generate_kwargs=generate_kwargs,
    )
    ov_quantized_pipe_forward = ov_quantized_pipe._forward


def transcribe(inputs, quantized=False):
    pipe = ov_quantized_pipe if quantized else ov_pipe
    pipe_forward = ov_quantized_pipe_forward if quantized else ov_pipe_forward

    if inputs is None:
        raise gr.Error(
            "No audio file submitted! Please record or upload an audio file before submitting your request."
        )

    with open(inputs, "rb") as f:
        inputs = f.read()

    inputs = ffmpeg_read(inputs, pipe.feature_extractor.sampling_rate)
    audio_length_mins = len(inputs) / pipe.feature_extractor.sampling_rate / 60

    if audio_length_mins > MAX_AUDIO_MINS:
        raise gr.Error(
            f"To ensure fair usage of the Space, the maximum audio length permitted is {MAX_AUDIO_MINS} minutes."
            f"Got an audio of length {round(audio_length_mins, 3)} minutes."
        )

    inputs = {"array": inputs, "sampling_rate": pipe.feature_extractor.sampling_rate}

    def _forward_ov_time(*args, **kwargs):
        global ov_time
        start_time = time.time()
        result = pipe_forward(*args, **kwargs)
        ov_time = time.time() - start_time
        ov_time = round(ov_time, 2)
        return result

    pipe._forward = _forward_ov_time
    ov_text = pipe(inputs.copy(), batch_size=BATCH_SIZE)["text"]
    return ov_text, ov_time


with gr.Blocks() as demo:
    gr.HTML(
        """
                <div style="text-align: center; max-width: 700px; margin: 0 auto;">
                  <div
                    style="
                      display: inline-flex; align-items: center; gap: 0.8rem; font-size: 1.75rem;
                    "
                  >
                    <h1 style="font-weight: 900; margin-bottom: 7px; line-height: normal;">
                      OpenVINO Distil-Whisper demo
                    </h1>
                  </div>
                </div>
            """
    )
    audio = gr.components.Audio(type="filepath", label="Audio input")
    with gr.Row():
        button = gr.Button("Transcribe")
        if to_quantize.value:
            button_q = gr.Button("Transcribe quantized")
    with gr.Row():
        infer_time = gr.components.Textbox(
            label="OpenVINO Distil-Whisper Transcription Time (s)"
        )
        if to_quantize.value:
            infer_time_q = gr.components.Textbox(
                label="OpenVINO Quantized Distil-Whisper Transcription Time (s)"
            )
    with gr.Row():
        transcription = gr.components.Textbox(
            label="OpenVINO Distil-Whisper Transcription", show_copy_button=True
        )
        if to_quantize.value:
            transcription_q = gr.components.Textbox(
                label="OpenVINO Quantized Distil-Whisper Transcription", show_copy_button=True
            )
    button.click(
        fn=transcribe,
        inputs=audio,
        outputs=[transcription, infer_time],
    )
    if to_quantize.value:
        button_q.click(
            fn=transcribe,
            inputs=[audio, gr.Number(value=1, visible=False)],
            outputs=[transcription_q, infer_time_q],
        )
    gr.Markdown("## Examples")
    gr.Examples(
        [["./example_1.wav"]],
        audio,
        outputs=[transcription, infer_time],
        fn=transcribe,
        cache_examples=False,
    )
# if you are launching remotely, specify server_name and server_port
# demo.launch(server_name='your server name', server_port='server port in int')
# Read more in the docs: https://gradio.app/docs/
try:
    demo.launch(debug=True)
except Exception:
    demo.launch(share=True, debug=True)