# Learn OpenAI Whisper - Chapter 6
## Notebook 3: Video Subtitle Generation using Whisper and OpenVINO™

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1e8ZqjqmY7ue2ynLVTDTJXxUCoTQ9XVJ5)

In this advanced tutorial, we will leverage the power of OpenAI's Whisper model in conjunction with OpenVINO toolkit to automatically generate subtitles for a sample video. The process will be broken down into the following key steps:

1. Obtaining the pre-trained Whisper model
2. Setting up the PyTorch model pipeline
3. Transforming the model into OpenVINO Intermediate Representation (IR) format using the model conversion API
4. Executing the Whisper pipeline with the converted OpenVINO models to generate the subtitles


## Setting Up the Environment


We start by importing a helper Python utility module called utils.py from our GitHub repository.

In [None]:
!wget -nv "https://github.com/PacktPublishing/Learn-OpenAI-Whisper/raw/main/Chapter06/utils.py" -O utils.py

Next, we install critical software dependencies to enable working with AI models and speech data.

In [None]:
%pip install -q cohere openai tiktoken
%pip install -q "openvino>=2023.1.0"
%pip install -q "python-ffmpeg<=1.0.16" moviepy transformers --extra-index-url https://download.pytorch.org/whl/cpu
%pip install -q "git+https://github.com/garywu007/pytube.git"
%pip install -q gradio
%pip install -q "openai-whisper==20231117" --extra-index-url https://download.pytorch.org/whl/cpu

## Initializing the Whisper Model

OpenAI's Whisper is a powerful Transformer-based encoder-decoder model, also known as a sequence-to-sequence model, designed for speech recognition tasks. It operates by mapping a sequence of audio spectrogram features to a corresponding sequence of text tokens. The process can be broken down into three main steps:

1. **Feature Extraction**: The raw audio inputs are first converted into a log-Mel spectrogram representation using a feature extractor module.

2. **Encoding**: The Transformer encoder then processes the spectrogram, generating a sequence of hidden states that capture the essential information from the audio input.

3. **Decoding**: Finally, the decoder autoregressively predicts the text tokens, conditioned on both the previously generated tokens and the encoder's hidden states.

The architecture of the Whisper model is illustrated in the diagram below:

![whisper_architecture.svg](https://user-images.githubusercontent.com/29454499/204536571-8f6d8d77-5fbd-4c6d-8e29-14e734837860.svg)

*Source: https://openai.com/research/whisper*

By leveraging this powerful architecture, Whisper achieves state-of-the-art performance on various speech recognition benchmarks, making it an ideal choice for our subtitle generation task.


The creators of Whisper have trained several models with varying sizes and capabilities to cater to different use cases and resource constraints. For the purpose of this tutorial, we will be using the `base` model, which offers a good balance between performance and efficiency. However, it's important to note that the steps and techniques demonstrated in this notebook can be easily applied to other models within the Whisper family, allowing you to experiment with different configurations and find the one that best suits your specific requirements.

In [None]:
from whisper import _MODELS
import ipywidgets as widgets

model_id = widgets.Dropdown(
    options=list(_MODELS),
    value='base',
    description='Model:',
    disabled=False,
)

model_id

In [None]:
import whisper

# model = whisper.load_model(model_id.value)
model = whisper.load_model(model_id.value, "cpu")
model.eval()
pass

### Converting the Model to OpenVINO Intermediate Representation (IR) Format

To achieve optimal performance and efficiency with the OpenVINO toolkit, it is highly recommended to convert the Whisper model into the OpenVINO-specific Intermediate Representation (IR) format. This process requires two key components:

1. An initialized model object
2. Sample input data for shape inference

We will leverage the `ov.convert_model` function provided by OpenVINO to perform the model conversion. This function takes the initialized model object and sample inputs as arguments and returns an OpenVINO-compatible model that is ready to be loaded onto the target device for inference.

Once the conversion is complete, we can save the OpenVINO model to disk using the `ov.save_model` function. This allows us to reuse the converted model in future sessions without the need to repeat the conversion process, saving valuable time and resources.

By converting the Whisper model to OpenVINO IR format, we can take full advantage of the performance optimizations and hardware acceleration capabilities offered by the OpenVINO toolkit, ensuring efficient and high-quality subtitle generation.


### Converting the Whisper Encoder to OpenVINO IR

In [None]:
from pathlib import Path

WHISPER_ENCODER_OV = Path(f"whisper_{model_id.value}_encoder.xml")
WHISPER_DECODER_OV = Path(f"whisper_{model_id.value}_decoder.xml")


An example input is created using a tensor of zeros. The ov.convert_model function is then used to convert the encoder model to OpenVINO's IR format. The converted model is saved to disk for future use.

In [None]:
import torch
import openvino as ov

mel = torch.zeros((1, 80 if 'v3' not in model_id.value else 128, 3000))
audio_features = model.encoder(mel)
if not WHISPER_ENCODER_OV.exists():
    encoder_model = ov.convert_model(model.encoder, example_input=mel)
    ov.save_model(encoder_model, WHISPER_ENCODER_OV)

 ### Converting the Whisper Decoder to OpenVINO IR

The Whisper decoder employs a technique called attention caching to reduce computational complexity and improve efficiency. This involves storing the key and value projections from previous steps in the attention modules, which can then be reused in subsequent computations. However, to ensure accurate tracing and conversion of the decoder to OpenVINO IR format, we need to modify this caching mechanism.

In the following code cells, we will define custom forward functions for the decoder's attention modules and residual blocks. These modified functions will explicitly handle the caching and retrieval of key and value projections, making the caching process more transparent and traceable.

By adapting the decoder's architecture to be more compatible with the OpenVINO conversion process, we can successfully convert the Whisper decoder to OpenVINO IR format, enabling us to leverage the performance benefits of the OpenVINO toolkit while maintaining the decoder's functionality and efficiency.

In [None]:
import torch
from typing import Optional, Tuple
from functools import partial


def attention_forward(
        attention_module,
        x: torch.Tensor,
        xa: Optional[torch.Tensor] = None,
        mask: Optional[torch.Tensor] = None,
        kv_cache: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
):
    """
    Override for forward method of decoder attention module with storing cache values explicitly.
    Parameters:
      attention_module: current attention module
      x: input token ids.
      xa: input audio features (Optional).
      mask: mask for applying attention (Optional).
      kv_cache: dictionary with cached key values for attention modules.
      idx: idx for search in kv_cache.
    Returns:
      attention module output tensor
      updated kv_cache
    """
    q = attention_module.query(x)

    if xa is None:
        # hooks, if installed (i.e. kv_cache is not None), will prepend the cached kv tensors;
        # otherwise, perform key/value projections for self- or cross-attention as usual.
        k = attention_module.key(x)
        v = attention_module.value(x)
        if kv_cache is not None:
            k = torch.cat((kv_cache[0], k), dim=1)
            v = torch.cat((kv_cache[1], v), dim=1)
        kv_cache_new = (k, v)
    else:
        # for cross-attention, calculate keys and values once and reuse in subsequent calls.
        k = attention_module.key(xa)
        v = attention_module.value(xa)
        kv_cache_new = (None, None)

    wv, qk = attention_module.qkv_attention(q, k, v, mask)
    return attention_module.out(wv), kv_cache_new


def block_forward(
    residual_block,
    x: torch.Tensor,
    xa: Optional[torch.Tensor] = None,
    mask: Optional[torch.Tensor] = None,
    kv_cache: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
):
    """
    Override for residual block forward method for providing kv_cache to attention module.
      Parameters:
        residual_block: current residual block.
        x: input token_ids.
        xa: input audio features (Optional).
        mask: attention mask (Optional).
        kv_cache: cache for storing attention key values.
      Returns:
        x: residual block output
        kv_cache: updated kv_cache

    """
    x0, kv_cache = residual_block.attn(residual_block.attn_ln(
        x), mask=mask, kv_cache=kv_cache)
    x = x + x0
    if residual_block.cross_attn:
        x1, _ = residual_block.cross_attn(
            residual_block.cross_attn_ln(x), xa)
        x = x + x1
    x = x + residual_block.mlp(residual_block.mlp_ln(x))
    return x, kv_cache



# update forward functions
for idx, block in enumerate(model.decoder.blocks):
    block.forward = partial(block_forward, block)
    block.attn.forward = partial(attention_forward, block.attn)
    if block.cross_attn:
        block.cross_attn.forward = partial(attention_forward, block.cross_attn)


def decoder_forward(decoder, x: torch.Tensor, xa: torch.Tensor, kv_cache: Optional[Tuple[Tuple[torch.Tensor, torch.Tensor]]] = None):
    """
    Override for decoder forward method.
    Parameters:
      x: torch.LongTensor, shape = (batch_size, <= n_ctx) the text tokens
      xa: torch.Tensor, shape = (batch_size, n_mels, n_audio_ctx)
           the encoded audio features to be attended on
      kv_cache: Dict[str, torch.Tensor], attention modules hidden states cache from previous steps
    """
    if kv_cache is not None:
        offset = kv_cache[0][0].shape[1]
    else:
        offset = 0
        kv_cache = [None for _ in range(len(decoder.blocks))]
    x = decoder.token_embedding(
        x) + decoder.positional_embedding[offset: offset + x.shape[-1]]
    x = x.to(xa.dtype)
    kv_cache_upd = []

    for block, kv_block_cache in zip(decoder.blocks, kv_cache):
        x, kv_block_cache_upd = block(x, xa, mask=decoder.mask, kv_cache=kv_block_cache)
        kv_cache_upd.append(tuple(kv_block_cache_upd))

    x = decoder.ln(x)
    logits = (
        x @ torch.transpose(decoder.token_embedding.weight.to(x.dtype), 1, 0)).float()

    return logits, tuple(kv_cache_upd)



# override decoder forward
model.decoder.forward = partial(decoder_forward, model.decoder)

In [None]:
tokens = torch.ones((5, 3), dtype=torch.int64)
logits, kv_cache = model.decoder(tokens, audio_features, kv_cache=None)

tokens = torch.ones((5, 1), dtype=torch.int64)

if not WHISPER_DECODER_OV.exists():
    decoder_model = ov.convert_model(model.decoder, example_input=(tokens, audio_features, kv_cache))
    ov.save_model(decoder_model, WHISPER_DECODER_OV)

The decoder model autoregressively predicts the next token guided by encoder hidden states and previously predicted sequence. This means that the shape of inputs which depends on the previous step (inputs for tokens and attention hidden states from previous step) are dynamic. For efficient utilization of memory, you define an upper bound for dynamic input shapes.

### Preparing the Inference Pipeline

The image below illustrates the pipeline of video transcribing using the Whisper model.

![ch06_diagram01.png](https://raw.githubusercontent.com/PacktPublishing/Learn-OpenAI-Whisper/main/Chapter06/ch06_diagram01.png)

To run the PyTorch Whisper model, we just need to call the `model.transcribe(audio, **parameters)` function. We will try to reuse original model pipeline for audio transcribing after replacing the original models with OpenVINO IR versions.

In the original PyTorch implementation of Whisper, running the transcription pipeline is as simple as calling the `model.transcribe(audio, **parameters)` function, which handles all the necessary steps internally.

To leverage the benefits of the OpenVINO toolkit, we will modify this pipeline by replacing the original PyTorch models with their OpenVINO IR counterparts. By doing so, we can take advantage of the performance optimizations and hardware acceleration capabilities offered by OpenVINO while maintaining the overall structure and functionality of the transcription pipeline.

In the following sections, we will dive deeper into each step of the pipeline and demonstrate how to integrate the OpenVINO models seamlessly.

### Selecting the Inference Device

One of the key advantages of the OpenVINO toolkit is its ability to optimize and run inference on a wide range of hardware devices, including CPUs, GPUs, and specialized accelerators. To harness this flexibility, we need to specify the target device on which we want to execute the inference pipeline.

In the code cell below, you will find a dropdown menu that allows you to select the desired inference device. The available options are dynamically populated based on the devices supported by your system and the installed OpenVINO runtime.

Simply choose the appropriate device from the dropdown list, considering factors such as performance, power consumption, and availability. OpenVINO will then optimize the converted models and execute the inference pipeline on the selected device, ensuring the best possible performance and efficiency.

By default, the "AUTO" option is selected, which allows OpenVINO to automatically choose the most suitable device based on the available hardware and the model's requirements. However, you can override this behavior by explicitly selecting a specific device from the list.

Once you have selected the inference device, the subsequent steps in the pipeline will be executed on that device, taking full advantage of the OpenVINO runtime's optimizations and acceleration capabilities.


In [None]:
core = ov.Core()

In [None]:
import ipywidgets as widgets

device = widgets.Dropdown(
    options=core.available_devices + ["AUTO"],
    value='AUTO',
    description='Device:',
    disabled=False,
)

device

In [None]:
from utils import patch_whisper_for_ov_inference, OpenVINOAudioEncoder, OpenVINOTextDecoder

patch_whisper_for_ov_inference(model)

model.encoder = OpenVINOAudioEncoder(core, WHISPER_ENCODER_OV, device=device.value)
model.decoder = OpenVINOTextDecoder(core, WHISPER_DECODER_OV, device=device.value)

## Running the Video Transcription Pipeline

With the Whisper model converted to OpenVINO IR format and the inference device selected, we are now ready to run the video transcription pipeline on our chosen video.

For the purpose of this tutorial, we will demonstrate the transcription process using a video from YouTube. In the code cell below, you can enter the URL of the YouTube video you wish to transcribe. Please keep in mind that downloading the video may take some time, depending on the video's length and your internet connection speed.

Once the video URL is provided, the code will automatically download the video and save it to the local file system. The downloaded video file will serve as the input for the transcription pipeline.



In [None]:
import ipywidgets as widgets
# VIDEO_LINK = "https://youtu.be/kgL5LBM-hFI"
VIDEO_LINK = "https://youtu.be/5bs9XoTac88"
link = widgets.Text(
    value=VIDEO_LINK,
    placeholder="Type link for video",
    description="Video:",
    disabled=False
)

link

In [None]:
from pytube import YouTube
from pathlib import Path

print(f"Downloading video {link.value} started")

output_file = Path("downloaded_video.mp4")
yt = YouTube(link.value)
yt.streams.get_highest_resolution().download(filename=output_file)
print(f"Video saved to {output_file}")

In [None]:
from utils import get_audio

audio, duration = get_audio(output_file)

import ipywidgets as widgets
widgets.Video.from_file(output_file, loop=False, width=400, height=400)

Select the task for the model:

* **transcribe** - generate audio transcription in the source language (automatically detected).
* **translate** - generate audio transcription with translation to English language.

In [None]:
from whisper import _MODELS
import ipywidgets as widgets

model_id = widgets.Dropdown(
    options=list(_MODELS),
    value='base',
    description='Model:',
    disabled=False,
)

model_id

In [None]:
task = widgets.Select(
    options=["transcribe", "translate"],
    value="translate",
    description="Select task:",
    disabled=False
)
task

In [None]:
torch.cuda.is_available()

In [None]:
transcription = model.transcribe(audio, fp16=torch.cuda.is_available(), task=task.value)

"The results will be saved in the `downloaded_video.srt` file. SRT is one of the most popular formats for storing subtitles and is compatible with many modern video players. This file can be used to embed transcription into videos during playback or by injecting them directly into video files using `ffmpeg`.

In [None]:
from utils import prepare_srt

srt_lines = prepare_srt(transcription, filter_duration=duration)
# save transcription
with output_file.with_suffix(".srt").open("w") as f:
    f.writelines(srt_lines)

Now let us see the results.

In [None]:
print("".join(srt_lines))

In [None]:
# prompt: command that creates directy path '/mnt/gradio' if it does not exists

!mkdir -p /tmp/gradio

In [None]:
import gradio as gr

# Define a function that Gradio will use to process inputs
def video_with_srt(t_video, t_srt):
    # Since Gradio handles file paths for video and SRT directly, simply return them
    # Note: This assumes t_video and t_srt are paths to the uploaded files
    return t_video, t_srt

# Create the Gradio interface
demo = gr.Interface(
    fn=video_with_srt,  # Pass the function reference
    inputs=[
        gr.Textbox(label="Video File Path"),
        gr.Textbox(label="SRT File Path")
    ],
    outputs="video",  # Specify the output type; this might need adjustment based on the actual handling
    examples=[['downloaded_video.mp4', 'downloaded_video.srt']],  # Example inputs
    allow_flagging="never"
)

try:
    demo.launch(debug=True)
except Exception as e:
    print(e)
    demo.launch(share=True, debug=True)


## Interactive Demo

To showcase the power and flexibility of the OpenVINO-optimized Whisper model for video transcription, we have created an interactive demo using the Gradio library. This demo allows you to input a YouTube video URL and select the desired transcription task (transcribe or translate) directly from the user interface.

Behind the scenes, the demo application downloads the specified video, extracts the audio, and feeds it into the Whisper model for processing. The generated transcription is then displayed in real-time, providing a seamless and user-friendly experience.

The code for the interactive demo is provided in the following cells. It includes the necessary setup, such as creating the Gradio interface, defining the transcription function, and configuring the input and output components.

Feel free to explore the demo and experiment with different videos and transcription tasks. This interactive component aims to demonstrate the practical application of the OpenVINO-optimized Whisper model and showcase its potential for real-world use cases.



In [None]:
# prompt: command that creates directy path '/mnt/gradio' if it does not exists

!mkdir -p /tmp/gradio


In [None]:
import gradio as gr
from pytube import YouTube
from utils import prepare_srt
from utils import get_audio

def transcribe(url, task):
    output_file = Path("downloaded_video.mp4")
    yt = YouTube(url)
    yt.streams.get_highest_resolution().download(filename=output_file)
    audio, duration = get_audio(output_file)
    transcription = model.transcribe(audio, fp16=torch.cuda.is_available(), task=task.lower())
    srt_lines = prepare_srt(transcription, duration)
    with output_file.with_suffix(".srt").open("w") as f:
        f.writelines(srt_lines)
    return [str(output_file), str(output_file.with_suffix(".srt"))]


demo = gr.Interface(
    transcribe,
    [gr.Textbox(label="YouTube URL"), gr.Radio(["Transcribe", "Translate"], value="Transcribe")],
    "video",
    examples=[["https://youtu.be/5bs9XoTac88", "Translate"],
              ["https://youtu.be/kgL5LBM-hFI", "Transcribe"]],
    allow_flagging="never"
)
try:
    demo.launch(debug=True)
except Exception:
    demo.launch(share=True, debug=True)
# if you are launching remotely, specify server_name and server_port
# demo.launch(server_name='your server name', server_port='server port in int')
# Read more in the docs: https://gradio.app/docs/