# Notebook 5.2 Speech Recognition

Speech recognition, also known as automatic speech recognition (ASR), is a technology that converts spoken words into written format or executes specific actions based on verbal commands. It involves machine learning models that analyze speech patterns, phonetics, and language structures to accurately transcribe and understand human speech.

[Whisper](https://openai.com/research/whisper), published by OpenAI, is a popular open-source model for both ASR and speech translation. This means that Whisper has the capability to transcribe speech in multiple languages and facilitate translation from those languages into English.

Due to its underlying Transformer-based encoder-decoder architecture, Whisper can be optimized effectively with BigDL-LLM INT4 optimizations. In this tutorial, we will guide you through building a speech recognition application on BigDL-LLM optimized Whisper model that can transcribe/translate audio files into text.

## 5.2.1 Install Packages

Follow instructions in [Chapter 2](../ch_2_Environment_Setup/README.md) to setup your environment if you haven't done so. Then install bigdl-llm:

In [None]:
!pip install --pre --upgrade bigdl-llm[all]

Due to the requirement to process audio file, you will also need to install the `librosa` package for audio analysis.

In [None]:
!pip install -U librosa

## 5.2.2 Download Audio Files

To begin, let's prepare some audio files. As an example, you can download [an English example](https://datasets-server.huggingface.co/assets/patrickvonplaten/librispeech_asr_dummy/--/clean/validation/2/audio/audio.wav) from English audio dataset [librispeech_asr_dummy](https://huggingface.co/datasets/patrickvonplaten/librispeech_asr_dummy) and [one Chinese example](https://datasets-server.huggingface.co/assets/carlot/AIShell/--/692ef58020d79b21f54eb25b15a4813d4f9650d7/--/default/train/84/audio/audio.wav) from the Chinese audio dataset [AIShell](https://huggingface.co/datasets/carlot/AIShell). Here, the English audio file and the Chinese audio file have been randomly selected. Feel free to choose different audio files according to your preferences.

Here we rename the file to `audio_en.wav` and `audio_en.wav` and put them in the current path. You could play the successfully-downloaded audio:

In [1]:
import IPython

IPython.display.display(IPython.display.Audio(r"audio_en.wav"))
IPython.display.display(IPython.display.Audio(r"audio_ch.wav"))

## 5.2.3 Load Pretrained Whisper Model

Now, let's load a pretrained Whisper model, e.g. [whisper-medium](https://huggingface.co/openai/whisper-medium) as an example. OpenAI has released pretrained Whisper models in various sizes (including [whisper-small](https://huggingface.co/openai/whisper-small), [whisper-tiny](https://huggingface.co/openai/whisper-tiny), etc.), allowing you to choose the one that best fits your requirements. 

Simply use one-line `transformers`-style API in `bigdl-llm` to load `whisper-medium` with INT4 optimizations (by specifying `load_in_4bit=True`) as follows. Please note that model class `AutoModelForSpeechSeq2Seq` is used for Whisper:

In [None]:
from bigdl.llm.transformers import AutoModelForSpeechSeq2Seq

model = AutoModelForSpeechSeq2Seq.from_pretrained(pretrained_model_name_or_path="openai/whisper-medium",
                                                  load_in_4bit=True)

## 5.2.4 Load Whisper Processor

A Whisper processor is also needed for both audio pre-processing, and post-processing model outputs from tokens to texts. Just use the official `transformers` API to load `WhisperProcessor`:

In [3]:
from transformers import WhisperProcessor

processor = WhisperProcessor.from_pretrained(pretrained_model_name_or_path="openai/whisper-medium")

## 5.2.5 Transcribe English Audio

Once you have optimized the Whisper model using BigDL-LLM with INT4 optimization and loaded the Whisper processor, you are ready to begin transcribing the audio through model inference.

Let's start with the English audio file `audio_en.wav`. Before we feed it into Whisper processor, we need to extract sequence data from raw speech waveform:

In [4]:
import librosa

data_en, sample_rate_en = librosa.load("audio_en.wav", sr=16000)

> **Note**
>
> For `whisper-medium`, its `WhisperFeatureExtractor` (part of `WhisperProcessor`) extracts features from audio using a 16,000Hz sampling rate by default. It's important to load the audio file at the sample sampling rate with model's `WhisperFeatureExtractor` for precise recognition.

We can then proceed to transcribe the audio file based on the sequence data, using exactly the same way as using official `transformers` API:

In [5]:
import torch
import time

# define task type
forced_decoder_ids = processor.get_decoder_prompt_ids(language="english", task="transcribe")

with torch.inference_mode():
    # extract input features for the Whisper model
    input_features = processor(data_en, sampling_rate=sample_rate_en, return_tensors="pt").input_features

    # predict token ids for transcription
    predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids,max_new_tokens=200)

    # decode token ids into texts
    transcribe_str = processor.batch_decode(predicted_ids, skip_special_tokens=True)

    print('-'*20, 'English Transcription', '-'*20)
    print(transcribe_str)

-------------------- English Transcription --------------------
[' He tells us that at this festive season of the year, with Christmas and roast beef looming before us, similes drawn from eating and its results occur most readily to the mind.']


> **Note**
>
> `forced_decoder_ids` defines the context token for different language and task (transcribe or translate). If it is set to `None`, Whisper will automatically predict them.


## 5.2.6 Transcribe Chinese Audio and Translate to English

Then let's move to the Chinese audio `audio_ch.wav`. Whisper can transcribe multilingual audio, and translate them into English. The only difference here is to define specific context token through `forced_decoder_ids`:

In [7]:
# extract sequence data
data_zh, sample_rate_zh = librosa.load("audio_ch.wav", sr=16000)

# define Chinese transcribe task
forced_decoder_ids = processor.get_decoder_prompt_ids(language="chinese", task="transcribe")

with torch.inference_mode():
    input_features = processor(data_zh, sampling_rate=sample_rate_zh, return_tensors="pt").input_features
    predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
    transcribe_str = processor.batch_decode(predicted_ids, skip_special_tokens=True)

    print('-'*20, 'Chinese Transcription', '-'*20)
    print(transcribe_str)

# define Chinese transcribe and translation task
forced_decoder_ids = processor.get_decoder_prompt_ids(language="chinese", task="translate")

with torch.inference_mode():
    input_features = processor(data_zh, sampling_rate=sample_rate_zh, return_tensors="pt").input_features
    predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids, max_new_tokens=200)
    translate_str = processor.batch_decode(predicted_ids, skip_special_tokens=True)

    print('-'*20, 'Chinese to English Translation', '-'*20)
    print(translate_str)

-------------------- Chinese Transcription --------------------
['这样能相对保障产品的质量']
-------------------- Chinese to English Translation --------------------
[' This can ensure the quality of the product relatively.']


## 5.2.7 What's Next?

In the upcoming chapter, we will explore the usage of BigDL-LLM in conjunction with langchain, a framework designed for developing applications with language models. With langchain integration, application development process could be simplified.