# Lesson 6: Automatic Speech Recognition

Automatic Speech Recognition is a task that involves transcribing speech audio recording into text. Think meeting notes or automatically generated video subtitles. 

For this task, you will learn the work with the Whisper model by OpenAI.

- In the classroom, the libraries are already installed for you.
- If you would like to run this code on your own machine, you can install the following:
``` 
    !pip install transformers
    !pip install -U datasets
    !pip install soundfile
    !pip install librosa
    !pip install gradio
```

The `librosa` library may need to have [ffmpeg](https://www.ffmpeg.org/download.html) installed. 
- This page on [librosa](https://pypi.org/project/librosa/) provides installation instructions for ffmpeg.

- Here is some code that suppresses warning messages.

In [2]:
!pip install transformers
!pip install -U datasets
!pip install soundfile
!pip install librosa
!pip install gradio

Collecting datasets
  Downloading datasets-3.4.1-py3-none-any.whl.metadata (19 kB)
Collecting tqdm>=4.66.3 (from datasets)
  Downloading tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Downloading datasets-3.4.1-py3-none-any.whl (487 kB)
Downloading tqdm-4.67.1-py3-none-any.whl (78 kB)
Installing collected packages: tqdm, datasets
  Attempting uninstall: tqdm
    Found existing installation: tqdm 4.66.1
    Uninstalling tqdm-4.66.1:
      Successfully uninstalled tqdm-4.66.1
  Attempting uninstall: datasets
    Found existing installation: datasets 2.16.1
    Uninstalling datasets-2.16.1:
      Successfully uninstalled datasets-2.16.1
Successfully installed datasets-3.4.1 tqdm-4.67.1
Collecting gradio
  Downloading gradio-5.22.0-py3-none-any.whl.metadata (16 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting fastapi<1.0,>=0.115.2 (from gradio)
  Downloading fastapi-0.115.11-py3-none-any.whl.metadata (27 kB)
Collec

In [None]:
from transformers.utils import logging
logging.set_verbosity_error()

### Data preparation

In [None]:
from datasets import load_dataset

In [None]:
dataset = load_dataset("librispeech_asr",
                       split="train.clean.100",
                       streaming=True,
                       trust_remote_code=True)

In [None]:
example = next(iter(dataset))

In [None]:
dataset_head = dataset.take(5)
list(dataset_head)

In [None]:
list(dataset_head)[2]

In [None]:
example

In [None]:
from IPython.display import Audio as IPythonAudio

IPythonAudio(example["audio"]["array"],
             rate=example["audio"]["sampling_rate"])

### Build the pipeline

In [None]:
from transformers import pipeline

In [None]:
asr = pipeline(task="automatic-speech-recognition",
               model="distil-whisper/distil-small.en")

Info about [distil-whisper/distil-small.en](https://huggingface.co/distil-whisper)

In [None]:
asr.feature_extractor.sampling_rate

In [None]:
example['audio']['sampling_rate']

In [None]:
asr(example["audio"]["array"])

In [None]:
example["text"]

### Build a shareable app with Gradio

### Troubleshooting Tip
- Note, in the classroom, you may see the code for creating the Gradio app run indefinitely.
  - This is specific to this classroom environment when it's serving many learners at once, and you won't wouldn't experience this issue if you run this code on your own machine.
- To fix this, please restart the kernel (Menu Kernel->Restart Kernel) and re-run the code in the lab from the beginning of the lesson.

In [None]:
import os
import gradio as gr

In [None]:
demo = gr.Blocks()

In [None]:
def transcribe_speech(filepath):
    if filepath is None:
        gr.Warning("No audio found, please retry.")
        return ""
    output = asr(filepath)
    return output["text"]

In [None]:
mic_transcribe = gr.Interface(
    fn=transcribe_speech,
    inputs=gr.Audio(sources="microphone",
                    type="filepath"),
    outputs=gr.Textbox(label="Transcription",
                       lines=3),
    allow_flagging="never")

To learn more about building apps with Gradio, you can check out the short course: [Building Generative AI Applications with Gradio](https://www.deeplearning.ai/short-courses/building-generative-ai-applications-with-gradio/), also taught by Hugging Face.

In [None]:
file_transcribe = gr.Interface(
    fn=transcribe_speech,
    inputs=gr.Audio(sources="upload",
                    type="filepath"),
    outputs=gr.Textbox(label="Transcription",
                       lines=3),
    allow_flagging="never",
)

In [None]:
with demo:
    gr.TabbedInterface(
        [mic_transcribe,
         file_transcribe],
        ["Transcribe Microphone",
         "Transcribe Audio File"],
    )

demo.launch(share=True, 
            server_port=int(os.environ['PORT1']))

In [None]:
demo.close()

## Note: Please stop the demo before continuing with the rest of the lab.
- The app will continue running unless you run
  ```Python
  demo.close()
  ```
- If you run another gradio app (later in this lesson) without first closing this appp, you'll see an error message:
  ```Python
  OSError: Cannot find empty port in range
  ```

* Testing with a longer audio file

In [None]:
import soundfile as sf
import io

In [None]:
audio, sampling_rate = sf.read('narration_example.wav')

In [None]:
sampling_rate

In [None]:
asr.feature_extractor.sampling_rate

In [None]:
asr(audio)

_Note:_ Running the cell above will return:
```
ValueError: We expect a single channel audio input for AutomaticSpeechRecognitionPipeline
```


* Convert the audio from stereo to mono (Using librosa)

In [None]:
audio.shape

In [None]:
import numpy as np

audio_transposed = np.transpose(audio)

In [None]:
audio_transposed.shape

In [None]:
import librosa

In [None]:
audio_mono = librosa.to_mono(audio_transposed)

In [None]:
IPythonAudio(audio_mono,
             rate=sampling_rate)

In [None]:
asr(audio_mono)

_Warning:_ The cell above might throw a warning because the sample rate of the audio sample is not the same of the sample rate of the model.

Let's check and fix this!

In [None]:
sampling_rate

In [None]:
asr.feature_extractor.sampling_rate

In [None]:
audio_16KHz = librosa.resample(audio_mono,
                               orig_sr=sampling_rate,
                               target_sr=16000)

In [None]:
asr(
    audio_16KHz,
    chunk_length_s=30, # 30 seconds
    batch_size=4,
    return_timestamps=True,
)["chunks"]

* Build the Gradio interface.

In [None]:
import gradio as gr
demo = gr.Blocks()

In [None]:
def transcribe_long_form(filepath):
    if filepath is None:
        gr.Warning("No audio found, please retry.")
        return ""
    output = asr(
      filepath,
      max_new_tokens=256,
      chunk_length_s=30,
      batch_size=8,
    )
    return output["text"]

In [None]:
mic_transcribe = gr.Interface(
    fn=transcribe_long_form,
    inputs=gr.Audio(sources="microphone",
                    type="filepath"),
    outputs=gr.Textbox(label="Transcription",
                       lines=3),
    allow_flagging="never")

file_transcribe = gr.Interface(
    fn=transcribe_long_form,
    inputs=gr.Audio(sources="upload",
                    type="filepath"),
    outputs=gr.Textbox(label="Transcription",
                       lines=3),
    allow_flagging="never",
)

In [None]:
with demo:
    gr.TabbedInterface(
        [mic_transcribe,
         file_transcribe],
        ["Transcribe Microphone",
         "Transcribe Audio File"],
    )
demo.launch(share=True, 
            server_port=int(os.environ['PORT1']))

In [None]:
demo.close()

## Note: Please stop the demo before continuing with the rest of the lab.
- The app will continue running unless you run
  ```Python
  demo.close()
  ```
- If you run another gradio app (later in this lesson) without first closing this appp, you'll see an error message:
  ```Python
  OSError: Cannot find empty port in range
  ```

### Try it yourself!
- Try this model with your own audio files!

In [None]:
import soundfile as sf
import io

audio, sampling_rate = sf.read('narration_example.wav')

In [None]:
sampling_rate

In [None]:
asr.feature_extractor.sampling_rate