# OpenAI Whisper - Speech Recognition

Whisper is an automatic speech recognition (ASR) and speech translation pre-trained model. We are going to use Whisper for audio transcription.

## 🛠️ Supported Hardware

This notebook can run in a CPU or in a GPU.

✅ AMD Instinct™ Accelerators  
✅ AMD Radeon™ RX/PRO Graphics Cards  
✅ AMD EPYC™ Processors  
✅ AMD Ryzen™ (AI) Processors  

Suggested hardware: **AI PC powered by AMD Ryzen™ AI Processors**

## ⚡ Recommended Software Environment

::::{tab-set}

:::{tab-item} Linux
- [Install Docker container](https://amdresearch.github.io/aup-ai-tutorials//env/env-gpu.html)
- [Install PyTorch](https://amdresearch.github.io/aup-ai-tutorials//env/env-cpu.html)
:::

:::{tab-item} Windows
- [Install Direct-ML](https://amdresearch.github.io/aup-ai-tutorials//env/env-gpu-windows.html)
- [Install PyTorch](https://amdresearch.github.io/aup-ai-tutorials//env/env-cpu.html)
:::
::::

## 🎯 Goals

- Show you how to download a model from HuggingFace
- Run OpenAI Whisper on an AMD platform
- Get OpenAI Whisper to transcribe an audio file

:::{seealso}
- [Whisper](https://huggingface.co/openai/whisper-small)
- [Robust Speech Recognition via Large-Scale Weak Supervision](https://arxiv.org/abs/2212.04356)
- [Whisper GitHub](https://github.com/openai/whisper)
:::

## 🚀 Run OpenAI Whisper on an AMD Platform

Import the necessary packages

In [None]:
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import load_dataset
from IPython.display import Audio

Load the model from Hugging Face and processor

In [None]:
processor = WhisperProcessor.from_pretrained("openai/whisper-small")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
model.config.forced_decoder_ids = None

In [None]:
print(f'Model size: {model.num_parameters() * model.dtype.itemsize / 1024 / 1024:.2f} MB')

Let's load a test audio file

:::{note} Dataset Download Disclaimer

By executing the next cell, you will initiate the download of the dataset `hf-internal-testing/librispeech_asr_dummy’. Please note that this dataset may include content subject to third-party ownership or licensing restrictions. By proceeding, you acknowledge and agree to the following:
- You are solely responsible for reviewing and complying with any applicable terms of use, licenses, or permissions required by the dataset owner.
- If explicit permission is required from the original owner or provider, you must obtain that permission before using the dataset for any purpose, including research, analysis, or redistribution.
- AMD Inc. is not distributing the dataset and is providing a link solely for your convenience. AMD Inc.  does not grant any rights to the dataset and disclaims all liability for misuse or unauthorized access.
If you are uncertain about the licensing or permission requirements, please consult the dataset documentation or contact the dataset owner directly.

:::

In [None]:
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = ds[0]["audio"]

We are going to use the `processor` to generate the input features that we will feed to the model

In [None]:
input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features
print(input_features)

Let's get the model to generate the output tokens that we can then decode with the `processor` function

In [None]:
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

Compare the transcript with the actual audio

In [None]:
print(transcription)
Audio(data=sample['array'], rate=sample['sampling_rate'])

Let's try with a different audio file

In [None]:
sample = ds[9]["audio"]
input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features 

predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

Compare the transcript with the actual audio

In [None]:
print(transcription)
Audio(data=sample['array'], rate=sample['sampling_rate'])

----------
Copyright (C) 2025 Advanced Micro Devices, Inc. All rights reserved.

SPDX-License-Identifier: MIT