Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
---
title: Run OpenAI Whisper Audio Model efficiently on Arm with Hugging Face Transformers

minutes_to_complete: 15

who_is_this_for: This Learning Path is for software developers, ML engineers, and those looking to run Whisper ASR Model on Arm Neoverse based CPUs efficiently and build speech transcription based applications around it.

learning_objectives:
- Install the dependencies to run the Whisper Model
- Run the OpenAI Whisper model using Hugging Face Transformers framework.
- Run the whisper-large-v3-turbo model on Arm CPU efficiently.
- Perform the audio to text transcription with Whisper.
- Observe the total time taken to generate transcript with Whisper.


prerequisites:
- Amazon Graviton4 (or other Arm) compute instance with 32 cores, 8GB of RAM, and 32GB disk space.
- Basic understanding of Python and ML concepts.
- Understanding of Whisper ASR Model fundamentals.

author: Nobel Chowdary Mandepudi

### Tags
skilllevels: Intermediate
armips:
- Neoverse
subjects: ML
operatingsystems:
- Linux
tools_software_languages:
- Python
- Whisper
- AWS Graviton

### FIXED, DO NOT MODIFY
# ================================================================================
weight: 1 # _index.md always has weight of 1 to order correctly
layout: "learningpathall" # All files under learning paths have this same wrapper
learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
---
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
---
# ================================================================================
# FIXED, DO NOT MODIFY THIS FILE
# ================================================================================
weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation.
title: "Next Steps" # Always the same, html page title.
layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing.
---
134 changes: 134 additions & 0 deletions content/learning-paths/servers-and-cloud-computing/whisper/whisper.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
---
# User change
title: "Setup the Whisper Model"

weight: 2

# Do not modify these elements
layout: "learningpathall"
---

## Before you begin

This Learning Path demonstrates how to run the whisper-large-v3-turbo model as an application that takes the audio input and computes out the text transcript of it. The instructions in this Learning Path have been designed for Arm servers running Ubuntu 24.04 LTS. You need an Arm server instance with 32 cores, atleast 8GB of RAM and 32GB disk to run this example. The instructions have been tested on a AWS c8g.8xlarge instance.

## Overview

OpenAI Whisper is an open-source Automatic Speech Recognition (ASR) model trained on the multilingual and multitask data, which enables the transcript generation in multiple languages and translations from different languages to English. We will explore the foundational aspects of speech-to-text transcription applications, specifically focusing on running OpenAI’s Whisper on an Arm CPU. We will discuss the implementation and performance considerations required to efficiently deploy Whisper using Hugging Face Transformers framework.

## Install dependencies

Install the following packages on your Arm based server instance:

```bash
sudo apt update
sudo apt install python3-pip python3-venv ffmpeg -y
```

## Install Python Dependencies

Create a Python virtual environment:

```bash
python3 -m venv whisper-env
```

Activate the virtual environment:

```bash
source whisper-env/bin/activate
```

Install the required libraries using pip:

```python3
pip install torch transformers accelerate
```

## Download the sample audio file

Download a sample audio file, which is about 33sec audio in .wav format or use your own audio file:
```bash
wget https://www.voiptroubleshooter.com/open_speech/american/OSR_us_000_0010_8k.wav
```

## Create a python script for audio to text transcription

Create a python file:

```bash
vim whisper-application.py
```

Write the following code in the `whisper-application.py` file:
```python
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
import time

# Set the device to CPU and specify the torch data type
device = "cpu"
torch_dtype = torch.float32

# Specify the model name
model_id = "openai/whisper-large-v3-turbo"

# Load the model with specified configurations
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)

# Move the model to the specified device
model.to(device)

# Load the processor for the model
processor = AutoProcessor.from_pretrained(model_id)

# Create a pipeline for automatic speech recognition
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
torch_dtype=torch_dtype,
device=device,
return_timestamps=True
)

# Record the start time of the inference
start_time = time.time()

# Perform speech recognition on the audio file
result = pipe("OSR_us_000_0010_8k.wav")

# Record the end time of the inference
end_time = time.time()

# Print the transcribed text
print(f'\n{result["text"]}\n')

# Calculate and print the duration of the inference
duration = end_time - start_time
hours = duration // 3600
minutes = (duration - (hours * 3600)) // 60
seconds = (duration - ((hours * 3600) + (minutes * 60)))
msg = f'\nInferencing elapsed time: {seconds:4.2f} seconds\n'

print(msg)

```

## Use the Arm specific flags:

Use the following flags to enable fast math GEMM kernels, Linux Transparent Huge Page (THP) allocations, logs to confirm kernel and set LRU cache capacity and OMP_NUM_THREADS to run the Whisper efficiently on Arm machines.

```bash
export DNNL_DEFAULT_FPMATH_MODE=BF16
export THP_MEM_ALLOC_ENABLE=1
export LRU_CACHE_CAPACITY=1024
export OMP_NUM_THREADS=32
export DNNL_VERBOSE=1
```
{{% notice Note %}}
BF16 support is merged into PyTorch versions greater than 2.3.0.
{{% /notice %}}
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
---
title: Run the Whisper Model
weight: 4

layout: learningpathall
---

## Run Whisper File
After installing the dependencies and enabling the Arm specific flags in the previous step, now lets run the Whisper model and analyze it.

Run the `whisper-application.py` file:

```python
python3 whisper-application.py
```

## Output

You should see output similar to the image below with the log since we enabled verbose, transcript of the audio and the audio transcription time:
![frontend](whisper_output.png)

## Analyze

The output in the above image has the log containing `attr-fpmath:bf16`, which confirms that fast math BF16 kernels are used in the compute process to improve the performance.

It also generated the text transcript of the audio and the `Inference elapsed time`.

By enabling the Arm specific flags as described in the learning path you can see the performance upliftment with the Whisper using Hugging Face Transformers framework on Arm.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.