WhisperX on AWS Fargate

A Dockerized transcription pipeline using WhisperX, originally intended for offline transcription and diarization of Zoom recordings.

The Docker container is run as AWS ETL job via AWS ECS and AWS Fargate to transcribe stored recordings. As this is a public sample, it omits the following critical details:

Downloading or otherwise loading into memory the mp4 recordings.
Saving the transcription metadata.

It would be straightforward to fork this repository to load the data from an appropriate source e.g. S3 or any other external storage.

Installation

Install the AWS Copilot CLI

For local development:

Install Poetry
Install FFmpeg
Install dependencies, using make: make install

Not all of the Python dependencies are in the pyproject.toml dependency list, since it wasn't clear to the developers how to include some of the more complex deps (i.e. PyTorch and WhisperX).

Execution

To re-deploy (because of changes to the Dockerfile, manifests, or code):

copilot deploy

To invoke the job manually:

copilot job run

To view the most recent logs:

copilot job logs

Environment

Environment variables needed:

HF_TOKEN

To change their values during execution on AWS, you'll need to update the transcribe-recordings JSON secret in AWS Secrets Manager. Sample secret value: {"HF_TOKEN": "hf_exampleToken"}

For local development, place the key-value pairs in a .env file in this directory.

To add a new secret (see guide), you'll need to tag it with copilot-application and copilot-environment tags, with their values set to transcribe-recordings and transcribe-recordings-env respectively.

The user's HuggingFace token (HF_TOKEN) will need to have accepted the user agreements for the two pyannote models, linked below.

Models

This tool uses WhisperX for transcription, segmentation, and diarization. All of the models are hosted on the HuggingFace model hub.

For transcription: openai/whisper-large-v2
For segmentation: pyannote/segmentation-3.0
For diarization: pyannote/speaker-diarization-3.1

Development notes

The Docker image is currently ~2GB, thanks to the hefty Python dependencies. To reduce task start-up time, we might consider including the models (which are cached to data/models) in the built Docker image.

Important caveat: this basic example uses a single CPU core to transcribe audio. In our experiments, it is 2-4x slower than real-time transcription. It should be straightforward to parallelize execution for batch processing of recordings.

Pros and Cons of this approach

I recommend this approach because it's easy, but there are downsides to adopting AWS Copilot. Due to a longstanding issue, EC2 instances are not supported. Fargate doesn't support provisioning of GPUs, and has relatively low CPU limits (16 vCPUs, up to 120GB memory). As a result, it will be more expensive then using a more efficient approach (like a properly-sized EC2 instance). The AWS Copilot + Fargate approach will enable you to get started quickly, and is completely appropriate for transcribing relatively small amounts of recordings.

Contributors

Primary code contributor:

Zachary Levonian (zach@levi.digitalharbor.org)

Other contributors:

Jionghao Lin (jionghao [at] cmu [dot] edu)

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
copilot		copilot
data/tests/this_is_a_test_recording		data/tests/this_is_a_test_recording
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WhisperX on AWS Fargate

Installation

Execution

Environment

Models

Development notes

Pros and Cons of this approach

Contributors

About

Languages

License

DigitalHarborFoundation/whisperx-on-aws-fargate

Folders and files

Latest commit

History

Repository files navigation

WhisperX on AWS Fargate

Installation

Execution

Environment

Models

Development notes

Pros and Cons of this approach

Contributors

About

Topics

Resources

License

Stars

Watchers

Forks

Languages