🤗 on Trainium

Overview

A (not so deep) exploration of 🤗 Transformers training on AWS Trainium.

Inspired by Julien Simon's post on how to Accelerate Transformer training with AWS Trainium 🙌

What is AWS Trainium?

Update (October 2022) 📢 Amazon EC2 Trn1 instances powered by AWS-designed Trainium chips are now generally available

Update (November 2022) 📢 Amazon SageMaker now supports ml.trn1 instances for model training

Update (May 2023) 📢 Amazon SageMaker now supports ml.inf2 and ml.trn1 instances for model deployment

AWS Trainium is a 2nd generation ML chip optimized for training state-of-the-art models.

Trainium-powered Amazon EC2 Trn1 instances achieve the highest performance on deep learning training, while providing up to 50% lower cost-to-train savings over comparable GPU-based P4d instances.

Using the AWS Neuron SDK, which integrates with popular frameworks like TensorFlow, PyTorch and Apache MXNet, anyone can start using AWS Trainium by changing just a few lines of code.

Setup

Provision resources using Terraform.

# It will take a few minutes for the Trn1 instance to be fully configured ⌛
cd infra
terraform init -upgrade -backend-config="config.s3.tfbackend"
terraform plan
terraform apply

# ⚠️ Clean up after yourself - don't forget to destroy all resources when you're done!
terraform destroy

Connect to the Trn1 instance using EC2 Instance Connect.

Update (June 2022) You can now connect to an EC2 Instance via EC2 Instance Connect (EIC) Endpoint - no need to create bastion hosts to tunnel SSH / RDP connections to instances with private IP addresses

# For information on how to set up EC2 instance connect, see
# https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-connect-set-up.html
export AWS_DEFAULT_REGION=$(terraform output -raw region)
mssh ec2-user@$(terraform output -raw trainium_instance)

Run training job.

# Clone this repository to access the training scripts
git clone https://github.com/JGalego/hf-on-trainium
cd hf-on-trainium/examples

# Install dependencies
python3 -m pip install -r requirements.txt

# CPU/GPU 🐌
python3 original.py

# Trainium (Single Core) ⚡
python3 trainium_single.py

# Trainium (Distributed) ⚡⚡⚡
export TOKENIZERS_PARALLELISM=false  # disabling parallelism to avoid hidden deadlocks
export N_PROCS_PER_NODE=2  			 # either 1, 2, 8 or a multiple of 32
torchrun --nproc_per_node=$N_PROCS_PER_NODE trainium_distributed.py

Monitor training job.

# Track Neuron environment activity
neuron-top

# Install and start TensorBoard
python3 -m pip install tensorboard
tensorboard --logdir runs --port 8080

# Open another terminal window
# SSH tunnel to TensorBoard
mssh -NL 8080:localhost:8080 ec2-user@$(terraform output -raw trainium_instance)
# and head over to http://localhost:8080

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
examples		examples
infra		infra
.gitignore		.gitignore
.pylintrc		.pylintrc
README.md		README.md
hf_on_trainium.png		hf_on_trainium.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

examples

examples

infra

infra

.gitignore

.gitignore

.pylintrc

.pylintrc

README.md

README.md

hf_on_trainium.png

hf_on_trainium.png

Repository files navigation

🤗 on Trainium

Overview

What is AWS Trainium?

Setup

References

About

Releases

Packages

Languages

JGalego/HF-on-Trainium

Folders and files

Latest commit

History

Repository files navigation

🤗 on Trainium

Overview

What is AWS Trainium?

Setup

References

About

Resources

Stars

Watchers

Forks

Languages