Skip to content

A (not so deep) exploration of πŸ€— Transformers training on AWS Trainium

Notifications You must be signed in to change notification settings

JGalego/HF-on-Trainium

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ€— on Trainium

Overview

A (not so deep) exploration of πŸ€— Transformers training on AWS Trainium.

Inspired by Julien Simon's post on how to Accelerate Transformer training with AWS Trainium πŸ™Œ

What is AWS Trainium?

Update (October 2022) πŸ“’ Amazon EC2 Trn1 instances powered by AWS-designed Trainium chips are now generally available

Update (November 2022) πŸ“’ Amazon SageMaker now supports ml.trn1 instances for model training

Update (May 2023) πŸ“’ Amazon SageMaker now supports ml.inf2 and ml.trn1 instances for model deployment

AWS Trainium is a 2nd generation ML chip optimized for training state-of-the-art models.

Trainium-powered Amazon EC2 Trn1 instances achieve the highest performance on deep learning training, while providing up to 50% lower cost-to-train savings over comparable GPU-based P4d instances.

Using the AWS Neuron SDK, which integrates with popular frameworks like TensorFlow, PyTorch and Apache MXNet, anyone can start using AWS Trainium by changing just a few lines of code.

Setup

  1. Provision resources using Terraform.

    # It will take a few minutes for the Trn1 instance to be fully configured βŒ›
    cd infra
    terraform init -upgrade -backend-config="config.s3.tfbackend"
    terraform plan
    terraform apply
    
    # ⚠️ Clean up after yourself - don't forget to destroy all resources when you're done!
    terraform destroy
  2. Connect to the Trn1 instance using EC2 Instance Connect.

    Update (June 2022) You can now connect to an EC2 Instance via EC2 Instance Connect (EIC) Endpoint - no need to create bastion hosts to tunnel SSH / RDP connections to instances with private IP addresses

    # For information on how to set up EC2 instance connect, see
    # https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-connect-set-up.html
    export AWS_DEFAULT_REGION=$(terraform output -raw region)
    mssh ec2-user@$(terraform output -raw trainium_instance)
  3. Run training job.

    # Clone this repository to access the training scripts
    git clone https://github.com/JGalego/hf-on-trainium
    cd hf-on-trainium/examples
    
    # Install dependencies
    python3 -m pip install -r requirements.txt
    
    # CPU/GPU 🐌
    python3 original.py
    
    # Trainium (Single Core) ⚑
    python3 trainium_single.py
    
    # Trainium (Distributed) ⚑⚑⚑
    export TOKENIZERS_PARALLELISM=false  # disabling parallelism to avoid hidden deadlocks
    export N_PROCS_PER_NODE=2  			 # either 1, 2, 8 or a multiple of 32
    torchrun --nproc_per_node=$N_PROCS_PER_NODE trainium_distributed.py
  4. Monitor training job.

    # Track Neuron environment activity
    neuron-top
    
    # Install and start TensorBoard
    python3 -m pip install tensorboard
    tensorboard --logdir runs --port 8080
    
    # Open another terminal window
    # SSH tunnel to TensorBoard
    mssh -NL 8080:localhost:8080 ec2-user@$(terraform output -raw trainium_instance)
    # and head over to http://localhost:8080

References

About

A (not so deep) exploration of πŸ€— Transformers training on AWS Trainium

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published