A (not so deep) exploration of π€ Transformers training on AWS Trainium.
Inspired by Julien Simon's post on how to Accelerate Transformer training with AWS Trainium π
Update (October 2022) π’ Amazon EC2
Trn1
instances powered by AWS-designed Trainium chips are now generally available
Update (November 2022) π’ Amazon SageMaker now supports
ml.trn1
instances for model training
Update (May 2023) π’ Amazon SageMaker now supports
ml.inf2
andml.trn1
instances for model deployment
AWS Trainium is a 2nd generation ML chip optimized for training state-of-the-art models.
Trainium-powered Amazon EC2 Trn1
instances achieve the highest performance on deep learning training, while providing up to 50% lower cost-to-train savings over comparable GPU-based P4d
instances.
Using the AWS Neuron SDK, which integrates with popular frameworks like TensorFlow, PyTorch and Apache MXNet, anyone can start using AWS Trainium by changing just a few lines of code.
-
Provision resources using Terraform.
# It will take a few minutes for the Trn1 instance to be fully configured β cd infra terraform init -upgrade -backend-config="config.s3.tfbackend" terraform plan terraform apply # β οΈ Clean up after yourself - don't forget to destroy all resources when you're done! terraform destroy
-
Connect to the Trn1 instance using EC2 Instance Connect.
# For information on how to set up EC2 instance connect, see # https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-connect-set-up.html export AWS_DEFAULT_REGION=$(terraform output -raw region) mssh ec2-user@$(terraform output -raw trainium_instance)
-
Run training job.
# Clone this repository to access the training scripts git clone https://github.com/JGalego/hf-on-trainium cd hf-on-trainium/examples # Install dependencies python3 -m pip install -r requirements.txt # CPU/GPU π python3 original.py # Trainium (Single Core) β‘ python3 trainium_single.py # Trainium (Distributed) β‘β‘β‘ export TOKENIZERS_PARALLELISM=false # disabling parallelism to avoid hidden deadlocks export N_PROCS_PER_NODE=2 # either 1, 2, 8 or a multiple of 32 torchrun --nproc_per_node=$N_PROCS_PER_NODE trainium_distributed.py
-
Monitor training job.
# Track Neuron environment activity neuron-top # Install and start TensorBoard python3 -m pip install tensorboard tensorboard --logdir runs --port 8080 # Open another terminal window # SSH tunnel to TensorBoard mssh -NL 8080:localhost:8080 ec2-user@$(terraform output -raw trainium_instance) # and head over to http://localhost:8080
- AWS Trainium
- Amazon EC2
Trn1
instances - AWS Neuron SDK documentation
- AWS Neuron SDK samples
- PyTorch on XLA devices
- AWS On Air feat. Silicon Innovation: Trainium and Inferentia
- Amazon EC2 Trn1 Instances for High-Performance Model Training are Now Available
- (HuggingFace) Tutorial: Fine-tune a pretrained model
- (Julien Simon) Video: Accelerate Transformer training with AWS Trainium