aws cloud deployment improvements #2618

dirkpetersen · 2024-06-06T19:07:26Z

the following changes for aws cloud deployment have been tested with 6 configurations, including 3 with T4 GPU (g4dn.xlarge) and 3 versions of Ubuntu

t2.small:
20.04: ami-04bad3c587fe60d89
22.04: ami-03c983f9003cb9cd1
24.04: ami-0406d1fdd021121cd

g4dn.xlarge:
20.04: ami-04bad3c587fe60d89
22.04: ami-03c983f9003cb9cd1
24.04: ami-0406d1fdd021121cd

Note: nvflare installs on 24.04 but is currently not supported

changed default image to ami-03c983f9003cb9cd1 (22.04 / Python 3.10)
added g4dn.xlarge as an option to prompt EC2_TYPE
fixed typo in prompt REGION
increase the default block device size by 8GB
install apt package nvidia-driver-535-server if GPU found
run modprobe nvidia to avoid reboot if GPU found
adding ~/.local/bin to PATH
add --break-system-packages to pip install (required by Python 3.12)
add --no-cache-dir to pip install to avoid disk space issues
add @reboot cronjob to ensure nvflare is restarted after a server (re)start

Fixes # .

Description

The critical change is here the disk size increase by 8GB. This results in all tested configurations having between 30% and 60% free disk space. The default install can run out of disk space if nvidia drivers, pytorch and a few other packages need to be installed

Types of changes

Non-breaking change (fix or new feature that would not break existing functionality).
Breaking change (fix or new feature that would cause existing functionality to change).
New tests added to cover the changes.
Quick tests passed locally by running ./runtest.sh.
In-line docstrings updated.
Documentation updated.

@reboot

with 6 configurations, including 3 with T4 GPU (g4dn.xlarge) t2.small: 20.04: ami-04bad3c587fe60d89 22.04: ami-03c983f9003cb9cd1 24.04: ami-0406d1fdd021121cd g4dn.xlarge: 20.04: ami-04bad3c587fe60d89 22.04: ami-03c983f9003cb9cd1 24.04: ami-0406d1fdd021121cd - changed default image to ami-03c983f9003cb9cd1 (22.04 / Python 3.10) - added g4dn.xlarge as an option to prompt EC2_TYPE - fixed typo in prompt REGION - get default blockdevice name and set size to 16GB instead of 8GB - install apt package nvidia-driver-535-server if GPU found - run modprobe nvidia to avoid reboot if GPU found - adding ~/.local/bin to PATH - add --break-system-packages to pip install (required by Python 3.12) - add --no-cache-dir to pip install to avoid disk space issues - add @reboot cronjob to ensure nvflare is restarted after a server (re)start

…y 8GB

IsaacYangSLA · 2024-06-07T14:31:54Z

@dirkpetersen , thanks for the PR. It improves the template quite a lot. However, we may need to have our internal discussion on the deployment scenario. We know different jobs require different dependencies. Sometimes, pytorch is one of the dependencies and thus the deployment will need larger volumes. We also see numpy-only jobs.
I will approve this PR. Later, we might have additional PR's that change this PR after our internal discussions reach conclusions.

IsaacYangSLA

Good PR. It covers some scenarios we did not consider previously.

dirkpetersen · 2024-06-07T14:39:53Z

Thanks @IsaacYangSLA, this helps us a lot as we no longer have run a patched version. Another thing to consider is that this will not run on systems with GPU that run RHEL based images such as amazon linux, only ubuntu. The AI/ML community, that does not use Ubuntu, seems to be small though.

IsaacYangSLA · 2024-06-10T23:25:42Z

/build

IsaacYangSLA · 2024-06-11T01:09:44Z

/build

YuanTingHsieh · 2024-06-11T18:19:30Z

/build

YuanTingHsieh · 2024-06-12T22:12:28Z

/build

dirkpetersen · 2024-06-13T16:39:31Z

@IsaacYangSLA , as I was working with other AWS regions, I noticed that some of our users would be challenged to look up appropriate AMI id without help from a cloud engineer as these ids are different in each region, also extensive testing triggered a number of fine tuning improvements that I would like to add to a pull request. I assume it might be better to have this current pull request go through, and then create a new one after that, or is it better to add to this one? Looking for your guidance

I have made the following changes:

get exact GPU requirements from ../local/resources.json.[default]
add function to propose best suitable instance type (find_ec2_gpu_instance_type)
check environment to get AWS region (default still us-west-2)
propose newest image id per region based on pattern, e.g "ubuntu-*-22.04-amd64-pro-server"
add AWS_PROFILE hint to allow for different AWS profile
call "aws sts get-caller-identity" to see if SSO token is still active
change 'prompt' command to read -e -i -p for better prompt editing
changed ssh commands to single quote for readability and avoid escaping chars
clarify requirements.txt instruction
set --region for all AWS commands to allow multi-region support
adding suffixes to KEY_FILE, nvflare.log etc to allow multiple deployments in parallel
install additional os packages (python3-dev & gcc) to allow ARM packages without wheels
use driver-550-server os package for >= 22.04 and driver-535-server for <= 20.04
separated os and user packages to allow for failed install of os packages
starting os and user package install in parallel but wait twice for install of os packages
support waiting for build of nvidia.ko as well as nvidia.ko.zst (for newer os)
improve final message to allow for copy/paste of ssh and tail command
switch default instance type to t4g.small, if image name pattern has *arm64* in it.
change control structures to "if [ something ]; then" for readability.

the last one is of course optional based on your team's coding preferences

I tested the following combinations over the last couple of days but only the 24.04 ARM option seems to fail with nvidia drivers 535 and 550

The new UI would look like this:

startup/start.sh --cloud aws
This script requires aws (AWS CLI), sshpass, dig and jq.  Now checking if they are installed.
Checking if aws exists. => found
Checking if sshpass exists. => found
Checking if dig exists. => found
Checking if jq exists. => found
Note: run this command first for a different AWS profile:
  export AWS_PROFILE=your-profile-name.

Checking AWS identity ...

* Cloud EC2 region, press ENTER to accept default: us-east-2
* Cloud AMI image name, press ENTER to accept default (use amd64 or arm64): ubuntu-*-22.04-amd64-pro-server
    retrieving AMI ID for ubuntu-*-22.04-amd64-pro-server ... ami-0a7e9bed072bb379b found
    finding smallest instance type with 1 GPUs and 15360 MiB VRAM ... g6.xlarge found
* Cloud EC2 type, press ENTER to accept default: g6.xlarge
* Cloud AMI image id, press ENTER to accept default: ami-0a7e9bed072bb379b
region = us-east-2, EC2 type = g6.xlarge, ami image = ami-0a7e9bed072bb379b , OK? (Y/n)
If the client requires additional Python packages, please add them to:
    /home/dp/NVFlare/dirk/Test/AWS-T4.X/startup/requirements.txt
Press ENTER when it's done or no additional dependencies.

Checking if default VPC exists
Default VPC found
Generating key pair for VM
Creating VM at region us-east-2, this may take a few minutes ...
VM created with IP address: 52.14.44.113
Copying files to nvflare_client
Destination folder is ubuntu@52.14.44.113:/var/tmp/cloud
Installing os packages as root in the background, this may take a few minutes ...
Installing user space packages in the background, this may take a few minutes ...
System was provisioned, packages may continue to install in the background.
To terminate the EC2 instance, run the following command:
  aws ec2 terminate-instances --region us-east-2 --instance-ids i-0837e105f2661a4e3
Other resources provisioned
security group: nvflare_client_sg_2036
key pair: NVFlareClientKeyPair
review install progress:
  tail -f /tmp/nvflare-aws-YGR.log
login to instance:
  ssh -i /home/dirk/AWS-T4.X/NVFlareClientKeyPair_i-0837e105f2661a4e3.pem ubuntu@52.14.44.113

here is the current aws_start.sh script i am using:
https://raw.githubusercontent.com/dirkpetersen/nvflare-cancer/main/aws_start.sh

chesterxgchen · 2024-06-14T17:35:34Z

The AI/ML community, that does not use Ubuntu, seems to be small though.

Thanks @IsaacYangSLA, this helps us a lot as we no longer have run a patched version. Another thing to consider is that this will not run on systems with GPU that run RHEL based images such as amazon linux, only ubuntu. The AI/ML community, that does not use Ubuntu, seems to be small though.

Thank you so much @dirkpetersen for you contribution, we will keep the RHEL in mind in future releases.

chesterxgchen · 2024-06-14T17:36:12Z

@IsaacYangSLA , as I was working with other AWS regions, I noticed that some of our users would be challenged to look up appropriate AMI id without help from a cloud engineer as these ids are different in each region, also extensive testing triggered a number of fine tuning improvements that I would like to add to a pull request. I assume it might be better to have this current pull request go through, and then create a new one after that, or is it better to add to this one? Looking for your guidance

I have made the following changes:

get exact GPU requirements from ../local/resources.json.[default]

add function to propose best suitable instance type (find_ec2_gpu_instance_type)

check environment to get AWS region (default still us-west-2)

propose newest image id per region based on pattern, e.g "ubuntu-*-22.04-amd64-pro-server"

add AWS_PROFILE hint to allow for different AWS profile

call "aws sts get-caller-identity" to see if SSO token is still active

change 'prompt' command to read -e -i -p for better prompt editing

changed ssh commands to single quote for readability and avoid escaping chars

clarify requirements.txt instruction

set --region for all AWS commands to allow multi-region support

adding suffixes to KEY_FILE, nvflare.log etc to allow multiple deployments in parallel

install additional os packages (python3-dev & gcc) to allow ARM packages without wheels

use driver-550-server os package for >= 22.04 and driver-535-server for <= 20.04

separated os and user packages to allow for failed install of os packages

starting os and user package install in parallel but wait twice for install of os packages

support waiting for build of nvidia.ko as well as nvidia.ko.zst (for newer os)

improve final message to allow for copy/paste of ssh and tail command

switch default instance type to t4g.small, if image name pattern has *arm64* in it.

change control structures to "if [ something ]; then" for readability.

the last one is of course optional based on your team's coding preferences

I tested the following combinations over the last couple of days but only the 24.04 ARM option seems to fail with nvidia drivers 535 and 550

The new UI would look like this:
startup/start.sh --cloud aws
This script requires aws (AWS CLI), sshpass, dig and jq.  Now checking if they are installed.
Checking if aws exists. => found
Checking if sshpass exists. => found
Checking if dig exists. => found
Checking if jq exists. => found
Note: run this command first for a different AWS profile:
  export AWS_PROFILE=your-profile-name.

Checking AWS identity ...

* Cloud EC2 region, press ENTER to accept default: us-east-2
* Cloud AMI image name, press ENTER to accept default (use amd64 or arm64): ubuntu-*-22.04-amd64-pro-server
    retrieving AMI ID for ubuntu-*-22.04-amd64-pro-server ... ami-0a7e9bed072bb379b found
    finding smallest instance type with 1 GPUs and 15360 MiB VRAM ... g6.xlarge found
* Cloud EC2 type, press ENTER to accept default: g6.xlarge
* Cloud AMI image id, press ENTER to accept default: ami-0a7e9bed072bb379b
region = us-east-2, EC2 type = g6.xlarge, ami image = ami-0a7e9bed072bb379b , OK? (Y/n)
If the client requires additional Python packages, please add them to:
    /home/dp/NVFlare/dirk/Test/AWS-T4.X/startup/requirements.txt
Press ENTER when it's done or no additional dependencies.

Checking if default VPC exists
Default VPC found
Generating key pair for VM
Creating VM at region us-east-2, this may take a few minutes ...
VM created with IP address: 52.14.44.113
Copying files to nvflare_client
Destination folder is ubuntu@52.14.44.113:/var/tmp/cloud
Installing os packages as root in the background, this may take a few minutes ...
Installing user space packages in the background, this may take a few minutes ...
System was provisioned, packages may continue to install in the background.
To terminate the EC2 instance, run the following command:
  aws ec2 terminate-instances --region us-east-2 --instance-ids i-0837e105f2661a4e3
Other resources provisioned
security group: nvflare_client_sg_2036
key pair: NVFlareClientKeyPair
review install progress:
  tail -f /tmp/nvflare-aws-YGR.log
login to instance:
  ssh -i /home/dirk/AWS-T4.X/NVFlareClientKeyPair_i-0837e105f2661a4e3.pem ubuntu@52.14.44.113
here is the current aws_start.sh script i am using: https://raw.githubusercontent.com/dirkpetersen/nvflare-cancer/main/aws_start.sh

@dirkpetersen really appreciate your contribution and detailed tests.

chesterxgchen · 2024-06-14T17:38:01Z

nvflare/lighter/impl/aws_template.yml

@@ -32,7 +32,7 @@ aws_start_sh: |
    EC2_TYPE=t2.xlarge
    REGION=us-west-2
  else
-    AMI_IMAGE=ami-04bad3c587fe60d89
+    AMI_IMAGE=ami-03c983f9003cb9cd1  # 22.04  20.04:ami-04bad3c587fe60d89 24.04:ami-0406d1fdd021121cd


@SYangster can you add documentations for these different ami-id and ubuntu versions

IsaacYangSLA · 2024-06-14T17:45:27Z

/build

@reboot

* the following changes for aws cloud deployment have been tested with 6 configurations, including 3 with T4 GPU (g4dn.xlarge) t2.small: 20.04: ami-04bad3c587fe60d89 22.04: ami-03c983f9003cb9cd1 24.04: ami-0406d1fdd021121cd g4dn.xlarge: 20.04: ami-04bad3c587fe60d89 22.04: ami-03c983f9003cb9cd1 24.04: ami-0406d1fdd021121cd - changed default image to ami-03c983f9003cb9cd1 (22.04 / Python 3.10) - added g4dn.xlarge as an option to prompt EC2_TYPE - fixed typo in prompt REGION - get default blockdevice name and set size to 16GB instead of 8GB - install apt package nvidia-driver-535-server if GPU found - run modprobe nvidia to avoid reboot if GPU found - adding ~/.local/bin to PATH - add --break-system-packages to pip install (required by Python 3.12) - add --no-cache-dir to pip install to avoid disk space issues - add @reboot cronjob to ensure nvflare is restarted after a server (re)start * instead of setting the disk to 16GB increase the existing disk size by 8GB --------- Co-authored-by: Isaac Yang <isaacy@nvidia.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>

dirkpetersen added 2 commits June 6, 2024 12:00

instead of setting the disk to 16GB increase the existing disk size b…

76b0976

…y 8GB

IsaacYangSLA requested review from chesterxgchen and IsaacYangSLA June 7, 2024 14:22

IsaacYangSLA approved these changes Jun 7, 2024

View reviewed changes

Merge branch 'main' into main

0c9e90f

IsaacYangSLA enabled auto-merge (rebase) June 11, 2024 01:09

Merge branch 'main' into main

ee41ec7

Merge branch 'main' into main

9540d82

Merge branch 'main' into main

347b1ed

chesterxgchen reviewed Jun 14, 2024

View reviewed changes

chesterxgchen approved these changes Jun 14, 2024

View reviewed changes

IsaacYangSLA disabled auto-merge June 14, 2024 17:45

IsaacYangSLA merged commit 3d1a509 into NVIDIA:main Jun 14, 2024
15 of 16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aws cloud deployment improvements #2618

aws cloud deployment improvements #2618

dirkpetersen commented Jun 6, 2024 •

edited

Loading

IsaacYangSLA commented Jun 7, 2024

IsaacYangSLA left a comment

dirkpetersen commented Jun 7, 2024 •

edited

Loading

IsaacYangSLA commented Jun 10, 2024

IsaacYangSLA commented Jun 11, 2024

YuanTingHsieh commented Jun 11, 2024

YuanTingHsieh commented Jun 12, 2024

dirkpetersen commented Jun 13, 2024 •

edited

Loading

chesterxgchen commented Jun 14, 2024

chesterxgchen commented Jun 14, 2024

chesterxgchen Jun 14, 2024

IsaacYangSLA commented Jun 14, 2024

aws cloud deployment improvements #2618

aws cloud deployment improvements #2618

Conversation

dirkpetersen commented Jun 6, 2024 • edited Loading

Description

Types of changes

IsaacYangSLA commented Jun 7, 2024

IsaacYangSLA left a comment

Choose a reason for hiding this comment

dirkpetersen commented Jun 7, 2024 • edited Loading

IsaacYangSLA commented Jun 10, 2024

IsaacYangSLA commented Jun 11, 2024

YuanTingHsieh commented Jun 11, 2024

YuanTingHsieh commented Jun 12, 2024

dirkpetersen commented Jun 13, 2024 • edited Loading

chesterxgchen commented Jun 14, 2024

chesterxgchen commented Jun 14, 2024

chesterxgchen Jun 14, 2024

Choose a reason for hiding this comment

IsaacYangSLA commented Jun 14, 2024

dirkpetersen commented Jun 6, 2024 •

edited

Loading

dirkpetersen commented Jun 7, 2024 •

edited

Loading

dirkpetersen commented Jun 13, 2024 •

edited

Loading