Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws cloud deployment improvements #2618

Merged
merged 6 commits into from
Jun 14, 2024
Merged

Conversation

dirkpetersen
Copy link
Contributor

@dirkpetersen dirkpetersen commented Jun 6, 2024

the following changes for aws cloud deployment have been tested with 6 configurations, including 3 with T4 GPU (g4dn.xlarge) and 3 versions of Ubuntu

t2.small:
20.04: ami-04bad3c587fe60d89
22.04: ami-03c983f9003cb9cd1
24.04: ami-0406d1fdd021121cd

g4dn.xlarge:
20.04: ami-04bad3c587fe60d89
22.04: ami-03c983f9003cb9cd1
24.04: ami-0406d1fdd021121cd

Note: nvflare installs on 24.04 but is currently not supported

  • changed default image to ami-03c983f9003cb9cd1 (22.04 / Python 3.10)
  • added g4dn.xlarge as an option to prompt EC2_TYPE
  • fixed typo in prompt REGION
  • increase the default block device size by 8GB
  • install apt package nvidia-driver-535-server if GPU found
  • run modprobe nvidia to avoid reboot if GPU found
  • adding ~/.local/bin to PATH
  • add --break-system-packages to pip install (required by Python 3.12)
  • add --no-cache-dir to pip install to avoid disk space issues
  • add @reboot cronjob to ensure nvflare is restarted after a server (re)start

Fixes # .

Description

The critical change is here the disk size increase by 8GB. This results in all tested configurations having between 30% and 60% free disk space. The default install can run out of disk space if nvidia drivers, pytorch and a few other packages need to be installed

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Quick tests passed locally by running ./runtest.sh.
  • In-line docstrings updated.
  • Documentation updated.

with 6 configurations, including 3 with T4 GPU (g4dn.xlarge)

t2.small:
    20.04: ami-04bad3c587fe60d89
    22.04: ami-03c983f9003cb9cd1
	24.04: ami-0406d1fdd021121cd

g4dn.xlarge:
    20.04: ami-04bad3c587fe60d89
    22.04: ami-03c983f9003cb9cd1
	24.04: ami-0406d1fdd021121cd

- changed default image to ami-03c983f9003cb9cd1 (22.04 / Python 3.10)
- added g4dn.xlarge as an option to prompt EC2_TYPE
- fixed typo in prompt REGION
- get default blockdevice name and set size to 16GB instead of 8GB
- install apt package nvidia-driver-535-server if GPU found
- run modprobe nvidia to avoid reboot if GPU found
- adding ~/.local/bin to PATH
- add --break-system-packages to pip install (required by Python 3.12)
- add --no-cache-dir to pip install to avoid disk space issues
- add @reboot cronjob to ensure nvflare is restarted after a server (re)start
@IsaacYangSLA
Copy link
Collaborator

@dirkpetersen , thanks for the PR. It improves the template quite a lot. However, we may need to have our internal discussion on the deployment scenario. We know different jobs require different dependencies. Sometimes, pytorch is one of the dependencies and thus the deployment will need larger volumes. We also see numpy-only jobs.
I will approve this PR. Later, we might have additional PR's that change this PR after our internal discussions reach conclusions.

Copy link
Collaborator

@IsaacYangSLA IsaacYangSLA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good PR. It covers some scenarios we did not consider previously.

@dirkpetersen
Copy link
Contributor Author

dirkpetersen commented Jun 7, 2024

Thanks @IsaacYangSLA, this helps us a lot as we no longer have run a patched version. Another thing to consider is that this will not run on systems with GPU that run RHEL based images such as amazon linux, only ubuntu. The AI/ML community, that does not use Ubuntu, seems to be small though.

@IsaacYangSLA
Copy link
Collaborator

/build

@IsaacYangSLA IsaacYangSLA enabled auto-merge (rebase) June 11, 2024 01:09
@IsaacYangSLA
Copy link
Collaborator

/build

@YuanTingHsieh
Copy link
Collaborator

/build

@YuanTingHsieh
Copy link
Collaborator

/build

@dirkpetersen
Copy link
Contributor Author

dirkpetersen commented Jun 13, 2024

@IsaacYangSLA , as I was working with other AWS regions, I noticed that some of our users would be challenged to look up appropriate AMI id without help from a cloud engineer as these ids are different in each region, also extensive testing triggered a number of fine tuning improvements that I would like to add to a pull request. I assume it might be better to have this current pull request go through, and then create a new one after that, or is it better to add to this one? Looking for your guidance

I have made the following changes:

  • get exact GPU requirements from ../local/resources.json.[default]
  • add function to propose best suitable instance type (find_ec2_gpu_instance_type)
  • check environment to get AWS region (default still us-west-2)
  • propose newest image id per region based on pattern, e.g "ubuntu-*-22.04-amd64-pro-server"
  • add AWS_PROFILE hint to allow for different AWS profile
  • call "aws sts get-caller-identity" to see if SSO token is still active
  • change 'prompt' command to read -e -i -p for better prompt editing
  • changed ssh commands to single quote for readability and avoid escaping chars
  • clarify requirements.txt instruction
  • set --region for all AWS commands to allow multi-region support
  • adding suffixes to KEY_FILE, nvflare.log etc to allow multiple deployments in parallel
  • install additional os packages (python3-dev & gcc) to allow ARM packages without wheels
  • use driver-550-server os package for >= 22.04 and driver-535-server for <= 20.04
  • separated os and user packages to allow for failed install of os packages
  • starting os and user package install in parallel but wait twice for install of os packages
  • support waiting for build of nvidia.ko as well as nvidia.ko.zst (for newer os)
  • improve final message to allow for copy/paste of ssh and tail command
  • switch default instance type to t4g.small, if image name pattern has *arm64* in it.
  • change control structures to "if [ something ]; then" for readability.

the last one is of course optional based on your team's coding preferences

I tested the following combinations over the last couple of days but only the 24.04 ARM option seems to fail with nvidia drivers 535 and 550

image

The new UI would look like this:

startup/start.sh --cloud aws
This script requires aws (AWS CLI), sshpass, dig and jq.  Now checking if they are installed.
Checking if aws exists. => found
Checking if sshpass exists. => found
Checking if dig exists. => found
Checking if jq exists. => found
Note: run this command first for a different AWS profile:
  export AWS_PROFILE=your-profile-name.

Checking AWS identity ...

* Cloud EC2 region, press ENTER to accept default: us-east-2
* Cloud AMI image name, press ENTER to accept default (use amd64 or arm64): ubuntu-*-22.04-amd64-pro-server
    retrieving AMI ID for ubuntu-*-22.04-amd64-pro-server ... ami-0a7e9bed072bb379b found
    finding smallest instance type with 1 GPUs and 15360 MiB VRAM ... g6.xlarge found
* Cloud EC2 type, press ENTER to accept default: g6.xlarge
* Cloud AMI image id, press ENTER to accept default: ami-0a7e9bed072bb379b
region = us-east-2, EC2 type = g6.xlarge, ami image = ami-0a7e9bed072bb379b , OK? (Y/n)
If the client requires additional Python packages, please add them to:
    /home/dp/NVFlare/dirk/Test/AWS-T4.X/startup/requirements.txt
Press ENTER when it's done or no additional dependencies.

Checking if default VPC exists
Default VPC found
Generating key pair for VM
Creating VM at region us-east-2, this may take a few minutes ...
VM created with IP address: 52.14.44.113
Copying files to nvflare_client
Destination folder is ubuntu@52.14.44.113:/var/tmp/cloud
Installing os packages as root in the background, this may take a few minutes ...
Installing user space packages in the background, this may take a few minutes ...
System was provisioned, packages may continue to install in the background.
To terminate the EC2 instance, run the following command:
  aws ec2 terminate-instances --region us-east-2 --instance-ids i-0837e105f2661a4e3
Other resources provisioned
security group: nvflare_client_sg_2036
key pair: NVFlareClientKeyPair
review install progress:
  tail -f /tmp/nvflare-aws-YGR.log
login to instance:
  ssh -i /home/dirk/AWS-T4.X/NVFlareClientKeyPair_i-0837e105f2661a4e3.pem ubuntu@52.14.44.113

here is the current aws_start.sh script i am using:
https://raw.githubusercontent.com/dirkpetersen/nvflare-cancer/main/aws_start.sh

@chesterxgchen
Copy link
Collaborator

The AI/ML community, that does not use Ubuntu, seems to be small though.

Thanks @IsaacYangSLA, this helps us a lot as we no longer have run a patched version. Another thing to consider is that this will not run on systems with GPU that run RHEL based images such as amazon linux, only ubuntu. The AI/ML community, that does not use Ubuntu, seems to be small though.

Thank you so much @dirkpetersen for you contribution, we will keep the RHEL in mind in future releases.

@chesterxgchen
Copy link
Collaborator

@IsaacYangSLA , as I was working with other AWS regions, I noticed that some of our users would be challenged to look up appropriate AMI id without help from a cloud engineer as these ids are different in each region, also extensive testing triggered a number of fine tuning improvements that I would like to add to a pull request. I assume it might be better to have this current pull request go through, and then create a new one after that, or is it better to add to this one? Looking for your guidance

I have made the following changes:

  • get exact GPU requirements from ../local/resources.json.[default]
  • add function to propose best suitable instance type (find_ec2_gpu_instance_type)
  • check environment to get AWS region (default still us-west-2)
  • propose newest image id per region based on pattern, e.g "ubuntu-*-22.04-amd64-pro-server"
  • add AWS_PROFILE hint to allow for different AWS profile
  • call "aws sts get-caller-identity" to see if SSO token is still active
  • change 'prompt' command to read -e -i -p for better prompt editing
  • changed ssh commands to single quote for readability and avoid escaping chars
  • clarify requirements.txt instruction
  • set --region for all AWS commands to allow multi-region support
  • adding suffixes to KEY_FILE, nvflare.log etc to allow multiple deployments in parallel
  • install additional os packages (python3-dev & gcc) to allow ARM packages without wheels
  • use driver-550-server os package for >= 22.04 and driver-535-server for <= 20.04
  • separated os and user packages to allow for failed install of os packages
  • starting os and user package install in parallel but wait twice for install of os packages
  • support waiting for build of nvidia.ko as well as nvidia.ko.zst (for newer os)
  • improve final message to allow for copy/paste of ssh and tail command
  • switch default instance type to t4g.small, if image name pattern has *arm64* in it.
  • change control structures to "if [ something ]; then" for readability.

the last one is of course optional based on your team's coding preferences

I tested the following combinations over the last couple of days but only the 24.04 ARM option seems to fail with nvidia drivers 535 and 550

image

The new UI would look like this:

startup/start.sh --cloud aws
This script requires aws (AWS CLI), sshpass, dig and jq.  Now checking if they are installed.
Checking if aws exists. => found
Checking if sshpass exists. => found
Checking if dig exists. => found
Checking if jq exists. => found
Note: run this command first for a different AWS profile:
  export AWS_PROFILE=your-profile-name.

Checking AWS identity ...

* Cloud EC2 region, press ENTER to accept default: us-east-2
* Cloud AMI image name, press ENTER to accept default (use amd64 or arm64): ubuntu-*-22.04-amd64-pro-server
    retrieving AMI ID for ubuntu-*-22.04-amd64-pro-server ... ami-0a7e9bed072bb379b found
    finding smallest instance type with 1 GPUs and 15360 MiB VRAM ... g6.xlarge found
* Cloud EC2 type, press ENTER to accept default: g6.xlarge
* Cloud AMI image id, press ENTER to accept default: ami-0a7e9bed072bb379b
region = us-east-2, EC2 type = g6.xlarge, ami image = ami-0a7e9bed072bb379b , OK? (Y/n)
If the client requires additional Python packages, please add them to:
    /home/dp/NVFlare/dirk/Test/AWS-T4.X/startup/requirements.txt
Press ENTER when it's done or no additional dependencies.

Checking if default VPC exists
Default VPC found
Generating key pair for VM
Creating VM at region us-east-2, this may take a few minutes ...
VM created with IP address: 52.14.44.113
Copying files to nvflare_client
Destination folder is ubuntu@52.14.44.113:/var/tmp/cloud
Installing os packages as root in the background, this may take a few minutes ...
Installing user space packages in the background, this may take a few minutes ...
System was provisioned, packages may continue to install in the background.
To terminate the EC2 instance, run the following command:
  aws ec2 terminate-instances --region us-east-2 --instance-ids i-0837e105f2661a4e3
Other resources provisioned
security group: nvflare_client_sg_2036
key pair: NVFlareClientKeyPair
review install progress:
  tail -f /tmp/nvflare-aws-YGR.log
login to instance:
  ssh -i /home/dirk/AWS-T4.X/NVFlareClientKeyPair_i-0837e105f2661a4e3.pem ubuntu@52.14.44.113

here is the current aws_start.sh script i am using: https://raw.githubusercontent.com/dirkpetersen/nvflare-cancer/main/aws_start.sh

@dirkpetersen really appreciate your contribution and detailed tests.

@@ -32,7 +32,7 @@ aws_start_sh: |
EC2_TYPE=t2.xlarge
REGION=us-west-2
else
AMI_IMAGE=ami-04bad3c587fe60d89
AMI_IMAGE=ami-03c983f9003cb9cd1 # 22.04 20.04:ami-04bad3c587fe60d89 24.04:ami-0406d1fdd021121cd
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SYangster can you add documentations for these different ami-id and ubuntu versions

@IsaacYangSLA
Copy link
Collaborator

/build

@IsaacYangSLA IsaacYangSLA merged commit 3d1a509 into NVIDIA:main Jun 14, 2024
15 of 16 checks passed
nvidianz pushed a commit to nvidianz/NVFlare that referenced this pull request Jul 8, 2024
* the following changes for aws cloud deployment have been tested
with 6 configurations, including 3 with T4 GPU (g4dn.xlarge)

t2.small:
    20.04: ami-04bad3c587fe60d89
    22.04: ami-03c983f9003cb9cd1
	24.04: ami-0406d1fdd021121cd

g4dn.xlarge:
    20.04: ami-04bad3c587fe60d89
    22.04: ami-03c983f9003cb9cd1
	24.04: ami-0406d1fdd021121cd

- changed default image to ami-03c983f9003cb9cd1 (22.04 / Python 3.10)
- added g4dn.xlarge as an option to prompt EC2_TYPE
- fixed typo in prompt REGION
- get default blockdevice name and set size to 16GB instead of 8GB
- install apt package nvidia-driver-535-server if GPU found
- run modprobe nvidia to avoid reboot if GPU found
- adding ~/.local/bin to PATH
- add --break-system-packages to pip install (required by Python 3.12)
- add --no-cache-dir to pip install to avoid disk space issues
- add @reboot cronjob to ensure nvflare is restarted after a server (re)start

* instead of setting the disk to 16GB increase the existing disk size by 8GB

---------

Co-authored-by: Isaac Yang <isaacy@nvidia.com>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com>
Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants