Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XGBoost User Experience Improvment #2616

Closed

Conversation

nvidianz
Copy link
Collaborator

@nvidianz nvidianz commented Jun 6, 2024

Description

Made following changes to XGBoost,

  1. Consolidate all XGBoost parameters to controller so all clients share the same parameters.
  2. Removed the need to configure handlers. They are added automatically.
  3. Moved plugin configuration to resources.

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Quick tests passed locally by running ./runtest.sh.
  • In-line docstrings updated.
  • Documentation updated.

Copy link
Collaborator

@yanchengnv yanchengnv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general. See my comments for improvement.

yanchengnv
yanchengnv previously approved these changes Jun 7, 2024
Copy link
Collaborator

@yanchengnv yanchengnv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nvidianz nvidianz force-pushed the xgboost-user-experience-improvement branch from 5f20e5c to 8302b60 Compare June 14, 2024 17:19
chesterxgchen and others added 7 commits July 8, 2024 06:45
* Fixed the simulator worker sys path.

* fixed the get_new_sys_path() logic, added in unit test.

* fixed isort.

* Changed the _get_new_sys_path() implementation.

---------

Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com>
* Moved the get_component_init_parameters to class_utils.

* added docstring and unit test.

---------

Co-authored-by: Holger Roth <6304754+holgerroth@users.noreply.github.com>
* WIP: auto register fobs.

* WIP

* Made changes to server and client fobs_initialize().

* Added auto fobs regiater.

* codestyle fix.

* Adjust to use register_nvflare_decomposers() for IPCAgent.

* add some cleanup.

* codestyle fix.

* Codestyle fix.

* re-design the custom Fobs register to only scan from nvflare_decomposers folder, and user defined decomposer_module.

* removed no use import.

* Added unit test for custom_fobs_initialize().

* Addressed the PR reviews.

* changed a log warning message.

* Added None handling for register_ext_decomposers.

* Fixed unit test.

* fixed unit test.

* fix unit test.

* codestyle fix.

* Removed the custom decomposer regoister for SP and CP. Added the register for simulator_worker.

* removed no use import.

---------

Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com>
* upgrade tf example

* change default

* update readme

* fix typo
apatole and others added 27 commits July 8, 2024 06:57
This change fixes:
  - Typos in the `build_doc.sh` references.
  - Incorrect rendering of python code blocks on Flare API page.
The `project.yml` generated by `nvflare provision` command has
comments `change overseer.example.com to the FQDN of the overseer`
and `change example.com to the FQDN of the server` but the actual
values nowhere has `example.com` mentioned and it is confusing.

This change just fixes comments to make it more clear.

Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com>
* test pr

* fix formatting issues

---------

Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com>
Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>
* Initial commit for xgboost-secure

* Initial commit for xgboost-secure

* Change model output path

* Change data mode

* Most basic xgboost process for coding

* Most basic xgboost process for coding

* Most basic xgboost process for coding

* Most basic xgboost process for coding

* First prototype for secure vertical pipeline

* Phase 1 concludes

* add seal pipeline in C++

* experiment will more tree depth to ensure correct node behavior

* experiment will more tree depth to ensure correct node behavior

* update secureboost eval bench

* set header to none for sample alignment

* config processor interface from python

* simplify data preparation, add horizontal testing codes

* remove redundants

* horizontal exps

* update scripts

* update test scripts

* add feature tests

* update to align all outputs' format

* remove conflict

* reorganize

* format

* add flare jobs

* add readme and experiment results
* the following changes for aws cloud deployment have been tested
with 6 configurations, including 3 with T4 GPU (g4dn.xlarge)

t2.small:
    20.04: ami-04bad3c587fe60d89
    22.04: ami-03c983f9003cb9cd1
	24.04: ami-0406d1fdd021121cd

g4dn.xlarge:
    20.04: ami-04bad3c587fe60d89
    22.04: ami-03c983f9003cb9cd1
	24.04: ami-0406d1fdd021121cd

- changed default image to ami-03c983f9003cb9cd1 (22.04 / Python 3.10)
- added g4dn.xlarge as an option to prompt EC2_TYPE
- fixed typo in prompt REGION
- get default blockdevice name and set size to 16GB instead of 8GB
- install apt package nvidia-driver-535-server if GPU found
- run modprobe nvidia to avoid reboot if GPU found
- adding ~/.local/bin to PATH
- add --break-system-packages to pip install (required by Python 3.12)
- add --no-cache-dir to pip install to avoid disk space issues
- add @reboot cronjob to ensure nvflare is restarted after a server (re)start

* instead of setting the disk to 16GB increase the existing disk size by 8GB

---------

Co-authored-by: Isaac Yang <isaacy@nvidia.com>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com>
Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>
* Added external script for fed_job API.

* enforce must deploy a controller or executor before adding external script.

---------

Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com>
The `compose.yml` generated by the DockerBuilder adds a build section
only for the overseer. As a result, when we have non-HA mode
`project.yml', generated `compose.yml` won't have the build section and
so `docker compose build` won't buildthe required NVFlare docker image.

This change adds a build section in the generated `compose.yml` even
for clients and servers. It fixes the above issue and also with this
change, same `compose.yml` can be used to build/run individual servers
and clients using the `docker compose` method.
* Changed the persistor_id for ScatterAndGather to persistor.

* changed the docstring of persistor_id in ScatterAndGather to default to empty string.

---------

Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>
…ion of GPU instance types and AMI images (NVIDIA#2650)

* add function get_resources_file to get filename to parse GPU specs from

* add function find_ec2_gpu_instance_type to propose best ec2 instance type

* Modify user prompts to allow for better AWS defaults :
  - Prompt for AMI image name ("ubuntu-*-22.04-amd64-pro-server")
  - Retrieve appropriate default AMI Image ID for the region
  - Prompt for default EC2 instance based on find_ec2_gpu_instance_type
  - Add support for ARM architecture and change EC2_TYPE to t4g.small (ARM)

* Improved package installations with package manager and in user space:
- changed ssh exec commands to single quote for readability and avoid escaping chars
- install additional os packages (python3-dev & gcc) to allow ARM packages without wheels
- use driver-550-server os package for >= 22.04 and driver-535-server for <= 20.04
- separated os and user packages to allow for failed install of os packages
- starting os and user package install in parallel but wait twice for install of os packages
- support waiting for build of nvidia.ko as well as nvidia.ko.zst (for newer os)

* check environment to get AWS region (default is us-west-2)

* set --region for all AWS commands to allow multi-region support

* call "aws sts get-caller-identity" to see if auth works, e.g. SSO token is still active

* adding unique suffixes to KEY_FILE, nvflare.log etc to allow multiple deployments in parallel

* change most "prompt" commands to read -e -i <default> -p for better prompt editing

* Cosmetic changes to improve usabilty :
  - initialize default AMI_IMAGE and EC2_TYPE at top of script
  - add AWS_PROFILE hint to allow for different AWS profile
  - clarify requirements.txt instruction
  - improve final message to allow for copy/paste of ssh and tail command

* adjust server security group command to allow REGION and unique log file

---------

Co-authored-by: Isaac Yang <isaacy@nvidia.com>
Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com>
* Fixed the wrong numpy release version.

* Update the numpy version to 1.26.4.
… "read -i" requires Bash >= 4 (newer than 2009 !) as OSX only comes with Bash 3.2 (NVIDIA#2677)

Refactor of Prompt function has:

  - editable prompts with read -i ONLY if Bash >= 4
  - in cases where you don not want or need an editable prompt (e.g. AMI ID) you leave out the 3rd argument
  - fully compatible with previous prompt function (only cosmetic changes in azure template)
* expose min_responses, hide wait_time_after_min_received

* change default to None

* add comment
If you possess more GPUs than clients, a good strategy is to run one client on each GPU.
This can be achieved using the `-gpu` argument during simulation, e.g., `nvflare simulator -n 2 -gpu 0,1 [job]`.


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a note some of the changes are already in the main branch so we will need to rebase this branch so we don't change 278 files in this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet