Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provisioning script to create cpu+gpu computes #188

Merged
merged 104 commits into from
Jan 9, 2023

Conversation

garg-amit
Copy link
Contributor

@garg-amit garg-amit commented Nov 28, 2022

Purpose

  • Provide a script to create cpu+gpu computes sharing the same datastore.
  • Modify both open and vnet setup to incorporate these changes.
  • Generalize the names of the resources and do required changes in the pipeline's configs.
  • Fix token expiration issue when provisioning resources.
  • List all jobs when deleting history runs.

closes #169

Does this introduce a breaking change?

[ ] Yes
[x] No

Pull Request Type

What kind of change does this Pull Request introduce?

[x] Bugfix
[x] Feature
[ ] Code style update (formatting, local variables)
[ ] Refactoring (no functional changes, no api changes)
[ ] Documentation content changes
[ ] Other... Please describe:

How to Test

  • Get the code
git clone [repo-address]
cd [repo-name]
git checkout [branch-name]
npm install

What to Check

Verify that the following are valid

  • Check the CI/CD tests.

majercakdavid and others added 30 commits November 3, 2022 17:22
* generic fedavg pytorch
* support model classes
* add docstrings

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>
* remove unused local MNIST data

* add link to provisioning cookbook in docs readme

* recommend creating a conda env in the quickstart

Co-authored-by: thomasp-ms <XXX@me.com>
Co-authored-by: thomasp-ms <XXX@me.com>
* refactor components to use dpv2 + remove unnecessary environments

* working dpv2 pipeline

* refactor scripts with right inputs and outputs

* fix code path

* implement fake outputs

* fix paths

* fix imports

* fix args of aggregation script

* add note, fix component args

* add chekcpoint arg

* linting

* linting

* remove sdkv2 folder

* add argparse to submit script

* add docstring

* add docstring

* linting

* linting

* add staging branch to build

* rollback changes to build, leave it for another PR

* remove logging lien

* remove custom uuid

* linting

* add docstring to custom path function

* polish docstring

* rename model_silo_X to input_silo_X

* rename output

* rename agg output

* Improve auto-provisioning resources (#35) (#36)

* docker file stub

* move docker file, implement feedback

* login before setting subscription

* login before setting subscription

* use default k8s version

* pin latest version since default won't work

* remove executionpolicy part, other small updates

* clarify to change job file _in docker filesystem_

* login before setting subscription

* formatting

* \ -> /

* install azureml-core in docker file

* propagate changes to section 7

* fix dataset creation command

Co-authored-by: thomasp-ms <XXX@me.com>

Co-authored-by: thomasp-ms <XXX@me.com>

* Refactor folder structure (#37)

* `plan` -> `docs`

* 'plan' -> 'docs'

* 'automated_provisioning' -> 'mlops'

* 'fl_arc_k8s' -> 'examples'

Co-authored-by: thomasp-ms <XXX@me.com>

* auto provisioning - vanilla internal silos (#41)

* split internal and external provisioning

* adjust directories after internal/external split

* introduce overall mlops readme

* first stab

* remove useless comment and my alias

Co-authored-by: thomasp-ms <XXX@me.com>

* Perform real FL training on the MNIST dataset

Added component files customized for MNIST dataset. Set the setup for 3
silo having their own compute and datastore.
git config --global user.email "you@example.com"

* refine components and add logs

* maintain consistency b/w config files

* add requirement and env files

* add requirement and env files

* rmv redundant dependencies, rename conda envs

* Correct epoch default value

* point data asset instead of underlying URI

* beef up orchestrator cluster (#46)

Co-authored-by: thomasp-ms <XXX@me.com>

* Provision CPUs for silos (instead of GPUs) (#47)

* beef up orchestrator cluster

* gpu -> cpu

Co-authored-by: thomasp-ms <XXX@me.com>

* add preprocessing comp description, fix typo and correct default datastore name

* add integration validation test - build

* update readme file

* Move logger to the maion if block, add pytorch channel in the conda env
yaml and move readme to the docs folder

* code reformatting using black

* add documentation to run an FL experiment

* add more intuitive path for aggr output dir

* Merge changes

* add more intuitive agg output dir path

* reformat using black

* add iteration2 branch for PR build testing

* reformat date and pass kwargs instead in the getUniqueIdentifier fn

* working submit

* working factory submit

* linting

* move component path

* add soft validation

* add soft validation

* Add basic tests on config

* linting

* working bicep deployment for vanilla demo

* proper orchestrator script, double containers

* fix name

* docstring

* docstring

* rollback to using only 1 container

* align naming convention

* instructions

* working submit

* set up permission model

* working orch perms

* wonky perms assignment

* working role assignments

* remove old perm model

* working except silo2orch

* fix typo

* working submit with config

* add sku as param

* use R/W for now

* fix submit to align with bicep provisioning demo

* linting

* remove dataset files

* fix docstring on permission model

* write draft docs with homepage, align structure, remove requirements, ensure demo documented

* rollback change to req

* change factory to use custom model type during validation

* linting

* Display metrics at the pipeline level (#68)

* Fix optional input yaml and mlflow log bugs (#59)

* refactor components to use dpv2 + remove unnecessary environments

* working dpv2 pipeline

* refactor scripts with right inputs and outputs

* fix code path

* implement fake outputs

* fix paths

* fix imports

* fix args of aggregation script

* add note, fix component args

* add chekcpoint arg

* linting

* linting

* remove sdkv2 folder

* add argparse to submit script

* add docstring

* add docstring

* linting

* linting

* add staging branch to build

* rollback changes to build, leave it for another PR

* remove logging lien

* remove custom uuid

* linting

* add docstring to custom path function

* polish docstring

* rename model_silo_X to input_silo_X

* rename output

* rename agg output

* Improve auto-provisioning resources (#35) (#36)

* docker file stub

* move docker file, implement feedback

* login before setting subscription

* login before setting subscription

* use default k8s version

* pin latest version since default won't work

* remove executionpolicy part, other small updates

* clarify to change job file _in docker filesystem_

* login before setting subscription

* formatting

* \ -> /

* install azureml-core in docker file

* propagate changes to section 7

* fix dataset creation command

Co-authored-by: thomasp-ms <XXX@me.com>

Co-authored-by: thomasp-ms <XXX@me.com>

* Refactor folder structure (#37)

* `plan` -> `docs`

* 'plan' -> 'docs'

* 'automated_provisioning' -> 'mlops'

* 'fl_arc_k8s' -> 'examples'

Co-authored-by: thomasp-ms <XXX@me.com>

* auto provisioning - vanilla internal silos (#41)

* split internal and external provisioning

* adjust directories after internal/external split

* introduce overall mlops readme

* first stab

* remove useless comment and my alias

Co-authored-by: thomasp-ms <XXX@me.com>

* Perform real FL training on the MNIST dataset

Added component files customized for MNIST dataset. Set the setup for 3
silo having their own compute and datastore.
git config --global user.email "you@example.com"

* refine components and add logs

* maintain consistency b/w config files

* add requirement and env files

* add requirement and env files

* rmv redundant dependencies, rename conda envs

* Correct epoch default value

* point data asset instead of underlying URI

* beef up orchestrator cluster (#46)

Co-authored-by: thomasp-ms <XXX@me.com>

* Provision CPUs for silos (instead of GPUs) (#47)

* beef up orchestrator cluster

* gpu -> cpu

Co-authored-by: thomasp-ms <XXX@me.com>

* add preprocessing comp description, fix typo and correct default datastore name

* add integration validation test - build

* update readme file

* Move logger to the maion if block, add pytorch channel in the conda env
yaml and move readme to the docs folder

* code reformatting using black

* add documentation to run an FL experiment

* add more intuitive path for aggr output dir

* Merge changes

* Accomodate optional input chnages and switch from mlflow autologging to manual logging

* code style

* change optional inputs syntax

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>
Co-authored-by: Jeff Omhover <jf.omhover@gmail.com>
Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com>
Co-authored-by: thomasp-ms <XXX@me.com>

* Make changes to display all metrics at the pipeline level

* Log preprocessing metadata in mlflow

* linting

* Pass client as an arg

* Fix typo, rmv name from silo config, metric naming convention, and add
metric identifier in the preprocessing component

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>
Co-authored-by: Jeff Omhover <jf.omhover@gmail.com>
Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com>
Co-authored-by: thomasp-ms <XXX@me.com>

* Remove redundant files from the mlops directory (#69)

* Remove internal & external dir as provisioning is taken care by bicep

* keep mnist data files

* rename demo script (#71)

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Unified documentation (#72)

* WIP: unifying docs

* Remove redundant doc file. We can always revisit if needed

* FL concepts will be covered in the glossary doc

* Remove internal and external silos docs as the code will be re-written in bicep

* provide comprehensive documentation

* rename file

* refine docs

* refine docs and rename fl_cross_silo_basic to fl_cross_silo_native

* simplify sandbox script

* simplify script, ensure it works

* align config of native submit

* align naming conventions between scripts, reinject rbac role

* create test job for quickly debugging provisioning issues

* fix tests

* linting

* move permissions to storage

* align config with bicep scrits

* Document the metrics panel of the pipeline overview in the quickstart (#76)

* WIP: unifying docs

* Remove redundant doc file. We can always revisit if needed

* FL concepts will be covered in the glossary doc

* Remove internal and external silos docs as the code will be re-written in bicep

* provide comprehensive documentation

* rename file

* refine docs

* refine docs and rename fl_cross_silo_basic to fl_cross_silo_native

* document the metrics/pipeline panel in the quickstart

* linting

* add docstrings and disclaimers

* Add instructions on how to create a custom graph  (#78)

* WIP: unifying docs

* Remove redundant doc file. We can always revisit if needed

* FL concepts will be covered in the glossary doc

* Remove internal and external silos docs as the code will be re-written in bicep

* provide comprehensive documentation

* rename file

* refine docs

* refine docs and rename fl_cross_silo_basic to fl_cross_silo_native

* document the metrics/pipeline panel in the quickstart

* add instructions on how to create a custom graph

* do better comments

* Refine native code (#82)

* fix silo name

* log only one datapoint per iteration for an aggregated metrics

* Align terminology for iteration/round/num_rounds

* linting

* use storage blob data contibutor

* add demoBaseName to guid name of role deployment (#85)

Co-authored-by: thomasp-ms <XXX@me.com>

* use id list, add listkeys builtin

* rename and dissociate orchestrator in resource + orchestrator

* separate orchestrator script

* draft sandbox setup

* make silo script distinct

* Update orchestrator_open.bicep

* Update internal_blob_open.bicep

* remove comments

* align hello world example with new naming conventions

* ensure uai assignments are created AFTER storage is created

* linting

* enforce precedence

* merge from secure branch

* use different regions, limit size of account

* reduce to 3 regions, add keys to guid

* substring

* align config

* do not use model

* Add msi version of scripts

* sandbox main can switch between uai and msi

* fix name

* linting

* linting

* implement ignore param, hotfix model with startswith

* Address my own comments on Jeff's PR (#96)

* remove magic number

* little improvements on some comments

* remove unused files

* put dash replacement next to length check

* don't necessarily assume USER AI

* UAI -> XAI

* revert previous UAI -> XAI changes

* move length check next to dash replacement

* typo

* try movind the dependsOn's

* RAGRS -> LRS

* revert dependsON changes

* revert another small change in a comment

Co-authored-by: thomasp-ms <XXX@me.com>

* align config of both submit scripts

* Make distinction between on-off and repeatable provisioning scripts (#99)

* clarify the role needed

* remove "custom role" line

* adjust locations

* use existing rg if not Owner of the sub

* clarify "Secure" setup

* add usage instructions in docstring

* explain what scripts are one-off (vs repeatable)

Co-authored-by: thomasp-ms <XXX@me.com>

* Align round/iteration terminology with the native code (#103)

* rename parameter in config file

* keep iterations instead of rounds

* round -> iteration

Co-authored-by: thomasp-ms <XXX@me.com>

* get all goodies from secureprovisioning branch wip

* get all goodies from secureprovisioning branch wip

* get all goodies from secureprovisioning branch wip

* align both submits to work

* add optional test

* rename native to literal

* add getting started in readme, introduce emojis

* change person

* remove emojs

* Propose rewriting of readme to highlight motivation first (#110)

* propose rewriting of readme to highlight motivation first

* minor edit

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Update README.md

* Update quickstart to mention rg clean-up

* Update quickstart.md

* Update quickstart.md

* Update quickstart.md

* Build bicep scripts as ARM template, add Azure Buttons to quickstart (#120)

* Update quickstart to lower header (hotfix) (#117)
* add arm templates, add button in quickstart
* switch to releasebranchlink

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Add subscription id, resource group and workspace name as CLI args (#122)

* add more cli args

* code style

* code style

* update quickstart doc

* update readme

* Initiate provisioning "cookbook" with list of provisioning scenarios + example (#123)

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Continuous Integration Tests (#119)

* take values of subscription id, rs grp, ws name, etc from github secrets and submit a native pipeline

* change path

* Test azure creds in the github workflow

* reformatting

* add pipeline validation and testing workflow

* add permissions

* add permissions

* check only certain dir to trigger workflows

* add soft validation for any iteration branch PR

* add provisioning script test

* testing

* create rg

* create rg

* change compute for testing

* change demoname

* delete old rg

* change demoname

* add demobasename and aml ws name as github secrets

* random demo base name

* auto generate random base name

* random demo base name

* adjust random num length

* add vnet sandbox test

* rmv dependency b/w jobs

* submit various pipelines

* change execution graph path

* add cli args in the factory code

* change compute for testing

* ignore validation - factory

* create custom action

* correct path

* correct path

* add shell in the github action

* create github actions and take required values as input params

* add shell

* add wait condition

* add logs

* linting

* correct rg name

* add azure ml extension

* handle ml extension installation error.

* add release branch test cases

* add script to delete run history

* cronjob test

* cronjob test

* checkout branch

* test run history deletion script

* test run history deletion script

* test run history deletion script

* azure login

* date format change

* remove double quotes

* date format change

* archive run history script tested

* Add vnet-based provisioning options to cookbook (#128)

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Make deployment name unique in our github actions (#135)

* set unique name for deployments
* add attempt to deployment name

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Refactor compute/storage scripts to be independent (#132)

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Provide motivation in provisioning docs for using service endpoints (#136)

* add motivation for service endpoints
* add link

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Refresh provisioning arm buttons with latest from bicep (#139)

* align names of directories
* rebuild all arm

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Update silo_vnet_newstorage.md (#141)

* Add Bicep build vs ARM template diff test  (#140)

* Add diff test for bicep vs arm

* Debug

* Debug

* fix syntax error

* redirect build output to stdout

* coorect path

* trigger arm template test when pushing changes to main branch from release* branch

* remove redundant logs

* Add "open aks with cc" provision tutorial and bicep scripts (#138)

* implement bicep scripts to provision open aks with cc
* add aks cc tutorial
* build arm and add in branch
* add button

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Provide script + tutorial to attach pair with an existing storage (#142)

* provision datastore with existing storage
* add arm for existing storage, add docs
* add link in readme

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* add latest arm templates to diff build (#145)

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Implements provisioning script for a confidential compute VM jumpbox inside a vnet (debug) (#146)

* add jumpbox script with tutorial
* add template to diff build

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Update jumpbox_cc.md (#147)

* update tutorials for silos to integrate feedback (#149)

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Implement option to turn orchestrator storage fully private (behind PLE) (#150)


Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Tutorial on how to adapt native and factory code to write FL experiments.  (#100)

* WIP: add general information about the factory code

* moving factory-tutorial to another file

* add scenarios

* add instructions on how to adapt literal code

* rename file

* add general info and fix typos

* Jeff's feedback

* Apply code clean-up to provision scripts before bug bash (#148)

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Instructions for provisioning external silos (#101)

* very first stab, far from done

* non-secure native job using the on-prem k8s

* use on-prem silos in example factory job

* Revert "very first stab, far from done"

This reverts commit e00d882.

* Revert "use on-prem silos in example factory job"

This reverts commit e2ef884.

* Revert "non-secure native job using the on-prem k8s"

This reverts commit 923e5f3.

* restore doc stub

* Make Git ignore resources for test jobs

* fix gitignore

* typo in comment

* steps A through D

* 2 typos

* move to subdir

* fix workspace creation

* add orchestrator part, role, and timeline

* last commit before PR

* adjust to new open_azureml_workspace.bicep

* first wave after Jeff's comments

* address jeff's comments

* typo

* light trims

Co-authored-by: thomasp-ms <XXX@me.com>

* bump up every title

* skeleton

* first attempt at data prep like Harmke

* change secret name

* wrong secret name

* remove separate unzip

* change clients, create silo data assets

* different names for silo data assets, duh

* cleanup

* adjust secret name in doc

* .

* use latest literal code

* align environment with literal

* base on latest component

* one dataset, comment out 2 unused args (for now)

* introduce new arguments

* reflect modified args in component spec

* remove unused arg from config

* start hooking up to Harmke's trainer

* initialize PTLearner and include in run.py

* use same values as Harmke for epochs and lr

* attributes with _, start implementing local_train

* add loggings, add test(), fix device_

* train_loader_

* align _'s

* fix transform bug

* remove unused constants

* use proper model in aggregation code

* removed unused file

* remove unused code and arguments, logging to DEBUG

* restore `metrics_prefix` parameter

* finish restoring `metrics_prefix`

* do not duplicate model code

* revert dedup attempt

* improve docstrings and descriptions

* change experiment name

* change pipeline name and docstring

* cite sources, remove wrongly added licenses

* italics

* black

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>
Co-authored-by: Jeff Omhover <jf.omhover@gmail.com>
Co-authored-by: thomasp-ms <XXX@me.com>
Co-authored-by: unknown <Mitgarg17495@gmail.com>
Co-authored-by: thomasp-ms <XXX@me.com>
* use shared agg component across all examples

* only keep a single {reqs/env}

* use more recent pip version

* standardize component spec name

* support dummy HELLOWORLD example is agg

* black

Co-authored-by: thomasp-ms <XXX@me.com>
* use shared agg component across all examples

* only keep a single {reqs/env}

* use more recent pip version

* standardize component spec name

* support dummy HELLOWORLD example is agg

* black

* SHARED -> utils, rename agg env

Co-authored-by: thomasp-ms <XXX@me.com>
@garg-amit garg-amit changed the base branch from release-iteration-04 to release-05 December 12, 2022 20:42
@garg-amit garg-amit marked this pull request as ready for review December 16, 2022 19:08
Copy link
Contributor

@jfomhover jfomhover left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, but I'd love to keep the region name in the silo compute+datastore, it provides a more fancy naming convention for demos

mlops/bicep/vnet_publicip_sandbox_setup.bicep Show resolved Hide resolved
@garg-amit garg-amit merged commit 1ec46a4 into release-05 Jan 9, 2023
@garg-amit garg-amit deleted the gargamit/cpu-gpu-compute branch January 9, 2023 18:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants