Provisioning script to create cpu+gpu computes #188

garg-amit · 2022-11-28T19:53:38Z

Purpose

Provide a script to create cpu+gpu computes sharing the same datastore.
Modify both open and vnet setup to incorporate these changes.
Generalize the names of the resources and do required changes in the pipeline's configs.
Fix token expiration issue when provisioning resources.
List all jobs when deleting history runs.

closes #169

Does this introduce a breaking change?

[ ] Yes
[x] No

Pull Request Type

What kind of change does this Pull Request introduce?

[x] Bugfix
[x] Feature
[ ] Code style update (formatting, local variables)
[ ] Refactoring (no functional changes, no api changes)
[ ] Documentation content changes
[ ] Other... Please describe:

How to Test

Get the code

git clone [repo-address]
cd [repo-name]
git checkout [branch-name]
npm install

What to Check

Verify that the following are valid

Check the CI/CD tests.

* generic fedavg pytorch * support model classes * add docstrings Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* remove unused local MNIST data * add link to provisioning cookbook in docs readme * recommend creating a conda env in the quickstart Co-authored-by: thomasp-ms <XXX@me.com>

…ajercak/finance-example

Co-authored-by: thomasp-ms <XXX@me.com>

* refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add more intuitive agg output dir path * reformat using black * add iteration2 branch for PR build testing * reformat date and pass kwargs instead in the getUniqueIdentifier fn * working submit * working factory submit * linting * move component path * add soft validation * add soft validation * Add basic tests on config * linting * working bicep deployment for vanilla demo * proper orchestrator script, double containers * fix name * docstring * docstring * rollback to using only 1 container * align naming convention * instructions * working submit * set up permission model * working orch perms * wonky perms assignment * working role assignments * remove old perm model * working except silo2orch * fix typo * working submit with config * add sku as param * use R/W for now * fix submit to align with bicep provisioning demo * linting * remove dataset files * fix docstring on permission model * write draft docs with homepage, align structure, remove requirements, ensure demo documented * rollback change to req * change factory to use custom model type during validation * linting * Display metrics at the pipeline level (#68) * Fix optional input yaml and mlflow log bugs (#59) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * Accomodate optional input chnages and switch from mlflow autologging to manual logging * code style * change optional inputs syntax Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Make changes to display all metrics at the pipeline level * Log preprocessing metadata in mlflow * linting * Pass client as an arg * Fix typo, rmv name from silo config, metric naming convention, and add metric identifier in the preprocessing component Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Remove redundant files from the mlops directory (#69) * Remove internal & external dir as provisioning is taken care by bicep * keep mnist data files * rename demo script (#71) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Unified documentation (#72) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * simplify sandbox script * simplify script, ensure it works * align config of native submit * align naming conventions between scripts, reinject rbac role * create test job for quickly debugging provisioning issues * fix tests * linting * move permissions to storage * align config with bicep scrits * Document the metrics panel of the pipeline overview in the quickstart (#76) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * document the metrics/pipeline panel in the quickstart * linting * add docstrings and disclaimers * Add instructions on how to create a custom graph (#78) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * document the metrics/pipeline panel in the quickstart * add instructions on how to create a custom graph * do better comments * Refine native code (#82) * fix silo name * log only one datapoint per iteration for an aggregated metrics * Align terminology for iteration/round/num_rounds * linting * use storage blob data contibutor * add demoBaseName to guid name of role deployment (#85) Co-authored-by: thomasp-ms <XXX@me.com> * use id list, add listkeys builtin * rename and dissociate orchestrator in resource + orchestrator * separate orchestrator script * draft sandbox setup * make silo script distinct * Update orchestrator_open.bicep * Update internal_blob_open.bicep * remove comments * align hello world example with new naming conventions * ensure uai assignments are created AFTER storage is created * linting * enforce precedence * merge from secure branch * use different regions, limit size of account * reduce to 3 regions, add keys to guid * substring * align config * do not use model * Add msi version of scripts * sandbox main can switch between uai and msi * fix name * linting * linting * implement ignore param, hotfix model with startswith * Address my own comments on Jeff's PR (#96) * remove magic number * little improvements on some comments * remove unused files * put dash replacement next to length check * don't necessarily assume USER AI * UAI -> XAI * revert previous UAI -> XAI changes * move length check next to dash replacement * typo * try movind the dependsOn's * RAGRS -> LRS * revert dependsON changes * revert another small change in a comment Co-authored-by: thomasp-ms <XXX@me.com> * align config of both submit scripts * Make distinction between on-off and repeatable provisioning scripts (#99) * clarify the role needed * remove "custom role" line * adjust locations * use existing rg if not Owner of the sub * clarify "Secure" setup * add usage instructions in docstring * explain what scripts are one-off (vs repeatable) Co-authored-by: thomasp-ms <XXX@me.com> * Align round/iteration terminology with the native code (#103) * rename parameter in config file * keep iterations instead of rounds * round -> iteration Co-authored-by: thomasp-ms <XXX@me.com> * get all goodies from secureprovisioning branch wip * get all goodies from secureprovisioning branch wip * get all goodies from secureprovisioning branch wip * align both submits to work * add optional test * rename native to literal * add getting started in readme, introduce emojis * change person * remove emojs * Propose rewriting of readme to highlight motivation first (#110) * propose rewriting of readme to highlight motivation first * minor edit Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Update README.md * Update quickstart to mention rg clean-up * Update quickstart.md * Update quickstart.md * Update quickstart.md * Build bicep scripts as ARM template, add Azure Buttons to quickstart (#120) * Update quickstart to lower header (hotfix) (#117) * add arm templates, add button in quickstart * switch to releasebranchlink Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Add subscription id, resource group and workspace name as CLI args (#122) * add more cli args * code style * code style * update quickstart doc * update readme * Initiate provisioning "cookbook" with list of provisioning scenarios + example (#123) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Continuous Integration Tests (#119) * take values of subscription id, rs grp, ws name, etc from github secrets and submit a native pipeline * change path * Test azure creds in the github workflow * reformatting * add pipeline validation and testing workflow * add permissions * add permissions * check only certain dir to trigger workflows * add soft validation for any iteration branch PR * add provisioning script test * testing * create rg * create rg * change compute for testing * change demoname * delete old rg * change demoname * add demobasename and aml ws name as github secrets * random demo base name * auto generate random base name * random demo base name * adjust random num length * add vnet sandbox test * rmv dependency b/w jobs * submit various pipelines * change execution graph path * add cli args in the factory code * change compute for testing * ignore validation - factory * create custom action * correct path * correct path * add shell in the github action * create github actions and take required values as input params * add shell * add wait condition * add logs * linting * correct rg name * add azure ml extension * handle ml extension installation error. * add release branch test cases * add script to delete run history * cronjob test * cronjob test * checkout branch * test run history deletion script * test run history deletion script * test run history deletion script * azure login * date format change * remove double quotes * date format change * archive run history script tested * Add vnet-based provisioning options to cookbook (#128) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Make deployment name unique in our github actions (#135) * set unique name for deployments * add attempt to deployment name Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Refactor compute/storage scripts to be independent (#132) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Provide motivation in provisioning docs for using service endpoints (#136) * add motivation for service endpoints * add link Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Refresh provisioning arm buttons with latest from bicep (#139) * align names of directories * rebuild all arm Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Update silo_vnet_newstorage.md (#141) * Add Bicep build vs ARM template diff test (#140) * Add diff test for bicep vs arm * Debug * Debug * fix syntax error * redirect build output to stdout * coorect path * trigger arm template test when pushing changes to main branch from release* branch * remove redundant logs * Add "open aks with cc" provision tutorial and bicep scripts (#138) * implement bicep scripts to provision open aks with cc * add aks cc tutorial * build arm and add in branch * add button Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Provide script + tutorial to attach pair with an existing storage (#142) * provision datastore with existing storage * add arm for existing storage, add docs * add link in readme Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * add latest arm templates to diff build (#145) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Implements provisioning script for a confidential compute VM jumpbox inside a vnet (debug) (#146) * add jumpbox script with tutorial * add template to diff build Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Update jumpbox_cc.md (#147) * update tutorials for silos to integrate feedback (#149) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Implement option to turn orchestrator storage fully private (behind PLE) (#150) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Tutorial on how to adapt native and factory code to write FL experiments. (#100) * WIP: add general information about the factory code * moving factory-tutorial to another file * add scenarios * add instructions on how to adapt literal code * rename file * add general info and fix typos * Jeff's feedback * Apply code clean-up to provision scripts before bug bash (#148) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Instructions for provisioning external silos (#101) * very first stab, far from done * non-secure native job using the on-prem k8s * use on-prem silos in example factory job * Revert "very first stab, far from done" This reverts commit e00d882. * Revert "use on-prem silos in example factory job" This reverts commit e2ef884. * Revert "non-secure native job using the on-prem k8s" This reverts commit 923e5f3. * restore doc stub * Make Git ignore resources for test jobs * fix gitignore * typo in comment * steps A through D * 2 typos * move to subdir * fix workspace creation * add orchestrator part, role, and timeline * last commit before PR * adjust to new open_azureml_workspace.bicep * first wave after Jeff's comments * address jeff's comments * typo * light trims Co-authored-by: thomasp-ms <XXX@me.com> * bump up every title * skeleton * first attempt at data prep like Harmke * change secret name * wrong secret name * remove separate unzip * change clients, create silo data assets * different names for silo data assets, duh * cleanup * adjust secret name in doc * . * use latest literal code * align environment with literal * base on latest component * one dataset, comment out 2 unused args (for now) * introduce new arguments * reflect modified args in component spec * remove unused arg from config * start hooking up to Harmke's trainer * initialize PTLearner and include in run.py * use same values as Harmke for epochs and lr * attributes with _, start implementing local_train * add loggings, add test(), fix device_ * train_loader_ * align _'s * fix transform bug * remove unused constants * use proper model in aggregation code * removed unused file * remove unused code and arguments, logging to DEBUG * restore `metrics_prefix` parameter * finish restoring `metrics_prefix` * do not duplicate model code * revert dedup attempt * improve docstrings and descriptions * change experiment name * change pipeline name and docstring * cite sources, remove wrongly added licenses * italics * black Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: unknown <Mitgarg17495@gmail.com>

…es/azure-ml-federated-learning into release-iteration-04

Co-authored-by: thomasp-ms <XXX@me.com>

* use shared agg component across all examples * only keep a single {reqs/env} * use more recent pip version * standardize component spec name * support dummy HELLOWORLD example is agg * black Co-authored-by: thomasp-ms <XXX@me.com>

…es/azure-ml-federated-learning into release-iteration-04

* use shared agg component across all examples * only keep a single {reqs/env} * use more recent pip version * standardize component spec name * support dummy HELLOWORLD example is agg * black * SHARED -> utils, rename agg env Co-authored-by: thomasp-ms <XXX@me.com>

…es/azure-ml-federated-learning into gargamit/cpu-gpu-compute

…l-federated-learning into gargamit/cpu-gpu-compute

jfomhover

Overall looks good, but I'd love to keep the region name in the silo compute+datastore, it provides a more fancy naming convention for demos

mlops/bicep/vnet_publicip_sandbox_setup.bicep

majercakdavid and others added 30 commits November 3, 2022 17:22

init branch

4e61a5d

wip data exploration

200145d

data exploration region/silo

bd47702

basic model

0fdce34

regions

2e78a72

basic network and finished data processing

4dd4776

training

bb4054a

Implement generic FedAvg without model object (#167)

a6a76c6

* generic fedavg pytorch * support model classes * add docstrings Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

add README

0a84a8a

update normalization

79d7f08

update exploration

4205bd9

Thomas/small improvements (#171)

0840806

* remove unused local MNIST data * add link to provisioning cookbook in docs readme * recommend creating a conda env in the quickstart Co-authored-by: thomasp-ms <XXX@me.com>

update example for finance with multiple models

990043c

Merge remote-tracking branch 'upstream/release-iteration-04' into dam…

5516af5

…ajercak/finance-example

successful training through lstm

84aafb2

revert unneeded changes

0b77c7e

remove local exploration ipynb

0d52e0b

fix test metric

52eb12f

different param value for AKS (#179)

4c03c1f

Co-authored-by: thomasp-ms <XXX@me.com>

Merge branch 'release-iteration-04' of https://github.com/Azure-Sampl…

5fdc98e

…es/azure-ml-federated-learning into release-iteration-04

update formatting

869ae7e

Merge branch 'release-iteration-04' into damajercak/finance-example

0a2aebb

add readme section

5fa63d7

rename training to traininsilo for consistency

d6f3b42

add more comments and update docs

41a0562

include urgency in PR template (#184)

8e9f0e5

Co-authored-by: thomasp-ms <XXX@me.com>

Merge branch 'release-iteration-04' of https://github.com/Azure-Sampl…

d0d650d

…es/azure-ml-federated-learning into release-iteration-04

garg-amit added 4 commits December 9, 2022 13:30

align config files to provisioning scripts

cea5b7d

Merge branch 'release-iteration-04' of https://github.com/Azure-Sampl…

c0287e6

…es/azure-ml-federated-learning into gargamit/cpu-gpu-compute

change compute sku in ci/cd

c0fba7d

Merge branch 'release-05' of https://github.com/Azure-Samples/azure-m…

fc59b46

…l-federated-learning into gargamit/cpu-gpu-compute

garg-amit changed the base branch from release-iteration-04 to release-05 December 12, 2022 20:42

garg-amit added 21 commits December 12, 2022 12:47

temp. changes to test cpu gpu computes

312c218

vnet compute 1 name change

80ce969

temp. changes to test cpu gpu computes

57f4bb4

change compute regions

0d3f121

test example pipelines

e5722a4

test example pipelines

27ae567

test example pipelines

40a8016

test example pipelines

421f312

implement compute2 settings for the vnet setup

ec5d1a0

implement compute2 settings for the vnet setup

d53338b

test vnet setup

7d860ea

change computesku

a1bcba9

test industry relevant examples with the vnet setup

556b2dc

update vnet compute with existing storage

90f979b

revert back github workflow changes

76ae04e

final testing

3125001

Merge branch 'release-05' of https://github.com/Azure-Samples/azure-m…

2b6163e

…l-federated-learning into gargamit/cpu-gpu-compute

revert temp. workflow changes

42d44e1

test kaggle kv

c20d032

test open setup

25a2ef9

update arm templates

4a79ea6

garg-amit marked this pull request as ready for review December 16, 2022 19:08

jfomhover approved these changes Jan 6, 2023

View reviewed changes

mlops/bicep/vnet_publicip_sandbox_setup.bicep Show resolved Hide resolved

garg-amit merged commit 1ec46a4 into release-05 Jan 9, 2023

garg-amit deleted the gargamit/cpu-gpu-compute branch January 9, 2023 18:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provisioning script to create cpu+gpu computes #188

Provisioning script to create cpu+gpu computes #188

garg-amit commented Nov 28, 2022 •

edited

Loading

jfomhover left a comment

Provisioning script to create cpu+gpu computes #188

Provisioning script to create cpu+gpu computes #188

Conversation

garg-amit commented Nov 28, 2022 • edited Loading

Purpose

Does this introduce a breaking change?

Pull Request Type

How to Test

What to Check

jfomhover left a comment

Choose a reason for hiding this comment

garg-amit commented Nov 28, 2022 •

edited

Loading