Release 06: Accelerator has features like native FL SDK contract, ben… · Azure-Samples/azure-ml-federated-learning@c378e97

Commit

Release 06: Accelerator has features like native FL SDK contract, ben…

…chmark results... (#282)

* refactor components to use dpv2 + remove unnecessary environments

* working dpv2 pipeline

* refactor scripts with right inputs and outputs

* fix code path

* implement fake outputs

* fix paths

* fix imports

* fix args of aggregation script

* add note, fix component args

* add chekcpoint arg

* linting

* linting

* remove sdkv2 folder

* add argparse to submit script

* add docstring

* add docstring

* linting

* linting

* add staging branch to build

* rollback changes to build, leave it for another PR

* remove logging lien

* remove custom uuid

* linting

* add docstring to custom path function

* polish docstring

* rename model_silo_X to input_silo_X

* rename output

* rename agg output

* Improve auto-provisioning resources (#35) (#36)

* docker file stub

* move docker file, implement feedback

* login before setting subscription

* login before setting subscription

* use default k8s version

* pin latest version since default won't work

* remove executionpolicy part, other small updates

* clarify to change job file _in docker filesystem_

* login before setting subscription

* formatting

* \ -> /

* install azureml-core in docker file

* propagate changes to section 7

* fix dataset creation command

Co-authored-by: thomasp-ms <XXX@me.com>

Co-authored-by: thomasp-ms <XXX@me.com>

* Refactor folder structure (#37)

* `plan` -> `docs`

* 'plan' -> 'docs'

* 'automated_provisioning' -> 'mlops'

* 'fl_arc_k8s' -> 'examples'

Co-authored-by: thomasp-ms <XXX@me.com>

* auto provisioning - vanilla internal silos (#41)

* split internal and external provisioning

* adjust directories after internal/external split

* introduce overall mlops readme

* first stab

* remove useless comment and my alias

Co-authored-by: thomasp-ms <XXX@me.com>

* Perform real FL training on the MNIST dataset

Added component files customized for MNIST dataset. Set the setup for 3
silo having their own compute and datastore.
git config --global user.email "you@example.com"

* refine components and add logs

* maintain consistency b/w config files

* add requirement and env files

* add requirement and env files

* rmv redundant dependencies, rename conda envs

* Correct epoch default value

* point data asset instead of underlying URI

* beef up orchestrator cluster (#46)

Co-authored-by: thomasp-ms <XXX@me.com>

* Provision CPUs for silos (instead of GPUs) (#47)

* beef up orchestrator cluster

* gpu -> cpu

Co-authored-by: thomasp-ms <XXX@me.com>

* add preprocessing comp description, fix typo and correct default datastore name

* add integration validation test - build

* update readme file

* Move logger to the maion if block, add pytorch channel in the conda env
yaml and move readme to the docs folder

* code reformatting using black

* add documentation to run an FL experiment

* add more intuitive path for aggr output dir

* Merge changes

* Update release branch (#271)

* maintain consistency b/w config files

* add requirement and env files

* add requirement and env files

* rmv redundant dependencies, rename conda envs

* Correct epoch default value

* point data asset instead of underlying URI

* beef up orchestrator cluster (#46)

Co-authored-by: thomasp-ms <XXX@me.com>

* Provision CPUs for silos (instead of GPUs) (#47)

* beef up orchestrator cluster

* gpu -> cpu

Co-authored-by: thomasp-ms <XXX@me.com>

* add preprocessing comp description, fix typo and correct default datastore name

* add integration validation test - build

* update readme file

* Move logger to the maion if block, add pytorch channel in the conda env
yaml and move readme to the docs folder

* code reformatting using black

* add documentation to run an FL experiment

* add more intuitive path for aggr output dir

* Merge changes

* init branch

* wip data exploration

* data exploration region/silo

* basic model

* regions

* basic network and finished data processing

* training

* Implement generic FedAvg without model object (#167)

* generic fedavg pytorch
* support model classes
* add docstrings

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* add README

* update normalization

* update exploration

* Thomas/small improvements (#171)

* remove unused local MNIST data

* add link to provisioning cookbook in docs readme

* recommend creating a conda env in the quickstart

Co-authored-by: thomasp-ms <XXX@me.com>

* update example for finance with multiple models

* successful training through lstm

* revert unneeded changes

* remove local exploration ipynb

* fix test metric

* different param value for AKS (#179)

Co-authored-by: thomasp-ms <XXX@me.com>

* Pneumonia xray example (#164)

* refactor components to use dpv2 + remove unnecessary environments

* working dpv2 pipeline

* refactor scripts with right inputs and outputs

* fix code path

* implement fake outputs

* fix paths

* fix imports

* fix args of aggregation script

* add note, fix component args

* add chekcpoint arg

* linting

* linting

* remove sdkv2 folder

* add argparse to submit script

* add docstring

* add docstring

* linting

* linting

* add staging branch to build

* rollback changes to build, leave it for another PR

* remove logging lien

* remove custom uuid

* linting

* add docstring to custom path function

* polish docstring

* rename model_silo_X to input_silo_X

* rename output

* rename agg output

* Improve auto-provisioning resources (#35) (#36)

* docker file stub

* move docker file, implement feedback

* login before setting subscription

* login before setting subscription

* use default k8s version

* pin latest version since default won't work

* remove executionpolicy part, other small updates

* clarify to change job file _in docker filesystem_

* login before setting subscription

* formatting

* \ -> /

* install azureml-core in docker file

* propagate changes to section 7

* fix dataset creation command

Co-authored-by: thomasp-ms <XXX@me.com>

Co-authored-by: thomasp-ms <XXX@me.com>

* Refactor folder structure (#37)

* `plan` -> `docs`

* 'plan' -> 'docs'

* 'automated_provisioning' -> 'mlops'

* 'fl_arc_k8s' -> 'examples'

Co-authored-by: thomasp-ms <XXX@me.com>

* auto provisioning - vanilla internal silos (#41)

* split internal and external provisioning

* adjust directories after internal/external split

* introduce overall mlops readme

* first stab

* remove useless comment and my alias

Co-authored-by: thomasp-ms <XXX@me.com>

* Perform real FL training on the MNIST dataset

Added component files customized for MNIST dataset. Set the setup for 3
silo having their own compute and datastore.
git config --global user.email "you@example.com"

* refine components and add logs

* maintain consistency b/w config files

* add requirement and env files

* add requirement and env files

* rmv redundant dependencies, rename conda envs

* Correct epoch default value

* point data asset instead of underlying URI

* beef up orchestrator cluster (#46)

Co-authored-by: thomasp-ms <XXX@me.com>

* Provision CPUs for silos (instead of GPUs) (#47)

* beef up orchestrator cluster

* gpu -> cpu

Co-authored-by: thomasp-ms <XXX@me.com>

* add preprocessing comp description, fix typo and correct default datastore name

* add integration validation test - build

* update readme file

* Move logger to the maion if block, add pytorch channel in the conda env
yaml and move readme to the docs folder

* code reformatting using black

* add documentation to run an FL experiment

* add more intuitive path for aggr output dir

* Merge changes

* add more intuitive agg output dir path

* reformat using black

* add iteration2 branch for PR build testing

* reformat date and pass kwargs instead in the getUniqueIdentifier fn

* working submit

* working factory submit

* linting

* move component path

* add soft validation

* add soft validation

* Add basic tests on config

* linting

* working bicep deployment for vanilla demo

* proper orchestrator script, double containers

* fix name

* docstring

* docstring

* rollback to using only 1 container

* align naming convention

* instructions

* working submit

* set up permission model

* working orch perms

* wonky perms assignment

* working role assignments

* remove old perm model

* working except silo2orch

* fix typo

* working submit with config

* add sku as param

* use R/W for now

* fix submit to align with bicep provisioning demo

* linting

* remove dataset files

* fix docstring on permission model

* write draft docs with homepage, align structure, remove requirements, ensure demo documented

* rollback change to req

* change factory to use custom model type during validation

* linting

* Display metrics at the pipeline level (#68)

* Fix optional input yaml and mlflow log bugs (#59)

* refactor components to use dpv2 + remove unnecessary environments

* working dpv2 pipeline

* refactor scripts with right inputs and outputs

* fix code path

* implement fake outputs

* fix paths

* fix imports

* fix args of aggregation script

* add note, fix component args

* add chekcpoint arg

* linting

* linting

* remove sdkv2 folder

* add argparse to submit script

* add docstring

* add docstring

* linting

* linting

* add staging branch to build

* rollback changes to build, leave it for another PR

* remove logging lien

* remove custom uuid

* linting

* add docstring to custom path function

* polish docstring

* rename model_silo_X to input_silo_X

* rename output

* rename agg output

* Improve auto-provisioning resources (#35) (#36)

* docker file stub

* move docker file, implement feedback

* login before setting subscription

* login before setting subscription

* use default k8s version

* pin latest version since default won't work

* remove executionpolicy part, other small updates

* clarify to change job file _in docker filesystem_

* login before setting subscription

* formatting

* \ -> /

* install azureml-core in docker file

* propagate changes to section 7

* fix dataset creation command

Co-authored-by: thomasp-ms <XXX@me.com>

Co-authored-by: thomasp-ms <XXX@me.com>

* Refactor folder structure (#37)

* `plan` -> `docs`

* 'plan' -> 'docs'

* 'automated_provisioning' -> 'mlops'

* 'fl_arc_k8s' -> 'examples'

Co-authored-by: thomasp-ms <XXX@me.com>

* auto provisioning - vanilla internal silos (#41)

* split internal and external provisioning

* adjust directories after internal/external split

* introduce overall mlops readme

* first stab

* remove useless comment and my alias

Co-authored-by: thomasp-ms <XXX@me.com>

* Perform real FL training on the MNIST dataset

Added component files customized for MNIST dataset. Set the setup for 3
silo having their own compute and datastore.
git config --global user.email "you@example.com"

* refine components and add logs

* maintain consistency b/w config files

* add requirement and env files

* add requirement and env files

* rmv redundant dependencies, rename conda envs

* Correct epoch default value

* point data asset instead of underlying URI

* beef up orchestrator cluster (#46)

Co-authored-by: thomasp-ms <XXX@me.com>

* Provision CPUs for silos (instead of GPUs) (#47)

* beef up orchestrator cluster

* gpu -> cpu

Co-authored-by: thomasp-ms <XXX@me.com>

* add preprocessing comp description, fix typo and correct default datastore name

* add integration validation test - build

* update readme file

* Move logger to the maion if block, add pytorch channel in the conda env
yaml and move readme to the docs folder

* code reformatting using black

* add documentation to run an FL experiment

* add more intuitive path for aggr output dir

* Merge changes

* Accomodate optional input chnages and switch from mlflow autologging to manual logging

* code style

* change optional inputs syntax

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>
Co-authored-by: Jeff Omhover <jf.omhover@gmail.com>
Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com>
Co-authored-by: thomasp-ms <XXX@me.com>

* Make changes to display all metrics at the pipeline level

* Log preprocessing metadata in mlflow

* linting

* Pass client as an arg

* Fix typo, rmv name from silo config, metric naming convention, and add
metric identifier in the preprocessing component

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>
Co-authored-by: Jeff Omhover <jf.omhover@gmail.com>
Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com>
Co-authored-by: thomasp-ms <XXX@me.com>

* Remove redundant files from the mlops directory (#69)

* Remove internal & external dir as provisioning is taken care by bicep

* keep mnist data files

* rename demo script (#71)

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Unified documentation (#72)

* WIP: unifying docs

* Remove redundant doc file. We can always revisit if needed

* FL concepts will be covered in the glossary doc

* Remove internal and external silos docs as the code will be re-written in bicep

* provide comprehensive documentation

* rename file

* refine docs

* refine docs and rename fl_cross_silo_basic to fl_cross_silo_native

* simplify sandbox script

* simplify script, ensure it works

* align config of native submit

* align naming conventions between scripts, reinject rbac role

* create test job for quickly debugging provisioning issues

* fix tests

* linting

* move permissions to storage

* align config with bicep scrits

* Document the metrics panel of the pipeline overview in the quickstart (#76)

* WIP: unifying docs

* Remove redundant doc file. We can always revisit if needed

* FL concepts will be covered in the glossary doc

* Remove internal and external silos docs as the code will be re-written in bicep

* provide comprehensive documentation

* rename file

* refine docs

* refine docs and rename fl_cross_silo_basic to fl_cross_silo_native

* document the metrics/pipeline panel in the quickstart

* linting

* add docstrings and disclaimers

* Add instructions on how to create a custom graph  (#78)

* WIP: unifying docs

* Remove redundant doc file. We can always revisit if needed

* FL concepts will be covered in the glossary doc

* Remove internal and external silos docs as the code will be re-written in bicep

* provide comprehensive documentation

* rename file

* refine docs

* refine docs and rename fl_cross_silo_basic to fl_cross_silo_native

* document the metrics/pipeline panel in the quickstart

* add instructions on how to create a custom graph

* do better comments

* Refine native code (#82)

* fix silo name

* log only one datapoint per iteration for an aggregated metrics

* Align terminology for iteration/round/num_rounds

* linting

* use storage blob data contibutor

* add demoBaseName to guid name of role deployment (#85)

Co-authored-by: thomasp-ms <XXX@me.com>

* use id list, add listkeys builtin

* rename and dissociate orchestrator in resource + orchestrator

* separate orchestrator script

* draft sandbox setup

* make silo script distinct

* Update orchestrator_open.bicep

* Update internal_blob_open.bicep

* remove comments

* align hello world example with new naming conventions

* ensure uai assignments are created AFTER storage is created

* linting

* enforce precedence

* merge from secure branch

* use different regions, limit size of account

* reduce to 3 regions, add keys to guid

* substring

* align config

* do not use model

* Add msi version of scripts

* sandbox main can switch between uai and msi

* fix name

* linting

* linting

* implement ignore param, hotfix model with startswith

* Address my own comments on Jeff's PR (#96)

* remove magic number

* little improvements on some comments

* remove unused files

* put dash replacement next to length check

* don't necessarily assume USER AI

* UAI -> XAI

* revert previous UAI -> XAI changes

* move length check next to dash replacement

* typo

* try movind the dependsOn's

* RAGRS -> LRS

* revert dependsON changes

* revert another small change in a comment

Co-authored-by: thomasp-ms <XXX@me.com>

* align config of both submit scripts

* Make distinction between on-off and repeatable provisioning scripts (#99)

* clarify the role needed

* remove "custom role" line

* adjust locations

* use existing rg if not Owner of the sub

* clarify "Secure" setup

* add usage instructions in docstring

* explain what scripts are one-off (vs repeatable)

Co-authored-by: thomasp-ms <XXX@me.com>

* Align round/iteration terminology with the native code (#103)

* rename parameter in config file

* keep iterations instead of rounds

* round -> iteration

Co-authored-by: thomasp-ms <XXX@me.com>

* get all goodies from secureprovisioning branch wip

* get all goodies from secureprovisioning branch wip

* get all goodies from secureprovisioning branch wip

* align both submits to work

* add optional test

* rename native to literal

* add getting started in readme, introduce emojis

* change person

* remove emojs

* Propose rewriting of readme to highlight motivation first (#110)

* propose rewriting of readme to highlight motivation first

* minor edit

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Update README.md

* Update quickstart to mention rg clean-up

* Update quickstart.md

* Update quickstart.md

* Update quickstart.md

* Build bicep scripts as ARM template, add Azure Buttons to quickstart (#120)

* Update quickstart to lower header (hotfix) (#117)
* add arm templates, add button in quickstart
* switch to releasebranchlink

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Add subscription id, resource group and workspace name as CLI args (#122)

* add more cli args

* code style

* code style

* update quickstart doc

* update readme

* Initiate provisioning "cookbook" with list of provisioning scenarios + example (#123)

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Continuous Integration Tests (#119)

* take values of subscription id, rs grp, ws name, etc from github secrets and submit a native pipeline

* change path

* Test azure creds in the github workflow

* reformatting

* add pipeline validation and testing workflow

* add permissions

* add permissions

* check only certain dir to trigger workflows

* add soft validation for any iteration branch PR

* add provisioning script test

* testing

* create rg

* create rg

* change compute for testing

* change demoname

* delete old rg

* change demoname

* add demobasename and aml ws name as github secrets

* random demo base name

* auto generate random base name

* random demo base name

* adjust random num length

* add vnet sandbox test

* rmv dependency b/w jobs

* submit various pipelines

* change execution graph path

* add cli args in the factory code

* change compute for testing

* ignore validation - factory

* create custom action

* correct path

* correct path

* add shell in the github action

* create github actions and take required values as input params

* add shell

* add wait condition

* add logs

* linting

* correct rg name

* add azure ml extension

* handle ml extension installation error.

* add release branch test cases

* add script to delete run history

* cronjob test

* cronjob test

* checkout branch

* test run history deletion script

* test run history deletion script

* test run history deletion script

* azure login

* date format change

* remove double quotes

* date format change

* archive run history script tested

* Add vnet-based provisioning options to cookbook (#128)

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Make deployment name unique in our github actions (#135)

* set unique name for deployments
* add attempt to deployment name

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Refactor compute/storage scripts to be independent (#132)

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Provide motivation in provisioning docs for using service endpoints (#136)

* add motivation for service endpoints
* add link

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Refresh provisioning arm buttons with latest from bicep (#139)

* align names of directories
* rebuild all arm

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Update silo_vnet_newstorage.md (#141)

* Add Bicep build vs ARM template diff test  (#140)

* Add diff test for bicep vs arm

* Debug

* Debug

* fix syntax error

* redirect build output to stdout

* coorect path

* trigger arm template test when pushing changes to main branch from release* branch

* remove redundant logs

* Add "open aks with cc" provision tutorial and bicep scripts (#138)

* implement bicep scripts to provision open aks with cc
* add aks cc tutorial
* build arm and add in branch
* add button

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Provide script + tutorial to attach pair with an existing storage (#142)

* provision datastore with existing storage
* add arm for existing storage, add docs
* add link in readme

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* add latest arm templates to diff build (#145)

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Implements provisioning script for a confidential compute VM jumpbox inside a vnet (debug) (#146)

* add jumpbox script with tutorial
* add template to diff build

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Update jumpbox_cc.md (#147)

* update tutorials for silos to integrate feedback (#149)

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Implement option to turn orchestrator storage fully private (behind PLE) (#150)


Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Tutorial on how to adapt native and factory code to write FL experiments.  (#100)

* WIP: add general information about the factory code

* moving factory-tutorial to another file

* add scenarios

* add instructions on how to adapt literal code

* rename file

* add general info and fix typos

* Jeff's feedback

* Apply code clean-up to provision scripts before bug bash (#148)

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Instructions for provisioning external silos (#101)

* very first stab, far from done

* non-secure native job using the on-prem k8s

* use on-prem silos in example factory job

* Revert "very first stab, far from done"

This reverts commit e00d882dfee6a348eb89cd63e339a051b85ce0ca.

* Revert "use on-prem silos in example factory job"

This reverts commit e2ef8841c6be25a6c84b57ae079cca8f361323fe.

* Revert "non-secure native job using the on-prem k8s"

This reverts commit 923e5f321d28b30d8cd9759c47a7ffe5457e3284.

* restore doc stub

* Make Git ignore resources for test jobs

* fix gitignore

* typo in comment

* steps A through D

* 2 typos

* move to subdir

* fix workspace creation

* add orchestrator part, role, and timeline

* last commit before PR

* adjust to new open_azureml_workspace.bicep

* first wave after Jeff's comments

* address jeff's comments

* typo

* light trims

Co-authored-by: thomasp-ms <XXX@me.com>

* bump up every title

* skeleton

* first attempt at data prep like Harmke

* change secret name

* wrong secret name

* remove separate unzip

* change clients, create silo data assets

* different names for silo data assets, duh

* cleanup

* adjust secret name in doc

* .

* use latest literal code

* align environment with literal

* base on latest component

* one dataset, comment out 2 unused args (for now)

* introduce new arguments

* reflect modified args in component spec

* remove unused arg from config

* start hooking up to Harmke's trainer

* initialize PTLearner and include in run.py

* use same values as Harmke for epochs and lr

* attributes with _, start implementing local_train

* add loggings, add test(), fix device_

* train_loader_

* align _'s

* fix transform bug

* remove unused constants

* use proper model in aggregation code

* removed unused file

* remove unused code and arguments, logging to DEBUG

* restore `metrics_prefix` parameter

* finish restoring `metrics_prefix`

* do not duplicate model code

* revert dedup attempt

* improve docstrings and descriptions

* change experiment name

* change pipeline name and docstring

* cite sources, remove wrongly added licenses

* italics

* black

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>
Co-authored-by: Jeff Omhover <jf.omhover@gmail.com>
Co-authored-by: thomasp-ms <XXX@me.com>
Co-authored-by: unknown <Mitgarg17495@gmail.com>

* update formatting

* add readme section

* rename training to traininsilo for consistency

* add more comments and update docs

* include urgency in PR template (#184)

Co-authored-by: thomasp-ms <XXX@me.com>

* Share resources and standardize component names (#182)

* use shared agg component across all examples

* only keep a single {reqs/env}

* use more recent pip version

* standardize component spec name

* support dummy HELLOWORLD example is agg

* black

Co-authored-by: thomasp-ms <XXX@me.com>

* Thopo/share component and environment (#185)

* use shared agg component across all examples

* only keep a single {reqs/env}

* use more recent pip version

* standardize component spec name

* support dummy HELLOWORLD example is agg

* black

* SHARED -> utils, rename agg env

Co-authored-by: thomasp-ms <XXX@me.com>

* rename config to spec and add upload data step

* upload data script

* use util aggregateweights

* add data splitting pipeline

* docs update

* log pipeline level only once per silo training

* do categorical encoding ahead of splitting

* nit updates

* update comment

* update formatting

* Hotfix: grant `az login` permissions to the 'clear run history' script  (#166)

* refactor components to use dpv2 + remove unnecessary environments

* working dpv2 pipeline

* refactor scripts with right inputs and outputs

* fix code path

* implement fake outputs

* fix paths

* fix imports

* fix args of aggregation script

* add note, fix component args

* add chekcpoint arg

* linting

* linting

* remove sdkv2 folder

* add argparse to submit script

* add docstring

* add docstring

* linting

* linting

* add staging branch to build

* rollback changes to build, leave it for another PR

* remove logging lien

* remove custom uuid

* linting

* add docstring to custom path function

* polish docstring

* rename model_silo_X to input_silo_X

* rename output

* rename agg output

* Improve auto-provisioning resources (#35) (#36)

* docker file stub

* move docker file, implement feedback

* login before setting subscription

* login before setting subscription

* use default k8s version

* pin latest version since default won't work

* remove executionpolicy part, other small updates

* clarify to change job file _in docker filesystem_

* login before setting subscription

* formatting

* \ -> /

* install azureml-core in docker file

* propagate changes to section 7

* fix dataset creation command

Co-authored-by: thomasp-ms <XXX@me.com>

Co-authored-by: thomasp-ms <XXX@me.com>

* Refactor folder structure (#37)

* `plan` -> `docs`

* 'plan' -> 'docs'

* 'automated_provisioning' -> 'mlops'

* 'fl_arc_k8s' -> 'examples'

Co-authored-by: thomasp-ms <XXX@me.com>

* auto provisioning - vanilla internal silos (#41)

* split internal and external provisioning

* adjust directories after internal/external split

* introduce overall mlops readme

* first stab

* remove useless comment and my alias

Co-authored-by: thomasp-ms <XXX@me.com>

* Perform real FL training on the MNIST dataset

Added component files customized for MNIST dataset. Set the setup for 3
silo having their own compute and datastore.
git config --global user.email "you@example.com"

* refine components and add logs

* maintain consistency b/w config files

* add requirement and env files

* add requirement and env files

* rmv redundant dependencies, rename conda envs

* Correct epoch default value

* point data asset instead of underlying URI

* beef up orchestrator cluster (#46)

Co-authored-by: thomasp-ms <XXX@me.com>

* Provision CPUs for silos (instead of GPUs) (#47)

* beef up orchestrator cluster

* gpu -> cpu

Co-authored-by: thomasp-ms <XXX@me.com>

* add preprocessing comp description, fix typo and correct default datastore name

* add integration validation test - build

* update readme file

* Move logger to the maion if block, add pytorch channel in the conda env
yaml and move readme to the docs folder

* code reformatting using black

* add documentation to run an FL experiment

* add more intuitive path for aggr output dir

* Merge changes

* simplify the job wait condition's code

* add comments

* trigger mnist pipeline check

* test token validity

* grant `az login` permissions to the clear-history script

* revert to sleep wait code

* test access token validity

* nit

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>
Co-authored-by: Jeff Omhover <jf.omhover@gmail.com>
Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com>
Co-authored-by: thomasp-ms <XXX@me.com>

* fix readme

* aggregate weughts on whichever device is available

* update docstrings

* update formatting

* reduce upload pipeline file

* fix datastore

* add info about data upload step

* fix typo

* steps for changing access policies

* update docs

* Named Entity Recognition example (#177)

* refactor components to use dpv2 + remove unnecessary environments

* working dpv2 pipeline

* refactor scripts with right inputs and outputs

* fix code path

* implement fake outputs

* fix paths

* fix imports

* fix args of aggregation script

* add note, fix component args

* add chekcpoint arg

* linting

* linting

* remove sdkv2 folder

* add argparse to submit script

* add docstring

* add docstring

* linting

* linting

* add staging branch to build

* rollback changes to build, leave it for another PR

* remove logging lien

* remove custom uuid

* linting

* add docstring to custom path function

* polish docstring

* rename model_silo_X to input_silo_X

* rename output

* rename agg output

* Improve auto-provisioning resources (#35) (#36)

* docker file stub

* move docker file, implement feedback

* login before setting subscription

* login before setting subscription

* use default k8s version

* pin latest version since default won't work

* remove executionpolicy part, other small updates

* clarify to change job file _in docker filesystem_

* login before setting subscription

* formatting

* \ -> /

* install azureml-core in docker file

* propagate changes to section 7

* fix dataset creation command

Co-authored-by: thomasp-ms <XXX@me.com>

Co-authored-by: thomasp-ms <XXX@me.com>

* Refactor folder structure (#37)

* `plan` -> `docs`

* 'plan' -> 'docs'

* 'automated_provisioning' -> 'mlops'

* 'fl_arc_k8s' -> 'examples'

Co-authored-by: thomasp-ms <XXX@me.com>

* auto provisioning - vanilla internal silos (#41)

* split internal and external provisioning

* adjust directories after internal/external split

* introduce overall mlops readme

* first stab

* remove useless comment and my alias

Co-authored-by: thomasp-ms <XXX@me.com>

* Perform real FL training on the MNIST dataset

Added component files customized for MNIST dataset. Set the setup for 3
silo having their own compute and datastore.
git config --global user.email "you@example.com"

* refine components and add logs

* maintain consistency b/w config files

* add requirement and env files

* add requirement and env files

* rmv redundant dependencies, rename conda envs

* Correct epoch default value

* point data asset instead of underlying URI

* beef up orchestrator cluster (#46)

Co-authored-by: thomasp-ms <XXX@me.com>

* Provision CPUs for silos (instead of GPUs) (#47)

* beef up orchestrator cluster

* gpu -> cpu

Co-authored-by: thomasp-ms <XXX@me.com>

* add preprocessing comp description, fix typo and correct default datastore name

* add integration validation test - build

* update readme file

* Move logger to the maion if block, add pytorch channel in the conda env
yaml and move readme to the docs folder

* code reformatting using black

* add documentation to run an FL experiment

* add more intuitive path for aggr output dir

* Merge changes

* add multinerd template files

* NER components

* re-structure

* partition data + log metrics

* add redme

* add readme

* restructuring

* restructuring

* add doc strings

* train on gpus

* create a separate component to upload data on silos

* docs

* rename

* add assert statement

* change upload-data job compute to orchestrator compute

* remove ner from literal example choices

* fix doc

* add model-name, tokenizer configurable

* pip version upgrade

* reformatting

* use shared aggregated component

* rename script file

* add note

* create a compute that has access to silos' storage accs

* change data uploading approach

* update doc

* incorporate Thomas's feedback

* fix typo

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>
Co-authored-by: Jeff Omhover <jf.omhover@gmail.com>
Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com>
Co-authored-by: thomasp-ms <XXX@me.com>

* Create nice-looking homepage for the examples in readme+docs (#190)

* add homepage for industry examples

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Align the medical imaging data provisioning process with other examples (#191)

* adjust paths in config file

* support component with 1 output for Pneumonia

* formatting

* adjust doc to new provisioning

* remove GH action for dataprep

* custom component for provisioning pneumonia data

* black

Co-authored-by: thomasp-ms <XXX@me.com>

* hot fix (#192)

Co-authored-by: thomasp-ms <XXX@me.com>

* Lots of micro-fixes after bug bashing all 3 industry examples (#194)

related to components:
* create distinct names for all components of each scenario
* polish component descriptions
* remove unused mnist datatransfer and postprocessing components
* upgrade all MCR images to a more recent OS
* cut some unnecessary dependencies
* use curated environments whenever possible (to speed up job build time)

related to pipelines:
* fix issues with ccfraud submit script (path to shared folder)
* remove unnecessary json+azure imports in submit scripts
* align all 3 submissions scripts
* in upload data pipeline, make --example required without default value to force intentional decision
* in upload data pipeline, use scenario name in the output path to avoid collision
* give each submit pipeline a distinct experiment and run name for readability

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Standardize all 3 real world example tutorials (#193)

* standardize documentation on all 3 examples
* change titles
* fix spaces
* add pip instructions
* upgrade azure-ai-ml version

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* poc for ddp training

* remove debug code + allow logging from multiple nodes

* update formatting

* provide correct link to Kaggle dataset (#196)

* provide correct link

* .

* .

Co-authored-by: thomasp-ms <XXX@me.com>

* add DDP docs

* Add CI tests for industry-relevant examples  (#186)

* refactor components to use dpv2 + remove unnecessary environments

* working dpv2 pipeline

* refactor scripts with right inputs and outputs

* fix code path

* implement fake outputs

* fix paths

* fix imports

* fix args of aggregation script

* add note, fix component args

* add chekcpoint arg

* linting

* linting

* remove sdkv2 folder

* add argparse to submit script

* add docstring

* add docstring

* linting

* linting

* add staging branch to build

* rollback changes to build, leave it for another PR

* remove logging lien

* remove custom uuid

* linting

* add docstring to custom path function

* polish docstring

* rename model_silo_X to input_silo_X

* rename output

* rename agg output

* Improve auto-provisioning resources (#35) (#36)

* docker file stub

* move docker file, implement feedback

* login before setting subscription

* login before setting subscription

* use default k8s version

* pin latest version since default won't work

* remove executionpolicy part, other small updates

* clarify to change job file _in docker filesystem_

* login before setting subscription

* formatting

* \ -> /

* install azureml-core in docker file

* propagate changes to section 7

* fix dataset creation command

Co-authored-by: thomasp-ms <XXX@me.com>

Co-authored-by: thomasp-ms <XXX@me.com>

* Refactor folder structure (#37)

* `plan` -> `docs`

* 'plan' -> 'docs'

* 'automated_provisioning' -> 'mlops'

* 'fl_arc_k8s' -> 'examples'

Co-authored-by: thomasp-ms <XXX@me.com>

* auto provisioning - vanilla internal silos (#41)

* split internal and external provisioning

* adjust directories after internal/external split

* introduce overall mlops readme

* first stab

* remove useless comment and my alias

Co-authored-by: thomasp-ms <XXX@me.com>

* Perform real FL training on the MNIST dataset

Added component files customized for MNIST dataset. Set the setup for 3
silo having their own compute and datastore.
git config --global user.email "you@example.com"

* refine components and add logs

* maintain consistency b/w config files

* add requirement and env files

* add requirement and env files

* rmv redundant dependencies, rename conda envs

* Correct epoch default value

* point data asset instead of underlying URI

* beef up orchestrator cluster (#46)

Co-authored-by: thomasp-ms <XXX@me.com>

* Provision CPUs for silos (instead of GPUs) (#47)

* beef up orchestrator cluster

* gpu -> cpu

Co-authored-by: thomasp-ms <XXX@me.com>

* add preprocessing comp description, fix typo and correct default datastore name

* add integration validation test - build

* update readme file

* Move logger to the maion if block, add pytorch channel in the conda env
yaml and move readme to the docs folder

* code reformatting using black

* add documentation to run an FL experiment

* add more intuitive path for aggr output dir

* Merge changes

* add pneumonia and ner examples tests

* add ccfraud test in the CI/CD pipeline

* add data upload test

* trigger workflow

* CI testing1

* CI testing1

* test kv kaggle creds

* fix creds

* fix creds

* set kaggle creds

* test pneumonia data-upload

* test all industry relevant examples

* upload data test for 3 examples

* add main tests

* rmv redundant chrs

* fix typo

* avoid industry relevant examples tests to run on the vnet setup as it is already covered by the open setup

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>
Co-authored-by: Jeff Omhover <jf.omhover@gmail.com>
Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com>
Co-authored-by: thomasp-ms <XXX@me.com>

* CLI commands to add credentials in the workspace keyvault  (#199)

* refactor components to use dpv2 + remove unnecessary environments

* working dpv2 pipeline

* refactor scripts with right inputs and outputs

* fix code path

* implement fake outputs

* fix paths

* fix imports

* fix args of aggregation script

* add note, fix component args

* add chekcpoint arg

* linting

* linting

* remove sdkv2 folder

* add argparse to submit script

* add docstring

* add docstring

* linting

* linting

* add staging branch to build

* rollback changes to build, leave it for another PR

* remove logging lien

* remove custom uuid

* linting

* add docstring to custom path function

* polish docstring

* rename model_silo_X to input_silo_X

* rename output

* rename agg output

* Improve auto-provisioning resources (#35) (#36)

* docker file stub

* move docker file, implement feedback

* login before setting subscription

* login before setting subscription

* use default k8s version

* pin latest version since default won't work

* remove executionpolicy part, other small updates

* clarify to change job file _in docker filesystem_

* login before setting subscription

* formatting

* \ -> /

* install azureml-core in docker file

* propagate changes to section 7

* fix dataset creation command

Co-authored-by: thomasp-ms <XXX@me.com>

Co-authored-by: thomasp-ms <XXX@me.com>

* Refactor folder structure (#37)

* `plan` -> `docs`

* 'plan' -> 'docs'

* 'automated_provisioning' -> 'mlops'

* 'fl_arc_k8s' -> 'examples'

Co-authored-by: thomasp-ms <XXX@me.com>

* auto provisioning - vanilla internal silos (#41)

* split internal and external provisioning

* adjust directories after internal/external split

* introduce overall mlops readme

* first stab

* remove useless comment and my alias

Co-authored-by: thomasp-ms <XXX@me.com>

* Perform real FL training on the MNIST dataset

Added component files customized for MNIST dataset. Set the setup for 3
silo having their own compute and datastore.
git config --global user.email "you@example.com"

* refine components and add logs

* maintain consistency b/w config files

* add requirement and env files

* add requirement and env files

* rmv redundant dependencies, rename conda envs

* Correct epoch default value

* point data asset instead of underlying URI

* beef up orchestrator cluster (#46)

Co-authored-by: thomasp-ms <XXX@me.com>

* Provision CPUs for silos (instead of GPUs) (#47)

* beef up orchestrator cluster

* gpu -> cpu

Co-authored-by: thomasp-ms <XXX@me.com>

* add preprocessing comp description, fix typo and correct default datastore name

* add integration validation test - build

* update readme file

* Move logger to the maion if block, add pytorch channel in the conda env
yaml and move readme to the docs folder

* code reformatting using black

* add documentation to run an FL experiment

* add more intuitive path for aggr output dir

* Merge changes

* add cli cmds to set a kv secret

* Jeff's feedback

* Implement Thomas's feedback

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>
Co-authored-by: Jeff Omhover <jf.omhover@gmail.com>
Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com>
Co-authored-by: thomasp-ms <XXX@me.com>

* Thomas/bug bash feedback 04 (#203)

* no need to navigate to a specific directory

* keyvault -> key vault

* improve Kaggle sections

* GPU's for NER example

* ARM templates with latest bicep version

* bold

* GPU instructions in quickstart

Co-authored-by: thomasp-ms <XXX@me.com>

* fix test to align with new sdk (#204)

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Hotfix: DataAccessError (orchestrator access) (#205)

* refactor components to use dpv2 + remove unnecessary environments

* working dpv2 pipeline

* refactor scripts with right inputs and outputs

* fix code path

* implement fake outputs

* fix paths

* fix imports

* fix args of aggregation script

* add note, fix component args

* add chekcpoint arg

* linting

* linting

* remove sdkv2 folder

* add argparse to submit script

* add docstring

* add docstring

* linting

* linting

* add staging branch to build

* rollback changes to build, leave it for another PR

* remove logging lien

* remove custom uuid

* linting

* add docstring to custom path function

* polish docstring

* rename model_silo_X to input_silo_X

* rename output

* rename agg output

* Improve auto-provisioning resources (#35) (#36)

* docker file stub

* move docker file, implement feedback

* login before setting subscription

* login before setting subscription

* use default k8s version

* pin latest version since default won't work

* remove executionpolicy part, other small updates

* clarify to change job file _in docker filesystem_

* login before setting subscription

* formatting

* \ -> /

* install azureml-core in docker file

* propagate changes to section 7

* fix dataset creation command

Co-authored-by: thomasp-ms <XXX@me.com>

Co-authored-by: thomasp-ms <XXX@me.com>

* Refactor folder structure (#37)

* `plan` -> `docs`

* 'plan' -> 'docs'

* 'automated_provisioning' -> 'mlops'

* 'fl_arc_k8s' -> 'examples'

Co-authored-by: thomasp-ms <XXX@me.com>

* auto provisioning - vanilla internal silos (#41)

* split internal and external provisioning

* adjust directories after internal/external split

* introduce overall mlops readme

* first stab

* remove useless comment and my alias

Co-authored-by: thomasp-ms <XXX@me.com>

* Perform real FL training on the MNIST dataset

Added component files customized for MNIST dataset. Set the setup for 3
silo having their own compute and datastore.
git config --global user.email "you@example.com"

* refine components and add logs

* maintain consistency b/w config files

* add requirement and env files

* add requirement and env files

* rmv redundant dependencies, rename conda envs

* Correct epoch default value

* point data asset instead of underlying URI

* beef up orchestrator cluster (#46)

Co-authored-by: thomasp-ms <XXX@me.com>

* Provision CPUs for silos (instead of GPUs) (#47)

* beef up orchestrator cluster

* gpu -> cpu

Co-authored-by: thomasp-ms <XXX@me.com>

* add preprocessing comp description, fix typo and correct default datastore name

* add integration validation test - build

* update readme file

* Move logger to the maion if block, add pytorch channel in the conda env
yaml and move readme to the docs folder

* code reformatting using black

* add documentation to run an FL experiment

* add more intuitive path for aggr output dir

* Merge changes

* fix bug

* update arm template

* fix a problem that was encountered during resolving merge conflicts

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>
Co-authored-by: Jeff Omhover <jf.omhover@gmail.com>
Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com>
Co-authored-by: thomasp-ms <XXX@me.com>

* Add GitHub Action workflow concurrency and implement token expiration policy workaround (#200)

* refactor components to use dpv2 + remove unnecessary environments

* working dpv2 pipeline

* refactor scripts with right inputs and outputs

* fix code path

* implement fake outputs

* fix paths

* fix imports

* fix args of aggregation script

* add note, fix component args

* add chekcpoint arg

* linting

* linting

* remove sdkv2 folder

* add argparse to submit script

* add docstring

* add docstring

* linting

* linting

* add staging branch to build

* rollback changes to build, leave it for another PR

* remove logging lien

* remove custom uuid

* linting

* add docstring to custom path function

* polish docstring

* rename model_silo_X to input_silo_X

* rename output

* rename agg output

* Improve auto-provisioning resources (#35) (#36)

* docker file stub

* move docker file, implement feedback

* login before setting subscription

* login before setting subscription

* use default k8s version

* pin latest version since default won't work

* remove executionpolicy part, other small updates

* clarify to change job file _in docker filesystem_

* login before setting subscription

* formatting

* \ -> /

* install azureml-core in docker file

* propagate changes to section 7

* fix dataset creation command

Co-authored-by: thomasp-ms <XXX@me.com>

Co-authored-by: thomasp-ms <XXX@me.com>

* Refactor folder structure (#37)

* `plan` -> `docs`

* 'plan' -> 'docs'

* 'automated_provisioning' -> 'mlops'

* 'fl_arc_k8s' -> 'examples'

Co-authored-by: thomasp-ms <XXX@me.com>

* auto provisioning - vanilla internal silos (#41)

* split internal and external provisioning

* adjust directories after internal/external split

* introduce overall mlops readme

* first stab

* remove useless comment and my alias

Co-authored-by: thomasp-ms <XXX@me.com>

* Perform real FL training on the MNIST dataset

Added component files customized for MNIST dataset. Set the setup for 3
silo having their own compute and datastore.
git config --global user.email "you@example.com"

* refine components and add logs

* maintain consistency b/w config files

* add requirement and env files

* add requirement and env files

* rmv redundant dependencies, rename conda envs

* Correct epoch default value

* point data asset instead of underlying URI

* beef up orchestrator cluster (#46)

Co-authored-by: thomasp-ms <XXX@me.com>

* Provision CPUs for silos (instead of GPUs) (#47)

* beef up orchestrator cluster

* gpu -> cpu

Co-authored-by: thomasp-ms <XXX@me.com>

* add preprocessing comp description, fix typo and correct default datastore name

* add integration validation test - build

* update readme file

* Move logger to the maion if block, add pytorch channel in the conda env
yaml and move readme to the docs folder

* code reformatting using black

* add documentation to run an FL experiment

* add more intuitive path for aggr output dir

* Merge changes

* add GitHub workflow concurrency

* test 1

* test 1

* test 1

* test 2

* test 3

* test 2

* test 3

* implement token expiry workaround

* test 1

* workaround to handle token expiry error

* fix typo

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>
Co-authored-by: Jeff Omhover <jf.omhover@gmail.com>
Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com>
Co-authored-by: thomasp-ms <XXX@me.com>

* Implement troubleshooting guide with first typical issues (#208)

* write troubleshooting guide

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Fix order of precedence for AML workspace references in submit.py (#209)

* refactor components to use dpv2 + remove unnecessary environments

* working dpv2 pipeline

* refactor scripts with right inputs and outputs

* fix code path

* implement fake outputs

* fix paths

* fix imports

* fix args of aggregation script

* add note, fix component args

* add chekcpoint arg

* linting

* linting

* remove sdkv2 folder

* add argparse to submit script

* add docstring

* add docstring

* linting

* linting

* add staging branch to build

* rollback changes to build, leave it for another PR

* remove logging lien

* remove custom uuid

* linting

* add docstring to custom path function

* polish docstring

* rename model_silo_X to input_silo_X

* rename output

* rename agg output

* Improve auto-provisioning resources (#35) (#36)

* docker file stub

* move docker file, implement feedback

* login before setting subscription

* login before setting subscription

* use default k8s version

* pin latest version since default won't work

* remove executionpolicy part, other small updates

* clarify to change job file _in docker filesystem_

* login before setting subscription

* formatting

* \ -> /

* install azureml-core in docker file

* propagate changes to section 7

* fix dataset creation command

Co-authored-by: thomasp-ms <XXX@me.com>

Co-authored-by: thomasp-ms <XXX@me.com>

* Refactor folder structure (#37)

* `plan` -> `docs`

* 'plan' -> 'docs'

* 'automated_provisioning' -> 'mlops'

* 'fl_arc_k8s' -> 'examples'

Co-authored-by: thomasp-ms <XXX@me.com>

* auto provisioning - vanilla internal silos (#41)

* split internal and external provisioning

* adjust directories after internal/external split

* introduce overall mlops readme

* first stab

* remove useless comment and my alias

Co-authored-by: thomasp-ms <XXX@me.com>

* Perform real FL training on the MNIST dataset

Added component files customized for MNIST dataset. Set the setup for 3
silo having their own compute and datastore.
git config --global user.email "you@example.com"

* refine components and add logs

* maintain consistency b/w config files

* add requirement and env files

* add requirement and env files

* rmv redundant dependencies, rename conda envs

* Correct epoch default value

* point data asset instead of underlying URI

* beef up orchestrator cluster (#46)

Co-authored-by: thomasp-ms <XXX@me.com>

* Provision CPUs for silos (instead of GPUs) (#47)

* beef up orchestrator cluster

* gpu -> cpu

Co-authored-by: thomasp-ms <XXX@me.com>

* add preprocessing comp description, fix typo and correct default datastore name

* add integration validation test - build

* update readme file

* Move logger to the maion if block, add pytorch channel in the conda env
yaml and move readme to the docs folder

* code reformatting using black

* add documentation to run an FL experiment

* add more intuitive path for aggr output dir

* Merge changes

* fix order of precedence

* fix build

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>
Co-authored-by: Jeff Omhover <jf.omhover@gmail.com>
Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com>
Co-authored-by: thomasp-ms <XXX@me.com>

* Add data permissions issue to TSG (#210)

* add permissions issue to TSG

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* November notes (#211)

Co-authored-by: thomasp-ms <XXX@me.com>

* create instance type and select it for run for cc

* upgrade all pip dependencies (#212)

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* format

* use azureml built in distribution fw

* Test industry-relevant examples if any changes in the `utils` dir are observed  (#221)

* add test to validate changes in the utils dir

* test1 trigger workflow

* fix typo

* only destroy ddp group if it was created

* remove unnecessary imports

* allow for mixture of ddp and non-ddp processes model aggregation

* use documentation instead of ps1 script for creating instancetype for CC

* add instance type assignment for all examples

* formatting

* formatting

* update batch size

* update model name

* use older pytorch

* Generalize aggregate component to Babel (#220)

* init branch

* wip data exploration

* data exploration region/silo

* basic model

* regions

* basic network and finished data processing

* training

* Implement generic FedAvg without model object (#167)

* generic fedavg pytorch
* support model classes
* add docstrings

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* add README

* update normalization

* update exploration

* Thomas/small improvements (#171)

* remove unused local MNIST data

* add link to provisioning cookbook in docs readme

* recommend creating a conda env in the quickstart

Co-authored-by: thomasp-ms <XXX@me.com>

* update example for finance with multiple models

* successful training through lstm

* revert unneeded changes

* remove local exploration ipynb

* fix test metric

* different param value for AKS (#179)

Co-authored-by: thomasp-ms <XXX@me.com>

* Pneumonia xray example (#164)

* refactor components to use dpv2 + remove unnecessary environments

* working dpv2 pipeline

* refactor scripts with right inputs and outputs

* fix code path

* implement fake outputs

* fix paths

* fix imports

* fix args of aggregation script

* add note, fix component args

* add chekcpoint arg

* linting

* linting

* remove sdkv2 folder

* add argparse to submit script

* add docstring

* add docstring

* linting

* linting

* add staging branch to build

* rollback changes to build, leave it for another PR

* remove logging lien

* remove custom uuid

* linting

* add docstring to custom path function

* polish docstring

* rename model_silo_X to input_silo_X

* rename output

* rename agg output

* Improve auto-provisioning resources (#35) (#36)

* docker file stub

* move docker file, implement feedback

* login before setting subscription

* login before setting subscription

* use default k8s version

* pin latest version since default won't work

* remove executionpolicy part, other small updates

* clarify to change job file _in docker filesystem_

* login before setting subscription

* formatting

* \ -> /

* install azureml-core in docker file

* propagate changes to section 7

* fix dataset creation command

Co-authored-by: thomasp-ms <XXX@me.com>

Co-authored-by: thomasp-ms <XXX@me.com>

* Refactor folder structure (#37)

* `plan` -> `docs`

* 'plan' -> 'docs'

* 'automated_provisioning' -> 'mlops'

* 'fl_arc_k8s' -> 'examples'

Co-authored-by: thomasp-ms <XXX@me.com>

* auto provisioning - vanilla internal silos (#41)

* split internal and external provisioning

* adjust directories after internal/external split

* introduce overall mlops readme

* first stab

* remove useless comment and my alias

Co-authored-by: thomasp-ms <XXX@me.com>

* Perform real FL training on the MNIST dataset

Added component files customized for MNIST dataset. Set the setup for 3
silo having their own compute and datastore.
git config --global user.email "you@example.com"

* refine components and add logs

* maintain consistency b/w config files

* add requirement and env files

* add requirement and env files

* rmv redundant dependencies, rename conda envs

* Correct epoch default value

* point data asset instead of underlying URI

* beef up orchestrator cluster (#46)

Co-authored-by: thomasp-ms <XXX@me.com>

* Provision CPUs for silos (instead of GPUs) (#47)

* beef up orchestrator cluster

* gpu -> cpu

Co-authored-by: thomasp-ms <XXX@me.com>

* add preprocessing comp description, fix typo and correct default datastore name

* add integration validation test - build

* update readme file

* Move logger to the maion if block, add pytorch channel in the conda env
yaml and move readme to the docs folder

* code reformatting using black

* add documentation to run an FL experiment

* add more intuitive path for aggr output dir

* Merge changes

* add more intuitive agg output dir path

* reformat using black

* add iteration2 branch for PR build testing

* reformat date and pass kwargs instead in the getUniqueIdentifier fn

* working submit

* working factory submit

* linting

* move component path

* add soft validation

* add soft validation

* Add basic tests on config

* linting

* working bicep deployment for vanilla demo

* proper orchestrator script, double containers

* fix name

* docstring

* docstring

* rollback to using only 1 container

* align naming convention

* instructions

* working submit

* set up permission model

* working orch perms

* wonky perms assignment

* working role assignments

* remove old perm model

* working except silo2orch

* fix typo

* working submit with config

* add sku as param

* use R/W for now

* fix submit to align with bicep provisioning demo

* linting

* remove dataset files

* fix docstring on permission model

* write draft docs with homepage, align structure, remove requirements, ensure demo documented

* rollback change to req

* change factory to use custom model type during validation

* linting

* Display metrics at the pipeline level (#68)

* Fix optional input yaml and mlflow log bugs (#59)

* refactor components to use dpv2 + remove unnecessary environments

* working dpv2 pipeline

* refactor scripts with right inputs and outputs

* fix code path

* implement fake outputs

* fix paths

* fix imports

* fix args of aggregation script

* add note, fix component args

* add chekcpoint arg

* linting

* linting

* remove sdkv2 folder

* add argparse to submit script

* add docstring

* add docstring

* linting

* linting

* add staging branch to build

* rollback changes to build, leave it for another PR

* remove logging lien

* remove custom uuid

* linting

* add docstring to custom path function

* polish docstring

* rename model_silo_X to input_silo_X

* rename output

* rename agg output

* Improve auto-provisioning resources (#35) (#36)

* docker file stub

* move docker file, implement feedback

* login before setting subscription

* login before setting subscription

* use default k8s version

* pin latest version since default won't work

* remove executionpolicy part, other small updates

* clarify to change job file _in docker filesystem_

* login before setting subscription

* formatting

* \ -> /

* install azureml-core in docker file

* propagate changes to section 7

* fix dataset creation command

Co-authored-by: thomasp-ms <XXX@me.com>

Co-authored-by: thomasp-ms <XXX@me.com>

* Refactor folder structure (#37)

* `plan` -> `docs`

* 'plan' -> 'docs'

* 'automated_provisioning' -> 'mlops'

* 'fl_arc_k8s' -> 'examples'

Co-authored-by: thomasp-ms <XXX@me.com>

* auto provisioning - vanilla internal silos (#41)

* split internal and external provisioning

* adjust directories after internal/external split

* introduce overall mlops readme

* first stab

* remove useless comment and my alias

Co-authored-by: thomasp-ms <XXX@me.com>

* Perform real FL training on the MNIST dataset

Added component files customized for MNIST dataset. Set the setup for 3
silo having their own compute and datastore.
git config --global user.email "you@example.com"

* refine components and add logs

* maintain consistency b/w config files

* add requirement and env files

* add requirement and env files

* rmv redundant dependencies, rename conda envs

* Correct epoch default value

* point data asset instead of underlying URI

* beef up orchestrator cluster (#46)

Co-authored-by: thomasp-ms <XXX@me.com>

* Provision CPUs for silos (instead of GPUs) (#47)

* beef up orchestrator cluster

* gpu -> cpu

Co-authored-by: thomasp-ms <XXX@me.com>

* add preprocessing comp description, fix typo and correct default datastore name

* add integration validation test - build

* update readme file

* Move logger to the maion if block, add pytorch channel in the conda env
yaml and move readme to the docs folder

* code reformatting using black

* add documentation to run an FL experiment

* add more intuitive path for aggr output dir

* Merge changes

* Accomodate optional input chnages and switch from mlflow autologging to manual logging

* code style

* change optional inputs syntax

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>
Co-authored-by: Jeff Omhover <jf.omhover@gmail.com>
Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com>
Co-authored-by: thomasp-ms <XXX@me.com>

* Make changes to display all metrics at the pipeline level

* Log preprocessing metadata in mlflow

* linting

* Pass client as an arg

* Fix typo, rmv name from silo config, metric naming convention, and add
metric identifier in the preprocessing component

Co-aut…

Loading branch information

10 people committed Mar 8, 2023

1 parent e423f1b commit c378e97

...s/submit-aml-factory-pipeline/action.yaml → ...t-aml-scatter-gather-pipeline/action.yaml

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -1,5 +1,5 @@
  
    name: Submit example factory pipeline

    description: Submit example factory pipeline in AML

    name: Submit example scatter-gather pipeline

    description: Submit example scatter-gather pipeline in AML

    inputs:

      client-id:

        description: Client ID of the service principal

    @@ -46,6 +46,6 @@ runs:
  
          shell: bash

          run: pip install -r examples/pipelines/requirements.txt

        - name: Submit fl_cross_silo_factory pipeline

        - name: Submit fl_cross_silo_scatter_gather pipeline

          shell: bash

          run: python examples/pipelines/fl_cross_silo_factory/submit.py --subscription_id ${{ inputs.subscription-id }} --resource_group ${{ inputs.resource-group }} --workspace_name ${{ inputs.workspace-name }} --example ${{ inputs.example }} --ignore_validation --wait || [ $? == 5 ]

          run: python examples/pipelines/fl_cross_silo_scatter_gather/submit.py --subscription_id ${{ inputs.subscription-id }} --resource_group ${{ inputs.resource-group }} --workspace_name ${{ inputs.workspace-name }} --example ${{ inputs.example }} --ignore_validation --wait || [ $? == 5 ]

.github/workflows/pipeline-e2e-test.yaml

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -118,7 +118,7 @@ jobs:
  
            workspace-name: aml-fl1${{ github.run_attempt }}${{ github.run_id }}

            example: MNIST

      open-sandbox-factory-helloworld-test:

      open-sandbox-scatter-gather-helloworld-test:

        needs: open-sandbox-test

        runs-on: ubuntu-latest

        permissions:

    @@ -127,8 +127,8 @@ jobs:
  
        - uses: actions/checkout@v2

        - name: Submit Helloworld example using the factory code

          uses: ./.github/actions/submit-aml-factory-pipeline

        - name: Submit Helloworld example using the scatter-gather code

          uses: ./.github/actions/submit-aml-scatter-gather-pipeline

          with:

            client-id: ${{ secrets.AZURE_CLIENT_ID }}

            tenant-id: ${{ secrets.AZURE_TENANT_ID }}

    @@ -137,16 +137,16 @@ jobs:
  
            workspace-name: aml-fl1${{ github.run_attempt }}${{ github.run_id }}

            example: HELLOWORLD

      open-sandbox-factory-mnist-test:

        needs: open-sandbox-factory-helloworld-test

      open-sandbox-scatter-gather-mnist-test:

        needs: open-sandbox-scatter-gather-helloworld-test

        runs-on: ubuntu-latest

        permissions:

          id-token: write

        steps:

        - uses: actions/checkout@v2

        - name: Submit MNIST example using the factory code

          uses: ./.github/actions/submit-aml-factory-pipeline

        - name: Submit MNIST example using the scatter-gather code

          uses: ./.github/actions/submit-aml-scatter-gather-pipeline

          with:

            client-id: ${{ secrets.AZURE_CLIENT_ID }}

            tenant-id: ${{ secrets.AZURE_TENANT_ID }}

    @@ -297,16 +297,16 @@ jobs:
  
            workspace-name: aml-fl2${{ github.run_attempt }}${{ github.run_id }}

            example: MNIST

      vnet-sandbox-factory-helloworld-test:

      vnet-sandbox-scatter-gather-helloworld-test:

        needs: vnet-sandbox-test

        runs-on: ubuntu-latest

        permissions:

          id-token: write

        steps:

        - uses: actions/checkout@v2

        - name: Submit Helloworld example using the factory code

          uses: ./.github/actions/submit-aml-factory-pipeline

        - name: Submit Helloworld example using the scatter-gather code

          uses: ./.github/actions/submit-aml-scatter-gather-pipeline

          with:

            client-id: ${{ secrets.AZURE_CLIENT_ID }}

            tenant-id: ${{ secrets.AZURE_TENANT_ID }}

    @@ -315,16 +315,16 @@ jobs:
  
            workspace-name: aml-fl2${{ github.run_attempt }}${{ github.run_id }}

            example: HELLOWORLD

      vnet-sandbox-factory-mnist-test:

        needs: vnet-sandbox-factory-helloworld-test

      vnet-sandbox-scatter-gather-mnist-test:

        needs: vnet-sandbox-scatter-gather-helloworld-test

        runs-on: ubuntu-latest

        permissions:

          id-token: write

        steps:

        - uses: actions/checkout@v2

        - name: Submit MNIST example using the factory code

          uses: ./.github/actions/submit-aml-factory-pipeline

        - name: Submit MNIST example using the scatter-gather code

          uses: ./.github/actions/submit-aml-scatter-gather-pipeline

          with:

            client-id: ${{ secrets.AZURE_CLIENT_ID }}

            tenant-id: ${{ secrets.AZURE_TENANT_ID }}

.github/workflows/release-branch-test.yaml

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -23,7 +23,7 @@ jobs:
  
          components-mnist: ${{ steps.filter.outputs.components-mnist }}

          components-utils: ${{ steps.filter.outputs.components-utils }}

          literal: ${{ steps.filter.outputs.literal }}

          factory: ${{ steps.filter.outputs.factory }}

          scatter-gather: ${{ steps.filter.outputs.scatter-gather }}

          components-pneumonia: ${{ steps.filter.outputs.components-pneumonia }}

          components-ner: ${{ steps.filter.outputs.components-ner }}

          components-ccfraud: ${{ steps.filter.outputs.components-ccfraud }}

    @@ -56,8 +56,8 @@ jobs:
  
                - 'examples/components/utils/**'

              literal:

                - 'examples/pipelines/fl_cross_silo_literal/**'

              factory:

                - 'examples/pipelines/fl_cross_silo_factory/**'

              scatter-gather:

                - 'examples/pipelines/fl_cross_silo_scatter_gather/**'

              components-pneumonia:

                - 'examples/components/PNEUMONIA/**'

              components-ner:

    @@ -149,18 +149,18 @@ jobs:
  
            workspace-name: ${{ secrets.AML_WORKSPACE_NAME }}

            example: HELLOWORLD

      factory-helloworld-test:

      scatter-gather-helloworld-test:

        runs-on: ubuntu-latest

        needs: paths-filter

        if: ${{ (needs.paths-filter.outputs.components-helloworld == 'true') || (needs.paths-filter.outputs.factory == 'true') }}    

        if: ${{ (needs.paths-filter.outputs.components-helloworld == 'true') || (needs.paths-filter.outputs.scatter-gather == 'true') }}    

        permissions:

          id-token: write

        steps:

        - uses: actions/checkout@v2

        - name: Submit Helloworld example using the factory code

          uses: ./.github/actions/submit-aml-factory-pipeline

        - name: Submit Helloworld example using the scatter-gather code

          uses: ./.github/actions/submit-aml-scatter-gather-pipeline

          with:

            client-id: ${{ secrets.AZURE_CLIENT_ID }}

            tenant-id: ${{ secrets.AZURE_TENANT_ID }}

    @@ -188,7 +188,7 @@ jobs:
  
            workspace-name: ${{ secrets.AML_WORKSPACE_NAME }}

            example: MNIST

      factory-mnist-test:

      scatter-gather-mnist-test:

        runs-on: ubuntu-latest

        needs: paths-filter

        if: ${{ needs.paths-filter.outputs.components-mnist == 'true' }}

    @@ -198,8 +198,8 @@ jobs:
  
        - uses: actions/checkout@v2

        - name: Submit MNIST example using the factory code

          uses: ./.github/actions/submit-aml-factory-pipeline

        - name: Submit MNIST example using the scatter-gather code

          uses: ./.github/actions/submit-aml-scatter-gather-pipeline

          with:

            client-id: ${{ secrets.AZURE_CLIENT_ID }}

            tenant-id: ${{ secrets.AZURE_TENANT_ID }}

.gitignore

-Original file line number
+Diff line change
@@ Expand Up / @@ -133,3 +133,6 @@ config.json @@
     # for ignoring test jobs
     /examples/pipelines/test*
+    # for ignoring local sandbox files for debugging/testing
+    /sandbox/*

CHANGELOG.md

-Original file line number
+Diff line change
@@ -1,5 +1,29 @@
     # FL Accelerator Changelog
+    ##  February 2023 release
+    We are excited to announce the release of the February iteration of our [FL Accelerator repository](https://github.com/Azure-Samples/azure-ml-federated-learning).
+    Some of the major updates we have made include the launch of a vertical federated learning feature, an FL pipeline that offers a native AML FL experience integrated with the factory engine, and benchmark results that reveal a comprehensive comparison between FL and non-FL experiments.
+    ### FL Experience
+    - Implemented _Vertical Federated Learning_ and offered a [tutorial](./docs/tutorials/vertical-fl.md) to run MNIST or CCFRAUD examples.
+    - Introduced a [scatter-gather](./docs/tutorials/literal-scatter-gather-tutorial.md) pipeline that delivers a real AML FL native experience.
+    - Conducted a comprehensive comparison b/w FL and non-FL experiments and the benchmark report can be accessed [here](./docs/concepts/benchmarking.md).
+    ### Provisioning
+    - Provided [instructions](./docs/tutorials/update-local-data-to-silo-storage-account.md) and a script to facilitate the upload of local data to a silo storage.
+    - Incremental improvements:
+      - Enhanced the network security rules and minimized the workspace dependencies for provisioning resources.
+    <!-- ### Documentation -->
+    <!-- ### Repository structure
+    -->
+    To get started, go [here](./docs/quickstart.md)!
+    If you find a bug or have a feature request, please open an issue on the [GitHub repository](https://github.com/Azure-Samples/azure-ml-federated-learning/issues).
     ##  January 2023 release
     We are excited to announce the release of the January iteration of our [FL Accelerator repository](https://github.com/Azure-Samples/azure-ml-federated-learning).
@@ Expand Down @@

README.md

-Original file line number
+Diff line change
@@ Expand Up @@
     - [Real world examples](#real-world-examples)
     - [FL Frameworks](#fl-frameworks)
     - [Documentation](#documentation)
+    - [Real-world example benchmarks](#real-world-example-benchmarks)
     - [Need Support?](#need-support)
     - [Contributing](#contributing)
@@ Expand Down Expand Up @@
     Please find a full documentation of this project [**here**](docs/README.md).
+    ### Real-world example benchmarks
+    A benchmarking analysis is performed for each real-world example to understand the validy, efficiency and scalability of our implementation of FL:
+    |  Training overhead  | Model performance  | Scalability |
+    |:-:|:-:|:-:|
+    | [![overhead icon](./docs/pics/pneumonia_time.jpg)](./docs/concepts/benchmarking.md/#21-training-overhead)| [![performance icon](./docs/pics/pneumonia_acc.jpg)](./docs/concepts/benchmarking.md/#22-model-performance)| [![scala icon](./docs/pics/pneumonia_ddp.jpg)](./docs/concepts/benchmarking.md/#23-scalability-with-training)
     ### Need Support?
     Please check the [**troubleshooting guide**](./docs/troubleshoot.md) for possible solutions. If you are unable to find a solution, please open an issue in this repository.
@@ Expand Down @@

docs/README.md

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -22,8 +22,9 @@
  
    - [Tutorials](#tutorials)

      - [What this repo has to offer?](#what-this-repo-has-to-offer)

      - [Provisioning guide](#provisioning-guide)

      - [How to adapt the "literal" and the "factory" code for your own scenario](#how-to-adapt-the-literal-and-the-factory-code-for-your-own-scenario)

      - [How to adapt the "literal" and the "scatter-gather" code for your own scenario](#how-to-adapt-the-literal-and-the-scatter-gather-code-for-your-own-scenario)

      - [Read local data in an on-premises Kubernetes silo](#read-local-data-in-an-on-premises-kubernetes-silo)

      - [Upload local data to silo storage account](#upload-local-data-to-silo-storage-account)

    - [Troubleshooting guide](#troubleshooting-guide)

    ## Motivation

    @@ -44,7 +45,7 @@ To know more about the resource provisioning alternatives, please go to the prov
  
    ## Real-world examples

    In addition to the [literal](../examples/pipelines/fl_cross_silo_literal/) and [factory](../examples/pipelines/fl_cross_silo_factory/) sample experiments, we also provide examples based on real-world applications.

    In addition to the [literal](../examples/pipelines/fl_cross_silo_literal/) and [scatter-gather](../examples/pipelines/fl_cross_silo_scatter_gather/) sample experiments, we also provide examples based on real-world applications.

    > Note: The `upload-data` scripts are only included in the examples for the convenience of executing the FL examples. Please ignore this section if you are performing an actual FL experiment for your scenario.

    @@ -123,9 +124,9 @@ This repo provides some code samples for running a federated learning pipeline i
  
    This guide will help you adapt your own setup depending on your provisioning strategy and your constraints. See [here](./provisioning/README.md) for detailed instructions.

    ## How to adapt the "literal" and the "factory" code for your own scenario

    ## How to adapt the "literal" and the "scatter-gather" code for your own scenario

    The complete tutorial can be found [**here**](./tutorials/literal-factory-tutorial.md)

    The complete tutorial can be found [**here**](./tutorials/literal-scatter-gather-tutorial.md)

    ## Read local data in an on-premises Kubernetes silo

    @@ -134,6 +135,9 @@ This tutorial will show you how to access, within an Azure ML job running on an
  
    ## Differential privacy for cross-silo horizontal federated learning

    The complete tutorial can be found [**here**](./tutorials/dp-for-cross-silo-horizontal-fl.md).

    ## Upload local data to silo storage account

    This tutorial will teach you how to upload local data to a silo storage account. We will be using a CLI job to do the upload. The job will run on the silo compute, which does have access to the silo storage account. See detailed instructions [here](./tutorials/update-local-data-to-silo-storage-account.md).

    # Troubleshooting guide

    If you experience an issue using this repository, please check the [**troubleshooting guide**](./troubleshoot.md) for possible solutions. If you are unable to find a solution, please open an issue in this repository.

0 comments on commit `c378e97`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `c378e97`

Commit

There are no files selected for viewing

0 comments on commit c378e97

0 comments on commit `c378e97`