Skip to content

Commit

Permalink
Update release branch (#275)
Browse files Browse the repository at this point in the history
* rmv redundant dependencies, rename conda envs

* Correct epoch default value

* point data asset instead of underlying URI

* beef up orchestrator cluster (#46)

Co-authored-by: thomasp-ms <XXX@me.com>

* Provision CPUs for silos (instead of GPUs) (#47)

* beef up orchestrator cluster

* gpu -> cpu

Co-authored-by: thomasp-ms <XXX@me.com>

* add preprocessing comp description, fix typo and correct default datastore name

* add integration validation test - build

* update readme file

* Move logger to the maion if block, add pytorch channel in the conda env
yaml and move readme to the docs folder

* code reformatting using black

* add documentation to run an FL experiment

* add more intuitive path for aggr output dir

* Merge changes

* init branch

* wip data exploration

* data exploration region/silo

* basic model

* regions

* basic network and finished data processing

* training

* Implement generic FedAvg without model object (#167)

* generic fedavg pytorch
* support model classes
* add docstrings

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* add README

* update normalization

* update exploration

* Thomas/small improvements (#171)

* remove unused local MNIST data

* add link to provisioning cookbook in docs readme

* recommend creating a conda env in the quickstart

Co-authored-by: thomasp-ms <XXX@me.com>

* update example for finance with multiple models

* successful training through lstm

* revert unneeded changes

* remove local exploration ipynb

* fix test metric

* different param value for AKS (#179)

Co-authored-by: thomasp-ms <XXX@me.com>

* Pneumonia xray example (#164)

* refactor components to use dpv2 + remove unnecessary environments

* working dpv2 pipeline

* refactor scripts with right inputs and outputs

* fix code path

* implement fake outputs

* fix paths

* fix imports

* fix args of aggregation script

* add note, fix component args

* add chekcpoint arg

* linting

* linting

* remove sdkv2 folder

* add argparse to submit script

* add docstring

* add docstring

* linting

* linting

* add staging branch to build

* rollback changes to build, leave it for another PR

* remove logging lien

* remove custom uuid

* linting

* add docstring to custom path function

* polish docstring

* rename model_silo_X to input_silo_X

* rename output

* rename agg output

* Improve auto-provisioning resources (#35) (#36)

* docker file stub

* move docker file, implement feedback

* login before setting subscription

* login before setting subscription

* use default k8s version

* pin latest version since default won't work

* remove executionpolicy part, other small updates

* clarify to change job file _in docker filesystem_

* login before setting subscription

* formatting

* \ -> /

* install azureml-core in docker file

* propagate changes to section 7

* fix dataset creation command

Co-authored-by: thomasp-ms <XXX@me.com>

Co-authored-by: thomasp-ms <XXX@me.com>

* Refactor folder structure (#37)

* `plan` -> `docs`

* 'plan' -> 'docs'

* 'automated_provisioning' -> 'mlops'

* 'fl_arc_k8s' -> 'examples'

Co-authored-by: thomasp-ms <XXX@me.com>

* auto provisioning - vanilla internal silos (#41)

* split internal and external provisioning

* adjust directories after internal/external split

* introduce overall mlops readme

* first stab

* remove useless comment and my alias

Co-authored-by: thomasp-ms <XXX@me.com>

* Perform real FL training on the MNIST dataset

Added component files customized for MNIST dataset. Set the setup for 3
silo having their own compute and datastore.
git config --global user.email "you@example.com"

* refine components and add logs

* maintain consistency b/w config files

* add requirement and env files

* add requirement and env files

* rmv redundant dependencies, rename conda envs

* Correct epoch default value

* point data asset instead of underlying URI

* beef up orchestrator cluster (#46)

Co-authored-by: thomasp-ms <XXX@me.com>

* Provision CPUs for silos (instead of GPUs) (#47)

* beef up orchestrator cluster

* gpu -> cpu

Co-authored-by: thomasp-ms <XXX@me.com>

* add preprocessing comp description, fix typo and correct default datastore name

* add integration validation test - build

* update readme file

* Move logger to the maion if block, add pytorch channel in the conda env
yaml and move readme to the docs folder

* code reformatting using black

* add documentation to run an FL experiment

* add more intuitive path for aggr output dir

* Merge changes

* add more intuitive agg output dir path

* reformat using black

* add iteration2 branch for PR build testing

* reformat date and pass kwargs instead in the getUniqueIdentifier fn

* working submit

* working factory submit

* linting

* move component path

* add soft validation

* add soft validation

* Add basic tests on config

* linting

* working bicep deployment for vanilla demo

* proper orchestrator script, double containers

* fix name

* docstring

* docstring

* rollback to using only 1 container

* align naming convention

* instructions

* working submit

* set up permission model

* working orch perms

* wonky perms assignment

* working role assignments

* remove old perm model

* working except silo2orch

* fix typo

* working submit with config

* add sku as param

* use R/W for now

* fix submit to align with bicep provisioning demo

* linting

* remove dataset files

* fix docstring on permission model

* write draft docs with homepage, align structure, remove requirements, ensure demo documented

* rollback change to req

* change factory to use custom model type during validation

* linting

* Display metrics at the pipeline level (#68)

* Fix optional input yaml and mlflow log bugs (#59)

* refactor components to use dpv2 + remove unnecessary environments

* working dpv2 pipeline

* refactor scripts with right inputs and outputs

* fix code path

* implement fake outputs

* fix paths

* fix imports

* fix args of aggregation script

* add note, fix component args

* add chekcpoint arg

* linting

* linting

* remove sdkv2 folder

* add argparse to submit script

* add docstring

* add docstring

* linting

* linting

* add staging branch to build

* rollback changes to build, leave it for another PR

* remove logging lien

* remove custom uuid

* linting

* add docstring to custom path function

* polish docstring

* rename model_silo_X to input_silo_X

* rename output

* rename agg output

* Improve auto-provisioning resources (#35) (#36)

* docker file stub

* move docker file, implement feedback

* login before setting subscription

* login before setting subscription

* use default k8s version

* pin latest version since default won't work

* remove executionpolicy part, other small updates

* clarify to change job file _in docker filesystem_

* login before setting subscription

* formatting

* \ -> /

* install azureml-core in docker file

* propagate changes to section 7

* fix dataset creation command

Co-authored-by: thomasp-ms <XXX@me.com>

Co-authored-by: thomasp-ms <XXX@me.com>

* Refactor folder structure (#37)

* `plan` -> `docs`

* 'plan' -> 'docs'

* 'automated_provisioning' -> 'mlops'

* 'fl_arc_k8s' -> 'examples'

Co-authored-by: thomasp-ms <XXX@me.com>

* auto provisioning - vanilla internal silos (#41)

* split internal and external provisioning

* adjust directories after internal/external split

* introduce overall mlops readme

* first stab

* remove useless comment and my alias

Co-authored-by: thomasp-ms <XXX@me.com>

* Perform real FL training on the MNIST dataset

Added component files customized for MNIST dataset. Set the setup for 3
silo having their own compute and datastore.
git config --global user.email "you@example.com"

* refine components and add logs

* maintain consistency b/w config files

* add requirement and env files

* add requirement and env files

* rmv redundant dependencies, rename conda envs

* Correct epoch default value

* point data asset instead of underlying URI

* beef up orchestrator cluster (#46)

Co-authored-by: thomasp-ms <XXX@me.com>

* Provision CPUs for silos (instead of GPUs) (#47)

* beef up orchestrator cluster

* gpu -> cpu

Co-authored-by: thomasp-ms <XXX@me.com>

* add preprocessing comp description, fix typo and correct default datastore name

* add integration validation test - build

* update readme file

* Move logger to the maion if block, add pytorch channel in the conda env
yaml and move readme to the docs folder

* code reformatting using black

* add documentation to run an FL experiment

* add more intuitive path for aggr output dir

* Merge changes

* Accomodate optional input chnages and switch from mlflow autologging to manual logging

* code style

* change optional inputs syntax

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>
Co-authored-by: Jeff Omhover <jf.omhover@gmail.com>
Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com>
Co-authored-by: thomasp-ms <XXX@me.com>

* Make changes to display all metrics at the pipeline level

* Log preprocessing metadata in mlflow

* linting

* Pass client as an arg

* Fix typo, rmv name from silo config, metric naming convention, and add
metric identifier in the preprocessing component

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>
Co-authored-by: Jeff Omhover <jf.omhover@gmail.com>
Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com>
Co-authored-by: thomasp-ms <XXX@me.com>

* Remove redundant files from the mlops directory (#69)

* Remove internal & external dir as provisioning is taken care by bicep

* keep mnist data files

* rename demo script (#71)

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Unified documentation (#72)

* WIP: unifying docs

* Remove redundant doc file. We can always revisit if needed

* FL concepts will be covered in the glossary doc

* Remove internal and external silos docs as the code will be re-written in bicep

* provide comprehensive documentation

* rename file

* refine docs

* refine docs and rename fl_cross_silo_basic to fl_cross_silo_native

* simplify sandbox script

* simplify script, ensure it works

* align config of native submit

* align naming conventions between scripts, reinject rbac role

* create test job for quickly debugging provisioning issues

* fix tests

* linting

* move permissions to storage

* align config with bicep scrits

* Document the metrics panel of the pipeline overview in the quickstart (#76)

* WIP: unifying docs

* Remove redundant doc file. We can always revisit if needed

* FL concepts will be covered in the glossary doc

* Remove internal and external silos docs as the code will be re-written in bicep

* provide comprehensive documentation

* rename file

* refine docs

* refine docs and rename fl_cross_silo_basic to fl_cross_silo_native

* document the metrics/pipeline panel in the quickstart

* linting

* add docstrings and disclaimers

* Add instructions on how to create a custom graph  (#78)

* WIP: unifying docs

* Remove redundant doc file. We can always revisit if needed

* FL concepts will be covered in the glossary doc

* Remove internal and external silos docs as the code will be re-written in bicep

* provide comprehensive documentation

* rename file

* refine docs

* refine docs and rename fl_cross_silo_basic to fl_cross_silo_native

* document the metrics/pipeline panel in the quickstart

* add instructions on how to create a custom graph

* do better comments

* Refine native code (#82)

* fix silo name

* log only one datapoint per iteration for an aggregated metrics

* Align terminology for iteration/round/num_rounds

* linting

* use storage blob data contibutor

* add demoBaseName to guid name of role deployment (#85)

Co-authored-by: thomasp-ms <XXX@me.com>

* use id list, add listkeys builtin

* rename and dissociate orchestrator in resource + orchestrator

* separate orchestrator script

* draft sandbox setup

* make silo script distinct

* Update orchestrator_open.bicep

* Update internal_blob_open.bicep

* remove comments

* align hello world example with new naming conventions

* ensure uai assignments are created AFTER storage is created

* linting

* enforce precedence

* merge from secure branch

* use different regions, limit size of account

* reduce to 3 regions, add keys to guid

* substring

* align config

* do not use model

* Add msi version of scripts

* sandbox main can switch between uai and msi

* fix name

* linting

* linting

* implement ignore param, hotfix model with startswith

* Address my own comments on Jeff's PR (#96)

* remove magic number

* little improvements on some comments

* remove unused files

* put dash replacement next to length check

* don't necessarily assume USER AI

* UAI -> XAI

* revert previous UAI -> XAI changes

* move length check next to dash replacement

* typo

* try movind the dependsOn's

* RAGRS -> LRS

* revert dependsON changes

* revert another small change in a comment

Co-authored-by: thomasp-ms <XXX@me.com>

* align config of both submit scripts

* Make distinction between on-off and repeatable provisioning scripts (#99)

* clarify the role needed

* remove "custom role" line

* adjust locations

* use existing rg if not Owner of the sub

* clarify "Secure" setup

* add usage instructions in docstring

* explain what scripts are one-off (vs repeatable)

Co-authored-by: thomasp-ms <XXX@me.com>

* Align round/iteration terminology with the native code (#103)

* rename parameter in config file

* keep iterations instead of rounds

* round -> iteration

Co-authored-by: thomasp-ms <XXX@me.com>

* get all goodies from secureprovisioning branch wip

* get all goodies from secureprovisioning branch wip

* get all goodies from secureprovisioning branch wip

* align both submits to work

* add optional test

* rename native to literal

* add getting started in readme, introduce emojis

* change person

* remove emojs

* Propose rewriting of readme to highlight motivation first (#110)

* propose rewriting of readme to highlight motivation first

* minor edit

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Update README.md

* Update quickstart to mention rg clean-up

* Update quickstart.md

* Update quickstart.md

* Update quickstart.md

* Build bicep scripts as ARM template, add Azure Buttons to quickstart (#120)

* Update quickstart to lower header (hotfix) (#117)
* add arm templates, add button in quickstart
* switch to releasebranchlink

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Add subscription id, resource group and workspace name as CLI args (#122)

* add more cli args

* code style

* code style

* update quickstart doc

* update readme

* Initiate provisioning "cookbook" with list of provisioning scenarios + example (#123)

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Continuous Integration Tests (#119)

* take values of subscription id, rs grp, ws name, etc from github secrets and submit a native pipeline

* change path

* Test azure creds in the github workflow

* reformatting

* add pipeline validation and testing workflow

* add permissions

* add permissions

* check only certain dir to trigger workflows

* add soft validation for any iteration branch PR

* add provisioning script test

* testing

* create rg

* create rg

* change compute for testing

* change demoname

* delete old rg

* change demoname

* add demobasename and aml ws name as github secrets

* random demo base name

* auto generate random base name

* random demo base name

* adjust random num length

* add vnet sandbox test

* rmv dependency b/w jobs

* submit various pipelines

* change execution graph path

* add cli args in the factory code

* change compute for testing

* ignore validation - factory

* create custom action

* correct path

* correct path

* add shell in the github action

* create github actions and take required values as input params

* add shell

* add wait condition

* add logs

* linting

* correct rg name

* add azure ml extension

* handle ml extension installation error.

* add release branch test cases

* add script to delete run history

* cronjob test

* cronjob test

* checkout branch

* test run history deletion script

* test run history deletion script

* test run history deletion script

* azure login

* date format change

* remove double quotes

* date format change

* archive run history script tested

* Add vnet-based provisioning options to cookbook (#128)

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Make deployment name unique in our github actions (#135)

* set unique name for deployments
* add attempt to deployment name

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Refactor compute/storage scripts to be independent (#132)

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Provide motivation in provisioning docs for using service endpoints (#136)

* add motivation for service endpoints
* add link

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Refresh provisioning arm buttons with latest from bicep (#139)

* align names of directories
* rebuild all arm

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Update silo_vnet_newstorage.md (#141)

* Add Bicep build vs ARM template diff test  (#140)

* Add diff test for bicep vs arm

* Debug

* Debug

* fix syntax error

* redirect build output to stdout

* coorect path

* trigger arm template test when pushing changes to main branch from release* branch

* remove redundant logs

* Add "open aks with cc" provision tutorial and bicep scripts (#138)

* implement bicep scripts to provision open aks with cc
* add aks cc tutorial
* build arm and add in branch
* add button

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Provide script + tutorial to attach pair with an existing storage (#142)

* provision datastore with existing storage
* add arm for existing storage, add docs
* add link in readme

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* add latest arm templates to diff build (#145)

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Implements provisioning script for a confidential compute VM jumpbox inside a vnet (debug) (#146)

* add jumpbox script with tutorial
* add template to diff build

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Update jumpbox_cc.md (#147)

* update tutorials for silos to integrate feedback (#149)

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Implement option to turn orchestrator storage fully private (behind PLE) (#150)


Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Tutorial on how to adapt native and factory code to write FL experiments.  (#100)

* WIP: add general information about the factory code

* moving factory-tutorial to another file

* add scenarios

* add instructions on how to adapt literal code

* rename file

* add general info and fix typos

* Jeff's feedback

* Apply code clean-up to provision scripts before bug bash (#148)

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Instructions for provisioning external silos (#101)

* very first stab, far from done

* non-secure native job using the on-prem k8s

* use on-prem silos in example factory job

* Revert "very first stab, far from done"

This reverts commit e00d882dfee6a348eb89cd63e339a051b85ce0ca.

* Revert "use on-prem silos in example factory job"

This reverts commit e2ef8841c6be25a6c84b57ae079cca8f361323fe.

* Revert "non-secure native job using the on-prem k8s"

This reverts commit 923e5f321d28b30d8cd9759c47a7ffe5457e3284.

* restore doc stub

* Make Git ignore resources for test jobs

* fix gitignore

* typo in comment

* steps A through D

* 2 typos

* move to subdir

* fix workspace creation

* add orchestrator part, role, and timeline

* last commit before PR

* adjust to new open_azureml_workspace.bicep

* first wave after Jeff's comments

* address jeff's comments

* typo

* light trims

Co-authored-by: thomasp-ms <XXX@me.com>

* bump up every title

* skeleton

* first attempt at data prep like Harmke

* change secret name

* wrong secret name

* remove separate unzip

* change clients, create silo data assets

* different names for silo data assets, duh

* cleanup

* adjust secret name in doc

* .

* use latest literal code

* align environment with literal

* base on latest component

* one dataset, comment out 2 unused args (for now)

* introduce new arguments

* reflect modified args in component spec

* remove unused arg from config

* start hooking up to Harmke's trainer

* initialize PTLearner and include in run.py

* use same values as Harmke for epochs and lr

* attributes with _, start implementing local_train

* add loggings, add test(), fix device_

* train_loader_

* align _'s

* fix transform bug

* remove unused constants

* use proper model in aggregation code

* removed unused file

* remove unused code and arguments, logging to DEBUG

* restore `metrics_prefix` parameter

* finish restoring `metrics_prefix`

* do not duplicate model code

* revert dedup attempt

* improve docstrings and descriptions

* change experiment name

* change pipeline name and docstring

* cite sources, remove wrongly added licenses

* italics

* black

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>
Co-authored-by: Jeff Omhover <jf.omhover@gmail.com>
Co-authored-by: thomasp-ms <XXX@me.com>
Co-authored-by: unknown <Mitgarg17495@gmail.com>

* update formatting

* add readme section

* rename training to traininsilo for consistency

* add more comments and update docs

* include urgency in PR template (#184)

Co-authored-by: thomasp-ms <XXX@me.com>

* Share resources and standardize component names (#182)

* use shared agg component across all examples

* only keep a single {reqs/env}

* use more recent pip version

* standardize component spec name

* support dummy HELLOWORLD example is agg

* black

Co-authored-by: thomasp-ms <XXX@me.com>

* Thopo/share component and environment (#185)

* use shared agg component across all examples

* only keep a single {reqs/env}

* use more recent pip version

* standardize component spec name

* support dummy HELLOWORLD example is agg

* black

* SHARED -> utils, rename agg env

Co-authored-by: thomasp-ms <XXX@me.com>

* rename config to spec and add upload data step

* upload data script

* use util aggregateweights

* add data splitting pipeline

* docs update

* log pipeline level only once per silo training

* do categorical encoding ahead of splitting

* nit updates

* update comment

* update formatting

* Hotfix: grant `az login` permissions to the 'clear run history' script  (#166)

* refactor components to use dpv2 + remove unnecessary environments

* working dpv2 pipeline

* refactor scripts with right inputs and outputs

* fix code path

* implement fake outputs

* fix paths

* fix imports

* fix args of aggregation script

* add note, fix component args

* add chekcpoint arg

* linting

* linting

* remove sdkv2 folder

* add argparse to submit script

* add docstring

* add docstring

* linting

* linting

* add staging branch to build

* rollback changes to build, leave it for another PR

* remove logging lien

* remove custom uuid

* linting

* add docstring to custom path function

* polish docstring

* rename model_silo_X to input_silo_X

* rename output

* rename agg output

* Improve auto-provisioning resources (#35) (#36)

* docker file stub

* move docker file, implement feedback

* login before setting subscription

* login before setting subscription

* use default k8s version

* pin latest version since default won't work

* remove executionpolicy part, other small updates

* clarify to change job file _in docker filesystem_

* login before setting subscription

* formatting

* \ -> /

* install azureml-core in docker file

* propagate changes to section 7

* fix dataset creation command

Co-authored-by: thomasp-ms <XXX@me.com>

Co-authored-by: thomasp-ms <XXX@me.com>

* Refactor folder structure (#37)

* `plan` -> `docs`

* 'plan' -> 'docs'

* 'automated_provisioning' -> 'mlops'

* 'fl_arc_k8s' -> 'examples'

Co-authored-by: thomasp-ms <XXX@me.com>

* auto provisioning - vanilla internal silos (#41)

* split internal and external provisioning

* adjust directories after internal/external split

* introduce overall mlops readme

* first stab

* remove useless comment and my alias

Co-authored-by: thomasp-ms <XXX@me.com>

* Perform real FL training on the MNIST dataset

Added component files customized for MNIST dataset. Set the setup for 3
silo having their own compute and datastore.
git config --global user.email "you@example.com"

* refine components and add logs

* maintain consistency b/w config files

* add requirement and env files

* add requirement and env files

* rmv redundant dependencies, rename conda envs

* Correct epoch default value

* point data asset instead of underlying URI

* beef up orchestrator cluster (#46)

Co-authored-by: thomasp-ms <XXX@me.com>

* Provision CPUs for silos (instead of GPUs) (#47)

* beef up orchestrator cluster

* gpu -> cpu

Co-authored-by: thomasp-ms <XXX@me.com>

* add preprocessing comp description, fix typo and correct default datastore name

* add integration validation test - build

* update readme file

* Move logger to the maion if block, add pytorch channel in the conda env
yaml and move readme to the docs folder

* code reformatting using black

* add documentation to run an FL experiment

* add more intuitive path for aggr output dir

* Merge changes

* simplify the job wait condition's code

* add comments

* trigger mnist pipeline check

* test token validity

* grant `az login` permissions to the clear-history script

* revert to sleep wait code

* test access token validity

* nit

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>
Co-authored-by: Jeff Omhover <jf.omhover@gmail.com>
Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com>
Co-authored-by: thomasp-ms <XXX@me.com>

* fix readme

* aggregate weughts on whichever device is available

* update docstrings

* update formatting

* reduce upload pipeline file

* fix datastore

* add info about data upload step

* fix typo

* steps for changing access policies

* update docs

* Named Entity Recognition example (#177)

* refactor components to use dpv2 + remove unnecessary environments

* working dpv2 pipeline

* refactor scripts with right inputs and outputs

* fix code path

* implement fake outputs

* fix paths

* fix imports

* fix args of aggregation script

* add note, fix component args

* add chekcpoint arg

* linting

* linting

* remove sdkv2 folder

* add argparse to submit script

* add docstring

* add docstring

* linting

* linting

* add staging branch to build

* rollback changes to build, leave it for another PR

* remove logging lien

* remove custom uuid

* linting

* add docstring to custom path function

* polish docstring

* rename model_silo_X to input_silo_X

* rename output

* rename agg output

* Improve auto-provisioning resources (#35) (#36)

* docker file stub

* move docker file, implement feedback

* login before setting subscription

* login before setting subscription

* use default k8s version

* pin latest version since default won't work

* remove executionpolicy part, other small updates

* clarify to change job file _in docker filesystem_

* login before setting subscription

* formatting

* \ -> /

* install azureml-core in docker file

* propagate changes to section 7

* fix dataset creation command

Co-authored-by: thomasp-ms <XXX@me.com>

Co-authored-by: thomasp-ms <XXX@me.com>

* Refactor folder structure (#37)

* `plan` -> `docs`

* 'plan' -> 'docs'

* 'automated_provisioning' -> 'mlops'

* 'fl_arc_k8s' -> 'examples'

Co-authored-by: thomasp-ms <XXX@me.com>

* auto provisioning - vanilla internal silos (#41)

* split internal and external provisioning

* adjust directories after internal/external split

* introduce overall mlops readme

* first stab

* remove useless comment and my alias

Co-authored-by: thomasp-ms <XXX@me.com>

* Perform real FL training on the MNIST dataset

Added component files customized for MNIST dataset. Set the setup for 3
silo having their own compute and datastore.
git config --global user.email "you@example.com"

* refine components and add logs

* maintain consistency b/w config files

* add requirement and env files

* add requirement and env files

* rmv redundant dependencies, rename conda envs

* Correct epoch default value

* point data asset instead of underlying URI

* beef up orchestrator cluster (#46)

Co-authored-by: thomasp-ms <XXX@me.com>

* Provision CPUs for silos (instead of GPUs) (#47)

* beef up orchestrator cluster

* gpu -> cpu

Co-authored-by: thomasp-ms <XXX@me.com>

* add preprocessing comp description, fix typo and correct default datastore name

* add integration validation test - build

* update readme file

* Move logger to the maion if block, add pytorch channel in the conda env
yaml and move readme to the docs folder

* code reformatting using black

* add documentation to run an FL experiment

* add more intuitive path for aggr output dir

* Merge changes

* add multinerd template files

* NER components

* re-structure

* partition data + log metrics

* add redme

* add readme

* restructuring

* restructuring

* add doc strings

* train on gpus

* create a separate component to upload data on silos

* docs

* rename

* add assert statement

* change upload-data job compute to orchestrator compute

* remove ner from literal example choices

* fix doc

* add model-name, tokenizer configurable

* pip version upgrade

* reformatting

* use shared aggregated component

* rename script file

* add note

* create a compute that has access to silos' storage accs

* change data uploading approach

* update doc

* incorporate Thomas's feedback

* fix typo

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>
Co-authored-by: Jeff Omhover <jf.omhover@gmail.com>
Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com>
Co-authored-by: thomasp-ms <XXX@me.com>

* Create nice-looking homepage for the examples in readme+docs (#190)

* add homepage for industry examples

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Align the medical imaging data provisioning process with other examples (#191)

* adjust paths in config file

* support component with 1 output for Pneumonia

* formatting

* adjust doc to new provisioning

* remove GH action for dataprep

* custom component for provisioning pneumonia data

* black

Co-authored-by: thomasp-ms <XXX@me.com>

* hot fix (#192)

Co-authored-by: thomasp-ms <XXX@me.com>

* Lots of micro-fixes after bug bashing all 3 industry examples (#194)

related to components:
* create distinct names for all components of each scenario
* polish component descriptions
* remove unused mnist datatransfer and postprocessing components
* upgrade all MCR images to a more recent OS
* cut some unnecessary dependencies
* use curated environments whenever possible (to speed up job build time)

related to pipelines:
* fix issues with ccfraud submit script (path to shared folder)
* remove unnecessary json+azure imports in submit scripts
* align all 3 submissions scripts
* in upload data pipeline, make --example required without default value to force intentional decision
* in upload data pipeline, use scenario name in the output path to avoid collision
* give each submit pipeline a distinct experiment and run name for readability

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Standardize all 3 real world example tutorials (#193)

* standardize documentation on all 3 examples
* change titles
* fix spaces
* add pip instructions
* upgrade azure-ai-ml version

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* poc for ddp training

* remove debug code + allow logging from multiple nodes

* update formatting

* provide correct link to Kaggle dataset (#196)

* provide correct link

* .

* .

Co-authored-by: thomasp-ms <XXX@me.com>

* add DDP docs

* Add CI tests for industry-relevant examples  (#186)

* refactor components to use dpv2 + remove unnecessary environments

* working dpv2 pipeline

* refactor scripts with right inputs and outputs

* fix code path

* implement fake outputs

* fix paths

* fix imports

* fix args of aggregation script

* add note, fix component args

* add chekcpoint arg

* linting

* linting

* remove sdkv2 folder

* add argparse to submit script

* add docstring

* add docstring

* linting

* linting

* add staging branch to build

* rollback changes to build, leave it for another PR

* remove logging lien

* remove custom uuid

* linting

* add docstring to custom path function

* polish docstring

* rename model_silo_X to input_silo_X

* rename output

* rename agg output

* Improve auto-provisioning resources (#35) (#36)

* docker file stub

* move docker file, implement feedback

* login before setting subscription

* login before setting subscription

* use default k8s version

* pin latest version since default won't work

* remove executionpolicy part, other small updates

* clarify to change job file _in docker filesystem_

* login before setting subscription

* formatting

* \ -> /

* install azureml-core in docker file

* propagate changes to section 7

* fix dataset creation command

Co-authored-by: thomasp-ms <XXX@me.com>

Co-authored-by: thomasp-ms <XXX@me.com>

* Refactor folder structure (#37)

* `plan` -> `docs`

* 'plan' -> 'docs'

* 'automated_provisioning' -> 'mlops'

* 'fl_arc_k8s' -> 'examples'

Co-authored-by: thomasp-ms <XXX@me.com>

* auto provisioning - vanilla internal silos (#41)

* split internal and external provisioning

* adjust directories after internal/external split

* introduce overall mlops readme

* first stab

* remove useless comment and my alias

Co-authored-by: thomasp-ms <XXX@me.com>

* Perform real FL training on the MNIST dataset

Added component files customized for MNIST dataset. Set the setup for 3
silo having their own compute and datastore.
git config --global user.email "you@example.com"

* refine components and add logs

* maintain consistency b/w config files

* add requirement and env files

* add requirement and env files

* rmv redundant dependencies, rename conda envs

* Correct epoch default value

* point data asset instead of underlying URI

* beef up orchestrator cluster (#46)

Co-authored-by: thomasp-ms <XXX@me.com>

* Provision CPUs for silos (instead of GPUs) (#47)

* beef up orchestrator cluster

* gpu -> cpu

Co-authored-by: thomasp-ms <XXX@me.com>

* add preprocessing comp description, fix typo and correct default datastore name

* add integration validation test - build

* update readme file

* Move logger to the maion if block, add pytorch channel in the conda env
yaml and move readme to the docs folder

* code reformatting using black

* add documentation to run an FL experiment

* add more intuitive path for aggr output dir

* Merge changes

* add pneumonia and ner examples tests

* add ccfraud test in the CI/CD pipeline

* add data upload test

* trigger workflow

* CI testing1

* CI testing1

* test kv kaggle creds

* fix creds

* fix creds

* set kaggle creds

* test pneumonia data-upload

* test all industry relevant examples

* upload data test for 3 examples

* add main tests

* rmv redundant chrs

* fix typo

* avoid industry relevant examples tests to run on the vnet setup as it is already covered by the open setup

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>
Co-authored-by: Jeff Omhover <jf.omhover@gmail.com>
Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com>
Co-authored-by: thomasp-ms <XXX@me.com>

* CLI commands to add credentials in the workspace keyvault  (#199)

* refactor components to use dpv2 + remove unnecessary environments

* working dpv2 pipeline

* refactor scripts with right inputs and outputs

* fix code path

* implement fake outputs

* fix paths

* fix imports

* fix args of aggregation script

* add note, fix component args

* add chekcpoint arg

* linting

* linting

* remove sdkv2 folder

* add argparse to submit script

* add docstring

* add docstring

* linting

* linting

* add staging branch to build

* rollback changes to build, leave it for another PR

* remove logging lien

* remove custom uuid

* linting

* add docstring to custom path function

* polish docstring

* rename model_silo_X to input_silo_X

* rename output

* rename agg output

* Improve auto-provisioning resources (#35) (#36)

* docker file stub

* move docker file, implement feedback

* login before setting subscription

* login before setting subscription

* use default k8s version

* pin latest version since default won't work

* remove executionpolicy part, other small updates

* clarify to change job file _in docker filesystem_

* login before setting subscription

* formatting

* \ -> /

* install azureml-core in docker file

* propagate changes to section 7

* fix dataset creation command

Co-authored-by: thomasp-ms <XXX@me.com>

Co-authored-by: thomasp-ms <XXX@me.com>

* Refactor folder structure (#37)

* `plan` -> `docs`

* 'plan' -> 'docs'

* 'automated_provisioning' -> 'mlops'

* 'fl_arc_k8s' -> 'examples'

Co-authored-by: thomasp-ms <XXX@me.com>

* auto provisioning - vanilla internal silos (#41)

* split internal and external provisioning

* adjust directories after internal/external split

* introduce overall mlops readme

* first stab

* remove useless comment and my alias

Co-authored-by: thomasp-ms <XXX@me.com>

* Perform real FL training on the MNIST dataset

Added component files customized for MNIST dataset. Set the setup for 3
silo having their own compute and datastore.
git config --global user.email "you@example.com"

* refine components and add logs

* maintain consistency b/w config files

* add requirement and env files

* add requirement and env files

* rmv redundant dependencies, rename conda envs

* Correct epoch default value

* point data asset instead of underlying URI

* beef up orchestrator cluster (#46)

Co-authored-by: thomasp-ms <XXX@me.com>

* Provision CPUs for silos (instead of GPUs) (#47)

* beef up orchestrator cluster

* gpu -> cpu

Co-authored-by: thomasp-ms <XXX@me.com>

* add preprocessing comp description, fix typo and correct default datastore name

* add integration validation test - build

* update readme file

* Move logger to the maion if block, add pytorch channel in the conda env
yaml and move readme to the docs folder

* code reformatting using black

* add documentation to run an FL experiment

* add more intuitive path for aggr output dir

* Merge changes

* add cli cmds to set a kv secret

* Jeff's feedback

* Implement Thomas's feedback

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>
Co-authored-by: Jeff Omhover <jf.omhover@gmail.com>
Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com>
Co-authored-by: thomasp-ms <XXX@me.com>

* Thomas/bug bash feedback 04 (#203)

* no need to navigate to a specific directory

* keyvault -> key vault

* improve Kaggle sections

* GPU's for NER example

* ARM templates with latest bicep version

* bold

* GPU instructions in quickstart

Co-authored-by: thomasp-ms <XXX@me.com>

* fix test to align with new sdk (#204)

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Hotfix: DataAccessError (orchestrator access) (#205)

* refactor components to use dpv2 + remove unnecessary environments

* working dpv2 pipeline

* refactor scripts with right inputs and outputs

* fix code path

* implement fake outputs

* fix paths

* fix imports

* fix args of aggregation script

* add note, fix component args

* add chekcpoint arg

* linting

* linting

* remove sdkv2 folder

* add argparse to submit script

* add docstring

* add docstring

* linting

* linting

* add staging branch to build

* rollback changes to build, leave it for another PR

* remove logging lien

* remove custom uuid

* linting

* add docstring to custom path function

* polish docstring

* rename model_silo_X to input_silo_X

* rename output

* rename agg output

* Improve auto-provisioning resources (#35) (#36)

* docker file stub

* move docker file, implement feedback

* login before setting subscription

* login before setting subscription

* use default k8s version

* pin latest version since default won't work

* remove executionpolicy part, other small updates

* clarify to change job file _in docker filesystem_

* login before setting subscription

* formatting

* \ -> /

* install azureml-core in docker file

* propagate changes to section 7

* fix dataset creation command

Co-authored-by: thomasp-ms <XXX@me.com>

Co-authored-by: thomasp-ms <XXX@me.com>

* Refactor folder structure (#37)

* `plan` -> `docs`

* 'plan' -> 'docs'

* 'automated_provisioning' -> 'mlops'

* 'fl_arc_k8s' -> 'examples'

Co-authored-by: thomasp-ms <XXX@me.com>

* auto provisioning - vanilla internal silos (#41)

* split internal and external provisioning

* adjust directories after internal/external split

* introduce overall mlops readme

* first stab

* remove useless comment and my alias

Co-authored-by: thomasp-ms <XXX@me.com>

* Perform real FL training on the MNIST dataset

Added component files customized for MNIST dataset. Set the setup for 3
silo having their own compute and datastore.
git config --global user.email "you@example.com"

* refine components and add logs

* maintain consistency b/w config files

* add requirement and env files

* add requirement and env files

* rmv redundant dependencies, rename conda envs

* Correct epoch default value

* point data asset instead of underlying URI

* beef up orchestrator cluster (#46)

Co-authored-by: thomasp-ms <XXX@me.com>

* Provision CPUs for silos (instead of GPUs) (#47)

* beef up orchestrator cluster

* gpu -> cpu

Co-authored-by: thomasp-ms <XXX@me.com>

* add preprocessing comp description, fix typo and correct default datastore name

* add integration validation test - build

* update readme file

* Move logger to the maion if block, add pytorch channel in the conda env
yaml and move readme to the docs folder

* code reformatting using black

* add documentation to run an FL experiment

* add more intuitive path for aggr output dir

* Merge changes

* fix bug

* update arm template

* fix a problem that was encountered during resolving merge conflicts

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>
Co-authored-by: Jeff Omhover <jf.omhover@gmail.com>
Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com>
Co-authored-by: thomasp-ms <XXX@me.com>

* Add GitHub Action workflow concurrency and implement token expiration policy workaround (#200)

* refactor components to use dpv2 + remove unnecessary environments

* working dpv2 pipeline

* refactor scripts with right inputs and outputs

* fix code path

* implement fake outputs

* fix paths

* fix imports

* fix args of aggregation script

* add note, fix component args

* add chekcpoint arg

* linting

* linting

* remove sdkv2 folder

* add argparse to submit script

* add docstring

* add docstring

* linting

* linting

* add staging branch to build

* rollback changes to build, leave it for another PR

* remove logging lien

* remove custom uuid

* linting

* add docstring to custom path function

* polish docstring

* rename model_silo_X to input_silo_X

* rename output

* rename agg output

* Improve auto-provisioning resources (#35) (#36)

* docker file stub

* move docker file, implement feedback

* login before setting subscription

* login before setting subscription

* use default k8s version

* pin latest version since default won't work

* remove executionpolicy part, other small updates

* clarify to change job file _in docker filesystem_

* login before setting subscription

* formatting

* \ -> /

* install azureml-core in docker file

* propagate changes to section 7

* fix dataset creation command

Co-authored-by: thomasp-ms <XXX@me.com>

Co-authored-by: thomasp-ms <XXX@me.com>

* Refactor folder structure (#37)

* `plan` -> `docs`

* 'plan' -> 'docs'

* 'automated_provisioning' -> 'mlops'

* 'fl_arc_k8s' -> 'examples'

Co-authored-by: thomasp-ms <XXX@me.com>

* auto provisioning - vanilla internal silos (#41)

* split internal and external provisioning

* adjust directories after internal/external split

* introduce overall mlops readme

* first stab

* remove useless comment and my alias

Co-authored-by: thomasp-ms <XXX@me.com>

* Perform real FL training on the MNIST dataset

Added component files customized for MNIST dataset. Set the setup for 3
silo having their own compute and datastore.
git config --global user.email "you@example.com"

* refine components and add logs

* maintain consistency b/w config files

* add requirement and env files

* add requirement and env files

* rmv redundant dependencies, rename conda envs

* Correct epoch default value

* point data asset instead of underlying URI

* beef up orchestrator cluster (#46)

Co-authored-by: thomasp-ms <XXX@me.com>

* Provision CPUs for silos (instead of GPUs) (#47)

* beef up orchestrator cluster

* gpu -> cpu

Co-authored-by: thomasp-ms <XXX@me.com>

* add preprocessing comp description, fix typo and correct default datastore name

* add integration validation test - build

* update readme file

* Move logger to the maion if block, add pytorch channel in the conda env
yaml and move readme to the docs folder

* code reformatting using black

* add documentation to run an FL experiment

* add more intuitive path for aggr output dir

* Merge changes

* add GitHub workflow concurrency

* test 1

* test 1

* test 1

* test 2

* test 3

* test 2

* test 3

* implement token expiry workaround

* test 1

* workaround to handle token expiry error

* fix typo

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>
Co-authored-by: Jeff Omhover <jf.omhover@gmail.com>
Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com>
Co-authored-by: thomasp-ms <XXX@me.com>

* Implement troubleshooting guide with first typical issues (#208)

* write troubleshooting guide

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Fix order of precedence for AML workspace references in submit.py (#209)

* refactor components to use dpv2 + remove unnecessary environments

* working dpv2 pipeline

* refactor scripts with right inputs and outputs

* fix code path

* implement fake outputs

* fix paths

* fix imports

* fix args of aggregation script

* add note, fix component args

* add chekcpoint arg

* linting

* linting

* remove sdkv2 folder

* add argparse to submit script

* add docstring

* add docstring

* linting

* linting

* add staging branch to build

* rollback changes to build, leave it for another PR

* remove logging lien

* remove custom uuid

* linting

* add docstring to custom path function

* polish docstring

* rename model_silo_X to input_silo_X

* rename output

* rename agg output

* Improve auto-provisioning resources (#35) (#36)

* docker file stub

* move docker file, implement feedback

* login before setting subscription

* login before setting subscription

* use default k8s version

* pin latest version since default won't work

* remove executionpolicy part, other small updates

* clarify to change job file _in docker filesystem_

* login before setting subscription

* formatting

* \ -> /

* install azureml-core in docker file

* propagate changes to section 7

* fix dataset creation command

Co-authored-by: thomasp-ms <XXX@me.com>

Co-authored-by: thomasp-ms <XXX@me.com>

* Refactor folder structure (#37)

* `plan` -> `docs`

* 'plan' -> 'docs'

* 'automated_provisioning' -> 'mlops'

* 'fl_arc_k8s' -> 'examples'

Co-authored-by: thomasp-ms <XXX@me.com>

* auto provisioning - vanilla internal silos (#41)

* split internal and external provisioning

* adjust directories after internal/external split

* introduce overall mlops readme

* first stab

* remove useless comment and my alias

Co-authored-by: thomasp-ms <XXX@me.com>

* Perform real FL training on the MNIST dataset

Added component files customized for MNIST dataset. Set the setup for 3
silo having their own compute and datastore.
git config --global user.email "you@example.com"

* refine components and add logs

* maintain consistency b/w config files

* add requirement and env files

* add requirement and env files

* rmv redundant dependencies, rename conda envs

* Correct epoch default value

* point data asset instead of underlying URI

* beef up orchestrator cluster (#46)

Co-authored-by: thomasp-ms <XXX@me.com>

* Provision CPUs for silos (instead of GPUs) (#47)

* beef up orchestrator cluster

* gpu -> cpu

Co-authored-by: thomasp-ms <XXX@me.com>

* add preprocessing comp description, fix typo and correct default datastore name

* add integration validation test - build

* update readme file

* Move logger to the maion if block, add pytorch channel in the conda env
yaml and move readme to the docs folder

* code reformatting using black

* add documentation to run an FL experiment

* add more intuitive path for aggr output dir

* Merge changes

* fix order of precedence

* fix build

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>
Co-authored-by: Jeff Omhover <jf.omhover@gmail.com>
Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com>
Co-authored-by: thomasp-ms <XXX@me.com>

* Add data permissions issue to TSG (#210)

* add permissions issue to TSG

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* November notes (#211)

Co-authored-by: thomasp-ms <XXX@me.com>

* create instance type and select it for run for cc

* upgrade all pip dependencies (#212)

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* format

* use azureml built in distribution fw

* Test industry-relevant examples if any changes in the `utils` dir are observed  (#221)

* add test to validate changes in the utils dir

* test1 trigger workflow

* fix typo

* only destroy ddp group if it was created

* remove unnecessary imports

* allow for mixture of ddp and non-ddp processes model aggregation

* use documentation instead of ps1 script for creating instancetype for CC

* add instance type assignment for all examples

* formatting

* formatting

* update batch size

* update model name

* use older pytorch

* Generalize aggregate component to Babel (#220)

* init branch

* wip data exploration

* data exploration region/silo

* basic model

* regions

* basic network and finished data processing

* training

* Implement generic FedAvg without model object (#167)

* generic fedavg pytorch
* support model classes
* add docstrings

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* add README

* update normalization

* update exploration

* Thomas/small improvements (#171)

* remove unused local MNIST data

* add link to provisioning cookbook in docs readme

* recommend creating a conda env in the quickstart

Co-authored-by: thomasp-ms <XXX@me.com>

* update example for finance with multiple models

* successful training through lstm

* revert unneeded changes

* remove local exploration ipynb

* fix test metric

* different param value for AKS (#179)

Co-authored-by: thomasp-ms <XXX@me.com>

* Pneumonia xray example (#164)

* refactor components to use dpv2 + remove unnecessary environments

* working dpv2 pipeline

* refactor scripts with right inputs and outputs

* fix code path

* implement fake outputs

* fix paths

* fix imports

* fix args of aggregation script

* add note, fix component args

* add chekcpoint arg

* linting

* linting

* remove sdkv2 folder

* add argparse to submit script

* add docstring

* add docstring

* linting

* linting

* add staging branch to build

* rollback changes to build, leave it for another PR

* remove logging lien

* remove custom uuid

* linting

* add docstring to custom path function

* polish docstring

* rename model_silo_X to input_silo_X

* rename output

* rename agg output

* Improve auto-provisioning resources (#35) (#36)

* docker file stub

* move docker file, implement feedback

* login before setting subscription

* login before setting subscription

* use default k8s version

* pin latest version since default won't work

* remove executionpolicy part, other small updates

* clarify to change job file _in docker filesystem_

* login before setting subscription

* formatting

* \ -> /

* install azureml-core in docker file

* propagate changes to section 7

* fix dataset creation command

Co-authored-by: thomasp-ms <XXX@me.com>

Co-authored-by: thomasp-ms <XXX@me.com>

* Refactor folder structure (#37)

* `plan` -> `docs`

* 'plan' -> 'docs'

* 'automated_provisioning' -> 'mlops'

* 'fl_arc_k8s' -> 'examples'

Co-authored-by: thomasp-ms <XXX@me.com>

* auto provisioning - vanilla internal silos (#41)

* split internal and external provisioning

* adjust directories after internal/external split

* introduce overall mlops readme

* first stab

* remove useless comment and my alias

Co-authored-by: thomasp-ms <XXX@me.com>

* Perform real FL training on the MNIST dataset

Added component files customized for MNIST dataset. Set the setup for 3
silo having their own compute and datastore.
git config --global user.email "you@example.com"

* refine components and add logs

* maintain consistency b/w config files

* add requirement and env files

* add requirement and env files

* rmv redundant dependencies, rename conda envs

* Correct epoch default value

* point data asset instead of underlying URI

* beef up orchestrator cluster (#46)

Co-authored-by: thomasp-ms <XXX@me.com>

* Provision CPUs for silos (instead of GPUs) (#47)

* beef up orchestrator cluster

* gpu -> cpu

Co-authored-by: thomasp-ms <XXX@me.com>

* add preprocessing comp description, fix typo and correct default datastore name

* add integration validation test - build

* update readme file

* Move logger to the maion if block, add pytorch channel in the conda env
yaml and move readme to the docs folder

* code reformatting using black

* add documentation to run an FL experiment

* add more intuitive path for aggr output dir

* Merge changes

* add more intuitive agg output dir path

* reformat using black

* add iteration2 branch for PR build testing

* reformat date and pass kwargs instead in the getUniqueIdentifier fn

* working submit

* working factory submit

* linting

* move component path

* add soft validation

* add soft validation

* Add basic tests on config

* linting

* working bicep deployment for vanilla demo

* proper orchestrator script, double containers

* fix name

* docstring

* docstring

* rollback to using only 1 container

* align naming convention

* instructions

* working submit

* set up permission model

* working orch perms

* wonky perms assignment

* working role assignments

* remove old perm model

* working except silo2orch

* fix typo

* working submit with config

* add sku as param

* use R/W for now

* fix submit to align with bicep provisioning demo

* linting

* remove dataset files

* fix docstring on permission model

* write draft docs with homepage, align structure, remove requirements, ensure demo documented

* rollback change to req

* change factory to use custom model type during validation

* linting

* Display metrics at the pipeline level (#68)

* Fix optional input yaml and mlflow log bugs (#59)

* refactor components to use dpv2 + remove unnecessary environments

* working dpv2 pipeline

* refactor scripts with right inputs and outputs

* fix code path

* implement fake outputs

* fix paths

* fix imports

* fix args of aggregation script

* add note, fix component args

* add chekcpoint arg

* linting

* linting

* remove sdkv2 folder

* add argparse to submit script

* add docstring

* add docstring

* linting

* linting

* add staging branch to build

* rollback changes to build, leave it for another PR

* remove logging lien

* remove custom uuid

* linting

* add docstring to custom path function

* polish docstring

* rename model_silo_X to input_silo_X

* rename output

* rename agg output

* Improve auto-provisioning resources (#35) (#36)

* docker file stub

* move docker file, implement feedback

* login before setting subscription

* login before setting subscription

* use default k8s version

* pin latest version since default won't work

* remove executionpolicy part, other small updates

* clarify to change job file _in docker filesystem_

* login before setting subscription

* formatting

* \ -> /

* install azureml-core in docker file

* propagate changes to section 7

* fix dataset creation command

Co-authored-by: thomasp-ms <XXX@me.com>

Co-authored-by: thomasp-ms <XXX@me.com>

* Refactor folder structure (#37)

* `plan` -> `docs`

* 'plan' -> 'docs'

* 'automated_provisioning' -> 'mlops'

* 'fl_arc_k8s' -> 'examples'

Co-authored-by: thomasp-ms <XXX@me.com>

* auto provisioning - vanilla internal silos (#41)

* split internal and external provisioning

* adjust directories after internal/external split

* introduce overall mlops readme

* first stab

* remove useless comment and my alias

Co-authored-by: thomasp-ms <XXX@me.com>

* Perform real FL training on the MNIST dataset

Added component files customized for MNIST dataset. Set the setup for 3
silo having their own compute and datastore.
git config --global user.email "you@example.com"

* refine components and add logs

* maintain consistency b/w config files

* add requirement and env files

* add requirement and env files

* rmv redundant dependencies, rename conda envs

* Correct epoch default value

* point data asset instead of underlying URI

* beef up orchestrator cluster (#46)

Co-authored-by: thomasp-ms <XXX@me.com>

* Provision CPUs for silos (instead of GPUs) (#47)

* beef up orchestrator cluster

* gpu -> cpu

Co-authored-by: thomasp-ms <XXX@me.com>

* add preprocessing comp description, fix typo and correct default datastore name

* add integration validation test - build

* update readme file

* Move logger to the maion if block, add pytorch channel in the conda env
yaml and move readme to the docs folder

* code reformatting using black

* add documentation to run an FL experiment

* add more intuitive path for aggr output dir

* Merge changes

* Accomodate optional input chnages and switch from mlflow autologging to manual logging

* code style

* change optional inputs syntax

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>
Co-authored-by: Jeff Omhover <jf.omhover@gmail.com>
Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com>
Co-authored-by: thomasp-ms <XXX@me.com>

* Make changes to display all metrics at the pipeline level

* Log preprocessing metadata in mlflow

* linting

* Pass client as an arg

* Fix typo, rmv name from silo config, metric naming convention, and add
metric identifier in the preprocessing component

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>
Co-authored-by: Jeff Omhover <jf.omhover@gmail.com>
Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com>
Co-authored-by: thomasp-ms <XXX@me.com>

* Remove redundant files from the mlops directory (#69)

* Remove internal & external dir as provisioning is taken care by bicep

* keep mnist data files

* rename demo script (#71)

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Unified documentation (#72)

* WIP: unifying docs

* Remove redundant doc file. We can always revisit if needed

* FL concepts will be covered in the glossary doc

* Remove internal and external silos docs as the code will be re-written in bicep

* provide comprehensive documentation

* rename file

* refine docs

* refine docs and rename fl_cross_silo_basic to fl_cross_silo_native

* simplify sandbox script

* simplify script, ensure it works

* align config of native submit

* align naming conventions between scripts, reinject rbac role

* create test job for quickly debugging provisioning issues

* fix tests

* linting

* move permissions to storage

* align config with bicep scrits

* Document the metrics panel of the pipeline overview in the quickstart (#76)

* WIP: unifying docs

* Remove redundant doc file. We can always revisit if needed

* FL concepts will be covered in the glossary doc

* Remove internal and external silos docs as the code will be re-written in bicep

* provide comprehensive documentation

* rename file

* refine docs

* refine docs and rename fl_cross_silo_basic to fl_cross_silo_native

* document the metrics/pipeline panel in the quickstart

* linting

* add docstrings and disclaimers

* Add instructions on how to create a custom graph  (#78)

* WIP: unifying docs

* Remove redundant doc file. We can always revisit if needed

* FL concepts will be covered in the glossary doc

* Remove internal and external silos docs as the code will be re-written in bicep

* provide comprehensive documentation

* rename file

* refine docs

* refine docs and rename fl_cross_silo_basic to fl_cross_silo_native

* document the metrics/pipeline panel in the quickstart

* add instructions on how to create a custom graph

* do better comments

* Refine native code (#82)

* fix silo name

* log only one datapoint per iteration for an aggregated metrics

* Align terminology for iteration/round/num_rounds

* linting

* use storage blob data contibutor

* add demoBaseName to guid name of role deployment (#85)

Co-authored-by: thomasp-ms <XXX@me.com>

* use id list, add listkeys builtin

* rename and dissociate orchestrator in resource + orchestrator

* separate orchestrator script

* draft sandbox setup

* make silo script distinct

* Update orchestrator_open.bicep

* Update internal_blob_open.bicep

* remove comments

* align hello world example with new naming conventions

* ensure uai assignments are created AFTER storage is created

* linting

* enforce precedence

* merge from secure branch

* use different regions, limit size of account

* reduce to 3 regions, add keys to guid

* substring

* align config

* do not u…
  • Loading branch information
9 people committed Feb 22, 2023
1 parent cce3723 commit 72fc40e
Show file tree
Hide file tree
Showing 36 changed files with 5,068 additions and 19 deletions.
9 changes: 9 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
- [Concepts](#concepts)
- [Why should you consider Federated Learning?](#why-should-you-consider-federated-learning)
- [How to plan for your Federated Learning project](#how-to-plan-for-your-federated-learning-project)
- [Vertical federated learning](#vertical-federated-learning)
- [Glossary](#glossary)
- [Tutorials](#tutorials)
- [What this repo has to offer?](#what-this-repo-has-to-offer)
Expand Down Expand Up @@ -94,6 +95,14 @@ Creating such a graph of jobs can be complex. This repository provides a recipe

We wrote a generic guide on how to get started, ramp-up and mature your [FL project](./concepts/plan-your-fl-project.md).

## Vertical federated learning

> - :warning: EXPERIMENTAL :warning: We are delighted to share with you our solution for vertical federated learning, however, please keep in mind that it is still in active development.
Vertical federated learning is a branch of federated learning where the data are split across the features (vertically) instead of across the samples (horizontally). This provides communication challenges as the nodes running the code needs to exchange intermediate outputs and their corresponding gradients of aligned samples.

We provide examples on how to run **MNIST** and **CCFRAUD** examples using vertical federated learning. These are essentially copies of the original examples with features scattered across the nodes. We invite you to learn more about this approach in the [vertical federated learing tutorial](./tutorials/vertical-fl.md).

## Glossary

The complete glossary list can be seen [**here**](./concepts/glossary.md).
Expand Down
Binary file added docs/pics/fldatatypes.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/pics/vfltrainingloop.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
113 changes: 113 additions & 0 deletions docs/tutorials/vertical-fl.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
# Cross-silo vertical federated learning

## Background
Vertical federated learning (VFL) is branch of federated learning where the data are split across the features among the participants rather than across the samples (horizontal FL). In other words we can say that it takes federated learning to another level as it allows for cross-organization collaboration without need for having the same features while keeping privacy and security of each individual's data intact. Some of real-world examples include, but are not limited to:
- Finance: several institutions owning different pieces of data about their clients (e.g. bank account data, credit card data, loans data, ...etc)
- Healthcare: different healthcare facilities may own different modalities (e.g. x-ray scans, prescriptions, patient health records, ...etc)
- Retail: each retailer owns different information about customer and aggregating this information may result in better recommendations for the customer

<br/><br/>
<div align="center">
<img src="../pics/fldatatypes.png" alt="Homogenous vs heterogenous data" width="400">
</div>

> Note: In this tutorial we refer to "host" as the party who owns the data labels and optionally some part of features and "contributors" as parties who own only features and provide host with intermediate outputs of their share of the network
## Objective and contents
This tutorial will guide you through steps required to set-up VFL experiments and point out important parts of the code. We target MNIST (written number recognition) and [CCFRAUD (financial tabular data)](../real-world-examples/ccfraud.md) examples in order to showcase versatility of the solution in regards to type of the data. All of the examples here make use of mean aggregation and assumption is that the host owns only labels while features are equally distributed among the contributors.

## Infrastructure
First step towards successfully running VFL example is to provision an infrastructure. In order to do so, please navigate to [quickstart](../quickstart.md) and use **single-button deployment for vnet infrastructure deployment**. This is necessary in order for nodes to be able to communicate.

## Install the required dependencies

You'll need python to submit experiments to AzureML. You can install the required dependencies by running:

```bash
conda env create --file ./examples/pipelines/environment.yml
conda activate fl_experiment_conda_env
```

Alternatively, you can just install the required dependencies:

```bash
python -m pip install -r ./examples/pipelines/requirements.txt
```

## Data provisioning
The data format for VFL is different from regular FL. That is why each of our examples contains its own script for uploading data that are needed for a given example.

> Note: This will split the data such that each contributor owns its portion of the features and host own only the labels
### CCFRAUD

Please follow steps in [CCFRAUD - Add your Kaggle credentials to the workspace key vault](../real-world-examples/ccfraud.md#Add-your-Kaggle-credentials-to-the-workspace-key-vault). Afterwards, follow same steps as for **MNIST** and **please do not forget to replace `--example MNIST_VERTICAL` with `--example CCFRAUD_VERTICAL`**).

### MNIST

This can all be performed with ease using a data provisioning pipeline. To run it follow these steps:

1. If you are not using the quickstart setup, adjust the config file `config.yaml` in `examples/pipelines/utils/upload_data/` to match your setup.

2. Submit the experiment by running:

```bash
python ./examples/pipelines/utils/upload_data/submit.py --example MNIST_VERTICAL --workspace_name "<workspace-name>" --resource_group "<resource-group-name>" --subscription_id "<subscription-id>"
```

> Note: You can use --offline flag when running the job to just build and validate pipeline without submitting it.
:star: you can simplify this command by entering your workspace details in the file `config.yaml` in this same directory.

:warning: Proceed to the next step only once the pipeline completes. This pipeline will create data in 3 distinct locations.

## Model preparation for VFL
It is an ongoing research topic on how the model can be orchestrated in VFL. We have decided to go with the most common approach by splitting it between the host and contributors, also referred to as **split learning**, this approach can be easily altered by moving layers between parties to hosting whole model on contributors while host provides only aggregation and/or activation function. We believe that this can better demonstrate capabilities of VFL on AzureML and most of the existing models can be easily split without requiring too much work.

## Training

### Overview
Now, before we run the training itself let's take a step back and take a look on how such training works in VFL setup that is roughly depicted in the figure below. The first step that needs to take place ahead of the training is:

- **Private entity intersection and alignment** - before the training takes place we need to make sure that all of the parties involved share the same sample space and these samples are aligned during the training. **Our samples provide these guarantees by design but please make sure it's true for your custom data. This can be achieved by, for example, providing preprocessing step before training as we do not provide any for of PSI as of now.**

Afterwards, we can continue with regular training loop:
- **Forward pass in contributors** - all contributors, and optionally host, perform forward pass on their part of the model with features they own
- **Intermediate outputs transfer** - all outputs from previous step are sent to the host that performs an aggregation (for simplicity sake we make use of mean operation)
- **Loss computation** - host performs either forward pass on its part of network or just passes aggregated outputs of previous step through an activation function followed by loss computation
- **Gradients computation** - if host owns some part of the network, it performs backward pass, followed by computing gradients w.r.t inputs in all cases
- **Gradient transfer** - all contributors, and optionally host, receives gradients w.r.t. their intermediate outputs
- **Backward pass** - gradients are used to perform backward pass and update the network weights

<br/><br/>
<div align="center">
<img src="../pics/vfltrainingloop.png" alt="Vertical federated learning training loop" width="400">
</div>

### Steps to launch
1. If you are not using the quickstart setup, adjust the config file `config.yaml` in `examples/pipelines/<example-name>/` to match your setup.

2. Submit the experiment by running:

```bash
python ./examples/pipelines/<example-name>/submit.py --config examples/pipelines/<example-name>/config.yaml --workspace_name "<workspace-name>" --resource_group "<resource-group-name>" --subscription_id "<subscription-id>"
```

> Note: You can use --offline flag when running the job to just build and validate pipeline without submitting it.
:star: you can simplify this command by entering your workspace details in the file `config.yaml` in this same directory.


## Tips and pitfalls
1. **Vertical Federated Learning comes at a cost**
There is significant overhead when launching vertical federated learning due to heavy communication among participants. As we can see in the training loop there are two transfers per each mini-batch. One for forward pass outputs, one for gradients. This means that the training may take longer than expected.
2. **Intersection and entity alignment**
The samples needs to be aligned across participants ahead of the training after we created set intersection of samples that are present on all involved parties. This process can reveal information to other entities that we may want to keep private. Fortunately there are **private set intersection** methods available out there that come to rescue.
3. **Communication encryption**
Even though the intermediate outputs and gradients are not raw data, they still have been inferred using private data. Therefore, it's good to use encryption when communicating the data to parties outside of Azure.

## Additional resources
- [Private set intersection algorithm overview](https://xianmu.github.io/posts/2018-11-03-private-set-intersection-based-on-rsa-blind-signature.html)



3 changes: 3 additions & 0 deletions examples/components/CCFRAUD/preprocessing/run.py
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,9 @@ def preprocess_data(
logger.debug(f"Train data samples: {len(train_data)}")
logger.debug(f"Test data samples: {len(test_data)}")

os.makedirs(train_data_dir, exist_ok=True)
os.makedirs(test_data_dir, exist_ok=True)

train_data = train_data.sort_values(by="trans_date_trans_time")
test_data = test_data.sort_values(by="trans_date_trans_time")

Expand Down
9 changes: 2 additions & 7 deletions examples/components/CCFRAUD/upload_data/run.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
3: [["South"], ["Midwest"], ["West", "Northeast"]],
4: [["South"], ["West"], ["Midwest"], ["Northeast"]],
}
CATEGORICAL_PROPS = ["category", "region", "gender", "state", "job"]
CATEGORICAL_PROPS = ["category", "region", "gender", "state"]
ENCODERS = {}


Expand Down Expand Up @@ -67,23 +67,18 @@ def preprocess_data(df):
useful_props = [
"amt",
"age",
# "cc_num",
"merch_lat",
"merch_long",
"category",
"region",
"gender",
"state",
# "zip",
"lat",
"long",
"city_pop",
"job",
# "dob",
"trans_date_trans_time",
"is_fraud",
]
categorical = ["category", "region", "gender", "state", "job"]

df.loc[:, "age"] = (pd.Timestamp.now() - pd.to_datetime(df["dob"])) // pd.Timedelta(
"1y"
Expand All @@ -92,7 +87,7 @@ def preprocess_data(df):
# Filter only useful columns
df = df[useful_props]

for column in categorical:
for column in CATEGORICAL_PROPS:
encoder = ENCODERS.get(column)
encoded_data = encoder.transform(df[column].values.reshape(-1, 1)).toarray()
encoded_df = pd.DataFrame(
Expand Down
200 changes: 200 additions & 0 deletions examples/components/CCFRAUD_VERTICAL/preprocessing/run.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,200 @@
import os
import argparse
import logging
import sys
import numpy as np

from sklearn.preprocessing import StandardScaler
import pandas as pd
import mlflow

SCALERS = {}


def get_arg_parser(parser=None):
"""Parse the command line arguments for merge using argparse.
Args:
parser (argparse.ArgumentParser or CompliantArgumentParser):
an argument parser instance
Returns:
ArgumentParser: the argument parser instance
Notes:
if parser is None, creates a new parser instance
"""
# add arguments that are specific to the component
if parser is None:
parser = argparse.ArgumentParser(description=__doc__)

parser.add_argument("--raw_training_data", type=str, required=True, help="")
parser.add_argument("--raw_testing_data", type=str, required=True, help="")
parser.add_argument("--train_output", type=str, required=True, help="")
parser.add_argument("--test_output", type=str, required=True, help="")
parser.add_argument(
"--metrics_prefix", type=str, required=False, help="Metrics prefix"
)
return parser


def apply_transforms(df):
"""Applies transformation for datetime and numerical columns
Args:
df (pd.DataFrame):
dataframe to transform
Returns:
pd.DataFrame: transformed dataframe
"""
global SCALERS

datetimes = ["trans_date_trans_time"] # "dob"
normalize = [
"age",
"merch_lat",
"merch_long",
"lat",
"long",
"city_pop",
"trans_date_trans_time",
"amt",
]

for column in datetimes:
if column not in df.columns:
continue
df.loc[:, column] = pd.to_datetime(df[column]).view("int64")
for column in normalize:
if column not in df.columns:
continue

if column not in SCALERS:
print(f"Creating encoder for column: {column}")
# Simply set all zeros if the category is unseen
scaler = StandardScaler()
scaler.fit(df[column].values.reshape(-1, 1))
SCALERS[column] = scaler

scaler = SCALERS.get(column)
df.loc[:, column] = scaler.transform(df[column].values.reshape(-1, 1))

return df


def preprocess_data(
raw_training_data,
raw_testing_data,
train_data_dir="./",
test_data_dir="./",
metrics_prefix="default-prefix",
):
"""Preprocess the raw_training_data and raw_testing_data and save the processed data to train_data_dir and test_data_dir.
Args:
raw_training_data: Training data directory that need to be processed
raw_testing_data: Testing data directory that need to be processed
train_data_dir: Train data directory where processed train data will be saved
test_data_dir: Test data directory where processed test data will be saved
Returns:
None
"""

logger.info(
f"Raw Training Data path: {raw_training_data}, Raw Testing Data path: {raw_testing_data}, Processed Training Data dir path: {train_data_dir}, Processed Testing Data dir path: {test_data_dir}"
)

logger.debug(f"Loading data...")
train_df = pd.read_csv(raw_training_data + f"/train.csv", index_col=0)
test_df = pd.read_csv(raw_testing_data + f"/test.csv", index_col=0)

if "is_fraud" in train_df.columns:
fraud_weight = (
train_df["is_fraud"].value_counts()[0]
/ train_df["is_fraud"].value_counts()[1]
)
logger.debug(f"Fraud weight: {fraud_weight}")
np.savetxt(train_data_dir + "/fraud_weight.txt", np.array([fraud_weight]))

logger.debug(f"Applying transformations...")
train_data = apply_transforms(train_df)
test_data = apply_transforms(test_df)

logger.debug(f"Train data samples: {len(train_data)}")
logger.debug(f"Test data samples: {len(test_data)}")
logger.info(f"Saving processed data to {train_data_dir} and {test_data_dir}")

os.makedirs(train_data_dir, exist_ok=True)
os.makedirs(test_data_dir, exist_ok=True)

train_data.to_csv(train_data_dir + "/data.csv")
test_data.to_csv(test_data_dir + "/data.csv")

# Mlflow logging
log_metadata(train_data, test_data, metrics_prefix)


def log_metadata(train_df, test_df, metrics_prefix):
with mlflow.start_run() as mlflow_run:
# get Mlflow client
mlflow_client = mlflow.tracking.client.MlflowClient()
root_run_id = mlflow_run.data.tags.get("mlflow.rootRunId")
logger.debug(f"Root runId: {root_run_id}")
if root_run_id:
mlflow_client.log_metric(
run_id=root_run_id,
key=f"{metrics_prefix}/Number of train datapoints",
value=f"{train_df.shape[0]}",
)

mlflow_client.log_metric(
run_id=root_run_id,
key=f"{metrics_prefix}/Number of test datapoints",
value=f"{test_df.shape[0]}",
)


def main(cli_args=None):
"""Component main function.
It parses arguments and executes run() with the right arguments.
Args:
cli_args (List[str], optional): list of args to feed script, useful for debugging. Defaults to None.
"""
# build an arg parser
parser = get_arg_parser()
# run the parser on cli args
args = parser.parse_args(cli_args)
logger.info(f"Running script with arguments: {args}")

def run():
"""Run script with arguments (the core of the component).
Args:
args (argparse.namespace): command line arguments provided to script
"""

preprocess_data(
args.raw_training_data,
args.raw_testing_data,
args.train_output,
args.test_output,
args.metrics_prefix,
)

run()


if __name__ == "__main__":
# Set logging to sys.out
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
log_format = logging.Formatter("[%(asctime)s] [%(levelname)s] - %(message)s")
handler = logging.StreamHandler(sys.stdout)
handler.setLevel(logging.DEBUG)
handler.setFormatter(log_format)
logger.addHandler(handler)

main()
Loading

0 comments on commit 72fc40e

Please sign in to comment.