Implement generic FedAvg without model object #167

jfomhover · 2022-11-09T19:20:40Z

Purpose

This PR recodes the aggregateweights example from MNIST to aggregate any state dict without having to know the model itself.

This makes the fedavg implement generic enough to be used for potentially any pytorch model in the other worksloads we're developing.

It also implements the averaging model by model, requiring less memory.

Does this introduce a breaking change?

[ ] Yes
[x] No

Pull Request Type

[ ] Bugfix
[x] Feature
[ ] Code style update (formatting, local variables)
[ ] Refactoring (no functional changes, no api changes)
[ ] Documentation content changes
[ ] Other... Please describe:

What to Check

run this in an FL experiment to see if there's no impact on model quality (sanity check)

examples/components/MNIST/aggregatemodelweights/run.py

thomasp-ms

2 comments: 1 minor about a variable name, and one not-so-minor about the environment to use for the component.

Then 2 meta-comments:

it would be nice to show that the model obtained with this new component is exactly the same as the model using the previous one.
I volunteer to put this new component in a shared folder, and to adapt all existing examples to consume it (in a subsequent PR)

examples/components/MNIST/aggregatemodelweights/aggregatemodelweights.yaml

examples/components/MNIST/aggregatemodelweights/run.py

…198) * init branch * wip data exploration * data exploration region/silo * basic model * regions * basic network and finished data processing * training * Implement generic FedAvg without model object (#167) * generic fedavg pytorch * support model classes * add docstrings Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * add README * update normalization * update exploration * Thomas/small improvements (#171) * remove unused local MNIST data * add link to provisioning cookbook in docs readme * recommend creating a conda env in the quickstart Co-authored-by: thomasp-ms <XXX@me.com> * update example for finance with multiple models * successful training through lstm * revert unneeded changes * remove local exploration ipynb * fix test metric * different param value for AKS (#179) Co-authored-by: thomasp-ms <XXX@me.com> * Pneumonia xray example (#164) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add more intuitive agg output dir path * reformat using black * add iteration2 branch for PR build testing * reformat date and pass kwargs instead in the getUniqueIdentifier fn * working submit * working factory submit * linting * move component path * add soft validation * add soft validation * Add basic tests on config * linting * working bicep deployment for vanilla demo * proper orchestrator script, double containers * fix name * docstring * docstring * rollback to using only 1 container * align naming convention * instructions * working submit * set up permission model * working orch perms * wonky perms assignment * working role assignments * remove old perm model * working except silo2orch * fix typo * working submit with config * add sku as param * use R/W for now * fix submit to align with bicep provisioning demo * linting * remove dataset files * fix docstring on permission model * write draft docs with homepage, align structure, remove requirements, ensure demo documented * rollback change to req * change factory to use custom model type during validation * linting * Display metrics at the pipeline level (#68) * Fix optional input yaml and mlflow log bugs (#59) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * Accomodate optional input chnages and switch from mlflow autologging to manual logging * code style * change optional inputs syntax Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Make changes to display all metrics at the pipeline level * Log preprocessing metadata in mlflow * linting * Pass client as an arg * Fix typo, rmv name from silo config, metric naming convention, and add metric identifier in the preprocessing component Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Remove redundant files from the mlops directory (#69) * Remove internal & external dir as provisioning is taken care by bicep * keep mnist data files * rename demo script (#71) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Unified documentation (#72) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * simplify sandbox script * simplify script, ensure it works * align config of native submit * align naming conventions between scripts, reinject rbac role * create test job for quickly debugging provisioning issues * fix tests * linting * move permissions to storage * align config with bicep scrits * Document the metrics panel of the pipeline overview in the quickstart (#76) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * document the metrics/pipeline panel in the quickstart * linting * add docstrings and disclaimers * Add instructions on how to create a custom graph (#78) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * document the metrics/pipeline panel in the quickstart * add instructions on how to create a custom graph * do better comments * Refine native code (#82) * fix silo name * log only one datapoint per iteration for an aggregated metrics * Align terminology for iteration/round/num_rounds * linting * use storage blob data contibutor * add demoBaseName to guid name of role deployment (#85) Co-authored-by: thomasp-ms <XXX@me.com> * use id list, add listkeys builtin * rename and dissociate orchestrator in resource + orchestrator * separate orchestrator script * draft sandbox setup * make silo script distinct * Update orchestrator_open.bicep * Update internal_blob_open.bicep * remove comments * align hello world example with new naming conventions * ensure uai assignments are created AFTER storage is created * linting * enforce precedence * merge from secure branch * use different regions, limit size of account * reduce to 3 regions, add keys to guid * substring * align config * do not use model * Add msi version of scripts * sandbox main can switch between uai and msi * fix name * linting * linting * implement ignore param, hotfix model with startswith * Address my own comments on Jeff's PR (#96) * remove magic number * little improvements on some comments * remove unused files * put dash replacement next to length check * don't necessarily assume USER AI * UAI -> XAI * revert previous UAI -> XAI changes * move length check next to dash replacement * typo * try movind the dependsOn's * RAGRS -> LRS * revert dependsON changes * revert another small change in a comment Co-authored-by: thomasp-ms <XXX@me.com> * align config of both submit scripts * Make distinction between on-off and repeatable provisioning scripts (#99) * clarify the role needed * remove "custom role" line * adjust locations * use existing rg if not Owner of the sub * clarify "Secure" setup * add usage instructions in docstring * explain what scripts are one-off (vs repeatable) Co-authored-by: thomasp-ms <XXX@me.com> * Align round/iteration terminology with the native code (#103) * rename parameter in config file * keep iterations instead of rounds * round -> iteration Co-authored-by: thomasp-ms <XXX@me.com> * get all goodies from secureprovisioning branch wip * get all goodies from secureprovisioning branch wip * get all goodies from secureprovisioning branch wip * align both submits to work * add optional test * rename native to literal * add getting started in readme, introduce emojis * change person * remove emojs * Propose rewriting of readme to highlight motivation first (#110) * propose rewriting of readme to highlight motivation first * minor edit Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Update README.md * Update quickstart to mention rg clean-up * Update quickstart.md * Update quickstart.md * Update quickstart.md * Build bicep scripts as ARM template, add Azure Buttons to quickstart (#120) * Update quickstart to lower header (hotfix) (#117) * add arm templates, add button in quickstart * switch to releasebranchlink Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Add subscription id, resource group and workspace name as CLI args (#122) * add more cli args * code style * code style * update quickstart doc * update readme * Initiate provisioning "cookbook" with list of provisioning scenarios + example (#123) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Continuous Integration Tests (#119) * take values of subscription id, rs grp, ws name, etc from github secrets and submit a native pipeline * change path * Test azure creds in the github workflow * reformatting * add pipeline validation and testing workflow * add permissions * add permissions * check only certain dir to trigger workflows * add soft validation for any iteration branch PR * add provisioning script test * testing * create rg * create rg * change compute for testing * change demoname * delete old rg * change demoname * add demobasename and aml ws name as github secrets * random demo base name * auto generate random base name * random demo base name * adjust random num length * add vnet sandbox test * rmv dependency b/w jobs * submit various pipelines * change execution graph path * add cli args in the factory code * change compute for testing * ignore validation - factory * create custom action * correct path * correct path * add shell in the github action * create github actions and take required values as input params * add shell * add wait condition * add logs * linting * correct rg name * add azure ml extension * handle ml extension installation error. * add release branch test cases * add script to delete run history * cronjob test * cronjob test * checkout branch * test run history deletion script * test run history deletion script * test run history deletion script * azure login * date format change * remove double quotes * date format change * archive run history script tested * Add vnet-based provisioning options to cookbook (#128) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Make deployment name unique in our github actions (#135) * set unique name for deployments * add attempt to deployment name Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Refactor compute/storage scripts to be independent (#132) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Provide motivation in provisioning docs for using service endpoints (#136) * add motivation for service endpoints * add link Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Refresh provisioning arm buttons with latest from bicep (#139) * align names of directories * rebuild all arm Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Update silo_vnet_newstorage.md (#141) * Add Bicep build vs ARM template diff test (#140) * Add diff test for bicep vs arm * Debug * Debug * fix syntax error * redirect build output to stdout * coorect path * trigger arm template test when pushing changes to main branch from release* branch * remove redundant logs * Add "open aks with cc" provision tutorial and bicep scripts (#138) * implement bicep scripts to provision open aks with cc * add aks cc tutorial * build arm and add in branch * add button Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Provide script + tutorial to attach pair with an existing storage (#142) * provision datastore with existing storage * add arm for existing storage, add docs * add link in readme Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * add latest arm templates to diff build (#145) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Implements provisioning script for a confidential compute VM jumpbox inside a vnet (debug) (#146) * add jumpbox script with tutorial * add template to diff build Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Update jumpbox_cc.md (#147) * update tutorials for silos to integrate feedback (#149) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Implement option to turn orchestrator storage fully private (behind PLE) (#150) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Tutorial on how to adapt native and factory code to write FL experiments. (#100) * WIP: add general information about the factory code * moving factory-tutorial to another file * add scenarios * add instructions on how to adapt literal code * rename file * add general info and fix typos * Jeff's feedback * Apply code clean-up to provision scripts before bug bash (#148) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Instructions for provisioning external silos (#101) * very first stab, far from done * non-secure native job using the on-prem k8s * use on-prem silos in example factory job * Revert "very first stab, far from done" This reverts commit e00d882dfee6a348eb89cd63e339a051b85ce0ca. * Revert "use on-prem silos in example factory job" This reverts commit e2ef8841c6be25a6c84b57ae079cca8f361323fe. * Revert "non-secure native job using the on-prem k8s" This reverts commit 923e5f321d28b30d8cd9759c47a7ffe5457e3284. * restore doc stub * Make Git ignore resources for test jobs * fix gitignore * typo in comment * steps A through D * 2 typos * move to subdir * fix workspace creation * add orchestrator part, role, and timeline * last commit before PR * adjust to new open_azureml_workspace.bicep * first wave after Jeff's comments * address jeff's comments * typo * light trims Co-authored-by: thomasp-ms <XXX@me.com> * bump up every title * skeleton * first attempt at data prep like Harmke * change secret name * wrong secret name * remove separate unzip * change clients, create silo data assets * different names for silo data assets, duh * cleanup * adjust secret name in doc * . * use latest literal code * align environment with literal * base on latest component * one dataset, comment out 2 unused args (for now) * introduce new arguments * reflect modified args in component spec * remove unused arg from config * start hooking up to Harmke's trainer * initialize PTLearner and include in run.py * use same values as Harmke for epochs and lr * attributes with _, start implementing local_train * add loggings, add test(), fix device_ * train_loader_ * align _'s * fix transform bug * remove unused constants * use proper model in aggregation code * removed unused file * remove unused code and arguments, logging to DEBUG * restore `metrics_prefix` parameter * finish restoring `metrics_prefix` * do not duplicate model code * revert dedup attempt * improve docstrings and descriptions * change experiment name * change pipeline name and docstring * cite sources, remove wrongly added licenses * italics * black Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: unknown <Mitgarg17495@gmail.com> * update formatting * add readme section * rename training to traininsilo for consistency * add more comments and update docs * include urgency in PR template (#184) Co-authored-by: thomasp-ms <XXX@me.com> * Share resources and standardize component names (#182) * use shared agg component across all examples * only keep a single {reqs/env} * use more recent pip version * standardize component spec name * support dummy HELLOWORLD example is agg * black Co-authored-by: thomasp-ms <XXX@me.com> * Thopo/share component and environment (#185) * use shared agg component across all examples * only keep a single {reqs/env} * use more recent pip version * standardize component spec name * support dummy HELLOWORLD example is agg * black * SHARED -> utils, rename agg env Co-authored-by: thomasp-ms <XXX@me.com> * rename config to spec and add upload data step * upload data script * use util aggregateweights * add data splitting pipeline * docs update * log pipeline level only once per silo training * do categorical encoding ahead of splitting * nit updates * update comment * update formatting * Hotfix: grant `az login` permissions to the 'clear run history' script (#166) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * simplify the job wait condition's code * add comments * trigger mnist pipeline check * test token validity * grant `az login` permissions to the clear-history script * revert to sleep wait code * test access token validity * nit Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * fix readme * aggregate weughts on whichever device is available * update docstrings * update formatting * reduce upload pipeline file * fix datastore * add info about data upload step * fix typo * steps for changing access policies * update docs * Named Entity Recognition example (#177) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add multinerd template files * NER components * re-structure * partition data + log metrics * add redme * add readme * restructuring * restructuring * add doc strings * train on gpus * create a separate component to upload data on silos * docs * rename * add assert statement * change upload-data job compute to orchestrator compute * remove ner from literal example choices * fix doc * add model-name, tokenizer configurable * pip version upgrade * reformatting * use shared aggregated component * rename script file * add note * create a compute that has access to silos' storage accs * change data uploading approach * update doc * incorporate Thomas's feedback * fix typo Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Create nice-looking homepage for the examples in readme+docs (#190) * add homepage for industry examples Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Align the medical imaging data provisioning process with other examples (#191) * adjust paths in config file * support component with 1 output for Pneumonia * formatting * adjust doc to new provisioning * remove GH action for dataprep * custom component for provisioning pneumonia data * black Co-authored-by: thomasp-ms <XXX@me.com> * hot fix (#192) Co-authored-by: thomasp-ms <XXX@me.com> * Lots of micro-fixes after bug bashing all 3 industry examples (#194) related to components: * create distinct names for all components of each scenario * polish component descriptions * remove unused mnist datatransfer and postprocessing components * upgrade all MCR images to a more recent OS * cut some unnecessary dependencies * use curated environments whenever possible (to speed up job build time) related to pipelines: * fix issues with ccfraud submit script (path to shared folder) * remove unnecessary json+azure imports in submit scripts * align all 3 submissions scripts * in upload data pipeline, make --example required without default value to force intentional decision * in upload data pipeline, use scenario name in the output path to avoid collision * give each submit pipeline a distinct experiment and run name for readability Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Standardize all 3 real world example tutorials (#193) * standardize documentation on all 3 examples * change titles * fix spaces * add pip instructions * upgrade azure-ai-ml version Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * provide correct link to Kaggle dataset (#196) * provide correct link * . * . Co-authored-by: thomasp-ms <XXX@me.com> * Add CI tests for industry-relevant examples (#186) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add pneumonia and ner examples tests * add ccfraud test in the CI/CD pipeline * add data upload test * trigger workflow * CI testing1 * CI testing1 * test kv kaggle creds * fix creds * fix creds * set kaggle creds * test pneumonia data-upload * test all industry relevant examples * upload data test for 3 examples * add main tests * rmv redundant chrs * fix typo * avoid industry relevant examples tests to run on the vnet setup as it is already covered by the open setup Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * CLI commands to add credentials in the workspace keyvault (#199) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add cli cmds to set a kv secret * Jeff's feedback * Implement Thomas's feedback Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Thomas/bug bash feedback 04 (#203) * no need to navigate to a specific directory * keyvault -> key vault * improve Kaggle sections * GPU's for NER example * ARM templates with latest bicep version * bold * GPU instructions in quickstart Co-authored-by: thomasp-ms <XXX@me.com> * fix test to align with new sdk (#204) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Hotfix: DataAccessError (orchestrator access) (#205) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * fix bug * update arm template * fix a problem that was encountered during resolving merge conflicts Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Add GitHub Action workflow concurrency and implement token expiration policy workaround (#200) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add GitHub workflow concurrency * test 1 * test 1 * test 1 * test 2 * test 3 * test 2 * test 3 * implement token expiry workaround * test 1 * workaround to handle token expiry error * fix typo Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Implement troubleshooting guide with first typical issues (#208) * write troubleshooting guide Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Fix order of precedence for AML workspace references in submit.py (#209) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * fix order of precedence * fix build Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Add data permissions issue to TSG (#210) * add permissions issue to TSG Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * November notes (#211) Co-authored-by: thomasp-ms <XXX@me.com> * upgrade all pip dependencies (#212) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: David Majercak <damajercak@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: Dávid Majerčák <david1majercak@gmail.com>

* init branch * wip data exploration * data exploration region/silo * basic model * regions * basic network and finished data processing * training * Implement generic FedAvg without model object (#167) * generic fedavg pytorch * support model classes * add docstrings Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * add README * update normalization * update exploration * Thomas/small improvements (#171) * remove unused local MNIST data * add link to provisioning cookbook in docs readme * recommend creating a conda env in the quickstart Co-authored-by: thomasp-ms <XXX@me.com> * update example for finance with multiple models * successful training through lstm * revert unneeded changes * remove local exploration ipynb * fix test metric * different param value for AKS (#179) Co-authored-by: thomasp-ms <XXX@me.com> * Pneumonia xray example (#164) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add more intuitive agg output dir path * reformat using black * add iteration2 branch for PR build testing * reformat date and pass kwargs instead in the getUniqueIdentifier fn * working submit * working factory submit * linting * move component path * add soft validation * add soft validation * Add basic tests on config * linting * working bicep deployment for vanilla demo * proper orchestrator script, double containers * fix name * docstring * docstring * rollback to using only 1 container * align naming convention * instructions * working submit * set up permission model * working orch perms * wonky perms assignment * working role assignments * remove old perm model * working except silo2orch * fix typo * working submit with config * add sku as param * use R/W for now * fix submit to align with bicep provisioning demo * linting * remove dataset files * fix docstring on permission model * write draft docs with homepage, align structure, remove requirements, ensure demo documented * rollback change to req * change factory to use custom model type during validation * linting * Display metrics at the pipeline level (#68) * Fix optional input yaml and mlflow log bugs (#59) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * Accomodate optional input chnages and switch from mlflow autologging to manual logging * code style * change optional inputs syntax Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Make changes to display all metrics at the pipeline level * Log preprocessing metadata in mlflow * linting * Pass client as an arg * Fix typo, rmv name from silo config, metric naming convention, and add metric identifier in the preprocessing component Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Remove redundant files from the mlops directory (#69) * Remove internal & external dir as provisioning is taken care by bicep * keep mnist data files * rename demo script (#71) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Unified documentation (#72) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * simplify sandbox script * simplify script, ensure it works * align config of native submit * align naming conventions between scripts, reinject rbac role * create test job for quickly debugging provisioning issues * fix tests * linting * move permissions to storage * align config with bicep scrits * Document the metrics panel of the pipeline overview in the quickstart (#76) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * document the metrics/pipeline panel in the quickstart * linting * add docstrings and disclaimers * Add instructions on how to create a custom graph (#78) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * document the metrics/pipeline panel in the quickstart * add instructions on how to create a custom graph * do better comments * Refine native code (#82) * fix silo name * log only one datapoint per iteration for an aggregated metrics * Align terminology for iteration/round/num_rounds * linting * use storage blob data contibutor * add demoBaseName to guid name of role deployment (#85) Co-authored-by: thomasp-ms <XXX@me.com> * use id list, add listkeys builtin * rename and dissociate orchestrator in resource + orchestrator * separate orchestrator script * draft sandbox setup * make silo script distinct * Update orchestrator_open.bicep * Update internal_blob_open.bicep * remove comments * align hello world example with new naming conventions * ensure uai assignments are created AFTER storage is created * linting * enforce precedence * merge from secure branch * use different regions, limit size of account * reduce to 3 regions, add keys to guid * substring * align config * do not use model * Add msi version of scripts * sandbox main can switch between uai and msi * fix name * linting * linting * implement ignore param, hotfix model with startswith * Address my own comments on Jeff's PR (#96) * remove magic number * little improvements on some comments * remove unused files * put dash replacement next to length check * don't necessarily assume USER AI * UAI -> XAI * revert previous UAI -> XAI changes * move length check next to dash replacement * typo * try movind the dependsOn's * RAGRS -> LRS * revert dependsON changes * revert another small change in a comment Co-authored-by: thomasp-ms <XXX@me.com> * align config of both submit scripts * Make distinction between on-off and repeatable provisioning scripts (#99) * clarify the role needed * remove "custom role" line * adjust locations * use existing rg if not Owner of the sub * clarify "Secure" setup * add usage instructions in docstring * explain what scripts are one-off (vs repeatable) Co-authored-by: thomasp-ms <XXX@me.com> * Align round/iteration terminology with the native code (#103) * rename parameter in config file * keep iterations instead of rounds * round -> iteration Co-authored-by: thomasp-ms <XXX@me.com> * get all goodies from secureprovisioning branch wip * get all goodies from secureprovisioning branch wip * get all goodies from secureprovisioning branch wip * align both submits to work * add optional test * rename native to literal * add getting started in readme, introduce emojis * change person * remove emojs * Propose rewriting of readme to highlight motivation first (#110) * propose rewriting of readme to highlight motivation first * minor edit Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Update README.md * Update quickstart to mention rg clean-up * Update quickstart.md * Update quickstart.md * Update quickstart.md * Build bicep scripts as ARM template, add Azure Buttons to quickstart (#120) * Update quickstart to lower header (hotfix) (#117) * add arm templates, add button in quickstart * switch to releasebranchlink Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Add subscription id, resource group and workspace name as CLI args (#122) * add more cli args * code style * code style * update quickstart doc * update readme * Initiate provisioning "cookbook" with list of provisioning scenarios + example (#123) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Continuous Integration Tests (#119) * take values of subscription id, rs grp, ws name, etc from github secrets and submit a native pipeline * change path * Test azure creds in the github workflow * reformatting * add pipeline validation and testing workflow * add permissions * add permissions * check only certain dir to trigger workflows * add soft validation for any iteration branch PR * add provisioning script test * testing * create rg * create rg * change compute for testing * change demoname * delete old rg * change demoname * add demobasename and aml ws name as github secrets * random demo base name * auto generate random base name * random demo base name * adjust random num length * add vnet sandbox test * rmv dependency b/w jobs * submit various pipelines * change execution graph path * add cli args in the factory code * change compute for testing * ignore validation - factory * create custom action * correct path * correct path * add shell in the github action * create github actions and take required values as input params * add shell * add wait condition * add logs * linting * correct rg name * add azure ml extension * handle ml extension installation error. * add release branch test cases * add script to delete run history * cronjob test * cronjob test * checkout branch * test run history deletion script * test run history deletion script * test run history deletion script * azure login * date format change * remove double quotes * date format change * archive run history script tested * Add vnet-based provisioning options to cookbook (#128) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Make deployment name unique in our github actions (#135) * set unique name for deployments * add attempt to deployment name Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Refactor compute/storage scripts to be independent (#132) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Provide motivation in provisioning docs for using service endpoints (#136) * add motivation for service endpoints * add link Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Refresh provisioning arm buttons with latest from bicep (#139) * align names of directories * rebuild all arm Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Update silo_vnet_newstorage.md (#141) * Add Bicep build vs ARM template diff test (#140) * Add diff test for bicep vs arm * Debug * Debug * fix syntax error * redirect build output to stdout * coorect path * trigger arm template test when pushing changes to main branch from release* branch * remove redundant logs * Add "open aks with cc" provision tutorial and bicep scripts (#138) * implement bicep scripts to provision open aks with cc * add aks cc tutorial * build arm and add in branch * add button Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Provide script + tutorial to attach pair with an existing storage (#142) * provision datastore with existing storage * add arm for existing storage, add docs * add link in readme Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * add latest arm templates to diff build (#145) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Implements provisioning script for a confidential compute VM jumpbox inside a vnet (debug) (#146) * add jumpbox script with tutorial * add template to diff build Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Update jumpbox_cc.md (#147) * update tutorials for silos to integrate feedback (#149) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Implement option to turn orchestrator storage fully private (behind PLE) (#150) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Tutorial on how to adapt native and factory code to write FL experiments. (#100) * WIP: add general information about the factory code * moving factory-tutorial to another file * add scenarios * add instructions on how to adapt literal code * rename file * add general info and fix typos * Jeff's feedback * Apply code clean-up to provision scripts before bug bash (#148) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Instructions for provisioning external silos (#101) * very first stab, far from done * non-secure native job using the on-prem k8s * use on-prem silos in example factory job * Revert "very first stab, far from done" This reverts commit e00d882. * Revert "use on-prem silos in example factory job" This reverts commit e2ef884. * Revert "non-secure native job using the on-prem k8s" This reverts commit 923e5f3. * restore doc stub * Make Git ignore resources for test jobs * fix gitignore * typo in comment * steps A through D * 2 typos * move to subdir * fix workspace creation * add orchestrator part, role, and timeline * last commit before PR * adjust to new open_azureml_workspace.bicep * first wave after Jeff's comments * address jeff's comments * typo * light trims Co-authored-by: thomasp-ms <XXX@me.com> * bump up every title * skeleton * first attempt at data prep like Harmke * change secret name * wrong secret name * remove separate unzip * change clients, create silo data assets * different names for silo data assets, duh * cleanup * adjust secret name in doc * . * use latest literal code * align environment with literal * base on latest component * one dataset, comment out 2 unused args (for now) * introduce new arguments * reflect modified args in component spec * remove unused arg from config * start hooking up to Harmke's trainer * initialize PTLearner and include in run.py * use same values as Harmke for epochs and lr * attributes with _, start implementing local_train * add loggings, add test(), fix device_ * train_loader_ * align _'s * fix transform bug * remove unused constants * use proper model in aggregation code * removed unused file * remove unused code and arguments, logging to DEBUG * restore `metrics_prefix` parameter * finish restoring `metrics_prefix` * do not duplicate model code * revert dedup attempt * improve docstrings and descriptions * change experiment name * change pipeline name and docstring * cite sources, remove wrongly added licenses * italics * black Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: unknown <Mitgarg17495@gmail.com> * update formatting * add readme section * rename training to traininsilo for consistency * add more comments and update docs * include urgency in PR template (#184) Co-authored-by: thomasp-ms <XXX@me.com> * Share resources and standardize component names (#182) * use shared agg component across all examples * only keep a single {reqs/env} * use more recent pip version * standardize component spec name * support dummy HELLOWORLD example is agg * black Co-authored-by: thomasp-ms <XXX@me.com> * Thopo/share component and environment (#185) * use shared agg component across all examples * only keep a single {reqs/env} * use more recent pip version * standardize component spec name * support dummy HELLOWORLD example is agg * black * SHARED -> utils, rename agg env Co-authored-by: thomasp-ms <XXX@me.com> * rename config to spec and add upload data step * upload data script * use util aggregateweights * add data splitting pipeline * docs update * log pipeline level only once per silo training * do categorical encoding ahead of splitting * nit updates * update comment * update formatting * Hotfix: grant `az login` permissions to the 'clear run history' script (#166) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * simplify the job wait condition's code * add comments * trigger mnist pipeline check * test token validity * grant `az login` permissions to the clear-history script * revert to sleep wait code * test access token validity * nit Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * fix readme * aggregate weughts on whichever device is available * update docstrings * update formatting * reduce upload pipeline file * fix datastore * add info about data upload step * fix typo * steps for changing access policies * update docs * Named Entity Recognition example (#177) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add multinerd template files * NER components * re-structure * partition data + log metrics * add redme * add readme * restructuring * restructuring * add doc strings * train on gpus * create a separate component to upload data on silos * docs * rename * add assert statement * change upload-data job compute to orchestrator compute * remove ner from literal example choices * fix doc * add model-name, tokenizer configurable * pip version upgrade * reformatting * use shared aggregated component * rename script file * add note * create a compute that has access to silos' storage accs * change data uploading approach * update doc * incorporate Thomas's feedback * fix typo Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Create nice-looking homepage for the examples in readme+docs (#190) * add homepage for industry examples Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Align the medical imaging data provisioning process with other examples (#191) * adjust paths in config file * support component with 1 output for Pneumonia * formatting * adjust doc to new provisioning * remove GH action for dataprep * custom component for provisioning pneumonia data * black Co-authored-by: thomasp-ms <XXX@me.com> * hot fix (#192) Co-authored-by: thomasp-ms <XXX@me.com> * Lots of micro-fixes after bug bashing all 3 industry examples (#194) related to components: * create distinct names for all components of each scenario * polish component descriptions * remove unused mnist datatransfer and postprocessing components * upgrade all MCR images to a more recent OS * cut some unnecessary dependencies * use curated environments whenever possible (to speed up job build time) related to pipelines: * fix issues with ccfraud submit script (path to shared folder) * remove unnecessary json+azure imports in submit scripts * align all 3 submissions scripts * in upload data pipeline, make --example required without default value to force intentional decision * in upload data pipeline, use scenario name in the output path to avoid collision * give each submit pipeline a distinct experiment and run name for readability Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Standardize all 3 real world example tutorials (#193) * standardize documentation on all 3 examples * change titles * fix spaces * add pip instructions * upgrade azure-ai-ml version Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * provide correct link to Kaggle dataset (#196) * provide correct link * . * . Co-authored-by: thomasp-ms <XXX@me.com> * generalize aggregate * a bit more detail on help * reformat to black * feedback addressed Co-authored-by: David Majercak <damajercak@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: unknown <Mitgarg17495@gmail.com> Co-authored-by: Dávid Majerčák <david1majercak@gmail.com>

* Implement generic FedAvg without model object (#167) * generic fedavg pytorch * support model classes * add docstrings Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * modifying test job * Thomas/small improvements (#171) * remove unused local MNIST data * add link to provisioning cookbook in docs readme * recommend creating a conda env in the quickstart Co-authored-by: thomasp-ms <XXX@me.com> * change experiment name * check in k8s config file * print all files and folders in the volume * another attempt * different param value for AKS (#179) Co-authored-by: thomasp-ms <XXX@me.com> * . * Pneumonia xray example (#164) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add more intuitive agg output dir path * reformat using black * add iteration2 branch for PR build testing * reformat date and pass kwargs instead in the getUniqueIdentifier fn * working submit * working factory submit * linting * move component path * add soft validation * add soft validation * Add basic tests on config * linting * working bicep deployment for vanilla demo * proper orchestrator script, double containers * fix name * docstring * docstring * rollback to using only 1 container * align naming convention * instructions * working submit * set up permission model * working orch perms * wonky perms assignment * working role assignments * remove old perm model * working except silo2orch * fix typo * working submit with config * add sku as param * use R/W for now * fix submit to align with bicep provisioning demo * linting * remove dataset files * fix docstring on permission model * write draft docs with homepage, align structure, remove requirements, ensure demo documented * rollback change to req * change factory to use custom model type during validation * linting * Display metrics at the pipeline level (#68) * Fix optional input yaml and mlflow log bugs (#59) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * Accomodate optional input chnages and switch from mlflow autologging to manual logging * code style * change optional inputs syntax Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Make changes to display all metrics at the pipeline level * Log preprocessing metadata in mlflow * linting * Pass client as an arg * Fix typo, rmv name from silo config, metric naming convention, and add metric identifier in the preprocessing component Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Remove redundant files from the mlops directory (#69) * Remove internal & external dir as provisioning is taken care by bicep * keep mnist data files * rename demo script (#71) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Unified documentation (#72) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * simplify sandbox script * simplify script, ensure it works * align config of native submit * align naming conventions between scripts, reinject rbac role * create test job for quickly debugging provisioning issues * fix tests * linting * move permissions to storage * align config with bicep scrits * Document the metrics panel of the pipeline overview in the quickstart (#76) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * document the metrics/pipeline panel in the quickstart * linting * add docstrings and disclaimers * Add instructions on how to create a custom graph (#78) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * document the metrics/pipeline panel in the quickstart * add instructions on how to create a custom graph * do better comments * Refine native code (#82) * fix silo name * log only one datapoint per iteration for an aggregated metrics * Align terminology for iteration/round/num_rounds * linting * use storage blob data contibutor * add demoBaseName to guid name of role deployment (#85) Co-authored-by: thomasp-ms <XXX@me.com> * use id list, add listkeys builtin * rename and dissociate orchestrator in resource + orchestrator * separate orchestrator script * draft sandbox setup * make silo script distinct * Update orchestrator_open.bicep * Update internal_blob_open.bicep * remove comments * align hello world example with new naming conventions * ensure uai assignments are created AFTER storage is created * linting * enforce precedence * merge from secure branch * use different regions, limit size of account * reduce to 3 regions, add keys to guid * substring * align config * do not use model * Add msi version of scripts * sandbox main can switch between uai and msi * fix name * linting * linting * implement ignore param, hotfix model with startswith * Address my own comments on Jeff's PR (#96) * remove magic number * little improvements on some comments * remove unused files * put dash replacement next to length check * don't necessarily assume USER AI * UAI -> XAI * revert previous UAI -> XAI changes * move length check next to dash replacement * typo * try movind the dependsOn's * RAGRS -> LRS * revert dependsON changes * revert another small change in a comment Co-authored-by: thomasp-ms <XXX@me.com> * align config of both submit scripts * Make distinction between on-off and repeatable provisioning scripts (#99) * clarify the role needed * remove "custom role" line * adjust locations * use existing rg if not Owner of the sub * clarify "Secure" setup * add usage instructions in docstring * explain what scripts are one-off (vs repeatable) Co-authored-by: thomasp-ms <XXX@me.com> * Align round/iteration terminology with the native code (#103) * rename parameter in config file * keep iterations instead of rounds * round -> iteration Co-authored-by: thomasp-ms <XXX@me.com> * get all goodies from secureprovisioning branch wip * get all goodies from secureprovisioning branch wip * get all goodies from secureprovisioning branch wip * align both submits to work * add optional test * rename native to literal * add getting started in readme, introduce emojis * change person * remove emojs * Propose rewriting of readme to highlight motivation first (#110) * propose rewriting of readme to highlight motivation first * minor edit Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Update README.md * Update quickstart to mention rg clean-up * Update quickstart.md * Update quickstart.md * Update quickstart.md * Build bicep scripts as ARM template, add Azure Buttons to quickstart (#120) * Update quickstart to lower header (hotfix) (#117) * add arm templates, add button in quickstart * switch to releasebranchlink Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Add subscription id, resource group and workspace name as CLI args (#122) * add more cli args * code style * code style * update quickstart doc * update readme * Initiate provisioning "cookbook" with list of provisioning scenarios + example (#123) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Continuous Integration Tests (#119) * take values of subscription id, rs grp, ws name, etc from github secrets and submit a native pipeline * change path * Test azure creds in the github workflow * reformatting * add pipeline validation and testing workflow * add permissions * add permissions * check only certain dir to trigger workflows * add soft validation for any iteration branch PR * add provisioning script test * testing * create rg * create rg * change compute for testing * change demoname * delete old rg * change demoname * add demobasename and aml ws name as github secrets * random demo base name * auto generate random base name * random demo base name * adjust random num length * add vnet sandbox test * rmv dependency b/w jobs * submit various pipelines * change execution graph path * add cli args in the factory code * change compute for testing * ignore validation - factory * create custom action * correct path * correct path * add shell in the github action * create github actions and take required values as input params * add shell * add wait condition * add logs * linting * correct rg name * add azure ml extension * handle ml extension installation error. * add release branch test cases * add script to delete run history * cronjob test * cronjob test * checkout branch * test run history deletion script * test run history deletion script * test run history deletion script * azure login * date format change * remove double quotes * date format change * archive run history script tested * Add vnet-based provisioning options to cookbook (#128) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Make deployment name unique in our github actions (#135) * set unique name for deployments * add attempt to deployment name Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Refactor compute/storage scripts to be independent (#132) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Provide motivation in provisioning docs for using service endpoints (#136) * add motivation for service endpoints * add link Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Refresh provisioning arm buttons with latest from bicep (#139) * align names of directories * rebuild all arm Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Update silo_vnet_newstorage.md (#141) * Add Bicep build vs ARM template diff test (#140) * Add diff test for bicep vs arm * Debug * Debug * fix syntax error * redirect build output to stdout * coorect path * trigger arm template test when pushing changes to main branch from release* branch * remove redundant logs * Add "open aks with cc" provision tutorial and bicep scripts (#138) * implement bicep scripts to provision open aks with cc * add aks cc tutorial * build arm and add in branch * add button Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Provide script + tutorial to attach pair with an existing storage (#142) * provision datastore with existing storage * add arm for existing storage, add docs * add link in readme Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * add latest arm templates to diff build (#145) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Implements provisioning script for a confidential compute VM jumpbox inside a vnet (debug) (#146) * add jumpbox script with tutorial * add template to diff build Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Update jumpbox_cc.md (#147) * update tutorials for silos to integrate feedback (#149) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Implement option to turn orchestrator storage fully private (behind PLE) (#150) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Tutorial on how to adapt native and factory code to write FL experiments. (#100) * WIP: add general information about the factory code * moving factory-tutorial to another file * add scenarios * add instructions on how to adapt literal code * rename file * add general info and fix typos * Jeff's feedback * Apply code clean-up to provision scripts before bug bash (#148) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Instructions for provisioning external silos (#101) * very first stab, far from done * non-secure native job using the on-prem k8s * use on-prem silos in example factory job * Revert "very first stab, far from done" This reverts commit e00d882. * Revert "use on-prem silos in example factory job" This reverts commit e2ef884. * Revert "non-secure native job using the on-prem k8s" This reverts commit 923e5f3. * restore doc stub * Make Git ignore resources for test jobs * fix gitignore * typo in comment * steps A through D * 2 typos * move to subdir * fix workspace creation * add orchestrator part, role, and timeline * last commit before PR * adjust to new open_azureml_workspace.bicep * first wave after Jeff's comments * address jeff's comments * typo * light trims Co-authored-by: thomasp-ms <XXX@me.com> * bump up every title * skeleton * first attempt at data prep like Harmke * change secret name * wrong secret name * remove separate unzip * change clients, create silo data assets * different names for silo data assets, duh * cleanup * adjust secret name in doc * . * use latest literal code * align environment with literal * base on latest component * one dataset, comment out 2 unused args (for now) * introduce new arguments * reflect modified args in component spec * remove unused arg from config * start hooking up to Harmke's trainer * initialize PTLearner and include in run.py * use same values as Harmke for epochs and lr * attributes with _, start implementing local_train * add loggings, add test(), fix device_ * train_loader_ * align _'s * fix transform bug * remove unused constants * use proper model in aggregation code * removed unused file * remove unused code and arguments, logging to DEBUG * restore `metrics_prefix` parameter * finish restoring `metrics_prefix` * do not duplicate model code * revert dedup attempt * improve docstrings and descriptions * change experiment name * change pipeline name and docstring * cite sources, remove wrongly added licenses * italics * black Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: unknown <Mitgarg17495@gmail.com> * include urgency in PR template (#184) Co-authored-by: thomasp-ms <XXX@me.com> * Share resources and standardize component names (#182) * use shared agg component across all examples * only keep a single {reqs/env} * use more recent pip version * standardize component spec name * support dummy HELLOWORLD example is agg * black Co-authored-by: thomasp-ms <XXX@me.com> * Thopo/share component and environment (#185) * use shared agg component across all examples * only keep a single {reqs/env} * use more recent pip version * standardize component spec name * support dummy HELLOWORLD example is agg * black * SHARED -> utils, rename agg env Co-authored-by: thomasp-ms <XXX@me.com> * read in proper location * move to new directories * restore HELLOWORLD component * black * restore literal job * really restore literal job * eof * . * doc stub * introduce new example in the readme * inconsistencies when using different base branch * . * reference k8s tutorial in ext-silo doc * yml templates and description * provisioning sections * . * add note, fix typo * remove duplicated file * cleaned up component * black * pipeline * clean up config * fix bug of non-existing output directory * second attempt at bug fix * instructions on how to run test job * clean up instructions Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: unknown <Mitgarg17495@gmail.com>

* generic fedavg pytorch * support model classes * add docstrings Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* init branch * wip data exploration * data exploration region/silo * basic model * regions * basic network and finished data processing * training * Implement generic FedAvg without model object (#167) * generic fedavg pytorch * support model classes * add docstrings Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * add README * update normalization * update exploration * Thomas/small improvements (#171) * remove unused local MNIST data * add link to provisioning cookbook in docs readme * recommend creating a conda env in the quickstart Co-authored-by: thomasp-ms <XXX@me.com> * update example for finance with multiple models * successful training through lstm * revert unneeded changes * remove local exploration ipynb * fix test metric * different param value for AKS (#179) Co-authored-by: thomasp-ms <XXX@me.com> * Pneumonia xray example (#164) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add more intuitive agg output dir path * reformat using black * add iteration2 branch for PR build testing * reformat date and pass kwargs instead in the getUniqueIdentifier fn * working submit * working factory submit * linting * move component path * add soft validation * add soft validation * Add basic tests on config * linting * working bicep deployment for vanilla demo * proper orchestrator script, double containers * fix name * docstring * docstring * rollback to using only 1 container * align naming convention * instructions * working submit * set up permission model * working orch perms * wonky perms assignment * working role assignments * remove old perm model * working except silo2orch * fix typo * working submit with config * add sku as param * use R/W for now * fix submit to align with bicep provisioning demo * linting * remove dataset files * fix docstring on permission model * write draft docs with homepage, align structure, remove requirements, ensure demo documented * rollback change to req * change factory to use custom model type during validation * linting * Display metrics at the pipeline level (#68) * Fix optional input yaml and mlflow log bugs (#59) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * Accomodate optional input chnages and switch from mlflow autologging to manual logging * code style * change optional inputs syntax Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Make changes to display all metrics at the pipeline level * Log preprocessing metadata in mlflow * linting * Pass client as an arg * Fix typo, rmv name from silo config, metric naming convention, and add metric identifier in the preprocessing component Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Remove redundant files from the mlops directory (#69) * Remove internal & external dir as provisioning is taken care by bicep * keep mnist data files * rename demo script (#71) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Unified documentation (#72) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * simplify sandbox script * simplify script, ensure it works * align config of native submit * align naming conventions between scripts, reinject rbac role * create test job for quickly debugging provisioning issues * fix tests * linting * move permissions to storage * align config with bicep scrits * Document the metrics panel of the pipeline overview in the quickstart (#76) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * document the metrics/pipeline panel in the quickstart * linting * add docstrings and disclaimers * Add instructions on how to create a custom graph (#78) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * document the metrics/pipeline panel in the quickstart * add instructions on how to create a custom graph * do better comments * Refine native code (#82) * fix silo name * log only one datapoint per iteration for an aggregated metrics * Align terminology for iteration/round/num_rounds * linting * use storage blob data contibutor * add demoBaseName to guid name of role deployment (#85) Co-authored-by: thomasp-ms <XXX@me.com> * use id list, add listkeys builtin * rename and dissociate orchestrator in resource + orchestrator * separate orchestrator script * draft sandbox setup * make silo script distinct * Update orchestrator_open.bicep * Update internal_blob_open.bicep * remove comments * align hello world example with new naming conventions * ensure uai assignments are created AFTER storage is created * linting * enforce precedence * merge from secure branch * use different regions, limit size of account * reduce to 3 regions, add keys to guid * substring * align config * do not use model * Add msi version of scripts * sandbox main can switch between uai and msi * fix name * linting * linting * implement ignore param, hotfix model with startswith * Address my own comments on Jeff's PR (#96) * remove magic number * little improvements on some comments * remove unused files * put dash replacement next to length check * don't necessarily assume USER AI * UAI -> XAI * revert previous UAI -> XAI changes * move length check next to dash replacement * typo * try movind the dependsOn's * RAGRS -> LRS * revert dependsON changes * revert another small change in a comment Co-authored-by: thomasp-ms <XXX@me.com> * align config of both submit scripts * Make distinction between on-off and repeatable provisioning scripts (#99) * clarify the role needed * remove "custom role" line * adjust locations * use existing rg if not Owner of the sub * clarify "Secure" setup * add usage instructions in docstring * explain what scripts are one-off (vs repeatable) Co-authored-by: thomasp-ms <XXX@me.com> * Align round/iteration terminology with the native code (#103) * rename parameter in config file * keep iterations instead of rounds * round -> iteration Co-authored-by: thomasp-ms <XXX@me.com> * get all goodies from secureprovisioning branch wip * get all goodies from secureprovisioning branch wip * get all goodies from secureprovisioning branch wip * align both submits to work * add optional test * rename native to literal * add getting started in readme, introduce emojis * change person * remove emojs * Propose rewriting of readme to highlight motivation first (#110) * propose rewriting of readme to highlight motivation first * minor edit Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Update README.md * Update quickstart to mention rg clean-up * Update quickstart.md * Update quickstart.md * Update quickstart.md * Build bicep scripts as ARM template, add Azure Buttons to quickstart (#120) * Update quickstart to lower header (hotfix) (#117) * add arm templates, add button in quickstart * switch to releasebranchlink Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Add subscription id, resource group and workspace name as CLI args (#122) * add more cli args * code style * code style * update quickstart doc * update readme * Initiate provisioning "cookbook" with list of provisioning scenarios + example (#123) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Continuous Integration Tests (#119) * take values of subscription id, rs grp, ws name, etc from github secrets and submit a native pipeline * change path * Test azure creds in the github workflow * reformatting * add pipeline validation and testing workflow * add permissions * add permissions * check only certain dir to trigger workflows * add soft validation for any iteration branch PR * add provisioning script test * testing * create rg * create rg * change compute for testing * change demoname * delete old rg * change demoname * add demobasename and aml ws name as github secrets * random demo base name * auto generate random base name * random demo base name * adjust random num length * add vnet sandbox test * rmv dependency b/w jobs * submit various pipelines * change execution graph path * add cli args in the factory code * change compute for testing * ignore validation - factory * create custom action * correct path * correct path * add shell in the github action * create github actions and take required values as input params * add shell * add wait condition * add logs * linting * correct rg name * add azure ml extension * handle ml extension installation error. * add release branch test cases * add script to delete run history * cronjob test * cronjob test * checkout branch * test run history deletion script * test run history deletion script * test run history deletion script * azure login * date format change * remove double quotes * date format change * archive run history script tested * Add vnet-based provisioning options to cookbook (#128) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Make deployment name unique in our github actions (#135) * set unique name for deployments * add attempt to deployment name Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Refactor compute/storage scripts to be independent (#132) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Provide motivation in provisioning docs for using service endpoints (#136) * add motivation for service endpoints * add link Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Refresh provisioning arm buttons with latest from bicep (#139) * align names of directories * rebuild all arm Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Update silo_vnet_newstorage.md (#141) * Add Bicep build vs ARM template diff test (#140) * Add diff test for bicep vs arm * Debug * Debug * fix syntax error * redirect build output to stdout * coorect path * trigger arm template test when pushing changes to main branch from release* branch * remove redundant logs * Add "open aks with cc" provision tutorial and bicep scripts (#138) * implement bicep scripts to provision open aks with cc * add aks cc tutorial * build arm and add in branch * add button Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Provide script + tutorial to attach pair with an existing storage (#142) * provision datastore with existing storage * add arm for existing storage, add docs * add link in readme Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * add latest arm templates to diff build (#145) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Implements provisioning script for a confidential compute VM jumpbox inside a vnet (debug) (#146) * add jumpbox script with tutorial * add template to diff build Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Update jumpbox_cc.md (#147) * update tutorials for silos to integrate feedback (#149) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Implement option to turn orchestrator storage fully private (behind PLE) (#150) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Tutorial on how to adapt native and factory code to write FL experiments. (#100) * WIP: add general information about the factory code * moving factory-tutorial to another file * add scenarios * add instructions on how to adapt literal code * rename file * add general info and fix typos * Jeff's feedback * Apply code clean-up to provision scripts before bug bash (#148) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Instructions for provisioning external silos (#101) * very first stab, far from done * non-secure native job using the on-prem k8s * use on-prem silos in example factory job * Revert "very first stab, far from done" This reverts commit e00d882dfee6a348eb89cd63e339a051b85ce0ca. * Revert "use on-prem silos in example factory job" This reverts commit e2ef8841c6be25a6c84b57ae079cca8f361323fe. * Revert "non-secure native job using the on-prem k8s" This reverts commit 923e5f321d28b30d8cd9759c47a7ffe5457e3284. * restore doc stub * Make Git ignore resources for test jobs * fix gitignore * typo in comment * steps A through D * 2 typos * move to subdir * fix workspace creation * add orchestrator part, role, and timeline * last commit before PR * adjust to new open_azureml_workspace.bicep * first wave after Jeff's comments * address jeff's comments * typo * light trims Co-authored-by: thomasp-ms <XXX@me.com> * bump up every title * skeleton * first attempt at data prep like Harmke * change secret name * wrong secret name * remove separate unzip * change clients, create silo data assets * different names for silo data assets, duh * cleanup * adjust secret name in doc * . * use latest literal code * align environment with literal * base on latest component * one dataset, comment out 2 unused args (for now) * introduce new arguments * reflect modified args in component spec * remove unused arg from config * start hooking up to Harmke's trainer * initialize PTLearner and include in run.py * use same values as Harmke for epochs and lr * attributes with _, start implementing local_train * add loggings, add test(), fix device_ * train_loader_ * align _'s * fix transform bug * remove unused constants * use proper model in aggregation code * removed unused file * remove unused code and arguments, logging to DEBUG * restore `metrics_prefix` parameter * finish restoring `metrics_prefix` * do not duplicate model code * revert dedup attempt * improve docstrings and descriptions * change experiment name * change pipeline name and docstring * cite sources, remove wrongly added licenses * italics * black Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: unknown <Mitgarg17495@gmail.com> * update formatting * add readme section * rename training to traininsilo for consistency * add more comments and update docs * include urgency in PR template (#184) Co-authored-by: thomasp-ms <XXX@me.com> * Share resources and standardize component names (#182) * use shared agg component across all examples * only keep a single {reqs/env} * use more recent pip version * standardize component spec name * support dummy HELLOWORLD example is agg * black Co-authored-by: thomasp-ms <XXX@me.com> * Thopo/share component and environment (#185) * use shared agg component across all examples * only keep a single {reqs/env} * use more recent pip version * standardize component spec name * support dummy HELLOWORLD example is agg * black * SHARED -> utils, rename agg env Co-authored-by: thomasp-ms <XXX@me.com> * rename config to spec and add upload data step * upload data script * use util aggregateweights * add data splitting pipeline * docs update * log pipeline level only once per silo training * do categorical encoding ahead of splitting * nit updates * update comment * update formatting * provide script to create cpu+gpu computes together * Hotfix: grant `az login` permissions to the 'clear run history' script (#166) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * simplify the job wait condition's code * add comments * trigger mnist pipeline check * test token validity * grant `az login` permissions to the clear-history script * revert to sleep wait code * test access token validity * nit Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * fix readme * aggregate weughts on whichever device is available * update docstrings * update formatting * cpu gpu computes having the same uai * reduce upload pipeline file * fix datastore * add info about data upload step * fix typo * steps for changing access policies * update docs * add option to create gpu computes * Named Entity Recognition example (#177) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add multinerd template files * NER components * re-structure * partition data + log metrics * add redme * add readme * restructuring * restructuring * add doc strings * train on gpus * create a separate component to upload data on silos * docs * rename * add assert statement * change upload-data job compute to orchestrator compute * remove ner from literal example choices * fix doc * add model-name, tokenizer configurable * pip version upgrade * reformatting * use shared aggregated component * rename script file * add note * create a compute that has access to silos' storage accs * change data uploading approach * update doc * incorporate Thomas's feedback * fix typo Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Create nice-looking homepage for the examples in readme+docs (#190) * add homepage for industry examples Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Align the medical imaging data provisioning process with other examples (#191) * adjust paths in config file * support component with 1 output for Pneumonia * formatting * adjust doc to new provisioning * remove GH action for dataprep * custom component for provisioning pneumonia data * black Co-authored-by: thomasp-ms <XXX@me.com> * hot fix (#192) Co-authored-by: thomasp-ms <XXX@me.com> * Lots of micro-fixes after bug bashing all 3 industry examples (#194) related to components: * create distinct names for all components of each scenario * polish component descriptions * remove unused mnist datatransfer and postprocessing components * upgrade all MCR images to a more recent OS * cut some unnecessary dependencies * use curated environments whenever possible (to speed up job build time) related to pipelines: * fix issues with ccfraud submit script (path to shared folder) * remove unnecessary json+azure imports in submit scripts * align all 3 submissions scripts * in upload data pipeline, make --example required without default value to force intentional decision * in upload data pipeline, use scenario name in the output path to avoid collision * give each submit pipeline a distinct experiment and run name for readability Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Standardize all 3 real world example tutorials (#193) * standardize documentation on all 3 examples * change titles * fix spaces * add pip instructions * upgrade azure-ai-ml version Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * provide correct link to Kaggle dataset (#196) * provide correct link * . * . Co-authored-by: thomasp-ms <XXX@me.com> * Add CI tests for industry-relevant examples (#186) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add pneumonia and ner examples tests * add ccfraud test in the CI/CD pipeline * add data upload test * trigger workflow * CI testing1 * CI testing1 * test kv kaggle creds * fix creds * fix creds * set kaggle creds * test pneumonia data-upload * test all industry relevant examples * upload data test for 3 examples * add main tests * rmv redundant chrs * fix typo * avoid industry relevant examples tests to run on the vnet setup as it is already covered by the open setup Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * CLI commands to add credentials in the workspace keyvault (#199) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add cli cmds to set a kv secret * Jeff's feedback * Implement Thomas's feedback Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Thomas/bug bash feedback 04 (#203) * no need to navigate to a specific directory * keyvault -> key vault * improve Kaggle sections * GPU's for NER example * ARM templates with latest bicep version * bold * GPU instructions in quickstart Co-authored-by: thomasp-ms <XXX@me.com> * fix test to align with new sdk (#204) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Hotfix: DataAccessError (orchestrator access) (#205) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * fix bug * update arm template * fix a problem that was encountered during resolving merge conflicts Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Add GitHub Action workflow concurrency and implement token expiration policy workaround (#200) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add GitHub workflow concurrency * test 1 * test 1 * test 1 * test 2 * test 3 * test 2 * test 3 * implement token expiry workaround * test 1 * workaround to handle token expiry error * fix typo Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Implement troubleshooting guide with first typical issues (#208) * write troubleshooting guide Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Fix order of precedence for AML workspace references in submit.py (#209) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * fix order of precedence * fix build Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Add data permissions issue to TSG (#210) * add permissions issue to TSG Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * November notes (#211) Co-authored-by: thomasp-ms <XXX@me.com> * Standarize compute names and allow upto 2 computes to be created for orchestrator and silos * upgrade all pip dependencies (#212) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * generalize compute and datastore names * align config files to provisioning scripts * change compute sku in ci/cd * temp. changes to test cpu gpu computes * vnet compute 1 name change * temp. changes to test cpu gpu computes * change compute regions * test example pipelines * test example pipelines * test example pipelines * test example pipelines * implement compute2 settings for the vnet setup * implement compute2 settings for the vnet setup * test vnet setup * change computesku * test industry relevant examples with the vnet setup * update vnet compute with existing storage * revert back github workflow changes * final testing * revert temp. workflow changes * test kaggle kv * test open setup * update arm templates Co-authored-by: David Majercak <damajercak@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: Dávid Majerčák <david1majercak@gmail.com>

* basic tabular data analysis component * improve plots and docs in data analytics * format update * format update * init branch * wip data exploration * data exploration region/silo * basic model * regions * basic network and finished data processing * training * add README * update normalization * update exploration * update example for finance with multiple models * Implement generic FedAvg without model object (#167) * generic fedavg pytorch * support model classes * add docstrings Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * successful training through lstm * revert unneeded changes * remove local exploration ipynb * fix test metric * update formatting * Pneumonia xray example (#164) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add more intuitive agg output dir path * reformat using black * add iteration2 branch for PR build testing * reformat date and pass kwargs instead in the getUniqueIdentifier fn * working submit * working factory submit * linting * move component path * add soft validation * add soft validation * Add basic tests on config * linting * working bicep deployment for vanilla demo * proper orchestrator script, double containers * fix name * docstring * docstring * rollback to using only 1 container * align naming convention * instructions * working submit * set up permission model * working orch perms * wonky perms assignment * working role assignments * remove old perm model * working except silo2orch * fix typo * working submit with config * add sku as param * use R/W for now * fix submit to align with bicep provisioning demo * linting * remove dataset files * fix docstring on permission model * write draft docs with homepage, align structure, remove requirements, ensure demo documented * rollback change to req * change factory to use custom model type during validation * linting * Display metrics at the pipeline level (#68) * Fix optional input yaml and mlflow log bugs (#59) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * Accomodate optional input chnages and switch from mlflow autologging to manual logging * code style * change optional inputs syntax Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Make changes to display all metrics at the pipeline level * Log preprocessing metadata in mlflow * linting * Pass client as an arg * Fix typo, rmv name from silo config, metric naming convention, and add metric identifier in the preprocessing component Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Remove redundant files from the mlops directory (#69) * Remove internal & external dir as provisioning is taken care by bicep * keep mnist data files * rename demo script (#71) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Unified documentation (#72) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * simplify sandbox script * simplify script, ensure it works * align config of native submit * align naming conventions between scripts, reinject rbac role * create test job for quickly debugging provisioning issues * fix tests * linting * move permissions to storage * align config with bicep scrits * Document the metrics panel of the pipeline overview in the quickstart (#76) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * document the metrics/pipeline panel in the quickstart * linting * add docstrings and disclaimers * Add instructions on how to create a custom graph (#78) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * document the metrics/pipeline panel in the quickstart * add instructions on how to create a custom graph * do better comments * Refine native code (#82) * fix silo name * log only one datapoint per iteration for an aggregated metrics * Align terminology for iteration/round/num_rounds * linting * use storage blob data contibutor * add demoBaseName to guid name of role deployment (#85) Co-authored-by: thomasp-ms <XXX@me.com> * use id list, add listkeys builtin * rename and dissociate orchestrator in resource + orchestrator * separate orchestrator script * draft sandbox setup * make silo script distinct * Update orchestrator_open.bicep * Update internal_blob_open.bicep * remove comments * align hello world example with new naming conventions * ensure uai assignments are created AFTER storage is created * linting * enforce precedence * merge from secure branch * use different regions, limit size of account * reduce to 3 regions, add keys to guid * substring * align config * do not use model * Add msi version of scripts * sandbox main can switch between uai and msi * fix name * linting * linting * implement ignore param, hotfix model with startswith * Address my own comments on Jeff's PR (#96) * remove magic number * little improvements on some comments * remove unused files * put dash replacement next to length check * don't necessarily assume USER AI * UAI -> XAI * revert previous UAI -> XAI changes * move length check next to dash replacement * typo * try movind the dependsOn's * RAGRS -> LRS * revert dependsON changes * revert another small change in a comment Co-authored-by: thomasp-ms <XXX@me.com> * align config of both submit scripts * Make distinction between on-off and repeatable provisioning scripts (#99) * clarify the role needed * remove "custom role" line * adjust locations * use existing rg if not Owner of the sub * clarify "Secure" setup * add usage instructions in docstring * explain what scripts are one-off (vs repeatable) Co-authored-by: thomasp-ms <XXX@me.com> * Align round/iteration terminology with the native code (#103) * rename parameter in config file * keep iterations instead of rounds * round -> iteration Co-authored-by: thomasp-ms <XXX@me.com> * get all goodies from secureprovisioning branch wip * get all goodies from secureprovisioning branch wip * get all goodies from secureprovisioning branch wip * align both submits to work * add optional test * rename native to literal * add getting started in readme, introduce emojis * change person * remove emojs * Propose rewriting of readme to highlight motivation first (#110) * propose rewriting of readme to highlight motivation first * minor edit Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Update README.md * Update quickstart to mention rg clean-up * Update quickstart.md * Update quickstart.md * Update quickstart.md * Build bicep scripts as ARM template, add Azure Buttons to quickstart (#120) * Update quickstart to lower header (hotfix) (#117) * add arm templates, add button in quickstart * switch to releasebranchlink Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Add subscription id, resource group and workspace name as CLI args (#122) * add more cli args * code style * code style * update quickstart doc * update readme * Initiate provisioning "cookbook" with list of provisioning scenarios + example (#123) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Continuous Integration Tests (#119) * take values of subscription id, rs grp, ws name, etc from github secrets and submit a native pipeline * change path * Test azure creds in the github workflow * reformatting * add pipeline validation and testing workflow * add permissions * add permissions * check only certain dir to trigger workflows * add soft validation for any iteration branch PR * add provisioning script test * testing * create rg * create rg * change compute for testing * change demoname * delete old rg * change demoname * add demobasename and aml ws name as github secrets * random demo base name * auto generate random base name * random demo base name * adjust random num length * add vnet sandbox test * rmv dependency b/w jobs * submit various pipelines * change execution graph path * add cli args in the factory code * change compute for testing * ignore validation - factory * create custom action * correct path * correct path * add shell in the github action * create github actions and take required values as input params * add shell * add wait condition * add logs * linting * correct rg name * add azure ml extension * handle ml extension installation error. * add release branch test cases * add script to delete run history * cronjob test * cronjob test * checkout branch * test run history deletion script * test run history deletion script * test run history deletion script * azure login * date format change * remove double quotes * date format change * archive run history script tested * Add vnet-based provisioning options to cookbook (#128) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Make deployment name unique in our github actions (#135) * set unique name for deployments * add attempt to deployment name Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Refactor compute/storage scripts to be independent (#132) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Provide motivation in provisioning docs for using service endpoints (#136) * add motivation for service endpoints * add link Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Refresh provisioning arm buttons with latest from bicep (#139) * align names of directories * rebuild all arm Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Update silo_vnet_newstorage.md (#141) * Add Bicep build vs ARM template diff test (#140) * Add diff test for bicep vs arm * Debug * Debug * fix syntax error * redirect build output to stdout * coorect path * trigger arm template test when pushing changes to main branch from release* branch * remove redundant logs * Add "open aks with cc" provision tutorial and bicep scripts (#138) * implement bicep scripts to provision open aks with cc * add aks cc tutorial * build arm and add in branch * add button Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Provide script + tutorial to attach pair with an existing storage (#142) * provision datastore with existing storage * add arm for existing storage, add docs * add link in readme Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * add latest arm templates to diff build (#145) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Implements provisioning script for a confidential compute VM jumpbox inside a vnet (debug) (#146) * add jumpbox script with tutorial * add template to diff build Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Update jumpbox_cc.md (#147) * update tutorials for silos to integrate feedback (#149) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Implement option to turn orchestrator storage fully private (behind PLE) (#150) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Tutorial on how to adapt native and factory code to write FL experiments. (#100) * WIP: add general information about the factory code * moving factory-tutorial to another file * add scenarios * add instructions on how to adapt literal code * rename file * add general info and fix typos * Jeff's feedback * Apply code clean-up to provision scripts before bug bash (#148) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Instructions for provisioning external silos (#101) * very first stab, far from done * non-secure native job using the on-prem k8s * use on-prem silos in example factory job * Revert "very first stab, far from done" This reverts commit e00d882. * Revert "use on-prem silos in example factory job" This reverts commit e2ef884. * Revert "non-secure native job using the on-prem k8s" This reverts commit 923e5f3. * restore doc stub * Make Git ignore resources for test jobs * fix gitignore * typo in comment * steps A through D * 2 typos * move to subdir * fix workspace creation * add orchestrator part, role, and timeline * last commit before PR * adjust to new open_azureml_workspace.bicep * first wave after Jeff's comments * address jeff's comments * typo * light trims Co-authored-by: thomasp-ms <XXX@me.com> * bump up every title * skeleton * first attempt at data prep like Harmke * change secret name * wrong secret name * remove separate unzip * change clients, create silo data assets * different names for silo data assets, duh * cleanup * adjust secret name in doc * . * use latest literal code * align environment with literal * base on latest component * one dataset, comment out 2 unused args (for now) * introduce new arguments * reflect modified args in component spec * remove unused arg from config * start hooking up to Harmke's trainer * initialize PTLearner and include in run.py * use same values as Harmke for epochs and lr * attributes with _, start implementing local_train * add loggings, add test(), fix device_ * train_loader_ * align _'s * fix transform bug * remove unused constants * use proper model in aggregation code * removed unused file * remove unused code and arguments, logging to DEBUG * restore `metrics_prefix` parameter * finish restoring `metrics_prefix` * do not duplicate model code * revert dedup attempt * improve docstrings and descriptions * change experiment name * change pipeline name and docstring * cite sources, remove wrongly added licenses * italics * black Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: unknown <Mitgarg17495@gmail.com> * add readme section * rename training to traininsilo for consistency * Share resources and standardize component names (#182) * use shared agg component across all examples * only keep a single {reqs/env} * use more recent pip version * standardize component spec name * support dummy HELLOWORLD example is agg * black Co-authored-by: thomasp-ms <XXX@me.com> * Thopo/share component and environment (#185) * use shared agg component across all examples * only keep a single {reqs/env} * use more recent pip version * standardize component spec name * support dummy HELLOWORLD example is agg * black * SHARED -> utils, rename agg env Co-authored-by: thomasp-ms <XXX@me.com> * upload data script * add data splitting pipeline * nit updates * Named Entity Recognition example (#177) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add multinerd template files * NER components * re-structure * partition data + log metrics * add redme * add readme * restructuring * restructuring * add doc strings * train on gpus * create a separate component to upload data on silos * docs * rename * add assert statement * change upload-data job compute to orchestrator compute * remove ner from literal example choices * fix doc * add model-name, tokenizer configurable * pip version upgrade * reformatting * use shared aggregated component * rename script file * add note * create a compute that has access to silos' storage accs * change data uploading approach * update doc * incorporate Thomas's feedback * fix typo Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Align the medical imaging data provisioning process with other examples (#191) * adjust paths in config file * support component with 1 output for Pneumonia * formatting * adjust doc to new provisioning * remove GH action for dataprep * custom component for provisioning pneumonia data * black Co-authored-by: thomasp-ms <XXX@me.com> * hot fix (#192) Co-authored-by: thomasp-ms <XXX@me.com> * Lots of micro-fixes after bug bashing all 3 industry examples (#194) related to components: * create distinct names for all components of each scenario * polish component descriptions * remove unused mnist datatransfer and postprocessing components * upgrade all MCR images to a more recent OS * cut some unnecessary dependencies * use curated environments whenever possible (to speed up job build time) related to pipelines: * fix issues with ccfraud submit script (path to shared folder) * remove unnecessary json+azure imports in submit scripts * align all 3 submissions scripts * in upload data pipeline, make --example required without default value to force intentional decision * in upload data pipeline, use scenario name in the output path to avoid collision * give each submit pipeline a distinct experiment and run name for readability Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Standardize all 3 real world example tutorials (#193) * standardize documentation on all 3 examples * change titles * fix spaces * add pip instructions * upgrade azure-ai-ml version Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * basic tabular data analysis component * improve plots and docs in data analytics * format update * format update * mlflow pipeline level * revert unwanted rebase changes --------- Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: unknown <Mitgarg17495@gmail.com>

…, differential privacy, etc (#256) * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * init branch * wip data exploration * data exploration region/silo * basic model * regions * basic network and finished data processing * training * Implement generic FedAvg without model object (#167) * generic fedavg pytorch * support model classes * add docstrings Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * add README * update normalization * update exploration * Thomas/small improvements (#171) * remove unused local MNIST data * add link to provisioning cookbook in docs readme * recommend creating a conda env in the quickstart Co-authored-by: thomasp-ms <XXX@me.com> * update example for finance with multiple models * successful training through lstm * revert unneeded changes * remove local exploration ipynb * fix test metric * different param value for AKS (#179) Co-authored-by: thomasp-ms <XXX@me.com> * Pneumonia xray example (#164) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add more intuitive agg output dir path * reformat using black * add iteration2 branch for PR build testing * reformat date and pass kwargs instead in the getUniqueIdentifier fn * working submit * working factory submit * linting * move component path * add soft validation * add soft validation * Add basic tests on config * linting * working bicep deployment for vanilla demo * proper orchestrator script, double containers * fix name * docstring * docstring * rollback to using only 1 container * align naming convention * instructions * working submit * set up permission model * working orch perms * wonky perms assignment * working role assignments * remove old perm model * working except silo2orch * fix typo * working submit with config * add sku as param * use R/W for now * fix submit to align with bicep provisioning demo * linting * remove dataset files * fix docstring on permission model * write draft docs with homepage, align structure, remove requirements, ensure demo documented * rollback change to req * change factory to use custom model type during validation * linting * Display metrics at the pipeline level (#68) * Fix optional input yaml and mlflow log bugs (#59) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * Accomodate optional input chnages and switch from mlflow autologging to manual logging * code style * change optional inputs syntax Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Make changes to display all metrics at the pipeline level * Log preprocessing metadata in mlflow * linting * Pass client as an arg * Fix typo, rmv name from silo config, metric naming convention, and add metric identifier in the preprocessing component Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Remove redundant files from the mlops directory (#69) * Remove internal & external dir as provisioning is taken care by bicep * keep mnist data files * rename demo script (#71) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Unified documentation (#72) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * simplify sandbox script * simplify script, ensure it works * align config of native submit * align naming conventions between scripts, reinject rbac role * create test job for quickly debugging provisioning issues * fix tests * linting * move permissions to storage * align config with bicep scrits * Document the metrics panel of the pipeline overview in the quickstart (#76) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * document the metrics/pipeline panel in the quickstart * linting * add docstrings and disclaimers * Add instructions on how to create a custom graph (#78) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * document the metrics/pipeline panel in the quickstart * add instructions on how to create a custom graph * do better comments * Refine native code (#82) * fix silo name * log only one datapoint per iteration for an aggregated metrics * Align terminology for iteration/round/num_rounds * linting * use storage blob data contibutor * add demoBaseName to guid name of role deployment (#85) Co-authored-by: thomasp-ms <XXX@me.com> * use id list, add listkeys builtin * rename and dissociate orchestrator in resource + orchestrator * separate orchestrator script * draft sandbox setup * make silo script distinct * Update orchestrator_open.bicep * Update internal_blob_open.bicep * remove comments * align hello world example with new naming conventions * ensure uai assignments are created AFTER storage is created * linting * enforce precedence * merge from secure branch * use different regions, limit size of account * reduce to 3 regions, add keys to guid * substring * align config * do not use model * Add msi version of scripts * sandbox main can switch between uai and msi * fix name * linting * linting * implement ignore param, hotfix model with startswith * Address my own comments on Jeff's PR (#96) * remove magic number * little improvements on some comments * remove unused files * put dash replacement next to length check * don't necessarily assume USER AI * UAI -> XAI * revert previous UAI -> XAI changes * move length check next to dash replacement * typo * try movind the dependsOn's * RAGRS -> LRS * revert dependsON changes * revert another small change in a comment Co-authored-by: thomasp-ms <XXX@me.com> * align config of both submit scripts * Make distinction between on-off and repeatable provisioning scripts (#99) * clarify the role needed * remove "custom role" line * adjust locations * use existing rg if not Owner of the sub * clarify "Secure" setup * add usage instructions in docstring * explain what scripts are one-off (vs repeatable) Co-authored-by: thomasp-ms <XXX@me.com> * Align round/iteration terminology with the native code (#103) * rename parameter in config file * keep iterations instead of rounds * round -> iteration Co-authored-by: thomasp-ms <XXX@me.com> * get all goodies from secureprovisioning branch wip * get all goodies from secureprovisioning branch wip * get all goodies from secureprovisioning branch wip * align both submits to work * add optional test * rename native to literal * add getting started in readme, introduce emojis * change person * remove emojs * Propose rewriting of readme to highlight motivation first (#110) * propose rewriting of readme to highlight motivation first * minor edit Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Update README.md * Update quickstart to mention rg clean-up * Update quickstart.md * Update quickstart.md * Update quickstart.md * Build bicep scripts as ARM template, add Azure Buttons to quickstart (#120) * Update quickstart to lower header (hotfix) (#117) * add arm templates, add button in quickstart * switch to releasebranchlink Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Add subscription id, resource group and workspace name as CLI args (#122) * add more cli args * code style * code style * update quickstart doc * update readme * Initiate provisioning "cookbook" with list of provisioning scenarios + example (#123) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Continuous Integration Tests (#119) * take values of subscription id, rs grp, ws name, etc from github secrets and submit a native pipeline * change path * Test azure creds in the github workflow * reformatting * add pipeline validation and testing workflow * add permissions * add permissions * check only certain dir to trigger workflows * add soft validation for any iteration branch PR * add provisioning script test * testing * create rg * create rg * change compute for testing * change demoname * delete old rg * change demoname * add demobasename and aml ws name as github secrets * random demo base name * auto generate random base name * random demo base name * adjust random num length * add vnet sandbox test * rmv dependency b/w jobs * submit various pipelines * change execution graph path * add cli args in the factory code * change compute for testing * ignore validation - factory * create custom action * correct path * correct path * add shell in the github action * create github actions and take required values as input params * add shell * add wait condition * add logs * linting * correct rg name * add azure ml extension * handle ml extension installation error. * add release branch test cases * add script to delete run history * cronjob test * cronjob test * checkout branch * test run history deletion script * test run history deletion script * test run history deletion script * azure login * date format change * remove double quotes * date format change * archive run history script tested * Add vnet-based provisioning options to cookbook (#128) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Make deployment name unique in our github actions (#135) * set unique name for deployments * add attempt to deployment name Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Refactor compute/storage scripts to be independent (#132) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Provide motivation in provisioning docs for using service endpoints (#136) * add motivation for service endpoints * add link Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Refresh provisioning arm buttons with latest from bicep (#139) * align names of directories * rebuild all arm Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Update silo_vnet_newstorage.md (#141) * Add Bicep build vs ARM template diff test (#140) * Add diff test for bicep vs arm * Debug * Debug * fix syntax error * redirect build output to stdout * coorect path * trigger arm template test when pushing changes to main branch from release* branch * remove redundant logs * Add "open aks with cc" provision tutorial and bicep scripts (#138) * implement bicep scripts to provision open aks with cc * add aks cc tutorial * build arm and add in branch * add button Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Provide script + tutorial to attach pair with an existing storage (#142) * provision datastore with existing storage * add arm for existing storage, add docs * add link in readme Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * add latest arm templates to diff build (#145) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Implements provisioning script for a confidential compute VM jumpbox inside a vnet (debug) (#146) * add jumpbox script with tutorial * add template to diff build Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Update jumpbox_cc.md (#147) * update tutorials for silos to integrate feedback (#149) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Implement option to turn orchestrator storage fully private (behind PLE) (#150) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Tutorial on how to adapt native and factory code to write FL experiments. (#100) * WIP: add general information about the factory code * moving factory-tutorial to another file * add scenarios * add instructions on how to adapt literal code * rename file * add general info and fix typos * Jeff's feedback * Apply code clean-up to provision scripts before bug bash (#148) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Instructions for provisioning external silos (#101) * very first stab, far from done * non-secure native job using the on-prem k8s * use on-prem silos in example factory job * Revert "very first stab, far from done" This reverts commit e00d882dfee6a348eb89cd63e339a051b85ce0ca. * Revert "use on-prem silos in example factory job" This reverts commit e2ef8841c6be25a6c84b57ae079cca8f361323fe. * Revert "non-secure native job using the on-prem k8s" This reverts commit 923e5f321d28b30d8cd9759c47a7ffe5457e3284. * restore doc stub * Make Git ignore resources for test jobs * fix gitignore * typo in comment * steps A through D * 2 typos * move to subdir * fix workspace creation * add orchestrator part, role, and timeline * last commit before PR * adjust to new open_azureml_workspace.bicep * first wave after Jeff's comments * address jeff's comments * typo * light trims Co-authored-by: thomasp-ms <XXX@me.com> * bump up every title * skeleton * first attempt at data prep like Harmke * change secret name * wrong secret name * remove separate unzip * change clients, create silo data assets * different names for silo data assets, duh * cleanup * adjust secret name in doc * . * use latest literal code * align environment with literal * base on latest component * one dataset, comment out 2 unused args (for now) * introduce new arguments * reflect modified args in component spec * remove unused arg from config * start hooking up to Harmke's trainer * initialize PTLearner and include in run.py * use same values as Harmke for epochs and lr * attributes with _, start implementing local_train * add loggings, add test(), fix device_ * train_loader_ * align _'s * fix transform bug * remove unused constants * use proper model in aggregation code * removed unused file * remove unused code and arguments, logging to DEBUG * restore `metrics_prefix` parameter * finish restoring `metrics_prefix` * do not duplicate model code * revert dedup attempt * improve docstrings and descriptions * change experiment name * change pipeline name and docstring * cite sources, remove wrongly added licenses * italics * black Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: unknown <Mitgarg17495@gmail.com> * update formatting * add readme section * rename training to traininsilo for consistency * add more comments and update docs * include urgency in PR template (#184) Co-authored-by: thomasp-ms <XXX@me.com> * Share resources and standardize component names (#182) * use shared agg component across all examples * only keep a single {reqs/env} * use more recent pip version * standardize component spec name * support dummy HELLOWORLD example is agg * black Co-authored-by: thomasp-ms <XXX@me.com> * Thopo/share component and environment (#185) * use shared agg component across all examples * only keep a single {reqs/env} * use more recent pip version * standardize component spec name * support dummy HELLOWORLD example is agg * black * SHARED -> utils, rename agg env Co-authored-by: thomasp-ms <XXX@me.com> * rename config to spec and add upload data step * upload data script * use util aggregateweights * add data splitting pipeline * docs update * log pipeline level only once per silo training * do categorical encoding ahead of splitting * nit updates * update comment * update formatting * Hotfix: grant `az login` permissions to the 'clear run history' script (#166) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * simplify the job wait condition's code * add comments * trigger mnist pipeline check * test token validity * grant `az login` permissions to the clear-history script * revert to sleep wait code * test access token validity * nit Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * fix readme * aggregate weughts on whichever device is available * update docstrings * update formatting * reduce upload pipeline file * fix datastore * add info about data upload step * fix typo * steps for changing access policies * update docs * Named Entity Recognition example (#177) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add multinerd template files * NER components * re-structure * partition data + log metrics * add redme * add readme * restructuring * restructuring * add doc strings * train on gpus * create a separate component to upload data on silos * docs * rename * add assert statement * change upload-data job compute to orchestrator compute * remove ner from literal example choices * fix doc * add model-name, tokenizer configurable * pip version upgrade * reformatting * use shared aggregated component * rename script file * add note * create a compute that has access to silos' storage accs * change data uploading approach * update doc * incorporate Thomas's feedback * fix typo Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Create nice-looking homepage for the examples in readme+docs (#190) * add homepage for industry examples Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Align the medical imaging data provisioning process with other examples (#191) * adjust paths in config file * support component with 1 output for Pneumonia * formatting * adjust doc to new provisioning * remove GH action for dataprep * custom component for provisioning pneumonia data * black Co-authored-by: thomasp-ms <XXX@me.com> * hot fix (#192) Co-authored-by: thomasp-ms <XXX@me.com> * Lots of micro-fixes after bug bashing all 3 industry examples (#194) related to components: * create distinct names for all components of each scenario * polish component descriptions * remove unused mnist datatransfer and postprocessing components * upgrade all MCR images to a more recent OS * cut some unnecessary dependencies * use curated environments whenever possible (to speed up job build time) related to pipelines: * fix issues with ccfraud submit script (path to shared folder) * remove unnecessary json+azure imports in submit scripts * align all 3 submissions scripts * in upload data pipeline, make --example required without default value to force intentional decision * in upload data pipeline, use scenario name in the output path to avoid collision * give each submit pipeline a distinct experiment and run name for readability Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Standardize all 3 real world example tutorials (#193) * standardize documentation on all 3 examples * change titles * fix spaces * add pip instructions * upgrade azure-ai-ml version Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * poc for ddp training * remove debug code + allow logging from multiple nodes * update formatting * provide correct link to Kaggle dataset (#196) * provide correct link * . * . Co-authored-by: thomasp-ms <XXX@me.com> * add DDP docs * Add CI tests for industry-relevant examples (#186) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add pneumonia and ner examples tests * add ccfraud test in the CI/CD pipeline * add data upload test * trigger workflow * CI testing1 * CI testing1 * test kv kaggle creds * fix creds * fix creds * set kaggle creds * test pneumonia data-upload * test all industry relevant examples * upload data test for 3 examples * add main tests * rmv redundant chrs * fix typo * avoid industry relevant examples tests to run on the vnet setup as it is already covered by the open setup Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * CLI commands to add credentials in the workspace keyvault (#199) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add cli cmds to set a kv secret * Jeff's feedback * Implement Thomas's feedback Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Thomas/bug bash feedback 04 (#203) * no need to navigate to a specific directory * keyvault -> key vault * improve Kaggle sections * GPU's for NER example * ARM templates with latest bicep version * bold * GPU instructions in quickstart Co-authored-by: thomasp-ms <XXX@me.com> * fix test to align with new sdk (#204) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Hotfix: DataAccessError (orchestrator access) (#205) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * fix bug * update arm template * fix a problem that was encountered during resolving merge conflicts Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Add GitHub Action workflow concurrency and implement token expiration policy workaround (#200) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add GitHub workflow concurrency * test 1 * test 1 * test 1 * test 2 * test 3 * test 2 * test 3 * implement token expiry workaround * test 1 * workaround to handle token expiry error * fix typo Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Implement troubleshooting guide with first typical issues (#208) * write troubleshooting guide Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Fix order of precedence for AML workspace references in submit.py (#209) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * fix order of precedence * fix build Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Add data permissions issue to TSG (#210) * add permissions issue to TSG Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * November notes (#211) Co-authored-by: thomasp-ms <XXX@me.com> * create instance type and select it for run for cc * upgrade all pip dependencies (#212) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * format * use azureml built in distribution fw * Test industry-relevant examples if any changes in the `utils` dir are observed (#221) * add test to validate changes in the utils dir * test1 trigger workflow * fix typo * only destroy ddp group if it was created * remove unnecessary imports * allow for mixture of ddp and non-ddp processes model aggregation * use documentation instead of ps1 script for creating instancetype for CC * add instance type assignment for all examples * formatting * formatting * update batch size * update model name * use older pytorch * Generalize aggregate component to Babel (#220) * init branch * wip data exploration * data exploration region/silo * basic model * regions * basic network and finished data processing * training * Implement generic FedAvg without model object (#167) * generic fedavg pytorch * support model classes * add docstrings Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * add README * update normalization * update exploration * Thomas/small improvements (#171) * remove unused local MNIST data * add link to provisioning cookbook in docs readme * recommend creating a conda env in the quickstart Co-authored-by: thomasp-ms <XXX@me.com> * update example for finance with multiple models * successful training through lstm * revert unneeded changes * remove local exploration ipynb * fix test metric * different param value for AKS (#179) Co-authored-by: thomasp-ms <XXX@me.com> * Pneumonia xray example (#164) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add more intuitive agg output dir path * reformat using black * add iteration2 branch for PR build testing * reformat date and pass kwargs instead in the getUniqueIdentifier fn * working submit * working factory submit * linting * move component path * add soft validation * add soft validation * Add basic tests on config * linting * working bicep deployment for vanilla demo * proper orchestrator script, double containers * fix name * docstring * docstring * rollback to using only 1 container * align naming convention * instructions * working submit * set up permission model * working orch perms * wonky perms assignment * working role assignments * remove old perm model * working except silo2orch * fix typo * working submit with config * add sku as param * use R/W for now * fix submit to align with bicep provisioning demo * linting * remove dataset files * fix docstring on permission model * write draft docs with homepage, align structure, remove requirements, ensure demo documented * rollback change to req * change factory to use custom model type during validation * linting * Display metrics at the pipeline level (#68) * Fix optional input yaml and mlflow log bugs (#59) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * Accomodate optional input chnages and switch from mlflow autologging to manual logging * code style * change optional inputs syntax Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Make changes to display all metrics at the pipeline level * Log preprocessing metadata in mlflow * linting * Pass client as an arg * Fix typo, rmv name from silo config, metric naming convention, and add metric identifier in the preprocessing component Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Remove redundant files from the mlops directory (#69) * Remove internal & external dir as provisioning is taken care by bicep * keep mnist data files * rename demo script (#71) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Unified documentation (#72) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * simplify sandbox script * simplify script, ensure it works * align config of native submit * align naming conventions between scripts, reinject rbac role * create test job for quickly debugging provisioning issues * fix tests * linting * move permissions to storage * align config with bicep scrits * Document the metrics panel of the pipeline overview in the quickstart (#76) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * document the metrics/pipeline panel in the quickstart * linting * add docstrings and disclaimers * Add instructions on how to create a custom graph (#78) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * document the metrics/pipeline panel in the quickstart * add instructions on how to create a custom graph * do better comments * Refine native code (#82) * fix silo name * log only one datapoint per iteration for an aggregated metrics * Align terminology for iteration/round/num_rounds * linting * use storage blob dat…

* maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * init branch * wip data exploration * data exploration region/silo * basic model * regions * basic network and finished data processing * training * Implement generic FedAvg without model object (#167) * generic fedavg pytorch * support model classes * add docstrings Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * add README * update normalization * update exploration * Thomas/small improvements (#171) * remove unused local MNIST data * add link to provisioning cookbook in docs readme * recommend creating a conda env in the quickstart Co-authored-by: thomasp-ms <XXX@me.com> * update example for finance with multiple models * successful training through lstm * revert unneeded changes * remove local exploration ipynb * fix test metric * different param value for AKS (#179) Co-authored-by: thomasp-ms <XXX@me.com> * Pneumonia xray example (#164) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add more intuitive agg output dir path * reformat using black * add iteration2 branch for PR build testing * reformat date and pass kwargs instead in the getUniqueIdentifier fn * working submit * working factory submit * linting * move component path * add soft validation * add soft validation * Add basic tests on config * linting * working bicep deployment for vanilla demo * proper orchestrator script, double containers * fix name * docstring * docstring * rollback to using only 1 container * align naming convention * instructions * working submit * set up permission model * working orch perms * wonky perms assignment * working role assignments * remove old perm model * working except silo2orch * fix typo * working submit with config * add sku as param * use R/W for now * fix submit to align with bicep provisioning demo * linting * remove dataset files * fix docstring on permission model * write draft docs with homepage, align structure, remove requirements, ensure demo documented * rollback change to req * change factory to use custom model type during validation * linting * Display metrics at the pipeline level (#68) * Fix optional input yaml and mlflow log bugs (#59) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * Accomodate optional input chnages and switch from mlflow autologging to manual logging * code style * change optional inputs syntax Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Make changes to display all metrics at the pipeline level * Log preprocessing metadata in mlflow * linting * Pass client as an arg * Fix typo, rmv name from silo config, metric naming convention, and add metric identifier in the preprocessing component Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Remove redundant files from the mlops directory (#69) * Remove internal & external dir as provisioning is taken care by bicep * keep mnist data files * rename demo script (#71) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Unified documentation (#72) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * simplify sandbox script * simplify script, ensure it works * align config of native submit * align naming conventions between scripts, reinject rbac role * create test job for quickly debugging provisioning issues * fix tests * linting * move permissions to storage * align config with bicep scrits * Document the metrics panel of the pipeline overview in the quickstart (#76) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * document the metrics/pipeline panel in the quickstart * linting * add docstrings and disclaimers * Add instructions on how to create a custom graph (#78) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * document the metrics/pipeline panel in the quickstart * add instructions on how to create a custom graph * do better comments * Refine native code (#82) * fix silo name * log only one datapoint per iteration for an aggregated metrics * Align terminology for iteration/round/num_rounds * linting * use storage blob data contibutor * add demoBaseName to guid name of role deployment (#85) Co-authored-by: thomasp-ms <XXX@me.com> * use id list, add listkeys builtin * rename and dissociate orchestrator in resource + orchestrator * separate orchestrator script * draft sandbox setup * make silo script distinct * Update orchestrator_open.bicep * Update internal_blob_open.bicep * remove comments * align hello world example with new naming conventions * ensure uai assignments are created AFTER storage is created * linting * enforce precedence * merge from secure branch * use different regions, limit size of account * reduce to 3 regions, add keys to guid * substring * align config * do not use model * Add msi version of scripts * sandbox main can switch between uai and msi * fix name * linting * linting * implement ignore param, hotfix model with startswith * Address my own comments on Jeff's PR (#96) * remove magic number * little improvements on some comments * remove unused files * put dash replacement next to length check * don't necessarily assume USER AI * UAI -> XAI * revert previous UAI -> XAI changes * move length check next to dash replacement * typo * try movind the dependsOn's * RAGRS -> LRS * revert dependsON changes * revert another small change in a comment Co-authored-by: thomasp-ms <XXX@me.com> * align config of both submit scripts * Make distinction between on-off and repeatable provisioning scripts (#99) * clarify the role needed * remove "custom role" line * adjust locations * use existing rg if not Owner of the sub * clarify "Secure" setup * add usage instructions in docstring * explain what scripts are one-off (vs repeatable) Co-authored-by: thomasp-ms <XXX@me.com> * Align round/iteration terminology with the native code (#103) * rename parameter in config file * keep iterations instead of rounds * round -> iteration Co-authored-by: thomasp-ms <XXX@me.com> * get all goodies from secureprovisioning branch wip * get all goodies from secureprovisioning branch wip * get all goodies from secureprovisioning branch wip * align both submits to work * add optional test * rename native to literal * add getting started in readme, introduce emojis * change person * remove emojs * Propose rewriting of readme to highlight motivation first (#110) * propose rewriting of readme to highlight motivation first * minor edit Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Update README.md * Update quickstart to mention rg clean-up * Update quickstart.md * Update quickstart.md * Update quickstart.md * Build bicep scripts as ARM template, add Azure Buttons to quickstart (#120) * Update quickstart to lower header (hotfix) (#117) * add arm templates, add button in quickstart * switch to releasebranchlink Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Add subscription id, resource group and workspace name as CLI args (#122) * add more cli args * code style * code style * update quickstart doc * update readme * Initiate provisioning "cookbook" with list of provisioning scenarios + example (#123) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Continuous Integration Tests (#119) * take values of subscription id, rs grp, ws name, etc from github secrets and submit a native pipeline * change path * Test azure creds in the github workflow * reformatting * add pipeline validation and testing workflow * add permissions * add permissions * check only certain dir to trigger workflows * add soft validation for any iteration branch PR * add provisioning script test * testing * create rg * create rg * change compute for testing * change demoname * delete old rg * change demoname * add demobasename and aml ws name as github secrets * random demo base name * auto generate random base name * random demo base name * adjust random num length * add vnet sandbox test * rmv dependency b/w jobs * submit various pipelines * change execution graph path * add cli args in the factory code * change compute for testing * ignore validation - factory * create custom action * correct path * correct path * add shell in the github action * create github actions and take required values as input params * add shell * add wait condition * add logs * linting * correct rg name * add azure ml extension * handle ml extension installation error. * add release branch test cases * add script to delete run history * cronjob test * cronjob test * checkout branch * test run history deletion script * test run history deletion script * test run history deletion script * azure login * date format change * remove double quotes * date format change * archive run history script tested * Add vnet-based provisioning options to cookbook (#128) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Make deployment name unique in our github actions (#135) * set unique name for deployments * add attempt to deployment name Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Refactor compute/storage scripts to be independent (#132) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Provide motivation in provisioning docs for using service endpoints (#136) * add motivation for service endpoints * add link Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Refresh provisioning arm buttons with latest from bicep (#139) * align names of directories * rebuild all arm Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Update silo_vnet_newstorage.md (#141) * Add Bicep build vs ARM template diff test (#140) * Add diff test for bicep vs arm * Debug * Debug * fix syntax error * redirect build output to stdout * coorect path * trigger arm template test when pushing changes to main branch from release* branch * remove redundant logs * Add "open aks with cc" provision tutorial and bicep scripts (#138) * implement bicep scripts to provision open aks with cc * add aks cc tutorial * build arm and add in branch * add button Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Provide script + tutorial to attach pair with an existing storage (#142) * provision datastore with existing storage * add arm for existing storage, add docs * add link in readme Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * add latest arm templates to diff build (#145) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Implements provisioning script for a confidential compute VM jumpbox inside a vnet (debug) (#146) * add jumpbox script with tutorial * add template to diff build Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Update jumpbox_cc.md (#147) * update tutorials for silos to integrate feedback (#149) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Implement option to turn orchestrator storage fully private (behind PLE) (#150) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Tutorial on how to adapt native and factory code to write FL experiments. (#100) * WIP: add general information about the factory code * moving factory-tutorial to another file * add scenarios * add instructions on how to adapt literal code * rename file * add general info and fix typos * Jeff's feedback * Apply code clean-up to provision scripts before bug bash (#148) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Instructions for provisioning external silos (#101) * very first stab, far from done * non-secure native job using the on-prem k8s * use on-prem silos in example factory job * Revert "very first stab, far from done" This reverts commit e00d882dfee6a348eb89cd63e339a051b85ce0ca. * Revert "use on-prem silos in example factory job" This reverts commit e2ef8841c6be25a6c84b57ae079cca8f361323fe. * Revert "non-secure native job using the on-prem k8s" This reverts commit 923e5f321d28b30d8cd9759c47a7ffe5457e3284. * restore doc stub * Make Git ignore resources for test jobs * fix gitignore * typo in comment * steps A through D * 2 typos * move to subdir * fix workspace creation * add orchestrator part, role, and timeline * last commit before PR * adjust to new open_azureml_workspace.bicep * first wave after Jeff's comments * address jeff's comments * typo * light trims Co-authored-by: thomasp-ms <XXX@me.com> * bump up every title * skeleton * first attempt at data prep like Harmke * change secret name * wrong secret name * remove separate unzip * change clients, create silo data assets * different names for silo data assets, duh * cleanup * adjust secret name in doc * . * use latest literal code * align environment with literal * base on latest component * one dataset, comment out 2 unused args (for now) * introduce new arguments * reflect modified args in component spec * remove unused arg from config * start hooking up to Harmke's trainer * initialize PTLearner and include in run.py * use same values as Harmke for epochs and lr * attributes with _, start implementing local_train * add loggings, add test(), fix device_ * train_loader_ * align _'s * fix transform bug * remove unused constants * use proper model in aggregation code * removed unused file * remove unused code and arguments, logging to DEBUG * restore `metrics_prefix` parameter * finish restoring `metrics_prefix` * do not duplicate model code * revert dedup attempt * improve docstrings and descriptions * change experiment name * change pipeline name and docstring * cite sources, remove wrongly added licenses * italics * black Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: unknown <Mitgarg17495@gmail.com> * update formatting * add readme section * rename training to traininsilo for consistency * add more comments and update docs * include urgency in PR template (#184) Co-authored-by: thomasp-ms <XXX@me.com> * Share resources and standardize component names (#182) * use shared agg component across all examples * only keep a single {reqs/env} * use more recent pip version * standardize component spec name * support dummy HELLOWORLD example is agg * black Co-authored-by: thomasp-ms <XXX@me.com> * Thopo/share component and environment (#185) * use shared agg component across all examples * only keep a single {reqs/env} * use more recent pip version * standardize component spec name * support dummy HELLOWORLD example is agg * black * SHARED -> utils, rename agg env Co-authored-by: thomasp-ms <XXX@me.com> * rename config to spec and add upload data step * upload data script * use util aggregateweights * add data splitting pipeline * docs update * log pipeline level only once per silo training * do categorical encoding ahead of splitting * nit updates * update comment * update formatting * Hotfix: grant `az login` permissions to the 'clear run history' script (#166) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * simplify the job wait condition's code * add comments * trigger mnist pipeline check * test token validity * grant `az login` permissions to the clear-history script * revert to sleep wait code * test access token validity * nit Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * fix readme * aggregate weughts on whichever device is available * update docstrings * update formatting * reduce upload pipeline file * fix datastore * add info about data upload step * fix typo * steps for changing access policies * update docs * Named Entity Recognition example (#177) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add multinerd template files * NER components * re-structure * partition data + log metrics * add redme * add readme * restructuring * restructuring * add doc strings * train on gpus * create a separate component to upload data on silos * docs * rename * add assert statement * change upload-data job compute to orchestrator compute * remove ner from literal example choices * fix doc * add model-name, tokenizer configurable * pip version upgrade * reformatting * use shared aggregated component * rename script file * add note * create a compute that has access to silos' storage accs * change data uploading approach * update doc * incorporate Thomas's feedback * fix typo Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Create nice-looking homepage for the examples in readme+docs (#190) * add homepage for industry examples Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Align the medical imaging data provisioning process with other examples (#191) * adjust paths in config file * support component with 1 output for Pneumonia * formatting * adjust doc to new provisioning * remove GH action for dataprep * custom component for provisioning pneumonia data * black Co-authored-by: thomasp-ms <XXX@me.com> * hot fix (#192) Co-authored-by: thomasp-ms <XXX@me.com> * Lots of micro-fixes after bug bashing all 3 industry examples (#194) related to components: * create distinct names for all components of each scenario * polish component descriptions * remove unused mnist datatransfer and postprocessing components * upgrade all MCR images to a more recent OS * cut some unnecessary dependencies * use curated environments whenever possible (to speed up job build time) related to pipelines: * fix issues with ccfraud submit script (path to shared folder) * remove unnecessary json+azure imports in submit scripts * align all 3 submissions scripts * in upload data pipeline, make --example required without default value to force intentional decision * in upload data pipeline, use scenario name in the output path to avoid collision * give each submit pipeline a distinct experiment and run name for readability Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Standardize all 3 real world example tutorials (#193) * standardize documentation on all 3 examples * change titles * fix spaces * add pip instructions * upgrade azure-ai-ml version Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * poc for ddp training * remove debug code + allow logging from multiple nodes * update formatting * provide correct link to Kaggle dataset (#196) * provide correct link * . * . Co-authored-by: thomasp-ms <XXX@me.com> * add DDP docs * Add CI tests for industry-relevant examples (#186) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add pneumonia and ner examples tests * add ccfraud test in the CI/CD pipeline * add data upload test * trigger workflow * CI testing1 * CI testing1 * test kv kaggle creds * fix creds * fix creds * set kaggle creds * test pneumonia data-upload * test all industry relevant examples * upload data test for 3 examples * add main tests * rmv redundant chrs * fix typo * avoid industry relevant examples tests to run on the vnet setup as it is already covered by the open setup Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * CLI commands to add credentials in the workspace keyvault (#199) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add cli cmds to set a kv secret * Jeff's feedback * Implement Thomas's feedback Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Thomas/bug bash feedback 04 (#203) * no need to navigate to a specific directory * keyvault -> key vault * improve Kaggle sections * GPU's for NER example * ARM templates with latest bicep version * bold * GPU instructions in quickstart Co-authored-by: thomasp-ms <XXX@me.com> * fix test to align with new sdk (#204) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Hotfix: DataAccessError (orchestrator access) (#205) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * fix bug * update arm template * fix a problem that was encountered during resolving merge conflicts Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Add GitHub Action workflow concurrency and implement token expiration policy workaround (#200) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add GitHub workflow concurrency * test 1 * test 1 * test 1 * test 2 * test 3 * test 2 * test 3 * implement token expiry workaround * test 1 * workaround to handle token expiry error * fix typo Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Implement troubleshooting guide with first typical issues (#208) * write troubleshooting guide Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Fix order of precedence for AML workspace references in submit.py (#209) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * fix order of precedence * fix build Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Add data permissions issue to TSG (#210) * add permissions issue to TSG Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * November notes (#211) Co-authored-by: thomasp-ms <XXX@me.com> * create instance type and select it for run for cc * upgrade all pip dependencies (#212) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * format * use azureml built in distribution fw * Test industry-relevant examples if any changes in the `utils` dir are observed (#221) * add test to validate changes in the utils dir * test1 trigger workflow * fix typo * only destroy ddp group if it was created * remove unnecessary imports * allow for mixture of ddp and non-ddp processes model aggregation * use documentation instead of ps1 script for creating instancetype for CC * add instance type assignment for all examples * formatting * formatting * update batch size * update model name * use older pytorch * Generalize aggregate component to Babel (#220) * init branch * wip data exploration * data exploration region/silo * basic model * regions * basic network and finished data processing * training * Implement generic FedAvg without model object (#167) * generic fedavg pytorch * support model classes * add docstrings Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * add README * update normalization * update exploration * Thomas/small improvements (#171) * remove unused local MNIST data * add link to provisioning cookbook in docs readme * recommend creating a conda env in the quickstart Co-authored-by: thomasp-ms <XXX@me.com> * update example for finance with multiple models * successful training through lstm * revert unneeded changes * remove local exploration ipynb * fix test metric * different param value for AKS (#179) Co-authored-by: thomasp-ms <XXX@me.com> * Pneumonia xray example (#164) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add more intuitive agg output dir path * reformat using black * add iteration2 branch for PR build testing * reformat date and pass kwargs instead in the getUniqueIdentifier fn * working submit * working factory submit * linting * move component path * add soft validation * add soft validation * Add basic tests on config * linting * working bicep deployment for vanilla demo * proper orchestrator script, double containers * fix name * docstring * docstring * rollback to using only 1 container * align naming convention * instructions * working submit * set up permission model * working orch perms * wonky perms assignment * working role assignments * remove old perm model * working except silo2orch * fix typo * working submit with config * add sku as param * use R/W for now * fix submit to align with bicep provisioning demo * linting * remove dataset files * fix docstring on permission model * write draft docs with homepage, align structure, remove requirements, ensure demo documented * rollback change to req * change factory to use custom model type during validation * linting * Display metrics at the pipeline level (#68) * Fix optional input yaml and mlflow log bugs (#59) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * Accomodate optional input chnages and switch from mlflow autologging to manual logging * code style * change optional inputs syntax Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Make changes to display all metrics at the pipeline level * Log preprocessing metadata in mlflow * linting * Pass client as an arg * Fix typo, rmv name from silo config, metric naming convention, and add metric identifier in the preprocessing component Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Remove redundant files from the mlops directory (#69) * Remove internal & external dir as provisioning is taken care by bicep * keep mnist data files * rename demo script (#71) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Unified documentation (#72) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * simplify sandbox script * simplify script, ensure it works * align config of native submit * align naming conventions between scripts, reinject rbac role * create test job for quickly debugging provisioning issues * fix tests * linting * move permissions to storage * align config with bicep scrits * Document the metrics panel of the pipeline overview in the quickstart (#76) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * document the metrics/pipeline panel in the quickstart * linting * add docstrings and disclaimers * Add instructions on how to create a custom graph (#78) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * document the metrics/pipeline panel in the quickstart * add instructions on how to create a custom graph * do better comments * Refine native code (#82) * fix silo name * log only one datapoint per iteration for an aggregated metrics * Align terminology for iteration/round/num_rounds * linting * use storage blob data contibutor * add demoBaseName to guid name of role deployment (#85) Co-authored-by: thomasp-ms <XXX@me.com> * use id list, add listkeys builtin * rename and dissociate orchestrator in resource + orchestrator * separate orchestrator script * draft sandbox setup * make silo script distinct * Update orchestrator_open.bicep * Update internal_blob_open.bicep * remove comments * align hello world example with new naming conventions * ensure uai assignments are created AFTER storage is created * linting * enforce precedence * merge from secure branch * use different regions…

* rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * init branch * wip data exploration * data exploration region/silo * basic model * regions * basic network and finished data processing * training * Implement generic FedAvg without model object (#167) * generic fedavg pytorch * support model classes * add docstrings Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * add README * update normalization * update exploration * Thomas/small improvements (#171) * remove unused local MNIST data * add link to provisioning cookbook in docs readme * recommend creating a conda env in the quickstart Co-authored-by: thomasp-ms <XXX@me.com> * update example for finance with multiple models * successful training through lstm * revert unneeded changes * remove local exploration ipynb * fix test metric * different param value for AKS (#179) Co-authored-by: thomasp-ms <XXX@me.com> * Pneumonia xray example (#164) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add more intuitive agg output dir path * reformat using black * add iteration2 branch for PR build testing * reformat date and pass kwargs instead in the getUniqueIdentifier fn * working submit * working factory submit * linting * move component path * add soft validation * add soft validation * Add basic tests on config * linting * working bicep deployment for vanilla demo * proper orchestrator script, double containers * fix name * docstring * docstring * rollback to using only 1 container * align naming convention * instructions * working submit * set up permission model * working orch perms * wonky perms assignment * working role assignments * remove old perm model * working except silo2orch * fix typo * working submit with config * add sku as param * use R/W for now * fix submit to align with bicep provisioning demo * linting * remove dataset files * fix docstring on permission model * write draft docs with homepage, align structure, remove requirements, ensure demo documented * rollback change to req * change factory to use custom model type during validation * linting * Display metrics at the pipeline level (#68) * Fix optional input yaml and mlflow log bugs (#59) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * Accomodate optional input chnages and switch from mlflow autologging to manual logging * code style * change optional inputs syntax Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Make changes to display all metrics at the pipeline level * Log preprocessing metadata in mlflow * linting * Pass client as an arg * Fix typo, rmv name from silo config, metric naming convention, and add metric identifier in the preprocessing component Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Remove redundant files from the mlops directory (#69) * Remove internal & external dir as provisioning is taken care by bicep * keep mnist data files * rename demo script (#71) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Unified documentation (#72) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * simplify sandbox script * simplify script, ensure it works * align config of native submit * align naming conventions between scripts, reinject rbac role * create test job for quickly debugging provisioning issues * fix tests * linting * move permissions to storage * align config with bicep scrits * Document the metrics panel of the pipeline overview in the quickstart (#76) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * document the metrics/pipeline panel in the quickstart * linting * add docstrings and disclaimers * Add instructions on how to create a custom graph (#78) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * document the metrics/pipeline panel in the quickstart * add instructions on how to create a custom graph * do better comments * Refine native code (#82) * fix silo name * log only one datapoint per iteration for an aggregated metrics * Align terminology for iteration/round/num_rounds * linting * use storage blob data contibutor * add demoBaseName to guid name of role deployment (#85) Co-authored-by: thomasp-ms <XXX@me.com> * use id list, add listkeys builtin * rename and dissociate orchestrator in resource + orchestrator * separate orchestrator script * draft sandbox setup * make silo script distinct * Update orchestrator_open.bicep * Update internal_blob_open.bicep * remove comments * align hello world example with new naming conventions * ensure uai assignments are created AFTER storage is created * linting * enforce precedence * merge from secure branch * use different regions, limit size of account * reduce to 3 regions, add keys to guid * substring * align config * do not use model * Add msi version of scripts * sandbox main can switch between uai and msi * fix name * linting * linting * implement ignore param, hotfix model with startswith * Address my own comments on Jeff's PR (#96) * remove magic number * little improvements on some comments * remove unused files * put dash replacement next to length check * don't necessarily assume USER AI * UAI -> XAI * revert previous UAI -> XAI changes * move length check next to dash replacement * typo * try movind the dependsOn's * RAGRS -> LRS * revert dependsON changes * revert another small change in a comment Co-authored-by: thomasp-ms <XXX@me.com> * align config of both submit scripts * Make distinction between on-off and repeatable provisioning scripts (#99) * clarify the role needed * remove "custom role" line * adjust locations * use existing rg if not Owner of the sub * clarify "Secure" setup * add usage instructions in docstring * explain what scripts are one-off (vs repeatable) Co-authored-by: thomasp-ms <XXX@me.com> * Align round/iteration terminology with the native code (#103) * rename parameter in config file * keep iterations instead of rounds * round -> iteration Co-authored-by: thomasp-ms <XXX@me.com> * get all goodies from secureprovisioning branch wip * get all goodies from secureprovisioning branch wip * get all goodies from secureprovisioning branch wip * align both submits to work * add optional test * rename native to literal * add getting started in readme, introduce emojis * change person * remove emojs * Propose rewriting of readme to highlight motivation first (#110) * propose rewriting of readme to highlight motivation first * minor edit Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Update README.md * Update quickstart to mention rg clean-up * Update quickstart.md * Update quickstart.md * Update quickstart.md * Build bicep scripts as ARM template, add Azure Buttons to quickstart (#120) * Update quickstart to lower header (hotfix) (#117) * add arm templates, add button in quickstart * switch to releasebranchlink Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Add subscription id, resource group and workspace name as CLI args (#122) * add more cli args * code style * code style * update quickstart doc * update readme * Initiate provisioning "cookbook" with list of provisioning scenarios + example (#123) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Continuous Integration Tests (#119) * take values of subscription id, rs grp, ws name, etc from github secrets and submit a native pipeline * change path * Test azure creds in the github workflow * reformatting * add pipeline validation and testing workflow * add permissions * add permissions * check only certain dir to trigger workflows * add soft validation for any iteration branch PR * add provisioning script test * testing * create rg * create rg * change compute for testing * change demoname * delete old rg * change demoname * add demobasename and aml ws name as github secrets * random demo base name * auto generate random base name * random demo base name * adjust random num length * add vnet sandbox test * rmv dependency b/w jobs * submit various pipelines * change execution graph path * add cli args in the factory code * change compute for testing * ignore validation - factory * create custom action * correct path * correct path * add shell in the github action * create github actions and take required values as input params * add shell * add wait condition * add logs * linting * correct rg name * add azure ml extension * handle ml extension installation error. * add release branch test cases * add script to delete run history * cronjob test * cronjob test * checkout branch * test run history deletion script * test run history deletion script * test run history deletion script * azure login * date format change * remove double quotes * date format change * archive run history script tested * Add vnet-based provisioning options to cookbook (#128) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Make deployment name unique in our github actions (#135) * set unique name for deployments * add attempt to deployment name Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Refactor compute/storage scripts to be independent (#132) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Provide motivation in provisioning docs for using service endpoints (#136) * add motivation for service endpoints * add link Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Refresh provisioning arm buttons with latest from bicep (#139) * align names of directories * rebuild all arm Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Update silo_vnet_newstorage.md (#141) * Add Bicep build vs ARM template diff test (#140) * Add diff test for bicep vs arm * Debug * Debug * fix syntax error * redirect build output to stdout * coorect path * trigger arm template test when pushing changes to main branch from release* branch * remove redundant logs * Add "open aks with cc" provision tutorial and bicep scripts (#138) * implement bicep scripts to provision open aks with cc * add aks cc tutorial * build arm and add in branch * add button Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Provide script + tutorial to attach pair with an existing storage (#142) * provision datastore with existing storage * add arm for existing storage, add docs * add link in readme Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * add latest arm templates to diff build (#145) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Implements provisioning script for a confidential compute VM jumpbox inside a vnet (debug) (#146) * add jumpbox script with tutorial * add template to diff build Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Update jumpbox_cc.md (#147) * update tutorials for silos to integrate feedback (#149) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Implement option to turn orchestrator storage fully private (behind PLE) (#150) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Tutorial on how to adapt native and factory code to write FL experiments. (#100) * WIP: add general information about the factory code * moving factory-tutorial to another file * add scenarios * add instructions on how to adapt literal code * rename file * add general info and fix typos * Jeff's feedback * Apply code clean-up to provision scripts before bug bash (#148) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Instructions for provisioning external silos (#101) * very first stab, far from done * non-secure native job using the on-prem k8s * use on-prem silos in example factory job * Revert "very first stab, far from done" This reverts commit e00d882dfee6a348eb89cd63e339a051b85ce0ca. * Revert "use on-prem silos in example factory job" This reverts commit e2ef8841c6be25a6c84b57ae079cca8f361323fe. * Revert "non-secure native job using the on-prem k8s" This reverts commit 923e5f321d28b30d8cd9759c47a7ffe5457e3284. * restore doc stub * Make Git ignore resources for test jobs * fix gitignore * typo in comment * steps A through D * 2 typos * move to subdir * fix workspace creation * add orchestrator part, role, and timeline * last commit before PR * adjust to new open_azureml_workspace.bicep * first wave after Jeff's comments * address jeff's comments * typo * light trims Co-authored-by: thomasp-ms <XXX@me.com> * bump up every title * skeleton * first attempt at data prep like Harmke * change secret name * wrong secret name * remove separate unzip * change clients, create silo data assets * different names for silo data assets, duh * cleanup * adjust secret name in doc * . * use latest literal code * align environment with literal * base on latest component * one dataset, comment out 2 unused args (for now) * introduce new arguments * reflect modified args in component spec * remove unused arg from config * start hooking up to Harmke's trainer * initialize PTLearner and include in run.py * use same values as Harmke for epochs and lr * attributes with _, start implementing local_train * add loggings, add test(), fix device_ * train_loader_ * align _'s * fix transform bug * remove unused constants * use proper model in aggregation code * removed unused file * remove unused code and arguments, logging to DEBUG * restore `metrics_prefix` parameter * finish restoring `metrics_prefix` * do not duplicate model code * revert dedup attempt * improve docstrings and descriptions * change experiment name * change pipeline name and docstring * cite sources, remove wrongly added licenses * italics * black Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: unknown <Mitgarg17495@gmail.com> * update formatting * add readme section * rename training to traininsilo for consistency * add more comments and update docs * include urgency in PR template (#184) Co-authored-by: thomasp-ms <XXX@me.com> * Share resources and standardize component names (#182) * use shared agg component across all examples * only keep a single {reqs/env} * use more recent pip version * standardize component spec name * support dummy HELLOWORLD example is agg * black Co-authored-by: thomasp-ms <XXX@me.com> * Thopo/share component and environment (#185) * use shared agg component across all examples * only keep a single {reqs/env} * use more recent pip version * standardize component spec name * support dummy HELLOWORLD example is agg * black * SHARED -> utils, rename agg env Co-authored-by: thomasp-ms <XXX@me.com> * rename config to spec and add upload data step * upload data script * use util aggregateweights * add data splitting pipeline * docs update * log pipeline level only once per silo training * do categorical encoding ahead of splitting * nit updates * update comment * update formatting * Hotfix: grant `az login` permissions to the 'clear run history' script (#166) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * simplify the job wait condition's code * add comments * trigger mnist pipeline check * test token validity * grant `az login` permissions to the clear-history script * revert to sleep wait code * test access token validity * nit Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * fix readme * aggregate weughts on whichever device is available * update docstrings * update formatting * reduce upload pipeline file * fix datastore * add info about data upload step * fix typo * steps for changing access policies * update docs * Named Entity Recognition example (#177) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add multinerd template files * NER components * re-structure * partition data + log metrics * add redme * add readme * restructuring * restructuring * add doc strings * train on gpus * create a separate component to upload data on silos * docs * rename * add assert statement * change upload-data job compute to orchestrator compute * remove ner from literal example choices * fix doc * add model-name, tokenizer configurable * pip version upgrade * reformatting * use shared aggregated component * rename script file * add note * create a compute that has access to silos' storage accs * change data uploading approach * update doc * incorporate Thomas's feedback * fix typo Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Create nice-looking homepage for the examples in readme+docs (#190) * add homepage for industry examples Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Align the medical imaging data provisioning process with other examples (#191) * adjust paths in config file * support component with 1 output for Pneumonia * formatting * adjust doc to new provisioning * remove GH action for dataprep * custom component for provisioning pneumonia data * black Co-authored-by: thomasp-ms <XXX@me.com> * hot fix (#192) Co-authored-by: thomasp-ms <XXX@me.com> * Lots of micro-fixes after bug bashing all 3 industry examples (#194) related to components: * create distinct names for all components of each scenario * polish component descriptions * remove unused mnist datatransfer and postprocessing components * upgrade all MCR images to a more recent OS * cut some unnecessary dependencies * use curated environments whenever possible (to speed up job build time) related to pipelines: * fix issues with ccfraud submit script (path to shared folder) * remove unnecessary json+azure imports in submit scripts * align all 3 submissions scripts * in upload data pipeline, make --example required without default value to force intentional decision * in upload data pipeline, use scenario name in the output path to avoid collision * give each submit pipeline a distinct experiment and run name for readability Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Standardize all 3 real world example tutorials (#193) * standardize documentation on all 3 examples * change titles * fix spaces * add pip instructions * upgrade azure-ai-ml version Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * poc for ddp training * remove debug code + allow logging from multiple nodes * update formatting * provide correct link to Kaggle dataset (#196) * provide correct link * . * . Co-authored-by: thomasp-ms <XXX@me.com> * add DDP docs * Add CI tests for industry-relevant examples (#186) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add pneumonia and ner examples tests * add ccfraud test in the CI/CD pipeline * add data upload test * trigger workflow * CI testing1 * CI testing1 * test kv kaggle creds * fix creds * fix creds * set kaggle creds * test pneumonia data-upload * test all industry relevant examples * upload data test for 3 examples * add main tests * rmv redundant chrs * fix typo * avoid industry relevant examples tests to run on the vnet setup as it is already covered by the open setup Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * CLI commands to add credentials in the workspace keyvault (#199) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add cli cmds to set a kv secret * Jeff's feedback * Implement Thomas's feedback Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Thomas/bug bash feedback 04 (#203) * no need to navigate to a specific directory * keyvault -> key vault * improve Kaggle sections * GPU's for NER example * ARM templates with latest bicep version * bold * GPU instructions in quickstart Co-authored-by: thomasp-ms <XXX@me.com> * fix test to align with new sdk (#204) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Hotfix: DataAccessError (orchestrator access) (#205) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * fix bug * update arm template * fix a problem that was encountered during resolving merge conflicts Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Add GitHub Action workflow concurrency and implement token expiration policy workaround (#200) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add GitHub workflow concurrency * test 1 * test 1 * test 1 * test 2 * test 3 * test 2 * test 3 * implement token expiry workaround * test 1 * workaround to handle token expiry error * fix typo Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Implement troubleshooting guide with first typical issues (#208) * write troubleshooting guide Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Fix order of precedence for AML workspace references in submit.py (#209) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * fix order of precedence * fix build Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Add data permissions issue to TSG (#210) * add permissions issue to TSG Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * November notes (#211) Co-authored-by: thomasp-ms <XXX@me.com> * create instance type and select it for run for cc * upgrade all pip dependencies (#212) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * format * use azureml built in distribution fw * Test industry-relevant examples if any changes in the `utils` dir are observed (#221) * add test to validate changes in the utils dir * test1 trigger workflow * fix typo * only destroy ddp group if it was created * remove unnecessary imports * allow for mixture of ddp and non-ddp processes model aggregation * use documentation instead of ps1 script for creating instancetype for CC * add instance type assignment for all examples * formatting * formatting * update batch size * update model name * use older pytorch * Generalize aggregate component to Babel (#220) * init branch * wip data exploration * data exploration region/silo * basic model * regions * basic network and finished data processing * training * Implement generic FedAvg without model object (#167) * generic fedavg pytorch * support model classes * add docstrings Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * add README * update normalization * update exploration * Thomas/small improvements (#171) * remove unused local MNIST data * add link to provisioning cookbook in docs readme * recommend creating a conda env in the quickstart Co-authored-by: thomasp-ms <XXX@me.com> * update example for finance with multiple models * successful training through lstm * revert unneeded changes * remove local exploration ipynb * fix test metric * different param value for AKS (#179) Co-authored-by: thomasp-ms <XXX@me.com> * Pneumonia xray example (#164) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add more intuitive agg output dir path * reformat using black * add iteration2 branch for PR build testing * reformat date and pass kwargs instead in the getUniqueIdentifier fn * working submit * working factory submit * linting * move component path * add soft validation * add soft validation * Add basic tests on config * linting * working bicep deployment for vanilla demo * proper orchestrator script, double containers * fix name * docstring * docstring * rollback to using only 1 container * align naming convention * instructions * working submit * set up permission model * working orch perms * wonky perms assignment * working role assignments * remove old perm model * working except silo2orch * fix typo * working submit with config * add sku as param * use R/W for now * fix submit to align with bicep provisioning demo * linting * remove dataset files * fix docstring on permission model * write draft docs with homepage, align structure, remove requirements, ensure demo documented * rollback change to req * change factory to use custom model type during validation * linting * Display metrics at the pipeline level (#68) * Fix optional input yaml and mlflow log bugs (#59) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * Accomodate optional input chnages and switch from mlflow autologging to manual logging * code style * change optional inputs syntax Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Make changes to display all metrics at the pipeline level * Log preprocessing metadata in mlflow * linting * Pass client as an arg * Fix typo, rmv name from silo config, metric naming convention, and add metric identifier in the preprocessing component Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Remove redundant files from the mlops directory (#69) * Remove internal & external dir as provisioning is taken care by bicep * keep mnist data files * rename demo script (#71) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Unified documentation (#72) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * simplify sandbox script * simplify script, ensure it works * align config of native submit * align naming conventions between scripts, reinject rbac role * create test job for quickly debugging provisioning issues * fix tests * linting * move permissions to storage * align config with bicep scrits * Document the metrics panel of the pipeline overview in the quickstart (#76) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * document the metrics/pipeline panel in the quickstart * linting * add docstrings and disclaimers * Add instructions on how to create a custom graph (#78) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * document the metrics/pipeline panel in the quickstart * add instructions on how to create a custom graph * do better comments * Refine native code (#82) * fix silo name * log only one datapoint per iteration for an aggregated metrics * Align terminology for iteration/round/num_rounds * linting * use storage blob data contibutor * add demoBaseName to guid name of role deployment (#85) Co-authored-by: thomasp-ms <XXX@me.com> * use id list, add listkeys builtin * rename and dissociate orchestrator in resource + orchestrator * separate orchestrator script * draft sandbox setup * make silo script distinct * Update orchestrator_open.bicep * Update internal_blob_open.bicep * remove comments * align hello world example with new naming conventions * ensure uai assignments are created AFTER storage is created * linting * enforce precedence * merge from secure branch * use different regions, limit size of account * reduce to 3 regions, add keys to guid * substring * align config * do not u…

* add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * init branch * wip data exploration * data exploration region/silo * basic model * regions * basic network and finished data processing * training * Implement generic FedAvg without model object (#167) * generic fedavg pytorch * support model classes * add docstrings Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * add README * update normalization * update exploration * Thomas/small improvements (#171) * remove unused local MNIST data * add link to provisioning cookbook in docs readme * recommend creating a conda env in the quickstart Co-authored-by: thomasp-ms <XXX@me.com> * update example for finance with multiple models * successful training through lstm * revert unneeded changes * remove local exploration ipynb * fix test metric * different param value for AKS (#179) Co-authored-by: thomasp-ms <XXX@me.com> * Pneumonia xray example (#164) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add more intuitive agg output dir path * reformat using black * add iteration2 branch for PR build testing * reformat date and pass kwargs instead in the getUniqueIdentifier fn * working submit * working factory submit * linting * move component path * add soft validation * add soft validation * Add basic tests on config * linting * working bicep deployment for vanilla demo * proper orchestrator script, double containers * fix name * docstring * docstring * rollback to using only 1 container * align naming convention * instructions * working submit * set up permission model * working orch perms * wonky perms assignment * working role assignments * remove old perm model * working except silo2orch * fix typo * working submit with config * add sku as param * use R/W for now * fix submit to align with bicep provisioning demo * linting * remove dataset files * fix docstring on permission model * write draft docs with homepage, align structure, remove requirements, ensure demo documented * rollback change to req * change factory to use custom model type during validation * linting * Display metrics at the pipeline level (#68) * Fix optional input yaml and mlflow log bugs (#59) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * Accomodate optional input chnages and switch from mlflow autologging to manual logging * code style * change optional inputs syntax Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Make changes to display all metrics at the pipeline level * Log preprocessing metadata in mlflow * linting * Pass client as an arg * Fix typo, rmv name from silo config, metric naming convention, and add metric identifier in the preprocessing component Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Remove redundant files from the mlops directory (#69) * Remove internal & external dir as provisioning is taken care by bicep * keep mnist data files * rename demo script (#71) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Unified documentation (#72) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * simplify sandbox script * simplify script, ensure it works * align config of native submit * align naming conventions between scripts, reinject rbac role * create test job for quickly debugging provisioning issues * fix tests * linting * move permissions to storage * align config with bicep scrits * Document the metrics panel of the pipeline overview in the quickstart (#76) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * document the metrics/pipeline panel in the quickstart * linting * add docstrings and disclaimers * Add instructions on how to create a custom graph (#78) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * document the metrics/pipeline panel in the quickstart * add instructions on how to create a custom graph * do better comments * Refine native code (#82) * fix silo name * log only one datapoint per iteration for an aggregated metrics * Align terminology for iteration/round/num_rounds * linting * use storage blob data contibutor * add demoBaseName to guid name of role deployment (#85) Co-authored-by: thomasp-ms <XXX@me.com> * use id list, add listkeys builtin * rename and dissociate orchestrator in resource + orchestrator * separate orchestrator script * draft sandbox setup * make silo script distinct * Update orchestrator_open.bicep * Update internal_blob_open.bicep * remove comments * align hello world example with new naming conventions * ensure uai assignments are created AFTER storage is created * linting * enforce precedence * merge from secure branch * use different regions, limit size of account * reduce to 3 regions, add keys to guid * substring * align config * do not use model * Add msi version of scripts * sandbox main can switch between uai and msi * fix name * linting * linting * implement ignore param, hotfix model with startswith * Address my own comments on Jeff's PR (#96) * remove magic number * little improvements on some comments * remove unused files * put dash replacement next to length check * don't necessarily assume USER AI * UAI -> XAI * revert previous UAI -> XAI changes * move length check next to dash replacement * typo * try movind the dependsOn's * RAGRS -> LRS * revert dependsON changes * revert another small change in a comment Co-authored-by: thomasp-ms <XXX@me.com> * align config of both submit scripts * Make distinction between on-off and repeatable provisioning scripts (#99) * clarify the role needed * remove "custom role" line * adjust locations * use existing rg if not Owner of the sub * clarify "Secure" setup * add usage instructions in docstring * explain what scripts are one-off (vs repeatable) Co-authored-by: thomasp-ms <XXX@me.com> * Align round/iteration terminology with the native code (#103) * rename parameter in config file * keep iterations instead of rounds * round -> iteration Co-authored-by: thomasp-ms <XXX@me.com> * get all goodies from secureprovisioning branch wip * get all goodies from secureprovisioning branch wip * get all goodies from secureprovisioning branch wip * align both submits to work * add optional test * rename native to literal * add getting started in readme, introduce emojis * change person * remove emojs * Propose rewriting of readme to highlight motivation first (#110) * propose rewriting of readme to highlight motivation first * minor edit Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Update README.md * Update quickstart to mention rg clean-up * Update quickstart.md * Update quickstart.md * Update quickstart.md * Build bicep scripts as ARM template, add Azure Buttons to quickstart (#120) * Update quickstart to lower header (hotfix) (#117) * add arm templates, add button in quickstart * switch to releasebranchlink Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Add subscription id, resource group and workspace name as CLI args (#122) * add more cli args * code style * code style * update quickstart doc * update readme * Initiate provisioning "cookbook" with list of provisioning scenarios + example (#123) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Continuous Integration Tests (#119) * take values of subscription id, rs grp, ws name, etc from github secrets and submit a native pipeline * change path * Test azure creds in the github workflow * reformatting * add pipeline validation and testing workflow * add permissions * add permissions * check only certain dir to trigger workflows * add soft validation for any iteration branch PR * add provisioning script test * testing * create rg * create rg * change compute for testing * change demoname * delete old rg * change demoname * add demobasename and aml ws name as github secrets * random demo base name * auto generate random base name * random demo base name * adjust random num length * add vnet sandbox test * rmv dependency b/w jobs * submit various pipelines * change execution graph path * add cli args in the factory code * change compute for testing * ignore validation - factory * create custom action * correct path * correct path * add shell in the github action * create github actions and take required values as input params * add shell * add wait condition * add logs * linting * correct rg name * add azure ml extension * handle ml extension installation error. * add release branch test cases * add script to delete run history * cronjob test * cronjob test * checkout branch * test run history deletion script * test run history deletion script * test run history deletion script * azure login * date format change * remove double quotes * date format change * archive run history script tested * Add vnet-based provisioning options to cookbook (#128) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Make deployment name unique in our github actions (#135) * set unique name for deployments * add attempt to deployment name Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Refactor compute/storage scripts to be independent (#132) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Provide motivation in provisioning docs for using service endpoints (#136) * add motivation for service endpoints * add link Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Refresh provisioning arm buttons with latest from bicep (#139) * align names of directories * rebuild all arm Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Update silo_vnet_newstorage.md (#141) * Add Bicep build vs ARM template diff test (#140) * Add diff test for bicep vs arm * Debug * Debug * fix syntax error * redirect build output to stdout * coorect path * trigger arm template test when pushing changes to main branch from release* branch * remove redundant logs * Add "open aks with cc" provision tutorial and bicep scripts (#138) * implement bicep scripts to provision open aks with cc * add aks cc tutorial * build arm and add in branch * add button Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Provide script + tutorial to attach pair with an existing storage (#142) * provision datastore with existing storage * add arm for existing storage, add docs * add link in readme Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * add latest arm templates to diff build (#145) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Implements provisioning script for a confidential compute VM jumpbox inside a vnet (debug) (#146) * add jumpbox script with tutorial * add template to diff build Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Update jumpbox_cc.md (#147) * update tutorials for silos to integrate feedback (#149) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Implement option to turn orchestrator storage fully private (behind PLE) (#150) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Tutorial on how to adapt native and factory code to write FL experiments. (#100) * WIP: add general information about the factory code * moving factory-tutorial to another file * add scenarios * add instructions on how to adapt literal code * rename file * add general info and fix typos * Jeff's feedback * Apply code clean-up to provision scripts before bug bash (#148) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Instructions for provisioning external silos (#101) * very first stab, far from done * non-secure native job using the on-prem k8s * use on-prem silos in example factory job * Revert "very first stab, far from done" This reverts commit e00d882dfee6a348eb89cd63e339a051b85ce0ca. * Revert "use on-prem silos in example factory job" This reverts commit e2ef8841c6be25a6c84b57ae079cca8f361323fe. * Revert "non-secure native job using the on-prem k8s" This reverts commit 923e5f321d28b30d8cd9759c47a7ffe5457e3284. * restore doc stub * Make Git ignore resources for test jobs * fix gitignore * typo in comment * steps A through D * 2 typos * move to subdir * fix workspace creation * add orchestrator part, role, and timeline * last commit before PR * adjust to new open_azureml_workspace.bicep * first wave after Jeff's comments * address jeff's comments * typo * light trims Co-authored-by: thomasp-ms <XXX@me.com> * bump up every title * skeleton * first attempt at data prep like Harmke * change secret name * wrong secret name * remove separate unzip * change clients, create silo data assets * different names for silo data assets, duh * cleanup * adjust secret name in doc * . * use latest literal code * align environment with literal * base on latest component * one dataset, comment out 2 unused args (for now) * introduce new arguments * reflect modified args in component spec * remove unused arg from config * start hooking up to Harmke's trainer * initialize PTLearner and include in run.py * use same values as Harmke for epochs and lr * attributes with _, start implementing local_train * add loggings, add test(), fix device_ * train_loader_ * align _'s * fix transform bug * remove unused constants * use proper model in aggregation code * removed unused file * remove unused code and arguments, logging to DEBUG * restore `metrics_prefix` parameter * finish restoring `metrics_prefix` * do not duplicate model code * revert dedup attempt * improve docstrings and descriptions * change experiment name * change pipeline name and docstring * cite sources, remove wrongly added licenses * italics * black Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: unknown <Mitgarg17495@gmail.com> * update formatting * add readme section * rename training to traininsilo for consistency * add more comments and update docs * include urgency in PR template (#184) Co-authored-by: thomasp-ms <XXX@me.com> * Share resources and standardize component names (#182) * use shared agg component across all examples * only keep a single {reqs/env} * use more recent pip version * standardize component spec name * support dummy HELLOWORLD example is agg * black Co-authored-by: thomasp-ms <XXX@me.com> * Thopo/share component and environment (#185) * use shared agg component across all examples * only keep a single {reqs/env} * use more recent pip version * standardize component spec name * support dummy HELLOWORLD example is agg * black * SHARED -> utils, rename agg env Co-authored-by: thomasp-ms <XXX@me.com> * rename config to spec and add upload data step * upload data script * use util aggregateweights * add data splitting pipeline * docs update * log pipeline level only once per silo training * do categorical encoding ahead of splitting * nit updates * update comment * update formatting * Hotfix: grant `az login` permissions to the 'clear run history' script (#166) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * simplify the job wait condition's code * add comments * trigger mnist pipeline check * test token validity * grant `az login` permissions to the clear-history script * revert to sleep wait code * test access token validity * nit Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * fix readme * aggregate weughts on whichever device is available * update docstrings * update formatting * reduce upload pipeline file * fix datastore * add info about data upload step * fix typo * steps for changing access policies * update docs * Named Entity Recognition example (#177) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add multinerd template files * NER components * re-structure * partition data + log metrics * add redme * add readme * restructuring * restructuring * add doc strings * train on gpus * create a separate component to upload data on silos * docs * rename * add assert statement * change upload-data job compute to orchestrator compute * remove ner from literal example choices * fix doc * add model-name, tokenizer configurable * pip version upgrade * reformatting * use shared aggregated component * rename script file * add note * create a compute that has access to silos' storage accs * change data uploading approach * update doc * incorporate Thomas's feedback * fix typo Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Create nice-looking homepage for the examples in readme+docs (#190) * add homepage for industry examples Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Align the medical imaging data provisioning process with other examples (#191) * adjust paths in config file * support component with 1 output for Pneumonia * formatting * adjust doc to new provisioning * remove GH action for dataprep * custom component for provisioning pneumonia data * black Co-authored-by: thomasp-ms <XXX@me.com> * hot fix (#192) Co-authored-by: thomasp-ms <XXX@me.com> * Lots of micro-fixes after bug bashing all 3 industry examples (#194) related to components: * create distinct names for all components of each scenario * polish component descriptions * remove unused mnist datatransfer and postprocessing components * upgrade all MCR images to a more recent OS * cut some unnecessary dependencies * use curated environments whenever possible (to speed up job build time) related to pipelines: * fix issues with ccfraud submit script (path to shared folder) * remove unnecessary json+azure imports in submit scripts * align all 3 submissions scripts * in upload data pipeline, make --example required without default value to force intentional decision * in upload data pipeline, use scenario name in the output path to avoid collision * give each submit pipeline a distinct experiment and run name for readability Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Standardize all 3 real world example tutorials (#193) * standardize documentation on all 3 examples * change titles * fix spaces * add pip instructions * upgrade azure-ai-ml version Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * poc for ddp training * remove debug code + allow logging from multiple nodes * update formatting * provide correct link to Kaggle dataset (#196) * provide correct link * . * . Co-authored-by: thomasp-ms <XXX@me.com> * add DDP docs * Add CI tests for industry-relevant examples (#186) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add pneumonia and ner examples tests * add ccfraud test in the CI/CD pipeline * add data upload test * trigger workflow * CI testing1 * CI testing1 * test kv kaggle creds * fix creds * fix creds * set kaggle creds * test pneumonia data-upload * test all industry relevant examples * upload data test for 3 examples * add main tests * rmv redundant chrs * fix typo * avoid industry relevant examples tests to run on the vnet setup as it is already covered by the open setup Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * CLI commands to add credentials in the workspace keyvault (#199) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add cli cmds to set a kv secret * Jeff's feedback * Implement Thomas's feedback Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Thomas/bug bash feedback 04 (#203) * no need to navigate to a specific directory * keyvault -> key vault * improve Kaggle sections * GPU's for NER example * ARM templates with latest bicep version * bold * GPU instructions in quickstart Co-authored-by: thomasp-ms <XXX@me.com> * fix test to align with new sdk (#204) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Hotfix: DataAccessError (orchestrator access) (#205) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * fix bug * update arm template * fix a problem that was encountered during resolving merge conflicts Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Add GitHub Action workflow concurrency and implement token expiration policy workaround (#200) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add GitHub workflow concurrency * test 1 * test 1 * test 1 * test 2 * test 3 * test 2 * test 3 * implement token expiry workaround * test 1 * workaround to handle token expiry error * fix typo Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Implement troubleshooting guide with first typical issues (#208) * write troubleshooting guide Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Fix order of precedence for AML workspace references in submit.py (#209) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * fix order of precedence * fix build Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Add data permissions issue to TSG (#210) * add permissions issue to TSG Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * November notes (#211) Co-authored-by: thomasp-ms <XXX@me.com> * create instance type and select it for run for cc * upgrade all pip dependencies (#212) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * format * use azureml built in distribution fw * Test industry-relevant examples if any changes in the `utils` dir are observed (#221) * add test to validate changes in the utils dir * test1 trigger workflow * fix typo * only destroy ddp group if it was created * remove unnecessary imports * allow for mixture of ddp and non-ddp processes model aggregation * use documentation instead of ps1 script for creating instancetype for CC * add instance type assignment for all examples * formatting * formatting * update batch size * update model name * use older pytorch * Generalize aggregate component to Babel (#220) * init branch * wip data exploration * data exploration region/silo * basic model * regions * basic network and finished data processing * training * Implement generic FedAvg without model object (#167) * generic fedavg pytorch * support model classes * add docstrings Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * add README * update normalization * update exploration * Thomas/small improvements (#171) * remove unused local MNIST data * add link to provisioning cookbook in docs readme * recommend creating a conda env in the quickstart Co-authored-by: thomasp-ms <XXX@me.com> * update example for finance with multiple models * successful training through lstm * revert unneeded changes * remove local exploration ipynb * fix test metric * different param value for AKS (#179) Co-authored-by: thomasp-ms <XXX@me.com> * Pneumonia xray example (#164) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add more intuitive agg output dir path * reformat using black * add iteration2 branch for PR build testing * reformat date and pass kwargs instead in the getUniqueIdentifier fn * working submit * working factory submit * linting * move component path * add soft validation * add soft validation * Add basic tests on config * linting * working bicep deployment for vanilla demo * proper orchestrator script, double containers * fix name * docstring * docstring * rollback to using only 1 container * align naming convention * instructions * working submit * set up permission model * working orch perms * wonky perms assignment * working role assignments * remove old perm model * working except silo2orch * fix typo * working submit with config * add sku as param * use R/W for now * fix submit to align with bicep provisioning demo * linting * remove dataset files * fix docstring on permission model * write draft docs with homepage, align structure, remove requirements, ensure demo documented * rollback change to req * change factory to use custom model type during validation * linting * Display metrics at the pipeline level (#68) * Fix optional input yaml and mlflow log bugs (#59) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * Accomodate optional input chnages and switch from mlflow autologging to manual logging * code style * change optional inputs syntax Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Make changes to display all metrics at the pipeline level * Log preprocessing metadata in mlflow * linting * Pass client as an arg * Fix typo, rmv name from silo config, metric naming convention, and add metric identifier in the preprocessing component Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Remove redundant files from the mlops directory (#69) * Remove internal & external dir as provisioning is taken care by bicep * keep mnist data files * rename demo script (#71) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Unified documentation (#72) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * simplify sandbox script * simplify script, ensure it works * align config of native submit * align naming conventions between scripts, reinject rbac role * create test job for quickly debugging provisioning issues * fix tests * linting * move permissions to storage * align config with bicep scrits * Document the metrics panel of the pipeline overview in the quickstart (#76) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * document the metrics/pipeline panel in the quickstart * linting * add docstrings and disclaimers * Add instructions on how to create a custom graph (#78) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * document the metrics/pipeline panel in the quickstart * add instructions on how to create a custom graph * do better comments * Refine native code (#82) * fix silo name * log only one datapoint per iteration for an aggregated metrics * Align terminology for iteration/round/num_rounds * linting * use storage blob data contibutor * add demoBaseName to guid name of role deployment (#85) Co-authored-by: thomasp-ms <XXX@me.com> * use id list, add listkeys builtin * rename and dissociate orchestrator in resource + orchestrator * separate orchestrator script * draft sandbox setup * make silo script distinct * Update orchestrator_open.bicep * Update internal_blob_open.bicep * remove comments * align hello world example with new naming conventions * ensure uai assignments are created AFTER storage is created * linting * enforce precedence * merge from secure branch * use different regions, limit size of account * reduce to 3 regions, add keys to guid * substring * align config * do not use model * Add msi version of scripts * sandbox main can switch between uai and msi * fix name * linting * linting * implement ignore param, hotfix model with startswith * Address my own comments on Jeff's PR (#96) * remove magic number * little improvements on some comments * remove unused files * put dash replacement next to length check * don't necessarily assume USER AI * UAI -…

…chmark results... (#282) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * Update release branch (#271) * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * init branch * wip data exploration * data exploration region/silo * basic model * regions * basic network and finished data processing * training * Implement generic FedAvg without model object (#167) * generic fedavg pytorch * support model classes * add docstrings Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * add README * update normalization * update exploration * Thomas/small improvements (#171) * remove unused local MNIST data * add link to provisioning cookbook in docs readme * recommend creating a conda env in the quickstart Co-authored-by: thomasp-ms <XXX@me.com> * update example for finance with multiple models * successful training through lstm * revert unneeded changes * remove local exploration ipynb * fix test metric * different param value for AKS (#179) Co-authored-by: thomasp-ms <XXX@me.com> * Pneumonia xray example (#164) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add more intuitive agg output dir path * reformat using black * add iteration2 branch for PR build testing * reformat date and pass kwargs instead in the getUniqueIdentifier fn * working submit * working factory submit * linting * move component path * add soft validation * add soft validation * Add basic tests on config * linting * working bicep deployment for vanilla demo * proper orchestrator script, double containers * fix name * docstring * docstring * rollback to using only 1 container * align naming convention * instructions * working submit * set up permission model * working orch perms * wonky perms assignment * working role assignments * remove old perm model * working except silo2orch * fix typo * working submit with config * add sku as param * use R/W for now * fix submit to align with bicep provisioning demo * linting * remove dataset files * fix docstring on permission model * write draft docs with homepage, align structure, remove requirements, ensure demo documented * rollback change to req * change factory to use custom model type during validation * linting * Display metrics at the pipeline level (#68) * Fix optional input yaml and mlflow log bugs (#59) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * Accomodate optional input chnages and switch from mlflow autologging to manual logging * code style * change optional inputs syntax Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Make changes to display all metrics at the pipeline level * Log preprocessing metadata in mlflow * linting * Pass client as an arg * Fix typo, rmv name from silo config, metric naming convention, and add metric identifier in the preprocessing component Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Remove redundant files from the mlops directory (#69) * Remove internal & external dir as provisioning is taken care by bicep * keep mnist data files * rename demo script (#71) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Unified documentation (#72) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * simplify sandbox script * simplify script, ensure it works * align config of native submit * align naming conventions between scripts, reinject rbac role * create test job for quickly debugging provisioning issues * fix tests * linting * move permissions to storage * align config with bicep scrits * Document the metrics panel of the pipeline overview in the quickstart (#76) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * document the metrics/pipeline panel in the quickstart * linting * add docstrings and disclaimers * Add instructions on how to create a custom graph (#78) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * document the metrics/pipeline panel in the quickstart * add instructions on how to create a custom graph * do better comments * Refine native code (#82) * fix silo name * log only one datapoint per iteration for an aggregated metrics * Align terminology for iteration/round/num_rounds * linting * use storage blob data contibutor * add demoBaseName to guid name of role deployment (#85) Co-authored-by: thomasp-ms <XXX@me.com> * use id list, add listkeys builtin * rename and dissociate orchestrator in resource + orchestrator * separate orchestrator script * draft sandbox setup * make silo script distinct * Update orchestrator_open.bicep * Update internal_blob_open.bicep * remove comments * align hello world example with new naming conventions * ensure uai assignments are created AFTER storage is created * linting * enforce precedence * merge from secure branch * use different regions, limit size of account * reduce to 3 regions, add keys to guid * substring * align config * do not use model * Add msi version of scripts * sandbox main can switch between uai and msi * fix name * linting * linting * implement ignore param, hotfix model with startswith * Address my own comments on Jeff's PR (#96) * remove magic number * little improvements on some comments * remove unused files * put dash replacement next to length check * don't necessarily assume USER AI * UAI -> XAI * revert previous UAI -> XAI changes * move length check next to dash replacement * typo * try movind the dependsOn's * RAGRS -> LRS * revert dependsON changes * revert another small change in a comment Co-authored-by: thomasp-ms <XXX@me.com> * align config of both submit scripts * Make distinction between on-off and repeatable provisioning scripts (#99) * clarify the role needed * remove "custom role" line * adjust locations * use existing rg if not Owner of the sub * clarify "Secure" setup * add usage instructions in docstring * explain what scripts are one-off (vs repeatable) Co-authored-by: thomasp-ms <XXX@me.com> * Align round/iteration terminology with the native code (#103) * rename parameter in config file * keep iterations instead of rounds * round -> iteration Co-authored-by: thomasp-ms <XXX@me.com> * get all goodies from secureprovisioning branch wip * get all goodies from secureprovisioning branch wip * get all goodies from secureprovisioning branch wip * align both submits to work * add optional test * rename native to literal * add getting started in readme, introduce emojis * change person * remove emojs * Propose rewriting of readme to highlight motivation first (#110) * propose rewriting of readme to highlight motivation first * minor edit Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Update README.md * Update quickstart to mention rg clean-up * Update quickstart.md * Update quickstart.md * Update quickstart.md * Build bicep scripts as ARM template, add Azure Buttons to quickstart (#120) * Update quickstart to lower header (hotfix) (#117) * add arm templates, add button in quickstart * switch to releasebranchlink Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Add subscription id, resource group and workspace name as CLI args (#122) * add more cli args * code style * code style * update quickstart doc * update readme * Initiate provisioning "cookbook" with list of provisioning scenarios + example (#123) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Continuous Integration Tests (#119) * take values of subscription id, rs grp, ws name, etc from github secrets and submit a native pipeline * change path * Test azure creds in the github workflow * reformatting * add pipeline validation and testing workflow * add permissions * add permissions * check only certain dir to trigger workflows * add soft validation for any iteration branch PR * add provisioning script test * testing * create rg * create rg * change compute for testing * change demoname * delete old rg * change demoname * add demobasename and aml ws name as github secrets * random demo base name * auto generate random base name * random demo base name * adjust random num length * add vnet sandbox test * rmv dependency b/w jobs * submit various pipelines * change execution graph path * add cli args in the factory code * change compute for testing * ignore validation - factory * create custom action * correct path * correct path * add shell in the github action * create github actions and take required values as input params * add shell * add wait condition * add logs * linting * correct rg name * add azure ml extension * handle ml extension installation error. * add release branch test cases * add script to delete run history * cronjob test * cronjob test * checkout branch * test run history deletion script * test run history deletion script * test run history deletion script * azure login * date format change * remove double quotes * date format change * archive run history script tested * Add vnet-based provisioning options to cookbook (#128) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Make deployment name unique in our github actions (#135) * set unique name for deployments * add attempt to deployment name Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Refactor compute/storage scripts to be independent (#132) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Provide motivation in provisioning docs for using service endpoints (#136) * add motivation for service endpoints * add link Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Refresh provisioning arm buttons with latest from bicep (#139) * align names of directories * rebuild all arm Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Update silo_vnet_newstorage.md (#141) * Add Bicep build vs ARM template diff test (#140) * Add diff test for bicep vs arm * Debug * Debug * fix syntax error * redirect build output to stdout * coorect path * trigger arm template test when pushing changes to main branch from release* branch * remove redundant logs * Add "open aks with cc" provision tutorial and bicep scripts (#138) * implement bicep scripts to provision open aks with cc * add aks cc tutorial * build arm and add in branch * add button Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Provide script + tutorial to attach pair with an existing storage (#142) * provision datastore with existing storage * add arm for existing storage, add docs * add link in readme Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * add latest arm templates to diff build (#145) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Implements provisioning script for a confidential compute VM jumpbox inside a vnet (debug) (#146) * add jumpbox script with tutorial * add template to diff build Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Update jumpbox_cc.md (#147) * update tutorials for silos to integrate feedback (#149) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Implement option to turn orchestrator storage fully private (behind PLE) (#150) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Tutorial on how to adapt native and factory code to write FL experiments. (#100) * WIP: add general information about the factory code * moving factory-tutorial to another file * add scenarios * add instructions on how to adapt literal code * rename file * add general info and fix typos * Jeff's feedback * Apply code clean-up to provision scripts before bug bash (#148) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Instructions for provisioning external silos (#101) * very first stab, far from done * non-secure native job using the on-prem k8s * use on-prem silos in example factory job * Revert "very first stab, far from done" This reverts commit e00d882dfee6a348eb89cd63e339a051b85ce0ca. * Revert "use on-prem silos in example factory job" This reverts commit e2ef8841c6be25a6c84b57ae079cca8f361323fe. * Revert "non-secure native job using the on-prem k8s" This reverts commit 923e5f321d28b30d8cd9759c47a7ffe5457e3284. * restore doc stub * Make Git ignore resources for test jobs * fix gitignore * typo in comment * steps A through D * 2 typos * move to subdir * fix workspace creation * add orchestrator part, role, and timeline * last commit before PR * adjust to new open_azureml_workspace.bicep * first wave after Jeff's comments * address jeff's comments * typo * light trims Co-authored-by: thomasp-ms <XXX@me.com> * bump up every title * skeleton * first attempt at data prep like Harmke * change secret name * wrong secret name * remove separate unzip * change clients, create silo data assets * different names for silo data assets, duh * cleanup * adjust secret name in doc * . * use latest literal code * align environment with literal * base on latest component * one dataset, comment out 2 unused args (for now) * introduce new arguments * reflect modified args in component spec * remove unused arg from config * start hooking up to Harmke's trainer * initialize PTLearner and include in run.py * use same values as Harmke for epochs and lr * attributes with _, start implementing local_train * add loggings, add test(), fix device_ * train_loader_ * align _'s * fix transform bug * remove unused constants * use proper model in aggregation code * removed unused file * remove unused code and arguments, logging to DEBUG * restore `metrics_prefix` parameter * finish restoring `metrics_prefix` * do not duplicate model code * revert dedup attempt * improve docstrings and descriptions * change experiment name * change pipeline name and docstring * cite sources, remove wrongly added licenses * italics * black Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: unknown <Mitgarg17495@gmail.com> * update formatting * add readme section * rename training to traininsilo for consistency * add more comments and update docs * include urgency in PR template (#184) Co-authored-by: thomasp-ms <XXX@me.com> * Share resources and standardize component names (#182) * use shared agg component across all examples * only keep a single {reqs/env} * use more recent pip version * standardize component spec name * support dummy HELLOWORLD example is agg * black Co-authored-by: thomasp-ms <XXX@me.com> * Thopo/share component and environment (#185) * use shared agg component across all examples * only keep a single {reqs/env} * use more recent pip version * standardize component spec name * support dummy HELLOWORLD example is agg * black * SHARED -> utils, rename agg env Co-authored-by: thomasp-ms <XXX@me.com> * rename config to spec and add upload data step * upload data script * use util aggregateweights * add data splitting pipeline * docs update * log pipeline level only once per silo training * do categorical encoding ahead of splitting * nit updates * update comment * update formatting * Hotfix: grant `az login` permissions to the 'clear run history' script (#166) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * simplify the job wait condition's code * add comments * trigger mnist pipeline check * test token validity * grant `az login` permissions to the clear-history script * revert to sleep wait code * test access token validity * nit Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * fix readme * aggregate weughts on whichever device is available * update docstrings * update formatting * reduce upload pipeline file * fix datastore * add info about data upload step * fix typo * steps for changing access policies * update docs * Named Entity Recognition example (#177) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add multinerd template files * NER components * re-structure * partition data + log metrics * add redme * add readme * restructuring * restructuring * add doc strings * train on gpus * create a separate component to upload data on silos * docs * rename * add assert statement * change upload-data job compute to orchestrator compute * remove ner from literal example choices * fix doc * add model-name, tokenizer configurable * pip version upgrade * reformatting * use shared aggregated component * rename script file * add note * create a compute that has access to silos' storage accs * change data uploading approach * update doc * incorporate Thomas's feedback * fix typo Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Create nice-looking homepage for the examples in readme+docs (#190) * add homepage for industry examples Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Align the medical imaging data provisioning process with other examples (#191) * adjust paths in config file * support component with 1 output for Pneumonia * formatting * adjust doc to new provisioning * remove GH action for dataprep * custom component for provisioning pneumonia data * black Co-authored-by: thomasp-ms <XXX@me.com> * hot fix (#192) Co-authored-by: thomasp-ms <XXX@me.com> * Lots of micro-fixes after bug bashing all 3 industry examples (#194) related to components: * create distinct names for all components of each scenario * polish component descriptions * remove unused mnist datatransfer and postprocessing components * upgrade all MCR images to a more recent OS * cut some unnecessary dependencies * use curated environments whenever possible (to speed up job build time) related to pipelines: * fix issues with ccfraud submit script (path to shared folder) * remove unnecessary json+azure imports in submit scripts * align all 3 submissions scripts * in upload data pipeline, make --example required without default value to force intentional decision * in upload data pipeline, use scenario name in the output path to avoid collision * give each submit pipeline a distinct experiment and run name for readability Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Standardize all 3 real world example tutorials (#193) * standardize documentation on all 3 examples * change titles * fix spaces * add pip instructions * upgrade azure-ai-ml version Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * poc for ddp training * remove debug code + allow logging from multiple nodes * update formatting * provide correct link to Kaggle dataset (#196) * provide correct link * . * . Co-authored-by: thomasp-ms <XXX@me.com> * add DDP docs * Add CI tests for industry-relevant examples (#186) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add pneumonia and ner examples tests * add ccfraud test in the CI/CD pipeline * add data upload test * trigger workflow * CI testing1 * CI testing1 * test kv kaggle creds * fix creds * fix creds * set kaggle creds * test pneumonia data-upload * test all industry relevant examples * upload data test for 3 examples * add main tests * rmv redundant chrs * fix typo * avoid industry relevant examples tests to run on the vnet setup as it is already covered by the open setup Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * CLI commands to add credentials in the workspace keyvault (#199) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add cli cmds to set a kv secret * Jeff's feedback * Implement Thomas's feedback Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Thomas/bug bash feedback 04 (#203) * no need to navigate to a specific directory * keyvault -> key vault * improve Kaggle sections * GPU's for NER example * ARM templates with latest bicep version * bold * GPU instructions in quickstart Co-authored-by: thomasp-ms <XXX@me.com> * fix test to align with new sdk (#204) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Hotfix: DataAccessError (orchestrator access) (#205) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * fix bug * update arm template * fix a problem that was encountered during resolving merge conflicts Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Add GitHub Action workflow concurrency and implement token expiration policy workaround (#200) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add GitHub workflow concurrency * test 1 * test 1 * test 1 * test 2 * test 3 * test 2 * test 3 * implement token expiry workaround * test 1 * workaround to handle token expiry error * fix typo Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Implement troubleshooting guide with first typical issues (#208) * write troubleshooting guide Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Fix order of precedence for AML workspace references in submit.py (#209) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * fix order of precedence * fix build Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Add data permissions issue to TSG (#210) * add permissions issue to TSG Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * November notes (#211) Co-authored-by: thomasp-ms <XXX@me.com> * create instance type and select it for run for cc * upgrade all pip dependencies (#212) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * format * use azureml built in distribution fw * Test industry-relevant examples if any changes in the `utils` dir are observed (#221) * add test to validate changes in the utils dir * test1 trigger workflow * fix typo * only destroy ddp group if it was created * remove unnecessary imports * allow for mixture of ddp and non-ddp processes model aggregation * use documentation instead of ps1 script for creating instancetype for CC * add instance type assignment for all examples * formatting * formatting * update batch size * update model name * use older pytorch * Generalize aggregate component to Babel (#220) * init branch * wip data exploration * data exploration region/silo * basic model * regions * basic network and finished data processing * training * Implement generic FedAvg without model object (#167) * generic fedavg pytorch * support model classes * add docstrings Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * add README * update normalization * update exploration * Thomas/small improvements (#171) * remove unused local MNIST data * add link to provisioning cookbook in docs readme * recommend creating a conda env in the quickstart Co-authored-by: thomasp-ms <XXX@me.com> * update example for finance with multiple models * successful training through lstm * revert unneeded changes * remove local exploration ipynb * fix test metric * different param value for AKS (#179) Co-authored-by: thomasp-ms <XXX@me.com> * Pneumonia xray example (#164) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add more intuitive agg output dir path * reformat using black * add iteration2 branch for PR build testing * reformat date and pass kwargs instead in the getUniqueIdentifier fn * working submit * working factory submit * linting * move component path * add soft validation * add soft validation * Add basic tests on config * linting * working bicep deployment for vanilla demo * proper orchestrator script, double containers * fix name * docstring * docstring * rollback to using only 1 container * align naming convention * instructions * working submit * set up permission model * working orch perms * wonky perms assignment * working role assignments * remove old perm model * working except silo2orch * fix typo * working submit with config * add sku as param * use R/W for now * fix submit to align with bicep provisioning demo * linting * remove dataset files * fix docstring on permission model * write draft docs with homepage, align structure, remove requirements, ensure demo documented * rollback change to req * change factory to use custom model type during validation * linting * Display metrics at the pipeline level (#68) * Fix optional input yaml and mlflow log bugs (#59) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * Accomodate optional input chnages and switch from mlflow autologging to manual logging * code style * change optional inputs syntax Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Make changes to display all metrics at the pipeline level * Log preprocessing metadata in mlflow * linting * Pass client as an arg * Fix typo, rmv name from silo config, metric naming convention, and add metric identifier in the preprocessing component Co-aut…

* Update release branch (#271) * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * init branch * wip data exploration * data exploration region/silo * basic model * regions * basic network and finished data processing * training * Implement generic FedAvg without model object (#167) * generic fedavg pytorch * support model classes * add docstrings Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * add README * update normalization * update exploration * Thomas/small improvements (#171) * remove unused local MNIST data * add link to provisioning cookbook in docs readme * recommend creating a conda env in the quickstart Co-authored-by: thomasp-ms <XXX@me.com> * update example for finance with multiple models * successful training through lstm * revert unneeded changes * remove local exploration ipynb * fix test metric * different param value for AKS (#179) Co-authored-by: thomasp-ms <XXX@me.com> * Pneumonia xray example (#164) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add more intuitive agg output dir path * reformat using black * add iteration2 branch for PR build testing * reformat date and pass kwargs instead in the getUniqueIdentifier fn * working submit * working factory submit * linting * move component path * add soft validation * add soft validation * Add basic tests on config * linting * working bicep deployment for vanilla demo * proper orchestrator script, double containers * fix name * docstring * docstring * rollback to using only 1 container * align naming convention * instructions * working submit * set up permission model * working orch perms * wonky perms assignment * working role assignments * remove old perm model * working except silo2orch * fix typo * working submit with config * add sku as param * use R/W for now * fix submit to align with bicep provisioning demo * linting * remove dataset files * fix docstring on permission model * write draft docs with homepage, align structure, remove requirements, ensure demo documented * rollback change to req * change factory to use custom model type during validation * linting * Display metrics at the pipeline level (#68) * Fix optional input yaml and mlflow log bugs (#59) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * Accomodate optional input chnages and switch from mlflow autologging to manual logging * code style * change optional inputs syntax Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Make changes to display all metrics at the pipeline level * Log preprocessing metadata in mlflow * linting * Pass client as an arg * Fix typo, rmv name from silo config, metric naming convention, and add metric identifier in the preprocessing component Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Remove redundant files from the mlops directory (#69) * Remove internal & external dir as provisioning is taken care by bicep * keep mnist data files * rename demo script (#71) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Unified documentation (#72) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * simplify sandbox script * simplify script, ensure it works * align config of native submit * align naming conventions between scripts, reinject rbac role * create test job for quickly debugging provisioning issues * fix tests * linting * move permissions to storage * align config with bicep scrits * Document the metrics panel of the pipeline overview in the quickstart (#76) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * document the metrics/pipeline panel in the quickstart * linting * add docstrings and disclaimers * Add instructions on how to create a custom graph (#78) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * document the metrics/pipeline panel in the quickstart * add instructions on how to create a custom graph * do better comments * Refine native code (#82) * fix silo name * log only one datapoint per iteration for an aggregated metrics * Align terminology for iteration/round/num_rounds * linting * use storage blob data contibutor * add demoBaseName to guid name of role deployment (#85) Co-authored-by: thomasp-ms <XXX@me.com> * use id list, add listkeys builtin * rename and dissociate orchestrator in resource + orchestrator * separate orchestrator script * draft sandbox setup * make silo script distinct * Update orchestrator_open.bicep * Update internal_blob_open.bicep * remove comments * align hello world example with new naming conventions * ensure uai assignments are created AFTER storage is created * linting * enforce precedence * merge from secure branch * use different regions, limit size of account * reduce to 3 regions, add keys to guid * substring * align config * do not use model * Add msi version of scripts * sandbox main can switch between uai and msi * fix name * linting * linting * implement ignore param, hotfix model with startswith * Address my own comments on Jeff's PR (#96) * remove magic number * little improvements on some comments * remove unused files * put dash replacement next to length check * don't necessarily assume USER AI * UAI -> XAI * revert previous UAI -> XAI changes * move length check next to dash replacement * typo * try movind the dependsOn's * RAGRS -> LRS * revert dependsON changes * revert another small change in a comment Co-authored-by: thomasp-ms <XXX@me.com> * align config of both submit scripts * Make distinction between on-off and repeatable provisioning scripts (#99) * clarify the role needed * remove "custom role" line * adjust locations * use existing rg if not Owner of the sub * clarify "Secure" setup * add usage instructions in docstring * explain what scripts are one-off (vs repeatable) Co-authored-by: thomasp-ms <XXX@me.com> * Align round/iteration terminology with the native code (#103) * rename parameter in config file * keep iterations instead of rounds * round -> iteration Co-authored-by: thomasp-ms <XXX@me.com> * get all goodies from secureprovisioning branch wip * get all goodies from secureprovisioning branch wip * get all goodies from secureprovisioning branch wip * align both submits to work * add optional test * rename native to literal * add getting started in readme, introduce emojis * change person * remove emojs * Propose rewriting of readme to highlight motivation first (#110) * propose rewriting of readme to highlight motivation first * minor edit Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Update README.md * Update quickstart to mention rg clean-up * Update quickstart.md * Update quickstart.md * Update quickstart.md * Build bicep scripts as ARM template, add Azure Buttons to quickstart (#120) * Update quickstart to lower header (hotfix) (#117) * add arm templates, add button in quickstart * switch to releasebranchlink Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Add subscription id, resource group and workspace name as CLI args (#122) * add more cli args * code style * code style * update quickstart doc * update readme * Initiate provisioning "cookbook" with list of provisioning scenarios + example (#123) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Continuous Integration Tests (#119) * take values of subscription id, rs grp, ws name, etc from github secrets and submit a native pipeline * change path * Test azure creds in the github workflow * reformatting * add pipeline validation and testing workflow * add permissions * add permissions * check only certain dir to trigger workflows * add soft validation for any iteration branch PR * add provisioning script test * testing * create rg * create rg * change compute for testing * change demoname * delete old rg * change demoname * add demobasename and aml ws name as github secrets * random demo base name * auto generate random base name * random demo base name * adjust random num length * add vnet sandbox test * rmv dependency b/w jobs * submit various pipelines * change execution graph path * add cli args in the factory code * change compute for testing * ignore validation - factory * create custom action * correct path * correct path * add shell in the github action * create github actions and take required values as input params * add shell * add wait condition * add logs * linting * correct rg name * add azure ml extension * handle ml extension installation error. * add release branch test cases * add script to delete run history * cronjob test * cronjob test * checkout branch * test run history deletion script * test run history deletion script * test run history deletion script * azure login * date format change * remove double quotes * date format change * archive run history script tested * Add vnet-based provisioning options to cookbook (#128) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Make deployment name unique in our github actions (#135) * set unique name for deployments * add attempt to deployment name Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Refactor compute/storage scripts to be independent (#132) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Provide motivation in provisioning docs for using service endpoints (#136) * add motivation for service endpoints * add link Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Refresh provisioning arm buttons with latest from bicep (#139) * align names of directories * rebuild all arm Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Update silo_vnet_newstorage.md (#141) * Add Bicep build vs ARM template diff test (#140) * Add diff test for bicep vs arm * Debug * Debug * fix syntax error * redirect build output to stdout * coorect path * trigger arm template test when pushing changes to main branch from release* branch * remove redundant logs * Add "open aks with cc" provision tutorial and bicep scripts (#138) * implement bicep scripts to provision open aks with cc * add aks cc tutorial * build arm and add in branch * add button Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Provide script + tutorial to attach pair with an existing storage (#142) * provision datastore with existing storage * add arm for existing storage, add docs * add link in readme Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * add latest arm templates to diff build (#145) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Implements provisioning script for a confidential compute VM jumpbox inside a vnet (debug) (#146) * add jumpbox script with tutorial * add template to diff build Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Update jumpbox_cc.md (#147) * update tutorials for silos to integrate feedback (#149) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Implement option to turn orchestrator storage fully private (behind PLE) (#150) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Tutorial on how to adapt native and factory code to write FL experiments. (#100) * WIP: add general information about the factory code * moving factory-tutorial to another file * add scenarios * add instructions on how to adapt literal code * rename file * add general info and fix typos * Jeff's feedback * Apply code clean-up to provision scripts before bug bash (#148) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Instructions for provisioning external silos (#101) * very first stab, far from done * non-secure native job using the on-prem k8s * use on-prem silos in example factory job * Revert "very first stab, far from done" This reverts commit e00d882dfee6a348eb89cd63e339a051b85ce0ca. * Revert "use on-prem silos in example factory job" This reverts commit e2ef8841c6be25a6c84b57ae079cca8f361323fe. * Revert "non-secure native job using the on-prem k8s" This reverts commit 923e5f321d28b30d8cd9759c47a7ffe5457e3284. * restore doc stub * Make Git ignore resources for test jobs * fix gitignore * typo in comment * steps A through D * 2 typos * move to subdir * fix workspace creation * add orchestrator part, role, and timeline * last commit before PR * adjust to new open_azureml_workspace.bicep * first wave after Jeff's comments * address jeff's comments * typo * light trims Co-authored-by: thomasp-ms <XXX@me.com> * bump up every title * skeleton * first attempt at data prep like Harmke * change secret name * wrong secret name * remove separate unzip * change clients, create silo data assets * different names for silo data assets, duh * cleanup * adjust secret name in doc * . * use latest literal code * align environment with literal * base on latest component * one dataset, comment out 2 unused args (for now) * introduce new arguments * reflect modified args in component spec * remove unused arg from config * start hooking up to Harmke's trainer * initialize PTLearner and include in run.py * use same values as Harmke for epochs and lr * attributes with _, start implementing local_train * add loggings, add test(), fix device_ * train_loader_ * align _'s * fix transform bug * remove unused constants * use proper model in aggregation code * removed unused file * remove unused code and arguments, logging to DEBUG * restore `metrics_prefix` parameter * finish restoring `metrics_prefix` * do not duplicate model code * revert dedup attempt * improve docstrings and descriptions * change experiment name * change pipeline name and docstring * cite sources, remove wrongly added licenses * italics * black Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: unknown <Mitgarg17495@gmail.com> * update formatting * add readme section * rename training to traininsilo for consistency * add more comments and update docs * include urgency in PR template (#184) Co-authored-by: thomasp-ms <XXX@me.com> * Share resources and standardize component names (#182) * use shared agg component across all examples * only keep a single {reqs/env} * use more recent pip version * standardize component spec name * support dummy HELLOWORLD example is agg * black Co-authored-by: thomasp-ms <XXX@me.com> * Thopo/share component and environment (#185) * use shared agg component across all examples * only keep a single {reqs/env} * use more recent pip version * standardize component spec name * support dummy HELLOWORLD example is agg * black * SHARED -> utils, rename agg env Co-authored-by: thomasp-ms <XXX@me.com> * rename config to spec and add upload data step * upload data script * use util aggregateweights * add data splitting pipeline * docs update * log pipeline level only once per silo training * do categorical encoding ahead of splitting * nit updates * update comment * update formatting * Hotfix: grant `az login` permissions to the 'clear run history' script (#166) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * simplify the job wait condition's code * add comments * trigger mnist pipeline check * test token validity * grant `az login` permissions to the clear-history script * revert to sleep wait code * test access token validity * nit Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * fix readme * aggregate weughts on whichever device is available * update docstrings * update formatting * reduce upload pipeline file * fix datastore * add info about data upload step * fix typo * steps for changing access policies * update docs * Named Entity Recognition example (#177) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add multinerd template files * NER components * re-structure * partition data + log metrics * add redme * add readme * restructuring * restructuring * add doc strings * train on gpus * create a separate component to upload data on silos * docs * rename * add assert statement * change upload-data job compute to orchestrator compute * remove ner from literal example choices * fix doc * add model-name, tokenizer configurable * pip version upgrade * reformatting * use shared aggregated component * rename script file * add note * create a compute that has access to silos' storage accs * change data uploading approach * update doc * incorporate Thomas's feedback * fix typo Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Create nice-looking homepage for the examples in readme+docs (#190) * add homepage for industry examples Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Align the medical imaging data provisioning process with other examples (#191) * adjust paths in config file * support component with 1 output for Pneumonia * formatting * adjust doc to new provisioning * remove GH action for dataprep * custom component for provisioning pneumonia data * black Co-authored-by: thomasp-ms <XXX@me.com> * hot fix (#192) Co-authored-by: thomasp-ms <XXX@me.com> * Lots of micro-fixes after bug bashing all 3 industry examples (#194) related to components: * create distinct names for all components of each scenario * polish component descriptions * remove unused mnist datatransfer and postprocessing components * upgrade all MCR images to a more recent OS * cut some unnecessary dependencies * use curated environments whenever possible (to speed up job build time) related to pipelines: * fix issues with ccfraud submit script (path to shared folder) * remove unnecessary json+azure imports in submit scripts * align all 3 submissions scripts * in upload data pipeline, make --example required without default value to force intentional decision * in upload data pipeline, use scenario name in the output path to avoid collision * give each submit pipeline a distinct experiment and run name for readability Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Standardize all 3 real world example tutorials (#193) * standardize documentation on all 3 examples * change titles * fix spaces * add pip instructions * upgrade azure-ai-ml version Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * poc for ddp training * remove debug code + allow logging from multiple nodes * update formatting * provide correct link to Kaggle dataset (#196) * provide correct link * . * . Co-authored-by: thomasp-ms <XXX@me.com> * add DDP docs * Add CI tests for industry-relevant examples (#186) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add pneumonia and ner examples tests * add ccfraud test in the CI/CD pipeline * add data upload test * trigger workflow * CI testing1 * CI testing1 * test kv kaggle creds * fix creds * fix creds * set kaggle creds * test pneumonia data-upload * test all industry relevant examples * upload data test for 3 examples * add main tests * rmv redundant chrs * fix typo * avoid industry relevant examples tests to run on the vnet setup as it is already covered by the open setup Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * CLI commands to add credentials in the workspace keyvault (#199) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add cli cmds to set a kv secret * Jeff's feedback * Implement Thomas's feedback Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Thomas/bug bash feedback 04 (#203) * no need to navigate to a specific directory * keyvault -> key vault * improve Kaggle sections * GPU's for NER example * ARM templates with latest bicep version * bold * GPU instructions in quickstart Co-authored-by: thomasp-ms <XXX@me.com> * fix test to align with new sdk (#204) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Hotfix: DataAccessError (orchestrator access) (#205) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * fix bug * update arm template * fix a problem that was encountered during resolving merge conflicts Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Add GitHub Action workflow concurrency and implement token expiration policy workaround (#200) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add GitHub workflow concurrency * test 1 * test 1 * test 1 * test 2 * test 3 * test 2 * test 3 * implement token expiry workaround * test 1 * workaround to handle token expiry error * fix typo Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Implement troubleshooting guide with first typical issues (#208) * write troubleshooting guide Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Fix order of precedence for AML workspace references in submit.py (#209) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * fix order of precedence * fix build Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Add data permissions issue to TSG (#210) * add permissions issue to TSG Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * November notes (#211) Co-authored-by: thomasp-ms <XXX@me.com> * create instance type and select it for run for cc * upgrade all pip dependencies (#212) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * format * use azureml built in distribution fw * Test industry-relevant examples if any changes in the `utils` dir are observed (#221) * add test to validate changes in the utils dir * test1 trigger workflow * fix typo * only destroy ddp group if it was created * remove unnecessary imports * allow for mixture of ddp and non-ddp processes model aggregation * use documentation instead of ps1 script for creating instancetype for CC * add instance type assignment for all examples * formatting * formatting * update batch size * update model name * use older pytorch * Generalize aggregate component to Babel (#220) * init branch * wip data exploration * data exploration region/silo * basic model * regions * basic network and finished data processing * training * Implement generic FedAvg without model object (#167) * generic fedavg pytorch * support model classes * add docstrings Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * add README * update normalization * update exploration * Thomas/small improvements (#171) * remove unused local MNIST data * add link to provisioning cookbook in docs readme * recommend creating a conda env in the quickstart Co-authored-by: thomasp-ms <XXX@me.com> * update example for finance with multiple models * successful training through lstm * revert unneeded changes * remove local exploration ipynb * fix test metric * different param value for AKS (#179) Co-authored-by: thomasp-ms <XXX@me.com> * Pneumonia xray example (#164) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * add more intuitive agg output dir path * reformat using black * add iteration2 branch for PR build testing * reformat date and pass kwargs instead in the getUniqueIdentifier fn * working submit * working factory submit * linting * move component path * add soft validation * add soft validation * Add basic tests on config * linting * working bicep deployment for vanilla demo * proper orchestrator script, double containers * fix name * docstring * docstring * rollback to using only 1 container * align naming convention * instructions * working submit * set up permission model * working orch perms * wonky perms assignment * working role assignments * remove old perm model * working except silo2orch * fix typo * working submit with config * add sku as param * use R/W for now * fix submit to align with bicep provisioning demo * linting * remove dataset files * fix docstring on permission model * write draft docs with homepage, align structure, remove requirements, ensure demo documented * rollback change to req * change factory to use custom model type during validation * linting * Display metrics at the pipeline level (#68) * Fix optional input yaml and mlflow log bugs (#59) * refactor components to use dpv2 + remove unnecessary environments * working dpv2 pipeline * refactor scripts with right inputs and outputs * fix code path * implement fake outputs * fix paths * fix imports * fix args of aggregation script * add note, fix component args * add chekcpoint arg * linting * linting * remove sdkv2 folder * add argparse to submit script * add docstring * add docstring * linting * linting * add staging branch to build * rollback changes to build, leave it for another PR * remove logging lien * remove custom uuid * linting * add docstring to custom path function * polish docstring * rename model_silo_X to input_silo_X * rename output * rename agg output * Improve auto-provisioning resources (#35) (#36) * docker file stub * move docker file, implement feedback * login before setting subscription * login before setting subscription * use default k8s version * pin latest version since default won't work * remove executionpolicy part, other small updates * clarify to change job file _in docker filesystem_ * login before setting subscription * formatting * \ -> / * install azureml-core in docker file * propagate changes to section 7 * fix dataset creation command Co-authored-by: thomasp-ms <XXX@me.com> Co-authored-by: thomasp-ms <XXX@me.com> * Refactor folder structure (#37) * `plan` -> `docs` * 'plan' -> 'docs' * 'automated_provisioning' -> 'mlops' * 'fl_arc_k8s' -> 'examples' Co-authored-by: thomasp-ms <XXX@me.com> * auto provisioning - vanilla internal silos (#41) * split internal and external provisioning * adjust directories after internal/external split * introduce overall mlops readme * first stab * remove useless comment and my alias Co-authored-by: thomasp-ms <XXX@me.com> * Perform real FL training on the MNIST dataset Added component files customized for MNIST dataset. Set the setup for 3 silo having their own compute and datastore. git config --global user.email "you@example.com" * refine components and add logs * maintain consistency b/w config files * add requirement and env files * add requirement and env files * rmv redundant dependencies, rename conda envs * Correct epoch default value * point data asset instead of underlying URI * beef up orchestrator cluster (#46) Co-authored-by: thomasp-ms <XXX@me.com> * Provision CPUs for silos (instead of GPUs) (#47) * beef up orchestrator cluster * gpu -> cpu Co-authored-by: thomasp-ms <XXX@me.com> * add preprocessing comp description, fix typo and correct default datastore name * add integration validation test - build * update readme file * Move logger to the maion if block, add pytorch channel in the conda env yaml and move readme to the docs folder * code reformatting using black * add documentation to run an FL experiment * add more intuitive path for aggr output dir * Merge changes * Accomodate optional input chnages and switch from mlflow autologging to manual logging * code style * change optional inputs syntax Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Make changes to display all metrics at the pipeline level * Log preprocessing metadata in mlflow * linting * Pass client as an arg * Fix typo, rmv name from silo config, metric naming convention, and add metric identifier in the preprocessing component Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> Co-authored-by: Jeff Omhover <jf.omhover@gmail.com> Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com> Co-authored-by: thomasp-ms <XXX@me.com> * Remove redundant files from the mlops directory (#69) * Remove internal & external dir as provisioning is taken care by bicep * keep mnist data files * rename demo script (#71) Co-authored-by: Jeff Omhover <jeomhove@microsoft.com> * Unified documentation (#72) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * simplify sandbox script * simplify script, ensure it works * align config of native submit * align naming conventions between scripts, reinject rbac role * create test job for quickly debugging provisioning issues * fix tests * linting * move permissions to storage * align config with bicep scrits * Document the metrics panel of the pipeline overview in the quickstart (#76) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * document the metrics/pipeline panel in the quickstart * linting * add docstrings and disclaimers * Add instructions on how to create a custom graph (#78) * WIP: unifying docs * Remove redundant doc file. We can always revisit if needed * FL concepts will be covered in the glossary doc * Remove internal and external silos docs as the code will be re-written in bicep * provide comprehensive documentation * rename file * refine docs * refine docs and rename fl_cross_silo_basic to fl_cross_silo_native * document the metrics/pipeline panel in the quickstart * add instructions on how to create a custom graph * do better comments * Refine native code (#82) * fix silo name * log only one datapoint per iteration for an aggregated metrics * Align terminology for iteration/round/num_rounds * linting * use storage blob data contibutor * add demoBaseName to guid name of role deployment (#85) Co-authored-by: thomasp-ms <XXX@me.com> * use id list, add listkeys builtin * rename and dissociate orchestrator in resource + orchestrator * separate orchestrator script * draft sandbox setup * make silo script distinct * Update orchestrator_open.bicep * Update internal_blob_open.bicep * remove comments * align hello world example with new naming conventions * ensure uai assignments are created AFTER storage is created * linting * enforce precedence * merge from secure …

Jeff Omhover added 5 commits November 8, 2022 21:07

generic fedavg pytorch

e896fd8

support model classes

051e679

support model classes

944c23b

add docstrings

6ace13d

linting

6cfb2c9

garg-amit approved these changes Nov 9, 2022

View reviewed changes

examples/components/MNIST/aggregatemodelweights/run.py Outdated Show resolved Hide resolved

typo

42d2ffc

thomasp-ms approved these changes Nov 10, 2022

View reviewed changes

examples/components/MNIST/aggregatemodelweights/aggregatemodelweights.yaml Outdated Show resolved Hide resolved

examples/components/MNIST/aggregatemodelweights/run.py Outdated Show resolved Hide resolved

Jeff Omhover added 2 commits November 9, 2022 19:53

rename variable

f520906

use old env instead

ff79596

jfomhover merged commit a6a76c6 into release-iteration-04 Nov 11, 2022

jfomhover deleted the jfomhover/fevavg branch November 11, 2022 18:13

majercakdavid pushed a commit that referenced this pull request Jan 8, 2023

Implement generic FedAvg without model object (#167)

632cf41

* generic fedavg pytorch * support model classes * add docstrings Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

majercakdavid pushed a commit that referenced this pull request Jan 8, 2023

Implement generic FedAvg without model object (#167)

367861e

* generic fedavg pytorch * support model classes * add docstrings Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

majercakdavid pushed a commit that referenced this pull request Jan 9, 2023

Implement generic FedAvg without model object (#167)

5608606

* generic fedavg pytorch * support model classes * add docstrings Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement generic FedAvg without model object #167

Implement generic FedAvg without model object #167

jfomhover commented Nov 9, 2022 •

edited

Loading

thomasp-ms left a comment

Implement generic FedAvg without model object #167

Implement generic FedAvg without model object #167

Conversation

jfomhover commented Nov 9, 2022 • edited Loading

Purpose

Does this introduce a breaking change?

Pull Request Type

What to Check

thomasp-ms left a comment

Choose a reason for hiding this comment

jfomhover commented Nov 9, 2022 •

edited

Loading