Skip to content

Commit

Permalink
Release sdkv2 iteration 02 (#114)
Browse files Browse the repository at this point in the history
* refactor components to use dpv2 + remove unnecessary environments

* working dpv2 pipeline

* refactor scripts with right inputs and outputs

* fix code path

* implement fake outputs

* fix paths

* fix imports

* fix args of aggregation script

* add note, fix component args

* add chekcpoint arg

* linting

* linting

* remove sdkv2 folder

* add argparse to submit script

* add docstring

* add docstring

* linting

* linting

* add staging branch to build

* rollback changes to build, leave it for another PR

* remove logging lien

* remove custom uuid

* linting

* add docstring to custom path function

* polish docstring

* rename model_silo_X to input_silo_X

* rename output

* rename agg output

* Improve auto-provisioning resources (#35) (#36)

* docker file stub

* move docker file, implement feedback

* login before setting subscription

* login before setting subscription

* use default k8s version

* pin latest version since default won't work

* remove executionpolicy part, other small updates

* clarify to change job file _in docker filesystem_

* login before setting subscription

* formatting

* \ -> /

* install azureml-core in docker file

* propagate changes to section 7

* fix dataset creation command

Co-authored-by: thomasp-ms <XXX@me.com>

Co-authored-by: thomasp-ms <XXX@me.com>

* Refactor folder structure (#37)

* `plan` -> `docs`

* 'plan' -> 'docs'

* 'automated_provisioning' -> 'mlops'

* 'fl_arc_k8s' -> 'examples'

Co-authored-by: thomasp-ms <XXX@me.com>

* auto provisioning - vanilla internal silos (#41)

* split internal and external provisioning

* adjust directories after internal/external split

* introduce overall mlops readme

* first stab

* remove useless comment and my alias

Co-authored-by: thomasp-ms <XXX@me.com>

* Perform real FL training on the MNIST dataset

Added component files customized for MNIST dataset. Set the setup for 3
silo having their own compute and datastore.
git config --global user.email "you@example.com"

* refine components and add logs

* maintain consistency b/w config files

* add requirement and env files

* add requirement and env files

* rmv redundant dependencies, rename conda envs

* Correct epoch default value

* point data asset instead of underlying URI

* beef up orchestrator cluster (#46)

Co-authored-by: thomasp-ms <XXX@me.com>

* Provision CPUs for silos (instead of GPUs) (#47)

* beef up orchestrator cluster

* gpu -> cpu

Co-authored-by: thomasp-ms <XXX@me.com>

* add preprocessing comp description, fix typo and correct default datastore name

* add integration validation test - build

* update readme file

* Move logger to the maion if block, add pytorch channel in the conda env
yaml and move readme to the docs folder

* code reformatting using black

* add documentation to run an FL experiment

* add more intuitive path for aggr output dir

* Merge changes

* add more intuitive agg output dir path

* reformat using black

* add iteration2 branch for PR build testing

* reformat date and pass kwargs instead in the getUniqueIdentifier fn

* working submit

* working factory submit

* linting

* move component path

* add soft validation

* add soft validation

* Add basic tests on config

* linting

* working bicep deployment for vanilla demo

* proper orchestrator script, double containers

* fix name

* docstring

* docstring

* rollback to using only 1 container

* align naming convention

* instructions

* working submit

* set up permission model

* working orch perms

* wonky perms assignment

* working role assignments

* remove old perm model

* working except silo2orch

* fix typo

* working submit with config

* add sku as param

* use R/W for now

* fix submit to align with bicep provisioning demo

* linting

* remove dataset files

* fix docstring on permission model

* write draft docs with homepage, align structure, remove requirements, ensure demo documented

* rollback change to req

* change factory to use custom model type during validation

* linting

* Display metrics at the pipeline level (#68)

* Fix optional input yaml and mlflow log bugs (#59)

* refactor components to use dpv2 + remove unnecessary environments

* working dpv2 pipeline

* refactor scripts with right inputs and outputs

* fix code path

* implement fake outputs

* fix paths

* fix imports

* fix args of aggregation script

* add note, fix component args

* add chekcpoint arg

* linting

* linting

* remove sdkv2 folder

* add argparse to submit script

* add docstring

* add docstring

* linting

* linting

* add staging branch to build

* rollback changes to build, leave it for another PR

* remove logging lien

* remove custom uuid

* linting

* add docstring to custom path function

* polish docstring

* rename model_silo_X to input_silo_X

* rename output

* rename agg output

* Improve auto-provisioning resources (#35) (#36)

* docker file stub

* move docker file, implement feedback

* login before setting subscription

* login before setting subscription

* use default k8s version

* pin latest version since default won't work

* remove executionpolicy part, other small updates

* clarify to change job file _in docker filesystem_

* login before setting subscription

* formatting

* \ -> /

* install azureml-core in docker file

* propagate changes to section 7

* fix dataset creation command

Co-authored-by: thomasp-ms <XXX@me.com>

Co-authored-by: thomasp-ms <XXX@me.com>

* Refactor folder structure (#37)

* `plan` -> `docs`

* 'plan' -> 'docs'

* 'automated_provisioning' -> 'mlops'

* 'fl_arc_k8s' -> 'examples'

Co-authored-by: thomasp-ms <XXX@me.com>

* auto provisioning - vanilla internal silos (#41)

* split internal and external provisioning

* adjust directories after internal/external split

* introduce overall mlops readme

* first stab

* remove useless comment and my alias

Co-authored-by: thomasp-ms <XXX@me.com>

* Perform real FL training on the MNIST dataset

Added component files customized for MNIST dataset. Set the setup for 3
silo having their own compute and datastore.
git config --global user.email "you@example.com"

* refine components and add logs

* maintain consistency b/w config files

* add requirement and env files

* add requirement and env files

* rmv redundant dependencies, rename conda envs

* Correct epoch default value

* point data asset instead of underlying URI

* beef up orchestrator cluster (#46)

Co-authored-by: thomasp-ms <XXX@me.com>

* Provision CPUs for silos (instead of GPUs) (#47)

* beef up orchestrator cluster

* gpu -> cpu

Co-authored-by: thomasp-ms <XXX@me.com>

* add preprocessing comp description, fix typo and correct default datastore name

* add integration validation test - build

* update readme file

* Move logger to the maion if block, add pytorch channel in the conda env
yaml and move readme to the docs folder

* code reformatting using black

* add documentation to run an FL experiment

* add more intuitive path for aggr output dir

* Merge changes

* Accomodate optional input chnages and switch from mlflow autologging to manual logging

* code style

* change optional inputs syntax

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>
Co-authored-by: Jeff Omhover <jf.omhover@gmail.com>
Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com>
Co-authored-by: thomasp-ms <XXX@me.com>

* Make changes to display all metrics at the pipeline level

* Log preprocessing metadata in mlflow

* linting

* Pass client as an arg

* Fix typo, rmv name from silo config, metric naming convention, and add
metric identifier in the preprocessing component

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>
Co-authored-by: Jeff Omhover <jf.omhover@gmail.com>
Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com>
Co-authored-by: thomasp-ms <XXX@me.com>

* Remove redundant files from the mlops directory (#69)

* Remove internal & external dir as provisioning is taken care by bicep

* keep mnist data files

* copy files from template

* draft orchestrator

* rename demo script (#71)

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Unified documentation (#72)

* WIP: unifying docs

* Remove redundant doc file. We can always revisit if needed

* FL concepts will be covered in the glossary doc

* Remove internal and external silos docs as the code will be re-written in bicep

* provide comprehensive documentation

* rename file

* refine docs

* refine docs and rename fl_cross_silo_basic to fl_cross_silo_native

* simplify sandbox script

* simplify script, ensure it works

* align config of native submit

* align naming conventions between scripts, reinject rbac role

* protected sandbox draft

* create test job for quickly debugging provisioning issues

* fix tests

* linting

* move permissions to storage

* align config with bicep scrits

* Document the metrics panel of the pipeline overview in the quickstart (#76)

* WIP: unifying docs

* Remove redundant doc file. We can always revisit if needed

* FL concepts will be covered in the glossary doc

* Remove internal and external silos docs as the code will be re-written in bicep

* provide comprehensive documentation

* rename file

* refine docs

* refine docs and rename fl_cross_silo_basic to fl_cross_silo_native

* document the metrics/pipeline panel in the quickstart

* linting

* add docstrings and disclaimers

* Add instructions on how to create a custom graph  (#78)

* WIP: unifying docs

* Remove redundant doc file. We can always revisit if needed

* FL concepts will be covered in the glossary doc

* Remove internal and external silos docs as the code will be re-written in bicep

* provide comprehensive documentation

* rename file

* refine docs

* refine docs and rename fl_cross_silo_basic to fl_cross_silo_native

* document the metrics/pipeline panel in the quickstart

* add instructions on how to create a custom graph

* working deployment, wrong setup

* do better comments

* Refine native code (#82)

* fix silo name

* log only one datapoint per iteration for an aggregated metrics

* Align terminology for iteration/round/num_rounds

* linting

* use storage blob data contibutor

* add demoBaseName to guid name of role deployment (#85)

Co-authored-by: thomasp-ms <XXX@me.com>

* use id list, add listkeys builtin

* rename and dissociate orchestrator in resource + orchestrator

* separate orchestrator script

* draft sandbox setup

* make silo script distinct

* add role

* Update orchestrator_open.bicep

* Update internal_blob_open.bicep

* add datastore for orch, align config

* remove comments

* fix datastore name

* align hello world example with new naming conventions

* fix merge

* work in progress

* use mount

* ensure uai assignments are created AFTER storage is created

* linting

* enforce precedence

* merge from secure branch

* use different regions, limit size of account

* reduce to 3 regions, add keys to guid

* substring

* align config

* do not use model

* secure storage

* submittable vnet silo

* sandbox

* Add msi version of scripts

* sandbox main can switch between uai and msi

* align orch with new design

* align silo bicep

* finalize vnet main

* add vnet links

* remove

* specify dependson

* fix name

* linting

* linting

* implement ignore param, hotfix model with startswith

* Address my own comments on Jeff's PR (#96)

* remove magic number

* little improvements on some comments

* remove unused files

* put dash replacement next to length check

* don't necessarily assume USER AI

* UAI -> XAI

* revert previous UAI -> XAI changes

* move length check next to dash replacement

* typo

* try movind the dependsOn's

* RAGRS -> LRS

* revert dependsON changes

* revert another small change in a comment

Co-authored-by: thomasp-ms <XXX@me.com>

* align config of both submit scripts

* fix

* add vnet peering

* fix peering

* Make distinction between on-off and repeatable provisioning scripts (#99)

* clarify the role needed

* remove "custom role" line

* adjust locations

* use existing rg if not Owner of the sub

* clarify "Secure" setup

* add usage instructions in docstring

* explain what scripts are one-off (vs repeatable)

Co-authored-by: thomasp-ms <XXX@me.com>

* Align round/iteration terminology with the native code (#103)

* rename parameter in config file

* keep iterations instead of rounds

* round -> iteration

Co-authored-by: thomasp-ms <XXX@me.com>

* upgrade versions all around

* add distinct permission

* orch and silo as just a pair

* orch and silo as just a pair

* minor fixes

* minor fixes

* setname of datastore

* verify all storage settings

* add rules

* add serice endpoint in vnet

* add note in vnet

* use old api

* fix name

* align open sandbox with vnet sandbox

* align config with bicep

* get all goodies from secureprovisioning branch wip

* get all goodies from secureprovisioning branch wip

* get all goodies from secureprovisioning branch wip

* align both submits to work

* add optional test

* finalize

* add notive

* add note in quickstart

* Remove unnecessary scripts

* last curation

* rename native to literal

* add getting started in readme, introduce emojis

* change person

* remove emojs

* Propose rewriting of readme to highlight motivation first (#110)

* propose rewriting of readme to highlight motivation first

* minor edit

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>

* Update README.md

* Update quickstart to mention rg clean-up

* Update quickstart.md

* Update quickstart.md

* Update quickstart.md

* Add September release notes (#98)

* first draft

* add Amit's suggestions

* move release notes to CHANGELOG.md

* amit's comments + changes to provisioning bullet

* Jeff's comments

* relative link

Co-authored-by: thomasp-ms <XXX@me.com>

* Add comments for future self

Co-authored-by: Jeff Omhover <jeomhove@microsoft.com>
Co-authored-by: Jeff Omhover <jf.omhover@gmail.com>
Co-authored-by: Thomas <7998422+thomasp-ms@users.noreply.github.com>
Co-authored-by: thomasp-ms <XXX@me.com>
  • Loading branch information
5 people committed Oct 12, 2022
1 parent 126f761 commit ae55069
Show file tree
Hide file tree
Showing 76 changed files with 3,500 additions and 1,492 deletions.
8 changes: 4 additions & 4 deletions .github/workflows/python.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ on:
pull_request:
branches:
- main
- release-sdkv2-mvp
- release-sdkv2-iteration-02

jobs:
build:
Expand All @@ -26,7 +26,7 @@ jobs:
if: github.base_ref

- name: Intall python dependencies
run: pip install -r examples/pipelines/fl_cross_silo_basic/requirements.txt
run: pip install -r examples/pipelines/fl_cross_silo_literal/requirements.txt

- name: Validate fl_cross_silo_basic pipeline
run: python examples/pipelines/fl_cross_silo_basic/submit.py
- name: Validate fl_cross_silo_literal pipeline
run: python examples/pipelines/fl_cross_silo_literal/submit.py
35 changes: 26 additions & 9 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,30 @@
## [project-title] Changelog
# FL Accelerator Changelog

<a name="x.y.z"></a>
# x.y.z (yyyy-mm-dd)
## September 2022 release

*Features*
* ...
We are excited to announce the release of the September iteration of our [FL Accelerator repository](https://github.com/Azure-Samples/azure-ml-federated-learning).

*Bug Fixes*
* ...
Here are the new features.

*Breaking Changes*
* ...

### Repository structure
- Cleaned up repository - removed redundant directories/files.

### Provisioning
- Auto-provisioning scripts to create a complete sandbox using bicep templates (by default an open sandbox on 3 regions). An experimental script with VNets and private endpoints is also available.
- A quickstart to run an FL demo in minutes, and easily run FL jobs on your own data!

### Documentation
- [Single location](https://github.com/Azure-Samples/azure-ml-federated-learning/blob/release-sdkv2-iteration-02/README.md) for all documentation.

### FL Experience
- Easy navigation using Azure Storage Explorer to discover models of various iterations, here is a more intuitive path (\<experiment-name\>/\<iteration-num\>/\<model-name\>)
- Introduction of "factory" code to simplify the creation of FL experiments.
- Decoupled components for more readability.
- Soft validation to check the correct permissions of your assets.
- Metrics:
- The combined losses and model performances of several silos can now be seen at the pipeline level.

To get started, go [here](./docs/quickstart.md)!

If you find a bug or have a feature request, please open an issue on the [GitHub repository](https://github.com/Azure-Samples/azure-ml-federated-learning/issues).
44 changes: 32 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,39 @@
# Federated Learning in Azure ML

> :warning: Need to update this README!
Federated Learning (FL) is a framework where one trains a single ML model on distinct datasets that cannot be gathered in a single central location. This enables companies and institutions to comply with regulations related to data location and data access while allowing for innovation and personalization.

**Disclaimer:** this repo is examples + recipes only, and none of this has SLAs on support or correctness.
This repo provides some code samples for running a federated learning pipeline in the Azure Machine Learning platform.

Although there is no specific commitment to a timeline yet, Azure ML is working on first-class support for cross-silo federated learning. Cross-silo federated learning allows data scientists to run pipelines against data in multiple isolated silos, with the platform guaranteeing that only approved and policy-compliant jobs transfer "safe" data across silo boundaries. Here a "silo" means an "isolated" collection of storage and compute. And "isolated" means that the platform guarantees:
- only compute within the silo can "touch" storage within the silo;
- only data of public or system metadata classification can be moved outside the silo;
- only "approved" jobs can change the classification of data or move it outside the silo.
:warning: Running a full federated learning pipeline raises **security questions that you need to address** before using this repository for production purpose. Please consider this repository as a sample only.

Silos are expected to be reliable (i.e., no concerns around network connectivity or uptime). We also assume a hard cap of **≤ 100 silos**.

Current contents:
- `mlops`: a collection of resources to automatically provision the orchestrator and silos resources.
- `examples`: simple example of using shrike Federated Learning API + Arc + Kubernetes + Azure ML to submit a Federated Learning experiment.
- `docs`: generic plan for a company to onboard to Federated Learning through Azure ML.
## Table of contents

More details, recipes, and examples coming soon!
- [Getting started](#zap-getting-started)
- [Documentation](#documentation)
- [Support](#need-support)
- [Contributing](#contributing)

### :zap: Getting Started

No time to read? Get directly to the [**quickstart**](./docs/quickstart.md) to provision a demo within minutes in your own subscription.

### Documentation

Please find a full documentation of this project [**here**](docs/README.md).

### Need Support?

If you have any feature requests, technical questions, or find any bugs, please do not hesitate to reach out to us.

For bug reports and feature requests, you are welcome to open an [**issue**](https://github.com/Azure-Samples/azure-ml-federated-learning/issues).

### Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

To contribute, please start by creating a self-assigned [**issue**](https://github.com/Azure-Samples/azure-ml-federated-learning/issues/new) giving a high-level overview of what you'd like to do. Once any discussion there concludes, follow up with a PR.

Please send an email to aims-team@microsoft.com to request a "contributor" access to this repository, if you have difficulty in creating a branch. When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
52 changes: 52 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Federated Learning in Azure ML

:warning: Running a full federated learning pipeline raises **security questions that you need to address** before using this repository for production purpose. Please consider this repository as a sample only.


## Table of contents

- [Motivation](#motivation)
- [Getting Started](#getting-started)
- [Why Federated Learning?](#why-should-you-consider-federated-learning)
- [What this repo as to offer?](#what-this-repo-as-to-offer)
- [Glossary](#glossary)

### Motivation

Local privacy regulations impose constraints on the movement of data out of a given region, or out of government agencies. Also, institutions or companies working together to leverage their respective data might require or desire to limit circulation of this data, and impose trust boundaries.

In those contexts, the data cannot be gathered in a central location, as is usual practice for training Machine Learning (ML) models. A technique called Federated Learning (FL) allows for training models in this highly constrained environment. It enables companies and institutions to comply with regulations related to data location and data access while allowing for innovation and achieving better quality models.

### Getting Started

No time to read? Get directly to the [**quickstart**](./quickstart.md) to provision a demo within minutes in your own subscription.

A step-by-step guide for performing a Federated Learning experiment can be found [**here**](./guide.md).

### Why should you consider Federated Learning?

Let's take the example of a data scientist working in a hospital to classify medical images to detect a specific patient condition. The team at the hospital _already_ has a deep learning model trained in a centralized fashion with their own patient data. The model achieved reasonable performance. Now the hospital wants to further improve the model's performance by partnering with other hospitals. Federated Learning will enable them to collaborate on the model training while keeping control of the hospital's own data, complying with their local regulations and privacy obligations, while enabling better quality models for the benefit of their patients.

Federated Learning (FL) is a framework where one trains a single ML model on distinct datasets that cannot be gathered in a single central location. The basic idea of FL is to train a model by aggregating the results of N isolated training jobs, each running on separated computes with restricted access to given data storages.

The training is orchestrated between a central server (_a.k.a._ orchestrator) and multiple clients (_a.k.a._ silos or embassies). The actual model training happens locally inside the silos/clients on their respective data, without the data ever leaving their respective trust boundaries. Only the local models are sent back to the central server/orchestrator for aggregation.

When the computes and data are in the cloud, we say they live in silos, and cross-silo federated learning consists in orchestrating the training and aggregation jobs against the cloud provider. The following figure illustrates what a federated learning solution looks like.

<br/><br/>
<img src="./pics/fl_fig.png" alt="Federated Learning Solution Figure" width="300">

Creating such a graph of jobs can be complex. This repository provides a recipe to help.

### What this repo as to offer?

This repo provides some code samples for running a federated learning pipeline in the Azure Machine Learning platform.

| Folder | Description |
| :--- | :--- |
| [examples](../examples) | Scripts and pipelines to run FL sample experiments |
| [mlops](../mlops) | Provisioning scripts |

### Glossary

The complete glossary list can be seen [**here**](./glossary.md).
31 changes: 0 additions & 31 deletions docs/experiment.md

This file was deleted.

65 changes: 65 additions & 0 deletions docs/glossary.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Glossary


__Data__
<br>

Any file or collection of files. Data will be described in terms of classification.
Only three classifications are required for the context of this document. "Sensitive" (cannot be moved or even looked at), "intermediate" (can be moved around, but looser restrictions on visibility), and "eyes-on" (can be moved freely and seen by everyone participating in the federated training).

__Storage__
<br>

Wherever data is stored. In this file, storage is assumed to live in Azure. It may exist in locked-down virtual networks.

__Compute__
<br>

Anything that can run "code" (deliberately vague). In this file, compute is assumed to live in Azure.

__Job__
<br>

Execute code (a collection of files) in an environment (a Docker image) against data (from storage). A job can consume data from multiple storage instances and write back to multiple instances.

__Approval__
<br>

REST endpoint to which the platform "asks permission" before running any job. The platform sends the approval endpoint information including:

1. Input and output storage
2. Which compute the job wishes to run in
3. The author of the code the job is running
4. Whether or not the job has been code-signed by the configured policies

The approval endpoint can either approve / reject the job based on checked-in configuration (e.g., of which storage accounts are associated with which silo) or pass this information on for manual approval.

:exclamation: Note that the approval endpoints do not support 3P-facing AML yet.

__Silo__
<br>

Isolated collection of storage and compute. Here, "isolated" means that the platform guarantees:

- Only compute within the silo can "touch" storage within the silo.
- Only data of intermediate or eyes-on classification can be moved outside the silo.
- Only "approved" jobs can change the classification of data or move it outside the silo.

Silos are expected to be reliable (i.e., no concerns around network connectivity or uptime).

:exclamation: Note that we assume a hard cap of ≤ 100 silos at current stage.

__Orchestrator__
<br>

Collection of storage and compute. The storage is for model parameters, rather than the actual data. A task orchestrator broadcasts the FL task, sends the current model to each silo, and aggregates the gradients from the silos. In this file, orchestrator is assumed to live in an AML workspace.

__Internal Silos__
<br>

Collection of silos belong to the same Azure tenant.

__External Silos__
<br>

Collection of silos that resides in either different Azure tenant or different cloud provider.
53 changes: 53 additions & 0 deletions docs/guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Step-by-Step guide to perform a Federated Learning experiment

The following are some key milestones that can help to perform an experiment in an efficient manner:
- A machine learning model. A working model or model architecture with a learning algorithm in a non-federated scheme.
- Resource provisioning. Get the orchestrator and silos ready for FL.
- Validate FL job. Run a synthetic test for the FL framework and understand the effect of each FL hyperparameter.
- Submit an actual FL job.
- Evaluate the performance of a federated model.
- Deploy a federated model.

We describe the key milestones in detail below.


## Machine learning model

The pre-requisite of a _federated_ learning job is a _machine_ learning model, which can be trained in a non-federated scheme. Such a model can be
- a working model trained with some real yet small data, or
- a model architecture that is proven effective via synthetic training data.

## Resource provisioning

If all the data live in __one__ AAD (Azure Active Directory) tenant, we could simply create “vanilla” AML computes and easily use managed identity to enforce silos (via compute --> storage access). Briefly speaking, per-silo storage accounts should be locked down with only RBAC for access and don’t give anyone access keys, and per-silo compute should have a managed identity, giving just that managed identity the "blob storage contributor" roles to access the data. For further details, please check out this [public document](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-use-managed-identities?tabs=python) on managed identity.

To quickly provision resources for your Federated Learning experiment setup, follow the steps given [here](./quickstart.md/#deploy-demo-resources-in-azure).

## How to validate an FL job:

1. Upload the provided data to each silo’s storage.
2. Download the provided FL script and model to the machine that can connect to the orchestrator workspace.
3. Define one or several metric(s) of interest (they should be consistent with your actual FL job).
4. Select the setting according to the scenario of your actual FL job. For example, differential privacy may be added to improve privacy preservation.
5. Submit the test job to AML and compare the results to our results.
6. Play around with the FL hyperparameters (e.g., parameters in differential privacy, weights for aggregation) to see their impact on the metrics of interest.


## Submit an FL job

1. Ready the baseline model (or base query in the case of Federated _Analysis_) and the resources that can be used to train the model.
2. Confirm there are data in various silos. Access to these data should be restricted.
3. Submit your actual FL job using the FL pipeline with proper hyperparameters.

## Evaluate a federated model

1. You can find the Mlflow metrics or artifacts in the pipeline metrics section.
2. Look at the loss/evaluation metrics to adjust the hyper-parameters such as learning rate, optimizer, etc.
3. Keep iterating until you get the desired results.

## Deploy a federated model

After training and evaluating your federated model, the last step is to deploy it! That means using the
model in your product, whether that involves batch (offline) inference or hosting it at a production
endpoint. Since the federated model is a single model, deployment can happen just like with your
non-federated models.
Loading

0 comments on commit ae55069

Please sign in to comment.