Add API worker websocket and HTTP endpoints for FL to PyGrid #445

cereallarceny · 2020-02-05T19:25:00Z

The various worker libraries will need to communicate with PyGrid according to an API that's defined in PyGrid. I currently believe that we should aim to support both Websocket messages as well as HTTPS endpoints to accomplish this - hopefully with this philosophy becoming a standard of PyGrid.

All socket calls should follow the format of:

{
  "type": "the type of the message",
  "data": {}
}

I'd like for the following endpoints to be added:

Authentication with PyGrid

Method in worker library:

const worker = new syft({
  url: 'https://localhost:3000',
  auth_token: MY_AUTH_TOKEN
});

HTTP endpoint: POST /federated/authenticate
Socket "type": federated/authenticate
Request data:

{
  "auth_token": "MY_AUTH_TOKEN"
}

Note that auth_token supplied above is an optional argument depending on the setup of PyGrid.

This endpoint is where the worker library (syft.js, KotlinSyft, or SwiftSyft) is attempting to authenticate as a worker with PyGrid.

In order to guarantee the identity of a worker, it's important to have some sort of authentication workflow present. While this isn't strictly required, it will prove an important mechanism in our federated learning workflow for preventing a variety of attacks, most notably a "Sybil attack". This would happen when a worker could generate multiple versions of themself, thus steering all model training to be done by the same worker on the same data, but with unique "worker id's" - which would overfit the model. To prevent this, we strongly suggest that every deployment of PyGrid's FL system implement some sort of oAuth 2.0 protocol.

In this circumstance, a worker would be logged in to their application via oAuth and would be given an authentication token with which to make secure web requests inside the app. Assuming that PyGrid has also been set up to include this same oAuth mechanism, a worker could forward this auth_token to PyGrid, which then validates that token as an actual user with the same oAuth provider. It's important to do this because it avoids putting the responsibility of having to incorporate our own authentication system with PyGrid, and instead farms this responsibility out to a third-party system.

In the event that the administrator of the PyGrid gateway does not want to add oAuth support, or there is no login capability within the web or mobile app the worker is running on, then this authentication process is skipped and a worker_id is assigned. This is insecure and open to attacks - it's not suggested, but is required as part of our system.

There are three possible responses, one success and two error responses:

Success - triggered when there is no oAuth flow required by PyGrid OR when there is a required oAuth flow in PyGrid and the auth_token sent by the worker validates the existence of that user by a third-party

{
  "worker_id": "ID OF THE WORKER"
}

Error - triggered when there is an oAuth flow required by PyGrid and no auth_token is sent

{
  "error": "Authentication is required, please pass an 'auth_token'."
}

Error - triggered when there is an oAuth flow required by PyGrid and the auth_token that was sent is invalid

{
  "error": "The 'auth_token' that you passed is invalid."
}

The success response will include a worker_id which should be cached for long-term use. This will be passed with all subsequent calls to PyGrid.

Connection Speed Test

Method in worker library: job.start()
HTTP endpoint: GET /federated/speed-test and POST /federated/speed-test
Socket "type": N/A
Query string: ?random=RANDOM HASH VALUE&worker_id=ID OF THE WORKER

This endpoint is HTTP only.

We need some way of getting a reliable average upload and download speed for a worker in order to potentially qualify them for joining an FL worker cycle. In order to do this, we need to endpoints at the same location: a GET route for testing worker download speed and a POST route for testing worker upload speed. In each route, a random query string value must be appended onto the end of the request to prevent the server or the worker from caching the result after multiple rounds.

When performing the download speed test, PyGrid will generate a random file of a certain size (to be determined) which the worker may download. The time it takes the worker to download will be captured by the worker and stored.

When performing the upload speed test, the worker will generate a random file of a certain size (to be determined) which will be uploaded to PyGrid (and then discarded). The time it takes the worker to upload will be also captured by the worker and stored.

Note: The above is merely a proposal of how this workflow should work. The real-world solution should be determined and this document will be modified to fit the best solution we come up with. This paradigm should be heavily tested against real-world connection speed tests to ensure a reliable result. @Prtfw please do some extra research on this to cover our bases.

FL Worker Cycle Request

Method in worker library: Also part of job.start() behind the scenes
HTTP endpoint: POST /federated/cycle-request
Socket "type": federated/cycle-request
Request data:

{
  "worker_id": "ID OF THE WORKER",
  "model": "my-federated-model",
  "version": "0.1.0",
  "ping": "8ms",
  "download": "46.3mbps",
  "upload": "23.7mbps"
}

Note that version supplied above is an optional argument.

This endpoint is where the worker library (syft.js, KotlinSyft, or SwiftSyft) is attempting to join an active federated learning cycle. PyGrid, depending on the current state of the cycle, the speed of the worker's connection, and how many workers have already been chosen.

Given this information, PyGrid will send one of two responses:

Rejection

{
  "status": "rejected",
  "timeout": 2700,
  "model": "my-federated-model",
  "version": "0.1.0"
}

This means that the worker was rejected from the current cycle and asked to request to join another cycle in 2700 seconds. The number of seconds will depend on when the next cycle is expected to start. If a timeout is not sent, this means that it's the last cycle and there will not be another one to join.

Accepted

{
  "status": "accepted",
  "model": "my-federated-model",
  "version": "0.1.0",
  "request_key": "LONG HASH VALUE",
  "plans": { "training_plan": "ID OF THE TRAINING PLAN", "another_plan": "ID OF ANOTHER PLAN" },
  "client_config": "CLIENT CONFIG OBJECT",
  "protocols": { "secure_agg_protocol": "ID OF THE PROTOCOL" },
  "model_id": "ID OF THE MODEL"
}

In the event that the worker is accepted into the current cycle, they will be sent a named list of the ID's of various plans they need to execute, a named list of the ID's of various protocols they need to execute, the id of the model, and the client config. The plans, protocols, and model will not be downloaded in this response. Instead, the worker will need to make an additional request to receive them (due to the size constraints of the response). They will pass the request_key given above as a form of "authenticating" the download request. This is specific to the relationship between the worker AND the cycle and cannot be reused for future cycles or other workers. This will be detailed in the ["Plan Download section"](#Plan Download).

Note that it is not possible for a worker to participate in the same cycle multiple times. The client creates a "job" request. If they are accepted, they should not be allowed to submit another job request for the same cycle.

Plan Download

Method in worker library: Also part of job.start() behind the scenes
HTTP endpoint: GET /federated/get-plan
Socket "type": N/A
Query string: ?worker_id=ID OF THE WORKER&request_key=LONG HASH VALUE&plan_id=ID OF THE PLAN&receive_operations_as=list

This endpoint is HTTP only.

This method will allow a worker that has been accepted into a cycle to request the download of a plan from PyGrid. They need to submit their request_key provided in the cycle request call above. This provides an extra means of authentication for PyGrid to ensure it's sending data to the right worker.

The worker also needs to specify how the worker likes to receive plans: either a list of operations ("list") or TorchScript ("torchscript") depending on the type of worker requesting (#437). This is found in the receive_operations_as key of the request data.

Response: This downloads the plan to the worker.

Protocol Download

Method in worker library: Also part of job.start() behind the scenes
HTTP endpoint: GET /federated/get-protocol
Socket "type": N/A
Query string: ?worker_id=ID OF THE WORKER&request_key=LONG HASH VALUE&protocol_id=ID OF THE PROTOCOL

This endpoint is HTTP only.

This method will allow a worker that has been accepted into a cycle to request the download of a protocol from PyGrid. They need to submit their request_key provided in the cycle request call above. This provides an extra means of authentication for PyGrid to ensure it's sending data to the right worker.

Response: This downloads the protocol to the worker.

Model Download

Method in worker library: Also part of job.start() behind the scenes
HTTP endpoint: GET /federated/get-model
Socket "type": N/A
Query string: ?worker_id=ID OF THE WORKER&request_key=LONG HASH VALUE&model_id=ID OF THE MODEL

This endpoint is HTTP only.

This method will allow a worker that has been accepted into a cycle to request the download of a model from PyGrid. They need to submit their request_key provided in the cycle request call above. This provides an extra means of authentication for PyGrid to ensure it's sending data to the right worker.

Response: This downloads the model to the worker.

Report

Method in worker library: job.report()
HTTP endpoint: POST /federated/report
Socket "type": federated/report
Request data:

{
  "worker_id": "ID OF THE WORKER",
  "request_key": "LONG HASH VALUE",
  "diff": "FINAL MODEL DIFF FROM TRAINING"
}

This method will allow a worker that has been accepted into a cycle and finished training a model on their device to upload the resulting model diff.

If the worker did not train a protocol to be done after the plan(s) was executed, then they will simply submit their entire model diff. If they want to manually add noise to this diff as a layer of protection, they may do so at the developer's discretion from inside the worker implementation.

If the worker did execute a protocol and they have finished the secure aggregation protocol with other workers, they will now receive a share of the resulting securely aggregated model diff. In this case, they will submit the share of the diff, rather than their original model diff. PyGrid will handle the decryption of the shares once they're all submitted.

Response: { "status": "success" }

The response of success is sent if the response is a 200. The worker should not be informed if the model diff was accepted or denied as part of the global model update.

The text was updated successfully, but these errors were encountered:

vkkhare · 2020-02-05T20:05:50Z

So one thing I can understand is we dont need to send workerId anymore in sockets. Once the authentication is finished and we have set up a websocket connection we dont need further. The pyGrid can already store that info with the connection. Https requests would require such authentication parameter mostly in the url?

Also the model, torchscript and protocols might be big as mentioned earlier so I think here along with type (list/torchscript) we should send the url from which we can download the files. Which would probably be a http connection!

cereallarceny · 2020-02-06T07:36:01Z

That's definitely what I was thinking for the model, but I can't imagine plan and protocol being too large for the response. If you think they'd be too big then we can include the ID instead just like we did for model.

…

On Wed, Feb 5, 2020 at 8:05 PM varun khare ***@***.***> wrote: So one thing I can understand is we dont need to send workerId anymore in sockets. Once the authentication is finished and we have set up a websocket connection we dont need further. The pyGrid can already store that info with the connection. Https requests would require such authentication parameter mostly in the url? Also the model, torchscript and protocols might be big so I think here along with type (list/torchscript) we should send the url from which we can download the files. Which would probably be a http connection! — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#445>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJ44CVOAJXPWJFWHIZJPJTRBMLZ7ANCNFSM4KQQVIKQ> .

vkkhare · 2020-02-06T09:15:53Z

I assume protocols can get big if we have too many operations there. Nevertheless, Torchscript will be in few MB so we would an ID for it

cereallarceny · 2020-02-06T09:31:29Z

Totally fair. I'll go ahead and modify the original post to include the change to an ID.

…

On Thu, Feb 6, 2020 at 9:15 AM varun khare ***@***.***> wrote: I assume protocols can get big if we have too many operations there. Nevertheless, Torchscript will be in few MB so we would an ID for it — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#445>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJ44CUGUPKIWN66CCRE6I3RBPIMVANCNFSM4KQQVIKQ> .

cereallarceny · 2020-02-06T19:05:36Z

@vkkhare @mjjimenez @mccorby @vvmnnnkv I've finished what I think is a good spec for the various endpoints that PyGrid needs to create. Please take a look and provide your thoughts here.

vkkhare · 2020-02-06T19:36:51Z

I would like to know what is the expected size of diff? If it is a big file, which I think it should be, the worker will then initiate a upload process and hence rather than being a one off sumission this would be a long process with status updates, retries,etc

cereallarceny · 2020-02-06T21:20:04Z

To be honest, I have no idea and I don't know that we will know until we try something. What about going ahead with the general format of what's described above with the understanding that it's likely to change?

…

On Thu, Feb 6, 2020 at 7:36 PM varun khare ***@***.***> wrote: I would like to know what is the expected size of diff? If it is a big file, which I think it should be, the worker will then initiate a upload process and hence rather than being a one off sumission this would be a long process with status updates, retries,etc — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#445>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJ44CUPIJDU3L7I6IDAIW3RBRRFHANCNFSM4KQQVIKQ> .

vvmnnnkv · 2020-02-08T10:30:21Z

I would like to know what is the expected size of diff?

@vkkhare it should be the same size as the model (given all model params are trainable :))

Thoughts on doc:

I wonder why download requests are POSTs. Naturally these should be GETs. This enables caching, part-ranges.
Not sure everything should be under /federated/ API scope. As I understand, Model hosting is already supported by pygrid, so there should be some endpoint already? In theory, Plans/Protocols might be used outside of FL, so we might want more "generic" API endpoints for them?
There's a typo in Model Download section, the URL is get-protocol.
I think we should avoid JSON for binary transfers (downloads: Model, Plan, Protocol; uploads: Model diff) because JSON envelope requires to encode binary with base64, which increases the size. Like @vkkhare suggested, these can be separate endpoints with binary content.
I don't quite understand how this issue works together with Grid Standardization #447 that also has "get-protocol" endpoint. And the worker scopes for Protocol execution are not described here, but implemented there?
How about using swagger, apib, or similar tools for describing APIs? This is easier to maintain than a text document, and we also get API docs auto-generation.

cereallarceny · 2020-02-08T12:25:31Z

@vvmnnnkv

Changed
I'd like to scope the endpoints by intention because there may be other ways to download a model that requires a different security protocol than the federated learning workflow. For now, I'd like to scope PyGrid according to intention.
Changed
That's fine with me. What are you suggesting we change then for these endpoints in the documentation above?
Grid Standardization #447 is based on the old "grid.js" style of federated learning. Ignore this. It will be changed by this issue. We won't need "get-protocol" as it was implemented in grid.js anymore and thusly can remove it from PyGrid once we have these endpoints. The problem was that "get-protocol" was heavily related to how WebRTC was being done. Now that WebRTC is unrelated to the training process itself, and only related to secure aggregation protocols, the code that was merged into PyGrid is now out of date. This will be updated momentarily. :)
We can in the future as a form of documentation, however, I believe the PyGrid team is working on implementing Read The Docs. If you want to create a separate issue for using Swagger on PyGrid, that's fine, but it's out of scope for this issue. Let's use the descriptions of the endpoints in the above issue description and be done with it. 😄

vvmnnnkv · 2020-02-08T21:42:26Z

Thanks!

Downloads are already good! For "report" we can use POST multipart/form-data with 2 parts: application/json for JSON (w/o "diff"), application/octet-stream for binary "diff" content. Or move JSON contents into URL params for simplicity and just upload binary. Or have 2 endpoints: 1) upload binary that will return some uploadID, 2) report that uploadID - this separation might be useful if we'll need to use some other protocol for uploading a binary than simple POST (like @vkkhare mentioned "long process with status updates, retries").
Then I'm confused what request will return scopeId and participants. The Protocol orchestration part - pygrid should group workers before they kick off WebRTC and Protocol execution?

cereallarceny · 2020-02-11T12:37:05Z

@vvmnnnkv @vkkhare I've updated this to force HTTP for all download-related endpoints. Websockets will not be allowed due to the lack of status checks.

cereallarceny · 2020-02-14T15:19:50Z

I've updated this issue to include support for multiple plans and protocols.

cereallarceny · 2020-02-20T11:08:37Z

Great questions @vvmnnnkv - here's my thinking.

Let's hold off on adding a checkpoint to the FL cycle request for now. The reason is that FL cycles are likely to be finished (and thus a new checkpoint would already be released) by the time most users would have updated their web application/mobile app. So by the time you're requesting a checkpoint, it will likely already be outdated. We can add this functionality later on as requested by users, but for now, that would get into some rather complicated branching. I'd rather force things to be sequential. If a developer really likes a particular checkpoint, they can always run get_model() and reupload to PyGrid as a new version. Fortunately, the FL cycle request already supports name and version, just not a checkpoint. 😄

Likewise, the get-model request will actually infer the appropriate version and checkpoint from the request_key. The request_key is like an API token that the worker will have that's specific their worker, model, and individual cycle. So they won't be able to request a model that they're not exactly permitted to download.

vvmnnnkv · 2020-02-21T14:20:20Z

@cereallarceny
Well I was asking about these params in cycle-request response body, not request params!
But if request_key is kind of session that holds cycle data and working in get-model, then this makes sense :)

cereallarceny · 2020-02-21T14:28:30Z

@vvmnnnkv Check out this data model to see if it answers your question. #481

vvmnnnkv · 2020-02-21T14:31:05Z

On the other hand, it may be beneficial for worker to know version/checkpoint to avoid re-downloading the model if it didn't change. In this case worker needs to know version/checkpoint when after requesting a cycle.

cereallarceny · 2020-02-21T14:43:54Z

Sure. Let's do that later. I just want to get a base version of this issue working. Write this idea down and let's address it another day. :)

cereallarceny · 2020-02-26T17:28:08Z

I've updated this issue to include the newly updated authentication flow @Prtfw @vvmnnnkv @vkkhare @mjjimenez

cereallarceny · 2020-02-26T18:06:31Z

I've also updated this issue to include our connection speed testing workflow.

cereallarceny created this issue from a note in Model-centric Federated Learning (To do) Feb 5, 2020

cereallarceny added the Type: Epic 🤙 Describes a large amount of functionality that will likely be broken down into smaller issues label Feb 5, 2020

cereallarceny assigned cereallarceny, vkkhare and IonesioJunior Feb 5, 2020

mjjimenez mentioned this issue Feb 7, 2020

Create serializable/deserializable web socket messages for FL endpoints OpenMined/SwiftSyft#42

Closed

6 tasks

cereallarceny unassigned cereallarceny, vkkhare and IonesioJunior Feb 7, 2020

This was referenced Feb 16, 2020

Implement FL Authentication and Cycle API Request OpenMined/SwiftSyft#47

Closed

Implement ping checker OpenMined/SwiftSyft#48

Closed

This was referenced Feb 17, 2020

Grid Authentication #463

Closed

FL Worker Cycle Request #464

Closed

Protocol Download #466

Closed

Model Download Endpoint #467

Closed

Plan Download #465

Closed

Report FL Process #468

Closed

Sketch Events for Mobile/Web Federated Learning #469

Merged

mjjimenez mentioned this issue Feb 25, 2020

Establish WebRTC Connection using PyGrid OpenMined/SwiftSyft#51

Closed

cereallarceny mentioned this issue Feb 26, 2020

Add JWT support for FL workflow #496

Closed

This was referenced Feb 26, 2020

Add bandwidth and Internet connectivity test in Android OpenMined/KotlinSyft#28

Closed

Add bandwidth and Internet connectivity test in iOS OpenMined/SwiftSyft#29

Closed

Add bandwidth and Internet connectivity test in syft.js OpenMined/syft.js#88

Closed

IonesioJunior mentioned this issue Mar 5, 2020

Connection Speed Test #502

Closed

monuelo mentioned this issue Mar 7, 2020

[EPIC] Create PyGrid Client OpenMined/PySyft#3153

Closed

This was referenced Mar 12, 2020

FL Process API fixes #509

Merged

FL Cycle Request improvement #514

Closed

Prtfw mentioned this issue Mar 17, 2020

add dummy auth endpoint to unblock #516

Merged

8 tasks

cereallarceny closed this as completed in #516 Mar 17, 2020

Model-centric Federated Learning automation moved this from In progress to Done Mar 17, 2020

Prtfw mentioned this issue Mar 18, 2020

auth working (JWT + HSA + RSA) #517

Merged

10 tasks

cereallarceny reopened this Mar 19, 2020

Model-centric Federated Learning automation moved this from Done to In progress Mar 19, 2020

cereallarceny mentioned this issue Mar 19, 2020

Add model name and version to fl cycle rejection #519

Closed

vkkhare mentioned this issue Mar 22, 2020

Add download/upload speed checker OpenMined/KotlinSyft#61

Closed

cereallarceny closed this as completed in #517 Mar 31, 2020

Model-centric Federated Learning automation moved this from In progress to Done Mar 31, 2020

This was referenced May 27, 2020

Test all major features OpenMined/syft.js#123

Closed

Test all major features OpenMined/SwiftSyft#122

Closed

Test all major features OpenMined/KotlinSyft#96

Closed

vkkhare mentioned this issue Jun 21, 2021

remove job arguments from authenticate OpenMined/PySyft#5696

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add API worker websocket and HTTP endpoints for FL to PyGrid #445

Add API worker websocket and HTTP endpoints for FL to PyGrid #445

cereallarceny commented Feb 5, 2020 •

edited

vkkhare commented Feb 5, 2020 •

edited

cereallarceny commented Feb 6, 2020 via email

vkkhare commented Feb 6, 2020

cereallarceny commented Feb 6, 2020 via email

cereallarceny commented Feb 6, 2020

vkkhare commented Feb 6, 2020

cereallarceny commented Feb 6, 2020 via email

vvmnnnkv commented Feb 8, 2020 •

edited

cereallarceny commented Feb 8, 2020

vvmnnnkv commented Feb 8, 2020

cereallarceny commented Feb 11, 2020

cereallarceny commented Feb 14, 2020

cereallarceny commented Feb 20, 2020

vvmnnnkv commented Feb 21, 2020

cereallarceny commented Feb 21, 2020

vvmnnnkv commented Feb 21, 2020

cereallarceny commented Feb 21, 2020

cereallarceny commented Feb 26, 2020

cereallarceny commented Feb 26, 2020

Add API worker websocket and HTTP endpoints for FL to PyGrid #445

Add API worker websocket and HTTP endpoints for FL to PyGrid #445

Comments

cereallarceny commented Feb 5, 2020 • edited

Authentication with PyGrid

Connection Speed Test

FL Worker Cycle Request

Plan Download

Protocol Download

Model Download

Report

vkkhare commented Feb 5, 2020 • edited

cereallarceny commented Feb 6, 2020 via email

vkkhare commented Feb 6, 2020

cereallarceny commented Feb 6, 2020 via email

cereallarceny commented Feb 6, 2020

vkkhare commented Feb 6, 2020

cereallarceny commented Feb 6, 2020 via email

vvmnnnkv commented Feb 8, 2020 • edited

cereallarceny commented Feb 8, 2020

vvmnnnkv commented Feb 8, 2020

cereallarceny commented Feb 11, 2020

cereallarceny commented Feb 14, 2020

cereallarceny commented Feb 20, 2020

vvmnnnkv commented Feb 21, 2020

cereallarceny commented Feb 21, 2020

vvmnnnkv commented Feb 21, 2020

cereallarceny commented Feb 21, 2020

cereallarceny commented Feb 26, 2020

cereallarceny commented Feb 26, 2020

cereallarceny commented Feb 5, 2020 •

edited

vkkhare commented Feb 5, 2020 •

edited

vvmnnnkv commented Feb 8, 2020 •

edited