Skip to content
This repository has been archived by the owner on Feb 16, 2023. It is now read-only.

Add API worker websocket and HTTP endpoints for FL to PyGrid #445

Closed
cereallarceny opened this issue Feb 5, 2020 · 22 comments · Fixed by #516 or #517
Closed

Add API worker websocket and HTTP endpoints for FL to PyGrid #445

cereallarceny opened this issue Feb 5, 2020 · 22 comments · Fixed by #516 or #517
Assignees
Labels
Type: Epic 🤙 Describes a large amount of functionality that will likely be broken down into smaller issues

Comments

@cereallarceny
Copy link
Member

cereallarceny commented Feb 5, 2020

The various worker libraries will need to communicate with PyGrid according to an API that's defined in PyGrid. I currently believe that we should aim to support both Websocket messages as well as HTTPS endpoints to accomplish this - hopefully with this philosophy becoming a standard of PyGrid.

All socket calls should follow the format of:

{
  "type": "the type of the message",
  "data": {}
}

I'd like for the following endpoints to be added:

Authentication with PyGrid

Method in worker library:

const worker = new syft({
  url: 'https://localhost:3000',
  auth_token: MY_AUTH_TOKEN
});

HTTP endpoint: POST /federated/authenticate
Socket "type": federated/authenticate
Request data:

{
  "auth_token": "MY_AUTH_TOKEN"
}

Note that auth_token supplied above is an optional argument depending on the setup of PyGrid.

This endpoint is where the worker library (syft.js, KotlinSyft, or SwiftSyft) is attempting to authenticate as a worker with PyGrid.

In order to guarantee the identity of a worker, it's important to have some sort of authentication workflow present. While this isn't strictly required, it will prove an important mechanism in our federated learning workflow for preventing a variety of attacks, most notably a "Sybil attack". This would happen when a worker could generate multiple versions of themself, thus steering all model training to be done by the same worker on the same data, but with unique "worker id's" - which would overfit the model. To prevent this, we strongly suggest that every deployment of PyGrid's FL system implement some sort of oAuth 2.0 protocol.

In this circumstance, a worker would be logged in to their application via oAuth and would be given an authentication token with which to make secure web requests inside the app. Assuming that PyGrid has also been set up to include this same oAuth mechanism, a worker could forward this auth_token to PyGrid, which then validates that token as an actual user with the same oAuth provider. It's important to do this because it avoids putting the responsibility of having to incorporate our own authentication system with PyGrid, and instead farms this responsibility out to a third-party system.

In the event that the administrator of the PyGrid gateway does not want to add oAuth support, or there is no login capability within the web or mobile app the worker is running on, then this authentication process is skipped and a worker_id is assigned. This is insecure and open to attacks - it's not suggested, but is required as part of our system.

There are three possible responses, one success and two error responses:

Success - triggered when there is no oAuth flow required by PyGrid OR when there is a required oAuth flow in PyGrid and the auth_token sent by the worker validates the existence of that user by a third-party

{
  "worker_id": "ID OF THE WORKER"
}

Error - triggered when there is an oAuth flow required by PyGrid and no auth_token is sent

{
  "error": "Authentication is required, please pass an 'auth_token'."
}

Error - triggered when there is an oAuth flow required by PyGrid and the auth_token that was sent is invalid

{
  "error": "The 'auth_token' that you passed is invalid."
}

The success response will include a worker_id which should be cached for long-term use. This will be passed with all subsequent calls to PyGrid.

Connection Speed Test

Method in worker library: job.start()
HTTP endpoint: GET /federated/speed-test and POST /federated/speed-test
Socket "type": N/A
Query string: ?random=RANDOM HASH VALUE&worker_id=ID OF THE WORKER

This endpoint is HTTP only.

We need some way of getting a reliable average upload and download speed for a worker in order to potentially qualify them for joining an FL worker cycle. In order to do this, we need to endpoints at the same location: a GET route for testing worker download speed and a POST route for testing worker upload speed. In each route, a random query string value must be appended onto the end of the request to prevent the server or the worker from caching the result after multiple rounds.

When performing the download speed test, PyGrid will generate a random file of a certain size (to be determined) which the worker may download. The time it takes the worker to download will be captured by the worker and stored.

When performing the upload speed test, the worker will generate a random file of a certain size (to be determined) which will be uploaded to PyGrid (and then discarded). The time it takes the worker to upload will be also captured by the worker and stored.

Note: The above is merely a proposal of how this workflow should work. The real-world solution should be determined and this document will be modified to fit the best solution we come up with. This paradigm should be heavily tested against real-world connection speed tests to ensure a reliable result. @Prtfw please do some extra research on this to cover our bases.

FL Worker Cycle Request

Method in worker library: Also part of job.start() behind the scenes
HTTP endpoint: POST /federated/cycle-request
Socket "type": federated/cycle-request
Request data:

{
  "worker_id": "ID OF THE WORKER",
  "model": "my-federated-model",
  "version": "0.1.0",
  "ping": "8ms",
  "download": "46.3mbps",
  "upload": "23.7mbps"
}

Note that version supplied above is an optional argument.

This endpoint is where the worker library (syft.js, KotlinSyft, or SwiftSyft) is attempting to join an active federated learning cycle. PyGrid, depending on the current state of the cycle, the speed of the worker's connection, and how many workers have already been chosen.

Given this information, PyGrid will send one of two responses:

Rejection

{
  "status": "rejected",
  "timeout": 2700,
  "model": "my-federated-model",
  "version": "0.1.0"
}

This means that the worker was rejected from the current cycle and asked to request to join another cycle in 2700 seconds. The number of seconds will depend on when the next cycle is expected to start. If a timeout is not sent, this means that it's the last cycle and there will not be another one to join.

Accepted

{
  "status": "accepted",
  "model": "my-federated-model",
  "version": "0.1.0",
  "request_key": "LONG HASH VALUE",
  "plans": { "training_plan": "ID OF THE TRAINING PLAN", "another_plan": "ID OF ANOTHER PLAN" },
  "client_config": "CLIENT CONFIG OBJECT",
  "protocols": { "secure_agg_protocol": "ID OF THE PROTOCOL" },
  "model_id": "ID OF THE MODEL"
}

In the event that the worker is accepted into the current cycle, they will be sent a named list of the ID's of various plans they need to execute, a named list of the ID's of various protocols they need to execute, the id of the model, and the client config. The plans, protocols, and model will not be downloaded in this response. Instead, the worker will need to make an additional request to receive them (due to the size constraints of the response). They will pass the request_key given above as a form of "authenticating" the download request. This is specific to the relationship between the worker AND the cycle and cannot be reused for future cycles or other workers. This will be detailed in the ["Plan Download section"](#Plan Download).

Note that it is not possible for a worker to participate in the same cycle multiple times. The client creates a "job" request. If they are accepted, they should not be allowed to submit another job request for the same cycle.

Plan Download

Method in worker library: Also part of job.start() behind the scenes
HTTP endpoint: GET /federated/get-plan
Socket "type": N/A
Query string: ?worker_id=ID OF THE WORKER&request_key=LONG HASH VALUE&plan_id=ID OF THE PLAN&receive_operations_as=list

This endpoint is HTTP only.

This method will allow a worker that has been accepted into a cycle to request the download of a plan from PyGrid. They need to submit their request_key provided in the cycle request call above. This provides an extra means of authentication for PyGrid to ensure it's sending data to the right worker.

The worker also needs to specify how the worker likes to receive plans: either a list of operations ("list") or TorchScript ("torchscript") depending on the type of worker requesting (#437). This is found in the receive_operations_as key of the request data.

Response: This downloads the plan to the worker.

Protocol Download

Method in worker library: Also part of job.start() behind the scenes
HTTP endpoint: GET /federated/get-protocol
Socket "type": N/A
Query string: ?worker_id=ID OF THE WORKER&request_key=LONG HASH VALUE&protocol_id=ID OF THE PROTOCOL

This endpoint is HTTP only.

This method will allow a worker that has been accepted into a cycle to request the download of a protocol from PyGrid. They need to submit their request_key provided in the cycle request call above. This provides an extra means of authentication for PyGrid to ensure it's sending data to the right worker.

Response: This downloads the protocol to the worker.

Model Download

Method in worker library: Also part of job.start() behind the scenes
HTTP endpoint: GET /federated/get-model
Socket "type": N/A
Query string: ?worker_id=ID OF THE WORKER&request_key=LONG HASH VALUE&model_id=ID OF THE MODEL

This endpoint is HTTP only.

This method will allow a worker that has been accepted into a cycle to request the download of a model from PyGrid. They need to submit their request_key provided in the cycle request call above. This provides an extra means of authentication for PyGrid to ensure it's sending data to the right worker.

Response: This downloads the model to the worker.

Report

Method in worker library: job.report()
HTTP endpoint: POST /federated/report
Socket "type": federated/report
Request data:

{
  "worker_id": "ID OF THE WORKER",
  "request_key": "LONG HASH VALUE",
  "diff": "FINAL MODEL DIFF FROM TRAINING"
}

This method will allow a worker that has been accepted into a cycle and finished training a model on their device to upload the resulting model diff.

If the worker did not train a protocol to be done after the plan(s) was executed, then they will simply submit their entire model diff. If they want to manually add noise to this diff as a layer of protection, they may do so at the developer's discretion from inside the worker implementation.

If the worker did execute a protocol and they have finished the secure aggregation protocol with other workers, they will now receive a share of the resulting securely aggregated model diff. In this case, they will submit the share of the diff, rather than their original model diff. PyGrid will handle the decryption of the shares once they're all submitted.

Response: { "status": "success" }

The response of success is sent if the response is a 200. The worker should not be informed if the model diff was accepted or denied as part of the global model update.

@cereallarceny cereallarceny created this issue from a note in Model-centric Federated Learning (To do) Feb 5, 2020
@cereallarceny cereallarceny added the Type: Epic 🤙 Describes a large amount of functionality that will likely be broken down into smaller issues label Feb 5, 2020
@vkkhare
Copy link
Member

vkkhare commented Feb 5, 2020

So one thing I can understand is we dont need to send workerId anymore in sockets. Once the authentication is finished and we have set up a websocket connection we dont need further. The pyGrid can already store that info with the connection. Https requests would require such authentication parameter mostly in the url?

Also the model, torchscript and protocols might be big as mentioned earlier so I think here along with type (list/torchscript) we should send the url from which we can download the files. Which would probably be a http connection!

@cereallarceny
Copy link
Member Author

cereallarceny commented Feb 6, 2020 via email

@vkkhare
Copy link
Member

vkkhare commented Feb 6, 2020

I assume protocols can get big if we have too many operations there. Nevertheless, Torchscript will be in few MB so we would an ID for it

@cereallarceny
Copy link
Member Author

cereallarceny commented Feb 6, 2020 via email

@cereallarceny
Copy link
Member Author

@vkkhare @mjjimenez @mccorby @vvmnnnkv I've finished what I think is a good spec for the various endpoints that PyGrid needs to create. Please take a look and provide your thoughts here.

@vkkhare
Copy link
Member

vkkhare commented Feb 6, 2020

I would like to know what is the expected size of diff? If it is a big file, which I think it should be, the worker will then initiate a upload process and hence rather than being a one off sumission this would be a long process with status updates, retries,etc

@cereallarceny
Copy link
Member Author

cereallarceny commented Feb 6, 2020 via email

@vvmnnnkv
Copy link
Member

vvmnnnkv commented Feb 8, 2020

I would like to know what is the expected size of diff?

@vkkhare it should be the same size as the model (given all model params are trainable :))

Thoughts on doc:

  1. I wonder why download requests are POSTs. Naturally these should be GETs. This enables caching, part-ranges.
  2. Not sure everything should be under /federated/ API scope. As I understand, Model hosting is already supported by pygrid, so there should be some endpoint already? In theory, Plans/Protocols might be used outside of FL, so we might want more "generic" API endpoints for them?
  3. There's a typo in Model Download section, the URL is get-protocol.
  4. I think we should avoid JSON for binary transfers (downloads: Model, Plan, Protocol; uploads: Model diff) because JSON envelope requires to encode binary with base64, which increases the size. Like @vkkhare suggested, these can be separate endpoints with binary content.
  5. I don't quite understand how this issue works together with Grid Standardization #447 that also has "get-protocol" endpoint. And the worker scopes for Protocol execution are not described here, but implemented there?
  6. How about using swagger, apib, or similar tools for describing APIs? This is easier to maintain than a text document, and we also get API docs auto-generation.

@cereallarceny
Copy link
Member Author

@vvmnnnkv

  1. Changed
  2. I'd like to scope the endpoints by intention because there may be other ways to download a model that requires a different security protocol than the federated learning workflow. For now, I'd like to scope PyGrid according to intention.
  3. Changed
  4. That's fine with me. What are you suggesting we change then for these endpoints in the documentation above?
  5. Grid Standardization #447 is based on the old "grid.js" style of federated learning. Ignore this. It will be changed by this issue. We won't need "get-protocol" as it was implemented in grid.js anymore and thusly can remove it from PyGrid once we have these endpoints. The problem was that "get-protocol" was heavily related to how WebRTC was being done. Now that WebRTC is unrelated to the training process itself, and only related to secure aggregation protocols, the code that was merged into PyGrid is now out of date. This will be updated momentarily. :)
  6. We can in the future as a form of documentation, however, I believe the PyGrid team is working on implementing Read The Docs. If you want to create a separate issue for using Swagger on PyGrid, that's fine, but it's out of scope for this issue. Let's use the descriptions of the endpoints in the above issue description and be done with it. 😄

@vvmnnnkv
Copy link
Member

vvmnnnkv commented Feb 8, 2020

Thanks!

  1. Downloads are already good! For "report" we can use POST multipart/form-data with 2 parts: application/json for JSON (w/o "diff"), application/octet-stream for binary "diff" content. Or move JSON contents into URL params for simplicity and just upload binary. Or have 2 endpoints: 1) upload binary that will return some uploadID, 2) report that uploadID - this separation might be useful if we'll need to use some other protocol for uploading a binary than simple POST (like @vkkhare mentioned "long process with status updates, retries").
  2. Then I'm confused what request will return scopeId and participants. The Protocol orchestration part - pygrid should group workers before they kick off WebRTC and Protocol execution?

@cereallarceny
Copy link
Member Author

@vvmnnnkv @vkkhare I've updated this to force HTTP for all download-related endpoints. Websockets will not be allowed due to the lack of status checks.

@cereallarceny
Copy link
Member Author

I've updated this issue to include support for multiple plans and protocols.

@cereallarceny
Copy link
Member Author

Great questions @vvmnnnkv - here's my thinking.

Let's hold off on adding a checkpoint to the FL cycle request for now. The reason is that FL cycles are likely to be finished (and thus a new checkpoint would already be released) by the time most users would have updated their web application/mobile app. So by the time you're requesting a checkpoint, it will likely already be outdated. We can add this functionality later on as requested by users, but for now, that would get into some rather complicated branching. I'd rather force things to be sequential. If a developer really likes a particular checkpoint, they can always run get_model() and reupload to PyGrid as a new version. Fortunately, the FL cycle request already supports name and version, just not a checkpoint. 😄

Likewise, the get-model request will actually infer the appropriate version and checkpoint from the request_key. The request_key is like an API token that the worker will have that's specific their worker, model, and individual cycle. So they won't be able to request a model that they're not exactly permitted to download.

@vvmnnnkv
Copy link
Member

@cereallarceny
Well I was asking about these params in cycle-request response body, not request params!
But if request_key is kind of session that holds cycle data and working in get-model, then this makes sense :)

@cereallarceny
Copy link
Member Author

@vvmnnnkv Check out this data model to see if it answers your question. #481

@vvmnnnkv
Copy link
Member

On the other hand, it may be beneficial for worker to know version/checkpoint to avoid re-downloading the model if it didn't change. In this case worker needs to know version/checkpoint when after requesting a cycle.

@cereallarceny
Copy link
Member Author

Sure. Let's do that later. I just want to get a base version of this issue working. Write this idea down and let's address it another day. :)

@cereallarceny
Copy link
Member Author

I've updated this issue to include the newly updated authentication flow @Prtfw @vvmnnnkv @vkkhare @mjjimenez

@cereallarceny
Copy link
Member Author

I've also updated this issue to include our connection speed testing workflow.

This was referenced Mar 12, 2020
Model-centric Federated Learning automation moved this from In progress to Done Mar 17, 2020
@cereallarceny cereallarceny reopened this Mar 19, 2020
Model-centric Federated Learning automation moved this from Done to In progress Mar 19, 2020
Model-centric Federated Learning automation moved this from In progress to Done Mar 31, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Type: Epic 🤙 Describes a large amount of functionality that will likely be broken down into smaller issues
Development

Successfully merging a pull request may close this issue.

6 participants