Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial SBD Client and Server #2

Merged
merged 34 commits into from May 7, 2024
Merged

Initial SBD Client and Server #2

merged 34 commits into from May 7, 2024

Conversation

neonphog
Copy link
Collaborator

@neonphog neonphog commented Apr 22, 2024

RESOLVES: #6
RESOLVES: #7
RESOLVES: #10
RESOLVES: #12

The autobahn-style runners are largely placeholders at the moment, with single smoke tests in them. We'll need to fill those out in a follow-on PR.

I'm going to copy the spec.md file into this PR description so you can see a rendered version of it:


SBD Spec

This SBD spec defines the protocol for SBD servers and clients to communicate.
SBD is a simple websocket-based message relay protocol.

1. Websocket Stack

The SBD protocol is built upon websockets.

1.1. Websocket Configuration

1.1.1. Message and Frame Size

The maximum SBD message size (including 32 byte header) is 20000 bytes.

SBD clients and servers MAY set the max message size in the websocket library to 20000 to help enforce this.

The maximum frame size MUST be set larger than 20000 so that sbd messages always fit in a single websocket frame.

2. Cryptography

2.1. TLS

Generally over the WAN SBD servers SHOULD be available over TLS (wss://), and on a LAN without (ws://).

2.2. Ed25519

Clients will be identified by ed25519 public key. Client sessions will be validated by ed25519 signature.

3. Protocol

3.1. Connect Path

Clients MUST specify exactly 1 http path item on the websocket connection url.
This item must be the base64url encoded public key that this client will be identified by.
This public key MUST be unique to this new connection.

3.2. Messages

3.2.1. Header

SBD messages always contain first a 32 byte header. Messages less than 32 bytes are invalid.

If the header starts with 28 zero bytes, the message is a "command". Otherwise, the message is a "forward".

If the header is a "command" the next four literal ascii bytes are interpreted as the command type:

  • lbrt - limit byte nanos - 4 byte i32be limit - server sent
  • lidl - limit idle millis - 4 byte i32be limit - server sent
  • areq - authentication request - 32 byte nonce - server sent
  • ares - authentication response - 64 byte signature - client sent
  • srdy - server ready - no additional data - server sent
  • keep - keepalive - no additional data - client sent

If the header is a "forward" type, the 32 byte header is interpreted as the public key to forward the message to.
The remaining bytes in the message are the data to forward.

3.2.2. Forward

When a client sends a "forward" message to the server, the first 32 bytes represent the
peer the message should be forwarded to.

When a server sends a "forward" message to the client the message should be forwarded to
the first 32 bytes will be altered to represent the peer from which the message originated.

3.2.3. Flow

  • The server MUST send areq with a random nonce once for every new opened connection.
    The server MAY send any limit messages before or after this areq, but it MUST come before the srdy.
  • The client MUST respond with a signature over the nonce by the private key associated with the public key
    sent in the url path segment websocket request
  • If the signature is valid the server MUST send the srdy message.
  • After receiving srdy the client MAY begin sending "forward" type messages.
  • At any point, the server MAY send an updated lbrt message.
  • At any point after srdy, the server MAY send "forward" type messages to the client.
  • At any point after srdy, the client MAY send a keep message.
sequenceDiagram
S->>C: lbrt
S->>C: lidl
S->>C: areq
C->>S: ares
S->>C: srdy
C->>S: "forward"
S->>C: "forward"
S->>C: lbrt
C->>S: keep

3.2.4. lidl and Keepalive

The lidl "Limit Idle Millis" message is the millisecond count the server will keep a connection around
without having seen any messages sent from that client. If a client does not have any forward messages
to send within this time period and wishes to keep the connection open they SHOULD send a keep message
to maintain the connection. Note this keep message will be counted against rate limiting.

3.2.5. lbrt and rate limiting

The lbrt "Limit Byte Nanos" message indicates the nanoseconds of rate limiting used up by a single byte sent.
This is intuitively backwards of a "limit" because higher values indicate you need to send data more slowly,
but this direction is easier to work with in code. That is, if lbrt is 1, you can send one byte every nanosecond.
A more reasonable librt value of 8000 means you can send 1 byte every 8000 nanoseconds. Or more reasonably,
if you send a 16000 byte message, you should wait 8000 * 16000 nanoseconds before sending the next message.

A server MUST provide a "burst" grace window to account for message size.

A server MAY track rate limiting by some metric other than individual connection. IP address, for example.
Then, if additional connections are established from the same other metric, all connections could be notified
of needing to send data more slowly.

3.2.6. Extensibility

In order to make this protocol extensible without versioning, clients and servers MUST ignore unknown command types.
(With the exception that servers should still count the raw bytes in rate limiting.)

3.3. Violations

If a server receives an invalid message from a client it MUST immediately drop the connection with no closing frame.

If a server receives a message that violates the rate limit, the connection MUST similarly be dropped with no closing frame.
The server MAY also block connections (perhaps by IP address) for an unspecified amount of time.

rust/sbd-bench/src/c_turnover.rs Show resolved Hide resolved
rust/sbd-bench/src/thru.rs Show resolved Hide resolved
rust/sbd-client/src/lib.rs Outdated Show resolved Hide resolved
rust/sbd-client/src/raw_client.rs Outdated Show resolved Hide resolved
rust/sbd-server/src/cmd.rs Show resolved Hide resolved
spec.md Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Show resolved Hide resolved
neonphog and others added 2 commits April 22, 2024 13:46
Co-authored-by: ThetaSinner <ThetaSinner@users.noreply.github.com>
Copy link
Contributor

@ThetaSinner ThetaSinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good. It's much clearer on a second read having through through the spec. It's still a lot of code in one go though :)

rust/sbd-client/src/lib.rs Outdated Show resolved Hide resolved
rust/sbd-client/src/lib.rs Show resolved Hide resolved

match match data.parse() {
Ok(data) => data,
Err(_) => break,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the write_task also stop if the read_task gets a bad message from the server?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it does and I'm missing how they're linked?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm curious about the error cases here as well because silently break. @neonphog please explain why this is and if appropriate make the errors visible at runtime somehow.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, when it breaks out of the loop it closes the connection, (through the send_buf2.lock().await.close().await call below).

There really aren't a lot of varied error cases... If you were able to establish the connection, but now it no longer exists, all you know is that the connection was closed. By design, you don't know it was closed due to rate-limiting, for example (because we don't want to waste system resources sending data to potential denial of service node).

rust/sbd-client/src/raw_client.rs Show resolved Hide resolved
let now = self.origin.elapsed().as_nanos() as u64;

// first check if we need to keepalive
if now - self.last_send >= self.idle_keepalive_nanos {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is skipping the next_step_dur check which seems right to me. The keepalive shouldn't wait on the rate limiting. So maybe the check above isn't needed and the docs for this function can be updated?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i read the Call `next_step_dur()` first as an instruction to the user. the keepalive check is indeed redundant to the one in that function.

i wonder if it could be simplified by having a single async fn that uses an async sleep to wait and send rather than needing to use two functions sequentially as can be seen here: https://github.com/holochain/sbd/pull/2/files#diff-55f2f95932add9c4e180147077386efd04a8e1ef6635e49bb6fcf514351105b5R125-R128

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions are called from two distinct code paths across multiple threads/tasks. First, if we have filled up our send buffer, they will be called in a back pressure loop from the user public send function potentially many concurrent times (the link @steveej provided). Second, from a single write task that manages making sure buffered data gets sent out even if the app isn't doing any follow-on sends.

I decided the best way to do that from the two paths was to have a non-mutable function that would tell you how long to wait before calling the mutable function that may or may not do anything on account of it potentially getting false positives on the wait side.

Let me know if you think there is a clearer way to do this, or a more clear way to document this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided the best way to do that from the two paths was to have a non-mutable function that would tell you how long to wait before calling the mutable function that may or may not do anything on account of it potentially getting false positives on the wait side.

this split makes some intuitive sense to me however i've got a hunch that there isn't much practical gain by it. not worth discussing much more to me at this point though so i'm complete on this thread; as i wasn't the start i won't close it.

rust/sbd-server/src/cmd.rs Outdated Show resolved Hide resolved
rust/sbd-server/src/cmd.rs Outdated Show resolved Hide resolved
rust/sbd-server/src/cslot.rs Show resolved Hide resolved
@jost-s
Copy link

jost-s commented Apr 23, 2024

@neonphog What does SBD stand for? I suggest to mention that somewhere =)

Copy link
Member

@steveej steveej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first review pass until and excluding rust/sbd-client/src/raw_client.rs

.github/workflows/static.yml Show resolved Hide resolved
rust/sbd-bench/src/c_turnover.rs Show resolved Hide resolved
rust/sbd-bench/src/thru.rs Show resolved Hide resolved
rust/sbd-client/src/lib.rs Outdated Show resolved Hide resolved
rust/sbd-client/src/lib.rs Outdated Show resolved Hide resolved
rust/sbd-client/src/send_buf.rs Outdated Show resolved Hide resolved
let now = self.origin.elapsed().as_nanos() as u64;

// first check if we need to keepalive
if now - self.last_send >= self.idle_keepalive_nanos {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i read the Call `next_step_dur()` first as an instruction to the user. the keepalive check is indeed redundant to the one in that function.

i wonder if it could be simplified by having a single async fn that uses an async sleep to wait and send rather than needing to use two functions sequentially as can be seen here: https://github.com/holochain/sbd/pull/2/files#diff-55f2f95932add9c4e180147077386efd04a8e1ef6635e49bb6fcf514351105b5R125-R128

rust/sbd-client/src/send_buf.rs Show resolved Hide resolved
rust/sbd-client/src/send_buf.rs Outdated Show resolved Hide resolved
rust/sbd-client/src/send_buf.rs Show resolved Hide resolved
@neonphog
Copy link
Collaborator Author

neonphog commented Apr 23, 2024

@jost-s

@neonphog What does SBD stand for? I suggest to mention that somewhere =)

1856628

SBD doesn't stand for anything. Imagine it however you like. Secure By Design... STUN By Default... Silent But Deadly...


SBD messages always contain first a 32 byte header. Messages less than 32 bytes are invalid.

If the header starts with 28 zero bytes, the message is a "command". Otherwise, the message is a "forward".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I did the math right (2^28), that puts the probability of a "forward" being misinterpreted as a "command" at around 1 in 268,000,000. That's just the floor -- the probability goes up if e.g. a proof-of-work scheme is used which requires leading zeroes for pub keys.

What are the implications of this misinterpretation? How broadly will those effects spread across the entire network? If a network has 10k agents, the probability of a bad agent key is 1 in 30k, which is beginning to look risky if the effects are widespread.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maackle - though derived from random seed, ed25519 pub keys are not random numbers, my impression is that it would be even more unlikely to result in 28 initial zeros. But regardless, I can put in a simple check to reject public keys that start with 28 zeroes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: fade996

@neonphog neonphog mentioned this pull request Apr 26, 2024
Copy link
Member

@steveej steveej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i do have lots of comments and am still vague around rust/sbd-server/src/cslot.rs.

however, i'm content with this being merged to proceed with testing while also having my comments addressed either via follow-up issues or my explained even post-merge.

@@ -0,0 +1,27 @@
#[test]
fn suite() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any idea how long the builds take? because they're considered test runtime, this may appear as a slow test.

ideally we'd build the binaries before running this test. i'm interested in creating a follow-up issue for this comment and see whether we can establish a better pattern for this on CI.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

totally dependent on system resources, on my system it's around 30 seconds after deleting target, but caching obviously makes this take no time at all... We could add a job ahead of the test job that causes them to be built with the correct settings, but I don't know how easy it would be to make sure the settings stay in sync. Feels pretty low priority to me.

Text(s) => return Ok(Payload::Vec(s.into_bytes())),
Binary(v) => return Ok(Payload::Vec(v)),
Ping(_) | Pong(_) => (),
Close(_) => return Err(Error::other("closed")),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm surprised to see an error returned here, as it looks like closing a connection is a regular part of the flow.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this could be an Option instead of a Result, but it's an internal api, and other backends might have more relevant errors to report.

rust/sbd-server/src/ws/ws_tungstenite.rs Show resolved Hide resolved
rust/sbd-server/src/ws/ws_fastwebsockets.rs Show resolved Hide resolved
.serve_connection(
io,
hyper::service::service_fn(move |mut req| {
// TODO report the PUBKEY here
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO for when? does it justify a follow-up issue?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the fastwebsockets backend currently doesn't even compile, since it was a dead end. I didn't want to loose the work incase it's viable some day, but I don't think it needs to be reviewed right now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i suspect we'll lose that context over time and it deserves to be mentioned in the docstring

rust/sbd-server/src/ip_deny.rs Outdated Show resolved Hide resolved
rust/sbd-server/src/ip_rate.rs Show resolved Hide resolved
rust/sbd-server/src/cslot.rs Show resolved Hide resolved
let rate_send_list = {
let list = {
// WARN - allocation here!
// Also, do we want to limit the max connections from same ip?
Copy link
Member

@steveej steveej May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think we should start out with such a limit, because it would make it difficult for groups that share the same public IP to communicate beyond their local group.

i think the limit should somehow understand unique users, e.g. via public IP key.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@steveej - I can't read between the typo here. What did you mean to suggest that was not public IP?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, quite the typo 😆 i meant public key (corrected above as well)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, gottcha : ) - Yeah, the main concern here would be an attacker blowing out our memory by opening up a bunch of connections all from the same ip address... since the Vec storing that is currently unbound. Distinguishing by key wouldn't mitigate that since all the connections would have unique keys.

.await;

if auth_res.is_err() {
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be helpful to eprintln this for debugging?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm experimenting with not having tracing enabled on the server. There is overhead to tracing that I'd like to avoid if possible. Separately, we may someday want some kind of audit log that could show when clients connect and when/why they are disconnected. This data could make it into that if we choose to implement that.

neonphog and others added 2 commits May 7, 2024 13:11
Co-authored-by: Stefan Junker <1181362+steveej@users.noreply.github.com>
@neonphog neonphog merged commit 1d0af9f into main May 7, 2024
5 checks passed
@neonphog neonphog deleted the initial branch May 7, 2024 19:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants