Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reverse AS-NS linking direction #2312

Closed
adriansmares opened this issue Apr 6, 2020 · 9 comments · Fixed by #3263
Closed

Reverse AS-NS linking direction #2312

adriansmares opened this issue Apr 6, 2020 · 9 comments · Fixed by #3263
Assignees
Labels
c/application server This is related to the Application Server c/network server This is related to the Network Server compat/api This could affect API compatibility in progress We're working on it scalability This could become a problem at scale
Milestone

Comments

@adriansmares
Copy link
Contributor

adriansmares commented Apr 6, 2020

Background

Can be skipped if you are familiar with https://github.com/TheThingsIndustries/lorawan-stack/issues/941

In the current implementation of the communication between the Network Server and Application Server, the links are established by the Application Server, which links itself to the Network Server. Over this link, uplinks are sent as a stream of ttnpb.ApplicationUp.

The underlying issue with this approach is that Application Servers are inherently stateful due to linking. This introduces challenges with regards to load balancing between multiple Application Servers - the distribution of links over multiple instances, maintaining/migrating links when instances are down or a network partition occurs being just a number of issues introduced by the linking state.

Proposed solution

In order to tackle this issue, following our offline discussions, we've decided that we would like to reverse the direction of the links. This means that instead of the Application Server being a client to the Network Server, the Network Server becomes a client of the Application Server. This simplifies the logic for both components.

Since the Application Server may have clients (MQTT/gRPC) connect to it while not being linked by a Network Server, a new PubSub service should be used as a broker between Application Server instances. When a new application subscription (connection) arrives to an Application Server, a PubSub subscription is created for the uplinks of the said application. When an uplink is received by an Application Server instance, it is published to the topic of the application.

Implementation details:

  • HandleUplink should be added to a new NsAs service that the Application Server implements.
service NsAs {
  rpc HandleUplink(ApplicationUp) returns (google.protobuf.Empty);
}

We may want to revisit #1523 since we are already adding a new RPC.

  • The NS uses this RPC in order to push uplinks to the AS.
  • The NS uses component.GetPeer/component.GetPeerCon in order to receive the peer/connection to the AS based on the application identifiers.
  • When the AS receives an uplink, it triggers the required webhooks and application packages. Only the AS that receives the uplink should send the webhooks and run the application packages.
    The uplink is then published to the PubSub service, from where subscription based frontends can pick it up.
  • AS downlink queue operations no longer use the link connection, but rather the cluster peer connection.
  • AS can use gocloud.dev/pubsub for PubSub communication for inter-AS PubSub.

Required migrations

We need to migrate the application default payload formatters to a more general Application Server application store. The migration should be lazy - when we need the application payload formatter we check if we have it in the AS application store, and if it is missing we attempt to retrieve it from the old link store. Any writes/updates are done in the application store.

Open questions

  • Should we want to drop external NS linking completely ?
  • Which PubSub service should we bundle by default ? I personally propose NATS, but I do not have strong feelings about this.
  • How can AS PubSub integrations handle downlink operations ? Internally we can ensure that uplinks are sent only once, using subscription groups, but I don't see how we can deduplicate downlink queue operations, since some services (MQTT) do not support subscription groups.
    Reworded: How can we decide which instance starts an external PubSub integration (NATS/MQTT), given that having each instance connect to every integration would result in duplicated downlink queue operations ?
@adriansmares adriansmares added c/network server This is related to the Network Server needs/discussion We need to discuss this c/application server This is related to the Application Server compat/api This could affect API compatibility scalability This could become a problem at scale labels Apr 6, 2020
@adriansmares adriansmares added this to the April 2020 milestone Apr 6, 2020
@rvolosatovs
Copy link
Contributor

rvolosatovs commented Apr 6, 2020

Why not NsAs.HandleUplink as we discussed before? (rvolosatovs/lorawan-stack@a514740#diff-23d97d1622d559c2eede816e5a8af414R105-R109). (FYI: I changed the type to accept a list of ApplicationUp in a more recent revision)

We want to be able to load-balance these and to be able to do things concurrently. Stream mapping a NS to 1 singular AS is definitely not a way to go.

Should we want to drop external NS linking completely ?

Yes, that's what we agreed upon. The goal is to get rid of linking altogether.

@adriansmares
Copy link
Contributor Author

adriansmares commented Apr 6, 2020

Why not NsAs.HandleUplink as we discussed before? (rvolosatovs@a514740#diff-23d97d1622d559c2eede816e5a8af414R105-R109). (FYI: I changed the type to accept a list of ApplicationUp in a more recent revision)

We want to be able to load-balance these and to be able to do things concurrently. Stream mapping a NS to 1 singular AS is definitely not a way to go.

Should we want to drop external NS linking completely ?

Yes, that's what we agreed upon. The goal is to get rid of linking altogether.

I've updated the issue.

Should've written this earlier in order to keep all the details in mind.

@rvolosatovs
Copy link
Contributor

We may want to revisit #1523 since we are already adding a new RPC.

I already started on this locally couple of days ago including NS support. Let's do this together with this issue, otherwise later we'll need to deprecate yet another RPC, which is not worth it in this case, since it's such a trivial feature to add with NsAs interface in place.

@johanstokking
Copy link
Member

Great.

  • The NS uses component.GetPeer/component.GetPeerCon in order to receive the peer/connection to the AS based on the application identifiers.

We wouldn't tie the AS instance to application identifiers, right? We should be spreading traffic from applications with many devices over multiple AS instances.

  • AS can use gocloud.dev/pubsub for PubSub communication for inter-AS PubSub.

I like the idea of AS-AS communication via pub/sub. For the future, I think we should document why we chose AS-AS instead of a dedicated component for applications to subscribe to.

  • Which PubSub service should we bundle by default ? I personally propose NATS, but I do not have strong feelings about this.

We can use the in-memory variant by default for single process deployments and the getting started, see mempubsub. Indeed, for multiple processes, something lightweight makes most sense, and NATS seems the best candidate there.

  • How can AS PubSub integrations handle downlink operations ? Internally we can ensure that uplinks are sent only once, using subscription groups, but I don't see how we can deduplicate downlink queue operations, since some services (MQTT) do not support subscription groups.

What scenario is this? That an application schedules downlink duplicates via multiple ways, i.e. MQTT and webhooks? I don't think we should/want to do something against that.

@adriansmares
Copy link
Contributor Author

adriansmares commented Apr 7, 2020

  • The NS uses component.GetPeer/component.GetPeerCon in order to receive the peer/connection to the AS based on the application identifiers.

We wouldn't tie the AS instance to application identifiers, right? We should be spreading traffic from applications with many devices over multiple AS instances.

True - probably based on the application identifiers can be left out here.

  • How can AS PubSub integrations handle downlink operations ? Internally we can ensure that uplinks are sent only once, using subscription groups, but I don't see how we can deduplicate downlink queue operations, since some services (MQTT) do not support subscription groups.

What scenario is this? That an application schedules downlink duplicates via multiple ways, i.e. MQTT and webhooks? I don't think we should/want to do something against that.

No, and now I see that the original text wasn't clear. It's about which instance starts which PubSub integrations (that we currently have with MQTT and NATS) in multi AS environments. We cannot have multiple instances subscribe to the same external PubSub service, since then downlink queue operations would be duplicated:

  • Two AS instances
  • AS1 connects to NATS1
  • AS2 connects to NATS1 (same integration if we don't do some kind of distribution)
  • User publishes to NATS a downlink push
  • Both AS pick it up and push the downlink twice

NATS does have subscription groups, so for NATS this could be solved (I don't think gocloud exposes these tho), but MQTT does not (yet).

@johanstokking
Copy link
Member

I see what you mean. Yes, that is something to account for. We need some sort of sharding. That touches however on clustering, which we don't discuss here.

What else needs to be discussed here to move forward with this?

@adriansmares
Copy link
Contributor Author

For now all of the open questions have been addressed (to the extent of this repository). I will remove the discussion label.

@adriansmares
Copy link
Contributor Author

adriansmares commented Sep 4, 2020

I've pushed my current progress in #3190 . I think I covered most (if not all) of the AS implementation, but I have some questions regarding the API:

  • What do we do with the GetLinkStats call ? Does it return an error ? Does it return an empty response ? @johanstokking

@johanstokking
Copy link
Member

johanstokking commented Sep 8, 2020

  • What do we do with the GetLinkStats call ? Does it return an error ? Does it return an empty response ? @johanstokking
  • Return Unimplemented
  • Deprecate usage in CLI
  • Remove usage from Console

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/application server This is related to the Application Server c/network server This is related to the Network Server compat/api This could affect API compatibility in progress We're working on it scalability This could become a problem at scale
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants