Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requirements: Orion state #30

Closed
bedeho opened this issue Dec 18, 2021 · 5 comments
Closed

Requirements: Orion state #30

bedeho opened this issue Dec 18, 2021 · 5 comments
Assignees
Labels
enhancement New feature or request

Comments

@bedeho
Copy link
Member

bedeho commented Dec 18, 2021

Disclaimer: these are really rough high level requirements, probably need to be refined and discussed, just putting it down in order to have all thoughts in one place

Background

Orion currently does very naive view & follow counting for Atlas, but we know that this should be global public state that will live in a forthcoming shared data layer Joystream/joystream#2753. There are a variety of problems where augmenting properties of Orion is a natural solution:

  • Clients are currently storing, in local storage, who is followed and what content has been, or is being, consumed. This breaks the experience when the user goes to a new device, a new browser or simply resets their browsing history.
  • In order for gateways to make credible updates to the shared data layer about consumption events some representation of consumers is needed.
  • In order for gateways to provide transcoding & post-processing services for creators, some representation of consumers is needed.
  • There is no scalable way for gateway operators to efficiently filter out content (channels & videos etc.) at scale, which they may be required to do as a matter of business objectives and law, depending on their user base.
  • There is no way to efficiently query across consumption data and content data and gateway operator filtering policy.
  • There is no unified product centric API which economises on request throughput & latency for clients, as a result Atlas often has to do dusins of requests on a single screen which all take a long time to resolve, and there is no caching across different clients.

Requirements

Introduce some initial monolitihic data representation in Orion puts us on a path to allow these issues to be addressed over time which addresses these problems by having a distinct application centric & operator specific API. It ingests data from the

  1. the query node: about core domain entities (videos, channels, members, playlists, nfts, tokens...)
  2. shared data layer: about consumption data (who watched what when from where etc., or similar with following)
  3. operator filtering policy: what content should be omitted
  4. operator featuring policy: what to display in prominent locations
  5. operator search & recommendation engine: providing results informed by global & user specific consumption data.

This sloppy schema tries to summarize how the parts would fit together

orionstate

@bedeho
Copy link
Member Author

bedeho commented Oct 29, 2022

Addendum

The virtue of the proposal above was that it aspired to leverage the already existing investment made into developing and testing the QN we already had, however, it did not specify how the interaction between the two states would work exactly. There are a few alternatives that come to mind, but they all have serious drawbacks.

  • Polling: At some regular interval, the Orion side asks for a full state snapshot of the query node side, and then tries to reconcile the difference between its own state and what it receives. The problem with this approach would be that
    • it would be a very large query covering all members, channels, videos, comments, playlists, tokens, events, etc. which would take a lot of resources to both service on the QN side and process on Orion side, in particular because during this processing on either end the state mutexed for a consistent read and later update.
    • the logic for this reconciliation would be possibly very complex to reason about and maintain.
    • it would render invisible a possibly large number of state transitions on the on-chain side which we would want to use to trigger various kinds of Orion logic, such as emails, without having to effective redo base level event processing which QN has already done. Moreover, when such events are processed, they will reference entities that no longer exist, or are in a different stage.
  • Stream CUD Events: There is probably some way to hook into low level Create, Update, Delete operations from the processor Postgress DB, e.g. using some plugin, and then streaming that into Orion, where Orion itself has a data model similar enough that it is possible to transform these original operations into Orion equivalents. How this would in reality work would be that once a block has been fully processed, a big stream of events would be sent all at once, as Hydra on commits processor side-effects when full blocks have been processed (I think?), in order to corrupting state in the case of faults. With this approach, you still get to benefit from processing logic of QN, but you also get a proper event stream that solves the issues of polling. The problem with this approach is that the events you get are very low levels of abstraction. So a single blockchain event may involve multiple low level db operations, and it will start getting quite complicated to map groups of the latter to the former in order to allow Orion to understand high level semantics for other purposes it may have.

Both of these have serious drawbacks, and in general there are disadvantages of using our existing QN with yet more infrastructure

  1. Not maintained by large team.
  2. Not performant, e.g. in it's indexing/processing.
  3. Exposes Orion to problems that may emerge from all the mappings that it is not interested in, e.g. bounties, proposals, forum, etc.
  4. Make processing slower, as all the non-Orion related events still must be fetched and processed.

One could address most of these last 4 issues by just using Subsquid instead, however, that still leaves the question from above of how. A variation not yet considered would be to just natively use the Subsquid processor database as the same state that Orion uses, and just augmenting the data model (graphql schema files) to have all the extra off-chain information one wants. This has a certain simplicity to it, but seems to have the major problem of not efficiently allowing multiple writers to the state, precisely because Subsquid/Hydra (I think!) takes a heavy lock over the database for each block it processes. It seems one also would probably anyway need to add a new secondary API on top of the native Subsquid API, in order to allow all sorts of mutations and endpoints that are not Subsquid native, and it also will cause strong coupling with API-generating features of Subsquid which are likely to have at least as many inadequacies as as Hydra. A final variation would be to use Subsquid, but then do a version of the Stream CUD Events above. In this approach, the only thing Subsquid is used for is to index chain & decode on-chain events, and then these are submitted over a message broker which has the Orion node as a consumer. The message broker not only ensures no messages are lost or duplicated when either side of the connection fails, but also decouples processing of chain from overall Orion processing. Now input from operator or end users speak to Orion on equal footing to the on-chain events, and since the events are blockchain native, not low level CUD events, Orion has a high level of message abstraction.

Now, how to adjudicate all of this? Some prototyping in close contact with Atlas codebase and team is probably useful.

@Lezek123
Copy link
Contributor

Lezek123 commented Nov 2, 2022

A variation not yet considered would be to just natively use the Subsquid processor database as the same state that Orion uses, and just augmenting the data model (graphql schema files) to have all the extra off-chain information one wants.

I'd opt for this solution

This has a certain simplicity to it, but seems to have the major problem of not efficiently allowing multiple writers to the state, precisely because Subsquid/Hydra (I think!) takes a heavy lock over the database for each block it processes.

I doesn't look like there's any "heavy lock", just standard SQL transactions, and the database transaction isolation level (used when processing block / batch of blocks) is easily configurable in Subsquid. I think READ COMMITTED would be just fine for our use case, allowing any secondary api to update entities like Video etc. without causing the processor transaction to rollback if it tries to, for example, delete the same entity.

It seems one also would probably anyway need to add a new secondary API on top of the native Subsquid API, in order to allow all sorts of mutations and endpoints that are not Subsquid native, and it also will cause strong coupling with API-generating features of Subsquid which are likely to have at least as many inadequacies as as Hydra

There is a way to extend the Subsquid-generated graphql api with custom resolvers:
https://docs.subsquid.io/develop-a-squid/graphql-api/custom-resolvers/

There appears to also be the way to add a custom request check function (not part of the docs though), possibly allowing us to introduce authorization for mutation endpoints

A final variation would be to use Subsquid, but then do a version of the Stream CUD Events above.

I'm not sure if I fully understand this approach, but it seems very complex and probably unnecessary(?) given the findings above.

@bedeho
Copy link
Member Author

bedeho commented Nov 2, 2022

I'd opt for this solution

I am very concerned about getting stuck with yet another auto-generated API where the number of exceptions will just start to pile up when combining filtering, relationships, unions and all the rest of it. We have never even tried doing very basic stuff like aggregation and grouping, despite it being table stakes queries for lots of normal API calls. Subsquid by default will have less of this than Hydra even, because no Subsquid users come anywhere close our needs. Even if there is a way to add new queries with custom endpoints, the resulting API may start becoming quite messy as you stop having genuine canonical entity queries, because the default ones are too weak. Also, we would need to remove all sorts of queries which make no sense because they either exposed private data, or they are just not useful to the application developer directly. Correct me if I am wrong, but is there not also some serious performance issue with all the default generated resolvers for Subsquid, just as I believe was pointed out for Hydra?

I doesn't look like there's any "heavy lock", just standard SQL transactions

I'm not sure what you mean by this locking not being heavy, but perhaps you could describe in more detail exactly when locks are applied? I asked in the Subsquid/Hydra Telegram channel the other day, and they said they had switched to some new batch based locking, but I did not get to the bottom of what that really meant. What I think would be good to understand a bit better is, in a scenario where there is lots of attempts to both read and write to the state from operator, consumption actions, error events, how will whatever locking Subsquid natively does interact with that?

is easily configurable in Subsquid

How would one configure such a low level choice as a Subsquid user, and why would there be any room for difference in approach from one use to another? Are we talking about forking Subsquid here?

appears to also be the way to add a custom request check function

I don't know what a request check function is, is that some GraphQL level concept? I could not find anything on this. But in any case, the ability to add mutations will be a necessity, so it would have to be established early that we can do this for sure, ideally without having to start forking Subsquid.

I'm not sure if I fully understand this approach, but it seems very complex and probably unnecessary(?) given the findings above.

The idea is not very complex, just badly explained, but is more work: use Subsquid exclusively as a producer of finalized events/calls relevant to Orion, but separate out the public API, data model and processing to a separate system which treats all writers the same, regardless of what locking Subsquid does.

I think, if the locking thing is either totally irrelevant, or just a minor issue at small scales, then I think the API issue(s) would be my main concern really, because it is nice to keep it simple.

@Lezek123
Copy link
Contributor

Lezek123 commented Nov 3, 2022

I am very concerned about getting stuck with yet another auto-generated API where the number of exceptions will just start to pile up when combining filtering, relationships, unions and all the rest of it. We have never even tried doing very basic stuff like aggregation and grouping, despite it being table stakes queries for lots of normal API calls.

Subsquid api is easy to extend, as we can define our custom models, graphql resolvers etc., so the autogenerated api becomes just a base and saves us some development time compared to building a graphql api from scratch. From my (not very long at this point) experience with Subsquid I find it way easier to work with than Hydra, which tried to sort of "do everything for you".

Correct me if I am wrong, but is there not also some serious performance issue with all the default generated resolvers for Subsquid, just as I believe was pointed out for Hydra?

I'm not sure what exact performance issue you have in mind, but it looks like there have been a lot of optimalizations in terms of queries, processing speed etc., we can also define custom SQL indexes which are very helpful for speeding up certain queries.

Even if there is a way to add new queries with custom endpoints, the resulting API may start becoming quite messy as you stop having genuine canonical entity queries, because the default ones are too weak.

So I'm not exactly sure what's the alternative here, as I see it we have a few options:

  • We use @subsquid/graphql-server and we get the default subsquid-generated entity queries + our custom resolvers, where we can freely define our own queries
  • We don't use @subsquid/graphql-server, but we build Orion's graphql server from scratch instead. In this case we need to define all the entity queries ourselves or perhaps try to reuse some @subsquid/graphql-server generation tools to get a good start and then adjust everything to our needs. This seems like much more work and much more code to maintain.
  • We use a fork of @subsquid/graphql-server
  • We try to get the functionality we need merged to @subsquid/graphql-server

I'm not sure what you mean by this locking not being heavy, but perhaps you could describe in more detail exactly when locks are applied?

There are no explicit locks, so the only locks that apply are those that are acquired by default by UPDATE, DELETE queries etc.

There is however an SQL transaction which wraps all db operations executed when processing a block or batch of blocks (processing batches is now a recommended approach). This means that if Subsquid is processing a batch of blocks where 1000 new videos were created, other queries that we execute at this time by default won't see any of those new videos until the entire batch is comitted, but we can still modify the existing records in the database etc. (and the effects will be instant)

I've send a link to SQL documentation which describes READ COMITTED transaction isolation level: https://www.postgresql.org/docs/current/transaction-iso.html#XACT-READ-COMMITTED, which we can use to prevent the transaction from rolling back when "conflicting" changes are to be comitted (like, for example, in case we modified a video after the processor transaction that will later try to delete this video has started)

One caveat however is that if the processor transaction already updated a row (say, a video of id = 1) that we now also want to update (for example, by incrementing the video.views counter), we'd need to wait until this processor transaction finishes, because it won't release the UPDATE lock (which is row-level lock) until the end of the transaction.

How would one configure such a low level choice as a Subsquid user, and why would there be any room for difference in approach from one use to another? Are we talking about forking Subsquid here?

Basically Subsquid processor is now configurable programmatically, we can import a processor class (SubstrateBatchProcessor for example), instantialize and configure it in a file like this one: https://github.com/subsquid/squid-template/blob/main/src/processor.ts

All we need to do is add isolationLevel option to TypeormDatabase instance passed to processor.run:

processor.run(new TypeormDatabase({ isolationLevel: 'READ COMMITTED' }), async ctx => {
  // ...
})

A note about this can also be found in the subsquid documentation: https://docs.subsquid.io/develop-a-squid/substrate-processor/store-interface/#typeormdatabase-recommended

I don't know what a request check function is, is that some GraphQL level concept?

It's a plugin which is executed on requestDidStart event (https://www.apollographql.com/docs/apollo-server/integrations/plugins/#responding-to-request-lifecycle-events), we can use it to intercept the request to graphql server and add some additional validation, authorization etc. Example I just created:

import { RequestCheckFunction } from '@subsquid/graphql-server/lib/check'

export const requestCheck: RequestCheckFunction = async (req) => {
    if (req.operation.operation === 'mutation' && !req.http.headers.get('x-admin')) {
        return 'Access denied'
    }
    return true
}

It requires an x-admin header to be provided in case the graphql operation type is mutation. If it's not provided, it returns Access denied error.

Of course it's much more customizable than this.

But in any case, the ability to add mutations will be a necessity, so it would have to be established early that we can do this for sure, ideally without having to start forking Subsquid.

I tested it locally and it's possible by adding a custom graphql resolver, for example:

import { Args, ArgsType, Field, ID, Mutation, Query, Resolver } from 'type-graphql'
import { Video } from '../model'
import { EntityManager } from 'typeorm'

@ArgsType()
export class AddVideoViewArgs {
  @Field(() => ID)
  videoId: string
}

@Resolver()
export class VideoViewsResolver {
  // Set by depenency injection
  constructor(private tx: () => Promise<EntityManager>) {}

  @Mutation(() => Number, { description: "Add a single view to the target video's count" })
  async addVideoView(
    @Args() { videoId }: AddVideoViewArgs
  ): Promise<number> {
    const videoRepository = (await this.tx()).getRepository(Video);
    const video = await videoRepository.findOneBy({ id: videoId })
    if (!video) {
        throw new Error('Video not found')
    }
    video.views += 1
    await videoRepository.save(video)

    return video.views
  }

  @Query(() => Number, { description: 'Get number of views per video' })
  async getVideoViews(
    @Args() { videoId }: AddVideoViewArgs
  ): Promise<number> {
    const videoRepository = (await this.tx()).getRepository(Video);
    const video = await videoRepository.findOneBy({ id: videoId })
    if (!video) {
        throw new Error('Video not found')
    }
    return video.views
  }
}

The resolver needs to also have at least one Query however.

@bedeho
Copy link
Member Author

bedeho commented Nov 4, 2022

Fantastic work, lets proceed as you are suggesting then.

processing batches is now a recommended approach

What determines the boundaries of what makes up an individual batch? e.g. will two distinct subsquids be processing the exact same sequence of batches, regardless of how it runs or is halted?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants