Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
[RFC] Storing Client Data #2901
In fact, I wonder if the current executable should be split into
I agree. We should have a different process to do client queries and storing data. However, it makes the code more complicated. Namely, it's nice to do these queries on the same process because we do queries on the in-memory transition frontier and transaction pool. I am thinking about how to overcome these issues and other tradeoffs.
I have a few issues with the approach outlined in the RFC, as well as the structure of this RFC for tackling the problem at hand. Firstly, this RFC is too large in scope. There are multiple decisions being outlined and detailed in this RFC, and I think that it will be easier to discuss this if we break this up into two parts. Tentatively, for this comment, I am going to break up my feedback and criticism into two sections: Architecture and API (I believe this would also be the best division for splitting up this RFC into two discussions as well).
The implementation that is layed out for solving this problem is extremely complex. Caching, in particular, is very hard to get correct, even harder to debug, and in the scope of this problem, is very complex. And, at this point in time, we don't actually know if we need caching for this. Who is to say that reading from a db is too slow and that we need caching here? Many APIs and webserver exist and scale without adding in-process caching, and there are many more modular solutions for adding caching in front of database requests that are typically used before considering adding caching to the server itself. For us in particular, this API isn't meant to scale nearly as much as a web API since, under the current model, each node will only have 1 wallet client communicating to it.
I'm not saying the model you have outlined is wrong or won't work, but rather that it's very complex and likely overkill for what you are trying to achieve. It's better to do this kind of thing incrementally: start with a system that just persists to a relational database, and add caching only when you need it and only where you need it. Who knows, maybe non of the stuff you have implemented caching for right now actually needs it, but later, as uses cases change, we will find something new that really needs caching. It will be easier to add that later if we keep it simple now.
Modularity/Generalization of Architecture
The design you have speced out is very specific towards only surfacing a GraphQL API for live queries to the state of a node. However, I would suggest generalizing this architecture some and making it more modular. For instance, this can be seen and broken up into two parts: an archiver process (which subscribes to information from the daemon and persists it to disk) and an API process (which serves an API format from the on disk database). There are a number of advantages to this method of architecting this, which I will outline below.
Right now, our goal is to get an API for our wallet. However, we have other long term goals related to archiving information from a node and providing interfaces to query that information. We want to provide services for long term storage of network activity (for monitoring, analysis, block exploring, etc...), and we want to build something for providing receipts as well. The archiver process can be thought of generically among all these use cases, and can easily be made configurable through eviction rules and data filtering. For instance, for a long term archiver, there would be no eviction rules, and no data would be filtered. For the wallet API, the eviction rules would be such that it evicts information that is invalidated and not important to the wallet user, and would filter only information relevant to any wallet keys it knows about.
Under this model, where the API server is separate and only serving information from a persistent database, anyone can make an API in any format surfacing the information in our wallet. They can even directly access the underlying DB if they wish to. This is great for our community members because, even though GraphQL is popular and I think we should focus on providing a GraphQL API early on, different developers may have different needs or want. This also creates a great avenue for easy work that new first time devs can do, so that's a plus as well.
With this separation, we can focus on implementing and testing the archiver first and individually. If we want, we can even write applications against this early on to test this lower level layer of our data architecture. If the underlying DB we use is SQL based, then we can even just use a tool like https://github.com/rexxars/sql-to-graphql to generate a GraphQL database for us from the SQL schema we have, meaning we get a GraphQL API for free. If we want more custom queries, we can always implement something ourselves, but using a generated server as a basis will probably make this a lot easier. We can just make a dead simple node app that extends a pregenerated GraphQL to SQL server with some extra functionality, making the code for the custom GraphQL API very small.
Integration with Existing Abstractions
This RFC does not discuss much detail around how the "client process" receives information from the daemon process. In particular, I think it the "client process" should receive information from the daemon through a transition frontier extension, similar to how the transition frontier persistence works. This is hinted at the in document, but I would appreciate a more explicit design. Should it share an extension with persistence? Should it live behind a diff buffer, similar to persistence? What are the flushing rules for that buffer? Do they match the persistence rules, or do we need two diff buffers in memory? etc...
A use case for our GraphQL API that has been discussed in the past has been subscriptions. Do you intend to support that under you model? That decision could have ramifications on my suggestion to split the archiver from the API server.
This is mentioned in one of my comments (we should have a discussion on this there instead of on this comment thread), but I think you should make a stronger argument for the SQLite choice. I personally think we should choose a more robust and performant SQL DB, such as PostgresSQL or MariaDB/MySQL.
Most of my other questions and criticism for the API are covered in individual comments on the RFC document.
Schmavery left a comment
As a takeaway of our meeting today this seems mostly ok to me, though I think it should concern itself more with the design of the archive node than the requirements of scaling the api since it's unclear what the plans should be there long term anyway (whereas we seem to clearly understand the requirements of an archive node). This might mean we can also put off implementing the filter logic for now, though the approach seems fine to me if some day we find we need such functionality.
I did a quick review of the SQL schema. I think there is some more thought that needs to be put into the relational model of these tables and to the optimization of the queries we want to perform. The current schema is very wasteful of space and duplicates a lot of information. It also doesn't use SQL ids for referencing tables, which destroys your relational query performance.
SQL looks better now, still have a few minor suggestions, but nothing too major. Would like to have a slightly extended conversation wrt the fee transfers table though.