Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Foreign Keys #1082

Closed
kderme opened this issue Mar 14, 2022 · 11 comments
Closed

Remove Foreign Keys #1082

kderme opened this issue Mar 14, 2022 · 11 comments
Assignees
Labels
enhancement New feature or request

Comments

@kderme
Copy link
Contributor

kderme commented Mar 14, 2022

Relations in db-sync can be visualised as a graph (actually DAG) of references. For example a Block references its previous Block, a Tx its Block, a certificate its Tx and its StakeAddress, a TxIn its TxOut etc. The current implementation uses foreign keys for these relations. Foreign keys come with some pros and cons

Pros
  • They make rollbacks possible and quite easy. db-sync simply deletes a single block and the deletion cascades to all related entries.
  • They enforce data integreity. With current setup dangling foreign keys are impossible.
  • They can make some queries faster, since they give additional information to the query planer about the relations between tables and so it can optimize queries better.
  • Removing them can be tricky since we use persistent
Cons
  • Foreign keys impose additional checks, which make insertions slower.
  • These checks and the on delete triggers, require indexes to the referenced and referencing tables to be efficient. Updating these indexes makes insertions even slower.
  • Rollbacks can be even faster without using foreign keys. One way to do it is adding a 'blockNo' field to all tables related to a specific block. For all these tables we ill use a query delete from tableX where blockNo > ....

Overall foreign keys make insertions slower. They allow very simple rollbacks, but rollbacks can become even faster. Finally the data integrity is not so vital. db-sync get its data from blocks that have gone through ledger validation, so doing additional checks is not needed.

@kderme kderme added bug Something isn't working enhancement New feature or request and removed bug Something isn't working labels Mar 14, 2022
@erikd
Copy link
Contributor

erikd commented Mar 16, 2022

Finally the data integrity is not so vital

I do not fully agree, especially since this is one of the few methods we have for ensuring data integrity.

db-sync get its data from blocks that have gone through ledger validation

Yes, ledger has its own data integrity checks, but IMO, since db-sync takes the ledger data and transforms its, the ledger integrity checks do not cover the integrity of the transformations.

One way to do it is adding a 'blockNo' field to all tables

Rollbacks using foreign keys are guaranteed to be correct. Yes, theoretically, that can be replaced with delete from tableX where blockNo > .... for all tables, but if even a single table is skipped, will result in significant problems. Maybe that can be avoided by making sure all blockNo columns have a Unique constraint.

@erikd
Copy link
Contributor

erikd commented Mar 16, 2022

Under this scheme rollbacks used to be done as:

delete cascade from block where blockNo >= ... ;

where the foreign keys would cause a cascade through all the tables.

Without foreign keys we would need a delete for every table:

delete from block where blockNo >= x ;
delete from tx where blockNo >= x ;
delete from tx_in where blockNo >= x ;
delete from tx_out where blockNo >= x ;
.....

with a separate delete statement for every table. Hopefully this would be at least as fast.

@kderme
Copy link
Contributor Author

kderme commented Mar 16, 2022

I do not fully agree, especially since this is one of the few methods we have for ensuring data integrity.

On insertions we ensure data integrity by doing a query on referenced fields before inserting, so the additional check done by postgres is basically useless. If it fails the whole thing crashes anyway. On deletions, indeed we will have to be very careful about not missing a table. And there are no updates.

@erikd
Copy link
Contributor

erikd commented Mar 21, 2022

And there are no updates.

The only table which is updated is the Epoch table. Everything else is write once and only once.

@erikd
Copy link
Contributor

erikd commented Mar 21, 2022

I am growing more and more skeptical about this. Replacing BlockId (a foreign key) with BlockNo makes sense. These two types are both stored in Postgres as SQL bigint (8 bytes).

However, replacing a foreign key (8 bytes) with a hash (28, 29 or 32 bytes) may not have the same benefits as the above.

  • Some foreign key fields like the slotLeaderId field of the Block table provide significant data compression. For a give epoch (20000 to 21600 blocks), there are rarely more than about 1000 slot leaders (usually) fewer. That means storing 20000+ foreign keys, takes up significantly less space than the same number of slot leader hashes (28 bytes).

  • Now consider the TxIn table (similarly for the CollateralTxIn table):

      TxIn
        txInId              TxId                OnDeleteCascade     -- The transaction where this is used as an input.
        txOutId             TxId                OnDeleteCascade     -- The transaction where this was created as an output.
        txOutIndex          Word16              sqltype=txindex
        redeemerId          RedeemerId Maybe    OnDeleteCascade
        UniqueTxin          txOutId txOutIndex
    

    Replacing the two TxId (foreign key) fields with the Tx hash, significantly increases the size of a row. Assuming the redeemerId is a Just (8 bytes), the minimum size of a row of the above table is 8 + 8 + 2 + 8 == 26 bytes. If we replace the TxId fields with the Tx hash (32 bytes) the minimum size as measured above is 32 + 32 + 2 + 8 == 74 bytes. In epoch 327 there are already over 80 million entries (table is 14G in size) in that table and this change will make it about 3 times larger in size. Furthermore it is not unreasonable to expect that comparing foreign keys (a bigint) is faster than comparing two hashes.

  • In general each foreign key occurrs in many places (eg the registeredTxId fields and similar used for rollbacks). For the case of registeredTxId this value is not even the result of a query, but the result of an insertion. The only benefit of switching from a TxId field to a BlockNo is that PostgreSQL does not check that the BlockNo is valid while it would check the TxId for validity.

  • The Reward table currently contains two foreign keys, addrId and poolId. Currently the Reward table gets over a million new entries every epoch (5 days). Switching from foreign keys to a stake address hash (29 bytes, including the network id) and a pool hash (28 bytes) would increase the size of each row 41 bytes, roughly doubling the size of this table (currently 11G in epoch 327). It should also be noted that Reward along with EpochStake are the two fastest growing tables.

In summary, replacing BlockId with BlockNo may make some sense, but replacing foreign keys with hashes definitely results in more disk storage being needed by the database, slower serialization/deserialization for fields sent to and retrieved from the database and likely slower queries for anything that used to use a foreign key and now uses a hash.

@kderme
Copy link
Contributor Author

kderme commented Mar 21, 2022

I totally agree with the above we should not replace ids with hashes. I think we should still use ids, just remove the foreign key constraints and the related indexes.

@erikd
Copy link
Contributor

erikd commented Mar 21, 2022

I wonder if its possible to run an experiment where all the foreign keys are removed from the generated SQL and we sync from genesis to compare speeds.

@erikd
Copy link
Contributor

erikd commented Mar 22, 2022

This experiment seems to be working and its faster. I will have a better idea of how much faster in the morning.

@erikd
Copy link
Contributor

erikd commented Jul 19, 2022

This is WIP. My branch is erikd/no-foreign-keys-does-this-work-3.

@kderme kderme assigned erikd and unassigned kderme Aug 16, 2022
@erikd erikd closed this as completed in 078e647 Aug 18, 2022
@marshada
Copy link

@kderme could you provide a more detailed update here on the work that was done? For example, were foreign key constraints removed for all keys across all tables? thanks!

@marshada marshada assigned kderme and unassigned erikd Nov 3, 2022
@kderme
Copy link
Contributor Author

kderme commented Nov 3, 2022

Foreign keys have been removed from master and the release branches, but in a different way which we eventually need to merge.

Rollbacks can be even faster without using foreign keys. One way to do it is adding a 'blockNo' field to all tables related to a specific block. For all these tables we ill use a query delete from tableX where blockNo > ....

In master this initially recomended process to rollback was followed, but then I realised this resulted in the addition of many new fields, which could negatively affect performance and required a very compicated migration. Instead in the release branch I added a different way to rollback which used reverse indexes and doesn't require new fields.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants