Remove Foreign Keys #1082

kderme · 2022-03-14T17:08:57Z

Relations in db-sync can be visualised as a graph (actually DAG) of references. For example a Block references its previous Block, a Tx its Block, a certificate its Tx and its StakeAddress, a TxIn its TxOut etc. The current implementation uses foreign keys for these relations. Foreign keys come with some pros and cons

Pros

They make rollbacks possible and quite easy. db-sync simply deletes a single block and the deletion cascades to all related entries.
They enforce data integreity. With current setup dangling foreign keys are impossible.
They can make some queries faster, since they give additional information to the query planer about the relations between tables and so it can optimize queries better.
Removing them can be tricky since we use persistent

Cons

Foreign keys impose additional checks, which make insertions slower.
These checks and the on delete triggers, require indexes to the referenced and referencing tables to be efficient. Updating these indexes makes insertions even slower.
Rollbacks can be even faster without using foreign keys. One way to do it is adding a 'blockNo' field to all tables related to a specific block. For all these tables we ill use a query delete from tableX where blockNo > ....

Overall foreign keys make insertions slower. They allow very simple rollbacks, but rollbacks can become even faster. Finally the data integrity is not so vital. db-sync get its data from blocks that have gone through ledger validation, so doing additional checks is not needed.

The text was updated successfully, but these errors were encountered:

erikd · 2022-03-16T00:09:26Z

Finally the data integrity is not so vital

I do not fully agree, especially since this is one of the few methods we have for ensuring data integrity.

db-sync get its data from blocks that have gone through ledger validation

Yes, ledger has its own data integrity checks, but IMO, since db-sync takes the ledger data and transforms its, the ledger integrity checks do not cover the integrity of the transformations.

One way to do it is adding a 'blockNo' field to all tables

Rollbacks using foreign keys are guaranteed to be correct. Yes, theoretically, that can be replaced with delete from tableX where blockNo > .... for all tables, but if even a single table is skipped, will result in significant problems. Maybe that can be avoided by making sure all blockNo columns have a Unique constraint.

erikd · 2022-03-16T05:58:19Z

Under this scheme rollbacks used to be done as:

delete cascade from block where blockNo >= ... ;

where the foreign keys would cause a cascade through all the tables.

Without foreign keys we would need a delete for every table:

delete from block where blockNo >= x ;
delete from tx where blockNo >= x ;
delete from tx_in where blockNo >= x ;
delete from tx_out where blockNo >= x ;
.....

with a separate delete statement for every table. Hopefully this would be at least as fast.

kderme · 2022-03-16T18:31:58Z

I do not fully agree, especially since this is one of the few methods we have for ensuring data integrity.

On insertions we ensure data integrity by doing a query on referenced fields before inserting, so the additional check done by postgres is basically useless. If it fails the whole thing crashes anyway. On deletions, indeed we will have to be very careful about not missing a table. And there are no updates.

erikd · 2022-03-21T04:28:36Z

And there are no updates.

The only table which is updated is the Epoch table. Everything else is write once and only once.

erikd · 2022-03-21T05:36:08Z

I am growing more and more skeptical about this. Replacing BlockId (a foreign key) with BlockNo makes sense. These two types are both stored in Postgres as SQL bigint (8 bytes).

However, replacing a foreign key (8 bytes) with a hash (28, 29 or 32 bytes) may not have the same benefits as the above.

Some foreign key fields like the slotLeaderId field of the Block table provide significant data compression. For a give epoch (20000 to 21600 blocks), there are rarely more than about 1000 slot leaders (usually) fewer. That means storing 20000+ foreign keys, takes up significantly less space than the same number of slot leader hashes (28 bytes).
Now consider the TxIn table (similarly for the CollateralTxIn table):
```
  TxIn
    txInId              TxId                OnDeleteCascade     -- The transaction where this is used as an input.
    txOutId             TxId                OnDeleteCascade     -- The transaction where this was created as an output.
    txOutIndex          Word16              sqltype=txindex
    redeemerId          RedeemerId Maybe    OnDeleteCascade
    UniqueTxin          txOutId txOutIndex
```
Replacing the two TxId (foreign key) fields with the Tx hash, significantly increases the size of a row. Assuming the redeemerId is a Just (8 bytes), the minimum size of a row of the above table is 8 + 8 + 2 + 8 == 26 bytes. If we replace the TxId fields with the Tx hash (32 bytes) the minimum size as measured above is 32 + 32 + 2 + 8 == 74 bytes. In epoch 327 there are already over 80 million entries (table is 14G in size) in that table and this change will make it about 3 times larger in size. Furthermore it is not unreasonable to expect that comparing foreign keys (a bigint) is faster than comparing two hashes.
In general each foreign key occurrs in many places (eg the registeredTxId fields and similar used for rollbacks). For the case of registeredTxId this value is not even the result of a query, but the result of an insertion. The only benefit of switching from a TxId field to a BlockNo is that PostgreSQL does not check that the BlockNo is valid while it would check the TxId for validity.
The Reward table currently contains two foreign keys, addrId and poolId. Currently the Reward table gets over a million new entries every epoch (5 days). Switching from foreign keys to a stake address hash (29 bytes, including the network id) and a pool hash (28 bytes) would increase the size of each row 41 bytes, roughly doubling the size of this table (currently 11G in epoch 327). It should also be noted that Reward along with EpochStake are the two fastest growing tables.

In summary, replacing BlockId with BlockNo may make some sense, but replacing foreign keys with hashes definitely results in more disk storage being needed by the database, slower serialization/deserialization for fields sent to and retrieved from the database and likely slower queries for anything that used to use a foreign key and now uses a hash.

kderme · 2022-03-21T12:09:17Z

I totally agree with the above we should not replace ids with hashes. I think we should still use ids, just remove the foreign key constraints and the related indexes.

erikd · 2022-03-21T22:42:30Z

I wonder if its possible to run an experiment where all the foreign keys are removed from the generated SQL and we sync from genesis to compare speeds.

erikd · 2022-03-22T07:50:02Z

This experiment seems to be working and its faster. I will have a better idea of how much faster in the morning.

erikd · 2022-07-19T11:07:31Z

This is WIP. My branch is erikd/no-foreign-keys-does-this-work-3.

marshada · 2022-10-27T15:50:42Z

@kderme could you provide a more detailed update here on the work that was done? For example, were foreign key constraints removed for all keys across all tables? thanks!

kderme · 2022-11-03T18:52:41Z

Foreign keys have been removed from master and the release branches, but in a different way which we eventually need to merge.

Rollbacks can be even faster without using foreign keys. One way to do it is adding a 'blockNo' field to all tables related to a specific block. For all these tables we ill use a query delete from tableX where blockNo > ....

In master this initially recomended process to rollback was followed, but then I realised this resulted in the addition of many new fields, which could negatively affect performance and required a very compicated migration. Instead in the release branch I added a different way to rollback which used reverse indexes and doesn't require new fields.

kderme added bug Something isn't working enhancement New feature or request and removed bug Something isn't working labels Mar 14, 2022

kderme mentioned this issue Mar 22, 2022

Remove unused Indexes, Unique and Primary Keys #1087

Closed

kderme mentioned this issue May 16, 2022

Performance Improvements #1120

Open

11 tasks

vfrsilva assigned kderme Jul 22, 2022

kderme assigned erikd and unassigned kderme Aug 16, 2022

erikd closed this as completed in 078e647 Aug 18, 2022

marshada assigned kderme and unassigned erikd Nov 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove Foreign Keys #1082

Remove Foreign Keys #1082

kderme commented Mar 14, 2022 •

edited

erikd commented Mar 16, 2022 •

edited

erikd commented Mar 16, 2022

kderme commented Mar 16, 2022 •

edited

erikd commented Mar 21, 2022 •

edited

erikd commented Mar 21, 2022 •

edited

kderme commented Mar 21, 2022

erikd commented Mar 21, 2022

erikd commented Mar 22, 2022

erikd commented Jul 19, 2022

marshada commented Oct 27, 2022

kderme commented Nov 3, 2022

Remove Foreign Keys #1082

Remove Foreign Keys #1082

Comments

kderme commented Mar 14, 2022 • edited

Pros

Cons

erikd commented Mar 16, 2022 • edited

erikd commented Mar 16, 2022

kderme commented Mar 16, 2022 • edited

erikd commented Mar 21, 2022 • edited

erikd commented Mar 21, 2022 • edited

kderme commented Mar 21, 2022

erikd commented Mar 21, 2022

erikd commented Mar 22, 2022

erikd commented Jul 19, 2022

marshada commented Oct 27, 2022

kderme commented Nov 3, 2022

kderme commented Mar 14, 2022 •

edited

erikd commented Mar 16, 2022 •

edited

kderme commented Mar 16, 2022 •

edited

erikd commented Mar 21, 2022 •

edited

erikd commented Mar 21, 2022 •

edited