-
-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compress the clock table format #293
Comments
A simple first pass is to use lookup tables for large values rather than change the
pk_lookup
siteid_lookup
These lookup tables are local to each database. Real values are returned to send over the wire. We don't need a lookup for column index since both databases should have the same schema and same column indices. We will, however, map column index to column name via the Size reduction: 16 byte primary key -reduced to-> 1 - 4 bytes 42 bytes worst case -> 9 bytes worst case Other structures can get us further compression be we may lose too much query power. brainstorm alt structure:
|
Doing re-insertion made me think a bit more about all the metadata being tracked. It looks rather excessive and I think we can cut it down a bunch by using lookup tables. @jeromegn , @Azarattum - curious if either of you have thoughts on:
These changes should be internal only and transparent to the user except where If someone has more than 256 columns in a table... well they'll be out of luck. |
I was thinking of the ways to compress deltas when sending over the wire. The primary optimisation I came up with (apart from lookup tables) is to skip the same columns on concurrent changes. For example, when we change a bunch of stuff in a single table, we only mention the table name once and all the later changes will assume the previous name. This works great for transmitting changes, not sure how fast/reliable this approach will be for storing them. This:
is equivalent to this:
The major downside is that we rely on order. It is fine within a single sync packet, but might not be great for long-term storage. What do you think? |
256 columns per table is fine, imo |
Skipping over cells with the same value is a good idea. Easy in the network code as you point out. For the persisted data.. I'd have to figure out how much this hurts performance as queries for changes would require scanning back several rows to find a value. |
We can actually encode column names as varints without creating a lookup table by leveraging the internal https://github.com/vlcn-io/cr-sqlite/blob/main/core/src/tableinfo.h#L14-L35 This poses some issues during schema migrations, however, since the indices in To deal with this we will need to:
I like this tradeoff since:
|
We can vastly compress the table format (5x in most cases) by encoding all metadata for a row in a single clock row.
Encoding in a single row, however, loses quite a bit of expressiveness when selecting changes. An alternate route that still cuts quite a bit of fat is to move large values into lookup tables. See comment below.
The text was updated successfully, but these errors were encountered: