Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Database subset #44

Merged
merged 22 commits into from
Apr 16, 2022
Merged

feat: Database subset #44

merged 22 commits into from
Apr 16, 2022

Conversation

evoxmusic
Copy link
Contributor

@evoxmusic evoxmusic commented Apr 5, 2022

@evoxmusic evoxmusic self-assigned this Apr 5, 2022
@evoxmusic
Copy link
Contributor Author

image

HUGE performance improvement - 30ms per iteration instead of 900ms before

@evoxmusic evoxmusic marked this pull request as ready for review April 11, 2022 08:59
@evoxmusic
Copy link
Contributor Author

I know it's a large PR but if you want to take a look at it @benny-n @fabriceclementz it's all good for me :)

Copy link
Contributor

@benny-n benny-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, very nice feature to have!

@fabriceclementz
Copy link
Contributor

Good job @evoxmusic ! This is my favorite feature for the moment !
The possibility to implement many subset strategies is great and the code is expressive enough. 👍

I noticed some errors during the backup restore command

  Running `target/debug/replibyte -c ./examples/with-subset-and-transformer.yaml restore -v latest`
⠉ [00:00:00] [--------------------------------------------------------------------------------------------------------------------------------------------------------] 0B/25.24KiB (0s)
⠁ [00:00:00] [#################################################################################################################################################] 202.58KiB/25.24KiB (0s)
⠠ [00:00:02] [#################################################################################################################################################] 202.58KiB/25.24KiB (0s)
ERROR:  could not create unique index "pk_customers"
DETAIL:  Key (customer_id)=(TOMSP) is duplicated.
ERROR:  could not create unique index "pk_employees"
DETAIL:  Key (employee_id)=(6) is duplicated.
ERROR:  could not create unique index "pk_shippers"
⠙ [00:00:02] [#################################################################################################################################################] 202.58KiB/25.24KiB (0s)
ERROR:  there is no unique constraint matching given keys for referenced table "customers"
ERROR:  there is no unique constraint matching given keys for referenced table "employees"
ERROR:  there is no unique constraint matching given keys for referenced table "employees"
ERROR:  there is no unique constraint matching given keys for referenced table "customers"
ERROR:  there is no unique constraint matching given keys for referenced table "employees"
⠄ [00:00:02] [#################################################################################################################################################] 202.58KiB/25.24KiB (0s)
Restore successful!

It seems some rows are inserted twice so the primary key cannot be added.
I haven't look further for now.

@evoxmusic
Copy link
Contributor Author

Yes I got the same issue and I will provide a fix

@evoxmusic
Copy link
Contributor Author

I'm facing a small challenge on rows deduplication. Since RepliByte needs to be low on memory consumption, it's almost impossible to make row deduplication in memory. I am looking for a remediation.

@fabriceclementz
Copy link
Contributor

I'm facing a small challenge on rows deduplication. Since RepliByte needs to be low on memory consumption, it's almost impossible to make row deduplication in memory. I am looking for a remediation.

I think It would require having all the rows in memory for deduplicates them?
Maybe you could create a temporary index in S3 like a HashMap data structure so you don't have to keep all rows in memory and could check if a row is already inserted.
I'm not sure It will be the most efficient solution nor even if it can help you!

@evoxmusic
Copy link
Contributor Author

It's a good idea but my concern is the IO performance. I presume it will be not super performant. For the database subset I am already using the local disk because there is no choice. We can't process data and assume that users will have a lot of memory available. However, we can assume that they have some disk space and it is a requirement for database subset. I am working on a function to dedup specific lines from a file. I will push my code tomorrow (I made a Rust conf tonight - I am super tired)

@evoxmusic evoxmusic marked this pull request as draft April 14, 2022 07:16
@evoxmusic
Copy link
Contributor Author

@fabriceclementz @benny-n finally done!! 😄 I am going to merge, but I will need to also explain that they are some requirements to use the database subset feature since it needs some disk space to do the processing. We can improve that part in an incremental way

@evoxmusic evoxmusic marked this pull request as ready for review April 16, 2022 15:01
@evoxmusic evoxmusic merged commit 791c6d7 into main Apr 16, 2022
@fabriceclementz
Copy link
Contributor

Nice 👍 I will try this soon !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Database Subsetting: Scale down a production database to a more reasonable size
3 participants