feat: Database subset #44

evoxmusic · 2022-04-05T07:30:09Z

See this issue

evoxmusic · 2022-04-10T17:21:49Z

HUGE performance improvement - 30ms per iteration instead of 900ms before

…rce/postgres_stdin.rs

evoxmusic · 2022-04-11T09:08:42Z

I know it's a large PR but if you want to take a look at it @benny-n @fabriceclementz it's all good for me :)

benny-n

Looks great, very nice feature to have!

fabriceclementz · 2022-04-11T19:34:27Z

Good job @evoxmusic ! This is my favorite feature for the moment !
The possibility to implement many subset strategies is great and the code is expressive enough. 👍

I noticed some errors during the backup restore command

  Running `target/debug/replibyte -c ./examples/with-subset-and-transformer.yaml restore -v latest`
⠉ [00:00:00] [--------------------------------------------------------------------------------------------------------------------------------------------------------] 0B/25.24KiB (0s)
⠁ [00:00:00] [#################################################################################################################################################] 202.58KiB/25.24KiB (0s)
⠠ [00:00:02] [#################################################################################################################################################] 202.58KiB/25.24KiB (0s)
ERROR:  could not create unique index "pk_customers"
DETAIL:  Key (customer_id)=(TOMSP) is duplicated.
ERROR:  could not create unique index "pk_employees"
DETAIL:  Key (employee_id)=(6) is duplicated.
ERROR:  could not create unique index "pk_shippers"
⠙ [00:00:02] [#################################################################################################################################################] 202.58KiB/25.24KiB (0s)
ERROR:  there is no unique constraint matching given keys for referenced table "customers"
ERROR:  there is no unique constraint matching given keys for referenced table "employees"
ERROR:  there is no unique constraint matching given keys for referenced table "employees"
ERROR:  there is no unique constraint matching given keys for referenced table "customers"
ERROR:  there is no unique constraint matching given keys for referenced table "employees"
⠄ [00:00:02] [#################################################################################################################################################] 202.58KiB/25.24KiB (0s)
Restore successful!

It seems some rows are inserted twice so the primary key cannot be added.
I haven't look further for now.

evoxmusic · 2022-04-11T22:23:58Z

Yes I got the same issue and I will provide a fix

evoxmusic · 2022-04-12T21:28:06Z

I'm facing a small challenge on rows deduplication. Since RepliByte needs to be low on memory consumption, it's almost impossible to make row deduplication in memory. I am looking for a remediation.

fabriceclementz · 2022-04-13T19:42:57Z

I'm facing a small challenge on rows deduplication. Since RepliByte needs to be low on memory consumption, it's almost impossible to make row deduplication in memory. I am looking for a remediation.

I think It would require having all the rows in memory for deduplicates them?
Maybe you could create a temporary index in S3 like a HashMap data structure so you don't have to keep all rows in memory and could check if a row is already inserted.
I'm not sure It will be the most efficient solution nor even if it can help you!

evoxmusic · 2022-04-14T00:00:22Z

It's a good idea but my concern is the IO performance. I presume it will be not super performant. For the database subset I am already using the local disk because there is no choice. We can't process data and assume that users will have a lot of memory available. However, we can assume that they have some disk space and it is a requirement for database subset. I am working on a function to dedup specific lines from a file. I will push my code tomorrow (I made a Rust conf tonight - I am super tired)

evoxmusic · 2022-04-16T15:01:28Z

@fabriceclementz @benny-n finally done!! 😄 I am going to merge, but I will need to also explain that they are some requirements to use the database subset feature since it needs some disk space to do the processing. We can improve that part in an incremental way

fabriceclementz · 2022-04-16T15:34:24Z

Nice 👍 I will try this soon !

wip: database subset

ff7e890

evoxmusic self-assigned this Apr 5, 2022

evoxmusic linked an issue Apr 5, 2022 that may be closed by this pull request

Database Subsetting: Scale down a production database to a more reasonable size #40

Closed

evoxmusic added 6 commits April 6, 2022 22:28

Merge branch 'main' into feat/database_subset

8f73614

Merge branch 'main' into feat/database_subset

3c661a6

wip: prepare database subset arch for Postgres

e34edfb

wip: prepare database subset arch for Postgres

3263a71

wip: integrates recursive algo to visits connected tables

f57dc22

wip: add Progress struct to track down time to process subset

45e4e5f

evoxmusic added 7 commits April 10, 2022 19:25

wip: improve database subset performance with a simple index algo

bc09497

wip: catch all errors

36195ca

wip: add header and footer database schema

79e6d75

wip: implement passthrough tables for database subset

a013d37

wip: implement passthrough tables for database subset

cdc08ec

wip: implement database subset options for source/postgres.rs and sou…

1e4e586

…rce/postgres_stdin.rs

wip: implement database subset options for source/postgres.rs and sou…

3206363

…rce/postgres_stdin.rs

evoxmusic marked this pull request as ready for review April 11, 2022 08:59

benny-n approved these changes Apr 11, 2022

View reviewed changes

evoxmusic added 3 commits April 11, 2022 11:28

fix: cli subset database parsing optins

f928bc1

wip: fix passthrough tables

4c0f3a5

wip: improve database subset tests

595a4c5

fix: duplicated INSERT INTO ... queries for Postgres subset

d44ce2d

evoxmusic marked this pull request as draft April 14, 2022 07:16

evoxmusic added 4 commits April 14, 2022 23:40

wip: dedup INSERT INTO queries

f2f86c5

wip: dedup INSERT INTO queries

99d79ef

wip: dedup INSERT INTO queries

c32cab2

wip: dedup INSERT INTO queries

791c6d7

evoxmusic marked this pull request as ready for review April 16, 2022 15:01

evoxmusic merged commit 791c6d7 into main Apr 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Database subset #44

feat: Database subset #44

evoxmusic commented Apr 5, 2022 •

edited

evoxmusic commented Apr 10, 2022

evoxmusic commented Apr 11, 2022

benny-n left a comment

fabriceclementz commented Apr 11, 2022

evoxmusic commented Apr 11, 2022

evoxmusic commented Apr 12, 2022

fabriceclementz commented Apr 13, 2022

evoxmusic commented Apr 14, 2022

evoxmusic commented Apr 16, 2022

fabriceclementz commented Apr 16, 2022

feat: Database subset #44

feat: Database subset #44

Conversation

evoxmusic commented Apr 5, 2022 • edited

evoxmusic commented Apr 10, 2022

evoxmusic commented Apr 11, 2022

benny-n left a comment

Choose a reason for hiding this comment

fabriceclementz commented Apr 11, 2022

evoxmusic commented Apr 11, 2022

evoxmusic commented Apr 12, 2022

fabriceclementz commented Apr 13, 2022

evoxmusic commented Apr 14, 2022

evoxmusic commented Apr 16, 2022

fabriceclementz commented Apr 16, 2022

evoxmusic commented Apr 5, 2022 •

edited