Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database Subsetting: Scale down a production database to a more reasonable size #40

Closed
evoxmusic opened this issue Apr 2, 2022 · 0 comments · Fixed by #44
Closed

Database Subsetting: Scale down a production database to a more reasonable size #40

evoxmusic opened this issue Apr 2, 2022 · 0 comments · Fixed by #44
Labels
feature New feature request

Comments

@evoxmusic
Copy link
Contributor

evoxmusic commented Apr 2, 2022

What is Subsetting

From Tonic.ai

Subsetting data is the process of taking a representative sample of your data in a manner that preserves the integrity of your database, e.g., give me 5% of my users. If you do this naively, e.g., just grab 5% of all the tables in your database, most likely, your database will break foreign key constraints. At best, you’ll end up with a statistically non-representative data sample.

One common use case is to scale down a production database to a more reasonable size so that it can be used in staging, test, and development environments. This can be done to save costs and, when used in tandem with PII removal, can be quite powerful as a productivity enhancer. Another example is copying specific rows from one database and placing them into another while maintaining referential integrity.

As discussed on Discord, database subsetting will be super valuable to restore a subset of a production database for a development purpose. E.g. developers from growing companies are interested in using RepliByte for their database with TBs of data 😮 Subsetting a database is needed for very large DBs. It even does not make any sense to try to re-import a DB with TB of data for development purposes.

In this issue, I propose that we work together in designing the "Database Subsetting" feature.

References

Here are some must-read references about database subsetting:

  1. Intro to database subsetting
  2. Database subsetting Tonic.ai design choices

I recommend reading them. They are full of information.

Design references

I am going to take some time digging into Condenser (OSS Tonic.ai subsetting python tool) to suggest a starting implementation. I keep you posted.

Design proposal

sequenceDiagram
participant RepliByte
participant PostgreSQL (Source)
participant AWS S3 (Bridge)
PostgreSQL (Source)->>RepliByte: Dump data
loop
    RepliByte->>RepliByte: a. Get database schema and tables relationships
    RepliByte->>RepliByte: b. Support virtual relationships
    RepliByte->>RepliByte: c. Take x% rows of the ref table
end
loop
    RepliByte->>RepliByte: Hide/fake sensitive data
    RepliByte->>RepliByte: Compress data
    RepliByte->>RepliByte: Encrypt data
end
RepliByte->>AWS S3 (Bridge): Upload data
RepliByte->>AWS S3 (Bridge): 6. Write index file

@evoxmusic evoxmusic added the feature New feature request label Apr 2, 2022
@evoxmusic evoxmusic linked a pull request Apr 5, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant