Database Subsetting: Scale down a production database to a more reasonable size #40

evoxmusic · 2022-04-02T21:20:09Z

What is Subsetting

Subsetting data is the process of taking a representative sample of your data in a manner that preserves the integrity of your database, e.g., give me 5% of my users. If you do this naively, e.g., just grab 5% of all the tables in your database, most likely, your database will break foreign key constraints. At best, you’ll end up with a statistically non-representative data sample.

One common use case is to scale down a production database to a more reasonable size so that it can be used in staging, test, and development environments. This can be done to save costs and, when used in tandem with PII removal, can be quite powerful as a productivity enhancer. Another example is copying specific rows from one database and placing them into another while maintaining referential integrity.

As discussed on Discord, database subsetting will be super valuable to restore a subset of a production database for a development purpose. E.g. developers from growing companies are interested in using RepliByte for their database with TBs of data 😮 Subsetting a database is needed for very large DBs. It even does not make any sense to try to re-import a DB with TB of data for development purposes.

In this issue, I propose that we work together in designing the "Database Subsetting" feature.

References

Here are some must-read references about database subsetting:

I recommend reading them. They are full of information.

Design references

I am going to take some time digging into Condenser (OSS Tonic.ai subsetting python tool) to suggest a starting implementation. I keep you posted.

Design proposal

sequenceDiagram
participant RepliByte
participant PostgreSQL (Source)
participant AWS S3 (Bridge)
PostgreSQL (Source)->>RepliByte: Dump data
loop
    RepliByte->>RepliByte: a. Get database schema and tables relationships
    RepliByte->>RepliByte: b. Support virtual relationships
    RepliByte->>RepliByte: c. Take x% rows of the ref table
end
loop
    RepliByte->>RepliByte: Hide/fake sensitive data
    RepliByte->>RepliByte: Compress data
    RepliByte->>RepliByte: Encrypt data
end
RepliByte->>AWS S3 (Bridge): Upload data
RepliByte->>AWS S3 (Bridge): 6. Write index file

evoxmusic added the feature New feature request label Apr 2, 2022

evoxmusic linked a pull request Apr 5, 2022 that will close this issue

feat: Database subset #44

Merged

evoxmusic mentioned this issue Apr 10, 2022

feat: Database subset #44

Merged

evoxmusic closed this as completed in #44 Apr 16, 2022

This was referenced Apr 29, 2022

Feature: implement database subseting for MySQL #69

Open

Feature: implement database subseting for MongoDB #70

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Database Subsetting: Scale down a production database to a more reasonable size #40

Database Subsetting: Scale down a production database to a more reasonable size #40

evoxmusic commented Apr 2, 2022 •

edited

Database Subsetting: Scale down a production database to a more reasonable size #40

Database Subsetting: Scale down a production database to a more reasonable size #40

Comments

evoxmusic commented Apr 2, 2022 • edited

What is Subsetting

References

Design references

Design proposal

evoxmusic commented Apr 2, 2022 •

edited