You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Subsetting data is the process of taking a representative sample of your data in a manner that preserves the integrity of your database, e.g., give me 5% of my users. If you do this naively, e.g., just grab 5% of all the tables in your database, most likely, your database will break foreign key constraints. At best, you’ll end up with a statistically non-representative data sample.
One common use case is to scale down a production database to a more reasonable size so that it can be used in staging, test, and development environments. This can be done to save costs and, when used in tandem with PII removal, can be quite powerful as a productivity enhancer. Another example is copying specific rows from one database and placing them into another while maintaining referential integrity.
As discussed on Discord, database subsetting will be super valuable to restore a subset of a production database for a development purpose. E.g. developers from growing companies are interested in using RepliByte for their database with TBs of data 😮 Subsetting a database is needed for very large DBs. It even does not make any sense to try to re-import a DB with TB of data for development purposes.
In this issue, I propose that we work together in designing the "Database Subsetting" feature.
References
Here are some must-read references about database subsetting:
I recommend reading them. They are full of information.
Design references
I am going to take some time digging into Condenser (OSS Tonic.ai subsetting python tool) to suggest a starting implementation. I keep you posted.
Design proposal
sequenceDiagram
participant RepliByte
participant PostgreSQL (Source)
participant AWS S3 (Bridge)
PostgreSQL (Source)->>RepliByte: Dump data
loop
RepliByte->>RepliByte: a. Get database schema and tables relationships
RepliByte->>RepliByte: b. Support virtual relationships
RepliByte->>RepliByte: c. Take x% rows of the ref table
end
loop
RepliByte->>RepliByte: Hide/fake sensitive data
RepliByte->>RepliByte: Compress data
RepliByte->>RepliByte: Encrypt data
end
RepliByte->>AWS S3 (Bridge): Upload data
RepliByte->>AWS S3 (Bridge): 6. Write index file
The text was updated successfully, but these errors were encountered:
What is Subsetting
From Tonic.ai
As discussed on Discord, database subsetting will be super valuable to restore a subset of a production database for a development purpose. E.g. developers from growing companies are interested in using RepliByte for their database with TBs of data 😮 Subsetting a database is needed for very large DBs. It even does not make any sense to try to re-import a DB with TB of data for development purposes.
In this issue, I propose that we work together in designing the "Database Subsetting" feature.
References
Here are some must-read references about database subsetting:
I recommend reading them. They are full of information.
Design references
I am going to take some time digging into Condenser (OSS Tonic.ai subsetting python tool) to suggest a starting implementation. I keep you posted.
Design proposal
The text was updated successfully, but these errors were encountered: