Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support incremental backup #1248

Closed
v0y4g3r opened this issue Mar 27, 2023 · 5 comments
Closed

Support incremental backup #1248

v0y4g3r opened this issue Mar 27, 2023 · 5 comments
Assignees
Labels
C-feature Category Features
Milestone

Comments

@v0y4g3r
Copy link
Contributor

v0y4g3r commented Mar 27, 2023

What problem does the new feature solve?

Given that all tables in GreptimeDB contains a timestamp column, we can allow users to backup data in some database within a specified time range into some directory, in an one file for one table manner.

What does the feature do?

Implement some SQL syntax like:

COPY DATABASE <DATABASE_NAME> [FROM <START_TIME>] [UNTIL <END_TIME>] TO <TARGET_DIR> [WITH (<COPY OPTIONS>)]

which export rows within given timerange of all tables in that database to target directory. All exported rows of one table will reside in the same parquet file.

Or maybe we can skip SQL and use HTTP Admin API first for prototype.

Implementation challenges

  1. Upgrade opendal so that Writer support can simplify stream parquet writer implementation feat: upgrade opendal #1245
  2. Change ParquetWriter to a stream writer that does not dump all parquet content in memory and write to underlying storage at a time, which may cause huge memory consumption in this case feat: buffered parquet writer #1263
    let mut buf = vec![];
    let mut arrow_writer = ArrowWriter::try_new(&mut buf, schema.clone(), Some(writer_props))
    .context(WriteParquetSnafu)?;
  3. Implement the syntax parser for COPY DATABASE or HTTP API handler.
  4. Implement COPY DATABASE by iterating all tables in some database and copy the content of that table into a parquet file in target directory. Maybe we don't need to compress the target directory in database.
@v0y4g3r v0y4g3r added the C-feature Category Features label Mar 27, 2023
@v0y4g3r v0y4g3r mentioned this issue Mar 27, 2023
2 tasks
@evenyag
Copy link
Contributor

evenyag commented Mar 27, 2023

RocksDB's backup support: How to backup RocksDB

But COPY DATABASE is different from BACKUP DATABASE as COPY is much simpler. We might also need to write some metadata to the target directory (to store the start/end time).

@v0y4g3r
Copy link
Contributor Author

v0y4g3r commented Mar 27, 2023

RocksDB's backup support: How to backup RocksDB

But COPY DATABASE is different from BACKUP DATABASE as COPY is much simpler.

Backup also involves backing up manifest files etc.

We might also need to write some metadata to the target directory (to store the start/end time).

Necessary metadata I come up with:

  • catalog/schema/table name
  • data time range
  • backup time

@v0y4g3r v0y4g3r self-assigned this Mar 27, 2023
@sunng87
Copy link
Member

sunng87 commented Mar 28, 2023

Can we use parquet metadata https://parquet.apache.org/docs/file-format/metadata/ for our metadata? Using less files reduces chance of corrupted data.

@v0y4g3r
Copy link
Contributor Author

v0y4g3r commented Mar 29, 2023

Can we use parquet metadata https://parquet.apache.org/docs/file-format/metadata/ for our metadata? Using less files reduces chance of corrupted data.

If "our metdata" refers to catalog/schema/table name, data time range and backup time, yes, we are going to write these to parquet footer's metadata section, juts like arrow does. We don't have a separate metadata file now.

@fengjiachun fengjiachun added this to the v0.3 milestone Apr 12, 2023
@killme2008
Copy link
Contributor

Closed via #1240

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-feature Category Features
Projects
Status: Done
Status: No status
Development

No branches or pull requests

5 participants