Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 17 additions & 18 deletions docs/data-modeling/backfilling.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@ description: How to use backfill large datasets in ClickHouse
keywords: [materialized views, backfilling, inserting data, resilient data load]
---

import nullTableMV from '@site/static/images/data-modeling/null_table_mv.png';

# Backfilling Data

Whether new to ClickHouse or responsible for an existing deployment, users will invariably need to backfill tables with historical data. In some cases, this is relatively simple but can become more complex when materialized views need to be populated. This guide documents some processes for this task that users can apply to their use case.
Expand All @@ -15,7 +17,7 @@ This guide assumes users are already familiar with the concept of [Incremental M

## Example dataset {#example-dataset}

Throughout this guide, we use a PyPI dataset. Each row in this dataset represents a Python package download using a tool such as `pip`.
Throughout this guide, we use a PyPI dataset. Each row in this dataset represents a Python package download using a tool such as `pip`.

For example, the subset covers a single day - `2024-12-17` and is available publicly at `https://datasets-documentation.s3.eu-west-3.amazonaws.com/pypi/2024-12-17/`. Users can query with:

Expand Down Expand Up @@ -66,12 +68,12 @@ The full PyPI dataset, consisting of over 1 trillion rows, is available in our p

## Backfilling scenarios {#backfilling-scenarios}

Backfilling is typically needed when a stream of data is being consumed from a point in time. This data is being inserted into ClickHouse tables with [incremental materialized views](/materialized-view/incremental-materialized-view), triggering on blocks as they are inserted. These views may be transforming the data prior to insert or computing aggregates and sending results to target tables for later use in downstream applications.
Backfilling is typically needed when a stream of data is being consumed from a point in time. This data is being inserted into ClickHouse tables with [incremental materialized views](/materialized-view/incremental-materialized-view), triggering on blocks as they are inserted. These views may be transforming the data prior to insert or computing aggregates and sending results to target tables for later use in downstream applications.

We will attempt to cover the following scenarios:

1. **Backfilling data with existing data ingestion** - New data is being loaded, and historical data needs to be backfilled. This historical data has been identified.
2. **Adding materialized views to existing tables** - New materialized views need to be added to a setup for which historical data has been populated and data is already streaming.
2. **Adding materialized views to existing tables** - New materialized views need to be added to a setup for which historical data has been populated and data is already streaming.

We assume data will be backfilled from object storage. In all cases, we aim to avoid pauses in data insertion.

Expand Down Expand Up @@ -141,7 +143,7 @@ FROM pypi_downloads
Peak memory usage: 682.38 KiB.
```

Suppose we wish to load another subset `{101..200}`. While we could insert directly into `pypi`, we can do this backfill in isolation by creating duplicate tables.
Suppose we wish to load another subset `{101..200}`. While we could insert directly into `pypi`, we can do this backfill in isolation by creating duplicate tables.

Should the backfill fail, we have not impacted our main tables and can simply [truncate](/managing-data/truncate) our duplicate tables and repeat.

Expand Down Expand Up @@ -236,9 +238,9 @@ FROM pypi_v2

Importantly, the `MOVE PARTITION` operation is both lightweight (exploiting hard links) and atomic, i.e. it either fails or succeeds with no intermediate state.

We exploit this process heavily in our backfilling scenarios below.
We exploit this process heavily in our backfilling scenarios below.

Notice how this process requires users to choose the size of each insert operation.
Notice how this process requires users to choose the size of each insert operation.

Larger inserts i.e. more rows, will mean fewer `MOVE PARTITION` operations are required. However, this must be balanced against the cost in the event of an insert failure e.g. due to network interruption, to recover. Users can complement this process with batching files to reduce the risk. This can be performed with either range queries e.g. `WHERE timestamp BETWEEN 2024-12-17 09:00:00 AND 2024-12-17 10:00:00` or glob patterns. For example,

Expand All @@ -258,7 +260,7 @@ ClickPipes uses this approach when loading data from object storage, automatical

## Scenario 1: Backfilling data with existing data ingestion {#scenario-1-backfilling-data-with-existing-data-ingestion}

In this scenario, we assume that the data to backfill is not in an isolated bucket and thus filtering is required. Data is already inserting and a timestamp or monotonically increasing column can be identified from which historical data needs to be backfilled.
In this scenario, we assume that the data to backfill is not in an isolated bucket and thus filtering is required. Data is already inserting and a timestamp or monotonically increasing column can be identified from which historical data needs to be backfilled.

This process follows the following steps:

Expand Down Expand Up @@ -317,7 +319,7 @@ ALTER TABLE pypi_downloads
If the historical data is an isolated bucket, the above time filter is not required. If a time or monotonic column is unavailable, isolate your historical data.

:::note Just use ClickPipes in ClickHouse Cloud
ClickHouse Cloud users should use ClickPipes for restoring historical backups if the data can be isolated in its own bucket (and a filter is not required). As well as parallelizing the load with multiple workers, thus reducing the load time, ClickPipes automates the above process - creating duplicate tables for both the main table and materialized views.
ClickHouse Cloud users should use ClickPipes for restoring historical backups if the data can be isolated in its own bucket (and a filter is not required). As well as parallelizing the load with multiple workers, thus reducing the load time, ClickPipes automates the above process - creating duplicate tables for both the main table and materialized views.
:::

## Scenario 2: Adding materialized views to existing tables {#scenario-2-adding-materialized-views-to-existing-tables}
Expand All @@ -339,7 +341,7 @@ Our simplest approach involves the following steps:

This can be further enhanced to target subsets of data in step (2) and/or use a duplicate target table for the materialized view (attach partitions to the original once the insert is complete) for easier recovery after failure.

Consider the following materialized view, which computes the most popular projects per hour.
Consider the following materialized view, which computes the most popular projects per hour.

```sql
CREATE TABLE pypi_downloads_per_day
Expand Down Expand Up @@ -372,7 +374,7 @@ AS SELECT
project, count() AS count
FROM pypi WHERE timestamp >= '2024-12-17 09:00:00'
GROUP BY hour, project
```
```

Once this view is added, we can backfill all data for the materialized view prior to this data.

Expand Down Expand Up @@ -403,7 +405,7 @@ In our case, this is a relatively lightweight aggregation that completes in unde

Often materialized view's query can be more complex (not uncommon as otherwise users wouldn't use a view!) and consume resources. In rarer cases, the resources for the query are beyond that of the server. This highlights one of the advantages of ClickHouse materialized views - they are incremental and don't process the entire dataset in one go!

In this case, users have several options:
In this case, users have several options:

1. Modify your query to backfill ranges e.g. `WHERE timestamp BETWEEN 2024-12-17 08:00:00 AND 2024-12-17 09:00:00`, `WHERE timestamp BETWEEN 2024-12-17 07:00:00 AND 2024-12-17 08:00:00` etc.
2. Use a [Null table engine](/engines/table-engines/special/null) to fill the materialized view. This replicates the typical incremental population of a materialized view, executing it's query over blocks of data (of configurable size).
Expand All @@ -418,10 +420,7 @@ The [Null table engine](/engines/table-engines/special/null) provides a storage

Importantly, any materialized views attached to the table engine still execute over blocks of data as its inserted - sending their results to a target table. These blocks are of a configurable size. While larger blocks can potentially be more efficient (and faster to process), they consume more resources (principally memory). Use of this table engine means we can build our materialized view incrementally i.e. a block at a time, avoiding the need to hold the entire aggregation in memory.

<img src={require('./images/null_table_mv.png').default}
class='image'
alt='Denormalization in ClickHouse'
style={{width: '50%', background: 'none' }} />
<img src={nullTableMV} class="image" alt="Denormalization in ClickHouse" style={{width: '50%', background: 'none'}} />

<br />

Expand Down Expand Up @@ -449,7 +448,7 @@ GROUP BY
Here, we create a Null table, `pypi_v2,` to receive the rows that will be used to build our materialized view. Note how we limit the schema to only the columns we need. Our materialized view performs an aggregation over rows inserted into this table (one block at a time), sending the results to our target table, `pypi_downloads_per_day`.

:::note
We have used `pypi_downloads_per_day` as our target table here. For additional resiliency, users could create a duplicate table, `pypi_downloads_per_day_v2`, and use this as the target table of the view, as shown in previous examples. On completion of the insert, partitions in `pypi_downloads_per_day_v2` could, in turn, be moved to `pypi_downloads_per_day.` This would allow recovery in the case our insert fails due to memory issues or server interruptions i.e. we just truncate `pypi_downloads_per_day_v2`, tune settings, and retry.
We have used `pypi_downloads_per_day` as our target table here. For additional resiliency, users could create a duplicate table, `pypi_downloads_per_day_v2`, and use this as the target table of the view, as shown in previous examples. On completion of the insert, partitions in `pypi_downloads_per_day_v2` could, in turn, be moved to `pypi_downloads_per_day.` This would allow recovery in the case our insert fails due to memory issues or server interruptions i.e. we just truncate `pypi_downloads_per_day_v2`, tune settings, and retry.
:::

To populate this materialized view, we simply insert the relevant data to backfill into `pypi_v2` from `pypi.`
Expand All @@ -467,8 +466,8 @@ Notice our memory usage here is `639.47 MiB`.

Several factors will determine the performance and resources used in the above scenario. We recommend readers understand insert mechanics documented in detail [here](/integrations/s3/performance#using-threads-for-reads) prior to attempting to tune. In summary:

- **Read Parallelism** - The number of threads used to read. Controlled through [`max_threads`](/operations/settings/settings#max_threads). In ClickHouse Cloud this is determined by the instance size with it defaulting to the number of vCPUs. Increasing this value may improve read performance at the expense of greater memory usage.
- **Insert Parallelism** - The number of insert threads used to insert. Controlled through [`max_insert_threads`](/operations/settings/settings#max_insert_threads). In ClickHouse Cloud this is determined by the instance size (between 2 and 4) and is set to 1 in OSS. Increasing this value may improve performance at the expense of greater memory usage.
- **Read Parallelism** - The number of threads used to read. Controlled through [`max_threads`](/operations/settings/settings#max_threads). In ClickHouse Cloud this is determined by the instance size with it defaulting to the number of vCPUs. Increasing this value may improve read performance at the expense of greater memory usage.
- **Insert Parallelism** - The number of insert threads used to insert. Controlled through [`max_insert_threads`](/operations/settings/settings#max_insert_threads). In ClickHouse Cloud this is determined by the instance size (between 2 and 4) and is set to 1 in OSS. Increasing this value may improve performance at the expense of greater memory usage.
- **Insert Block Size** - data is processed in a loop where it is pulled, parsed, and formed into in-memory insert blocks based on the [partitioning key](/engines/table-engines/mergetree-family/custom-partitioning-key). These blocks are sorted, optimized, compressed, and written to storage as new [data parts](/parts). The size of the insert block, controlled by settings [`min_insert_block_size_rows`](/operations/settings/settings#min_insert_block_size_rows) and [`min_insert_block_size_bytes`](/operations/settings/settings#min_insert_block_size_bytes) (uncompressed), impacts memory usage and disk I/O. Larger blocks use more memory but create fewer parts, reducing I/O and background merges. These settings represent minimum thresholds (whichever is reached first triggers a flush).
- **Materialized view block size** - As well as the above mechanics for the main insert, prior to insertion into materialized views, blocks are also squashed for more efficient processing. The size of these blocks is determined by the settings [`min_insert_block_size_bytes_for_materialized_views`](/operations/settings/settings#min_insert_block_size_bytes_for_materialized_views) and [`min_insert_block_size_rows_for_materialized_views`](/operations/settings/settings#min_insert_block_size_rows_for_materialized_views). Larger blocks allow more efficient processing at the expense of greater memory usage. By default, these settings revert to the values of the source table settings [`min_insert_block_size_rows`](/operations/settings/settings#min_insert_block_size_rows) and [`min_insert_block_size_bytes`](/operations/settings/settings#min_insert_block_size_bytes), respectively.

Expand Down
13 changes: 5 additions & 8 deletions docs/data-modeling/denormalization.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,9 @@ description: How to use denormalization to improve query performance
keywords: [data denormalization, denormalize, query optimization]
---

import denormalizationDiagram from '@site/static/images/data-modeling/denormalization-diagram.png';
import denormalizationSchema from '@site/static/images/data-modeling/denormalization-schema.png';

# Denormalizing Data

Data denormalization is a technique in ClickHouse to use flattened tables to help minimize query latency by avoiding joins.
Expand All @@ -15,10 +18,7 @@ Denormalizing data involves intentionally reversing the normalization process to

This process reduces the need for complex joins at query time and can significantly speed up read operations, making it ideal for applications with heavy read requirements and complex queries. However, it can increase the complexity of write operations and maintenance, as any changes to the duplicated data must be propagated across all instances to maintain consistency.

<img src={require('./images/denormalization-diagram.png').default}
class='image'
alt='Denormalization in ClickHouse'
style={{width: '100%', background: 'none' }} />
<img src={denormalizationDiagram} class="image" alt="Denormalization in ClickHouse" style={{width: '100%', background: 'none'}} />

<br />

Expand Down Expand Up @@ -131,10 +131,7 @@ The main observation here is that aggregated vote statistics for each post would

Now let's consider our `Users` and `Badges`:

<img src={require('./images/denormalization-schema.png').default}
class='image'
alt='Users and Badges schema'
style={{width: '100%', background: 'none' }} />
<img src={denormalizationSchema} class="image" alt="Users and Badges schema" style={{width: '100%', background: 'none'}} />

<p></p>
We first insert the data with the following command:
Expand Down
Loading