Query cache on disk (RFC) #58228

alexey-milovidov · 2023-12-26T14:12:39Z

On-disk configuration for query cache can be provided in addition to the already existing in-memory configuration.

On-disk cache works almost independently of the in-memory cache. When data is written to the cache, it will be written to both of them (write-through). When data is searched inside the cache, it is searched in memory first, then on disk. If the data is found on disk, it will be also put in memory.

Therefore, the on-disk cache can have independent configurations of max size, max elements, max element size, compression method, compression level, etc.

The on-disk cache is organized in a directory with subdirectories named as first letters of the hexadecimal cache key, and files named like hexadecimal cache keys, similar to the filesystem cache. The files are removed by the eviction policy, while subdirectories aren't.

It should be possible to have no in-memory cache, but only on-disk cache.

The on-disk cache can be configured with custom disks (e.g., s3_plain), and allows co-locating inside the filesystem cache, similar to how it is done for temporary data on disk.

The metadata (a set of keys existing in the on-disk cache) is loaded in memory at server startup, and the LRU information is maintained in memory. When the server is restarted, the information for the eviction policy (about last access times) is not preserved because it does not persist on disk.

The server may additionally check (during the lookup) if the file suddenly appeared on disk - to support scenarios with a shared disk space between multiple servers.

If there is an exception related to a wrong file size (empty) or a wrong checksum during reading, the cache entry should be discarded.

srikanthccv · 2023-12-26T15:35:20Z

Is partial results caching within the scope of the RFC? In the monitoring use cases, there is a consistent pattern where records are time-bound and progress into the future. The error rate b/w two timestamps remains the same (it might be desirable to prevent the caching of very recent results that might still be in flux, which could be made possible with cache-freshness configuration). Avoiding recomputation helps in fast load times for the dashboard and reduces processing load. Today, this can be achieved by external systems but it would be great if ClickHouse supported it.

alexey-milovidov · 2023-12-26T16:57:47Z

@srikanthccv, this RFC does not include partial caching - it is independent.
But we also have this: #57490

alexey-milovidov added the feature label Dec 26, 2023

rschu1ze self-assigned this Apr 11, 2024

rschu1ze linked a pull request Apr 28, 2024 that will close this issue

*wip* Query cache persistence #63091

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query cache on disk (RFC) #58228

Query cache on disk (RFC) #58228

alexey-milovidov commented Dec 26, 2023 •

edited

srikanthccv commented Dec 26, 2023

alexey-milovidov commented Dec 26, 2023

Query cache on disk (RFC) #58228

Query cache on disk (RFC) #58228

Comments

alexey-milovidov commented Dec 26, 2023 • edited

srikanthccv commented Dec 26, 2023

alexey-milovidov commented Dec 26, 2023

alexey-milovidov commented Dec 26, 2023 •

edited