Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query cache on disk (RFC) #58228

Open
alexey-milovidov opened this issue Dec 26, 2023 · 2 comments · May be fixed by #63091
Open

Query cache on disk (RFC) #58228

alexey-milovidov opened this issue Dec 26, 2023 · 2 comments · May be fixed by #63091
Assignees
Labels

Comments

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Dec 26, 2023

On-disk configuration for query cache can be provided in addition to the already existing in-memory configuration.

On-disk cache works almost independently of the in-memory cache. When data is written to the cache, it will be written to both of them (write-through). When data is searched inside the cache, it is searched in memory first, then on disk. If the data is found on disk, it will be also put in memory.

Therefore, the on-disk cache can have independent configurations of max size, max elements, max element size, compression method, compression level, etc.

The on-disk cache is organized in a directory with subdirectories named as first letters of the hexadecimal cache key, and files named like hexadecimal cache keys, similar to the filesystem cache. The files are removed by the eviction policy, while subdirectories aren't.

It should be possible to have no in-memory cache, but only on-disk cache.

The on-disk cache can be configured with custom disks (e.g., s3_plain), and allows co-locating inside the filesystem cache, similar to how it is done for temporary data on disk.

The metadata (a set of keys existing in the on-disk cache) is loaded in memory at server startup, and the LRU information is maintained in memory. When the server is restarted, the information for the eviction policy (about last access times) is not preserved because it does not persist on disk.

The server may additionally check (during the lookup) if the file suddenly appeared on disk - to support scenarios with a shared disk space between multiple servers.

If there is an exception related to a wrong file size (empty) or a wrong checksum during reading, the cache entry should be discarded.

@srikanthccv
Copy link
Contributor

Is partial results caching within the scope of the RFC? In the monitoring use cases, there is a consistent pattern where records are time-bound and progress into the future. The error rate b/w two timestamps remains the same (it might be desirable to prevent the caching of very recent results that might still be in flux, which could be made possible with cache-freshness configuration). Avoiding recomputation helps in fast load times for the dashboard and reduces processing load. Today, this can be achieved by external systems but it would be great if ClickHouse supported it.

@alexey-milovidov
Copy link
Member Author

@srikanthccv, this RFC does not include partial caching - it is independent.
But we also have this: #57490

@rschu1ze rschu1ze self-assigned this Apr 11, 2024
@rschu1ze rschu1ze linked a pull request Apr 28, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants