Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roadmap 2024 (discussion) #58392

Open
alexey-milovidov opened this issue Dec 31, 2023 · 42 comments
Open

Roadmap 2024 (discussion) #58392

alexey-milovidov opened this issue Dec 31, 2023 · 42 comments
Labels

Comments

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Dec 31, 2023

This is ClickHouse roadmap 2024.
Descriptions and links are to be filled.

This roadmap does not cover the tasks related to infrastructure, orchestration, documentation, marketing, external integrations, drivers, etc.

See also:

Roadmap 2023: #44767
Roadmap 2022: #32513
Roadmap 2021: #17623
Roadmap 2020: link

SQL Compatibility

✔️ Enable Analyzer by default
Non-constant CASE, non-constant IN
Remove old predicate pushdown mechanics
Correlated subqueries with decorrelation
Transforming anti-join: LEFT JOIN ... WHERE ... IS NULL to NOT IN
Deriving index condition from the right-hand side of INNER JOIN
JOINs reordering and extended pushdown
Time data type

Data Storage

✔️ Userspace page cache
✔️ Adaptive mode for asynchronous inserts
✔️ Semistructured Data: Variant data type
Semistructured Data: Sharded Maps
Semistructured Data: JSON data type
Transactions for Replicated tables
Lightweight Updates v2
Uniform treatment of LowCardinality, Sparse, and Const columns
Settings to control the consistency of projections on updates
Replicated Catalog ☁️
On-disk storage for Keeper
Query cache on disk
✔️ Decoupling of object storages and metadata
Full-text indices (production readiness)
Vector search indices (production readiness)

Security, access control, and isolation

✔️ Definers (encapsulation of access control) for views
Warnings and limits on the number of database objects
Dynamic configuration of query handlers
JWT authentication ☁️
Data masking in row-level security ☁️
Secure storage for named collections ☁️
Cancellation points for long operations
Resource scheduler (continuation)

Query Processing

Parallel replicas with task callbacks (production readiness)
Parallel replicas with parallel distributed INSERT SELECT
Automatic usage of -Cluster table functions
Adaptive thresholds for data spilling on disk
Optimization with subcolumns by default

Interfaces & External Data

Support for Iceberg Data Catalog
Support for Hive-style partitioning
Explicit queries in external tables
Even simpler data upload
HTTP API for simple query construction
Unification of data lake and file-like functions

Testing & Hardening

Revive coverage
Fuzzer of data formats
Fuzzer of network protocols
Server-side AST query fuzzer
Generic fuzzer for query text
Randomization of DETACH/ATTACH in tests
Integration with SQLSmith
Embedded documentation

Experiments & Research

Multi-RAFT for Keeper
MaterializedPostgreSQL (production readiness)
SSH protocol for the server
Support for PromQL
Streaming queries
Freeform text format
Key-value data marts
Decouple of columns and buffers
Lazy reading of ranges
Instant attaching tables from backups
An object storage to borrow space from the filesystem cache
COW disks
ALTER PRIMARY KEY
Autocompletion with language models
Decentralized tables
Unique Key Constraint


The roadmap covers the top focus items for both external contributors and full-time ClickHouse employees.
The items marked with the ☁️ icon are meant for ClickHouse Cloud (proprietary).
We expect 50..80% completion of the roadmap according to the results from previous years.

@alanpaulkwan
Copy link

alanpaulkwan commented Dec 31, 2023

"HTTP API for simple query construction" - would be really awesome if Python/R/DuckDB could read an arbitrary output / filtered table like S3 or just a file download. Amazing.

Also glad to see join reordering on the list still.

@chenziliang
Copy link
Contributor

Would like to see streaming processing become main stream in ClickHouse :)

@ahmed-adly-khalil
Copy link

Would be great to see deeper NATS integration, mainly using JWT auth.

@ucasfl
Copy link
Collaborator

ucasfl commented Jan 2, 2024

Support for Iceberg Data Catalog

Catalog is a good concept, but if want to introduce catalog into ClickHouse, we need refactor current metadata's structure, from Database -> Table to Catalog -> Database -> Table, then all tables created with builtin engine will under a internal_catalog, and we can have iceberg_catalog, hive_catalog. Not sure it's a good idea.

Besides, we still face the difficulty that we don't have Iceberg C++ API.

@olly-writes-code
Copy link

Hi! Adding a request for more GIS features. One important one is the ability to transform a geometry to a specified SRID. Redshift docs here

@alexey-milovidov
Copy link
Member Author

@ucasfl, we can support mapping a specific database from the Iceberg data catalog as a database in ClickHouse.
In this way, we don't have to map a whole catalog at once but allow doing it database by database.

@alexey-milovidov
Copy link
Member Author

@ahmed-adly-khalil, while NATS is not on the main list (but nice to have), there are some items that we already started to do: #39459

@alexey-milovidov
Copy link
Member Author

@alanpaulkwan, yes, in the simplest form, it represents a table like a file: #46925 but also allows to customize the result.

@jrdi
Copy link
Contributor

jrdi commented Jan 2, 2024

Would be great to know if there are plans to keep working on improving zero-copy and Cloud Storage during 2024. I'm constantly seeing improvements and bug fixes which is super good. Will we see zero-copy ready for production this year?

Decoupling of object storages and metadata

Does this mean moving metadata from disks to Keeper or any other shared store?

@alexey-milovidov
Copy link
Member Author

alexey-milovidov commented Jan 2, 2024

@jrdi

Would be great to know if there are plans to keep working on improving zero-copy and Cloud Storage during 2024. I'm constantly seeing improvements and bug fixes. Will we see zero-copy ready for production this year?

We have to fix the issues in zero-copy replication because it is still tested in CI, and used in production on older services in ClickHouse Cloud. For example, issues like this are found: #58333. But the track record of zero-copy replication is not good, and we expect to stop using it, then remove it from CI, and keep it on life support without further changes.

Does this mean moving metadata from disks to Keeper or any other shared store?

This is #58357

We currently have the following metadata options:

  1. Metadata on local filesystem (s3).
  2. No separate metadata (s3_plain).
  3. Metadata in .index files in directories (web).
  4. Metadata in a backup.
  5. Metadata in Keeper (proprietary).
  6. and more to come, e.g. A disk similar to s3_plain that allows directory rename. #58347

And, we have the following object storage options:

  1. S3.
  2. HDFS.
  3. Azure.
  4. Web.
  5. Local filesystem.
  6. Borrowing space from the filesystem cache.

The task is to allow the cross-product of these options.

@chenziliang
Copy link
Contributor

Would be great to see deeper NATS integration, mainly using JWT auth.

@ahmed-adly-khalil may I ask if you like streaming processing / analytics against NATS via ClickHouse ?

@jrdi
Copy link
Contributor

jrdi commented Jan 2, 2024

Thanks, @alexey-milovidov!

But the track record of zero-copy replication is not good, and we expect to stop using it, then remove it from CI, and keep it on life support without further changes.

I can understand this decision but it's a pity. This mean that open source version won't have a productive method to separate compute and storage. Do you think this could change in the short/mid term? Even something like a plan with CH help and guidance on improvements that can be done to keep support by external contributors sounds better than keeping the feature out of the CI.

@alexey-milovidov
Copy link
Member Author

It is not guaranteed and not in the plans, but we might have an implementation in the future- the only thing for sure is that it will not be based on zero-copy replication.

@bputt-e
Copy link

bputt-e commented Jan 5, 2024

Unique Key Constraint would be great, could remove our deduplication step in our processing pipeline

@mbtolou
Copy link

mbtolou commented Jan 7, 2024

Unique Key Constraint is great idea.

@alanpaulkwan
Copy link

alanpaulkwan commented Jan 7, 2024

I really like the unique key idea - hope it can follow ReplacingMergeTree and allow user to decide which row entry to keep. For me the options seem to be (1) incumbent data entry, (2) newest data entry, (3) an integer describing version priority. I've created arbitrary values to keep the "best value", which allows for non-standard logic like keeping the value that minimizes the difference in two timestamps with some case-by-case logic.

@mwarkentin
Copy link

@jrdi theres a proposal from Altinity here: #54644

@earlev4
Copy link

earlev4 commented Jan 12, 2024

A huge thank you to ClickHouse team and all the contributors for the amazing work on ClickHouse! I sincerely appreciate it.

Just curious. The Roadmap 2023 mentioned a "Recursive CTE" task, but I do not see it mentioned in the Roadmap 2024. Are plans to implement recursive CTEs in the future?

Thanks again!

@alexey-milovidov
Copy link
Member Author

@earlev4, It was planned for the previous year after enabling Analyzer, but we didn't manage to enable Analyzer under that schedule, so I've added it as the major item for 2024, but I'm afraid to add recursive CTEs. We are considering it for implementation, but not on the list of main items.

@earlev4
Copy link

earlev4 commented Jan 16, 2024

Thanks so much, @alexey-milovidov! I sincerely appreciate the detailed response. It is very helpful. I am very grateful to you and the team for ClickHouse!

@domainio
Copy link

Hi, is there a plan to support Apache Iceberg writing with MERGE operation?

@zheyu001
Copy link

zheyu001 commented Jan 23, 2024

Is there any chance to support iceberg v2? Or support evolved schema.

@1392657590
Copy link

Do you have time to resolve In high-concurrency scenarios, the performance of ClickHouse Keeper is lower than that of ZooKeeper.We found this issue when replacing zk with keeper. The replacement plan has been temporarily suspended

@JackyWoo
Copy link
Contributor

@1392657590 maybe you can try RaftKeeper

@jiugem
Copy link

jiugem commented Feb 4, 2024

I don't see MaterializedMySQL in the Roadmap

@zhanglistar
Copy link
Contributor

zhanglistar commented Feb 5, 2024

@alexey-milovidov What about non-equal join? Any plan? Thanks.

@chrisgoddard
Copy link

Would be super excited to see production support for Vector search indices - especially on Clickhouse Cloud. Every week there seems like there's a different vector database and I can't wait until I can just use Clickhouse for everything

@xevix
Copy link

xevix commented Feb 16, 2024

I would have major performance bottlenecks alleviated with materialized CTEs, to avoid DB roundtrips to create many intermediate results in tables. DuckDB recently added this last year which was great to see, was wondering if Clickhouse was thinking about this as well: #53449.

Fantastic work on recent releases, with usability features like ORDER BY ALL making it faster to query things ad hoc, and tight S3 integration with system credentials just magically pulled in 🎉

@guoxiaolongzte
Copy link
Contributor

Support for Hive style partitioning.

Does it support dynamic hive partition writing?

@guoxiaolongzte
Copy link
Contributor

Support for Iceberg Data Catalog

Catalog is a good concept, but if want to introduce catalog into ClickHouse, we need refactor current metadata's structure, from Database -> Table to Catalog -> Database -> Table, then all tables created with builtin engine will under a internal_catalog, and we can have iceberg_catalog, hive_catalog. Not sure it's a good idea.

Besides, we still face the difficulty that we don't have Iceberg C++ API.

@alexey-milovidov
When do we plan to support the hive catalog and hudi catalog?

@mingmwang
Copy link

@1392657590 maybe you can try RaftKeeper

@JackyWoo

Do you have some benchmark data to share between the performance and throughput of ClickHouse Keeper and RaftKeeper ?

@JackyWoo
Copy link
Contributor

JackyWoo commented Feb 20, 2024

@mingmwang we did not compare them right now, we only compare it with Zookeeper.

@softiger
Copy link

@JackyWoo could you share the comparison result between Zookeeper and RaftKeeper? I'm really interested in it, thanks!

@JackyWoo
Copy link
Contributor

JackyWoo commented Feb 21, 2024

@softiger You can find it here. Let's talk about RaftKeeper here.

@immelnikoff
Copy link

I would like to see binary search for pre-ordered arrays in the next functions:
has()
hasAny()
hasAll()
arrayIntersect().
In general, I would like to see accelerated functions for pre-ordered arrays.

@wordhardqi
Copy link

CK could plan query optimizer for complex queries.

@wordhardqi
Copy link

Starrocks, Snowflake and Byconity have it.

@Dileep-Dora
Copy link

I see from 2023 roadmap inverted indices implementation is not a priority. #38667 .

are we considering this for this year or any other plans on improving text search performance.

@alexey-milovidov
Copy link
Member Author

So far, there is a prototype implementation of inverted indices (you can unlock it with allow_experimental_inverted_index) - it is not ready and should not be used in production. It was not tested on realistic datasets.

@Dileep-Dora
Copy link

@alexey-milovidov thanks for the reply, yes we've tried this experimental feature but performance was not upto the mark. hence checking do we have any plans for this in 2024

@johnpyp
Copy link

johnpyp commented May 5, 2024

Curious about prioritizing supporting ORDER BY optimizations for projections? This is the one thing holding my team back from using Clickhouse for WHERE query usecases where we want to replicate the ease-of-use and flexibility of traditional database indices.

We'd love to be able to create potentially many projections on top of one table with varied combinations of different ORDER BY query optimizations and WHERE query optimizations.

@anvaari
Copy link
Contributor

anvaari commented May 11, 2024

Is there any plan to support deserializing Protobuf through a schema registry? It's needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests