StarRocks Roadmap 2024 #39686

Dshadowzh · 2024-01-22T11:08:28Z

Refer to roadmap 2023 2022

Shared-data & StarOS

Performance

Full columnar Json index
Cost model with primary key and foreign key constrains
Arm optimization for codecs
Adaptive DOP and adaptive query engine
Global dictionary encoding
Enhance IO schedule framework
JIT / Codegen
Fine granularity Fe lock(from db level to table level)

Easy to use

Data lake analytics

Better file format support
- Parquet reader tuning
- ORC reader tuning
Better table format support

Lake	Query	Insert	DDL	Update/Delete/Merge into	MV	Improment
Hive	1.18	3.2			2.5
Iceberg	2.1	3.1	3.3	3.3	3.0	metadata improvement
Hudi	2.2				3.0
Paimon	3.0				3.2
Delta lake	3.0				3.2

Data warehousing(batch and streaming)

Batch processing & ETL improvement

Streaming processing & real-time update

Schemaless partial update
Merge into statement
Binlog to flink and spark streaming
Transaction level incremental refresh in materialized view (Aggregation, Join, functions)
Incremental refresh for iceberg/Hudi/Paimon materialized view

All-in-one scenarios

Search: Optimize full text inverted index
Row store: Optimize row store for high concurrent point lookup
Time series db: Asof join, high concurrent ingestion
Vector database: vector index

Release

Release plan 3.3 #40907
3.4 release plan

The text was updated successfully, but these errors were encountered:

arsenalzp · 2024-01-24T08:49:29Z

Hello,
Any chance to have good-first-issue feature among those tasks?

Dshadowzh · 2024-01-25T02:56:30Z

Hello, Any chance to have good-first-issue feature among those tasks?

Welcome, you can check this #13300 first, we'll update more good-first-issues in 2024 later. Particularly regarding external catalog and connectors.

Zhangg7723 · 2024-01-25T06:14:12Z

How about incremental refresh materialized view for external table like Iceberg or Hudi? I think this feature can reduce the cost of refresh mv

Dshadowzh · 2024-01-29T06:14:17Z

How about incremental refresh materialized view for external table like Iceberg or Hudi? I think this feature can reduce the cost of refresh mv

Yes. We are considering about it, there is a Incremental refresh for iceberg/Hudi/Paimon materialized view. By the way, Iceberg and Hudi, which do you perfer?

Zhangg7723 · 2024-01-30T07:28:01Z

How about incremental refresh materialized view for external table like Iceberg or Hudi? I think this feature can reduce the cost of refresh mv

Yes. We are considering about it, there is a Incremental refresh for iceberg/Hudi/Paimon materialized view. By the way, Iceberg and Hudi, which do you perfer?

We prefer Iceberg, for better interface design and less bugs. incremental snapshot refresh is useful for non-partition table.

MatthewH00 · 2024-02-01T01:29:10Z

On yesterday 2024 roadmap meeting，mention that will support tag on BE in shared nothing mode，it like multi warehouse mechanism like in shared data mode，could split into load data warehouse\adhoc query warehouse\ETL warehouse...? And when will release?

Dshadowzh · 2024-02-04T02:01:04Z

On yesterday 2024 roadmap meeting，mention that will support tag on BE in shared nothing mode，it like multi warehouse mechanism like in shared data mode，could split into load data warehouse\adhoc query warehouse\ETL warehouse...? And when will release?

#38833 It has finished already, will be published in the next version.

trikker · 2024-02-07T04:49:17Z

On yesterday 2024 roadmap meeting，mention that will support tag on BE in shared nothing mode，it like multi warehouse mechanism like in shared data mode，could split into load data warehouse\adhoc query warehouse\ETL warehouse...? And when will release?

#38833 It has finished already, will be published in the next version.

I think there are different issues. Multi-warehouse enables different users to see different machines so as to get resource isolation at machine level. #38833 is about the replica location. In share-nothing deployment the data is all on HDFS/S3, we don't have replicas but we still need multi-warehouse cabability to isolate different machines to different resource group.

trikker · 2024-02-07T04:54:58Z

Thanks, the following are our wanted features and improvements based on my tests on StarRocks and my company's business scenarios:

Datalake Query
(1) support more types of catalogs, like oracle, tidb, oceanbase and greenplum(don't know if it is fully compatible with postgresql)
(2) support catalog metadata cache for JDBC MySQL, JDBC postgresql and the above catalogs
(3) support automatic sample and histogram statistic collection for Hive, Iceberg
Materialized View
(1) support rewrite for SQLs with UNION, ORDER BY and LIMIT
(2) support rewrite for nested-aggregation SQLs, aggregation-then-join SQLs and related-subquery SQLs with one MV
(3) enable view based mv rewrite to rewrite a query without needing user to query the view
(4) support incremental refresh of MV(like flink stream computing), rather than refresh a whole table or partition
(5) support materiailized view recommendation
Query Plan
(1) support query plan cache and plan binding for SQLs, like Oracle
Resource Management
(1) support CPU real hard limit for resource group, currently it is actually a soft limit;
(2) suport multi-warehouse for users or resource group, different users can only see only different part of the machines when executing queries
Stability
(1) memory spill still don't work for some SQLs when enable_spill is true and spill_mode is force, see issue: TPC-DS query04 still reports "Mem usage has exceed the limit of single query" when query_mem_limit is 5GB and enable_spill is true and spill_mode is force #40936

chengyi3192 · 2024-02-07T06:44:34Z

Expected to support deletion queries in the Parquet format of Iceberg

Dshadowzh · 2024-02-07T12:10:06Z

Expected to support deletion queries in the Parquet format of Iceberg

We plan implementing it in v3.3

Dshadowzh · 2024-02-07T12:20:23Z

Thanks, the following are our wanted features and improvements based on my tests on StarRocks and my company's business scenarios:

Thanks a lot for the extensive feedback:

We want to improve JDBC catalog for more databases. It's a community driven project, It's welcome if you want to participate and we'll list some detail about this project later.
New statistics collection framework for hive and iceberg is WIP
View-based mv rewrite is already finished(v3.3).You can create some independent issue for UNION, ORDER BY, LIMIT and nested-aggregation SQLs rewrite. We have some work arounds for these cases, if you have some specific scenarios, we'd like to prioritize these issues.
Query plan cache is a good idea, if you have some basic design, we'd like to discuss in details.
Hard limit for resource group and multi-warehouse are in our roadmap.

trikker · 2024-02-08T14:43:54Z

Thanks, the following are our wanted features and improvements based on my tests on StarRocks and my company's business scenarios:

Thanks a lot for the extensive feedback:

We want to improve JDBC catalog for more databases. It's a community driven project, It's welcome if you want to participate and we'll list some detail about this project later.

New statistics collection framework for hive and iceberg is WIP

View-based mv rewrite is already finished(v3.3).You can create some independent issue for UNION, ORDER BY, LIMIT and nested-aggregation SQLs rewrite. We have some work arounds for these cases, if you have some specific scenarios, we'd like to prioritize these issues.

Query plan cache is a good idea, if you have some basic design, we'd like to discuss in details.

Hard limit for resource group and multi-warehouse are in our roadmap.

OK, hope we have a detailed communication later. Happy Spring Festival! ^_^

inviscid · 2024-02-26T19:46:08Z

Top priorities for our ability to migrate to StarRocks from Greenplum. These two issues:

Special Character Handling: Quoted Table Column Identifiers Not Allowing ! Character #38854
Case Sensitive Column Names: Add case sensitive configuration #40292

Plus, Time Travel natively in StarRocks without using an external Data Lake.

The 2024 Roadmap looks great. Hoping we can get migrated so we can contribute to the journey.

ericsun2 · 2024-03-01T07:42:29Z

This roadmap is exciting.

Let's collaborate on supporting Databricks Unity Catalog as well as MAP & STRUCT types as well. Thanks a million.

motto1314 · 2024-03-28T07:13:24Z

Connector: Directly reading data files instead of reading from BE.

Do you have a more detailed plan or issue for this feature?

thanks.

melin · 2024-04-03T08:53:56Z

Materialized view support kafka source && db cdc，Reduce dependencies on external components such as kafka connect, flink cdc, etc

Using starrocks as the lake warehouse is to build a lightweight data platform. If you rely on flink to import data in real time, flink needs to run on yarn or k8s, it will no longer be lightweight.
Refer to redshift or risingwave to complete real-time lake entry through mv and advocate the concept of NOETL. Is a very attractive feature.

We have a lot of offshore business customers using aws. if user need to write redshift from kafka, use mv。ie: https://aws.amazon.com/cn/blogs/china/new-for-amazon-redshift-general-availability-of-streaming-ingestion-for-kinesis-data-streams-and-managed-streaming-for-apache-kafka/

chulucninh09 · 2024-04-06T16:57:00Z

Any plan to support Arrow flight SQL protocol for better data transportation? As data engineer & data scientist, we expect less overhead of converting from SQL (currently Mysql) protocol to pandas/pyarrow table

giovannibonetti · 2024-04-10T22:23:59Z

Thanks for the explanation, @Dshadowzh. I have just a question about this part:

3. View-based mv rewrite is already finished(v3.3).You can create some independent issue for UNION, ORDER BY...

I think the documentation is confusing in this regard.

On one hand, the limitations section of the Query rewrite with materialized views page says:

Limitations
In terms of materialized view-based query rewrite, StarRocks currently has the following limitations:
...

Materialized views defined with statements containing LIMIT, ORDER BY... cannot be used for query rewrite.
...

On the other hand, the Set extra sort keys section of the Synchronous materialized view page says:

Suppose that the base table tableA contains columns k1, k2 and k3, where only k1 and k2 are sort keys. If the query including the sub-query where k3=x must be accelerated, you can create a synchronous materialized view with k3 as the first column.
CREATE MATERIALIZED VIEW k3_as_key AS
SELECT k3, k2, k1
FROM tableA

If I understood it correctly, ORDER BY k3, k2, k1 is implicit in the materialized view definition. So, it seems like ORDER BY is already working for materialized view query rewriting, isn't it?

Dshadowzh added the type/feature-request label Jan 22, 2024

Dshadowzh pinned this issue Jan 22, 2024

kevincai mentioned this issue Mar 19, 2024

Starrocks 存算分离部署测试问题反馈 #42715

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StarRocks Roadmap 2024 #39686

StarRocks Roadmap 2024 #39686

Dshadowzh commented Jan 22, 2024 •

edited

arsenalzp commented Jan 24, 2024

Dshadowzh commented Jan 25, 2024 •

edited

Zhangg7723 commented Jan 25, 2024

Dshadowzh commented Jan 29, 2024

Zhangg7723 commented Jan 30, 2024

MatthewH00 commented Feb 1, 2024

Dshadowzh commented Feb 4, 2024

trikker commented Feb 7, 2024

trikker commented Feb 7, 2024 •

edited

chengyi3192 commented Feb 7, 2024

Dshadowzh commented Feb 7, 2024

Dshadowzh commented Feb 7, 2024 •

edited

trikker commented Feb 8, 2024

inviscid commented Feb 26, 2024

ericsun2 commented Mar 1, 2024 •

edited

motto1314 commented Mar 28, 2024

melin commented Apr 3, 2024 •

edited

chulucninh09 commented Apr 6, 2024 •

edited

giovannibonetti commented Apr 10, 2024 •

edited

StarRocks Roadmap 2024 #39686

StarRocks Roadmap 2024 #39686

Comments

Dshadowzh commented Jan 22, 2024 • edited

Shared-data & StarOS

Performance

Easy to use

Data lake analytics

Data warehousing(batch and streaming)

Batch processing & ETL improvement

Streaming processing & real-time update

All-in-one scenarios

Release

arsenalzp commented Jan 24, 2024

Dshadowzh commented Jan 25, 2024 • edited

Zhangg7723 commented Jan 25, 2024

Dshadowzh commented Jan 29, 2024

Zhangg7723 commented Jan 30, 2024

MatthewH00 commented Feb 1, 2024

Dshadowzh commented Feb 4, 2024

trikker commented Feb 7, 2024

trikker commented Feb 7, 2024 • edited

chengyi3192 commented Feb 7, 2024

Dshadowzh commented Feb 7, 2024

Dshadowzh commented Feb 7, 2024 • edited

trikker commented Feb 8, 2024

inviscid commented Feb 26, 2024

ericsun2 commented Mar 1, 2024 • edited

motto1314 commented Mar 28, 2024

melin commented Apr 3, 2024 • edited

chulucninh09 commented Apr 6, 2024 • edited

giovannibonetti commented Apr 10, 2024 • edited

Dshadowzh commented Jan 22, 2024 •

edited

Dshadowzh commented Jan 25, 2024 •

edited

trikker commented Feb 7, 2024 •

edited

Dshadowzh commented Feb 7, 2024 •

edited

ericsun2 commented Mar 1, 2024 •

edited

melin commented Apr 3, 2024 •

edited

chulucninh09 commented Apr 6, 2024 •

edited

giovannibonetti commented Apr 10, 2024 •

edited