Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StarRocks Roadmap 2024 #39686

Open
9 of 60 tasks
Dshadowzh opened this issue Jan 22, 2024 · 19 comments
Open
9 of 60 tasks

StarRocks Roadmap 2024 #39686

Dshadowzh opened this issue Jan 22, 2024 · 19 comments

Comments

@Dshadowzh
Copy link
Contributor

Dshadowzh commented Jan 22, 2024

Refer to roadmap 2023 2022

Shared-data & StarOS

  • Align with all functionalities to shared-nothing
    • Sync materialized view
    • Generated column
    • Partial update with column mode
    • Optimize table and manual compaction
  • Better cache system
    • Multi-layer cache
    • Global cache
    • Cache Auto warmup
    • Cache black/whitelist
    • Refine evict algorithm
  • StarOS internal optimization
    • Multi-replicas for shard management
    • Shard schedule optimization for large scale (more than 10M shards)
    • Local storage for StarOS
  • Decoupled storage for FE
  • Open API for StarRocks table format (sink and source)
  • Time Travel
  • Backup support

Performance

  • Full columnar Json index
  • Cost model with primary key and foreign key constrains
  • Arm optimization for codecs
  • Adaptive DOP and adaptive query engine
  • Global dictionary encoding
  • Enhance IO schedule framework
  • JIT / Codegen
  • Fine granularity Fe lock(from db level to table level)

Easy to use

  • Online optimize table
  • List partition optimization
  • Improve files table function
    • Improve schema inference
    • CSV and json format support
    • Other format: Avro, Arrow, Protobuf
    • Better performance for read, predicates pushdown
  • Insert statement improvement (on duplicate key, insert properties)
  • Unified data ingestion with Pipe
    • Pipe for continuous ingestion from Kafka
    • Read from external stream table(Kafka)
    • Continues data ingestion from SQS with Pipe
  • Out-of-the-box parameters

Data lake analytics

  • Better file format support
    • Parquet reader tuning
    • ORC reader tuning
  • Better table format support
Lake Query Insert DDL Update/Delete/Merge into MV Improment
Hive 1.18 3.2     2.5
Iceberg 2.1 3.1 3.3 3.3 3.0 metadata improvement
Hudi 2.2       3.0
Paimon 3.0       3.2
Delta lake 3.0       3.2
  • Lake metadata optimization
  • Materialized view improvement
    • Improve partition mapping (list partition, expression partition)
    • Task scheduler DAG & Lineage
    • Better query rewrite
  • JDBC catalog improvement
  • Enhance JNI reader and implement JNI writer
  • Text File format support
  • Presto/Trino/Spark/Hive SQL compatibility
  • Presto/Trino/Spark/Hive UDF compatibility
  • Automatic cooldown to lake format

Data warehousing(batch and streaming)

Batch processing & ETL improvement

  • Enable spilling by default globally
  • Multi-statement transaction
  • Temporary table
  • Group execution
  • Task auto retry

Streaming processing & real-time update

  • Schemaless partial update
  • Merge into statement
  • Binlog to flink and spark streaming
  • Transaction level incremental refresh in materialized view (Aggregation, Join, functions)
  • Incremental refresh for iceberg/Hudi/Paimon materialized view

All-in-one scenarios

  • Search: Optimize full text inverted index
  • Row store: Optimize row store for high concurrent point lookup
  • Time series db: Asof join, high concurrent ingestion
  • Vector database: vector index

Release

@arsenalzp
Copy link

Hello,
Any chance to have good-first-issue feature among those tasks?

@Dshadowzh
Copy link
Contributor Author

Dshadowzh commented Jan 25, 2024

Hello, Any chance to have good-first-issue feature among those tasks?

Welcome, you can check this #13300 first, we'll update more good-first-issues in 2024 later. Particularly regarding external catalog and connectors.

@Zhangg7723
Copy link
Contributor

How about incremental refresh materialized view for external table like Iceberg or Hudi? I think this feature can reduce the cost of refresh mv

@Dshadowzh
Copy link
Contributor Author

How about incremental refresh materialized view for external table like Iceberg or Hudi? I think this feature can reduce the cost of refresh mv

Yes. We are considering about it, there is a Incremental refresh for iceberg/Hudi/Paimon materialized view. By the way, Iceberg and Hudi, which do you perfer?

@Zhangg7723
Copy link
Contributor

How about incremental refresh materialized view for external table like Iceberg or Hudi? I think this feature can reduce the cost of refresh mv

Yes. We are considering about it, there is a Incremental refresh for iceberg/Hudi/Paimon materialized view. By the way, Iceberg and Hudi, which do you perfer?

We prefer Iceberg, for better interface design and less bugs. incremental snapshot refresh is useful for non-partition table.

@MatthewH00
Copy link
Contributor

On yesterday 2024 roadmap meeting,mention that will support tag on BE in shared nothing mode,it like multi warehouse mechanism like in shared data mode,could split into load data warehouse\adhoc query warehouse\ETL warehouse...? And when will release?

@Dshadowzh
Copy link
Contributor Author

On yesterday 2024 roadmap meeting,mention that will support tag on BE in shared nothing mode,it like multi warehouse mechanism like in shared data mode,could split into load data warehouse\adhoc query warehouse\ETL warehouse...? And when will release?

#38833 It has finished already, will be published in the next version.

@trikker
Copy link
Contributor

trikker commented Feb 7, 2024

On yesterday 2024 roadmap meeting,mention that will support tag on BE in shared nothing mode,it like multi warehouse mechanism like in shared data mode,could split into load data warehouse\adhoc query warehouse\ETL warehouse...? And when will release?

#38833 It has finished already, will be published in the next version.

I think there are different issues. Multi-warehouse enables different users to see different machines so as to get resource isolation at machine level. #38833 is about the replica location. In share-nothing deployment the data is all on HDFS/S3, we don't have replicas but we still need multi-warehouse cabability to isolate different machines to different resource group.

@trikker
Copy link
Contributor

trikker commented Feb 7, 2024

Thanks, the following are our wanted features and improvements based on my tests on StarRocks and my company's business scenarios:

  1. Datalake Query
    (1) support more types of catalogs, like oracle, tidb, oceanbase and greenplum(don't know if it is fully compatible with postgresql)
    (2) support catalog metadata cache for JDBC MySQL, JDBC postgresql and the above catalogs
    (3) support automatic sample and histogram statistic collection for Hive, Iceberg

  2. Materialized View
    (1) support rewrite for SQLs with UNION, ORDER BY and LIMIT
    (2) support rewrite for nested-aggregation SQLs, aggregation-then-join SQLs and related-subquery SQLs with one MV
    (3) enable view based mv rewrite to rewrite a query without needing user to query the view
    (4) support incremental refresh of MV(like flink stream computing), rather than refresh a whole table or partition
    (5) support materiailized view recommendation

  3. Query Plan
    (1) support query plan cache and plan binding for SQLs, like Oracle

  4. Resource Management
    (1) support CPU real hard limit for resource group, currently it is actually a soft limit;
    (2) suport multi-warehouse for users or resource group, different users can only see only different part of the machines when executing queries

  5. Stability
    (1) memory spill still don't work for some SQLs when enable_spill is true and spill_mode is force, see issue: TPC-DS query04 still reports "Mem usage has exceed the limit of single query" when query_mem_limit is 5GB and enable_spill is true and spill_mode is force #40936

@chengyi3192
Copy link

Expected to support deletion queries in the Parquet format of Iceberg

@Dshadowzh
Copy link
Contributor Author

Expected to support deletion queries in the Parquet format of Iceberg

We plan implementing it in v3.3

@Dshadowzh
Copy link
Contributor Author

Dshadowzh commented Feb 7, 2024

Thanks, the following are our wanted features and improvements based on my tests on StarRocks and my company's business scenarios:

Thanks a lot for the extensive feedback:

  1. We want to improve JDBC catalog for more databases. It's a community driven project, It's welcome if you want to participate and we'll list some detail about this project later.
  2. New statistics collection framework for hive and iceberg is WIP
  3. View-based mv rewrite is already finished(v3.3).You can create some independent issue for UNION, ORDER BY, LIMIT and nested-aggregation SQLs rewrite. We have some work arounds for these cases, if you have some specific scenarios, we'd like to prioritize these issues.
  4. Query plan cache is a good idea, if you have some basic design, we'd like to discuss in details.
  5. Hard limit for resource group and multi-warehouse are in our roadmap.

@trikker
Copy link
Contributor

trikker commented Feb 8, 2024

Thanks, the following are our wanted features and improvements based on my tests on StarRocks and my company's business scenarios:

Thanks a lot for the extensive feedback:

  1. We want to improve JDBC catalog for more databases. It's a community driven project, It's welcome if you want to participate and we'll list some detail about this project later.
  2. New statistics collection framework for hive and iceberg is WIP
  3. View-based mv rewrite is already finished(v3.3).You can create some independent issue for UNION, ORDER BY, LIMIT and nested-aggregation SQLs rewrite. We have some work arounds for these cases, if you have some specific scenarios, we'd like to prioritize these issues.
  4. Query plan cache is a good idea, if you have some basic design, we'd like to discuss in details.
  5. Hard limit for resource group and multi-warehouse are in our roadmap.

OK, hope we have a detailed communication later. Happy Spring Festival! ^_^

@inviscid
Copy link

Top priorities for our ability to migrate to StarRocks from Greenplum. These two issues:

Plus, Time Travel natively in StarRocks without using an external Data Lake.

The 2024 Roadmap looks great. Hoping we can get migrated so we can contribute to the journey.

@ericsun2
Copy link

ericsun2 commented Mar 1, 2024

This roadmap is exciting.

Let's collaborate on supporting Databricks Unity Catalog as well as MAP & STRUCT types as well. Thanks a million.

@motto1314
Copy link
Contributor

Connector: Directly reading data files instead of reading from BE.

Do you have a more detailed plan or issue for this feature?

thanks.

@melin
Copy link

melin commented Apr 3, 2024

Materialized view support kafka source && db cdc,Reduce dependencies on external components such as kafka connect, flink cdc, etc

Using starrocks as the lake warehouse is to build a lightweight data platform. If you rely on flink to import data in real time, flink needs to run on yarn or k8s, it will no longer be lightweight.
Refer to redshift or risingwave to complete real-time lake entry through mv and advocate the concept of NOETL. Is a very attractive feature.

We have a lot of offshore business customers using aws. if user need to write redshift from kafka, use mv。ie: https://aws.amazon.com/cn/blogs/china/new-for-amazon-redshift-general-availability-of-streaming-ingestion-for-kinesis-data-streams-and-managed-streaming-for-apache-kafka/

@chulucninh09
Copy link

chulucninh09 commented Apr 6, 2024

Any plan to support Arrow flight SQL protocol for better data transportation? As data engineer & data scientist, we expect less overhead of converting from SQL (currently Mysql) protocol to pandas/pyarrow table

@giovannibonetti
Copy link

giovannibonetti commented Apr 10, 2024

Thanks for the explanation, @Dshadowzh. I have just a question about this part:

3. View-based mv rewrite is already finished(v3.3).You can create some independent issue for UNION, ORDER BY...

I think the documentation is confusing in this regard.

On one hand, the limitations section of the Query rewrite with materialized views page says:

Limitations
In terms of materialized view-based query rewrite, StarRocks currently has the following limitations:
...

  • Materialized views defined with statements containing LIMIT, ORDER BY... cannot be used for query rewrite.
    ...

On the other hand, the Set extra sort keys section of the Synchronous materialized view page says:

Suppose that the base table tableA contains columns k1, k2 and k3, where only k1 and k2 are sort keys. If the query including the sub-query where k3=x must be accelerated, you can create a synchronous materialized view with k3 as the first column.

CREATE MATERIALIZED VIEW k3_as_key AS
SELECT k3, k2, k1
FROM tableA

If I understood it correctly, ORDER BY k3, k2, k1 is implicit in the materialized view definition. So, it seems like ORDER BY is already working for materialized view query rewriting, isn't it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests