[FLINK-37901][table] Support to serde for StreamExecMLPredictTableFunction #26641

fsk119 · 2025-06-05T15:53:10Z

What is the purpose of the change

Support to serde for StreamExecMLPredictTableFunction

Brief change log

Add serialization for all related classes.
Support to create ModelProvider from the ContextResolvedModelSpec

Verifying this change

Added ContextResolvedModel serde test
Added MLPredictRestoreTest to verify generate and restore from execplan.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (yes / no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
The serializers: (yes / no / don't know)
The runtime per-record code paths (performance sensitive): (yes / no / don't know)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know)
The S3 file system connector: (yes / no / don't know)

Documentation

Does this pull request introduce a new feature? (yes / no)
If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

flinkbot · 2025-06-05T16:00:52Z

CI report:

1305c92 Azure: FAILURE
02006c3 Azure: PENDING

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

lihaosky

Need to fix doc: #26630 (comment)

Thanks for the PR!

...g/apache/flink/table/planner/plan/nodes/exec/serde/ContextResolvedModelJsonDeserializer.java

lihaosky · 2025-06-06T04:34:16Z

...g/apache/flink/table/planner/plan/nodes/exec/serde/ContextResolvedModelJsonDeserializer.java

+        return true;
+    }
+
+    static TableException schemaNotMatching(


Also almost similar to ContextResolvedTableJsonDeserializer. Move to util if you want

The exception messages are different. One is for table, the other is for model.

...g/apache/flink/table/planner/plan/nodes/exec/serde/ContextResolvedModelJsonDeserializer.java

...g/apache/flink/table/planner/plan/nodes/exec/serde/ResolvedCatalogModelJsonDeserializer.java

...org/apache/flink/table/planner/plan/nodes/exec/serde/ResolvedCatalogModelJsonSerializer.java

lihaosky · 2025-06-06T04:43:42Z

...org/apache/flink/table/planner/plan/nodes/exec/serde/ResolvedCatalogModelJsonSerializer.java

+            serializerProvider.defaultSerializeField(
+                    OPTIONS, resolvedCatalogModel.getOptions(), jsonGenerator);


For table, there's a try/catch, curious why no catch here?

Actaully, I think the catalog table serde is a little weird here. The try/catch here is used to notify users Flink doesn't support to generate execplan for query. You can take a look at ExternalCatalogTable#getOptions. I think model is much simpler here, we don't need to complicate the case.

lihaosky · 2025-06-06T05:08:10Z

.../org/apache/flink/table/planner/plan/nodes/exec/stream/StreamExecMLPredictTableFunction.java

-                                    asyncLookupOptions.asyncTimeout,
-                                    asyncLookupOptions.asyncBufferCapacity,
-                                    asyncLookupOptions.asyncOutputMode),
+                                    asyncOptions.asyncTimeout,


Why don't we need these data shuffling logic in createModelPredict? Looks in lookup exec node, it's in sync transformation creation

I think we can add this later if users require this. UpsertMaterialize is a complicated optimization and works if the pk of the source are different from the sink pk. What's more, it requires users use the output of the model as the part of pk.

Regardless of upsertMaterialize, looks we have upsert key shuffling in async ordered mode, should it also be in sync mode?

Nope.

If upsert keys are subset of the the lookup keys, we don't need to introduce shuffle here. Because planner promises the data with same upsert keys is located at the same subtask.

If the downstream operator doesn't use output of the predict function as upsert keys, we don't need deterministic here.

After discussion with @lihaosky offline, I think it's a mistake to introduce shuffle for async mode. We should only introduce shuffle if we use upsert materalize.

lihaosky · 2025-06-06T05:57:20Z

.../org/apache/flink/table/planner/plan/nodes/exec/stream/StreamExecMLPredictTableFunction.java

                    InternalTypeInfo.of(getOutputType()),
                    inputTransformation.getParallelism(),
                    false);
-        } else if (asyncLookupOptions.asyncOutputMode == AsyncDataStream.OutputMode.ORDERED) {
+        } else if (asyncOptions.asyncOutputMode == AsyncDataStream.OutputMode.ORDERED) {


I'm a bit confused about the shuffle logic. Looks shuffle is only done when upsertMaterialize is true in lookup join exec node. And upsertMaterialize seems to be always false for lookup join. Why doesn't lookup need to do it for cdc? Also async is disabled when upsertMaterialize for lookup join

This a good question! Let me share some points here.

When using materialize, it means lookup operator will store the lookup results in its state. So when a update-before or delete message arrives, the lookup operator tries to search the results in its state. If state contains the results, it emits the result with the content in the state to keep the output of the lookup join op is deterministic.

Why shuffle is required for cdc mode

Planner requires the lookup join operator uses keyed state to make sure all messages with the same lookup keys should be located at the same subtask. Currently, Flink requires there is a shuffle before a keyed stream.

And upsertMaterialize seems to be always false for lookup join.

In some cases, planner will omit upsertMaterialize. You can take a look at StreamNonDeterministicUpdatePlanVisitor#visitLookupJoin.

First of all, users should set 'table.optimizer.non-deterministic-update.strategy' = 'TRY_RESOLVE';
Then user's query should not use pk of the upstream operator as lookup keys or ...

async is disabled when upsertMaterialize for lookup join

Because current AsyncWaitOperator is not friendly to cdc stream. For example , +I message and -D message are almost at the same time arriving at the operator and then both enter the input queue. It means the async lookup join function should process these messages at the same time. It's possible that the output of +I message and -D message are different if lookup source is changed frequently.

Hope my explanation can solve your questions.

I think we should only support insert mode for ml_predict then (StreamNonDeterministicUpdatePlanVisitor should reject upsert mode for mlpredict plan) since ml_predict function itself isn't deterministic. Non-deterministic function can result in error according to https://docs.confluent.io/cloud/current/flink/concepts/determinism.html. We can support cdc mode later by introducing configs user can use to tell us their model is deterministic. Created https://issues.apache.org/jira/browse/FLINK-37928 and https://issues.apache.org/jira/browse/FLINK-37929

After discussing with @lihaosky , I think we can improve this feature in the next version. After all, correctness is the top priority.

fsk119 · 2025-06-09T08:18:20Z

Before merging #26630 (comment), I have fixed the problem in my local env, then push the codes to the master branch.

.../src/test/java/org/apache/flink/table/planner/plan/nodes/exec/testutils/RestoreTestBase.java

…ction

fsk119 · 2025-06-10T08:55:48Z

All tests pass in my private CI pipeline: https://dev.azure.com/1059623455/Flink/_build/results?buildId=694&view=logs&j=0e31ee24-31a6-528c-a4bf-45cde9b2a14e

Merging...

fsk119 force-pushed the serde branch from f734554 to 69b60a0 Compare June 5, 2025 16:00

fsk119 force-pushed the serde branch from 69b60a0 to c430290 Compare June 6, 2025 01:59

lihaosky reviewed Jun 6, 2025

View reviewed changes

fsk119 force-pushed the serde branch from bd7a96f to 88a9d28 Compare June 9, 2025 08:16

lihaosky reviewed Jun 10, 2025

View reviewed changes

.../src/test/java/org/apache/flink/table/planner/plan/nodes/exec/testutils/RestoreTestBase.java Outdated Show resolved Hide resolved

lihaosky approved these changes Jun 10, 2025

View reviewed changes

fsk119 added 5 commits June 10, 2025 11:59

[FLINK-37901][table] Support to serde for StreamExecMLPredictTableFun…

056be0f

…ction

address comments

cc41b18

address comments

e50aa7c

address comments

b0e92d9

fix compile error

02006c3

fsk119 force-pushed the serde branch from ae1df99 to 02006c3 Compare June 10, 2025 04:02

fsk119 merged commit 449fef0 into apache:master Jun 10, 2025

		serializerProvider.defaultSerializeField(
		OPTIONS, resolvedCatalogModel.getOptions(), jsonGenerator);

[FLINK-37901][table] Support to serde for StreamExecMLPredictTableFunction #26641

[FLINK-37901][table] Support to serde for StreamExecMLPredictTableFunction #26641

Uh oh!

Conversation

fsk119 commented Jun 5, 2025

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

flinkbot commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

lihaosky left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fsk119 Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fsk119 commented Jun 9, 2025

Uh oh!

Uh oh!

fsk119 commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

flinkbot commented Jun 5, 2025 •

edited

Loading

fsk119 Jun 9, 2025 •

edited

Loading

fsk119 commented Jun 10, 2025 •

edited

Loading