fix: Pipeline state during disconnects #5298

sakoush · 2024-02-08T10:59:08Z

What this PR does / why we need it:

This PR fixes issues with regards to pipeline state inconsistency in cases of components failures (e.g. dataflow-engine, scheduler).

Which issue(s) this PR fixes:

Fixes INFRA-716 (internal)

Changes:

Do not remove a pipeline from the scheduler local persistence store upon the user deleting this particular pipeline. This is required because if the scheduler restarts and before the pipeline delete is propagated and acknowledged by the different services (e.g. dataflow-engine and controller) then there is a risk that this state will become inconsistent. This has the drawback though that the internal state of the scheduler indefinitely increasing so we need to recycle it in a follow up work. For now we prioritise consistency over storage as we expect storage of protobuf control plan messages of these pipelines to be low.
Fix a bug in dataflow-engine to report success when the pipeline topology is already running.
Fix a bug in controller to not remove the finaliser of a pipeline if its state is PipelineTerminate (i.e. still to terminate).
Return PipelineTerminating pipelines when calling GetAllRunningPipelineVersions, which allow us to handle the following case.
For PipelineTerminating pipelines and no currently available dataflow-engines, set them to PipelineTerminated.
If receiving an event for a pipeline with state PipelineTerminate or PipelineTerminating and no currently available dataflow-engines, set them to PipelineTerminated.
We always send all pipelines (event deleted ones) when the controller connects as it might be that some messages have been missed during the disconnect.

Testing

Induced a 2 min delay in datafllow pipeline create and delete (to allow time to kill components).

    private suspend fun handleDelete(metadata: PipelineMetadata) {
        logger.info("Delete pipeline ${metadata.name} version: ${metadata.version} id: ${metadata.id}")
        Thread.sleep(120_000)

    private suspend fun handleCreate(
        metadata: PipelineMetadata,
        steps: List<PipelineStepUpdate>,
        kafkaConsumerGroupIdPrefix: String,
        namespace: String,
    ) {
        logger.info("Create pipeline ${metadata.name} version: ${metadata.version} id: ${metadata.id}")
        Thread.sleep(120_000)

Killing dataflow-engine, controller and scheduler pods, them make sure state is eventually consistent.

Special notes for your reviewer:

scheduler/data-flow/src/main/kotlin/io/seldon/dataflow/PipelineSubscriber.kt

lc525 · 2024-02-08T12:10:00Z

scheduler/pkg/kafka/dataflow/server.go

@@ -393,12 +400,17 @@ func (c *ChainerServer) handlePipelineEvent(event coordinator.PipelineEventMsg)
 			errMsg := "no dataflow engines available to handle pipeline"
 			logger.WithField("pipeline", event.PipelineName).Warn(errMsg)

-			err := c.pipelineHandler.SetPipelineState(pv.Name, pv.Version, pv.UID, pv.State.Status, errMsg, sourceChainerServer)
+			status := pv.State.Status
+			// if no dataflow engines available then we think we can terminate. however it might be a networking glitch


do we deal with the network glitch case?

sakoush · 2024-02-09T11:58:06Z

scheduler/pkg/kafka/dataflow/server.go

-			err := c.pipelineHandler.SetPipelineState(pv.Name, pv.Version, pv.UID, pv.State.Status, errMsg, sourceChainerServer)
+			status := pv.State.Status
+			// if no dataflow engines available then we think we can terminate pipelines.
+			// TODO: however it might be a networking glitch and we need to handle this better in future


I think in the case of a networking glitch, then no the pipelines are going to remain in dataflow engine and not removed. We could repay all pipeline control plane messages up to a specific time and therefore could deal with glitches but will leave it to a follow up PR.

That sounds like a good potential solution, and agreed, to be dealt with in another PR

scheduler/pkg/kafka/dataflow/server.go

scheduler/pkg/store/pipeline/store.go

lc525

lgtm; left some minor observations/comments

sakoush added 14 commits February 6, 2024 10:28

adjust logging

9cf562b

set pipeline state to started if it already exists

d2af195

add note and log

f0117be

fix case condition on pipeline remove

e2de24f

add logic to deal with terminating pipelines with no dataflow

ae86a0e

fix typo

7ea2ff7

remove unused constant

c80a9ad

add note

2ec388c

update case when we can remove pipeline finaliser

869e683

reformat log

579dabe

update note

bbc59be

adjust logging

54a3e09

deal with pipeline that are terminating and dataflow restarts

8505463

adjust log

1907af5

sakoush requested a review from lc525 as a code owner February 8, 2024 10:59

lc525 reviewed Feb 8, 2024

View reviewed changes

scheduler/data-flow/src/main/kotlin/io/seldon/dataflow/PipelineSubscriber.kt Outdated Show resolved Hide resolved

lc525 reviewed Feb 8, 2024

View reviewed changes

sakoush marked this pull request as draft February 8, 2024 13:31

sakoush added 4 commits February 8, 2024 14:15

update log from PR review

d2f7e74

delete with scheduler restart while pipeline is terminating

f638794

fix tests

e6101ba

add extra notes

8dc7c34

sakoush added the v2 label Feb 8, 2024

sakoush marked this pull request as ready for review February 8, 2024 17:38

sakoush added 2 commits February 9, 2024 10:16

lint fixes

78cf588

return all pipelines

14d7b4e

sakoush commented Feb 9, 2024

View reviewed changes

sakoush requested a review from lc525 February 9, 2024 11:58

lc525 reviewed Feb 9, 2024

View reviewed changes

scheduler/pkg/kafka/dataflow/server.go Show resolved Hide resolved

lc525 reviewed Feb 9, 2024

View reviewed changes

scheduler/pkg/store/pipeline/store.go Show resolved Hide resolved

lc525 approved these changes Feb 9, 2024

View reviewed changes

added description of pipeline states

ac56141

sakoush merged commit 94c107d into SeldonIO:v2 Feb 9, 2024
2 of 3 checks passed

sakoush mentioned this pull request Feb 15, 2024

fix: Experiments and Models state fixes when reconnecting to scheduler #5320

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Pipeline state during disconnects #5298

fix: Pipeline state during disconnects #5298

sakoush commented Feb 8, 2024 •

edited

Loading

lc525 Feb 8, 2024

sakoush Feb 9, 2024

lc525 Feb 9, 2024

lc525 left a comment

fix: Pipeline state during disconnects #5298

fix: Pipeline state during disconnects #5298

Conversation

sakoush commented Feb 8, 2024 • edited Loading

Changes:

Testing

lc525 Feb 8, 2024

Choose a reason for hiding this comment

sakoush Feb 9, 2024

Choose a reason for hiding this comment

lc525 Feb 9, 2024

Choose a reason for hiding this comment

lc525 left a comment

Choose a reason for hiding this comment

sakoush commented Feb 8, 2024 •

edited

Loading