Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Pipeline state during disconnects #5298

Merged

Conversation

sakoush
Copy link
Member

@sakoush sakoush commented Feb 8, 2024

What this PR does / why we need it:

This PR fixes issues with regards to pipeline state inconsistency in cases of components failures (e.g. dataflow-engine, scheduler).

Which issue(s) this PR fixes:

Fixes INFRA-716 (internal)

Changes:

  • Do not remove a pipeline from the scheduler local persistence store upon the user deleting this particular pipeline. This is required because if the scheduler restarts and before the pipeline delete is propagated and acknowledged by the different services (e.g. dataflow-engine and controller) then there is a risk that this state will become inconsistent. This has the drawback though that the internal state of the scheduler indefinitely increasing so we need to recycle it in a follow up work. For now we prioritise consistency over storage as we expect storage of protobuf control plan messages of these pipelines to be low.
  • Fix a bug in dataflow-engine to report success when the pipeline topology is already running.
  • Fix a bug in controller to not remove the finaliser of a pipeline if its state is PipelineTerminate (i.e. still to terminate).
  • Return PipelineTerminating pipelines when calling GetAllRunningPipelineVersions, which allow us to handle the following case.
  • For PipelineTerminating pipelines and no currently available dataflow-engines, set them to PipelineTerminated.
  • If receiving an event for a pipeline with state PipelineTermiante or PipelineTerminating and no currently available dataflow-engines, set them to PipelineTerminated.
  • We always send all pipelines (event deleted ones) when the controller connects as it might be that some messages have been missed during the disconnect.

Testing

  • Induced a 2 min delay in datafllow pipeline create and delete (to allow time to kill components).
    private suspend fun handleDelete(metadata: PipelineMetadata) {
        logger.info("Delete pipeline ${metadata.name} version: ${metadata.version} id: ${metadata.id}")
        Thread.sleep(120_000)

    private suspend fun handleCreate(
        metadata: PipelineMetadata,
        steps: List<PipelineStepUpdate>,
        kafkaConsumerGroupIdPrefix: String,
        namespace: String,
    ) {
        logger.info("Create pipeline ${metadata.name} version: ${metadata.version} id: ${metadata.id}")
        Thread.sleep(120_000)
  • Killing dataflow-engine, controller and scheduler pods, them make sure state is eventually consistent.

Special notes for your reviewer:

@sakoush sakoush requested a review from lc525 as a code owner February 8, 2024 10:59
@@ -393,12 +400,17 @@ func (c *ChainerServer) handlePipelineEvent(event coordinator.PipelineEventMsg)
errMsg := "no dataflow engines available to handle pipeline"
logger.WithField("pipeline", event.PipelineName).Warn(errMsg)

err := c.pipelineHandler.SetPipelineState(pv.Name, pv.Version, pv.UID, pv.State.Status, errMsg, sourceChainerServer)
status := pv.State.Status
// if no dataflow engines available then we think we can terminate. however it might be a networking glitch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we deal with the network glitch case?

@sakoush sakoush marked this pull request as draft February 8, 2024 13:31
@sakoush sakoush added the v2 label Feb 8, 2024
@sakoush sakoush marked this pull request as ready for review February 8, 2024 17:38
err := c.pipelineHandler.SetPipelineState(pv.Name, pv.Version, pv.UID, pv.State.Status, errMsg, sourceChainerServer)
status := pv.State.Status
// if no dataflow engines available then we think we can terminate pipelines.
// TODO: however it might be a networking glitch and we need to handle this better in future
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in the case of a networking glitch, then no the pipelines are going to remain in dataflow engine and not removed. We could repay all pipeline control plane messages up to a specific time and therefore could deal with glitches but will leave it to a follow up PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a good potential solution, and agreed, to be dealt with in another PR

@sakoush sakoush requested a review from lc525 February 9, 2024 11:58
Copy link
Member

@lc525 lc525 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm; left some minor observations/comments

@sakoush sakoush merged commit 94c107d into SeldonIO:v2 Feb 9, 2024
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants