-
Notifications
You must be signed in to change notification settings - Fork 824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Pipeline state during disconnects #5298
fix: Pipeline state during disconnects #5298
Conversation
scheduler/data-flow/src/main/kotlin/io/seldon/dataflow/PipelineSubscriber.kt
Outdated
Show resolved
Hide resolved
@@ -393,12 +400,17 @@ func (c *ChainerServer) handlePipelineEvent(event coordinator.PipelineEventMsg) | |||
errMsg := "no dataflow engines available to handle pipeline" | |||
logger.WithField("pipeline", event.PipelineName).Warn(errMsg) | |||
|
|||
err := c.pipelineHandler.SetPipelineState(pv.Name, pv.Version, pv.UID, pv.State.Status, errMsg, sourceChainerServer) | |||
status := pv.State.Status | |||
// if no dataflow engines available then we think we can terminate. however it might be a networking glitch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we deal with the network glitch case?
err := c.pipelineHandler.SetPipelineState(pv.Name, pv.Version, pv.UID, pv.State.Status, errMsg, sourceChainerServer) | ||
status := pv.State.Status | ||
// if no dataflow engines available then we think we can terminate pipelines. | ||
// TODO: however it might be a networking glitch and we need to handle this better in future |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in the case of a networking glitch, then no the pipelines are going to remain in dataflow engine and not removed. We could repay all pipeline control plane messages up to a specific time and therefore could deal with glitches but will leave it to a follow up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds like a good potential solution, and agreed, to be dealt with in another PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm; left some minor observations/comments
What this PR does / why we need it:
This PR fixes issues with regards to pipeline state inconsistency in cases of components failures (e.g. dataflow-engine, scheduler).
Which issue(s) this PR fixes:
Fixes INFRA-716 (internal)
Changes:
PipelineTerminate
(i.e. still to terminate).PipelineTerminating
pipelines when callingGetAllRunningPipelineVersions
, which allow us to handle the following case.PipelineTerminating
pipelines and no currently available dataflow-engines, set them toPipelineTerminated
.PipelineTermiante
orPipelineTerminating
and no currently available dataflow-engines, set them toPipelineTerminated
.Testing
Special notes for your reviewer: