Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(scheduler): Introduce sendMsg timeout for grpc streams (scheduler -> controller) #5434

Merged
merged 34 commits into from Mar 15, 2024

Conversation

sakoush
Copy link
Member

@sakoush sakoush commented Mar 14, 2024

This change adds a timeout for sending events to the controller from the scheduler. These events are sent on server side (scheduler) grpc streams (4x for each scheduler) and we have noticed that in some cases (pending more investigations of the root cause), sendMsg blocks indefinitely because of the lack of control flow. We double checked the controller and it is also waiting on receive so we are for now ruling out any issues with slow consumers.

The default timeout we added is 30s for each send and in this case we break the context and therefore the controller will detect a stream disconnect, and reestablishing the connection with new a stream to unblock events.

Note that we reply events so there is no risk of losing message with this logic.

logic inspired from: grpc/grpc-go#1229

This PR also adds:

  • More unit test coverage for this part of the codebase
  • stress-tests bash script to create multiple models of the same type in parallel (currently only 2 model types supported)
  • tidy go modules for the different modules of the project
  • align gprc and proto go libraries for all modules

Fixes INFRA-686 (internal)

@sakoush sakoush requested a review from lc525 as a code owner March 14, 2024 12:51
@sakoush sakoush changed the title fix(scheduler): Introduce sendMsg timeout for grpc streams (scheduler -> controller) fix(scheduler): Introduce sendMsg timeout for grpc streams (scheduler -> controller) Mar 14, 2024
@sakoush sakoush added the v2 label Mar 14, 2024
@sakoush
Copy link
Member Author

sakoush commented Mar 14, 2024

looks like the fix is working as I got on scheduler

time="2024-03-14T17:25:56Z" level=error msg="Failed to send model status event to seldon manager for tfsimple1:1" error="rpc error: code = DeadlineExceeded desc = Failed to send event within timeout" func=sendModelStatusEvent source=SchedulerServer

samples/stress-tests.sh Outdated Show resolved Hide resolved
)

func (s *SchedulerServer) SubscribeExperimentStatus(req *pb.ExperimentSubscriptionRequest, stream pb.Scheduler_SubscribeExperimentStatusServer) error {
logger := s.logger.WithField("func", "SubscribeExperimentStatus")
logger.Infof("Received subscribe request from %s", req.GetSubscriberName())

err := s.sendCurrentExperimentStatuses(stream)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets also send experiment updates at startup similar to models and pipelines. This is something that we missed doing from previous work.

Copy link
Member

@lc525 lc525 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Just really minor nits, I like the added tests.

@sakoush sakoush merged commit b05bcba into SeldonIO:v2 Mar 15, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants