-
Notifications
You must be signed in to change notification settings - Fork 824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(scheduler): Introduce sendMsg
timeout for grpc streams (scheduler -> controller)
#5434
Conversation
sendMsg
timeout for grpc streams (scheduler -> controller)
looks like the fix is working as I got on scheduler
|
) | ||
|
||
func (s *SchedulerServer) SubscribeExperimentStatus(req *pb.ExperimentSubscriptionRequest, stream pb.Scheduler_SubscribeExperimentStatusServer) error { | ||
logger := s.logger.WithField("func", "SubscribeExperimentStatus") | ||
logger.Infof("Received subscribe request from %s", req.GetSubscriberName()) | ||
|
||
err := s.sendCurrentExperimentStatuses(stream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets also send experiment updates at startup similar to models and pipelines. This is something that we missed doing from previous work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. Just really minor nits, I like the added tests.
This change adds a timeout for sending events to the controller from the scheduler. These events are sent on server side (scheduler) grpc streams (4x for each scheduler) and we have noticed that in some cases (pending more investigations of the root cause),
sendMsg
blocks indefinitely because of the lack of control flow. We double checked the controller and it is also waiting on receive so we are for now ruling out any issues with slow consumers.The default timeout we added is 30s for each send and in this case we break the context and therefore the controller will detect a stream disconnect, and reestablishing the connection with new a stream to unblock events.
Note that we reply events so there is no risk of losing message with this logic.
logic inspired from: grpc/grpc-go#1229
This PR also adds:
stress-tests
bash script to create multiple models of the same type in parallel (currently only 2 model types supported)Fixes INFRA-686 (internal)