-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Submit a job return an error #942
Comments
Can you provide proxy logs? The path is logs/fate-proxy.log. |
@zengjice @dylan-fan after futher testing, here is the outcome that i can share. If all my workers (arbiter, guest,host..) are running on different machines on the LAN , then is no issue. When the workers are running on Azure with external IP address, the RPC will throws error from time to time (after a few jobs successfully submitted). I will get below error and the new job won't be able to submit with rpc error.
When the log show above error, new job won't be able to be submitted. |
when submit a job i get below error. this only happen time to time.
|
I created a PR #993 |
we are running on FATE 1.2 . After run one round of job successfully on a distributed machines. We re-submit similar job (from example given) . The command just hang and after awhile, we get response
On the Fateboard, the progress is zero. We have to press 'kill' to terminate the job. After that, restart 'proxy' service on each of the machines. Re-run the command to submit job. The execution is successfully executed. Any issue with the 'proxy' service ? This happen quite often. There is no error in the proxy's log. This issue is overcome with a restart. Any idea why this is happenning? We are using docker-compose
The text was updated successfully, but these errors were encountered: