Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Submit a job return an error #942

Closed
cometta opened this issue Jan 17, 2020 · 4 comments
Closed

Submit a job return an error #942

cometta opened this issue Jan 17, 2020 · 4 comments
Assignees
Labels

Comments

@cometta
Copy link
Contributor

cometta commented Jan 17, 2020

we are running on FATE 1.2 . After run one round of job successfully on a distributed machines. We re-submit similar job (from example given) . The command just hang and after awhile, we get response

{
"retcode": 100,
"retmsg": "rpc request error: <_Rendezvous of RPC that terminated with:\n\tstatus = StatusCode.INTERNAL\n\tdetails = "0.0.0.0:9370: java.util.concurrent.TimeoutException: [UNARYCALL][SERVER] unary call server error: overall process time exceeds timeout: 60000, metadata: {"task":{"taskId":"202001170655528598743"},"src":{"name":"202001170655528598743","partyId":"10000","role":"fateflow","callback":{"ip":"0.0.0.0","port":9360}},"dst":{"name":"202001170655528598743","partyId":"9999","role":"fateflow"},"command":{"name":"fateflow"},"operator":"POST","conf":{"overallTimeout":"60000"}}, lastPacketTimestamp: 1579244152874, loopEndTimestamp: 1579244212882\n\tat com.webank.ai.fate.networking.proxy.grpc.service.DataTransferPipedServerImpl.unaryCall(DataTransferPipedServerImpl.java:245)\n\tat com.webank.ai.fate.api.networking.proxy.DataTransferServiceGrpc$MethodHandlers.invoke(DataTransferServiceGrpc.java:346)\n\tat io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:171)\n\tat io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:283)\n\tat io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:710)\n\tat io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)\n\tat io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\n"\n\tdebug_error_string = "{"created":"@1579244212.885988745","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"0.0.0.0:9370: java.util.concurrent.TimeoutException: [UNARYCALL][SERVER] unary call server error: overall process time exceeds timeout: 60000, metadata: {"task":{"taskId":"202001170655528598743"},"src":{"name":"202001170655528598743","partyId":"10000","role":"fateflow","callback":{"ip":"0.0.0.0","port":9360}},"dst":{"name":"202001170655528598743","partyId":"9999","role":"fateflow"},"command":{"name":"fateflow"},"operator":"POST","conf":{"overallTimeout":"60000"}}, lastPacketTimestamp: 1579244152874, loopEndTimestamp: 1579244212882\n\tat com.webank.ai.fate.networking.proxy.grpc.service.DataTransferPipedServerImpl.unaryCall(DataTransferPipedServerImpl.java:245)\n\tat com.webank.ai.fate.api.networking.proxy.DataTransferServiceGrpc$MethodHandlers.invoke(DataTransferServiceGrpc.java:346)\n\tat io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:171)\n\tat io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:283)\n\tat io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:710)\n\tat io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)\n\tat io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\n","grpc_status":13}"\n>"
}

On the Fateboard, the progress is zero. We have to press 'kill' to terminate the job. After that, restart 'proxy' service on each of the machines. Re-run the command to submit job. The execution is successfully executed. Any issue with the 'proxy' service ? This happen quite often. There is no error in the proxy's log. This issue is overcome with a restart. Any idea why this is happenning? We are using docker-compose

@jarviszeng-zjc
Copy link
Contributor

Can you provide proxy logs? The path is logs/fate-proxy.log.
And then please check if there are any other large data jobs running.
Thanks.

@cometta
Copy link
Contributor Author

cometta commented Feb 11, 2020

@zengjice @dylan-fan after futher testing, here is the outcome that i can share. If all my workers (arbiter, guest,host..) are running on different machines on the LAN , then is no issue. When the workers are running on Azure with external IP address, the RPC will throws error from time to time (after a few jobs successfully submitted). I will get below error and the new job won't be able to submit with rpc error.

eb 11, 2020 12:34:51 PM io.grpc.netty.shaded.io.grpc.netty.NettyServerTransport notifyTerminated
INFO: Transport failed
java.io.IOException: Operation timed out
	at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
	at sun.nio.ch.IOUtil.read(IOUtil.java:192)
	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
	at io.grpc.netty.shaded.io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288)
	at io.grpc.netty.shaded.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1128)
	at io.grpc.netty.shaded.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:347)
	at io.grpc.netty.shaded.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148)
	at io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
	at io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579)
	at io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496)
	at io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
	at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:897)
	at io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)

When the log show above error, new job won't be able to be submitted.

@cometta
Copy link
Contributor Author

cometta commented Feb 12, 2020

when submit a job i get below error. this only happen time to time.

[ERROR] 2020-02-12T08:21:22,231 [grpcServiceExecutor-187] [DataTransferPipedServerImpl:243] - [UNARYCALL][SERVER] unary call server error: overall process time exceeds timeout: 60000, metadata: {"task":{"taskId":"2020021208202220896228"},"src":{"name":"2020021208202220896228","partyId":"10000","role":"fateflow","callback":{"ip":"0.0.0.0","port":9360}},"dst":{"name":"2020021208202220896228","partyId":"9999","role":"fateflow"},"command":{"name":"fateflow"},"operator":"POST","conf":{"overallTimeout":"60000"}}, lastPacketTimestamp: 1581495622222, loopEndTimestamp: 1581495682230

@cometta cometta changed the title Submit a job hang Submit a job return an error Feb 15, 2020
@cometta
Copy link
Contributor Author

cometta commented Feb 17, 2020

I created a PR #993

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants