New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fuse client the alluxio worker doesn't go through short-circuit in the same node in Kubernetes #9089
Comments
ping @madanadit |
/ping @apc999. Since @madanadit is on travel, may I know who else can help investigate? |
When I tried to copy the file again.
I see the error from the logging of the fuse:
In the logs of worker, I'm able to see the error:
|
@bf8086 this looks like some gRPC authentication failure. Can you take a look and see if it is gRPC related? |
Alluxio fuse client does not allow overwrite since Alluxio is a write once read many times filesystem. If need to overwrite, delete the file first. |
@cheyang I think we are talking about two problems in this issue ticket
|
For the authentication error, looks like the authenticated client user is set to |
I have tried. The same error:
Notice:
Btw, It looks good when I'm using CLI inside the fuse pod in the same node:
|
The fuse has one issue is that the output message is always not informative (we could only return error code and different platforms have different interpretation ways). All the detailed error message is in |
The error is as below:
|
Looks like the channel id is not recognized by gRPC server. @cheyang Does the error happen consistently? Did you restart any Alluxio worker before seeing this issue? |
It happens consistently. I didn't see any restart:
|
Worker will revoke authentication for a client following a period of inactivity. (Default 60mins) Authentication failires are currently not handled by blockworker clients, thus authentication failure become permanent. Looks like Fuse can retain a client with long periods of inactivity. This will need to be fixed. |
Discussed this with @bf8086. The most viable solution, without incurring any perf loss, is to keep authentication call active on the server and close it whenever authentication is being revoked due to inactivity. This will introduce 1 more round trip to Sasl traffic for letting client continue, but server will retain the call reference for failing/closing it later. That way client will get notified lazily of a need to reauthenticate. With this solution there could be intermittent failures with the blockworker client but those will be propagated to application so it shouldn't cause any inconsistencies. |
What's your suggestions on workaround? Just umount and remount the fuse? |
@cheyang Can you try setting |
fix for locality: #9143 |
Thanks @bf8086 setting I'll file a separate issue to track this bug. |
`AuthenticationServer` has a setting for revoking authentication after a period of inactivity. To handle that on the client side, metadata clients, after a period of inactivity, will retry after getting `Unauthenticated` code. However, due to nature of streaming, data clients can not retry after getting the error because they might have pipelined more data before seeing the error. And since this revocation will not change the connection state, they used to continue getting `Unauthenticated`. See #9089 for an instance of this problem. This PR introduces long polling to authentication handshake. Client and server will not complete streams used for authentication and instead will use it for notifying end of an authentication session. With this change, revocation on server will be propagated to client channel via health status, causing a client recreation for later use of the same channel. Also client closing the channel will notify server and it'll clean its state of the recently closed channel. Periodic cleanup has not been disabled in order to not prolong a duration for a channel to remain authenticated. pr-link: #9149 change-id: cid-7c847b674046f4836c2d77881d47985a760b8951
The PR did not make it to the official RC2 release. In order to try it you will have to build from the latest |
@mingfang Great to hear the issue is verified fixed in your deployment! |
Thanks, it's fixed alluxio 2.0. |
Alluxio Version:
2.0.0-snapshot
Describe the bug
Deploy master, workers and fuse in Kubernetes.
allxuio.zip
And go to the machine A(192.168.0.109), copy whole the directory from alluxio to local directory
But I noticed the file is not located in the same node, it's in 192.168.0.110:
To Reproduce
Steps to reproduce the behavior (as minimally and precisely as possible)
Expected behavior
A clear and concise description of what you expected to happen.
Urgency
Describe the impact and urgency of the bug.
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: