-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lustre to Lustre Data Movement can fail on DataIn due to mknod()
error
#161
Comments
An example status:
allocationSets:
- label: ost
storage:
rabbit-node-1:
allocationSize: 5368709120
rabbit-node-2:
allocationSize: 5368709120
- label: mdt
storage:
rabbit-node-1:
allocationSize: 0
rabbit-node-2:
allocationSize: 0 External mgs is being used. |
I've added a 30 second pause after the mount of ephemeral lustre in the DataIn stage. This seems to have caused the issue to go away. To me, that indicates a lustre issue. I will remove the pause and grab lustre logs from the rabbit nodes to see if there are any breadcrumbs. |
Data Movement nightly tests routinely hit this issue: NearNodeFlash/NearNodeFlash.github.io#161. This appears to only affect a specific version of lustre/toss on our system until we upgrade. This change adds an optional workaround for to add a pause to let lustre settle after mounting. 1s appears to be all that is needed. Signed-off-by: Blake Devcich <blake.devcich@hpe.com>
Data Movement nightly tests routinely hit this issue: NearNodeFlash/NearNodeFlash.github.io#161. This appears to only affect a specific version of lustre/toss on our system until we upgrade. This change adds an optional workaround for to add a pause to let lustre settle after mounting. 1s appears to be all that is needed. Signed-off-by: Blake Devcich <blake.devcich@hpe.com>
Data Movement nightly tests routinely hit this issue: NearNodeFlash/NearNodeFlash.github.io#161. This appears to only affect a specific version of lustre/toss on our system until we upgrade. This change adds an optional workaround for to add a pause to let lustre settle after mounting. 1s appears to be all that is needed. Signed-off-by: Blake Devcich <blake.devcich@hpe.com>
Data Movement nightly tests routinely hit this issue: NearNodeFlash/NearNodeFlash.github.io#161. This appears to only affect a specific version of lustre/toss on our system until we upgrade. This change adds an optional workaround for to add a pause to let lustre settle after mounting. 1s appears to be all that is needed. Signed-off-by: Blake Devcich <blake.devcich@hpe.com>
Data Movement nightly tests routinely hit this issue: NearNodeFlash/NearNodeFlash.github.io#161. This appears to only affect a specific version of lustre/toss on our system until we upgrade. This change adds an optional workaround for to add a pause to let lustre settle after mounting. 1s appears to be all that is needed. Signed-off-by: Blake Devcich <blake.devcich@hpe.com>
As I mentioned in hpc/mpifileutils#574, this issue seems to have gone away with newer versions of lustre. I can still reproduce this on HPE systems which are still running the same version.
Closing this issue as well since we have a way to workaround this on systems with older TOSS. |
When performing data movement during DataIn, a recursive copy in into an ephemeral file system can run into a race condition. This happens intermittently and I do not have a way to make this occur besides running data movement tests. It will eventually run into this.
The dcp details are tracked in an mpifileutils issue: hpc/mpifileutils#574
This only occurs when the ephemeral lustre filesystem spans multiple rabbits.
Another thing to point out is that this issue was not seen on internal HPE systems when running TOSS 4.6.6. A recent upgrade to TOSS 4.7.6 happened around the same timeframe as this issue surfacing. Not sure it's related or not but perhaps something changed with ZFS/Lustre to cause this issue.
The text was updated successfully, but these errors were encountered: