Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lustre to Lustre Data Movement can fail on DataIn due to mknod() error #161

Closed
bdevcich opened this issue May 30, 2024 · 3 comments
Closed
Assignees
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@bdevcich
Copy link
Contributor

bdevcich commented May 30, 2024

When performing data movement during DataIn, a recursive copy in into an ephemeral file system can run into a race condition. This happens intermittently and I do not have a way to make this occur besides running data movement tests. It will eventually run into this.

The dcp details are tracked in an mpifileutils issue: hpc/mpifileutils#574

This only occurs when the ephemeral lustre filesystem spans multiple rabbits.

Another thing to point out is that this issue was not seen on internal HPE systems when running TOSS 4.6.6. A recent upgrade to TOSS 4.7.6 happened around the same timeframe as this issue surfacing. Not sure it's related or not but perhaps something changed with ZFS/Lustre to cause this issue.

@bdevcich
Copy link
Contributor Author

An example allocationSet for the ephemeral lustre:

status:
  allocationSets:
  - label: ost
    storage:
      rabbit-node-1:
        allocationSize: 5368709120
      rabbit-node-2:
        allocationSize: 5368709120
  - label: mdt
    storage:
      rabbit-node-1:
        allocationSize: 0
      rabbit-node-2:
        allocationSize: 0

External mgs is being used.

@bdevcich bdevcich self-assigned this May 30, 2024
@bdevcich bdevcich added help wanted Extra attention is needed bug Something isn't working labels May 30, 2024
@bdevcich
Copy link
Contributor Author

bdevcich commented Jun 7, 2024

I've added a 30 second pause after the mount of ephemeral lustre in the DataIn stage. This seems to have caused the issue to go away. To me, that indicates a lustre issue. I will remove the pause and grab lustre logs from the rabbit nodes to see if there are any breadcrumbs.

bdevcich added a commit to NearNodeFlash/nnf-sos that referenced this issue Sep 25, 2024
Data Movement nightly tests routinely hit this issue:
NearNodeFlash/NearNodeFlash.github.io#161.

This appears to only affect a specific version of lustre/toss on our system until
we upgrade.

This change adds an optional workaround for to add a pause to let lustre
settle after mounting. 1s appears to be all that is needed.

Signed-off-by: Blake Devcich <blake.devcich@hpe.com>
bdevcich added a commit to NearNodeFlash/nnf-sos that referenced this issue Sep 25, 2024
Data Movement nightly tests routinely hit this issue:
NearNodeFlash/NearNodeFlash.github.io#161.

This appears to only affect a specific version of lustre/toss on our system until
we upgrade.

This change adds an optional workaround for to add a pause to let lustre
settle after mounting. 1s appears to be all that is needed.

Signed-off-by: Blake Devcich <blake.devcich@hpe.com>
bdevcich added a commit to NearNodeFlash/nnf-sos that referenced this issue Sep 25, 2024
Data Movement nightly tests routinely hit this issue:
NearNodeFlash/NearNodeFlash.github.io#161.

This appears to only affect a specific version of lustre/toss on our system until
we upgrade.

This change adds an optional workaround for to add a pause to let lustre
settle after mounting. 1s appears to be all that is needed.

Signed-off-by: Blake Devcich <blake.devcich@hpe.com>
bdevcich added a commit to NearNodeFlash/nnf-sos that referenced this issue Sep 25, 2024
Data Movement nightly tests routinely hit this issue:
NearNodeFlash/NearNodeFlash.github.io#161.

This appears to only affect a specific version of lustre/toss on our system until
we upgrade.

This change adds an optional workaround for to add a pause to let lustre
settle after mounting. 1s appears to be all that is needed.

Signed-off-by: Blake Devcich <blake.devcich@hpe.com>
bdevcich added a commit to NearNodeFlash/nnf-sos that referenced this issue Sep 25, 2024
Data Movement nightly tests routinely hit this issue:
NearNodeFlash/NearNodeFlash.github.io#161.

This appears to only affect a specific version of lustre/toss on our system until
we upgrade.

This change adds an optional workaround for to add a pause to let lustre
settle after mounting. 1s appears to be all that is needed.

Signed-off-by: Blake Devcich <blake.devcich@hpe.com>
@bdevcich
Copy link
Contributor Author

As I mentioned in hpc/mpifileutils#574, this issue seems to have gone away with newer versions of lustre. I can still reproduce this on HPE systems which are still running the same version.

Coming back to this, I have not been able to reproduce this on systems with:

TOSS 4.8.3
lustre-2.15.5_6.llnl-1.t4.x86_64
However, I can reliably reproduce this on HPE systems that are still running:

TOSS 4.7.6
lustre-2.15.4_4.llnl-1.t4.x86_64

Closing this issue as well since we have a way to workaround this on systems with older TOSS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
Status: Closed
Development

No branches or pull requests

1 participant