Lustre to Lustre Data Movement can fail on DataIn due to `mknod()` error #161

bdevcich · 2024-05-30T15:22:02Z

When performing data movement during DataIn, a recursive copy in into an ephemeral file system can run into a race condition. This happens intermittently and I do not have a way to make this occur besides running data movement tests. It will eventually run into this.

The dcp details are tracked in an mpifileutils issue: hpc/mpifileutils#574

This only occurs when the ephemeral lustre filesystem spans multiple rabbits.

Another thing to point out is that this issue was not seen on internal HPE systems when running TOSS 4.6.6. A recent upgrade to TOSS 4.7.6 happened around the same timeframe as this issue surfacing. Not sure it's related or not but perhaps something changed with ZFS/Lustre to cause this issue.

bdevcich · 2024-05-30T15:25:18Z

An example allocationSet for the ephemeral lustre:

status:
  allocationSets:
  - label: ost
    storage:
      rabbit-node-1:
        allocationSize: 5368709120
      rabbit-node-2:
        allocationSize: 5368709120
  - label: mdt
    storage:
      rabbit-node-1:
        allocationSize: 0
      rabbit-node-2:
        allocationSize: 0

External mgs is being used.

bdevcich · 2024-06-07T16:06:59Z

I've added a 30 second pause after the mount of ephemeral lustre in the DataIn stage. This seems to have caused the issue to go away. To me, that indicates a lustre issue. I will remove the pause and grab lustre logs from the rabbit nodes to see if there are any breadcrumbs.

Data Movement nightly tests routinely hit this issue: NearNodeFlash/NearNodeFlash.github.io#161. This appears to only affect a specific version of lustre/toss on our system until we upgrade. This change adds an optional workaround for to add a pause to let lustre settle after mounting. 1s appears to be all that is needed. Signed-off-by: Blake Devcich <blake.devcich@hpe.com>

bdevcich · 2024-09-26T15:41:30Z

As I mentioned in hpc/mpifileutils#574, this issue seems to have gone away with newer versions of lustre. I can still reproduce this on HPE systems which are still running the same version.

Coming back to this, I have not been able to reproduce this on systems with:

TOSS 4.8.3
lustre-2.15.5_6.llnl-1.t4.x86_64
However, I can reliably reproduce this on HPE systems that are still running:

TOSS 4.7.6
lustre-2.15.4_4.llnl-1.t4.x86_64

Closing this issue as well since we have a way to workaround this on systems with older TOSS.

bdevcich self-assigned this May 30, 2024

bdevcich added help wanted Extra attention is needed bug Something isn't working labels May 30, 2024

bdevcich mentioned this issue Jun 7, 2024

mknod() failed (errno=2 No such file or directory) hpc/mpifileutils#574

Closed

bdevcich mentioned this issue Sep 25, 2024

Add env var for optional pause during DataIn NearNodeFlash/nnf-sos#392

Merged

bdevcich closed this as completed Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lustre to Lustre Data Movement can fail on DataIn due to `mknod()` error #161

Lustre to Lustre Data Movement can fail on DataIn due to `mknod()` error #161

bdevcich commented May 30, 2024 •

edited

Loading

bdevcich commented May 30, 2024

bdevcich commented Jun 7, 2024 •

edited

Loading

bdevcich commented Sep 26, 2024

Lustre to Lustre Data Movement can fail on DataIn due to mknod() error #161

Lustre to Lustre Data Movement can fail on DataIn due to mknod() error #161

Comments

bdevcich commented May 30, 2024 • edited Loading

bdevcich commented May 30, 2024

bdevcich commented Jun 7, 2024 • edited Loading

bdevcich commented Sep 26, 2024

Lustre to Lustre Data Movement can fail on DataIn due to `mknod()` error #161

Lustre to Lustre Data Movement can fail on DataIn due to `mknod()` error #161

bdevcich commented May 30, 2024 •

edited

Loading

bdevcich commented Jun 7, 2024 •

edited

Loading