-
Notifications
You must be signed in to change notification settings - Fork 34
Can't connect to bbproxy when calling BB_InitLibrary() in post-stage #1002
Comments
The C API is only available directly on the compute nodes (e.g., as part of the MPI binary). During post-stage (and pre-stage), CSM will prevent access to those nodes as they might be used by a different job (e.g., as we're staging out, another job can be using the CPU/GPUs). |
Ok, thanks for the answer. Ultimately what we want to do is start a checkpoint transfer at the end of a user's allocated time, and then have the poststage script verify the transfers were done successfully. It sounds like we can't do that using the BB C API for the reasons you mentioned. As an alternative, could we simply check the final size of the destination file, and if it's what we expected, then we know the transfer was good? Or would there ever be a case where the destination filesize was correct, but the BB API hadn't finished transferring all the bits yet (because maybe it was preallocating the file or something), or hit some error such that some of the bytes were wrong? Or does the BB API only update the file size after it has correctly received and written the bytes? |
You should be able to initiate a transfer in the post-stage script (1st phase) through bbcmd. The filelist can be pre-populated by the application on the SSD such that you can blindly initiate a final transfer and check success/fail in the 2nd phase. I would not rely on destination file size as seen by GPFS. Strong potential for a data integrity problem. Either by the file pre-existing or making assumptions about how bbServer writes data. (and certainly bbServer today can write blocks out-of-order, so its possible that the final block was written relatively early) |
I don't think launching the transfer is the issue. We've been able to do that successful from the BB C API. It's checking that it finished correctly afterwards in 2nd phase post-stage that's an issue. It would seem that since
Doesn't the 2nd stage only get called after the transfer completes/aborts? So by definition, bbServer should not be writing the file at the time the 2nd post-stage is called? Or is there a chance it could change the contents of the file during/after 2nd post-stage? I ask, because I wonder if we could just do a checksum of the destination file in 2nd post-stage to verify it's correct? |
Let's not try to checksum the files as a "transfer complete" check. That will be performance suicide at scale. |
bbcmd uses the CSM API (csm_bb_cmd()) to start an authorized executable on the compute node and retrieve its output. The CSM design is that only the running job on the CPUs has permission to start arbitrary executables (via jsrun/mpirun/ssh) on the compute nodes. Nodes are cleaned up between jobs to ensure there are no straggler processes, files, etc. And any access outside that phase requires authorized processes. So the flow would be: on the launch node, bbcmd processes inputs, gathers a bit of info on allocation IDs, bundles them, and then calls the CSM API (or in non-CSM environments, it uses ssh). This results in the /opt/ibm/bb/bin/bbcmd executable getting started on the compute node-side, which can then use the C API. The thinking (pre-AXL) was that users would be scripting their staging scripts (e.g., like scheduler run scripts), so already using perl/python/bash that have good JSON processing libraries. I don't think that precludes using popen() within AXL to accomplish the same goal.
Correct. I thought you meant using file size to determine whether you needed to transfer or not. I think I see what you were intending now. At 2nd stage, querying the transfer status to determine the final transfer status would be sufficient. |
@tgooding thanks for the info. It sounds like we'll have AXL spawn |
Describe the bug
I want to check the BB transfer status via the BB API in 2nd-stage post-stage script. Basically, I want to see that the BB transfers I launched were successfully transferred. I notice that when I call
BB_InitLibrary()
on the post-stage node, I get this error:I tried using the same
contribId
I used when I started the transfer, but got the error. I also triedcontribId=0
andcontribId=999999999 (UNDEFINED_CONTRIBID)
but got the same error.I see that I can run
bbcmd gettransfers --target=0 --matchstatus=BBALL
in post-stage and get the transfers. However, that codepath setsbb.api.noproxyinit
in the config and maybe that makes a difference. I also see that bbcmd will set it's contribid toUNDEFINED_CONTRIBID
, which isn't exported in the BBAPI.So my question is: how do I use the BB API to get the transfer statuses in post-stage?
To Reproduce
Call
BB_InitLibrary()
on the post-stage node.Expected behavior
I expect
BB_InitLibrary()
to connect to the BB server in post-stage.Screenshots
Environment (please complete the following information):
Additional context
Issue Source:
The text was updated successfully, but these errors were encountered: