-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Additional testing using pthreads and BBAPI #163
Comments
You can assign this to me for now |
An early observation: Below is the script I'm using to test the BBAPI transfer. To test, do #!/bin/bash
# Get BBAPI mount point
mnt=$(mount | grep -Eo /mnt/bb_[a-z0-9]+)
rm -fr /tmp/ssd
mkdir -p /tmp/ssd
cat << EOF > ~/myscr.conf
SCR_COPY_TYPE=FILE
SCR_CLUSTER_NAME=butte
SCR_FLUSH=1
STORE=/tmp GROUP=NODE COUNT=1
STORE=$mnt GROUP=NODE COUNT=1 TYPE=bbapi
STORE=/tmp/persist GROUP=WORLD COUNT=100
CKPT=0 INTERVAL=1 GROUP=NODE STORE=/tmp TYPE=XOR SET_SIZE=4
CKPT=1 INTERVAL=4 GROUP=NODE STORE=$mnt TYPE=XOR SET_SIZE=2 OUTPUT=1
CNTLDIR=/tmp BYTES=1GB
CACHEDIR=/tmp BYTES=1GB
CACHEDIR=/tmp/ssd BYTES=50GB
SCR_CACHE_BASE=/dev/shm
SCR_DEBUG=10
EOF
SCR_CONF_FILE=~/myscr.conf ~/scr/build/examples/test_api What I'm seeing is that when AXL tries to copy the file from /tmp/[blah]/ckpt to /mnt/[BB mount]/ckptdir/ckpt using BBAPI, it's falling back to using pthreads. This happens because AXL BBAPI first checks if the both source and destination support FIEMAP (file extents). If so, then it continues to use BBAPI to transfer the files. If not, then it falls back to using pthreads to transfer them. On cases where the destination file doesn't exist yet, it falls back to calling FIEMAP on the destination directory. This is problematic, as tmp and BBAPI filesystems don't support FIEMAP on directories, while ext4 does (which is what I originally tested on). So to get around this, we may want to fall back to doing a statfs() on the destination directory and checking |
Sounds good. Related to this, as far as I understand it, supporting extents is a necessary but not a sufficient requirement for the BBAPI. The BB software uses extents, but even if a file system supports extents, there is no guarantee that it will work with the BBAPI. IBM says that the BBAPI should work for transfers between a BB-managed file system on the SSD and GPFS, but any other combination only works by chance if at all. No other combinations of transfer pairs have been tested or designed to work. |
Some quirks I'm noticing using the BBAPI to transfer: Can I transfer using BBAPI?
|
We may want to update AXL to not only whitelist BBAPI based on src/dest filesystem type, but also if BBAPI can actually do the transfer, and fallback to pthreads if necessary. So in the table above, we'd fallback to pthreads on /tmp <-> /p/gpfs transfers, and use BBAPI on the others. |
If the user explicitly requests the BBAPI in their configuration and it doesn't work, I think it's fine to just error out. I think that would either be a configuration error on their part or a system software error, either of which they probably would like to know about so that it can get fixed. |
Regarding the table in #163 (comment): |
@adammoody I just opened ECP-VeloC/AXL#64 which disables fallback to pthreads by default, but you can enable it again in cmake with |
I've been testing using the following script on one of our BBAPI machines: #!/bin/bash
# Bypass mode is default - disable it to use AXL
export SCR_CACHE_BYPASS=0
ssdmnt=$(mount | grep -Eo /mnt/bb_[a-z0-9]+)
rm -fr /tmp/ssd
mkdir -p /tmp/ssd
gpfsmnt=$(mount | awk '/gpfs/{print $3}')
mkdir -p $gpfsmnt/`whoami`/testing
gpfsmnt=$gpfsmnt/`whoami`/testing
cat << EOF > ~/myscr.conf
SCR_COPY_TYPE=FILE
SCR_CLUSTER_NAME=`hostname`
SCR_FLUSH=1
STORE=/tmp GROUP=NODE COUNT=1 TYPE=pthread
STORE=$ssdmnt GROUP=NODE COUNT=1 TYPE=bbapi
STORE=$gpfsmnt GROUP=WORLD COUNT=1 TYPE=bbapi
CKPT=0 INTERVAL=1 GROUP=NODE STORE=/tmp TYPE=XOR SET_SIZE=4
CKPT=1 INTERVAL=4 GROUP=NODE STORE=$ssdmnt TYPE=XOR SET_SIZE=2 OUTPUT=1
CKPT=2 INTERVAL=8 GROUP=NODE STORE=$gpfsmnt TYPE=XOR SET_SIZE=2 OUTPUT=1
CNTLDIR=/tmp/hutter2 BYTES=1GB
CACHEDIR=/tmp BYTES=1GB
CACHEDIR=/tmp/ssd BYTES=50GB
SCR_CACHE_BASE=/dev/shm
SCR_DEBUG=10
EOF
SCR_CONF_FILE=~/myscr.conf ~/scr/build/examples/test_api The output from the script shows SCR successfully using AXL in pthreads and BBAPI mode:
So we know BBAPI and pthreads are working in the basic case. |
Great! So far, so good. A tip for getting the BB path, an alternative is to use the |
- Rename test_ckpt.C -> test_ckpt.c - Convert it to an actual C file (previously it was C++) - Allow it to take in an argument for checkpoint size in MB These changes will help me with testing in LLNL#163.
- Rename test_ckpt.C -> test_ckpt.c - Convert it to a C file (previously it was C++) - Allow it to take in an argument for checkpoint size in MB These changes will help me with testing in LLNL#163.
- Rename test_ckpt.C -> test_ckpt.c - Convert it to a C file (previously it was C++) - Allow it to take in an argument for checkpoint size in MB These changes will help me with testing in LLNL#163.
- Rename test_ckpt.C -> test_ckpt.c - Convert it to a C file (previously it was C++) - Allow it to take in an argument for checkpoint size in MB These changes will help me with testing in LLNL#163.
- Rename test_ckpt.C -> test_ckpt.cpp - Update it so it can be easily used as C code as well. - Allow it to take in an argument for checkpoint size in MB These changes will help me with testing in LLNL#163.
- Rename test_ckpt.C -> test_ckpt.cpp - Update it so it can be easily used as C code as well. - Allow it to take in an argument for checkpoint size in MB These changes will help me with testing in LLNL#163.
- Rename test_ckpt.C -> test_ckpt.cpp - Update it so it can be easily used as C code as well. - Allow it to take in an argument for checkpoint size in MB These changes will help me with testing in LLNL#163.
- Rename test_ckpt.C -> test_ckpt.cpp - Update it so it can be easily used as C code as well. - Allow it to take in an argument for checkpoint size in MB These changes will help me with testing in LLNL#163.
- Rename test_ckpt.C -> test_ckpt.cpp - Update it so it can be easily used as C code as well. - Allow it to take in an argument for checkpoint size in MB These changes will help me with testing in #163.
While testing last week, I ran across a FILO bug and created a PR: ECP-VeloC/filo#9 |
So this is a little bizarre: I noticed in my tests that that SCR was reporting that it successfully transfered a 1MB file called "rank_0" from the SSD to GPFS using the BBAPI:
But the resulting file was 0 bytes.
I tried the same test with
I see the same failure on two of our IBM systems. I'm still investigating.. |
Issue 2, when I login to a machine for the first time and run my test, I always hit this error:
If I re-run the test the problem always goes away. Looks like a stale transfer handle issue. I'll try it with |
I'm not seeing it with |
Fun fact: BBAPI appears to be way slower than a vanilla copy. I timed the amount of time it took to copy a 10GB, random-data, file from SSD to GPFS with
|
BBAPI traffic is throttled in the network, so I’m not surprised it’s slower... but that is quite slow. |
It's possible that the copy was faster because the whole file was in the page cache. That would mean the regular copies were just reading the data from memory rather than the SSD. I believe BBAPI just does a direct device-to-device SSD->GPFS transfer, which would bypass the page cache. Unfortunately, I was unable to test with zeroed caches ( |
The 0B rank_0 file issue arises because it is a sparse file.
Since BBAPI transfers extents, it would make sense that it would create a zero byte file for a sparse file. Proof: # create one regular 1M file, and one sparse file
$ dd if=/dev/zero of=$BBPATH/file1 bs=1M count=1
$ truncate -s 1m $BBPATH/file2
# lookup extents
$ fiemap $BBPATH/file1
ioctl success, extents = 1
$ fiemap $BBPATH/file2
ioctl success, extents = 0
# both files appear to be 1MB to the source filesystem
$ ls -l $BBPATH/file1
-rw------- 1 hutter2 hutter2 1048576 May 20 14:00 /mnt/bb_410f5f473f8a95cc2ac227b2c7ee1742/file1
$ ls -l $BBPATH/file2
-rw------- 1 hutter2 hutter2 1048576 May 20 14:00 /mnt/bb_410f5f473f8a95cc2ac227b2c7ee1742/file2
# transfer them using BBAPI
$ ~/scr/deps/AXL/build/test/axl_cp -X bbapi $BBPATH/file1 /p/gpfs1/hutter2/
$ ~/scr/deps/AXL/build/test/axl_cp -X bbapi $BBPATH/file2 /p/gpfs1/hutter2/
# sizes after transfer
$ ls -l /p/gpfs1/hutter2/
total 1024
-rw------- 1 hutter2 hutter2 1048576 May 20 14:01 file1
-rw------- 1 hutter2 hutter2 0 May 20 14:02 file2 It gets really interesting when you create and transfer a partially sparse file:
I then did another test where I created sparse section between two extents.
When I transferred that file using the BBAPI, it was the correct size and had the correct data. TL;DR: don't end your file with sparse sections, or they're not going to get transferred correctly with BBAPI. |
That is expected. BB transfers are intentionally throttled via InfiniBand QoS at ~0.6GBps per node to minimize network impacts to MPI and demand I/O. So 10GB should take around 16-17 seconds when transferring in the SSD->GPFS direction. The transfers in the GPFS->SSD direction are not governed and you should see near native SSD write speeds. |
@tgooding out of curiosity, can the throttling be adjusted or disabled? |
"Adjusting" can be done. Its a cluster wide parameter and would require changing the InfiniBand switch settings - not something you'd want to do often. "Disabling" is easier. I/O on port 4420 is throttled, so you just need to change to use the other defined NVMe over Fabrics port (4421). I think you can edit the configuration file Alternatively, you can pass in a custom config template via |
@tgooding thank you! 👍 |
@adammoody regarding "should we cancel/not-cancel existing transfers on AXL restart", I forgot about this thread from around a year ago: I'll put my comments in there since it already has a lot of good discussion. |
Regarding:
See: ECP-VeloC/AXL#57 (comment) . TL;DR: sync/pthread transfers get killed with the job automatically, and I don't see how we could cancel BBAPI transfers. |
@tonyhutter : if you haven't, look at using BB_GetTransferList() at SCR startup time for getting the active handles. |
@tonyhutter , we also have a working test case in which a later job step cancels the transfer started by the previous job step. It's in the LLNL bitbucket, so I'll point you to that separately. |
@tgooding according to the documentation, /**
* \brief Obtain the list of transfers
* \par Description
* The BB_GetTransferList routine obtains the list of transfers within the job that match the
* status criteria. BBSTATUS values are powers-of-2 so they can be bitwise OR'd together to
* form a mask (matchstatus). For each of the job's transfers, this mask is bitwise AND'd
* against the status of the transfer and if non-zero, the transfer handle for that transfer
* is returned in the array_of_handles.
*
* Transfer handles are associated with a jobid and jobstepid. Only those transfer handles that were
* generated for the current jobid and jobstepid are returned.
*
* \param[in] matchstatus Only transfers with a status that match matchstatus will be returned. matchstatus can be a OR'd mask of several BBSTATUS values.
* \param[inout] numHandles Populated with the number of handles returned. Upon entry, contains the number of handles allocated to the array_of_handles.
* \param[out] array_of_handles Returns an array of handles that match matchstatus. The caller provides storage for the array_of_handles and indicates the number of available elements in numHandles. \note If array_of_handles==NULL, then only the matching numHandles is returned.
* \param[out] numAvailHandles Populated with the number of handles available to be returned that match matchstatus.
* \return Error code
* \retval 0 Success
* \retval errno Positive non-zero values correspond with errno. strerror() can be used to interpret.
* \ingroup bbapi
*/
extern int BB_GetTransferList(BBSTATUS matchstatus, uint64_t* numHandles, BBTransferHandle_t array_of_handles[], uint64_t* numAvailHandles); https://github.com/IBM/CAST/blob/master/bb/include/bbapi.h#L261 If a job is restarted it's going to get a new jobid, so I don't see how you could use it to get the list of old transfers. |
Some more observations from testing:
If we have multiple transfers to cancel, we should cancel them in parallel in separate threads, as |
Regrading:
I ran some tests using This is important with BB API transfers, as there is a period of time in the transfer when the destination file will be the ending size, but all the data isn't present. I saw this by stating the file while it was being transferred (look at the block count): # start transfer in the background
$ axl_cp -X bbapi $BBPATH/bigfile /p/gpfs1/hutter2/bigfile &
$ while [ 1 ] ; do sleep 1 && stat /p/gpfs1/hutter2/bigfile | grep Size ; done
Size: 21206401024 Blocks: 34045952 IO Block: 16777216 regular file
Size: 21206401024 Blocks: 34766848 IO Block: 16777216 regular file
Size: 21474836480 Blocks: 37388288 IO Block: 16777216 regular file
Size: 21474836480 Blocks: 39256064 IO Block: 16777216 regular file
Size: 21474836480 Blocks: 41254912 IO Block: 16777216 regular file
Size: 21474836480 Blocks: 41943040 IO Block: 16777216 regular file
Size: 21474836480 Blocks: 41943040 IO Block: 16777216 regular file
Size: 21474836480 Blocks: 41943040 IO Block: 16777216 regular file
Size: 21474836480 Blocks: 41943040 IO Block: 16777216 regular file
Size: 21474836480 Blocks: 41943040 IO Block: 16777216 regular file
exit_func: exiting, status 0, id 1
[1]+ Done rm -f /p/gpfs1/hutter2/bigfile && ~/AXL/build/test/axl_cp -X bbapi $BBPATH/bigfile /p/gpfs1/hutter2/bigfile
Size: 21474836480 Blocks: 41943040 IO Block: 16777216 regular file
Size: 21474836480 Blocks: 41943040 IO Block: 16777216 regular file |
Thanks for exploring the space of the BB, @tonyhutter . This is great. We used to compute a CRC32 for each file when it was copied to the parallel file system, and we stored that in the SCR metadata. When reading the file back, we computed the CRC32 again and checked against the stored value. We dropped that feature when moving to the components, but it'd be nice to have back. Even so, SCR should avoid fetching a checkpoint that did not finalize. SCR assumes a checkpoint is bad unless it has explicitly marked it as complete by writing a flag into the .scr/index.scr file. It does that after the flush completes. In the case of an async flush, that happens here: Line 225 in d0777f8
then here: Line 364 in d0777f8
The SCR library has to be running for this to work. What's new with the BB is that this transfer will continue even after the application has stopped. If it errors out, then we're ok because SCR will not try to read it, since it never marked it as complete. A more interesting case is what happens if it actually succeeds? In that case, it'd be nice to have something that then updates the SCR index file to indicate that the checkpoint is actually good. The BB lets us register a script that runs in poststage. We could create a custom script that runs in poststage and then updates the index file if it finds the checkpoint succeeded. |
Actually, browsing through that code, I need to double check that it's properly setting that flag. It's not obvious to me, so it may have been lost in translation. Anyway, that is what it is supposed to be doing. |
Ok, before starting the flush, SCR writes a COMPLETE=0 flag: Line 318 in d0777f8
Then it only updates that flag after the flush finishes. So we should be covered. Though this TODO suggests the code that writes the complete marker needs to do some more checking: Line 245 in d0777f8
We don't catch the case right now if the flush completes but some process detected an error. |
One more thing. Though SCR is meant to rely on its index file to avoid reading back a partially flushed checkpoint, other AXL users might appreciate built-in support to better handle this. Perhaps AXL could transfer the file without the read bit enabled, and then flip the read bit on after it completes. Or AXL could flush to a temporary file name and then rename the file. |
I like this idea, it's simple and reliable. We'd probably want to tack on an
That way, it gives a hint to the user that "my |
Opened AXL issue for the file rename: |
So, I think I now have some answers to the original questions:
I tested SCR w/AXL-pthreads transfers and they work. They shouldn't adversely affect application performance since AXL pthreads already self-limits the number of transfer threads to the lesser of:
So AXL pthreads could only ever use a maximum of 16 threads. Additionally, the CPU is going to task switch when waiting on IOs to finish, so it's not like AXL would using those CPUs 100% of the time. The 16 thread number was chosen because transfer performance really doesn't increase after 16 threads (at least in the limited testing I did). Here are some benchmarks I did back when I was testing pthreads: Time to transfer 200 files with files sizes ranging between 1-50MB:
Time to transfer 20,000 files with files sizes ranging between 1-50KB:
In the common case yes. SCR will not load a checkpoint that isn't the right size or not marked internally within SCR as complete. If the file is corrupted after being successfully transferred, SCR will not detect it, but that's really not something SCR should be expected to do.
Sync and pthreads transfers will be cancelled when an application dies, since it is the application's own threads doing the transfer. BB API will continue the transfer after the application dies, and there's currently no easy way to cancel those transfers on job restart. In order to cancel a BB transfer on application restart, we would need to keep track of the BB API transfer handle across jobs. We don't currently do this. It's possible we could embed the transfer handles into the AXL state_file, but since the state_file is technically not required, it's not a great solition. A better solution would be to modify the BB API to allow
SCR does detect if a checkpoint file size is correct, and not load a previous checkpoint if it is not. According to @adammoody (#163 (comment)) SCR internally doesn't mark a transfer as complete unless it finishes flushing. This means that if the checkpoint was the right size, but the filo never actually said "I'm done transferring the file", then it won't load the checkpoint (Note: I didn't actually test this). This is a good thing. It should get around an issue with BBAPI transfers where the destination file is the correct final file size, but all the blocks hadn't finished transferring (see #163 (comment)) Overall, file transfer integrity should be done at a lower level like FILO or AXL. Perhaps we could add an Additionally, we proposed increasing file transfer integrity by having AXL use a temporary file name while it's transferring (ECP-VeloC/AXL#66) |
Great progress, @tonyhutter . On the application interference question, we also need to do some "noise" tests at medium or large scale perhaps during a DST after a system update. One thing we want check is whether a background transfer slows down the application. To do this, we can put together an "app" that does MPI_Allreduce(1 double) in a tight loop. We then measure the time time it takes to complete some number of those both with and without a background transfer. It's simple to recreate, but we have a "rednoise" benchmark we can use for this. |
Some background info... For tracking transfers to either cancel them during restart or wait on them to complete or error out within a poststage script, we've made this problem harder on ourselves when we moved to components. It used to be that both Filo and AXL were all part of the monolithic SCR library. In that case, SCR knew the full set of files that had to be transferred, and it was easy to treat that transfer as one logical operation. While co-designing the BB, we asked IBM to support shared transfer handles. That enables multiple processes, which may even be running on different nodes, to all individually register their own files as part of one larger shared transfer operation. In that case, one process like rank 0 allocates a single handle value, broadcasts it to all processes, and then all processes register their files under that single handle id. You can then query the status or cancel the entire transfer with a single integer id value. The plan was to record that single value somewhere safe, like replicate it on each node or store it on the parallel file system. With that, it would then be trivial to cancel a transfer on restart. We just need to make sure rank 0 can get the id back and then just rank 0 executes the cancel while everyone else waits in a barrier. This suddenly got much harder now that we have SCR / Filo / AXL. While we're on it, I keep thinking we should reevaluate whether we should move Filo back into SCR or merge it into AXL as an MPI option. That might help solve some of this. For example, at least Filo understands there is a wider set of files across processes and we might be able to use a single handle id again. That's assuming shared handle performance is acceptable, which we'd want to retest before making that switch. The other option is to record the full set of handle ids somewhere so that we can get them all back. An MPI gather would make that easier, too. |
Another thing that comes to mind... Filo is called by every process in MPI_COMM_WORLD during a flush or fetch, and every Filo instance invokes AXL. If the user is running P processes per node, each compute node then runs P AXL instances in parallel. On some machines, people run one MPI process per CPU core, so you can easily get tens of processes per node today. We may do better if Filo can use a single AXL instance per node. For pthreads, we'd reduce the number of threads we create on the node, and for BB, we'd cut back on the number of handles. To do this, Filo would have to identify the set of ranks that can share an AXL instance and delegate the work to one Filo process. We can determine whether that's worth the effort by testing. |
Keep in mind that AXL is only going to spawn 16 threads if there's >= 16 files to transfer in a single transfer. The user can always switch it to sync or some other transfer type, or do transfers in smaller batches of files at a time to limit the number of threads. |
Jitter test I created 3000 files with random sizes between 0-9MB on the SSD (around 14GB) and copied them to GPFS. While they were copying, I then used a utility that @adammoody's gave me to measure the jitter between a
There's not a huge difference in the results. This is on one of our beefy IBM nodes with a burst buffer and a ton of cores. |
Cool! Glad to see you got that working already, @tonyhutter . That was fast. For single node tests, the fwqmpi benchmark will be of most interest. In the past, we used to run this with one MPI rank per core, since that was how apps often ran. I think we should still do that for lassen, but we might also run one rank per GPU, which is more typical of CORAL apps. This benchmark has a perl script you can use to generate a graph, where each core generates a scatter plot of points. The last we tried this, we ran on cab and had 16 cores/node. Lassen is now at 40-42, so not sure how well all of this will hold up. Also, it'd be helpful to collect results with no transfer to server as a baseline. |
@kathrynmohror and @adammoody have identified some additional things we should test/verify with SCR:
The text was updated successfully, but these errors were encountered: