Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional testing using pthreads and BBAPI #163

Open
tonyhutter opened this issue Mar 31, 2020 · 46 comments
Open

Additional testing using pthreads and BBAPI #163

tonyhutter opened this issue Mar 31, 2020 · 46 comments

Comments

@tonyhutter
Copy link
Contributor

@kathrynmohror and @adammoody have identified some additional things we should test/verify with SCR:

  • Test using AXL with pthreads to transfer checkpoints. Does it work correctly? Do we need to add logic to throttle the transfer so it doesn't interfere with application performance? (possibly not, but just want to check to see what the performance is like)
  • Does the initiation and completion check for transfers work correctly? (at small and large scales?)
  • Need to be able to cancel transfer of files on restart. Requires tracking set of transfer handles used (with redundancy to account for failures). Ensure everything was cancelled successfully in SCR_Init during restart. Ensure final transfers are cancelled or complete in the case of scavenge/post-run transfers.
  • Check that SCR manages any final BBAPI transfer that is completed at the end of an allocation. In particular, SCR normally runs some finalization code to register that a transfer completed successfully, so that it knows the corresponding checkpoint is valid to be used during a restart in the next job allocation. I don't think we're accounting for that right now.
@tonyhutter
Copy link
Contributor Author

You can assign this to me for now

@tonyhutter
Copy link
Contributor Author

An early observation:

Below is the script I'm using to test the BBAPI transfer. To test, do srun -n4 -N4 <this script> from within the BB mount point:

#!/bin/bash

# Get BBAPI mount point
mnt=$(mount | grep -Eo /mnt/bb_[a-z0-9]+)
rm -fr /tmp/ssd
mkdir -p /tmp/ssd

cat << EOF > ~/myscr.conf
SCR_COPY_TYPE=FILE

SCR_CLUSTER_NAME=butte

SCR_FLUSH=1

STORE=/tmp GROUP=NODE COUNT=1
STORE=$mnt GROUP=NODE COUNT=1 TYPE=bbapi
STORE=/tmp/persist GROUP=WORLD COUNT=100

CKPT=0 INTERVAL=1 GROUP=NODE   STORE=/tmp TYPE=XOR SET_SIZE=4
CKPT=1 INTERVAL=4 GROUP=NODE   STORE=$mnt TYPE=XOR SET_SIZE=2  OUTPUT=1

CNTLDIR=/tmp BYTES=1GB

CACHEDIR=/tmp BYTES=1GB
CACHEDIR=/tmp/ssd BYTES=50GB
SCR_CACHE_BASE=/dev/shm
SCR_DEBUG=10
EOF

SCR_CONF_FILE=~/myscr.conf  ~/scr/build/examples/test_api

What I'm seeing is that when AXL tries to copy the file from /tmp/[blah]/ckpt to /mnt/[BB mount]/ckptdir/ckpt using BBAPI, it's falling back to using pthreads. This happens because AXL BBAPI first checks if the both source and destination support FIEMAP (file extents). If so, then it continues to use BBAPI to transfer the files. If not, then it falls back to using pthreads to transfer them. On cases where the destination file doesn't exist yet, it falls back to calling FIEMAP on the destination directory. This is problematic, as tmp and BBAPI filesystems don't support FIEMAP on directories, while ext4 does (which is what I originally tested on). So to get around this, we may want to fall back to doing a statfs() on the destination directory and checking f_type against a whitelist of filesystems that we know support extents.

@adammoody
Copy link
Contributor

Sounds good. Related to this, as far as I understand it, supporting extents is a necessary but not a sufficient requirement for the BBAPI. The BB software uses extents, but even if a file system supports extents, there is no guarantee that it will work with the BBAPI.

IBM says that the BBAPI should work for transfers between a BB-managed file system on the SSD and GPFS, but any other combination only works by chance if at all. No other combinations of transfer pairs have been tested or designed to work.

@tonyhutter
Copy link
Contributor Author

tonyhutter commented Apr 9, 2020

Some quirks I'm noticing using the BBAPI to transfer:

Can I transfer using BBAPI?

src dst can xfer?
xfs ext4 Yes
ext4 xfs Yes
xfs gpfs Yes
ext4 gpfs No
gpfs ext4 No
ext4 ext4 Yes
tmpfs any FS No

@tonyhutter
Copy link
Contributor Author

We may want to update AXL to not only whitelist BBAPI based on src/dest filesystem type, but also if BBAPI can actually do the transfer, and fallback to pthreads if necessary. So in the table above, we'd fallback to pthreads on /tmp <-> /p/gpfs transfers, and use BBAPI on the others.

@adammoody
Copy link
Contributor

If the user explicitly requests the BBAPI in their configuration and it doesn't work, I think it's fine to just error out. I think that would either be a configuration error on their part or a system software error, either of which they probably would like to know about so that it can get fixed.

@tonyhutter
Copy link
Contributor Author

Regarding the table in #163 (comment):
/tmp on the system I tested on was actually EXT4, not tmpfs. I've since tested tmpfs (transfers don't work at all with BBAPI), and updated the table.

@tonyhutter
Copy link
Contributor Author

@adammoody I just opened ECP-VeloC/AXL#64 which disables fallback to pthreads by default, but you can enable it again in cmake with -DENABLE_BBAPI_FALLBACK. It's useful for me for testing, so I'd at least like to make it configurable. I've also updated it to use a FS type whitelist instead of checking if the FS supports extents.

@tonyhutter
Copy link
Contributor Author

I've been testing using the following script on one of our BBAPI machines:

#!/bin/bash

# Bypass mode is default - disable it to use AXL
export SCR_CACHE_BYPASS=0

ssdmnt=$(mount | grep -Eo /mnt/bb_[a-z0-9]+)
rm -fr /tmp/ssd
mkdir -p /tmp/ssd

gpfsmnt=$(mount | awk '/gpfs/{print $3}')
mkdir -p $gpfsmnt/`whoami`/testing
gpfsmnt=$gpfsmnt/`whoami`/testing

cat << EOF > ~/myscr.conf
SCR_COPY_TYPE=FILE

SCR_CLUSTER_NAME=`hostname`

SCR_FLUSH=1

STORE=/tmp GROUP=NODE COUNT=1 TYPE=pthread
STORE=$ssdmnt GROUP=NODE COUNT=1 TYPE=bbapi
STORE=$gpfsmnt GROUP=WORLD COUNT=1 TYPE=bbapi

CKPT=0 INTERVAL=1 GROUP=NODE   STORE=/tmp TYPE=XOR SET_SIZE=4
CKPT=1 INTERVAL=4 GROUP=NODE   STORE=$ssdmnt TYPE=XOR SET_SIZE=2  OUTPUT=1
CKPT=2 INTERVAL=8 GROUP=NODE   STORE=$gpfsmnt TYPE=XOR SET_SIZE=2  OUTPUT=1

CNTLDIR=/tmp/hutter2 BYTES=1GB

CACHEDIR=/tmp BYTES=1GB
CACHEDIR=/tmp/ssd BYTES=50GB
SCR_CACHE_BASE=/dev/shm
SCR_DEBUG=10
EOF

SCR_CONF_FILE=~/myscr.conf  ~/scr/build/examples/test_api

The output from the script shows SCR successfully using AXL in pthreads and BBAPI mode:

AXL 0.3.0: lassen788: Read and copied /mnt/bb_2c90f9ab469844e39364103cb0b0d928/hutter2/scr.defjobid/scr.dataset.44/rank_0.ckpt to /p/gpfs1/hutter2/testing/ckpt.44/rank_0.ckpt sucessfully @ axl_async_wait_bbapi /g/g0/hutter2/scr/deps/AXL/src/axl_async_bbapi.c:452
AXL 0.3.0: lassen788: axl_pthread_func: Read and copied /tmp/hutter2/scr.defjobid/scr.dataset.45/rank_0.ckpt to /p/gpfs1/hutter2/testing/ckpt.45/rank_0.ckpt, rc 0 @ axl_pthread_func /g/g0/hutter2/scr/deps/AXL/src/axl_pthread.c:209
AXL 0.3.0: lassen788: axl_pthread_func: Read and copied /tmp/hutter2/scr.defjobid/scr.dataset.46/rank_0.ckpt to /p/gpfs1/hutter2/testing/ckpt.46/rank_0.ckpt, rc 0 @ axl_pthread_func /g/g0/hutter2/scr/deps/AXL/src/axl_pthread.c:209
AXL 0.3.0: lassen788: axl_pthread_func: Read and copied /tmp/hutter2/scr.defjobid/scr.dataset.47/rank_0.ckpt to /p/gpfs1/hutter2/testing/ckpt.47/rank_0.ckpt, rc 0 @ axl_pthread_func /g/g0/hutter2/scr/deps/AXL/src/axl_pthread.c:209
AXL 0.3.0: lassen788: Read and copied /p/gpfs1/hutter2/testing/hutter2/scr.defjobid/scr.dataset.48/rank_0.ckpt to /p/gpfs1/hutter2/testing/ckpt.48/rank_0.ckpt sucessfully @ axl_async_wait_bbapi /g/g0/hutter2/scr/deps/AXL/src/axl_async_bbapi.c:452

So we know BBAPI and pthreads are working in the basic case.

@adammoody
Copy link
Contributor

Great! So far, so good.

A tip for getting the BB path, an alternative is to use the $BBPATH variable, which will be defined in the environment of your job. Matching on /mnt/bb_ should work if all goes well. However, the BB software is still buggy, and it often leaves behind stray /mnt/bb_ directories. If you end up on a node where the BB failed to clean up, you'll see multiple /mnt/bb_ paths, only one of which is valid for your job.

tonyhutter added a commit to tonyhutter/scr that referenced this issue May 15, 2020
- Rename test_ckpt.C -> test_ckpt.c
- Convert it to an actual C file (previously it was C++)
- Allow it to take in an argument for checkpoint size in MB

These changes will help me with testing in LLNL#163.
tonyhutter added a commit to tonyhutter/scr that referenced this issue May 15, 2020
- Rename test_ckpt.C -> test_ckpt.c
- Convert it to a C file (previously it was C++)
- Allow it to take in an argument for checkpoint size in MB

These changes will help me with testing in LLNL#163.
tonyhutter added a commit to tonyhutter/scr that referenced this issue May 15, 2020
- Rename test_ckpt.C -> test_ckpt.c
- Convert it to a C file (previously it was C++)
- Allow it to take in an argument for checkpoint size in MB

These changes will help me with testing in LLNL#163.
tonyhutter added a commit to tonyhutter/scr that referenced this issue May 15, 2020
- Rename test_ckpt.C -> test_ckpt.c
- Convert it to a C file (previously it was C++)
- Allow it to take in an argument for checkpoint size in MB

These changes will help me with testing in LLNL#163.
tonyhutter added a commit to tonyhutter/scr that referenced this issue May 15, 2020
- Rename test_ckpt.C -> test_ckpt.cpp
- Update it so it can be easily used as C code as well.
- Allow it to take in an argument for checkpoint size in MB

These changes will help me with testing in LLNL#163.
tonyhutter added a commit to tonyhutter/scr that referenced this issue May 15, 2020
- Rename test_ckpt.C -> test_ckpt.cpp
- Update it so it can be easily used as C code as well.
- Allow it to take in an argument for checkpoint size in MB

These changes will help me with testing in LLNL#163.
tonyhutter added a commit to tonyhutter/scr that referenced this issue May 15, 2020
- Rename test_ckpt.C -> test_ckpt.cpp
- Update it so it can be easily used as C code as well.
- Allow it to take in an argument for checkpoint size in MB

These changes will help me with testing in LLNL#163.
tonyhutter added a commit to tonyhutter/scr that referenced this issue May 18, 2020
- Rename test_ckpt.C -> test_ckpt.cpp
- Update it so it can be easily used as C code as well.
- Allow it to take in an argument for checkpoint size in MB

These changes will help me with testing in LLNL#163.
adammoody pushed a commit that referenced this issue May 18, 2020
- Rename test_ckpt.C -> test_ckpt.cpp
- Update it so it can be easily used as C code as well.
- Allow it to take in an argument for checkpoint size in MB

These changes will help me with testing in #163.
@tonyhutter
Copy link
Contributor Author

While testing last week, I ran across a FILO bug and created a PR: ECP-VeloC/filo#9

@tonyhutter
Copy link
Contributor Author

So this is a little bizarre:

I noticed in my tests that that SCR was reporting that it successfully transfered a 1MB file called "rank_0" from the SSD to GPFS using the BBAPI:

SCR v1.2.0: rank 0 on butte20: Initiating flush of dataset 22
AXL 0.3.0: butte20: Read and copied /mnt/bb_ce4375e353b3b68966779cab5adbcb9c/tmp/hutter2/scr.defjobid/scr.dataset.22/rank_0 to /p/gpfs1/hutter2/rank_0 sucessfully @ axl_async_wait_bbapi /g/g0/hutter2/scr/deps/AXL/src/axl_async_bbapi.c:452
SCR v1.2.0: rank 0 on butte20: scr_flush_sync: 0.285768 secs, 1.048576e+06 bytes, 3.499343 MB/s, 3.499343 MB/s per proc
SCR v1.2.0: rank 0 on butte20: scr_flush_sync: Flush of dataset 22 succeeded

But the resulting file was 0 bytes.

$ ls -l /p/gpfs1/hutter2/rank_0
-rw------- 1 hutter2 hutter2 0 May 20 09:55 /p/gpfs1/hutter2/rank_0

$ ls -l /mnt/bb_ce4375e353b3b68966779cab5adbcb9c/tmp/hutter2/scr.defjobid/scr.dataset.22/rank_0
-rw------- 1 hutter2 hutter2 1048576 May 20 09:49 /mnt/bb_ce4375e353b3b68966779cab5adbcb9c/tmp/hutter2/scr.defjobid/scr.dataset.22/rank_0

I tried the same test with axl_cp -X bbapi ..., and got the same 0 byte file. I then manually created a 1MB file on the SSD and transferred it using the BBAPI to GPFS, and it worked:

$ dd if=/dev/zero of=/mnt/bb_ce4375e353b3b68966779cab5adbcb9c/zero_1mb bs=1M count=1

$ ls -l /mnt/bb_ce4375e353b3b68966779cab5adbcb9c/zero_1mb 
-rw------- 1 hutter2 hutter2 1048576 May 20 10:03 /mnt/bb_ce4375e353b3b68966779cab5adbcb9c/zero_1mb

$ axl_cp -X bbapi /mnt/bb_ce4375e353b3b68966779cab5adbcb9c/zero_1mb /p/gpfs1/hutter2/zero_1mb   
AXL 0.3.0: butte20: Read and copied /mnt/bb_ce4375e353b3b68966779cab5adbcb9c/zero_1mb to /p/gpfs1/hutter2/zero_1mb sucessfully @ axl_async_wait_bbapi /g/g0/hutter2/scr/deps/AXL/src/axl_async_bbapi.c:452

$ ls -l /p/gpfs1/hutter2/zero_1mb
-rw------- 1 hutter2 hutter2 1048576 May 20 10:04 /p/gpfs1/hutter2/zero_1mb

I see the same failure on two of our IBM systems. I'm still investigating..

@tonyhutter
Copy link
Contributor Author

tonyhutter commented May 20, 2020

Issue 2, when I login to a machine for the first time and run my test, I always hit this error:

SCR v1.2.0: rank 0: Initiating flush of dataset 1
AXL 0.3.0 ERROR: AXL Error with BBAPI rc:       -1 @ bb_check /g/g0/hutter2/scr/deps/AXL/src/axl_async_bbapi.c:131
AXL 0.3.0 ERROR: AXL Error with BBAPI details:
     "error": {
        "func": "queueTagInfo",
        "line": "2903",
        "sourcefile": "/u/tgooding/cast_proto1.5/CAST/bb/src/xfer.cc",
        "text": "Transfer definition for contribid 436 already exists for LVKey(bb.proxy.44 (192.168.129.182),bef4bbc8-2214-4077-8264-75d8f993c296), TagID((1093216,835125249),2875789842327411), handle 4294969993. Extents have already been enqueued for the transfer definition. Most likely, an incorrect contribid was specified or a different tag should be used for the transfer."
    },

If I re-run the test the problem always goes away. Looks like a stale transfer handle issue. I'll try it with axl_cp and see if the same thing happens.

@tonyhutter
Copy link
Contributor Author

tonyhutter commented May 20, 2020

If I re-run the test the problem always goes away. Looks like a stale transfer handle issue. I'll try it with axl_cp and see if the same thing happens.

I'm not seeing it with axl_cp. I used axl_cp -X bbapi to copy a new file from SSD->GPFS right after logging in, and didn't get the transfer definition error. It's possible SCR/filo is exercising AXL in a more advanced way that causes the transfer error.

@tonyhutter
Copy link
Contributor Author

tonyhutter commented May 20, 2020

Fun fact: BBAPI appears to be way slower than a vanilla copy. I timed the amount of time it took to copy a 10GB, random-data, file from SSD to GPFS with axl_cp -X sync cp and axl_cp -X bbapi, and just to make sure there was no funny business, I also timed how long it took to md5sum the file afterwards:

type copy md5
axl_cp -X sync 1.8s 32s
cp 1.8s 32s
axl_cp -X bbapi 16.5s 32s

@gonsie
Copy link
Member

gonsie commented May 20, 2020

BBAPI traffic is throttled in the network, so I’m not surprised it’s slower... but that is quite slow.

@tonyhutter
Copy link
Contributor Author

It's possible that the copy was faster because the whole file was in the page cache. That would mean the regular copies were just reading the data from memory rather than the SSD. I believe BBAPI just does a direct device-to-device SSD->GPFS transfer, which would bypass the page cache. Unfortunately, I was unable to test with zeroed caches (echo 3 > /proc/sys/vm/drop_caches) since I'm not root.

@tonyhutter
Copy link
Contributor Author

The 0B rank_0 file issue arises because it is a sparse file.

$ fiemap /mnt/bb_131fa614a608da727b038ed08e6eaad4/tmp/hutter2/scr.defjobid/scr.dataset.5/rank_0
ioctl success, extents = 0

Since BBAPI transfers extents, it would make sense that it would create a zero byte file for a sparse file. Proof:

# create one regular 1M file, and one sparse file
$ dd if=/dev/zero of=$BBPATH/file1 bs=1M count=1
$ truncate -s 1m $BBPATH/file2

# lookup extents
$ fiemap $BBPATH/file1
ioctl success, extents = 1
$ fiemap $BBPATH/file2
ioctl success, extents = 0

# both files appear to be 1MB to the source filesystem
$ ls -l $BBPATH/file1
-rw------- 1 hutter2 hutter2 1048576 May 20 14:00 /mnt/bb_410f5f473f8a95cc2ac227b2c7ee1742/file1
$ ls -l $BBPATH/file2
-rw------- 1 hutter2 hutter2 1048576 May 20 14:00 /mnt/bb_410f5f473f8a95cc2ac227b2c7ee1742/file2

# transfer them using BBAPI
$ ~/scr/deps/AXL/build/test/axl_cp -X bbapi  $BBPATH/file1 /p/gpfs1/hutter2/
$ ~/scr/deps/AXL/build/test/axl_cp -X bbapi  $BBPATH/file2 /p/gpfs1/hutter2/

# sizes after transfer
$ ls -l /p/gpfs1/hutter2/
total 1024
-rw------- 1 hutter2 hutter2 1048576 May 20 14:01 file1
-rw------- 1 hutter2 hutter2       0 May 20 14:02 file2

It gets really interesting when you create and transfer a partially sparse file:

$ echo "hello world" > /mnt/bb_410f5f473f8a95cc2ac227b2c7ee1742/file3
$ truncate -s 1M /mnt/bb_410f5f473f8a95cc2ac227b2c7ee1742/file3
$ ls -l /mnt/bb_410f5f473f8a95cc2ac227b2c7ee1742/file3
-rw------- 1 hutter2 hutter2 1048576 May 20 14:11 /mnt/bb_410f5f473f8a95cc2ac227b2c7ee1742/file3
$ fiemap /mnt/bb_410f5f473f8a95cc2ac227b2c7ee1742/file3
ioctl success, extents = 1
$ ~/scr/deps/AXL/build/test/axl_cp -X bbapi  $BBPATH/file3 /p/gpfs1/hutter2/
$ ls -l /p/gpfs1/hutter2/file3
-rw------- 1 hutter2 hutter2   65536 May 20 14:12 file3
$ cat /p/gpfs1/hutter2/file3
hello world

I then did another test where I created sparse section between two extents.

echo "hello world" > file4
truncate -s 1M file4
echo "the end" >> file4

When I transferred that file using the BBAPI, it was the correct size and had the correct data.

TL;DR: don't end your file with sparse sections, or they're not going to get transferred correctly with BBAPI.

@tgooding
Copy link

Fun fact: BBAPI appears to be way slower than a vanilla copy. I timed the amount of time it took to copy a 10GB, random-data, file from SSD to GPFS

That is expected. BB transfers are intentionally throttled via InfiniBand QoS at ~0.6GBps per node to minimize network impacts to MPI and demand I/O. So 10GB should take around 16-17 seconds when transferring in the SSD->GPFS direction. The transfers in the GPFS->SSD direction are not governed and you should see near native SSD write speeds.

@tonyhutter
Copy link
Contributor Author

@tgooding out of curiosity, can the throttling be adjusted or disabled?

@tgooding
Copy link

"Adjusting" can be done. Its a cluster wide parameter and would require changing the InfiniBand switch settings - not something you'd want to do often.

"Disabling" is easier. I/O on port 4420 is throttled, so you just need to change to use the other defined NVMe over Fabrics port (4421).

I think you can edit the configuration file /etc/ibm/bb.cfg and change "readport" to 2. Then restart bbProxy.

Alternatively, you can pass in a custom config template via bbactivate --configtempl=<myaltconfig>. The original/default template is at /opt/ibm/bb/scripts/bb.cfg

@tonyhutter
Copy link
Contributor Author

@tgooding thank you! 👍

@tonyhutter
Copy link
Contributor Author

@adammoody regarding "should we cancel/not-cancel existing transfers on AXL restart", I forgot about this thread from around a year ago:

ECP-VeloC/AXL#57

I'll put my comments in there since it already has a lot of good discussion.

@tonyhutter
Copy link
Contributor Author

Regarding:

  • Need to be able to cancel transfer of files on restart. Requires tracking set of transfer handles used (with redundancy to account for failures). Ensure everything was cancelled successfully in SCR_Init during restart. Ensure final transfers are cancelled or complete in the case of scavenge/post-run transfers.

See: ECP-VeloC/AXL#57 (comment) .

TL;DR: sync/pthread transfers get killed with the job automatically, and I don't see how we could cancel BBAPI transfers.

@tgooding
Copy link

@tonyhutter : if you haven't, look at using BB_GetTransferList() at SCR startup time for getting the active handles.

@adammoody
Copy link
Contributor

@tonyhutter , we also have a working test case in which a later job step cancels the transfer started by the previous job step. It's in the LLNL bitbucket, so I'll point you to that separately.

@tonyhutter
Copy link
Contributor Author

@tgooding according to the documentation, BB_GetTransferList() only returns the transfer handles for a specific jobid/jobstepid.

/**
 *  \brief Obtain the list of transfers
 *  \par Description
 *  The BB_GetTransferList routine obtains the list of transfers within the job that match the
 *  status criteria.  BBSTATUS values are powers-of-2 so they can be bitwise OR'd together to
 *  form a mask (matchstatus).  For each of the job's transfers, this mask is bitwise AND'd
 *  against the status of the transfer and if non-zero, the transfer handle for that transfer
 *  is returned in the array_of_handles.
 *
 *  Transfer handles are associated with a jobid and jobstepid.  Only those transfer handles that were
 *  generated for the current jobid and jobstepid are returned.
 *
 *  \param[in] matchstatus Only transfers with a status that match matchstatus will be returned.  matchstatus can be a OR'd mask of several BBSTATUS values.
 *  \param[inout] numHandles Populated with the number of handles returned.  Upon entry, contains the number of handles allocated to the array_of_handles.
 *  \param[out] array_of_handles Returns an array of handles that match matchstatus.  The caller provides storage for the array_of_handles and indicates the number of available elements in numHandles.  \note If array_of_handles==NULL, then only the matching numHandles is returned.
 *  \param[out] numAvailHandles Populated with the number of handles available to be returned that match matchstatus.
 *  \return Error code
 *  \retval 0 Success
 *  \retval errno Positive non-zero values correspond with errno.  strerror() can be used to interpret.
 *  \ingroup bbapi
 */
extern int BB_GetTransferList(BBSTATUS matchstatus, uint64_t* numHandles, BBTransferHandle_t array_of_handles[], uint64_t* numAvailHandles);

https://github.com/IBM/CAST/blob/master/bb/include/bbapi.h#L261

If a job is restarted it's going to get a new jobid, so I don't see how you could use it to get the list of old transfers.

@tonyhutter
Copy link
Contributor Author

Some more observations from testing:

  1. I was able to cancel a transfer across job IDs by saving the transfer handle and cancelling it. So we would need to save this state somewhere. This is going to be an issue with AXL, as saving the state is optional (via the optional save_file passed to AXL_Init()).

  2. I confirmed that BB_GetTransferList() will only list the transfers for your job id. So if your job dies and you re-launch, you can't use BB_GetTransferList() to find your old transfers and cancel them. I assume this is some sort of access control policy.

  3. Cancelling a BB API transfer is not immediate. A 25GB copy from the burst buffer to GPF takes ~1min 30sec, and BB_CancelTransfer() the takes ~30sec to cancel that transfer. Interestingly, it also took ~30s to cancel a 50GB transfer, but only 13s to cancel a 10GB transfer.

If we have multiple transfers to cancel, we should cancel them in parallel in separate threads, as BB_CancelTransfer() will block until the transfer is cancelled.

@tonyhutter
Copy link
Contributor Author

Regrading:

Check that SCR manages any final BBAPI transfer that is completed at the end of an allocation. In particular, SCR normally runs some finalization code to register that a transfer completed successfully, so that it knows the corresponding checkpoint is valid to be used during a restart in the next job allocation. I don't think we're accounting for that right now.

I ran some tests using test_api to see what would happen if I corrupted a checkpoint file. If I changed the file size, it would detect that it was corrupted and correctly fall back to an earlier checkpoint. If I wrote random data to the checkpoint but kept the same file size, it caught the corruption, but only because there's a check within test_api.c itself. When I removed that manual check to see if SCR would internally detect the corruption, it did not detect it.

This is important with BB API transfers, as there is a period of time in the transfer when the destination file will be the ending size, but all the data isn't present. I saw this by stating the file while it was being transferred (look at the block count):

# start transfer in the background
$  axl_cp -X bbapi $BBPATH/bigfile /p/gpfs1/hutter2/bigfile &

$ while [ 1 ] ; do sleep 1 && stat /p/gpfs1/hutter2/bigfile | grep Size ; done
  Size: 21206401024	Blocks: 34045952   IO Block: 16777216 regular file
  Size: 21206401024	Blocks: 34766848   IO Block: 16777216 regular file
  Size: 21474836480	Blocks: 37388288   IO Block: 16777216 regular file
  Size: 21474836480	Blocks: 39256064   IO Block: 16777216 regular file
  Size: 21474836480	Blocks: 41254912   IO Block: 16777216 regular file
  Size: 21474836480	Blocks: 41943040   IO Block: 16777216 regular file
  Size: 21474836480	Blocks: 41943040   IO Block: 16777216 regular file
  Size: 21474836480	Blocks: 41943040   IO Block: 16777216 regular file
  Size: 21474836480	Blocks: 41943040   IO Block: 16777216 regular file
  Size: 21474836480	Blocks: 41943040   IO Block: 16777216 regular file
exit_func: exiting, status 0, id 1
[1]+  Done                    rm -f /p/gpfs1/hutter2/bigfile && ~/AXL/build/test/axl_cp -X bbapi $BBPATH/bigfile /p/gpfs1/hutter2/bigfile
  Size: 21474836480	Blocks: 41943040   IO Block: 16777216 regular file
  Size: 21474836480	Blocks: 41943040   IO Block: 16777216 regular file

@adammoody
Copy link
Contributor

Thanks for exploring the space of the BB, @tonyhutter . This is great.

We used to compute a CRC32 for each file when it was copied to the parallel file system, and we stored that in the SCR metadata. When reading the file back, we computed the CRC32 again and checked against the stored value. We dropped that feature when moving to the components, but it'd be nice to have back.

Even so, SCR should avoid fetching a checkpoint that did not finalize. SCR assumes a checkpoint is bad unless it has explicitly marked it as complete by writing a flag into the .scr/index.scr file. It does that after the flush completes. In the case of an async flush, that happens here:

if (scr_flush_complete(id, scr_flush_async_file_list) != SCR_SUCCESS) {

then here:
int scr_flush_complete(int id, kvtree* file_list)

The SCR library has to be running for this to work. What's new with the BB is that this transfer will continue even after the application has stopped. If it errors out, then we're ok because SCR will not try to read it, since it never marked it as complete.

A more interesting case is what happens if it actually succeeds? In that case, it'd be nice to have something that then updates the SCR index file to indicate that the checkpoint is actually good. The BB lets us register a script that runs in poststage. We could create a custom script that runs in poststage and then updates the index file if it finds the checkpoint succeeded.

@adammoody
Copy link
Contributor

Actually, browsing through that code, I need to double check that it's properly setting that flag. It's not obvious to me, so it may have been lost in translation. Anyway, that is what it is supposed to be doing.

@adammoody
Copy link
Contributor

Ok, before starting the flush, SCR writes a COMPLETE=0 flag:

int scr_flush_init_index(scr_dataset* dataset)

Then it only updates that flag after the flush finishes. So we should be covered.

Though this TODO suggests the code that writes the complete marker needs to do some more checking:

/* TODO: need to determine whether everyone flushed successfully */

We don't catch the case right now if the flush completes but some process detected an error.

@adammoody
Copy link
Contributor

adammoody commented May 28, 2020

One more thing. Though SCR is meant to rely on its index file to avoid reading back a partially flushed checkpoint, other AXL users might appreciate built-in support to better handle this. Perhaps AXL could transfer the file without the read bit enabled, and then flip the read bit on after it completes. Or AXL could flush to a temporary file name and then rename the file.

@tonyhutter
Copy link
Contributor Author

Or AXL could flush to a temporary file name and then rename the file.

I like this idea, it's simple and reliable. We'd probably want to tack on an ._AXL extension so to the filename instead of using a temp name, like:

ckpt_0._AXL

That way, it gives a hint to the user that "my ckpt_0 file is still being transferred by AXL" and they could ls -l it to track progress.

@tonyhutter
Copy link
Contributor Author

Opened AXL issue for the file rename:
ECP-VeloC/AXL#66

@tonyhutter
Copy link
Contributor Author

tonyhutter commented May 29, 2020

So, I think I now have some answers to the original questions:

Test using AXL with pthreads to transfer checkpoints. Does it work correctly? Do we need to add logic to throttle the transfer so it doesn't interfere with application performance? (possibly not, but just want to check to see what the performance is like)

I tested SCR w/AXL-pthreads transfers and they work. They shouldn't adversely affect application performance since AXL pthreads already self-limits the number of transfer threads to the lesser of:

  1. The number of CPUs
  2. MAX_THREADS (currently 16)
  3. The number of files being transferred.

So AXL pthreads could only ever use a maximum of 16 threads. Additionally, the CPU is going to task switch when waiting on IOs to finish, so it's not like AXL would using those CPUs 100% of the time.

The 16 thread number was chosen because transfer performance really doesn't increase after 16 threads (at least in the limited testing I did). Here are some benchmarks I did back when I was testing pthreads:

Time to transfer 200 files with files sizes ranging between 1-50MB:

Threads Time (seconds)
1 4
2 3
4 2.6
8 2.4
16 2.4
32 2.4

Time to transfer 20,000 files with files sizes ranging between 1-50KB:

Threads Time (seconds)
1 16.5
2 12.2
4 9.6
8 8.5
16 8
32 7.9

Does the initiation and completion check for transfers work correctly? (at small and large scales?)

In the common case yes. SCR will not load a checkpoint that isn't the right size or not marked internally within SCR as complete. If the file is corrupted after being successfully transferred, SCR will not detect it, but that's really not something SCR should be expected to do.

Need to be able to cancel transfer of files on restart. Requires tracking set of transfer handles used (with redundancy to account for failures). Ensure everything was cancelled successfully in SCR_Init during restart. Ensure final transfers are cancelled or complete in the case of scavenge/post-run transfers.

Sync and pthreads transfers will be cancelled when an application dies, since it is the application's own threads doing the transfer.

BB API will continue the transfer after the application dies, and there's currently no easy way to cancel those transfers on job restart. In order to cancel a BB transfer on application restart, we would need to keep track of the BB API transfer handle across jobs. We don't currently do this. It's possible we could embed the transfer handles into the AXL state_file, but since the state_file is technically not required, it's not a great solition. A better solution would be to modify the BB API to allow BB_GetTransferList() to return a list of all the user's transfer handles, rather than artificially limiting it to the current job ID's handles. I opened an issue with IBM requesting this change:
IBM/CAST#922

Check that SCR manages any final BBAPI transfer that is completed at the end of an allocation. In particular, SCR normally runs some finalization code to register that a transfer completed successfully, so that it knows the corresponding checkpoint is valid to be used during a restart in the next job allocation. I don't think we're accounting for that right now.

SCR does detect if a checkpoint file size is correct, and not load a previous checkpoint if it is not.

According to @adammoody (#163 (comment)) SCR internally doesn't mark a transfer as complete unless it finishes flushing. This means that if the checkpoint was the right size, but the filo never actually said "I'm done transferring the file", then it won't load the checkpoint (Note: I didn't actually test this). This is a good thing. It should get around an issue with BBAPI transfers where the destination file is the correct final file size, but all the blocks hadn't finished transferring (see #163 (comment))

Overall, file transfer integrity should be done at a lower level like FILO or AXL. Perhaps we could add an AXL_Checksum(int id) function that the user could optionally call after AXL_Wait()? The nice thing about doing it in AXL is that we could do the checksum in parallel, since AXL is already pthreads-aware.

Additionally, we proposed increasing file transfer integrity by having AXL use a temporary file name while it's transferring (ECP-VeloC/AXL#66)

@adammoody
Copy link
Contributor

Great progress, @tonyhutter . On the application interference question, we also need to do some "noise" tests at medium or large scale perhaps during a DST after a system update. One thing we want check is whether a background transfer slows down the application.

To do this, we can put together an "app" that does MPI_Allreduce(1 double) in a tight loop. We then measure the time time it takes to complete some number of those both with and without a background transfer. It's simple to recreate, but we have a "rednoise" benchmark we can use for this.

@adammoody
Copy link
Contributor

adammoody commented May 30, 2020

Some background info... For tracking transfers to either cancel them during restart or wait on them to complete or error out within a poststage script, we've made this problem harder on ourselves when we moved to components. It used to be that both Filo and AXL were all part of the monolithic SCR library. In that case, SCR knew the full set of files that had to be transferred, and it was easy to treat that transfer as one logical operation.

While co-designing the BB, we asked IBM to support shared transfer handles. That enables multiple processes, which may even be running on different nodes, to all individually register their own files as part of one larger shared transfer operation. In that case, one process like rank 0 allocates a single handle value, broadcasts it to all processes, and then all processes register their files under that single handle id. You can then query the status or cancel the entire transfer with a single integer id value.

The plan was to record that single value somewhere safe, like replicate it on each node or store it on the parallel file system. With that, it would then be trivial to cancel a transfer on restart. We just need to make sure rank 0 can get the id back and then just rank 0 executes the cancel while everyone else waits in a barrier.

This suddenly got much harder now that we have SCR / Filo / AXL.

While we're on it, I keep thinking we should reevaluate whether we should move Filo back into SCR or merge it into AXL as an MPI option. That might help solve some of this. For example, at least Filo understands there is a wider set of files across processes and we might be able to use a single handle id again. That's assuming shared handle performance is acceptable, which we'd want to retest before making that switch. The other option is to record the full set of handle ids somewhere so that we can get them all back. An MPI gather would make that easier, too.

@adammoody
Copy link
Contributor

Another thing that comes to mind... Filo is called by every process in MPI_COMM_WORLD during a flush or fetch, and every Filo instance invokes AXL. If the user is running P processes per node, each compute node then runs P AXL instances in parallel. On some machines, people run one MPI process per CPU core, so you can easily get tens of processes per node today.

We may do better if Filo can use a single AXL instance per node. For pthreads, we'd reduce the number of threads we create on the node, and for BB, we'd cut back on the number of handles. To do this, Filo would have to identify the set of ranks that can share an AXL instance and delegate the work to one Filo process. We can determine whether that's worth the effort by testing.

@tonyhutter
Copy link
Contributor Author

On some machines, people run one MPI process per CPU core, so you can easily get tens of processes per node today.

Keep in mind that AXL is only going to spawn 16 threads if there's >= 16 files to transfer in a single transfer. The user can always switch it to sync or some other transfer type, or do transfers in smaller batches of files at a time to limit the number of threads.

@tonyhutter
Copy link
Contributor Author

Jitter test

I created 3000 files with random sizes between 0-9MB on the SSD (around 14GB) and copied them to GPFS. While they were copying, I then used a utility that @adammoody's gave me to measure the jitter between a MPI_Allreduce() call. The utility sampled for 20,000,000 iterations (about 6 sec). Note that I had to run the bbapi copy using only 1000 files to get around an error ("Too many open files").

Ticks sync pthreads bbapi
0-9: 0 0 0
10-19: 0 0 0
20-29: 0 0 0
30-39: 0 0 0
40-49: 0 0 0
50-59: 0 0 0
60-69: 0 0 0
70-79: 0 0 0
80-89: 0 0 0
90-99: 0 0 0
100-109: 799515 701805 736913
110-119: 19030830 19107556 19088667
120-129: 165606 143548 170098
130-139: 1426 1500 1089
140-149: 884 858 879
150-159: 430 781 621
160-169: 149 39395 302
170-179: 91 344 113
180-189: 32 210 51
190-199: 18 212 32
200-209: 8 310 28
210-219: 11 243 32
220-229: 9 238 40
230-239: 9 1510 92
240-249: 9 324 47
250-259: 12 61 29
260-269: 12 48 16
270-279: 7 35 13
280-289: 7 33 12
290-299: 2 27 8
300-309: 5 17 1
310-319: 1 7 2
320-329: 0 11 2
330-339: 2 8 1
340-349: 3 2 0
350-359: 0 5 1
360-369: 1 1 1
370-379: 1 5 1
380-389: 2 4 4
390-399: 0 3 14
400-409: 1 3 19
410-419: 1 2 12
420-429: 3 2 6
430-439: 2 7 1
440-449: 1 3 2
450-459: 2 0 0
460-469: 6 6 2
470-479: 5 2 4
480-489: 12 8 8
490-499: 15 16 19
500 870 850 818

There's not a huge difference in the results. This is on one of our beefy IBM nodes with a burst buffer and a ton of cores.

@adammoody
Copy link
Contributor

Cool! Glad to see you got that working already, @tonyhutter . That was fast.

For single node tests, the fwqmpi benchmark will be of most interest. In the past, we used to run this with one MPI rank per core, since that was how apps often ran. I think we should still do that for lassen, but we might also run one rank per GPU, which is more typical of CORAL apps.

This benchmark has a perl script you can use to generate a graph, where each core generates a scatter plot of points.

The last we tried this, we ran on cab and had 16 cores/node. Lassen is now at 40-42, so not sure how well all of this will hold up.

Also, it'd be helpful to collect results with no transfer to server as a baseline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants