Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate/fix strncat issue #343

Closed
markcmiller86 opened this issue Aug 25, 2023 · 7 comments
Closed

Investigate/fix strncat issue #343

markcmiller86 opened this issue Aug 25, 2023 · 7 comments
Milestone

Comments

@markcmiller86
Copy link
Member

When we write a restart file in mercury on sierra, it looks like we are spending 61% of our time in DBGetToc and 48% of the time in strncat() way down the call stack.
This only shows up in the "new" silo just released.

This confluence page has 2 .pdf files comparing the profile of the old vs new silo.

https://rzlc.llnl.gov/confluence/display/MER/Release+Issues

@markcmiller86 markcmiller86 added this to the 4.11.1 milestone Aug 25, 2023
@markcmiller86
Copy link
Member Author

The edge labels are samples, not calls. See Edge Information. The slowdown is only 2x on rzansel but 1000x on sierra. Why so much bigger? And why are we slow on rzansel and sierra but not other systems like rzgenie and jade?

We discovered the issue when regression tests on sierra which normally take 1 minute were killed by ats after running for 60 minutes. Which tests "timeout" differs every night, eg on Monday 10 tests timeout and on Tuesday 8 tests timeout and only 1 test timedout both days. For about 80% of the "timeout" tests, the last message printed to stdout says Writing Restart or Writing Graphics.

@markcmiller86
Copy link
Member Author

Well, I may be misinterpreting the numbers on the edges. If those are sample and not actual calls, then I apologize. I interpreted them as actual calls.

That said, I am able to produce cases on my macOS system where strncat is taking ~30% of the total time. And, HDF5-1.12 is about 2-3x slower than HDF-1.8.14 for the same operations. So, there is definitely some issue that I am sure I can improve.

@markcmiller86
Copy link
Member Author

markcmiller86 commented Sep 6, 2023

I have investigated this issue with a modified dir.c test a snipit of which is captured below...

int main(int argc, char *argv[])
{

    int            meshtypes[3], nmesh;
    char          *meshnames[3], original_dir[128];
    DBfile        *dbfile, *dbfile2, *dbfile3, *dbfile4, *dbfile5;
    char          *filename = "dir.pdb";
    char          *filename2 = "dir2.pdb";
    int            i, driver = DB_PDB, driver2 = DB_PDB;
    int            show_all_errors = FALSE;
    char          *objname = 0;
    int            ndirs = 0;
    int            ntocs = 0;

    for (i=1; i<argc; i++) {
        if (!strncmp(argv[i], "DB_PDB",6)) {
            driver = StringToDriver(argv[i]);
            driver2 = driver;
            filename = "dir.pdb";
            filename2 = "dir2.pdb";
        } else if (!strncmp(argv[i], "DB_HDF5", 7)) {
            driver = StringToDriver(argv[i]);
            driver2 = driver;
            filename = "dir.h5";
            filename2 = "dir2.h5";
        } else if (!strcmp(argv[i], "mix-driver-copy")) {
            driver2 = driver == DB_HDF5 ? DB_PDB : DB_HDF5;
            filename2 = driver == DB_HDF5 ? "dir2.pdb" : "dir2.h5";
        } else if (!strcmp(argv[i], "show-all-errors")) {
            show_all_errors = 1;
        } else if (!strncmp(argv[i], "ndirs=", 6)) {
            ndirs = (int) strtol(argv[i]+6,0,10);
        } else if (!strncmp(argv[i], "ntocs=", 6)) {
            ntocs = (int) strtol(argv[i]+6,0,10);
.
.
.
    for (i = 0; i < ndirs; i++)
    {
        char dirname[64];
        sprintf(dirname, "domain_%08d", i);
        DBMkDir(dbfile, dirname);
    }
.
.
.
    /* make some hellaciously long directory names which are almost the same */
    DBClose(dbfile);
    dbfile = DBOpen(filename, driver, DB_APPEND);
    if (ntocs)
    {
        char tmp[] = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa";
        DBSetDir(dbfile, "/");
        DBMkDir(dbfile, "testtoc");
        DBSetDir(dbfile, "/testtoc");
        for (i = 0; i < 26; i++)
        {
            tmp[235] = (char) (97+i);
            DBMkDir(dbfile, tmp);
        }

        for (i = 0; i < ntocs/2; i++)
        {
            DBtoc *dbtoc;

            DBSetDir(dbfile, "/");
            dbtoc = DBGetToc(dbfile);
            DBSetDir(dbfile, "/testtoc");
            dbtoc = DBGetToc(dbfile);
        }
    }

I have then tested Silo 4.11 and 4.10.3 with HDF5-1.8 and HDF5-1.14 on my macOS laptop and on a Lassen login node in both my home dir on Lassen and /p/gpfs1 and /p/vast and with various of the HDF5 VFD's Silo supports. I do not find any cases where the toc loop, above, is 1000x slower. I don't even see cases where performance is >2x slower.

In my tests, I was doing 10,000 DBGetToc() calls and calling DBSetDir() to switch current working directory between each call to ensure Silo's TOC caching was not coming into play.

I tried creating directories which contain a slew of long-named objects that differ only in the last few characters. This did have some effect on performance (as would be expected) but again, nothing approaching 1000x slower or even 10x slower. More like 2-3x slower (on Lassen).

I did discover that better args to H5Literate() which does the traversal of named entries in an HDF5 group can actually improve performance by an order of magnitude. However, it does mean that the objects are iterated in a different order than previous versions of Silo. On the other hand, this is all used to populate a DBtoc* which gets returned to caller. So, I don't think caller cares about the order of traversal used by HDF5 library to do that.

I did compare the stack dumps provided and there is a difference in the load_toc() function implementation in the two cases captured. The load_toc() function in the Silo driver is a callback used by H5Literate() to visit all the entries in an HDF5 group (dir).

The newer Silo includes a couple of additional calls to the HDF5 library to interrogate entries found and decide if they are symlinks or not. These are the calls to H5Gget_objinfo() and H5Lget_info().

Here is some performance data I gathered while in ~ on Lassen (clang compilers and using blueos3 installs of HDF5-1.8.16)...

case timing (from home dir on Lassen)
Silo-4.10.3, ndirs=100000, PDB 1.676u 0.598s 0:02.30 98.2% 0+0k 19840+15488io 2pf+0w
Silo-4.10.3, ndirs=100000, HDF5 1.985u 1.122s 0:03.27 94.8% 0+0k 4352+225024io 2pf+0w
Silo-4.11.0, ndirs=100000, PDB 2.500u 0.550s 0:03.08 99.0% 0+0k 7552+15488io 0pf+0w
Silo-4.11.0, ndirs=100000, HDF5 2.148u 1.044s 0:03.36 94.6% 0+0k 0+225024io 0pf+0w
Silo-4.10.3, ntocs=10000, PDB 1.094u 0.118s 0:01.22 98.3% 0+0k 0+896io 0pf+0w
Silo-4.10.3, ntocs=10000, HDF5 1.329u 0.000s 0:01.33 99.2% 0+0k 0+1280io 0pf+0w
Silo-4.11.0, ntocs=10000, PDB 2.078u 0.179s 0:02.26 99.1% 0+0k 0+896io 0pf+0w
Silo-4.11.0, ntocs=10000, HDF5 3.415u 1.557s 0:04.97 99.7% 0+0k 0+1152io 0pf+0w
case timing (from /p/gpfs1 on Lassen)
Silo-4.10.3, ndirs=100000, PDB 1.682u 1.261s 0:02.95 99.6% 0+0k 640+0io 1pf+0w
Silo-4.10.3, ndirs=100000, HDF5 2.182u 2.521s 0:04.76 98.7% 0+0k 0+0io 0pf+0w
Silo-4.11.0, ndirs=100000, PDB 2.550u 1.235s 0:03.79 99.7% 0+0k 0+0io 0pf+0w
Silo-4.11.0, ndirs=100000, HDF5 2.238u 2.667s 0:04.98 98.1% 0+0k 0+0io 0pf+0w
Silo-4.10.3, ntocs=10000, PDB 1.108u 0.229s 0:01.34 98.5% 0+0k 0+0io 0pf+0w
Silo-4.10.3, ntocs=10000, HDF5 1.436u 0.000s 0:01.44 99.3% 0+0k 0+0io 0pf+0w
Silo-4.11.0, ntocs=10000, PDB 2.160u 0.280s 0:02.44 100.0% 0+0k 0+0io 0pf+0w
Silo-4.11.0, ntocs=10000, HDF5 3.595u 3.645s 0:07.25 99.7% 0+0k 0+0io 0pf+0w

I have not tested with HDF-1.10 or HDF5-1.12 both of which have known performance issues in certain circumstances.

Next, I tested Silo 4.11.1-pre1 with both HDF5-1.8 and HDF5-1.14

case timing (from home dir on Lassen)
HDF5-1.8, ntocs=10000 DB_PDB 1.825u 0.235s 0:02.08 98.5% 0+0k 11264+896io 3pf+0w
HDF5-1.8, ntocs=10000 DB_HDF5 1.087u 0.057s 0:01.42 79.5% 0+0k 896+1280io 1pf+0w
HDF5-1.8, ntocs=10000 DB_HDF5_SILO 0.323u 0.007s 0:00.33 96.9% 0+0k 0+640io 0pf+0w
HDF5-1.14, ntocs=10000 DB_PDB 1.702u 0.151s 0:01.88 98.4% 0+0k 11648+896io 4pf+0w
HDF5-1.14, ntocs=10000 DB_HDF5 0.680u 0.225s 0:01.16 77.5% 0+0k 3968+1152io 2pf+0w
HDF5-1.14, ntocs=10000 DB_HDF5_SILO 0.340u 0.026s 0:00.37 97.2% 0+0k 0+768io 0pf+0w

@markcmiller86
Copy link
Member Author

Ok, so I discovered an issue with the really good results in some of the tests above. I had adjusted args to H5Literate() to try to use creation order indexing but without having specified group creation properties to track and build the creation order index. So, the effect of this was that DBGetToc() went really, really fast because it was failing immediately and not doing any work. So, the results in the above tables for the new Silo, 4.11.1-pre1, are experimental error.

@markcmiller86
Copy link
Member Author

I also used gprof on the modified dir.c to see what is reported for most expensive operations with Silo-4.11.1-pre1 on both HDF5-1.8.22 and HDF5-1.14.0.

gprof of modified (./dir ntocs=10000 DB_HDF5) HDF5-1.8.22

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 15.53      0.16     0.16  1971298     0.00     0.00  H5G__component
 10.68      0.27     0.11  5550891     0.00     0.00  H5C_protect
  4.85      0.32     0.05  2513483     0.00     0.00  H5O_assert
  4.85      0.37     0.05  2160336     0.00     0.00  H5F_addr_decode_len
  3.88      0.41     0.04  5550891     0.00     0.00  H5C_unprotect
  3.88      0.45     0.04  3435469     0.00     0.00  H5SL_search
  2.91      0.48     0.03  5550891     0.00     0.00  H5AC_unprotect
  2.91      0.51     0.03  2512721     0.00     0.00  H5O_unprotect
  2.91      0.54     0.03   980550     0.00     0.00  H5G_traverse_real
  2.91      0.57     0.03   535277     0.00     0.00  H5SL_insert_common
  2.91      0.60     0.03   385037     0.00     0.00  H5G_get_objinfo
  1.94      0.62     0.02  2987595     0.00     0.00  H5G_name_free
  1.94      0.64     0.02  1850210     0.00     0.00  H5E_clear_stack
  1.94      0.66     0.02   965532     0.00     0.00  H5B_find
  1.94      0.68     0.02   517782     0.00     0.00  H5SL_remove
  1.94      0.70     0.02   200000     0.00     0.00  H5Lget_info
  1.94      0.72     0.02   200000     0.00     0.00  load_toc
  0.97      0.73     0.01  6613964     0.00     0.00  H5FL_blk_find_list
  0.97      0.74     0.01  5880343     0.00     0.00  H5FL_reg_free
  0.97      0.75     0.01  5550891     0.00     0.00  H5AC_protect
  0.97      0.76     0.01  3306982     0.00     0.00  H5FL_blk_malloc
  0.97      0.77     0.01  2243467     0.00     0.00  H5I__find_id
  0.97      0.78     0.01  1920728     0.00     0.00  H5RS_own
  0.97      0.79     0.01  1713008     0.00     0.00  H5I_object
  0.97      0.80     0.01  1353837     0.00     0.00  H5FL_reg_calloc
  0.97      0.81     0.01  1190706     0.00     0.00  H5WB_unwrap
  0.97      0.82     0.01  1160203     0.00     0.00  H5G__ent_to_link
  0.97      0.83     0.01  1010855     0.00     0.00  H5O_msg_read_oh
  0.97      0.84     0.01   980815     0.00     0.00  H5O_msg_read
  0.97      0.85     0.01   765553     0.00     0.00  H5G_oloc
  0.97      0.86     0.01   735858     0.00     0.00  H5C_flush_single_entry
  0.97      0.87     0.01   735080     0.00     0.00  H5F__accum_read
  0.97      0.88     0.01   705108     0.00     0.00  H5FD_sec2_read
  0.97      0.89     0.01   445050     0.00     0.00  H5G_node_decode_key
  0.97      0.90     0.01   385037     0.00     0.00  H5G_get_objinfo_cb
  0.97      0.91     0.01   375778     0.00     0.00  H5O_msg_free_real
  0.97      0.92     0.01   206436     0.00     0.00  H5P_get

gprof of modified (./dir ntocs=10000 DB_HDF5) HDF5-1.14.0

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 15.22      0.21     0.21  5753410     0.00     0.00  H5C_protect
 10.14      0.35     0.14  5753410     0.00     0.00  H5C_unprotect
  9.42      0.48     0.13  2723503     0.00     0.00  H5O__assert
  9.42      0.61     0.13  1961154     0.00     0.00  H5G__component
  2.90      0.65     0.04  5753410     0.00     0.00  H5AC_unprotect
  2.90      0.69     0.04   965532     0.00     0.00  H5B_find
  2.90      0.73     0.04   748370     0.00     0.00  H5C__flush_single_entry
  2.90      0.77     0.04   747604     0.00     0.00  H5C__load_entry
  2.17      0.80     0.03  1920740     0.00     0.00  H5RS__xstrdup
  2.17      0.83     0.03   430066     0.00     0.00  H5G__node_decode_key
  2.17      0.86     0.03   197533     0.00     0.00  H5O__chunk_deserialize
  1.45      0.88     0.02 16752014     0.00     0.00  H5AC_tag
  1.45      0.90     0.02 16752014     0.00     0.00  H5CX_set_tag
  1.45      0.92     0.02   975832     0.00     0.00  H5HL_protect
  1.45      0.94     0.02   975478     0.00     0.00  H5G__traverse_real
  1.45      0.96     0.02   747964     0.00     0.00  H5C__tag_entry
  1.45      0.98     0.02   385037     0.00     0.00  H5VL_group_optional
  1.45      1.00     0.02   320326     0.00     0.00  H5C_set_slist_enabled
  0.72      1.01     0.01 11507540     0.00     0.00  H5C_verify_tag
  0.72      1.02     0.01  9394157     0.00     0.00  H5FL_reg_free
  0.72      1.03     0.01  8667277     0.00     0.00  H5F_get_intent
  0.72      1.04     0.01  6260439     0.00     0.00  H5FL_reg_calloc
  0.72      1.05     0.01  5753410     0.00     0.00  H5AC_protect
  0.72      1.06     0.01  5493576     0.00     0.00  H5HL_offset_into
  0.72      1.07     0.01  5156773     0.00     0.00  H5MM_memcpy
  0.72      1.08     0.01  4831234     0.00     0.00  H5_trace
  0.72      1.09     0.01  1936090     0.00     0.00  H5G__node_cmp3

@markcmiller86
Copy link
Member Author

On lassen, I am unable to reproduce issues observed on sierra. I will close this issue for now but arrange a time to investigate on sierra.

@markcmiller86
Copy link
Member Author

I got onto sierra and performed similar tests as those above and was not able to identify performance issues with DBGetToc that were originally reported.

I did expand my test to create many directories of very long (>200 character) almost identical names (except for last few characters), and although there was some performance degredation for this case (as I would expect), there was nothing even approaching an order of magnitude let alone three orders of magnitude.

I did notice that for a similar test involving 10,000 directory names and DBGetToc calls, PDB driver does do about 5x better than HDF5 1.14.0 on sierra. Again, a problem to be investigated but nothing worth holding up the release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant