Skip to content

Commit

Permalink
Improves error reporting in esl_json parser; fixes critical bug in es…
Browse files Browse the repository at this point in the history
…l_mem_strtof().

Systematically checked and improved all eslFORMAT errors reportable by
the JSON parser. [H5/134-135]

Fixes a critical bug in esl_mem_strtof(), which would fail on most
exponents > 10 (i.e. "1.0e11"). [H5/134]

esl_json_ReadInt() changed to use code derived from esl_mem_strtoi(),
rather than calling it. Validated JSON number tokens can be handled
without as much error checking. Added documentation and a unit test
for it, utest_read_int().

esl_json_ReadFloat(), similarly, already used code derived from
esl_mem_strtof()... but ... discovered that conversion of strings to
floats is harder than it seemed. Added notes in esl_mem.md to record
why. Nonetheless decided to stick with hand-rolled code, because the
speed difference (3x) is more important than strict accuracy (HMMER
can live with +/-1 ulp typical error), compared to the alternative of
copying an ESL_BUFFET char array to a NUL-terminated string and
calling strtof() on it.  Added utest_read_float() and
utest_read_float_err() unit tests. The utest_read_float() test
includes examples of a couple of super pathological edge cases that
esl_json_ReadFloat() doesn't get quite right.

Adds utest_mem_strtof_error() unit tests for esl_mem_strtof(), similar
to the esl_json test. This is what caught the critical bug, which had
escaped the existing esl_mem unit test, which I'd erroneously thought
was pretty damned thorough.

Adds esl_rnd_floatstring(), which generates a random digital string
representation of a float, for testing. Used by both the esl_json and
esl_mem unit tests above.
  • Loading branch information
cryptogenomicon committed Aug 16, 2018
1 parent ed665c7 commit 177dc37
Show file tree
Hide file tree
Showing 7 changed files with 546 additions and 102 deletions.
313 changes: 271 additions & 42 deletions esl_json.c

Large diffs are not rendered by default.

121 changes: 97 additions & 24 deletions esl_mem.c
Original file line number Diff line number Diff line change
@@ -1,12 +1,6 @@
/* str*()-like functions for raw memory "lines" (non-NUL terminated strings).
/* str*()-like functions for raw char arrays (non-NUL-terminated strings).
*
* Especially when we're parsing input in an ESL_BUFFER, we need to be
* able to parse non-writable raw memory buffers. Because an
* ESL_BUFFER may have memory mapped a file, we cannot assume that the
* input is NUL-terminated, and we can't even assume it's writable (so
* we can't NUL-terminate it ourselves). These functions provide
* a set of utilities for parsing raw memory (in terms of a char * pointer
* and a length in bytes; <char *p>, <esl_pos_t n> by convention).
* esl_mem.md has additional notes.
*
* Contents:
* 1. The esl_mem*() API.
Expand Down Expand Up @@ -110,6 +104,9 @@
*
* esl_mem_strtoi32(), _strtoi64(), and _strtoi() repeat
* near-identical code. If you change one, change them all.
*
* Stripped-down version of same code is used in
* esl_json_ReadInt().
*/
int
esl_mem_strtoi32(char *p, esl_pos_t n, int base, int *opt_nc, int32_t *opt_val)
Expand Down Expand Up @@ -292,14 +289,14 @@ esl_mem_strtoi(char *p, esl_pos_t n, int base, int *opt_nc, int *opt_val)
* Synopsis: Convert a chunk of memory to a float.
* Incept: SRE, Fri Jun 3 10:52:07 2016 [Hamilton]
*
* Purpose: Convert the text starting at <p> to a float, converting no
* Purpose: Convert a prefix of the char array starting at <p> to a float, converting no
* more than <n> characters, i.e. the valid length of
* non-NUL terminated memory buffer <p>.
*
* The floating point representation is parsed as:
* The decimal string representation is parsed as:
* - leading whitespace is skipped
* - an optional sign for the mantissa, +/-
* - a mantissa: (at least one digit must be present)
* - an optional sign for the significand, +/-
* - a significand: (at least one digit must be present)
* - an optional string of digits
* - an optional '.'
* - an optional string of digits
Expand All @@ -310,16 +307,34 @@ esl_mem_strtoi(char *p, esl_pos_t n, int base, int *opt_nc, int *opt_val)
* Or, after whitespace and the optional sign, one of the
* following special strings (case-insensitive):
* "inf", "infinity", "nan"
*
* Parsing stops at the first character that isn't part of
* a decimal string representation. For example, given "
* 42.0xx", 5 characters are parsed (including the skipped
* leading whitespace, for a result of 42.0. A missing
* exponent, as in " 42.0e-" isn't an <eslEFORMAT> error,
* because it's also parsed as 5 chars to 42.0.
*
* The converted value is optionally returned in
* <*opt_val>, and the number of characters parsed (up to
* <n>) is optionally returned in <*opt_n>. The caller can
* reposition a parser to <p + *opt_nc> to exactly skip a
* parsed number.
*
* Only decimal representations are recognized. Compare to
* POSIX strtof(), which also allows hexadecimal
* representation (when the mantissa leads with 0x or 0X).
* <strtof()>, which also allows hexadecimal representation
* (when the significand leads with 0x or 0X).
*
* If the representation overflows (e.g. "1e999") the
* result is +/-infinity. If it underflows (e.g. "1e-999")
* the result is 0. These conversions still return
* <eslOK>.
*
* This function incurs a small roundoff error, typically
* around +/-1 ulp. See notes in esl_mem.md for details.
* When an accuracy guarantee is more important than a
* ~three-fold speed difference, use <esl_memtof()>
* instead.
*
* Args: p - pointer to text buffer to convert
* n - max number of chars to convert in <p>: p[0..n-1] are valid
Expand All @@ -328,13 +343,8 @@ esl_mem_strtoi(char *p, esl_pos_t n, int base, int *opt_nc, int *opt_val)
*
* Returns: <eslOK> on success.
*
* <eslEFORMAT> if no mantissa digits are found.
* <eslEFORMAT> if no significand digits are found.
* Now <*opt_val> is set to 0 and <*opt_nc> is set to 0.
*
* Note: Think this is stupid? Yeah, I agree. But see note on
* <esl_mem_strtoi32()> for why this seems necessary, and
* why POSIX strtod()/strtol() don't suffice, especially
* if we're going to parse mmap'ed() read-only data.
*/
int
esl_mem_strtof(char *p, esl_pos_t n, int *opt_nc, float *opt_val)
Expand Down Expand Up @@ -391,7 +401,7 @@ esl_mem_strtof(char *p, esl_pos_t n, int *opt_nc, float *opt_val)

while (i < n && isdigit(p[i]))
{
exp += 10.*exp + (p[i]-'0');
exp = 10.*exp + (p[i]-'0');
i++;
e++;
}
Expand Down Expand Up @@ -934,6 +944,8 @@ main(int argc, char **argv)
*****************************************************************/
#ifdef eslMEM_TESTDRIVE

#include "esl_random.h"

static void
utest_mem_strtoi32(void)
{
Expand Down Expand Up @@ -1075,11 +1087,59 @@ utest_mem_strtof(void)
if (( status = esl_mem_strtof("iNfXYZ", 6, &nc, &val) ) != eslOK || nc != 3 || !isinf(val)) esl_fatal(msg);
if (( status = esl_mem_strtof("nAnXYZ", 6, &nc, &val) ) != eslOK || nc != 3 || !isnan(val)) esl_fatal(msg);

/* terrifying edge cases that expose some flaws in our implementation, even aside from roundoff error */
if (( status = esl_mem_strtof("9999999999999999999999999999999999999999e-10", 44, &nc, &val) ) != eslOK || nc != 44 || val != eslINFINITY) esl_fatal(msg); // a real strtof() gets this right: it's 1e30
if (( status = esl_mem_strtof("-9999999999999999999999999999999999999999e-10", 45, &nc, &val) ) != eslOK || nc != 45 || val != -eslINFINITY) esl_fatal(msg); // ... and this should be -1e30
if (( status = esl_mem_strtof("0.0000000000000000000000000000000000000000000001e10", 51, &nc, &val) ) != eslOK || nc != 51 || val != 0.0) esl_fatal(msg); // ... 1-e36
if (( status = esl_mem_strtof("-0.0000000000000000000000000000000000000000000001e10", 52, &nc, &val) ) != eslOK || nc != 52 || val != 0.0) esl_fatal(msg); // ... -1-e36

/* Bad formats correctly detected: */
if (( status = esl_mem_strtof("XYZXYZ", 6, &nc, &val) ) != eslEFORMAT || nc != 0 || val != 0.0) esl_fatal(msg);
if (( status = esl_mem_strtof("intinity", 8, &nc, &val) ) != eslEFORMAT || nc != 0 || val != 0.0) esl_fatal(msg);
}


/* utest_mem_strtof_error()
*
* Compares <esl_mem_strtof()> to <strtof()>. Assumes that <strtof()>
* is accurate: that it finds the nearest floating point
* representation for the input string (within +/- 1 ulp and correctly
* rounded). Makes sure that error in <esl_mem_strtof()> is tolerable:
* that its relative error is no more than 4 ulps away from
* <strtof()>, corresponding to a maximum relative error of ~5e-7.
*
* There is an error distribution, and this test can stochastically
* fail normally. We force a fixed RNG seed unless caller toggles an
* <allow_badluck> option to <TRUE>.
*
* See esl_mem.md for notes on rationale for rolling our own strtof(),
* and its tradeoffs.
*/
static void
utest_mem_strtof_error(ESL_RANDOMNESS *rng, int allow_badluck)
{
char msg[] = "mem_strtof() error unit test failed";
char s[32]; // random generated string representation of a float. Max len of slen+'.'+flen+'e'+"-xx" = 18.
typedef union { float f; uint32_t rep; } flunion; // union gives us access to IEEE754 bitwise representation of a float
flunion v1; // value deduced by strtof(), which we assume is accurate
flunion v2; // value deduced by esl_mem_strtof(), which is fast and not too inaccurate
int ulperr; // error in integer representation of significand
int trial;

if (! allow_badluck) esl_randomness_Init(rng, 42);

for (trial = 0; trial < 100000; trial++) // check 100,000 numbers
{
esl_rnd_floatstring(rng, s);

v1.f = strtof(s, NULL);
esl_mem_strtof(s, strlen(s), NULL, &(v2.f));
ulperr = (v1.rep & 0x7fffff) - (v2.rep & 0x7fffff);

if (ulperr > 4) esl_fatal(msg); // 4 ulp = up to 4 \epsilon ~ 5e-7 rel error.
}
}



static void
Expand Down Expand Up @@ -1220,7 +1280,9 @@ utest_memstrcontains(void)

static ESL_OPTIONS options[] = {
/* name type default env range togs reqs incomp help docgrp */
{"-h", eslARG_NONE, FALSE, NULL, NULL, NULL, NULL, NULL, "show help and usage", 0},
{ "-h", eslARG_NONE, FALSE, NULL, NULL, NULL, NULL, NULL, "show help and usage", 0},
{ "-s", eslARG_INT, "0", NULL, NULL, NULL, NULL, NULL, "set random number seed to <n>", 0 },
{ "-x", eslARG_NONE, FALSE, NULL, NULL, NULL, NULL, NULL, "allow bad luck (stochastic failures)", 0 },
{ 0,0,0,0,0,0,0,0,0,0},
};
static char usage[] = "[-options]";
Expand All @@ -1229,7 +1291,12 @@ static char banner[] = "test driver for mem module";
int
main(int argc, char **argv)
{
ESL_GETOPTS *go = esl_getopts_CreateDefaultApp(options, 0, argc, argv, banner, usage);
ESL_GETOPTS *go = esl_getopts_CreateDefaultApp(options, 0, argc, argv, banner, usage);
ESL_RANDOMNESS *rng = esl_randomness_Create(esl_opt_GetInteger(go, "-s"));
int allow_badluck = esl_opt_GetBoolean(go, "-x"); // if a utest can fail just by chance, let it, instead of suppressing

fprintf(stderr, "## %s\n", argv[0]);
fprintf(stderr, "# rng seed = %" PRIu32 "\n", esl_randomness_GetSeed(rng));

utest_mem_strtoi32();
utest_mem_strtoi64();
Expand All @@ -1241,6 +1308,12 @@ main(int argc, char **argv)
utest_memstrcmp_memstrpfx();
utest_memstrcontains();

/* tests that can fail stochastically go last, because they reset RNG seed by default */
utest_mem_strtof_error(rng, allow_badluck);

fprintf(stderr, "# status = ok\n");

esl_randomness_Destroy(rng);
esl_getopts_Destroy(go);
return 0;
}
Expand Down
3 changes: 1 addition & 2 deletions esl_mem.h
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
/* str*()-like functions for raw memory "lines" (non-NUL terminated strings)
/* str*()-like functions for raw char arrays (non-NUL-terminated strings)
*/
#ifndef eslMEM_INCLUDED
#define eslMEM_INCLUDED
Expand All @@ -8,7 +8,6 @@
extern int esl_mem_strtoi32(char *p, esl_pos_t n, int base, int *opt_nc, int32_t *opt_val);
extern int esl_mem_strtoi64(char *p, esl_pos_t n, int base, int *opt_nc, int64_t *opt_val);
extern int esl_mem_strtoi (char *p, esl_pos_t n, int base, int *opt_nc, int *opt_val);

extern int esl_mem_strtof (char *p, esl_pos_t n, int *opt_nc, float *opt_val);
extern int esl_memnewline(const char *p, esl_pos_t n, esl_pos_t *ret_nline, int *ret_nterm);
extern int esl_memtok(char **p, esl_pos_t *n, const char *delim, char **ret_tok, esl_pos_t *ret_toklen);
Expand Down
63 changes: 63 additions & 0 deletions esl_mem.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
## esl_mem : str*() like functions for char arrays


Many useful C library functions for parsing char data assume that the
input is a NUL-terminated string. Easel also needs to deal with
non-terminated char arrays. Working with nonterminated char arrays is
especially important in data input with `ESL_BUFFER`, where an input
might be a memory-mapped file on disk. We want to avoid making copies
of data just to add `\0` string terminators. `esl_mem` provides a set
of string function substitutes that take a pointer and a length in
bytes `(char *s, esl_pos_t n)` as input.


### esl_mem_strtof() versus esl_memtof()

It is shockingly difficult to produce a correct implementation of
`strtod()` or related C library functions that convert a string
decimal representation to a floating-point number, such as `strtof()`
and `atof()`. Correct conversion includes a guarantee that the
resulting floating point representation will be within one ulp of the
decimal string representation, correctly rounded. One canonical
implementation by David Gay [1] is over 5500 lines of C code.
Unfortunately the C library provides no alternative for nonterminated
char arrays.

When the input is a non-terminated char array, Easel provides two
choices:

* `esl_mem_strtof()` is a fast and compact reimplementation of
`strtof()` that works on char arrays, but sacrifices some accuracy.

* `esl_memtof()` is slower but accurate. It copies the data to a
NUL-terminated string buffer and passes it to `strtof()` for
correct conversion.

In benchmarking during development of the JSON-based HMMER4 profile
file parser, I measured the speed difference between the two routines
at about three-fold. In HMMER, I was more concerned with speed than
guaranteeing absolute accuracy of the conversion, so HMMER4 uses
`esl_json_ReadFloat()`, which mirrors the `esl_mem_strtof()`
implementation.

The accuracy loss in `esl_mem_strtof()` is caused by a small roundoff
accumulation error that seems difficult to avoid (and hence the
difference between the Gay implementation [1] and mine). Still, it is
almost always within +/-1 ulp of the correct `strtof()` result. The
`utest_mem_strtof_error()` unit test verifies that in 100K different
string representation conversions, no `esl_mem_strtof()` conversion
deviates by more than +/-4 ulp (a relative error of about 5e-7) from
`strtof()`. Applications that demand smaller errors than this need to
use `esl_memtof()`.

`esl_mem_strtof()` is also unable to deal with a pathological case
where the significand by itself would over/underflow, but when
combined with the exponent, the result is within the valid range of a
float. For example, it parses
9999999999999999999999999999999999999999e-10 as +infinity, not as
1e30. Full `strtof()` implementations get this right, as does
`esl_memtof()`.

[1] David M. Gay (1990)
["Correctly rounded binary-decimal and decimal-binary conversions"](https://www.ampl.com/REFS/rounding.pdf),
AT&T manuscript 90-10. Implementation: [NETLIB dtoa.c code](https://www.ampl.com/netlib/fp/dtoa.c).
4 changes: 2 additions & 2 deletions esl_minimizer.c
Original file line number Diff line number Diff line change
Expand Up @@ -842,8 +842,8 @@ utest_simplefunc(ESL_RANDOMNESS *rng)
esl_min_ConjugateGradientDescent(NULL, x, n, &test_func, &test_dfunc, (void *) prm, &fx, NULL);

for (i = 0; i < n; i++)
if ( esl_DCompareNew( b[i], x[i], 1e-8, 1e-15) != eslOK) esl_fatal(msg);
if (esl_DCompareNew(0., fx, 0., 1e-8) != eslOK) esl_fatal(msg);
if ( esl_DCompareNew( b[i], x[i], 1e-5, 1e-15) != eslOK) esl_fatal(msg);
if (esl_DCompareNew(0., fx, 0., 1e-5) != eslOK) esl_fatal(msg);

free(prm);
free(x);
Expand Down
Loading

0 comments on commit 177dc37

Please sign in to comment.