-
Notifications
You must be signed in to change notification settings - Fork 26
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Improves error reporting in esl_json parser; fixes critical bug in es…
…l_mem_strtof(). Systematically checked and improved all eslFORMAT errors reportable by the JSON parser. [H5/134-135] Fixes a critical bug in esl_mem_strtof(), which would fail on most exponents > 10 (i.e. "1.0e11"). [H5/134] esl_json_ReadInt() changed to use code derived from esl_mem_strtoi(), rather than calling it. Validated JSON number tokens can be handled without as much error checking. Added documentation and a unit test for it, utest_read_int(). esl_json_ReadFloat(), similarly, already used code derived from esl_mem_strtof()... but ... discovered that conversion of strings to floats is harder than it seemed. Added notes in esl_mem.md to record why. Nonetheless decided to stick with hand-rolled code, because the speed difference (3x) is more important than strict accuracy (HMMER can live with +/-1 ulp typical error), compared to the alternative of copying an ESL_BUFFET char array to a NUL-terminated string and calling strtof() on it. Added utest_read_float() and utest_read_float_err() unit tests. The utest_read_float() test includes examples of a couple of super pathological edge cases that esl_json_ReadFloat() doesn't get quite right. Adds utest_mem_strtof_error() unit tests for esl_mem_strtof(), similar to the esl_json test. This is what caught the critical bug, which had escaped the existing esl_mem unit test, which I'd erroneously thought was pretty damned thorough. Adds esl_rnd_floatstring(), which generates a random digital string representation of a float, for testing. Used by both the esl_json and esl_mem unit tests above.
- Loading branch information
1 parent
ed665c7
commit 177dc37
Showing
7 changed files
with
546 additions
and
102 deletions.
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
## esl_mem : str*() like functions for char arrays | ||
|
||
|
||
Many useful C library functions for parsing char data assume that the | ||
input is a NUL-terminated string. Easel also needs to deal with | ||
non-terminated char arrays. Working with nonterminated char arrays is | ||
especially important in data input with `ESL_BUFFER`, where an input | ||
might be a memory-mapped file on disk. We want to avoid making copies | ||
of data just to add `\0` string terminators. `esl_mem` provides a set | ||
of string function substitutes that take a pointer and a length in | ||
bytes `(char *s, esl_pos_t n)` as input. | ||
|
||
|
||
### esl_mem_strtof() versus esl_memtof() | ||
|
||
It is shockingly difficult to produce a correct implementation of | ||
`strtod()` or related C library functions that convert a string | ||
decimal representation to a floating-point number, such as `strtof()` | ||
and `atof()`. Correct conversion includes a guarantee that the | ||
resulting floating point representation will be within one ulp of the | ||
decimal string representation, correctly rounded. One canonical | ||
implementation by David Gay [1] is over 5500 lines of C code. | ||
Unfortunately the C library provides no alternative for nonterminated | ||
char arrays. | ||
|
||
When the input is a non-terminated char array, Easel provides two | ||
choices: | ||
|
||
* `esl_mem_strtof()` is a fast and compact reimplementation of | ||
`strtof()` that works on char arrays, but sacrifices some accuracy. | ||
|
||
* `esl_memtof()` is slower but accurate. It copies the data to a | ||
NUL-terminated string buffer and passes it to `strtof()` for | ||
correct conversion. | ||
|
||
In benchmarking during development of the JSON-based HMMER4 profile | ||
file parser, I measured the speed difference between the two routines | ||
at about three-fold. In HMMER, I was more concerned with speed than | ||
guaranteeing absolute accuracy of the conversion, so HMMER4 uses | ||
`esl_json_ReadFloat()`, which mirrors the `esl_mem_strtof()` | ||
implementation. | ||
|
||
The accuracy loss in `esl_mem_strtof()` is caused by a small roundoff | ||
accumulation error that seems difficult to avoid (and hence the | ||
difference between the Gay implementation [1] and mine). Still, it is | ||
almost always within +/-1 ulp of the correct `strtof()` result. The | ||
`utest_mem_strtof_error()` unit test verifies that in 100K different | ||
string representation conversions, no `esl_mem_strtof()` conversion | ||
deviates by more than +/-4 ulp (a relative error of about 5e-7) from | ||
`strtof()`. Applications that demand smaller errors than this need to | ||
use `esl_memtof()`. | ||
|
||
`esl_mem_strtof()` is also unable to deal with a pathological case | ||
where the significand by itself would over/underflow, but when | ||
combined with the exponent, the result is within the valid range of a | ||
float. For example, it parses | ||
9999999999999999999999999999999999999999e-10 as +infinity, not as | ||
1e30. Full `strtof()` implementations get this right, as does | ||
`esl_memtof()`. | ||
|
||
[1] David M. Gay (1990) | ||
["Correctly rounded binary-decimal and decimal-binary conversions"](https://www.ampl.com/REFS/rounding.pdf), | ||
AT&T manuscript 90-10. Implementation: [NETLIB dtoa.c code](https://www.ampl.com/netlib/fp/dtoa.c). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.