Html-Cref: Fast HTML Character References Decoder
License
stvar/html-cref
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Html-Cref ~~~~~~~~~ Stefan Vargyas, stvar@yahoo.com Apr 11, 2019 Table of Contents ----------------- 0. Copyright 1. The Html-Cref Program 2. Building and Testing Html-Cref 3. Running Timings of Html-Cref's Parsers 4. Appendix: Using Shell Function 'html-cref-test' 5. Appendix: The Parsers Generated by RE2C 6. Appendix: Links to Json-Type 7. References 0. Copyright ============ This program is GPL-licensed free software. Its author is Stefan Vargyas. You can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. You should have received a copy of the GNU General Public License along with this program (look up for the file COPYING in the top directory of the source tree). If not, see http://gnu.org/licenses/gpl.html. The source tree of Html-Cref includes source files from another free software project: Json-Type [3]. These source files were placed in a separate directory, 'lib/json-type'. Each such Json-Type source file contains unaltered the original copyright and license notices. 1. The Html-Cref Program ======================== Html-Cref's vocation is that of a fast HTML character references decoder [1a]. Html-Cref is implementing several fast HTML character references parsers based on two meta-tools: Trie-Gen [5] and RE2C [4]. The parsing and decoding of HTML character references follow carefully the HTML standard specification, handling properly those named references that, for historical reasons, are allowed to not be terminated with semicolon [1b]; also handling properly the numeric character references that are permitted to be overriden as per the specification [1c, 1d]. Html-Cref produces output encoded in UTF-8 that is conforming to the Unicode Standard v12.0.0 [2], handling correctly both BMP and non-BMP (the so-called Astral planes) code points. 2. Building and Testing Html-Cref ================================= Html-Cref is written in modern C and was developed under a GNU/Linux environment using the GCC C compiler v4.3.4 and v7.2.0. This latter version is the newest GCC with which Html-Cref was tested: passing to GCC the option `-std=gnu11', it builds Html-Cref cleanly. Note that the two meta-tools mentioned in section 1 (Trie-Gen and RE2C) are not needed for building Html-Cref. All the C code that these programs generated is included within the source tree in the files: $ ls -1 src/html-cref-!(overrides|table)-impl.h src/html-cref-bre2c-impl.h src/html-cref-etrie-impl.h src/html-cref-ietrie-impl.h src/html-cref-itrie-impl.h src/html-cref-iwtrie-impl.h src/html-cref-re2c-impl.h src/html-cref-trie-impl.h src/html-cref-wtrie-impl.h Html-Cref is supposed to be built in two distinct ways -- determined by specific arguments passed to the 'make' program. The first way is to have each HTML named character reference parser be compiled separately as a dynamic library (a shared object) and have the main program, 'html-cref' load an actual parser dynamically at run time (the parser gets specified through 'html-cref's command line options `-p|--cref-parser=$NAME'): $ make [OPT=$OPT] [TIMINGS=no|yes] [CYCLES=no|yes] The other alternative is to have 'html-cref' be a standalone program that does not depend on external parser libraries. The main program is built such that to include only one specified parser: $ make [OPT=$OPT] BUILTIN=$NAME The above `$NAME' can be one of following: ietrie, iwtrie, itrie, etrie, wtrie, trie, bre2c or re2c. The argument of form `OPT=$OPT' asks GCC to do optimization of the binaries it produces according to the optimization level option `-O$OPT'. When passing to 'make' the argument 'TIMINGS=yes', then the binaries built will include code that is able to do timing measurements of the HTML named character reference parsers. The argument 'CYCLES=yes' has meaning only given along with 'TIMINGS=yes'. In such case, it brings in the binaries built code that is able to measure the CPU instruction cycles spent by the parsers of Html-Cref. Note that this code is valid only on Intel/AMD platforms that support the two time stamp counter instructions RDTSC and RDTSCP. The received package comes along with a comprehensive test-suite contained in 'test' directory. The shell script 'test/test.sh' starts the test-suite and it can be invoked simply by issuing: $ make test The main test script 'test/test.sh' invokes one of the two top shell scripts 'test/test-{builtin,modules}.sh'; subsequently, each is calling for the inner scripts placed in the sub-directories of 'test'. All these are 'bash' scripts which depend upon a handful of power-tools commonly found on every GNU/Linux installation. The 'test' directory contains also the source files from which the test scripts were generated: the files 'test/test-{builtin,modules}.txt'. Upon running 'test/test.sh', the output obtained from the shell script would be a short series of lines of the kind: test: GROUP:CASE RESULT GROUP is the name of the group of tests located in the directory 'test/GROUP', CASE is the name of the test case and RESULT is either 'OK' or 'failed'. The expected behavior would be that of all test cases to succeed. In the case of things going the wrong way for a particular test case, more verbose output is obtainable when running the corresponding 'test/GROUP/test-CASE.sh' script on its own. It'll produce a diff between expected and actual output of 'html-cref'. Note that any user's explicit invocation of these 'bash' test scripts must be initiated from within the 'test' directory. A common scenario is for one to have Html-Cref built in three steps. Firstly, do build the program along with external parser modules, enabling the timings code within each of these modules: $ make OPT=3 TIMINGS=yes Second step: run the timings measurements described in the next section for to determine the parser module to be built into the standalone 'html-cref' program. Third step: build 'html-cref' with one's chosen parser module included as such: $ make allclean $ make OPT=3 BUILTIN=... 3. Running Timings of Html-Cref's Parsers ========================================= This section shows how to run timings measurements of 'html-cref's HTML named character references parsers. Let point out that the timings below are to be obtained upon running Html-Cref's parser libraries on a test file that contain all the 2125 valid named character references repeatedly, for to amount 272000 of them disposed in random order. $ cd test $ wc -l < html-cref-names.txt 2125 $ test-html-crefs() { local k for ((k=0; k<128; k++)); do cat html-cref-names.txt done| shuf| sed 's/^/\&/;s/$/;/'| tr -d '\n' } $ test-html-crefs > test-html-crefs.txt The first thing to do is to build optimized binaries with timings code enabled: $ cd src $ make OPT=3 TIMINGS=yes -B Upon building Html-Cref with GCC 7.2.0 on a GNU/Linux machine, the sizes of the binaries obtained are as shown: $ strip *.so html-cref $ du -sbh *.so html-cref|sort -k1hr,1 154K html-cref-re2c.so 138K html-cref-etrie.so 138K html-cref-ietrie.so 138K html-cref-iwtrie.so 138K html-cref-wtrie.so 134K html-cref-bre2c.so 130K html-cref-itrie.so 130K html-cref-trie.so 51K html-cref Then run the shell function 'html-cref-test' sourced in the current environment from the 'bash' shell script 'src/html-cref.sh'. (See the appendix section below for the details of how 'html-cref-test' is functioning.) The shell function will be invoked on the test file 'test/test-html-crefs.txt', asking it to produce the timings output both on stdout and in a text file. The name of the output file is based on the name of shell function's input file. In this case, the name of the output file would be 'test/test-html-crefs.output'. $ . html-cref.sh $ html-cref-test -i ../test/test-html-crefs.txt -o+ -T+ --thread ... Upon running shell function 'html-cref-test' as above on a GNU/Linux 64-bit Intel Core I5-3210M (Ivy Bridge) CPU machine, the timings data expressed as percentage values relative to those of parser library 'html-cref-etrie.so' would look like: $ html-cref-test -i ../test/test-html-crefs.output -P+ etrie: 100 - - 0.00 wtrie: 100 - - 0.08 ietrie: 100 - - 1.97 iwtrie: 100 - - 2.06 bre2c: 100 - - 10.41 re2c: 100 - - 12.98 trie: 100 - - 16.35 itrie: 100 - - 17.03 4. Appendix: Using Shell Function 'html-cref-test' ================================================== The 'bash' shell function 'html-cref-test' was conceived as a tool that one should use for to obtain aggregated timings information upon running series of 'html-cref' program instances. The program 'html-cref' produces timings data that amounts to the time its HTML named character reference parsers do spend processing a given input text. For using 'html-cref-test', one has to source in the current 'bash' environment the shell script 'src/html-cref.sh': $ cd src $ . html-cref.sh The command line options of 'html-cref-test' are as shown below: $ funchelp -f html-cref.sh -c html-cref-test --long-wrap-join=auto actions: -N|--names print out the names of known HTML char ref parsers -T|--test-set[=NAMES] test named HTML char ref parsers; NAMES is a comma-separated list of HTML char ref parser names (default: '+', i.e. all) -P|--percents[=NAME] process percents relative to the specified HTML char ref parser (default: '+' , i.e. 'etrie') options: -f|--overwrite force overwriting the output timings file if that already exists when action is `-T| --test-set' -g|--group group by names and sum up timings of input table when action is `-P|--percents' -i|--input=FILE input test file when action is `-T|--test-set' or input timings file when action is `-P| --percents' -o|--output=FILE output timings file when action is `-T| --test-set'; '-' means to not generate such file at all (default); '+[SUFFIX]' stands for computing a name based on the input test file name: replace FILE's shortest `.' suffix with `.output[.SUFFIX]'; note that regardless of the argument these options have, the timings table is printed out on stdout -r|--repeat=NUM number of times to repeat the 'html-cref' command (default: 100) -m:|--timings[=NUM,NUM,NUM] pass `-m|--timings[=NUM,NUM,NUM]' or, by case, --real[-timings][=NUM] `--{real,process,thread}-timings[=NUM]' to --process[-timings][=NUM] 'html-cref'; the default NUM is '+', i.e. --thread[-timings][=NUM] query that number from 'clocks' -c|--[clock-]cycles[=NUM] pass `-c|--clock-cycles[=NUM]' to 'html-cref' (default do not); the default NUM is '+', i.e. query that number from 'clocks' -s:|--sort=NAME sort or not the output table by the named --no-sort timings column when action is `-P| --percents'; for sorting, NAME can be either 'real', 'process', 'thread' or 'cycles'; for not sorting the table at all, NAME must be '-' (default is sorting by '+', i.e. by 'thread') -w|--width=NUM width of timings columns' integral part when action is `-T|--test-set' (default: 9) As seen above, 'html-cref-test' has three modes of operation, corresponding each to one of its action options. The action options determines the action taken by the script and the output obtained from it. In case of action options `-N|--names' there's nothing to add to the description text above. The action options `-T|--test-set' expects to find a binary 'html-cref' along with all the shared libs of form 'html-cref-NAME.so', where 'NAME' is one of the names printed out by action options `-N|--names' -- each built with the timings code enabled and located in the current directory from which the shell script 'html-cref-test' is invoked. The shell script 'html-cref-test' also expects to find the binary 'clocks'. This latter binary is obtained as a result of a normal building of Html-Cref and it's used for producing clock timings overhead estimates that 'html-cref' uses for adjusting the timings measurements done. When an 'html-cref-test' command line contains one of the options `-m|--timings' or `--{real,process,thread}-timings' and none of the options `-c|--clock-cyles' after the rightmost such timings option, then the result obtained from action option `-T|--test-set' is a table of which rows are of form: NAME: COUNT REAL PROCESS THREAD where 'NAME' is one of the name that was given as argument to the action option itself; 'COUNT' is the number of times the command 'html-cref' was issued using the parser library named by 'NAME'; 'COUNT' is controlled by 'html-cref-test's option `-r|--repeat' and has the default value 100. The last three columns of the table obtained from `-T--test-set' are mean values of the total of real, process and, respectively, thread nanoseconds spent by the parser library named by 'NAME' inside the HTML named character references parser function that it implements. The real, process and thread nanoseconds values are computed using the library function `clock_gettime(3)' -- which is called on the clock ids 'CLOCK_{REALTIME,PROCESS_CPUTIME_ID,THREAD_CPUTIME_ID}' respectively. If an 'html-cref-test' command line contains the options `-c|--clock-cycles' and none of the options `-m|--timings' and `--{real,process,thread}-timings' after the rightmost such cycles option, then the result obtained from action option `-T|--test-set' is a table of which rows are of form: NAME: COUNT - - - CYCLES where 'NAME' and 'COUNT are as described above and 'CYCLES' are mean values of the total amount of CPU instruction cycles spent by the parser library named by 'NAME' inside the HTML named character references parser function it implements. The action options `-P|--percents' take as input a table obtained from action options `-T|--test-set' for to transform the cell values to percentage values relative to the corresponding cell value on the table's row that is specified as argument of the action option itself. Prior to printing it out, these action options do sort the resulting table on the fifth column, the THREAD column. When the options `--{real,process,thread}-timings' are used invoking the shell function 'html-cref-test', the columns of the output table that correspond to the clocks that got omitted from the timing measurements will contain no values at all, but only an empty indicator. 5. Appendix: The Parsers Generated by RE2C ========================================== Html-Cref has two parsers that were generated by RE2C [4]. Each is generated from the RE2C specification within the file 'src/html-cref-re2c-impl.def': $ cd src $ grep html-cref-*re2c.c -Pe '^//\s*\$\s*re2c\b' html-cref-bre2c.c:// $ re2c -b html-cref-re2c-impl.def > html-cref-bre2c-impl.h html-cref-re2c.c:// $ re2c html-cref-re2c-impl.def > html-cref-re2c-impl.h The source tree of Html-Cref is not including 'gre2c' and 'sre2c' parsers -- the ones that would have been generated by RE2C's options `-g|--computed-gotos' and, respectively, `-s|--nested-ifs'. On the machine mentioned in section 3 above, the 'gre2c' parser builds to an 1.4M binary that is about 9% slower than the 'bre2c' parser. RE2C version 1.1.1 generates identical 'sre2c' and 'bre2c' parsers: $ diff -q \ -Lbre2c <(re2c --no-debug-info --no-generation-date -b html-cref-re2c-impl.def) \ -Lsre2c <(re2c --no-debug-info --no-generation-date -s html-cref-re2c-impl.def) && echo OK OK 6. Appendix: Links to Json-Type =============================== The file 'lib/json-type-files.txt' lists the SHA1 hashes of the original files brought in 'lib' directory from Json-Type's 'git' repository: $ cat lib/json-type-files.txt 231c047b8ab51ab957535c2d8aab3d788cbabbd4 char-traits.h 56478925b560a095ef5caed4cc4602bfc96c51f3 config.h ... The command below shows which of these source files are modified versions of the original ones. Note that '$JSON_TYPE_HOME' is the path to the directory hosting Json-Type's local 'git' repository. Json-Type's public 'git' repository URL is given under reference entry [3]. $ . src/html-cref.sh $ git-repo-diff -g $JSON_TYPE_HOME|lsdiff -s ! lib/file-buf.c ! lib/su-size.c Note that 'git-repo-diff' expects to be issued from within the top directory of the source tree. Its command line options are as follows: $ funchelp -f html-cref.sh -c git-repo-diff --long-wrap-join=auto -b|--ignore-space-change pass '-b|--ignore-space-change' to diff -h|--home=DIR home dir (default: '.') -g|--git-dir=DIR 'git' repo directory (default: '$HOME/$target') -s|--sha1-hashes=FILE sha1 hashes file name ('-' means stdin, the default is '$home/lib/$target-files.txt') -t:|--target=NAME target name: 'json-type' --json-type -u|--unified=NUM pass '-u|--unified=NUM' to diff 7. References ============= Internet Resources: [1] HTML Living Standard -- Last Updated 11 April 2019 https://html.spec.whatwg.org/multipage/index.html (a) 12 The HTML syntax: 12.1.4 Character references https://html.spec.whatwg.org/multipage/syntax.html#character-references (b) 12.2 Parsing HTML documents: 12.2.5.73 Named character reference state https://html.spec.whatwg.org/multipage/parsing.html#named-character-reference-state (c) 12.2 Parsing HTML documents: 12.2.5.80 Numeric character reference end state https://html.spec.whatwg.org/multipage/parsing.html#numeric-character-reference-end-state (d) 12.2 Parsing HTML documents: Table of numeric character reference overrides https://html.spec.whatwg.org/multipage/parsing.html#table-charref-overrides [2] The Unicode Standard Version 12.0 -- Core Specification Chapter 3: 3.9 Unicode Encoding Forms: UTF-8, p. 125 http://www.unicode.org/versions/Unicode12.0.0/ch03.pdf Free or Open-Source Software: [3] Json-Type: JSON Push Parsing and Type Checking http://nongnu.org/json-type/ Json-Type's Public 'git' Repository: git://git.sv.nongnu.org/json-type [4] RE2C: A Lexer Generator for C and C++ http://re2c.org/index.html [5] Trie-Gen: Trie Lookup Code Generator http://nongnu.org/trie-gen/
About
Html-Cref: Fast HTML Character References Decoder
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published