GitHub - stvar/html-cref: Html-Cref: Fast HTML Character References Decoder

stvar / html-cref Public
Notifications You must be signed in to change notification settings
Fork 0
Star 0
Html-Cref: Fast HTML Character References Decoder
Notifications
Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
lib		lib
src		src
test		test
.gitattributes		.gitattributes
AUTHOR		AUTHOR
COPYING		COPYING
Makefile		Makefile
PROGRAM		PROGRAM
README		README
Repository files navigation

                                    Html-Cref
                                    ~~~~~~~~~
                        Stefan Vargyas, stvar@yahoo.com

                                  Apr 11, 2019


Table of Contents
-----------------

0. Copyright
1. The Html-Cref Program
2. Building and Testing Html-Cref
3. Running Timings of Html-Cref's Parsers
4. Appendix: Using Shell Function 'html-cref-test'
5. Appendix: The Parsers Generated by RE2C
6. Appendix: Links to Json-Type
7. References


0. Copyright
============

This program is GPL-licensed free software. Its author is Stefan Vargyas. You
can redistribute it and/or modify it under the terms of the GNU General Public
License as published by the Free Software Foundation, either version 3 of the
License, or (at your option) any later version.

You should have received a copy of the GNU General Public License along with
this program (look up for the file COPYING in the top directory of the source
tree). If not, see http://gnu.org/licenses/gpl.html.

The source tree of Html-Cref includes source files from another free software
project: Json-Type [3]. These source files were placed in a separate directory,
'lib/json-type'. Each such Json-Type source file contains unaltered the original
copyright and license notices.


1. The Html-Cref Program
========================

Html-Cref's vocation is that of a fast HTML character references decoder [1a].

Html-Cref is implementing several fast HTML character references parsers based
on two meta-tools: Trie-Gen [5] and RE2C [4]. The parsing and decoding of HTML
character references follow carefully the HTML standard specification, handling
properly those named references that, for historical reasons, are allowed to not
be terminated with semicolon [1b]; also handling properly the numeric character
references that are permitted to be overriden as per the specification [1c, 1d].

Html-Cref produces output encoded in UTF-8 that is conforming to the Unicode
Standard v12.0.0 [2], handling correctly both BMP and non-BMP (the so-called
Astral planes) code points.


2. Building and Testing Html-Cref
=================================

Html-Cref is written in modern C and was developed under a GNU/Linux environment
using the GCC C compiler v4.3.4 and v7.2.0. This latter version is the newest
GCC with which Html-Cref was tested: passing to GCC the option `-std=gnu11', it
builds Html-Cref cleanly.

Note that the two meta-tools mentioned in section 1 (Trie-Gen and RE2C) are not
needed for building Html-Cref. All the C code that these programs generated is
included within the source tree in the files:

  $ ls -1 src/html-cref-!(overrides|table)-impl.h
  src/html-cref-bre2c-impl.h
  src/html-cref-etrie-impl.h
  src/html-cref-ietrie-impl.h
  src/html-cref-itrie-impl.h
  src/html-cref-iwtrie-impl.h
  src/html-cref-re2c-impl.h
  src/html-cref-trie-impl.h
  src/html-cref-wtrie-impl.h

Html-Cref is supposed to be built in two distinct ways -- determined by specific
arguments passed to the 'make' program. The first way is to have each HTML named
character reference parser be compiled separately as a dynamic library (a shared
object) and have the main program, 'html-cref' load an actual parser dynamically
at run time (the parser gets specified through 'html-cref's command line options
`-p|--cref-parser=$NAME'):

  $ make [OPT=$OPT] [TIMINGS=no|yes] [CYCLES=no|yes]

The other alternative is to have 'html-cref' be a standalone program that does
not depend on external parser libraries. The main program is built such that to
include only one specified parser:

  $ make [OPT=$OPT] BUILTIN=$NAME

The above `$NAME' can be one of following: ietrie, iwtrie, itrie, etrie, wtrie,
trie, bre2c or re2c. The argument of form `OPT=$OPT' asks GCC to do optimization
of the binaries it produces according to the optimization level option `-O$OPT'.

When passing to 'make' the argument 'TIMINGS=yes', then the binaries built will
include code that is able to do timing measurements of the HTML named character
reference parsers. The argument 'CYCLES=yes' has meaning only given along with
'TIMINGS=yes'. In such case, it brings in the binaries built code that is able
to measure the CPU instruction cycles spent by the parsers of Html-Cref. Note
that this code is valid only on Intel/AMD platforms that support the two time
stamp counter instructions RDTSC and RDTSCP. 

The received package comes along with a comprehensive test-suite contained in
'test' directory. The shell script 'test/test.sh' starts the test-suite and it
can be invoked simply by issuing:

  $ make test

The main test script 'test/test.sh' invokes one of the two top shell scripts
'test/test-{builtin,modules}.sh'; subsequently, each is calling for the inner
scripts placed in the sub-directories of 'test'. All these are 'bash' scripts
which depend upon a handful of power-tools commonly found on every GNU/Linux
installation.  The 'test' directory contains also the source files from which
the test scripts were generated: the files 'test/test-{builtin,modules}.txt'.

Upon running 'test/test.sh', the output obtained from the shell script would
be a short series of lines of the kind:

  test: GROUP:CASE RESULT

GROUP is the name of the group of tests located in the directory 'test/GROUP',
CASE is the name of the test case and RESULT is either 'OK' or 'failed'. The
expected behavior would be that of all test cases to succeed. In the case of
things going the wrong way for a particular test case, more verbose output is
obtainable when running the corresponding 'test/GROUP/test-CASE.sh' script on
its own. It'll produce a diff between expected and actual output of 'html-cref'.

Note that any user's explicit invocation of these 'bash' test scripts must be
initiated from within the 'test' directory.

A common scenario is for one to have Html-Cref built in three steps. Firstly, do
build the program along with external parser modules, enabling the timings code
within each of these modules:

  $ make OPT=3 TIMINGS=yes

Second step: run the timings measurements described in the next section for to
determine the parser module to be built into the standalone 'html-cref' program.

Third step: build 'html-cref' with one's chosen parser module included as such:

  $ make allclean

  $ make OPT=3 BUILTIN=...


3. Running Timings of Html-Cref's Parsers
=========================================

This section shows how to run timings measurements of 'html-cref's HTML named
character references parsers.

Let point out that the timings below are to be obtained upon running Html-Cref's
parser libraries on a test file that contain all the 2125 valid named character
references repeatedly, for to amount 272000 of them disposed in random order.

  $ cd test

  $ wc -l < html-cref-names.txt
  2125

  $ test-html-crefs() {
    local k
    for ((k=0; k<128; k++)); do
        cat html-cref-names.txt
    done|
    shuf|
    sed 's/^/\&/;s/$/;/'|
    tr -d '\n'
  }

  $ test-html-crefs > test-html-crefs.txt

The first thing to do is to build optimized binaries with timings code enabled:

  $ cd src

  $ make OPT=3 TIMINGS=yes -B 

Upon building Html-Cref with GCC 7.2.0 on a GNU/Linux machine, the sizes of the
binaries obtained are as shown:

  $ strip *.so html-cref

  $ du -sbh *.so html-cref|sort -k1hr,1
  154K	html-cref-re2c.so
  138K	html-cref-etrie.so
  138K	html-cref-ietrie.so
  138K	html-cref-iwtrie.so
  138K	html-cref-wtrie.so
  134K	html-cref-bre2c.so
  130K	html-cref-itrie.so
  130K	html-cref-trie.so
  51K	html-cref

Then run the shell function 'html-cref-test' sourced in the current environment
from the 'bash' shell script 'src/html-cref.sh'. (See the appendix section below
for the details of how 'html-cref-test' is functioning.) The shell function will
be invoked on the test file 'test/test-html-crefs.txt', asking it to produce the
timings output both on stdout and in a text file. The name of the output file is
based on the name of shell function's input file. In this case, the name of the
output file would be 'test/test-html-crefs.output'.

  $ . html-cref.sh

  $ html-cref-test -i ../test/test-html-crefs.txt -o+ -T+ --thread
  ...

Upon running shell function 'html-cref-test' as above on a GNU/Linux 64-bit Intel
Core I5-3210M (Ivy Bridge) CPU machine, the timings data expressed as percentage
values relative to those of parser library 'html-cref-etrie.so' would look like:

  $ html-cref-test -i ../test/test-html-crefs.output -P+
  etrie:  100      -      -   0.00
  wtrie:  100      -      -   0.08
  ietrie: 100      -      -   1.97
  iwtrie: 100      -      -   2.06
  bre2c:  100      -      -  10.41
  re2c:   100      -      -  12.98
  trie:   100      -      -  16.35
  itrie:  100      -      -  17.03


4. Appendix: Using Shell Function 'html-cref-test'
==================================================

The 'bash' shell function 'html-cref-test' was conceived as a tool that one
should use for to obtain aggregated timings information upon running series
of 'html-cref' program instances. The program 'html-cref' produces timings
data that amounts to the time its HTML named character reference parsers do
spend processing a given input text.

For using 'html-cref-test', one has to source in the current 'bash' environment
the shell script 'src/html-cref.sh':

  $ cd src

  $ . html-cref.sh

The command line options of 'html-cref-test' are as shown below:

  $ funchelp -f html-cref.sh -c html-cref-test --long-wrap-join=auto
  actions:
    -N|--names                     print out the names of known HTML char ref
                                     parsers
    -T|--test-set[=NAMES]          test named HTML char ref parsers; NAMES is a
                                     comma-separated list of HTML char ref parser
                                     names (default: '+', i.e. all)
    -P|--percents[=NAME]           process percents relative to the specified
                                     HTML char ref parser (default: '+' , i.e.
                                     'etrie')
  options:
    -f|--overwrite                 force overwriting the output timings file if
                                     that already exists when action is `-T|
                                     --test-set'
    -g|--group                     group by names and sum up timings of input
                                     table when action is `-P|--percents'
    -i|--input=FILE                input test file when action is `-T|--test-set'
                                     or input timings file when action is `-P|
                                     --percents'
    -o|--output=FILE               output timings file when action is `-T|
                                     --test-set'; '-' means to not generate such
                                     file at all (default); '+[SUFFIX]' stands
                                     for computing a name based on the input test
                                     file name: replace FILE's shortest `.'
                                     suffix with `.output[.SUFFIX]'; note that
                                     regardless of the argument these options
                                     have, the timings table is printed out on
                                     stdout
    -r|--repeat=NUM                number of times to repeat the 'html-cref'
                                     command (default: 100)
    -m:|--timings[=NUM,NUM,NUM]    pass `-m|--timings[=NUM,NUM,NUM]' or, by case,
        --real[-timings][=NUM]       `--{real,process,thread}-timings[=NUM]' to
        --process[-timings][=NUM]    'html-cref'; the default NUM is '+', i.e.
        --thread[-timings][=NUM]     query that number from 'clocks'
    -c|--[clock-]cycles[=NUM]      pass `-c|--clock-cycles[=NUM]' to 'html-cref'
                                     (default do not); the default NUM is '+',
                                     i.e. query that number from 'clocks'
    -s:|--sort=NAME                sort or not the output table by the named
        --no-sort                    timings column when action is `-P|
                                     --percents'; for sorting, NAME can be either
                                     'real', 'process', 'thread' or 'cycles'; for
                                     not sorting the table at all, NAME must be
                                     '-' (default is sorting by '+', i.e. by
                                     'thread')
    -w|--width=NUM                 width of timings columns' integral part when
                                     action is `-T|--test-set' (default: 9)

As seen above, 'html-cref-test' has three modes of operation, corresponding each
to one of its action options. The action options determines the action taken by
the script and the output obtained from it.

In case of action options `-N|--names' there's nothing to add to the description
text above.

The action options `-T|--test-set' expects to find a binary 'html-cref' along
with all the shared libs of form 'html-cref-NAME.so', where 'NAME' is one of
the names printed out by action options `-N|--names' -- each built with the
timings code enabled and located in the current directory from which the shell
script 'html-cref-test' is invoked.

The shell script 'html-cref-test' also expects to find the binary 'clocks'.
This latter binary is obtained as a result of a normal building of Html-Cref
and it's used for producing clock timings overhead estimates that 'html-cref'
uses for adjusting the timings measurements done.

When an 'html-cref-test' command line contains one of the options `-m|--timings'
or `--{real,process,thread}-timings' and none of the options `-c|--clock-cyles'
after the rightmost such timings option, then the result obtained from action
option `-T|--test-set' is a table of which rows are of form:

  NAME: COUNT REAL PROCESS THREAD

where 'NAME' is one of the name that was given as argument to the action option
itself; 'COUNT' is the number of times the command 'html-cref' was issued using
the parser library named by 'NAME'; 'COUNT' is controlled by 'html-cref-test's
option `-r|--repeat' and has the default value 100.

The last three columns of the table obtained from `-T--test-set' are mean values
of the total of real, process and, respectively, thread nanoseconds spent by the
parser library named by 'NAME' inside the HTML named character references parser
function that it implements. The real, process and thread nanoseconds values are
computed using the library function `clock_gettime(3)' -- which is called on the
clock ids 'CLOCK_{REALTIME,PROCESS_CPUTIME_ID,THREAD_CPUTIME_ID}' respectively.

If an 'html-cref-test' command line contains the options `-c|--clock-cycles' and
none of the options `-m|--timings' and `--{real,process,thread}-timings' after
the rightmost such cycles option, then the result obtained from action option
`-T|--test-set' is a table of which rows are of form:

  NAME: COUNT - - - CYCLES

where 'NAME' and 'COUNT are as described above and 'CYCLES' are mean values of
the total amount of CPU instruction cycles spent by the parser library named by
'NAME' inside the HTML named character references parser function it implements.

The action options `-P|--percents' take as input a table obtained from action
options `-T|--test-set' for to transform the cell values to percentage values
relative to the corresponding cell value on the table's row that is specified
as argument of the action option itself. Prior to printing it out, these action
options do sort the resulting table on the fifth column, the THREAD column.

When the options `--{real,process,thread}-timings' are used invoking the shell
function 'html-cref-test', the columns of the output table that correspond to
the clocks that got omitted from the timing measurements will contain no values
at all, but only an empty indicator.


5. Appendix: The Parsers Generated by RE2C
==========================================

Html-Cref has two parsers that were generated by RE2C [4]. Each is generated from
the RE2C specification within the file 'src/html-cref-re2c-impl.def':

  $ cd src

  $ grep html-cref-*re2c.c -Pe '^//\s*\$\s*re2c\b'
  html-cref-bre2c.c:// $ re2c -b html-cref-re2c-impl.def > html-cref-bre2c-impl.h
  html-cref-re2c.c:// $ re2c html-cref-re2c-impl.def > html-cref-re2c-impl.h

The source tree of Html-Cref is not including 'gre2c' and 'sre2c' parsers -- the
ones that would have been generated by RE2C's options `-g|--computed-gotos' and,
respectively, `-s|--nested-ifs'. 

On the machine mentioned in section 3 above, the 'gre2c' parser builds to an 1.4M
binary that is about 9% slower than the 'bre2c' parser.

RE2C version 1.1.1 generates identical 'sre2c' and 'bre2c' parsers:

  $ diff -q \
  -Lbre2c <(re2c --no-debug-info --no-generation-date -b html-cref-re2c-impl.def) \
  -Lsre2c <(re2c --no-debug-info --no-generation-date -s html-cref-re2c-impl.def) &&
  echo OK
  OK


6. Appendix: Links to Json-Type
===============================

The file 'lib/json-type-files.txt' lists the SHA1 hashes of the original files
brought in 'lib' directory from Json-Type's 'git' repository:

  $ cat lib/json-type-files.txt 
  231c047b8ab51ab957535c2d8aab3d788cbabbd4  char-traits.h
  56478925b560a095ef5caed4cc4602bfc96c51f3  config.h
  ...

The command below shows which of these source files are modified versions of the
original ones. Note that '$JSON_TYPE_HOME' is the path to the directory hosting
Json-Type's local 'git' repository. Json-Type's public 'git' repository URL is
given under reference entry [3].

  $ . src/html-cref.sh

  $ git-repo-diff -g $JSON_TYPE_HOME|lsdiff -s
  ! lib/file-buf.c
  ! lib/su-size.c

Note that 'git-repo-diff' expects to be issued from within the top directory of
the source tree. Its command line options are as follows:

  $ funchelp -f html-cref.sh -c git-repo-diff --long-wrap-join=auto
    -b|--ignore-space-change  pass '-b|--ignore-space-change' to diff
    -h|--home=DIR             home dir (default: '.')
    -g|--git-dir=DIR          'git' repo directory (default: '$HOME/$target')
    -s|--sha1-hashes=FILE     sha1 hashes file name ('-' means stdin, the default
                                is '$home/lib/$target-files.txt')
    -t:|--target=NAME         target name: 'json-type'
        --json-type
    -u|--unified=NUM          pass '-u|--unified=NUM' to diff


7. References
=============

Internet Resources:

[1] HTML Living Standard -- Last Updated 11 April 2019
    https://html.spec.whatwg.org/multipage/index.html

    (a) 12 The HTML syntax: 12.1.4 Character references
    https://html.spec.whatwg.org/multipage/syntax.html#character-references

    (b) 12.2 Parsing HTML documents: 12.2.5.73 Named character reference state
    https://html.spec.whatwg.org/multipage/parsing.html#named-character-reference-state

    (c) 12.2 Parsing HTML documents: 12.2.5.80 Numeric character reference end state
    https://html.spec.whatwg.org/multipage/parsing.html#numeric-character-reference-end-state

    (d) 12.2 Parsing HTML documents: Table of numeric character reference overrides 
    https://html.spec.whatwg.org/multipage/parsing.html#table-charref-overrides

[2] The Unicode Standard Version 12.0 -- Core Specification
    Chapter 3: 3.9 Unicode Encoding Forms: UTF-8, p. 125
    http://www.unicode.org/versions/Unicode12.0.0/ch03.pdf

Free or Open-Source Software:

[3] Json-Type: JSON Push Parsing and Type Checking
    http://nongnu.org/json-type/

    Json-Type's Public 'git' Repository:
    git://git.sv.nongnu.org/json-type

[4] RE2C: A Lexer Generator for C and C++
    http://re2c.org/index.html

[5] Trie-Gen: Trie Lookup Code Generator
    http://nongnu.org/trie-gen/