Make Java JusText implementation match Python and/or document differences #37

tfmorris · 2016-04-10T20:47:39Z

The differences between the Java and Python implementations were explained as largely an artifact of different XML parsers in a reply to #23, but I think there's more to it than that. I think the differences in the output of the two implementations should be explainable, and preferably should be improvements.

Some differences that I know of (some have already been reported as bugs) include:

header postprocessing implemented incorrectly (Boilerplate removal header post processing incorrect #36)
<select> elements not automatically tagged as boilerplate as described in the algorithm description at http://corpus.tools/wiki/Justext/Algorithm
HTML entity decoding not done (HTML entities not decoded #30)
min/max lengths implemented as doubles instead of integers (probably doesn't affect the output, but it seems an unnecessary deviation)
<textarea> is marked as ignorable rather than a block level separator
block level tags are computed using JSoup's Element.isBlock() method, rather than the list of tags defined by the jusText algorithm resulting in a different tag set being used for paragraph splitting. The sets have substantial overlap and JSoup's may be better, but I'm not sure the difference is anything other than arbitrary.
<br><br> handling is different

Bottom line - what's implemented is not the JusText algorithm as documented.

The text was updated successfully, but these errors were encountered:

tfmorris · 2016-04-12T18:20:38Z

Another significant difference is the paragraph detection/segmentation.

The Python implementation uses a very simple algorithm. Any "block level" element starts a new paragraph on both its start tag and its end tag. All text seen until another new paragraph is started gets added. Empty paragraphs are dropped.

The Java implementation uses a complicated backtracking algorithm that attempts to find a common ancestor between two nodes, then inspects all the elements along the paths to the common ancestor. This is O(n!) in tree depth which normally isn't a huge deal (although still wasteful), but in pathological deeply nested cases it can cause mappers to never complete.

I'm not sure what problem, real or imagined, the Java implementation was designed to solve, but the simple Python version seems to perform just fine (and in linear time & space).

habernal · 2016-04-15T14:02:35Z

Thanks for these very helpful insights! I agree that the implementation might be far from being optimal. Deviations are there, indeed, but since the results were comparable to the Python implementation, we didn't spend that much time on fine tuning (also given constrained project resources). I'm leaving that issue open to be considered for the next release.

reckart · 2016-04-15T14:09:20Z

We're happy about any contributions :)

tfmorris · 2016-04-15T14:53:52Z

I'm not sure the current results are comparable to the Python version. In
my cursory spot checks, I saw some significant differences.

Did you look, for example, at
https://github.com/tfmorris/dkpro-c4corpus/blob/paragraphs/dkpro-c4corpus-boilerplate/BoilerplateEvaluationOnCleanEval/JusText_Java_Defaults_CleanEvalHTMLTestSubset/105.txt
? It currently preserves line breaks, which makes the diff difficult, but
comparing Gold, Python, old Java, & new Java by eye shows some pretty
significant differences.

BTW, in reference to my previous comment about ancestor processing for
paragraph detection, I think I might have found the motivation for the
fancy paragraph processing in the Java version. It looks suspiciously like
this description of paragraphs from the HTML5 spec:

https://www.w3.org/TR/html/dom.html#paragraph

tfmorris changed the title ~~Make Java jusText implementation match Python and/or document differences~~ Make Java JusText implementation match Python and/or document differences Apr 10, 2016

habernal added the enhancement label Apr 15, 2016

habernal added this to the 1.0.1 milestone Apr 15, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Java JusText implementation match Python and/or document differences #37

Make Java JusText implementation match Python and/or document differences #37

tfmorris commented Apr 10, 2016

tfmorris commented Apr 12, 2016

habernal commented Apr 15, 2016

reckart commented Apr 15, 2016

tfmorris commented Apr 15, 2016

Make Java JusText implementation match Python and/or document differences #37

Make Java JusText implementation match Python and/or document differences #37

Comments

tfmorris commented Apr 10, 2016

tfmorris commented Apr 12, 2016

habernal commented Apr 15, 2016

reckart commented Apr 15, 2016

tfmorris commented Apr 15, 2016