Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Java JusText implementation match Python and/or document differences #37

Open
tfmorris opened this issue Apr 10, 2016 · 4 comments
Milestone

Comments

@tfmorris
Copy link
Contributor

The differences between the Java and Python implementations were explained as largely an artifact of different XML parsers in a reply to #23, but I think there's more to it than that. I think the differences in the output of the two implementations should be explainable, and preferably should be improvements.

Some differences that I know of (some have already been reported as bugs) include:

  • header postprocessing implemented incorrectly (Boilerplate removal header post processing incorrect #36)
  • <select> elements not automatically tagged as boilerplate as described in the algorithm description at http://corpus.tools/wiki/Justext/Algorithm
  • HTML entity decoding not done (HTML entities not decoded #30)
  • min/max lengths implemented as doubles instead of integers (probably doesn't affect the output, but it seems an unnecessary deviation)
  • <textarea> is marked as ignorable rather than a block level separator
  • block level tags are computed using JSoup's Element.isBlock() method, rather than the list of tags defined by the jusText algorithm resulting in a different tag set being used for paragraph splitting. The sets have substantial overlap and JSoup's may be better, but I'm not sure the difference is anything other than arbitrary.
  • <br><br> handling is different

Bottom line - what's implemented is not the JusText algorithm as documented.

@tfmorris tfmorris changed the title Make Java jusText implementation match Python and/or document differences Make Java JusText implementation match Python and/or document differences Apr 10, 2016
@tfmorris
Copy link
Contributor Author

Another significant difference is the paragraph detection/segmentation.

The Python implementation uses a very simple algorithm. Any "block level" element starts a new paragraph on both its start tag and its end tag. All text seen until another new paragraph is started gets added. Empty paragraphs are dropped.

The Java implementation uses a complicated backtracking algorithm that attempts to find a common ancestor between two nodes, then inspects all the elements along the paths to the common ancestor. This is O(n!) in tree depth which normally isn't a huge deal (although still wasteful), but in pathological deeply nested cases it can cause mappers to never complete.

I'm not sure what problem, real or imagined, the Java implementation was designed to solve, but the simple Python version seems to perform just fine (and in linear time & space).

@habernal
Copy link
Contributor

Thanks for these very helpful insights! I agree that the implementation might be far from being optimal. Deviations are there, indeed, but since the results were comparable to the Python implementation, we didn't spend that much time on fine tuning (also given constrained project resources). I'm leaving that issue open to be considered for the next release.

@habernal habernal added this to the 1.0.1 milestone Apr 15, 2016
@reckart
Copy link
Member

reckart commented Apr 15, 2016

We're happy about any contributions :)

@tfmorris
Copy link
Contributor Author

I'm not sure the current results are comparable to the Python version. In
my cursory spot checks, I saw some significant differences.

Did you look, for example, at
https://github.com/tfmorris/dkpro-c4corpus/blob/paragraphs/dkpro-c4corpus-boilerplate/BoilerplateEvaluationOnCleanEval/JusText_Java_Defaults_CleanEvalHTMLTestSubset/105.txt
? It currently preserves line breaks, which makes the diff difficult, but
comparing Gold, Python, old Java, & new Java by eye shows some pretty
significant differences.

BTW, in reference to my previous comment about ancestor processing for
paragraph detection, I think I might have found the motivation for the
fancy paragraph processing in the Java version. It looks suspiciously like
this description of paragraphs from the HTML5 spec:

https://www.w3.org/TR/html/dom.html#paragraph

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants