Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Boilerplate removal header post processing incorrect #36

Open
tfmorris opened this issue Apr 10, 2016 · 0 comments
Open

Boilerplate removal header post processing incorrect #36

tfmorris opened this issue Apr 10, 2016 · 0 comments
Milestone

Comments

@tfmorris
Copy link
Contributor

The conditional here is wrong:
https://github.com/dkpro/dkpro-c4corpus/blob/master/dkpro-c4corpus-boilerplate/src/main/java/de/tudarmstadt/ukp/dkpro/c4corpus/boilerplate/impl/JusTextBoilerplateRemoval.java#L350
causing the algorithm to attempt to reclassify non-headings, not just headings. The inverted conditionals just to save a little indentation whitespace make my head hurt and are error prone, so I'd recommend using normal logic which matches the algorithm descriptions. ie In this case, instead of:

        if (!(paragraph.isHeading() && paragraph.getClassType().equalsIgnoreCase("bad")
                && !paragraph.getContextFreeClass().equalsIgnoreCase("bad"))) {
            continue;
        }

use

        if (paragraph.isHeading() && paragraph.getClassType().equalsIgnoreCase("bad")
                && !paragraph.getContextFreeClass().equalsIgnoreCase("bad")) {

The current code goes pathologically wrong in the case of documents with a large number empty elements (45,000 "paragraphs", a large number of which were consecutive <br> elements in the example I looked at). In this case the 200 character distance limit never gets reached to trigger the loop exit, causing O(n!) processing of 45,000 elements.

This suggests a couple other possible improvements:

  • compress runs of more than 2 <br> elements
  • introduce a max number of elements distance limit in addition to the max number of character limit
tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Apr 10, 2016
Also align more closely with the original algorithm by:
- un-inverting conditionals so they can be checked against algorithm
easily
- adding <style> tag to list of tags cleaned in pre-processing per algo
- marking <select> tag as block level per original algorithm
- using ints for character counts instead of doubles
- adding documentation from original algorithm description
@habernal habernal added this to the 1.0.1 milestone Apr 13, 2016
tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Apr 13, 2016
Also align more closely with the original algorithm by:
- un-inverting conditionals so they can be checked against algorithm
easily
- adding <style> tag to list of tags cleaned in pre-processing per algo
- marking <select> tag as block level per original algorithm
- using ints for character counts instead of doubles
- adding documentation from original algorithm description
tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Apr 15, 2016
Also align more closely with the original algorithm by:
- un-inverting conditionals so they can be checked against algorithm
easily
- adding <style> tag to list of tags cleaned in pre-processing per algo
- marking <select> tag as block level per original algorithm
- using ints for character counts instead of doubles
- adding documentation from original algorithm description
tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Apr 15, 2016
Also align more closely with the original algorithm by:
- un-inverting conditionals so they can be checked against algorithm
easily
- adding <style> tag to list of tags cleaned in pre-processing per algo
- marking <select> tag as block level per original algorithm
- using ints for character counts instead of doubles
- adding documentation from original algorithm description
tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Apr 28, 2016
Also align more closely with the original algorithm by:
- un-inverting conditionals so they can be checked against algorithm
easily
- adding <style> tag to list of tags cleaned in pre-processing per algo
- marking <select> tag as block level per original algorithm
- using ints for character counts instead of doubles
- adding documentation from original algorithm description
tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Jun 12, 2020
Also align more closely with the original algorithm by:
- un-inverting conditionals so they can be checked against algorithm
easily
- adding <style> tag to list of tags cleaned in pre-processing per algo
- marking <select> tag as block level per original algorithm
- using ints for character counts instead of doubles
- adding documentation from original algorithm description
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants