Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML entities not decoded #30

Open
tfmorris opened this issue Apr 4, 2016 · 3 comments
Open

HTML entities not decoded #30

tfmorris opened this issue Apr 4, 2016 · 3 comments
Labels
Milestone

Comments

@tfmorris
Copy link
Contributor

tfmorris commented Apr 4, 2016

Comparing these two files:

  • /dkpro-c4corpus-boilerplate/BoilerplateEvaluationOnCleanEval/JusText_Java_Defaults_CleanEvalHTMLTestSubset/105.txt
  • /dkpro-c4corpus-boilerplate/BoilerplateEvaluationOnCleanEval/JusText_Python_Defaults_CleanEvalHTMLTestSubset/105.txt

It appears that the Python program is dropping   entities, but not decoding some other such as <. The gold standard doesn't include any HTML entities, naturally. I'd argue that the correct approach is to decode all HTML entities and convert them to their equivalent Unicode character, even though this is different from what the original Python program did.

@reckart
Copy link
Member

reckart commented Apr 4, 2016

+1 ;)

@habernal habernal added the bug label Apr 8, 2016
@habernal habernal added this to the 1.0.1 milestone Apr 8, 2016
tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Apr 9, 2016
Also use JSoup for more of the HTML cleaning.
@tfmorris
Copy link
Contributor Author

tfmorris commented Apr 9, 2016

I've submitted a fix for this. When the full CleanEval corpus is re-run, I'd suggest having it generate the minimal HTML tags, since the tags are included in the gold standard.

@tfmorris
Copy link
Contributor Author

tfmorris commented Apr 9, 2016

I'm going to revise my opinion about the "correct approach" and turn it into a question. The gold standard doesn't entity encode less than (<) or ampersand (&) characters which means that it's not legal X(HT)ML (but it also uses made up tags like <l> for lists), so there's a tension between doing what is useful for comparison with the gold standard and doing what's most convenient for consumers.

It's pretty clear that the text mode should be fully decoded, but should the minimal HTML mode match the gold standard or produce legal XML? Is a third mode needed?

tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Apr 9, 2016
Also use JSoup for more of the HTML cleaning.
tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Apr 10, 2016
Also use JSoup for more of the HTML cleaning.
tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Apr 13, 2016
Also use JSoup for more of the HTML cleaning.
tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Apr 15, 2016
Also use JSoup for more of the HTML cleaning.
tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Apr 15, 2016
Also use JSoup for more of the HTML cleaning.
tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Apr 28, 2016
Also use JSoup for more of the HTML cleaning.
tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Jun 12, 2020
Also use JSoup for more of the HTML cleaning.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants