HTML entities not decoded #30

tfmorris · 2016-04-04T14:22:07Z

Comparing these two files:

/dkpro-c4corpus-boilerplate/BoilerplateEvaluationOnCleanEval/JusText_Java_Defaults_CleanEvalHTMLTestSubset/105.txt
/dkpro-c4corpus-boilerplate/BoilerplateEvaluationOnCleanEval/JusText_Python_Defaults_CleanEvalHTMLTestSubset/105.txt

It appears that the Python program is dropping   entities, but not decoding some other such as <. The gold standard doesn't include any HTML entities, naturally. I'd argue that the correct approach is to decode all HTML entities and convert them to their equivalent Unicode character, even though this is different from what the original Python program did.

The text was updated successfully, but these errors were encountered:

reckart · 2016-04-04T14:23:58Z

+1 ;)

Also use JSoup for more of the HTML cleaning.

tfmorris · 2016-04-09T17:48:01Z

I've submitted a fix for this. When the full CleanEval corpus is re-run, I'd suggest having it generate the minimal HTML tags, since the tags are included in the gold standard.

tfmorris · 2016-04-09T18:30:07Z

I'm going to revise my opinion about the "correct approach" and turn it into a question. The gold standard doesn't entity encode less than (<) or ampersand (&) characters which means that it's not legal X(HT)ML (but it also uses made up tags like <l> for lists), so there's a tension between doing what is useful for comparison with the gold standard and doing what's most convenient for consumers.

It's pretty clear that the text mode should be fully decoded, but should the minimal HTML mode match the gold standard or produce legal XML? Is a third mode needed?

Also use JSoup for more of the HTML cleaning.

habernal added the bug label Apr 8, 2016

habernal added this to the 1.0.1 milestone Apr 8, 2016

tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Apr 9, 2016

Decode HTML entities. Fixes dkpro#30.

b5bb751

Also use JSoup for more of the HTML cleaning.

tfmorris mentioned this issue Apr 9, 2016

Fix O(n!) in tag depth issue #28

Open

tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Apr 9, 2016

Decode HTML entities. Fixes dkpro#30.

44c9622

Also use JSoup for more of the HTML cleaning.

tfmorris mentioned this issue Apr 10, 2016

Make Java JusText implementation match Python and/or document differences #37

Open

tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Apr 10, 2016

Decode HTML entities. Fixes dkpro#30.

48e4438

Also use JSoup for more of the HTML cleaning.

tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Apr 13, 2016

Decode HTML entities. Fixes dkpro#30.

b5becbf

Also use JSoup for more of the HTML cleaning.

tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Apr 15, 2016

Decode HTML entities. Fixes dkpro#30.

e7a4f02

Also use JSoup for more of the HTML cleaning.

tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Apr 15, 2016

Decode HTML entities. Fixes dkpro#30.

6b0e39c

Also use JSoup for more of the HTML cleaning.

tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Apr 28, 2016

Decode HTML entities. Fixes dkpro#30.

ddcffff

Also use JSoup for more of the HTML cleaning.

tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Jun 12, 2020

Decode HTML entities. Fixes dkpro#30.

8580783

Also use JSoup for more of the HTML cleaning.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML entities not decoded #30

HTML entities not decoded #30

tfmorris commented Apr 4, 2016

reckart commented Apr 4, 2016

tfmorris commented Apr 9, 2016

tfmorris commented Apr 9, 2016

HTML entities not decoded #30

HTML entities not decoded #30

Comments

tfmorris commented Apr 4, 2016

reckart commented Apr 4, 2016

tfmorris commented Apr 9, 2016

tfmorris commented Apr 9, 2016