Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid XHTML names entities reported in Character Data strings #800

Closed
JayPanoz opened this issue Oct 20, 2017 · 16 comments
Closed

Invalid XHTML names entities reported in Character Data strings #800

JayPanoz opened this issue Oct 20, 2017 · 16 comments
Labels
status: has PR The issue is being processed in a pull request type: bug The issue describes a bug
Milestone

Comments

@JayPanoz
Copy link

epubcheck seems to report invalid XHTML named entities in <![CDATA[]]>, which I used in the first place to indicate that the <script> data in between these strings includes data that could be interpreted as XML markup, but should not be.

Here comes the tricky part:

  • inline script is uglified/minified;
  • warning is reported for a super specific string: &&void 0 (there’s a space);
  • no issue with other strings such as &&TweenLite.to(), where there’s no space.

As far as I can remember, this wasn’t reported when the book was published last year (this is a 2nd edition). And given void 0 contains fewer characters than undefined, it’s quite used in uglyfiers/minifiers.

Fix if you encounter this issue: use void(0) instead of void 0 (they’re the same).

What’s strange though, is that it is interpreted as XML markup despite <![CDATA[]]>.

@tofi86
Copy link
Collaborator

tofi86 commented Nov 21, 2017

Hey Jiminy, can you please attach a simple HTML file which can be run against the current EpubCheck build and also post the offending error message? Thanks!

@tofi86 tofi86 added the status: waiting for feedback The development team needs feedback from the issue’s creator label Nov 21, 2017
@JayPanoz
Copy link
Author

I’ll see what I can do (publishers’ right and stuff).

I guess I can just use a placeholder HTML file with the script if I don’t have permission to share the problematic XHTML file but won’t be able to deal with this until Friday.

@tofi86
Copy link
Collaborator

tofi86 commented Nov 22, 2017

A sample/dummy HTML file would be enough. Thanks Jiminy.

@mattgarrish
Copy link
Member

On a cursory review, the problem looks like it's that EntitySearch.java goes line-by-line over an input stream looking for this pattern:

static final Pattern entityPattern = Pattern.compile("&([A-Za-z0-9]+)([;|\s])");

It doesn't care that there is an active CDATA block. So long as it finds an ampersand followed by one or more alphanumeric characters followed by a colon or space it will emit the warning. That's why '&&void ' gets a warning but '&&void(' does not.

@mattgarrish
Copy link
Member

Even that pattern is a bit screwy, to be honest, as the pipe is not an OR inside a character set, and is not a delimiter of the end of an entity.

@tofi86
Copy link
Collaborator

tofi86 commented Nov 22, 2017

thanks matt for having a look.
regarding the screwy pattern, do you agree that it should be "&([A-Za-z0-9]+)(;|\s)" without the square brackets instead?

anyways, we still need a solution to not parse entitities in CDATA.

@mattgarrish
Copy link
Member

Either that or just remove the pipe. The brackets establish a character set. I don't know if the capture groups are used or not, but they're also potentially superfluous. Could just be:

&[A-Za-z0-9]+[;\s]

Can't help with the harder cdata problem as I only hack at java. You probably have to pre-process out the data blocks before analyzing. Maybe also comments, but haven't tested those.

@mattgarrish
Copy link
Member

Maybe also comments, but haven't tested those.

To confirm, it's dumb validation, so also affects comments.

I'm wondering why this check is even run? What is it doing that the default HTML validation doesn't catch?

I've yet to find any value to it, at any rate, other than to make mistakes. XHTML doesn't allow entities without a closing semi-colon, so that part of the check is useless, and declaring named entities beyond the base xml set generates errors, so you wind up with an error and a warning.

@JayPanoz
Copy link
Author

Here we go, placeholder file in a ZIP (it contains an XHTML file but github doesn't support xhtml for attached files).

void-epubcheck.zip

@tofi86 tofi86 added type: bug The issue describes a bug status: ready for implem The issue is ready to be implemented and removed status: waiting for feedback The development team needs feedback from the issue’s creator labels Nov 26, 2017
@kamorrissey
Copy link
Contributor

I'm not clear on what the resolution is. Is it to make sure that named entities are not checked for inside comments and CDATA?

@kamorrissey
Copy link
Contributor

Oh, and to fix the regex for named entities?

@kamorrissey
Copy link
Contributor

I reread this thread, and now I'm really not sure what the resolution is since there's talk that the entity check is redundant with a check elsewhere.

@tofi86
Copy link
Collaborator

tofi86 commented Apr 27, 2018

Since it was me adding the label "reviewed & ready for implementation", I have to wrap my head around it a second time I think...

I think I added the label because it's clearly a bug and we have demo data which are reproducable.
However, there's currently not a clear solution to it.

Looking at the EntitySearch.java class, this whole thing is a nightmare. As are other similar classes in the ctc package (see below).
This class tries to parse an X(HT)ML file line-by-line(!) which is a blocker for context sensitive information items like things INSIDE other elements/structures.

There's a similar parsing class in the ctc package HTMLTagsAnalyseHandler.java which handles XML validation like this and tries to implement context sensitive line-by-line parsing like at https://github.com/IDPF/epubcheck/blob/master/src/main/java/com/adobe/epubcheck/ctc/xml/HTMLTagsAnalyseHandler.java#L343 which is IMHO a huge pile of sh*** 😲 (and causes trouble at #446 as well).

All of this would need refactoring asap.

But I'm not sure whether that's a solution for this issue here.

I'm still on parental leave and my time is very limited at the moment, so unfortunately I cannot make up a good solution for this at the moment :-/ sorry.

@kmorrissey-mersive
Copy link

kmorrissey-mersive commented May 2, 2018

I'm thinking of tackling this one. What I'd like to do is to parse into individual tags (crudely, "<[^>]+>"; CDATA and comments require more sophistication than that) and non-tag (plain text) data. Then apply the character entity checking only to opening tags and non-tag data, ignoring comments, closing tags, and CDATA. I think the opening (& closed) tags can be searched naively.

Does that sound like a good approach?

Also, it seems to be a mistake to consider something to be a character entity if it doesn't end with a semicolon. Is part of the solution here to warn about instances of "&" that don't seem to begin a legal character or numeric entity? (I have to do my homework to see about my terminology.)

@tofi86
Copy link
Collaborator

tofi86 commented May 3, 2018

Also, it seems to be a mistake to consider something to be a character entity if it doesn't end with a semicolon.

Na, that's indeed the XML rules! So everything beginning with & needs to be treated as an entity. At least in XHTML. Out of my head, I don't know the rules for HTML5, but I would treat it the same.

Is part of the solution here to warn about instances of "&" that don't seem to begin a legal character or numeric entity?

The main goal should be to exclude entity parsing in CDATA blocks (the only place where an & is not treated as the beginning of an entity).

Parsing for legal entities is too hard I think. We could certainly check for decimal entities (&#1234;, can have 4+ digits), for hexadeciaml entities (&#202F;, can have 4+ chars) or named entities (&nbsp;, there's a list of allowed entities in the (X)HTML DTD e.g. XHTML 1.1 DTD EPUB 2). But I don't know whether this is already checked somewhere else...

@tofi86
Copy link
Collaborator

tofi86 commented May 3, 2018

Does that sound like a good approach?

If you have a look at the linked TMLTagsAnalyseHandler.java above, you will see the marked line starting with if (inBlockQuote ... . Original Authors parsed line by line and if they found an opening tag <blockquote> they set inBlockQuote = true; and if they found the next occurance of closing </blockquote> the reset the var to false.

You could use this mechanism to parse for CDATA markers (start <![CDATA[, end ]]>) line by line and set a global var if you're inside a CDATA section.

If I got you right, that's basically what you suggested, right?

rdeltour added a commit that referenced this issue Jan 21, 2019
- suppress poorly implemented checks `HTM-023` and `HTM-024`
  (entity references are already checked at parsing time)
- add more tests for entity references

Fixes #800
@rdeltour rdeltour added status: has PR The issue is being processed in a pull request and removed status: ready for implem The issue is ready to be implemented labels Jan 21, 2019
@rdeltour rdeltour added this to the 4.1.1 milestone Jan 21, 2019
rdeltour added a commit that referenced this issue Jan 21, 2019
- suppress poorly implemented checks `HTM-023` and `HTM-024`
  (entity references are already checked at parsing time)
- add more tests for entity references

Fixes #800
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: has PR The issue is being processed in a pull request type: bug The issue describes a bug
Projects
None yet
Development

No branches or pull requests

6 participants