New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTML5 Named character references missing? #502
Comments
@jtconsol If you look in HTMLEntityCodec.java, at the private 'mkCharacterToEntityMap()' method, there are 580 named references currently present, but none for 	 or &Newline;. I suspect that those are from the HTML5 spec which we never picked up. (Revised your issue title for that reason.) So we need to add quite a few. In fact, I'm wondering if we should rewrite this and pull this from a resource file internal to ESAPI instead as that way it would be easier and perhaps more obvious to update. |
@xeno6696 @jeremiahjstacey -- Do you think we should file this as a bug or as an enhancement? I could see either way really, since I don't think that the HTML5 was officially published when ESAPI was originally written. OTOH, I could see 'bug' instead of 'enhancement' since I think it ought to be assumed that a library around HTML and HTTP keeps up with the current standards. |
The wikipedia article "List of XML and HTML character entity references" features some insights on the topic at hand. Curiously, it does not list @kwwall How would you go about "pulling this from a resource" - parse the HTML5 spec document and extract the list of entities? This javascript snippet might be a starting point (use it on the table view):
It outputs a list on the console like so:
Follow-up: I think there's a conceptional problem in the current implementation in HTMLEntityCodec as it binds each numeric ID to a single string, but we can already see an ID clash in the few items above ("quot"). In this example, it's only the case difference, but there's also stuff like:
To make matters more complicated, the table view linked above does not seem to contain all named character references, e.g. varsubsetneq is missing even though it's just a synonym for subsetneq, which itself is contained in the table. Probably better to parse the spec directly... |
@jtconsol When I meant parse it from a resource I meant that we should just deploy some text file that we prepare in advance (whether from this JavaScript or manually) based on something extracted from the latest HTML spec at WHATWG (as of May this year, W3C has turned over the management of the HTML spec to WHATWG). We just create some simple form of text file to parse and put it under 'src/main/resources' (and perhaps 'src/test/resources' as well, especially if we wish to deviate from the official one for testing purposes). That output can be something like you note above (although I would prefer that we support some code of comment notation, maybe everything from '#' to the end of the line). Then use getClass().getResourceAsStream() to retrieve it for parsing. That's what I was referring to. Certainly nothing as complex as dynamically retrieving it from the Internet somewhere. As far as the "conceptual problem" that you mention in the current implementation, that is likely to present a problem. Ultimately, once we have something encoded as |
Sorry for the late response @kwwall but yeah this would be an enhancement to me as I agree 100% that the current implementation was designed with HTML4 in mind with not even a glacial glance at the future HTML5 spec. |
Looks like they did the work for us: https://html.spec.whatwg.org/entities.json All we have to do now is slurp that file on startup and we should be able to handle every case. However I don't think I can do this without adding another external library. Alternatively I could write a script to cut out all these entities and just wrap them in java code. Thoughts @kwwall ? |
We can copy it locally (via our pom.xml) and use it from there, but NOT at
runtime. That would be considered an insecure external code reference by my
secure code review team, even if done via https.
…-kevin
--
Blog: http://off-the-wall-security.blogspot.com/ | Twitter: @KevinWWall
NSA: All your crypto bit are belong to us.
On Tue, Aug 27, 2019, 15:00 Matt Seil ***@***.***> wrote:
Looks like they did the work for us:
https://html.spec.whatwg.org/entities.json
All we have to do now is slurp that file on startup and we should be able
to handle every case. However I don't think I can do this without adding
another external library.
Alternatively I could write a script to cut out all these entities and
just wrap them in java code. Thoughts @kwwall <https://github.com/kwwall>
?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#502?email_source=notifications&email_token=AAO6PG7TMZRM5Z3NHXN3WGLQGV2UBA5CNFSM4H3VJN5KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5IY6PI#issuecomment-525438781>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAO6PG2NE6QJ5XM2DO27V2TQGV2UBANCNFSM4H3VJN5A>
.
|
I wasn't asserting we'd grab it live. Actually in hoping for a quick win, do we care if the underlying structure of the codecs never accounted for high UTF-8 encodings? Specifically there are codepoints that can only be represented as integer arrays, and in researching the changes here I'll have to make some fundamental API changes in the codec classes to accomodate that, and there might be deeper effects that I haven't found yet. |
actually @jtconsol I've been really thinking hard about this issue today. I'm not seeing a terrible threat here, and hopefully I can articulate this well: First off let me agree in principle: If you offer a decoding capability, whatever is encoded should be decoded. I think we're alright here: We numerically decode everything we encode. And in fact, if I were to completely overhaul our encoding, my first two actions would be to remove the decoding capability EXCEPT via the canonicalize method, and second, to pare down things we map to the OWASP encoder project. So, help me understand what you think the risk is here. After literally thinking about this all day, I come to the following points:
To me this means that I'm not understanding the risk posed. I believe it means we have more false positives. We don't encode for HTML5 named references, but we don't have to... The Java Encoder project encodes less than we do and has still never been compromised. Of course, that's because 99.99% of all characters under consideration aren't reserved in programming grammars. We really do encode too much. Given that we: I think the only real action that needs to happen here is that we improve the documentation, Deprecate the decode method on all codecs (to move them to being private) and close this out. |
Hi guys, first off let me thank you for all the work, especially on the new release - Splendid! :)
Coincidentally, I was revisiting the XSS filter in our application, which makes use of
Then I stumbled upon the following XSS attack vector:
Didn't even know
	
or

. So I checked the HTML5 spec (here and here - also, here's a more visually pleasing overview) and they seem to agree on having these named character references.In contrast,
HTMLEntityCodec.decode...
currently delivers:This seems wrong. Also, shouldn't all named character references be unescaped?
The text was updated successfully, but these errors were encountered: