Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML5 Named character references missing? #502

Open
jtconsol opened this issue Jun 26, 2019 · 9 comments
Open

HTML5 Named character references missing? #502

jtconsol opened this issue Jun 26, 2019 · 9 comments

Comments

@jtconsol
Copy link

jtconsol commented Jun 26, 2019

Hi guys, first off let me thank you for all the work, especially on the new release - Splendid! :)

Coincidentally, I was revisiting the XSS filter in our application, which makes use of

ESAPI.encoder().canonicalize(...)

Then I stumbled upon the following XSS attack vector:

<a href="j&Tab;a&Tab;v&Tab;asc&NewLine;ri&Tab;pt&colon;&lpar;a&Tab;l&Tab;e&Tab;r&Tab;t&Tab;(document.domain)&rpar;">X</a>

Didn't even know &Tab; or &NewLine; . So I checked the HTML5 spec (here and here - also, here's a more visually pleasing overview) and they seem to agree on having these named character references.

In contrast, HTMLEntityCodec.decode... currently delivers:

<a href="j&Tab;a&Tab;v&Tab;asc≠wLine;ri&Tab;pt&colon;&lpar;a&Tab;l&Tab;e&Tab;r&Tab;t&Tab;(document.domain)&rpar;">X</a>

This seems wrong. Also, shouldn't all named character references be unescaped?

@kwwall kwwall changed the title HTML Named character references missing? HTML5 Named character references missing? Jun 29, 2019
@kwwall
Copy link
Contributor

kwwall commented Jun 29, 2019

@jtconsol If you look in HTMLEntityCodec.java, at the private 'mkCharacterToEntityMap()' method, there are 580 named references currently present, but none for &Tab; or &Newline;. I suspect that those are from the HTML5 spec which we never picked up. (Revised your issue title for that reason.) So we need to add quite a few. In fact, I'm wondering if we should rewrite this and pull this from a resource file internal to ESAPI instead as that way it would be easier and perhaps more obvious to update.

@kwwall
Copy link
Contributor

kwwall commented Jun 29, 2019

@xeno6696 @jeremiahjstacey -- Do you think we should file this as a bug or as an enhancement? I could see either way really, since I don't think that the HTML5 was officially published when ESAPI was originally written. OTOH, I could see 'bug' instead of 'enhancement' since I think it ought to be assumed that a library around HTML and HTTP keeps up with the current standards.

@jtconsol
Copy link
Author

jtconsol commented Jun 29, 2019

The wikipedia article "List of XML and HTML character entity references" features some insights on the topic at hand. Curiously, it does not list &Tab; and &NewLine; (but surely the spec should be the primary source?).

@kwwall How would you go about "pulling this from a resource" - parse the HTML5 spec document and extract the list of entities?

This javascript snippet might be a starting point (use it on the table view):

var output = '';
document.querySelectorAll('tr').forEach( function(a) { 
  var unicodeName = a.title;
  var code = a.querySelector('.dec code').innerHTML.replace(/^&amp;#([^;]+);$/, '$1')
  var charRefs = a.querySelector('.named code').innerHTML.split(' ');
  charRefs.forEach( function(a) { 
    var thisCharRef = a.replace(/^&amp;([^;]+);$/, '$1');
    output += code + ',' + thisCharRef + ',' + unicodeName + '\n';
  });
})
console.log(output);

It outputs a list on the console like so:

9,Tab,U+00009 CHARACTER TABULATION
10,NewLine,U+0000A LINE FEED (LF)
33,excl,U+00021 EXCLAMATION MARK
34,quot,U+00022 QUOTATION MARK
34,QUOT,U+00022 QUOTATION MARK
35,num,U+00023 NUMBER SIGN
...

Follow-up: I think there's a conceptional problem in the current implementation in HTMLEntityCodec as it binds each numeric ID to a single string, but we can already see an ID clash in the few items above ("quot"). In this example, it's only the case difference, but there's also stuff like:

8203,ZeroWidthSpace,U+0200B ZERO WIDTH SPACE
8203,NegativeVeryThinSpace,U+0200B ZERO WIDTH SPACE
8203,NegativeThinSpace,U+0200B ZERO WIDTH SPACE
8203,NegativeMediumSpace,U+0200B ZERO WIDTH SPACE
8203,NegativeThickSpace,U+0200B ZERO WIDTH SPACE

To make matters more complicated, the table view linked above does not seem to contain all named character references, e.g. varsubsetneq is missing even though it's just a synonym for subsetneq, which itself is contained in the table. Probably better to parse the spec directly...

@kwwall
Copy link
Contributor

kwwall commented Jun 30, 2019

@jtconsol When I meant parse it from a resource I meant that we should just deploy some text file that we prepare in advance (whether from this JavaScript or manually) based on something extracted from the latest HTML spec at WHATWG (as of May this year, W3C has turned over the management of the HTML spec to WHATWG). We just create some simple form of text file to parse and put it under 'src/main/resources' (and perhaps 'src/test/resources' as well, especially if we wish to deviate from the official one for testing purposes). That output can be something like you note above (although I would prefer that we support some code of comment notation, maybe everything from '#' to the end of the line). Then use getClass().getResourceAsStream() to retrieve it for parsing. That's what I was referring to. Certainly nothing as complex as dynamically retrieving it from the Internet somewhere.

As far as the "conceptual problem" that you mention in the current implementation, that is likely to present a problem. Ultimately, once we have something encoded as &#_nnnn_; we don't really care which named character reference it it might get decoded back to since they all are really equivalent (and assuming that HTMLEntityCodec even supports that; too lazy too look at the moment), but we we certainly need to recognize all the valid named character references when we are doing the encoding to the character encoding in the first place. And right now, I don't think that the current design supports this. Because of that, I am removing the 'good first issue' label that I originally added.

@xeno6696
Copy link
Collaborator

xeno6696 commented Jul 2, 2019

Sorry for the late response @kwwall but yeah this would be an enhancement to me as I agree 100% that the current implementation was designed with HTML4 in mind with not even a glacial glance at the future HTML5 spec.

@xeno6696
Copy link
Collaborator

Looks like they did the work for us:

https://html.spec.whatwg.org/entities.json

All we have to do now is slurp that file on startup and we should be able to handle every case. However I don't think I can do this without adding another external library.

Alternatively I could write a script to cut out all these entities and just wrap them in java code. Thoughts @kwwall ?

@kwwall
Copy link
Contributor

kwwall commented Aug 27, 2019 via email

@xeno6696
Copy link
Collaborator

I wasn't asserting we'd grab it live.

Actually in hoping for a quick win, do we care if the underlying structure of the codecs never accounted for high UTF-8 encodings? Specifically there are codepoints that can only be represented as integer arrays, and in researching the changes here I'll have to make some fundamental API changes in the codec classes to accomodate that, and there might be deeper effects that I haven't found yet.

@xeno6696
Copy link
Collaborator

actually @jtconsol I've been really thinking hard about this issue today. I'm not seeing a terrible threat here, and hopefully I can articulate this well:

First off let me agree in principle: If you offer a decoding capability, whatever is encoded should be decoded. I think we're alright here: We numerically decode everything we encode. And in fact, if I were to completely overhaul our encoding, my first two actions would be to remove the decoding capability EXCEPT via the canonicalize method, and second, to pare down things we map to the OWASP encoder project.

So, help me understand what you think the risk is here. After literally thinking about this all day, I come to the following points:

  1. The HTMLEntityCodec class is intended primarily to thwart XSS attacks by escaping key characters.
  2. Encoding is the primary defense for XSS... if you do nothing else, you encode. Best practice would be to pair it with input validation which leads me to
  3. The only real issue I can see knowing this code as well as I do (which isn't perfect!) is during input validation questions. The process for input validation is currently that we pass to the canonicalize method, and then we validate against that input, and the best practice would then be to discard that value and pass along the original data. I can only see this being an issue if you're using a different encoder--and I don't think its fair if we have to decode for any possible decoder--and you're doing something unique with the decoder on the back end. Additionally, I don't know that it's worth the effort to guard against possible false-positive mixed or double encoding exceptions, which is the only threat I see given the HTML5 inputs you're discussing. That's failing safe.

To me this means that I'm not understanding the risk posed. I believe it means we have more false positives. We don't encode for HTML5 named references, but we don't have to... The Java Encoder project encodes less than we do and has still never been compromised. Of course, that's because 99.99% of all characters under consideration aren't reserved in programming grammars. We really do encode too much.

Given that we:
Decode everything we encode.
Plan on removing the public nature of the decode method (This conversation has solidified my opinion here: decode should never have been made public.)

I think the only real action that needs to happen here is that we improve the documentation, Deprecate the decode method on all codecs (to move them to being private) and close this out.

Opions, @jtconsol and @kwwall ?

@kwwall kwwall modified the milestones: 2.3, 2.4 Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants