-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add missing mappings for named HTML entities #174
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've verified that the new map is a superset of the old map.
Map<String, Integer> oldMap = ImmutableMap.<String, Integer>builder()
// ... original map contents ...
.build();
Set<String> oldEntityNames = new HashSet<String>(oldMap.keySet());
oldEntityNames.removeAll(entityNameToCodePointMap.keySet());
System.out.println("Keys in old map but not in new map: " + oldEntityNames.size());
Keys in old map but not in new map: 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In decodeEntityAt
there is a block of code that attempts to find a lowercase version of a named entity in the trie if it didn't find it in the trie with the given case:
https://github.com/OWASP/java-html-sanitizer/blob/release-20190503.1/src/main/java/org/owasp/html/HtmlEntities.java#L175-L183
And there is a test case for this ("&AmP;"
-> "&"
):
https://github.com/OWASP/java-html-sanitizer/blob/release-20190503.1/src/test/java/org/owasp/html/EncodingTest.java#L182-L184
Is this really the intended behavior? I believe named entities are case-sensitive. See:
https://html.spec.whatwg.org/multipage/syntax.html#character-references
The ampersand must be followed by one of the names given in the named character references section, using the same case.
On my browser (Chrome 74.0.3729.169) the HTML text &AmP;
displays as &AmP;
and not as &
.
The reason I bring this up is because the reference list parsed for this change explicitly includes both the lowercase and the uppercase versions of named entities that have both (e.g. "&"
and "&"
). Given that, do we still need to keep the code in decodeEntityAt
that performs the fallback lowercase check? Was it there because of some non-standard browser behavior?
Updated script to generate mappings, using the data downloaded from https://html.spec.whatwg.org/entities.json: #!/usr/bin/env node
'use strict';
const fs = require('fs');
const ENTITIES_JSON_PATH = '/Users/jshields/entities.json';
const entities = JSON.parse(fs.readFileSync(ENTITIES_JSON_PATH));
for (const entityKey of Object.keys(entities)) {
if (entityKey[entityKey.length - 1] !== ';') {
continue;
}
const entityName = entityKey.substring(1, entityKey.length - 1);
const characters = entities[entityKey].characters;
if (characters.length === 1) {
console.log(`builder.put("${entityName}", ((int) ${getJavaCharacterLiteral(characters[0])}) << 16);`);
} else if (characters.length === 2) {
console.log(`builder.put("${entityName}", (((int) ${getJavaCharacterLiteral(characters[0])}) << 16) | ${getJavaCharacterLiteral(characters[1])});`);
} else {
throw new Error(`Unexpected number of characters for ${entityKey}`);
}
}
function getJavaCharacterLiteral(character) {
return `'\\u${character.charCodeAt(0).toString(16).padStart(4, '0')}'`
.replace('\\u000a', '\\n')
.replace('\\u0027', '\\\'')
.replace('\\u005c', '\\\\');
} |
Thanks for this. #!/usr/bin/env node
'use strict';
const url = require('url');
const fetch = require('node-fetch');
const ENTITIES_JSON_URL = 'https://html.spec.whatwg.org/entities.json';
(async () => {
const response = await fetch(ENTITIES_JSON_URL, { method: 'GET' });
if (response.status !== 200) { throw new Error(); }
const entities = JSON.parse(await response.text());
for (const entityKey of Object.keys(entities)) {
if (entityKey[entityKey.length - 1] !== ';') {
// Are you sure this is right?
// I think an entityKey of "<" means that "<" need not end with a semicolon.
// IIRC, this is what allows recognizing that in <a href="?a=b&x=0©=1">
// the "&" means "&" but "©" is not a copyright symbol.
continue;
}
const entityName = entityKey.substring(1, entityKey.length - 1);
const { characters } = entities[entityKey];
console.log(` builder.put("${entityName}", ${ javaString(characters) });`);
}
})().then(() => {}, (e) => console.error(e));
const specialCharacters = new Map();
specialCharacters.set('\\', String.raw`\\`);
specialCharacters.set('"', String.raw`\"`);
function javaString(s) {
const javaChars = '"' +
s.replace(
/[\s\S]/g,
(c) => specialCharacters.has(c) ? specialCharacters.get(c) : '\\u' + hex4(c.charCodeAt(0)))
+ '"';
if (s !== JSON.parse(javaChars)) { throw new Error(s); }
return javaChars;
}
function hex4(n) {
const hex = n.toString(16);
return '0000'.substring(hex.length) + hex;
} and then I think Encoding.decodeHtml s the only code that calls java-html-sanitizer/src/main/java/org/owasp/html/Encoding.java Lines 57 to 60 in 7cdb5eb
|
Good idea. I like the idea of Regarding references that don't end with Actually the spec is a little confusing in that in one place it says that must end with
But the character reference sheet clearly shows that not all of them do. Are you OK de-scoping that particular concern to a different PR? |
Yes. |
OK, sounds good. I updated to use your suggestion of Also curious what your thoughts are on this comment: #174 (review) |
@mikesamuel When would you be able to take another look at this PR for approval? Would love to get this merged soon if possible for a project I am working on. |
They are indeed. You can see why with accented characters: |
Sorry for the delay. Was slammed with deadlines. Taking another look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks much for this. A few minor nits.
Thanks much for doing this. I'll push a version with this. |
Thank you Mike, I appreciate it! |
This adds missing mappings for named HTML entities.
Currently, certain entities are not recognized, so for instance
⇑
gets "sanitized" to&DoubleUpArrow;
(incorrect) instead of⇑
(correct).I used a piece of JavaScript to generate the Java mappings based on the data at https://dev.w3.org/html5/html-author/charref.
JavaScript used to scrape the entities: