Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add missing mappings for named HTML entities #174

Merged
merged 18 commits into from
Jun 10, 2019
Merged

Add missing mappings for named HTML entities #174

merged 18 commits into from
Jun 10, 2019

Conversation

jacobrshields
Copy link

@jacobrshields jacobrshields commented May 24, 2019

This adds missing mappings for named HTML entities.

Currently, certain entities are not recognized, so for instance ⇑ gets "sanitized" to ⇑ (incorrect) instead of (correct).

I used a piece of JavaScript to generate the Java mappings based on the data at https://dev.w3.org/html5/html-author/charref.

JavaScript used to scrape the entities:

// Execute in browser at https://dev.w3.org/html5/html-author/charref
(function () {
  let lastCategory = null;

  for (tr of document.querySelectorAll('tr')) {
    const category = tr.getAttribute('data-block');
    if (category !== lastCategory) {
      console.log(`\n      // ${category}`);
    }
    lastCategory = category;

    const description = tr.querySelector('td.desc').innerText;
    const names = tr.querySelector('td.named code').innerText.split(' ');

    const rawHexStr = tr.querySelector('td.hex code').innerHTML; // '𝕫'
    const unicodeStr = rawHexStr.substring(rawHexStr.indexOf('x') + 1, rawHexStr.length - 1); // '1D56B'
    const unicodeValue = parseInt(unicodeStr, 16); // 0x1D56B

    let upperUnicodeValue;
    let lowerUnicodeValue;
    if (unicodeValue < 0x10000) {
      upperUnicodeValue = 0;
      lowerUnicodeValue = unicodeValue;
    } else {
      const difference = (unicodeValue - 0x10000);
      upperUnicodeValue = 0xD800 | (difference >> 10); // 0xd835
      lowerUnicodeValue = 0xDC00 | (difference & 0x3FF); // 0xdd6b
    }

    const upperUnicodeStr = upperUnicodeValue.toString(16).padStart(4, '0'); // 'd835'
    const lowerUnicodeStr = lowerUnicodeValue.toString(16).padStart(4, '0'); // 'dd6b'

    const javaValue = (upperUnicodeStr === '0000'
      ? `Integer.valueOf('\\u${lowerUnicodeStr}')`
      : `Character.toCodePoint('\\u${upperUnicodeStr}', '\\u${lowerUnicodeStr}')`)
      .replace('\\u000a', '\\n')
      .replace('\\u0027', '\\\'')
      .replace('\\u005c', '\\\\');

    for (const name of names) {
      console.log(`      builder.put("${name.substring(1, name.length - 1)}", ${javaValue}); // ${description}`);
    }
  }
})();

Copy link
Author

@jacobrshields jacobrshields left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've verified that the new map is a superset of the old map.

Map<String, Integer> oldMap = ImmutableMap.<String, Integer>builder()
  // ... original map contents ...
  .build();

Set<String> oldEntityNames = new HashSet<String>(oldMap.keySet());
oldEntityNames.removeAll(entityNameToCodePointMap.keySet());
System.out.println("Keys in old map but not in new map: " + oldEntityNames.size());

Keys in old map but not in new map: 0

Copy link
Author

@jacobrshields jacobrshields left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In decodeEntityAt there is a block of code that attempts to find a lowercase version of a named entity in the trie if it didn't find it in the trie with the given case:
https://github.com/OWASP/java-html-sanitizer/blob/release-20190503.1/src/main/java/org/owasp/html/HtmlEntities.java#L175-L183

And there is a test case for this ("&AmP;" -> "&"):
https://github.com/OWASP/java-html-sanitizer/blob/release-20190503.1/src/test/java/org/owasp/html/EncodingTest.java#L182-L184

Is this really the intended behavior? I believe named entities are case-sensitive. See:
https://html.spec.whatwg.org/multipage/syntax.html#character-references

The ampersand must be followed by one of the names given in the named character references section, using the same case.

On my browser (Chrome 74.0.3729.169) the HTML text &AmP; displays as &AmP; and not as &.

The reason I bring this up is because the reference list parsed for this change explicitly includes both the lowercase and the uppercase versions of named entities that have both (e.g. "&amp;" and "&AMP;"). Given that, do we still need to keep the code in decodeEntityAt that performs the fallback lowercase check? Was it there because of some non-standard browser behavior?

@jacobrshields jacobrshields changed the title Add missing named HTML entity mappings Add missing mappings for named HTML entities May 25, 2019
src/main/java/org/owasp/html/Encoding.java Outdated Show resolved Hide resolved
src/main/java/org/owasp/html/HtmlEntities.java Outdated Show resolved Hide resolved
src/main/java/org/owasp/html/HtmlEntities.java Outdated Show resolved Hide resolved
src/main/java/org/owasp/html/HtmlEntities.java Outdated Show resolved Hide resolved
@jacobrshields
Copy link
Author

Updated script to generate mappings, using the data downloaded from https://html.spec.whatwg.org/entities.json:

#!/usr/bin/env node
'use strict';

const fs = require('fs');

const ENTITIES_JSON_PATH = '/Users/jshields/entities.json';

const entities = JSON.parse(fs.readFileSync(ENTITIES_JSON_PATH));

for (const entityKey of Object.keys(entities)) {
  if (entityKey[entityKey.length - 1] !== ';') {
    continue;
  }

  const entityName = entityKey.substring(1, entityKey.length - 1);
  const characters = entities[entityKey].characters;

  if (characters.length === 1) {
    console.log(`builder.put("${entityName}", ((int) ${getJavaCharacterLiteral(characters[0])}) << 16);`);
  } else if (characters.length === 2) {
    console.log(`builder.put("${entityName}", (((int) ${getJavaCharacterLiteral(characters[0])}) << 16) | ${getJavaCharacterLiteral(characters[1])});`);
  } else {
    throw new Error(`Unexpected number of characters for ${entityKey}`);
  }
}

function getJavaCharacterLiteral(character) {
  return `'\\u${character.charCodeAt(0).toString(16).padStart(4, '0')}'`
    .replace('\\u000a', '\\n')
    .replace('\\u0027', '\\\'')
    .replace('\\u005c', '\\\\');
}

@mikesamuel
Copy link
Contributor

Thanks for this.
What if we just stored strings like

#!/usr/bin/env node
'use strict';

const url = require('url');
const fetch = require('node-fetch');

const ENTITIES_JSON_URL = 'https://html.spec.whatwg.org/entities.json';

(async () => {
  const response = await fetch(ENTITIES_JSON_URL, { method: 'GET' });
  if (response.status !== 200) { throw new Error(); }
  const entities = JSON.parse(await response.text());

  for (const entityKey of Object.keys(entities)) {
    if (entityKey[entityKey.length - 1] !== ';') {
// Are you sure this is right?
// I think an entityKey of "&lt" means that "&lt" need not end with a semicolon.
// IIRC, this is what allows recognizing that in <a href="?a=b&ampx=0&copy=1">
// the "&amp" means "&" but "&copy" is not a copyright symbol.
      continue;
    }

    const entityName = entityKey.substring(1, entityKey.length - 1);
    const { characters } = entities[entityKey];

    console.log(`    builder.put("${entityName}", ${ javaString(characters) });`);
  }
})().then(() => {}, (e) => console.error(e));

const specialCharacters = new Map();
specialCharacters.set('\\', String.raw`\\`);
specialCharacters.set('"', String.raw`\"`);

function javaString(s) {
  const javaChars = '"' +
      s.replace(
        /[\s\S]/g,
        (c) => specialCharacters.has(c) ? specialCharacters.get(c) : '\\u' + hex4(c.charCodeAt(0)))
      + '"';
  if (s !== JSON.parse(javaChars)) { throw new Error(s); }
  return javaChars;
}

function hex4(n) {
  const hex = n.toString(16);
  return '0000'.substring(hex.length) + hex;
}

and then I think Encoding.decodeHtml s the only code that calls decodeEntityAt, so decodeEntityAt could change to appendDecodedEntity, take a StringBuilder as input, and return the int end.

long endAndCodepoint = HtmlEntities.decodeEntityAt(s, amp, n);
int end = (int) (endAndCodepoint >>> 32);
int codepoint = (int) endAndCodepoint;
sb.append(s, pos, amp).appendCodePoint(codepoint);

@jacobrshields
Copy link
Author

Good idea. I like the idea of appendDecodedEntity.

Regarding references that don't end with ;: I think this is a separate pre-existing issue that we may want to handle under the scope of a different PR. Yes it appears that &amp, &lt, etc. are legal character references and browsers do display them as expected. But the current code first parses the string between & and ; and then checks if it's a numeric entity or a named entity.

Actually the spec is a little confusing in that in one place it says that must end with ;:

The name must be one that is terminated by a U+003B SEMICOLON character (;).

But the character reference sheet clearly shows that not all of them do.

Are you OK de-scoping that particular concern to a different PR?

@mikesamuel
Copy link
Contributor

Are you OK de-scoping that particular concern to a different PR?

Yes.

@jacobrshields
Copy link
Author

OK, sounds good.

I updated to use your suggestion of appendDecodedEntity. Also I had to add 1 more special character mapping to your JavaScript for the newline character.

Also curious what your thoughts are on this comment: #174 (review)

@jacobrshields
Copy link
Author

@mikesamuel When would you be able to take another look at this PR for approval? Would love to get this merged soon if possible for a project I am working on.

@scottastrophic
Copy link

I believe named entities are case-sensitive.

They are indeed. You can see why with accented characters: &Eacute; is É while &eacute; is é.

@mikesamuel
Copy link
Contributor

Sorry for the delay. Was slammed with deadlines. Taking another look.

Copy link
Contributor

@mikesamuel mikesamuel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks much for this. A few minor nits.

src/main/java/org/owasp/html/Trie.java Outdated Show resolved Hide resolved
src/main/java/org/owasp/html/Trie.java Outdated Show resolved Hide resolved
@mikesamuel mikesamuel merged commit e37292d into OWASP:master Jun 10, 2019
@mikesamuel
Copy link
Contributor

Thanks much for doing this.

I'll push a version with this.

@jacobrshields
Copy link
Author

Thank you Mike, I appreciate it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants