Add missing mappings for named HTML entities #174

jacobrshields · 2019-05-24T22:32:00Z

This adds missing mappings for named HTML entities.

Currently, certain entities are not recognized, so for instance &DoubleUpArrow; gets "sanitized" to &DoubleUpArrow; (incorrect) instead of ⇑ (correct).

I used a piece of JavaScript to generate the Java mappings based on the data at https://dev.w3.org/html5/html-author/charref.

JavaScript used to scrape the entities:

// Execute in browser at https://dev.w3.org/html5/html-author/charref
(function () {
  let lastCategory = null;

  for (tr of document.querySelectorAll('tr')) {
    const category = tr.getAttribute('data-block');
    if (category !== lastCategory) {
      console.log(`\n      // ${category}`);
    }
    lastCategory = category;

    const description = tr.querySelector('td.desc').innerText;
    const names = tr.querySelector('td.named code').innerText.split(' ');

    const rawHexStr = tr.querySelector('td.hex code').innerHTML; // '&#x1D56B;'
    const unicodeStr = rawHexStr.substring(rawHexStr.indexOf('x') + 1, rawHexStr.length - 1); // '1D56B'
    const unicodeValue = parseInt(unicodeStr, 16); // 0x1D56B

    let upperUnicodeValue;
    let lowerUnicodeValue;
    if (unicodeValue < 0x10000) {
      upperUnicodeValue = 0;
      lowerUnicodeValue = unicodeValue;
    } else {
      const difference = (unicodeValue - 0x10000);
      upperUnicodeValue = 0xD800 | (difference >> 10); // 0xd835
      lowerUnicodeValue = 0xDC00 | (difference & 0x3FF); // 0xdd6b
    }

    const upperUnicodeStr = upperUnicodeValue.toString(16).padStart(4, '0'); // 'd835'
    const lowerUnicodeStr = lowerUnicodeValue.toString(16).padStart(4, '0'); // 'dd6b'

    const javaValue = (upperUnicodeStr === '0000'
      ? `Integer.valueOf('\\u${lowerUnicodeStr}')`
      : `Character.toCodePoint('\\u${upperUnicodeStr}', '\\u${lowerUnicodeStr}')`)
      .replace('\\u000a', '\\n')
      .replace('\\u0027', '\\\'')
      .replace('\\u005c', '\\\\');

    for (const name of names) {
      console.log(`      builder.put("${name.substring(1, name.length - 1)}", ${javaValue}); // ${description}`);
    }
  }
})();

…issues

src/main/java/org/owasp/html/HtmlEntities.java

jacobrshields

I've verified that the new map is a superset of the old map.

Map<String, Integer> oldMap = ImmutableMap.<String, Integer>builder()
  // ... original map contents ...
  .build();

Set<String> oldEntityNames = new HashSet<String>(oldMap.keySet());
oldEntityNames.removeAll(entityNameToCodePointMap.keySet());
System.out.println("Keys in old map but not in new map: " + oldEntityNames.size());

Keys in old map but not in new map: 0

jacobrshields

In decodeEntityAt there is a block of code that attempts to find a lowercase version of a named entity in the trie if it didn't find it in the trie with the given case:
https://github.com/OWASP/java-html-sanitizer/blob/release-20190503.1/src/main/java/org/owasp/html/HtmlEntities.java#L175-L183

And there is a test case for this ("&AmP;" -> "&"):
https://github.com/OWASP/java-html-sanitizer/blob/release-20190503.1/src/test/java/org/owasp/html/EncodingTest.java#L182-L184

Is this really the intended behavior? I believe named entities are case-sensitive. See:
https://html.spec.whatwg.org/multipage/syntax.html#character-references

The ampersand must be followed by one of the names given in the named character references section, using the same case.

On my browser (Chrome 74.0.3729.169) the HTML text &AmP; displays as &AmP; and not as &.

The reason I bring this up is because the reference list parsed for this change explicitly includes both the lowercase and the uppercase versions of named entities that have both (e.g. "&" and "&AMP;"). Given that, do we still need to keep the code in decodeEntityAt that performs the fallback lowercase check? Was it there because of some non-standard browser behavior?

src/main/java/org/owasp/html/Encoding.java

src/main/java/org/owasp/html/HtmlEntities.java

src/test/java/org/owasp/html/EncodingTest.java

jacobrshields · 2019-05-27T23:10:11Z

Updated script to generate mappings, using the data downloaded from https://html.spec.whatwg.org/entities.json:

#!/usr/bin/env node
'use strict';

const fs = require('fs');

const ENTITIES_JSON_PATH = '/Users/jshields/entities.json';

const entities = JSON.parse(fs.readFileSync(ENTITIES_JSON_PATH));

for (const entityKey of Object.keys(entities)) {
  if (entityKey[entityKey.length - 1] !== ';') {
    continue;
  }

  const entityName = entityKey.substring(1, entityKey.length - 1);
  const characters = entities[entityKey].characters;

  if (characters.length === 1) {
    console.log(`builder.put("${entityName}", ((int) ${getJavaCharacterLiteral(characters[0])}) << 16);`);
  } else if (characters.length === 2) {
    console.log(`builder.put("${entityName}", (((int) ${getJavaCharacterLiteral(characters[0])}) << 16) | ${getJavaCharacterLiteral(characters[1])});`);
  } else {
    throw new Error(`Unexpected number of characters for ${entityKey}`);
  }
}

function getJavaCharacterLiteral(character) {
  return `'\\u${character.charCodeAt(0).toString(16).padStart(4, '0')}'`
    .replace('\\u000a', '\\n')
    .replace('\\u0027', '\\\'')
    .replace('\\u005c', '\\\\');
}

mikesamuel · 2019-05-28T04:29:39Z

Thanks for this.
What if we just stored strings like

#!/usr/bin/env node
'use strict';

const url = require('url');
const fetch = require('node-fetch');

const ENTITIES_JSON_URL = 'https://html.spec.whatwg.org/entities.json';

(async () => {
  const response = await fetch(ENTITIES_JSON_URL, { method: 'GET' });
  if (response.status !== 200) { throw new Error(); }
  const entities = JSON.parse(await response.text());

  for (const entityKey of Object.keys(entities)) {
    if (entityKey[entityKey.length - 1] !== ';') {
// Are you sure this is right?
// I think an entityKey of "&lt" means that "&lt" need not end with a semicolon.
// IIRC, this is what allows recognizing that in <a href="?a=b&ampx=0&copy=1">
// the "&amp" means "&" but "&copy" is not a copyright symbol.
      continue;
    }

    const entityName = entityKey.substring(1, entityKey.length - 1);
    const { characters } = entities[entityKey];

    console.log(`    builder.put("${entityName}", ${ javaString(characters) });`);
  }
})().then(() => {}, (e) => console.error(e));

const specialCharacters = new Map();
specialCharacters.set('\\', String.raw`\\`);
specialCharacters.set('"', String.raw`\"`);

function javaString(s) {
  const javaChars = '"' +
      s.replace(
        /[\s\S]/g,
        (c) => specialCharacters.has(c) ? specialCharacters.get(c) : '\\u' + hex4(c.charCodeAt(0)))
      + '"';
  if (s !== JSON.parse(javaChars)) { throw new Error(s); }
  return javaChars;
}

function hex4(n) {
  const hex = n.toString(16);
  return '0000'.substring(hex.length) + hex;
}

and then I think Encoding.decodeHtml s the only code that calls decodeEntityAt, so decodeEntityAt could change to appendDecodedEntity, take a StringBuilder as input, and return the int end.

java-html-sanitizer/src/main/java/org/owasp/html/Encoding.java

Lines 57 to 60 in 7cdb5eb

    
           long endAndCodepoint = HtmlEntities.decodeEntityAt(s, amp, n); 
        
           int end = (int) (endAndCodepoint >>> 32); 
        
           int codepoint = (int) endAndCodepoint; 
        
           sb.append(s, pos, amp).appendCodePoint(codepoint);

jacobrshields · 2019-05-28T04:45:21Z

Good idea. I like the idea of appendDecodedEntity.

Regarding references that don't end with ;: I think this is a separate pre-existing issue that we may want to handle under the scope of a different PR. Yes it appears that &amp, &lt, etc. are legal character references and browsers do display them as expected. But the current code first parses the string between & and ; and then checks if it's a numeric entity or a named entity.

Actually the spec is a little confusing in that in one place it says that must end with ;:

The name must be one that is terminated by a U+003B SEMICOLON character (;).

But the character reference sheet clearly shows that not all of them do.

Are you OK de-scoping that particular concern to a different PR?

mikesamuel · 2019-05-28T04:55:26Z

Are you OK de-scoping that particular concern to a different PR?

Yes.

jacobrshields · 2019-05-28T05:49:26Z

OK, sounds good.

I updated to use your suggestion of appendDecodedEntity. Also I had to add 1 more special character mapping to your JavaScript for the newline character.

Also curious what your thoughts are on this comment: #174 (review)

jacobrshields · 2019-06-02T22:05:28Z

@mikesamuel When would you be able to take another look at this PR for approval? Would love to get this merged soon if possible for a project I am working on.

scottastrophic · 2019-06-06T22:42:43Z

I believe named entities are case-sensitive.

They are indeed. You can see why with accented characters: É is É while é is é.

mikesamuel · 2019-06-10T15:21:12Z

Sorry for the delay. Was slammed with deadlines. Taking another look.

mikesamuel

Thanks much for this. A few minor nits.

src/main/java/org/owasp/html/HtmlEntities.java

src/main/java/org/owasp/html/Trie.java

src/test/java/org/owasp/html/EncodingTest.java

mikesamuel · 2019-06-10T20:17:43Z

Thanks much for doing this.

I'll push a version with this.

jacobrshields · 2019-06-10T20:23:17Z

Thank you Mike, I appreciate it!

mikesamuel · 2019-06-10T21:00:01Z

https://github.com/OWASP/java-html-sanitizer/releases/tag/release-20190610.1 should be available at https://search.maven.org/artifact/com.googlecode.owasp-java-html-sanitizer/aggregate/20190610.1/pom shortly.

jshields-squarespace added 4 commits May 24, 2019 18:28

Add missing named HTML entity mappings

e38e49e

Get rid of chained method calls due to compiler resource utilization …

0a1946d

…issues

Dynamically calculate longest entity name

41233d9

Refactor map building to happen in static initializer

2b6aef5

jacobrshields commented May 25, 2019

View reviewed changes

src/main/java/org/owasp/html/HtmlEntities.java Outdated Show resolved Hide resolved

jacobrshields commented May 25, 2019

View reviewed changes

src/main/java/org/owasp/html/HtmlEntities.java Show resolved Hide resolved

Add comment linking to source data

465107d

jacobrshields commented May 25, 2019

View reviewed changes

Make formatting consistent

0e8c13e

jacobrshields changed the title ~~Add missing named HTML entity mappings~~ Add missing mappings for named HTML entities May 25, 2019

jshields-squarespace added 2 commits May 25, 2019 13:59

Update Javadoc

50e287d

Update Javadoc

e8a0a8b

jacobrshields commented May 25, 2019

View reviewed changes

src/main/java/org/owasp/html/Encoding.java Outdated Show resolved Hide resolved

mikesamuel reviewed May 27, 2019

View reviewed changes

jshields-squarespace added 4 commits May 27, 2019 14:24

Revert Javadoc changes

01d548d

Update decodeEntityAt to return code-units instead of code-points

8a08271

Update character literal

9f07b85

Update list of named character references using official spec

e2f4fab

Simplify long to char conversion

d7e7c5c

jshields-squarespace added 2 commits May 28, 2019 01:44

Refactor decodeEntityAt to appendDecodedEntity

4498b49

Remove unused imports

471d89e

mikesamuel requested changes Jun 10, 2019

View reviewed changes

jshields-squarespace added 3 commits June 10, 2019 14:47

Add Nullable annotations

3c9b108

Add unit test to check boundary condition

4cabbd4

Update comment

3a2637e

mikesamuel approved these changes Jun 10, 2019

View reviewed changes

mikesamuel merged commit e37292d into OWASP:master Jun 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add missing mappings for named HTML entities #174

Add missing mappings for named HTML entities #174

jacobrshields commented May 24, 2019 •

edited

Loading

jacobrshields left a comment •

edited

Loading

jacobrshields left a comment •

edited

Loading

jacobrshields commented May 27, 2019

mikesamuel commented May 28, 2019

jacobrshields commented May 28, 2019

mikesamuel commented May 28, 2019

jacobrshields commented May 28, 2019

jacobrshields commented Jun 2, 2019

scottastrophic commented Jun 6, 2019

mikesamuel commented Jun 10, 2019

mikesamuel left a comment

mikesamuel commented Jun 10, 2019

jacobrshields commented Jun 10, 2019

mikesamuel commented Jun 10, 2019

Add missing mappings for named HTML entities #174

Add missing mappings for named HTML entities #174

Conversation

jacobrshields commented May 24, 2019 • edited Loading

jacobrshields left a comment • edited Loading

Choose a reason for hiding this comment

jacobrshields left a comment • edited Loading

Choose a reason for hiding this comment

jacobrshields commented May 27, 2019

mikesamuel commented May 28, 2019

jacobrshields commented May 28, 2019

mikesamuel commented May 28, 2019

jacobrshields commented May 28, 2019

jacobrshields commented Jun 2, 2019

scottastrophic commented Jun 6, 2019

mikesamuel commented Jun 10, 2019

mikesamuel left a comment

Choose a reason for hiding this comment

mikesamuel commented Jun 10, 2019

jacobrshields commented Jun 10, 2019

mikesamuel commented Jun 10, 2019

jacobrshields commented May 24, 2019 •

edited

Loading

jacobrshields left a comment •

edited

Loading

jacobrshields left a comment •

edited

Loading