Sanitize HTML using a DOM parser
JavaScript
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
test
.gitignore
.travis.yml
LICENSE
README.md
example.html
janitor.js
package.json

README.md

janitor-js

Build Status

Sanitize HTML tags, attributes, and protocols using a native DOM parser.

Parsing, cleaning, and sanitizing HTML is a pain in the ass. There are lots of libraries that do it, but they require well-formed or psuedo well-formed HTML. If you're cleaning or parsing user-generated HTML, chances are it's going to be messed up now and then, and this causes many HTML parsers to barf or drop content entirely.

Janitor-js uses the best DOM parser available for untrusted HTML -- a browser. Janitor can be run client-side or server-side in a headless browser.

Examples

Only allow A and B tags, with "href" being the only allowed attribute:

var config = {
  tags: {
    b: [],
    a: [ 'href' ]
  }
}

Janitor.clean('<div><b>bold text</b><a onclick="foo()" href=#">hi</a></div>', config)
  -> '<b>bold text</b><a href="#">hi</a>'

Fix unclosed tags:

Janitor.clean('<b>uh oh!', config)
  -> '<b>uh oh!</b>'

Filter out XSS in href attributes:

var config = {
  tags: {
    b: [],
    a: [ 'href' ]
  },
  protocols: [ 'http', 'https' ]
}

Janitor.clean('<a href="javascript:alert(document.cookie)">click me please</a>', config)
  -> '<a href="">click me please</a>'

Fix really messed up HTML:

config = {
  tags: {
    b: [],
    div: [],
    ul: [],
    li: [],
    a: [ 'href' ]
  },
  protocols: [ 'http', 'https' ]
}

Janitor.clean('<div>adfasdf<ul><li>asdfasdf<a href="http://google.com">Google</div>', config)
  '<div>adfasdf<ul><li>asdfasdf<a href="http://google.com">Google</a></li></ul></div>'

Magic!