Skip to content

The Readability constructor

FB55 edited this page Feb 10, 2012 · 4 revisions
Clone this wiki locally

Getting the constructor

In node, there are two ways of receiving the Readability object:

1.

var Readability = require("readabilitySAX").Readability;

This will first get the file node/index.js, which requires readabilitySAX.js and returns it.

2.

var Readability = require("readabilitySAX/readabilitySAX.js");

This will directly locate the readabilitySAX.js file inside the readabilitySAX dir inside your node_modules.

Usage

The basic usage is:

var readable = new Readability(<object> settings = {});

Methods

getTitle()

Returns a string containing the pages title.

getNextPage()

Returns a string with the link to the next page. Returns an empty string when no links was found or searchFurtherPages is set to false.

getHTML()

Returns a string containing the HTML of the article.

getText()

Returns a string containing a formatted text representation of the article.

getEvents(<object> cbs)

When you need to work with the result, instead of passing the returned HTML to a parser, you may just call getEvents. The cbs object should have three methods:

{
  onopentag: function(<str> name, <obj> attributes){ … }
  ontext: function(<str> text){ … }
  onclosetag: function(<str> name){ … }
}
setSkipLevel(<int> level)

skipLevel is a shortcut to allow more elements of the page. There are three levels:

  1. stripUnlikelyCandidates = false
  2. weightClasses = false
  3. cleanConditionally = false
getArticle(<String> type = "html")

Returns an object with the following properties:

{
    title: <String>,
    nextPage: <String>,
    textLength: <int>,
    score: <int>,
    html: <String> OR text: <String>
}

If html or text is present depends on the type passed to the object. Possible values are "html" or "text", otherwise, it defaults to "html".

onopentagname(<String> name), onattribute(<String> name, <String> value), ontext(<String> text), onclosetag(<String> name) & onreset()

These methods are for the parser, in most cases htmlparser2.

Settings

These are the options that one may pass to the Readability object:

  • stripUnlikelyCandidates: Removes elements that probably don't belong to the article. Default: true

  • weightClasses: Indicates whether classes should be scored. This may lead to shorter articles. Default: true

  • cleanConditionally: Removes elements that don't match specific criteria (defined by the original Readability). Default: true

  • cleanAttributes: Only allow some attributes, ignore all the crap nobody needs. Default: true

  • searchFurtherPages: Indicates whether links should be checked whether they point to the next page of an article. Default: true

  • linksToSkip: A map of pages that should be ignored when searching links to further pages. Default: {}

  • pageURL: The URL of the current page. Will be used to resolve all other links and is ignored when searching links. Default: ""

  • type: The default type of the output of getArticle(). Possible values are "html" or "text". Default: "html"

  • resolvePaths: Indicates whether ".." and "." inside paths should be eliminated. Default: false

Example

var Readability = require("readabilitySAX/readabilitySAX.js"),
    Parser = require("htmlparser2/lib/Parser.js"),
    readable = new Readability({}),
    parser = new Parser(readable, {});

parser.write(require("fs").readFileSync("./foo.html").toString());
console.log(readable.getArticle());
Something went wrong with that request. Please try again.