html-extract-js is a javascript library that extracts HTML documents for collecting metadata and core contextual information in infinite webpages.
This library has been created and used in Additor for web-scraping.
Using npm:
$ npm install --save html-extract-js
First you need to pass a HTML document data as a type of "String" or "Buffer". Once you get ready to extract the document, load a html-extractor.
const HtmlExtractor = require('html-extract-js');
const extractor = HtmlExtractor.load(html);
The HtmlExtractor uses cheerio and iconv-lite for extracting document's information.
The HtmlExtractor is a wrapping class of its sub-extractors. By default, it uses two extractors, ContextExtractor and MetaExtractor.
Also, you can configure this extractor through passing option
parameter.
const option = {
charset: 'EUC-KR', // if you set, "iconv-lite" converts the HTML document.
};
const extractor = HtmlExtractor.load(html, option);
const uri = extractor.getURI(); // "https://additor.io"
const title = extractor.getTitle(); // "Additor :: Just Add it. Be an Additor"
const description = extractor.getDescription(); // "Additor is alchemy that turns your scattered information into well-organized content..."
const thumbnail = extractor.getThumbnail(); // "https://cdn.additor.io/image/main/landing_temp.png"
const favicon = extractor.getFavicon(); // "https://cdn.additor.io/image/logo/favicon.ico"
License #
MIT License