Skip to content

html-extract-js is a javascript library that extracts HTML documents for collecting metadata and core contextual information in infinite webpages.

License

Notifications You must be signed in to change notification settings

Additor/lib-htmlextract-js

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

html-extract-js

html-extract-js is a javascript library that extracts HTML documents for collecting metadata and core contextual information in infinite webpages.

This library has been created and used in Additor for web-scraping.

Installation

Using npm:

$ npm install --save html-extract-js

API

Load

First you need to pass a HTML document data as a type of "String" or "Buffer". Once you get ready to extract the document, load a html-extractor.

const HtmlExtractor = require('html-extract-js');
const extractor = HtmlExtractor.load(html);

HtmlExtractor

The HtmlExtractor uses cheerio and iconv-lite for extracting document's information.

The HtmlExtractor is a wrapping class of its sub-extractors. By default, it uses two extractors, ContextExtractor and MetaExtractor.

Also, you can configure this extractor through passing option parameter.

const option = {
    charset: 'EUC-KR',      // if you set, "iconv-lite" converts the HTML document.
};
const extractor = HtmlExtractor.load(html, option);

URI

const uri = extractor.getURI();                 // "https://additor.io"

Title

const title = extractor.getTitle();             // "Additor :: Just Add it. Be an Additor"

Description

const description = extractor.getDescription(); // "Additor is alchemy that turns your scattered information into well-organized content..."

Thumbnail

const thumbnail = extractor.getThumbnail();     // "https://cdn.additor.io/image/main/landing_temp.png"

Favicon

const favicon = extractor.getFavicon();         // "https://cdn.additor.io/image/logo/favicon.ico"

License #

MIT License

About

html-extract-js is a javascript library that extracts HTML documents for collecting metadata and core contextual information in infinite webpages.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published