html-extract-js

html-extract-js is a javascript library that extracts HTML documents for collecting metadata and core contextual information in infinite webpages.

This library has been created and used in Additor for web-scraping.

Installation

Using npm:

$ npm install --save html-extract-js

API

Load

First you need to pass a HTML document data as a type of "String" or "Buffer". Once you get ready to extract the document, load a html-extractor.

const HtmlExtractor = require('html-extract-js');
const extractor = HtmlExtractor.load(html);

HtmlExtractor

The HtmlExtractor uses cheerio and iconv-lite for extracting document's information.

The HtmlExtractor is a wrapping class of its sub-extractors. By default, it uses two extractors, ContextExtractor and MetaExtractor.

Also, you can configure this extractor through passing option parameter.

const option = {
    charset: 'EUC-KR',      // if you set, "iconv-lite" converts the HTML document.
};
const extractor = HtmlExtractor.load(html, option);

URI

const uri = extractor.getURI();                 // "https://additor.io"

Title

const title = extractor.getTitle();             // "Additor :: Just Add it. Be an Additor"

Description

const description = extractor.getDescription(); // "Additor is alchemy that turns your scattered information into well-organized content..."

Thumbnail

const thumbnail = extractor.getThumbnail();     // "https://cdn.additor.io/image/main/landing_temp.png"

Favicon

const favicon = extractor.getFavicon();         // "https://cdn.additor.io/image/logo/favicon.ico"

License #

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
src		src
tests		tests
webpack		webpack
.gitignore		.gitignore
.npmignore		.npmignore
CHANGES.md		CHANGES.md
LICENSE.md		LICENSE.md
README.md		README.md
index.js		index.js
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

html-extract-js

Installation

API

Load

HtmlExtractor

URI

Title

Description

Thumbnail

Favicon

License #

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Additor/lib-htmlextract-js

Folders and files

Latest commit

History

Repository files navigation

html-extract-js

Installation

API

Load

HtmlExtractor

URI

Title

Description

Thumbnail

Favicon

License #

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages