This tool uses [himalaya][himalaya] under the hood for parsing the HTML string into JSON then this packages scrolls through it to find the desired information.
Read this for more information
Copyright © 2024 DisQada
This tool is licensed under the Apache 2.0 License.
See the LICENSE file for more information.
As the package [himalaya] breaks down the JSON output into nodes, this package is following the same concept (HTML tag = JSON object/node) with the main node types being:
- Element node: Container of the main information defining the tag like the tag name, attributes and children nodes
- Text node: Container of the text value in the HTML tag
- Comment node:
Click on the individual node link to read further details about the type
First and most importantly is that we need to have our HTML string ready in a variable
If you're gonna use the same HTML string for multiple uses then it's better to parse it by yourself then pass the JSON output to the functions (so the HTML string will be parsed once only), the following example shows how to
const { parse } = require('himalaya')
// Use another package to fetch the HTML from the web
const html = `
<html>
<head>
<title>Test Page</title>
</head>
<body>
<h1 id="title">Hello, world!</h1>
<p class="content">This is a test paragraph.</p>
</body>
</html>
`
const nodes = parse(html)
// Rest of the code ...
const { findNode } = require('@disqada/scraper')
const node = findNode(nodes, {
tag: 'h1',
attr: {
key: 'id',
value: 'title'
}
})
// node = {
// type: 'element',
// tagName: 'h1'
// attributes: [{key: 'id', value: 'title' }],
// children: []
// }
const { grabAText } = require('@disqada/scraper')
const text = grabAText(nodes, {
title: 'p'
})
// text = 'Test Page'
The function grabText
can be given a TextOptions
object that specifies some configurations for the search process, note that it's optional
Click on the blue highlighted
TextOptions
to read more
const { grabAttr } = require('@disqada/scraper')
const attr = grabAttr(nodes, {
tag: 'p',
}, 'class')
// attr = 'content'
You can download an HTML file and it's parsed json file under scrap
folder in the root path of your project outside runtime by calling the download
command cli
Note that either
--url
or--path
must be given
Arg name | required | Column3 |
---|---|---|
--file |
true | Name of the downloaded html and parsed json file |
--url |
false | Link of web page |
--path |
false | Path of a local html file (the html file will be copied to the scrap folder) |
npm explore @disqada/scraper -- npm run download --url='https://example.com/sample' --file='sample'
npm explore @disqada/scraper -- npm run download --path='./samples/v1/index.html' --file='sample1'
[himalaya]: https://www.npmjs.com/package/himalaya# Scraper