extract-web package

Extract content from web pages, including link URLs, image URLs and entire web page contents.

Commands

Extract Web: Extract Link URLs
Extracts all the URLs from links in the target page.
Extract Web: Extract Image URLs
Extracts all the URLs from images in the target page.
Extract Web: Extract Contents
- Extracts the HTML content of the target page to a JSON or YAML document. Requires a URL list.

Settings

User-Agent:
- Valid values: android, chrome, googlebot, ie, ios, opera, safari
- Default value: chrome
Accept-Language:
- Default value: en
Extract URL Pattern:
- Filters the URLs returned by the Extract Link URLs and Extract Image URLs commands using a regular expression.
- Default: https?://.+
Extract Contents Config Path:
- The location of a JSON file that determines the setting used by the Extract Contents command. An example is included in the package as default-config.json.
- Default value: extract-web/default-config.json
Extract Contents Output Format:
- Determines the format of the output of the Extract Contents command.
- Valid values: json, yaml
- Default value: json
JSON Indent Size
- Sets the indentation used in the output for the Extract Contents command when it is in JSON mode.
- Default value: 2
YAML Indent Size
- Sets the indentation used in the output for the Extract Contents command when it is in YAML mode.
- Default value: 2

Customizing the output of the Extract Contents command

The Extract Contents command outputs a JSON or YAML document containing an array of objects. Each extracted web page is represented by a JSON/YAML object in this array.

The properties object for each extracted web page contains an array of properties extracted from the web page.

If you want to customize the properties extracted from each item, prepare a configuration file similar to the example below. Properties to extract are specified using CSS syntax.

Example:

{
  "target": [
    {
      "pattern": {
        "url": "https://atom.io/packages/.*"
      },
      "properties": {
        "title": {
          "text": "title"
        },
        "body": {
          "text": "body"
        },
        "bodyAsHtml": {
          "html": "body"
        },
        "package_meta": {
          "text": ".package-meta ul li a",
          "isArray": true
        },
        "meta_description": {
          "attr": "meta[name=description]",
          "args": ["content"]
        },
        "domain": {
          "default": "atom.io"
        }
      }
    }
  ]
}

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
keymaps		keymaps
lib		lib
menus		menus
screenshots		screenshots
styles		styles
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE.md		LICENSE.md
README.md		README.md
default-config.json		default-config.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

extract-web package

Commands

Settings

Customizing the output of the Extract Contents command

Screenshots

Extract Contents

About

Releases

Packages

Languages

License

KunihikoKido/atom-extract-web

Folders and files

Latest commit

History

Repository files navigation

extract-web package

Commands

Settings

Customizing the output of the Extract Contents command

Screenshots

Extract Contents

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages