NAME

scrape.pl - simple HTML scraping from the command line

ABSTRACT

This is a simple program to extract data from HTML by specifying CSS3 or XPath selectors.

SYNOPSIS

scrape.pl URL selector selector ...

# Print page title
scrape.pl http://perl.org title
# The Perl Programming Language - www.perl.org

# Print links with titles, make links absolute
scrape.pl http://perl.org a //a/@href --uri=2

# Print all links to JPG images, make links absolute
scrape.pl http://perl.org a[@href=$"jpg"]

# print JSON about Amazon prices
scrape.pl https://www.amazon.de/dp/0321751043
    --format json
    --name "title" #productTitle
    --name "price" #priceblock_ourprice
    --name "deal" #priceblock_dealprice

# print JSON about Amazon prices for multiple products
scrape.pl --format json
    --url https://www.amazon.de/dp/B01J90P010
    --url https://www.amazon.de/dp/B01M3015CT
    --name "title" #productTitle
    --name "price" #priceblock_ourprice
    --name "deal" #priceblock_dealprice

# extract values from HTML from stdin
scrape.pl --format json
    --url -
    --name "title" #productTitle
    --name "price" #priceblock_ourprice
    --name "deal" #priceblock_dealprice

DESCRIPTION

This program fetches an HTML page and extracts nodes matched by XPath or CSS selectors from it.

If URL is -, input will be read from STDIN.

OPTIONS

--format

Output format, the default is csv. Valid values are csv or json.
--url

URL to fetch. This can be given multiple times to fetch multiple URLs in one run. If this is not given, the first argument on the command line will be taken as the only URL to be fetched.
--keep-url

Add the fetched URL as another column with the given name in the output. If you use CSV output, the URL will always be in the first column.
--name

Name of the output column.
--sep

Separator character to use for columns. Default is tab.
--uri COLUMNS

Numbers of columns to convert into absolute URIs, if the known attributes do not everything you want.
--no-uri

Switches off the automatic translation to absolute URIs for known attributes like href and src.

REPOSITORY

The public repository of this module is http://github.com/Corion/App-scrape.

SUPPORT

The public support forum of this program is http://perlmonks.org/.

AUTHOR

Max Maischein corion@cpan.org

COPYRIGHT (c)

LICENSE

This module is released under the same terms as Perl itself.

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
bin		bin
lib/App		lib/App
t		t
xt		xt
.gitignore		.gitignore
Changes		Changes
LICENSE		LICENSE
MANIFEST		MANIFEST
MANIFEST.SKIP		MANIFEST.SKIP
META.json		META.json
META.yml		META.yml
Makefile.PL		Makefile.PL
README		README
README.mkdn		README.mkdn

Navigation Menu

License

Corion/App-scrape

Folders and files

Latest commit

History

Repository files navigation

NAME

ABSTRACT

SYNOPSIS

DESCRIPTION

OPTIONS

REPOSITORY

SUPPORT

AUTHOR

COPYRIGHT (c)

LICENSE

About

Resources

License

Stars

Watchers

Forks

Languages