RWGet

RWget is a web crawler that tries to emulate a subset of the interface of GNU/Wget, but with more flexibility for my needs.

Features

Regular expression accept/reject lists
Pluggable interfaces for robots-txt, url-fetcher, url-queue, url-dupe-detector, and page-storage. The defaults store locally, and fetch using libcurl, but you could easily change to db storage, a distributed queue, etc.

Help page

Usage: /usr/bin/rwget [options] SEED_URL [SEED_URL2 ...]
    -w, --wait=SECONDS               wait SECONDS between retrievals.
    -P, --directory-prefix=PREFIX    save files to PREFIX/...
    -U, --user-agent=AGENT           identify as AGENT instead of RWget/VERSION.
    -A, --accept-pattern=RUBY_REGEX  URLs must match RUBY_REGEX to be saved to the queue.
        --time-limit=AMOUNT          Crawler will stop after this AMOUNT of time has passed.
    -R, --reject-pattern=RUBY_REGEX  URLs must NOT match RUBY_REGEX to be saved to the queue.
        --require=RUBY_SCRIPT        Will execute 'require RUBY_SCRIPT'
        --limit-rate=RATE            limit download rate to RATE.
        --http-proxy=URL             Proxies via URL
        --proxy-user=USER            Sets proxy user to USER
        --proxy-password=PASSWORD    Sets proxy password to PASSWORD
        --fetch-class=RUBY_CLASS     Must implement fetch(uri, user_agent_string) #=> [final_redirected_url, file_object]
        --store-class=RUBY_CLASS     Must implement put(key_string, temp_file)
        --dupes-class=RUBY_CLASS     Must implement dupe?(uri)
        --queue-class=RUBY_CLASS     Must implement put(key_string, depth_int) and get() #=> [key_string, depth_int]
        --links-class=RUBY_CLASS     Must implement urls(base_uri, temp_file) #=> [uri, ...]
    -S, --sitemap=URL                URL of a sitemap to crawl (will ignore inter-page links)

    -Q, --quota=NUMBER               set retrieval quota to NUMBER.
        --max-redirect=NUM           maximum redirections allowed per page.
    -H, --span-hosts                 go to foreign hosts when recursive
        --connect-timeout=SECS       set the connect timeout to SECS.
    -T, --timeout=SECS               set all timeout values to SECONDS.
    -l, --level=NUMBER               maximum recursion depth (inf or 0 for infinite).
        --[no-]timestampize          Prepend the timestamp of when the crawl started to the directory structure.
        --incremental-from=PREVIOUS  Build upon the indexing already saved in PREVIOUS.
        --protocol-directories       use protocol name in directories.
        --no-host-directories        don't create host directories.
    -v, --[no-]verbose               Run verbosely
    -h, --help                       Show this message

Ruby API

require "rubygems"
require "rwget"

# options is the same as the command-line long options, but converted into
# idiomatic ruby.  See the RDoc for details.
# i.e. 
# sh$ rwget -T 5 -A ".*foo.*" http://google.com
# becomes:
# irb$ RWGet::Controller.new({:seeds => ["http://google.com"], 
#            :timeout => 5, :accept_patterns => /.*foo.*/}).start

RWGet::Controller.new(options).start

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
bin		bin
lib		lib
test		test
.document		.document
.gitignore		.gitignore
README.markdown		README.markdown
Rakefile		Rakefile
VERSION		VERSION
rwget.gemspec		rwget.gemspec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bin

bin

lib

lib

test

test

.document

.document

.gitignore

.gitignore

README.markdown

README.markdown

Rakefile

Rakefile

VERSION

VERSION

rwget.gemspec

rwget.gemspec

Repository files navigation

RWGet

Features

Help page

Ruby API

About

Releases

Packages

Languages

fizx/rwget

Folders and files

Latest commit

History

Repository files navigation

RWGet

Features

Help page

Ruby API

About

Resources

Stars

Watchers

Forks

Languages