Download whole html pages or data dumps with a single terminal command.
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Product Name

Download data, HTML pages and whatever you like from fairly static URLs

License: MIT Flattr this

Whether you need to download a huge data dump or hundreds of HTML pages to analyze them locally, this tool might be the way to go. You will need a fairly static URL, where just an index counts up for every new file.

download banner

Installation Examples

OS X & Linux & Windows:

# clone the project and head over to the main directory
git clone
cd data-collection-download-tool
# from here you can run the script, which is described in section "Usage Examples"

Usage Examples


This will download every xkcd HTML page from to

You just need to connect the dynamic part of your URL by starting ++ and ending ++. In between you must define a download range using integers and delimiting them by **.

In more detail:

Get HTML pages

(note: I took xkcd just a simple example to make clear, how the tool works. If you want all xkcd images scraped from the website, you would rather use a library like BeautifulSoup to get them on the fly.)

URLs, where the pictures are located:

and so on ...

Say you want all HTML pages from 15 to 2100:

# run from the tools source directory

Get a Big Data Dump

The tool can also handle preceding zeros. E.g. to get the complete dump of pubmed, you would do this:


Development Setup

(in case you want to contribute to the download tool)

Installation is the same as described in "Installation Examples" above.

Python's unittest module is used for the tests. To run the tests from the commandline:

export PYTHONPATH=$PYTHONPATH/your/own/path/to/data-collection-download-tool/
python3 tests/

Release History

  • 0.0.1
    • First release to download from a static up counting URL


Richard Steinmetz - @LinkedIn@Twitter

Distributed under MIT license. See LICENSE.txt for more information.


  1. Fork it
  2. Create your feature branch (git checkout -b feature/fooBar)
  3. Commit your changes (git commit -am 'Add some fooBar')
  4. Push to the branch (git push origin feature/fooBar)
  5. Create a new Pull Request

Of course it would be great to keep the tool test-driven ;)

Possible Future Features

  • get really dynamic URLs (e.g. walk through a page and get all its URLs)
  • exclude some ranges/files from download
  • custom file naming/endings
  • custom download directory
  • automatic unzip option
  • adding headers and sleep option for sensitive URLs
  • GUI 🌈

Just let me know what you need for your use case or help me refining this tool. I would love to know about your use cases and refine the tool.

Necessary Code Refactoring

  • factor parsing elements out of
  • solve wildcard dependency between Parser und Downloader elegantly

Code Metrics

Blog article about the Data Collection Tool