Skip to content


Folders and files

Last commit message
Last commit date

Latest commit



46 Commits

Repository files navigation


This project is a Python script that Archive-It partners can use to download their WARC files and associated metadata.


This script uses Archive-It's Web Archiving Systems API (WASAPI) and Partner API to download WARC files and associated metadata. The code was developed as part of a Professional Experience project at the UBC iSchool for use by UBC Library Digital Initiatives, with the goal of digitally preserving WARC files captured using Archive-It.

Because the files will be preserved in Archivematica, the script organizes downloads in the following Submission Information Package (SIP) structure:

  • ARCHIVEIT_COLLECTION-<collection number>_JOB-<crawl ID>
    • metadata
      • submissionDocumentation
      • <host-list csv>: list of host names and summary data from hosts report
      • <mimetype-list csv>: list of mimetypes and summary data from file types report
      • <seed-list csv>: list of seed URLs and summary data from seed report
      • objects
        • <WARC file(s)>

Each package contains one crawl's WARC files and administrative metadata. At present, descriptive metadata is not downloaded by this script.


  1. Python 3
  2. pipenv


Project Files

Filename Description Main script
Pipfile Pipfile containing dependencies
credentials.env Example file – edit with your Archive-It credentials


  1. Clone or download this repository
  2. Run pipenv install within the project folder
  3. Edit credentials.env, replacing sampleUsername and samplePassword with your Archive-It credentials


  1. Run pipenv run python
  2. Follow the prompts provided:
Prompt Notes
Enter collection number: Enter the collection number from which to download WARC files.
Would you like to narrow further by date? Enter y or n: y to provide a date range for which WARC files to download, n to proceed with current results.
If a collection has > 100 files, the initial query will only return 100 files, and you will be required to narrow the results by date.
Enter a start date (YYYY-MM-DD): Enter the earliest date for which to retrieve WARC files.
Enter an end date (YYYY-MM-DD): Enter the latest date for which to retrieve WARC files.
Note that the end date is not inclusive. For example, to get all files from 2019, use start date 2019-01-01 and end date 2020-01-01.
Download files? Enter y or n: y to download files, n to exit.
  1. As the files download, scan for any output in red text. The script will indicate if there is any file corruption (md5 checksum did not match) or missing metadata files.


python script to download WARC files and metadata






No releases published


No packages published
