Skip to content

KellyStathis/warc_downloader

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

warc_downloader

This project is a Python script that Archive-It partners can use to download their WARC files and associated metadata.

Overview

This script uses Archive-It's Web Archiving Systems API (WASAPI) and Partner API to download WARC files and associated metadata. The code was developed as part of a Professional Experience project at the UBC iSchool for use by UBC Library Digital Initiatives, with the goal of digitally preserving WARC files captured using Archive-It.

Because the files will be preserved in Archivematica, the script organizes downloads in the following Submission Information Package (SIP) structure:

  • ARCHIVEIT_COLLECTION-<collection number>_JOB-<crawl ID>
    • metadata
      • submissionDocumentation
      • <host-list csv>: list of host names and summary data from hosts report
      • <mimetype-list csv>: list of mimetypes and summary data from file types report
      • <seed-list csv>: list of seed URLs and summary data from seed report
      • objects
        • <WARC file(s)>

Each package contains one crawl's WARC files and administrative metadata. At present, descriptive metadata is not downloaded by this script.

Prerequisites

  1. Python 3
  2. pipenv

Dependencies

Project Files

Filename Description
warc_downloader.py Main script
Pipfile Pipfile containing dependencies
credentials.env Example file – edit with your Archive-It credentials

Setup

  1. Clone or download this repository
  2. Run pipenv install within the project folder
  3. Edit credentials.env, replacing sampleUsername and samplePassword with your Archive-It credentials

Execution

  1. Run pipenv run python warc_downloader.py
  2. Follow the prompts provided:
Prompt Notes
Enter collection number: Enter the collection number from which to download WARC files.
Would you like to narrow further by date? Enter y or n: y to provide a date range for which WARC files to download, n to proceed with current results.
If a collection has > 100 files, the initial query will only return 100 files, and you will be required to narrow the results by date.
Enter a start date (YYYY-MM-DD): Enter the earliest date for which to retrieve WARC files.
Enter an end date (YYYY-MM-DD): Enter the latest date for which to retrieve WARC files.
Note that the end date is not inclusive. For example, to get all files from 2019, use start date 2019-01-01 and end date 2020-01-01.
Download files? Enter y or n: y to download files, n to exit.
  1. As the files download, scan for any output in red text. The script will indicate if there is any file corruption (md5 checksum did not match) or missing metadata files.

About

python script to download WARC files and metadata

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages