Skip to content

Option to exclude URLs from WARC #212

@chfoo

Description

@chfoo

An option to exclude URLs from being stored in the WARC similar to --cdx-dedup. Maybe --url-dedup. This option will accept a filename containing a list of URLs that content should not be stored in the WARC.

It uses the revisit WARC-Type CDX dedup logic, but since we do not have a previous WARC Record ID (using profile http://netpreserve.org/warc/1.0/revisit/identical-payload-digest) or a timestamp (using profile http://netpreserve.org/warc/1.0/revisit/server-not-modified), we won't have a WARC-Refers-To field.

We'll need to use a nonstandard WARC-Profile, maybe urn:X-wpull:warc-revisit-url.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions