-
-
Notifications
You must be signed in to change notification settings - Fork 84
Open
Labels
Milestone
Description
An option to exclude URLs from being stored in the WARC similar to --cdx-dedup. Maybe --url-dedup. This option will accept a filename containing a list of URLs that content should not be stored in the WARC.
It uses the revisit WARC-Type CDX dedup logic, but since we do not have a previous WARC Record ID (using profile http://netpreserve.org/warc/1.0/revisit/identical-payload-digest) or a timestamp (using profile http://netpreserve.org/warc/1.0/revisit/server-not-modified), we won't have a WARC-Refers-To field.
We'll need to use a nonstandard WARC-Profile, maybe urn:X-wpull:warc-revisit-url.