Ability to import user-provided/3rd-party WARCs from other archiving services (e.g. if user tries to archive a URL that is already down) #160

pirate · 2019-03-05T19:26:56Z

ArchiveBox should be able to load WARCs from outside sources, replay them with pywb, and re-archive them using all the redundant archive methods like Chrome Headless, Wget, etc.

This would be most useful when a user tries to archive a URL that is already down, or that is not accessible to the ArchiveBox server.

ArchiveBox should be able to ingest a user-provided .warc / .warcz / .warg.gz, auto-fetching any available WARC from Archive.org / Archive.it / Archive.is / etc., or as a last resort auto-fetch from search engine caches (Google / Bing / Yahoo / Yandex / etc.).

Related issues:

Switch all dependencies to pure python and release ArchiveBox pip package #177 Setting up pyppeteer and pywb (needs to be finished before this can start)
Extend WARC file with all requests made via all archive methods #130 Record all ArchiveBox requests into the generated WARC files with pywb's proxy archiver
Archive Method: Replace archive.org-only with ArchiveNow to push to multiple 3rd-party services #146 Add ability to export ArchiveBox WARC files to 3rd party archiving services
Add official support for taking multiple snapshots of websites over time #179 Add support for multiple snapshots of archived sites
Add optional http proxy which archives all traffic #63 Adding support for HTTP proxy archiving

WARCs should be directly importable easily using archivebox add ~/Downloads/path/to/some/warc.gz, or be configurable to do the fallback searches on 3rd party services automatically in the case of a 404/403/etc.

There are a few tools that may be helpful to integrate to achieve these goals:

This should allow us to redundantly archive URLs using ArchiveBox even when the original sites are no longer available.

The text was updated successfully, but these errors were encountered:

muramasatheninja · 2020-04-02T07:13:02Z

Would very much like to see this feature. I already have made a bunch of warc files and would love to have a way to bring them into Archivebox.

TheAnachronism · 2021-06-06T20:56:50Z

What is the status on this?
Currently have the problem, that I can't archive stuff from sites which have some kind of authentication or maturity filter so I wanted to try to manually archive it and then upload it into ArchiveBox. But there doesn't seem to be any workflow for this.

pirate · 2021-06-07T05:35:07Z

Right now there's no official workflow or concrete plan to add this in the short-term, but in the for now you can put any manual warcs inside archive/<timestamp>/warc/*.warc.gz and archivebox wont touch them there. They wont show up in the UI, but AB won't delete/move them either, so it's a safe place to put them. If you want you can even manually create an ArchvieResult entry to track those warc files on the Log page or via archivebox shell, that way they'll show up in the UI and have any metadata you want to attach about how/when you saved them.

refparo · 2021-07-04T09:09:42Z

Looking forward to this to be added. This would make it easier to get ArchiveBox work with browser extensions like https://github.com/machawk1/warcreate

pirate · 2023-06-13T04:07:12Z

I don't have any updates on progress here, but I did just think of an idea that I think would be related to this feature: adding support for automatically finding 3rd party copies of pages on Archive.org/search engine caches/etc. and pulling them into ArchiveBox.

My ideal vision of this feature is that it covers the case where a user tries to archive a URL that is already down / no longer available from the original server.

The flow from there could be to:

try to find a copy on archive.org / archive.it / archive.is / etc. and save their warc to the ArchiveBox Snapshot
try to find copies of the page in search engine caches (Google, Bing, Yahoo, Yandex, etc.) and save that to our Snapshot
try to find alternative non-canonical URLs for the page using search engines, and attempt to archive those versions instead
allow the user to manually upload / ingest a WARC from a URL or local filesystem to save to the ArchiveBox Snapshot

These options should be disabled by default (because it's not safe to give the impression to the user that it was the original page that was archived when in fact we got it from a 3rd-party mirror), but configurable via ArchiveBox.conf / env variables. I also imagine having an option where users could enable doing these 3rd party archive imports even if the original URL is up, that way they can save every version of the site thats available every time.

pirate added status: idea-phase Work is tentatively approved and is being planned / laid out, but is not ready to be implemented yet size: medium why: functionality Intended to improve ArchiveBox functionality or features labels Mar 5, 2019

pirate changed the title ~~Attempt to download Archive.org's WARC for URLs that 404 or contain archive.org in the domain~~ Ability to import WARCs: Load, replay, and re-archive user-provided/archive.org-provided WARC files Apr 30, 2019

pirate changed the title ~~Ability to import WARCs: Load, replay, and re-archive user-provided/archive.org-provided WARC files~~ Ability to import WARCs: Import, replay, and re-archive user-provided or archive.org-provided WARC files Apr 30, 2019

pirate mentioned this issue Aug 13, 2020

Feature Request: archive.today family integration #439

Closed

9 tasks

pirate mentioned this issue Dec 2, 2020

[Feature Request] Input An Archive Link, Get All Their Snapshots #560

Closed

berezovskyi mentioned this issue Aug 8, 2023

Feature Request: a web clipper #1203

Open

9 tasks

pirate mentioned this issue Mar 25, 2024

Support: singlefile & readability fail to work #1386

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability to import user-provided/3rd-party WARCs from other archiving services (e.g. if user tries to archive a URL that is already down) #160

Ability to import user-provided/3rd-party WARCs from other archiving services (e.g. if user tries to archive a URL that is already down) #160

pirate commented Mar 5, 2019 •

edited

Loading

muramasatheninja commented Apr 2, 2020

TheAnachronism commented Jun 6, 2021

pirate commented Jun 7, 2021

refparo commented Jul 4, 2021 •

edited

Loading

pirate commented Jun 13, 2023

Ability to import user-provided/3rd-party WARCs from other archiving services (e.g. if user tries to archive a URL that is already down) #160

Ability to import user-provided/3rd-party WARCs from other archiving services (e.g. if user tries to archive a URL that is already down) #160

Comments

pirate commented Mar 5, 2019 • edited Loading

muramasatheninja commented Apr 2, 2020

TheAnachronism commented Jun 6, 2021

pirate commented Jun 7, 2021

refparo commented Jul 4, 2021 • edited Loading

pirate commented Jun 13, 2023

pirate commented Mar 5, 2019 •

edited

Loading

refparo commented Jul 4, 2021 •

edited

Loading