-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to import user-provided/3rd-party WARCs from other archiving services (e.g. if user tries to archive a URL that is already down) #160
Comments
Would very much like to see this feature. I already have made a bunch of warc files and would love to have a way to bring them into Archivebox. |
What is the status on this? |
Right now there's no official workflow or concrete plan to add this in the short-term, but in the for now you can put any manual warcs inside |
Looking forward to this to be added. This would make it easier to get ArchiveBox work with browser extensions like https://github.com/machawk1/warcreate |
I don't have any updates on progress here, but I did just think of an idea that I think would be related to this feature: adding support for automatically finding 3rd party copies of pages on Archive.org/search engine caches/etc. and pulling them into ArchiveBox. My ideal vision of this feature is that it covers the case where a user tries to archive a URL that is already down / no longer available from the original server. The flow from there could be to:
These options should be disabled by default (because it's not safe to give the impression to the user that it was the original page that was archived when in fact we got it from a 3rd-party mirror), but configurable via |
ArchiveBox should be able to load WARCs from outside sources, replay them with
pywb
, and re-archive them using all the redundant archive methods like Chrome Headless, Wget, etc.This would be most useful when a user tries to archive a URL that is already down, or that is not accessible to the ArchiveBox server.
ArchiveBox should be able to ingest a user-provided
.warc
/.warcz
/.warg.gz
, auto-fetching any available WARC from Archive.org / Archive.it / Archive.is / etc., or as a last resort auto-fetch from search engine caches (Google / Bing / Yahoo / Yandex / etc.).Related issues:
WARCs should be directly importable easily using
archivebox add ~/Downloads/path/to/some/warc.gz
, or be configurable to do the fallback searches on 3rd party services automatically in the case of a 404/403/etc.There are a few tools that may be helpful to integrate to achieve these goals:
This should allow us to redundantly archive URLs using ArchiveBox even when the original sites are no longer available.
The text was updated successfully, but these errors were encountered: