Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to import user-provided/3rd-party WARCs from other archiving services (e.g. if user tries to archive a URL that is already down) #160

Open
pirate opened this issue Mar 5, 2019 · 5 comments
Labels
size: medium status: idea-phase Work is tentatively approved and is being planned / laid out, but is not ready to be implemented yet why: functionality Intended to improve ArchiveBox functionality or features

Comments

@pirate
Copy link
Member

pirate commented Mar 5, 2019

ArchiveBox should be able to load WARCs from outside sources, replay them with pywb, and re-archive them using all the redundant archive methods like Chrome Headless, Wget, etc.

This would be most useful when a user tries to archive a URL that is already down, or that is not accessible to the ArchiveBox server.

ArchiveBox should be able to ingest a user-provided .warc / .warcz / .warg.gz, auto-fetching any available WARC from Archive.org / Archive.it / Archive.is / etc., or as a last resort auto-fetch from search engine caches (Google / Bing / Yahoo / Yandex / etc.).

Related issues:

WARCs should be directly importable easily using archivebox add ~/Downloads/path/to/some/warc.gz, or be configurable to do the fallback searches on 3rd party services automatically in the case of a 404/403/etc.

There are a few tools that may be helpful to integrate to achieve these goals:

This should allow us to redundantly archive URLs using ArchiveBox even when the original sites are no longer available.

@pirate pirate added status: idea-phase Work is tentatively approved and is being planned / laid out, but is not ready to be implemented yet size: medium why: functionality Intended to improve ArchiveBox functionality or features labels Mar 5, 2019
@pirate pirate changed the title Attempt to download Archive.org's WARC for URLs that 404 or contain archive.org in the domain Ability to import WARCs: Load, replay, and re-archive user-provided/archive.org-provided WARC files Apr 30, 2019
@pirate pirate changed the title Ability to import WARCs: Load, replay, and re-archive user-provided/archive.org-provided WARC files Ability to import WARCs: Import, replay, and re-archive user-provided or archive.org-provided WARC files Apr 30, 2019
@muramasatheninja
Copy link

Would very much like to see this feature. I already have made a bunch of warc files and would love to have a way to bring them into Archivebox.

@TheAnachronism
Copy link

What is the status on this?
Currently have the problem, that I can't archive stuff from sites which have some kind of authentication or maturity filter so I wanted to try to manually archive it and then upload it into ArchiveBox. But there doesn't seem to be any workflow for this.

@pirate
Copy link
Member Author

pirate commented Jun 7, 2021

Right now there's no official workflow or concrete plan to add this in the short-term, but in the for now you can put any manual warcs inside archive/<timestamp>/warc/*.warc.gz and archivebox wont touch them there. They wont show up in the UI, but AB won't delete/move them either, so it's a safe place to put them. If you want you can even manually create an ArchvieResult entry to track those warc files on the Log page or via archivebox shell, that way they'll show up in the UI and have any metadata you want to attach about how/when you saved them.

@refparo
Copy link

refparo commented Jul 4, 2021

Looking forward to this to be added. This would make it easier to get ArchiveBox work with browser extensions like https://github.com/machawk1/warcreate

@pirate pirate changed the title Ability to import WARCs: Import, replay, and re-archive user-provided or archive.org-provided WARC files Ability to import user-provided or 3rd-party WARCs: e.g. if user tries to archive a URL that is already down, save WARC from archive.org/archive.it, search engine caches, or manual import instead Jun 13, 2023
@pirate
Copy link
Member Author

pirate commented Jun 13, 2023

I don't have any updates on progress here, but I did just think of an idea that I think would be related to this feature: adding support for automatically finding 3rd party copies of pages on Archive.org/search engine caches/etc. and pulling them into ArchiveBox.

My ideal vision of this feature is that it covers the case where a user tries to archive a URL that is already down / no longer available from the original server.

The flow from there could be to:

  • try to find a copy on archive.org / archive.it / archive.is / etc. and save their warc to the ArchiveBox Snapshot
  • try to find copies of the page in search engine caches (Google, Bing, Yahoo, Yandex, etc.) and save that to our Snapshot
  • try to find alternative non-canonical URLs for the page using search engines, and attempt to archive those versions instead
  • allow the user to manually upload / ingest a WARC from a URL or local filesystem to save to the ArchiveBox Snapshot

These options should be disabled by default (because it's not safe to give the impression to the user that it was the original page that was archived when in fact we got it from a 3rd-party mirror), but configurable via ArchiveBox.conf / env variables. I also imagine having an option where users could enable doing these 3rd party archive imports even if the original URL is up, that way they can save every version of the site thats available every time.

@pirate pirate changed the title Ability to import user-provided or 3rd-party WARCs: e.g. if user tries to archive a URL that is already down, save WARC from archive.org/archive.it, search engine caches, or manual import instead Ability to import user-provided/3rd-party WARCs from other archiving services (e.g. if user tries to archive a URL that is already down) Jun 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size: medium status: idea-phase Work is tentatively approved and is being planned / laid out, but is not ready to be implemented yet why: functionality Intended to improve ArchiveBox functionality or features
Projects
None yet
Development

No branches or pull requests

4 participants