New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: archive.today family integration #439
Comments
When I try to archive an archive.today page, I get errors and the archive is a directory of junk instead of the actual page.
|
@jaw-sh can you provide the command (with url) you are testing? I can give it a check (It is probably being blocked by the target url). |
http://archive.vn/nX7fq All I really want is the static, non-interactive version of the page they already archived.
|
ArchiveBox really needs a way to capture the DOM at "first rest" when the page is fully loaded. With Twitter, the archive is completely mangled because it tries to totally replicate the entire Twitter living webpage. Instagram is also completely broken. I can open a new issue for this and I am willing to put cash bounties on these things. https://twitter.com/dril/status/134787490526658561 |
@jaw-sh it does capture at first rest with 2 of the methods, the DOM dump and Singlefile. Have you tried looking at those outputs? |
I have the single-file binary set. It didn't work at all before I set it. |
We have a fix in an incoming PR that will disable it by default. Using docker has support for all of the extractors out of the box. |
@cdvv7788 Sounds good. What I really, really need is this:
I am willing to pay for this. |
This is already present, you can POST to
As mentioned above, this is already present, both the DOM dump and SingleFile methods archive "at first rest", i.e. ~1s after DOM.ready event fires.
I'm afraid this is not easily possible, archive.today explicitly does not expose an API that allows users to download their snapshots. If they did have such an API, then that task would fall under the umbrella of this ticket: #160 If you are serious about this, be aware that funding development on this issue would be on the order of $5k USD or more. We run a software consultancy and you can find more info about our hiring us here: Monadical.com. Also related (for improving exporting to sites like archive.today/archive.org): #146 |
archive.today/is/vn/fi does not use the WARC format, they export a .zip download. Even if it's not easy, converting that .zip download into WARC and using it as a snapshot is something I would pay for. I have thousands of these links I would like to host myself. I must be missing something re: the single file archive. Is there a special config setting I have to set to explicitly use single file? I believe I am already using it but Instagram and Twitter archives are malformed. I had to create a binary to get any archive to work. |
We might be able to download that ZIP and rehost it verbatim in the ArchiveBox index without converting it to WARC. ArchiveBox wouldn't be able to run any of its own extractors though (wget, youtubedl, git, chrome, etc.), you'd basically just see the archive.today version in the index with none of ArchiveBox's own functionality. Is that what you're asking for? https://github.com/pirate/ArchiveBox/wiki/Usage#ui-usage All archive methods (that are installed) are run for every URL, you can access them by clicking the favicon next to the title, or any of the icons in the "Files" column. |
I'm merging this feature with #160, which is a more general TODO to add support for searching/importing from 3rd party archiving platforms. Please subscribe to that issue for progress updates / discussions. |
The archive.today sites (including archive.is, archive.md, archive.vn, archive.fi, etc) should have special integrations..
Type
What is the problem that your feature request solves
archive.today's webmaster uses its status for activism. Using browsers the webmaster does not like (Brave) will result in the site being unusable. I would like to locally archive all archive.today links.
Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes
There is a .ZIP download available for every archive which can be downloaded, unzipped, and converted into the archive format.
What hacks or alternative solutions have you tried to solve the problem?
Currently, my attempts at
archivebox add
ing archive.today links results in the archive failing.How badly do you want this new feature?
The text was updated successfully, but these errors were encountered: