Feature Request: archive.today family integration #439

jaw-sh · 2020-08-13T15:17:20Z

The archive.today sites (including archive.is, archive.md, archive.vn, archive.fi, etc) should have special integrations..

Type

General question or discussion
Propose a brand new feature
Request modification of existing behavior or design

What is the problem that your feature request solves

archive.today's webmaster uses its status for activism. Using browsers the webmaster does not like (Brave) will result in the site being unusable. I would like to locally archive all archive.today links.

Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

There is a .ZIP download available for every archive which can be downloaded, unzipped, and converted into the archive format.

What hacks or alternative solutions have you tried to solve the problem?

Currently, my attempts at archivebox adding archive.today links results in the archive failing.

How badly do you want this new feature?

It's an urgent deal-breaker, I can't live without it
It's important to add it in the near-mid term future
It would be nice to have eventually

I'm willing to contribute dev time / money to fix this issue
I like ArchiveBox so far / would recommend it to a friend
I've had a lot of difficulty getting ArchiveBox set up

The text was updated successfully, but these errors were encountered:

jaw-sh · 2020-08-13T15:25:58Z

When I try to archive an archive.today page, I get errors and the archive is a directory of junk instead of the actual page.

�
[+] [2020-08-13 15:18:24] Adding 1 links to index (crawl depth=0)...�
> Saved verbatim input to sources/1597331904-import.txt > Parsed 1 URLs from input (Plain Text) > Found 1 new URLs not already in index �
[*] [2020-08-13 15:18:24] Writing 2 links to main index...�
> /opt/archive/index.sqlite3 √ /opt/archive/index.sqlite3 > /opt/archive/index.json √ /opt/archive/index.json > /opt/archive/index.html √ /opt/archive/index.html �
[▶] [2020-08-13 15:18:25] Collecting content for 1 Snapshots in archive...�
[�
+�
] [�
2020-08-13 15:18:25�
] "archive.is/nX7fq" �
http://archive.is/nX7fq�
> ./archive/1597331904 > title �
Failed:�
�
ConnectionError �
HTTPConnectionPool(host='archive.is', port=80): Max retries exceeded with url: /nX7fq (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 113] No route to host'))�
�
Run to see full output:�
cd /opt/archive/archive/1597331904; curl --silent --max-time 60 --location --compressed --user-agent "ArchiveBox/0.4.13 (+https://github.com/pirate/ArchiveBox/) curl/curl 7.64.0 (x86_64-pc-linux-gnu)" http://archive.is/nX7fq > favicon > wget �
Failed:�
�
�
Wget failed or got an error from the server�
�
Got wget response code: 4.�
�
failed: No route to host.�
�
Run to see full output:�
cd /opt/archive/archive/1597331904; wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent -e robots=off --timeout=60 --restrict-file-names=windows --warc-file=warc/1597331908 --page-requisites "--user-agent=ArchiveBox/0.4.13 (+https://github.com/pirate/ArchiveBox/) wget/GNU Wget 1.20.1" --compression=auto http://archive.is/nX7fq > singlefile �
Failed:�
�
Exception �
Failed to chmod: /opt/archive/archive/1597331904/singlefile.html does not exist (did the previous step fail?)�
�
Run to see full output:�
cd /opt/archive/archive/1597331904; /opt/SingleFile/cli/single-file --browser-executable-path=chromium "--browser-args="["--headless", "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36", "--window-size=1440,2000"]"" http://archive.is/nX7fq /opt/archive/archive/1597331904/singlefile.html > pdf �
Failed:�
�
�
Failed to save PDF�
�
[0813/151832.239630:ERROR:viz_main_impl.cc(152)] Exiting GPU process due to errors during initialization�
�
../../sandbox/linux/seccomp-bpf-helpers/sigsys_handlers.cc:**CRASHING**:seccomp-bpf failure in syscall 0063�
�
Received signal 11 SEGV_MAPERR 00000ffa003f�
�
#0 0x563968f29529 (/usr/lib/chromium/chromium+0x51f9528)�
�
#1 0x563968e87253 (/usr/lib/chromium/chromium+0x5157252)�
�
Run to see full output:�
cd /opt/archive/archive/1597331904; chromium --headless "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36" --window-size=1440,2000 --timeout=60000 --print-to-pdf http://archive.is/nX7fq > screenshot �
Failed:�
�
�
Failed to save screenshot�
�
[0813/151847.542441:ERROR:viz_main_impl.cc(152)] Exiting GPU process due to errors during initialization�
�
../../sandbox/linux/seccomp-bpf-helpers/sigsys_handlers.cc:**CRASHING**:seccomp-bpf failure in syscall 0063�
�
Received signal 11 SEGV_MAPERR 00000ffa003f�
�
#0 0x55ee106ed529 (/usr/lib/chromium/chromium+0x51f9528)�
�
#1 0x55ee1064b253 (/usr/lib/chromium/chromium+0x5157252)�
�
Run to see full output:�
cd /opt/archive/archive/1597331904; chromium --headless "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36" --window-size=1440,2000 --timeout=60000 --screenshot http://archive.is/nX7fq > dom �
Failed:�
�
�
Failed to save DOM�
�
[0813/151902.868207:ERROR:viz_main_impl.cc(152)] Exiting GPU process due to errors during initialization�
�
../../sandbox/linux/seccomp-bpf-helpers/sigsys_handlers.cc:**CRASHING**:seccomp-bpf failure in syscall 0063�
�
Received signal 11 SEGV_MAPERR 00000ffa003f�
�
#0 0x55d478e36529 (/usr/lib/chromium/chromium+0x51f9528)�
�
#1 0x55d478d94253 (/usr/lib/chromium/chromium+0x5157252)�
�
Run to see full output:�
cd /opt/archive/archive/1597331904; chromium --headless "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36" --window-size=1440,2000 --timeout=60000 --dump-dom http://archive.is/nX7fq > media �
Failed:�
�
�
Failed to save media�
�
Got youtube-dl response code: 1.�
�
WARNING: Could not send HEAD request to http://archive.is/nX7fq: �
�
ERROR: Unable to download webpage: (caused by URLError(OSError(113, 'No route to host')))�
�
Run to see full output:�
cd /opt/archive/archive/1597331904; youtube-dl --write-description --write-info-json --write-annotations --write-thumbnail --no-call-home --no-check-certificate --user-agent --all-subs --extract-audio --keep-video --ignore-errors --geo-bypass --audio-format mp3 --audio-quality 320K --embed-thumbnail --add-metadata --yes-playlist http://archive.is/nX7fq > archive_org �
[√] [2020-08-13 15:19:09] Update of 1 pages complete (44.27 sec)�
- 0 links skipped - 0 links updated - 2 links had errors �
Hint:�
To view your archive index, open: /opt/archive/index.html Or run the built-in webserver: archivebox server �
[*] [2020-08-13 15:19:09] Writing 2 links to main index...�
> /opt/archive/index.sqlite3 √ /opt/archive/index.sqlite3 > /opt/archive/index.json √ /opt/archive/index.json > /opt/archive/index.html √ /opt/archive/index.html

cdvv7788 · 2020-08-13T15:28:24Z

@jaw-sh can you provide the command (with url) you are testing? I can give it a check (It is probably being blocked by the target url).

jaw-sh · 2020-08-13T15:29:22Z

@cdvv7788 http://archive.is/nX7fq

cdvv7788 · 2020-08-13T15:35:07Z

It works for me. How are you trying to run archivebox? Are you setting some environment variable or changing some configuration? Are you on master?

jaw-sh · 2020-08-13T15:39:47Z

http://archive.vn/nX7fq
https://tinf.io/archive/1597333033/

All I really want is the static, non-interactive version of the page they already archived.

[i] [2020-08-13 15:37:12] ArchiveBox v0.4.13: archivebox add http://archive.vn/nX7fq
    > /opt/archive

[+] [2020-08-13 15:37:13] Adding 1 links to index (crawl depth=0)...
    > Saved verbatim input to sources/1597333033-import.txt
    > Parsed 1 URLs from input (Plain Text)                                                  
    > Found 1 new URLs not already in index                                                  

[*] [2020-08-13 15:37:13] Writing 2 links to main index...
    √ /opt/archive/index.sqlite3                                                             
    √ /opt/archive/index.json                                                                
    √ /opt/archive/index.html                                                                

[▶] [2020-08-13 15:37:14] Collecting content for 1 Snapshots in archive...

[+] [2020-08-13 15:37:15] "archive.vn/nX7fq"
    http://archive.vn/nX7fq
    > ./archive/1597333033
      > title
        Failed:                                                                              
            ConnectionError HTTPConnectionPool(host='archive.vn', port=80): Max retries exceeded with url: /nX7fq (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f52fc3b7ac8>: Failed to establish a new connection: [Errno -5] No address associated with hostname'))
        Run to see full output:
            cd /opt/archive/archive/1597333033;
            curl --silent --max-time 60 --location --compressed --user-agent "ArchiveBox/0.4.13 (+https://github.com/pirate/ArchiveBox/) curl/curl 7.64.0 (x86_64-pc-linux-gnu)" http://archive.vn/nX7fq

      > favicon
      > wget                                                                                 
        Failed:                                                                              
            TimeoutExpired Command '['wget', '--no-verbose', '--adjust-extension', '--convert-links', '--force-directories', '--backup-converted', '--span-hosts', '--no-parent', '-e', 'robots=off', '--timeout=60', '--restrict-file-names=windows', '--warc-file=warc/1597333035', '--page-requisites', '--user-agent=ArchiveBox/0.4.13 (+https://github.com/pirate/ArchiveBox/) wget/GNU Wget 1.20.1', '--compression=auto', 'http://archive.vn/nX7fq']' timed out after 60 seconds
        Run to see full output:
            cd /opt/archive/archive/1597333033;
            wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent -e robots=off --timeout=60 --restrict-file-names=windows --warc-file=warc/1597333035 --page-requisites "--user-agent=ArchiveBox/0.4.13 (+https://github.com/pirate/ArchiveBox/) wget/GNU Wget 1.20.1" --compression=auto http://archive.vn/nX7fq

      > singlefile
        Failed:                                                                              
            Exception Failed to chmod: /opt/archive/archive/1597333033/singlefile.html does not exist (did the previous step fail?)
        Run to see full output:
            cd /opt/archive/archive/1597333033;
            /opt/SingleFile/cli/single-file --browser-executable-path=chromium "--browser-args="["--headless", "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36", "--window-size=1440,2000"]"" http://archive.vn/nX7fq /opt/archive/archive/1597333033/singlefile.html

      > pdf
      > screenshot                                                                           
      > dom                                                                                  
      > media                                                                                
        Failed:                                                                              
             Failed to save media
            Got youtube-dl response code: 1.
            WARNING: Could not send HEAD request to http://archive.vn/nX7fq: <urlopen error [Errno 99] Cannot assign requested address>
            ERROR: Unable to download webpage: <urlopen error [Errno 99] Cannot assign requested address> (caused by URLError(OSError(99, 'Cannot assign requested address')))
        Run to see full output:
            cd /opt/archive/archive/1597333033;
            youtube-dl --write-description --write-info-json --write-annotations --write-thumbnail --no-call-home --no-check-certificate --user-agent --all-subs --extract-audio --keep-video --ignore-errors --geo-bypass --audio-format mp3 --audio-quality 320K --embed-thumbnail --add-metadata --yes-playlist http://archive.vn/nX7fq

      > archive_org
        Failed:                                                                              
             WaybackException: java.lang.IllegalStateException: Payload size does not match content-length!
        Run to see full output:
            cd /opt/archive/archive/1597333033;
            curl --silent --location --head --compressed --max-time 60 --user-agent "ArchiveBox/0.4.13 (+https://github.com/pirate/ArchiveBox/) curl/curl 7.64.0 (x86_64-pc-linux-gnu)" https://web.archive.org/save/http://archive.vn/nX7fq


[√] [2020-08-13 15:38:20] Update of 1 pages complete (1.10 min)
    - 0 links skipped
    - 0 links updated
    - 1 links had errors

    Hint: To view your archive index, open:
        /opt/archive/index.html
    Or run the built-in webserver:
        archivebox server

[*] [2020-08-13 15:38:20] Writing 2 links to main index...
    √ /opt/archive/index.sqlite3                                                             
    √ /opt/archive/index.json                                                                
    √ /opt/archive/index.html ```

jaw-sh · 2020-08-13T16:18:12Z

ArchiveBox really needs a way to capture the DOM at "first rest" when the page is fully loaded. With Twitter, the archive is completely mangled because it tries to totally replicate the entire Twitter living webpage. Instagram is also completely broken.

I can open a new issue for this and I am willing to put cash bounties on these things.

https://twitter.com/dril/status/134787490526658561
https://tinf.io/archive/1597335380/twitter.com/dril/status/134787490526658561.html

pirate · 2020-08-13T19:16:55Z

@jaw-sh it does capture at first rest with 2 of the methods, the DOM dump and Singlefile. Have you tried looking at those outputs?

jaw-sh · 2020-08-13T20:34:53Z

I have the single-file binary set. It didn't work at all before I set it.

cdvv7788 · 2020-08-13T20:38:52Z

I have the single-file binary set. It didn't work at all before I set it.

We have a fix in an incoming PR that will disable it by default. Using docker has support for all of the extractors out of the box.

jaw-sh · 2020-08-13T20:41:32Z

@cdvv7788 Sounds good.

What I really, really need is this:

A way to request URLs be archived with a POST/GET request from another service programatically* (auto-archiving links posted by users).
A way to request web content be archived at their first rest state, identical to how archive.today works.
A way to import archive.today archives exactly as they appear on archive.today, instead of just the archive.today page.

I am willing to pay for this.

pirate · 2020-08-13T21:03:28Z

A way to request URLs be archived with a POST/GET request from another service programatically* (auto-archiving links posted by users).

This is already present, you can POST to https://127.0.0.1:8000/admin/core/snapshot/add/ with the following fields to archive a link:

url: str (a string containing any number of URLs)
depth: int (either 0 or 1, as detailed in archivebox add --help)
your session cookie header to authenticate the request

A way to request web content be archived at their first rest state, identical to how archive.today works.

As mentioned above, this is already present, both the DOM dump and SingleFile methods archive "at first rest", i.e. ~1s after DOM.ready event fires.
The other methods do not execute JS, and so "page ready" is not a concept that applies to them.
Your screenshot is of the wget output method only, have you tried viewing the SingleFile or DOM dump outputs? They should generally work fine for twitter.

A way to import archive.today archives exactly as they appear on archive.today, instead of just the archive.today page.

I'm afraid this is not easily possible, archive.today explicitly does not expose an API that allows users to download their snapshots. If they did have such an API, then that task would fall under the umbrella of this ticket: #160

If you are serious about this, be aware that funding development on this issue would be on the order of $5k USD or more. We run a software consultancy and you can find more info about our hiring us here: Monadical.com.

Also related (for improving exporting to sites like archive.today/archive.org): #146

jaw-sh · 2020-08-13T21:40:37Z

archive.today/is/vn/fi does not use the WARC format, they export a .zip download. Even if it's not easy, converting that .zip download into WARC and using it as a snapshot is something I would pay for. I have thousands of these links I would like to host myself.

I must be missing something re: the single file archive. Is there a special config setting I have to set to explicitly use single file? I believe I am already using it but Instagram and Twitter archives are malformed. I had to create a binary to get any archive to work.

pirate · 2020-08-13T21:58:56Z

We might be able to download that ZIP and rehost it verbatim in the ArchiveBox index without converting it to WARC. ArchiveBox wouldn't be able to run any of its own extractors though (wget, youtubedl, git, chrome, etc.), you'd basically just see the archive.today version in the index with none of ArchiveBox's own functionality. Is that what you're asking for?

https://github.com/pirate/ArchiveBox/wiki/Usage#ui-usage

All archive methods (that are installed) are run for every URL, you can access them by clicking the favicon next to the title, or any of the icons in the "Files" column.

pirate · 2023-06-13T04:09:15Z

I'm merging this feature with #160, which is a more general TODO to add support for searching/importing from 3rd party archiving platforms.

Please subscribe to that issue for progress updates / discussions.

jaw-sh added why: functionality Intended to improve ArchiveBox functionality or features status: idea-phase Work is tentatively approved and is being planned / laid out, but is not ready to be implemented yet labels Aug 13, 2020

pirate mentioned this issue Dec 2, 2020

[Feature Request] Input An Archive Link, Get All Their Snapshots #560

Closed

pirate closed this as completed Jun 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: archive.today family integration #439

Feature Request: archive.today family integration #439

jaw-sh commented Aug 13, 2020

jaw-sh commented Aug 13, 2020

cdvv7788 commented Aug 13, 2020

jaw-sh commented Aug 13, 2020

cdvv7788 commented Aug 13, 2020 •

edited

jaw-sh commented Aug 13, 2020 •

edited

jaw-sh commented Aug 13, 2020 •

edited

pirate commented Aug 13, 2020

jaw-sh commented Aug 13, 2020

cdvv7788 commented Aug 13, 2020

jaw-sh commented Aug 13, 2020 •

edited

pirate commented Aug 13, 2020 •

edited

jaw-sh commented Aug 13, 2020 •

edited

pirate commented Aug 13, 2020 •

edited

pirate commented Jun 13, 2023

Feature Request: archive.today family integration #439

Feature Request: archive.today family integration #439

Comments

jaw-sh commented Aug 13, 2020

Type

What is the problem that your feature request solves

Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

What hacks or alternative solutions have you tried to solve the problem?

How badly do you want this new feature?

jaw-sh commented Aug 13, 2020

cdvv7788 commented Aug 13, 2020

jaw-sh commented Aug 13, 2020

cdvv7788 commented Aug 13, 2020 • edited

jaw-sh commented Aug 13, 2020 • edited

jaw-sh commented Aug 13, 2020 • edited

pirate commented Aug 13, 2020

jaw-sh commented Aug 13, 2020

cdvv7788 commented Aug 13, 2020

jaw-sh commented Aug 13, 2020 • edited

pirate commented Aug 13, 2020 • edited

jaw-sh commented Aug 13, 2020 • edited

pirate commented Aug 13, 2020 • edited

pirate commented Jun 13, 2023

cdvv7788 commented Aug 13, 2020 •

edited

jaw-sh commented Aug 13, 2020 •

edited

jaw-sh commented Aug 13, 2020 •

edited

jaw-sh commented Aug 13, 2020 •

edited

pirate commented Aug 13, 2020 •

edited

jaw-sh commented Aug 13, 2020 •

edited

pirate commented Aug 13, 2020 •

edited