Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: archive.today family integration #439

Closed
4 of 9 tasks
jaw-sh opened this issue Aug 13, 2020 · 14 comments
Closed
4 of 9 tasks

Feature Request: archive.today family integration #439

jaw-sh opened this issue Aug 13, 2020 · 14 comments
Labels
status: idea-phase Work is tentatively approved and is being planned / laid out, but is not ready to be implemented yet why: functionality Intended to improve ArchiveBox functionality or features

Comments

@jaw-sh
Copy link

jaw-sh commented Aug 13, 2020

The archive.today sites (including archive.is, archive.md, archive.vn, archive.fi, etc) should have special integrations..

Type

  • General question or discussion
  • Propose a brand new feature
  • Request modification of existing behavior or design

What is the problem that your feature request solves

archive.today's webmaster uses its status for activism. Using browsers the webmaster does not like (Brave) will result in the site being unusable. I would like to locally archive all archive.today links.

Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

There is a .ZIP download available for every archive which can be downloaded, unzipped, and converted into the archive format.

What hacks or alternative solutions have you tried to solve the problem?

Currently, my attempts at archivebox adding archive.today links results in the archive failing.

How badly do you want this new feature?

  • It's an urgent deal-breaker, I can't live without it
  • It's important to add it in the near-mid term future
  • It would be nice to have eventually

  • I'm willing to contribute dev time / money to fix this issue
  • I like ArchiveBox so far / would recommend it to a friend
  • I've had a lot of difficulty getting ArchiveBox set up
@jaw-sh jaw-sh added why: functionality Intended to improve ArchiveBox functionality or features status: idea-phase Work is tentatively approved and is being planned / laid out, but is not ready to be implemented yet labels Aug 13, 2020
@jaw-sh
Copy link
Author

jaw-sh commented Aug 13, 2020

When I try to archive an archive.today page, I get errors and the archive is a directory of junk instead of the actual page.

�
[+] [2020-08-13 15:18:24] Adding 1 links to index (crawl depth=0)...�
> Saved verbatim input to sources/1597331904-import.txt > Parsed 1 URLs from input (Plain Text) > Found 1 new URLs not already in index �
[*] [2020-08-13 15:18:24] Writing 2 links to main index...�
> /opt/archive/index.sqlite3 √ /opt/archive/index.sqlite3 > /opt/archive/index.json √ /opt/archive/index.json > /opt/archive/index.html √ /opt/archive/index.html �
[▶] [2020-08-13 15:18:25] Collecting content for 1 Snapshots in archive...�
[�
+�
] [�
2020-08-13 15:18:25�
] "archive.is/nX7fq" �
http://archive.is/nX7fq�
> ./archive/1597331904 > title �
Failed:�
�
ConnectionError �
HTTPConnectionPool(host='archive.is', port=80): Max retries exceeded with url: /nX7fq (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 113] No route to host'))�
�
Run to see full output:�
cd /opt/archive/archive/1597331904; curl --silent --max-time 60 --location --compressed --user-agent "ArchiveBox/0.4.13 (+https://github.com/pirate/ArchiveBox/) curl/curl 7.64.0 (x86_64-pc-linux-gnu)" http://archive.is/nX7fq > favicon > wget �
Failed:�
�
�
Wget failed or got an error from the server�
�
Got wget response code: 4.�
�
failed: No route to host.�
�
Run to see full output:�
cd /opt/archive/archive/1597331904; wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent -e robots=off --timeout=60 --restrict-file-names=windows --warc-file=warc/1597331908 --page-requisites "--user-agent=ArchiveBox/0.4.13 (+https://github.com/pirate/ArchiveBox/) wget/GNU Wget 1.20.1" --compression=auto http://archive.is/nX7fq > singlefile �
Failed:�
�
Exception �
Failed to chmod: /opt/archive/archive/1597331904/singlefile.html does not exist (did the previous step fail?)�
�
Run to see full output:�
cd /opt/archive/archive/1597331904; /opt/SingleFile/cli/single-file --browser-executable-path=chromium "--browser-args="["--headless", "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36", "--window-size=1440,2000"]"" http://archive.is/nX7fq /opt/archive/archive/1597331904/singlefile.html > pdf �
Failed:�
�
�
Failed to save PDF�
�
[0813/151832.239630:ERROR:viz_main_impl.cc(152)] Exiting GPU process due to errors during initialization�
�
../../sandbox/linux/seccomp-bpf-helpers/sigsys_handlers.cc:**CRASHING**:seccomp-bpf failure in syscall 0063�
�
Received signal 11 SEGV_MAPERR 00000ffa003f�
�
#0 0x563968f29529 (/usr/lib/chromium/chromium+0x51f9528)�
�
#1 0x563968e87253 (/usr/lib/chromium/chromium+0x5157252)�
�
Run to see full output:�
cd /opt/archive/archive/1597331904; chromium --headless "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36" --window-size=1440,2000 --timeout=60000 --print-to-pdf http://archive.is/nX7fq > screenshot �
Failed:�
�
�
Failed to save screenshot�
�
[0813/151847.542441:ERROR:viz_main_impl.cc(152)] Exiting GPU process due to errors during initialization�
�
../../sandbox/linux/seccomp-bpf-helpers/sigsys_handlers.cc:**CRASHING**:seccomp-bpf failure in syscall 0063�
�
Received signal 11 SEGV_MAPERR 00000ffa003f�
�
#0 0x55ee106ed529 (/usr/lib/chromium/chromium+0x51f9528)�
�
#1 0x55ee1064b253 (/usr/lib/chromium/chromium+0x5157252)�
�
Run to see full output:�
cd /opt/archive/archive/1597331904; chromium --headless "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36" --window-size=1440,2000 --timeout=60000 --screenshot http://archive.is/nX7fq > dom �
Failed:�
�
�
Failed to save DOM�
�
[0813/151902.868207:ERROR:viz_main_impl.cc(152)] Exiting GPU process due to errors during initialization�
�
../../sandbox/linux/seccomp-bpf-helpers/sigsys_handlers.cc:**CRASHING**:seccomp-bpf failure in syscall 0063�
�
Received signal 11 SEGV_MAPERR 00000ffa003f�
�
#0 0x55d478e36529 (/usr/lib/chromium/chromium+0x51f9528)�
�
#1 0x55d478d94253 (/usr/lib/chromium/chromium+0x5157252)�
�
Run to see full output:�
cd /opt/archive/archive/1597331904; chromium --headless "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36" --window-size=1440,2000 --timeout=60000 --dump-dom http://archive.is/nX7fq > media �
Failed:�
�
�
Failed to save media�
�
Got youtube-dl response code: 1.�
�
WARNING: Could not send HEAD request to http://archive.is/nX7fq: �
�
ERROR: Unable to download webpage: (caused by URLError(OSError(113, 'No route to host')))�
�
Run to see full output:�
cd /opt/archive/archive/1597331904; youtube-dl --write-description --write-info-json --write-annotations --write-thumbnail --no-call-home --no-check-certificate --user-agent --all-subs --extract-audio --keep-video --ignore-errors --geo-bypass --audio-format mp3 --audio-quality 320K --embed-thumbnail --add-metadata --yes-playlist http://archive.is/nX7fq > archive_org �
[√] [2020-08-13 15:19:09] Update of 1 pages complete (44.27 sec)�
- 0 links skipped - 0 links updated - 2 links had errors �
Hint:�
To view your archive index, open: /opt/archive/index.html Or run the built-in webserver: archivebox server �
[*] [2020-08-13 15:19:09] Writing 2 links to main index...�
> /opt/archive/index.sqlite3 √ /opt/archive/index.sqlite3 > /opt/archive/index.json √ /opt/archive/index.json > /opt/archive/index.html √ /opt/archive/index.html

@cdvv7788
Copy link
Contributor

@jaw-sh can you provide the command (with url) you are testing? I can give it a check (It is probably being blocked by the target url).

@jaw-sh
Copy link
Author

jaw-sh commented Aug 13, 2020

@cdvv7788 http://archive.is/nX7fq

@cdvv7788
Copy link
Contributor

cdvv7788 commented Aug 13, 2020

image
It works for me. How are you trying to run archivebox? Are you setting some environment variable or changing some configuration? Are you on master?

@jaw-sh
Copy link
Author

jaw-sh commented Aug 13, 2020

http://archive.vn/nX7fq
https://tinf.io/archive/1597333033/

All I really want is the static, non-interactive version of the page they already archived.

[i] [2020-08-13 15:37:12] ArchiveBox v0.4.13: archivebox add http://archive.vn/nX7fq
    > /opt/archive

[+] [2020-08-13 15:37:13] Adding 1 links to index (crawl depth=0)...
    > Saved verbatim input to sources/1597333033-import.txt
    > Parsed 1 URLs from input (Plain Text)                                                  
    > Found 1 new URLs not already in index                                                  

[*] [2020-08-13 15:37:13] Writing 2 links to main index...
    √ /opt/archive/index.sqlite3                                                             
    √ /opt/archive/index.json                                                                
    √ /opt/archive/index.html                                                                

[▶] [2020-08-13 15:37:14] Collecting content for 1 Snapshots in archive...

[+] [2020-08-13 15:37:15] "archive.vn/nX7fq"
    http://archive.vn/nX7fq
    > ./archive/1597333033
      > title
        Failed:                                                                              
            ConnectionError HTTPConnectionPool(host='archive.vn', port=80): Max retries exceeded with url: /nX7fq (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f52fc3b7ac8>: Failed to establish a new connection: [Errno -5] No address associated with hostname'))
        Run to see full output:
            cd /opt/archive/archive/1597333033;
            curl --silent --max-time 60 --location --compressed --user-agent "ArchiveBox/0.4.13 (+https://github.com/pirate/ArchiveBox/) curl/curl 7.64.0 (x86_64-pc-linux-gnu)" http://archive.vn/nX7fq

      > favicon
      > wget                                                                                 
        Failed:                                                                              
            TimeoutExpired Command '['wget', '--no-verbose', '--adjust-extension', '--convert-links', '--force-directories', '--backup-converted', '--span-hosts', '--no-parent', '-e', 'robots=off', '--timeout=60', '--restrict-file-names=windows', '--warc-file=warc/1597333035', '--page-requisites', '--user-agent=ArchiveBox/0.4.13 (+https://github.com/pirate/ArchiveBox/) wget/GNU Wget 1.20.1', '--compression=auto', 'http://archive.vn/nX7fq']' timed out after 60 seconds
        Run to see full output:
            cd /opt/archive/archive/1597333033;
            wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent -e robots=off --timeout=60 --restrict-file-names=windows --warc-file=warc/1597333035 --page-requisites "--user-agent=ArchiveBox/0.4.13 (+https://github.com/pirate/ArchiveBox/) wget/GNU Wget 1.20.1" --compression=auto http://archive.vn/nX7fq

      > singlefile
        Failed:                                                                              
            Exception Failed to chmod: /opt/archive/archive/1597333033/singlefile.html does not exist (did the previous step fail?)
        Run to see full output:
            cd /opt/archive/archive/1597333033;
            /opt/SingleFile/cli/single-file --browser-executable-path=chromium "--browser-args="["--headless", "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36", "--window-size=1440,2000"]"" http://archive.vn/nX7fq /opt/archive/archive/1597333033/singlefile.html

      > pdf
      > screenshot                                                                           
      > dom                                                                                  
      > media                                                                                
        Failed:                                                                              
             Failed to save media
            Got youtube-dl response code: 1.
            WARNING: Could not send HEAD request to http://archive.vn/nX7fq: <urlopen error [Errno 99] Cannot assign requested address>
            ERROR: Unable to download webpage: <urlopen error [Errno 99] Cannot assign requested address> (caused by URLError(OSError(99, 'Cannot assign requested address')))
        Run to see full output:
            cd /opt/archive/archive/1597333033;
            youtube-dl --write-description --write-info-json --write-annotations --write-thumbnail --no-call-home --no-check-certificate --user-agent --all-subs --extract-audio --keep-video --ignore-errors --geo-bypass --audio-format mp3 --audio-quality 320K --embed-thumbnail --add-metadata --yes-playlist http://archive.vn/nX7fq

      > archive_org
        Failed:                                                                              
             WaybackException: java.lang.IllegalStateException: Payload size does not match content-length!
        Run to see full output:
            cd /opt/archive/archive/1597333033;
            curl --silent --location --head --compressed --max-time 60 --user-agent "ArchiveBox/0.4.13 (+https://github.com/pirate/ArchiveBox/) curl/curl 7.64.0 (x86_64-pc-linux-gnu)" https://web.archive.org/save/http://archive.vn/nX7fq


[√] [2020-08-13 15:38:20] Update of 1 pages complete (1.10 min)
    - 0 links skipped
    - 0 links updated
    - 1 links had errors

    Hint: To view your archive index, open:
        /opt/archive/index.html
    Or run the built-in webserver:
        archivebox server

[*] [2020-08-13 15:38:20] Writing 2 links to main index...
    √ /opt/archive/index.sqlite3                                                             
    √ /opt/archive/index.json                                                                
    √ /opt/archive/index.html ```

@jaw-sh
Copy link
Author

jaw-sh commented Aug 13, 2020

ArchiveBox really needs a way to capture the DOM at "first rest" when the page is fully loaded. With Twitter, the archive is completely mangled because it tries to totally replicate the entire Twitter living webpage. Instagram is also completely broken.

I can open a new issue for this and I am willing to put cash bounties on these things.

https://twitter.com/dril/status/134787490526658561
https://tinf.io/archive/1597335380/twitter.com/dril/status/134787490526658561.html
image

@pirate
Copy link
Member

pirate commented Aug 13, 2020

@jaw-sh it does capture at first rest with 2 of the methods, the DOM dump and Singlefile. Have you tried looking at those outputs?

@jaw-sh
Copy link
Author

jaw-sh commented Aug 13, 2020

I have the single-file binary set. It didn't work at all before I set it.

@cdvv7788
Copy link
Contributor

I have the single-file binary set. It didn't work at all before I set it.

We have a fix in an incoming PR that will disable it by default. Using docker has support for all of the extractors out of the box.

@jaw-sh
Copy link
Author

jaw-sh commented Aug 13, 2020

@cdvv7788 Sounds good.

What I really, really need is this:

  • A way to request URLs be archived with a POST/GET request from another service programatically* (auto-archiving links posted by users).
  • A way to request web content be archived at their first rest state, identical to how archive.today works.
  • A way to import archive.today archives exactly as they appear on archive.today, instead of just the archive.today page.

I am willing to pay for this.

@pirate
Copy link
Member

pirate commented Aug 13, 2020

A way to request URLs be archived with a POST/GET request from another service programatically* (auto-archiving links posted by users).

This is already present, you can POST to https://127.0.0.1:8000/admin/core/snapshot/add/ with the following fields to archive a link:

  • url: str (a string containing any number of URLs)
  • depth: int (either 0 or 1, as detailed in archivebox add --help)
  • your session cookie header to authenticate the request

A way to request web content be archived at their first rest state, identical to how archive.today works.

As mentioned above, this is already present, both the DOM dump and SingleFile methods archive "at first rest", i.e. ~1s after DOM.ready event fires.
The other methods do not execute JS, and so "page ready" is not a concept that applies to them.
Your screenshot is of the wget output method only, have you tried viewing the SingleFile or DOM dump outputs? They should generally work fine for twitter.

A way to import archive.today archives exactly as they appear on archive.today, instead of just the archive.today page.

I'm afraid this is not easily possible, archive.today explicitly does not expose an API that allows users to download their snapshots. If they did have such an API, then that task would fall under the umbrella of this ticket: #160

image

If you are serious about this, be aware that funding development on this issue would be on the order of $5k USD or more. We run a software consultancy and you can find more info about our hiring us here: Monadical.com.

Also related (for improving exporting to sites like archive.today/archive.org): #146

@jaw-sh
Copy link
Author

jaw-sh commented Aug 13, 2020

archive.today/is/vn/fi does not use the WARC format, they export a .zip download. Even if it's not easy, converting that .zip download into WARC and using it as a snapshot is something I would pay for. I have thousands of these links I would like to host myself.

I must be missing something re: the single file archive. Is there a special config setting I have to set to explicitly use single file? I believe I am already using it but Instagram and Twitter archives are malformed. I had to create a binary to get any archive to work.

@pirate
Copy link
Member

pirate commented Aug 13, 2020

We might be able to download that ZIP and rehost it verbatim in the ArchiveBox index without converting it to WARC. ArchiveBox wouldn't be able to run any of its own extractors though (wget, youtubedl, git, chrome, etc.), you'd basically just see the archive.today version in the index with none of ArchiveBox's own functionality. Is that what you're asking for?

https://github.com/pirate/ArchiveBox/wiki/Usage#ui-usage

All archive methods (that are installed) are run for every URL, you can access them by clicking the favicon next to the title, or any of the icons in the "Files" column.

Screen Shot 2020-08-13 at 5 51 18 PM

Screen Shot 2020-08-13 at 5 52 37 PM

@pirate
Copy link
Member

pirate commented Jun 13, 2023

I'm merging this feature with #160, which is a more general TODO to add support for searching/importing from 3rd party archiving platforms.

Please subscribe to that issue for progress updates / discussions.

@pirate pirate closed this as completed Jun 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: idea-phase Work is tentatively approved and is being planned / laid out, but is not ready to be implemented yet why: functionality Intended to improve ArchiveBox functionality or features
Projects
None yet
Development

No branches or pull requests

3 participants