Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Archivebox stopped saving DOM, screenshot and PDF with v111 update #1125

Closed
Giger22 opened this issue Mar 21, 2023 · 30 comments
Closed

Archivebox stopped saving DOM, screenshot and PDF with v111 update #1125

Giger22 opened this issue Mar 21, 2023 · 30 comments

Comments

@Giger22
Copy link

Giger22 commented Mar 21, 2023

Archivebox fails to save DOM, screenshot and PDF.

Steps to reproduce

For example:

  1. I Ran ArchiveBox and ArchiveBoxDev.
  2. All archive methods except PDF, screenshot and DOM works.
    I get the same error in normal version.
    Command '['chromium', '--headless', '--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/{VERSION} (+https://github.com/ArchiveBox/ArchiveBox/)', '--window-size=1440,2000', '--timeout=180000', '--screenshot', 'https://www.dailymail.co.uk/sport/formulaone/article-11883195/Mercedes-chief-Toto-Wolff-bemoans-Red-Bulls-early-season-dominance-not-great-show.html']' timed out after 180 seconds

And just these error for dev version.

Failed to save DOM
Failed to save screenshot
Failed to save PDF

Screenshots or log output

ArchiveBox version

ArchiveBox v0.6.2
Cpython Linux Linux-6.2.6-artix1-1-x86_64-with-glibc2.37 x86_64
IN_DOCKER=False DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.2          valid     /home/art/.local/bin/archivebox                                             
 √  PYTHON_BINARY         v3.10.9         valid     /usr/bin/python3.10                                                         
 √  DJANGO_BINARY         v3.1.14         valid     /home/art/.local/lib/python3.10/site-packages/django/bin/django-admin.py    
 √  CURL_BINARY           v7.88.1         valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.21.3         valid     /usr/bin/wget                                                               
 √  NODE_BINARY           v19.8.1         valid     /usr/bin/node                                                               
 √  SINGLEFILE_BINARY     v1.0.10         valid     ./node_modules/single-file/cli/single-file                                  
 √  READABILITY_BINARY    v0.0.4          valid     ./node_modules/readability-extractor/readability-extractor                  
 √  MERCURY_BINARY        v1.0.0          valid     ./node_modules/@postlight/mercury-parser/cli.js                             
 √  GIT_BINARY            v2.40.0         valid     /usr/bin/git                                                                
 √  YOUTUBEDL_BINARY      v2023.02.17     valid     /home/art/Downloads/yt-dlp_linux                                            
 √  CHROME_BINARY         v111.0.5563.64  valid     /usr/bin/chromium                                                           
 √  RIPGREP_BINARY        v13.0.0         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /home/art/.local/lib/python3.10/site-packages/archivebox                    
 √  TEMPLATES_DIR         3 files         valid     /home/art/.local/lib/python3.10/site-packages/archivebox/templates          
 -  CUSTOM_TEMPLATES_DIR  -               disabled                                                                              

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled                                                                              
 -  COOKIES_FILE          -               disabled                                                                              

[i] Data locations:
 √  OUTPUT_DIR            12 files        valid     /home/art/archiveboxPIP                                                     
 √  SOURCES_DIR           46 files        valid     ./sources                                                                   
 √  LOGS_DIR              1 files         valid     ./logs                                                                      
 √  ARCHIVE_DIR           12 files        valid     ./archive                                                                   
 √  CONFIG_FILE           186.0 Bytes     valid     ./ArchiveBox.conf                                                           
 √  SQL_INDEX             292.0 KB        valid     ./index.sqlite3 

ArchiveBoxDev

0.6.3
ArchiveBox v0.6.3 Cpython Linux Linux-6.2.6-artix1-1-x86_64-with-glibc2.37 x86_64
DEBUG=False IN_DOCKER=False IS_TTY=True TZ=UTC FS_ATOMIC=True FS_REMOTE=False FS_PERMS=644 1000:1000 SEARCH_BACKEND=ripgrep

[i] Dependency versions:
 √  PYTHON_BINARY         v3.10.9         valid     /usr/bin/python3.10                                                         
 √  SQLITE_BINARY         v2.6.0          valid     /usr/lib/python3.10/sqlite3/dbapi2.py                                       
 √  DJANGO_BINARY         v3.1.14         valid     /home/art/.local/lib/python3.10/site-packages/django/__init__.py            
 √  ARCHIVEBOX_BINARY     v0.6.3          valid     /home/art/.local/bin/archivebox                                             

 √  CURL_BINARY           v7.88.1         valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.21.3         valid     /usr/bin/wget                                                               
 √  NODE_BINARY           v19.8.1         valid     /usr/bin/node                                                               
 √  SINGLEFILE_BINARY     v1.0.31         valid     ./node_modules/single-file/cli/single-file                                  
 √  READABILITY_BINARY    v0.0.6          valid     ./node_modules/readability-extractor/readability-extractor                  
 √  MERCURY_BINARY        v1.0.0          valid     ./node_modules/@postlight/mercury-parser/cli.js                             
 √  GIT_BINARY            v2.40.0         valid     /usr/bin/git                                                                
 √  YOUTUBEDL_BINARY      v2023.03.04     valid     /home/art/.local/bin/yt-dlp                                                 
 √  CHROME_BINARY         v111.0.5563.64  valid     /usr/bin/chromium                                                           
 √  RIPGREP_BINARY        v13.0.0         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /home/art/.local/lib/python3.10/site-packages/archivebox                    
 √  TEMPLATES_DIR         3 files         valid     /home/art/.local/lib/python3.10/site-packages/archivebox/templates          
 -  CUSTOM_TEMPLATES_DIR  -               disabled                                                                              

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled                                                                              
 -  COOKIES_FILE          -               disabled                                                                              

[i] Data locations:
 √  OUTPUT_DIR            6 files         valid     /home/art/archiveboxPIPdev                                                  
 √  SOURCES_DIR           6 files         valid     ./sources                                                                   
 √  LOGS_DIR              1 files         valid     ./logs                                                                      
 √  ARCHIVE_DIR           3 files         valid     ./archive                                                                   
 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf                                                           
 √  SQL_INDEX             220.0 KB        valid     ./index.sqlite3 
@pirate
Copy link
Member

pirate commented Mar 21, 2023

Chrome unfortunately changed the behavior of a bunch of their CLI options between versions with little warning, so we're scrambling to cover all the edge cases still. Thanks for your patience 😅

Can you try runnning chromium --headless=new --screenshot 'https://example.com' in terminal and posting the output?

@pirate pirate changed the title Archivebox stopped saving DOM, screenshot and PDF. Archivebox stopped saving DOM, screenshot and PDF with v111 update Mar 21, 2023
@Giger22
Copy link
Author

Giger22 commented Mar 22, 2023

[24053:24053:0322/125618.770957:ERROR:object_proxy.cc(623)] Failed to call method: org.kde.KWallet.isEnabled: object_path= /modules/kwalletd5: org.freedesktop.DBus.Error.NoReply: Message recipient disconnected from message bus without replying
[24053:24053:0322/125618.771033:ERROR:kwallet_dbus.cc(100)] Error contacting kwalletd5 (isEnabled)
[24053:24053:0322/125618.771802:ERROR:object_proxy.cc(623)] Failed to call method: org.kde.KLauncher.start_service_by_desktop_name: object_path= /KLauncher: org.freedesktop.DBus.Error.ServiceUnknown: The name org.kde.klauncher was not provided by any .service files
[24053:24053:0322/125618.771807:ERROR:kwallet_dbus.cc(72)] Error contacting klauncher to start kwalletd5
[24053:24053:0322/125618.880695:ERROR:object_proxy.cc(623)] Failed to call method: org.kde.KWallet.close: object_path= /modules/kwalletd5: org.freedesktop.DBus.Error.NoReply: Message recipient disconnected from message bus without replying
[24053:24053:0322/125618.880710:ERROR:kwallet_dbus.cc(418)] Error contacting kwalletd5 (close)
[24053:24053:0322/125618.881002:ERROR:process_singleton_posix.cc(334)] Failed to create /home/art/.config/chromium/SingletonLock: File exists (17)

@adamcecc
Copy link

adamcecc commented Apr 7, 2023

I had the same issue. I removed the dead symlinks for /home/username/.config/google-chrome/SingletonCookie and /home/username/.config/google-chrome/SingletonLock and it works again.

@Dontkickmi22
Copy link

I had the same issue. I removed the dead symlinks for /home/username/.config/google-chrome/SingletonCookie and /home/username/.config/google-chrome/SingletonLock and it works again.

I did that, it worked. But mine got same error again today. Have to do it again to clear. I wonder.

@mwnoo
Copy link

mwnoo commented Apr 11, 2023

Using dev version 0.6.3 with the headless=new command successfully archives pdf, screenshots and DOM. The only thing is that with chromium v111 and v112 --user-data-dir information (accepted cookies, etc.) is not used anymore. New screenshots and pdf's still have the cookie banners inside although I accepted them in the browser and copied the profile to my archivebox folders. This worked fine with previous versions of chromium (e.g. v101).

Also from the command line Chromium does not use the cookie information in --user-data-dir

chromium --version 
Chromium 112.0.5615.49 snap

# Open Chromium, browse to website (e.g. https://stackoverflow.com/), accept cookies, close Chromium
# Copy profile to archivebox folder
cp -r /home/archive/snap/chromium/common/chromium/ /home/archive/data/chromium

# Creates 800 x 600 screenshots with cookie banner (window-size settings are not used)
chromium-browser --headless --window-size=1440,2000 --screenshot "https://stackoverflow.com/"
chromium-browser --headless=old --window-size=1440,2000 --screenshot "https://stackoverflow.com/"

# Creates 1440 x 1928 screenshot with cookie banner
chromium-browser --headless=new --window-size=1440,2000 --screenshot "https://stackoverflow.com/"

# Creates 1440 x 1928 screenshot still with cookie banner
chromium-browser --headless=new --window-size=1440,2000 --user-data-dir=/home/archive/data/chromium --screenshot "https://stackoverflow.com/"

# With Chromium v101 this worked fine to create screenshot without the cookie banner
chromium-browser --headless --window-size=1440,2000 --user-data-dir=/home/archive/data/chromium --screenshot "https://stackoverflow.com/"

Any ideas to get the new chromium headless to use the profile data (user-data-dir) and make screenshots without the cookie banners?

@pirate
Copy link
Member

pirate commented Apr 14, 2023

Argh that's frustrating, two major breaking changes to chromium's CLI without much care toward backwards compatibility on their part 😓

I'll take a look but probably going to focus on the ongoing playwright/browsertrix-crawler integration before trying to fix this.

@turian
Copy link
Contributor

turian commented May 3, 2023

Can you try runnning chromium --headless=new --screenshot 'https://example.com' in terminal and posting the output?

How do I run that command if I'm running archivebox from within docker?

@pirate
Copy link
Member

pirate commented May 3, 2023

docker-compose run archivebox chromium --headless=new --screenshot 'https://example.com/'

@turian
Copy link
Contributor

turian commented May 3, 2023

Ah thanks. I also found:

docker run -it --rm ArchiveBox/ArchiveBox:dev /bin/bash

The docker compose command doesn't work for me because I'm on a system where I just run from dockerhub (I assume dev is the most stable recent tag?). I'm back to using archivebox again, and am trying to find a docker image that has recent fixes but is relatively stable.

Dropping into bash using that command, I get:

archivebox@d50ee2d5f13e:/data$ chromium --headless=new --screenshot 'https://example.com'
find: ‘/home/archivebox/.config/chromium/Crash Reports/pending/’: No such file or directory
[16:16:0503/203759.108625:ERROR:zygote_host_impl_linux.cc(127)] No usable sandbox! If this is a Debian system, please install the chromium-sandbox package to solve this problem. If you want to live dangerously and need an immediate workaround, you can try using --no-sandbox.

BTW, this is my ArchiveBox.conf, are there current best defaults that I should update?

$ cat ArchiveBox.conf
[SERVER_CONFIG]
SECRET_KEY = <redacted>

FETCH_MEDIA=True
MEDIA_TIMEOUT=500
CHROME_BINARY=google-chrome-stable
CHROME_HEADLESS=True
WGET_USER_AGENT='Mozilla/5.0 (Macintosh; Intel Mac OS X 12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'
PUBLIC_INDEX=False
PUBLIC_SNAPSHOTS=False

@pirate
Copy link
Member

pirate commented May 4, 2023

Sorry, try this:

docker-compose run archivebox /usr/bin/chromium --version
docker-compose run archivebox /usr/bin/chromium --headless=new --screenshot --no-sandbox 'https://example.com' screenshot.png
docker-compose run archivebox config --get CHROME_BINARY
docker-compose run archivebox version

Then open ./data/screenshot.png to make sure it succeeded.

@turian
Copy link
Contributor

turian commented May 4, 2023

docker-compose doesn't work for me because I'm just using the latest tag on docker. But if you think I should switch, I will.

screenshot.png doesn't come through as per the following:

archivebox@ubuntu-s-4vcpu-8gb-amd-fra1-01:~$ docker run -it --rm archivebox/archivebox:dev /bin/bash




archivebox@019ca72a3922:/data$
archivebox@019ca72a3922:/data$
archivebox@019ca72a3922:/data$
archivebox@019ca72a3922:/data$
archivebox@019ca72a3922:/data$  /usr/bin/chromium --version
find: ‘/home/archivebox/.config/chromium/Crash Reports/pending/’: No such file or directory
Chromium 112.0.5615.121 built on Debian 11.6, running on Debian 11.6
archivebox@019ca72a3922:/data$ /usr/bin/chromium --headless=new --screenshot --no-sandbox 'https://example.com' screenshot.png
find: ‘/home/archivebox/.config/chromium/Crash Reports/pending/’: No such file or directory
[0504/103129.626416:ERROR:chrome_main.cc(164)] Multiple targets are not supported in headless mode.
archivebox@019ca72a3922:/data$ config --get CHROME_BINARY
bash: config: command not found
archivebox@019ca72a3922:/data$ version
bash: version: command not found
archivebox@019ca72a3922:/data$
exit
archivebox@ubuntu-s-4vcpu-8gb-amd-fra1-01:~$ docker run --rm archivebox/archivebox:dev config --get CHROME_BINARY
find: '/.config/chromium/Crash Reports/pending/': No such file or directory
[i] [2023-05-04 10:32:20] ArchiveBox v0.6.3: archivebox config --get CHROME_BINARY
    > /data

find: '/.config/chromium/Crash Reports/pending/': No such file or directory
[X] No archivebox index found in the current directory.
    /data

    Hint: Are you running archivebox in the right folder?
        cd path/to/your/archive/folder
        archivebox [command]

    Hint: To create a new archive collection or import existing data in this folder, run:
        archivebox init
archivebox@ubuntu-s-4vcpu-8gb-amd-fra1-01:~$ docker run --rm archivebox/archivebox:dev version
find: '/.config/chromium/Crash Reports/pending/': No such file or directory
0.6.3
ArchiveBox v0.6.3 a1e2fce Cpython Linux Linux-5.15.0-71-generic-x86_64-with-glibc2.31 x86_64
DEBUG=False IN_DOCKER=True IS_TTY=False TZ=UTC FS_ATOMIC=True FS_REMOTE=True FS_PERMS=644 999:999 SEARCH_BACKEND=ripgrep

[i] Dependency versions:
 √  PYTHON_BINARY         v3.11.3         valid     /usr/local/bin/python3.11
 √  SQLITE_BINARY         v2.6.0          valid     /usr/local/lib/python3.11/sqlite3/dbapi2.py
 √  DJANGO_BINARY         v3.1.14         valid     /usr/local/lib/python3.11/site-packages/django/__init__.py
 √  ARCHIVEBOX_BINARY     v0.6.3          valid     /usr/local/bin/archivebox

 √  CURL_BINARY           v7.74.0         valid     /usr/bin/curl
 √  WGET_BINARY           v1.21           valid     /usr/bin/wget
 √  NODE_BINARY           v18.16.0        valid     /usr/bin/node
 √  SINGLEFILE_BINARY     v0.3.16         valid     /node/node_modules/single-file/cli/single-file
 √  READABILITY_BINARY    v0.0.2          valid     /node/node_modules/readability-extractor/readability-extractor
 √  MERCURY_BINARY        v1.0.0          valid     /node/node_modules/@postlight/mercury-parser/cli.js
 √  GIT_BINARY            v2.30.2         valid     /usr/bin/git
 √  YOUTUBEDL_BINARY      v2023.03.04     valid     /usr/local/bin/yt-dlp
 √  CHROME_BINARY         v112.0.5615.121  valid     /usr/bin/chromium
 √  RIPGREP_BINARY        v12.1.1         valid     /usr/bin/rg

[i] Source-code locations:
 √  PACKAGE_DIR           22 files        valid     /app/archivebox
 √  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates
 -  CUSTOM_TEMPLATES_DIR  -               disabled

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled
 -  COOKIES_FILE          -               disabled


[i] Data locations:

@turian
Copy link
Contributor

turian commented May 4, 2023

What I'm looking for is a "relatively stable, relatively recent" docker tag so I can start crawling again.

@pirate
Copy link
Member

pirate commented May 4, 2023

archivebox/archivebox:dev is definitely the most recent/stable tag.

I'm unable to replicate this [0504/103129.626416:ERROR:chrome_main.cc(164)] Multiple targets are not supported in headless mode. message on my side, so it's tricky to debug :/

docker run -v $PWD/archivebox:/data archivebox/archivebox:dev bash

archivebox@019ca72a3922:/data$ /usr/bin/chromium --headless=new --screenshot --no-sandbox  --no-first-run --disable-sync --disable-gpu 'https://example.com/' screenshot.png

@mrled
Copy link
Contributor

mrled commented May 7, 2023

I think "multiple targets are not supported in headless mode" is because Chromium thinks that you're trying to navigate to two URLs: https://example.com/ and screenshot.png. You probably want /usr/bin/chromium --headless=new --screenshot --no-sandbox --no-first-run --disable-sync --disable-gpu 'https://example.com/', without screenshot.png at the end. (Should work the same whether the container is started from plain Docker or from docker-compose.)

@rpcope1
Copy link

rpcope1 commented Jun 8, 2023

archivebox/archivebox:dev is definitely the most recent/stable tag.

I'm unable to replicate this [0504/103129.626416:ERROR:chrome_main.cc(164)] Multiple targets are not supported in headless mode. message on my side, so it's tricky to debug :/

docker run -v $PWD/archivebox:/data archivebox/archivebox:dev bash

archivebox@019ca72a3922:/data$ /usr/bin/chromium --headless=new --screenshot --no-sandbox  --no-first-run --disable-sync --disable-gpu 'https://example.com/' screenshot.png

I don't know if this helps or not, but I ran into the "Multiple targets are not supported in headless mode." and narrowed it down to Chromium interpreting the "--user-agent" argument as a URL instead of the way you would expect it should. This was with Chromium 114 on Debian 11.7. It probably has to do with Chromium no longer taking a user-agent argument. I ended up patching around it with this hack:

#!/bin/bash
set -euo pipefail

declare -a FIXED_ARGS
for arg in "$@"; do
   if ! [[ "$arg" =~ .*--user-agent.* ]]; then
       FIXED_ARGS+=("$arg")
   fi      
done
exec chromium --disable-gpu "${FIXED_ARGS[@]}"

But the archiver still times out (maybe because this same LXC container doesn't have a GPU exposed). It sounds like the thing to do for now maybe is to find a version of Chromium before 111 and use that instead.

@dcalano
Copy link
Contributor

dcalano commented Jul 1, 2023

Looking at my error logs, the dom/screenshot calls failing for me seem to fail to interpolate the {VERSION} env variable. Interestingly the wget capture was resolving fine to 0.6.2 and completed successfully, but in the cmd string logs this is remaining as {VERSION} for dom/sreenshot when the crawl command is being called.

@sclu1034
Copy link

sclu1034 commented Jul 6, 2023

I was seeing the same issues, including the unresolved {VERSION} fields in the error logs.

Removing the broken symlinks mentioned by @adamcecc fixed it for now.

I removed the dead symlinks for /home/username/.config/google-chrome/SingletonCookie and /home/username/.config/google-chrome/SingletonLock and it works again.

@sclu1034
Copy link

That didn't actually help all that much. It breaks again as soon as there are multiple processes adding URLs running at the same time, and then remains broken until those files are removed again.

@pirate
Copy link
Member

pirate commented Aug 16, 2023

afaik there is no way to run multiple chrome instances with a single profile, Chrome does not support it. the best we could do is clone the profile directory and create a few temporary copies and use those, or migrate to an event sourcing model with a single playwright-based chrome worker that handles all the jobs as separate tabs in a single chrome instance.

@pirate
Copy link
Member

pirate commented Oct 20, 2023

I made major changes to the Dockerfile last night and bumped all the dependency versions so it should be on the latest Chrome v119 now. Not all the cross-platform builds are on Docker Hub yet, but you can try it by pulling and running docker build . -t archivebox-dev; docker run -it -v $PWD:/data archivebox-dev ....

@pirate
Copy link
Member

pirate commented Nov 9, 2023

I think this should be fixed now on dev. Please pull the latest image/pip package and try again, docker pull archivebox/archivebox:dev. https://github.com/ArchiveBox/ArchiveBox#install-and-run-a-specific-github-branch

Comment back here if any of you are still encountering issues and I'll reopen the ticket.

@pirate pirate closed this as completed Nov 9, 2023
@unlostify
Copy link

unlostify commented Feb 9, 2024

@pirate I'm having this same problem with 0.7.0-0.7.3.

For Dom, PDF and Screenshot the logs show: Extractor timed out after 900s.
There also appears to be a problem with SingleFile – the logs show: SingleFile was not able to archive the page

As you can see, I set the timeout to 900 seconds, but that didn't help.

For context, ArchiveBox is running in Docker Compose. This instance is several years old and includes ~8000 saved pages. I've run archivebox init to ensure the necessary migrations have run.

Through v0.6.3 I never had any problems, but it seems like this started after I upgraded to 0.7.0 at some point late last year, and persisted when I later upgraded to 0.7.3. This has been an issue ever since then, but it is only now that I'm getting around to troubleshooting this.

Today I rolled back to a backup from Oct 2023, running on 0.7.0. It had this issue. I ran the necessary migrations via archivebox init, updated to 0.7.1, ran the migrations again, and then did the same for 0.7.2 and 0.7.3.

This is the result of archivebox version now that I'm back on 0.7.3

0.7.3
ArchiveBox v0.7.3+editable COMMIT_HASH=fd2a91b BUILD_TIME=2024-01-12 04:15:39 1705032939
IN_DOCKER=True IN_QEMU=False ARCH=x86_64 OS=Linux PLATFORM=Linux-5.15.49-linuxkit-x86_64-with-glibc2.36 PYTHON=Cpython
FS_ATOMIC=True FS_REMOTE=True FS_USER=501:20 FS_PERMS=644
DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND=ripgrep LDAP=False

[i] Dependency versions:
 √  PYTHON_BINARY         v3.11.7         valid     /usr/local/bin/python3.11                                                   
 √  SQLITE_BINARY         v2.6.0          valid     /usr/local/lib/python3.11/sqlite3/dbapi2.py                                 
 √  DJANGO_BINARY         v3.1.14         valid     /usr/local/lib/python3.11/site-packages/django/__init__.py                  
 √  ARCHIVEBOX_BINARY     v0.7.3          valid     /usr/local/bin/archivebox                                                   

 √  CURL_BINARY           v8.5.0          valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.21.3         valid     /usr/bin/wget                                                               
 √  NODE_BINARY           v20.11.0        valid     /usr/bin/node                                                               
 √  SINGLEFILE_BINARY     v1.1.46         valid     /app/node_modules/single-file-cli/single-file                               
 √  READABILITY_BINARY    v0.0.11         valid     /app/node_modules/readability-extractor/readability-extractor               
 √  MERCURY_BINARY        v1.0.0          valid     /app/node_modules/@postlight/parser/cli.js                                  
 -  GIT_BINARY            -               disabled  /usr/bin/git                                                                
 √  YOUTUBEDL_BINARY      v2023.12.30     valid     /usr/local/bin/yt-dlp                                                       
 √  CHROME_BINARY         v120.0.6099.28  valid     /usr/bin/chromium-browser                                                   
 √  RIPGREP_BINARY        v13.0.0         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           25 files        valid     /app/archivebox                                                             
 √  TEMPLATES_DIR         4 files         valid     /app/archivebox/templates                                                   
 -  CUSTOM_TEMPLATES_DIR  -               disabled  None                                                                        

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled  None                                                                        
 -  COOKIES_FILE          -               disabled  None                                                                        

[i] Data locations:
 √  OUTPUT_DIR            9 files @       valid     /data                                                                       
 √  SOURCES_DIR           61 files        valid     ./sources                                                                   
 √  LOGS_DIR              1 files         valid     ./logs                                                                      
 √  ARCHIVE_DIR           8688 files      valid     ./archive                                                                   
 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf                                                           
 √  SQL_INDEX             155.1 MB        valid     ./index.sqlite3

Thank you kindly for the work you do on this wonderful project!

@pirate
Copy link
Member

pirate commented Feb 10, 2024

Thanks for the info and the version output. I've experienced this intermittently with chrome sometimes but it usually went away on its own. I think it's caused by chrome not exiting correctly after a job finishes, it just hangs indefinitely (singlefile uses chrome too)

I'll take a deeper look next week!

@unlostify
Copy link

Thanks @pirate ! Let me know if there's anything else I can provide (logs etc), or do, that would help. Happy to do anything I can to help diagnose the issue =)

@pirate
Copy link
Member

pirate commented Feb 29, 2024

@unlostify you may want to subscribe to this issue as well, I have a more in-depth comment trying to figure out the underlying cause here: cypress-io/cypress#27264 (comment)

@unlostify
Copy link

Thanks so much! Will do =)

In the other issue you mentioned its hard to reproduce, but the 'good' thing is that this bug happens 100% of the time in my instance. I'm running in Docker, and the issue always appears, even if I regenerate the container. Since the /data directory is all that persists, its presumably something in there that's the problem. So perhaps it would be helpful if I provided you a copy of my /data directory?

It's currently ~8GB. However, if you think it'd be useful for troubleshooting, I can prune it down as much as possible by duplicating it and removing all of the sites I've archived.

If you'd like me to do that, just let me know and I'll put it on my todo list.

@pirate
Copy link
Member

pirate commented Mar 1, 2024

It's probably not something in your /data directory actually. I think it's more likely to correlated with the chrome version in the docker container combined with your CPU architecture, host kernel, core count/threading support, docker storage driver, underlying host filesystem, network conditions, etc. (which is why I've gradually added all these things to the archivebox version output)

x86 in Docker appears to hit this issue much more than arm64 for example. (I personally run arm64 on macOS, where it almost never happens, which is partly why it's been hard for me to debug without running test cloud servers all the time)

@unlostify
Copy link

Whoops! That makes way more sense.

For what its worth, I am indeed running on x86 (also on macOS).

@pirate
Copy link
Member

pirate commented Mar 1, 2024

Hah of course not 30 seconds after posting this I tried again just for fun and managed to reproduce this on arm64 on macOS!

I didn't even add any of our normal ArchiveBox args, it hung immediately on the first try with only --headless=new and --screenshot!

Screenshot 2024-02-29 at 5 30 24 PM

This dispelled the last of my doubts, this is 100% an upstream Chromium bug and has nothing to do with ArchiveBox. I just opened an upstream bug report on the Chromium bug tracker: https://issues.chromium.org/issues/327583144

@unlostify
Copy link

You're the best @pirate! Thanks for looking into this, and for all of your hard work on this fantastic project =)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests