New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Archivebox stopped saving DOM, screenshot and PDF with v111 update #1125
Comments
Chrome unfortunately changed the behavior of a bunch of their CLI options between versions with little warning, so we're scrambling to cover all the edge cases still. Thanks for your patience 😅 Can you try runnning |
|
I had the same issue. I removed the dead symlinks for /home/username/.config/google-chrome/SingletonCookie and /home/username/.config/google-chrome/SingletonLock and it works again. |
I did that, it worked. But mine got same error again today. Have to do it again to clear. I wonder. |
Using dev version 0.6.3 with the Also from the command line Chromium does not use the cookie information in
Any ideas to get the new chromium headless to use the profile data (user-data-dir) and make screenshots without the cookie banners? |
Argh that's frustrating, two major breaking changes to chromium's CLI without much care toward backwards compatibility on their part 😓 I'll take a look but probably going to focus on the ongoing playwright/browsertrix-crawler integration before trying to fix this. |
How do I run that command if I'm running archivebox from within docker? |
|
Ah thanks. I also found:
The docker compose command doesn't work for me because I'm on a system where I just run from dockerhub (I assume dev is the most stable recent tag?). I'm back to using archivebox again, and am trying to find a docker image that has recent fixes but is relatively stable. Dropping into bash using that command, I get:
BTW, this is my ArchiveBox.conf, are there current best defaults that I should update?
|
Sorry, try this: docker-compose run archivebox /usr/bin/chromium --version
docker-compose run archivebox /usr/bin/chromium --headless=new --screenshot --no-sandbox 'https://example.com' screenshot.png
docker-compose run archivebox config --get CHROME_BINARY
docker-compose run archivebox version Then open |
docker-compose doesn't work for me because I'm just using the latest tag on docker. But if you think I should switch, I will. screenshot.png doesn't come through as per the following:
|
What I'm looking for is a "relatively stable, relatively recent" docker tag so I can start crawling again. |
I'm unable to replicate this
archivebox@019ca72a3922:/data$ /usr/bin/chromium --headless=new --screenshot --no-sandbox --no-first-run --disable-sync --disable-gpu 'https://example.com/' screenshot.png |
I think "multiple targets are not supported in headless mode" is because Chromium thinks that you're trying to navigate to two URLs: |
I don't know if this helps or not, but I ran into the "Multiple targets are not supported in headless mode." and narrowed it down to Chromium interpreting the "--user-agent" argument as a URL instead of the way you would expect it should. This was with Chromium 114 on Debian 11.7. It probably has to do with Chromium no longer taking a user-agent argument. I ended up patching around it with this hack: #!/bin/bash
set -euo pipefail
declare -a FIXED_ARGS
for arg in "$@"; do
if ! [[ "$arg" =~ .*--user-agent.* ]]; then
FIXED_ARGS+=("$arg")
fi
done
exec chromium --disable-gpu "${FIXED_ARGS[@]}" But the archiver still times out (maybe because this same LXC container doesn't have a GPU exposed). It sounds like the thing to do for now maybe is to find a version of Chromium before 111 and use that instead. |
Looking at my error logs, the dom/screenshot calls failing for me seem to fail to interpolate the {VERSION} env variable. Interestingly the wget capture was resolving fine to 0.6.2 and completed successfully, but in the cmd string logs this is remaining as {VERSION} for dom/sreenshot when the crawl command is being called. |
I was seeing the same issues, including the unresolved Removing the broken symlinks mentioned by @adamcecc fixed it for now.
|
That didn't actually help all that much. It breaks again as soon as there are multiple processes adding URLs running at the same time, and then remains broken until those files are removed again. |
afaik there is no way to run multiple chrome instances with a single profile, Chrome does not support it. the best we could do is clone the profile directory and create a few temporary copies and use those, or migrate to an event sourcing model with a single playwright-based chrome worker that handles all the jobs as separate tabs in a single chrome instance. |
I made major changes to the Dockerfile last night and bumped all the dependency versions so it should be on the latest Chrome v119 now. Not all the cross-platform builds are on Docker Hub yet, but you can try it by pulling and running |
I think this should be fixed now on dev. Please pull the latest image/pip package and try again, Comment back here if any of you are still encountering issues and I'll reopen the ticket. |
@pirate I'm having this same problem with 0.7.0-0.7.3. For Dom, PDF and Screenshot the logs show: Extractor timed out after 900s. As you can see, I set the timeout to 900 seconds, but that didn't help. For context, ArchiveBox is running in Docker Compose. This instance is several years old and includes ~8000 saved pages. I've run Through v0.6.3 I never had any problems, but it seems like this started after I upgraded to 0.7.0 at some point late last year, and persisted when I later upgraded to 0.7.3. This has been an issue ever since then, but it is only now that I'm getting around to troubleshooting this. Today I rolled back to a backup from Oct 2023, running on 0.7.0. It had this issue. I ran the necessary migrations via This is the result of
Thank you kindly for the work you do on this wonderful project! |
Thanks for the info and the version output. I've experienced this intermittently with chrome sometimes but it usually went away on its own. I think it's caused by chrome not exiting correctly after a job finishes, it just hangs indefinitely (singlefile uses chrome too) I'll take a deeper look next week! |
Thanks @pirate ! Let me know if there's anything else I can provide (logs etc), or do, that would help. Happy to do anything I can to help diagnose the issue =) |
@unlostify you may want to subscribe to this issue as well, I have a more in-depth comment trying to figure out the underlying cause here: cypress-io/cypress#27264 (comment) |
Thanks so much! Will do =) In the other issue you mentioned its hard to reproduce, but the 'good' thing is that this bug happens 100% of the time in my instance. I'm running in Docker, and the issue always appears, even if I regenerate the container. Since the /data directory is all that persists, its presumably something in there that's the problem. So perhaps it would be helpful if I provided you a copy of my /data directory? It's currently ~8GB. However, if you think it'd be useful for troubleshooting, I can prune it down as much as possible by duplicating it and removing all of the sites I've archived. If you'd like me to do that, just let me know and I'll put it on my todo list. |
It's probably not something in your
|
Whoops! That makes way more sense. For what its worth, I am indeed running on x86 (also on macOS). |
Hah of course not 30 seconds after posting this I tried again just for fun and managed to reproduce this on arm64 on macOS! I didn't even add any of our normal ArchiveBox args, it hung immediately on the first try with only This dispelled the last of my doubts, this is 100% an upstream Chromium bug and has nothing to do with ArchiveBox. I just opened an upstream bug report on the Chromium bug tracker: https://issues.chromium.org/issues/327583144 |
You're the best @pirate! Thanks for looking into this, and for all of your hard work on this fantastic project =) |
Archivebox fails to save DOM, screenshot and PDF.
Steps to reproduce
For example:
I get the same error in normal version.
Command '['chromium', '--headless', '--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/{VERSION} (+https://github.com/ArchiveBox/ArchiveBox/)', '--window-size=1440,2000', '--timeout=180000', '--screenshot', 'https://www.dailymail.co.uk/sport/formulaone/article-11883195/Mercedes-chief-Toto-Wolff-bemoans-Red-Bulls-early-season-dominance-not-great-show.html']' timed out after 180 seconds
And just these error for dev version.
Failed to save DOM
Failed to save screenshot
Failed to save PDF
Screenshots or log output
ArchiveBox version
ArchiveBoxDev
The text was updated successfully, but these errors were encountered: