New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Architecture: Concurrent runs accidentally delete each other's temp files, leaving the index broken #234
Comments
I added something recently called def atomic_write(contents, path):
try:
# 1. create temp file
# 2. write to temp file
# 3. rename temp file over actual destination file
finally:
# if anything fails, delete temp file to clean up
if os.path.exists(tmp_file):
os.remove(tmp_file) What you're encountering is the |
you might want to reuse existing code for this, e.g. https://github.com/untitaker/python-atomicwrites |
This should all be fixed in the latest git checkout django
git pull
docker build . -t archivebox
docker run -v $PWD/output:/data archivebox init
docker run -v $PWD/output:/data archivebox add 'https://example.com' If you still see any issues, comment back and I'll reopen the ticket. |
Note I've added a new DB/filesystem troubleshooting area to the wiki that may help people arriving here from Google: https://github.com/ArchiveBox/ArchiveBox/wiki/Upgrading-or-Merging-Archives#database-troubleshooting Contributions/suggestions welcome there. |
Describe the bug
As part of my ridiculously large archiving attempt (partly documented in #233), I have done a first batch of URL imports with the first 100 URLs found. For a reason I can't explain (maybe because I ran two
archivebox add
commands in parallel?), that eventually crashed with:No problem, I thought - I can resume! So I did that with
But that crashed as well, with:
I suspect this is because
--update-all
actually expects a list of URLs to be passed, but the usage doesn't make that clear and we shouldn't be crashing there.Steps to reproduce
archivebox add --update-all
with no other URLsScreenshots or log output
First, the original crash, not the subject of this bug report:
Readding the list does nothing:
Looking at
-h
, I noticed--update-all
so I try that:The correct call is of course to retry with the same URLs:
which works, but it would actually be nice to (a) not crash when
--update-all
is passed without an argument (maybe just error in argument parsing more politely) and (b) eventually just do the right thing, which is probably to retry any failed URL from the database.Software versions
Thanks for your hard work, and sorry for the flood of bug reports! :)
The text was updated successfully, but these errors were encountered: