Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Architecture: Concurrent runs accidentally delete each other's temp files, leaving the index broken #234

Closed
anarcat opened this issue May 6, 2019 · 4 comments
Labels
size: easy status: wip Work is in-progress / has already been partially completed type: bug report

Comments

@anarcat
Copy link

anarcat commented May 6, 2019

Describe the bug

As part of my ridiculously large archiving attempt (partly documented in #233), I have done a first batch of URL imports with the first 100 URLs found. For a reason I can't explain (maybe because I ran two archivebox add commands in parallel?), that eventually crashed with:

FileNotFoundError: [Errno 2] No such file or directory: '/srv/backup/archive/archivebox/index.json.tmp' -> '/srv/backup/archive/archivebox/index.json'

No problem, I thought - I can resume! So I did that with

archivebox add --update-all

But that crashed as well, with:

TypeError: save_file_to_sources(..., path: str) got unexpected NoneType argument path=None

I suspect this is because --update-all actually expects a list of URLs to be passed, but the usage doesn't make that clear and we shouldn't be crashing there.

Steps to reproduce

  1. call archivebox add --update-all with no other URLs

Screenshots or log output

First, the original crash, not the subject of this bug report:

[...]
[+] [2019-05-06 21:39:14] "www.hjdskes.nl/projects/cage"                                                                                                           
    https://www.hjdskes.nl/projects/cage/                                                                                                                          
    > ./archive/1557178364.10                                                                                                                                      
      > title                                                                                                                                                      
      > favicon                                                                                                                                                    
      > wget                                                                                                                                                       
        Failed:                                                                                                                                                    
            TimeoutExpired Command 'wget' timed out after 60 seconds                                                                                               
        Run to see full output:                                                                                                                                    
            cd /srv/backup/archive/archivebox/archive/1557178364.10;                                                                                               
            wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent -e robots=off --restrict-file-names=windows --timeout=60 --warc-file=warc/1557178755 --page-requisites "--user-agent=ArchiveBox/0.4.1 (+https://github.com/pirate/ArchiveBox/) wget/GNU Wget 1.20.1" --compression=auto https://www.hjdskes.nl/projects/cage/                                                                                                                   
                                                                                                                                                                   
      > pdf                                                                                                                                                        
      > screenshot                                                                                                                                                 
      > dom                                                                                                                                                        
      > media                                                                                                                                                      
      > archive_org                                                                                                                                                
    ! Failed to archive link: FileNotFoundError: [Errno 2] No such file or directory: '/srv/backup/archive/archivebox/index.json.tmp' -> '/srv/backup/archive/archivebox/index.json'                                                                                                                                                      
                                                                                                                                                                   
Traceback (most recent call last):                                                                                                                                 
  File "/home/anarcat/.virtualenvs/archivebox/bin/archivebox", line 10, in <module>                                                                                
    sys.exit(main())                                                                                                                                               
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/__main__.py", line 10, in main                                                
    archivebox.main(args=sys.argv[1:], stdin=sys.stdin)                                                                                                            
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/cli/archivebox.py", line 58, in main                                          
    pwd=pwd or OUTPUT_DIR,                                                                                                                                         
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/cli/__init__.py", line 55, in run_subcommand                                  
    module.main(args=subcommand_args, stdin=stdin, pwd=pwd)    # type: ignore                                                                                      
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/cli/archivebox_add.py", line 55, in main                                      
    out_dir=pwd or OUTPUT_DIR,                                                                                                                                     
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 104, in typechecked_function                                   
    return func(*args, **kwargs)                                                                                                                                   
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/main.py", line 521, in add                                                    
    archive_link(link, out_dir=link.link_dir)                                                                                                                      
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 104, in typechecked_function                                   
    return func(*args, **kwargs)                                                                                                                                   
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/extractors/__init__.py", line 85, in archive_link                             
    patch_main_index(link)                                                                                                                                         
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 104, in typechecked_function                                   
    return func(*args, **kwargs)                                                                                                                                   
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/index/__init__.py", line 323, in patch_main_index                             
    write_json_main_index(patched_links, out_dir=out_dir)                                                                                                          
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 104, in typechecked_function                                   
    return func(*args, **kwargs)                                                                                                                                   
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/index/json.py", line 77, in write_json_main_index                             
    atomic_write(main_index_json, os.path.join(out_dir, JSON_INDEX_FILENAME))                                                                                      
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/system.py", line 79, in atomic_write                                          
    os.rename(tmp_file, path)
FileNotFoundError: [Errno 2] No such file or directory: '/srv/backup/archive/archivebox/index.json.tmp' -> '/srv/backup/archive/archivebox/index.json'             

Readding the list does nothing:

[1]anarcat@curie:archivebox(master)$ archivebox add wallabag-p1.list                                                                                               
    > ./sources/wallabag-p1.list-1557179130.txt                                                                                                                    
                                                                                                                                                                   
[*] [2019-05-06 21:45:30] Parsing new links from output/sources/wallabag-p1.list-1557179130.txt...                                                                 
    > Parsed 100 links as Plain Text (0 new links added)                                                                                                           
                                                                                                                                                                   
[*] [2019-05-06 21:45:30] Writing 101 links to main index...                                                                                                       
    √ /srv/backup/archive/archivebox/index.sqlite3                                                                                                                 
    √ /srv/backup/archive/archivebox/index.json                                                                                                                    
    √ /srv/backup/archive/archivebox/index.html                                                                                                                    
                                                                                                                                                                   
[▶] [2019-05-06 21:45:31] Updating content for 0 matching pages in archive...                                                                                      
                                                                                                                                                                   
[√] [2019-05-06 21:45:31] Update of 0 pages complete (0.00 sec)                                                                                                    
    - 0 links skipped                                                                                                                                              
    - 0 links updated                                                                                                                                              
    - 0 links had errors                                                                                                                                           
                                                                                                                                                                   
    To view your archive, open:                                                                                                                                    
        /srv/backup/archive/archivebox/index.html                                                                                                                  
    Or run the built-in webserver:                                                                                                                                 
        archivebox server                                                                                                                                          
                                                                                                                                                                   
[*] [2019-05-06 21:45:31] Writing 101 links to main index...                                                                                                       
    √ /srv/backup/archive/archivebox/index.sqlite3                                                                                                                 
    √ /srv/backup/archive/archivebox/index.json                                                                                                                    
    √ /srv/backup/archive/archivebox/index.html                                                                                                                    

Looking at -h, I noticed --update-all so I try that:

anarcat@curie:archivebox(master)$ archivebox add --update-all                                                                                                      
Traceback (most recent call last):                                                                                                                                 
  File "/home/anarcat/.virtualenvs/archivebox/bin/archivebox", line 10, in <module>                                                                                
    sys.exit(main())                                                                                                                                               
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/__main__.py", line 10, in main                                                
    archivebox.main(args=sys.argv[1:], stdin=sys.stdin)                                                                                                            
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/cli/archivebox.py", line 58, in main                                          
    pwd=pwd or OUTPUT_DIR,                                                                                                                                         
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/cli/__init__.py", line 55, in run_subcommand                                  
    module.main(args=subcommand_args, stdin=stdin, pwd=pwd)    # type: ignore                                                                                      
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/cli/archivebox_add.py", line 55, in main                                      
    out_dir=pwd or OUTPUT_DIR,                                                                                                                                     
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 104, in typechecked_function                                   
    return func(*args, **kwargs)                                                                                                                                   
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/main.py", line 496, in add                                                    
    import_path = save_file_to_sources(import_path, out_dir=out_dir)                                                                                               
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 98, in typechecked_function                                    
    check_argument_type(arg_key, arg_val)                                                                                                                          
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 92, in check_argument_type                                     
    str(arg_val)[:64],                                                                                                                                             
TypeError: save_file_to_sources(..., path: str) got unexpected NoneType argument path=None   

The correct call is of course to retry with the same URLs:

anarcat@curie:archivebox(master)$ archivebox add --update-all wallabag-p1.list

which works, but it would actually be nice to (a) not crash when --update-all is passed without an argument (maybe just error in argument parsing more politely) and (b) eventually just do the right thing, which is probably to retry any failed URL from the database.

Software versions

  • OS: Debian buster 10 up to date
  • ArchiveBox version: 0.4.1 installed from pip
  • Python version: 3.7.3something
  • Chrome version: irrelevant?

Thanks for your hard work, and sorry for the flood of bug reports! :)

@pirate
Copy link
Member

pirate commented May 6, 2019

I added something recently called atomic_write, and I think the behavior you're seeing is just a bug in my implementation that can be fixed quite easily. This is how atomic_write works right now:

def atomic_write(contents, path):
    try:
        # 1. create temp file
        # 2. write to temp file
        # 3. rename temp file over actual destination file
    finally:
        # if anything fails, delete temp file to clean up
        if os.path.exists(tmp_file):
            os.remove(tmp_file)

What you're encountering is the finally clause deleting a temp file that's being created by a different process. It can be fixed by making every temp file have a random, unique suffix such that two processes never attempt to modify the same temp file. After I push the fix I'll comment back and close this. I'll also improve testing and support for multicore runs in general in v0.4.0.

@pirate pirate changed the title can't resume indexing Architecture: Concurrent runs accidentally delete each other's temp files, leaving the index broken May 6, 2019
@pirate pirate added type: bug report size: easy status: wip Work is in-progress / has already been partially completed labels May 6, 2019
@anarcat
Copy link
Author

anarcat commented May 6, 2019

you might want to reuse existing code for this, e.g.

https://github.com/untitaker/python-atomicwrites
https://github.com/rec/safer

@pirate
Copy link
Member

pirate commented Jul 24, 2020

This should all be fixed in the latest django version. (we ended up using python-atomicwrites)

git checkout django
git pull
docker build . -t archivebox
docker run -v $PWD/output:/data archivebox init
docker run -v $PWD/output:/data archivebox add 'https://example.com'

If you still see any issues, comment back and I'll reopen the ticket.
I still recommend running it single-threaded only for now, the next version will have much better multicore support since we'll be removing the index.json and index.html main indexes that cause so many locking issues and writing race-conditions.

@pirate pirate closed this as completed Jul 24, 2020
@pirate
Copy link
Member

pirate commented Apr 12, 2022

Note I've added a new DB/filesystem troubleshooting area to the wiki that may help people arriving here from Google: https://github.com/ArchiveBox/ArchiveBox/wiki/Upgrading-or-Merging-Archives#database-troubleshooting

Contributions/suggestions welcome there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size: easy status: wip Work is in-progress / has already been partially completed type: bug report
Projects
None yet
Development

No branches or pull requests

2 participants