Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Running archivebox update --index-only doesn't upgrade Snapshot index.{html,json} files #962

Closed
mwnoo opened this issue Apr 7, 2022 · 2 comments

Comments

@mwnoo
Copy link

mwnoo commented Apr 7, 2022

Describe the bug

I tried to update the data/archive/<timestamp>/index.{json,html} files by running the archivebox update --index-only command as described in #544. I expected that this command would update the index.{json,html} files in the data/archive/<timestamp>/ folders, but the updated index.{json,html} files are only written to OUTPUT_DIR and not to the timestamp folder. The updated index.{json,html} files in the OUTPUT_DIR are probably only used by sonic to create the search index.

Steps to reproduce

  1. Use git on the archive folder
  2. Run archivebox update --index-only (also tried: archivebox update --index-only --overwrite but same result)
  3. Updated index.{json,html} only written to OUTPUT_DIR
  4. Running git status shows no changes to the archive folder
  5. data/archive/<timestamp>/index.{json,html} files are not updated

Screenshots or log output

N/A

ArchiveBox version

ArchiveBox v0.6.2
Cpython Linux Linux-5.13.0-39-generic-x86_64-with-glibc2.29 x86_64
IN_DOCKER=False DEBUG=False IS_TTY=False TZ=UTC SEARCH_BACKEND_ENGINE=sonic

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.2          valid     /usr/local/bin/archivebox                                                   
 √  PYTHON_BINARY         v3.8.10         valid     /usr/bin/python3.8                                                          
 √  DJANGO_BINARY         v3.1.14         valid     /usr/local/lib/python3.8/dist-packages/django/bin/django-admin.py           
 √  CURL_BINARY           v7.68.0         valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.20.3         valid     /usr/bin/wget                                                               
 √  NODE_BINARY           v14.19.1        valid     /usr/bin/node                                                               
 √  SINGLEFILE_BINARY     v0.3.32         valid     ./node_modules/single-file/cli/single-file                                  
 √  READABILITY_BINARY    v0.0.3          valid     ./node_modules/readability-extractor/readability-extractor                  
 √  MERCURY_BINARY        v1.0.0          valid     ./node_modules/@postlight/mercury-parser/cli.js                             
 √  GIT_BINARY            v2.25.1         valid     /usr/bin/git                                                                
 -  YOUTUBEDL_BINARY      -               disabled  /usr/local/bin/youtube-dl                                                   
 √  CHROME_BINARY         v100.0.4896.60  valid     /usr/bin/chromium-browser                                                   
 √  RIPGREP_BINARY        v11.0.2         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /usr/local/lib/python3.8/dist-packages/archivebox                           
 √  TEMPLATES_DIR         3 files         valid     /usr/local/lib/python3.8/dist-packages/archivebox/templates                 
 -  CUSTOM_TEMPLATES_DIR  -               disabled                                                                              

[i] Secrets locations:
 √  CHROME_USER_DATA_DIR  34 files        valid     ./chromium                                                                  
 -  COOKIES_FILE          -               disabled                                                                              

[i] Data locations:
 √  OUTPUT_DIR            13 files        valid     /archive_data/archivebox/data                                               
 √  SOURCES_DIR           112 files       valid     ./sources                                                                   
 √  LOGS_DIR              1 files         valid     ./logs                                                                      
 √  ARCHIVE_DIR           812 files       valid     ./archive                                                                   
 √  CONFIG_FILE           460.0 Bytes     valid     ./ArchiveBox.conf                                                           
 √  SQL_INDEX             6.8 MB          valid     ./index.sqlite3                                                             

Config

[SERVER_CONFIG]
SECRET_KEY = XXXX
SNAPSHOTS_PER_PAGE = 100

[ARCHIVE_METHOD_TOGGLES]
SAVE_ARCHIVE_DOT_ORG = False
SAVE_MEDIA = False
SAVE_WGET = False
SAVE_READABILITY = True
SAVE_MERCURY = True
SAVE_DOM = True

[ARCHIVE_METHOD_OPTIONS]
CHROME_USER_DATA_DIR = chromium/

[GENERAL_CONFIG]
TIMEOUT = 180

[SEARCH_BACKEND_CONFIG]
SEARCH_BACKEND_ENGINE = sonic
SEARCH_BACKEND_HOST_NAME = localhost
@pirate
Copy link
Member

pirate commented Apr 11, 2022

This is by design, for safety and performance on large collections the timestamp folder index files are only lazily updated when they actually need to be changed. If you want to update them all, check all the snapshot rows in the UI and click the update button.

I've added more notes to the Wiki page on upgrading to explain this: https://github.com/ArchiveBox/ArchiveBox/wiki/Upgrading-or-Merging-Archives#merge-two-or-more-existing-archives

I've also added a new DB/filesystem troubleshooting area to the wiki that may help people arriving here from Google: https://github.com/ArchiveBox/ArchiveBox/wiki/Upgrading-or-Merging-Archives#database-troubleshooting

Contributions/suggestions welcome there.

@pirate pirate closed this as completed Apr 12, 2022
@pirate pirate changed the title Bug: archivebox update --index-only writes output only to OUTPUT_DIR Bug: Running archivebox update --index-only doesn't upgrade Snapshot index.{html,json} files Apr 12, 2022
@mwnoo
Copy link
Author

mwnoo commented Apr 15, 2022

Thanks @pirate for the UI suggestion (I focused mainly on the CLI options)
Great project!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants