I noticed listing snapshots is really slow. I found a few optimizations that seem to increase performance, but wanted to know whether there any undesirable side effects to this:
get_dir_size() requires recursing the snapshot directories for every snapshot on a page. It seems archive_size() caches this, but only for the default Django cache TTL (300 seconds). Is there any reason we can't just set CACHES['default']['TIMEOUT'] = None to ensure these keys don't expire by default?
- Archivebox doesn't expose any options for choosing an external cache, which isn't great when running in ephemeral containers. I've had luck with configuring
django_redis in settings.py so the cache can be periodically written to disk. If there's interest, I can put up a PR which optionally enables this. (Alternatively, if there are no blockers to upgrading to Django>=4.0, we can use the built-in Redis cache client.)
- With a warm cache,
Snapshot.from_json() requires a round trip to both the DB and the cache. Calling self.tags_str() with nocache=False seems to cut-down DB latency by about half according to the Django debug toolbar.
I noticed listing snapshots is really slow. I found a few optimizations that seem to increase performance, but wanted to know whether there any undesirable side effects to this:
get_dir_size()requires recursing the snapshot directories for every snapshot on a page. It seemsarchive_size()caches this, but only for the default Django cache TTL (300 seconds). Is there any reason we can't just setCACHES['default']['TIMEOUT'] = Noneto ensure these keys don't expire by default?django_redisinsettings.pyso the cache can be periodically written to disk. If there's interest, I can put up a PR which optionally enables this. (Alternatively, if there are no blockers to upgrading to Django>=4.0, we can use the built-in Redis cache client.)Snapshot.from_json()requires a round trip to both the DB and the cache. Callingself.tags_str()withnocache=Falseseems to cut-down DB latency by about half according to the Django debug toolbar.