Skip to content

Speed up DownloadEvent queries#370

Merged
DasSkelett merged 6 commits intoKSP-SpaceDock:alphafrom
HebaruSan:fix/download-event-perf
Jun 23, 2021
Merged

Speed up DownloadEvent queries#370
DasSkelett merged 6 commits intoKSP-SpaceDock:alphafrom
HebaruSan:fix/download-event-perf

Conversation

@HebaruSan
Copy link
Member

@HebaruSan HebaruSan commented Jun 9, 2021

Problem

Flame graph profiling from #346 has revealed that accessing DownloadEvent is a substantial part of loading a mod page and especially starting a download:

image

image

image

Cause

DownloadEvent is a big table with a lot of rows, and we filter it by columns that aren't indexed (mod_id and version_id). This means the database engine has to search every row and check whether they match.

Changes

  • New multi-column index in DownloadEvent: (mod_id, created), to be used for loading the mod page
  • New multi-column index in DownloadEvent: (version_id, created), to be used when starting downloads
  • The download route now only filters by DownloadEvent.version_id since if this matches then DownloadEvent.mod_id will match as well, and both will now be indexed
  • The download route previously retrieved DownloadEvent.created and checked whether it was more than 1 hour ago in Python code, but the check was somewhat wrong (it used timedelta.seconds, which only goes up to 86399 and so could satisfy the limit when it shouldn't). Now it enforces this limit in SQL, correctly.
  • Cascade deletion is enabled for DownloadEvent.mod_id and DownloadEvent.version_id so download events are deleted from the database if their corresponding mod or version is removed
  • A few calls to isfile are now removed to lighten the load on storage
  • ModVersion.download_size now stores the size of the download file, again to lighten the load on storage
  • The featured.rss routes no longer set properties on db objects and then call db.rollback() to undo, because those calls showed up in profiling as slow; the timezone offset is also now included in the RSS output because I think it's supposed to be there
    image

This should speed up mod page loads and downloads somewhat noticeably in some cases.

Also mypy has changed again, so now we install type libs to satisfy it.

@HebaruSan HebaruSan added Type: Improvement Area: Backend Related to the Python code that runs inside gunicorn Priority: Normal Status: Ready Area: Migration Related to Alembic database migrations labels Jun 9, 2021
@HebaruSan HebaruSan requested a review from DasSkelett June 9, 2021 22:40
@HebaruSan HebaruSan force-pushed the fix/download-event-perf branch 4 times, most recently from d92ed1d to a5c7ccf Compare June 9, 2021 23:05
@DasSkelett
Copy link
Member

DasSkelett commented Jun 9, 2021

So apparently on my system, after generating a bunch of download events (~5k) for only a few mods (6) and mod versions (10), the query planner decides to go through the DownloadEvent.created index backwards, filtering out all non-matching mods and stopping after encountering the first matching one:
image
This is even after creating indices for mod_id, version_id and (mod_id, version_id).

So Postgres assumes that there are many matching rows for that filter condition (high download event per mod version ratio), and that using the index to go directly to the matching rows and sorting them by created afterwards would take longer than going through a bunch of non-matching rows but not having to sort them anymore.

While having a lot more mods (~2k), mod versions (~15k) and download events (~10m), the ratio is roughly the same, so there's a chance that the query planner behaves the same, and that these new indices would not be used.

This is one of the cases where it would be really nice to be able to run EXPLAIN SELECT queries on the production db, or a copy thereof...

@HebaruSan

This comment has been minimized.

@HebaruSan HebaruSan force-pushed the fix/download-event-perf branch from f3bb23d to fcfa28d Compare June 10, 2021 00:01
@HebaruSan

This comment has been minimized.

@DasSkelett
Copy link
Member

Good idea! I can confirm that it uses this index at least in the /download query, and you can see it's a lot faster:
image

@DasSkelett

This comment has been minimized.

@HebaruSan HebaruSan force-pushed the fix/download-event-perf branch 2 times, most recently from de85df1 to 4776a9d Compare June 10, 2021 01:31
@DasSkelett
Copy link
Member

While we're on it, what do you think about adding ondelete='CASCADE' to the foreign keys, so that the download events are also deleted from the database if their corresponding mod (version) is removed? I don't think there's a reason to keep them around, and they just bloat the table.

    mod_id = Column(Integer, ForeignKey('mod.id', ondelete='CASCADE'))
    version_id = Column(Integer, ForeignKey('modversion.id', ondelete='CASCADE'))

@HebaruSan

This comment has been minimized.

@DasSkelett

This comment has been minimized.

@HebaruSan HebaruSan force-pushed the fix/download-event-perf branch 2 times, most recently from eb95326 to ef59e77 Compare June 19, 2021 12:12
@HebaruSan HebaruSan force-pushed the fix/download-event-perf branch from ef59e77 to 95fcd95 Compare June 19, 2021 13:22
@HebaruSan
Copy link
Member Author

@DasSkelett, what do you think about also removing this isfile call and instead putting the getsize call inside a try/except block?

def get_version_size(f: str) -> Optional[str]:
if not os.path.isfile(f):
return None
size = os.path.getsize(f)
if size < 1023:
return "%d %s" % (size, ("byte" if size == 1 else "bytes"))
elif size < 1048576:
return "%3.2f KiB" % (size/1024)
elif size < 1073741824:
return "%3.2f MiB" % (size/1048576)
elif size < 1099511627776:
return "%3.2f GiB" % (size/1073741824)
else:
return "%3.2f TiB" % (size/1099511627776)

Sometimes (but not always) the mod page bogs down on isfile, maybe we can eliminate some of the requests to the storage here?

image

If the storage causes getsize to fail then we can still return None in the except handler.

@DasSkelett
Copy link
Member

DasSkelett commented Jun 20, 2021

@DasSkelett, what do you think about also removing this isfile call and instead putting the getsize call inside a try/except block?

Hm yeah, that could work. Would also remove a theoretical bug, because a file could be removed inbetween the isfile and the getsize checks.

I'm also pondering about this line:

if not storage or not os.path.isfile(os.path.join(storage, mod_version.download_path)):

If we remove* this line, we would not only get rid of another filesystem access, but also allow the reverse proxies to serve this file out of its memory cache even if the storage is inaccessible.
A downside is that if the file does indeed not exist (or the storage is inaccessible) and neither of the reverse proxies in front has it cached, the user wouldn't see the nice 404 template from the backend, but some standard Apache 404 page. Probably not that big of an issue.
And maybe there are other contra points that I didn't think of yet.

*: we still need it for when there's no cdn_domain configured and we directly serve the files from the backend, so we'd need to move it down, not fully remove it.

The next step after these changes should probably be following @techman83's suggestion to store the file size in the database, to get rid of even more calls.
(If it were a local disk, one or two stat calls would probably be cheaper than a database query, but since it's a network mount...)
This would nicely fit in get_version_size(), first ask the database, and if it's not stored there yet, check the filesystem (with your above improvement to a single call), then fill it into the db.
But

@HebaruSan
Copy link
Member Author

OK, let me know how the latest commit looks, should include those changes.

@HebaruSan HebaruSan force-pushed the fix/download-event-perf branch from fcc8be8 to 92b110e Compare June 20, 2021 18:52
@HebaruSan HebaruSan force-pushed the fix/download-event-perf branch from 0852526 to a1cabb0 Compare June 23, 2021 13:42
@HebaruSan HebaruSan force-pushed the fix/download-event-perf branch from a1cabb0 to 32b0db2 Compare June 23, 2021 13:45
Copy link
Member

@DasSkelett DasSkelett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much! This looks really really good now, and contains so many improvements. And you kept on implementing every new request I came up with 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area: Backend Related to the Python code that runs inside gunicorn Area: Migration Related to Alembic database migrations Priority: Normal Status: Ready Type: Improvement Type: Performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants