You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Pandas is a pretty large package, which hits us at startup for about half a second and also makes dependency upkeep more complicated. It'd be cool if we could see about whether lighter alternatives can achieve the same kind of lookup speed.
We're using Pandas for a few things:
easy lookups into the MANE transcript summary file
filtering (e.g. dropping duplicate rows) and sorting transcript results for genes
I think for 2), we might be able to accomplish some of this within the SQL engine. 1) might be thornier but the MANE summary file appears to be a little less than 20k rows, which might be small enough that we could just read into an in-memory dict.
I'm not super confident doing the above would achieve performance or startup gains but it's probably a place to look as we begin to try to improve in those areas.
The text was updated successfully, but these errors were encountered:
The MANE data doesn’t seem to update that often. We could do some data transformations we need and store it. We could probably do this for the other files (transcript mappings and LRG Ref Seq) as well…
I think Polars would be better -- probably would make things lighter at startup, at least. In the long term I still think it'd be better to see if we can just do this stuff with native Python, since it doesn't seem like it should be that complicated (?) but if we can quickly swap Pandas for Polars and see better startup times, that's a quick win.
Pandas is a pretty large package, which hits us at startup for about half a second and also makes dependency upkeep more complicated. It'd be cool if we could see about whether lighter alternatives can achieve the same kind of lookup speed.
We're using Pandas for a few things:
I think for 2), we might be able to accomplish some of this within the SQL engine. 1) might be thornier but the MANE summary file appears to be a little less than 20k rows, which might be small enough that we could just read into an in-memory dict.
I'm not super confident doing the above would achieve performance or startup gains but it's probably a place to look as we begin to try to improve in those areas.
The text was updated successfully, but these errors were encountered: