Investigate Pandas-less data processing #178

jsstevenson · 2023-07-31T20:53:22Z

Pandas is a pretty large package, which hits us at startup for about half a second and also makes dependency upkeep more complicated. It'd be cool if we could see about whether lighter alternatives can achieve the same kind of lookup speed.

We're using Pandas for a few things:

easy lookups into the MANE transcript summary file
filtering (e.g. dropping duplicate rows) and sorting transcript results for genes

I think for 2), we might be able to accomplish some of this within the SQL engine. 1) might be thornier but the MANE summary file appears to be a little less than 20k rows, which might be small enough that we could just read into an in-memory dict.

I'm not super confident doing the above would achieve performance or startup gains but it's probably a place to look as we begin to try to improve in those areas.

korikuzma · 2023-07-31T20:58:41Z

The MANE data doesn’t seem to update that often. We could do some data transformations we need and store it. We could probably do this for the other files (transcript mappings and LRG Ref Seq) as well…

korikuzma · 2023-10-10T16:57:27Z

@jsstevenson https://pypi.org/project/polars/ ?

jsstevenson · 2023-10-10T17:24:55Z

I think Polars would be better -- probably would make things lighter at startup, at least. In the long term I still think it'd be better to see if we can just do this stuff with native Python, since it doesn't seem like it should be that complicated (?) but if we can quickly swap Pandas for Polars and see better startup times, that's a quick win.

Initial work for #178 to help improve startup time

jsstevenson added the technical debt A feature/requirement implemented in a sub-optimal way & must be re-written. Contrast to "cleanup" label Jul 31, 2023

korikuzma self-assigned this Oct 10, 2023

korikuzma added a commit that referenced this issue Oct 10, 2023

build: replace pandas with polars (#178)

b616b92

korikuzma mentioned this issue Oct 11, 2023

build: replace pandas with polars (#178) #205

Merged

korikuzma added a commit that referenced this issue Oct 13, 2023

build: replace pandas with polars (#178) (#205)

774f17e

Initial work for #178 to help improve startup time

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate Pandas-less data processing #178

Investigate Pandas-less data processing #178

jsstevenson commented Jul 31, 2023

korikuzma commented Jul 31, 2023

korikuzma commented Oct 10, 2023

jsstevenson commented Oct 10, 2023

Investigate Pandas-less data processing #178

Investigate Pandas-less data processing #178

Comments

jsstevenson commented Jul 31, 2023

korikuzma commented Jul 31, 2023

korikuzma commented Oct 10, 2023

jsstevenson commented Oct 10, 2023