Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate Pandas-less data processing #178

Open
jsstevenson opened this issue Jul 31, 2023 · 3 comments
Open

Investigate Pandas-less data processing #178

jsstevenson opened this issue Jul 31, 2023 · 3 comments
Assignees
Labels
technical debt A feature/requirement implemented in a sub-optimal way & must be re-written. Contrast to "cleanup"

Comments

@jsstevenson
Copy link
Member

Pandas is a pretty large package, which hits us at startup for about half a second and also makes dependency upkeep more complicated. It'd be cool if we could see about whether lighter alternatives can achieve the same kind of lookup speed.

We're using Pandas for a few things:

  1. easy lookups into the MANE transcript summary file
  2. filtering (e.g. dropping duplicate rows) and sorting transcript results for genes

I think for 2), we might be able to accomplish some of this within the SQL engine. 1) might be thornier but the MANE summary file appears to be a little less than 20k rows, which might be small enough that we could just read into an in-memory dict.

I'm not super confident doing the above would achieve performance or startup gains but it's probably a place to look as we begin to try to improve in those areas.

@jsstevenson jsstevenson added the technical debt A feature/requirement implemented in a sub-optimal way & must be re-written. Contrast to "cleanup" label Jul 31, 2023
@korikuzma
Copy link
Member

The MANE data doesn’t seem to update that often. We could do some data transformations we need and store it. We could probably do this for the other files (transcript mappings and LRG Ref Seq) as well…

@korikuzma korikuzma self-assigned this Oct 10, 2023
@korikuzma
Copy link
Member

@jsstevenson
Copy link
Member Author

I think Polars would be better -- probably would make things lighter at startup, at least. In the long term I still think it'd be better to see if we can just do this stuff with native Python, since it doesn't seem like it should be that complicated (?) but if we can quickly swap Pandas for Polars and see better startup times, that's a quick win.

korikuzma added a commit that referenced this issue Oct 13, 2023
Initial work for #178 to help improve startup time
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
technical debt A feature/requirement implemented in a sub-optimal way & must be re-written. Contrast to "cleanup"
Projects
None yet
Development

No branches or pull requests

2 participants