Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert Python Package from Pandas to Polars #159

Open
jjacobson95 opened this issue Apr 29, 2024 · 8 comments
Open

Convert Python Package from Pandas to Polars #159

jjacobson95 opened this issue Apr 29, 2024 · 8 comments
Assignees
Labels
enhancement New feature or request package

Comments

@jjacobson95
Copy link
Collaborator

@sgosline We've briefly spoken about this before and it came up after my comp bio presentation as well. I'd like to do this before addressing #111.

Currently the package is build from Python, converting it all to Polars will speed up operations and make it less susceptible to memory issues as we continue to add more data. Even now, I could see users on low memory devices potentially having issues.

This update should address the following 7 files:

  • Data Loader Script (loader.py)
  • Data Exporation Tutorial & HTML version in docs
  • Drug Exploration Tutorial & HTML version in docs
  • Deep Learning Tutorial & HTML version in docs
@jjacobson95 jjacobson95 self-assigned this Apr 29, 2024
@jjacobson95 jjacobson95 added enhancement New feature or request package labels Apr 29, 2024
@sgosline
Copy link
Member

This seems like a good move. It'd be nice to move all the build scripts to polars as well, but that will break a lot.

@jjacobson95
Copy link
Collaborator Author

Great, I fully agree with that too.

@sgosline
Copy link
Member

Just to be explicit (I thought it was assumed), we should make sure that the forward facing code is still in pandas, so that users dont have to learn polars to use the package.

@jjacobson95
Copy link
Collaborator Author

I thought the goal was to migrate the package from pandas to polars? Pandas is slower and has memory issues. If users have to learn one of the two, they might as well learn the one that is more optimized for our data?

Everything in the package that uses pandas is user-facing.

@jjacobson95
Copy link
Collaborator Author

As this code is nearly complete, we could have two versions if we wanted? - DatasetLoader_pd() and DatasetLoader_pl()?

But having two versions would also increase time for future code development such as the upcoming DataTypeLoader.

@sgosline
Copy link
Member

no, that'd be a nightmare. Just add a flag called 'use_polars' and set the default to False. HOw you imagine this working when #111 is complete?

@jjacobson95
Copy link
Collaborator Author

Oh yeah, thats a much better solution (using a flag). For #111, I assume we should also create a pandas and polars version as well (with the same flag option)?

@sgosline
Copy link
Member

sgosline commented Apr 30, 2024

I was hoping it'd speed up things on the backend, but polars is too new to really rely on at this stage. Basically check the flag and add in a to_pandas at the end of every function. I dont think this requires 'maintaining two versions of the codebase'. We can set the flag to default to True if polars becomes the standard that pandas is today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request package
Projects
Status: In progress
Development

No branches or pull requests

2 participants