Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow use of generic pa.DataFrameSchema/Model for different supported libraries #1632

Open
DavidSlayback opened this issue May 9, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@DavidSlayback
Copy link

Is your feature request related to a problem? Please describe.
It's a small issue, but in a repo that is attempting to transition from Pandas to Polars over time, there is a mix of possible Pandas and Polars dataframes of the same basic schema. Currently, it seems like I need to define two schemas for each: one for Pandas using pa.DataFrameModel, one for polars using pa.polars.DataFrameModel.

Describe the solution you'd like
Ideally, the top-level pa.DataFrameModel and pa.DataFrameSchema functions would use something like @singledispatch to delegate to the appropriate backend version based on the input dataframe. This is similar to an Ibis Table where it's rare that you actually need to go into the specific backend to request a specific function.

Describe alternatives you've considered
What I'm currently doing is just being more verbose and defining multiple schemas. It works fine! It just seems a bit strange as a workflow. Obviously if we were always in Polars it wouldn't be an issue, but that'll take a while.

@DavidSlayback DavidSlayback added the enhancement New feature or request label May 9, 2024
@cosmicBboy
Copy link
Collaborator

I've thought about this a lot, and I think we're getting closer to this world. However my main concern is that this generic dataframe schema will have to include a superset of all the options for all of the dataframes. I think eventually we'll nail down a "common dataframe schema api to rule them all", in which case this concern is less of an issue.

We recently introduced a generic dataframe api: https://github.com/unionai-oss/pandera/tree/main/pandera/api/dataframe which is where this dispatching might happen. Currently pandas and polars schemas inherit from these classes (pyspark still needs to be done).

If folks engage with this issue (👍 or comment/discuss) we can prioritize this effort, but in the mean time @DavidSlayback if you can write down a spec for how this would all work with perhaps a code snippet sketch of how dispatching would work that would get the ball rolling.

@DavidSlayback
Copy link
Author

Sure, I'll try to sketch something up later this week when I'm free!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants