Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very rough draft for pydantic v2 #4

Closed

Conversation

thomasaarholt
Copy link
Collaborator

  • Downgrade Sphinx to compatible version
  • Bump package version to v0.5.0
  • very rough draft for pydantic v2

@ion-elgreco
Copy link
Contributor

@thomasaarholt What else needs to be done to push this over the edge?

@ion-elgreco
Copy link
Contributor

@JakobGM @thomasaarholt Polars just bumped the minimum version for pydantic to v2.0 pola-rs/polars#10923. It would be nice if we can also get pydantic 2.0 support in patito, otherwise users are stuck on Polars 0.19.1.

@thomasaarholt
Copy link
Collaborator Author

I completely agree. Jakob and I are both on very full schedules at the moment and are trying to find time to finish off the work that we started in this branch: https://github.com/JakobGM/patito/tree/jakobgm/patito-v2

@ion-elgreco
Copy link
Contributor

ion-elgreco commented Sep 5, 2023

I see! If there are some small things I could help out with; I can try to familiarize myself more with the codebase and make some additions, but I probably would need some pointers.

@stinodego
Copy link
Contributor

@JakobGM @thomasaarholt Polars just bumped the minimum version for pydantic to v2.0 pola-rs/polars#10923. It would be nice if we can also get pydantic 2.0 support in patito, otherwise users are stuck on Polars 0.19.1.

We did not realize this dependency clash with patito. We'll revert this change for now, so you'll be able to use 0.19.3 and later.

@thomasaarholt
Copy link
Collaborator Author

Thanks @stinodego. Hopefully won't be much longer before we bump it ourselves.

@brendancooley
Copy link
Contributor

brendancooley commented Oct 24, 2023

If my understanding is correct, a major outstanding issue is the forced json conversion of pydantic v2 BaseModel schemas via model_json_schema. This new interface to the Field metadata, combined with removing support for arbitrary extensions to the Field attributes, breaks patito's interface for specifying column dtypes.

pt.Field(dtype=pl.UInt64)

now raises

warnings.PydanticDeprecatedSince20: Using extra keyword arguments on `Field` is deprecated and will be removed. Use `json_schema_extra` instead. (Extra keys: 'dtype').

Attempting to modernize to

pt.Field(json_schema_extra={"dtype":pl.UInt64})

raises

pydantic_core._pydantic_core.PydanticSerializationError: Unable to serialize unknown type: <class 'polars.datatypes.classes.DataTypeClass'>

when the user calls BaseModel.model_json_schema(), which is the v2 accessor for the field properties.

One possibility would be to specify the dtypes with string equivalents, but polars does not seem to support this style. str(pl.UInt64) -> "Int64" but they do not seem to have implemented an inverse map.

Another possibility would be to subclass pydantic.Field, add dtype as a supported attribute, and add handlers for converting back and forth between polars DataTypes and string representations of those. This would have the added advantage of allowing patito to specify other attributes (unique seems like another one that might be good to add as a typed field), instead of hacking the functionality via the json_schema_extra dict.


Using the example in test_dataframe_model_dtype_casting (test_polars):

class DTypeModel(pt.Model):
    implicit_int_1: int
    implicit_int_2: int
    explicit_uint: int = pt.Field(json_schema_extra={"dtype":pl.UInt64})
    implicit_date: date
    implicit_datetime: datetime

Calling DTypeModel.model_fields['explicit_uint'] returns FieldInfo(annotation=int, required=True, json_schema_extra={'dtype': UInt64}) no problem. So perhaps in the short term we could avoid calls to model_json_schema and use the model_fields attribute instead?

@ion-elgreco
Copy link
Contributor

ion-elgreco commented Oct 24, 2023

@brendancooley with regards to the datatype serialisation not working is something which needs to be done upstream. I think it's just a matter of using Serde in Polars.

API wise it would be less ideal to have to pass all these custom fields to json_schema_extra. The Patito API should handle that.

@brendancooley
Copy link
Contributor

@brendancooley with regards to the datatype serialisation not working is something which needs to be done upstream. I think it's just a matter of using Serde in Polars.

API wise it would be less ideal to have to pass all these custom fields to json_schema_extra. The Patito API should handle that.

Very much agree that patito custom fields ought to be exposed directly. Is there an open issue for polars dtype serialization?

@ion-elgreco
Copy link
Contributor

@brendancooley I don't think so, can you create one in Polars?

A while ago I created one for pl.struct and it was added within 2 days.

If @thomasaarholt @JakobGM could share what perhaps is still outstanding then I think the community can contribute to those things to speed up the migration :)

@brendancooley
Copy link
Contributor

brendancooley commented Oct 25, 2023

@brendancooley I don't think so, can you create one in Polars?

A while ago I created one for pl.struct and it was added within 2 days.

If @thomasaarholt @JakobGM could share what perhaps is still outstanding then I think the community can contribute to those things to speed up the migration :)

Looks like we would also need them to support the serialization of Expressions, to make the derived_from field work properly. The test derive model currently fails when you call model_json_schema on it

class DerivedModel(pt.Model):
    underived: int
    const_derived: int = pt.Field(json_schema_extra={"derived_from": pl.lit(3)})
    column_derived: int = pt.Field(json_schema_extra={"derived_from": "underived"})
    expr_derived: int = pt.Field(json_schema_extra={"derived_from": 2 * pl.col("underived")})
    second_order_derived: int = pt.Field(json_schema_extra={"derived_from": 2 * pl.col("expr_derived")})

with

pydantic_core._pydantic_core.PydanticSerializationError: Unable to serialize unknown type: <class 'polars.expr.expr.Expr'>

@brendancooley
Copy link
Contributor

Working on collecting a list of sub-issues for this on #28

@brendancooley
Copy link
Contributor

Took a hack at finishing this up on #32. Feedback from existing users would be very helpful. In particular if folks could add tests that isolate issues introduced by upgrading onto this prototype would be very helpful.

@thomasaarholt
Copy link
Collaborator Author

Superseded by either #32 or #33, or at least a combination 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants