Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for pydantic 2.0, polars 0.20.10 and remove duckdb support #32

Merged
merged 29 commits into from
Feb 27, 2024

Conversation

brendancooley
Copy link
Contributor

@brendancooley brendancooley commented Oct 26, 2023

builds upon #4, all existing tests (excepting tests/test_duckdb/*) passing after upgrade to pydantic==2.4.2, polars==0.19.11

Closes #11
Closes #28
Closes #26
...possibly closes others

Summary

  • (breaking change) recover legacy ValidationError from pydantic v1 to allow for multiple error reporting, collect data frame validation errors in this object, rename as DataFrameValidationError to distinguish from errors raised by pydantic directly.
    • users will need to update exception catching on data frame validation
  • subclass pydantic.fields.FieldInfo to accommodate patito-specific field attributes (constraints, derived_from, dtype, unique)
  • update patito.Model._schema_properties to append patito-specific field attributes to the field property dictionaries to pydantic.BaseModel.model_json_schema
    • avoids serialization errors raised when passing patito-specific fields via json_schema_extra
  • (breaking change?) separation of concepts of required and nullable. nullable now refers to columns that allow pl.Null as a valid entry. required (pydantic-side) refers to columns that do not have a default value, and therefore must be passed to the BaseModel constructor.
  • fix polars.collect bug (see fix: update LDF.collect() for polars==0.19.8 #24)

TODO

  • refactor directory structure for patito.pydantic
patito
+-- src
|  +-- pydantic
|  |  +-- main.py (`BaseModel`)
|  |  +-- fields.py (`FieldInfo`, `Field`)
  • updates where necessary to documentation and readme

And I'm sure there are other issues introduced by these changes, and bugs that the existing test suite does not yet catch. But hopefully this serves as a template for discussion and new test collection and can help move the ball forward on getting this package onto pydantic2. Feedback very welcome.

@ion-elgreco
Copy link
Contributor

I'll build this one locally and see if it will fix the other bugs Patito 0.50 still has

@ion-elgreco
Copy link
Contributor

They were also working on v2 here: https://github.com/JakobGM/patito/tree/jakobgm%2Fpatito-v2

@brendancooley
Copy link
Contributor Author

I'll build this one locally and see if it will fix the other bugs Patito 0.50 still has

Anxious to hear how it goes

chore: cleanup init

chore: cleanup init

chore: more misc cleanup

cleanup json_schema_extras
@ion-elgreco
Copy link
Contributor

ion-elgreco commented Oct 30, 2023

@brendancooley nice work! The FieldInfo class nicely fixes the issue with properties being lost after any class method, I solved a bit a different with v1 #20. Perhaps you can grab my test from there and include it in your pr. I already checked the test also passes with your changes.

One thing that still will be broken is the examples creation, you need to include this fix #20 and then build on top of it since now a type which is Optional[str] will have the parameter minimum behind a list in the anyOf key.

I will run it tomorrow on our production repo to see if anything breaks : ) (possibly in the examples creation it will)

@brendancooley
Copy link
Contributor Author

@brendancooley nice work! The FieldInfo class nicely fixes the issue with properties being lost after any class method, I solved a bit a different with v1 #20. Perhaps you can grab my test from there and include it in your pr. I already checked the test also passes with your changes.

One thing that still will be broken is the examples creation, you need to include this fix #20 and then build on top of it since now a type which is Optional[str] will have the parameter minimum behind a list in the anyOf key.

I will run it tomorrow on our production repo to see if anything breaks : ) (possibly in the examples creation it will)

happy to include the test. did you mean to reference a different issue/pr in your second comment? not clear on how #20 helps with example string generation but perhaps I am missing something.

@ion-elgreco
Copy link
Contributor

@brendancooley nice work! The FieldInfo class nicely fixes the issue with properties being lost after any class method, I solved a bit a different with v1 #20. Perhaps you can grab my test from there and include it in your pr. I already checked the test also passes with your changes.

One thing that still will be broken is the examples creation, you need to include this fix #20 and then build on top of it since now a type which is Optional[str] will have the parameter minimum behind a list in the anyOf key.

I will run it tomorrow on our production repo to see if anything breaks : ) (possibly in the examples creation it will)

happy to include the test. did you mean to reference a different issue/pr in your second comment? not clear on how #20 helps with example string generation but perhaps I am missing something.

#20 is a fix regarding, creating proper examples with numerical values. I meant just taking that fix directly into pydantic v2 won't work. I saw that grabbing the minimum field is now nested if you have a field that is optional

@ion-elgreco
Copy link
Contributor

@brendancooley tried it on our repo and tests, I am also getting this now static type hint error:

Expression of type "FieldInfo" cannot be assigned to declared type "int"
  "FieldInfo" is incompatible with "int"

on this code:

class Test(pt.Model):
    test: int= pt.Field(constraints=pl.struct("a", "b").is_unique())

@ion-elgreco
Copy link
Contributor

ion-elgreco commented Nov 2, 2023

Also nullable columns are not working, mre:

class Test(pt.Model):
    foo: str|None = pt.Field(dtype=pl.Utf8)
    
print(Test.nullable_columns)
set()

Result should be: {'foo'}

@brendancooley
Copy link
Contributor Author

Thanks @ion-elgreco, I'll add these tests to the PR and start hacking at them tomorrow. Have not looked much into structs but it would be nice if we had more tests for those.

@ion-elgreco
Copy link
Contributor

ion-elgreco commented Nov 2, 2023

Thanks @ion-elgreco, I'll add these tests to the PR and start hacking at them tomorrow. Have not looked much into structs but it would be nice if we had more tests for those.

This example where I am using a struct is just a simple trick to do multiple column uniqueness as a constraint.

One thing that is missing though in Patito but probably out of scope for the migration is proper struct support as a dtype

@brendancooley
Copy link
Contributor Author

Also nullable columns are not working, mre:

class Test(pt.Model):
    foo: str|None = pt.Field(dtype=pl.Utf8)
    
print(Test.nullable_columns)
set()

Result should be: {'foo'}

@ion-elgreco should be addressed by e2bf0d7

@brendancooley
Copy link
Contributor Author

017c59b also adds some sanity checking that the annotated types match the allowable polars types specified in Field.dtype

@brendancooley
Copy link
Contributor Author

@brendancooley tried it on our repo and tests, I am also getting this now static type hint error:

Expression of type "FieldInfo" cannot be assigned to declared type "int"
  "FieldInfo" is incompatible with "int"

on this code:

class Test(pt.Model):
    test: int= pt.Field(constraints=pl.struct("a", "b").is_unique())

yes I can replicate these in vscode if I enable

{
  "python.analysis.typeCheckingMode": "basic"
}

in settings.json. Looking into it.

@brendancooley
Copy link
Contributor Author

@brendancooley tried it on our repo and tests, I am also getting this now static type hint error:

Expression of type "FieldInfo" cannot be assigned to declared type "int"
  "FieldInfo" is incompatible with "int"

on this code:

class Test(pt.Model):
    test: int= pt.Field(constraints=pl.struct("a", "b").is_unique())

This should be handled by 161300b

@brendancooley
Copy link
Contributor Author

brendancooley commented Nov 3, 2023

@brendancooley nice work! The FieldInfo class nicely fixes the issue with properties being lost after any class method, I solved a bit a different with v1 #20. Perhaps you can grab my test from there and include it in your pr. I already checked the test also passes with your changes.
One thing that still will be broken is the examples creation, you need to include this fix #20 and then build on top of it since now a type which is Optional[str] will have the parameter minimum behind a list in the anyOf key.
I will run it tomorrow on our production repo to see if anything breaks : ) (possibly in the examples creation it will)

happy to include the test. did you mean to reference a different issue/pr in your second comment? not clear on how #20 helps with example string generation but perhaps I am missing something.

#20 is a fix regarding, creating proper examples with numerical values. I meant just taking that fix directly into pydantic v2 won't work. I saw that grabbing the minimum field is now nested if you have a field that is optional

Yes this makes sense. Let me know if 9e132bc fixes the issue @ion-elgreco.

@ion-elgreco
Copy link
Contributor

@brendancooley 🚀🚀, I'm going to try it out now

- support validation for column subsets
- tests for nested models (as structs)
- fill_null adds missing columns with default values
- test recursive derivation
- test derive column subset
- allow conversion pt.DataFrame -> pl.DataFrame
- support pydantic validation_alias

chore: fixes for python 3.9 (all tests passing)

chore(patito): cleanup
@ion-elgreco
Copy link
Contributor

ion-elgreco commented Feb 8, 2024

@brendancooley datetime doesn't work if the type doesn't contain a TZ with .examples():

This will fix it:

if dtype.time_zone is not None:
    tzinfo = ZoneInfo(dtype.time_zone)
else:
    tzinfo = None
return datetime(
    year=1970, month=1, day=1, tzinfo=tzinfo
)

@ion-elgreco
Copy link
Contributor

ion-elgreco commented Feb 8, 2024

Adding dtype in pt.Field(dtype=) breaks validation for datetimes:

class Test(pt.Model):
    date_value: datetime = pt.Field(dtype=pl.Datetime)

Test.validate(pl.DataFrame({"date_value": [datetime(2020,1,1)]}))

DataFrameValidationError: 1 validation error for Test
date_value
  Polars dtype Datetime(time_unit='us', time_zone=None) does not match model field type. (type=type_error.columndtype)

While this works:

class Test(pt.Model):
    date_value: datetime

Test.validate(pl.DataFrame({"date_value": [datetime(2020,1,1)]}))

@ion-elgreco
Copy link
Contributor

ion-elgreco commented Feb 8, 2024

GE constraint not respected while creating examples:

class Test(pt.Model):
    value: int = pt.Field(ge=0)
    
print(Test.examples())
shape: (1, 1)
┌───────┐
│ value │
│ ---   │
│ i64   │
╞═══════╡
│ -1    │
└───────┘

This should return 0 or larger :) 

Good to mention this only happens when ge=0, otherwise it works fine.

LOL, python evalautes 0.0 or None as None, while None or 0.0 is 0.0

@brendancooley this will fix it:

minimum = properties.get('minimum')
exclusive_minimum = properties.get("exclusiveMinimum")
maximum = properties.get('maximum')
exclusive_maximum = properties.get("exclusiveMaximum")            

lower = minimum if minimum is not None else exclusive_minimum
upper = maximum if maximum is not None else exclusive_maximum

@ion-elgreco
Copy link
Contributor

So, besides these small bugs everything works as expected!

@brendancooley
Copy link
Contributor Author

awesome @ion-elgreco, should be good to go on all of these now

@ion-elgreco
Copy link
Contributor

@brendancooley awesome! Then I think it's in a state to release now☺️

Copy link
Contributor

@ion-elgreco ion-elgreco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple small comments, really nice work @brendancooley!

Hope we can get this merged soon @thomasaarholt :)

src/patito/_pydantic/dtypes/utils.py Outdated Show resolved Hide resolved
src/patito/validators.py Show resolved Hide resolved
src/patito/validators.py Show resolved Hide resolved
src/patito/polars.py Show resolved Hide resolved
src/patito/polars.py Show resolved Hide resolved
@brendancooley
Copy link
Contributor Author

Left a couple small comments, really nice work @brendancooley!

Hope we can get this merged soon @thomasaarholt :)

Useful for knowing where to flesh out the docs -- thanks!

Copy link
Collaborator

@thomasaarholt thomasaarholt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the tests pass, I am keen to get this merged now, and then we can work on smaller parts separately.

Very well done to everyone involved 😊

@thomasaarholt thomasaarholt merged commit 5446910 into JakobGM:main Feb 27, 2024
@ion-elgreco
Copy link
Contributor

ion-elgreco commented Feb 27, 2024

Awesome! @thomasaarholt can't wait for the release :D

@thomasaarholt
Copy link
Collaborator

Doing it right now

@thomasaarholt thomasaarholt changed the title upgrade: operational migration to pydantic v2 Add support for pydantic 2.0, polars 0.20.10 and remove duckdb support Feb 27, 2024
@thomasaarholt
Copy link
Collaborator

Released now on github, should be on pypi shortly!

@thomasaarholt
Copy link
Collaborator

thomasaarholt commented Feb 27, 2024

I see that the pypi release action failed, because everything has to pass. 😬 I'll see if I can tone down the requirements this evening.

@thomasaarholt
Copy link
Collaborator

We are live now with 0.6.1! https://pypi.org/project/patito/

@ion-elgreco
Copy link
Contributor

Woohoo 🎉🎉

@brendancooley
Copy link
Contributor Author

Awesome, thanks @thomasaarholt. Excited to do more with this, glad that we have a stable foundation to build on top of.

nameloCmaS added a commit to nameloCmaS/patito that referenced this pull request May 9, 2024
These pieces of code the docs refer to were removed as part of JakobGM#32 (commit 054d034)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants