Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use model field alias generator appropriately for CSV column names #54

Merged
merged 5 commits into from
Jul 12, 2024

Conversation

lmmx
Copy link
Contributor

@lmmx lmmx commented Apr 1, 2024

If the header is to be read by the Polars CSV reader, and if the Pydantic model has an alias generator, then the columns should be expected to be given by applying the aliasing function on the model fields.

This solves the issue reported in:

@thomasaarholt
Copy link
Collaborator

Hi @lmmx! Thanks for both the comprehensive issue and solution!

I must admit to not being entirely sure why Jakob introduced the read_csv functionality (I've taken co-ownership of this project). If I understand correctly from the docstring, it's used for cases where the header doesn't exist.

Do you think this is a common usecase? Or is it a bit niche that someone wants validation on a csv file that doesn't have a header? I'm just wondering if this might be best solved by removing the read_csv functionality entirely, and just relying on polars directly for reading in csvs.

I actually took a look at your issue a few days ago, and decided that it would be nice to have the AliasGenerator support built into the models directly, not only for the read_csv case. I accidentally added this to the ruff PR 😓 and didn't notice before merging. I've just added some tests that I meant to go together with it in #61.

Could you take a look using the latest main branch and see if this solves parts of your problem?

@lmmx
Copy link
Contributor Author

lmmx commented Apr 4, 2024

Hey @thomasaarholt, nice work!

No, pt.read_csv exists because pl.read_csv infers schema dtypes, whereas with Patito we always want to proactively define the dtypes in the data model. The docstring mentioning reading CSV files without headers is in the examples section, i.e. it's not the primary purpose, just a good demonstration of what it can do.

It makes sense to have a CSV reader method but not an Arrow/Parquet one because Parquet files embed the schema types, whereas CSV does not and so must always either provide them (passing the model dtypes) or infer (not perfect).

I was motivated to pick up this issue when I encountered a misinferred schema (a dataset of train stations' platform codes with numeric and alphanumeric values was misinferred as int when it was str typed).

I can reproduce that error with a toy example here, lowering the value of the rows used to infer the schema to recreate the scenario of a "surprise" value that clashes with the inferred schema. (This was reinstalled on the main branch)

>>> import polars as pl
>>> from io import StringIO
>>> inp1 = StringIO("col_a,col_b\n1,2\n10,B")
>>> pl.read_csv(inp1)
shape: (2, 2)
┌───────┬───────┐
│ col_acol_b │
│ ------   │
│ i64str   │
╞═══════╪═══════╡
│ 12     │
│ 10B     │
└───────┴───────┘

But when the inferred schema length is too short you get the same error I initially hit

Click to show error message
>>> inp2 = StringIO("col_a,col_b\n1,2\n10,B")
>>> pl.read_csv(inp2, infer_schema_length=1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/louis/miniconda3/envs/patito/lib/python3.10/site-packages/polars/_utils/deprecation.py", line 134, in wrapper
    return function(*args, **kwargs)
  File "/home/louis/miniconda3/envs/patito/lib/python3.10/site-packages/polars/_utils/deprecation.py", line 134, in wrapper
    return function(*args, **kwargs)
  File "/home/louis/miniconda3/envs/patito/lib/python3.10/site-packages/polars/_utils/deprecation.py", line 134, in wrapper
    return function(*args, **kwargs)
  File "/home/louis/miniconda3/envs/patito/lib/python3.10/site-packages/polars/io/csv/functions.py", line 397, in read_csv
    df = pl.DataFrame._read_csv(
  File "/home/louis/miniconda3/envs/patito/lib/python3.10/site-packages/polars/dataframe/frame.py", line 655, in _read_csv
    self._df = PyDataFrame.read_csv(
polars.exceptions.ComputeError: could not parse `B` as dtype `i64` at column 'col_b' (column number 2)

The current offset in the file is 19 bytes.

You might want to try:
- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
- specifying correct dtype with the `dtypes` argument
- setting `ignore_errors` to `True`,
- adding `B` to the `null_values` list.

Original error: ```remaining bytes non-empty```

I'll take a look at the other aspect later, feeling a bit under the weather today! 🤒 I see it was put in in this commit:

Edit If I'm reading it correctly, the code you introduced in the _transform_df function will run on call to pt.DataFrame.validate(). This is later than the step at which read_csv runs (the model is used here for its dtypes and the alias generator is used for the field to column name transformations). I'll rebase and see how this works alongside the read_csv code, looks fine at first glance.

@thomasaarholt
Copy link
Collaborator

Thanks for doing this! I rebased this on latest main, and it's looking good!

@thomasaarholt thomasaarholt merged commit c4fa607 into JakobGM:main Jul 12, 2024
2 checks passed
@lmmx lmmx deleted the dtype-alias-fix branch September 9, 2024 18:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants