feat: use model field alias generator appropriately for CSV column names #54

lmmx · 2024-04-01T23:47:42Z

If the header is to be read by the Polars CSV reader, and if the Pydantic model has an alias generator, then the columns should be expected to be given by applying the aliasing function on the model fields.

This solves the issue reported in:

read_csv does not use model dtypes when an alias generator is used #55

thomasaarholt · 2024-04-04T09:25:27Z

Hi @lmmx! Thanks for both the comprehensive issue and solution!

I must admit to not being entirely sure why Jakob introduced the read_csv functionality (I've taken co-ownership of this project). If I understand correctly from the docstring, it's used for cases where the header doesn't exist.

Do you think this is a common usecase? Or is it a bit niche that someone wants validation on a csv file that doesn't have a header? I'm just wondering if this might be best solved by removing the read_csv functionality entirely, and just relying on polars directly for reading in csvs.

I actually took a look at your issue a few days ago, and decided that it would be nice to have the AliasGenerator support built into the models directly, not only for the read_csv case. I accidentally added this to the ruff PR 😓 and didn't notice before merging. I've just added some tests that I meant to go together with it in #61.

Could you take a look using the latest main branch and see if this solves parts of your problem?

lmmx · 2024-04-04T11:03:32Z

Hey @thomasaarholt, nice work!

No, pt.read_csv exists because pl.read_csv infers schema dtypes, whereas with Patito we always want to proactively define the dtypes in the data model. The docstring mentioning reading CSV files without headers is in the examples section, i.e. it's not the primary purpose, just a good demonstration of what it can do.

It makes sense to have a CSV reader method but not an Arrow/Parquet one because Parquet files embed the schema types, whereas CSV does not and so must always either provide them (passing the model dtypes) or infer (not perfect).

I was motivated to pick up this issue when I encountered a misinferred schema (a dataset of train stations' platform codes with numeric and alphanumeric values was misinferred as int when it was str typed).

I can reproduce that error with a toy example here, lowering the value of the rows used to infer the schema to recreate the scenario of a "surprise" value that clashes with the inferred schema. (This was reinstalled on the main branch)

>>> import polars as pl
>>> from io import StringIO
>>> inp1 = StringIO("col_a,col_b\n1,2\n10,B")
>>> pl.read_csv(inp1)
shape: (2, 2)
┌───────┬───────┐
│ col_a ┆ col_b │
│ ---   ┆ ---   │
│ i64   ┆ str   │
╞═══════╪═══════╡
│ 1     ┆ 2     │
│ 10    ┆ B     │
└───────┴───────┘

But when the inferred schema length is too short you get the same error I initially hit

Click to show error message

>>> inp2 = StringIO("col_a,col_b\n1,2\n10,B")
>>> pl.read_csv(inp2, infer_schema_length=1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/louis/miniconda3/envs/patito/lib/python3.10/site-packages/polars/_utils/deprecation.py", line 134, in wrapper
    return function(*args, **kwargs)
  File "/home/louis/miniconda3/envs/patito/lib/python3.10/site-packages/polars/_utils/deprecation.py", line 134, in wrapper
    return function(*args, **kwargs)
  File "/home/louis/miniconda3/envs/patito/lib/python3.10/site-packages/polars/_utils/deprecation.py", line 134, in wrapper
    return function(*args, **kwargs)
  File "/home/louis/miniconda3/envs/patito/lib/python3.10/site-packages/polars/io/csv/functions.py", line 397, in read_csv
    df = pl.DataFrame._read_csv(
  File "/home/louis/miniconda3/envs/patito/lib/python3.10/site-packages/polars/dataframe/frame.py", line 655, in _read_csv
    self._df = PyDataFrame.read_csv(
polars.exceptions.ComputeError: could not parse `B` as dtype `i64` at column 'col_b' (column number 2)

The current offset in the file is 19 bytes.

You might want to try:
- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
- specifying correct dtype with the `dtypes` argument
- setting `ignore_errors` to `True`,
- adding `B` to the `null_values` list.

Original error: ```remaining bytes non-empty```

I'll take a look at the other aspect later, feeling a bit under the weather today! 🤒 I see it was put in in this commit:

ef9928c

Edit If I'm reading it correctly, the code you introduced in the _transform_df function will run on call to pt.DataFrame.validate(). This is later than the step at which read_csv runs (the model is used here for its dtypes and the alias generator is used for the field to column name transformations). I'll rebase and see how this works alongside the read_csv code, looks fine at first glance.

thomasaarholt · 2024-07-12T15:25:47Z

Thanks for doing this! I rebased this on latest main, and it's looking good!

lmmx mentioned this pull request Apr 1, 2024

read_csv does not use model dtypes when an alias generator is used #55

Open

lmmx mentioned this pull request Apr 5, 2024

fix: make all xfail read_csv tests pass #62

Merged

lmmx added 4 commits July 12, 2024 17:13

feat: use model field alias generator appropriately for CSV column names

44ea06d

test: simple test that AliasGenerator is used

b8ddf63

fix: future annotations import; lint

6542c79

fix: 3.9 needs Optional

c2c42e9

thomasaarholt force-pushed the dtype-alias-fix branch from 604b35e to c2c42e9 Compare July 12, 2024 15:17

fixed regression from other PR merge

87fa3c1

thomasaarholt merged commit c4fa607 into JakobGM:main Jul 12, 2024
2 checks passed

lmmx deleted the dtype-alias-fix branch September 9, 2024 18:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: use model field alias generator appropriately for CSV column names #54

feat: use model field alias generator appropriately for CSV column names #54

lmmx commented Apr 1, 2024 •

edited

Loading

thomasaarholt commented Apr 4, 2024

lmmx commented Apr 4, 2024 •

edited

Loading

thomasaarholt commented Jul 12, 2024

feat: use model field alias generator appropriately for CSV column names #54

feat: use model field alias generator appropriately for CSV column names #54

Conversation

lmmx commented Apr 1, 2024 • edited Loading

thomasaarholt commented Apr 4, 2024

lmmx commented Apr 4, 2024 • edited Loading

thomasaarholt commented Jul 12, 2024

lmmx commented Apr 1, 2024 •

edited

Loading

lmmx commented Apr 4, 2024 •

edited

Loading