-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: use model field alias generator appropriately for CSV column names #54
Conversation
Hi @lmmx! Thanks for both the comprehensive issue and solution! I must admit to not being entirely sure why Jakob introduced the Do you think this is a common usecase? Or is it a bit niche that someone wants validation on a csv file that doesn't have a header? I'm just wondering if this might be best solved by removing the read_csv functionality entirely, and just relying on polars directly for reading in csvs. I actually took a look at your issue a few days ago, and decided that it would be nice to have the AliasGenerator support built into the models directly, not only for the Could you take a look using the latest main branch and see if this solves parts of your problem? |
Hey @thomasaarholt, nice work! No, It makes sense to have a CSV reader method but not an Arrow/Parquet one because Parquet files embed the schema types, whereas CSV does not and so must always either provide them (passing the model dtypes) or infer (not perfect). I was motivated to pick up this issue when I encountered a misinferred schema (a dataset of train stations' platform codes with numeric and alphanumeric values was misinferred as I can reproduce that error with a toy example here, lowering the value of the rows used to infer the schema to recreate the scenario of a "surprise" value that clashes with the inferred schema. (This was reinstalled on the main branch) >>> import polars as pl
>>> from io import StringIO
>>> inp1 = StringIO("col_a,col_b\n1,2\n10,B")
>>> pl.read_csv(inp1)
shape: (2, 2)
┌───────┬───────┐
│ col_a ┆ col_b │
│ --- ┆ --- │
│ i64 ┆ str │
╞═══════╪═══════╡
│ 1 ┆ 2 │
│ 10 ┆ B │
└───────┴───────┘ But when the inferred schema length is too short you get the same error I initially hit Click to show error message>>> inp2 = StringIO("col_a,col_b\n1,2\n10,B")
>>> pl.read_csv(inp2, infer_schema_length=1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/louis/miniconda3/envs/patito/lib/python3.10/site-packages/polars/_utils/deprecation.py", line 134, in wrapper
return function(*args, **kwargs)
File "/home/louis/miniconda3/envs/patito/lib/python3.10/site-packages/polars/_utils/deprecation.py", line 134, in wrapper
return function(*args, **kwargs)
File "/home/louis/miniconda3/envs/patito/lib/python3.10/site-packages/polars/_utils/deprecation.py", line 134, in wrapper
return function(*args, **kwargs)
File "/home/louis/miniconda3/envs/patito/lib/python3.10/site-packages/polars/io/csv/functions.py", line 397, in read_csv
df = pl.DataFrame._read_csv(
File "/home/louis/miniconda3/envs/patito/lib/python3.10/site-packages/polars/dataframe/frame.py", line 655, in _read_csv
self._df = PyDataFrame.read_csv(
polars.exceptions.ComputeError: could not parse `B` as dtype `i64` at column 'col_b' (column number 2)
The current offset in the file is 19 bytes.
You might want to try:
- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
- specifying correct dtype with the `dtypes` argument
- setting `ignore_errors` to `True`,
- adding `B` to the `null_values` list.
Original error: ```remaining bytes non-empty``` I'll take a look at the other aspect later, feeling a bit under the weather today! 🤒 I see it was put in in this commit: Edit If I'm reading it correctly, the code you introduced in the |
604b35e
to
c2c42e9
Compare
Thanks for doing this! I rebased this on latest main, and it's looking good! |
If the header is to be read by the Polars CSV reader, and if the Pydantic model has an alias generator, then the columns should be expected to be given by applying the aliasing function on the model fields.
This solves the issue reported in:
read_csv
does not use model dtypes when an alias generator is used #55