Skip to content

Conversation

@tmi
Copy link
Collaborator

@tmi tmi commented Jun 13, 2022

No description provided.

@vtuma vtuma requested a review from sondracek June 13, 2022 13:11
@sondracek
Copy link
Member

What about having some common params across all methods instead of just wrapping via kwargs? E.g. columns selection make sense for all listed reader methods imo. This is supported in parquet, has different name for csv (usecols) and it's completely missing for json. But this would probably led to a more complicated solution with a proper filter on data itself in an ideal case (not just simple filter/selection for columns, but also e.g. filtering based on value of a column and so on).

Behaviour can be customised via passing any kwargs to the constructor.
"""

def __init__(self, input_format=InputFormat.AUTO, **pdread_kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we will extend the signature in the future? I would rather not, but it's one of the reason why I do not like this kind of kwargs propagation that might be a bit dangerous sometimes... (although we use it quite often 😃 )

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely agree and I in general not use kwargs myself. But as the evil down there is already in place, what can I do 🤷

Idea I toyed with and which may end up to the next version is to require the user to provide directly the read_single function, thus leaving the user fully in control of the kwargs passing. I did not use it now as it would require user do what feels like boilerplate. So once we'll get to a situation where the signature needs to be extended too much, we may escape via this way, possibly keeping some of the previous functionality via some utils or factories or builders

I'm not really worried by it because the core thing that must stay is the DataReader interface which is not tainted by this at all

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about a case when the signature extension might introduce some breaking changes, especially because of used kwargs. But let's not complicate it atm.

@tmi
Copy link
Collaborator Author

tmi commented Jun 13, 2022

What about having some common params across all methods instead of just wrapping via kwargs? E.g. columns selection make sense for all listed reader methods imo. This is supported in parquet, has different name for csv (usecols) and it's completely missing for json. But this would probably led to a more complicated solution with a proper filter on data itself in an ideal case (not just simple filter/selection for columns, but also e.g. filtering based on value of a column and so on).

that would, in my eyes, be too much -- the fsql should be as independent and uncoupled from pandas as possible. What you describe would couple the interface of pd.read to the interface of PandasReader, which imo would be unhealthy. I can imagine a different project (or possibly some .experimental module here) with the goal of making a better interface for pandas reading itself, but I'm not going that way myself :)

@sondracek sondracek merged commit f5077ec into AmpX-AI:main Jun 14, 2022
@tmi tmi deleted the supportPdReaderKwargs branch July 12, 2022 10:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants