Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'extra_kwargs' vs writing out each argument #506

Open
JoJo10Smith opened this issue Oct 30, 2023 · 1 comment
Open

'extra_kwargs' vs writing out each argument #506

JoJo10Smith opened this issue Oct 30, 2023 · 1 comment
Labels
triage label for issues that need to be triaged.

Comments

@JoJo10Smith
Copy link
Contributor

JoJo10Smith commented Oct 30, 2023

Leaving out engine specific parameters for pandas materializers

Current behavior

The only arguments that can be passed into the read_FORMAT and to_FORMAT for each of the materilzers are the ones specified on the respective page, but some pandas functions allow the user to add additional arguments that are passed into the engine directly .

For example the Pandas to_parquet (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_parquet.html#pandas.DataFrame.to_parquet) has a few arguments listed on the page (path, engine, compression etc) but has the option to pass in additional arguments to the engine as **kwargs.

If the user set engine = pyarrow as their engine they would be able to use the following arguments in addition to the ones listed on the pandas page: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html

As the code is written, Hamilton throws an error when an otherwise valid argument can be passed, unless they are passed as a dict as a part of the extra_kwargs argument which is unituituve.

Stack Traces

PandasParquetWriter.__init__() got an unexpected keyword argument 'read_dictionary'

Screenshots

image

Steps to replicate behavior

Pass any parameter not specified in the initializtion of the class but is a valid parameter to any Read and Write class

Library & System Information

Hamilton: 1.35.1
Python 3.10.2
System: Ubuntu 22.04

Expected behavior

There shouldn't be a failure for any parameter that is valid, but should they have to pass it in as part of the extra_kwargs or into the class like other arguments

Additional context

I'm working on the geopandas materilzers and noticed that there are many instances where I would need the user to pass in their own kwargs that I had not initially listed. I just need to know which implementation should be used.

Potential solutions

  1. Have the user specify the kwargs themselves in the 'extra_kwargs' Feat/pandas reader parquet #429
  2. Write out each of the possible kwargs Pandas Feather Reader and Writer Class #385
@JoJo10Smith JoJo10Smith added the triage label for issues that need to be triaged. label Oct 30, 2023
@skrawcz
Copy link
Collaborator

skrawcz commented Oct 30, 2023

@JoJo10Smith thanks for the report.

Yeah this is a bit of a hard one. We decided to use extra_kwargs since it would otherwise require listing all possible kwargs in the class since we're using dataclasses (AFAIK). That seems hard to scale/ensure it's right?

So extra_kwargs I think is the happy medium, but it does mean the API doesn't 1-1 map. That said, I think extra_kwargs helps someone else who comes to read the code, as it makes it clear these kwargs are extra and likely specific to some use case... versus you having to check the function signature to understand what is used, and what is being passed on under the hood...

I'll noodle on this for the week, but I'm leaning towards extra_kwargs. Thoughts?

@JoJo10Smith JoJo10Smith changed the title Bug Report 'extra_kwargs' vs writing out each argument Oct 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage label for issues that need to be triaged.
Projects
None yet
Development

No branches or pull requests

2 participants