Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/pandas reader parquet #429

Merged
merged 7 commits into from
Oct 4, 2023

Conversation

flaviassantos
Copy link
Contributor

@flaviassantos flaviassantos commented Oct 3, 2023

This pull request adds parquet read and write functionality for issue #406.

Changes

The changes to the files are as described below:

pandas_extensions.py: Added the classes to read and write parquet files
test_pandas_extensions.py: Added a single test case that exercises the writing and reading functionality respectively
notebook.ipynb: Added the example of parquet materialization
my_script.py: Added the example of parquet materialization

How I tested this

By running successfully:

  • unit test for the PandasParquetWriter and PandasParquetReader classes.
  • the Jupyter notebook.
  • the my_script.py
  • cicleci job locally

Notes

Checklist

  • PR has an informative and human-readable title (this will be pulled into the release notes)
  • Changes are limited to a single goal (no scope creep)
  • Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future TODOs are captured in comments
  • Project documentation has been updated if adding/changing functionality.

@flaviassantos
Copy link
Contributor Author

@skrawcz , last point in the checklist. Can you point me to the "Project documentation" mentioned?

@skrawcz
Copy link
Collaborator

skrawcz commented Oct 3, 2023

@skrawcz , last point in the checklist. Can you point me to the "Project documentation" mentioned?

Yep so that should be automatically updated for you. It should show up here (based on the build of this branch) https://hamilton--429.org.readthedocs.build/en/429/reference/io/available-data-adapters/.

…riter to handle kwargs not listed in Pandas' docs
Copy link
Collaborator

@skrawcz skrawcz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, just the minor comment on the test!

@skrawcz
Copy link
Collaborator

skrawcz commented Oct 3, 2023

try:
result = self.api.parquet.read_table(
path_or_handle, columns=columns, **kwargs
).to_pandas(**to_pandas_kwargs)
E TypeError: read_table() got an unexpected keyword argument 'dtype_backend'

../venvs/hamilton-venv/lib/python3.7/site-packages/pandas/io/parquet.py:240: TypeError

You will need to gate this parameter for 3.7. So you'll see others have:

        if sys.version_info >= (3, 8) and self.dtype_backend is not None:
            kwargs["dtype_backend"] = self.dtype_backend

@skrawcz skrawcz merged commit 28c955e into DAGWorks-Inc:main Oct 4, 2023
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants