-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parquet pipeline #649
parquet pipeline #649
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good and seems to work as designed for me. Wondering if the default pipeline_file_name setting should change in config.py to remove the ".h5". (Had to do some digging here for the actual setting to get this feature turned on!). Note that I think changing the default setting will also use this parquet functionality in the pipeline CI testing, which I think is desirable. Based on the getting_started.ipynb notebooks, I think the .h5 version should be specified explicitly in the settings for the prototype_mtc example.
|
||
If the pipeline_file_name setting ends in ".h5", then the pandas | ||
HDFStore file format is used, otherwise pipeline files are stored | ||
as parquet files organized in regular file system directories. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume documentation on this setting will to be addressed in the other Pydantic task?
although when using the parquet storage format this file is stored as "None.parquet" | ||
to maintain a simple consistent file directory structure. | ||
|
||
If the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like unfinished thought here...
store.joinpath(table_name).mkdir(parents=True, exist_ok=True) | ||
df.to_parquet(store.joinpath(table_name, f"{checkpoint_name}.parquet")) | ||
else: | ||
complib = config.setting("pipeline_complib", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another setting to not get lost in the Pydantic task.
This PR was superseded by #654 |
Initial implementation for #645