Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace zarr with parquet #432

Merged
merged 9 commits into from
Nov 29, 2023
Merged

Replace zarr with parquet #432

merged 9 commits into from
Nov 29, 2023

Conversation

esheehan-gsl
Copy link
Contributor

Replace the code that reads diag data from Zarr files with a view that reads the data from the Parquet files.

Update the test fixtures in the routes integration tests to write out a
Parquet file instead of a Zarr so that we can switch over to using
Parquet for all of our diag data.
If we're not using Zarr, we won't encounter a missing group. So this
message will just be a file not found error.
I don't think these are relevant anymore. We don't need to distinguish
between a missing file and missing group, now that we're not using Zarr.
And I don't think the unreadable file is handled the same way.
Eliminate all of the code that read from Zarr files for diagnostic data
and switch over to reading the data from the Parquet files we already
keep for the time series data.
Using groupby and looping over each group was horrible for performance,
so I've worked out a way to apply the filters to a single DataFrame
regardless of whether the variable is a vector or scalar.
Moving the is_used filter to the Parquet loading is significantly
improving performance for me locally when loading the wind data.
Ideally, I think it would be good to let PyArrow do all of the
filtering, so that may be worth looking into.

I had to update the test data in a number of cases to ensure that the
is_used flag is a bool, not an int.
For some reason my test data in development has its index ordered
(component, nobs) instead of (nobs, component). Not sure if this is
indicative of a bug somewhere in our ETL code, or if I messed something
up generating this data. Regardless, being specific like this will work
no matter what, so I think we should be ok with this for now.
@esheehan-gsl esheehan-gsl marked this pull request as ready for review November 17, 2023 21:14
@esheehan-gsl
Copy link
Contributor Author

It looks like I’ve been able to work out all of the performance problems by eliminating all of the loops in our filtering logic. I think we can proceed with eliminating Zarr for the diag data.

src/unified_graphics/diag.py Show resolved Hide resolved
src/unified_graphics/routes.py Outdated Show resolved Hide resolved
Ian pointed out that this is a bit hard to follow, so I've added
comments and docstrings to help describe what this function is for.
Copy link

Code Coverage

Package Line Rate Branch Rate Health
unified_graphics 79% 68%
unified_graphics.etl 97% 96%
utils.s3 68% 69%
Summary 84% (337 / 399) 82% (85 / 104)

Minimum allowed line rate is 60%

@esheehan-gsl esheehan-gsl merged commit f9396d8 into main Nov 29, 2023
9 checks passed
@esheehan-gsl esheehan-gsl deleted the replace-zarr-with-parquet branch November 29, 2023 16:33
esheehan-gsl added a commit that referenced this pull request Nov 30, 2023
This reverts commit f9396d8, reversing
changes made to 5bb0508.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Read all diag data from Parquet files
2 participants