Replace zarr with parquet #432

esheehan-gsl · 2023-10-31T15:57:17Z

Replace the code that reads diag data from Zarr files with a view that reads the data from the Parquet files.

Update the test fixtures in the routes integration tests to write out a Parquet file instead of a Zarr so that we can switch over to using Parquet for all of our diag data.

If we're not using Zarr, we won't encounter a missing group. So this message will just be a file not found error.

I don't think these are relevant anymore. We don't need to distinguish between a missing file and missing group, now that we're not using Zarr. And I don't think the unreadable file is handled the same way.

Eliminate all of the code that read from Zarr files for diagnostic data and switch over to reading the data from the Parquet files we already keep for the time series data.

Using groupby and looping over each group was horrible for performance, so I've worked out a way to apply the filters to a single DataFrame regardless of whether the variable is a vector or scalar.

Moving the is_used filter to the Parquet loading is significantly improving performance for me locally when loading the wind data. Ideally, I think it would be good to let PyArrow do all of the filtering, so that may be worth looking into. I had to update the test data in a number of cases to ensure that the is_used flag is a bool, not an int.

For some reason my test data in development has its index ordered (component, nobs) instead of (nobs, component). Not sure if this is indicative of a bug somewhere in our ETL code, or if I messed something up generating this data. Regardless, being specific like this will work no matter what, so I think we should be ok with this for now.

esheehan-gsl · 2023-11-17T21:16:00Z

It looks like I’ve been able to work out all of the performance problems by eliminating all of the loops in our filtering logic. I think we can proceed with eliminating Zarr for the diag data.

src/unified_graphics/diag.py

src/unified_graphics/routes.py

Ian pointed out that this is a bit hard to follow, so I've added comments and docstrings to help describe what this function is for.

github-actions · 2023-11-27T21:31:54Z

Package	Line Rate	Branch Rate	Health
unified_graphics	79%	68%	➖
unified_graphics.etl	97%	96%	✔
utils.s3	68%	69%	➖
Summary	84% (337 / 399)	82% (85 / 104)	✔

Minimum allowed line rate is 60%

This reverts commit f9396d8, reversing changes made to 5bb0508.

esheehan-gsl linked an issue Oct 31, 2023 that may be closed by this pull request

Read all diag data from Parquet files #431

Closed

esheehan-gsl self-assigned this Oct 31, 2023

esheehan-gsl temporarily deployed to vlab November 1, 2023 16:43 — with GitHub Actions Inactive

esheehan-gsl mentioned this pull request Nov 1, 2023

Read all diag data from Parquet files #431

Closed

esheehan-gsl added 4 commits November 17, 2023 10:28

Convert tests to save Parquet instead of Zarr

9a5681a

Update the test fixtures in the routes integration tests to write out a Parquet file instead of a Zarr so that we can switch over to using Parquet for all of our diag data.

Remove "group" from the expected 404 message

92f7088

If we're not using Zarr, we won't encounter a missing group. So this message will just be a file not found error.

Remove some the HTTP error tests

31d4d08

I don't think these are relevant anymore. We don't need to distinguish between a missing file and missing group, now that we're not using Zarr. And I don't think the unreadable file is handled the same way.

Replace use of Zarr with Parquet

36775e9

Eliminate all of the code that read from Zarr files for diagnostic data and switch over to reading the data from the Parquet files we already keep for the time series data.

esheehan-gsl force-pushed the replace-zarr-with-parquet branch from 9a2603d to 36775e9 Compare November 17, 2023 17:54

esheehan-gsl temporarily deployed to vlab November 17, 2023 17:58 — with GitHub Actions Inactive

Eliminate groupby and loops in filtering

d5c1e3b

Using groupby and looping over each group was horrible for performance, so I've worked out a way to apply the filters to a single DataFrame regardless of whether the variable is a vector or scalar.

esheehan-gsl temporarily deployed to vlab November 17, 2023 20:23 — with GitHub Actions Inactive

esheehan-gsl added 2 commits November 17, 2023 14:09

esheehan-gsl marked this pull request as ready for review November 17, 2023 21:14

esheehan-gsl requested a review from ian-noaa November 17, 2023 21:15

esheehan-gsl temporarily deployed to vlab November 17, 2023 21:18 — with GitHub Actions Inactive

ian-noaa approved these changes Nov 27, 2023

View reviewed changes

src/unified_graphics/diag.py Show resolved Hide resolved

src/unified_graphics/routes.py Outdated Show resolved Hide resolved

esheehan-gsl added 2 commits November 27, 2023 14:08

Add documentation for parse_filters function

a9bf88e

Ian pointed out that this is a bit hard to follow, so I've added comments and docstrings to help describe what this function is for.

Formatting

3ca20d4

esheehan-gsl temporarily deployed to vlab November 27, 2023 21:35 — with GitHub Actions Inactive

esheehan-gsl merged commit f9396d8 into main Nov 29, 2023
9 checks passed

esheehan-gsl deleted the replace-zarr-with-parquet branch November 29, 2023 16:33

esheehan-gsl temporarily deployed to vlab November 29, 2023 16:33 — with GitHub Actions Inactive

esheehan-gsl added a commit that referenced this pull request Nov 30, 2023

Revert "Replace zarr with parquet (#432)"

c8abd5c

This reverts commit f9396d8, reversing changes made to 5bb0508.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace zarr with parquet #432

Replace zarr with parquet #432

esheehan-gsl commented Oct 31, 2023

esheehan-gsl commented Nov 17, 2023

github-actions bot commented Nov 27, 2023

Replace zarr with parquet #432

Replace zarr with parquet #432

Conversation

esheehan-gsl commented Oct 31, 2023

esheehan-gsl commented Nov 17, 2023

github-actions bot commented Nov 27, 2023