Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

Parquet data storage #348

Merged
merged 9 commits into from
Jun 15, 2023
Merged

Parquet data storage #348

merged 9 commits into from
Jun 15, 2023

Conversation

esheehan-gsl
Copy link
Contributor

Add storage of diag data as partitioned Parquet files. This should be more suitable for our time series visualizations.

Add some assertions to check that the parquet file is being created
during ETL. This is getting to be a bit much with the Zarr and database
stuff as well. If this works out, we should consider using just the
Parquet files for diag data and removing Zarr and Postgres from the
equation.
This is needed to write to Parquet files.
I think following pytest's recommended pattern for multiple asserts
makes the tests a little easier to follow. Each test is a parameterized
method, grouped by what action is being tested. The "act" step of the
test is encompassed in a fixture on the class. Each method then only has
to assert the state it expects, so it's a little more clear what each
expected outcome is.

See: https://docs.pytest.org/en/7.3.x/how-to/fixtures.html#running-multiple-assert-statements-safely
We need to ensure that the parquet file is being updated correctly, and
that the database has both analyses.
This needs to be present to allow us to concatenate variables into a
single Parquet file.
Instead of combining all of the variables into a single parquet file for
the model, I've decided to separate them. Some of the variables, such as
wind, have columns that are unique to that variable; I think this is
reason enough to store them separately, otherwise all of the other
variables will just NaN values in those columns, which seems silly.

Some preliminary testing suggests this might actually save space.
This seems like a silly type hint from pandas-stubs, because the tuple
works just fine, but I don't think it's causing any harm.
@github-actions
Copy link

Code Coverage

Package Line Rate Branch Rate Health
unified_graphics 90% 87%
unified_graphics.etl 97% 95%
Summary 93% (406 / 438) 90% (94 / 104)

Minimum allowed line rate is 60%

@esheehan-gsl esheehan-gsl marked this pull request as ready for review June 14, 2023 16:55
@esheehan-gsl esheehan-gsl self-assigned this Jun 14, 2023
@esheehan-gsl esheehan-gsl temporarily deployed to vlab June 14, 2023 16:59 — with GitHub Actions Inactive
Copy link
Collaborator

@ian-noaa ian-noaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me - Does the parquet file get saved out locally as well if we're using the file:// prefix?

@pytest.fixture
@pytest.fixture(scope="session")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good call. If this was function-scoped it seems like the factory function wasn't really being used before - it'd need to recreate the data each time it was invoked.

Comment on lines +48 to +61
class TestSaveNew:
@pytest.fixture(scope="class", autouse=True)
def dataset(self, model, diag_dataset, session, zarr_file):
(mdl, system, domain, background, frequency) = model
ps = diag_dataset(
"ps",
"2022-05-05T14:00",
"anl",
mdl,
system,
domain,
frequency,
background,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice way to consolidate setup and handle multiple assert's. Thanks for the doc link!

@esheehan-gsl esheehan-gsl merged commit 52ff32a into main Jun 15, 2023
@esheehan-gsl esheehan-gsl deleted the parquet-data-storage branch June 15, 2023 13:32
@esheehan-gsl esheehan-gsl temporarily deployed to vlab June 15, 2023 13:32 — with GitHub Actions Inactive
@esheehan-gsl
Copy link
Contributor Author

Yeah, the Parquet file should get written locally as well.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants