Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write L2 to file and leave all additional variable derivation for the L2toL3 step #255

Conversation

BaptisteVandecrux
Copy link
Member

The idea of this new version is that:

  1. L2 data files are written into level_2/raw and level_2/tx folders by get_l2 (just like it was done for the L3 data previously). One consequence is that this low-latency level 2 tx data can be posted very fast on THREDDS for showcase and fieldwork, and processed into BUFR files.

  2. L2 tx and raw files are merged using join_l2 (just like it was done for the L3 data previously). Resampling to hourly, daily, monthly values are done here, but could be left for a later stage.

  3. get_l3 is now a script that loads the merged L2 file, run process.L2toL3.toL3. This will allow more variables to be derived in toL3 and historical data to be appended once the L3 data is processed.

This reverts commit af09818.
.gitignore Outdated Show resolved Hide resolved
setup.py Show resolved Hide resolved
src/pypromice/process/aws.py Outdated Show resolved Hide resolved
src/pypromice/process/aws.py Outdated Show resolved Hide resolved
src/pypromice/process/aws.py Show resolved Hide resolved
src/pypromice/process/aws.py Outdated Show resolved Hide resolved
src/pypromice/process/aws.py Outdated Show resolved Hide resolved
@@ -87,20 +87,20 @@ def join_l3():
exit()

# Get hourly, daily and monthly datasets
print('Resampling L3 data to hourly, daily and monthly resolutions...')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why we should have a specific join L2 script that resamples and generates three different intermediate L2(A) products.
getL3 is also supposed to be a merge function with resampling.

NSE_L3 = resample(
  hard_merge([
    NSE_hist_L2,
    NSE_tx_L2,
    NSE_raw_L2,
  ])
)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. It was first separated to respect the distinction, in the previous version, between getl3 and joinl3.
It also has the advantage of making a clear distinction between the data product levels:
get_l2 produces L2_tx and L2_raw
join_l2 produces L2_merged
get_l3 produces L3
join_l3 produces L3_merged
I find it confusing that a script called getL3 produces as an intermediate product the L2_merged product

Also, I have two comments about the example you give above:

  1. that's another clear case where we need to do it setp-wise:
    NSE_tx_L2 and NSE_raw_L2 can be combined with combine_first because they are from the same physical station. The result is the NSE_L2_merged. Only after that, can we merge it with another distinct physical station (historical or v3) using time blocks. But...
  2. The historical stations come out as level 3 products (they already have derived variables such as turbulent fluxes, estimated coordinates). So we need to first turn NSE_L2_merged into NSE_L3 and then it can be merged with NSE_hist_L3

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To your comment:

I don't see why we should have a specific join L2 script that resamples and generates three different intermediate L2(A) products.

The now updated reply is:

get_l2 produces L2_tx and L2_raw
join_l2 produces L2_merged
get_l2tol3 produces L3 from L2_merged
join_l3 produces L3_merged

We cannot further merge those functionalities because they take different inputs:

get_l2 takes a config file as input (different from tx and raw)
join_l2 takes two files with same station_id as input
get_l2tol3 takes a single file (or alternatively a single station_id) as input
join_l3 takes either multiple files or a site_id as input

This structure respects the different levels and makes the different function rather independent of each other.

get_l3 is not used at the moment and is also unpractical because it takes as input a single config file (either from tx or raw) and doesn't include the merging of L2 datasets before running L2toL3

Does that answer your comment?

src/pypromice/process/get_l3.py Outdated Show resolved Hide resolved
src/pypromice/process/get_l3.py Outdated Show resolved Hide resolved
@PennyHow
Copy link
Member

PennyHow commented Jun 7, 2024

I just pushed changes to the processing performed in the pypromice.process.aws object, mainly for better compatibility (and also in line with Mads' comments).

  • Merge type for L1A removed (we will take that later)
  • Rounding, reformatting and resampling steps are moved to aws.writeArr. To put these at the end of Level 2 processing runs the risk of affecting Level 3 processing. Instead, they will now be triggered when a user wants to write a L2/L3 dataset to file
  • Related to this, aws.writeArr has been altered to accomodate Level 2 and Level 3 dataset exporting. These are then used in the specific-level writing functions writeL2 and writeL3, just to keep it simple

Also, I updated:

  • pydataverse dependency version is pinned as the most recent version is not compatible for some reason. We'll solve this later
  • pypromice version increased to v1.4.0... I think we will send out a new release once all the PRs are merged. It is definitely worth a subversion release rather than just a patch release!

I will tackle the CLI scripts now, specifically looking at get_l2, get_l3 and join_l2

@PennyHow
Copy link
Member

PennyHow commented Jun 7, 2024

New commit that tackles get_l2. Instead of having seperate scripts for get_l2 and get_l3 I think it is better if we just add functionality to get_l3 that allows you to only process to level 2 if specified.

I've made changes now that means you can specify the level of processing (and the outputted file) with the input variable --level (or -l); for example:

$ get_l3 -c aws-l0/tx/config/NUK_U.toml -i aws-l0/tx -o aws-l2 -l 2

Where -l 2 triggers aws.getL1(), aws.getL2(), and then writes the Level 3 dataset to file

$ get_l3 -c aws-l0/tx/config/NUK_U.toml -i aws-l0/tx -o aws-l3 -l 3

Where -l 3 triggers aws.getL1(), aws.getL2() and aws.getL3(), and then writes the Level 3 dataset to file.

I have also renamed Baptiste's get_l3 to l2_to_l3 to better represent its functionality. I want to work more on this - pypromice.process.aws should hold all processing and associated handling (e.g. rounding, reformating, resampling), so should call on the functions in the pypromice package, rather than written out separately.

@PennyHow
Copy link
Member

PennyHow commented Jun 7, 2024

Now l2_to_l3 uses the pypromice.process.aws object functionality instead of re-writing the workflow in the CLI script.

if os.path.exists(station_path):
aws = AWS(args.config_file, station_path, v, m)
else:
aws = AWS(args.config_file, args.inpath, v, m)
# Define Level 2 dataset from file
aws.L2 = xr.open_dataset(args.level_2)
# Perform Level 3 processing
aws.getL3()
# Write Level 3 dataset to file if output directory given
if args.outpath is not None:
aws.writeL3(args.outpath)

Effectively the Level 2 .nc file is loaded directly as the Level 2 property of the aws object (i.e. aws.L2 = .NC FILE). So we bypass the Level 0 to Level 2 processing steps in the aws object.

The only thing I am not sure about is the efficiency of re-loading the netcdf file. I'm curious to see what @ladsmund thinks about this. Please feel free to revert the change if you have something else in mind.

@BaptisteVandecrux
Copy link
Member Author

Hi @PennyHow ,

Thanks for the suggestions.

Although your get_l3 is more elegant on the code side, I'm afraid we need to slice things in different scripts because of operational constrains:

  • we need a get_l2 that processes transmission from L0 to a L2 file which then can be used by the BUFR processing
  • we can significantly speed up things if we:
  1. run get_l2 in parallel for different stations (for tx first, and after BUFR processing, for raw files)
  2. than the L2toL3 can be run only after the L2_tx and L2_raw have been merged, so we don't make the processing twice for their overlapping (note that removing those overlapping period between tx and raw is not the topic of this PR and could be addressed later)
  3. the join_l3 can be done in parallel for different sites (with each site having a list of stations defined in a config file)

I try to summarize these constrains and make a draft of a l3_processor.sh that would use these functionalities:
image

I hope that clarifies as well the level definition for @ladsmund.

As a side note, since we'll need to update l3_processor.sh, I find the the latest version (where functionalities are in separate shell files) harder to read and to debug (more chances for i/o errors when calling another file).

@ladsmund
Copy link
Contributor

I also wondered about the purpose of the get_l3 script. As I see it, it is a script that processes all data from AWS through the pipeline. So far, it has only been from a single data source such as tx or raw.

It is not clear to me, if we are on the same page with respect to the processing pipeline and how to invoke which steps. As I see it, the current pipeline is something like

Version 1

  1. getL0tx
  2. getL3 tx_files
  3. generate and publish BUFR files
  4. getL3 raw_files
  5. joinL3 This includes a resampling into monthly, daily and hourly
  6. Publish to fileshare

After this pull request, I could be something like

Version 2

  1. getL0tx
  2. getL2 tx_files
  3. generate and publish BUFR files
  4. getL2 raw_files
  5. joinL2 l2_tx_files l2_raw_files
  6. getL3 l2_joined_files
  7. Publish to fileshare

Or if we still need the L3 tx files for backwards compatibility

Version 3

  1. getL0tx
  2. getL2 tx_files
  3. generate and publish BUFR files
  4. getL3 l2_tx_files
  5. getL2 raw_files
  6. joinL2 l2_tx_files l2_raw_files
  7. getL3 l2_joined_files
  8. Publish to fileshare

And maybe also with the historical data

  • getL2 hostical_files
  • getL3 l2_historical_files
  • joinL3 l3_files l3_historical_files

@ladsmund
Copy link
Contributor

I am not sure about how we should interpet L3 vs Level3 both now and in the future.

I think Baptiste has a good point about using site vs station names as references for datasets.

I could imagine an approach where we have multiple types of l3 products:

Data source level

  • QAS_L_V3_tx
  • QAS_L_V3_raw
  • QAS_L_V2_tx
  • QAS_L_V2_raw
  • QAS_L_Historic_raw

Station level

  • QAS_L_V3
  • QAS_L_V2

Site level

  • QAS_L

@PennyHow
Copy link
Member

we need a get_l2 that processes transmission from L0 to a L2 file which then can be used by the BUFR processing

I added flexibility into the get_l3 function to also allow for L0 to L2 processing. This can be defined with the -l/--level option set to 2, like so:

$ get_l3 -c aws-l0/tx/config/NUK_U.toml -i aws-l0/tx -o aws-l2 -l 2

So yes, it is still called "get_l3" currently, but can also be used for the functionality you described in get_l2. I'm happy to rename it. I just want to keep the option of L0-to-L3 processing for development or other uses that we may need in the future.

The reason I am hesitant about your configuration is because a lot of the post-processing functionality (i.e. rounding, re-formating, resampling) is re-written from the pypromice.process.aws module into the CLI script. By having two instances of this functionality, it means we have to update and maintain it in both places which is a lot more work for us.

Version 3

getL0tx
getL2 tx_files
generate and publish BUFR files
getL3 l2_tx_files
getL2 raw_files
joinL2 l2_tx_files l2_raw_files
getL3 l2_joined_files
Publish to fileshare

So currently this is what I think we are aiming for. With the current pypromice modules/CLI scripts, it should look like this:

  • get_l0tx (Level 0 tx message fetching)
  • get_l3 -l 2 (For Level 0 to Level 2 tx processing)
  • get_bufr (For tx BUFR file generation and upload)
  • get_l3 -l 2 (For Level 0 to Level 2 raw processing)
  • join_l2 (For Level 2 raw and tx joining)
  • l2_to_l3 (For Level 2 to Level 3 processing)
  • Publish to fileshare

I think we are close!

@PennyHow
Copy link
Member

There are a lot of new commits here now, but most of them are associated with de-bugging of a new Action for testing the get_l2 and get_l2tol3 CLI scripts.

Main changes

  • The long list of functions in pypromice.process.aws have been moved out to separate pypromice.process submodules:
  1. pypromice.process.write contains all file loading functions (writeAll(), writeCSV(). writeNC(), getColNames())
  2. pypromice.process.resample contains all resampling functions (resample_dataset(), calculateSaturationVaporPressure())
  3. pypromice.process.test contains all unit tests, i.e. the TestProcess class
  4. pypromice.process.utilities contains all formatting, dataset populating, and metadata handling (roundValues(), reformat_time(), reformat_lon(), popCols(), addBasicMeta(), populateMeta(), addVars(), addMeta()
  5. pypromice.process.load contains all functions for writing L2/L3 datasets (getConfig(), getL0(), getVars(), getMeta())
  • All key functionality in the pypromice.process.aws.AWS class have been moved out to the respective submodules. The main one is pypromice.process.write.prepare_and_write(), which prepares a L2/L3 dataset with resampling, rounding and reformating, and then writes out to file. This is now adopted in the CLI scripts, either from the function in the pypromice.process.write module (pypromice.process.write.prepare_and_write() or the pypromice.process.aws.AWS class function which calls upon this (aws.writeArr())

  • get_l2 and get_l3 CLI scripts now exist for separate L0-to-L2 and L0-to-L3 processing. The get_l2tol3 CLI script performs L2-to-L3 processing

  • When updating the join CLI scripts (join_l2 and join_l3), I found that they were both exactly the same. I don't know if I am missing something, but with the new changes in pypromice, it seems that most of the functionality for differentiating L2 and L3 datasets is in pypromice. I have renamed the join CLI script now to join_levels, and this should be usable on joining L2 datasets, and joining L3 datasets

  • I added a new Action (.github/workflows/process_l2_test.yml) to test the get_l2 CLI script. I wanted to add the get_l2tol3 script to this also, but had problems with directory structuring. I can try again another time.

To-do

  • Check that the entire processing pipeline can work with this new structuring. I've tested each part in isolation, but it would be good to make sure that everything runs smoothly together
  • Check that all needed functionality has been transferred out of the pypromice.process.aws.AWS class
  • Check that all new pypromice.process submodules make sense in terms of naming conventions and functionality
  • Implement get_l2tol3 testing in Action .github/workflows/process_l2_test.yml

And also to see what you guys think about these changes. Please feel free to modify.

@BaptisteVandecrux
Copy link
Member Author

So the updated structure is:
billede

I have now made an update of aws-operational-processing that uses the functions from this PR.

The code has been running on glacio01 and posting the level_2 and level_3 on GitHub (if cloning level_3, make sure to use --depth 1).

The comparison between the aws-l3-dev (new) to aws-l3 (old) is available as timeseries plots or as scatter plots. All variables are identical except: q_h_u, q_h_l, dshf_h_u, dshf_h_l, dlhf_h_u, dlhf_h_l because in the previous version they were calculated from 10 minute data and then averaged, while now they are calculated from hourly averages directly.

I'll be rebasing downstream PRs to this one and we can take the discussion to the next one, which is #252

@PennyHow
Copy link
Member

Looks good @BaptisteVandecrux. Do you want us to begin looking at #252 and adding comments, or do you have some things to do on it beforehand?

@BaptisteVandecrux
Copy link
Member Author

Looks good @BaptisteVandecrux. Do you want us to begin looking at #252 and adding comments, or do you have some things to do on it beforehand?

I need to adapt the scripts first now that I have a clearer idea about the structure.
I'll let you know when it's ready!

@@ -0,0 +1,193 @@
#!/usr/bin/env python3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend avoiding the use of imperative forms in module names, as it can be confusing. For example:

import write

write.write(args)

Instead, I suggest using a gerund form like writing or a descriptive noun phrase such as dataset_writer or dataset_export. This makes the purpose of the module clearer and more intuitive.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, that was me. Feel free to change.

@BaptisteVandecrux BaptisteVandecrux changed the base branch from main to develop June 26, 2024 19:09
@BaptisteVandecrux BaptisteVandecrux merged commit c7ddb4e into develop Jun 26, 2024
10 checks passed
@BaptisteVandecrux BaptisteVandecrux deleted the 250-make-a-fast-track-l2-product-before-any-derived-variables-are-being-calculated branch June 26, 2024 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants