Write L2 to file and leave all additional variable derivation for the L2toL3 step #255

BaptisteVandecrux · 2024-06-04T14:43:49Z

The idea of this new version is that:

L2 data files are written into level_2/raw and level_2/tx folders by get_l2 (just like it was done for the L3 data previously). One consequence is that this low-latency level 2 tx data can be posted very fast on THREDDS for showcase and fieldwork, and processed into BUFR files.
L2 tx and raw files are merged using join_l2 (just like it was done for the L3 data previously). Resampling to hourly, daily, monthly values are done here, but could be left for a later stage.
get_l3 is now a script that loads the merged L2 file, run process.L2toL3.toL3. This will allow more variables to be derived in toL3 and historical data to be appended once the L3 data is processed.

This reverts commit af09818.

.gitignore

setup.py

src/pypromice/process/aws.py

ladsmund · 2024-06-07T13:11:04Z

src/pypromice/process/join_l2.py

@@ -87,20 +87,20 @@ def join_l3():
        exit()

    # Get hourly, daily and monthly datasets
-    print('Resampling L3 data to hourly, daily and monthly resolutions...')


I don't see why we should have a specific join L2 script that resamples and generates three different intermediate L2(A) products.
getL3 is also supposed to be a merge function with resampling.

NSE_L3 = resample( hard_merge([ NSE_hist_L2, NSE_tx_L2, NSE_raw_L2, ]) )

I agree. It was first separated to respect the distinction, in the previous version, between getl3 and joinl3.
It also has the advantage of making a clear distinction between the data product levels:
get_l2 produces L2_tx and L2_raw
join_l2 produces L2_merged
get_l3 produces L3
join_l3 produces L3_merged
I find it confusing that a script called getL3 produces as an intermediate product the L2_merged product

Also, I have two comments about the example you give above:

that's another clear case where we need to do it setp-wise:
NSE_tx_L2 and NSE_raw_L2 can be combined with combine_first because they are from the same physical station. The result is the NSE_L2_merged. Only after that, can we merge it with another distinct physical station (historical or v3) using time blocks. But...

The historical stations come out as level 3 products (they already have derived variables such as turbulent fluxes, estimated coordinates). So we need to first turn NSE_L2_merged into NSE_L3 and then it can be merged with NSE_hist_L3

To your comment:

I don't see why we should have a specific join L2 script that resamples and generates three different intermediate L2(A) products.

The now updated reply is:

get_l2 produces L2_tx and L2_raw
join_l2 produces L2_merged
get_l2tol3 produces L3 from L2_merged
join_l3 produces L3_merged

We cannot further merge those functionalities because they take different inputs:

get_l2 takes a config file as input (different from tx and raw)
join_l2 takes two files with same station_id as input
get_l2tol3 takes a single file (or alternatively a single station_id) as input
join_l3 takes either multiple files or a site_id as input

This structure respects the different levels and makes the different function rather independent of each other.

get_l3 is not used at the moment and is also unpractical because it takes as input a single config file (either from tx or raw) and doesn't include the merging of L2 datasets before running L2toL3

Does that answer your comment?

src/pypromice/process/get_l3.py

PennyHow · 2024-06-07T13:40:43Z

I just pushed changes to the processing performed in the pypromice.process.aws object, mainly for better compatibility (and also in line with Mads' comments).

Merge type for L1A removed (we will take that later)
Rounding, reformatting and resampling steps are moved to aws.writeArr. To put these at the end of Level 2 processing runs the risk of affecting Level 3 processing. Instead, they will now be triggered when a user wants to write a L2/L3 dataset to file
Related to this, aws.writeArr has been altered to accomodate Level 2 and Level 3 dataset exporting. These are then used in the specific-level writing functions writeL2 and writeL3, just to keep it simple

Also, I updated:

pydataverse dependency version is pinned as the most recent version is not compatible for some reason. We'll solve this later
pypromice version increased to v1.4.0... I think we will send out a new release once all the PRs are merged. It is definitely worth a subversion release rather than just a patch release!

I will tackle the CLI scripts now, specifically looking at get_l2, get_l3 and join_l2

PennyHow · 2024-06-07T14:46:21Z

New commit that tackles get_l2. Instead of having seperate scripts for get_l2 and get_l3 I think it is better if we just add functionality to get_l3 that allows you to only process to level 2 if specified.

I've made changes now that means you can specify the level of processing (and the outputted file) with the input variable --level (or -l); for example:

$ get_l3 -c aws-l0/tx/config/NUK_U.toml -i aws-l0/tx -o aws-l2 -l 2

Where -l 2 triggers aws.getL1(), aws.getL2(), and then writes the Level 3 dataset to file

$ get_l3 -c aws-l0/tx/config/NUK_U.toml -i aws-l0/tx -o aws-l3 -l 3

Where -l 3 triggers aws.getL1(), aws.getL2() and aws.getL3(), and then writes the Level 3 dataset to file.

I have also renamed Baptiste's get_l3 to l2_to_l3 to better represent its functionality. I want to work more on this - pypromice.process.aws should hold all processing and associated handling (e.g. rounding, reformating, resampling), so should call on the functions in the pypromice package, rather than written out separately.

PennyHow · 2024-06-07T15:46:54Z

Now l2_to_l3 uses the pypromice.process.aws object functionality instead of re-writing the workflow in the CLI script.

pypromice/src/pypromice/process/l2_to_l3.py

Lines 55 to 69 in cc6bfb2

    
           if os.path.exists(station_path): 
        
               aws = AWS(args.config_file, station_path, v, m) 
        
           else: 
        
               aws = AWS(args.config_file, args.inpath, v, m) 
        
           # Define Level 2 dataset from file 
        
           aws.L2 = xr.open_dataset(args.level_2) 
        
           # Perform Level 3 processing 
        
           aws.getL3() 
        
           # Write Level 3 dataset to file if output directory given 
        
           if args.outpath is not None: 
        
               aws.writeL3(args.outpath)

Effectively the Level 2 .nc file is loaded directly as the Level 2 property of the aws object (i.e. aws.L2 = .NC FILE). So we bypass the Level 0 to Level 2 processing steps in the aws object.

The only thing I am not sure about is the efficiency of re-loading the netcdf file. I'm curious to see what @ladsmund thinks about this. Please feel free to revert the change if you have something else in mind.

BaptisteVandecrux · 2024-06-08T09:19:22Z

Hi @PennyHow ,

Thanks for the suggestions.

Although your get_l3 is more elegant on the code side, I'm afraid we need to slice things in different scripts because of operational constrains:

we need a get_l2 that processes transmission from L0 to a L2 file which then can be used by the BUFR processing
we can significantly speed up things if we:

run get_l2 in parallel for different stations (for tx first, and after BUFR processing, for raw files)
than the L2toL3 can be run only after the L2_tx and L2_raw have been merged, so we don't make the processing twice for their overlapping (note that removing those overlapping period between tx and raw is not the topic of this PR and could be addressed later)
the join_l3 can be done in parallel for different sites (with each site having a list of stations defined in a config file)

I try to summarize these constrains and make a draft of a l3_processor.sh that would use these functionalities:

I hope that clarifies as well the level definition for @ladsmund.

As a side note, since we'll need to update l3_processor.sh, I find the the latest version (where functionalities are in separate shell files) harder to read and to debug (more chances for i/o errors when calling another file).

ladsmund · 2024-06-10T09:53:21Z

I also wondered about the purpose of the get_l3 script. As I see it, it is a script that processes all data from AWS through the pipeline. So far, it has only been from a single data source such as tx or raw.

It is not clear to me, if we are on the same page with respect to the processing pipeline and how to invoke which steps. As I see it, the current pipeline is something like

Version 1

getL0tx
getL3 tx_files
generate and publish BUFR files
getL3 raw_files
joinL3 This includes a resampling into monthly, daily and hourly
Publish to fileshare

After this pull request, I could be something like

Version 2

getL0tx
getL2 tx_files
generate and publish BUFR files
getL2 raw_files
joinL2 l2_tx_files l2_raw_files
getL3 l2_joined_files
Publish to fileshare

Or if we still need the L3 tx files for backwards compatibility

Version 3

getL0tx
getL2 tx_files
generate and publish BUFR files
getL3 l2_tx_files
getL2 raw_files
joinL2 l2_tx_files l2_raw_files
getL3 l2_joined_files
Publish to fileshare

And maybe also with the historical data

getL2 hostical_files
getL3 l2_historical_files
joinL3 l3_files l3_historical_files

ladsmund · 2024-06-10T09:54:52Z

I am not sure about how we should interpet L3 vs Level3 both now and in the future.

I think Baptiste has a good point about using site vs station names as references for datasets.

I could imagine an approach where we have multiple types of l3 products:

Data source level

QAS_L_V3_tx
QAS_L_V3_raw
QAS_L_V2_tx
QAS_L_V2_raw
QAS_L_Historic_raw

Station level

QAS_L_V3
QAS_L_V2

Site level

QAS_L

PennyHow · 2024-06-10T10:41:08Z

we need a get_l2 that processes transmission from L0 to a L2 file which then can be used by the BUFR processing

I added flexibility into the get_l3 function to also allow for L0 to L2 processing. This can be defined with the -l/--level option set to 2, like so:

$ get_l3 -c aws-l0/tx/config/NUK_U.toml -i aws-l0/tx -o aws-l2 -l 2

So yes, it is still called "get_l3" currently, but can also be used for the functionality you described in get_l2. I'm happy to rename it. I just want to keep the option of L0-to-L3 processing for development or other uses that we may need in the future.

The reason I am hesitant about your configuration is because a lot of the post-processing functionality (i.e. rounding, re-formating, resampling) is re-written from the pypromice.process.aws module into the CLI script. By having two instances of this functionality, it means we have to update and maintain it in both places which is a lot more work for us.

Version 3

getL0tx
getL2 tx_files
generate and publish BUFR files
getL3 l2_tx_files
getL2 raw_files
joinL2 l2_tx_files l2_raw_files
getL3 l2_joined_files
Publish to fileshare

So currently this is what I think we are aiming for. With the current pypromice modules/CLI scripts, it should look like this:

get_l0tx (Level 0 tx message fetching)
get_l3 -l 2 (For Level 0 to Level 2 tx processing)
get_bufr (For tx BUFR file generation and upload)
get_l3 -l 2 (For Level 0 to Level 2 raw processing)
join_l2 (For Level 2 raw and tx joining)
l2_to_l3 (For Level 2 to Level 3 processing)
Publish to fileshare

I think we are close!

PennyHow · 2024-06-10T17:36:51Z

There are a lot of new commits here now, but most of them are associated with de-bugging of a new Action for testing the get_l2 and get_l2tol3 CLI scripts.

Main changes

The long list of functions in pypromice.process.aws have been moved out to separate pypromice.process submodules:

pypromice.process.write contains all file loading functions (writeAll(), writeCSV(). writeNC(), getColNames())
pypromice.process.resample contains all resampling functions (resample_dataset(), calculateSaturationVaporPressure())
pypromice.process.test contains all unit tests, i.e. the TestProcess class
pypromice.process.utilities contains all formatting, dataset populating, and metadata handling (roundValues(), reformat_time(), reformat_lon(), popCols(), addBasicMeta(), populateMeta(), addVars(), addMeta()
pypromice.process.load contains all functions for writing L2/L3 datasets (getConfig(), getL0(), getVars(), getMeta())

All key functionality in the pypromice.process.aws.AWS class have been moved out to the respective submodules. The main one is pypromice.process.write.prepare_and_write(), which prepares a L2/L3 dataset with resampling, rounding and reformating, and then writes out to file. This is now adopted in the CLI scripts, either from the function in the pypromice.process.write module (pypromice.process.write.prepare_and_write() or the pypromice.process.aws.AWS class function which calls upon this (aws.writeArr())
get_l2 and get_l3 CLI scripts now exist for separate L0-to-L2 and L0-to-L3 processing. The get_l2tol3 CLI script performs L2-to-L3 processing
When updating the join CLI scripts (join_l2 and join_l3), I found that they were both exactly the same. I don't know if I am missing something, but with the new changes in pypromice, it seems that most of the functionality for differentiating L2 and L3 datasets is in pypromice. I have renamed the join CLI script now to join_levels, and this should be usable on joining L2 datasets, and joining L3 datasets
I added a new Action (.github/workflows/process_l2_test.yml) to test the get_l2 CLI script. I wanted to add the get_l2tol3 script to this also, but had problems with directory structuring. I can try again another time.

To-do

Check that the entire processing pipeline can work with this new structuring. I've tested each part in isolation, but it would be good to make sure that everything runs smoothly together
Check that all needed functionality has been transferred out of the pypromice.process.aws.AWS class
Check that all new pypromice.process submodules make sense in terms of naming conventions and functionality
Implement get_l2tol3 testing in Action .github/workflows/process_l2_test.yml

And also to see what you guys think about these changes. Please feel free to modify.

…les out in L2toL3, trying not to re-write sorted tx file if already sorted

…rging function, attribute management and use site_id and list_station_id

src/pypromice/process/L1toL2.py

BaptisteVandecrux · 2024-06-13T09:41:16Z

So the updated structure is:

I have now made an update of aws-operational-processing that uses the functions from this PR.

The code has been running on glacio01 and posting the level_2 and level_3 on GitHub (if cloning level_3, make sure to use --depth 1).

The comparison between the aws-l3-dev (new) to aws-l3 (old) is available as timeseries plots or as scatter plots. All variables are identical except: q_h_u, q_h_l, dshf_h_u, dshf_h_l, dlhf_h_u, dlhf_h_l because in the previous version they were calculated from 10 minute data and then averaged, while now they are calculated from hourly averages directly.

I'll be rebasing downstream PRs to this one and we can take the discussion to the next one, which is #252

PennyHow · 2024-06-13T09:52:21Z

Looks good @BaptisteVandecrux. Do you want us to begin looking at #252 and adding comments, or do you have some things to do on it beforehand?

BaptisteVandecrux · 2024-06-13T09:53:32Z

Looks good @BaptisteVandecrux. Do you want us to begin looking at #252 and adding comments, or do you have some things to do on it beforehand?

I need to adapt the scripts first now that I have a clearer idea about the structure.
I'll let you know when it's ready!

ladsmund · 2024-06-18T14:30:24Z

src/pypromice/process/write.py

@@ -0,0 +1,193 @@
+#!/usr/bin/env python3


I recommend avoiding the use of imperative forms in module names, as it can be confusing. For example:

import write write.write(args)

Instead, I suggest using a gerund form like writing or a descriptive noun phrase such as dataset_writer or dataset_export. This makes the purpose of the module clearer and more intuitive.

Sorry, that was me. Feel free to change.

Revert "revert"

64b6aa2

This reverts commit af09818.

This was linked to issues Jun 4, 2024

Run turbulent heat flux calculation after the merge of l3 files to save time #247

Open

Idea: Calculate THF after the bufr processing (in e.g. L3toL4) to decrease latency #175

Open

correcting typo

df5a9c5

ladsmund reviewed Jun 7, 2024

View reviewed changes

aws process handling modified

8b4d430

PennyHow added 2 commits June 7, 2024 13:31

Level 2 processing capabilities added to get_l3

f8de75c

get_l3 renamed to l2_to_l3

17245f7

l2_to_l3 functionality from pypromice

cc6bfb2

minor de-bug for process test to run

09f12a0

Update .gitignore

17b647c

PennyHow added 12 commits June 10, 2024 10:31

L2 split from L3 CLI processing

5927dbb

unit tests moved to separate module

b9a2e70

file writing functions moved to separate module

bfb1603

Loading functions moved to separate module

eb58371

Handling and reformating functions moved

f536e74

resampling functions moved

9b2d8f4

aws module updated with structure changes

2e98bfd

Logging package typo

526f0f0

de-bugging

de7e5d8

get_l2 and l2_to_l3 process test added

e9bcc8c

data prep and write function moved out of AWS class

496ef03

actions de-bug

facf0ee

PennyHow added 8 commits June 10, 2024 14:54

stations for testing changed

bd8d8a5

l2 action de-bug

679aa4f

loop de-bug in l2 test action

e5db5d6

out filepath changed

a6911f0

Path alteration 2

ba1a424

poutput directory alteration 3

0ae13cd

action stages merged

6a9b66b

get_l2tol3 action test skipped

eaa49fd

BaptisteVandecrux and others added 11 commits June 11, 2024 14:24

typo fix in join_levels.py

e0d93b5

further debugging

a6e8807

creating folder before writing files, writing hourly daily monthly fi…

1a406f4

…les out in L2toL3, trying not to re-write sorted tx file if already sorted

typo fix

697fa60

fixed daily output file name

b9fb365

resampling frequency specified

9d0cb14

modified file filter added

59cfbf0

reverts changes on get_l2.py to write to file both 10min and hourly data

fdb6b02

renamed join_levels to join_l2 because join_l3 will have different me…

3bd0435

…rging function, attribute management and use site_id and list_station_id

skipping resample after join_l2, fixed setup.py for join_l2

2cebff4

fixing test

cb1f6a3

BaptisteVandecrux commented Jun 13, 2024

View reviewed changes

src/pypromice/process/L1toL2.py Outdated Show resolved Hide resolved

fixed function names

d1d46b7

ladsmund reviewed Jun 18, 2024

View reviewed changes

BaptisteVandecrux changed the base branch from main to develop June 26, 2024 19:09

BaptisteVandecrux merged commit c7ddb4e into develop Jun 26, 2024
10 checks passed

BaptisteVandecrux deleted the 250-make-a-fast-track-l2-product-before-any-derived-variables-are-being-calculated branch June 26, 2024 19:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write L2 to file and leave all additional variable derivation for the L2toL3 step #255

Write L2 to file and leave all additional variable derivation for the L2toL3 step #255

BaptisteVandecrux commented Jun 4, 2024

ladsmund Jun 7, 2024

BaptisteVandecrux Jun 10, 2024

BaptisteVandecrux Jun 13, 2024

PennyHow commented Jun 7, 2024 •

edited

Loading

PennyHow commented Jun 7, 2024 •

edited

Loading

PennyHow commented Jun 7, 2024

BaptisteVandecrux commented Jun 8, 2024

ladsmund commented Jun 10, 2024

ladsmund commented Jun 10, 2024

PennyHow commented Jun 10, 2024

PennyHow commented Jun 10, 2024

BaptisteVandecrux commented Jun 13, 2024

PennyHow commented Jun 13, 2024

BaptisteVandecrux commented Jun 13, 2024

ladsmund Jun 18, 2024

PennyHow Jun 18, 2024

Write L2 to file and leave all additional variable derivation for the L2toL3 step #255

Write L2 to file and leave all additional variable derivation for the L2toL3 step #255

Conversation

BaptisteVandecrux commented Jun 4, 2024

ladsmund Jun 7, 2024

Choose a reason for hiding this comment

BaptisteVandecrux Jun 10, 2024

Choose a reason for hiding this comment

BaptisteVandecrux Jun 13, 2024

Choose a reason for hiding this comment

PennyHow commented Jun 7, 2024 • edited Loading

PennyHow commented Jun 7, 2024 • edited Loading

PennyHow commented Jun 7, 2024

BaptisteVandecrux commented Jun 8, 2024

ladsmund commented Jun 10, 2024

ladsmund commented Jun 10, 2024

PennyHow commented Jun 10, 2024

PennyHow commented Jun 10, 2024

BaptisteVandecrux commented Jun 13, 2024

PennyHow commented Jun 13, 2024

BaptisteVandecrux commented Jun 13, 2024

ladsmund Jun 18, 2024

Choose a reason for hiding this comment

PennyHow Jun 18, 2024

Choose a reason for hiding this comment

PennyHow commented Jun 7, 2024 •

edited

Loading

PennyHow commented Jun 7, 2024 •

edited

Loading