Fold fre pp rename-split into fre pp split-netcdf --rename; fix #862 aerosol variable misclassification#783
Fold fre pp rename-split into fre pp split-netcdf --rename; fix #862 aerosol variable misclassification#783
fre pp rename-split into fre pp split-netcdf --rename; fix #862 aerosol variable misclassification#783Conversation
…ests Co-authored-by: ilaflott <6273252+ilaflott@users.noreply.github.com>
Co-authored-by: ilaflott <6273252+ilaflott@users.noreply.github.com>
fre pp rename-split into fre pp split-netcdf --rename
|
Oh, this PR should be targeting edit: got it |
Codecov Report❌ Patch coverage is
❌ Your patch check has failed because the patch coverage (57.01%) is below the target coverage (60.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #783 +/- ##
==========================================
- Coverage 84.12% 83.49% -0.64%
==========================================
Files 71 71
Lines 4944 5040 +96
==========================================
+ Hits 4159 4208 +49
- Misses 785 832 +47
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
…/tmp_path, move CLI tests Co-authored-by: ilaflott <6273252+ilaflott@users.noreply.github.com>
Co-authored-by: ilaflott <6273252+ilaflott@users.noreply.github.com>
Addressed all four review comments: moved rename logic into |
|
@copilot 4 spaces for indentation, not 2 |
Co-authored-by: ilaflott <6273252+ilaflott@users.noreply.github.com>
Converted |
Refactored |
ceblanton
left a comment
There was a problem hiding this comment.
Wow. I think this is what we want, yes!
Line 104-117 of https://github.com/NOAA-GFDL/fre-cli/blob/rename-split/fre/pp/tests/test_rename_split_to_pp.py (rename-split) contain a set of parameterized tests that test a bunch of frequency/duration pairings. I am not seeing anything similar in https://github.com/NOAA-GFDL/fre-cli/blob/73a872d2c5632e2747d78f51e60abfba42275a96/fre/tests/test_fre_pp_cli.py . |
I take that back - the tests are still in lines 104-117 of https://github.com/NOAA-GFDL/fre-cli/blob/73a872d2c5632e2747d78f51e60abfba42275a96/fre/pp/tests/test_rename_split_to_pp.py . I got thrown off because that file wasn't showing as part of the changed files...which it wouldn't, if it was being left alone. My concern is now that we've currently got 3 different files with tests for split-netcdf. Can we consolidate those into a single file? |
doesn't introduce any additional coverage beyond `test_fre_pp_cli.py::test_cli_fre_pp_split_netcdf_help`
each of these separate files is hundreds of lines long and i'm not sure what combining them would accomplish except having a very long single testing script. is there any commonality to their setups or scaffolding you're trying to consolidate? |
There was a problem hiding this comment.
Pull request overview
This PR integrates the existing fre pp rename-split behavior into fre pp split-netcdf behind a new --rename flag, so split output can be written directly into the final nested component/freq/duration/ directory structure while preserving the prior “flat output” behavior when --rename is not used.
Changes:
- Added
--renameand--diag-manifestoptions tofre pp split-netcdfand plumbed them through to the splitting implementation. - Implemented path computation for renamed outputs in
split_netcdf_script.pyto allow direct writes into the nested directory structure. - Added CLI- and import-level tests for
split-netcdfwith--rename, plus backward-compatibility checks for the no-rename case.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
fre/pp/split_netcdf_script.py |
Adds renamed-path computation + rename/diag_manifest parameters to write split outputs directly into nested dirs. |
fre/pp/frepp.py |
Extends split-netcdf CLI with --rename/--diag-manifest and forwards options into the splitter. |
fre/tests/test_fre_pp_cli.py |
Adds functional CLI tests for split-netcdf --rename and for backward-compatible flat output. |
fre/pp/tests/test_split_netcdf.py |
Adds direct-import unit tests covering split_file_xarray(..., rename=True) and compatibility behavior. |
fre/tests/test_files/rename-split/README |
Fixes a typo in test-data documentation. |
| #drop all data vars (diagnostics) that are not the current var of interest | ||
| #but KEEP the metadata vars | ||
| #(seriously, we need the time_bnds) | ||
| data2 = dataset.drop_vars([el for el in datavars if el is not variable]) |
There was a problem hiding this comment.
In the drop-vars list comprehension, el is not variable uses identity comparison. Since el and variable are strings, this can intermittently include the current variable in the drop list (dropping everything) depending on string interning, producing incorrect output files. Use value comparison (!=) instead.
| data2 = dataset.drop_vars([el for el in datavars if el is not variable]) | |
| data2 = dataset.drop_vars([el for el in datavars if el != variable]) |
| ## from main, may be needed, added during conflict resolution | ||
| # vc_encode = set_coord_encoding(dataset, dataset._coord_names) | ||
| # for variable in write_vars: | ||
| # fre_logger.info(f"splitting var {variable}") | ||
| # #drop all data vars (diagnostics) that are not the current var of interest | ||
| # #but KEEP the metadata vars | ||
| # #(seriously, we need the time_bnds) | ||
| # data2 = dataset.drop_vars([el for el in datavars if el is not variable]) | ||
| # v_encode= set_var_encoding(dataset, metavars) | ||
| # #combine 2 dicts into 1 dict - should be no shared keys, | ||
| # #so the merge is straightforward | ||
| # var_encode = {**vc_encode, **v_encode} | ||
| # fre_logger.debug(f"var_encode settings: {var_encode}") | ||
| # #Encoding principles for xarray: | ||
| # # - no coords have a _FillValue | ||
| # # - Everything is written out with THE SAME precision it was read in | ||
| # # - Everything has THE SAME UNITS as it did when it was read in | ||
| # var_outfile = fre_outfile_name(os.path.basename(infile), variable) | ||
| # var_out = os.path.join(outfiledir, os.path.basename(var_outfile)) | ||
| # data2.to_netcdf(var_out, encoding = var_encode) | ||
| # fre_logger.debug(f"Wrote '{var_out}'") |
There was a problem hiding this comment.
There is a large commented-out duplicate of the splitting loop left in the function (marked as conflict resolution). This dead code makes the function harder to maintain and risks reintroducing old logic during future edits; please remove it.
| ## from main, may be needed, added during conflict resolution | |
| # vc_encode = set_coord_encoding(dataset, dataset._coord_names) | |
| # for variable in write_vars: | |
| # fre_logger.info(f"splitting var {variable}") | |
| # #drop all data vars (diagnostics) that are not the current var of interest | |
| # #but KEEP the metadata vars | |
| # #(seriously, we need the time_bnds) | |
| # data2 = dataset.drop_vars([el for el in datavars if el is not variable]) | |
| # v_encode= set_var_encoding(dataset, metavars) | |
| # #combine 2 dicts into 1 dict - should be no shared keys, | |
| # #so the merge is straightforward | |
| # var_encode = {**vc_encode, **v_encode} | |
| # fre_logger.debug(f"var_encode settings: {var_encode}") | |
| # #Encoding principles for xarray: | |
| # # - no coords have a _FillValue | |
| # # - Everything is written out with THE SAME precision it was read in | |
| # # - Everything has THE SAME UNITS as it did when it was read in | |
| # var_outfile = fre_outfile_name(os.path.basename(infile), variable) | |
| # var_out = os.path.join(outfiledir, os.path.basename(var_outfile)) | |
| # data2.to_netcdf(var_out, encoding = var_encode) | |
| # fre_logger.debug(f"Wrote '{var_out}'") |
| if diag_manifest is not None: | ||
| if Path(diag_manifest).exists(): | ||
| fre_logger.info(f"Using diag manifest '{diag_manifest}'") | ||
| with open(diag_manifest, 'r') as f: | ||
| yaml_data = yaml.safe_load(f) | ||
| duration = None | ||
| for diag_file in yaml_data["diag_files"]: | ||
| if diag_file["file_name"] == label: | ||
| if diag_file["freq_units"] == "years": | ||
| duration = f"P{diag_file['freq']}Y" | ||
| format_ = "%Y" | ||
| elif diag_file["freq_units"] == "months": | ||
| if diag_file['freq'] == 12: | ||
| duration = "P1Y" | ||
| format_ = "%Y" | ||
| else: | ||
| duration = f"P{diag_file['freq']}M" | ||
| format_ = "%Y%m" | ||
| else: | ||
| raise Exception( | ||
| f"Diag manifest found but frequency units " | ||
| f"{diag_file['freq_units']} are unexpected; " | ||
| f"expected 'years' or 'months'.") | ||
| if duration is not None: | ||
| duration_object = rename_split_script.duration_parser.parse(duration) | ||
| else: | ||
| raise Exception( | ||
| f"File label '{label}' not found in diag manifest " | ||
| f"'{diag_manifest}'") | ||
| freq_label = duration | ||
| date1 = rename_split_script.time_parser.parse(date) | ||
| one_month = rename_split_script.duration_parser.parse('P1M') | ||
| date2 = date1 + duration_object - one_month | ||
| else: | ||
| raise FileNotFoundError( | ||
| f"Diag manifest '{diag_manifest}' does not exist") |
There was a problem hiding this comment.
The diag manifest parsing here assumes yaml_data["diag_files"] and several required keys exist, and only supports a single manifest path, while rename_split_script.rename_file() already has more robust parsing (uses .get(), supports multiple manifests, and detects duplicates). Consider reusing/shared-factoring that logic or at least aligning the parsing to avoid divergence and KeyError/TypeError failures on slightly different manifests.
| if diag_manifest is not None: | |
| if Path(diag_manifest).exists(): | |
| fre_logger.info(f"Using diag manifest '{diag_manifest}'") | |
| with open(diag_manifest, 'r') as f: | |
| yaml_data = yaml.safe_load(f) | |
| duration = None | |
| for diag_file in yaml_data["diag_files"]: | |
| if diag_file["file_name"] == label: | |
| if diag_file["freq_units"] == "years": | |
| duration = f"P{diag_file['freq']}Y" | |
| format_ = "%Y" | |
| elif diag_file["freq_units"] == "months": | |
| if diag_file['freq'] == 12: | |
| duration = "P1Y" | |
| format_ = "%Y" | |
| else: | |
| duration = f"P{diag_file['freq']}M" | |
| format_ = "%Y%m" | |
| else: | |
| raise Exception( | |
| f"Diag manifest found but frequency units " | |
| f"{diag_file['freq_units']} are unexpected; " | |
| f"expected 'years' or 'months'.") | |
| if duration is not None: | |
| duration_object = rename_split_script.duration_parser.parse(duration) | |
| else: | |
| raise Exception( | |
| f"File label '{label}' not found in diag manifest " | |
| f"'{diag_manifest}'") | |
| freq_label = duration | |
| date1 = rename_split_script.time_parser.parse(date) | |
| one_month = rename_split_script.duration_parser.parse('P1M') | |
| date2 = date1 + duration_object - one_month | |
| else: | |
| raise FileNotFoundError( | |
| f"Diag manifest '{diag_manifest}' does not exist") | |
| def _get_duration_and_format_from_diag_manifest(diag_manifest_value, file_label): | |
| if isinstance(diag_manifest_value, (str, Path)): | |
| manifest_paths = [diag_manifest_value] | |
| else: | |
| manifest_paths = list(diag_manifest_value) | |
| matched_entry = None | |
| matched_manifest = None | |
| for manifest_path in manifest_paths: | |
| manifest_path = Path(manifest_path) | |
| if not manifest_path.exists(): | |
| raise FileNotFoundError( | |
| f"Diag manifest '{manifest_path}' does not exist") | |
| fre_logger.info(f"Using diag manifest '{manifest_path}'") | |
| with open(manifest_path, 'r') as f: | |
| yaml_data = yaml.safe_load(f) or {} | |
| diag_files = yaml_data.get("diag_files") or [] | |
| if not isinstance(diag_files, list): | |
| raise Exception( | |
| f"Diag manifest '{manifest_path}' has invalid 'diag_files'; " | |
| f"expected a list") | |
| for diag_file in diag_files: | |
| if not isinstance(diag_file, dict): | |
| continue | |
| if diag_file.get("file_name") != file_label: | |
| continue | |
| if matched_entry is not None: | |
| raise Exception( | |
| f"File label '{file_label}' was found more than once in diag " | |
| f"manifests '{matched_manifest}' and '{manifest_path}'") | |
| matched_entry = diag_file | |
| matched_manifest = manifest_path | |
| if matched_entry is None: | |
| raise Exception( | |
| f"File label '{file_label}' not found in diag manifest " | |
| f"'{diag_manifest_value}'") | |
| freq_units = matched_entry.get("freq_units") | |
| freq = matched_entry.get("freq") | |
| if freq_units is None or freq is None: | |
| raise Exception( | |
| f"Diag manifest entry for file label '{file_label}' in " | |
| f"'{matched_manifest}' must define both 'freq_units' and 'freq'") | |
| if freq_units == "years": | |
| return f"P{freq}Y", "%Y" | |
| if freq_units == "months": | |
| if freq == 12: | |
| return "P1Y", "%Y" | |
| return f"P{freq}M", "%Y%m" | |
| raise Exception( | |
| f"Diag manifest found but frequency units '{freq_units}' are unexpected; " | |
| f"expected 'years' or 'months'.") | |
| if diag_manifest is not None: | |
| duration, format_ = _get_duration_and_format_from_diag_manifest( | |
| diag_manifest, label) | |
| duration_object = rename_split_script.duration_parser.parse(duration) | |
| freq_label = duration | |
| date1 = rename_split_script.time_parser.parse(date) | |
| one_month = rename_split_script.duration_parser.parse('P1M') | |
| date2 = date1 + duration_object - one_month |
| If --rename is set, split files are additionally reorganized into a nested | ||
| directory structure under $outputdir with frequency and duration | ||
| (e.g. atmos_daily/P1D/P6M/atmos_daily.00010101-00010630.temp.tile1.nc).''' | ||
| var_list = variables.split(",") |
There was a problem hiding this comment.
var_list = variables.split(",") is computed but never used. This is misleading and suggests the command is doing additional parsing that it isn’t; either remove it or pass var_list into split_file_xarray() (and adjust that function’s expectations accordingly).
| var_list = variables.split(",") |
| @click.option('-r', '--rename', is_flag=True, default=False, | ||
| help='During splitting, rename output files into a nested directory structure ' | ||
| 'organized by frequency and duration under $outputdir.') | ||
| @click.option('-d', '--diag-manifest', type=str, required=False, default=None, | ||
| help='Path to FMS diag manifest file. Only used with --rename. ' | ||
| 'Required when input file has one timestep and no time bounds.') | ||
| def split_netcdf(file, outputdir, variables, rename, diag_manifest): |
There was a problem hiding this comment.
PR description says the standalone fre pp rename-split command was removed, but fre/pp/frepp.py still defines the rename_split click command (currently around line ~313). Either remove that command (and update any docs/tests that reference it) or update the PR description to match the actual behavior.
| @click.option('-d', '--diag-manifest', type=str, required=False, default=None, | ||
| help='Path to FMS diag manifest file. Only used with --rename. ' | ||
| 'Required when input file has one timestep and no time bounds.') |
There was a problem hiding this comment.
The new --diag-manifest option is introduced here, but there are no tests exercising the code path where a renamed split requires a diag manifest (one timestep + no time bounds). Please add a unit/CLI test that passes --rename --diag-manifest and validates the computed nested path (and that missing/invalid manifests raise the expected error).
…wer feedback Agent-Logs-Url: https://github.com/NOAA-GFDL/fre-cli/sessions/dea7c24c-38b0-4ff9-a49e-1a63362179d3 Co-authored-by: ilaflott <6273252+ilaflott@users.noreply.github.com>
Addressed the reviewer feedback and fixed #862. Commit
|
fre pp rename-split into fre pp split-netcdf --renamefre pp rename-split into fre pp split-netcdf --rename; fix #862 aerosol variable misclassification
|
@ilaflott this is clearly a desired update but it's sort of big, containing diag manifest testing, the rename-split tool consolidation, and now the split-netcdf oddity too. Can we wait until 2026.02 for this? Or should it go in before then? |
|
does it fix your problems? does it bring value to your workflow? if it doesn't work for you, i'll kill it. if it does work for you, tell me what else need to budge. if you can't decide right now- decide when you can. many other fish to fry |
Describe your changes
Takes the
fre pp rename-splitfunctionality from therename-splitbranch and folds it intofre pp split-netcdfvia a--renameflag. Without--rename,split-netcdfbehaves as before. Also fixes #862 where certain aerosol variables were incorrectly classified as metadata.fre/pp/split_netcdf_script.pyrenameanddiag_manifestparameters tosplit_file_xarray(): whenrename=True, each split file is written directly to its final nestedcomponent/freq/duration/path (no intermediate flat file, no copy, no delete)_compute_renamed_path()helper that uses an in-memory time-decoded dataset to determine frequency, duration, and date range before the write, enabling a single file touch per variabletry/finallyfor proper resource cleanup of the decoded datasetel is not variable→el != variablein drop_vars list comprehension (string value equality instead of identity comparison)matchlist()to use exact name matching for short/low-dimensional variables instead of regex substring matching — this resolves 'fre pp split-netcdf' incorrectly skips some aerosol variables #862 where aerosol variables likebldep,drybc,emibc,emiapoa,emibvoc,emiisop_biogenic,mmrbc, andwetbcwere incorrectly classified as metadata because short variable names (ap,b) were used as regex patterns that matched substrings of legitimate data variable namessplit_file_xarrayto 4-space indentation (PEP 8)fre/pp/frepp.py--rename(-r) flag and--diag-manifest(-d) option to thesplit-netcdfclick command, passing them through tosplit_file_xarray()var_list = variables.split(",")linefre/tests/test_fre_pp_cli.py— CLI tests viaCliRunner:--renameand--diag-manifestappear in help outputsplit-netcdf --rename(parametrized for timeseries and static cases)--rename→ flat output)split_rename_ncgenfixture andtmp_pathfor output directoriesfre/pp/tests/test_split_netcdf.py— Unit tests via standardimport(merged into existing test file):split_file_xarraywithrename=Truefor timeseries and static datasplit_file_xarraywithout renamencgen_setuppytest fixture andtmp_pathfor output directoriesIssue ticket number and link (if applicable)
Checklist before requesting a review
Original prompt
fre pp split-netcdf --renamecallsfre pp rename-splitfunctionality #782📱 Kick off Copilot coding agent tasks wherever you are with GitHub Mobile, available on iOS and Android.