# CDAT Migration Regression Testing Notebook (`.json` metrics)

This notebook is used to perform regression testing between the development and
production versions of a diagnostic set.

## How it works

It compares the relative differences (%) between two sets of `.json` files in two
separate directories, one for the refactored code and the other for the `main` branch.

It will display metrics values with relative differences >= 2%. Relative differences are used instead of absolute differences because:

- Relative differences are in percentages, which shows the scale of the differences.
- Absolute differences are just a raw number that doesn't factor in
  floating point size (e.g., 100.00 vs. 0.0001), which can be misleading.

## How to use

PREREQUISITE: The diagnostic set's metrics stored in `.json` files in two directories
(dev and `main` branches).

1. Make a copy of this notebook under `auxiliary_tools/cdat_regression_testing/<DIR_NAME>`.
2. Run `mamba create -n cdat_regression_test -y -c conda-forge "python<3.12" pandas matplotlib-base ipykernel`
3. Run `mamba activate cdat_regression_test`
4. Update `DEV_PATH` and `MAIN_PATH` in the copy of your notebook.
5. Run all cells IN ORDER.
6. Review results for any outstanding differences (>= 2%).
   - Debug these differences (e.g., bug in metrics functions, incorrect variable references, etc.)


## Setup Code


In [22]:
from collections import defaultdict
import glob

import numpy as np
import xarray as xr

# TODO: Update SET_NAME and SET_DIR
SET_NAME = "cosp_histogram"
SET_DIR = "660-cosp-histogram"

DEV_PATH = f"/global/cfs/projectdirs/e3sm/e3sm_diags_cdat_test/{SET_DIR}/{SET_NAME}/**"
MAIN_PATH = f"/global/cfs/projectdirs/e3sm/e3sm_diags_cdat_test/main/{SET_NAME}/**"

DEV_GLOB = sorted(glob.glob(DEV_PATH + "/*.nc"))
MAIN_GLOB = sorted(glob.glob(MAIN_PATH + "/*.nc"))

if len(DEV_GLOB) == 0 or len(MAIN_GLOB) == 0:
    raise IOError("No files found at DEV_PATH and/or MAIN_PATH.")

if len(DEV_GLOB) != len(MAIN_GLOB):
    raise IOError("Number of files do not match at DEV_PATH and MAIN_PATH.")

## 1. Get the metrics for the development and `main` branches and their differences.


In [23]:
df_metrics_dev = get_metrics(DEV_GLOB)
df_metrics_main = get_metrics(MAIN_GLOB)
df_metrics_diffs = get_rel_diffs(df_metrics_dev, df_metrics_main)

## 2. Filter differences to those above maximum threshold (2%).

All values below maximum threshold will be labeled as `NaN`.

- **If all cells in a row are NaN (< 2%)**, the entire row is dropped to make the results easier to parse.
- Any remaining NaN cells are below < 2% difference and **should be ignored**.


In [24]:
df_metrics_diffs_thres = df_metrics_diffs[df_metrics_diffs >= 0.02]
df_metrics_diffs_thres = df_metrics_diffs_thres.dropna(
    axis=0, how="all", ignore_index=False
)

## 3. Combine all DataFrames to get the final result.


In [25]:
df_metrics_all = pd.concat(
    [df_metrics_dev.add_suffix("_dev"), df_metrics_main.add_suffix("_main")],
    axis=1,
    join="outer",
)
df_final = df_metrics_diffs_thres.join(df_metrics_all)
df_final = sort_columns(df_final)
df_final = update_diffs_to_pct(df_final)

## 4. Review variables and metrics above difference threshold.

- <span style="color:red">Red</span> cells are differences >= 2%
- `nan` cells are differences < 2% and **should be ignored**


In [26]:
remove_metrics = ["min", "max"]
df_metrics_sub = df_final.reset_index(names=["var_key", "metric"])
df_metrics_sub = df_metrics_sub[~df_metrics_sub.metric.isin(remove_metrics)]
get_num_metrics_above_diff_thres(df_metrics_all, df_metrics_sub)

* Related variables ['FSNTOA', 'LHFLX', 'LWCF', 'NET_FLUX_SRF', 'PRECT', 'PSL', 'RESTOM', 'TREFHT']
* Number of metrics above 2% max threshold: 11 / 96


In [28]:
highlight_large_diffs(df_metrics_sub)

Unnamed: 0,var_key,metric,test_dev,test_main,test DIFF (%),ref_dev,ref_main,ref DIFF (%),test_regrid_dev,test_regrid_main,test_regrid DIFF (%),ref_regrid_dev,ref_regrid_main,ref_regrid DIFF (%),misc_dev,misc_main,misc DIFF (%)
5,FSNTOA,mean,239.859777,240.00186,,241.439641,241.544384,,239.859777,240.00186,,241.439641,241.544384,,,,
8,LHFLX,mean,88.379609,88.47027,,88.96955,88.976266,,88.379609,88.47027,,88.96955,88.976266,,,,
11,LWCF,mean,24.373224,24.370539,,24.406697,24.391579,,24.373224,24.370539,,24.406697,24.391579,,,,
16,NET_FLUX_SRF,mean,0.394016,0.51633,31.04%,-0.068186,0.068584,200.58%,0.394016,0.51633,31.04%,-0.068186,0.068584,200.58%,,,
19,PRECT,mean,3.053802,3.05676,,3.074885,3.074978,,3.053802,3.05676,,3.074885,3.074978,,,,
21,PSL,rmse,,,,,,,,,,,,,1.042884,0.979981,6.03%
23,RESTOM,mean,0.481549,0.65656,36.34%,0.018041,0.162984,803.40%,0.481549,0.65656,36.34%,0.018041,0.162984,803.40%,,,
34,TREFHT,mean,14.769946,14.741707,,13.842013,13.800258,,14.769946,14.741707,,13.842013,13.800258,,,,
35,TREFHT,mean,9.214224,9.114572,,8.083349,7.957917,,9.214224,9.114572,,8.083349,7.957917,,,,
40,TREFHT,rmse,,,,,,,,,,,,,1.160718,1.179995,2.68%


## `NET_FLUX_SRF` and `RESTOM` contain the highest differences and should be investigated further
