# CDAT Migration Regression Testing Notebook (`.json` metrics)

This notebook is used to perform regression testing between the development and
production versions of a diagnostic set.

## How it works

It compares the relative differences (%) between two sets of `.json` files in two
separate directories, one for the refactored code and the other for the `main` branch.

It will display metrics values with relative differences >= 2%. Relative differences are used instead of absolute differences because:

- Relative differences are in percentages, which shows the scale of the differences.
- Absolute differences are just a raw number that doesn't factor in
  floating point size (e.g., 100.00 vs. 0.0001), which can be misleading.

## How to use

PREREQUISITE: The diagnostic set's metrics stored in `.json` files in two directories
(dev and `main` branches).

1. Make a copy of this notebook under `auxiliary_tools/cdat_regression_testing/<DIR_NAME>`.
2. Run `mamba create -n cdat_regression_test -y -c conda-forge "python<3.12" xarray netcdf4 dask pandas matplotlib-base ipykernel`
3. Run `mamba activate cdat_regression_test`
4. Update `DEV_PATH` and `MAIN_PATH` in the copy of your notebook.
5. Run all cells IN ORDER.
6. Review results for any outstanding differences (>= 2%).
   - Debug these differences (e.g., bug in metrics functions, incorrect variable references, etc.)


## Setup Code


In [1]:
from typing import List
import glob

import pandas as pd

from auxiliary_tools.cdat_regression_testing.utils import (
    get_rel_diffs,
    get_num_metrics_above_diff_thres,
    highlight_large_diffs,
    sort_columns,
    update_diffs_to_pct,
    PERCENTAGE_COLUMNS,
)


SET_NAME = "lat_lon"
SET_DIR = "759-slice-flag"

DEV_PATH = f"/global/cfs/cdirs/e3sm/www/cdat-migration-fy24/{SET_DIR}/{SET_NAME}/**"
MAIN_PATH = f"/global/cfs/cdirs/e3sm/www/cdat-migration-fy24/old-runs/671-lat-lon/main/{SET_NAME}/**"

DEV_GLOB = sorted(glob.glob(DEV_PATH + "/*.json"))
MAIN_GLOB = sorted(glob.glob(MAIN_PATH + "/*.json"))

if len(DEV_GLOB) == 0 or len(MAIN_GLOB) == 0:
    raise IOError("No files found at DEV_PATH and/or MAIN_PATH.")

if len(DEV_GLOB) != len(MAIN_GLOB):
    raise IOError("Number of files do not match at DEV_PATH and MAIN_PATH.")

OSError: Number of files do not match at DEV_PATH and MAIN_PATH.

In [4]:
dev_filenames = [filepaths.split("/")[-1] for filepaths in DEV_GLOB]
main_filenames = [filepaths.split("/")[-1] for filepaths in MAIN_GLOB]

In [5]:
set(dev_filenames) ^ set(main_filenames)

{'-ERFtot-ANN-global.json'}

In [6]:
def get_metrics(filepaths: List[str]) -> pd.DataFrame:
    """Get the metrics using a glob of `.json` metric files in a directory.

    Parameters
    ----------
    filepaths : List[str]
        The filepaths for metrics `.json` files.

    Returns
    -------
    pd.DataFrame
        The DataFrame containing the metrics for all of the variables in
        the results directory.
    """
    metrics = []

    for filepath in filepaths:
        df = pd.read_json(filepath)

        filename = filepath.split("/")[-1]
        var_key = filename.split("-")[1]
        region = filename.split("-")[-1].replace(".json", "")

        # Add the variable key to the MultiIndex and update the index
        # before stacking to make the DataFrame easier to parse.
        multiindex = pd.MultiIndex.from_product([[var_key], [region], [*df.index]])
        df = df.set_index(multiindex)
        df.stack()

        metrics.append(df)

    df_final = pd.concat(metrics)

    # Reorder columns and drop "unit" column (string dtype breaks Pandas
    # arithmetic).
    df_final = df_final[["test", "ref", "test_regrid", "ref_regrid", "diff", "misc"]]

    return df_final

## 1. Get the metrics for the development and `main` branches and their differences.


In [7]:
df_metrics_dev = get_metrics(DEV_GLOB).drop("ERFtot", level=0)
df_metrics_main = get_metrics(MAIN_GLOB)
df_metrics_diffs = get_rel_diffs(df_metrics_dev, df_metrics_main)

## 2. Filter differences to those above maximum threshold (2%).

All values below maximum threshold will be labeled as `NaN`.

- **If all cells in a row are NaN (< 2%)**, the entire row is dropped to make the results easier to parse.
- Any remaining NaN cells are below < 2% difference and **should be ignored**.


In [8]:
df_metrics_diffs_thres = df_metrics_diffs[df_metrics_diffs >= 0.02]
df_metrics_diffs_thres = df_metrics_diffs_thres.dropna(
    axis=0, how="all", ignore_index=False
)

In [9]:
df_metrics_diffs_thres

Unnamed: 0,Unnamed: 1,Unnamed: 2,test DIFF (%),ref DIFF (%),test_regrid DIFF (%),ref_regrid DIFF (%),diff DIFF (%),misc DIFF (%)
SST,global,min,,,,0.513107,0.203545,
TREFHT,land,mean,,,,0.115921,,
TREFHT,land,std,,,,0.028019,,


## 3. Combine all DataFrames to get the final result.


In [10]:
df_metrics_all = pd.concat(
    [df_metrics_dev.add_suffix("_dev"), df_metrics_main.add_suffix("_main")],
    axis=1,
    join="outer",
)

df_final = df_metrics_all.join(df_metrics_diffs_thres, how="inner")
df_final = sort_columns(df_final)
df_final = update_diffs_to_pct(df_final)

In [11]:
df_final

Unnamed: 0,Unnamed: 1,Unnamed: 2,test_dev,test_main,test DIFF (%),ref_dev,ref_main,ref DIFF (%),test_regrid_dev,test_regrid_main,test_regrid DIFF (%),ref_regrid_dev,ref_regrid_main,ref_regrid DIFF (%),misc_dev,misc_main,misc DIFF (%)
SST,global,min,-1.788055,-1.788055,,-1.676941,-1.676941,,-1.788055,-1.788055,,-1.108276,-1.676941,51.31%,,,
TREFHT,land,mean,9.114572,9.114572,,7.957917,7.957917,,9.114572,9.114572,,7.131257,7.957917,11.59%,,,
TREFHT,land,std,,,,,,,17.947743,17.947743,,18.718675,18.194196,2.80%,,,


## 4. Review variables and metrics above difference threshold.

- <span style="color:red">Red</span> cells are differences >= 2%
- `nan` cells are differences < 2% and **should be ignored**


In [12]:
df_final_adj = df_final.reset_index(names=["var_key", "region", "metric"])
get_num_metrics_above_diff_thres(df_metrics_all, df_final_adj)

* Related variables ['SST', 'TREFHT']
* Number of metrics above 2% max threshold: 3 / 96


In [13]:
df_final_adj

Unnamed: 0,var_key,region,metric,test_dev,test_main,test DIFF (%),ref_dev,ref_main,ref DIFF (%),test_regrid_dev,test_regrid_main,test_regrid DIFF (%),ref_regrid_dev,ref_regrid_main,ref_regrid DIFF (%),misc_dev,misc_main,misc DIFF (%)
0,SST,global,min,-1.788055,-1.788055,,-1.676941,-1.676941,,-1.788055,-1.788055,,-1.108276,-1.676941,51.31%,,,
1,TREFHT,land,mean,9.114572,9.114572,,7.957917,7.957917,,9.114572,9.114572,,7.131257,7.957917,11.59%,,,
2,TREFHT,land,std,,,,,,,17.947743,17.947743,,18.718675,18.194196,2.80%,,,


In [14]:
highlight_large_diffs(df_final_adj)

Unnamed: 0,var_key,region,metric,test_dev,test_main,test DIFF (%),ref_dev,ref_main,ref DIFF (%),test_regrid_dev,test_regrid_main,test_regrid DIFF (%),ref_regrid_dev,ref_regrid_main,ref_regrid DIFF (%),misc_dev,misc_main,misc DIFF (%)
0,SST,global,min,-1.788055,-1.788055,,-1.676941,-1.676941,,-1.788055,-1.788055,,-1.108276,-1.676941,51.31%,,,
1,TREFHT,land,mean,9.114572,9.114572,,7.957917,7.957917,,9.114572,9.114572,,7.131257,7.957917,11.59%,,,
2,TREFHT,land,std,,,,,,,17.947743,17.947743,,18.718675,18.194196,2.80%,,,


## Results

- The only large diffs are the regridded reference data for `"SST"` global mean and `TREFHT` land "mean".
