# Sample ldcpy Notebook

In [1]:
# Add ldcpy root to system path
import sys
sys.path.insert(0,'../../../')

import ldcpy
# Import ldcpy package
# Autoreloads package everytime the package is called, so changes to code will be reflected in the notebook if the above sys.path.insert(...) line is uncommented.
%load_ext autoreload
%autoreload 2
import ldcpy.plot as lp

%matplotlib inline

In [2]:
import warnings
warnings.filterwarnings('ignore')

## Loading Datasets and Viewing Metadata

We use three different datasets in these examples, one containing TS data (ds), one containing PRECT data (ds2), and a 3d dataset containing T data (ds3). These datasets are ensembles of variable data in several different netCDF files, which are given ensemble names in the second parameter to the ldcpy.open_datasets function. These ensemble names can be whatever you want, but they should be informative because the names will be used to select the appropriate dataset later and as part of plot titles.

In [3]:
# ds contains TS data
ds = ldcpy.open_datasets(['../../../data/cam-fv/orig.TS.100days.nc', '../../../data/cam-fv/zfp1.0.TS.100days.nc', '../../../data/cam-fv/zfp1e-1.TS.100days.nc'],
                         ['orig', 'zfpA1.0', 'zfpA1e-1'])
ds2 = ldcpy.open_datasets(['../../../data/cam-fv/orig.PRECT.100days.nc', '../../../data/cam-fv/zfp1e-7.PRECT.100days.nc', '../../../data/cam-fv/zfp1e-11.PRECT.100days.nc'],
                         ['orig', 'zfpA1.0', 'zfpA1e-1'])
ds3 = ldcpy.open_datasets(['../../../data/cam-fv/cam-fv.T.6months.nc'], ['orig'])

The print_stats function can be used to gather overall statistics on two (original and reconstructed) datasets and the error between the datasets:

In [4]:
ldcpy.print_stats(ds, 'TS', 'orig', 'zfpA1.0')

Comparing orig data to zfpA1.0 data
{
    "mean_observed": 274.7137027669836,
    "mean_squared_error": 0.005666020999224538,
    "variance_observed": 533.680828155477,
    "min_error": -0.361724853515625,
    "standard_deviation_observed": 23.10856581306422,
    "max_error": 0.4058837890625,
    "ks_p_value": [
        0.9999947706571545,
        0.0
    ],
    "pearson_correlation_coefficient": 0.9999947706571547,
    "mean_absolute_error": 0.05852021166571864,
    "standard_deviation_modelled": 23.10856581306422,
    "root_mean_squared_error": 0.07527297655350516,
    "covariance": 533.8405046664303,
    "variance_modelled": 533.680828155477,
    "mean_modelled": 274.707935474537,
    "mean_error": 0.00576729244656033
}


Printing a dataset reveals the dimension names, sizes, datatypes and initial values, among other metadata:

In [9]:
ds

We can use the ldcpy.metrics.AggregateMetrics class to print the metric data if we do not want to plot it directly using the ldcpy.plot.plot function:

In [8]:
ds_metrics = ldcpy.metrics.DatasetMetrics(ds['TS'].sel(ensemble='orig'), ['time'])
ds_metrics.get_metric("mean")

## Spatial Plots

The most basic usage of the ldcpy.plot.plot() function requires a dataset, the variable of interest, the ensemble name of the original data, and the metric of interest. By default, a spatial plot of this data is created. The title of the plot contains the ensemble name of the data, the variable being plotted, the metric and the "metric_type", which indicates if that these are the "raw" unaltered metric values. The following plot shows the mean TS (temperature) value at each point in the original dataset:

In [None]:
lp.plot(ds, "TS", 'orig', 'mean')

If we want a side-by-side comparison of two datasets, we need to specify an additional dataset using the ens_r argument, and a non-default plot_type. This plot shows the mean TS value at each point in both the original and compressed (with zfp, tolerance 1.0) datasets:

In [None]:
lp.plot(ds, "TS", 'orig', 'mean', ens_r='zfpA1.0', plot_type="spatial_comparison")

It is possible to compare two compressed datasets side by side as well, by simply using a different first ensemble name:

In [None]:
lp.plot(ds, "TS", 'zfpA1e-1', 'mean', ens_r='zfpA1.0', plot_type="spatial_comparison")

We can also plot different metrics, such as the standard deviation at each point, or change the color scheme (for a full list of metrics and color schemes, see the documentation):

In [None]:
lp.plot(ds, "TS", 'orig', 'std', color="cmo.thermal")

Some metrics result in values that are +/- infinity, or NaN (likely resulting from operations like 0/0 or inf/inf). NaN values are plotted in gray, infinity is plotted in white, and negative infinity is plotted in black (regardless of color scheme). If infinite values are present in the plot data, arrows on either side of the colorbar are shown to indicate the color for +/- infinity. This plot shows the log of the ratio of the odds of positive rainfall in the compressed and original output, log(odds_positive compressed/odds_positive original). This statistic showcases some interesting plot features:

In [None]:
lp.plot(ds2, 'PRECT', 'orig', ens_r='zfpA1.0', metric='odds_positive', metric_type="ratio", transform="log")

If all values are NaN, the colorbar is not shown but instead a legend is shown indicating the gray color of NaN values, and the whole plot is colored gray. (If all values are infinite, the plot is displayed normally with all values either black or white). Because the example dataset contains 100 days of data, the deseasonalized lag-1 values and their variances are all 0, and so calculating the correlation of the lag-1 values will involve computing 0/0 = NaN:

In [None]:
lp.plot(ds2, "PRECT", "orig", metric="corr_lag1")

Additionally, there are a number of ways to compare datasets besides a side-by-side comparison. By specifying the metric_type, we can plot the diff or the ratio between the metrics in two datasets. This shows the ratio of the zfp 1.0 standard deviation values over the original values:

In [None]:
lp.plot(ds, 'TS', 'orig', ens_r='zfpA1.0', metric="std", metric_type="diff")

Sometimes, we may want to compute a metric on the difference between the datasets. For instance, the zscore metric calculates the zscore at each point under the null hypothesis that the true mean is zero, so using the "metric_of_diff" metric_type calculates the zscore of the diff between two datasets (to find the values that are significantly different between the two datasets). The zscore metric in particular gives additional information about the percentage of significant gridpoints in the plot title:

In [None]:
lp.plot(ds, 'TS', 'orig', 'zscore', ens_r="zfpA1.0", metric_type="metric_of_diff")

Plotting the metric of a subset of the data is possible using the subset keyword:

In [None]:
lp.plot(ds, "TS", 'orig', 'mean', subset="winter")

Finally, a single time slice of a 3d dataset can be plotted using the lev keyword, which selects the nearest level to the lev value specified (by default, lev=0):

In [None]:
lp.plot(ds3, "T", 'orig', 'mean', lev='300')

## Time-Series Plots

We may also plot a time-series plot of the metrics by changing the plot_type, which calculate the mean metric value across space rather than across time:

In [None]:
lp.plot(ds, "TS", 'orig', 'std', plot_type="time_series")

To group the data by time, use the group_by keyword. This plot shows the mean standard deviation over all latitude and longitude points for each month:

In [None]:
lp.plot(ds, "TS", 'orig', 'std', plot_type="time_series", group_by="time.month")

We can view a histogram of the time-series data by changing the plot_type to histogram:

In [None]:
lp.plot(ds, "TS", 'orig', 'std', plot_type="histogram")

A second dataset can be specified using the ens_r keyword, just as in the spatial plots. The metric_type keyword can also be used in the same way. This plot shows the mean differences between the compressed and original standard deviation values:

In [None]:
lp.plot(ds, "TS", 'orig', 'std', plot_type="time_series", ens_r="zfpA1.0", metric_type="diff")

Subsetting is also possible on time-series data:

In [None]:
lp.plot(ds, "TS", 'orig', 'std', plot_type="time_series", ens_r="zfpA1.0", metric_type="diff", subset="first50")

Additionally, we can specify lat and lon keywords for time-series plots that give us a subset of the data at a single point, rather than averaging over all latitudes and longitudes. The nearest latitude and longitude point to the one specified is plotted (and the actual coordinates of the point can be found in the plot title). This plot, for example, shows the difference in mean rainfall between the compressed and original data for the first 50 days of data at the location (44.76, -123.75):

In [None]:
lp.plot(ds2, "PRECT", 'zfpA1.0', ens_r='orig', metric_type="diff", metric="mean", plot_type="time_series", subset="first50", lat=44.56, lon=-123.26)
