# Analysis of the Gaussian committor

In this notebook you will see how to train the Gaussian committor and analyze the results to generate the figures in the paper.

## Compute the Gaussian committor

To compute the Gaussian committor, we will use the [Climate-Learning](https://github.com/georgemilosh/Climate-Learning) repository, and in particular the code in `Climate-Learning/PLASIM/gaussian_approx.py`

Remember from the general documentation, that you should clone the repository in on the same level as `gaus-approx`. See the general `README.md` for more info.

If you are curious about the implementation, you can look at the code mentioned and in particular at the `GaussianCommittor` object, which can be easily used outside the Climate-Learning framework.

---

### Create working directory

To proceed with the computations, first open a terminal in `Climate-Learning/PLASIM` and run

```bash
python gaussian_approx.py ../../gaus-approx/ERA5/committor/ga/
```

which will create the directory `../../gaus-approx/ERA5/committor/ga/`, where you have all the tools to train the gaussian committor. Then `cd` into it

The choice of the path `../../gaus-approx/ERA5/committor/ga/` is arbitrary, but if you use this one, you'll be able to run the rest of this notebook without any changes.

### Setup config file

Once you are in (relative to the path of this notebook) `ga/`, run

```bash
python import_config.py ../config_T14_tau0_epsilon1.json
```

Which will set up the training to be performed exactly as it was for the paper.

### Run

The `Climate-Learning` framework is optimized to do multiple trainings in series, minimizing data reloads. To have the data you will need for the results in the paper, from `ga/` launch

```bash
python gaussian_approx.py T="[1,7,14]" tau="[0,-1,-2,-3,-4,-5,-6,-7,-10,-15,-20,-25,-30]" reg_c="[0,1e-2,1e-1,1,1e1,1e2,1e3,1e4,1e5,1e6,1e7]"
```

Notice that the code uses the convention $\tau < 0$ for a forecast in the future. `reg_c` is the regularization coefficient $\epsilon$.

Be wary that depending on your machine, these calculations may take a while. You can of course change the parameter combinations that you look at to speed up the process, for example reducing the number of regularization coefficients


#### Prediction using the composite map

In figure 9, we show the prediction using the composite map as a projection pattern. To compute this in practice we exploit the fact that when we do $L_2$ regularization,

$$ \lim_{\epsilon \to \infty} M_\epsilon \propto \Sigma_{XA} \propto C$$

So, launch (in the same directory `ga`)

```bash
python gaussian_approx.py T=14 tau="[0,-1,-2,-3,-4,-5,-6,-7,-10,-15,-20,-25,-30]" reg_c=1e8 regularization="identity"
```


## Train convolutional neural network

### Create working directory

To train the convolutional networks we'll proceed in a very similar way. Go in `Climate-Learning/PLASIM` and run

```bash
python Learn2_new.py ../../gaus-approx/ERA5/committor/cnn/
```

### Setup config file

Move to `cnn/` and set up the config file

```bash
python import_config.py ../config_T14_tau0_epsilon1.json
```

After this open `cnn/config.json` and check that `load_from` is set to `null`. The default value of `'last'` would do transfer learning from the last compatible training. If you want you can try this option as well, but we observed it didn't make much of a difference.

### Run

From `cnn/`, launch

```bash
python Learn2_new.py T=14 tau="[0,-1,-2,-3,-4,-5,-6,-7,-10,-15,-20,-25,-30]"
```

## Collect the results

In [None]:
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
%matplotlib widget
matplotlib.rc('font', size=18)
default_colors = plt.rcParams['axes.prop_cycle'].by_key()['color']


import pandas as pd
import xarray as xr
from scipy import sparse

from tqdm.notebook import tqdm

import sys
sys.path.append('../../../Climate-Learning/')

import general_purpose.utilities as ut
import general_purpose.cartopy_plots as cplt
import general_purpose.uplotlib as uplt
# import general_purpose.tables as tbl

# log to stdout
import logging
logging.getLogger().level = logging.INFO
logging.getLogger().handlers = [logging.StreamHandler(sys.stdout)]
ut.indentation_sep = '  '

HOME = '../../'

In [None]:
def l2(x, **kwargs):
    return np.sqrt(np.sum(x**2, **kwargs))

def get_score(run):
    return uplt.unc.ufloat(run['scores']['mean'], run['scores']['std'])

def get_arg(run, key, config_dict):
    return run['args'].get(key, ut.extract_nested(config_dict, key))

def get_years(run, config_dict):
    try:
        year_list = run['args']['year_list']
    except KeyError:
        year_list = ut.extract_nested(config_dict, 'year_list')
        
    if year_list is None:
        return ut.extract_nested(config_dict, 'dataset_years')
    
    year_list = f"({year_list.split('(',1)[1].split(')',1)[0]})" # get just the arguments
    year_list = ast.literal_eval(year_list) # now year_list is int or tuple

    if isinstance(year_list,int):
        return year_list
    elif isinstance(year_list, tuple):
        return np.arange(*year_list).shape[0]

In [None]:
lon = np.load('../../lon.npy')
lat = np.load('../../lat.npy')

### Compute skill of CNN

In [None]:
import PLASIM.Learn2_new as ln

In [None]:
folder_ERA_CNN = 'cnn/'
runs_ERA_CNN = ut.json2dict(f'{folder_ERA_CNN}/runs.json')
runs_ERA_CNN = {k:v for k,v in runs_ERA_CNN.items() if v['status'] == 'COMPLETED'}
config_dict_ERA_CNN = ut.json2dict(f'{folder_ERA_CNN}/config.json')

In [None]:
var='tau'
groups_ERA_CNN = ln.make_groups(runs_ERA_CNN, variable=var, config_dict_flat=ut.collapse_dict(config_dict_ERA_CNN), sort=True)
for g in groups_ERA_CNN:
    print(g['args'], g[var])

In [None]:
g_CNN = groups_ERA_CNN[0]

In [None]:
df = []
item = {}

for run in g_CNN['runs']:
    nfolds = get_arg(run, 'nfolds',)
    item['tau'] = -get_kwarg(run, 'tau', config_dict_ERA_CNN)
    # print(item['tau'])
    for fold in range(nfolds):
        item['fold'] = fold
        item['skill'] = 1 - run['scores'][f'fold_{fold}']/ut.entropy(0.05)
        df.append(item.copy())
df = pd.DataFrame(df)
df.sort_values(['tau', 'fold'], inplace=True)
dfi = df.set_index(['tau', 'fold'])
ds = dfi.to_xarray()
ds

In [None]:
ds = ds['skill'].expand_dims({'T': [14], 'percent': [5]})
ds.attrs = {'description': 'Normalized log score of CNN'}

In [None]:
ds.to_netcdf('Skill-CNN_T14_percent5.nc')

### Compute skill of composite map

In [None]:
folder = 'ga'
runs = ut.json2dict(f'{folder}/runs.json')
runs = {k:v for k,v in runs.items() if v['status'] == 'COMPLETED'}
config_dict = ut.json2dict(f'{folder}/config.json')

runs = {k:v for k,v in runs.items() if 'T' not in v['args'] and get_arg(v, 'regularization', config_dict) == 'identity'}

In [None]:
id_g = ln.make_groups(runs, variable='tau', config_dict_flat=ut.collapse_dict(config_dict), sort=True)[0]
id_g['args'], id_g['tau']

In [None]:
df = []
item = {}

for run in id_g['runs']:
    nfolds = get_arg(run, 'nfolds',)
    item['tau'] = -get_kwarg(run, 'tau', config_dict_ERA_CNN)
    # print(item['tau'])
    for fold in range(nfolds):
        item['fold'] = fold
        item['skill'] = 1 - run['scores'][f'fold_{fold}']/ut.entropy(0.05)
        df.append(item.copy())
df = pd.DataFrame(df)
df.sort_values(['tau', 'fold'], inplace=True)
dfi = df.set_index(['tau', 'fold'])
ds = dfi.to_xarray()
ds

In [None]:
ds = ds['skill'].expand_dims({'T': [14], 'percent': [5]})
ds.attrs = {'description': 'Normalized log score of Gaussian approximation when projecting onto the composite map'}

In [None]:
ds.to_netcdf('Skill-comp_T14_percent5.nc')

### Compute $H_2$ of projection patterns (Optional)

In [None]:
mask = np.ones((22,128,1), dtype=bool)
reshaper = ut.Reshaper(mask)

coslat = np.maximum(np.cos(lat*np.pi/180), 0)
aw = (np.ones(mask.shape).T * coslat).T
aw *= mask
aw /= np.sum(aw)

W = sparse.load_npz('W.npz')

assert W.shape[0] == reshaper.surviving_coords

In [None]:
folder = 'ga'

config_dict = ut.json2dict(f'{folder}/config.json')

runs = ut.json2dict(f'{folder}/runs.json')

nfolds = ut.extract_nested(config_dict, 'nfolds')

force_computation = False


for run in tqdm(reversed(runs.values())):
    if run['status'] != 'COMPLETED' or ('h2s' in run and not force_computation):
        continue
    # print(run['name'])
    h2s = {}
    for fold in range(nfolds):
        proj = np.load(f"{folder}/{run['name']}/fold_{fold}/proj.npy")
        
        # normalize projection pattern
        proj /= l2(proj*np.sqrt(aw))
        
        proj_r = reshaper.reshape(proj)
        h2 = proj_r @ W @ proj_r
        
        h2s[f'fold_{fold}'] = h2
    h2m = np.mean(list(h2s.values()))
    h2z = np.std(list(h2s.values()))
    h2s['mean'] = h2m
    h2s['std'] = h2z
    
    run['h2s'] = h2s

In [None]:
ut.dict2json(runs, f'{folder}/runs.json')

### Compute skill of GA

In [None]:
# collect all runs in a big dataframe

df = []
item = {}
fields = None
year_list = None

folders = ['ga']
Model = 'ERA5'


for folder in folders:
    print(folder)
    runs = ut.json2dict(f'{folder}/runs.json')
    runs = {k:v for k,v in runs.items() if v['status'] == 'COMPLETED'}
    config_dict = ut.json2dict(f'{folder}/config.json')
    
    for run in tqdm(runs.values()):
        
        if get_arg(run, 'regularization', config_dict) != ut.extract_nested(config_dict, 'regularization'): # ignore other regularization types
            continue
        if fields is not None:
            if get_arg(run, 'fields', config_dict) != fields:
                continue
        if year_list is not None:
            if get_arg(run, 'year_list', config_dict) != year_list:
                continue
        
        item['path'] = f"{folder}/{run['name']}"
        for kw in ['T', 'tau', 'reg_c', 'percent']:
            item[kw] = get_arg(run, kw, config_dict)
        item['tau'] = -item['tau']
        item['years'] = get_years(run, config_dict)
        # item['run'] = run
        
        item['clim_entropy'] = ut.entropy(0.01*item['percent'])
        
        nfolds = get_arg(run, 'nfolds', config_dict)
        for fold in range(nfolds):
            item['fold'] = fold
            item['entropy'] = run['scores'][f'fold_{fold}']
        
            try:
                item['h2'] = run['h2s'][f'fold_{fold}']
            except KeyError:
                item['h2'] = np.nan
                
            
            
        # item['field_ratio'] = get_field_ratio(folder, run)
            df.append(item.copy())
        
        
df = pd.DataFrame(df)

df.sort_values(['T', 'tau', 'percent', 'reg_c', 'years', 'fold'], inplace=True)

assert not df.duplicated(['T', 'tau', 'reg_c', 'years', 'fold']).any()

df['skill'] = 1 - df['entropy']/df['clim_entropy']
dfi = df.set_index(['T', 'tau', 'reg_c', 'years', 'fold'])
ds = dfi.to_xarray()
ds

In [None]:
sk = ds['skill']
sk.attrs = {'description': 'Normalized log score of Gaussian approximation'}
sk

In [None]:
eps_best = sk.mean('fold').fillna(-100).argmax('reg_c')
eps_best
skk = sk.isel(reg_c=eps_best)
skk

In [None]:
skk.to_netcdf('Skill-GA_percent5_epsilonbest.nc')

### Save projection patterns

In [None]:
mask = np.ones((22,128,1), dtype=bool)
reshaper = ut.Reshaper(mask)

coslat = np.maximum(np.cos(lat*np.pi/180), 0)
aw = (np.ones(mask.shape).T * coslat).T
aw *= mask
aw /= np.sum(aw)

W = sparse.load_npz('W.npz')

assert W.shape[0] == reshaper.surviving_coords

In [None]:
sel = ds.isel(reg_c=eps_best)
ss = sel.sel(T=1,tau=[0,2,4,6],fold=0)
ss

In [None]:
projs = []
for tau in ss['tau'].data:
    sss = ss.sel(tau=tau)
    proj = np.load(f'{sss["path"].data.item()}/fold_0/proj.npy')
    proj /= l2(proj*np.sqrt(aw))
    proj = reshaper.reshape(proj)
    projs.append(proj)
projs = np.stack(projs)
projs.shape

In [None]:
da = xr.DataArray(projs, coords={'tau': ss['tau'].data, 'pixel': np.arange(projs.shape[1])}, name='M')
da

In [None]:
da = da.expand_dims({'T': [1], 'fold': [0], 'years': [80]})
da.attrs = {'description': 'Projection patterns for the Gaussian approximation'}
da

In [None]:
da.to_netcdf('projection_patterns_T1_epsilonbest_fold0.nc')