# Prototype python script for fetching and averaging data

The goal is to be able to create NetCDF files of time-averaged ocean-transport states (`umo`, `vmo`, `uo`, `vo`, and `mlotst`) from any CMIP model of interest. (CMIP5 might require some tweaks as the catalog may use different kwargs (e.g., `souce_id` instead of `model`).

Eventually, this notebook will be turned into a function with these inputs:
- model
- experiment
- ensemble member
- time period
that a script can put in a loop to generate lots of datasets.

Down the road, a Julia script (maybe a small package?) can grab the time-averaged data from these files and generate the transport matrices.

Note that this notebook is specific to Gadi at NCI. It may require access to at least one of the following projects (so make sure you add them all in your interactive ARE job submission):
```
gdata/gh0+gdata/oi10+gdata/dk92+gdata/hh5+gdata/rr3+gdata/al33+gdata/fs38
```

Warning: No promises made, this is work in progress!

## 1. Load packages

In [93]:
# Ignore warnings
from os import environ
environ["PYTHONWARNINGS"] = "ignore"

In [94]:
# Import makedirs to create directories where I write new files
from os import makedirs

In [137]:
# Load dask
from dask.distributed import Client

# Load intake and cosima cookbook
import intake
import cosima_cookbook as cc

# Load xarray for N-dimensional arrays
import xarray as xr

# Load xesmf for regridding
import xesmf as xe

# Load datetime to deal with time formats
import datetime

# Load numpy for numbers!
import numpy as np

# Load xmip for preprocessing (trying to get consistent metadata for making matrices down the road)
from xmip.preprocessing import combined_preprocessing

# Load pandas for DataFrame manipulations
import pandas as pd

## 2. Define some functions

(to avoid too much boilerplate code)

In [96]:
########## functions ##########
print("Defining functions")

def time_window_strings(year_start, num_years):
    """
    return strings for start_time and end_time
    """
    # start_time is first second of year_start
    start_time = datetime.datetime(year_start, 1, 1, 0, 0, 0)
    # end_time is last second of last_year
    end_time = datetime.datetime(year_start + num_years - 1, 12, 31, 23, 59, 59)

    # Return the weighted average
    return start_time, end_time

def find_latest_version(cat):
    """
    find latest version of selected data
    """
    sorted_versions = cat.df.version.to_list()
    sorted_versions.sort()
    latest_version = sorted_versions[-1]
    return latest_version

def select_latest_cat(cat, **kwargs):
    """
    search latest version of selected data
    """
    selectedcat = cat.search(**kwargs)
    latestselectedcat = selectedcat.search(version=find_latest_version(selectedcat))
    return latestselectedcat

def select_latest_data(cat, xarray_open_kwargs, **kwargs):
    latestselectedcat = select_latest_cat(cat, **kwargs)
    xarray_combine_by_coords_kwargs=dict(
        compat="override",
        data_vars="minimal",
        coords="minimal"
    )
    datadask = latestselectedcat.to_dask(
        xarray_open_kwargs=xarray_open_kwargs,
        xarray_combine_by_coords_kwargs=xarray_combine_by_coords_kwargs,
        parallel=True
    )
    return datadask

def cmip_catalog_str(CMIP_version, model):
    """
    Figure out the name of the catalog given the model and CMIP version (NCI/Gadi specific)
    """
    match CMIP_version:
        case 'CMIP5':
            match model:
                case 'ACCESS1-3':
                    cat = "/g/data/dk92/catalog/v2/esm/cmip5-rr3/catalog.json"
                case _:
                    cat = "/g/data/dk92/catalog/v2/esm/cmip5-al33/catalog.json"
        case 'CMIP6':
            match model:
                case 'ACCESS-CM2' | 'ACCESS-ESM1-5':
                    cat = "/g/data/dk92/catalog/v2/esm/cmip6-fs38/catalog.json"
                case _:
                    cat = "/g/data/dk92/catalog/v2/esm/cmip6-oi10/catalog.json"
    return cat


Defining functions


## 3. Select the model, experiment, ensemble, and time window

On NCI, the catalog depends not only on the CMIP version but also on the model: For some reason that eludes me, some of the Australian data (including ACCESS1.3, ACCESS-ESM1.5, and ACCESS-CM2) lives in its own ***separate*** catalog. So I need to define these first 

In [97]:
CMIP_version = "CMIP6"

In [175]:
# Comment/uncomment
model = "ACCESS-ESM1-5"
# model = "not ACCESS" # use this if you don;t know yet which model you want and need to look up some model names first.
model = "GFDL-CM4"

In [176]:
# Load catalog
cat_str = cmip_catalog_str(CMIP_version, model)
cat_str

'/g/data/dk92/catalog/v2/esm/cmip6-oi10/catalog.json'

In [177]:
# The catalog
cat = intake.open_esm_datastore(cat_str)
cat

Unnamed: 0,unique
path,3295829
file_type,2
realm,14
frequency,12
table_id,32
project_id,16
institution_id,36
source_id,103
experiment_id,79
member_id,6520


A little detour to list all the models in the catalog:

In [178]:
models = np.sort(cat.search(realm = 'ocean').df.source_id.unique())
print(*models, sep = "\n")

ACCESS-OM2
ACCESS-OM2-025
AWI-CM-1-1-MR
AWI-ESM-1-1-LR
BCC-CSM2-MR
BCC-ESM1
CAMS-CSM1-0
CAS-ESM2-0
CESM1-1-CAM5-CMIP5
CESM1-CAM5-SE-HR
CESM2
CESM2-FV2
CESM2-WACCM
CESM2-WACCM-FV2
CIESM
CMCC-CM2-HR4
CMCC-CM2-SR5
CMCC-CM2-VHR4
CMCC-ESM2
CNRM-CM6-1
CNRM-CM6-1-HR
CNRM-ESM2-1
CanESM5
CanESM5-1
CanESM5-CanOE
E3SM-1-0
E3SM-1-1
E3SM-1-1-ECA
E3SM-2-0
E3SM-2-0-NARRM
EC-Earth3
EC-Earth3-AerChem
EC-Earth3-CC
EC-Earth3-LR
EC-Earth3-Veg
EC-Earth3-Veg-LR
FGOALS-f3-H
FGOALS-f3-L
FGOALS-g3
FIO-ESM-2-0
GFDL-CM4
GFDL-ESM2M
GFDL-ESM4
GFDL-OM4p5B
GISS-E2-1-G
GISS-E2-1-G-CC
GISS-E2-1-H
GISS-E2-2-G
GISS-E2-2-H
HadGEM3-GC31-LL
HadGEM3-GC31-MM
ICON-ESM-LR
IITM-ESM
INM-CM4-8
INM-CM5-0
IPSL-CM5A2-INCA
IPSL-CM6A-LR
IPSL-CM6A-LR-INCA
KACE-1-0-G
KIOST-ESM
MCM-UA-1-0
MIROC-ES2L
MIROC6
MPI-ESM-1-2-HAM
MPI-ESM1-2-HR
MPI-ESM1-2-LR
MRI-ESM2-0
NESM3
NorCPM1
NorESM1-F
NorESM2-LM
NorESM2-MM
SAM0-UNICON
TaiESM1
TaiESM1-TIMCOM
TaiESM1-TIMCOM2
UKESM1-0-LL
UKESM1-1-LL


The creation of a matrix can only work if the following set of variables is available:
- mass transports (`umo` and `vmo`)
- mixed-layer depth (`mlotst`)

Alternatively, we can use `umo` and `vmo` (in kg/s) can be replaced by `uo` and `vo` (m/s). However, the conversion to mass transport requires the grid-cell volume (`volcello`), grid-cell areas (from vertices and `thkcello`), and density (no variable so I guess constant will do).

So the notebook here will create all the files for transport, if available: `umo`, `vmo`, `uo`, `vo`, `mlotst`.

In [102]:
experiment = "historical"

List of members

In [103]:
members = np.sort(cat.search(source_id = model, realm = 'ocean').df.member_id.unique())
print(*members, sep = "\n")

r1i1p1f1


In [104]:
# List all models of the catalog that have 
# (variable `umo` or variable `uo`) and (variable `vmo` or variable `vo`) and variable `mlotst`) 
# and display these models as a dictionary where the keys are the models 
# and the values are lists of the members that have these variables
models = np.sort(cat.search(variable_id = ['umo', 'uo', 'vmo', 'vo', 'mlotst']).df.source_id.unique())
models_dict = {}
for model in models:
    members = np.sort(cat.search(source_id = model, variable_id = ['umo', 'uo', 'vmo', 'vo', 'mlotst']).df.member_id.unique())
    models_dict[model] = members
models_dict

{'ACCESS-OM2': array(['r1i1p1f1'], dtype=object),
 'ACCESS-OM2-025': array(['r1i1p1f1'], dtype=object),
 'AWI-CM-1-1-MR': array(['r1i1p1f1'], dtype=object),
 'AWI-ESM-1-1-LR': array(['r1i1p1f1'], dtype=object),
 'BCC-CSM2-MR': array(['r1i1p1f1', 'r2i1p1f1', 'r3i1p1f1'], dtype=object),
 'BCC-ESM1': array(['r1i1p1f1'], dtype=object),
 'CAMS-CSM1-0': array(['r1i1p1f1', 'r1i1p1f2'], dtype=object),
 'CAS-ESM2-0': array(['r1i1p1f1'], dtype=object),
 'CESM1-CAM5-SE-HR': array(['r1i1p1f1'], dtype=object),
 'CESM2': array(['r101i1p1f1', 'r102i1p1f1', 'r103i1p1f1', 'r1i1p1f1', 'r2i1p1f1',
        'r3i1p1f1'], dtype=object),
 'CESM2-FV2': array(['r1i1p1f1'], dtype=object),
 'CESM2-WACCM': array(['r1i1p1f1'], dtype=object),
 'CESM2-WACCM-FV2': array(['r1i1p1f1'], dtype=object),
 'CIESM': array(['r1i1p1f1'], dtype=object),
 'CMCC-CM2-HR4': array(['r1i1p1f1'], dtype=object),
 'CMCC-CM2-SR5': array(['r1i1p1f1'], dtype=object),
 'CMCC-CM2-VHR4': array(['r1i1p1f1'], dtype=object),
 'CMCC-ESM2': array([

In [105]:
# List all models of the catalog that have 
# `umo` and `vmo` and `mlotst`
# and display these models as a dictionary where the keys are the models 
# and the values are lists of the members that have these variables
models = np.sort(cat.search(variable_id = ['umo', 'uo', 'vmo', 'vo', 'mlotst']).df.source_id.unique())
models_dict = {}
for model in models:
    members = np.sort(cat.search(source_id = model, variable_id = ['umo', 'uo', 'vmo', 'vo', 'mlotst']).df.member_id.unique())
    models_dict[model] = members
models_dict

{'ACCESS-OM2': array(['r1i1p1f1'], dtype=object),
 'ACCESS-OM2-025': array(['r1i1p1f1'], dtype=object),
 'AWI-CM-1-1-MR': array(['r1i1p1f1'], dtype=object),
 'AWI-ESM-1-1-LR': array(['r1i1p1f1'], dtype=object),
 'BCC-CSM2-MR': array(['r1i1p1f1', 'r2i1p1f1', 'r3i1p1f1'], dtype=object),
 'BCC-ESM1': array(['r1i1p1f1'], dtype=object),
 'CAMS-CSM1-0': array(['r1i1p1f1', 'r1i1p1f2'], dtype=object),
 'CAS-ESM2-0': array(['r1i1p1f1'], dtype=object),
 'CESM1-CAM5-SE-HR': array(['r1i1p1f1'], dtype=object),
 'CESM2': array(['r101i1p1f1', 'r102i1p1f1', 'r103i1p1f1', 'r1i1p1f1', 'r2i1p1f1',
        'r3i1p1f1'], dtype=object),
 'CESM2-FV2': array(['r1i1p1f1'], dtype=object),
 'CESM2-WACCM': array(['r1i1p1f1'], dtype=object),
 'CESM2-WACCM-FV2': array(['r1i1p1f1'], dtype=object),
 'CIESM': array(['r1i1p1f1'], dtype=object),
 'CMCC-CM2-HR4': array(['r1i1p1f1'], dtype=object),
 'CMCC-CM2-SR5': array(['r1i1p1f1'], dtype=object),
 'CMCC-CM2-VHR4': array(['r1i1p1f1'], dtype=object),
 'CMCC-ESM2': array([

In [149]:
cat.search(variable_id="mlotstmax")

Unnamed: 0,unique
path,1164
file_type,1
realm,1
frequency,1
table_id,1
project_id,2
institution_id,2
source_id,3
experiment_id,4
member_id,9


In [128]:
def list_models_and_members_that_have(cat, variables):
    """
    find the list of models and their members that have all the variables: 'uo', 'vo', and 'mlotst'.
    """
    # Step 1: Filter the dataframe to include only the specified variables
    filtered_df = cat.search(variable_id = variables).df
    
    # Step 2: Group by 'model' and 'member'
    grouped = filtered_df.groupby(['source_id', 'member_id'])
    
    # Step 3: Find groups that contain all three variables
    valid_groups = grouped.filter(lambda x: set(variables).issubset(set(x['variable_id'])))
    
    # Step 4: Get the list of models and their members
    result = valid_groups[['source_id', 'member_id']].drop_duplicates().reset_index(drop=True)
    
    # Step 5: Sort the result by model
    result_sorted = result.sort_values(by='source_id')

    # Setp 6: Regroup by model
    grouped_result_sorted = result_sorted.groupby('source_id')
    
    return grouped_result_sorted.apply(display)

In [161]:
def summary_variable_availability(df):

    # Step 1: Filter the dataframe to include only the specified variables
    filtered_df_1 = df[df['variable_id'].isin(['umo', 'vmo'])]
    filtered_df_2 = df[df['variable_id'].isin(['uo', 'vo'])]
    filtered_df_3 = df[df['variable_id'].isin(['mlotst'])]
    filtered_df_4 = df[df['variable_id'].isin(['mlotstmax'])]
    
    # Step 2: Group by 'source_id' and 'member_id'
    grouped_1 = filtered_df_1.groupby(['experiment_id', 'source_id', 'member_id'])
    grouped_2 = filtered_df_2.groupby(['experiment_id', 'source_id', 'member_id'])
    grouped_3 = filtered_df_3.groupby(['experiment_id', 'source_id', 'member_id'])
    grouped_4 = filtered_df_4.groupby(['experiment_id', 'source_id', 'member_id'])
    
    # Step 3: Find groups that contain all the variables in each set
    valid_groups_1 = grouped_1.filter(lambda x: set(['umo', 'vmo']).issubset(set(x['variable_id'])))
    valid_groups_2 = grouped_2.filter(lambda x: set(['uo', 'vo']).issubset(set(x['variable_id'])))
    valid_groups_3 = grouped_3.filter(lambda x: set(['mlotst']).issubset(set(x['variable_id'])))
    valid_groups_4 = grouped_4.filter(lambda x: set(['mlotstmax']).issubset(set(x['variable_id'])))
    
    # Step 4: Get the list of source_id and their member_id for each set
    result_1 = valid_groups_1[['experiment_id', 'source_id', 'member_id']].drop_duplicates().reset_index(drop=True)
    result_2 = valid_groups_2[['experiment_id', 'source_id', 'member_id']].drop_duplicates().reset_index(drop=True)
    result_3 = valid_groups_3[['experiment_id', 'source_id', 'member_id']].drop_duplicates().reset_index(drop=True)
    result_4 = valid_groups_4[['experiment_id', 'source_id', 'member_id']].drop_duplicates().reset_index(drop=True)
    
    # Step 5: Group by 'source_id' and aggregate member_id into a list for each set
    final_result_1 = result_1.groupby(['experiment_id', 'source_id'])['member_id'].apply(list).reset_index()
    final_result_2 = result_2.groupby(['experiment_id', 'source_id'])['member_id'].apply(list).reset_index()
    final_result_3 = result_3.groupby(['experiment_id', 'source_id'])['member_id'].apply(list).reset_index()
    final_result_4 = result_4.groupby(['experiment_id', 'source_id'])['member_id'].apply(list).reset_index()
    
    # Step 6: Merge the results into a single dataframe
    merged_result_1 = pd.merge(final_result_1, final_result_2, on=['experiment_id', 'source_id'], how='outer', suffixes=('_umo_vmo', '_uo_vo'))
    merged_result_2 = pd.merge(merged_result_1, final_result_3, on=['experiment_id', 'source_id'], how='outer', suffixes=('', '_mlotst'))
    merged_result_3 = pd.merge(merged_result_2, final_result_4, on=['experiment_id', 'source_id'], how='outer', suffixes=('', '_mlotstmax'))

    final_restult = merged_result_3.sort_values(by='source_id')
    
    return final_restult

In [162]:
summary_variable_availability(cat.df)

Unnamed: 0,experiment_id,source_id,member_id_umo_vmo,member_id_uo_vo,member_id,member_id_mlotstmax
236,omip2,ACCESS-OM2,,[r1i1p1f1],,
237,omip2,ACCESS-OM2-025,,[r1i1p1f1],,
0,1pctCO2,AWI-CM-1-1-MR,,[r1i1p1f1],,
342,ssp245,AWI-CM-1-1-MR,,[r1i1p1f1],,
49,abrupt-4xCO2,AWI-CM-1-1-MR,,[r1i1p1f1],,
307,ssp126,AWI-CM-1-1-MR,,[r1i1p1f1],,
251,piControl,AWI-CM-1-1-MR,,[r1i1p1f1],,
417,ssp585,AWI-CM-1-1-MR,,[r1i1p1f1],,
164,historical,AWI-CM-1-1-MR,,[r1i1p1f1],[r1i1p1f1],
1,1pctCO2,AWI-ESM-1-1-LR,,[r1i1p1f1],,


In [180]:
cat.search(source_id = "GFDL-CM4", experiment_id = "historical", realm="ocean", variable_id="mlotst").df.variable_id.unique()

array([], dtype=object)

In [108]:
list_models_and_members_that_have(cat, ['uo', 'vo', 'mlotst'])

Unnamed: 0,source_id,member_id
22,AWI-CM-1-1-MR,r1i1p1f1


Unnamed: 0,source_id,member_id
28,AWI-ESM-1-1-LR,r1i1p1f1


Unnamed: 0,source_id,member_id
31,CAMS-CSM1-0,r1i1p1f1


Unnamed: 0,source_id,member_id
38,CAS-ESM2-0,r1i1p1f1


Unnamed: 0,source_id,member_id
43,CESM2,r1i1p1f1


Unnamed: 0,source_id,member_id
44,CESM2-FV2,r1i1p1f1


Unnamed: 0,source_id,member_id
35,CESM2-WACCM,r1i1p1f1


Unnamed: 0,source_id,member_id
45,CESM2-WACCM-FV2,r1i1p1f1


Unnamed: 0,source_id,member_id
42,CMCC-CM2-HR4,r1i1p1f1


Unnamed: 0,source_id,member_id
34,CMCC-CM2-SR5,r1i1p1f1


Unnamed: 0,source_id,member_id
33,CMCC-ESM2,r1i1p1f1


Unnamed: 0,source_id,member_id
14,CNRM-CM6-1,r1i1p1f2


Unnamed: 0,source_id,member_id
13,CNRM-CM6-1-HR,r1i1p1f2


Unnamed: 0,source_id,member_id
15,CNRM-ESM2-1,r1i1p1f2


Unnamed: 0,source_id,member_id
11,CanESM5,r1i1p1f1


Unnamed: 0,source_id,member_id
17,E3SM-1-0,r1i1p1f1


Unnamed: 0,source_id,member_id
20,E3SM-1-1,r1i1p1f1


Unnamed: 0,source_id,member_id
18,E3SM-1-1-ECA,r1i1p1f1


Unnamed: 0,source_id,member_id
5,EC-Earth3,r13i1p1f1
9,EC-Earth3,r9i1p1f1
8,EC-Earth3,r6i1p1f1
7,EC-Earth3,r11i1p1f1
6,EC-Earth3,r15i1p1f1
2,EC-Earth3,r1i1p1f1


Unnamed: 0,source_id,member_id
4,EC-Earth3-AerChem,r1i1p1f1


Unnamed: 0,source_id,member_id
3,EC-Earth3-CC,r1i1p1f1


Unnamed: 0,source_id,member_id
10,EC-Earth3-Veg,r12i1p1f1
1,EC-Earth3-Veg,r14i1p1f1


Unnamed: 0,source_id,member_id
0,EC-Earth3-Veg-LR,r1i1p1f1


Unnamed: 0,source_id,member_id
39,FGOALS-f3-L,r1i1p1f1


Unnamed: 0,source_id,member_id
40,FGOALS-g3,r1i1p1f1


Unnamed: 0,source_id,member_id
27,GFDL-CM4,r1i1p1f1


Unnamed: 0,source_id,member_id
21,HadGEM3-GC31-LL,r1i1p1f3
16,HadGEM3-GC31-LL,r1i1p1f1


Unnamed: 0,source_id,member_id
41,ICON-ESM-LR,r1i1p1f1


Unnamed: 0,source_id,member_id
12,IPSL-CM6A-LR,r1i1p1f1


Unnamed: 0,source_id,member_id
47,IPSL-CM6A-LR-INCA,r1i1p1f1


Unnamed: 0,source_id,member_id
23,KIOST-ESM,r1i1p1f1


Unnamed: 0,source_id,member_id
26,MIROC6,r1i1p1f1


Unnamed: 0,source_id,member_id
29,MPI-ESM-1-2-HAM,r1i1p1f1


Unnamed: 0,source_id,member_id
32,MPI-ESM1-2-HR,r1i1p1f1


Unnamed: 0,source_id,member_id
37,MPI-ESM1-2-LR,r1i1p1f1


Unnamed: 0,source_id,member_id
36,MRI-ESM2-0,r1i1p1f1


Unnamed: 0,source_id,member_id
19,NESM3,r1i1p1f1


Unnamed: 0,source_id,member_id
30,NorCPM1,r1i1p1f1


Unnamed: 0,source_id,member_id
25,NorESM2-LM,r1i1p1f1


Unnamed: 0,source_id,member_id
24,NorESM2-MM,r1i1p1f1


Unnamed: 0,source_id,member_id
46,TaiESM1-TIMCOM,r1i1p1f1


In [141]:
ensemble = "r1i1p1f1"

In [140]:
model = "GFDL-CM4" # Redefine it here because a for loop above overwrote it (the famous Jupyter notebooks state flaw)

In [142]:
year_start = 1870
num_years = 30

In [143]:
def variable_availability_check(cat, **kwargs):
    """
    List the relevant variables available and if more is needed.
    """
    searched_cat = cat.search(**kwargs)
    umo_cat = searched_cat.search(variable_id = "umo")
    vmo_cat = searched_cat.search(variable_id = "vmo")
    uo_cat = searched_cat.search(variable_id = "uo")
    vo_cat = searched_cat.search(variable_id = "vo")
    mlotstmo_cat = searched_cat.search(variable_id = "mlotst")

    print("\numo:\n")
    print(available_time_window(umo_cat))
    print("\nvmo:\n")
    print(available_time_window(vmo_cat))
    print("\nuo:\n")
    print(available_time_window(uo_cat))
    print("\nvo:\n")
    print(available_time_window(vo_cat))
    print("\nmlotst:\n")
    print(available_time_window(mlotstmo_cat))

    return


def available_time_window(cat):
    time_ranges = cat.df.time_range.unique()
    idx = np.argsort([int(foo[0:4]) for foo in time_ranges if foo != 'na'])
    # return time_ranges[idx]
    return time_ranges[idx]

In [144]:
model, experiment, ensemble

('GFDL-CM4', 'historical', 'r1i1p1f1')

In [145]:
variable_availability_check(cat,
    source_id = model,
    experiment_id = experiment,
    member_id = ensemble,    
    realm = 'ocean'
)                            


umo:

[]

vmo:

[]

uo:

['185001-186912' '187001-188912' '189001-190912' '191001-192912'
 '193001-194912' '195001-196912' '197001-198912' '199001-200912'
 '201001-201412']

vo:

['185001-186912' '187001-188912' '189001-190912' '191001-192912'
 '193001-194912' '195001-196912' '197001-198912' '199001-200912'
 '201001-201412']

mlotst:

[]


In [120]:
# Check that catalog contains the data requested before creating empty directories
searched_cat = cat.search(
    source_id = model,
    experiment_id = experiment,
    member_id = ensemble,
    variable_id = ["uo", "vo", "mlotst"],
    realm = 'ocean')
np.sort(searched_cat.df.source_id.unique())

array([], dtype=object)

In [121]:
searched_searched_cat = searched_cat.search(variable_id = "uo")
searched_searched_cat.df.variable_id.unique()

array([], dtype=object)

In [116]:
np.sort([int(foo[0:4]) for foo in searched_cat.df.time_range.unique() if foo != 'na'])

array([], dtype=float64)

In [117]:
# Create directory on gdata
datadir = '/g/data/gh0/BP/TMIP/data/'
start_time, end_time = time_window_strings(year_start, num_years)
start_time_str = start_time.strftime("%b%Y")
end_time_str = end_time.strftime("%b%Y")

outputdir = f'{datadir}/{model}/{experiment}/{ensemble}/{start_time_str}-{end_time_str}'
print(outputdir)
makedirs(outputdir, exist_ok=True)

/g/data/gh0/BP/TMIP/data//UKESM1-0-LL/historical/r1i1p1f1/Jan1870-Dec1899


In [118]:
########## Start the client and make the `.nc` files ##########
print("Starting client")
client = Client(n_workers=4)#, threads_per_worker=1, memory_limit='16GB') # Note: with 1thread/worker cannot plot thetao. Maybe I need to understand why?

Starting client


In [119]:
# umo dataset
print("Loading umo data")
umo_datadask = select_latest_data(searched_cat,
    dict(
        chunks={'i': 60, 'j': 60, 'time': -1, 'lev':50}
    ),
    variable_id = "umo",
    frequency = "mon",
)
print("\numo_datadask: ", umo_datadask)

Loading umo data


IndexError: list index out of range

In [None]:
# vmo dataset
print("Loading vmo data")
vmo_datadask = select_latest_data(searched_cat,
    dict(
        chunks={'i': 60, 'j': 60, 'time': -1, 'lev':50}
    ),
    variable_id = "vmo",
    frequency = "mon",
)
print("\nvmo_datadask: ", vmo_datadask)

In [None]:
# mlotst dataset
print("Loading mlotst data")
mlotst_datadask = select_latest_data(searched_cat,
    dict(
        chunks={'i': 60, 'j': 60, 'time': -1, 'lev':50}
    ),
    variable_id = "mlotst",
    frequency = "mon",
)
print("\nmlotst_datadask: ", mlotst_datadask)

In [None]:
# Deal with thkcello for a different script,
# given that its location (fixed or time-dependent) depends on the model and/or project
# # thkcello dataset
# print("Loading thkcello data")
# thkcello_datadask = select_latest_data(searched_cat,
#     dict(
#         chunks={'i': 60, 'j': 60, 'time': -1, 'lev':50}
#     ),
#     variable_id = "thkcello",
#     frequency = "mon",
# )
# print("\nthkcello_datadask: ", thkcello_datadask)

In [None]:
# Slice umo dataset for the time period
umo_datadask_sel = umo_datadask.sel(time=slice(start_time, end_time))
# Take the time average of the monthly evaporation (using month length as weights)
umo = umo_datadask_sel["umo"].weighted(umo_datadask_sel.time.dt.days_in_month).mean(dim="time")
umo

In [None]:
# Slice vmo dataset for the time period
vmo_datadask_sel = vmo_datadask.sel(time=slice(start_time, end_time))
# Take the time average of the monthly evaporation (using month length as weights)
vmo = vmo_datadask_sel["vmo"].weighted(vmo_datadask_sel.time.dt.days_in_month).mean(dim="time")
vmo

In [None]:
# Slice mlotst dataset for the time period
mlotst_datadask_sel = mlotst_datadask.sel(time=slice(start_time, end_time))
# Take the time mean of the yearly maximum of mlotst
mlotst_yearlymax = mlotst_datadask_sel.groupby("time.year").max(dim="time")
mlotst_yearlymax

In [None]:
mlotst = mlotst_yearlymax.mean(dim="year")
mlotst

In [None]:
# # Slice thkcello dataset for the time period
# thkcello_datadask_sel = thkcello_datadask.sel(time=slice(start_time, end_time))
# # Take the time average of the monthly evaporation (using month length as weights)
# thkcello = thkcello_datadask_sel["thkcello"].weighted(thkcello_datadask_sel.time.dt.days_in_month).mean(dim="time")

In [None]:
# Save to netcdfs (and compute!)
umo.to_netcdf(f'{outputdir}/umo.nc', compute=True)
vmo.to_netcdf(f'{outputdir}/vmo.nc', compute=True)
mlotst.to_netcdf(f'{outputdir}/mlotst.nc', compute=True)
# thkcello.to_netcdf(f'{outputdir}/thkcello.nc', compute=True)

In [None]:
client.close()