This notebook can be used as an example on how to make the preprocessing of climate data for a simulation with OGGM (e.g. from a GCM simulation) computationally significantly faster (in the order of 85% recuction in the computation time). In order to do this 3 steps need to be taken. Of those steps the first two are being described in detail in this notebook. For the last step only some hints are being given at the end. 

The first step is to select all coordinates of the climate dataset that one needs and make a table out of those. Based on this list the climate time series of each of these coordinates will be saved in a seperated file netcdf file (the second step). The advantage is that is takes less time to open a small file compared to a large file when later preprocessing the data. Additionally it reduces the amount of the same file being opened at the same time by different processors when using multi-processing.

In [None]:
import pandas as pd
import salem
import numpy as np
import xarray as xr

Here a table with all the glaciers globally is being opened and all columns that are not of relevance are being dropped. Collumns to save the coordinates of intrest are being added.

In [None]:
fp = '~/rgi62_allglaciers_stats.h5' 
df = pd.read_hdf(fp)
df = df.drop(columns=['GLIMSId', 'BgnDate', 'EndDate', 'O1Region', 'O2Region', 'Zmin', 'Zmax', 'Form', 'Surging', 
                 'Linkages', 'TermType', 'Area', 'Zmed', 'Slope', 'Name', 'Lmax', 'Status', 'Aspect', 'Connect',
                'GlacierType',  'TerminusType', 'IsTidewater'])
df['cesm_lat'] = pd.Series(index=df.index)
df['cesm_lon'] = pd.Series(index=df.index)

Here the climate dataset of intrest is being opened.

In [None]:
cesm = '~/b.e11.BLMTRC5CN.f19_g16.001.cam.h0.TREFHT.085001-200512.nc'
dsd = xr.open_dataset(cesm)

Here just the first time step of the file is being selected to make the proccess faster. (For now we're only intrested in the coordinates of the data.)

In [None]:
dsd = dsd.TREFHT[0]

This is a loop over all the glaciers of intrest and selects the nearest coordinate in the climate data set. There might be a
difference in the longitude values being used between the datasets (-180 to 180 vs 0 to 360). Keep in mind that 
that you might need to correct for a such a difference.

In [None]:
for gl in np.arange(len(df)):
    lat = df['CenLat'][gl]
    lon = df['CenLon'][gl] + 360
    cesm_lat = dsd.sel(lat=lat, lon=lon, method='nearest').lat.values
    cesm_lon = dsd.sel(lat=lat, lon=lon, method='nearest').lon.values
    df['cesm_lat'][gl] = cesm_lat
    df['cesm_lon'][gl] = cesm_lon

It might be usefull to save the table for later.

In [None]:
df.to_hdf('look_up_table.hdf', key='df')

However for we don´t need the full table. Especially when having a climate file with a course resolution, 
there can like in this case be many duplicates. Therefore the duplicates are being removed.

In [None]:
df_list = df.drop_duplicates(subset=['cesm_lat', 'cesm_lon'], keep='first')
df_list = df_list.dropna()

Here the climate file of intrest is being opened again. For each coordinate of intrest one 
file is being generated and saved with the round coordinate in it file name.

In [None]:
dsd = xr.open_dataset(cesm)

for ki in np.arange(len(df_list)):
    ds = dsd.sel(lat=df_list.iloc[ki].cesm_lat, lon=df_list.iloc[ki].cesm_lon, method='nearest')
    ds.to_netcdf(path='temp_files/b.e11.BLMTRC5CN.f19_g16.001.cam.h0.temp.085001-200512.' + 
                 str(round(df_list.iloc[ki].cesm_lat)) + '_' 
                 + str(round(df_list.iloc[ki].cesm_lon)) + '.nc', mode='w')

The next step would be to create or look up a function that prepares your data so it can be fed for each glacier 
of interrest to for instance the process_gcm_data. Examples of functions that do so are process_cesm_data and 
process_cmip5_data. However those functions select that for you from a large file. That part of these functions 
you would need to replace/ adjust. Here the previously saved look-up table could be handy. This context the 
following lines of code could be usefull.

In [None]:
lookupt = pd.read_hdf(look_up_table) # open the lookup table 
cesm_lon = str(int(round(lookupt.loc[str(gdir.rgi_id)].cesm_lon))) # select coordinate as in the title of the file
cesm_lat = str(int(round(lookupt.loc[str(gdir.rgi_id)].cesm_lat)))
# fill the gaps in the title of the file e.g.:
# fpath_temp = 'temp_files/b.e11.BLMTRC5CN.f19_g16.001.cam.h0.temp.085001-200512.{}_{}.nc'
precpds = xr.open_dataset(fpath_precp.format(cesm_lat, cesm_lon))
tempds = xr.open_dataset(fpath_temp.format(cesm_lat, cesm_lon))  