# Script to Extract Data
**Input Data:** Original Data  
**Output Data:** Data for a specific variable and spatial extent  
**Description:** Extracts data for a specific variable and spatial extent and exports them to a new file.  
**Date:** June 2022  
**Author:** Emma Perkins  

In [1]:
# import relevant packages
import xarray as xr
import glob

### Load Full Data

In [2]:
# full data
paths = '/glade/campaign/cesm/collections/cesmLE/CESM-CAM5-BGC-LE/lnd/proc/tseries/daily/H2OSNO/'  # change to your paths
# names = 'e5.oper.an.sfc.128_167_2t.ll025sc.*.nc'  # change to your files
file1 = paths+'b.e11.B1850C5CN.f09_g16.005.clm2.h1.H2OSNO.19000101-19991231.nc'
file2 = paths+'b.e11.B1850C5CN.f09_g16.005.clm2.h1.H2OSNO.20000101-20991231.nc'
# files = sorted(glob.glob(paths+names))
files = [file1, file2]
full_data = xr.open_mfdataset(files, concat_dim=None) 

### Select Variable of Interest

In [3]:
data_var = 'H2OSNO'  # change to variable of interest from climate model data
data_select = full_data[data_var]

### Select Area of Interest

In [8]:
# determine variable names
lat_var = 'lat'  # name of latitude variable for original data
lon_var = 'lon'  # name of longitude variable for original data

# rename lat lon variables to all be lat lon
data_select = data_select.rename({lat_var: 'lat', lon_var: 'lon'})

# sort by latitude:
data_select = data_select.sortby('lat')

lon_type = 'long3'  # longitude coordinate type: long1 (-180 - 180) or long 3 (0 - 360)
if lon_type == 'long3':
    lon_new = (data_select.lon + 180) % 360 - 180
    data_select['lon'] = lon_new
data_select = data_select.sortby('lon')

# select input area from left to right / west to east:
lat_min = 50  # minimum latitude
lat_max = 90  # maximum latitude
lon_min = 150  # minimum longitude
lon_max = -100  # maximum longitude

if lon_min < lon_max:
    data_select = data_select.sel(lat=slice(lat_min, lat_max), lon=slice(lon_min, lon_max))
else:
    data_select1 = data_select.sel(lat=slice(lat_min, lat_max), lon=slice(lon_min, 180))
    data_select2 = data_select.sel(lat=slice(lat_min, lat_max), lon=slice(-180, lon_max))
    data_select = xr.concat([data_select1, data_select2], dim='lon')
data_select = data_select.sortby('lon')

    >>> with dask.config.set(**{'array.slicing.split_large_chunks': False}):
    ...     array[indexer]

To avoid creating the large chunks, set the option
    >>> with dask.config.set(**{'array.slicing.split_large_chunks': True}):
    ...     array[indexer]
  return self.array[key]
    >>> with dask.config.set(**{'array.slicing.split_large_chunks': False}):
    ...     array[indexer]

To avoid creating the large chunks, set the option
    >>> with dask.config.set(**{'array.slicing.split_large_chunks': True}):
    ...     array[indexer]
  return self.array[key]


### Standardize Time Step
Can skip if already using the desired time step.

In [5]:
%%time

analysis_time_type = '1D'  # time step for analysis (ex: 6H, 1D, 1M, 1Y, etc.)
data_select = data_select.resample(time=analysis_time_type).mean('time')  # choose either sum.() or mean.() for accumulation or instantaneous variables respectively

CPU times: user 1min 52s, sys: 1min 6s, total: 2min 58s
Wall time: 3min 9s


### Export Data - Long Step (6+ Hours, Run Overnight)

In [10]:
%%time

outpath = '/glade/campaign/cgd/ppc/eperkins/cesm/'  # path for new observational data file
data_name = 'cesmLE_B1850C5CN_H2OSNO_1900_2099_1D_MRBplus'  # name for new data file

data_select.load().to_netcdf(outpath+data_name+'.nc')

CPU times: user 4h 36min 51s, sys: 2h 13min 23s, total: 6h 50min 15s
Wall time: 4h 25min 2s
