# Geospatial Data Science - EEPS 440/460

# Lecture 12

# Multi-dimensional arrays II

---

## GDAL for everything? 

* GDAL is terrible when working with a time dimension.
* There are multiple options (e.g., NetCDF Operators ([NCO](https://www.unidata.ucar.edu/blogs/news/entry/netcdf-operators-nco-version-427)) and Climate Data Operators [(CDO)](https://code.mpimet.mpg.de/projects/cdo))
* CDO is an excellent option that we are going to explore

## Situations where this problem matters 

* Comparing two different gridded datasets (e.g., climate model grids vs. observed grids)
* Prepare data for input into an environmental model (e.g., hydrologic model)
* Perform some basic statistical analysis (and save it!)

## Typical issues that have to be addressed

* Datasets to compare are at a different spatial resolution (e.g., 1 arcdegree vs. 0.1 arcdegree)
* Datasets are in different map projections (e.g., latlon vs. equal area)
* Datasets are at different temporal resolution (e.g., daily vs. annual)

## Believe it or not...

If you don't know what tools are out there, solving these issues can take up 90%+ of your research time (this is not an exaggeration!). If you use the right tools from the beginning, when you do the research you can focus on the science. Hence this course...

Tools like CDO become critical at the data processing stage.

# Climate Data Operators (CDO)

The Climate Data Operators (CDO) software is a collection of many operators for standard processing of climate and
forecast model data. The operators include simple statistical and arithmetic functions, data selection and subsampling tools, and spatial interpolation. CDO was developed to have the same set of processing functions for GRIB [GRIB] and NetCDF [NetCDF] datasets in one package.

Source: https://code.mpimet.mpg.de/projects/cdo/wiki/Tutorial

I am going to show you some highlights of CDO. However, just like GDAL, there is much, much more to explore.

# Upscaling in space

Let's upscale our ERA-interim database in space

Let's remind ourselves what is in this dataset

In [None]:
%%bash
cdo -showname ../data/era-interim/era_interim_monthly_197901_201512_upscaled.nc_ann

Let's start by learning about the file's current projection information

In [None]:
%%bash
cdo -griddes ../data/era-interim/era_interim_monthly_197901_201512_upscaled.nc_ann

What do we learn?

* The data is in a regular lon/lat grid (PlateCarree)
* There are 180 pixels along the latitude axis
* There are 288 pixels along the longitude axis
* The longitude of the first pixel on the longitude axis is 0.625
* The latitude of the first pixel on the latitude axis is -89.5
* The spatial resolution of the longitude axis is 0.625 arcdegree
* The spatial resolution of the latitudes axis is 1.0 arcdegree

If we read that into Python...

In [None]:
%matplotlib inline
import netCDF4 as nc
import matplotlib.pyplot as plt
file = '../data/era-interim/era_interim_monthly_197901_201512_upscaled.nc_ann'
fp = nc.Dataset(file)
print('nlat: %d' % fp['lat'].size)
print('nlon: %d' % fp['lon'].size)
print('dlat: %f' % (fp['lat'][1]-fp['lat'][0]))
print('dlon: %f' % (fp['lon'][1]-fp['lon'][0]))
print('lat0: %f' % (fp['lat'][0]))
print('lon0: %f' % (fp['lon'][0]))
fp.close()

I am given another dataset that is at 5.0 arcdegree spatial resolution in both the lat and lon direction. How can CDO help us?

Let's define the information of the new grid to which we want to map our data

In [None]:
%%writefile ../Workspace/grid.cdo
gridtype  = lonlat
xsize     = 72
ysize     = 36
xname     = lon
xlongname = "longitude" 
xunits    = "degrees_east" 
yname     = lat
ylongname = "latitude" 
yunits    = "degrees_north" 
xfirst    = 2.5
xinc      = 5.0
yfirst    = -87.5
yinc      = 5.0

CDO then allows us to remap our data to this new grid information.

Let's try a nearest neighbor interpolation (there are many other options). We will only apply it to the t2m (temperature) and precip (precipitation) variables). 

In [None]:
%%bash 
cdo -selname,t2m,precip -remapnn,../Workspace/grid.cdo ../data/era-interim/era_interim_monthly_197901_201512_upscaled.nc_ann ../Workspace/example.nc

What did that just do?

In [None]:
%%bash 
cdo -showname ../Workspace/example.nc

In [None]:
%%bash 
cdo -griddes ../Workspace/example.nc

Let's do a quick plot comparison of the original and new data

In [None]:
%matplotlib inline
import netCDF4 as nc
import numpy as np
import matplotlib.pyplot as plt
import cartopy

#Extract first time step of each datasets
file = '../data/era-interim/era_interim_monthly_197901_201512_upscaled.nc_ann'
fp = nc.Dataset(file)
dold = fp['t2m'][0,:,:]
fp.close()
file = '../Workspace/example.nc'
fp = nc.Dataset(file)
dnew = fp['t2m'][0,:,:]
fp.close()

#Make a side by side plot
fig = plt.figure(figsize=(16,8))
img_extent = (-180,180,-90,90)

ax = plt.subplot(1,2,1,projection=cartopy.crs.Robinson())
ax.imshow(dold,transform=cartopy.crs.PlateCarree(),cmap=plt.get_cmap('RdBu_r'),extent=(img_extent))
plt.title('Original',fontsize=25)
                              
ax = plt.subplot(1,2,2,projection=cartopy.crs.Robinson())
plt.title('Upscaled',fontsize=25)
ax.imshow(dnew,transform=cartopy.crs.PlateCarree(),cmap=plt.get_cmap('RdBu_r'),extent=(img_extent))
plt.show()


# Reprojecting/Regridding in space using CDO

Possibilities:

* Reproject the entire database to other projections
* Upscale or downscale in horizontal space
* Extract subsets of the data
* Use a multitude of interpolation schemes (nearest vs. average)
* And many more

# Temporal statistics

Let's learn about the temporal information of our dataset

In [None]:
%%bash
cdo -sinfon ../data/era-interim/era_interim_monthly_197901_201512_upscaled.nc_ann

Let's compute and save the time average

In [None]:
%%bash
cdo -timmean ../data/era-interim/era_interim_monthly_197901_201512_upscaled.nc_ann ../Workspace/example.nc

Let's look at the new file

In [None]:
%%bash
cdo -sinfon ../Workspace/example.nc

You can also do everything at once

In [None]:
%%bash
cdo -timmean -selname,t2m,precip -remapnn,../Workspace/grid.cdo ../data/era-interim/era_interim_monthly_197901_201512_upscaled.nc_ann ../Workspace/example.nc

Let's look at what we made

In [None]:
%%bash
cdo -sinfon ../Workspace/example.nc

We have not even scratched the surface of CDO. Want more information? 

Go here: https://code.mpimet.mpg.de/projects/cdo

And definitely check out the CDO printable cheat sheet here: 

https://code.mpimet.mpg.de/projects/cdo/embedded/cdo_refcard.pdf

The options are fairly endless...

## Annoying aspects of using Python for spatial data (up until now):

* Fairly involved to create a datetime array
* Too many lines just to retrieve the data for a given location
* Can be too involved just to do some basic plots
* Awkward to extract data for a given time stamp
* Main issue: Why should I use it for data exploration?

Wouldn't it be nice if:

* We could just call by a lat/lon pair instead of having to determine the indices?
* We could use a time label to extract data for a time stamp (instead of using masks)?
* We could upscale in time without having to write 30 lines of code?

In other words, wouldn't it be nice if N-dimensional spatial data could be dealt with as we do with data in Pandas?

Well, you are in luck...

<img src="http://xarray.pydata.org/en/v0.10.0/_images/dataset-diagram-logo.png" width="1000">


We are all tired of hearing me explain software packages so...

Let's let one of the core developers of **xarray** do it instead. 

In [None]:
%%html
<iframe width="939" height="528" src="https://www.youtube.com/embed/X0pAhJgySxk" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

## Let's explore xarray

In [None]:
import xarray as xr

#Open access to the monthly data and print the metadata
file_era_interim = '../data/era-interim/era_interim_monthly_197901_201512_upscaled.nc'
fp = xr.open_dataset(file_era_interim)

One big difference is the metadata. No need for a ncdump -h to get a general idea of the data.

In [None]:
print(fp)

## You can get more information at the variable level

In [None]:
print(fp['t2m'])

## Sampling data from Houston: Revisited

In [None]:
#Geographic coordinates for Houston
lat = 29.717154
lon = 360 -95.4041  # considering greenwich as lon 0
#Extract the data
data = fp['t2m'].sel(lat=lat,lon=lon,method='nearest')
#Print the metadata
print(data)

## Making plots is incredibly easy

In [None]:
#It evens makes plots much easier
import matplotlib.pyplot as plt

plt.figure(figsize=(13,6))
data.plot()
plt.show()

## Upscale to annual time step

In [None]:
#We can quickly upscale to an annual time step and make the plot
plt.figure(figsize=(13,6))
data.groupby('time.year').mean('time').plot(lw=3)
plt.show()

## Compute and plot the monthly climatology

In [None]:
plt.figure(figsize=(13,6))
data.groupby('time.month').mean('time').plot(lw=3)
plt.show()

## Plot the histogram

In [None]:
plt.figure(figsize=(13,6))
data.plot.hist(bins=20)
plt.show()

## Subset via labels

In [None]:
plt.figure(figsize=(10,6))
data.loc[slice('1979','1985')].plot(lw=4)

## Subset the data in space and time (but with labels!)

In [None]:
#We can directly subset the larger dataset for a given region
data = fp['t2m'].sel(time=slice('1979','1990'),lat=slice(20,60),lon=slice(220,310))
print(data)

## Make a quick spatial plot of the climatological mean

In [None]:
plt.figure(figsize=(14,6))
data.mean('time').plot.imshow()
plt.show()

## Add in Cartopy

In [None]:
import matplotlib.pyplot as plt
import cartopy

plt.figure(figsize=(15,8))
ax = plt.subplot(projection=cartopy.crs.AlbersEqualArea(standard_parallels=(20,50),
                                                        central_longitude=260,
                                                        central_latitude=45))
data.mean('time').plot.pcolormesh(transform=cartopy.crs.PlateCarree(),add_colorbar=True)

# Use existing cartopy shapefiles
ax.add_feature(cartopy.feature.COASTLINE)
ax.add_feature(cartopy.feature.BORDERS)
ax.add_feature(cartopy.feature.STATES)
plt.show()

## Computing temporal statistics

In [None]:
plt.figure(figsize=(15,8))

ax = plt.subplot(projection=cartopy.crs.AlbersEqualArea(standard_parallels=(20,50),
                                                        central_longitude=260,
                                                        central_latitude=45))

data.std('time').plot.pcolormesh(transform=cartopy.crs.PlateCarree(),add_colorbar=True)

ax.add_feature(cartopy.feature.COASTLINE)
ax.add_feature(cartopy.feature.BORDERS)
ax.add_feature(cartopy.feature.STATES)

plt.title(r'Standard deviation ($^o$C)',fontsize=20)
plt.show()

## We can always just treat them as numpy arrays as well

In [None]:
import numpy as np
tmp = np.mean(data,axis=0)
plt.figure(figsize=(16,8))
plt.imshow(np.flipud(tmp))
plt.show()

## Writing data made incredibly easy

In [None]:
data.to_netcdf('../Workspace/test.nc')

In [None]:
%%bash
ncdump -h ../Workspace/test.nc

# Accesing data online via xarray
### ERA5 Reanalysis from google cloud storage

In [None]:
%%html
<iframe width="939" height="528" src="https://github.com/google-research/arco-era5" frameborder="0" allow="accelerometer;  autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

In [None]:
import xarray as xr
ds = xr.open_zarr(
    'gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3',
    chunks=None,
    storage_options=dict(token='anon'),
)

In [None]:
print(ds)

In [None]:
print(ds.time)

In [None]:
data = ds.sel(time=slice(ds.attrs['valid_time_start'], ds.attrs['valid_time_stop']))
print(data)

In [None]:
var_list = list(data.variables.keys())
print('Number of variables:', len(var_list))

In [None]:
for a in var_list: print(a)

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
#Plot the temperature for the first time step of the simulation
data['2m_temperature'][0,:,:].plot()
plt.show()

In [None]:
#We could also subset a single variable for the area and period that we want
temp_data = data['2m_temperature'].sel(time=slice('2021-11-29T22:00:00.000000000T00:00:00.000000000', '2021-11-30T22:00:00.000000000'),
                                latitude=slice(55,25),
                                longitude=slice(-110+360,-70+360))
print(temp_data.shape)
temp_data[0,:,:].plot()
plt.show()

In [None]:
#Let's save this data locally
temp_data.to_netcdf('../Workspace/era5_2m_temperature.nc')

In [None]:
#We could also download multiple variables at once
vars = ['2m_temperature','evaporation','volumetric_soil_water_layer_1']
#We could also subset a single variable for the area that we want
sample = data[vars].sel(time=slice('2020-09-24T00:00:00.000000000', '2020-09-24T03:00:00.000000000'),
                        latitude=slice(55,25),
                        longitude=slice(360-110,360-70))
sample['volumetric_soil_water_layer_1'][0,:,:].plot()
plt.show()
print(sample)

Lets save the data in zarr now

In [None]:
%%bash
#clean up potential previous files
rm -rf ../Workspace/sample.zarr

In [None]:
#Save the data in Zarr
sample.to_zarr('../Workspace/sample.zarr')

## Learn xarray

There are sooooo much powerful capabilities in xarray

Resources? Start here: http://xarray.pydata.org/en/stable/

## Word of caution

* Xarray is very powerful and you should definitely use it.
* However, note that the easier a software package makes things, (in most cases) the less control we have.
* There are many circumstances where xarray could be limiting (probably not in this course though).
* It is good to know what is happening under the hood so you dig deeper when necessary.

# Too many options?

* Presented with a lot of packages and tools that do similar things.
* Which ones you use and how you combine them is a personal choice.
* I am just presenting many good options that will help you solve your current/future research question.