# Create masked data for countries or regions

---

This notebook will create masked data (1D instead of 2D lat/lon) that are regional averages. You can run this on any of the files we have created so far. The country data that has already been run is available.

If you want other regions you can download shapefiles to use. Information on `regionmask` can be found here:

https://regionmask.readthedocs.io/en/stable/

In [1]:
import cftime
import numpy as np
import xarray as xr
xr.set_options(keep_attrs=True)
import climpred
from tqdm import tqdm
import dask.array as da
import matplotlib.pyplot as plt
from matplotlib.ticker import FixedLocator
import xskillscore as xs
import regionmask
import intake
import intake_geopandas
import warnings
warnings.filterwarnings("ignore")

from dask.distributed import Client
import dask.config
dask.config.set({"array.slicing.split_large_chunks": False})

<dask.config.set at 0x2ba848cf6be0>

In [2]:
client = Client("tcp://10.12.206.54:36264")

Choose your model, data type, and time

In [114]:
model = "OBS" #OBS, ECMWF, NCEP, or ECCC
data = "anom" #raw or anom or climatology
time = "daily" #biweekly or daily

In [115]:
hinda = xr.open_zarr("/glade/campaign/mmm/c3we/jaye/S2S_zarr/"+model+"."+data+".cat_edges."+time+".geospatial.zarr/", consolidated=True).astype('float32')
cat = intake.open_catalog('https://raw.githubusercontent.com/aaronspring/remote_climate_data/master/master.yaml')

In [116]:
hinda

Unnamed: 0,Array,Chunk
Bytes,85.03 MB,1.37 MB
Shape,"(366, 2, 121, 240)","(92, 1, 31, 120)"
Count,65 Tasks,64 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 85.03 MB 1.37 MB Shape (366, 2, 121, 240) (92, 1, 31, 120) Count 65 Tasks 64 Chunks Type float32 numpy.ndarray",366  1  240  121  2,

Unnamed: 0,Array,Chunk
Bytes,85.03 MB,1.37 MB
Shape,"(366, 2, 121, 240)","(92, 1, 31, 120)"
Count,65 Tasks,64 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,85.03 MB,1.37 MB
Shape,"(366, 2, 121, 240)","(92, 1, 31, 120)"
Count,65 Tasks,64 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 85.03 MB 1.37 MB Shape (366, 2, 121, 240) (92, 1, 31, 120) Count 65 Tasks 64 Chunks Type float32 numpy.ndarray",366  1  240  121  2,

Unnamed: 0,Array,Chunk
Bytes,85.03 MB,1.37 MB
Shape,"(366, 2, 121, 240)","(92, 1, 31, 120)"
Count,65 Tasks,64 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,85.03 MB,1.37 MB
Shape,"(366, 2, 121, 240)","(92, 1, 31, 120)"
Count,65 Tasks,64 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 85.03 MB 1.37 MB Shape (366, 2, 121, 240) (92, 1, 31, 120) Count 65 Tasks 64 Chunks Type float32 numpy.ndarray",366  1  240  121  2,

Unnamed: 0,Array,Chunk
Bytes,85.03 MB,1.37 MB
Shape,"(366, 2, 121, 240)","(92, 1, 31, 120)"
Count,65 Tasks,64 Chunks
Type,float32,numpy.ndarray


I understand this is messy, but we need to rechunk for each different type of data. It's a bit too messy for lots of if statements, so just read my comments and choose wisely. Or just try multiple times until it works :)

In [117]:
#hinda = hinda.chunk({"member": "auto", "init": -1, "lead": "auto", "lat": 45, "lon": 60}).persist() #hindcast raw & anom
#hinda = hinda.chunk({"time": -1, "lat": 45, "lon": 60}).persist() #verif
#hinda = hinda.chunk({"dayofyear": -1, "lead": "auto", "lat": 45, "lon": 60}).persist() #climatology for the models
#hinda = hinda.chunk({"dayofyear": -1, "lat": 45, "lon": 60}).persist() #climatology for verification
#hinda = hinda.chunk({"category_edge": -1, "dayofyear": -1, "lead": "auto", "lat": 45, "lon": 60}).persist() #cat_edges for the model
hinda = hinda.chunk({"category_edge": -1, "dayofyear": -1, "lat": 45, "lon": 60}).persist() #cat_edges for verification

Here we are seeing what Countries are available for masking. Just listing them out.

In [118]:
region = cat.regionmask.Countries.read()
region

<regionmask.Regions>
Name:     unnamed

Regions:
  0         Ind0                   Indonesia
  1         Mal0                    Malaysia
  2          Chi                       Chile
  3          Bol                     Bolivia
  4          Per                        Peru
 ..          ...                         ...
250          Mac                       Macau
251 AshandCarIsl Ashmore and Cartier Islands
252    BajNueBan             Bajo Nuevo Bank
253       SerBan             Serranilla Bank
254       ScaSho           Scarborough Shoal

[255 regions]

## Running the region mask over the data!

In [119]:
mask = region.mask(hinda, lon_name='lon',lat_name='lat')

In [120]:
var = hinda.groupby(mask).mean('stacked_lat_lon')

Here we have a function that adds labels to the region mask.

In [121]:
def set_regionmask_labels(ds, region):
    """Set names as region label for region dimension from regionmask regions."""
    abbrevs = region[ds.region.values].abbrevs
    names = region[ds.region.values].names
    ds.coords["abbrevs"] = ("region", abbrevs)
    ds.coords["number"] = ("region", ds.region.values)
    ds["region"] = names
    return ds

var = set_regionmask_labels(var, region)
var.coords

Coordinates:
  * category_edge  (category_edge) float64 0.3333 0.6667
  * dayofyear      (dayofyear) int64 1 2 3 4 5 6 7 ... 361 362 363 364 365 366
  * region         (region) <U35 'Indonesia' 'Malaysia' ... 'Solomon Islands'
    abbrevs        (region) <U15 'Ind0' 'Mal0' 'Chi' ... 'Jam' 'Sam' 'SolIsl'
    number         (region) float64 0.0 1.0 2.0 3.0 ... 207.0 210.0 232.0 233.0

In [122]:
var

Unnamed: 0,Array,Chunk
Bytes,488.98 kB,2.93 kB
Shape,"(167, 366, 2)","(1, 366, 2)"
Count,1094 Tasks,167 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 488.98 kB 2.93 kB Shape (167, 366, 2) (1, 366, 2) Count 1094 Tasks 167 Chunks Type float32 numpy.ndarray",2  366  167,

Unnamed: 0,Array,Chunk
Bytes,488.98 kB,2.93 kB
Shape,"(167, 366, 2)","(1, 366, 2)"
Count,1094 Tasks,167 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,488.98 kB,2.93 kB
Shape,"(167, 366, 2)","(1, 366, 2)"
Count,1094 Tasks,167 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 488.98 kB 2.93 kB Shape (167, 366, 2) (1, 366, 2) Count 1094 Tasks 167 Chunks Type float32 numpy.ndarray",2  366  167,

Unnamed: 0,Array,Chunk
Bytes,488.98 kB,2.93 kB
Shape,"(167, 366, 2)","(1, 366, 2)"
Count,1094 Tasks,167 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,488.98 kB,2.93 kB
Shape,"(167, 366, 2)","(1, 366, 2)"
Count,1094 Tasks,167 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 488.98 kB 2.93 kB Shape (167, 366, 2) (1, 366, 2) Count 1094 Tasks 167 Chunks Type float32 numpy.ndarray",2  366  167,

Unnamed: 0,Array,Chunk
Bytes,488.98 kB,2.93 kB
Shape,"(167, 366, 2)","(1, 366, 2)"
Count,1094 Tasks,167 Chunks
Type,float32,numpy.ndarray


Again, here you need to choose which chunking you want based on your data.

In [123]:
#%time var = var.chunk({"member": -1, "init": -1, "lead": -1, "region": 1}).persist() #hindcast
#%time var = var.chunk({"member": -1, "init": -1, "lead": "auto", "region": 1}).persist() #hindcast
#%time var = var.chunk({"time": -1, "region": 1}).persist() #verif
#%time var = var.chunk({"dayofyear": -1, "lead": -1, "region": 1}).persist() #climatology for the models
#%time var = var.chunk({"dayofyear": -1, "region": 1}).persist() #climatology for verification
#%time var = var.chunk({"category_edge": -1, "dayofyear": -1, "lead": -1, "region": 1}).persist() #cat_edges for the models
%time var = tsurfc.chunk({"category_edge": -1, "dayofyear": -1, "region": 1}).persist() #cat_edges for verification

CPU times: user 305 ms, sys: 3.89 ms, total: 309 ms
Wall time: 1.08 s


In [124]:
var

Unnamed: 0,Array,Chunk
Bytes,10.02 kB,60 B
Shape,"(167,)","(1,)"
Count,167 Tasks,167 Chunks
Type,numpy.ndarray,
"Array Chunk Bytes 10.02 kB 60 B Shape (167,) (1,) Count 167 Tasks 167 Chunks Type numpy.ndarray",167  1,

Unnamed: 0,Array,Chunk
Bytes,10.02 kB,60 B
Shape,"(167,)","(1,)"
Count,167 Tasks,167 Chunks
Type,numpy.ndarray,

Unnamed: 0,Array,Chunk
Bytes,1.34 kB,8 B
Shape,"(167,)","(1,)"
Count,167 Tasks,167 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 1.34 kB 8 B Shape (167,) (1,) Count 167 Tasks 167 Chunks Type float64 numpy.ndarray",167  1,

Unnamed: 0,Array,Chunk
Bytes,1.34 kB,8 B
Shape,"(167,)","(1,)"
Count,167 Tasks,167 Chunks
Type,float64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,488.98 kB,2.93 kB
Shape,"(167, 366, 2)","(1, 366, 2)"
Count,167 Tasks,167 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 488.98 kB 2.93 kB Shape (167, 366, 2) (1, 366, 2) Count 167 Tasks 167 Chunks Type float32 numpy.ndarray",2  366  167,

Unnamed: 0,Array,Chunk
Bytes,488.98 kB,2.93 kB
Shape,"(167, 366, 2)","(1, 366, 2)"
Count,167 Tasks,167 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,488.98 kB,2.93 kB
Shape,"(167, 366, 2)","(1, 366, 2)"
Count,167 Tasks,167 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 488.98 kB 2.93 kB Shape (167, 366, 2) (1, 366, 2) Count 167 Tasks 167 Chunks Type float32 numpy.ndarray",2  366  167,

Unnamed: 0,Array,Chunk
Bytes,488.98 kB,2.93 kB
Shape,"(167, 366, 2)","(1, 366, 2)"
Count,167 Tasks,167 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,488.98 kB,2.93 kB
Shape,"(167, 366, 2)","(1, 366, 2)"
Count,167 Tasks,167 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 488.98 kB 2.93 kB Shape (167, 366, 2) (1, 366, 2) Count 167 Tasks 167 Chunks Type float32 numpy.ndarray",2  366  167,

Unnamed: 0,Array,Chunk
Bytes,488.98 kB,2.93 kB
Shape,"(167, 366, 2)","(1, 366, 2)"
Count,167 Tasks,167 Chunks
Type,float32,numpy.ndarray


# Write out to zarr!

Or even netcdf if you want. The data is small enough

In [125]:
%time var.to_zarr("/glade/campaign/mmm/c3we/jaye/S2S_zarr/"+model+"."+data+".cat_edges."+time+".country.zarr/",mode="w",consolidated=True)

CPU times: user 124 ms, sys: 32.2 ms, total: 156 ms
Wall time: 2.75 s


<xarray.backends.zarr.ZarrStore at 0x2ba8a439fd00>