# Extracting WOA23 data using FishMIP regional model boundaries
**Author:** Denisse Fierro Arcos  
**Date:** 2024-09-04  
  
This script uses the cloud optimised WOA23 data (i.e., `zarr` files produced in the [02P_WOA_netcdf_to_zarr.ipynb](02P_WOA_netcdf_to_zarr.ipynb) script) to extract data for all FishMIP regional models.

We use the FishMIP regional model shapefile, which is available via our THREDDS server. You can refer to this [notebook](https://github.com/Fish-MIP/FishMIP_regions/blob/main/scripts/02_Mapping_Regional_Models.md) for instructions on how to download this shapefile.

Additionally, you will need a mask containing all FishMIP regional models. Instructions on how to create this mask are available [here](https://github.com/Fish-MIP/FishMIP_regions/blob/main/scripts/03a_Regional_Models_2DMasks.md)

We recommend that you store both the shapefile and mask in the same folder.

## Setting working directory
Remember to change the working directory below to the location of the scripts in your own local machine. Update the `your_path` variable below before continuing with the next chunk.

In [5]:
your_path = ''

In [6]:
import os
os.chdir(os.path.join(your_path, 'processing_WOA_data/scripts'))

## Loading libraries
We will load published Python libraries as well as our custom-made `useful_functions` library.

In [7]:
from dask.distributed import Client
from glob import glob
import xarray as xr
import geopandas as gpd
from glob import glob
import useful_functions as uf

## Starting a cluster
This will allow us to automatically parallelising tasks on large datasets.

In [4]:
client = Client(threads_per_worker = 1)

Perhaps you already have a cluster running?
Hosting the HTTP server on port 36495 instead


## Defining basic variables
Before continuing with the script, remember to update the `regions_folder` variable below with the location of the folder where you are storing the FishMIP regional model shapefile and gridded mask.

In [8]:
regions_folder = ''
mask_folder = ''

In [20]:
#Loading FishMIP regional models shapefile
rmes = gpd.read_file(os.path.join(regions_folder, 'FishMIP_regional_models.shp'))
#Loading FishMIP regional models gridded mask
mask_ras = xr.open_dataset(os.path.join(mask_folder, 
                                        'gfdl-mom6-cobalt2_areacello_15arcmin_fishMIP_regional_merged.nc')).region
#Renaming coordinate dimensions
mask_ras = mask_ras.rename({'latitude': 'lat', 'longitude': 'lon'})
#Rechunking data to make it more manageable
mask_ras = mask_ras.chunk({'lat': 144, 'lon': 288})

# Getting a list of all WOA zarr files available 
WOA_zarr = glob('/g/data/vf71/WOA_data/global/woa*.zarr')

#Define (or create) folders where outputs will be stored
base_out_clim = '/g/data/vf71/WOA_data/regional/climatology'
os.makedirs(base_out_clim, exist_ok = True)
base_out_month = '/g/data/vf71/WOA_data/regional/monthly'
os.makedirs(base_out_month, exist_ok = True)

## Extracting WOA data for each region

In [23]:
#Applying functions to WOA files
for f in WOA_zarr:
    #Open data array as ARD
    da = uf.mask_boolean_ard_data(f, mask_ras)   
    base_name = os.path.basename(f).replace('zarr', 'parquet')
    
    #Create full file path
    if 'month' in f:
         #Adding output folder to create full file path
        full_file_out = os.path.join(base_out_month, base_name)
    else:
        full_file_out = os.path.join(base_out_clim, base_name)

    #Extract data for each region included in the regional mask
    for i in rmes.region:
        #Get polygon for each region
        mask = rmes[rmes.region == i]
        #Get name of region and clean it for use in output file
        reg_name = mask['region'].values[0].lower().replace(" ", "-").replace("'", "")
        #File name out - Replacing "global" for region name
        file_out = full_file_out.replace('woa23_', f'woa23_{reg_name}_')
        #Extract data and save masked data - but only if file does not already exist
        if os.path.isdir(file_out) | os.path.isfile(file_out):
            continue
        uf.mask_ard_data(da, mask, file_out)