# Retrieve Data

This notebook contains functions for downloading and saving data from the NASA Earth Exchange Global Daily Downscaled Projections (NEX-GDDP-CMIP6). Information about NEX-GDDP-CMIP6 is available at https://www.nature.com/articles/s41597-022-01393-4. To access and process this data, users will need credentials for the Fix6 Amazon S3 bucket.

To begin, create a hidden.py file with the necessary S3 bucket credentials. An example template is provided below.

import os

os.environ['AWS_ACCESS_KEY_ID'] = 'YOUR_AWS_ACCESS_KEY_ID'
os.environ['AWS_SECRET_ACCESS_KEY'] = 'YOUR_AWS_SECRET_ACCESS_KEY'
os.environ['AWS_DEFAULT_REGION'] = 'us-east-1'

## Configure Environment

This function allows you to access modules saved in other directories.

In [1]:
import sys
import os

In [2]:
def configure_environment(relative_folderpath):
    """
    Configure the environment path to include the specified relative directory.

    Parameters:
    - relative_folderpath: The relative path to the directory to be added to the system path.

    Returns:
    - None
    """
    absolute_folderpath = os.path.abspath(os.path.join(os.getcwd(), relative_folderpath))
    sys.path.append(absolute_folderpath)

In [3]:
configure_environment('../src')

## Import Packages

In [4]:
import geopandas as gpd
import numpy as np
import pandas as pd
import rioxarray as rio
import xarray as xr
from pathlib import Path
from shapely.geometry import mapping

import hidden
from nex_gddp_cmip6 import get_nex_dataset

## Define Functions

These functions load saved polygons as GeoDataFrames, use the polygons to clip an Xarray dataset, split the Xarray dataset into train, validate, and test datasets, and save the result as a CSV.

In [5]:
def load_polygons(folderpath, filename):
    """
    Load a GeoDataFrame from the specified processed directory.

    Parameters:
    - folderpath (str): The path to the main folder containing the processed subfolder.
    - filename (str): The name of the file (with extension) to load from the processed directory.

    Returns:
    - GeoDataFrame: A GeoDataFrame loaded from the specified file in the processed subfolder.
    """
    # Create a Path object for folderpath to ensure correct path manipulation
    folder = Path(folderpath)

    # Construct the file path for the processed version of the file
    filepath = folder / 'processed' / filename
    
    # Load and return the GeoDataFrame
    return gpd.read_file(str(filepath))

In [6]:
def clip_dataset(dataset, geodataframe):
    """
    Clip a dataset by a GeoDataFrame's boundaries.

    Parameters:
    - dataset: The dataset to be clipped.
    - geodataframe: A GeoDataFrame that defines the region to clip.

    Returns:
    - The clipped dataset.
    """
    ds = dataset.rio.set_spatial_dims(x_dim="lon", y_dim="lat", inplace=False)
    ds = ds.rio.write_crs("EPSG:4326", inplace=False)
    gdf = geodataframe.to_crs(ds.rio.crs)
    
    return ds.rio.clip(gdf.geometry.apply(mapping), gdf.crs)

In [7]:
def split_dataset(dataset, time_splits):
    """
    Split a dataset into multiple datasets based on specified time splits.

    Parameters:
    - dataset: The xarray.Dataset to be split.
    - time_splits: A list of numpy.datetime64 objects indicating the split points.

    Returns:
    - A list of xarray.Dataset objects representing the datasets split according to the time points.
      The length of the returned list is one more than the number of splits, as it includes the ranges
      before the first split, between each pair of splits, and after the last split.
    """
    datasets = []
    previous_time = None
    
    for current_time in time_splits:
        if previous_time is None:
            # Before the first split point
            ds_split = dataset.sel(time=(dataset.time < current_time))
        else:
            # Between the current and previous split points
            ds_split = dataset.sel(time=(dataset.time >= previous_time) & (dataset.time < current_time))
        datasets.append(ds_split)
        previous_time = current_time
        
    # After the last split point
    datasets.append(dataset.sel(time=(dataset.time >= time_splits[-1])))
    
    return datasets

In [8]:
def save_dataset(dataset, folderpath, filename):
    """
    Save the dataset to a netCDF file. 
    The netCDF file is saved in a raw subdirectory within the specified folder path.
    
    Parameters:
    - dataset: The xarray.Dataset to be saved.
    - folderpath: A string specifying the directory path where the netCDF file will be saved.
    - filename: The name of the netCDF file to save without an extension.
    
    """
    # Create a Path object for folderpath to ensure correct path manipulation
    folder = Path(folderpath)
    
    # Construct the file path for the processed version of the file
    filepath = folder / 'raw' / (filename + '.nc')
    
    dataset.to_netcdf(filepath)

## Execute Functions

In [9]:
# Load dataset from S3
ds = get_nex_dataset(['tasmin'], ['projection'])

# Load GeoDataFrame
gdf = load_polygons('../data', 'gdf_easternmountain_polygons')

# Clip dataset
ds_clipped = clip_dataset(ds, gdf)

Note that according to Ibrahim N. Mohammed (2024), the "NEX-GDDP-CMIP6 climate projections is downscaled at a spatial resolution of 0.25 degrees x 0.25 degrees (approximately 25 km x 25 km)." https://imohamme.github.io/NASAaccess/articles/NEXGDDP-CMIP6.html. In contrast, the OikoLab ERA5 dataset has a spatial resolution of approximately 28 km x 28 km. https://docs.oikolab.com/. Interpolation is required to merge the NEX-GDDP-CMIP6 and ERA5 datasets.

In [10]:
# Define time periods
dt_train = np.datetime64('2022', 'ns')
dt_validate = np.datetime64('2023', 'ns')
dt_test = np.datetime64('2024', 'ns')
dt_project = np.datetime64('2051', 'ns')

dts = [dt_train, dt_validate, dt_test, dt_project]

# Split dataset into train, validate, and test datasets
ds_train, ds_validate, ds_test, ds_project, _ = split_dataset(ds_clipped, dts)

In [11]:
# save_dataset(ds_train, '../data', 'CMIP6_train_easternmountain')

In [12]:
# save_dataset(ds_validate, '../data', 'CMIP6_validate_easternmountain')

In [13]:
# save_dataset(ds_test, '../data', 'CMIP6_test_easternmountain')

In [14]:
# save_dataset(ds_project, '../data', 'CMIP6_project_easternmountain')