# Retrieve Data

This notebook contains functions for downloading and saving data from the NASA Earth Exchange Global Daily Downscaled Projections (NEX-GDDP-CMIP6). Information about NEX-GDDP-CMIP6 is available at https://www.nature.com/articles/s41597-022-01393-4. To access and process this data, users will need credentials for the Fix6 Amazon S3 bucket.

To begin, create a hidden.py file with the necessary S3 bucket credentials. An example template is provided below.

import os

os.environ['AWS_ACCESS_KEY_ID'] = 'YOUR_AWS_ACCESS_KEY_ID'
os.environ['AWS_SECRET_ACCESS_KEY'] = 'YOUR_AWS_SECRET_ACCESS_KEY'
os.environ['AWS_DEFAULT_REGION'] = 'us-east-1'

## Configure Environment

This function allows you to access modules saved in other directories.

In [1]:
import sys
import os

In [2]:
def configure_environment(relative_folderpath):
    """
    Configure the environment path to include the specified relative directory.

    Parameters:
    - relative_folderpath: The relative path to the directory to be added to the system path.

    Returns:
    - None
    """
    absolute_folderpath = os.path.abspath(os.path.join(os.getcwd(), relative_folderpath))
    sys.path.append(absolute_folderpath)

In [3]:
configure_environment('../src')

## Import Packages

In [4]:
import geopandas as gpd
import numpy as np
import pandas as pd
from pathlib import Path
from shapely.geometry import mapping

import hidden
from nex_gddp_cmip6 import get_nex_dataset

## Define Functions

These functions load saved polygons as GeoDataFrames, use the polygons to clip an Xarray dataset, split the Xarray dataset into train, validate, and test datasets, and save the result as a CSV.

In [5]:
def load_polygons(folderpath, filename):
    """
    Load a GeoDataFrame from the specified processed directory.

    Parameters:
    - folderpath (str): The path to the main folder containing the processed subfolder.
    - filename (str): The name of the file (with extension) to load from the processed directory.

    Returns:
    - GeoDataFrame: A GeoDataFrame loaded from the specified file in the processed subfolder.
    """
    # Create a Path object for folderpath to ensure correct path manipulation
    folder = Path(folderpath)

    # Construct the file path for the processed version of the file
    filepath = folder / 'processed' / filename
    
    # Load and return the GeoDataFrame
    return gpd.read_file(str(filepath))

In [6]:
def clip_dataset(dataset, geodataframe):
    """
    Clip a dataset by a GeoDataFrame's boundaries.

    Parameters:
    - dataset: The dataset to be clipped.
    - geodataframe: A GeoDataFrame that defines the region to clip.

    Returns:
    - The clipped dataset.
    """
    rio_dataset = dataset.rio.write_crs("EPSG:4326")
    return rio_dataset.rio.clip(geodataframe.geometry.apply(mapping), geodataframe.crs)

In [7]:
def split_dataset(dataset, time_splits):
    """
    Split a dataset into multiple datasets based on specified time splits.

    Parameters:
    - dataset: The xarray.Dataset to be split.
    - time_splits: A list of numpy.datetime64 objects indicating the split points.

    Returns:
    - A list of xarray.Dataset objects representing the datasets split according to the time points.
      The length of the returned list is one more than the number of splits, as it includes the ranges
      before the first split, between each pair of splits, and after the last split.
    """
    datasets = []
    previous_time = None
    
    for current_time in time_splits:
        if previous_time is None:
            # Before the first split point
            ds_split = dataset.sel(time=(dataset.time < current_time))
        else:
            # Between the current and previous split points
            ds_split = dataset.sel(time=(dataset.time >= previous_time) & (dataset.time < current_time))
        datasets.append(ds_split)
        previous_time = current_time
        
    # After the last split point
    datasets.append(dataset.sel(time=(dataset.time >= time_splits[-1])))
    
    return datasets

In [8]:
def save_dataframe(dataset, folderpath, filename):
    """
    Save the dataset to a CSV file after converting to a DataFrame, dropping NA, and removing 
    unwanted columns. The CSV file is saved in a raw subdirectory within the specified folder path.

    Parameters:
    - dataset: The xarray.Dataset to be saved as a DataFrame. It is expected to contain 
      geospatial data that may include a spatial_ref column.
    - folderpath: A string specifying the directory path where the CSV file will be saved. The 
      function will save the file within a raw subdirectory of this path.
    - filename: The name of the CSV file to save. This function assumes the filename does not 
      include any directory path.

    Returns:
    - None
    """
    # Create a Path object for folderpath to ensure correct path manipulation
    folder = Path(folderpath)

    # Construct the file path for the processed version of the file
    filepath = folder / 'raw' / filename
    
    # Convert the dataset to a DataFrame, drop NA values and the 'spatial_ref' column, then reset the index
    df = dataset.to_dataframe()
    df.dropna().drop('spatial_ref', axis=1).reset_index().to_csv(filepath, index=False)


## Execute Functions

In [9]:
# Load dataset from S3
ds = get_nex_dataset(['tasmin'], ['projection'])

# Load GeoDataFrame
gdf = load_polygons('../data', 'gdf_easternmountain_polygons')

# Clip dataset
ds_clipped = clip_dataset(ds, gdf)

In [10]:
ds_clipped

Unnamed: 0,Array,Chunk
Bytes,4.45 GiB,575.15 MiB
Shape,"(20, 4, 31411, 19, 25)","(8, 4, 31411, 10, 15)"
Dask graph,18 chunks in 7 graph layers,18 chunks in 7 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 4.45 GiB 575.15 MiB Shape (20, 4, 31411, 19, 25) (8, 4, 31411, 10, 15) Dask graph 18 chunks in 7 graph layers Data type float32 numpy.ndarray",4  20  25  19  31411,

Unnamed: 0,Array,Chunk
Bytes,4.45 GiB,575.15 MiB
Shape,"(20, 4, 31411, 19, 25)","(8, 4, 31411, 10, 15)"
Dask graph,18 chunks in 7 graph layers,18 chunks in 7 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


In [11]:
# Define time periods
dt_train = np.datetime64('2022', 'ns')
dt_validate = np.datetime64('2023', 'ns')
dt_test = np.datetime64('2024', 'ns')
dt_project = np.datetime64('2051', 'ns')

dts = [dt_train, dt_validate, dt_test, dt_project]

# Split dataset into train, validate, and test datasets
ds_train, ds_validate, ds_test, ds_project, _ = split_dataset(ds_clipped, dts)

In [12]:
# save_dataframe(ds_train, '../data', 'CMIP6_train_easternmountain.gz')

In [13]:
# save_dataframe(ds_validate, '../data', 'CMIP6_validate_easternmountain.gz')

In [14]:
# save_dataframe(ds_test, '../data', 'CMIP6_test_easternmountain.gz')