# Model Input Data Preparation

In this notebook, we will prepare five essential files to construct a foundational class called `DroughtDataset`, based on the `TabularDataset` class from the [torch-spatiotemporal library](https://torch-spatiotemporal.readthedocs.io/en/latest/modules/datasets_prototypes.html#tsl.datasets.prototypes.TabularDataset). 

In the next notebook (`6.training.ipynb`), we will utilize this class to create a proper structure for feeding data into deep learning models.

The files we will prepare include:

- **target**: Stores our target variable data.
- **mask**: Specifies areas or time periods with missing or irrelevant data.
- **covariates**: Contains feature data that will serve as model inputs.
- **metadata**: Provides additional information about each data point, such as location and time-specific attributes.
- **distance matrix**: Contains distances between spatial points, which supports spatial relationship modeling.


## Load data

In [2]:
import xarray as xr

# Load the dataset
ds = xr.open_dataset(
    '/Users/adamprzychodni/Documents/Repos/ml-drought-forecasting/ml-modeling-pipeline/data/04_feature/features.nc'
)

In [3]:
ds

## Target

The `target` stores values for the target variable, `swvl1`, over time and location.



In [4]:
import numpy as np

# Define the target variables to extract from the dataset
targets = ['swvl1']
data_arrays = [ds[var] for var in targets]

# Flatten latitude and longitude into a single "node" dimension
# This combines the latitude and longitude coordinates as one "node" dimension
data_arrays_flattened = [da.stack(node=('latitude', 'longitude')) for da in data_arrays]

# Convert each DataArray to a numpy array and add a new channel dimension
# This results in an array with shape (time, nodes, channels) for the target data
target = np.stack([da.to_numpy() for da in data_arrays_flattened], axis=-1)


In [5]:
target

array([[[-2.4842884e-06],
        [-2.4842884e-06],
        [-2.4842884e-06],
        ...,
        [-2.4842884e-06],
        [-2.4842884e-06],
        [-2.4842884e-06]],

       [[-3.1925738e-06],
        [-3.1925738e-06],
        [-3.1925738e-06],
        ...,
        [-3.1925738e-06],
        [-3.1925738e-06],
        [-3.1925738e-06]],

       [[ 1.4910474e-06],
        [ 1.4910474e-06],
        [ 1.4910474e-06],
        ...,
        [ 1.4910474e-06],
        [ 1.4910474e-06],
        [ 1.4910474e-06]],

       ...,

       [[ 2.8295908e-06],
        [ 2.8295908e-06],
        [ 2.8295908e-06],
        ...,
        [ 3.4286782e-01],
        [ 3.4286782e-01],
        [ 3.4286782e-01]],

       [[-5.7665166e-06],
        [-5.7665166e-06],
        [-5.7665166e-06],
        ...,
        [ 3.4311864e-01],
        [ 3.4311864e-01],
        [ 3.4311864e-01]],

       [[-3.6517158e-06],
        [-3.6517158e-06],
        [-3.6517158e-06],
        ...,
        [ 3.5116211e-01],
        [ 3.511

In [6]:
# Save as .npy file
np.save('/Users/adamprzychodni/Documents/Repos/ml-drought-forecasting/ml-modeling-pipeline/data/05_model_input/target.npy', target)

## Mask 

Boolean mask denoting if values in target are valid (True) or not (False).

In [7]:
mask = np.where(~np.isnan(target), 1, 0)

In [8]:
mask

array([[[1],
        [1],
        [1],
        ...,
        [1],
        [1],
        [1]],

       [[1],
        [1],
        [1],
        ...,
        [1],
        [1],
        [1]],

       [[1],
        [1],
        [1],
        ...,
        [1],
        [1],
        [1]],

       ...,

       [[1],
        [1],
        [1],
        ...,
        [1],
        [1],
        [1]],

       [[1],
        [1],
        [1],
        ...,
        [1],
        [1],
        [1]],

       [[1],
        [1],
        [1],
        ...,
        [1],
        [1],
        [1]]])

In [9]:
# Save as .npy file
np.save('/Users/adamprzychodni/Documents/Repos/ml-drought-forecasting/ml-modeling-pipeline/data/05_model_input/mask.npy', mask)

## Covariates 

The `covariates` stores values for the features variables over time and location.

In [None]:
import numpy as np

# Define covariates: Relevant variables from the dataset to be used as features in the model
variables = [
    'u10',             # 10-meter U wind component
    'v10',             # 10-meter V wind component
    't2m',             # 2-meter temperature
    'sst',             # Sea surface temperature
    'sp',              # Surface pressure
    'tp',              # Total precipitation
    'ssr',             # Surface solar radiation
    'ssrd',            # Surface solar radiation downward
    'tcc',             # Total cloud cover
    'cl',              # Cloud liquid water content
    'e',               # Evaporation
    'pev',             # Potential evaporation
    'ro',              # Runoff
    'asn',             # Snowfall amount
    'rsn',             # Rainfall amount
    'sd',              # Snow depth
    'slt',             # Soil type
    # 'swvl1',           # Soil water in layer 1
    'cvh',             # High vegetation cover
    'lai_hv',          # High vegetation leaf area index
    'tvh',             # High vegetation transpiration
    'z',               # Geopotential
    # 'lsm',             # Land-sea mask
    'season',          # Season (single temporal feature)
]

# Extract the defined covariate variables from the dataset
data_arrays = [ds[var] for var in variables]

# Flatten latitude and longitude into a single "node" dimension
# This combines the spatial coordinates into one dimension labeled "node"
data_arrays_flattened = [da.stack(node=('latitude', 'longitude')) for da in data_arrays]

# Convert each DataArray to a numpy array and add a new channel dimension
# The resulting array has the shape (time, nodes, channels), ready for model input
covariates = np.stack([da.to_numpy() for da in data_arrays_flattened], axis=-1)

# Replace NaN values with 0 to handle missing data
covariates = np.nan_to_num(covariates, nan=0.0)


In [None]:
covariates

In [None]:
# Save as .npy file
np.save('/Users/adamprzychodni/Documents/Repos/ml-drought-forecasting/ml-modeling-pipeline/data/05_model_input/covariates.npy', covariates)

## Metadata

The metadata store values for the latitude and longitude of each specific location (node).







In [None]:
import pandas as pd
import os

def generate_and_save_metadata(df: pd.DataFrame, lat_col: str = 'lat', lon_col: str = 'lon', save_directory: str = None) -> pd.DataFrame:
    """
    Generates metadata from a given DataFrame by identifying unique latitude and longitude
    combinations and assigning a unique node ID to each combination. The metadata is then
    saved to a specified Parquet file.

    Parameters:
    - df (pd.DataFrame): The input DataFrame.
    - lat_col (str): The name of the column containing latitude data.
    - lon_col (str): The name of the column containing longitude data.
    - save_directory (str, optional): The directory where the metadata file will be saved. If None, 
                                      the file will be saved in the current working directory.

    Returns:
    - pd.DataFrame: A DataFrame containing the metadata with latitude, longitude, and 'node_id' as columns.
    """
    # Ensure the DataFrame contains necessary columns
    if lat_col not in df.columns or lon_col not in df.columns:
        raise ValueError(f"DataFrame must contain '{lat_col}' and '{lon_col}' columns.")

    # Create a unique node ID for each unique latitude-longitude combination
    unique_lat_lon = df[[lat_col, lon_col]].drop_duplicates().reset_index(drop=True)
    unique_lat_lon['node_id'] = unique_lat_lon.index

    # Create the metadata DataFrame
    metadata = unique_lat_lon.set_index('node_id')

    # Handle save directory and save metadata to Parquet file
    if save_directory:
        os.makedirs(save_directory, exist_ok=True)  # Create directory if it doesn't exist
        file_path = os.path.join(save_directory, "metadata.parquet")
    else:
        file_path = "metadata.parquet"  # Save in the current working directory

    metadata.to_parquet(file_path)
    print(f"Metadata file saved at: {file_path}")

    return metadata


In [None]:
import pandas as pd
import os

# Assume you've already stacked your data arrays as 'data_arrays_flattened'
# We'll use one of the flattened DataArrays to extract the node information
da_flattened = data_arrays_flattened[0]  # Using the first variable for example

# Get the MultiIndex from the 'node' dimension
node_index = da_flattened.indexes['node']

# Extract latitude and longitude from the MultiIndex
latitudes = node_index.get_level_values('latitude').values
longitudes = node_index.get_level_values('longitude').values

# Create a DataFrame with 'lat' and 'lon' columns
df = pd.DataFrame({
    'lat': latitudes,
    'lon': longitudes
})

# Now use your 'generate_and_save_metadata' function
metadata = generate_and_save_metadata(
    df,
    lat_col='lat',
    lon_col='lon',
    save_directory='ml-drought-forecasting/ml-modeling-pipeline/data/05_model_input/'
)


In [None]:
import pandas as pd
metadata = pd.read_parquet("ml-drought-forecasting/ml-modeling-pipeline/data/05_model_input/metadata.parquet")

In [None]:
metadata

In [None]:
# Converting the DataFrame to a numpy ndarray
metadata_array = metadata.to_numpy()

In [None]:
# Save as .npy file
np.save('ml-drought-forecasting/ml-modeling-pipeline/data/05_model_input/metadata.npy', metadata_array)

## Distance matrix 

The distance_matrix stores values for the geographical distances between locations (nodes).



In [None]:
from tsl.ops.similarities import geographical_distance
# Calculate geographical distances with coordinates converted to radians.
distance_matrix = geographical_distance(metadata_array, to_rad=True)

In [None]:
distance_matrix

In [None]:
# Save as .npy file
np.save('ml-drought-forecasting/ml-modeling-pipeline/data/05_model_input/distances.npy', distance_matrix)

# Test

Below is a test to verify the functionality of the class.







In [None]:
from typing import Optional, Union, List
import numpy as np

from tsl.datasets.prototypes import TabularDataset

class SoilWaterDataset(TabularDataset):

    similarity_options = {'distance', 'correlation'}

    def __init__(self,
                 root: str = None
                 ):

        self.root = root

        # Load data
        target, mask, u, dist, metadata = self.load()

        covariates = {
            'u': (u),
            'metadata' : (metadata),
            'distances': (dist)
        }

        super().__init__(target=target,
                         mask=mask,
                         covariates=covariates,
                         similarity_score='distance',
                         temporal_aggregation='mean',
                         spatial_aggregation='mean',
                         name='DroughtDataset')

    def load(self):
        """
        Load data from files.

        Returns:
            tuple: Containing target, mask, covariates, distances, and metadata.
        """
        target_path = f"{self.root}target.npy"
        mask_path = f"{self.root}mask.npy"
        dist_path = f"{self.root}distances.npy"
        covariates_path = f"{self.root}covariates.npy"
        metadata_path = f"{self.root}metadata.npy"

        # Load main data
        target = np.load(target_path)
        mask = np.load(mask_path)
        u = np.load(covariates_path)
        dist = np.load(dist_path)
        metadata = np.load(metadata_path)

        return target, mask, u, dist, metadata


    def compute_similarity(self, method: str, **kwargs):
        """
        Compute similarity matrix based on the specified method.

        Args:
            method (str): The similarity computation method ('distance' or 'correlation').
            **kwargs: Additional keyword arguments for similarity computation.

        Returns:
            numpy.ndarray: Computed similarity matrix.

        Raises:
            ValueError: If an unknown similarity method is provided.
        """
        if method == "distance":
            # Calculate a Gaussian kernel similarity from the distance matrix, using a default or provided 'theta'
            theta = kwargs.get('theta', np.std(self.distances))
            return self.gaussian_kernel(self.distances, theta=theta)
        elif method == "correlation":
            # Compute the average correlation between nodes over the target features
            # Reshape target data to have nodes as columns
            target_values = self.target.values.reshape(len(self.target), -1, len(self.target_node_feature))
            # Average over the target features
            target_mean = target_values.mean(axis=2)
            # Compute correlation between nodes
            corr = np.corrcoef(target_mean, rowvar=False)
            return (corr + 1) / 2  # Normalize to [0, 1]
        else:
            raise ValueError(f"Unknown similarity method: {method}")

    @staticmethod
    def gaussian_kernel(distances, theta):
        """
        Compute Gaussian kernel similarity from distances.

        Args:
            distances (numpy.ndarray): Distance matrix.
            theta (float): Kernel bandwidth parameter.

        Returns:
            numpy.ndarray: Gaussian kernel similarity matrix.
        """
        return np.exp(-(distances ** 2) / (2 * (theta ** 2)))

In [None]:
dataset = SoilWaterDataset(root='ml-drought-forecasting/ml-modeling-pipeline/data/05_model_input/')