## Create Training and Validation Set

### Importing Dependencies

We import the necessary libraries and functions, ensuring that all required modules and helper functions are properly integrated.

In [3]:
import os
import networkx as nx
import sys
import torch
import import_ipynb 

src_path = os.path.abspath(os.path.join(os.getcwd(), "..", ".."))
if src_path not in sys.path:
    sys.path.append(src_path)

from torch_geometric.transforms import RandomLinkSplit
from torch_geometric.data import Batch

from utils.wrapper.transform_networkx_into_pyg import transform_networkx_into_pyg
from utils.helper_functions.add_dummy_node_features import add_dummy_node_features

### Loading and Preparing Graph Data from GraphML Files

The `load_graphml_files` function loads a series of bicycle traffic network graphs stored in GraphML format and prepares them for training with PyTorch Geometric (PyG). The objective is to convert each monthly graph into a format compatible with Graph Neural Networks (GNNs), ensuring that both node and edge features are retained.

Each NetworkX graph is converted into a PyG `Data` object using a custom helper function `transform_networkx_into_pyg`. This function ensures that essential node and edge attributes such as:

- **Node attributes**:
  - `lon` (longitude),
  - `lat` (latitude),
  
- **Edge attributes**:
  - `speed_rel` (relative speed),
  - `month` (for cyclical encoding of months),
  - `year` (the year of the traffic data),
  - `id` (a unique identifier for each edge),
  - `tracks` (the number of bicycles traveling from the starting to the ending point),

are preserved during the conversion process.

PyG expects data in a specific structure, particularly when both node and edge attributes are used in models like GATv2.

`data_list` contains multiple `torch_geometric.data.Data` objects, each representing a graph.


In [4]:
def load_graphml_files(years=[2021, 2022, 2023]):
    """
    Loads multiple directed graph files in GraphML format and converts them 
    into PyTorch Geometric (PyG) Data objects.

    Parameters:
    -----------
    years : list of int, optional (default=[2021, 2022, 2023])
        List of years for which graph files should be loaded. 
        Assumes 12 monthly files per year.

    Returns:
    --------
    data_list : list of torch_geometric.data.Data
        List of PyG data objects created from the loaded NetworkX graphs.
    """

    data_list = []

    for year in years:
        for i in range(12):
            path = f"../../../data/graphml/{year}/bike_network_{year}_{i}.graphml"
            if not os.path.exists(path):
                print(f"[WARN] File not found: {path}")
                continue

            G_nx = nx.read_graphml(path)
            G_nx = nx.DiGraph(G_nx)

            data = transform_networkx_into_pyg(G_nx)
            data_list.append(data)

    print(f"Number of loaded graphs: {len(data_list)}")
    return data_list


### Adding Dummy Node Features to the Graphs

In this section of the code, we add **dummy node features** to our graphs. This process ensures that each node in our graphs has a **feature dimension**, even if no node features were originally present. This is an important step in preparing the data for use in Graph Neural Networks (GNNs).

**NOTE:** At a later stage, once we have implemented feature engineering, we will replace the dummy features with the engineered ones.


In [5]:
def add_features(data_list, feature_dim=1, value=1.0):
    # add dummy feature
    return add_dummy_node_features(data_list, feature_dim=feature_dim, value=value)


### Train-Validation Split

For predicting edge attributes (e.g., `tracks`), an 80/20 train/validation split is applied to the **existing edges within each graph**.

In our application, the **nodes represent physically existing bike stations**, which typically do not change or only change very infrequently. The aim of the analysis is to model the **connections between stations**, i.e., to understand and predict how many bicycles move along certain routes (in other words: edges with weights).

A **node-level split** (i.e., an 80/20 split of the nodes themselves) would mean that some stations would be completely unseen during training. This would not be meaningful because:

- The **stations themselves are not the prediction target**;
- It is the **relationships or transitions between the stations (edges)** that should be modeled;
- In deployment, **all stations are known** (they are physically installed in the system);

Initially, we wanted to use the `RandomLinkSplit()` function, but this is designed for classic link prediction – i.e., binary classification. It adds both positive examples (existing edges) AND negative examples (non-existing edges). Since our task is an edge attribute regression task, this method is unsuitable, and we manually implemented the split mechanism.


In [7]:
def split_train_val(data_list, val_ratio=0.2, seed=42, save_dir="../../../data/data_splits", edge_attr_key_index=4):
    """
    Splits a list of PyTorch Geometric Data objects into training and validation sets
    for edge regression tasks.
    
    Parameters:
    -----------
    data_list : list of torch_geometric.data.Data
        List of graphs to be split into train and validation sets.
    
    val_ratio : float, optional (default=0.2)
        Proportion of edges to be used for validation in each graph.
    
    seed : int, optional (default=42)
        Random seed for reproducibility.
    
    save_dir : str, optional (default='../../../data/data_splits')
        Directory path where the split datasets will be saved.

    edge_attr_key_index : int, optional (default=4)
        The index of the edge attribute that should be predicted (e.g. 'tracks').
    
    Returns:
    --------
    None
    """

    torch.manual_seed(seed)
    os.makedirs(save_dir, exist_ok=True)

    train_save_path = os.path.join(save_dir, "train_data.pt")
    val_save_path = os.path.join(save_dir, "val_data.pt")

    train_list, val_list = [], []
    total_train_edges = 0
    total_val_edges = 0

    for i, data in enumerate(data_list):
        edge_index = data.edge_index
        edge_attr = data.edge_attr

        num_edges = edge_index.size(1)
        num_val = int(val_ratio * num_edges)
        perm = torch.randperm(num_edges)

        val_idx = perm[:num_val]
        train_idx = perm[num_val:]

        # Training Data
        train_data = data.clone()
        train_data.edge_index = edge_index[:, train_idx]
        train_data.edge_attr = torch.cat([edge_attr[train_idx][:, :edge_attr_key_index], edge_attr[train_idx][:, edge_attr_key_index+1:]], dim=1)
        train_data.y = edge_attr[train_idx][:, edge_attr_key_index]  

        # Validation Data
        val_data = data.clone()
        val_data.edge_index = edge_index[:, val_idx]
        val_data.edge_attr = torch.cat([edge_attr[val_idx][:, :edge_attr_key_index], edge_attr[val_idx][:, edge_attr_key_index+1:]], dim=1)
        val_data.y = edge_attr[val_idx][:, edge_attr_key_index]

        train_list.append(train_data)
        val_list.append(val_data)

        total_train_edges += train_data.edge_index.size(1)
        total_val_edges += val_data.edge_index.size(1)

        print(f"Graph {i}: Train edges = {train_data.edge_index.size(1)}, Val edges = {val_data.edge_index.size(1)}")

    # Batch the split data
    train_data = Batch.from_data_list(train_list)
    val_data = Batch.from_data_list(val_list)

    # Save
    torch.save(train_data, train_save_path)
    torch.save(val_data, val_save_path)

    print(f"\nTotal train edges (batched): {total_train_edges}")
    print(f"Total val edges   (batched): {total_val_edges}")
    print(f"\nTrain data saved to: {train_save_path}")
    print(f"Val data saved to: {val_save_path}")


### Executing the Pipeline for Creating Training and Validation Data

This script defines a `main` function that orchestrates the entire pipeline for generating training and validation splits for Graph Neural Networks (GNNs). The previously defined functions are called sequentially to load the graph data, perform feature engineering(not yet), and perform the data split.


In [5]:
def main(years=[2021, 2022, 2023], save_dir="../../../data/data_splits", val_ratio=0.2):
    """
    Main pipeline for loading graph data, preprocessing it, and splitting into train/val sets.

    Parameters:
    -----------
    years : list of int, optional (default=[2021, 2022, 2023])
        The years for which GraphML files will be loaded.

    save_dir : str, optional (default='../../../data/data_splits')
        Directory where the processed train and validation data will be saved.

    val_ratio : float, optional (default=0.2)
        Proportion of edges to be used for validation during the train/validation split.

    Returns:
    --------
    None
    """

    os.makedirs(save_dir, exist_ok=True)
    train_save_path = os.path.join(save_dir, "train_data.pt")
    val_save_path = os.path.join(save_dir, "val_data.pt")

    # Load data
    data_list = load_graphml_files(years)
    train_data, val_data = split_train_val(data_list, val_ratio=val_ratio)

    # Normalize features
    train_data, val_data = normalize_feature(train_data, val_data)

    # Save
    torch.save(train_data, train_save_path)
    torch.save(val_data, val_save_path)

    print(f"\nTrain data saved to: {train_save_path}")
    print(f"Val data saved to: {val_save_path}")

# Call the main function
main()


KeyboardInterrupt: 