## Enforcing causal paths in tabular GNN - Full data-graph (x, edge_index) child knowledge

To train a GNN while respecting a minimal set of causal paths, we pass to the GNN 2 types of graph-data:
- ancestor: with only the ancestor nodes
- child: ancestor + child nodes, adding as an edge "ancestor -> child"

-> s.t. child(n) becomes the ancestor of child(n+1)

Constitute 2 graph-data parent/child, suggesting causality by adding child nodes (and also edge: parent -> child) in the child data.

For the moment, we specify only 1 parent per edge, on 2 layers:
- ancestor layer: age -> occupation
- child layer: occupation -> hours of work per week

For the moment, to avoid spurious correlations we also keep only the ancestor features (age, sex, race, native country) as node features for all graph-data. Based on this ancestor "blind knowledge", add the child as node features (and also edge: parent -> child) in the child graph-data?

# Data preparation 

In [None]:
# imports and train/test split (to be put in part 2. of the notebook)
%load_ext autoreload
%autoreload 2

import warnings
warnings.filterwarnings('ignore')

import torch
try:
    import torch_geometric
except ModuleNotFoundError:
    TORCH = torch.__version__.split("+")[0]
    CUDA = "cu" + torch.version.cuda.replace(".","")
!pip install torch-scatter     -f https://pytorch-geometric.com/whl/torch-{TORCH}+{CUDA}.html
!pip install torch-sparse      -f https://pytorch-geometric.com/whl/torch-{TORCH}+{CUDA}.html
#!pip install torch-geometric
#import torch_geometric

import sys
sys.path.append("../")

import time
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

import torch

import numpy as np
import pandas as pd

from classif_basic.data_preparation import handle_cat_features

from classif_basic.graph.data_to_graph import table_to_graph, add_new_edge
from classif_basic.graph.train import train_GNN_ancestor

# preparing the dataset on clients for binary classification
data = fetch_openml(data_id=1590, as_frame=True)

X = data.data
Y = (data.target == '>50K') * 1

SEED = 7
VALID_SIZE = 0.15
preprocessing_cat_features = "label_encoding"

X = handle_cat_features(X=X, preprocessing_cat_features=preprocessing_cat_features)

# Split valid set for early stopping & model selection
# "stratify=Y" to keep the same proportion of target classes in train/valid (i.e. model) and test sets 
X_model, X_test, Y_model, Y_test = train_test_split(
    X, Y, test_size=VALID_SIZE, random_state=SEED, stratify=Y
)

# Data to ancestor & child Graphs 

We begin with all parent features as nodes, and the directed edge parent->child1 (here, age->job): 

In [None]:
X_total=X_model
Y_total=Y_model

median_age = X_total["age"].median() # merge 'age' in 2 age categories to form the edges faster

X_total["age"] = (X_total["age"] == median_age).astype(int)

list_child1 = ["race", "sex", "native-country", "age", "occupation"]
edge_parent = "age"
edge_child1 = "occupation"

X_total_child1 = X_total.filter(list_child1)

# add "sex" as a connection (edge) between these ancestors features
edges_total_child1 = add_new_edge(data=X_total_child1, previous_edge=None, 
                                list_col_names=[edge_parent, edge_child1])

# being edges, "sex" and "education" and must be removed from the nodes
list_child1.remove(edge_parent)
list_child1.remove(edge_child1)

data_total_child1 = table_to_graph(X=X_total_child1, Y=Y_total, list_col_names=list_child1, edges=edges_total_child1)
print(f"data_total_child1: {data_total_child1} \n")

Then, we add the descendant node feature "hours of work per week" (new graph data), and the new directed edge child1->child2 (here, job -> hours of work per week):

In [None]:
# NEW GRAPH-DATA integrating child2 (to parent & child1 data)
median_hours = X_total["hours-per-week"].median() # '1' if the client works over 40 hours per week

X_total["hours-per-week"] = (X_total["hours-per-week"] == median_hours).astype(int)

list_child2 = ["race", "sex", "native-country", "age", "occupation", "hours-per-week"]
edge_parent = "age"
edge_child1 = "occupation"
edge_child2 = "hours-per-week"

X_total_child2 = X_total.filter(list_child2)

# add "sex" as a connection (edge) between these ancestors features
edges_total_child2 = add_new_edge(data=X_total_child2, previous_edge=None, 
                                list_col_names=[edge_child1, edge_child2])

# being edges, "occupation" and "hours-per-week" must be removed from the nodes
# list_child2.remove(edge_parent) -> TODO remove edge_parent, to avoid correlation vs causation? It seems to me not necessary
list_child2.remove(edge_child1)
list_child2.remove(edge_child2)

data_total_child2 = table_to_graph(X=X_total_child2, Y=Y_total, list_col_names=list_child2, edges=edges_total_child2)
print(f"data_total_child2: {data_total_child2} \n")

# Train a basic Graph Neural Network on the graph-shaped data

## Train with batches (neighborhood sampling) a basic GCN 

Here, we try using the batches constituted from neighborhoods to train the GNN, using our GPU (if accessed).

We use our GCN_ancestor class progressively adding through layers the "causal child" information:

Here with batches of 128 individuals, 76% of accuracy is reached by passing a causal order on layer1 and layer2 (accuracy == to the situation where all features are specified, and no causal layer!)...

We get here our own data-loader, ensuring that each batch passes the same individuals to the GNN (s.t. only causal data changes through layers):

In [None]:
# with the method "index_groups": with 300 batches and 2 epochs, 70%(epoch1) -> 76% of accuracy (5 mn)
# ||| Epoch 2 Loss_train = 1.1 Loss_valid = 0.55 Train & Valid Accuracy = 0.76

list_data_total = [data_total_child1, data_total_child2]
loader_method="index_groups"
loss_name="CrossEntropyLoss"
#batch_size=150
nb_batches=300
epoch_nb = 2
learning_rate = 0.01

gnn_index = train_GNN_ancestor(
                list_data_total=list_data_total,
                loader_method=loader_method,
                loss_name=loss_name,
                #batch_size=batch_size,
                nb_batches=nb_batches,
                epoch_nb = epoch_nb,
                learning_rate = learning_rate)

In [None]:
# currently working, with the method "neighbor_nodes"
# 17 mn with 300 epochs (Epoch 300 Loss_train = 0.55 Loss_valid = 0.28 Train & Valid Accuracy = 0.76)
# but: not the same individuals sampled across the layers! Take the neighbours of data-ancestor -> keep index? 

list_data_total = [data_total_child1, data_total_child2]
loader_method="neighbor_nodes"
loss_name="CrossEntropyLoss" 
# need to retropropagate the gradient => should not detach it! Instead, compute AUC with gradient present?

#batch_size=150
nb_batches=100
epoch_nb = 30
learning_rate = 0.01

gnn_neighbor = train_GNN_ancestor(
                list_data_total=list_data_total,
                loader_method=loader_method,
                loss_name=loss_name,
                #batch_size=batch_size,
                nb_batches=nb_batches,
                epoch_nb = epoch_nb,
                learning_rate = learning_rate)

## Inspect the predictions of the model on valid and test sets

Let's inspect the model on test data, to assess if the stability of performance is not due to coincidence:

**No Overfitting**

With this very simple shape of graph-data (directed edge = "job" -> "work hours"), the accuracy remains 75% for train, valid and test data.

It confirms us that the training through basic GNN, on basic shaped data, delivers here stable results.