## Enforcing causal paths in tabular GNN - Full data-graph (x, edge_index) child knowledge

To train a GNN while respecting a minimal set of causal paths, we pass to the GNN 2 types of graph-data:
- ancestor: with only the ancestor nodes
- child: ancestor + child nodes, adding as an edge "ancestor -> child"

-> s.t. child(n) becomes the ancestor of child(n+1)

Constitute 2 graph-data parent/child, suggesting causality by adding child nodes (and also edge: parent -> child) in the child data.

For the moment, we specify only 1 parent per edge, on 2 layers:
- ancestor layer: age -> occupation
- child layer: occupation -> hours of work per week

For the moment, to avoid spurious correlations we also keep only the ancestor features (age, sex, race, native country) as node features for all graph-data. Based on this ancestor "blind knowledge", add the child as node features (and also edge: parent -> child) in the child graph-data?

# Data preparation 

Causal analysis: before train&valid/test split, we reduced the number of features to contain only "straightforward" causal information -> enabling to integrate it progressively in our GNN (through edges). Therefore, we use factor analysis:

To control for the balance of df across classes, we sort the clients so that X gets perfect equality in the repartition of classes 0 and 1 (at the cost of 25 000 instead of the 45 000 initial individuals).

In [None]:
# imports and train/test split (to be put in part 2. of the notebook)
%load_ext autoreload
%autoreload 2

import warnings
warnings.filterwarnings('ignore')

import torch
try:
    import torch_geometric
except ModuleNotFoundError:
    TORCH = torch.__version__.split("+")[0]
    CUDA = "cu" + torch.version.cuda.replace(".","")
!pip install torch-scatter     -f https://pytorch-geometric.com/whl/torch-{TORCH}+{CUDA}.html
!pip install torch-sparse      -f https://pytorch-geometric.com/whl/torch-{TORCH}+{CUDA}.html
#!pip install torch-geometric
#import torch_geometric

import sys
sys.path.append("../")

import time
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

import torch

import numpy as np
import pandas as pd

from classif_basic.data_preparation import handle_cat_features

from classif_basic.graph.data_to_graph import table_to_graph, add_new_edge
from classif_basic.graph.train import train_GNN_ancestor

# preparing the dataset on clients for binary classification
data = fetch_openml(data_id=1590, as_frame=True)

X = data.data
Y = (data.target == '>50K') * 1

SEED = 7
VALID_SIZE = 0.15
preprocessing_cat_features = "label_encoding"

X = handle_cat_features(X=X, preprocessing_cat_features=preprocessing_cat_features)

# select equal proportion of classes "wealthy" and "not wealthy", and generates the new dataset accordingly
from classif_basic.graph.utils import get_balanced_df

balanced_df = get_balanced_df(X=X, Y=Y)

X_balanced = balanced_df.drop("target", axis=1)
Y_total_balanced = balanced_df["target"]

#X=X_balanced # here, we try with the whole dataset (assuming it is imbalanced, but counts almost 50 000 nodes)
#Y=Y_balanced

# first of all, unify features with "redundant" causal information
from classif_basic.graph.utils import get_unified_col

X = get_unified_col(X=X, list_cols_to_join = ["education","education-num"], new_col_name = "education")
X = get_unified_col(X=X, list_cols_to_join = ["relationship","marital-status"], new_col_name = "relationship")
X = get_unified_col(X=X, list_cols_to_join = ["occupation","workclass"], new_col_name = "job")
X = get_unified_col(X=X, list_cols_to_join = ["capital-gain","capital-loss"], new_col_name = "capital")

# Split valid set for early stopping & model selection
# "stratify=Y" to keep the same proportion of target classes in train/valid (i.e. model) and test sets 
X_model, X_test, Y_model, Y_test = train_test_split(
    X, Y, test_size=VALID_SIZE, random_state=SEED, stratify=Y
)

# Data to ancestor & child Graphs 

Cascade de causes...

In [None]:
from classif_basic.graph.data_to_graph import get_parent_child_data

list_data_total = []

list_first_ancestors = ['race', 'sex', 'native-country', 'age']
edge_parent = "fnlwgt"
edge_child0 = "education" 
edge_child1 = "relationship"
edge_child2 = "job"
edge_child3 = "hours-per-week"
edge_child4 = "capital"

list_successive_paths=["education" , "job", "hours-per-week"]

for i in range(len(list_successive_paths)-1):
    edge_parent = list_successive_paths[i]
    edge_child = list_successive_paths[i+1]
    print(f"\n {edge_parent} -> {edge_child}")
    data_total = get_parent_child_data(X=X_model, Y=Y_model, list_node_features=list_first_ancestors, 
                                                edge_parent=edge_parent, edge_child=edge_child1)
    list_data_total.append(data_total)
    print(data_total)

# Train a basic Graph Neural Network on the graph-shaped data

## Train with batches (neighborhood sampling) a basic GCN 

Here, we try using the batches constituted from neighborhoods to train the GNN, using our GPU (if accessed).

We use our GCN_ancestor class progressively adding through layers the "causal child" information:

Here with batches of 128 individuals, 76% of accuracy is reached by passing a causal order on layer1 and layer2 (accuracy == to the situation where all features are specified, and no causal layer!)...

We get here our own data-loader, ensuring that each batch passes the same individuals to the GNN (s.t. only causal data changes through layers):

In [None]:
data_job_hours = get_parent_child_data(X=X_model, Y=Y_model, list_node_features=X_model.columns, 
                                                edge_parent="job", edge_child="hours-per-week")

In [None]:
# with the method "index_groups": with 300 batches and 2 epochs, 70%(epoch1) -> 76% of accuracy (5 mn)
# ||| Epoch 2 Loss_train = 1.1 Loss_valid = 0.55 Train & Valid Accuracy = 0.76

# here with skip connections

#list_data_total = [data_total_child1, data_total_child2, data_total_child3]
loader_method="index_groups"

model_type="conv_attention"
loss_name="CrossEntropyLoss"
batch_size=10_000#19_868
learning_rate = 0.01
#nb_batches=100
epoch_nb = 50

batch_size=25_000

gnn_neighbor = train_GNN_ancestor(
                list_data_total=[data_job_hours],
                model_type=model_type,
                loader_method=loader_method,
                loss_name=loss_name,
                batch_size=batch_size,
                #nb_batches=nb_batches,
                epoch_nb = epoch_nb,
                learning_rate = learning_rate,
                skip_connection=True)

In [None]:
# currently working, with the method "neighbor_nodes"
# 17 mn with 300 epochs (Epoch 300 Loss_train = 0.55 Loss_valid = 0.28 Train & Valid Accuracy = 0.76)
# but: not the same individuals sampled across the layers! Take the neighbours of data-ancestor -> keep index? 

# list_data_total = [data_total_child1, data_total_child2, data_total_child3]
loader_method="neighbor_nodes"

epoch_nb = 5

gnn_neighbor = train_GNN_ancestor(
                list_data_total=list_data_total,
                model_type=model_type,
                loader_method=loader_method,
                loss_name=loss_name,
                batch_size=batch_size,
                #nb_batches=nb_batches,
                epoch_nb = epoch_nb,
                learning_rate = learning_rate)

## Standard GCN on all features for comparison

In [None]:
GATConv(13,32)

In [None]:
from torch_geometric.nn.conv import GATConv

data = list_data_total[-1]

conv = GATConv(data.num_features, 32)
conv(data.x.float(), data.edge_index)

In [None]:
list_one_data[0].num_features

In [None]:
# train a GNN with the unique edge "education -> job"
#6 features as well, but same data.x => our assumptions: the AUCs will remain low, 
# but not as fluctuating as with 2 graphs using different neighbor sampling (and 2 different x in layers)

list_one_data = [list_data_total[-1]] # only trained with ancestors (x) and one edge (fnlwgt->education)
loader_method="neighbor_nodes"
loss_name="CrossEntropyLoss" 

unique_data_graph=True

epoch_nb = 5
batch_size=25_000

gnn_neighbor = train_GNN_ancestor(
                list_data_total=list_one_data,
                model_type=model_type,
                loader_method=loader_method,
                loss_name=loss_name,
                batch_size=batch_size,
                #nb_batches=nb_batches,
                epoch_nb = epoch_nb,
                learning_rate = learning_rate,
                unique_data_graph=unique_data_graph)

# Basic XGB for comparison - excellent results
With the same unified features 92% ROC-AUC, 81-84% PR-AUC, 85% accuracy on train&valid sets.

In [None]:
from classif_basic.data_preparation import train_valid_test_split
from classif_basic.model import train_naive_xgb

model_task="classification"
stat_criteria="aucpr"

X_train, X_valid, X_train_valid, X_test, Y_train, Y_valid, Y_train_valid, Y_test=train_valid_test_split(
    X=X,
    Y=Y, 
    model_task=model_task, 
    preprocessing_cat_features=preprocessing_cat_features)

Y_pred_train_valid = train_naive_xgb(
    X_train=X_train,
    X_valid=X_valid,
    X_train_valid=X_train_valid,
    X_test=X_test,
    Y_train=Y_train,
    Y_valid=Y_valid,
    Y_train_valid=Y_train_valid,
    Y_test=Y_test,
    model_task=model_task,
    stat_criteria=stat_criteria,
) 

total_target = Y_train_valid.shape[0]
total_exact=(Y_pred_train_valid==Y_train_valid).sum()#.all()

xgb_accuracy = total_exact/total_target
xgb_accuracy