## Enforcing causal paths in tabular GNN - Full data-graph (x, edge_index) child knowledge

To train a GNN while respecting a minimal set of causal paths, we pass to the GNN 2 types of graph-data:
- ancestor: with only the ancestor nodes
- child: ancestor + child nodes, adding as an edge "ancestor -> child"

-> s.t. child(n) becomes the ancestor of child(n+1)

Constitute 2 graph-data parent/child, suggesting causality by adding child nodes (and also edge: parent -> child) in the child data.

For the moment, we specify only 1 parent per edge, on 2 layers:
- ancestor layer: age -> occupation
- child layer: occupation -> hours of work per week

For the moment, to avoid spurious correlations we also keep only the ancestor features (age, sex, race, native country) as node features for all graph-data. Based on this ancestor "blind knowledge", add the child as node features (and also edge: parent -> child) in the child graph-data?

# Data preparation 

Causal analysis: before train&valid/test split, we reduced the number of features to contain only "straightforward" causal information -> enabling to integrate it progressively in our GNN (through edges). Therefore, we use factor analysis:

To control for the balance of df across classes, we sort the clients so that X gets perfect equality in the repartition of classes 0 and 1 (at the cost of 25 000 instead of the 45 000 initial individuals).

In [None]:
# imports and train/test split (to be put in part 2. of the notebook)
%load_ext autoreload
%autoreload 2

import warnings
warnings.filterwarnings('ignore')

import torch
try:
    import torch_geometric
except ModuleNotFoundError:
    TORCH = torch.__version__.split("+")[0]
    CUDA = "cu" + torch.version.cuda.replace(".","")
!pip install torch-scatter     -f https://pytorch-geometric.com/whl/torch-{TORCH}+{CUDA}.html
!pip install torch-sparse      -f https://pytorch-geometric.com/whl/torch-{TORCH}+{CUDA}.html
#!pip install torch-geometric
#import torch_geometric

import sys
sys.path.append("../")

import time
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

import torch

import numpy as np
import pandas as pd

from classif_basic.data_preparation import handle_cat_features

from classif_basic.graph.train import train_xgb_benchmark, get_auc

from classif_basic.model import pickle_load_model, pickle_save_model

# preparing the dataset on clients for binary classification
data = fetch_openml(data_id=1590, as_frame=True)

X = data.data
Y = (data.target == '>50K') * 1

SEED = 7
VALID_SIZE = 0.15
preprocessing_cat_features = "label_encoding"

X = handle_cat_features(X=X, preprocessing_cat_features=preprocessing_cat_features)

# first of all, unify features with "redundant" causal information
from classif_basic.graph.utils import get_unified_col

X = get_unified_col(X=X, list_cols_to_join = ["education","education-num"], new_col_name = "education")
X = get_unified_col(X=X, list_cols_to_join = ["relationship","marital-status"], new_col_name = "relationship")
X = get_unified_col(X=X, list_cols_to_join = ["occupation","workclass"], new_col_name = "job")
X = get_unified_col(X=X, list_cols_to_join = ["capital-gain","capital-loss"], new_col_name = "capital")

# select equal proportion of classes "wealthy" and "not wealthy", and generates the new dataset accordingly
from classif_basic.graph.utils import get_balanced_df

balanced_df = get_balanced_df(X=X, Y=Y)

X_balanced = balanced_df.drop("target", axis=1)
Y_balanced = balanced_df["target"]

X=X_balanced # here, we try with the whole dataset (assuming it is imbalanced, but counts almost 50 000 nodes)
Y=Y_balanced

# then, normalize the df categories for better neural-network computation
from classif_basic.graph.utils import normalize_df

X=normalize_df(df=X, normalization='min_max')

# Split valid set for early stopping & model selection
# "stratify=Y" to keep the same proportion of target classes in train/valid (i.e. model) and test sets 
X_model, X_test, Y_model, Y_test = train_test_split(
    X, Y, test_size=VALID_SIZE, random_state=SEED, stratify=Y
)

# Load and test previous models (plot AUCs...)
We here load our basic GCN, trained on 20 000 epochs (around 100/120 mn):
- the "classic" - correlated, with all columns as feature nodes and only one edge
- the "test", with layers integrating successive causal edges (2 edges for the moment)

In [None]:
from classif_basic.graph.train import activate_gpu   

gcn_classic = pickle_load_model("/work/data/models/gcn_classic_education_relationship.pkl")
gcn_ancestor = pickle_load_model("/work/data/models/gcn_ancestor_education_relationship_job.pkl")

dict_data_total = pickle_load_model("/work/data/graph_data/balanced/dict_all_edges.pkl")

data_full_education_relationship = pickle_load_model("/work/data/graph_data/balanced/data_full_features_education_relationship.pkl")

device = activate_gpu()

preds_classic = gcn_classic(list_data=[data_full_education_relationship], device=device, skip_connection=True)

preds_ancestor = gcn_ancestor(
    list_data=[dict_data_total['education->relationship'],
                                 dict_data_total['relationship->job']],
    device=device, 
    skip_connection=True)

In [None]:
# after loading the models, save their predictions
pickle_save_model(preds_classic, "/work/data/graph_data/balanced/edge_ed_rel_job.preds_classic.pkl")

pickle_save_model(preds_ancestor, "/work/data/graph_data/balanced/edge_ed_rel_job.preds_ancestor.pkl")

In [None]:
# and plot AUCs...
# get y to plot AUC
for edge_name, graph_data in dict_data_total.items():
    y_true=graph_data.y

plot = True

# first, plot AUCs for the classic GNN (with full features as node features)
# the printed scores are respectively ROC AUC, PR AUC, False Positive and True Positive Ratios
print("AUCs for the classic GNN (with full features as node features)")
get_auc(y_true=y_true, probas_pred=preds_classic, plot=plot)

In [None]:
# second, plot AUCs for the ancestor GNN (with 2 progressive layers of edges, education -> relationship -> job)
print("AUCs for the ancestor GNN (with 2 progressive layers of edges, education -> relationship -> job)")
get_auc(y_true=y_true, probas_pred=preds_ancestor, plot=plot)

In [None]:
# don't forget to reallocate GPU memory, at the end!
del preds_classic
del preds_ancestor

torch.cuda.empty_cache()

# Basic XGB for comparison - excellent results
With the same unified features 92% ROC-AUC, 81-84% PR-AUC, 85% accuracy on train&valid sets.

**Conclusion** between at least **10-18% less than optimized XGB on all metrics** - respectable for non-optimized structures (basic edges, , basic GCN), our GNN (causal as non causal) have to be optimized...

In [None]:
from classif_basic.graph.train import train_xgb_benchmark

X_train, X_valid, Y_train, Y_valid = train_test_split(
    X_model, Y_model, test_size=VALID_SIZE, random_state=SEED, stratify=Y_model
)

train_xgb_benchmark(X_train=X_train, X_valid=X_valid, Y_train=Y_train, Y_valid=Y_valid)