## Our GCN on test data
We here test the previously GCNs, ancestor and classic, trained on large (unbalanced, 45 000 individuals) and small (balanced, 20 000 individuals) data. 

## Enforcing causal paths in tabular GNN - Full data-graph (x, edge_index) child knowledge

To train a GNN while respecting a minimal set of causal paths, we pass to the GNN 2 types of graph-data:
- ancestor: with only the ancestor nodes
- child: ancestor + child nodes, adding as an edge "ancestor -> child"

-> s.t. child(n) becomes the ancestor of child(n+1)

Constitute 2 graph-data parent/child, suggesting causality by adding child nodes (and also edge: parent -> child) in the child data.

For the moment, we specify only 1 parent per edge, on 2 layers:
- ancestor layer: age -> occupation
- child layer: occupation -> hours of work per week

For the moment, to avoid spurious correlations we also keep only the ancestor features (age, sex, race, native country) as node features for all graph-data. Based on this ancestor "blind knowledge", add the child as node features (and also edge: parent -> child) in the child graph-data?

# Results on test set

We observe a stability in the results of the model trained with full features (even if only 1 edge) - approximately 75% accuracy and 75% ROC-AUC. 

By contrast, the model trained with 2 edges generalizes less well on unbalanced data it has never seen. The change of indexes across layers may perturbate the straightforward GCN more than "improve its knowledge", thus we have to seek for a **new progressive causal architecture**. 

## Models trained on balanced data
### Unbalanced (7000 individuals)

In [None]:
%load_ext autoreload
%autoreload 2

# load already formed graph-data, to gain time
from classif_basic.model import pickle_load_model
from classif_basic.graph.plot import plot_perfs_gnn

dict_test = pickle_load_model("/work/data/graph_data/unbalanced/test/dict_all_edges.pkl")

data_full_test = pickle_load_model("/work/data/graph_data/unbalanced/test/data_full_features_education_relationship.pkl")

In [None]:
# also load the model, to gain time 
gcn_classic_small = pickle_load_model("/work/data/models/gcn_classic_education_relationship.pkl")

gcn_ancestor_small = pickle_load_model("/work/data/models/gcn_ancestor_education_relationship_job.pkl")

In [None]:
# "classic" GNN - trained with all features as nodes, only 1 edge
plot_perfs_gnn(classifier=gcn_classic_small,
               list_data_test=[data_full_test])

In [None]:
# "ancestor" GNN - trained with only ancestor features as nodes, 2 edges
plot_perfs_gnn(classifier=gcn_ancestor_small,
               list_data_test=[dict_test[link] for link in dict_test.keys()])

### Balanced (3500 individuals)

In [None]:
dict_test_balanced = pickle_load_model("/work/data/graph_data/balanced/test/dict_all_edges.pkl")

data_full_test_balanced = pickle_load_model("/work/data/graph_data/balanced/test/data_full_features_education_relationship.pkl")

# "classic" GNN - trained with all features as nodes, only 1 edge
plot_perfs_gnn(classifier=gcn_classic_small,
               list_data_test=[data_full_test_balanced])

In [None]:
# "ancestor" GNN - trained with only ancestor features as nodes, 2 edges
plot_perfs_gnn(classifier=gcn_ancestor_small,
               list_data_test=[dict_test_balanced[link] for link in dict_test_balanced.keys()])

## Model trained on full (unbalanced) data 

For comparison, we show here the results of the "classic" model, hence expected more performant (as the index information is not perturbated across layers), on test data.

Even if the accuracy is low on balanced data, it show almost equal (but **a bit lower**) performances than the classic model trained on balanced data. This encourages us to **continue with these (series of?) 1-edge classifiers**...

### Unbalanced (7000 individuals)

In [None]:
# load the model, to gain time 
gcn_classic_large = pickle_load_model("/work/data/models/gcn_classic_education_relationship_45_000_indivs.pkl")

plot_perfs_gnn(classifier=gcn_classic_large,
               list_data_test=[data_full_test])

### Balanced (3500 individuals)

In [None]:
plot_perfs_gnn(classifier=gcn_classic_large,
               list_data_test=[data_full_test_balanced])

# Basic XGB for comparison - excellent results
With the same unified features 92% ROC-AUC, 81-84% PR-AUC, 85% accuracy on train&valid sets.

**Conclusion** between at least **10-18% less than optimized XGB on all metrics** - respectable for non-optimized structures (basic edges, , basic GCN), our GNN (causal as non causal) have to be optimized...

In [None]:
# quick data preparation
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from classif_basic.data_preparation import handle_cat_features

# preparing the dataset on clients for binary classification
data = fetch_openml(data_id=1590, as_frame=True)

X = data.data
Y = (data.target == '>50K') * 1

SEED = 7
VALID_SIZE = 0.15
preprocessing_cat_features = "label_encoding"

X = handle_cat_features(X=X, preprocessing_cat_features=preprocessing_cat_features)

# first of all, unify features with "redundant" causal information
from classif_basic.graph.utils import get_unified_col

X = get_unified_col(X=X, list_cols_to_join = ["education","education-num"], new_col_name = "education")
X = get_unified_col(X=X, list_cols_to_join = ["relationship","marital-status"], new_col_name = "relationship")
X = get_unified_col(X=X, list_cols_to_join = ["occupation","workclass"], new_col_name = "job")
X = get_unified_col(X=X, list_cols_to_join = ["capital-gain","capital-loss"], new_col_name = "capital")

# then, normalize the df categories for better neural-network computation
from classif_basic.graph.utils import normalize_df

X=normalize_df(df=X, normalization='min_max')

# Split valid set for early stopping & model selection
# "stratify=Y" to keep the same proportion of target classes in train/valid (i.e. model) and test sets 
X_model, X_test, Y_model, Y_test = train_test_split(
    X, Y, test_size=VALID_SIZE, random_state=SEED, stratify=Y
)

In [None]:
from classif_basic.graph.train import train_xgb_benchmark

X_train, X_valid, Y_train, Y_valid = train_test_split(
    X_model, Y_model, test_size=VALID_SIZE, random_state=SEED, stratify=Y_model
)

train_xgb_benchmark(X_train=X_train, X_valid=X_valid, Y_train=Y_train, Y_valid=Y_valid)