# GCN on Census
There is an example of using a GCN on a tabular dataset for binary classification (here, Census to detect the people earning > $50_000). We suppose we already have some **logically consistent arrows** (coming from logical analysis of data -> all the coherent DAGs), that we want the GCN to learn - **phase 2** . 

**Causal hierarchy** could be introduced in the [definition of neighbors](https://pytorch-geometric.readthedocs.io/en/latest/_modules/torch_geometric/loader/neighbor_loader.html#NeighborLoader) to build the subgraphs (i.e. batches of DataLoader)? 

Maybe: to specify the different "relations", we need to build a [heterogeneous graph](https://github.com/pyg-team/pytorch_geometric/blob/master/examples/hetero/to_hetero_mag.py)? 
Begin with these constraints:
    - Graph data 1: edge "sex"
    - Graph data 2: edge "work -> hours of work"
    - Graph data 1 -> (inherits from; temporal?) Graph data 2

In this notebook, we inspect **in which way a tabular dataset as Census can be used by an AI based on graphs to estimate wealthiness of individuals**. 

Therefore, we proceed in 2 steps:

**1. We prepare data to be handled by a model based on a graph**
We transform them into a graph, that involves strong assumptions on the features involved in connections...

**2. We train an AI based on graphs**
Here, we begin with a Graphical Neural Network (GNN) based on a Multi-Layer Perceptron (MLP), requiring the library Torch.

**3. We inspect if the graph-based AI indeed reflects common & expert knowledge on**
In particular, regarding the non-sense of certain inferences that should absolutely be avoided (e.g. education may influence occupation, but not the reverse).

## Enforcing causal paths in tabular GNN - Full data-graph (x, edge_index) child knowledge

To train a GNN while respecting a minimal set of causal paths, we pass to the GNN 2 types of graph-data:
- ancestor: with only the ancestor nodes
- child: ancestor + child nodes, adding as an edge "ancestor -> child"

-> s.t. child(n) becomes the ancestor of child(n+1)

Constitute 2 graph-data parent/child, suggesting causality by adding child nodes (and also edge: parent -> child) in the child data.

For the moment, we specify only 1 parent per edge, on 2 layers:
- ancestor layer: age -> occupation
- child layer: occupation -> hours of work per week

For the moment, to avoid spurious correlations we also keep only the ancestor features (age, sex, race, native country) as node features for all graph-data. Based on this ancestor "blind knowledge", add the child as node features (and also edge: parent -> child) in the child graph-data?

# Data preparation 

Causal analysis: before train&valid/test split, we reduced the number of features to contain only "straightforward" causal information -> enabling to integrate it progressively in our GNN (through edges). Therefore, we use factor analysis:

To control for the balance of df across classes, we sort the clients so that X gets perfect equality in the repartition of classes 0 and 1 (at the cost of 25 000 instead of the 45 000 initial individuals).

In [None]:
# imports and train/test split (to be put in part 2. of the notebook)
%load_ext autoreload
%autoreload 2

import warnings
warnings.filterwarnings('ignore')

import torch
try:
    import torch_geometric
except ModuleNotFoundError:
    TORCH = torch.__version__.split("+")[0]
    CUDA = "cu" + torch.version.cuda.replace(".","")
!pip install torch-scatter     -f https://pytorch-geometric.com/whl/torch-{TORCH}+{CUDA}.html
!pip install torch-sparse      -f https://pytorch-geometric.com/whl/torch-{TORCH}+{CUDA}.html
#!pip install torch-geometric
#import torch_geometric

import sys
sys.path.append("../")

import time
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

import torch

import numpy as np
import pandas as pd

from classif_basic.data_preparation import handle_cat_features

from classif_basic.graph.data_to_graph import table_to_graph, add_new_edge
from classif_basic.graph.train import train_GNN_ancestor

# preparing the dataset on clients for binary classification
data = fetch_openml(data_id=1590, as_frame=True)

X = data.data
Y = (data.target == '>50K') * 1

SEED = 7
VALID_SIZE = 0.15
preprocessing_cat_features = "label_encoding"

X = handle_cat_features(X=X, preprocessing_cat_features=preprocessing_cat_features)

# first of all, unify features with "redundant" causal information
from classif_basic.graph.utils import get_unified_col

X = get_unified_col(X=X, list_cols_to_join = ["education","education-num"], new_col_name = "education")
X = get_unified_col(X=X, list_cols_to_join = ["relationship","marital-status"], new_col_name = "relationship")
X = get_unified_col(X=X, list_cols_to_join = ["occupation","workclass"], new_col_name = "job")
X = get_unified_col(X=X, list_cols_to_join = ["capital-gain","capital-loss"], new_col_name = "capital")

# select equal proportion of classes "wealthy" and "not wealthy", and generates the new dataset accordingly
from classif_basic.graph.utils import get_balanced_df

balanced_df = get_balanced_df(X=X, Y=Y)

X_balanced = balanced_df.drop("target", axis=1)
Y_balanced = balanced_df["target"]

X=X_balanced # here, we try with the whole dataset (assuming it is imbalanced, but counts almost 50 000 nodes)
Y=Y_balanced

# then, normalize the df categories for better neural-network computation
from classif_basic.graph.utils import normalize_df

X=normalize_df(df=X, normalization='min_max')

# Split valid set for early stopping & model selection
# "stratify=Y" to keep the same proportion of target classes in train/valid (i.e. model) and test sets 
X_model, X_test, Y_model, Y_test = train_test_split(
    X, Y, test_size=VALID_SIZE, random_state=SEED, stratify=Y
)

# Data to ancestor & child Graphs 

Cascade de causes...

In [None]:
# load already formed graph-data, to gain time
from classif_basic.model import pickle_load_model

dict_data_total = pickle_load_model("/work/data/graph_data/balanced/dict_all_edges.pkl")

data_full_education_relationship = pickle_load_model("/work/data/graph_data/balanced/data_full_features_education_relationship.pkl")

# Train a basic Graph Neural Network on the graph-shaped data

## Train with batches (neighborhood sampling) a basic GCN 

Here, we try using the batches constituted from neighborhoods to train the GNN, using our GPU (if accessed).

We use our GCN_ancestor class progressively adding through layers the "causal child" information:

Here with batches of 128 individuals, 76% of accuracy is reached by passing a causal order on layer1 and layer2 (accuracy == to the situation where all features are specified, and no causal layer!)...

We get here our own data-loader, ensuring that each batch passes the same individuals to the GNN (s.t. only causal data changes through layers):

We begin with all successive 3 edges (without fnlwgt, not enough info in edge_index?) given in our (basic) GCN with successive causal layers:

In [None]:
loader_method="index_groups"
model_type="conv"
loss_name="CrossEntropyLoss"
#batch_size=10_000#19_868
learning_rate = 0.01
nb_batches=1
epoch_nb = 1000
cv_step=100

skip_connection=True

In [None]:
# don't forget to reallocate GPU memory, before new training!
torch.cuda.empty_cache()

# and to build the index, pass the data-graph to CPU device
dict_data_total['education->relationship'].cpu()

# combine the 2 edges? Arriving respectively to 0.4 and 0.3 of losses
gnn_education_job = train_GNN_ancestor(
                list_data_total=[dict_data_total['education->relationship'],
                                 dict_data_total['relationship->job']],
                model_type=model_type,
                loader_method=loader_method,
                loss_name=loss_name,
                #batch_size=batch_size,
                nb_batches=nb_batches,
                epoch_nb = epoch_nb,
                cv_step=cv_step,
                learning_rate = learning_rate,
                skip_connection=skip_connection)

## Standard GCN on all features for comparison
Finally, we compare these GCNs trained with progressively incorporated / partial edges, with a GCN trained with all features as node features (correlating them...) and only the edge "job -> work hours".

In [None]:
gnn_classic = train_GNN_ancestor(
                list_data_total=[data_full_education_relationship],
                model_type=model_type,
                loader_method=loader_method,
                loss_name=loss_name,
                #batch_size=batch_size,
                nb_batches=nb_batches,
                epoch_nb = epoch_nb,
                cv_step=cv_step,
                learning_rate = learning_rate,
                skip_connection=skip_connection)

# Basic XGB for comparison - excellent results
With the same unified features 92% ROC-AUC, 81-84% PR-AUC, 85% accuracy on train&valid sets.

In [None]:
from classif_basic.graph.train import train_xgb_benchmark

train_xgb_benchmark(X_train=X_train, X_valid=X_valid, Y_train=Y_train, Y_valid=Y_valid)