# GCN on Census - training with batches (neighbor sampling method)
There is an example of using a GCN on a tabular dataset for binary classification (here, Census to detect the people earning > $50_000). We suppose we already have some **logically consistent arrows** (coming from logical analysis of data -> all the coherent DAGs), that we want the GCN to learn - **phase 2** . 

**Causal hierarchy** could be introduced in the [definition of neighbors](https://pytorch-geometric.readthedocs.io/en/latest/_modules/torch_geometric/loader/neighbor_loader.html#NeighborLoader) to build the subgraphs (i.e. batches of DataLoader)? 

Maybe: to specify the different "relations", we need to build a [heterogeneous graph](https://github.com/pyg-team/pytorch_geometric/blob/master/examples/hetero/to_hetero_mag.py)? 
Begin with these constraints:
    - Graph data 1: edge "sex"
    - Graph data 2: edge "work -> hours of work"
    - Graph data 1 -> (inherits from; temporal?) Graph data 2

In this notebook, we inspect **in which way a tabular dataset as Census can be used by an AI based on graphs to estimate wealthiness of individuals**. 

Therefore, we proceed in 2 steps:

**1. We prepare data to be handled by a model based on a graph**
We transform them into a graph, that involves strong assumptions on the features involved in connections...

**2. We train an AI based on graphs**
Here, we begin with a Graphical Neural Network (GNN) based on a Multi-Layer Perceptron (MLP), requiring the library Torch.

**3. We inspect if the graph-based AI indeed reflects common & expert knowledge on**
In particular, regarding the non-sense of certain inferences that should absolutely be avoided (e.g. education may influence occupation, but not the reverse).

In [None]:
%load_ext autoreload
%autoreload 2

import warnings
warnings.filterwarnings('ignore')

# Data preparation for binary classification with graphs (Census)
For this reshaping (and also interpretation, see below the choice of edges) of data tables to graphs, we used a basic Google [colab](https://colab.research.google.com/drive/1_eR7DXBF3V4EwH946dDPOxeclDBeKNMD?usp=sharing#scrollTo=WuggdIItffpv).

## General preparation - handle categorical features
Here, we handle the categorical features through label-encoding. 

As we need to install torch-scatter and torch-sparse to enable torch_geometric (enabling our transformation of data in table, and the GNN), which seem not compatible with GPU on poetry, we use a [trick](https://stackoverflow.com/questions/74823704/error-building-wheel-for-torch-sparse-error-installing-pytorch-geometric) to install them on notebook with pip (to be cleaned):

In [None]:
import torch
try:
    import torch_geometric
except ModuleNotFoundError:
    TORCH = torch.__version__.split("+")[0]
    CUDA = "cu" + torch.version.cuda.replace(".","")
!pip install torch-scatter     -f https://pytorch-geometric.com/whl/torch-{TORCH}+{CUDA}.html
!pip install torch-sparse      -f https://pytorch-geometric.com/whl/torch-{TORCH}+{CUDA}.html
#!pip install torch-geometric
#import torch_geometric

In [None]:
import sys
sys.path.append("../")

import time
from sklearn import datasets

from sklearn.preprocessing import LabelEncoder

import torch
from torch_geometric.data import Data

import itertools
import numpy as np
import pandas as pd

from classif_basic.data_preparation import handle_cat_features, train_valid_test_split

from classif_basic.graph import table_to_graph, add_new_edge

### Prepare data

In [None]:
# preparing the dataset on clients for binary classification
from sklearn.datasets import fetch_openml
data = fetch_openml(data_id=1590, as_frame=True)

t0 = time.time()

X = data.data
Y = (data.target == '>50K') * 1

### Add pre-processing: split hours-per-week in 2 quantiles, to use it as an edge (combined with "occupation")

In [None]:
X["hours-per-week"].value_counts().plot()

In [None]:
median_hours = X["hours-per-week"].median() # '1' if the client works over 40 hours per week

X["hours-per-week"] = (X["hours-per-week"] == median_hours).astype(int)
X["hours-per-week"]

## Reshape (by interpreting) data to a graph

From this dataset (where we introduced selectively a "sexist" effect against women), let's see how we could swith from the tabular data to a graph representation.

The point is that our features X all seem to be attributes of the clients, though we should find a way of representing their interactions between clients 

X = {race, age, sex, final weight (depends on age, sex, hispanic origin, race), education, education number, marital status, relationship, occupation, hours per week, workclass, race, sex, capital gain, capital loss, native country} 

**Nodes** 
Bank clients (by ID)

**Edges** 
Here, we should find one or several ways of connecting the clients

Should be occupation → if changes of occupation (or similar client with new occupation), which impact on the revenue? // change of football team => impact on the football rate 
(pers) actionable => predict revenue when switches to a new job??
→ may be: “hours per week” <=> inspect the change of revenue if switches to greater hours per week?

**Node Features** 
Attributs of the nodes, i.e. characteristics of the clients (here, hard to separate from what "connects" them...) 

Race, age, sex, final weight (depends on age, sex, hispanic origin, race), education, education number, marital status, relationship, hours per week, workclass, race, sex, capital gain, capital loss, native country 

**Label (here at a node-level?)** 
Income (Y = income > $50 000)

Test of my idea: create graphs with different edges, here sex (graph 1) -> education (graph 2)?

Or enforce causal hierarchy through the neighborhood definition?

As it is in use in the creation of batches by neighbors with PyTorch Geometric, we split the data inside the function and keep their train/valid/test masks (i.e. boolean tensor indicating if the individual is in X_train/X_valid/X_test).

As for instance, data_total.train_mask will be required to pass in "input_nodes"...

## Split between data used for GNN training / test data 

In [None]:
from sklearn.model_selection import train_test_split

SEED = 7
VALID_SIZE = 0.15
preprocessing_cat_features = "label_encoding"

X = handle_cat_features(X=X, preprocessing_cat_features=preprocessing_cat_features)

# Split valid set for early stopping & model selection
# "stratify=Y" to keep the same proportion of target classes in train/valid (i.e. model) and test sets 
X_model, X_test, Y_model, Y_test = train_test_split(
    X, Y, test_size=VALID_SIZE, random_state=SEED, stratify=Y
)

## Transformation of model / test data into graphs with the same attributes

First, shape the data used for GNN training in a graph.

In [None]:
# compute edge by hands: create our own edge combination, to predict the income - with directed paths
# first edge joins "occupation" -> "hours-per-week"
# second edge joins "sex" -> "education"
X_total = X_model
Y_total = Y_model

list_col_names=["occupation", "hours-per-week"] # test the model with only 2 categories (> or < median of work hours)

edges_total = add_new_edge(data=X_total, previous_edge=None, list_col_names=["occupation", "hours-per-week"])
#edges_total = add_new_edge(data=X_total, previous_edge=edges_total, list_col_names=["sex","education"]

# for training by specifying "masks" (i.e. boolean for nodes = individuals selected to train the GNN), 
# add a specification on train indexes 
data_total = table_to_graph(X=X_total, Y=Y_total, list_col_names=list_col_names, edges=edges_total)

Do exactly the same for test data (will be used for GNN test evaluation):

In [None]:
list_col_names=["occupation", "hours-per-week"] # test the model with only 2 categories (> or < median of work hours)

edges_test = add_new_edge(data=X_test, previous_edge=None, list_col_names=["occupation", "hours-per-week"])
#edges_test = add_new_edge(data=X_test, previous_edge=edges_test, list_col_names=["sex","education"]

# for training by specifying "masks" (i.e. boolean for nodes = individuals selected to train the GNN), 
# add a specification on train indexes 
data_test = table_to_graph(X=X_test, Y=Y_test, list_col_names=list_col_names, edges=edges_test)

# Train a sequential GNN learning causal hierarchy (small data)

For faster training, avoiding to split data into batches (next step if it works), we here train the GNN on test_data (small data). 

1st layer: edge_index of 'sex'
2nd layer: 'discovers' the edge_index of 'job' -> 'work hours'

## First approach: only changing the specified edge

It works still with 76% of accuracy, but a problem is that only changing the edge (from sex -> job) does not involve that the GNN "learns" sex to be the causal ancestor...

In particular, the **node features** are causal childs (education, job...) that already exist in the first "ancestor" layer.

Here, we only need to specify the "parent" and "child" indexes processed sequentially by the GNN => causal hierarchy between sex and work is integrated (at least, 76% accuracy on small data)? To further test with the GPU...

## Integrating "new" child knowledge of the world 

Here, we try to pass to the GNN 2 graph-data
- ancestor: with only the ancestor nodes
- child: ancestor + child nodes, adding as an edge "ancestor -> child"

Constitute 2 graph-data parent/child, suggesting causality by adding child nodes (and also edge: parent -> child) in the child data.

Problem: only 1 parent specified? Let's begin to see if it works...

In [None]:
list_parent = ["age", "race", "sex", "native-country"]
edge_parent = "sex"

X_test_parent = X_test.filter(list_parent)

# add "sex" as a connection (edge) between these ancestors features
edges_test_parent = add_new_edge(data=X_test_parent, previous_edge=None, list_col_names=[edge_parent])

# being an edge, "sex" must be removed from the nodes
list_parent.remove(edge_parent)
data_test_parent = table_to_graph(X=X_test_parent, Y=Y_test, list_col_names=list_parent, edges=edges_test_parent)
data_test_parent

Based on this ancestor "blind knowledge", add the child as node features (and also edge: parent -> child) in the child graph-data. Here the first child is "education" (following ages of life...):

In [None]:
# we first simplify the categories of education, to be better computed as edges

median_education = X_test["education"].median() # '1' if the client works over 40 hours per week

X_test["education"] = (X_test["education"] == median_education).astype(int)
X_test["education"].value_counts()

In [None]:
list_child1 = ["age", "race", "sex", "native-country", "education"]
edge_parent = "sex"
edge_child1 = "education"

X_test_child1 = X_test.filter(list_child1)

# add "sex" as a connection (edge) between these ancestors features
edges_test_child1 = add_new_edge(data=X_test_child1, previous_edge=None, 
                                 list_col_names=[edge_parent, edge_child1])

# being edges, "sex" and "education" and must be removed from the nodes
list_child1.remove(edge_parent)
list_child1.remove(edge_child1)

data_test_child1 = table_to_graph(X=X_test_child1, Y=Y_test, list_col_names=list_child1, 
                                  edges=edges_test_child1)
data_test_child1

In [None]:
list_child2 = ["age", "race", "sex", "native-country", "education", "workclass"]
edge_parent = "sex"
edge_child1 = "education"
edge_child2 = "workclass"

X_test_child2 = X_test.filter(list_child2)

# add "sex" as a connection (edge) between these ancestors features
edges_test_child2 = add_new_edge(data=X_test_child2, previous_edge=None, 
                                 list_col_names=[edge_child1, edge_child2])

# being edges, "sex" and "education" and must be removed from the nodes
list_child2.remove(edge_parent)
list_child2.remove(edge_child1)
list_child2.remove(edge_child2)

data_test_child2 = table_to_graph(X=X_test_child2, Y=Y_test, list_col_names=list_child2, 
                                  edges=edges_test_child2)
data_test_child2

Finish with the "last" descendants, to complete the node features in the last layer (!) TODO before, group the columns to avoid redundancy, while keeping input information (e.g. between education - education level)!!

-> reduction analysis to be led

In [None]:
list_ascendants = ["age", "race", "sex", "native-country", "education", "workclass"]

set_final_descendants = set(X_test.columns) - set(list_ascendants)
set_final_descendants

In [None]:
# here, we assume the last descendant features are the last non-listed columns (has to be deduced)
# list_final_descendants = list(set_final_descendants)
list_final_descendants = [ 
    'capital-gain',
     # 'occupation', TODO merge information with 'workclass'
     'marital-status',
     'fnlwgt',
     #'education-num', TODO merge information with 'education'
     'hours-per-week',
     'relationship',
     'capital-loss',
      edge_child2] # add edge_child2 at the end of the chain? TODO investigate

edge_final_descendants = 'hours-per-week'

X_test_final_descendants = X_test.filter(list_final_descendants)

# add "sex" as a connection (edge) between these ancestors features
edges_test_final_descendants = add_new_edge(data=X_test_final_descendants, previous_edge=None, 
                                 list_col_names=[edge_child2, edge_final_descendants]) 
                                # edge here: direction of the last child ("workclass") on these features
                                # here, we even add the direction on 'hours-per-week' -> TODO investigate 
                                # if 'hours-per-week' works only if it has the same ascendant-level as the other node features here?

list_final_descendants.remove(edge_child2)
list_final_descendants.remove(edge_final_descendants)

data_test_final_descendants = table_to_graph(X=X_test_final_descendants, Y=Y_test, list_col_names=list_final_descendants, 
                                  edges=edges_test_final_descendants)
data_test_final_descendants

In [None]:
from classif_basic.graph import activate_gpu

device = activate_gpu()

Here, we inspect if an edge containing an information out of features attributes can be added

Indeed, 
- data_parent only entails parent features (age, sex, race, origin)
- edge_parent_child computes the neighborhoods (embeddings) between clients by adding the information "work"

=> what if we added the causal descendant edges through causal layers? 

In [None]:
from classif_basic.graph import GCN_ancestor_edges

learning_rate = 0.001

classifier = GCN_ancestor_edges(data_test_parent).to(device) # only ancestor information in initialization? OK if not used

classifier.train()
optimizer = torch.optim.Adam(classifier.parameters(), lr=learning_rate)
optimizer.zero_grad()

# these data (nodes, targets) do not change with the edges
data_test = data_test_parent.to(device)
target = data_test_parent.y.to(device)

x_parent = data_test_parent.x.to(device)
x_child1 = data_test_child1.x.to(device)
x_child2 = data_test_child2.x.to(device)
x_final_descendants = data_test_final_descendants.x.to(device)

# then, pass the different edges
# for learning of this causal hierarchy by the GCN
edge_index_parent = data_test_parent.edge_index.to(device)
edge_index_child1 = data_test_child1.edge_index.to(device)
edge_index_child2 = data_test_child2.edge_index.to(device)
edge_index_final_descendants = data_test_final_descendants.edge_index.to(device)

preds = classifier(
    x_parent=x_parent,
    x_child1=x_child1,
    x_child2=x_child2,
    x_final_descendants=x_final_descendants,
    edge_index_parent=edge_index_parent,
    edge_index_child1=edge_index_child1, 
    edge_index_child2=edge_index_child2,
    edge_index_final_descendants=edge_index_final_descendants,
    device=device)

preds

In [None]:
loss = torch.nn.CrossEntropyLoss()

error_test = loss(preds, target)
print(f"\nError on test: {error_test:.4f} \n")

# compute overall train&valid accuracy
_, preds_temp = torch.max(preds.data, 1)
total = len(target)
correct = (preds_temp == target).sum().item()
print(f"Test Accuracy = {round(correct / total, 2)}") 

# Train a basic Graph Neural Network on the graph-shaped data

## Train with batches (neighborhood sampling) a basic GCN 

Here, we try using the batches constituted from neighborhoods to train the GNN, using our GPU (if accessed):

In [None]:
from classif_basic.graph import train_GNN

gnn_basic = train_GNN(
                data_total=data_total,
                loader_method="neighbor_nodes",
                batch_size = 32,
                epoch_nb = 2,
                learning_rate = 0.01,
                nb_neighbors_per_sample = 30,
                nb_iterations_per_neighbors = 2)

## Inspect the predictions of the model on valid and test sets

Let's inspect the model on test data, to assess if the stability of performance is not due to coincidence:

In [None]:
from classif_basic.graph import evaluate_gnn

classifier=gnn_basic
data_test=data_test
loss_name="cross_entropy"

# unfortunately, memory error... Evaluate per batches? Or create an independant data_test? // Evaluate on valid 

evaluate_gnn(
    classifier=gnn_basic, 
    data_test=data_test, 
    loss_name=loss_name)

**No Overfitting**

With this very simple shape of graph-data (directed edge = "job" -> "work hours"), the accuracy remains 75% for train, valid and test data.

It confirms us that the training through basic GNN, on basic shaped data, delivers here stable results.

# Visual Representation of the Graph
Here, we will seek for a visual representation of the (directed acyclic?) graph. The goal is to check if it corresponds to the users' intuition - at least regarding the "non sense" causal paths. 

Here, the edges have been built with the directed path **sex -> education** (recall that the link [potentially] exists, because we voluntarily biased the data to be "sexist" regarding the distribution of incomes). Hence, the non-sense we don't want to find is an impact of education on sex. 

Obviously, we have no clear intuition of what these links do correspond with... By individual, path from the sex to the income? But there are more groups than individuals here selected (10)...

## Constitute a graph - Try to connect the features 

Here, we proceed in 2 steps (back and forth)

1. **Detect the relations**
We use the partial dependance plots to inspect the correlations (pers) sufficient? Input intervention changes?

1. **Select the causal direction**
Based on the user's experience and expertise (e.g. sex -> education, because the contrary would be logically and temporally impossible)

At a first sight, look at correlated features (!) may be some hidden correlations => experience is still required at this stage: