In this notebook, we inspect **in which way a tabular dataset as Census can be used by an AI based on graphs to estimate wealthiness of individuals**. 

Therefore, we proceed in 2 steps:

**1. We prepare data to be handled by a model based on a graph**
We transform them into a graph, that involves strong assumptions on the features involved in connections...

**2. We train an AI based on graphs**
Here, we begin with a Graphical Neural Network (GNN) based on a Multi-Layer Perceptron (MLP), requiring the library Torch.

**3. We inspect if the graph-based AI indeed reflects common & expert knowledge on**
In particular, regarding the non-sense of certain inferences that should absolutely be avoided (e.g. education may influence occupation, but not the reverse).

In [None]:
%load_ext autoreload
%autoreload 2

import warnings
warnings.filterwarnings('ignore')

# Data preparation for binary classification with graphs (Census)
For this reshaping (and also interpretation, see below the choice of edges) of data tables to graphs, we used a basic Google [colab](https://colab.research.google.com/drive/1_eR7DXBF3V4EwH946dDPOxeclDBeKNMD?usp=sharing#scrollTo=WuggdIItffpv).

## General preparation - handle categorical features
Here, we handle the categorical features through label-encoding. 

As we need to install torch-scatter and torch-sparse to enable torch_geometric (enabling our transformation of data in table, and the GNN), which seem not compatible with GPU on poetry, we use a [trick](https://stackoverflow.com/questions/74823704/error-building-wheel-for-torch-sparse-error-installing-pytorch-geometric) to install them on notebook with pip (to be cleaned):

In [None]:
import torch
try:
    import torch_geometric
except ModuleNotFoundError:
    TORCH = torch.__version__.split("+")[0]
    CUDA = "cu" + torch.version.cuda.replace(".","")
!pip install torch-scatter     -f https://pytorch-geometric.com/whl/torch-{TORCH}+{CUDA}.html
!pip install torch-sparse      -f https://pytorch-geometric.com/whl/torch-{TORCH}+{CUDA}.html
#!pip install torch-geometric
#import torch_geometric

In [None]:
import sys
sys.path.append("../")

import time
from sklearn import datasets

from sklearn.preprocessing import LabelEncoder

import torch
from torch_geometric.data import Data

import itertools
import numpy as np
import pandas as pd

from classif_basic.data_preparation import handle_cat_features, train_valid_test_split

from classif_basic.graph import table_to_graph, add_new_edge

### Prepare data

In [None]:
# preparing the dataset on clients for binary classification
from sklearn.datasets import fetch_openml
data = fetch_openml(data_id=1590, as_frame=True)

t0 = time.time()

X = data.data
Y = (data.target == '>50K') * 1

### Add pre-processing: split hours-per-week in 2 quantiles, to use it as an edge (combined with "occupation")

In [None]:
X["hours-per-week"].value_counts().plot()

In [None]:
median_hours = X["hours-per-week"].median() # '1' if the client works over 40 hours per week

X["hours-per-week"] = (X["hours-per-week"] == median_hours).astype(int)
X["hours-per-week"]

## Reshape (by interpreting) data to a graph

From this dataset (where we introduced selectively a "sexist" effect against women), let's see how we could swith from the tabular data to a graph representation.

The point is that our features X all seem to be attributes of the clients, though we should find a way of representing their interactions between clients 

X = {race, age, sex, final weight (depends on age, sex, hispanic origin, race), education, education number, marital status, relationship, occupation, hours per week, workclass, race, sex, capital gain, capital loss, native country} 

**Nodes** 
Bank clients (by ID)

**Edges** 
Here, we should find one or several ways of connecting the clients

Should be occupation → if changes of occupation (or similar client with new occupation), which impact on the revenue? // change of football team => impact on the football rate 
(pers) actionable => predict revenue when switches to a new job??
→ may be: “hours per week” <=> inspect the change of revenue if switches to greater hours per week?

**Node Features** 
Attributs of the nodes, i.e. characteristics of the clients (here, hard to separate from what "connects" them...) 

Race, age, sex, final weight (depends on age, sex, hispanic origin, race), education, education number, marital status, relationship, hours per week, workclass, race, sex, capital gain, capital loss, native country 

**Label (here at a node-level?)** 
Income (Y = income > $50 000)

Test of my idea: create graphs with different edges, here sex (graph 1) -> education (graph 2)?

Or enforce causal hierarchy through the neighborhood definition?

As it is in use in the creation of batches by neighbors with PyTorch Geometric, we split the data inside the function and keep their train/valid/test masks (i.e. boolean tensor indicating if the individual is in X_train/X_valid/X_test).

As for instance, data_total.train_mask will be required to pass in "input_nodes"...

## Split between data used for GNN training / test data 

In [None]:
from sklearn.model_selection import train_test_split

SEED = 7
VALID_SIZE = 0.15
preprocessing_cat_features = "label_encoding"

X = handle_cat_features(X=X, preprocessing_cat_features=preprocessing_cat_features)

# Split valid set for early stopping & model selection
# "stratify=Y" to keep the same proportion of target classes in train/valid (i.e. model) and test sets 
X_model, X_test, Y_model, Y_test = train_test_split(
    X, Y, test_size=VALID_SIZE, random_state=SEED, stratify=Y
)

## Transformation of model / test data into graphs with the same attributes

First, shape the data used for GNN training in a graph.

In [None]:
# compute edge by hands: create our own edge combination, to predict the income - with directed paths
# first edge joins "occupation" -> "hours-per-week"
# second edge joins "sex" -> "education"
X_total = X_model
Y_total = Y_model

list_col_names=["occupation", "hours-per-week"] # test the model with only 2 categories (> or < median of work hours)

edges_total = add_new_edge(data=X_total, previous_edge=None, list_col_names=["occupation", "hours-per-week"])
#edges_total = add_new_edge(data=X_total, previous_edge=edges_total, list_col_names=["sex","education"]

# for training by specifying "masks" (i.e. boolean for nodes = individuals selected to train the GNN), 
# add a specification on train indexes 
data_total = table_to_graph(X=X_total, Y=Y_total, list_col_names=list_col_names, edges=edges_total)

Do exactly the same for test data (will be used for GNN test evaluation):

In [None]:
list_col_names=["occupation", "hours-per-week"] # test the model with only 2 categories (> or < median of work hours)

edges_test = add_new_edge(data=X_test, previous_edge=None, list_col_names=["occupation", "hours-per-week"])
#edges_test = add_new_edge(data=X_test, previous_edge=edges_test, list_col_names=["sex","education"]

# for training by specifying "masks" (i.e. boolean for nodes = individuals selected to train the GNN), 
# add a specification on train indexes 
data_test = table_to_graph(X=X_test, Y=Y_test, list_col_names=list_col_names, edges=edges_test)

# Train a basic Graph Neural Network on the graph-shaped data

## Train with batches (neighborhood sampling) a basic GCN 

Here, we try using the batches constituted from neighborhoods to train the GNN, using our GPU (if accessed):

In [None]:
from classif_basic.graph import train_GNN

gnn_basic = train_GNN(
                data_total=data_total,
                loader_method="neighbor_nodes",
                batch_size = 32,
                epoch_nb = 2,
                learning_rate = 0.01,
                nb_neighbors_per_sample = 30,
                nb_iterations_per_neighbors = 2)

## Inspect the predictions of the model on valid and test sets

Let's inspect the model on test data, to assess if the stability of performance is not due to coincidence:

In [None]:
from classif_basic.graph import evaluate_gnn

classifier=gnn_basic
data_test=data_test
loss_name="cross_entropy"

# unfortunately, memory error... Evaluate per batches? Or create an independant data_test? // Evaluate on valid 

evaluate_gnn(
    classifier=gnn_basic, 
    data_test=data_test, 
    loss_name=loss_name)

**No Overfitting**

With this very simple shape of graph-data (directed edge = "job" -> "work hours"), the accuracy remains 75% for train, valid and test data.

It confirms us that the training through basic GNN, on basic shaped data, delivers here stable results.

# Visual Representation of the Graph
Here, we will seek for a visual representation of the (directed acyclic?) graph. The goal is to check if it corresponds to the users' intuition - at least regarding the "non sense" causal paths. 

Here, the edges have been built with the directed path **sex -> education** (recall that the link [potentially] exists, because we voluntarily biased the data to be "sexist" regarding the distribution of incomes). Hence, the non-sense we don't want to find is an impact of education on sex. 

Obviously, we have no clear intuition of what these links do correspond with... By individual, path from the sex to the income? But there are more groups than individuals here selected (10)...

## Constitute a graph - Try to connect the features 

Here, we proceed in 2 steps (back and forth)

1. **Detect the relations**
We use the partial dependance plots to inspect the correlations (pers) sufficient? Input intervention changes?

1. **Select the causal direction**
Based on the user's experience and expertise (e.g. sex -> education, because the contrary would be logically and temporally impossible)

At a first sight, look at correlated features (!) may be some hidden correlations => experience is still required at this stage: