In this notebook, we inspect **in which way a tabular dataset as Census can be used by an AI based on graphs to estimate wealthiness of individuals**. 

Therefore, we proceed in 2 steps:

**1. We prepare data to be handled by a model based on a graph**
We transform them into a graph, that involves strong assumptions on the features involved in connections...

**2. We train an AI based on graphs**
Here, we begin with a Graphical Neural Network (GNN) based on a Multi-Layer Perceptron (MLP), requiring the library Torch.

**3. We inspect if the graph-based AI indeed reflects common & expert knowledge on**
In particular, regarding the non-sense of certain inferences that should absolutely be avoided (e.g. education may influence occupation, but not the reverse).

In [None]:
%load_ext autoreload
%autoreload 2

import warnings
warnings.filterwarnings('ignore')

# Data preparation for binary classification with graphs (Census)
For this reshaping (and also interpretation, see below the choice of edges) of data tables to graphs, we used a basic Google [colab](https://colab.research.google.com/drive/1_eR7DXBF3V4EwH946dDPOxeclDBeKNMD?usp=sharing#scrollTo=WuggdIItffpv).

## General preparation - handle categorical features
Here, we handle the categorical features through label-encoding. 

As we need to install torch-scatter and torch-sparse to enable torch_geometric (enabling our transformation of data in table, and the GNN), which seem not compatible with GPU on poetry, we use a [trick](https://stackoverflow.com/questions/74823704/error-building-wheel-for-torch-sparse-error-installing-pytorch-geometric) to install them on notebook with pip (to be cleaned):

In [None]:
import torch
try:
    import torch_geometric
except ModuleNotFoundError:
    TORCH = torch.__version__.split("+")[0]
    CUDA = "cu" + torch.version.cuda.replace(".","")
!pip install torch-scatter     -f https://pytorch-geometric.com/whl/torch-{TORCH}+{CUDA}.html
!pip install torch-sparse      -f https://pytorch-geometric.com/whl/torch-{TORCH}+{CUDA}.html
#!pip install torch-geometric
#import torch_geometric

In [None]:
import sys
sys.path.append("../")

import time
from sklearn import datasets

from sklearn.preprocessing import LabelEncoder

import torch
from torch_geometric.data import Data

import itertools
import numpy as np
import pandas as pd

from classif_basic.data_preparation import train_valid_test_split, set_target_if_feature, automatic_preprocessing
from classif_basic.data_preparation import handle_cat_features

from classif_basic.graph import table_to_graph, add_new_edge

### Prepare data

In [None]:
# preparing the dataset on clients for binary classification
from sklearn.datasets import fetch_openml
data = fetch_openml(data_id=1590, as_frame=True)

t0 = time.time()

X = data.data
Y = (data.target == '>50K') * 1

### Add pre-processing: split hours-per-week in 2 quantiles, to use it as an edge (combined with "occupation")

In [None]:
X["hours-per-week"].value_counts().plot()

In [None]:
median_hours = X["hours-per-week"].median() # '1' if the client works over 40 hours per week

X["hours-per-week"] = (X["hours-per-week"] == median_hours).astype(int)
X["hours-per-week"]

### Train-test-split, to prepare for 3 graphs representing data

In [None]:
model_task = "classification"
preprocessing_cat_features = "label_encoding"

X_train, X_valid, X_train_valid, X_test, Y_train, Y_valid, Y_train_valid, Y_test = train_valid_test_split(
    X=X,
    Y=Y, 
    model_task=model_task,
    preprocessing_cat_features=preprocessing_cat_features)

## Reshape (by interpreting) data to a graph

From this dataset (where we introduced selectively a "sexist" effect against women), let's see how we could swith from the tabular data to a graph representation.

The point is that our features X all seem to be attributes of the clients, though we should find a way of representing their interactions between clients 

X = {race, age, sex, final weight (depends on age, sex, hispanic origin, race), education, education number, marital status, relationship, occupation, hours per week, workclass, race, sex, capital gain, capital loss, native country} 

**Nodes** 
Bank clients (by ID)

**Edges** 
Here, we should find one or several ways of connecting the clients

Should be occupation → if changes of occupation (or similar client with new occupation), which impact on the revenue? // change of football team => impact on the football rate 
(pers) actionable => predict revenue when switches to a new job??
→ may be: “hours per week” <=> inspect the change of revenue if switches to greater hours per week?

**Node Features** 
Attributs of the nodes, i.e. characteristics of the clients (here, hard to separate from what "connects" them...) 

Race, age, sex, final weight (depends on age, sex, hispanic origin, race), education, education number, marital status, relationship, hours per week, workclass, race, sex, capital gain, capital loss, native country 

**Label (here at a node-level?)** 
Income (Y = income > $50 000)

We will no more need the train/valid/test split outside, as we will get the train/valid/test masks => create a function returning the 3 masks, as X_total and Y_total label encoded? 

# Train a basic Graph Neural Network on the graph-shaped data

Here are tries to train the model with batches...

Beta version (DataLoader was not made for graphs), to delete once our current functions are confirmed to work...

Test of my idea: create graphs with different edges, here sex (graph 1) -> education (graph 2)?

Or enforce causal hierarchy through the neighborhood definition?

As it is in use in the creation of batches by neighbors with PyTorch Geometric, we recreate the data.train_mask (i.e. boolean tensor indicating if the individual is in X_train) to pass in "input_nodes":

In [None]:
X_copy = X.copy()
X_train_mask = X_train.copy()
X_train_mask["train_mask"] = 1
dict_train_mask = X_train_mask['train_mask'].to_dict()

X_copy['train_mask'] = X_copy.index.map(dict_train_mask)

X_copy['train_mask'] = X_copy['train_mask'].fillna(0)
X_copy['train_mask'] = X_copy['train_mask'].astype(bool)

X_copy['train_mask'] # TODO create the same mask to get data_valid and data_test
train_mask = torch.tensor(X_copy['train_mask'].values)

# TODO create the same mask to get Y_train values on Y_total, and then apply a proper loss 
# https://pytorch-geometric.readthedocs.io/en/latest/get_started/introduction.html

In [None]:
# compute edge by hands: create our own edge combination, to predict the income - with directed paths
# first edge joins "occupation" -> "hours-per-week"
# second edge joins "sex" -> "education"
X_total = handle_cat_features(X=X, preprocessing_cat_features="label_encoding")
Y_total = Y.copy()

list_col_names=["occupation", "hours-per-week"] # test the model with only 2 categories (> or < median of work hours)

edges_total = add_new_edge(data=X_total, previous_edge=None, list_col_names=["occupation", "hours-per-week"])
#edges_total = add_new_edge(data=X_total, previous_edge=edges_total, list_col_names=["sex","education"]

# for training by specifying "masks" (i.e. boolean for nodes = individuals selected to train the GNN), 
# add a specification on train indexes 
data_total = table_to_graph(X=X_total, Y=Y_total, list_col_names=list_col_names, edges=edges_total,
                           train_mask=train_mask)

In [None]:
from torch_geometric.loader import NeighborLoader

loader = NeighborLoader(
    data_total,
    # Sample 30 neighbors for each node for 2 iterations
    num_neighbors=[30] * 2,
    # Use a batch size of 128 for sampling training nodes
    batch_size=128,
    input_nodes=data_total.train_mask,
)

sampled_data = next(iter(loader))
print(sampled_data.batch_size)

In [None]:
for i, data in enumerate(loader):
    print(f"{i} : \n {data} \n\n")

Here, we try using the batches constituted from neighborhoods to train the GNN, using our GPU (if accessed):

In [None]:
import torch.nn.functional as F
from torch_geometric.nn import GCNConv

class GCN(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = GCNConv(data_total.num_node_features, 16)
        self.conv2 = GCNConv(16, data_total.num_classes)

    def forward(self, x, edge_index):

        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, training=self.training)
        x = self.conv2(x, edge_index)
        
        return F.log_softmax(x, dim=1)

In [None]:
epoch_nb = 100
learning_rate = 0.001

t_basic_1 = time.time()

# activate and signal the use of GPU for faster processing
if torch.cuda.is_available():    
    print("Using GPU!")    
    device = torch.device("cuda")
    # torch.set_default_tensor_type('torch.cuda.FloatTensor')   
else:    
    print("Using CPU!")       
    device = torch.device("cpu")

# initialize the structure of the classifier, and prepare for GNN training (with GPU)
classifier = GCN().to(device) # classifier = GCN(data_total).to(device)

# classifier = GraphClassifier(v_in=71, e_in=6, v_g=30, e_g=6, v_out=200, mc_out=22, i_types=11).float().to(device)
optimizer = torch.optim.Adam(classifier.parameters(), lr=learning_rate)
loss = torch.nn.CrossEntropyLoss()

print('starting training')
classifier.train()

for epoch in range(epoch_nb):
    
    epoch_loss = 0
    total = 0
    correct = 0
    classifier.train()
    for i, data in enumerate(loader):
        optimizer.zero_grad()
        data = data.to(device)
        x = data.x.to(device)
        edge_index=data.edge_index.to(device)
        target = data.y.to(device)
        preds = classifier(x=x.float(), edge_index=edge_index)
        err = loss(preds, target)
        _, preds_temp = torch.max(preds.data, 1)
        total += len(target)
        correct += (preds_temp == target).sum().item()
        epoch_loss += err.item()
        err.backward()
        optimizer.step()
    print(f'Epoch {epoch + 1} Loss = {epoch_loss/(i+1)} Train Accuracy = {correct / total}') 

t_basic_2 = time.time()            
print(f"Training of the basic GCN on Census with {batch_size} batches and {epoch_nb} epochs took {(t_basic_2 - t_basic_1)/60} mn")

In [None]:
t_basic_2 = time.time()            
print(f"Training of the basic GCN on Census on {data_total.x.shape[0]} nodes and {data_total.edge_index.shape[1]} edges, \n with {sampled_data.batch_size} batches and {epoch_nb} epochs took {(t_basic_2 - t_basic_1)/60} mn")

## Build a basic convolutional GNN with torch

In [None]:
# here intervenes the quick "introduction by example" of GCN by torch
# in 'https://pytorch-geometric.readthedocs.io/en/latest/notes/introduction.html'

import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv

class GCN(torch.nn.Module):
    def __init__(self, data):
        super().__init__()
        self.conv1 = GCNConv(data.num_node_features, 16)
        self.conv2 = GCNConv(16, data.num_classes)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index

        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, training=self.training)
        x = self.conv2(x, edge_index)
        
        return F.log_softmax(x, dim=1)

Here, we try to reduce the use of GPU-memory by training through batches (built with torch.utils.data.DataLoader):

In [None]:
EPOCH = 50
batch_size = 100

t_basic_1 = time.time()

# activate and signal the use of GPU for faster processing
if torch.cuda.is_available():    
    print("Using GPU!")    
    device = torch.device("cuda")
    torch.set_default_tensor_type('torch.cuda.FloatTensor')   
else:    
    print("Using CPU!")       
    device = torch.device("cpu")

model = GCN(data=data_train).to(device)
data_train = data_train.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)

model.double()

model.train()

train_loader = torch.utils.data.DataLoader(dataset=data_train,
                                           batch_size=batch_size,
                                           shuffle=True,
                                           generator=torch.Generator(device='cuda'))

for epoch in range(EPOCH):
    for data_train_batch in train_loader:   # gives batch data
        print(f"\n step {step} \n")
        # Compute prediction and loss
        out = model(data_train_batch)
        loss = F.nll_loss(out, data_train_batch.y)
        loss.backward()
        del data_train_batch
        gc.collect()
        torch.cuda.empty_cache()

t_basic_2 = time.time()
        
print(f"Training of the basic GCN on Census with {batch_size} batches and {EPOCH} epochs took {(t_basic_2 - t_basic_1)/60} mn")

Finally, we can evaluate our model on the validation nodes. Obviously, linking the clients only through the job provides less than 70% of accuracy even on the train set. Therefore, we need to seek for other ways...

By creating an edge only with the combination of sex and education, we observe an accuracy of 61% on train that does not fall down on valid (65%). Moreover, **when the graph is directed (sex -> education), the accuracy seems to increase** without falling down valid performance: + 11% on train (76%), +2% on valid (67%), and 70% on test.  

Thanks to the training of the GCN with 200 batches, which however took 20 mn for 15_000 rows and 2 classes (and we shall admit, edge_index=[2, 10813909])

**Other observations (tests of combinations of features as edges)**

Having created our own edge index combining (sex&education) and (occupation), the training took 7 mn more (27 mn) but the performance did not improve (61% on train set...)

Adding the combination (occupation -> hours-per-week) to (sex -> education) does not improve the performances, but it decreases it (60 +-2 % on train and valid). Maybe because (i) it complexifies too much the network (ii) the model or (iii) the model's hyperparameters (batch, layers...) is too simple to catch these relations (iii) the models? 

**Constitution of couples graph data - graph networks to be tested, with input intervention changes...**

In [None]:
pred_train = model(data_train).double().argmax(dim=1)
nb_indivs_train = data_train.x.shape[0]

model.double()

model.eval()

correct_train = (pred_train == data_train.y).sum()
acc = int(correct_train) / nb_indivs_train
print(f'Accuracy on train data: {acc:.4f}')

In [None]:
data_valid = data_valid.to(device) # set data_valid in a GPU-compatible format

pred_valid = model(data_valid).argmax(dim=1)
nb_indivs_valid = data_valid.x.shape[0]

model.eval()

correct_valid = (pred_valid == data_valid.y).sum()
acc = int(correct_valid) / nb_indivs_valid
print(f'Accuracy on valid data: {acc:.4f}')

Let's inspect the model on test data, to assess if the stability of performance is not due to coincidence:

In [None]:
data_test = data_test.to(device) # set data_test in a GPU-compatible format

pred_test = model(data_test).argmax(dim=1)
nb_indivs_test = data_test.x.shape[0]

model.eval()

correct_test = (pred_test == data_test.y).sum()
acc = int(correct_test) / nb_indivs_test
print(f'Accuracy on test data: {acc:.4f}')

# Visual Representation of the Graph
Here, we will seek for a visual representation of the (directed acyclic?) graph. The goal is to check if it corresponds to the users' intuition - at least regarding the "non sense" causal paths. 

Here, the edges have been built with the directed path **sex -> education** (recall that the link [potentially] exists, because we voluntarily biased the data to be "sexist" regarding the distribution of incomes). Hence, the non-sense we don't want to find is an impact of education on sex. 

In [None]:
import networkx as nx

from torch_geometric.utils import to_networkx
import matplotlib.pyplot as plt

In [None]:
network_valid = to_networkx(data=data_valid)

# subax1 = plt.subplot(121)

# graph 
nx.draw(network_valid, with_labels=True, font_weight='bold')

In [None]:
list_col_names = ["occupation", "hours-per-week"]#, "sex","education"]

data_job_valid = table_to_graph(X=X_valid, Y=Y_valid, list_col_names=list_col_names, edges=edges_valid)

network_job_valid = to_networkx(data=data_job_valid)
nx.draw(network_job_valid, with_labels=True, font_weight='bold')

In [None]:
data_valid.x.shape[0]

In [None]:
# create a representation of the edge ("sex -> education") 
# with only 2 values of education and 20 individuals (min, max)

X_valid.reset_index(drop=True, inplace=True)
Y_valid.reset_index(drop=True, inplace=True)

df_education_max = X_valid.loc[X_valid["education"]==X_valid["education"].max()].iloc[:10]
#df_education_min = X_valid.loc[X_valid["education"]==X_valid["education"].min()].iloc[:10]

X_education_extreme = df_education_max#.append(df_education_min).sort_index()
Y_education_extreme = Y_valid.iloc[X_education_extreme.index]

In [None]:
# here, gain a representation with only 10 individuals 

t_graph_0 = time.time()

list_col_names = ["sex", "education"]

edges_sex_valid = add_new_edge(data=X_education_extreme, previous_edge=None, list_col_names=list_col_names)

data_sex_valid = table_to_graph(X=X_education_extreme, Y=Y_education_extreme, list_col_names=list_col_names, 
                                edges=edges_sex_valid)

network_job_valid = to_networkx(data=data_job_valid)
nx.draw(network_job_valid, with_labels=True, font_weight='bold')

t_graph_1 = time.time()

print(f"Plotting the graph with {data_sex_valid.x.shape[0]} individuals took {(t_graph_1 - t_graph_0)/60} mn")

In [None]:
# here, gain a representation with only 10 individuals (and only 'sex' as edge)

t_graph_0 = time.time()

list_col_names = ["sex"]

edges_sex_valid = add_new_edge(data=X_education_extreme, previous_edge=None, list_col_names=list_col_names)

data_sex_valid = table_to_graph(X=X_education_extreme, Y=Y_education_extreme, list_col_names=list_col_names, 
                                edges=edges_sex_valid)

network_job_valid = to_networkx(data=data_job_valid)
nx.draw(network_job_valid, with_labels=True, font_weight='bold')

t_graph_1 = time.time()

print(f"Plotting the graph with {data_sex_valid.x.shape[0]} individuals took {(t_graph_1 - t_graph_0)/60} mn")

In [None]:
# here, gain a representation with only 10 individuals (and only 'sex' as edge)

t_graph_0 = time.time()

list_col_names = ["sex"]

edges_sex_valid = add_new_edge(data=X_education_extreme, previous_edge=None, list_col_names=list_col_names)

data_sex_valid = table_to_graph(X=X_education_extreme, Y=Y_education_extreme, list_col_names=list_col_names, 
                                edges=edges_sex_valid)

network_job_valid = to_networkx(data=data_job_valid)
nx.draw(network_job_valid)

t_graph_1 = time.time()

print(f"Plotting the graph with {data_sex_valid.x.shape[0]} individuals took {(t_graph_1 - t_graph_0)/60} mn")

In [None]:
# here, gain a representation with only 10 individuals (and only 'sex' as edge)

t_graph_0 = time.time()

list_col_names = ['capital-gain', 'capital-loss',
       'hours-per-week', 'workclass', 'education', 'marital-status',
       'occupation', 'relationship', 'race', 'sex', 'native-country',
       'clients_id'] # to take only the likely 'relevant' features 'age', 'fnlwgt', 'education-num' as node

edges_sex_valid = add_new_edge(data=X_education_extreme, previous_edge=None, list_col_names=['sex'])

data_sex_valid = table_to_graph(X=X_education_extreme, Y=Y_education_extreme, list_col_names=list_col_names, 
                                edges=edges_sex_valid)

network_job_valid = to_networkx(data=data_job_valid)
nx.draw(network_job_valid)

t_graph_1 = time.time()

print(f"Plotting the graph with {data_sex_valid.x.shape[0]} individuals took {(t_graph_1 - t_graph_0)/60} mn")

In [None]:
# here, gain a representation with only 10 individuals (and only 'sex' as edge)

t_graph_0 = time.time()

list_col_names = ['age', 'fnlwgt', 'capital-gain', 'capital-loss',
       'hours-per-week', 'workclass', 'education', 'marital-status',
       'occupation', 'relationship', 'race', 'sex', 'native-country',
       'clients_id'] # to take only the likely 'relevant' feature 'education-num' as node

edges_sex_valid = add_new_edge(data=X_education_extreme, previous_edge=None, list_col_names=['sex'])

data_sex_valid = table_to_graph(X=X_education_extreme, Y=Y_education_extreme, list_col_names=list_col_names, 
                                edges=edges_sex_valid)

network_job_valid = to_networkx(data=data_job_valid)
nx.draw(network_job_valid)

t_graph_1 = time.time()

print(f"Plotting the graph with {data_sex_valid.x.shape[0]} individuals took {(t_graph_1 - t_graph_0)/60} mn")

Obviously, we have no clear intuition of what these links do correspond with... By individual, path from the sex to the income? But there are more groups than individuals here selected (10)...

## Constitute a graph - Try to connect the features 

Here, we proceed in 2 steps (back and forth)

1. **Detect the relations**
We use the partial dependance plots to inspect the correlations (pers) sufficient? Input intervention changes?

1. **Select the causal direction**
Based on the user's experience and expertise (e.g. sex -> education, because the contrary would be logically and temporally impossible)

At a first sight, look at correlated features (!) may be some hidden correlations => experience is still required at this stage:

In [None]:
# reconstitute the dataset to check the correlations

data_train_valid = X_train_valid.copy()
data_train_valid['target'] = Y_train_valid
data_train_valid

In [None]:
f = plt.figure(figsize=(19, 15))
plt.matshow(data_train_valid.corr(), fignum=f.number)
plt.xticks(range(df.select_dtypes(['number']).shape[1]), df.select_dtypes(['number']).columns, fontsize=14, rotation=45)
plt.yticks(range(df.select_dtypes(['number']).shape[1]), df.select_dtypes(['number']).columns, fontsize=14)
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)
plt.title('Correlation Matrix', fontsize=16)

In [None]:
import seaborn as sns

sns.set(style="ticks", color_codes=True)    
g = sns.pairplot(X_train_valid.filter(items=['education-num','education']))
plt.show()

In [None]:
g = sns.pairplot(X_train_valid.filter(items=['sex','age']))
plt.show()

In [None]:
from sklearn.inspection import PartialDependenceDisplay

# detect the relations: show the changes in predictions for the combinations of 2 features
fig, ax = plt.subplots(figsize=(8, 6))
f_names = [('sex', 'education')]
# Similar to previous PDP plot except we use tuple of features instead of single feature
disp4 = PartialDependenceDisplay.from_estimator(model, X_valid, f_names, ax=ax)
plt.show()