# Get a complete representation of connections in Census (with TabNN)?

In this notebook, we inspect **in which way a tabular dataset as Census can be used by an AI based on graphs to estimate wealthiness of individuals**. 

Therefore, we proceed in 2 steps:

**1. We prepare data to be handled by a model based on a graph**
We transform them into a graph, that involves strong assumptions on the features involved in connections...

**2. We train an AI based on graphs**
Here, we begin with a Graphical Neural Network (GNN) based on a Multi-Layer Perceptron (MLP), requiring the library Torch.

**3. We inspect if the graph-based AI indeed reflects common & expert knowledge on**
In particular, regarding the non-sense of certain inferences that should absolutely be avoided (e.g. education may influence occupation, but not the reverse).

In [None]:
%load_ext autoreload
%autoreload 2

import warnings
warnings.filterwarnings('ignore')

# Data preparation for binary classification with graphs (Census)
For this reshaping (and also interpretation, see below the choice of edges) of data tables to graphs, we based on the [Google tutorial](https://colab.research.google.com/drive/1_eR7DXBF3V4EwH946dDPOxeclDBeKNMD?usp=sharing#scrollTo=WuggdIItffpv).

## General preparation - handle categorical features
Here, we handle the categorical features through label-encoding. 

In [None]:
import sys
sys.path.append("../")

import time
from sklearn import datasets

from sklearn.preprocessing import LabelEncoder

import torch
from torch_geometric.data import Data

import tensorflow as tf

import itertools
import numpy as np
import pandas as pd

from classif_basic.data_preparation import train_valid_test_split, set_target_if_feature, automatic_preprocessing

### Prepare data

Fix precise % of population distribution (sex: Male, Female) and % of wealthiness according to sex. In that way, we could inspect if the structure of the model (here based on a graph) integrates this "sexist" representation of the world. 

In [None]:
# preparing the dataset on clients for binary classification
from sklearn.datasets import fetch_openml
data = fetch_openml(data_id=1590, as_frame=True)

t0 = time.time()

X = data.data
Y = (data.target == '>50K') * 1

For the moment, we exclude the 'sexist' bias to inspect how the data are linked, and further describe them through a (general, high-level) causal model...

### Train-test-split, to prepare for 3 graphs representing data

In [None]:
model_task = "classification"
preprocessing_cat_features = "label_encoding"

X_train, X_valid, X_train_valid, X_test, Y_train, Y_valid, Y_train_valid, Y_test = train_valid_test_split(
    X=X,
    Y=Y, 
    model_task=model_task,
    preprocessing_cat_features=preprocessing_cat_features)

## Reshape (by interpreting) data to a graph

From this dataset (where we introduced selectively a "sexist" effect against women), let's see how we could swith from the tabular data to a graph representation. Here, we based on the [Google tutorial](https://colab.research.google.com/drive/1_eR7DXBF3V4EwH946dDPOxeclDBeKNMD?usp=sharing#scrollTo=Mw8dzPy3-UnJ) to switch from tables to graph. 

The point is that our features X all seem to be attributes of the clients, though we should find a way of representing their interactions between clients 

X = {race, age, sex, final weight (depends on age, sex, hispanic origin, race), education, education number, marital status, relationship, occupation, hours per week, workclass, race, sex, capital gain, capital loss, native country} 

**Nodes** 
Bank clients (by ID)

**Edges** 
Here, we should find one or several ways of connecting the clients

Should be occupation → if changes of occupation (or similar client with new occupation), which impact on the revenue? // change of football team => impact on the football rate 
(pers) actionable => predict revenue when switches to a new job??
→ may be: “hours per week” <=> inspect the change of revenue if switches to greater hours per week?

**Node Features** 
Attributs of the nodes, i.e. characteristics of the clients (here, hard to separate from what "connects" them...) 

Race, age, sex, final weight (depends on age, sex, hispanic origin, race), education, education number, marital status, relationship, hours per week, workclass, race, sex, capital gain, capital loss, native country 

**Label (here at a node-level?)** 
Income (Y = income > $50 000)

In [None]:
# first of all, specify the edge
edge = "occupation"# str (for the moment)

In [None]:
# get an idea of the codes corresponding to occupations, reconstituting labels' transformations from X
le = LabelEncoder()

dict_occupation_codes = pd.Series(X[edge].values, index=X.apply(le.fit_transform)[edge]).to_dict()

# correct according to dict comparison
dict_occupation_codes[14] = 'Transport-moving'
dict_occupation_codes

In [None]:
X=X_valid

def add_new_edge(data, previous_edge_index, col_name): # col_name as a list? E.g. occupation AND hours per week
    
    if previous_edge_index is None:
        previous_edges = np.array([], dtype=np.int32).reshape((0, 2))
    
    elif previous_edge_index is not None:
        previous_edges = previous_edge_index.transpose()

    # first, reset IDs
    # to enable the computation of all combinations of clients sharing some attribute, i.e. column value (e.g. the same type of job)
    data["clients_id"] = data.reset_index().index
    attribute_values = data[col_name].unique()
    
    for attribute in attribute_values:
        # select clients with the same job
        attribute_df = data[data[col_name] == attribute]
        clients = attribute_df["clients_id"].values        
        # Build all combinations between clients with the same attribute e.g. job (without knowing their label)
        permutations = list(itertools.combinations(clients, 2))
        edges_source = [e[0] for e in permutations] # starting client -> to other client with the same attribute e.g. job
        edges_target = [e[1] for e in permutations] # ending client -> from other client with the same attribute e.g. job
        clients_edges = np.column_stack([edges_source, edges_target]) # convert combinations to array
        # complete with each new attribute (e.g. new type of job), to get all couples of clients with the same attribute
        previous_edges = np.vstack([previous_edges, clients_edges]) 

    # Convert to Pytorch Geometric format
    edge_index = previous_edges.transpose()
    # edge_index # [2, num_edges]
    # then convert to torch, for further compatibility avec the torch GNN
    #edge_index = torch.from_numpy(edge_index)

    return edge_index

In [None]:
X.columns

In [None]:
data=X_valid
previous_edge_index=None
col_name='occupation'

edge_occupation = add_new_edge(data, previous_edge_index, col_name)

edge_occupation_capital_gain = add_new_edge(data, edge_occupation, col_name)

In [None]:
list_col_names = ["education","sex"] # since it seems relevant to combine the 2 features in a unique edge

# first, reset IDs
# to enable the computation of all combinations of clients sharing some attribute, i.e. column value (e.g. the same type of job)
data["clients_id"] = data.reset_index().index

In [None]:
def add_new_edge(data, previous_edge_index, list_col_names): # with the list of columns to combine in the edge 
    
    if previous_edge_index is None:
        previous_edges = np.array([], dtype=np.int32).reshape((0, 2))
    
    elif previous_edge_index is not None:
        previous_edges = previous_edge_index.transpose()

    # first, reset IDs
    # to enable the computation of all combinations of clients sharing some attribute, i.e. column value (e.g. the same type of job)
    data["clients_id"] = data.reset_index().index


    if len(list_col_names)==1: # when a unique feature is chosen to form an edge
        
        col_name = list_col_names[0]
        attribute_values = data[col_name].unique()

        for attribute in attribute_values:
            # select clients with the same job
            attribute_df = data[data[col_name] == attribute]
            clients = attribute_df["clients_id"].values        
            # Build all combinations between clients with the same attribute e.g. job (without knowing their label)
            permutations = list(itertools.combinations(clients, 2))
            edges_source = [e[0] for e in permutations] # starting client -> to other client with the same attribute e.g. job
            edges_target = [e[1] for e in permutations] # ending client -> from other client with the same attribute e.g. job
            clients_edges = np.column_stack([edges_source, edges_target]) # convert combinations to array
            # complete with each new attribute (e.g. new type of job), to get all couples of clients with the same attribute
            previous_edges = np.vstack([previous_edges, clients_edges]) 
    
    elif len(list_col_names) == 2: # for the moment, maximum combination of 2 columns to create an edge
    
    # TODO join if too many categories (e.g. hours of work per week)
    # else, 1050 combinations of types of jobs and hours per week - a bit hard to compute
    # and irrelevant (mini-categories of clients as edges)...
        col_1 = list_col_names[0]
        col_2 = list_col_names[1]
            
        combinations_vals_cols_1_to_2 = np.array(np.meshgrid(data[col_1].unique(), data[col_2].unique())).T.reshape(-1,2)

        for attr1, attr2 in combinations_vals_cols_1_to_2:
            attribute_df = data.loc[(data[col_1] == attr1) & (data[col_2] == attr2)]
            clients = attribute_df["clients_id"].values        
            # Build all combinations between clients with the same attribute e.g. job (without knowing their label)
            permutations = list(itertools.combinations(clients, 2))
            edges_source = [e[0] for e in permutations] # starting client -> to other client with the same attribute e.g. job
            edges_target = [e[1] for e in permutations] # ending client -> from other client with the same attribute e.g. job
            clients_edges = np.column_stack([edges_source, edges_target]) # convert combinations to array
            # complete with each new attribute (e.g. new type of job), to get all couples of clients with the same attribute
            previous_edges = np.vstack([previous_edges, clients_edges]) 
    
    else:
        raise NotImplementedError("The maximum number of features you specify in list_col_names to create an edge must be 2.")

    # Convert to Pytorch Geometric format
    edge_index = previous_edges.transpose()
    # edge_index # [2, num_edges]
    # then convert to torch, for further compatibility avec the torch GNN
    #edge_index = torch.from_numpy(edge_index)

    return edge_index

In [None]:
data = X_valid
previous_edge_index = None
list_col_names = ["sex", "education"]

edge_sex_education = add_new_edge(data=data, previous_edge_index=previous_edge_index, list_col_names=list_col_names)

In [None]:
data = X_valid
previous_edge_index = edge_sex_education
list_col_names = ["occupation"]

edge_sex_education_and_occupation = add_new_edge(data=data, previous_edge_index=previous_edge_index, list_col_names=list_col_names)

In [None]:
previous_edge_index.shape

In [None]:
# TODO enhance the function (and then include it in the package)

def table_to_graph(X, Y, list_col_names):
    
    #Make sure that we have no duplicate nodes
    assert(X.index.unique().shape[0] == X.shape[0])
    
    # first of all, reset the IDs of clients
    X["clients_id"] = X.reset_index().index
    
    # Extract the node features

        # The node features are typically represented in a matrix of the shape (num_nodes, node_feature_dim).
        # For each of the bank clients, we simply extract their attributes (except here the "occupation", that would be used as an "actionable" edge to connect them)
    list_X_cols = X.columns.to_list()
    list_nodes_names = [col for col in list_X_cols if col not in list_col_names]
    node_features = X[list_nodes_names]
        # That's already our node feature matrix. The number of nodes and the ordering is implicitly defined by it's shape. Each row corresponds to one node in our final graph. 
    
    # Convert to numpy
    x = node_features.to_numpy()
    # x.shape # [num_nodes x num_features]
    # then convert to torch, for further compatibility avec the torch GNN
    x = torch.from_numpy(x)
    
    # Extract the labels
    labels = Y
        # Those are simply the wealthiness of each of the clients (if their income is >$50 000). This corresponds to a node-level prediction problem. 
        # Therefore we have as many labels as we have nodes.
    
    # to make the graph functioning, check that the nodes follow the same order than the labels (rows n°)
        # else, sort values by ids
    nb_corresponding_nodes_labels = (labels.index == node_features.index).sum()
    assert(nb_corresponding_nodes_labels == X.shape[0])
    
    # Convert to numpy
    y = labels.to_numpy()
    #y.shape # [num_nodes, 1] --> node regression
    # get the number of classes
    num_classes=np.unique(y).shape[0]
    # then convert to torch, for further compatibility avec the torch GNN
    y = torch.from_numpy(y)

    # Extract the edges, know with our function to combine columns 
    edges = add_new_edge(data=data, previous_edge_index=previous_edge_index, list_col_names=list_col_names)
    # then convert to torch, for further compatibility avec the torch GNN
    edge_index = torch.from_numpy(edges)
    
    # finally, build the graph (if other attributes e.g. edge_features, you can also pass it there)
    data = Data(x=x, edge_index=edge_index, y=y, num_classes=num_classes)
    
    return data

In [None]:
# TODO enhance the function (and then include it in the package)

def table_to_graph(X, Y, edge):
    
    #Make sure that we have no duplicate nodes
    assert(X.index.unique().shape[0] == X.shape[0])
    
    # first of all, reset the IDs of clients
    X["clients_id"] = X.reset_index().index
    
    # Extract the node features

        # The node features are typically represented in a matrix of the shape (num_nodes, node_feature_dim).
        # For each of the bank clients, we simply extract their attributes (except here the "occupation", that would be used as an "actionable" edge to connect them)
    node_features = X.loc[:, X.columns != edge]
        # That's already our node feature matrix. The number of nodes and the ordering is implicitly defined by it's shape. Each row corresponds to one node in our final graph. 
    
    # Convert to numpy
    x = node_features.to_numpy()
    # x.shape # [num_nodes x num_features]
    # then convert to torch, for further compatibility avec the torch GNN
    x = torch.from_numpy(x)
    
    # Extract the labels
    labels = Y
        # Those are simply the wealthiness of each of the clients (if their income is >$50 000). This corresponds to a node-level prediction problem. 
        # Therefore we have as many labels as we have nodes.
    
    # to make the graph functioning, check that the nodes follow the same order than the labels (rows n°)
        # else, sort values by ids
    nb_corresponding_nodes_labels = (labels.index == node_features.index).sum()
    assert(nb_corresponding_nodes_labels == X.shape[0])
    
    # Convert to numpy
    y = labels.to_numpy()
    #y.shape # [num_nodes, 1] --> node regression
    # get the number of classes
    num_classes=np.unique(y).shape[0]
    # then convert to torch, for further compatibility avec the torch GNN
    y = torch.from_numpy(y)

    # Extract the edges
        # That's probably the trickiest part with a tabular dataset. You need to think of a reasonable way to connect your nodes. 
        # We will use the type of job assignment here
        # We now need to build all permutations of these clients within one type of job, which corresponds to a fully-connected graph within each occupation-subgroup. We use the column int_player_id as indices for the edges. If there is for example a [0, 1] in the edge index, it means that the first and second node (regarding the previously defined node feature matrix) are connected.
    
    jobs = X["occupation"].unique()
    all_edges = np.array([], dtype=np.int32).reshape((0, 2))
    for job in jobs:
        job_df = X[X["occupation"] == job]
        clients = job_df["clients_id"].values        # Build all combinations, as all players are connected
        permutations = list(itertools.combinations(clients, 2))
        edges_source = [e[0] for e in permutations]
        edges_target = [e[1] for e in permutations]
        clients_edges = np.column_stack([edges_source, edges_target])
        all_edges = np.vstack([all_edges, clients_edges])
        
    # begin with empty edge_index, to assess if the GNN structure works
    #edge_index = torch.empty(2, 0, dtype=torch.long)
        
    # Convert to Pytorch Geometric format
    edge_index = all_edges.transpose()
    # edge_index # [2, num_edges]
    # then convert to torch, for further compatibility avec the torch GNN
    edge_index = torch.from_numpy(edge_index)
    
    # finally, build the graph (if other attributes e.g. edge_features, you can also pass it there)
    data = Data(x=x, edge_index=edge_index, y=y, num_classes=num_classes)
    
    return data

In [None]:
data_train = table_to_graph(X=X_train, Y=Y_train, list_col_names=list_col_names)
data_valid = table_to_graph(X=X_valid, Y=Y_valid, list_col_names=list_col_names)

In [None]:
type(data_train)

In [None]:
data_valid

In [None]:
from classif_basic.graph import table_to_graph

In [None]:
data_train = table_to_graph(X=X_train, Y=Y_train, list_col_names=list_col_names)
data_valid = table_to_graph(X=X_valid, Y=Y_valid, list_col_names=list_col_names)

In [None]:
data_train

In [None]:
data_valid

# Train a basic Graph Neural Network on the graph-shaped data

## Build a basic convolutional GNN with torch

In [None]:
# here intervenes the quick "introduction by example" of GCN by torch
# in 'https://pytorch-geometric.readthedocs.io/en/latest/notes/introduction.html'

import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv

class GCN(torch.nn.Module):
    def __init__(self, data):
        super().__init__()
        self.conv1 = GCNConv(data.num_node_features, 16)
        self.conv2 = GCNConv(16, data.num_classes)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index

        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, training=self.training)
        x = self.conv2(x, edge_index)
        
        return F.log_softmax(x, dim=1)

In [None]:
batch_nb = 200

t_basic_1 = time.time()

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = GCN(data=data_train).to(device)
data_train = data_train.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)

model.double()

model.train()
for epoch in range(batch_nb): 
    # better with 200 batches (with only feature "occupation" as edge, 70% accuracy vs 50% accuracy with 50 batches)
    optimizer.zero_grad()
    out = model(data_train)
    loss = F.nll_loss(out, data_train.y)
    loss.backward()
    optimizer.step()

t_basic_2 = time.time()

print(f"Training of the basic GCN on Census with {batch_nb} batches took {(t_basic_2 - t_basic_1)/60} mn")

In [None]:
model(data_train)

Finally, we can evaluate our model on the validation nodes. Obviously, linking the clients only through the job provides less than 70% of accuracy even on the train set. Therefore, we need to seek for other ways...

In [None]:
pred_train = model(data_train).argmax(dim=1)
nb_indivs_train = data_train.x.shape[0]

model.eval()

correct_train = (pred_train == data_train.y).sum()
acc = int(correct_train) / nb_indivs_train
print(f'Accuracy on train data: {acc:.4f}')

In [None]:
pred_valid = model(data_valid).argmax(dim=1)
nb_indivs_valid = data_valid.x.shape[0]

model.eval()

correct_valid = (pred_valid == data_valid.y).sum()
acc = int(correct_valid) / nb_indivs_valid
print(f'Accuracy on test data: {acc:.4f}')

In [None]:
# connections of the node 1 with the other (14450-1) nodes
# obviously there is a problem, as the node 18460 does not exist...
data_train.edge_index[1]

In [None]:
data_train.edge_index

## Build a more complex GNN with torch
The advantage of using torch_geometric to build the GNN is the compatibility with the graph of data, as data was just reshaped using torch_geometric (above). 

In [None]:
from torch.nn import Linear, ReLU, Dropout
from torch_geometric.nn import Sequential, GCNConv, JumpingKnowledge
from torch_geometric.nn import global_mean_pool

num_data_classes = 2

gcn_seq = Sequential('x, edge_index, batch', [
    (Dropout(p=0.5), 'x -> x'),
    (GCNConv(data_train.num_features, 64), 'x, edge_index -> x1'),
    ReLU(inplace=True),
    (GCNConv(64, 64), 'x1, edge_index -> x2'),
    ReLU(inplace=True),
    (lambda x1, x2: [x1, x2], 'x1, x2 -> xs'),
    (JumpingKnowledge("cat", 64, num_layers=2), 'xs -> x'),
    (global_mean_pool, 'x, batch -> x'),
    Linear(2 * 64, num_data_classes),
])

In [None]:
def pred_gcn_seq(data, batch_nb):

    t_seq_1 = time.time()
    
    if batch_nb is None:
        batch_nb = 200

    x = data.x.float()#.long()
    edge_index = data.edge_index
    batch = batch_nb*torch.ones(data.num_nodes).long() # set 200 batches with the required shape

    pred = gcn_seq(x, edge_index, batch)

    t_seq_2 = time.time()

    print(f"Predictions with the sequential GCN on Census with {batch_nb} batches took {round(t_seq_2 - t_seq_1)/60} mn")
    
    return pred

Obviously, using a more 'complex' model with the sole edge 'occupation' does not lead to better results (accuracy = 0.52, but without adapted features for training)... Then, we will try to constitute better edges.

In [None]:
pred_train = pred_gcn_seq(data=data_train, batch_nb=batch_nb)
nb_indivs_train = data_train.x.shape[0]

acc = int(correct_train) / nb_indivs_train
print(f'Accuracy on train data: {acc:.4f}')

In [None]:
pred_valid = pred_gcn_seq(data=data_valid, batch_nb=batch_nb)
nb_indivs_valid = data_valid.x.shape[0]

acc = int(correct_valid) / nb_indivs_valid
print(f'Accuracy on valid data: {acc:.4f}')

Below were the previous tries...

In [None]:
x = data_train.x
edge_index = data_train.edge_index
batch = 10

model.train()

In [None]:
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)

x = data_train.x
edge_index = data_train.edge_index
batch = 10

model.train()
for epoch in range(200):
    optimizer.zero_grad()
    
    x = data_train.x
    edge_index = data_train.edge_index
    batch = 10
    
    out = model(x, edge_index, batch)
    loss = F.nll_loss(data_train, data_train.y)
    loss.backward()
    optimizer.step()

In [None]:
x = data_valid.x
edge_index = data_valid.edge_index
batch = 10

pred_valid = model().argmax(dim=1)
nb_indivs_valid = data_valid.x.shape[0]

model.eval()

correct = (pred_valid == data_valid).sum()
acc = int(correct) / nb_indivs_valid
print(f'Accuracy: {acc:.4f}')

## Training a Graph Neural Network (GNN)

We can easily convert our MLP to a GNN by swapping the `torch.nn.Linear` layers with PyG's GNN operators.

Following-up on [the first part of the Torch tutorial we used](https://colab.research.google.com/drive/1h3-vJGRVloF5zStxL5I0rSy4ZUPNsjy8), we replace the linear layers by the [`GCNConv`](https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#torch_geometric.nn.conv.GCNConv) module.
To recap, the **GCN layer** ([Kipf et al. (2017)](https://arxiv.org/abs/1609.02907)) is defined as

$$
\mathbf{x}_v^{(\ell + 1)} = \mathbf{W}^{(\ell + 1)} \sum_{w \in \mathcal{N}(v) \, \cup \, \{ v \}} \frac{1}{c_{w,v}} \cdot \mathbf{x}_w^{(\ell)}
$$

where $\mathbf{W}^{(\ell + 1)}$ denotes a trainable weight matrix of shape `[num_output_features, num_input_features]` and $c_{w,v}$ refers to a fixed normalization coefficient for each edge.
In contrast, a single `Linear` layer is defined as

$$
\mathbf{x}_v^{(\ell + 1)} = \mathbf{W}^{(\ell + 1)} \mathbf{x}_v^{(\ell)}
$$

which does not make use of neighboring node information.

In [None]:
# Install required packages.
import os
import torch
os.environ['TORCH'] = torch.__version__
print(torch.__version__)

!pip install -q torch-scatter -f https://data.pyg.org/whl/torch-${TORCH}.html
!pip install -q torch-sparse -f https://data.pyg.org/whl/torch-${TORCH}.html
!pip install -q git+https://github.com/pyg-team/pytorch_geometric.git

# Helper function for visualization.
%matplotlib inline
import networkx as nx
import matplotlib.pyplot as plt


def visualize_graph(G, color):
    plt.figure(figsize=(7,7))
    plt.xticks([])
    plt.yticks([])
    nx.draw_networkx(G, pos=nx.spring_layout(G, seed=42), with_labels=False,
                     node_color=color, cmap="Set2")
    plt.show()


def visualize_embedding(h, color, epoch=None, loss=None):
    plt.figure(figsize=(7,7))
    plt.xticks([])
    plt.yticks([])
    h = h.detach().cpu().numpy()
    plt.scatter(h[:, 0], h[:, 1], s=140, c=color, cmap="Set2")
    if epoch is not None and loss is not None:
        plt.xlabel(f'Epoch: {epoch}, Loss: {loss.item():.4f}', fontsize=16)
    plt.show()

In [None]:
data = data_train  # Get the first graph object.

print(data)
print('==============================================================')

# Gather some statistics about the graph.
print(f'Number of nodes: {data.num_nodes}')
print(f'Number of edges: {data.num_edges}')
print(f'Average node degree: {data.num_edges / data.num_nodes:.2f}')
#print(f'Number of training nodes: {data.train_mask.sum()}')
#print(f'Training node label rate: {int(data.train_mask.sum()) / data.num_nodes:.2f}')
#print(f'Has isolated nodes: {data.has_isolated_nodes()}')
print(f'Has self-loops: {data.has_self_loops()}')
# print(f'Is undirected: {data.is_undirected()}')

By printing edge_index, we can understand how PyG represents graph connectivity internally. We can see that for each edge, edge_index holds a tuple of two node indices, where the first value describes the node index of the source node and the second value describes the node index of the destination node of an edge.

In [None]:
from IPython.display import Javascript  # Restrict height of output cell.
display(Javascript('''google.colab.output.setIframeHeight(0, true, {maxHeight: 300})'''))

edge_index = data.edge_index
print(edge_index.transpose())

We can further visualize the graph by converting it to the networkx library format, which implements, in addition to graph manipulation functionalities, powerful tools for visualization:

In [None]:
tf.convert_to_tensor(data.y)

In [None]:
from torch_geometric.utils import to_networkx

G = to_networkx(tf.convert_to_tensor(data), to_undirected=True)
visualize_graph(G, color=tf.convert_to_tensor(data.y))

Here, there was the code of yesterday:

In [None]:
import torch
from torch_geometric.nn import GCNConv

class GCN(torch.nn.Module):
    def __init__(self, hidden_channels):
        super().__init__()
        torch.manual_seed(1234567)
        self.conv1 = GCNConv(data_train.num_features, hidden_channels)
        self.conv2 = GCNConv(hidden_channels, 2) # number of classes on the data

    def forward(self, x, edge_index):
        x = self.conv1(x, edge_index)
        x = x.relu()
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.conv2(x, edge_index)
        return x

model = GCN(hidden_channels=16)
print(model)

**Embedding the Census Network**

Let's take a look at the node embeddings produced by our GNN.
Here, we pass in the initial node features `x` and the graph connectivity information `edge_index` to the model, and visualize its 2-dimensional embedding.

In [None]:
_, h = model(data_train.x, data_train.edge_index)
print(f'Embedding shape: {list(h.shape)}')

visualize_embedding(h, color=data_train.y)

In [None]:
pip install IPython

In [None]:
data_train

In [None]:
from IPython.display import Javascript  # Restrict height of output cell.
display(Javascript('''google.colab.output.setIframeHeight(0, true, {maxHeight: 300})'''))

model = GCN(hidden_channels=16)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)
criterion = torch.nn.CrossEntropyLoss()

def train():
    model.train()
    optimizer.zero_grad()  # Clear gradients.
    out = model(data_train.x, data_train.edge_index)  # Perform a single forward pass.
    loss = criterion(out, data_train.y)  # Compute the loss solely based on the training nodes.
    loss.backward()  # Derive gradients.
    optimizer.step()  # Update parameters based on gradients.
    return loss

def test():
    model.eval()
    out = model(data_valid.x, data_valid.edge_index)
    pred = out.argmax(dim=1)  # Use the class with highest probability.
    test_correct = pred == data_test.y  # Check against ground-truth labels.
    test_acc = int(test_correct.sum()) / int(data_test.sum())  # Derive ratio of correct predictions.
    return test_acc

for epoch in range(1, 101):
    loss = train()
    print(f'Epoch: {epoch:03d}, Loss: {loss:.4f}')

In [None]:
augmented_train_valid_set = augment_train_valid_set_with_results("uncorrected", X_train_valid, Y_train_valid, Y_pred_train_valid, model_task)

We now see that this process with basic data preparation, modelling and integration of the results in a DataFrame (as storage of the model) is very fast (in seconds):

In [None]:
t1 = time.time()

print(f"Basic modelling took {round(t1 - t0)} seconds")

The further steps are for fairness assessment and correction of the model, functionality which is available with the package FairDream of DreamQuark (private for the moment)...

## Detection alert (on train&valid data to examine if the model learned discriminant behavior)

## Discrimination correction with a new fair model

### Generating fairer models with grid search or weights distorsion

### Evaluating the best fair model