# Shallow methods for supervised learning

In this notebook we will exploring a very naive (yet powerful) approach for solving graph-based supervised machine learning. The idea rely on the classic machine learning approach of handcrafted feature extraction.

In Chapter 1 you learned how local and global graph properties can be extracted from graphs. Those properties represent the graph itself and bring important informations which can be useful for classification.

In [5]:
!pip install stellargraph

Uninstalling stellargraph-1.2.1:
  Successfully uninstalled stellargraph-1.2.1


In this demo, we will be using the PROTEINS dataset, already integrated in StellarGraph

In [1]:
from stellargraph import datasets
from IPython.display import display, HTML

datasets.PROTEINS.url = 'https://www.chrsmrrs.com/graphkerneldatasets/PROTEINS.zip'

dataset = datasets.PROTEINS()
display(HTML(dataset.description))
graphs, graph_labels = dataset.load()

To compute the graph metrics, one way is to retrieve the adjacency matrix representation of each graph.

In [3]:
# convert graphs from StellarGraph format to numpy adj matrices
adjs = [graph.to_adjacency_matrix().A for graph in graphs]
# convert labes fom Pandas.Series to numpy array
labels = graph_labels.to_numpy(dtype=int)

In [4]:
import numpy as np
import networkx as nx

metrics = []
for adj in adjs:
    G = nx.from_numpy_matrix(adj)
    # basic properties
    num_edges = G.number_of_edges()
    # clustering measures
    cc = nx.average_clustering(G)
    # measure of efficiency
    eff = nx.global_efficiency(G)

    metrics.append([num_edges, cc, eff])



We can now exploit scikit-learn utilities to create a train and test set. In our experiments, we will be using 70% of the dataset as training set and the remaining as testset

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(metrics, labels, test_size=0.3, random_state=42)

As commonly done in many Machine Learning workflows, we preprocess features to have zero mean and unit standard deviation

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

It's now time for training a proper algorithm. We chose a support vector machine for this task

In [None]:
from sklearn import svm
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

clf = svm.SVC()
clf.fit(X_train_scaled, y_train)

y_pred = clf.predict(X_test_scaled)

print('Accuracy', accuracy_score(y_test,y_pred))
print('Precision', precision_score(y_test,y_pred))
print('Recall', recall_score(y_test,y_pred))
print('F1-score', f1_score(y_test,y_pred))

Accuracy 0.7455089820359282
Precision 0.7709251101321586
Recall 0.8413461538461539
F1-score 0.8045977011494253


# Supervised graph representation learning using Graph ConvNet

In this notebook we will be performing supervised graph representation learning using Deep Graph ConvNet as encoder.

The model embeds a graph by using stacked Graph ConvNet layers

In this demo, we will be using the PROTEINS dataset, already integrated in StellarGraph

In [12]:
import pandas as pd
from stellargraph import datasets
from IPython.display import display, HTML

dataset = datasets.PROTEINS()
display(HTML(dataset.description))
graphs, graph_labels = dataset.load()

labels = graph_labels.to_numpy(dtype=int)

# necessary for converting default string labels to int
graph_labels = pd.get_dummies(graph_labels, drop_first=True)

StellarGraph we are using for building the model, uses tf.Keras as backend. According to its specific, we need a data generator for feeding the model. For supervised graph classification, we create an instance of StellarGraph's PaddedGraphGenerator class. This generator supplies the features arrays and the adjacency matrices to a mini-batch Keras graph classification model. Differences in the number of nodes are resolved by padding each batch of features and adjacency matrices, and supplying a boolean mask indicating which are valid and which are padding.

In [13]:
from stellargraph.mapper import PaddedGraphGenerator
generator = PaddedGraphGenerator(graphs=graphs)

Now we are ready for actually create the model. The GCN layers will be created and stacked togheter through StellarGraph's utility function. This _backbone_ will be then concateneted to 1D Convolutional layers and Fully connected layers using tf.Keras

In [14]:
from stellargraph.layer import DeepGraphCNN
from tensorflow.keras import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import Dense, Conv1D, MaxPool1D, Dropout, Flatten
from tensorflow.keras.losses import binary_crossentropy
import tensorflow as tf

nrows = 35  # the number of rows for the output tensor
layer_dims = [32, 32, 32, 1]

dgcnn_model = DeepGraphCNN(
    layer_sizes=layer_dims,
    activations=["tanh", "tanh", "tanh", "tanh"],
    k=nrows,
    bias=False,
    generator=generator,
)
gnn_inp, gnn_out = dgcnn_model.in_out_tensors()


x_out = Conv1D(filters=16, kernel_size=sum(layer_dims), strides=sum(layer_dims))(gnn_out)
x_out = MaxPool1D(pool_size=2)(x_out)

x_out = Conv1D(filters=32, kernel_size=5, strides=1)(x_out)

x_out = Flatten()(x_out)

x_out = Dense(units=128, activation="relu")(x_out)
x_out = Dropout(rate=0.5)(x_out)

predictions = Dense(units=1, activation="sigmoid")(x_out)

Let's now compile the model

In [15]:
model = Model(inputs=gnn_inp, outputs=predictions)
model.compile(optimizer=Adam(lr=0.0001), loss=binary_crossentropy, metrics=["acc"])

We use 70% of the dataset for training and the remaining for test

In [16]:
from sklearn import model_selection
train_graphs, test_graphs = model_selection.train_test_split(
    graph_labels, test_size=.3, stratify=labels,
)

In [17]:
gen = PaddedGraphGenerator(graphs=graphs)

train_gen = gen.flow(
    list(train_graphs.index - 1),
    targets=train_graphs.values,
    symmetric_normalization=False,
    batch_size=50,
)

test_gen = gen.flow(
    list(test_graphs.index - 1),
    targets=test_graphs.values,
    symmetric_normalization=False,
    batch_size=1,
)

It's now time for training!

In [None]:
epochs = 100
history = model.fit(
    train_gen, epochs=epochs, verbose=1, validation_data=test_gen, shuffle=True,
)

Epoch 1/100


  "shape. This may consume a large amount of memory." % value)


Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 7

In [None]:
# https://stellargraph.readthedocs.io/en/stable/demos/graph-classification/index.html

## Supervised node representation learning using GraphSAGE

In [18]:
from stellargraph import datasets
from IPython.display import display, HTML

dataset = datasets.Cora()
display(HTML(dataset.description))
G, nodes = dataset.load()

Let's split the dataset into training and testing set

In [19]:
from sklearn.model_selection import train_test_split
train_nodes, test_nodes = train_test_split(
    nodes, train_size=0.1, test_size=None, stratify=nodes
)

Since we are performing a categorical classification, it is useful to represent each categorical label in its one-hot encoding

In [22]:
from sklearn import preprocessing, feature_extraction, model_selection
label_encoding = preprocessing.LabelBinarizer()
train_labels = label_encoding.fit_transform(train_nodes)
test_labels = label_encoding.transform(test_nodes)

It's now time for creating the mdoel. It will be composed by two GraphSAGE layers followed by a Dense layer with softmax activation for classification

In [23]:
from stellargraph.mapper import GraphSAGENodeGenerator
batchsize = 50
n_samples = [10, 5, 7]
generator = GraphSAGENodeGenerator(G, batchsize, n_samples)

In [24]:
from stellargraph.layer import GraphSAGE
from tensorflow.keras.layers import Dense

graphsage_model = GraphSAGE(
    layer_sizes=[32, 32, 16], generator=generator, bias=True, dropout=0.6,
)

In [30]:
gnn_inp, gnn_out = graphsage_model.in_out_tensors()
outputs = Dense(units=train_labels.shape[1], activation="softmax")(gnn_out)

In [32]:
from tensorflow.keras.losses import categorical_crossentropy
from keras.models import Model
from tensorflow.keras.optimizers import Adam

model = Model(inputs=gnn_inp, outputs=outputs)
model.compile(optimizer=Adam(lr=0.003), loss=categorical_crossentropy, metrics=["acc"],)

We will use the flow function of the generator for feeding the model with the train and the test set.

In [33]:
train_gen = generator.flow(train_nodes.index, train_labels, shuffle=True)
test_gen = generator.flow(test_nodes.index, test_labels)

Finally, let's train the model!

In [None]:
history = model.fit(train_gen, epochs=20, validation_data=test_gen, verbose=2, shuffle=False)

In the rest of the notebook, we will be performing a similar example as above using other two popular graph-dl frameworks: PyTorch Geometric (PyG) and Deep Graph Library (DGL).

### Graph Classification using PyG

In [None]:
#!pip install fsspec==2024.3.1 # needed for PROTEINS download torch geometric
#!pip install torch_geometric

import torch
from torch_geometric.datasets import TUDataset
from torch_geometric.data import DataLoader
from torch_geometric.nn import GCNConv, global_mean_pool
from torch.nn import Linear
import torch.nn.functional as F

# Load the PROTEINS dataset
dataset = TUDataset(root='data/PROTEINS', name='PROTEINS')

# Set random seed for reproducibility
torch.manual_seed(42)

# Shuffle and split the dataset into training and test sets
dataset = dataset.shuffle()
split_idx = int(0.8 * len(dataset))  # 80/20 train/test split
train_dataset = dataset[:split_idx]
test_dataset = dataset[split_idx:]

# Print dataset statistics
print(f'Training graphs: {len(train_dataset)}, Test graphs: {len(test_dataset)}')

# Create DataLoader for batching
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

# Define the GCN model
class GCN(torch.nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(GCN, self).__init__()
        self.conv1 = GCNConv(input_dim, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, hidden_dim)
        self.conv3 = GCNConv(hidden_dim, hidden_dim)
        self.lin = Linear(hidden_dim, output_dim)
        
    def forward(self, x, edge_index, batch):
        # Graph convolution layers with ReLU activations
        x = F.relu(self.conv1(x, edge_index))
        x = F.relu(self.conv2(x, edge_index))
        x = self.conv3(x, edge_index)
        
        # Global pooling to obtain graph-level representation
        x = global_mean_pool(x, batch)
        
        # Apply dropout and final linear layer
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.lin(x)
        return x

# Instantiate the model
print(dataset.num_node_features)
model = GCN(input_dim=dataset.num_node_features, hidden_dim=64, output_dim=dataset.num_classes)
print(model)

# Define optimizer and loss function
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
criterion = torch.nn.CrossEntropyLoss()
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=50, gamma=0.5)  # Learning rate decay

# Training function
def train():
    model.train()
    total_loss = 0
    for data in train_loader:
        optimizer.zero_grad()
        out = model(data.x, data.edge_index, data.batch)
        loss = criterion(out, data.y)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    return total_loss / len(train_loader)

# Evaluation function
def evaluate(loader):
    model.eval()
    correct = 0
    for data in loader:
        with torch.no_grad():
            out = model(data.x, data.edge_index, data.batch)
            pred = out.argmax(dim=1)
            correct += int((pred == data.y).sum())
    return correct / len(loader.dataset)

# Training loop
num_epochs = 200
for epoch in range(1, num_epochs + 1):
    loss = train()
    train_acc = evaluate(train_loader)
    test_acc = evaluate(test_loader)
    scheduler.step()  # Adjust learning rate

    print(f'Epoch {epoch:03d}, Loss: {loss:.4f}, Train Acc: {train_acc:.4f}, Test Acc: {test_acc:.4f}')

### Graph Classification using DGL

In [None]:
#!pip install torch==2.1.1 # needed for dgl
#!pip install  dgl -f https://data.dgl.ai/wheels/torch-2.1/repo.html

import dgl
import torch
import torch.nn.functional as F
from torch.nn import Linear
from dgl.data import GINDataset
from dgl.dataloading import GraphDataLoader
from dgl.nn.pytorch import GraphConv
from dgl.data.utils import split_dataset

dataset = dgl.data.GINDataset('PROTEINS', self_loop=True)

# Set random seed for reproducibility
torch.manual_seed(42)

# 2. Split dataset into training and test sets
train_dataset, val_dataset, test_dataset = split_dataset(dataset, frac_list=[0.8, 0.1, 0.1], shuffle=False, random_state=42)

# Print dataset statistics
print(f'Training graphs: {len(train_dataset)}, Test graphs: {len(test_dataset)}')

# 3. Create DGL DataLoader for batching
train_loader = GraphDataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = GraphDataLoader(test_dataset, batch_size=64, shuffle=False)

# 4. Define the GCN model using DGL's GraphConv layers
class GCN(torch.nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(GCN, self).__init__()
        self.conv1 = GraphConv(input_dim, hidden_dim)
        self.conv2 = GraphConv(hidden_dim, hidden_dim)
        self.conv3 = GraphConv(hidden_dim, hidden_dim)
        self.fc = Linear(hidden_dim, output_dim)

    def forward(self, g, features):
        # Apply GraphConv layers with ReLU activations
        h = F.relu(self.conv1(g, features))
        h = F.relu(self.conv2(g, h))
        h = self.conv3(g, h)
        
        # Global mean pooling to obtain graph-level representation
        with g.local_scope():
            g.ndata['h'] = h
            hg = dgl.mean_nodes(g, 'h')
        
        # Apply dropout and final linear layer for classification
        hg = F.dropout(hg, p=0.5, training=self.training)
        return self.fc(hg)

# 5. Initialize the model, optimizer, and loss function
input_dim = dataset.dim_nfeats
output_dim = dataset.num_classes
hidden_dim = 64

print("Input dim:", input_dim)
print("Output dim:", output_dim)

model = GCN(input_dim, hidden_dim, output_dim)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
criterion = torch.nn.CrossEntropyLoss()

# 6. Training function
def train():
    model.train()
    total_loss = 0
    for batched_graph, labels in train_loader:
        optimizer.zero_grad()
        features = batched_graph.ndata['attr']
        out = model(batched_graph, features)
        loss = criterion(out, labels)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    return total_loss / len(train_loader)

# 7. Evaluation function
def evaluate(loader):
    model.eval()
    correct = 0
    for batched_graph, labels in loader:
        features = batched_graph.ndata['attr']
        with torch.no_grad():
            out = model(batched_graph, features)
            pred = out.argmax(dim=1)
            correct += (pred == labels).sum().item()
    return correct / len(loader.dataset)

# 8. Training loop
num_epochs = 200
for epoch in range(1, num_epochs + 1):
    loss = train()
    train_acc = evaluate(train_loader)
    test_acc = evaluate(test_loader)
    print(f'Epoch {epoch:03d}, Loss: {loss:.4f}, Train Acc: {train_acc:.4f}, Test Acc: {test_acc:.4f}')