<a href="https://colab.research.google.com/github/JulianMeigen/ML-handson/blob/main/notebooks/7.0-SNJMMH-Day7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment Day 7

## Team members:
- Samuel Nebgen s6sanebg@uni-bonn.de
- Muhammad Humza Arain s27marai@uni-bonn.de
- Julian Meigen s82jmeig@uni-bonn.de

## 16.09.2025

Contributions were made by all team members in around the same amount, either based on discussions or coding.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [24]:
import torch
import torch_geometric
import numpy as np
import pandas as pd
import networkx as nx
import plotly
from torch_geometric.utils import to_networkx
from torch.nn import Embedding
import torch.nn.functional as F
from torch_geometric.nn import GCNConv

# Task 1 Perform a node labeling task with a Graph ML model

## a) Load the graph dataset (ogbn-proteins) into pytorch-geometric

We are directly using a Subgraph

In [6]:
path = '/content/drive/MyDrive/Machine Learning Hands On/Data/Graph Data/subgraph_hop_1.pt'

bundle_Julian = torch.load(path, weights_only=False)

data = bundle_Julian['graph']
print(data)

Data(num_nodes=942, edge_index=[2, 200414], edge_attr=[200414, 8], node_species=[942, 1], y=[942, 112])


In [7]:
G = to_networkx(data, to_undirected=True)
print(G)

Graph with 942 nodes and 100207 edges


## b) Create a train, val, test split on the nodes or load the masks via pytorch-geometric.

### i. Create a subgraph if the computation is too expensive.

In [17]:
num_nodes = data.num_nodes
perm = torch.randperm(num_nodes)

train_size = int(0.8 * num_nodes)
val_size = int(0.1 * num_nodes)

train_mask = torch.zeros(num_nodes, dtype=torch.bool)
val_mask = torch.zeros(num_nodes, dtype=torch.bool)
test_mask = torch.zeros(num_nodes, dtype=torch.bool)

train_mask[perm[:train_size]] = True
val_mask[perm[train_size:train_size + val_size]] = True
test_mask[perm[train_size + val_size:]] = True

data.train_mask = train_mask
data.val_mask = val_mask
data.test_mask = test_mask

print(data.y)

tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])


In [16]:
print(len(data.y[data.train_mask]))
print(len(data.y[data.val_mask]))
print(len(data.y[data.test_mask]))

753
94
95


## c) Initialize the graph with random node embeddings.

In [23]:
number_nodes = data.num_nodes
embedding_dim = 112
node_embeddings = Embedding(number_nodes, embedding_dim)

## d) Define a graph convolutional neural network class with two layers using pytorch-geometric..

In [27]:
class GCN(torch.nn.Module):
    def __init__(self, num_nodes, embedding_dim, hidden_dim, out_dim):
        super().__init__()
        self.emb = torch.nn.Embedding(num_nodes, embedding_dim)
        self.conv1 = GCNConv(embedding_dim, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, out_dim)

    def forward(self, edge_index):
        # get embeddings for all nodes
        x = self.emb.weight
        # apply GCN layers
        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = self.conv2(x, edge_index)
        return x

In [None]:
model_gcn = GCN(num_nodes=data.num_nodes, embedding_dim=128, hidden_dim=64, out_dim=121)

### i. Train your model on the train dataset using an optimizer and a loss function for a multilabel classification task for 100 epochs

### ii. Test your model on the test set and evaluate it with accuracy, AUROC, precision, recall and F1 score.

## e) Set up a hyperparameter optimization pipeline with nested 5-fold cross-validation

### i. Familiarize yourself with the hyperparameter optimization package optuna (https://optuna.org/ )

### ii. Integrate the logging package mlflow (https://mlflow.org/) to log your metrics.

### iii. Train and test your models and report the evaluation metrics with mean and std for the nested CV.