# Task: Edge classification with pytorch geometric
### Chenguang Guan

In [1]:
import torch
import torch_geometric
from torch import Tensor

import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.nn import MessagePassing
from torch.nn import Sequential as Seq, Linear, ReLU, Sigmoid

import random
from torch.utils.data import Dataset
from torch_geometric.loader import DataLoader
from pathlib import Path

## A. Dataset Preparation
We can either directly load the dataset and pass it to a list, or define a custom Dataset class as the tutorial.

### I. The Naive Way
The advantage point is that we can easily seperate the training set and test set.

In [2]:
def naive_dataset(path: Path):
    total_graphs = list()
    for i in range(10):
        subdir = "batch_1_" + str(i) + "/"
        total_graphs += list(path.glob(subdir + "*.pt"))
    dataset = list()
    for j in range(len(total_graphs)):
        dataset.append(torch.load(total_graphs[j]))
    return dataset
dataset = naive_dataset(Path("./data_gsoc"))

In [3]:
random.shuffle(dataset)
train_loader = DataLoader(dataset[:int(9/10*len(dataset))], batch_size=32)
test_loader = DataLoader(dataset[int(9/10*len(dataset)):], batch_size=32)

### II. Custom Dataset

In [4]:
class MyDataset(Dataset):
    def __init__(self, path: Path):
        super().__init__()
        total_graphs = list()
        for i in range(10):
            subdir = "batch_1_" + str(i) + "/"
            total_graphs += list(path.glob(subdir + "*.pt"))
        
        #random.shuffle(total_graphs)
        self.graphs = total_graphs
    
    def __getitem__(self, idx):
        return torch.load(self.graphs[idx])
    
    def __len__(self) -> int:
        return len(self.graphs)
    
    def shuffle(self):
        random.shuffle(self.graphs)
        
#dataset = MyDataset(Path("./data_gsoc"))
#dataset.shuffle()

### III. Attributes of The Graph Data

In [4]:
print(dataset[0])
print("num of nodes:",dataset[0].num_nodes)
print("num of edges:",dataset[0].num_edges)
print("num of node features:",dataset[0].num_node_features)
print("num of edge features:",dataset[0].num_edge_features)
print("Whether it's a directed graph?",dataset[0].is_directed())

Data(x=[168, 6], edge_index=[2, 1256], edge_attr=[1256, 4], y=[1256])
num of nodes: 168
num of edges: 1256
num of node features: 6
num of edge features: 4
Whether it's a directed graph? True


This is a directed graph with 6-d node features and 4-d edge features, and edge index is written in the Coordinate format (COO):
$$ G = (X, R, I) $$
Where,
$$ X = R^{n_{nodes}\times 6} $$
$$ R = R^{n_{edges}\times 4} $$
$$ I = N^{2\times n_{edges}} $$
$I[0,i]$ is the source node of the i-th edge, and $I[1,i]$ is the target node of the i-th edge.
While we also have traning target $Y=\{0,1\}^{n_{edges}}$, which is tha ground truth of the edge classification.

## B. Models

The main idea of graph neural network is to update the node embedding vectors and edge embedding vectors based on locality and message passing.
The general form of the updating is:
$$x_i^{(t)}=\phi_{\text {node }}^{(t)}\left(x_i^{(t-1)}, \underset{j \in N(i)}{\square} \phi_{\text {message }}^{(t)}\left(x_i^{(t-1)}, x_j^{(t-1)}, a_{i j}^{(t-1)}\right)\right)$$


### Model-I: Interaction Networks
This model comes from "DeZoort, G., Thais, S., Duarte, J. et al. Charged Particle Tracking via Edge-Classifying Interaction Networks. Comput Softw Big Sci 5, 26 (2021). https://doi.org/10.1007/s41781-021-00073-z"

In this model:
1. We take the "aggregation" as "add": $\underset{j \in N(i)}{\square} = \sum_{j \in N(i)} $.
2. We assign $\phi_{\text {message }}^{(t)}\left(x_i^{(t-1)}, x_j^{(t-1)}, e_{i j}^{(t-1)}\right) $ to $a_{ij}^{(t)}$, which means that $a_{ij}^{(t)}= \phi_{\text {message }}^{(t)}\left(x_i^{(t-1)}, x_j^{(t-1)}, a_{i j}^{(t-1)}\right)$ and $x_i^{(t)}=\phi_{\text {node }}^{(t)}\left(x_i^{(t-1)},\underset{j \in N(i)}{\square}a_{i j}^{(t-1)}\right) $.

In the literature:
1. The authors take $\phi_{\text {message }} \rightarrow \phi_{R, 1}$ and $\phi_{\text {node }} \rightarrow \phi_O$ as MLP.
2. The authors take only one time step (only update the embedding vectors one time)
$$a_{i j}^{(1)}=\phi_{R, 1}\left(x_i^{(0)}, x_j^{(0)}, a_{i j}^{(0)}\right)$$
$$x_i^{(1)}=\phi_O\left(x_i^{(0)}, \sum_{j \in N(i)} a_{i j}^{(1)}\right)$$

3. We also need an extra layer to transform the embedding vectors to classification results (also called weights):
$$w_{i j}^{(1)}:=\phi_{R, 2}\left(x_i^{(1)}, x_j^{(1)}, a_{i j}^{(1)}\right)$$

4. $\phi_{R, 1}$ and $\phi_{R, 2}$ are called Relational Models, and $\phi_{O}$ is called Object Model.

In [5]:
class MLP_Layer(nn.Module):
    def __init__(self, input_size, output_size, hidden_size):
        super(MLP_Layer, self).__init__()

        self.layers = nn.Sequential(
            nn.Linear(input_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, output_size),
        )

    def forward(self, m):
        return self.layers(m)

In [6]:
class InteractionNetwork(MessagePassing):
    def __init__(self, hidden_size):
        super(InteractionNetwork, self).__init__(aggr='add', 
                                                 flow='source_to_target')
        self.R1 = MLP_Layer(16, 4, hidden_size)
        self.O = MLP_Layer(10, 6, hidden_size)
        self.R2 = MLP_Layer(16, 1, hidden_size)
        self.E: Tensor = Tensor()

    def forward(self, x: Tensor, edge_index: Tensor, edge_attr: Tensor) -> Tensor:

        # propagate_type: (x: Tensor, edge_attr: Tensor)
        x_tilde = self.propagate(edge_index, x=x, edge_attr=edge_attr, size=None)
        m2 = torch.cat([x_tilde[edge_index[1]],
                        x_tilde[edge_index[0]],
                        self.E], dim=1)
        return torch.sigmoid(self.R2(m2))

    def message(self, x_i, x_j, edge_attr):
        # x_i --> incoming
        # x_j --> outgoing        
        m1 = torch.cat([x_i, x_j, edge_attr], dim=1)
        self.E = self.R1(m1)
        return self.E

    def update(self, aggr_out, x):
        c = torch.cat([x, aggr_out], dim=1)
        return self.O(c) 

### Multi-Layer Interaction Network
We can also stack more layers of InteractionNetwork before moving to the edge weighting stage ($w_{i j}^{(1)}=\phi_{R, 2}\left(x_i^{(1)}, x_j^{(1)}, a_{i j}^{(1)}\right) $):

In [24]:
class InteractionNetwork_wo_weight(MessagePassing):
    def __init__(self, hidden_size):
        super(InteractionNetwork_wo_weight, self).__init__(aggr='add', 
                                                 flow='source_to_target')
        self.R1 = MLP_Layer(16, 4, hidden_size)
        self.O = MLP_Layer(10, 6, hidden_size)
        #self.R2 = MLP_Layer(16, 1, hidden_size)
        self.E: Tensor = Tensor()

    def forward(self, x: Tensor, edge_index: Tensor, edge_attr: Tensor) -> Tensor:

        # propagate_type: (x: Tensor, edge_attr: Tensor)
        x_tilde = self.propagate(edge_index, x=x, edge_attr=edge_attr, size=None)
        return x_tilde, self.E

    def message(self, x_i, x_j, edge_attr):
        # x_i --> incoming
        # x_j --> outgoing        
        m1 = torch.cat([x_i, x_j, edge_attr], dim=1)
        self.E = self.R1(m1)
        return self.E

    def update(self, aggr_out, x):
        c = torch.cat([x, aggr_out], dim=1)
        return self.O(c) 

In [29]:
class Multi_IN(nn.Module):
    def __init__(self, hidden_size):
        super(Multi_IN, self).__init__()
        
        #self.IN_layer = nn.Sequential(InteractionNetwork_wo_weight(hidden_size), InteractionNetwork_wo_weight(hidden_size))
        self.R2 = MLP_Layer(16, 1, hidden_size)

    def forward(self, x: Tensor, edge_index: Tensor, edge_attr: Tensor, num_layers = 1) -> Tensor:
        for i in range(num_layers):
            x, edge_attr = InteractionNetwork_wo_weight(hidden_size)(x=x,edge_index=edge_index,edge_attr=edge_attr)
        m2 = torch.cat([x[edge_index[1]], x[edge_index[0]], edge_attr], dim=1)
        return torch.sigmoid(self.R2(m2))

### DIY Models
A lot of models do not consider edge features in their updating schemes, such as famous GCN (https://arxiv.org/abs/1609.02907) and Edge Convolution (https://arxiv.org/abs/1801.07829). Therefore, one naive idea is to directly concatenate the edge embedding vectors with node embedding vectors when doing message passing.

#### The Variant of Edge Convolution (DIY)
The orginal Edge Convolution is defined as:
$$x_i^{(k)}=\max _{j \in N(i)} h_{\Theta}\left(x_i^{(k-1)}, x_j^{(k-1)}-x_i^{(k-1)}\right)$$

Here, $h_{\Theta}$ is also a MLP. The original Edge Convolution is suitable for the network without edge features.

Therefore, we can modify the Edge Convolution and include node features into the message passing.
$$a_{ij}^{(k)}= h_{\Theta}\left(x_i^{(k-1)}, x_j^{(k-1)}-x_i^{(k-1)},a_{ij}^{(k-1)}\right)$$
$$x_i^{(1)}=\phi_O\left(x_i^{(k)}, \max _{j \in N(i)} a_{i j}^{(k-1)}\right)$$

We also have two choices for the weights calculation:
1. Same as Interacting Network:$$w_{i j}^{(k)}:=\phi_{R, 2}\left(x_i^{(k)}, x_j^{(k)}, a_{i j}^{(k)}\right)$$
2. Or Similar with Edge Convolution Layer: $$w_{i j}^{(k)}:=\phi_{R, 2}\left(x_i^{(k)}, x_j^{(k)} - x_i^{(k)}, a_{i j}^{(k)}\right)$$

Interacting network in the literature and our DIY Edge Convolution can be both included in the general framework of graph neural network. 
$$x_i^{(t)}=\phi_{\text {node }}^{(t)}\left(x_i^{(t-1)}, \underset{j \in N(i)}{\square} \phi_{\text {message }}^{(t)}\left(x_i^{(t-1)}, x_j^{(t-1)}, a_{i j}^{(t-1)}\right)\right)$$

The differences between them are:
1. The form of $\phi_{\text {node }}$ is different.
2. $\underset{j \in N(i)}{\square}$ in Interacting Network is sum, while $\underset{j \in N(i)}{\square}$ in DIY edge convolution is max.

#### Type-I
$$w_{i j}^{(k)}:=\phi_{R, 2}\left(x_i^{(k)}, x_j^{(k)}, a_{i j}^{(k)}\right)$$

In [46]:
class EdgeConv_type1(MessagePassing):
    def __init__(self,hidden_size):
        super().__init__(aggr='max') #  "Max" aggregation.
        self.mlp = Seq(Linear(16, 4),
                       ReLU(),
                       Linear(4, 4))
        self.mlp_2 = Seq(Linear(10, 6),
                       ReLU(),
                       Linear(6, 6))
        self.R2 = nn.Sequential(
            nn.Linear(16, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, 1),
        )
        self.E: Tensor = Tensor()

    def forward(self, x, edge_index, edge_attr):
        # x has shape [N, in_channels]
        # edge_index has shape [2, E]

        x_tilde = self.propagate(edge_index, x=x, edge_attr=edge_attr, size=None)
        m2 = torch.cat([x_tilde[edge_index[1]],
                        x_tilde[edge_index[0]]-x_tilde[edge_index[1]],
                        self.E], dim=1)
        return torch.sigmoid(self.R2(m2))

    def message(self, x_i, x_j, edge_attr):
        # x_i has shape [E, in_channels]
        # x_j has shape [E, in_channels]

        tmp = torch.cat([x_i, x_j, edge_attr], dim=1)  # tmp has shape [E, in_channels]
        self.E =self.mlp(tmp)
        
        return self.E
    
    def update(self, aggr_out, x):
        c = torch.cat([x, aggr_out], dim=1)
        return self.mlp_2(c) 

#### Type-II
$$w_{i j}^{(k)}:=\phi_{R, 2}\left(x_i^{(k)}, x_j^{(k)} - x_i^{(k)}, a_{i j}^{(k)}\right)$$

In [45]:
class EdgeConv_type2(MessagePassing):
    def __init__(self, hidden_size):
        super().__init__(aggr='max') #  "Max" aggregation.
        self.mlp = Seq(Linear(16, 4),
                       ReLU(),
                       Linear(4, 4))
        self.mlp_2 = Seq(Linear(10, 6),
                       ReLU(),
                       Linear(6, 6))
        self.R2 = nn.Sequential(
            nn.Linear(16, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, 1),
        )
        self.E: Tensor = Tensor()

    def forward(self, x, edge_index, edge_attr):
        # x has shape [N, in_channels]
        # edge_index has shape [2, E]

        x_tilde = self.propagate(edge_index, x=x, edge_attr=edge_attr, size=None)
        m2 = torch.cat([x_tilde[edge_index[1]],
                        x_tilde[edge_index[0]]-x_tilde[edge_index[1]],
                        self.E], dim=1)
        return torch.sigmoid(self.R2(m2))

    def message(self, x_i, x_j, edge_attr):
        # x_i has shape [E, in_channels]
        # x_j has shape [E, in_channels]

        tmp = torch.cat([x_i, x_j - x_i, edge_attr], dim=1)  # tmp has shape [E, in_channels]
        self.E =self.mlp(tmp)
        
        return self.E
    
    def update(self, aggr_out, x):
        c = torch.cat([x, aggr_out], dim=1)
        return self.mlp_2(c) 

## C. Training and Results

Before moving to explicit models, we can define the general train loop function and test loop function. Here, we take the binary cross entropy as loss function.

In [None]:
loss_fn = F.binary_cross_entropy

def train_loop(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    for batch, X in enumerate(dataloader):
        # Compute prediction and loss
        pred = model(X.x,X.edge_index,X.edge_attr)
        #print(pred.shape)
        #print(X.y.shape)
        loss = loss_fn(pred.squeeze(), X.y, reduction='mean')

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if batch % 100 == 0:
            loss, current = loss.item(), (batch + 1) * len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")

def test_loop(dataloader, model, loss_fn):
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    test_loss, correct = 0, 0

    with torch.no_grad():
        for X in dataloader:
            pred = model(X.x,X.edge_index,X.edge_attr)
            test_loss += loss_fn(pred.squeeze(), X.y, reduction='mean').item()
            correct += (pred.argmax(1) == X.y).type(torch.float).sum().item()

    test_loss /= num_batches
    correct /= size
    print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")

### 1. Training and Evaluation of single layer IN:

In [11]:
learning_rate = 1e-3
hidden_size = 16
device = "cpu"
model_1 = InteractionNetwork(hidden_size).to(device)

In [51]:
optimizer_1 = torch.optim.SGD(model_1.parameters(), lr=learning_rate)
epochs = 5
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train_loop(train_loader, model_1, loss_fn, optimizer_1)
    test_loop(test_loader, model_1, loss_fn)
print("Done!")

Epoch 1
-------------------------------
loss: 0.351929  [   32/ 8096]
loss: 0.375481  [ 3232/ 8096]
loss: 0.388098  [ 6432/ 8096]
Test Error: 
 Accuracy: 314565.8%, Avg loss: 0.375170 

Epoch 2
-------------------------------
loss: 0.349160  [   32/ 8096]
loss: 0.373014  [ 3232/ 8096]
loss: 0.385588  [ 6432/ 8096]
Test Error: 
 Accuracy: 314565.8%, Avg loss: 0.372890 

Epoch 3
-------------------------------
loss: 0.346620  [   32/ 8096]
loss: 0.370670  [ 3232/ 8096]
loss: 0.383213  [ 6432/ 8096]
Test Error: 
 Accuracy: 314565.8%, Avg loss: 0.370744 

Epoch 4
-------------------------------
loss: 0.344289  [   32/ 8096]
loss: 0.368454  [ 3232/ 8096]
loss: 0.380947  [ 6432/ 8096]
Test Error: 
 Accuracy: 314565.8%, Avg loss: 0.368706 

Epoch 5
-------------------------------
loss: 0.342135  [   32/ 8096]
loss: 0.366342  [ 3232/ 8096]
loss: 0.378783  [ 6432/ 8096]
Test Error: 
 Accuracy: 314565.8%, Avg loss: 0.366780 

Done!


### 2. Training and Evaluation of multi layer IN:

we need to modify our train loop function and test loop function (introducing num of layers) for multi layer IN.

In [26]:
def train_loop_2(dataloader, model, loss_fn, optimizer, num_layers):
    size = len(dataloader.dataset)
    for batch, X in enumerate(dataloader):
        # Compute prediction and loss
        pred = model(X.x,X.edge_index,X.edge_attr,num_layers)
        #print(pred.shape)
        #print(X.y.shape)
        loss = loss_fn(pred.squeeze(), X.y, reduction='mean')

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if batch % 100 == 0:
            loss, current = loss.item(), (batch + 1) * len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")

def test_loop_2(dataloader, model, loss_fn, num_layers):
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    test_loss, correct = 0, 0

    with torch.no_grad():
        for X in dataloader:
            pred = model(X.x,X.edge_index,X.edge_attr, num_layers)
            test_loss += loss_fn(pred.squeeze(), X.y, reduction='mean').item()
            correct += (pred.argmax(1) == X.y).type(torch.float).sum().item()

    test_loss /= num_batches
    correct /= size
    print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")

In [27]:
learning_rate = 1e-3
hidden_size = 16
device = "cpu"
num_layers = 2

In [30]:
model_2 = Multi_IN(hidden_size).to(device)
optimizer_2 = torch.optim.SGD(model_2.parameters(), lr=learning_rate)
epochs = 5
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train_loop_2(train_loader, model_2, loss_fn, optimizer_2, num_layers)
    test_loop_2(test_loader, model_2, loss_fn, num_layers)
print("Done!")

Epoch 1
-------------------------------
loss: 0.684882  [   32/ 8096]
loss: 0.674749  [ 3232/ 8096]
loss: 0.662488  [ 6432/ 8096]
Test Error: 
 Accuracy: 314565.8%, Avg loss: 0.655081 

Epoch 2
-------------------------------
loss: 0.653246  [   32/ 8096]
loss: 0.663603  [ 3232/ 8096]
loss: 0.642421  [ 6432/ 8096]
Test Error: 
 Accuracy: 314565.8%, Avg loss: 0.628446 

Epoch 3
-------------------------------
loss: 0.636243  [   32/ 8096]
loss: 0.623932  [ 3232/ 8096]
loss: 0.613959  [ 6432/ 8096]
Test Error: 
 Accuracy: 314565.8%, Avg loss: 0.605644 

Epoch 4
-------------------------------
loss: 0.606629  [   32/ 8096]
loss: 0.595179  [ 3232/ 8096]
loss: 0.599072  [ 6432/ 8096]
Test Error: 
 Accuracy: 314565.8%, Avg loss: 0.586985 

Epoch 5
-------------------------------
loss: 0.578913  [   32/ 8096]
loss: 0.576923  [ 3232/ 8096]
loss: 0.576522  [ 6432/ 8096]
Test Error: 
 Accuracy: 314565.8%, Avg loss: 0.567949 

Done!


### 3. Training and Evaluation of DIY Edge Convolution (Type-I):

In [47]:
learning_rate = 1e-3
hidden_size = 16
device = "cpu"
model_3 = EdgeConv_type1(hidden_size).to(device)

In [48]:
optimizer_3 = torch.optim.SGD(model_3.parameters(), lr=learning_rate)
epochs = 5
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train_loop(train_loader, model_3, loss_fn, optimizer_3)
    test_loop(test_loader, model_3, loss_fn)
print("Done!")

Epoch 1
-------------------------------
loss: 0.739243  [   32/ 8096]
loss: 0.697415  [ 3232/ 8096]
loss: 0.674982  [ 6432/ 8096]
Test Error: 
 Accuracy: 314565.8%, Avg loss: 0.665960 

Epoch 2
-------------------------------
loss: 0.666597  [   32/ 8096]
loss: 0.641850  [ 3232/ 8096]
loss: 0.626733  [ 6432/ 8096]
Test Error: 
 Accuracy: 314565.8%, Avg loss: 0.618573 

Epoch 3
-------------------------------
loss: 0.614422  [   32/ 8096]
loss: 0.599718  [ 3232/ 8096]
loss: 0.589220  [ 6432/ 8096]
Test Error: 
 Accuracy: 314565.8%, Avg loss: 0.581081 

Epoch 4
-------------------------------
loss: 0.572656  [   32/ 8096]
loss: 0.565215  [ 3232/ 8096]
loss: 0.558832  [ 6432/ 8096]
Test Error: 
 Accuracy: 314565.8%, Avg loss: 0.550354 

Epoch 5
-------------------------------
loss: 0.537920  [   32/ 8096]
loss: 0.536471  [ 3232/ 8096]
loss: 0.534633  [ 6432/ 8096]
Test Error: 
 Accuracy: 314565.8%, Avg loss: 0.525688 

Done!


### 4. Training and Evaluation of DIY Edge Convolution (Type-II):

In [49]:
learning_rate = 1e-3
hidden_size = 16
device = "cpu"
model_4 = EdgeConv_type2(hidden_size).to(device)

In [50]:
optimizer_4 = torch.optim.SGD(model_4.parameters(), lr=learning_rate)
epochs = 5
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train_loop(train_loader, model_4, loss_fn, optimizer_4)
    test_loop(test_loader, model_4, loss_fn)
print("Done!")

Epoch 1
-------------------------------
loss: 0.645674  [   32/ 8096]
loss: 0.633662  [ 3232/ 8096]
loss: 0.628654  [ 6432/ 8096]
Test Error: 
 Accuracy: 314565.8%, Avg loss: 0.621502 

Epoch 2
-------------------------------
loss: 0.616622  [   32/ 8096]
loss: 0.608760  [ 3232/ 8096]
loss: 0.606546  [ 6432/ 8096]
Test Error: 
 Accuracy: 314565.8%, Avg loss: 0.599874 

Epoch 3
-------------------------------
loss: 0.592948  [   32/ 8096]
loss: 0.588337  [ 3232/ 8096]
loss: 0.588244  [ 6432/ 8096]
Test Error: 
 Accuracy: 314565.8%, Avg loss: 0.581717 

Epoch 4
-------------------------------
loss: 0.572993  [   32/ 8096]
loss: 0.570952  [ 3232/ 8096]
loss: 0.572633  [ 6432/ 8096]
Test Error: 
 Accuracy: 314565.8%, Avg loss: 0.566107 

Epoch 5
-------------------------------
loss: 0.555742  [   32/ 8096]
loss: 0.555816  [ 3232/ 8096]
loss: 0.559052  [ 6432/ 8096]
Test Error: 
 Accuracy: 314565.8%, Avg loss: 0.552473 

Done!


## D. Further Directions
Due to the time limitation, I can not implement every ideas in the task. However, in the near future there are several directions I would like to contribute to:
### Network Structure
1. #### Next Nearest Neighbour:
The first way to improve the network I would like to try is to include next nearest neighbor for the message passing. We can extend the $\underset{j \in N(i)}{\square}$ to $\underset{j \in N(i)\cup N.N.(i)}{\square}$, where $N.N.(i)$ denotes next nearest neighbot of node-$i$.

2. #### Graph Attention Network 
We can also try some framework beyond Graph Convolutional Network Scheme. For example, one promising direction is using Graph Attention Network（GAT) in the task.

### Training Results (Bug!)
The accuracy on test dataset of four models are all apprxomately equal to $31\%$, which is very weird. There might be some bugs in the training or evaluation part of my code.

### Evaluation
I would like to utilize ROC curve (Receiver Operating Characteristic curve) and AUC score (Area Under the ROC Curve) in the further evaluation.

### Class Imbalance Problem

In [54]:
num_0 = 0
num_1 = 0
for i in dataset[0].y:
    if i == 0:
        num_0 += 1
    elif i == 1:
        num_1 += 1
    else:
        print("error")
print("Num of class-0",num_0)
print("Num of class-1",num_1)

Num of class-0 882
Num of class-1 374


There is a significant class imbalance in our data. 

In the normal case, we can use under-sampling or over-sampling to fix the imbalanced class. However, in the graph problem this is tricky, because we can not arbitarily delete or add edges in a single graph.

Therefore, we might need to find a way to enlarge or reduce each graph (introducing extra nodes and edges with $y=1$ or properly deleting existing nodes and edges with $y=0$).