# Node classification with Cora Dataset

In this notebook, you will need to deal with the node classification problem on Cora Dataset, which is similar to the Karate Club problem.

## Load Dataset

The Cora dataset consists of Machine Learning papers. These papers are classified into one of the following seven classes:
- Case_Based
- Genetic_Algorithms
- Neural_Networks
- Probabilistic_Methods
- Reinforcement_Learning
- Rule_Learning
- Theory

Each node represents a paper, and the link between them represents the citation relationship. It is splitted to train, validation and test set. This is also a common baseline for most graph neural network papers.

In [3]:
from dgl.data import CoraDataset
import torch as th
dataset = CoraDataset()
g = dataset[0]
print(g)

DGLGraph(num_nodes=2708, num_edges=10556,
         ndata_schemes={'train_mask': Scheme(shape=(), dtype=torch.float64), 'val_mask': Scheme(shape=(), dtype=torch.float64), 'test_mask': Scheme(shape=(), dtype=torch.float64), 'label': Scheme(shape=(), dtype=torch.int64), 'feat': Scheme(shape=(1433,), dtype=torch.float32)}
         edata_schemes={})


## Extract masks and print dataset statistics

In [4]:
num_labels = g.ndata['label'].max().item()+1 # label index started from 0
feature_dim = g.ndata['feat'].shape[1]
train_mask = g.ndata['train_mask'].to(th.bool)
val_mask = g.ndata['val_mask'].to(th.bool)
test_mask = g.ndata['test_mask'].to(th.bool)
print("Node feature dimension: {}".format(feature_dim))
print("Number of labels: {}".format(num_labels))
print("Number of nodes for training: {}".format(train_mask.long().sum()))
print("Number of nodes for validataion: {}".format(val_mask.long().sum()))
print("Number of nodes for testing: {}".format(test_mask.long().sum()))

Node feature dimension: 1433
Number of labels: 7
Number of nodes for training: 140
Number of nodes for validataion: 300
Number of nodes for testing: 1000


## Setup Model and Train

In this challenge, you will need to modify the part below to achieve better performance on the test set. You can change the model structure, use other dgl [nn modules](https://docs.dgl.ai/api/python/nn.pytorch.html#module-dgl.nn.pytorch.conv), tuning hyperparameters in optimizers, add early stopping and so on.  
**However, please remember only using training data in the training loop below**.

### Define model

In [5]:
# Define a 2-layer GCN model with DGL nn modules
from dgl.nn.pytorch import conv as dgl_conv
import torch.nn.functional as F
import torch.nn as nn

class GCN(nn.Module):
    def __init__(self, in_feats, hidden_size, num_classes):
        super(GCN, self).__init__()
        # GraphConv did more than GCNLayer defined above, it also added normalization for each node.
        # Details can be found in original paper https://arxiv.org/abs/1609.02907
        self.gcn1 = dgl_conv.GraphConv(in_feats, hidden_size, activation=F.relu) 
        self.gcn2 = dgl_conv.GraphConv(hidden_size, num_classes)

    def forward(self, g, inputs):
        h = self.gcn1(g, inputs)
        h = self.gcn2(g, h)
        return h
    
net = GCN(1433, 64, num_labels)

### Training Loop

In [6]:
net.train()
optimizer = th.optim.Adam(net.parameters(), lr=0.01)
loss_fcn = th.nn.CrossEntropyLoss()
num_epochs = 30
for epoch in range(num_epochs):
    
    logits = net(g, g.ndata['feat'])
    loss = loss_fcn(logits[train_mask], g.ndata['label'][train_mask])
    train_acc = (logits.argmax(1)==g.ndata['label'])[train_mask].float().mean()
    val_acc = (logits.argmax(1)==g.ndata['label'])[val_mask].float().mean()
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    print('Epoch {:<4} |  Loss: {:.4f}  |  Train acc: {:.4f}  |  Val acc: {:.4f}'.format(epoch, loss.item(), train_acc, val_acc))

Epoch 0    |  Loss: 1.9460  |  Train acc: 0.1714  |  Val acc: 0.1167
Epoch 1    |  Loss: 1.9197  |  Train acc: 0.4071  |  Val acc: 0.3700
Epoch 2    |  Loss: 1.8883  |  Train acc: 0.4286  |  Val acc: 0.3800
Epoch 3    |  Loss: 1.8510  |  Train acc: 0.4357  |  Val acc: 0.3967
Epoch 4    |  Loss: 1.8118  |  Train acc: 0.4357  |  Val acc: 0.4033
Epoch 5    |  Loss: 1.7718  |  Train acc: 0.4357  |  Val acc: 0.4033
Epoch 6    |  Loss: 1.7322  |  Train acc: 0.4357  |  Val acc: 0.4033
Epoch 7    |  Loss: 1.6940  |  Train acc: 0.4357  |  Val acc: 0.4033
Epoch 8    |  Loss: 1.6575  |  Train acc: 0.4357  |  Val acc: 0.4067
Epoch 9    |  Loss: 1.6220  |  Train acc: 0.4500  |  Val acc: 0.4267
Epoch 10   |  Loss: 1.5864  |  Train acc: 0.4571  |  Val acc: 0.4300
Epoch 11   |  Loss: 1.5494  |  Train acc: 0.4714  |  Val acc: 0.4500
Epoch 12   |  Loss: 1.5102  |  Train acc: 0.5000  |  Val acc: 0.4667
Epoch 13   |  Loss: 1.4682  |  Train acc: 0.5071  |  Val acc: 0.4667
Epoch 14   |  Loss: 1.4234  |  Tra

## Evaluate result on the test set

In [7]:
net.eval()
logits = net(g, g.ndata['feat'])
test_acc = (logits.argmax(1)==g.ndata['label'])[test_mask].float().mean()
print('Test accuracy: {:.4f}'.format(test_acc))

Test accuracy: 0.6880
