# Node Classification

In this notebook, we will be looking at node classification problems. In this setting, a graph is provided with some labeled nodes and some unlabeled nodes, and the task is to train a model that predicts the labels of the unlabeled nodes. Examples of this include predicting the research categories of publications in a citation network, or predicting interests of a user in a social network.






## Transductive Node Classification

In a transductive node classification setting:


*   The features of all nodes in one graph are known
*   Some of the node labels are known
*   Goal: Predict unknown node labels in the same graph

<center>
<img src="https://ai.science/api/authorized-images/ZHAan1V7Hz7YV0%2BRVLeqR9qa1Eaam83R2A8TDhqP7ugb2T6SQZBujucyoNn8Cxr%2FJCIo%2BvdirgOq%2FFP%2Fp47GdfZaYXQe5bXZTfw0vsIsjGB2ZpJOfKYlNOq8bePoVFr4X4DN0bgMEoXB19hlk7KMFEljiu8PatOb3MKQjKYWBtrFcJCqFjOaWrGQ260G%2F9PnUTqF74r0BXT4mc6C50UizvrBhyJkoQbNdA6CnG62rB6TN0JTQjhE3%2Fo%2BLu2Te6Gpc5zUmPG5KEuN0aSNtlwml82uZAtV0srMU6YdsMlh5eJYRXPgQD0PHO3PpFcsKIsyAi%2BPkDLO4gx%2FMFedDaNdSkfRL01dycLpNHvg7YrtWi0TCMxwQF2D4i4ua3beHP0E7sfogTti%2BwaBkeE8xKZBNTxJd78ZQFIEmzEEbR%2FI%2FS8OxUDMTnI5lGL872nfT47KaZcW4BgQMpEmqluNdWtQEQifwUedi4XjEWBqewIN9rcaVOLyRbIDAtphrxS%2Be8AQt4tnkbnCxDQxsiOfb%2BIyVm4JzeMSjxO5jVBRgiAcKOx4UCoNbMtmVEjglJQS%2BrKODOIZOXJ%2BsVjJwOtnTvr%2FzDsbqZojvECAW98ITDKbuewrKBCr7cKy82mE8C8NARgwr5uSRzR8Euy3xEu7gJM%2BfhdXB5oRlzGGoBpCo3bm2Oo%3D" width="60%" > </center>

## Inductive Node Classification

In an inductive node classification setting:

*   The features of nodes in one or more graphs are known
*   Some or all of the node labels in those graphs are known
*   Goal: Predict unkown node labels in unseen graphs


<center>
<img src="https://ai.science/api/authorized-images/cUEc1VHs2beXboMFOl3uQ8MyReaWjSTJbt7Xdc4evnlGiyIjfFbtgKX79az37YEGEOnSA%2BZiYsH%2Bexjqp4TheDH1cVsMZ4Y8pHwk%2FFt1KBox2AJ4syhJLq0DaY627JAMfo2T06rowYtPGNJjlvp%2BpznhQ39QCfuxNZGjCf2pO0j2zugxHm4YviF9Knods2J6MRiQIFg7wYEg5jWfstY0Q5AfQEu9BBNRke3ngG769tpixVkXRhU7hr37m38tFZhMqzZriRE7tyJ64HR%2F5ewJihu%2FRuh5Wz6uYEE%2FdK8bBX6%2FyLrk79oxA1roOURtVQO%2Bng04fIXu3GD%2FFfPDA%2BrQlryLRUSsvgaeeGJI95alp51MWoSRbivoFEqA4aCEZhuKYIA9GI1jqXScwgr3UvtEH6fDSNbghcwFggMFZNZ8l%2F2tixwWAVC6ibdfw3TcDAZIcByv%2BBZ2HCy%2BbEMaXboMLB7h5rlvimJhOXUYYwitz8S8IO3PpHvIHMYT6Yo6X319L1SYzHtcCJa5ZKkO4fjuGBtcsOJUCZQRp4c0kyTkhcMmdkFnnrUet5eEzBr6PoeSf4WgRaZNiEvbPf0fJ%2FRcfhAphn4TqIohrZcMlRDbC9hl4z3EK8h0DxYTCL1sBqIM8tNO1MDPGh4UtipdFIT9OjjuFJFvMvdcQcymOHJ03eM%3D" width="60%" > </center>



# Installation


In [1]:
!pip uninstall -y torch torchvision torchtext fastai
!pip install torch==1.6.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html

!pip install torch-scatter==latest+cu101 torch-sparse==latest+cu101 -f https://s3.eu-central-1.amazonaws.com/pytorch-geometric.com/whl/torch-1.6.0.html
!pip install torch-spline-con==latest+cu101 torch-scatter==latest+cu101 -f https://s3.eu-central-1.amazonaws.com/pytorch-geometric.com/whl/torch-1.6.0.html
!pip install torch-sparse==latest+cu101 -f https://s3.eu-central-1.amazonaws.com/pytorch-geometric.com/whl/torch-1.6.0.html

!pip install torch-geometric==1.6.1

Uninstalling torch-1.6.0+cu101:
  Successfully uninstalled torch-1.6.0+cu101
Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==1.6.0+cu101
  Using cached https://download.pytorch.org/whl/cu101/torch-1.6.0%2Bcu101-cp37-cp37m-linux_x86_64.whl
Installing collected packages: torch
Successfully installed torch-1.6.0+cu101
Looking in links: https://s3.eu-central-1.amazonaws.com/pytorch-geometric.com/whl/torch-1.6.0.html
Collecting torch-scatter==latest+cu101
  Using cached https://s3.eu-central-1.amazonaws.com/pytorch-geometric.com/whl/torch-1.6.0/torch_scatter-latest%2Bcu101-cp37-cp37m-linux_x86_64.whl
Collecting torch-sparse==latest+cu101
  Using cached https://s3.eu-central-1.amazonaws.com/pytorch-geometric.com/whl/torch-1.6.0/torch_sparse-latest%2Bcu101-cp37-cp37m-linux_x86_64.whl
Installing collected packages: torch-scatter, torch-sparse
  Found existing installation: torch-scatter 2.0.5
    Uninstalling torch-scatter-2.0.5:
      Successfully uninst

# Loading Datasets
For our datasets, we will be using three citation networks; Pubmed, Cora and Citeseer. Nodes correspond to publications and edges correspond to citations. The citation networks are available through the Planetoid dataset of PyG.

In [2]:
import torch
import torch.nn.functional as F
from torch_geometric.datasets import Planetoid
import torch_geometric.transforms as T

#Load the Cora, CiteSeer and Pubmed citation networks
#Note: T.NormalizeFeatures() creates a transform that normalizes the node features
dataset_cora = Planetoid(root="./tmp", name="Cora", transform=T.NormalizeFeatures())
dataset_citeseer = Planetoid(root="./tmp", name="CiteSeer", transform=T.NormalizeFeatures())
dataset_pubmed = Planetoid(root="./tmp", name="Pubmed",transform=T.NormalizeFeatures())

data_cora = dataset_cora[0]
data_citeseer = dataset_citeseer[0]
data_pubmed = dataset_pubmed[0]

print("Citation network information")
print("Cora: ", data_cora)
print("Citeseer: ", data_citeseer)
print("Pubmed: ", data_pubmed)

Citation network information
Cora:  Data(edge_index=[2, 10556], test_mask=[2708], train_mask=[2708], val_mask=[2708], x=[2708, 1433], y=[2708])
Citeseer:  Data(edge_index=[2, 9104], test_mask=[3327], train_mask=[3327], val_mask=[3327], x=[3327, 3703], y=[3327])
Pubmed:  Data(edge_index=[2, 88648], test_mask=[19717], train_mask=[19717], val_mask=[19717], x=[19717, 500], y=[19717])


# Generalizing Convolutional Operators to Graphs

## Convolutional Operator on a 2-d grid

Let's consider an image represented by numbers on a 2-d grid and review what happens in a convolutional layer typically used in convolutional neural networks (CNNs). A learnable filter (typically of size $3 \times 3$)  convolves across the grid and an element-wise multiplication of the filter values by the image values is computed at all positions.  The output of this operation is also numbers on a 2-d grid, which could be of the same size or not depending on whether padding was used. (For more details on CNNs see this [page](https://cs231n.github.io/convolutional-networks/). )


<center>
<img src="https://ai.science/api/authorized-images/ZbbQlWsB8df1DpFwUiY9AS7TapF%2BH9nv%2FX8vCzV9inQ7KYk4uS2WDm%2FAGXepIuwVXbrnn%2Bxyd8wcGv1KU0EYmqB6noUpulGGRJ2ycVSITeuh8gyGXq4huY7vyETRq3lGYaClBN6etDJuu5rFAe89nBALS8JNeAXKE2VTac8iHpwMnk%2FRtWnoReqL67ZK1TFg9fsWjwgs8t7XMD%2F5rzxFCJh4rqXoI6F4PuANumyYiAU8OiLW8LTjxHREsF4MV28zJCWHxVQLa%2BJzyIoHTQrkkCL0yExVz4sX%2Fxx9ZnbS7Vr14TLb2PqkyKPg%2BwHQ9wNjkTP74KsX3A1eSlEKBum0xa3xb5XdF2QHQDl9vMNDNxzCO2WbVW%2B2l3Vo%2BsZbZ3qu4Lw6QtgBdbRY%2B2%2FAJyo79W72M9OsGB70ZJJnW6%2FZEcYmtKuE37%2FeNqKKKOP0XcP6zqFobi69E9fTgUIB%2Bvh0TO%2F%2Bl%2BFoxbPvP3LJfeylPi1Xh2OJgB03o%2F2lm8Psw9SiCywKIF8z2IDYJ6zfPa0UUUp3WyiJGNSba8znBgH%2BgiCe4uyQKtQ6x8vxvDlMHZMzXqyj%2Bhed2CanR0JsrJhNeSUP09kkMorCvFyWIWK8%2B%2FwPqgB0CA4RonWgmFJiQfyNdH0YMILNs0QHFk2pEA34zo08r1jg15lYhqDCF7F3fL0%3D" width="60%" > </center>

The convolutional filter in CNNs has some desirable properties that are suitable for images:

*   The number of parameters is independent of the input 
*   Operates locally, extracting localized features
*   Translation invariant (i.e. the filter remains the same as it convolves)
*   Values of the filters  depend on the relative position of neighboring pixels. (e.g. the 9 values in a $3 \times 3$ filter can in general be different)

So what are the challenges in generalizing a convolutional operator to arbitrary graphs? Some of the difficulties that make such a generalization non-trivial is:

*   A node in a graph may have an arbitrary number of neighbors, while a pixel in an image has 8 neighbors (except in the case of edge and corner pixels)
*   A node in a graph may have an arbitrary number of and types of attributes. E.g. Whereas a pixel in a color image may have three values (RGB), a node in a social network representing a user may different types of attributes such as current location, interests, etc.
*  In the case of a heterogeneous graph, there may be different types of nodes. E.g. in a movie-actor network.
 





# Extension to Homogeneous Graphs with Node Features


Let's try to create a convolution-like operator on a homogeneous graph. Let's assume that each node $v_i$  has a feature vector $\mathbf{v}_i$ associated with it. 

<center><img src="https://ai.science/api/authorized-images/ly8YBPB1rsFZENZBrO47vYhzp5zBRXRO8%2FtHMMBtBm%2FkhYTNnnl7B2rHuVk0MoMOxY8WDh2R9uMBuqwkMjT1fGntAM2p2agcwmciMtSrIPBZiVWvRH%2FcSC%2FPRgyyLy0LRiBeqbNPvIetT6J7cKKRB6t3OUL9vXnlI9bMPSBrNIDH9pblYjIs48BtthsR4FLtXPp5H25nq4dyn0K6cvONLIebrOEoAmbtL8x3DxJHnmBfEoKdHU%2B6yxSCQJk14%2BUkUSiutxov2UCJnzCzPXQuyyFTieRFXe4zkzqcjhLljBLHwGDorWlDZTSAF2dqN6%2BVvl5Bu1JKk1iY7wLluejjS7lfxGaeeiLS0SE34ChjXe9XVvXJAnZ%2Fr7uOlgKhytDfdquuGtFJlHs4b0ODZ2AVuFFCS6h%2FVLAPt1FHHBcEgcAv5Gnk7Gs3hAPZ57beBqaLe7sWdzJA7RRWaedUnO%2FyxGQg8MvYSjDXyz0H3yZi8acLUtCkme8W73WO1io%2FaXTeWteXRHDvUI42kMrqjcyqfoy7gdGzyO6CsYk6BvzGeR7ralzvzjPVVSn8wvvsa3dDECAfXex%2BFyCsAv6HqurD%2FwNeqK%2FB9B1r4ydqTc6On2ZLx4db8JJq6zW8NVHUwfwTJRrVuJY5V2m8QlWr3LLu3Xz10Urt255yAgkXFvLfRaw%3D" width="30%" > </center>


A very simple convolution like operator on graphs is just the uniform average of the node-features of each node’s neighborhood. 

Let's go through a single layer of this operation. We start with the initial feature vectors at each node as the input to the convolution layer:
$$\mathbf{v}_i^{(0)}= \mathbf{v}_i$$

Then we update each node’s features by the average of all neighboring node features. This can be written as:


$$ \mathbf{v}_i^{(1)} = 
\sum_{v_j \in N(v_i) \cup v_i }{\frac{1}{d_i + 1}  \mathbf{v}_i^{(0)}} $$

This is visualized below

<center><img src="https://ai.science/api/authorized-images/twlUnmkDyXflUxLUbaJlYVxoyPsrO9lCxMIwGcllts3fshttjk2otcAN%2F%2BYT%2BzCn%2B7zCcSTq73x08Sjx%2BuQw1XaZqeC6PEQJDT01QzpVHtvVN%2FtKMOfCiUmHYA2Sn%2FEOxTKY%2BDeIliDYlQM2efGAHfvumXEOynZJdAPbJdhrHo2y0j0oYdInscZkr5PHzHVrWqtxIaJCMriCJ5UlVxfIHvH8%2Frf3VOBW5H2Cwu%2F6vjww0wOhvrq60ygHE99Sghzo0pmkuk8Fq6OYw2t5MgL2azS0KZEnSOf%2F5zeKGYBwUE3czdz5BlPZAVZvV7UYGb4jlmohvMNWmwKY8e%2F4wvtG5iUtzUZpMvSTCsvdl0NIxbJ1He7gykW%2BfaAQRjiO1Nme%2B%2FD9o2BaVzGX61eu4vlPJZYGZh3JUgBGGtWMEFmOETMPMt5CYoHas0z5Dk2xIk5DkYS1F%2B4FIlc8ywacJoih7miZcUIPjeWFiVvGMtrv0zuofNZrV7gb%2Bw2mm3YPVSg6uCPslo2phKBFpkrLb7iy6OBGMGqElRa1ofK81zJ4EUMs%2FnAt0TkE6DOZtsQlEsXI4%2B4fuMoFs2fdlcgb%2B4rBU2efOjfO5nQptHlvbTXTccmI6IIU0AY%2BiSimB7QysH%2FqT%2F%2FY1iOI9WObaDfecUpK8OYLnyaOeZCp0NNP5ILbINk%3D" width="30%" > </center>

In the illustration here, the neighborhood of each node consists of all nodes with incoming edges to that node. This definition of a neighborhood is to take into account the propagation of information. However other definitions of neighborhoods are possible, and the neighborhood doesn’t have to be limited to nodes that are only 1 hop away.

If we write $D$ as the degree matrix, i.e. a diagonal matrix containing the number of neighbors of each node on its diagonals, $I$ as the identity matrix and $A$ as the adjacency matrix, then we can succinctly write this simple convolution operator as 

$$ V^{(1)} =(D+I)^{−1} (A+I) V^{0}, $$

where $V$ is a $N^v \times N^{f_{nodes}}$ matrix containing the feature-vectors stacked vertically. Introducing $\tilde{S}=(D+I)^{−1} (A+I)$, this can be written as 

$$ V^{(1)} =\tilde{S}V^{0}$$
We can apply this step $k$ times to get  

$$ V^{(k)} =\tilde{S}V^{0}.$$
After $k$ applications, the updated node features of a all nodes will have been influenced by the input feature-vectors of nodes that are up to $k$ hops away. There is however a limitation to how many times we can repeat this operation. For example if the graph has a diameter of 5, that is the largest shortest-path between any pair of nodes is 5, and we repeat this averaging operation 5 times,  the node features in the final layer would be very similar. This would probably result in poor performance in downstream tasks like node classification.


After computing  $V^{(k)}$, we can then use these updated feature-vectors for downstream tasks. For example for node classification, one can pass $V^{(k)}$ into a softmax layer to predict the label of each node, i.e. 

$$ \hat{Y}= softmax(V^{(k)}\Theta ),$$
where $\Theta$ is a matrix to be learned e.g. via back-propagation.

While the above procedure is simple, it is still a good baseline for graph representations of data and is benchmarked in the paper [Simplifying Graph Convolutional Networks](https://arxiv.org/abs/1902.07153). 

# Simple Graph Convolutonal Network

We will first implement a simple learning pipleline based on the graph convolutional operator from the paper [Simplifying Graph Convolutional Networks](https://arxiv.org/abs/1902.07153). 

The [*SGConv*](https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#torch_geometric.nn.conv.SGConv) operator is linear. For each node, the operator essentially averages the features from neighboring nodes. Applying this averaging $K$ times allows features to propagate from nodes that are at most $K$ edges apart.

*Note: We will look at the details of how the convolutonal operators are implemented in the next notebook. For now we will simply use the ones provided in PyG.*

Let's apply a single SGConv layer to the node features of the Cora graph. We will define the layer to have $N^{f_{nodes}}$ input channels (i.e. the length of the feature vector of each node) and output $N^C$ channels (i.e. the number unique labels).

In [3]:
from torch_geometric.nn import SGConv

num_classes = len(data_cora.y.unique())

conv = SGConv(in_channels=data_cora.num_features, out_channels=num_classes,
       K=1, cached=True)

x  = data_cora.x
print("Shape before applying convoluton: ", x.shape)

#x contains the node features, and edge_index encodes the structure of the graph
x  = conv(x, data_cora.edge_index)
print("Shape after applying convoluton: ", x.shape)


Shape before applying convoluton:  torch.Size([2708, 1433])
Shape after applying convoluton:  torch.Size([2708, 7])


Let's define a network which uses SGConv to classify nodes on the Cora dataset

In [4]:
class SGNet(torch.nn.Module):
    def __init__(self, data, K=1):
        super().__init__()
        num_classes = len(data.y.unique())

        # Create a Simple convolutional layer with K neighbourhood 
        # "averaging" steps
        self.conv = SGConv(in_channels=data.num_features,
                            out_channels=num_classes, 
                           K=K, cached=True)

    def forward(self, data):
        # Apply convolution to node features
        x = self.conv(data.x, data.edge_index)

        # Compute log softmax.
        # Note: Negative log likelihood loss expects a log probability
        return F.log_softmax(x, dim=1) 


We will define a function to run one training cycle of the model

In [5]:
def train(model, data, optimizer):
  # Set the model.training attribute to True
  model.train() 

  # Reset the gradients of all the variables in a model
  optimizer.zero_grad() 

  # Get the output of the network. The output is a log probability of each
  log_softmax = model(data) 

  labels = data.y # Labels of each node

  # Use only the nodes specified by the train_mask to compute the loss.
  nll_loss = F.nll_loss(log_softmax[data.train_mask], labels[data.train_mask])
  
  #Computes the gradients of all model parameters used to compute the nll_loss
  #Note: These can be listed by looking at model.parameters()
  nll_loss.backward()

  # Finally, the optimizer looks at the gradients of the parameters 
  # and updates the parameters with the goal of minimizing the loss.
  optimizer.step() 

In case you are not very familiar with Pytorch: To get a better sense of what the above function does (or anything you are not quite sure down the line), you can just run the code and see what is going on yourself! Here is an example:

In [6]:
model_cora = SGNet(data_cora, K=1)
optimizer = torch.optim.Adam(model_cora.parameters(), lr=0.2)

optimizer.zero_grad() 

print("="*80)
print("Gradients of model parameters right after zero_grad")
for i, parameter in model_cora.named_parameters():
  print ("Parameter {}".format(i))
  print ("Shape: ",parameter.shape )
  print("Gradient")
  print(parameter.grad)

# Get the output of the network. The output is a log probability of each
log_softmax = model_cora(data_cora) 

print("="*80)
print("Output of model (log-softmax) \n Shape:{}"
      " \n Values: {}".format(log_softmax.shape, log_softmax))

# Labels of each node
y_true = data_cora.y 

# Use only the nodes specified by the train_mask to compute the loss.
train_mask = data_cora.train_mask
nll_loss = F.nll_loss(log_softmax[train_mask], y_true[train_mask])

print("="*80)
print("negative logloss {}".format(nll_loss))

#Computes the gradients of all model parameters used to compute the nll_loss
#Note: These can be listed by looking at model.parameters()
nll_loss.backward()

print("="*80)
print("Gradients of model parameters right after back propagation")
for i, parameter in model_cora.named_parameters():
  print ("Parameter {}".format(i))
  print ("Shape: ",parameter.shape )
  print("Gradient")
  print(parameter.grad)

Gradients of model parameters right after zero_grad
Parameter conv.lin.weight
Shape:  torch.Size([7, 1433])
Gradient
None
Parameter conv.lin.bias
Shape:  torch.Size([7])
Gradient
None
Output of model (log-softmax) 
 Shape:torch.Size([2708, 7]) 
 Values: tensor([[-1.9316, -1.9705, -1.9500,  ..., -1.9456, -1.9391, -1.9405],
        [-1.9324, -1.9668, -1.9511,  ..., -1.9463, -1.9360, -1.9404],
        [-1.9319, -1.9663, -1.9489,  ..., -1.9481, -1.9372, -1.9405],
        ...,
        [-1.9356, -1.9665, -1.9528,  ..., -1.9475, -1.9377, -1.9352],
        [-1.9334, -1.9662, -1.9475,  ..., -1.9456, -1.9380, -1.9418],
        [-1.9328, -1.9676, -1.9495,  ..., -1.9453, -1.9377, -1.9423]],
       grad_fn=<LogSoftmaxBackward>)
negative logloss 1.945770263671875
Gradients of model parameters right after back propagation
Parameter conv.lin.weight
Shape:  torch.Size([7, 1433])
Gradient
tensor([[-4.3987e-06,  1.2867e-04, -6.8989e-05,  ..., -3.7922e-05,
         -3.9255e-05,  4.2869e-05],
        [-1.7

Now let's define a function to test the accuracy of a trained model on the validation set

In [7]:
def compute_accuracy(model, data, mask):
  # Set the model.training attribute to False
  model.eval()
  logprob = model(data)
  _, y_pred = logprob[mask].max(dim=1)
  y_true=data.y[mask]
  acc = y_pred.eq(y_true).sum()/ mask.sum().float()
  return acc.item()

@torch.no_grad() # Decorator to deactivate autograd functionality  
def test(model, data):
  acc_train = compute_accuracy(model, data, data.train_mask)
  acc_val = compute_accuracy(model, data, data.val_mask)

  return acc_train, acc_val

Putting it all together in a training loop

In [8]:
# Create a model for the Cora dataset
model_cora = SGNet(data_cora, K=1)

# Create an Adam optimizer with learning rate and weight decay (i.e. L2 regularization)
optimizer = torch.optim.Adam(model_cora.parameters(), lr=0.001, weight_decay=5e-4)

for epoch in range(1, 200):
    train(model_cora, data_cora, optimizer)
    if epoch %10 ==0:
      log = 'Epoch: {:03d}, Train: {:.4f}, Val: {:.4f}'
      print(log.format(epoch, *test(model_cora,data_cora)))

Epoch: 010, Train: 0.5143, Val: 0.1820
Epoch: 020, Train: 0.9857, Val: 0.6620
Epoch: 030, Train: 0.9929, Val: 0.7000
Epoch: 040, Train: 0.9929, Val: 0.7180
Epoch: 050, Train: 0.9929, Val: 0.7220
Epoch: 060, Train: 0.9929, Val: 0.7220
Epoch: 070, Train: 0.9929, Val: 0.7200
Epoch: 080, Train: 0.9929, Val: 0.7220
Epoch: 090, Train: 0.9929, Val: 0.7240
Epoch: 100, Train: 0.9929, Val: 0.7280
Epoch: 110, Train: 0.9929, Val: 0.7240
Epoch: 120, Train: 0.9929, Val: 0.7240
Epoch: 130, Train: 0.9929, Val: 0.7280
Epoch: 140, Train: 0.9929, Val: 0.7280
Epoch: 150, Train: 0.9929, Val: 0.7320
Epoch: 160, Train: 0.9929, Val: 0.7360
Epoch: 170, Train: 0.9929, Val: 0.7380
Epoch: 180, Train: 0.9929, Val: 0.7360
Epoch: 190, Train: 0.9929, Val: 0.7380


# Optional Exercise
* Experiment with the $K$ parameter of *SGNet*, as well as the rate and  weight_decay paramters of the optimizer to improve the results on the validation set. Finally, check the accuracy on the test set. How does it compare with the results provided in the paper? The results are reproduced below

* Do the same for the Citseer and Pubmed citation network datasets

* Somethings to try to improve the results:

  * Add another convolution layer  
  * Add a dropout layer
  * Add nonlinearities in the model
 
  <center><img src="https://ai.science/api/authorized-images/fIA50%2BdS0pey1rqtswkEaKhrHMLSGANdEGtf5be8r4ghb15lOoeCRvY2zofkIKqEG5p0%2BwDF1yT2Jii0i%2B58Ffl5jCNDg4vWzlRB9VpDBR%2BI5YKqpeh1Wa7fauh4Y%2FxmNCAF1j8dYdlD%2BcOYSQMLHcZ33QXV1jci0FowIRbCDT0Ax7Jqi8DjR9%2Be9z8ULyDNFjIBb4yKMY3s2qKcHqmulYeTKv9aeYxdrvdD6FsSVaDAo41%2BppCaycng4y0E3j5B591SRLTHnwtjLUEqT3xKeTKJXdgPeOAboibccywv3Jgb3z0X4ztC6DjKOIbSLCWPYKeZMl9SEdhZPMdpROoeEz4aT8BGHdzZojlpO21W3%2BcwvkBGtrV6xH14Jyd6P%2Fcccn9H1pkZjLpNA1vSbhFzMupaHnkFeH8IeJSNMt03Xckj0MXjTBrQsQLy%2FBh2yR%2F4%2B0ZNe%2BvIWHEdbVFqR0j51DTQf2x7cu%2BNJLIQpo8%2F1ipBJRCxTrpWCPM76FODx8qxdZ9ToLDiV7nSuiYBnn9dcdcZMtuAU3LbUoSK8JAuV0wB4Y5d7zSV2w5nfPHZDy4FRy0DGahQjpWujcmsIYqqHl4WpV8pst0lwIcq6uW7%2B5TsbEOHlp11f4vzNGcGJSsErq%2FSBcA9Z7KW9pelss6P8qbcq8gn6B3xC2rs%2ByOorZA%3D" width="80%" > </center>

  



# References

 [Wu, Felix, et al. "Simplifying graph convolutional networks." arXiv preprint arXiv:1902.07153 (2019)](https://arxiv.org/abs/1902.07153)


 [Kipf, Thomas N., and Max Welling. "Semi-supervised classification with graph convolutional networks." arXiv preprint arXiv:1609.02907 (2016)](https://arxiv.org/abs/1609.02907)