# Installation


In [1]:
# Please visit https://github.com/rusty1s/pytorch_geometric#pip-wheels for lastest installation instruction

!pip install torch-scatter -f https://pytorch-geometric.com/whl/torch-1.9.0+cu102.html -U
!pip install torch-sparse -f https://pytorch-geometric.com/whl/torch-1.9.0+cu102.html -U
!pip install torch-cluster -f https://pytorch-geometric.com/whl/torch-1.9.0+cu102.html -U
!pip install torch-spline-conv -f https://pytorch-geometric.com/whl/torch-1.9.0+cu102.html -U
!pip install torch-geometric -U

Uninstalling torch-1.6.0+cu101:
  Successfully uninstalled torch-1.6.0+cu101
Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==1.6.0+cu101
  Using cached https://download.pytorch.org/whl/cu101/torch-1.6.0%2Bcu101-cp37-cp37m-linux_x86_64.whl
Installing collected packages: torch
Successfully installed torch-1.6.0+cu101
Looking in links: https://s3.eu-central-1.amazonaws.com/pytorch-geometric.com/whl/torch-1.6.0.html
Collecting torch-scatter==latest+cu101
  Using cached https://s3.eu-central-1.amazonaws.com/pytorch-geometric.com/whl/torch-1.6.0/torch_scatter-latest%2Bcu101-cp37-cp37m-linux_x86_64.whl
Collecting torch-sparse==latest+cu101
  Using cached https://s3.eu-central-1.amazonaws.com/pytorch-geometric.com/whl/torch-1.6.0/torch_sparse-latest%2Bcu101-cp37-cp37m-linux_x86_64.whl
Installing collected packages: torch-scatter, torch-sparse
  Found existing installation: torch-scatter 2.0.5
    Uninstalling torch-scatter-2.0.5:
      Successfully uninst

# Loading Datasets
For our datasets, we will be using three citation networks; Pubmed, Cora and Citeseer. Nodes correspond to publications and edges correspond to citations. The citation networks are available through the Planetoid dataset of PyG.

In [2]:
import torch
import torch.nn.functional as F
from torch_geometric.datasets import Planetoid
import torch_geometric.transforms as T

#Load the Cora, CiteSeer and Pubmed citation networks
#Note: T.NormalizeFeatures() creates a transform that normalizes the node features
dataset_cora = Planetoid(root="./tmp", name="Cora", transform=T.NormalizeFeatures())
dataset_citeseer = Planetoid(root="./tmp", name="CiteSeer", transform=T.NormalizeFeatures())
dataset_pubmed = Planetoid(root="./tmp", name="Pubmed",transform=T.NormalizeFeatures())

data_cora = dataset_cora[0]
data_citeseer = dataset_citeseer[0]
data_pubmed = dataset_pubmed[0]

print("Citation network information")
print("Cora: ", data_cora)
print("Citeseer: ", data_citeseer)
print("Pubmed: ", data_pubmed)

Citation network information
Cora:  Data(edge_index=[2, 10556], test_mask=[2708], train_mask=[2708], val_mask=[2708], x=[2708, 1433], y=[2708])
Citeseer:  Data(edge_index=[2, 9104], test_mask=[3327], train_mask=[3327], val_mask=[3327], x=[3327, 3703], y=[3327])
Pubmed:  Data(edge_index=[2, 88648], test_mask=[19717], train_mask=[19717], val_mask=[19717], x=[19717, 500], y=[19717])


# Training & Testing Functions

In [3]:
def train(model, data, optimizer):
  # Set the model.training attribute to True
  model.train() 

  # Reset the gradients of all the variables in a model
  optimizer.zero_grad() 

  # Get the output of the network. The output is a log probability of each
  log_softmax = model(data) 

  labels = data.y # Labels of each node

  # Use only the nodes specified by the train_mask to compute the loss.
  nll_loss = F.nll_loss(log_softmax[data.train_mask], labels[data.train_mask])
  
  #Computes the gradients of all model parameters used to compute the nll_loss
  #Note: These can be listed by looking at model.parameters()
  nll_loss.backward()

  # Finally, the optimizer looks at the gradients of the parameters 
  # and updates the parameters with the goal of minimizing the loss.
  optimizer.step() 

def compute_accuracy(model, data, mask):
  # Set the model.training attribute to False
  model.eval()
  logprob = model(data)
  _, y_pred = logprob[mask].max(dim=1)
  y_true=data.y[mask]
  acc = y_pred.eq(y_true).sum()/ mask.sum().float()
  return acc.item()

@torch.no_grad() # Decorator to deactivate autograd functionality  
def test(model, data):
  acc_train = compute_accuracy(model, data, data.train_mask)
  acc_val = compute_accuracy(model, data, data.val_mask)

  return acc_train, acc_val

# Graph Attention Networks


In this notebook we will be using the graph attention (GAT) convolutional operator.   On application of this operator on the graph, each node's feature-vector at layer $k$ is computed by 

$$ \mathbf{v}_i^{(k)} = \sigma \left( \sum_{v_j\in N(v_i)} \alpha_{ij} W \mathbf{v}_j^{(k-1)}\right)$$. 

Let's break this equation down.  

1. First each of neighboring-node feature-vectors  are multiplied by a weight matrix $W$ resulting in   the term $W \mathbf{v}_j^{(k-1)}$. We can imagine this as a message that is sent to node $v_i$ from each of its neighboring nodes. 

2. Next, we take a weighted average of these messages, where the weights are given by a function of the node feature-vectors and the weight matrix $W$, i.e. $a_{ij}=f(v_i, v_j, W)$. This results in the term $\sum_{v_j\in N(v_i)} \alpha_{ij} W \mathbf{v}_j^{(k-1)}$. 

3. Finally we apply a nonlinear function to this weighted average..



## Attention Mechanism


Let's take a closer look at the weights $a_{ij}$ are computed.

First we compute some unnormalized weight that depends on the weight matrix $W$, and the feature-vectors:

$$\rho_{ij} = \sigma\left( 
  \mathbf{a}^T\left[ 
    W\mathbf{v}_i^{(k-1)}||W\mathbf{v}_j^{(k-1)}\right]
  \right).$$
Here $\mathbf{a}$ is a weight vector that needs to be learned, and $||$ is a vector concatenation operation.


<center><img src="https://ai.science/api/authorized-images/P9YAB%2FiojuBjYHys2SyLct2iT%2B6CYnF565SUHuvCwGbpd60fhuqfp6b%2BPq5FgfGFF%2BxGt5dvNdCaHWAdd%2F%2BxCfvnJDQ6VZgzYCElgNjc5wFCXII7dM2jFI8xKMTG2oEuUCwUdRtubT%2BELMsTjzLHtZ0aJCNyJKWxCaAzj37IqamCwdNtgzJ8rpnC69%2BUj3L%2FnovLaA3OWOgMFnSGWSB35Pi0L4ZzqI7gGpVScFZl2OZB1MEnR9oJkIqP4oYgXY3%2BwRdP8lrG0nhNTBzfEbC6gpiU7WQTdqKAbyAzBsLgA5Kv%2FR%2BZhpccdktppsCdSoAK0yBvjaulNzUuHVNvLPBLaakBCOScsvqUSk9RruKYugkMDG%2BTgowM3Qmm772Zw%2FpqOxUdJUaohlb1Hz0nYK1FUJHu5ubj6fKW5JC8e024rg%2F2mtrQk1GgYISH7tpomBbNxRPX5QKJtJ54cSME3JSfcEvDBhaIBprK8FvdZwti1BiizeBXphLw1FaaeYeQBnmN2dWTSs28QhOSSs6jArkD%2Fa9ixN%2F6iI9zO59IBGA8undgK%2BayjCsYjWwuCI81lCWnvEIVFDCXnden6%2BWnIjPrH1uYoGxjmjMPsUoW%2BJio%2B%2FocC8C%2FNf4FUfPS8S29nE28rm1cKZDtQxacLEB0kJorUKe4jHRFsmOWHl2zXM9DvkY%3D" width="20%" > </center>




 Next we ensure that the weights for each node $v_i$ sum up to equal to 1 by applying a softmax function

$$ \alpha_{ij} = \frac{exp(\rho_{ij})}
{ \sum_{j \in N(v_i)} exp(\rho_{ij})}.$$


What we described so far is one attention head. We can also add multiple attention heads, each giving a different weight $a_{ij}$, resulting in different weighted averages being used in the GAT convolution operator. This is analogous to learning multiple convolutional filters in CNNS.  We can aggregate the outputs of resulting operations using different attention heads by either taking the average  on concatenating them.

<center><img src="https://ai.science/api/authorized-images/xU0Yv9p%2BCBULhnI9YGFPXt1QXea%2F492Zz5mauFoOKPCu7GcU%2B0VTbvvXP4hmoWW1K4AppfcZFfpkgcbvi%2FFt1ibHtHfCOL8r7PvtsG2Gr3cvKcL34Duli1lrrVM%2BW%2FJ0b1Dg%2Fw8tFynbEEzKYNE5spRONapqFRw2DOeyYFTF8ifxLDg6kQ3Jw1B5naTrSE%2FB%2Bp4p4Nsc1jf6Ij9nHZxr%2Fzx4%2Br5hbK9OGDNPrvatD61EhxdPiFiIV8ttMPRVUpSDXJMnCqS21cqS4Ws502VpD%2F2AxxI%2FXurz38O2uHLCpCLp%2BcrhP6n7%2B6PZCF7H%2BeLySasjT3C0%2F4ERDnR7z5cHcjeK54AeFhZ9bIZtymxyclg%2BLtniQBdPUhKeoC7W%2FY3Tlsa0Y%2Bn4Z5WRf9fm5zLWGH1kVIL8AawRPUwdfn8ZdBvCzHw6PCjFcp4BT2WbXlae9DLhNjzBs7bdbrvPhmuuJ7mvQd5QLHsRWytTO%2BTr9k9hOlGos0a48MKjl8Gzn5ll3cBtwEFHGmq9WP7TUiidBxtLnLepxFO82vBlIKikwo9yD4y5ZeJiQfX9UUNa1pNZcGgHcgHFeNSYCJW0EqbZeoxL13znoJT7DZwbryjiTKdlRaO9c%2B%2Fmjozyz0rIXNqEslvyJzXFnTPwZN0dGVt5BlNBK7ruC4n%2Bn9P3GA2tZE8%3D" width="40%" > </center>



# Graph Attention Networks in PyG
We will now implement a network based on the graph attention convolutional layer described in the paper [Graph Attention Networks](https://arxiv.org/abs/1710.10903) and provided in PyG as the *GATConv* layer. Note that this layer takes different parameters than *SGConv*. All layers have their own unique set of parameters, which are described in [PyG's documentation](https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html).

In [4]:
import torch
import torch.nn.functional as F
from torch_geometric.datasets import Planetoid
import torch_geometric.transforms as T
from torch_geometric.nn import GATConv

class GATNet(torch.nn.Module):
  def __init__(self, data, heads_layer1, 
               heads_layer2, dropout, dropout_alphas):
    super().__init__()

    self.dropout=dropout
    num_features = data.num_features
    num_classes = len(data.y.unique())

    self.conv1 = GATConv(in_channels=num_features, out_channels=8,
                         heads=heads_layer1, concat=True, negative_slope=0.2, 
                         dropout=dropout_alphas)
    
    self.conv2 = GATConv(in_channels=8*heads_layer1, out_channels=num_classes, 
                         heads=heads_layer2, concat=False, negative_slope=0.2,
                         dropout=dropout_alphas)
  
  def forward(self, data):
      x=data.x
      x = F.dropout(x, p=self.dropout, training=self.training)
      x = self.conv1(x, data.edge_index)
      x = F.elu(x)
      x = F.dropout(x, p=self.dropout, training=self.training)
      x = self.conv2(x, data.edge_index)
      
      return F.log_softmax(x, dim=1)

We can use the previously written training and testing code to try out this model to classify nodes in the PubMed dataset. We use the default parameters from the paper 

In [5]:
# Set cuda to be the device if available.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_pubmed_gat = GATNet(data=data_pubmed, heads_layer1=8, heads_layer2=8, 
                          dropout=0.6,  dropout_alphas=0.6).to(device)
data_pubmed= data_pubmed.to(device)

optimizer = torch.optim.Adam(model_pubmed_gat.parameters(), lr=0.001, weight_decay=1e-4)

for epoch in range(1, 200+1):
    train(model_pubmed_gat, data_pubmed, optimizer)
    if epoch %10 ==0:
      log = 'Epoch: {:03d}, Train: {:.4f}, Val: {:.4f}'
      print(log.format(epoch, *test(model_pubmed_gat,data_pubmed)))

Epoch: 010, Train: 0.8333, Val: 0.6320
Epoch: 020, Train: 0.8833, Val: 0.7040
Epoch: 030, Train: 0.8833, Val: 0.7300
Epoch: 040, Train: 0.9000, Val: 0.7360
Epoch: 050, Train: 0.9000, Val: 0.7360
Epoch: 060, Train: 0.9000, Val: 0.7500
Epoch: 070, Train: 0.9000, Val: 0.7400
Epoch: 080, Train: 0.9000, Val: 0.7380
Epoch: 090, Train: 0.8833, Val: 0.7400
Epoch: 100, Train: 0.9000, Val: 0.7400
Epoch: 110, Train: 0.9000, Val: 0.7500
Epoch: 120, Train: 0.9000, Val: 0.7400
Epoch: 130, Train: 0.9000, Val: 0.7440
Epoch: 140, Train: 0.9000, Val: 0.7520
Epoch: 150, Train: 0.9167, Val: 0.7480
Epoch: 160, Train: 0.9167, Val: 0.7520
Epoch: 170, Train: 0.9167, Val: 0.7520
Epoch: 180, Train: 0.9167, Val: 0.7520
Epoch: 190, Train: 0.9167, Val: 0.7520
Epoch: 200, Train: 0.9167, Val: 0.7500


# Optional Exercise 1

* Study the effect of changing the different hyper parameters of the GAT Network above.
* Try to reproduce (or do better than) the results in the GAT paper, which are reproduced below. Hint: you may need to also change the parameters on the optimizer and not just the model.

<center><img src="https://ai.science/api/authorized-images/m40NptgQCvGge0XNb3KZWoOcsvBpo0r3qQ6yOIYJngqNByMQ%2FS8CGoURLG3uP61rQ6fgybshIDDm1Wnyl1ni0nC6vfZvaAaL9CJF2zm4Jr7LAZ6TCBYMkyk7xFqkMeIHFYvigZomO1CgTf2%2F8OOlyx5gXe7HV9Be1Qm5iaGIbWepuJ26j1VZC7x8HtQUUlVG0yCD95d70ZP6839I%2FZb7oIve42o%2B3BrYnvPbR01ocLqaRfRWSFhmPJArU92uPlDnrKmk0qS3NBneD9zZyqs2wYJEKub7mxbJdJtsy%2BJAZEHUMM9%2B8SlKnWtTc4e%2ByAIlO6ihENeTNbMygcPOT%2Bclk2JUkjLoBtVVawHgB1Q0DiauUUsRIlIcQsQALOo1hawiRToCd%2FpdRr5PnL1XeKml%2BusObnO%2BJ3e6MuuiMSEyFMyNY5DeYMSJt5xSQ7aX%2Bl3943PRW72stYXCDw%2F6RCW6voz4VON%2BMqQes64R8yvjBsUcfw8irsoH0kNei6CB0LoZMxY2fPuhe%2FnR80n4gJwyUOcMnCvqmSAdnpKR9fok4UDgnyd9nOw5eny3hMkb%2Ff%2Bt2ffqsUgaPse24Q9rAmKOR0JUAOQ9Orygbql%2FKWoF%2FGnR%2FJoR0NPrru8vFTz8eqhqw9J%2FlO9HXAuCre5NecGr48O514kg6Dt070GkX3chrL8%3D" width="80%" > </center>



# Optional Exercise 2
* There are [many graph convolutional operators availabe in PyG.](https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html)  Try to implement a network for node classification which uses the *SAGEConv* convolutional layer.

# References
[Graph Attention Networks Site](https://petar-v.com/GAT/)


 [Veličković, Petar, et al. "Graph attention networks." arXiv preprint arXiv:1710.10903 (2017)](https://arxiv.org/abs/1710.10903)