# 1. Introduction

Say that we have a node $i$ connected to node $c$. The question which arises is: how much the features of node $c$ are important to node $i$, and can such importance be learnt automatically?

# 2. Graph Attention Layer

**Input:** Set of node features $\mathbf{h} = \{\bar{h}_1, \bar{h}_2, \cdots, \bar{h}_n\}$ where $\bar{h}_i \in \mathbb{R}^F$<br>
**Input:** New set of node features $\mathbf{h'} = \{\bar{h'}_1, \bar{h'}_2, \cdots, \bar{h'}_n\}$ where $\bar{h'}_i \in \mathbb{R}^{F'}$<br>

1. Apply a **parameterized linear transformation** to every node: $\mathbf{W}\cdot \bar{h}_i$ where $\mathbf{W} \in \mathbb{R}^{F' \times F}$
2. Apply **self attention**: $a: \mathbb{R}^{F'} \times \mathbb{R}^{F'} \to \mathbb{R}$. Applying this we get the result <br>
   $$e_{i,j} = a(\mathbf{W}\cdot \bar{h}_i, \mathbf{W}\cdot \bar{h}_j) $$
3. **Normalization**: $$\alpha_{i,j} = softmax_j (e_{i,j}) = \dfrac{\exp(e_{i,j})}{\sum_{k\in\mathcal{N}(i)}\exp(e_{i,k})}$$
4. **Attention mechanism**: $a$ is a single layer feedforward neural network. The two vectors $\mathbf{W}\cdot h_i \to h_i'$ and $\mathbf{W}\cdot h_j \to h_j'$ are concatenated to get $\bar{a} \in \mathbb{R}^{2F'}$ and this is passed to the Leaky ReLU function, i.e $\max(0.2x, x)$. Finally this get passed to the $softmax_j$ step.

Finally we can say
$$\alpha_{i,j} = \dfrac{\exp(LeakyReLU(\bar{a}^T [\mathbf{W}\cdot h_i || \mathbf{W}\cdot h_j]))}{\sum_{k \in \mathcal{N}(i)} \exp(LeakyReLU(\bar{a}^T [\mathbf{W}\cdot h_i || \mathbf{W}\cdot h_k]))} $$

5. **Message passing**: $h_i' = \sigma(\sum_{j \in \mathcal{N}(i)} \alpha_{i,j}\mathbf{W}h_j)$
6. **Multi-head attention**: We repeat the procedure several times. Thus we have can either concatenate the results (say we run it $K$ times), or we can average it. The authors suggest to concatenate in the internal layers and average in the final layer of the network.
    <br>Concatenation: $$h_i' = ||_{k=1}^K \sigma(\sum_{j \in \mathcal{N}(i)} \alpha_{i,j}^k\mathbf{W}^kh_j)$$
Average: $$h_i' = \sigma\bigg(\dfrac{1}{K} \sum_{k=1}^K \sum_{j \in \mathcal{N}(i)} \alpha_{i,j}^k\mathbf{W}^kh_j \bigg)$$
    

# 3. Advantages of GAT

1. Computationally efficient since we can parallelize the self-attention layers across edges and the output features can be parellelized across nodes.
2. Allows to assign different importance to nodes of a same neighborhood
3. It is applied in a shared manner to all edges in the graph, and thus we don't need to have the entire graph.
4. Works in both: transductive and inductive learning

# 4. Message Passing Implementation

From PyTorch Geometric, we can write that 
$$ \mathbf{x}_i^{(k)} = \gamma^{(k)} \bigg(\mathbf{x}_i^{(k-1)}, \square_{j \in \mathcal{N}(i)} \phi^{(k)} \Big(\mathbf{x}_i^{(k-1)},\mathbf{x}_j^{(k-1)}, \mathbf{e}_{j, i}\Big)\bigg) $$
where 
1. $\mathbf{x}_i^{(k)}$ are the feature representations of node $i$ in the $k^{th}$ layer
2. $\mathbf{x}_i^{(k-1)},\mathbf{x}_j^{(k-1)}, \mathbf{e}_{j, i}$ are the feature representations in the $(k-1)^{th}$ layer, and optionally we have edge features of the edge $i-j$
3. $\phi^{(k)}$ is a differentiable function such as MLP
4. $\square$ is a differentiable ordering invariant function
5. $\gamma^{(k)}$ is another differentiable function

# 5. Practice the MessagePassing class in PyTorch Geometric

The GCNConv layer has the following implementation:
$$\mathbf{x}_i^{(k)} = \sum_{j \in \mathcal{N}(i) \cup {i}} \dfrac{1}{\sqrt{deg(i)}\sqrt{deg(j)}} \cdot \Big(\mathbf{\Theta}\cdot \mathbf{x}_j^{(k-1)}\Big) $$

The $\phi$ here becomes the parameter $\mathbf{\Theta}$ and the $\square$ becomes the sum of the degree normalization term.<br>
In order to implement this in PyTorch, we need to do the following steps:
1. Add self loops
2. A linear transformation to node feature matrix
3. Compute normalization coefficients
4. Normalize node features
5. Sum up neighboring node features
The first three steps are done in the forward method, the 4th step in the message method and the sum is done in the initialization.

In [1]:
import torch
from torch_geometric.nn import MessagePassing
from torch_geometric.utils import add_self_loops, degree

In [2]:
class GCNConv(MessagePassing):
    def __init__(self, in_channels, out_channels):
        super(GCNConv, self).__init__(aggr='add')
        self.lin = torch.nn.Linear(in_channels, out_channels)
        
    def forward(self, x, edge_index):
        # x.shape = (N, in_channels)
        # edge_index.shape = (2, E)
        
        # 1. Add self loops
        edge_index, _ = add_self_loops(edge_index, numn_nodes = x.size(0))
        
        #2. Linear transformation
        x = self.lin(x)
        
        #3. Normalization
        row, col = edge_index
        deg = degree(col, x.size(0), dtype=x.dtype)
        deg_inv_sqrt = deg.pow(-0.5)
        norm = deg_inv_sqrt[row] * deg_inv_sqrt[col]
        
        #4/5.Start propogating messages
        return self.propogate(edge_index, x=x, norm=norm)
    
    def message(self, x_j, norm):
        #x_j.shape = (E, out_channels)
        return norm.view(-1, 1) * x_j
        

# 6. Implement GAT

In [3]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F

In [4]:
class GATLayer(nn.Module):
    def __init__(self):
        super(GATLayer, self).__init__()
        
    def forward(self, input, adj):
        print("")

## Linear Transformation

$$ \bar{h}_i' = \mathbf{W}\bar{h}_i$$
As before $\mathbf{W} \in \mathbb{R}^{F' \times F}$ and $\bar{h}_i \in \mathbb{R}^F$, thus $\bar{h}_i' \in \mathbf{R}^{F'}$

In [5]:
in_features = 5
out_features = 2
nb_nodes = 3

W = nn.Parameter(torch.zeros(size=(in_features, out_features))) # xavier parameter initialization
nn.init.xavier_uniform_(W.data, gain=1.414)
input = torch.rand(nb_nodes, in_features)

h = torch.mm(input, W)
N = h.size()[0]
print(h.shape)

torch.Size([3, 2])


## Attention mechanism

In [7]:
a = nn.Parameter(torch.zeros(size=(2*out_features, 1)))
nn.init.xavier_uniform_(a.data, gain=1.414)
print(a.shape)

leaky_relu = nn.LeakyReLU(0.2)

torch.Size([4, 1])


In [8]:
a_input = torch.cat([h.repeat(1, N).view(N * N, -1), h.repeat(N, 1)], dim=1).view(N, -1, 2 * out_features)

In [10]:
e = leaky_relu(torch.matmul(a_input, a).squeeze(2))

In [11]:
print(a_input.shape, a.shape)
print()
print(torch.matmul(a_input, a).shape)
print()
print(torch.matmul(a_input, a).squeeze(2).shape)

torch.Size([3, 3, 4]) torch.Size([4, 1])

torch.Size([3, 3, 1])

torch.Size([3, 3])


In [12]:
print(e)

tensor([[-0.0142, -0.0059,  0.0037],
        [ 0.0908,  0.1327,  0.1658],
        [ 0.2506,  0.2925,  0.3256]], grad_fn=<LeakyReluBackward0>)


## Masked attention

In [14]:
adj = torch.randint(2, (3, 3))
zero_vec = -9e15 * torch.ones_like(e)
print(zero_vec.shape)
print(zero_vec)

torch.Size([3, 3])
tensor([[-9.0000e+15, -9.0000e+15, -9.0000e+15],
        [-9.0000e+15, -9.0000e+15, -9.0000e+15],
        [-9.0000e+15, -9.0000e+15, -9.0000e+15]])


In [15]:
attention = torch.where(adj > 0, e, zero_vec)
print(adj)
print(e)
print(zero_vec)
print(attention)

tensor([[1, 0, 0],
        [1, 1, 0],
        [0, 1, 0]])
tensor([[-0.0142, -0.0059,  0.0037],
        [ 0.0908,  0.1327,  0.1658],
        [ 0.2506,  0.2925,  0.3256]], grad_fn=<LeakyReluBackward0>)
tensor([[-9.0000e+15, -9.0000e+15, -9.0000e+15],
        [-9.0000e+15, -9.0000e+15, -9.0000e+15],
        [-9.0000e+15, -9.0000e+15, -9.0000e+15]])
tensor([[-1.4248e-02, -9.0000e+15, -9.0000e+15],
        [ 9.0823e-02,  1.3270e-01, -9.0000e+15],
        [-9.0000e+15,  2.9252e-01, -9.0000e+15]], grad_fn=<SWhereBackward0>)


In [16]:
attention = F.softmax(attention, dim=1)
h_prime = torch.matmul(attention, h)
print(attention)
print(h_prime)

tensor([[1.0000, 0.0000, 0.0000],
        [0.4895, 0.5105, 0.0000],
        [0.0000, 1.0000, 0.0000]], grad_fn=<SoftmaxBackward0>)
tensor([[ 0.2003,  0.1686],
        [-0.0097,  0.0547],
        [-0.2111, -0.0546]], grad_fn=<MmBackward0>)


## Finalizing the GAT layer

In [17]:
class GATLayer(nn.Module):
    def __init__(self, in_features, out_features, dropout, alpha, concat=True):
        super(GATLayer, self).__init__()
        self.dropout = dropout
        self.in_features = in_features
        self.out_features = out_features
        self.alpha = alpha
        self.concat = concat
        
        self.W = nn.Parameter(torch.zeros(size=(in_features, out_features)))
        nn.init.xavier_uniform_(self.W.data, gain=1.414)
        
        self.a = nn.Parameter(torch.zeros(size=(2 * out_features, 1)))
        nn.init.xavier_uniform_(self.a.data, gain=1.414)
        
        self.leaky_relu = nn.LeakyReLU(self.alpha)
        
    def forward(self, input, adj):
        # 1. Linear transformation
        h = torch.mm(input, self.W)
        N = h.size()[0]
        
        # 2. Attention mechanism
        a_input = torch.cat([h.repeat(1, N).view(N * N, -1), h.repeat(N, 1)], dim=1).view(N, -1, 2 * self.out_features)
        e = self.leaky_relu(torch.matmul(a_input, self.a).squeeze(2))
        
        # 3. Masked attention
        zero_vec = -9e15 * torch.ones_like(e)
        attention = torch.where(adj > 0, e, zero_vec)
        attention = F.softmax(attention, dim=1)
        attention = F.dropout(attention, self.dropout, training=self.training)
        h_prime = torch.matmul(attention, h)
        if self.concat:
            return F.elu(h_prime)
        else:
            return h_prime

## Usage of the implementation in PyTorch Geometric

In [19]:
from torch_geometric.data import Data
from torch_geometric.nn import GATConv
from torch_geometric.datasets import Planetoid
import torch_geometric.transforms as T
import matplotlib.pyplot as plt

dataset = Planetoid(root='Cora_dataset', name='Cora')
dataset.transform = T.NormalizeFeatures()
print(dataset.num_classes, dataset.num_node_features)

7 1433


In [20]:
class GAT(torch.nn.Module):
    def __init__(self):
        super(GAT, self).__init__()
        self.hid = 8
        self.in_head = 8
        self.out_head = 1
        
        self.conv1 = GATConv(dataset.num_features, self.hid, heads=self.in_head, dropout=0.6)
        self.conv2 = GATConv(self.hid*self.in_head, dataset.num_classes, concat=False, heads=self.out_head, dropout=0.6)
        
    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = F.dropout(x, p=0.6, training=self.training)
        x = self.conv1(x, edge_index)
        x = F.elu(x)
        x = F.dropout(x, p=0.6, training=self.training)
        x = self.conv2(x, edge_index)
        
        return F.log_softmax(x, dim=1)

In [22]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


In [23]:
model = GAT().to(device)
data = dataset[0].to(device)
opt = torch.optim.Adam(model.parameters(), lr=0.005, weight_decay=5e-4)

In [25]:
model.train()
for e in range(1000):
    model.train()
    opt.zero_grad()
    out = model(data)
    loss = F.nll_loss(out[data.train_mask], data.y[data.train_mask])
    
    if e % 200 == 0:
        print(loss)
        
    loss.backward()
    opt.step()

tensor(1.9461, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(0.6521, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(0.5752, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(0.5749, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(0.5524, device='cuda:0', grad_fn=<NllLossBackward0>)


In [26]:
model.eval()
_, pred = model(data).max(dim=1)
correct = float(pred[data.test_mask].eq(data.y[data.test_mask]).sum().item())
acc = correct/data.test_mask.sum().item()
print("Accuracy: {:.4f}".format(acc))

Accuracy: 0.8210
