- focus on spatial construction if graph neural network

- images are a specific sort of graph: grid graph

    every pixel is a vertex


- recall CNNs use 3 basic ideas, which can all be translated to graph

    1. local receptive fields: problem on graph: there is no clearly defined direction or method of translation over a graph, can't cleanly slide a convolution, center neighbors on a vertex
    
    2. shared weights in a filter
    
    3. pooling

# how can we approximate CNN ideas on more general non-grid graphs

## spatial construction (vertex domain)

- apply the same operator and mover over the graph space in some fashion


- notion of locality based on neighbors of a node


## spectral construction (spectral domain)

- define a conv operation in the graph spectral space


- use signal processing theory to define filters in the graph Fourier domain


- notion of locality based on global properties of the filter

# graph notation

- define a graph $G=(V, E)$ with vertices $V$ and edges $E$


- each vertex $u$ in graph $u \in V$, each vertex is an example, can have several node features, e.g., for node A, node features are $x_A = \left\{ Age, Sex, Job,... \right\}$


- each edge $(u, v)$ in graph $(u, v) \in E$, each edge can also have features, e.g., edge type `is_a` for knowledge graph


- A is an adjacency matrix, denotes connectivity or similarity in a graph

$$
A(u, v)=\left\{\begin{matrix}
1 & if\ (u, v) \in E \\
0 & if\ (u, v) \notin  E \\
\end{matrix}\right.
$$

also, if A is weighted, $A(u, v)$ can be a real value, see weighted graph

![image.png](attachment:image.png)

# message passing

- message passing is getting messages from node's neighbors to target node

$$
\begin{aligned}
h_u^{(k+1)} &= \text{UPDATE}^{(k)}\left ( h_u^{(k)}, \text{AGGREGATE}^{(k)}\left ( \left\{ h_v^{(k)}, \forall v \in \mathcal{N}(u)\right\} \right ) \right )
\\&= \text{UPDATE}^{(k)} \left ( h_u^k, m_{\mathcal{N}(u)}^{(k)} \right ) 
\end{aligned}
$$

where $u$ is the target node, 

$\mathcal{N}(u)$ is a set of neighbor nodes of node $u$

$v \in \mathcal{N}(u)$ is node $u$'s neighbor node $v$

$h_u$ is node $u$'s feature

$h_v$ is node $v$'s feature

$k$ is iteration step

$h_u^{(k+1)}$ is updated node feature at $k+1$th step

$h_u^{(k)}$ is node feature at $k$th step

$m_{\mathcal{N}(u)}^{(k)}$ is **message**, the aggregated feature of node $u$'s neighbors

$\text{AGGREGATE}$ is an aggregate function, e.g., sum, mean or max, which is differentiable, permutation invariant

$\text{UPDATE}$ is a differential function like MLP (Multi Layer Perceptron).



## Pytorch Geometric

https://pytorch-geometric.readthedocs.io/en/latest/notes/create_gnn.html

Generalizing the convolution operator to irregular domains is typically expressed as a neighborhood aggregation or message passing scheme. 

message passing graph neural networks can be described as

$$
\mathbf{x}_i^{(k)} = \gamma^{(k)} \left( \mathbf{x}_i^{(k-1)}, \text{AGGREGATE}_{j \in \mathcal{N}(i)} \, \phi^{(k)}\left(\mathbf{x}_i^{(k-1)}, \mathbf{x}_j^{(k-1)},\mathbf{e}_{j,i}\right) \right)
$$

where 

$\mathbf{x}^{(k-1)}_i \in \mathbb{R}^F$ denoting node features of node i in layer (k-1) 

$\mathbf{e}_{j,i} \in \mathbb{R}^D$ denoting (optional) edge features from node j to node i,

$\text{AGGREGATE}$ denotes a differentiable, permutation invariant function, e.g., sum, mean or max, 

$\gamma$ and $\phi$ denote differentiable functions such as MLPs (Multi Layer Perceptrons).

## message passing iterations

- there can be several iterations of message passing through a GNN


- at each iteration, the message from the previous iteration is aggregated through neighboring vertices


- after K iterations you get a latent embedding of node u

$$
z_u = h_u^{(K)}, \forall u \in \mathcal{V}
$$


- this node embedding can then be used for tasks, e.g., neighbor nodes reconstruction, node classification

## aggregation and updating

- $\text{AGGREGATE}$ is an aggregate function, e.g., sum, mean or max, which is differentiable, permutation invariant

    e.g., a typical aggregation function is a sum function, summing over features of neighbor nodes $h_v, v \in \mathcal{N}(u)$
    
    $$
    m_{\mathcal{N}(u)}^{(k)} = \sum_{v \in \mathcal{N}(u)} h_v
    $$

- $\text{UPDATE}$ is a differential function, e.g., MLP (Multi Layer Perceptron).

$$
\text{UPDATE} \left ( h_u, m_{\mathcal{N}(u)} \right )  = \sigma (W_{\text{self}}h_u + W_{\text{neigh}} m_{\mathcal{N}(u)})
$$

where $\sigma$ is sigmoid function, $W$ are trainable parameter matrices


- combine sum AGGREGATE function and MLP UPDATE function, we have

$$
\begin{aligned}
h_u^{(k+1)} &= \text{UPDATE}^{(k)}\left ( h_u^{(k)}, \sum_{v \in \mathcal{N}(u)} h_v^{(k)} \right )
\\&= \sigma \left(W_{\text{self}}^{(k+1)} h_u^{(k)} + W_{\text{neigh}}^{(k+1)} \sum_{v \in \mathcal{N}(u)} h_v^{(k)} + b^{(k+1)} \right) 
\end{aligned}
$$

where $b$ is bias

## normalization

- reason for normalization: 

    avoid vertices with denser neighbors pass more message, 

    result in one vertex dominates the graph, especially scale-free graph in social network

    e.g. a celebrity has lots of followers, if you follow him, he will dominate your graph


- normalized aggregation: aggregation is normalized by number of neighbor vertices in each neighborhood $|\mathcal{N}(v)|$

$$
m_{\mathcal{N}(u)} = \sum_{v \in \mathcal{N}(u)} \frac{h_v}{\sqrt{|\mathcal{N}(u)||\mathcal{N}(v)|}}
$$


- therefore, normalized message passing is

\begin{aligned}
h_u^{(k+1)} &= \sigma \left(W_{\text{self}}^{(k+1)} h_u^{(k)} + W_{\text{neigh}}^{(k+1)} \sum_{v \in \mathcal{N}(u)} \frac{h_v}{\sqrt{|\mathcal{N}(u)||\mathcal{N}(v)|}} \right) 
\\&= \sigma \left(W^{(k+1)} \sum_{v \in \mathcal{N}(u) \cup \left\{ u\right\}} \frac{h_v}{\sqrt{|\mathcal{N}(u)||\mathcal{N}(v)|}}\right) 
\end{aligned}

# graph attention

- attention is a dynamic mechanism to reweight input 

    not prefixed as trained weight matrix
    
    we can choose to pay different attentions to different neighbors


- there are also mask attention

- there are 3 kinds of weights: 

    attention $\alpha$, 
    
    weight matrix $W$, 
    
    adjacency matrix $A$: defines which nodes can be considered as node v's neighbors



- attention score for from node v to node u

    raw score: 
    
    $$e_{vu} = a\left ( W^{(l)}h_u^{(l-1)}, W^{(l)}h_v^{(l-1)} \right )$$

    take softmax:
    
    $$\alpha _{vu}  = \frac{\exp(e_{vu})}{\sum_{k \in \mathcal{N}(v)}\exp(e_{vk}) }$$

    where $a$ is attention function
    
    $h_u$ is feature of node u, $h_v$ is feature of node v
    
    softmax operation makes all the attention scores for node v to different neighbor nodes sum up to 1: $\sum_{u \in \mathcal{N}(v)}\alpha_{u v} =1$
    

- node feature at node v

$$
h_v^{(l)} = \sigma \left(\sum_{u \in N(v)} {\color{Red} {\alpha _{vu}}} W^{(l)} h_u^{(l-1)} \right)
$$

# tasks

3 types of tasks for GNN

1. node level
    
    - node classification: e.g., classify article topics on a citation graph 
    
    - node regression: e.g., predict what genre of music a person like
    
    
2. edge level
    
    - edge weight/strength prediction: missing link prediction, e.g., Facebook suggests who should be your friend
    
    - edge classification: what type of edge it is? e.g., in Facebook, edge types are family, colleagues, classmates
    
    
3. graph level

    graph classification: does this graph have a 5-clique 团
    
    graph regression: does the molecule graph encodes bind to a virus

## node level tasks

- for node regression, predicted label at node v is

$$
\hat y_v = \text{Head}_{\text{node}} \left(h_v^{(L)}\right) = W^{(H)}h_v^{(L)}
$$

where $h_v^{(L)}$ is node embedding of node v after L iterations

$\text{Head}_{\text{node}}$ is a linear function at final liner layer


- loss function: MSE loss

$$
MSE = \left\| \hat y_v - y_v \right\|_2
$$

where $y_v$ is ground-truth label at node v

### GraphSAGE: Node classification

- inductive, message passing network


- use aggregation to compute node embeddings for node classification

    - mean aggregation: aggregates neighbors and concatenates self (skip)
    
    $$
    h_v^k \leftarrow \sigma \left(W \cdot \text{MEAN} \left(\left\{ h_v^{k-1}, \forall u \in \mathcal{N}(v) \right\} \right) \right)
    $$

    - LSTM aggregation: visit nodes around neighborhood in random order to perform aggregation



## edge level tasks

- for edge weight/strength prediction, predicted label for edge $(u, v)$ is

$$
\hat y_{uv} = \text{Head}_{\text{edge}} \left(h_u^{(L)}, h_v^{(L)}\right)
$$

where $\text{Head}_{\text{edge}}$ is a Fully connected linear function

$h_u^{(L)}, h_v^{(L)}$ are node embeddings for node u and node v, respectively

## graph level tasks

- a clique: a concept in graph theory, 

    a subset of vertices of an undirected graph such that **every 2 distinct vertices in the clique are adjacent**

- the term clique comes from Luce & Perry (1949), 

    who used complete subgraphs in social networks to model cliques of people; 
    
    that is, groups of people all of whom know each other. 


- Cliques have many other applications in the sciences and particularly in bioinformatics.

https://en.wikipedia.org/wiki/Clique_(graph_theory)

![image-2.png](attachment:image-2.png)

- graph level pooling: mean/max/sum


- caveat: pooling can result in a loss of information


- mean pooling: take average of all the node embeddings over the graph G

    the predicted graph label is

$$
\hat y_G = \text{Mean}\left( \left\{h_v^{(L)}, \forall v \in G \right\}\right)
$$

where $h_v^{(L)} \in \mathbb{R}^d$ is node embedding of node v

- e.g. of graph pooling network

# property of graph/node embeddings

## permutation equivariance 排列等变性

- graph and node representations are the same regardless of how nodes are ordered in the adjacency matrix


- computed messages don't depend on index of nodes in a graph


- this is achieved via **permutation equivariant aggregation** operations: mean, sum, max


- how to make message passing **permutation inequivariance**? 如何实现排列会变性

    use identity function on node embeddings, e.g., position of node in a graph.

    

## inductive capability 可归纳性

- if train on one graph,

    add a new node to a graph, can generalize to this new graph easily 
    
    reason: operations repeat at all the nodes, wieghts are shared across all the nodes

# limitations of GNN

- message passing can lead to oversmoothing

    smooth: nodes on the graph has same message
    
    result: nodes indistinguishable from each other

# Weisfeler-Lehman Test of Isomorphism 同构性

- message passing GNN using vanilla aggregation functions

iteratively label the nodes based on connectivity

- start with all nodes of the same color (same node label "0")

- aggregates color of a node with its neighbors ("00" or "000")

- hashes aggregated colors into new label ("00"=1, "000"=2)

- aggregates new labels ("12" or "21")

- repeats hashing

- stops when coloring is stable

- if at that point 2 graphs are differently colored they are non-isomorphic 非同构的

## limitation: false positives

- these graphs are colored the same but are not isomorphic

## Power: GNN <= WL Test

- in a GNN that uses message passing, each node aggregates neighbor features

    thus it's very similar to its neighbors


- but choice of aggregation function important in giving as much power as WL algo


- most popular aggregators (mean, sum) are less powerful than WL


- Xu proposed an MLP aggregation function that makes it equivalent to WL

$$
h_v^{(k)} = \text{MLP}^{(k)}\left ( \left ( 1+ \epsilon ^{(k)} \right )h_v^{(k-1)} + \sum_{u \in \mathcal{N}(v)} h_u^{(k-1)}\right)
$$

## solution 1: increase power within an aggregation

- don't just add layers to GNNs (message passing iterations) to avoid oversmooth labels


- increase power within an aggregation

    if needed, each box could include a 3-layer MLP

## solution 2: add skip connections

- forward all the outputs in GNN layers to last layer of GNN

## solution 3: add FC layers

# augmentation/engineering

## node feature augmentation/engineering

- add features to make GNNs more expressive

- local aggregation won;t distinguish these but can add vertex features

    e.g., cycle count, node degree, edge degree, node centrality, anything from graph theory
    

- unique identities to nodes

## graph augmentation/engineering

- problems of graph depends on domain of research:

    - graph can be too sparse
    
    - too dense
    
    - not connected enough
    
- can add virtual nodes and edges to the graph

    when calculate loss, just don't care about labels on virtual nodes


- subsample or cluster nodes for the graph

# Readings

- GraphSAGE


- WL-Test  

    https://towardsdatascience.com/expressive-power-of-graph-neural-networks-and-the-weisefeiler-lehman-test-b883db3c7c49
    
    https://arxiv.org/pdf/1609.02907.pdf


- GCN https://arxiv.org/pdf/1810.00826.pdf