Skip to content

Latest commit

 

History

History
763 lines (573 loc) · 45.5 KB

File metadata and controls

763 lines (573 loc) · 45.5 KB

Geometric Deep Learning

In the last decade, Deep Learning approaches (e.g. Convolutional Neural Networks and Recurrent Neural Networks) allowed to achieve unprecedented performance on a broad range of problems coming from a variety of different fields (e.g. Computer Vision and Speech Recognition). Despite the results obtained, research on DL techniques has mainly focused so far on data defined on Euclidean domains (i.e. grids). Nonetheless, in a multitude of different fields, such as: Biology, Physics, Network Science, Recommender Systems and Computer Graphics; one may have to deal with data defined on non-Euclidean domains (i.e. graphs and manifolds). The adoption of Deep Learning in these particular fields has been lagging behind until very recently, primarily since the non-Euclidean nature of data makes the definition of basic operations (such as convolution) rather elusive. Geometric Deep Learning deals in this sense with the extension of Deep Learning techniques to graph/manifold structured data.

Images are stored in computer as matrix roughly. The spatial distribution of pixel on the screen project to us a colorful digitalized world. Convolutional neural network(ConvNet or CNN) has been proved to be efficient to process and analyses the images for visual cognitive tasks. What if we generalize these methods to graph structure which can be represented as adjacent matrix?

Image Graph
Convolutional Neural Network Graph Convolution Network
Attention Graph Attention
Gated Gated Graph Network
Generative Generative Models for Graphs

Advanced proceedings of natural language processing(NLP) shone a light into semantic embedding as one potential approach to knowledge representation. The text or symbol, strings in computer, is designed for natural people to communicate and understand based on the context or situation, i.e., the connections of entities and concepts are essential. What if we generalize these methods to connected data?

Graph Embedding

Graph embedding, preprocessing of graph data processing, is an example of representation learning to find proper numerical representation form of graph data structure. It maps the graph structure to numerical domain: $f:\mathbf{G}\mapsto \mathbb{R}^{n}$.

DeepWalk

DeepWalk is an approach for learning latent representations of vertices in a network, which maps the nodes in the graph into real vectors: $$ f: \mathbb{V}\to\mathbb{R}^{d}. $$

DeepWalk generalizes recent advancements in language modeling and unsupervised feature learning (or deep learning) from sequences of words to graphs. DeepWalk uses local information obtained from truncated random walks to learn latent representations by treating walks as the equivalent of sentences. And if we consider the text as digraph, word2vec is an specific example of DeepWalk. Given the word sequence $\mathbb{W}=(w_0, w_1, \dots, w_n)$, we can compute the conditional probability $P(w_n|w_0, w_1, \dots, w_{n-1})$. And $$P(w_n|f(w_0), f(w_1), \cdots, f(w_{n_1}))$$

  • DeepWalk $G, w, d, \gamma, t$
  • Input: graph $G(V, E)$; window size $w$; embedding size $d$; walks per vertex $\gamma$; walk length $t$.
  • Output: matrix of vertex representations $\Phi\in\mathbb{R}^{|V|\times d}$
  • Initialization: Sample $\Phi$ from $\mathbb{U}^{|V|\times d}$;
    • Build a binary Tree T from V;
    • for $i = 0$ to $\gamma$ do
      • $O = Shuffle(V )$
      • for each $v_i \in O$ do
      • $W_{v_i}== RandomWalk(G, v_i, t)$
      • $SkipGram(Φ, W_{v_i}, w)$
      • end for
    • end for

$SkipGram(Φ, W_{v_i}, w)$

    1. for each $v_j \in W_{v_i}$ do
      1. for each $u_k \in W_{v_i}[j - w : j + w]$ do
      1. $J(\Phi)=-\log Pr(u_k\mid \Phi(v_j))$
      1. $\Phi =\Phi -\alpha\frac{\partial J}{\partial \Phi}$
      1. end for
    1. end for

Computing the partition function (normalization factor) is expensive, so instead we will factorize the conditional probability using Hierarchical softmax. If the path to vertex $u_k$ is identified by a sequence of tree nodes $(b_0, b_1, \cdots , b_{[log |V|]})$, and $(b_0 = root, b_{[log |V|]} = u_k)$ then $$ Pr(u_k\mid \Phi(v_j)) =\prod_{l=1}^{[log |V|]}Pr(b_l\mid \Phi(v_j)). $$

Now, $Pr(b_l\mid \Phi(v_j))$ could be modeled by a binary classifier that is assigned to the parent of the node $b_l$ as below $$ Pr(b_l\mid \Phi(v_j))=\frac{1}{1 + \exp(-\Phi(v_j)\cdot \Phi(b_l))} $$ where $\Phi(b_l)$ is the representation assigned to tree node $b_l$ ’s parents.

node2vec

node2vec is an algorithmic framework for representational learning on graphs. Given any graph, it can learn continuous feature representations for the nodes, which can then be used for various downstream machine learning tasks.

By extending the Skip-gram architecture to networks, it seeks to optimize the following objective function, which maximizes the log-probability of observing a network neighborhood $N_{S}(u)$ for a node $u$ conditioned on its feature representation, given by $f$: $$ \max_{f} \sum_{u\in V}\log Pr(N_S(u)\mid f(u)) $$

Conditional independence and Symmetry in feature space are expected to make the optimization problem tractable. We model the conditional likelihood of every source-neighborhood node pair as a softmax unit parametrized by a dot product of their features: $$ Pr(n_i\mid f(u))=\frac{\exp(f(n_i)\cdot f(u))}{\sum_{v\in V} \exp(f(v)\cdot f(u))}. $$

The objective function simplifies to $$\max_{f}\sum_{u\in V}[-\log Z_u + \sum_{n_i\in N_S(u)} f(n_i)\cdot f(u).$$

struc2vec

This is a paper about identifying nodes in graphs that play a similar role based solely on the structure of the graph, for example computing the structural identity of individuals in social networks. That’s nice and all that, but what I personally find most interesting about the paper is the meta-process by which the authors go about learning the latent distributed vectors that capture the thing they’re interested in (structural similarity in this case). Once you’ve got those vectors, you can do vector arithmetic, distance calculations, use them in classifiers and other higher-level learning tasks and so on. As word2vec places semantically similar words close together in space, so we want structurally similar nodes to be close together in space.

Struc2vec has four main steps:

  1. Determine the structural similarity between each vertex pair in the graph, for different neighborhood sizes.
  2. Construct a weighted multi-layer graph, in which each layer corresponds to a level in a hierarchy measuring structural similarity (think: ‘at this level of zoom, these things look kind of similar�?).
  3. Use the multi-layer graph to generate context for each node based on biased random walking.
  4. Apply standard techniques to learn a latent representation from the context given by the sequence of nodes in the random walks.

Word Embedding

word2vec

In natural language processing, the word can be regarded as the node in a graph, which only takes the relation of locality or context. It is difficult to learn the concepts or the meaning of words. The word embedding technique word2vec maps the words to fixed length real vectors: $$ f: \mathbb{W}\to\mathbb{V}^d\subset \mathbb{R}^d. $$

The skip-gram model assumes that a word can be used to generate the words that surround it in a text sequence. We assume that, given the central target word, the context words are generated independently of each other.

The conditional probability of generating the context word for the given central target word can be obtained by performing a softmax operation on the vector inner product: $$ P(w_o|w_c) = \frac{\exp(u_o^T u_c)}{\sum_{i\in\mathbb{V}} \exp(u_i^T u_c)}, $$

where vocabulary index set $V = {1,2,\dots, |V|-1}$. Assume that a text sequence of length ${T}$ is given, where the word at time step ${t}$ is denoted as $w^{(t)}$. Assume that context words are independently generated given center words. When context window size is ${m}$ , the likelihood function of the skip-gram model is the joint probability of generating all the context words given any center word $$ \prod_{t=1}^{T}\prod_{-m\leq j \leq m, j\not = i}{P}(w^{(t+j)}|w^{(j)}), $$

Here, any time step that is less than 1 or greater than ${T}$ can be ignored.

The skip-gram model parameters are the central target word vector and context word vector for each individual word. In the training process, we are going to learn the model parameters by maximizing the likelihood function, which is also known as maximum likelihood estimation. his is equivalent to minimizing the following loss function: $$ -\log(\prod_{t=1}^{T}\prod_{-m\leq j \leq m, j\not = i}{P}(w^{(t+j)}\mid w^{(j)})) = \ -\sum_{t=1}^{T}\sum_{-m\leq j \leq m, j \not= i} \log({P}(w^{(t+j)}|w^{(j)}))). $$

And we could compute the negative logarithm of the conditional probability $$ -\log(P(w_o|w_c)) = -\log(\frac{\exp(u_o^T u_c)}{\sum_{i\in\mathbb{V}} \exp(u_i^T u_c)}) \= -u_o^T u_c + \log(\sum_{i\in\mathbb{V}} \exp(u_i^T u_c)). $$

Then we could compute the gradient or Hessian matrix of the loss functions to update the parameters such as:

$$ \frac{\partial \log(P(w_o|w_c))}{\partial u_c} \= \frac{\partial }{\partial u_c} [u_o^T u_c - \log(\sum_{i\in\mathbb{V}} \exp(u_i^T u_c))] \= u_o - \sum_{j\in\mathbb{V}}\frac{ \exp(u_j^T u_c)) }{\sum_{i\in\mathbb{V}} \exp(u_i^T u_c))} u_j \= u_o - \sum_{j\in\mathbb{V}}P(w_j|w_c) u_j. $$

The continuous bag of words (CBOW) model is similar to the skip-gram model. The biggest difference is that the CBOW model assumes that the central target word is generated based on the context words before and after it in the text sequence. Let central target word $w_c$ be indexed as $c$ , and context words $w_{o_1},\cdots, w_{o_{2m}}$ be indexed as $o_1,\cdots,o_{2m}$ in the dictionary. Thus, the conditional probability of generating a central target word from the given context word is

$$ P(w_c|w_{o_1},\cdots, w_{o_{2m}}) = \frac{\exp(\frac{1}{2m}u_c^T(u_{o_1}+ \cdots + u_{o_{2m}}))}{\sum_{i\in V}\exp(\frac{1}{2m} u_i^T(u_{o_1}+ \cdots + u_{o_{2m}}))}. $$

Doc2Vec


Gradient Boosted Categorical Embedding and Numerical Trees

Gradient Boosted Categorical Embedding and Numerical Trees (GB-CSENT) is to combine Tree-based Models and Matrix-based Embedding Models in order to handle numerical features and large-cardinality categorical features. A prediction is based on:

  • Bias terms from each categorical feature.
  • Dot-product of embedding features of two categorical features,e.g., user-side v.s. item-side.
  • Per-categorical decision trees based on numerical features ensemble of numerical decision trees where each tree is based on one categorical feature.

In details, it is as following: $$ \hat{y}(x) = \underbrace{\underbrace{\sum_{i=0}^{k} w_{a_i}}{bias} + \underbrace{(\sum{a_i\in U(a)} Q_{a_i})^{T}(\sum_{a_i\in I(a)} Q_{a_i}) }{factors}}{CAT-E} + \underbrace{\sum_{i=0}^{k} T_{a_i}(b)}_{CAT-NT}. $$ And it is decomposed as the following table.


Ingredients Formulae Features
Factorization Machines $\underbrace{\underbrace{\sum_{i=0}^{k} w_{a_i}}{bias} + \underbrace{(\sum{a_i\in U(a)} Q_{a_i})^{T}(\sum_{a_i\in I(a)} Q_{a_i}) }{factors}}{CAT-E}$ Categorical Features
GBDT $\underbrace{\sum_{i=0}^{k} T_{a_i}(b)}_{CAT-NT}$ Numerical Features

Gaussian Auto Embeddings

http://koaning.io/gaussian-auto-embeddings.html

Atom2Vec

M. V. Diudea, I. Gutman and L. Jantschi wrote in the preface of the book Molecular Topology: One of the principal goals of chemistry is to establish (causal) relations between the chemical and physical (experimentally observable and measurable) properties of substance and the structure of the corresponding molecules. Countless results along these lines have been obtained, and their presentation comprise significant parts of textbooks of organic, inorganic and physical chemistry, not to mention treatises on theoretical chemistry

tile2Vec

Graph Embedding

graph2vec

graph2vec is to learn data-driven distributed representations of arbitrary sized graphs in an unsupervised manner and are task agnostic.

Deep Semantic Embedding

Hyperbolic Embeddings

--- ---

Graph Convolutional Network

If images is regarded as a matrix in computer and text as a chain( or sequence), their representation contain all the spatial and semantic information of the entity.

Graph can be represented as adjacency matrix as shown in Graph Algorithm. However, the adjacency matrix only describe the connections between the nodes. The feature of the nodes does not appear. The node itself really matters. For example, the chemical bonds can be represented as adjacency matrix while the atoms in molecule really determine the properties of the molecule.

A simple and direct way is to concatenate the feature matrix $X\in \mathbb{R}^{N\times E}$ and adjacency matrix $A\in\mathbb{B}^{N\times N}$, i.e., $X_{in}=[X, A]\in \mathbb{R}^{N\times (N+E)}$. What is more, $\mathbb{B}$ is binary where each element of the adjacent matrix $a_{i,j}$ is ${1}$ if the node ${i}$ is adjacent to the node ${j}$ otherwise 0.

And what is the output? How can deep learning apply to them? And how can we extend the tree-based algorithms such as decision tree into graph-based algorithms?

For these models, the goal is then to learn a function of signals/features on a graph $G=(V,E)$ which takes as input:

  • A feature description $x_i$ for every node $i$; summarized in a $N\times D$ feature matrix ${X}$ ($N$: number of nodes, $D$: number of input features);
  • A representative description of the graph structure in matrix form; typically in the form of an adjacency matrix ${A}$ (or some function thereof)

and produces a node-level output $Z$ (an $N\times F$ feature matrix, where $F$ is the number of output features per node). Graph-level outputs can be modeled by introducing some form of pooling operation (see, e.g. Duvenaud et al., NIPS 2015).

Every neural network layer can then be written as a non-linear function $$ {H}{i+1} = \sigma \circ ({H}{i}, A) $$ with ${H}0 = {X}{in}$ and ${H}_{d} = Z$ (or $Z$ for graph-level outputs), $d$ being the number of layers. The specific models then differ only in how $\sigma$ is chosen and parameterized.

For example, we can consider a simple form of a layer-wise propagation rule $$ {H}{i+1} = \sigma \circ ({H}{i}, A)=\sigma \circ(A {H}{i} {W}{i}) $$ where ${W}_{i}$ is a weight matrix for the $i$-th neural network layer and $\sigma (\cdot)$ is is a non-linear activation function such as ReLU.

  • But first, let us address two limitations of this simple model: multiplication with $A$ means that, for every node, we sum up all the feature vectors of all neighboring nodes but not the node itself (unless there are self-loops in the graph). We can "fix" this by enforcing self-loops in the graph: we simply add the identity matrix $I$ to $A$.

  • The second major limitation is that $A$ is typically not normalized and therefore the multiplication with $A$ will completely change the scale of the feature vectors (we can understand that by looking at the eigenvalues of $A$).Normalizing ${A}$ such that all rows sum to one, i.e. $D^{-1}A$, where $D$ is the diagonal node degree matrix, gets rid of this problem.

In fact, the propagation rule introduced in Kipf & Welling (ICLR 2017) is given by: $$ {H}{i+1} = \sigma \circ ({H}{i}, A)=\sigma \circ(\hat{D}^{-\frac{1}{2}} \hat{A} \hat{D}^{-\frac{1}{2}} {H}{i} {W}{i}), $$ with $\hat{A}=A+I$, where $I$ is the identity matrix and $\hat{D}$ is the diagonal node degree matrix of $\hat{A}$. See more details at Multi-layer Graph Convolutional Networks (GCN) with first-order filters.

Like other neural network, GCN is also composite of linear and nonlinear mapping. In details,

  1. $\hat{D}^{-\frac{1}{2}} \hat{A} \hat{D}^{-\frac{1}{2}}$ is to normalize the graph structure;
  2. the next step is to multiply node properties and weights;
  3. Add nonlinearities by activation function $\sigma$.

See more at experoinc.com or https://tkipf.github.io/.

$\color{navy}{\text{Graph convolution network is potential to}}, \cal{reasoning}$ as the blend of $\frak{\text{probabilistic graph model}}$ and $\mit{\text{deep learning}}$.

GCN can be regarded as the counterpart of CNN for graphs so that the optimization techniques such as normalization, attention mechanism and even the adversarial version can be extended to the graph structure.

Spectral ConvNets

Compositional layers of convolutional neural network can be expressed as

$$ \hat{H}{i} = P\oplus H{i-1} \ \tilde{H_i} = C_i\otimes(\hat{H}_{t}) \ Z_i = \mathrm{N}\cdot \tilde{H_i} \ H_i = Pooling\cdot (\sigma\circ Z_i) $$

where $\otimes,\oplus,\cdot$ represent convolution operation, padding and pooling, respectively.

Xavier Bresson gave a talk on New Deep Learning Techniques FEBRUARY 5-9, 2018. We would ideally like our graph convolutional layer to have:

  • Computational and storage efficiency (requiring no more than $O(E+V)$ time and memory);
  • Fixed number of parameters (independent of input graph size);
  • Localisation (acting on a local neighbourhood of a node);
  • Ability to specify arbitrary importances to different neighbours;
  • Applicability to inductive problems (arbitrary, unseen graph structures).
CNN GCN ---
padding ? ?
convolution ? Information of neighbors
pooling ? Invariance
  • Spectral graph theory allows to redefine convolution in the context of graphs with Fourier analysis.
  • Graph downsampling $\iff$ graph coarsening $\iff$ graph partitioning: Decompose ${G}$ into smaller meaningful clusters.
  • Structured pooling: Arrangement of the node indexing such that adjacent nodes are hierarchically merged at the next coarser level.

Laplacian operator is represented as a positive semi-definite $n \times n$ matrix:

Laplacian Representation
Unnormalized Laplacian $\Delta = \bf{D - A}$
Normalized Laplacian $\Delta = \bf{I -D^{-\frac{1}{2}}AD^{-\frac{1}{2}}}$
Random walk Laplacian $\Delta = \bf{I - D^{-1}A}$
$\mathbf A$:Adjacency Matrix $\mathbf D$: Degree Matrix

Eigendecomposition of graph Laplacian:

$$\Delta = {\Phi}^{T} {\Lambda} {\Phi}$$

where ${\Phi}$ is the matrix consisting of eigenvectors and ${\Lambda}= diag({\lambda}_1,\dots, {\lambda}n )$. In matrix-vector notation, with the $n\times n$ Fourier matrix and a n-dimensional vector $f\in\mathbb{R}^{n}$, it is proven that if $\hat{f}={\Phi}^{T}f$ as the projection of $f$ into the column space of $\Phi$, where $\hat{f}{i}$ the inner product of ${f, \phi_i}$, then $f={\Phi}\hat{f}$ the inverse Fourier transform.

Convolution of two vectors $f=(f_1, \cdots, f_n )^{T}$ and $g=(g_1, \cdots, g_n )^{T}$ is defined as $(f\star g)i = \sum{m} g_{(i-m) ,,, mod ,,,n } ,\cdot f_m$ or in matrix notation $$ (f\star g)= \underbrace{\begin{pmatrix} & g_1, & g_2, & \cdots, & g_{n-1}, & g_n &\ & g_n, & g_1, & \cdots, & g_{n-2}, & g_{n-1} & \ & \vdots & \vdots &\ddots & \vdots & \vdots & \ & g_3, & g_4, & \cdots, & g_{1}, & g_2 &\ & g_2, & g_3, & \cdots, & g_{n}, & g_1 &\ \end{pmatrix}}_{\text{Circulant matrix G} } \begin{pmatrix} f_1 \ \vdots \ f_n \end{pmatrix} \ = \underbrace{ {\Phi} {diag(\hat{g}_1, \cdots, \hat{g}m)} \Phi^T }{G} \quad f \ = {\Phi}({\Phi}^Tg\circ {\Phi}^{T} f) $$

where the last equation is because the matrix multiplication is associative and $\Phi^Tg$ is the $\fbox{Fourier transform}$; the notation $\circ$ is the inner product. What is more, $${\Phi}({\Phi}^Tg\circ {\Phi}^{T} f)={\Phi}\hat{g}(\Lambda){\Phi}^{T} f=\hat{g}({\Phi}(\Lambda){\Phi}^{T})f=\hat{g}(\Delta)f$$ where $\Delta$ is the Laplacian of the graph; $\hat{g}(\Lambda)$ is the polynomial of matrix.


Graph Convolution: Recursive Computation with Shared Parameters:

  • Represent each node based on its neighbourhood
  • Recursively compute the state of each node by propagating previous states using relation specific transformations
  • Backpropagation through Structure

Vanilla spectral graph ConvNets

Every graph convolutional layer starts off with a shared node-wise feature transformation (in order to achieve a higher-level representation), specified by a weight matrix $W$. This transforms the feature vectors into $\vec{g}_i = {\bf W}\vec{h}_i$. After this, the vectors $\vec{g}_i$ are typically recombined in some way at each node.

In general, to satisfy the localization property, we will define a graph convolutional operator as an aggregation of features across neighborhoods; defining $\mathcal{N}_i$ as the neighborhood of node i (typically consisting of all first-order neighbours of $i$ , including $i$ itself), we can define the output features of node $i$ as: $$\vec{h}'i = \sigma\left(\sum{j\in\mathcal{N}i}\alpha{ij}\vec{g}_j\right)$$ where $\sigma$ is some activation function such as rectified linear unit (ReLU) in ConvNet.

SplineNets

Parametrize the smooth spectral filter function

Spectral graph ConvNets with polynomial filters

Represent smooth spectral functions with polynomials of Laplacian eigenvalues $$w_{\alpha}(\lambda)={\sum}{j=0}^r{\alpha}{j} {\lambda}^j$$

where $\alpha=(\alpha_1, \cdots, \alpha_r)^{T}$ is the vector of filter parameters

Convolutional layer: Apply spectral filter to feature signal ${f}$: $$w_{\alpha}(\Lambda)f= {\sum}{j=0}^r{\alpha}{j} {\Lambda}^j f$$

Such graph convolutional layers are GPU friendly.

ChebNet

Graph convolution network always deal with unstructured data sets where the graph has different size. What is more, the graph is dynamic, and we need to apply to new nodes without model retraining.

Graph convolution with (non-orthogonal) monomial basis $1, x, x^2, x^3, \cdots$: $$w_{\alpha}(\lambda)={\sum}{j=0}^{r}{\alpha}{j} {\lambda}^{j}$$ Graph convolution with (orthogonal) Chebyshev polynomials $$w_{\alpha}(\lambda) = {\sum}{j=0}^{r}{\alpha}{j} T_j(\lambda)$$ where $T_k(x)$ are the Chebyshev polynomials.

Kipf and Welling proposed the ChebNet (arXiv:1609.02907) to approximate the filter using Chebyshev polynomial. Application of the filter with the scaled Laplacian: $$\tilde{\mathbf{\Delta}}=2{\lambda}_{n}^{-1}\mathbf{\Delta-I}$$

$$w_{\alpha}(\tilde{\mathbf{\Delta}})f= {\sum}{j=0}^r{\alpha}{j} T_j({\tilde{\mathbf{\Delta}}}) f={\sum}{j=0}^r{\alpha}{j}X^{(j)}$$ with $$X^{(j)}=T_j({\tilde{\mathbf{\Delta}}}) f=2\mathbf{\Delta} X^{(j-1)}-X^{(j-2)}, X^{(0)}=f, X^{(1)}=\tilde{\mathbf{\Delta}} f.$$

Simplified ChebNets

Use Chebychev polynomials of degree $r=2$ and assume $r_2\approx 2$: $$w_{\alpha}(\tilde{\mathbf{\Delta}})f ={\alpha}_0 f + {\alpha}_1(\mathbf{\Delta-I})f= {\alpha}_0 f - {\alpha}_1\mathbf D^{-\frac{1}{2}}WD^{-\frac{1}{2}} f $$

Further constrain $\alpha=-\alpha_1=\alpha_0$ to obtain a single-parameter filter: $$w_{\alpha}(\tilde{\mathbf{\Delta}})f ={\alpha}\mathbf{(I-D^{-\frac{1}{2}}WD^{-\frac{1}{2}})} f $$

PinSage

ChebNet, CayleyNet, MotifNet

In the previous post, the convolution of the graph Laplacian is defined in its graph Fourier space as outlined in the paper of Bruna et. al. (arXiv:1312.6203). However, the eigenmodes of the graph Laplacian are not ideal because it makes the bases to be graph-dependent. A lot of works were done in order to solve this problem, with the help of various special functions to express the filter functions. Examples include Chebyshev polynomials and Cayley transform.

Graph Convolution Networks (GCNs) generalize the operation of convolution from traditional data (images or grids) to graph data. The key is to learn a function f to generate a node $v_i$’s representation by aggregating its own features $X_i$ and neighbors? features $X_j$ , where $j \in N(v_i)$.

CayleyNet

Defining filters as polynomials applied over the eigenvalues of the graph Laplacian, it is possible indeed to avoid any eigen-decomposition and realize convolution by means of efficient sparse routines The main idea behind CayleyNet is to achieve some sort of spectral zoom property by means of Cayley transform: $$ C(\lambda) = \frac{\lambda - i}{\lambda + i} $$

Instead of Chebyshev polynomials, it approximates the filter as: $$ g(\lambda) = c_0 + \sum_{j=1}^{r}[c_jC^{j}(h\lambda) + c_j^{\ast} C^{j^{\ast}}(h\lambda)] $$ where $c_0$ is real and other $c_j'$ s are generally complex, and ${h}$ is a zoom parameter, and $\lambda'$ s are the eigenvalues of the graph Laplacian. Tuning ${h}$ makes one find the best zoom that spread the top eigenvalues. ${c}$'s are computed by training. This solves the problem of unfavorable clusters in ChebNet.


MotifNet

MotifNet is aimed to address the directed graph convolution.

Minimal inner structures:

  • Invariant by vertex re-indexing (no graph matching is required)
  • Locality (only neighbors are considered) Weight sharing (convolutional operations)
  • Independence w.r.t. graph size

$$\color{green}{\fbox{${h_i} = f_{GCN}(h_j: j\to i)$} }$$

Higher-order Graph Convolutional Networks


Graph Attention Networks

Graph convolutional layer then computes a set of new node features, $(\vec{h}{1},\cdots, \vec{h}{n})$ , based on the input features as well as the graph structure.

Most prior work defines the kernels $\alpha_{ij}$ explicitly (either based on the structural properties of the graph, or as a learnable weight); this requires compromising at least one other desirable property.

In Graph Attention Networks the kernels $\alpha_{ij}$ be computed as a byproduct of an attentional mechanism, $a : \mathbb{R}^N \times \mathbb{R}^N \rightarrow \mathbb{R}$, which computes un-normalized coefficients $e_{ij}$ across pairs of nodes $i,j$ based on their features:

$$ e_{ij} = a(\vec{h}_i, \vec{h}_j). $$

We inject the graph structure by only allowing node $i$ to attend over nodes in its neighbourhood, $j\in\mathcal{N}{i}$.These coefficients are then typically normalised using the softmax function, in order to be comparable across different neighbourhoods: $$ \alpha{ij} = \frac{\exp(e_{ij})}{\sum_{k\in\mathcal{N}i}\exp(e{ik})}. $$

Gated Graph Neural Networks

Generative Models for Graphs

Application

GCN for RecSys

PinSAGE Node’s neighborhood defines a computation graph. The key idea is to generate node embeddings based on local neighborhoods. Nodes aggregate information from their neighbors using neural networks.

GCN for Bio & Chem

GCN for NLP