Skip to content

Latest commit

 

History

History
679 lines (521 loc) · 47.5 KB

Embeding Methods.md

File metadata and controls

679 lines (521 loc) · 47.5 KB

Representation Learning

Representation Learning, Manifold Learning and Metric Learning are about how to learn the intrinsic structure of data. The objective of representation and metric learning is to build new spaces of representations to improve the performance of a classification, regression or clustering algorithm either from distance constraints or by making use of fine decomposition of instances in complete samples.

A model’s quality depends completely on the data representation used to train it. For this reason, methods that can transform input data such that it better suits a given machine learning method can improve that method’s predictive capacity.

Unsupervised learning of useful features, or representations, is one of the most basic challenges of machine learning. Too often the success of a data science project depends on the choice of features used. Machine learning has made great progress in training classification, regression and recognition systems when “good” representations, or features, of input data are available. However, much human effort is spent on designing good features which are usually knowledge-based and engineered by domain-experts over years of trial and error. A natural question to ask then is “Can we automate the learning of useful features from raw data?”. Unsupervised representation learning techniques capitalize on unlabeled data which is often cheap and abundant and sometimes virtually unlimited. The goal of these ubiquitous techniques is to learn a representation that reveals intrinsic low-dimensional structure in data, disentangles underlying factors of variation by incorporating universal AI priors such as smoothness and sparsity, and is useful across multiple tasks and domains. This course will focus on theory and methods for representation learning that can easily scale to large amounts of unlabeled, multi-modal, and heterogeneous data.

Manifold learning and finding low-dimensional structure in data is an important task.

Knowledge Representation and Reasoning

Embedding Methods

Semantic Representation and Embedding

Deep Semantic Embedding

Knowledge Graph Embeddings

Knowledge graphs represent information via entities and their relationships. This form of relational knowledge representation has a long history in logic and artificial intelligence. More recently, it has also been the basis of the Semantic Web to create a “web of data” that is readable by machines.

Word Embedding

word2vec

In natural language processing, the word can be regarded as the node in a graph, which only takes the relation of locality or context. It is difficult to learn the concepts or the meaning of words. The word embedding technique word2vec maps the words to fixed length real vectors: $$ f: \mathbb{W}\to\mathbb{V}^d\subset \mathbb{R}^d. $$

The skip-gram model assumes that a word can be used to generate the words that surround it in a text sequence. We assume that, given the central target word, the context words are generated independently of each other.

The conditional probability of generating the context word for the given central target word can be obtained by performing a softmax operation on the vector inner product: $$ P(w_o|w_c) = \frac{\exp(u_o^T u_c)}{\sum_{i\in\mathbb{V}} \exp(u_i^T u_c)}, $$

where vocabulary index set $V = {1,2,\dots, |V|-1}$. Assume that a text sequence of length ${T}$ is given, where the word at time step ${t}$ is denoted as $w^{(t)}$. Assume that context words are independently generated given center words. When context window size is ${m}$ , the likelihood function of the skip-gram model is the joint probability of generating all the context words given any center word $$ \prod_{t=1}^{T}\prod_{-m\leq j \leq m, j\not = i}{P}(w^{(t+j)}|w^{(j)}), $$

Here, any time step that is less than 1 or greater than ${T}$ can be ignored.

The skip-gram model parameters are the central target word vector and context word vector for each individual word. In the training process, we are going to learn the model parameters by maximizing the likelihood function, which is also known as maximum likelihood estimation. his is equivalent to minimizing the following loss function: $$ -\log(\prod_{t=1}^{T}\prod_{-m\leq j \leq m, j\not = i}{P}(w^{(t+j)}\mid w^{(j)})) = \ -\sum_{t=1}^{T}\sum_{-m\leq j \leq m, j \not= i} \log({P}(w^{(t+j)}|w^{(j)}))). $$

And we could compute the negative logarithm of the conditional probability $$ -\log(P(w_o|w_c)) = -\log(\frac{\exp(u_o^T u_c)}{\sum_{i\in\mathbb{V}} \exp(u_i^T u_c)}) \= -u_o^T u_c + \log(\sum_{i\in\mathbb{V}} \exp(u_i^T u_c)). $$

Then we could compute the gradient or Hessian matrix of the loss functions to update the parameters such as:

$$ \frac{\partial \log(P(w_o|w_c))}{\partial u_c} \= \frac{\partial }{\partial u_c} [u_o^T u_c - \log(\sum_{i\in\mathbb{V}} \exp(u_i^T u_c))] \= u_o - \sum_{j\in\mathbb{V}}\frac{ \exp(u_j^T u_c)) }{\sum_{i\in\mathbb{V}} \exp(u_i^T u_c))} u_j \= u_o - \sum_{j\in\mathbb{V}}P(w_j|w_c) u_j. $$

The continuous bag of words (CBOW) model is similar to the skip-gram model. The biggest difference is that the CBOW model assumes that the central target word is generated based on the context words before and after it in the text sequence. Let central target word $w_c$ be indexed as $c$ , and context words $w_{o_1},\cdots, w_{o_{2m}}$ be indexed as $o_1,\cdots,o_{2m}$ in the dictionary. Thus, the conditional probability of generating a central target word from the given context word is

$$ P(w_c|w_{o_1},\cdots, w_{o_{2m}}) = \frac{\exp(\frac{1}{2m}u_c^T(u_{o_1}+ \cdots + u_{o_{2m}}))}{\sum_{i\in V}\exp(\frac{1}{2m} u_i^T(u_{o_1}+ \cdots + u_{o_{2m}}))}. $$ hierarchical softmax, negative sample

Doc2Vec


Entity Embedding

We map categorical variables in a function approximation problem into Euclidean spaces, which are the entity embeddings of the categorical variables. The mapping is learned by a neural network during the standard supervised training process. Entity embedding not only reduces memory usage and speeds up neural networks compared with one-hot encoding, but more importantly by mapping similar values close to each other in the embedding space it reveals the intrinsic properties of the categorical variables.

Item2Vec

ecently, several works in the field of Natural Language Processing (NLP) suggested to learn a latent representation of words using neural embedding algorithms. Among them, the Skip-gram with Negative Sampling (SGNS), also known as word2vec, was shown to provide state-of-the-art results on various linguistics tasks. In this paper, we show that item-based CF can be cast in the same framework of neural word embedding. Inspired by SGNS, we describe a method we name item2vec for item-based CF that produces embedding for items in a latent space. The method is capable of inferring item-item relations even when user information is not available.

Gradient Boosted Categorical Embedding and Numerical Trees

Gradient Boosted Categorical Embedding and Numerical Trees (GB-CSENT) is to combine Tree-based Models and Matrix-based Embedding Models in order to handle numerical features and large-cardinality categorical features. A prediction is based on:

  • Bias terms from each categorical feature.
  • Dot-product of embedding features of two categorical features,e.g., user-side v.s. item-side.
  • Per-categorical decision trees based on numerical features ensemble of numerical decision trees where each tree is based on one categorical feature.

In details, it is as following: $$ \hat{y}(x) = \underbrace{\underbrace{\sum_{i=0}^{k} w_{a_i}}{bias} + \underbrace{(\sum{a_i\in U(a)} Q_{a_i})^{T}(\sum_{a_i\in I(a)} Q_{a_i}) }{factors}}{CAT-E} + \underbrace{\sum_{i=0}^{k} T_{a_i}(b)}_{CAT-NT}. $$ And it is decomposed as the following table.


Ingredients Formulae Features
Factorization Machines $\underbrace{\underbrace{\sum_{i=0}^{k} w_{a_i}}{bias} + \underbrace{(\sum{a_i\in U(a)} Q_{a_i})^{T}(\sum_{a_i\in I(a)} Q_{a_i}) }{factors}}{CAT-E}$ Categorical Features
GBDT $\underbrace{\sum_{i=0}^{k} T_{a_i}(b)}_{CAT-NT}$ Numerical Features

Hyperbolic Embeddings

The motivation of hyperbolic embedding is to combine structural information with continuous representations suitable for machine learning methods. One example is embedding taxonomies (such as Wikipedia categories, lexical databases like WordNet, and phylogenetic relations).

The big goal when embedding a space into another is to preserve distances and more complex relationships. It turns out that hyperbolic space can better embed graphs (particularly hierarchical graphs like trees) than is possible in Euclidean space. Even better—angles in the hyperbolic world are the same as in Euclidean space, suggesting that hyperbolic embeddings are useful for downstream applications (and not just a quirky theoretical idea).

The motivation is to embed structured, discrete objects such as knowledge graphs into a continuous representation that can be used with modern machine learning methods. Hyperbolic embeddings can preserve graph distances and complex relationships in very few dimensions, particularly for hierarchical graphs.

Hyperbolic Axiom: There is a line $l$ and point $P$ not on $l$, such that there are at least 2 different lines through $P$ parallel to $l$.

This leads to the general property that given a line, there are an infinite number of different lines parallel to it through an outside point. Parallel lines can be of different types. A pair of parallel lines are said to be limiting parallel if they approach each-other asymptotically without intersecting. Such a pair does not admit a common line perpendicular to both, and the distance between the two does not have a minimum. A pair of parallel lines are called divergent parallel if they move away from each-other in both directions. They have a common segment perpendicular to both. This segment achieves the minimum distance between the two lines.

Given a line $l$ and a point $P$ outside, there is always a point $Q$ on $l$ such that $P Q \perp l$. Through $P$, there are always rays $m_{+\alpha}$ and $m_{-\alpha}$ that are limiting parallel to $l$ in the two directions. The angle between $P Q$ and $m_{+\alpha}$ (or symmetrically, the angle between $P Q$ and $m_{-\alpha}$) is called the angle of parallelism and represented by $\Pi(P Q)$. The Bolyai and Lobachevsky formula gives its value in terms of the length ${|PQ|}{\mathbb{H}}$: $$\tan\frac{\Pi(P Q)}{2}=\exp(-{|PQ|}{\mathbb{H}}/k) \tag{1}$$

where $k$ is a constant for the hyperbolic plane in consideration.

Reorganizing equation 1, $${|PQ|}{\mathbb{H}}=k\ln\tan(\frac{\alpha}{2})$$ where $\alpha=\Pi(PQ)$ is always less than $\frac{\pi}{2}$. ${|PQ|}{\mathbb{H}}$ is positive and monotone decreasing in $\alpha$.

The hyperbolic distance between two points $x$ and $y$ has a particularly nice form: $$d_{\mathbb{H}}(x,y) = \mathsf{acosh}\left(1+2\frac{|x-y|^2}{(1-|x|^2)(1-|y|^2)} \right).$$

Let $X$ be a geodesic space and $\triangle\subset X$ a geodesic triangle; that is, a set of three geodesic segments in $X$ such that any pair of segments shares precisely one endpoint. Then $\triangle$ is $\delta$-slim if any side of $\triangle$ is contained in the $\delta$-neighbourhood of the other two. The metric space X is $\delta$-hyperbolic if every triangle is $\delta$-slim, and $X$ is called hyperbolic if it is $\delta$-hyperbolic for some $\delta\geq 0$.


Spherical and Hyperbolic Embeddings of Data

We have a set of distances between points, $d_{ij}$. If these points exist on the surface of a hypersphere, then the position vectors of the points in a Euclidean embedding space for the hypersphere should be $$\left<x_i, x_j\right>=r^2\cos(\frac{d_{ij}}{r}).$$

From this we construct an inner product matrix $Z$ with $Z_{ij}=\left<x_i, x_j\right>$.

This matrix should be positive semidefinite with a smallest eigenvalue of zero. We can use this fact to find the appropriate radius $r$ of the embedding space by minimizing the magnitude of the smallest eigenvalue of $Z$.

Point positions can then be found easily using standard kernel embedding techniques. This embedding is exact if the points lie on a hypersphere. Of course real data rarely lies precisely on a spherical manifold and so the paper also discusses details of the inexact case.

This can naturally be extended to hyperbolic embeddings where and we need to optimize the second smallest eigenvalue $$ \left<x_i, x_j\right>= -r^2\cosh(\frac{d_{ij}}{r})\ r^{\ast} = \arg\min_{r}|\lambda_2[Z(r)]|. $$

Here $\lambda_2[Z(r)]$ is the second smallest eigenvalues of the matrix $Z(r)$, $|\cdot |$ is the absolute value function.

Stanford bunny texture mapped with spherical texture using spherical embedding

Embedding of Trees in Hyperbolic Plane

Delaunay Embedding of Trees

Delaunay Graph: Given a set of vertices in the hyperbolic plane $\mathbb H$ their Delaunay graph is one where a pair of vertices are neighbors if their Voronoi cells intersect.

Here the Voronoi cell of $v_i\in{v_0, v_1,\dots, v_n }\subset \mathbb H$ is defined as the set of points whose distance to $v_i$ is not larger than the distance to $v_j$ for $j\not=i$, denoted as $\mathcal V(v_i)$.

We construct a function $\Phi$ that embeds the vertices of a tree $T$ into $\mathbb{H}$. Function $\Phi$ is designed such that $T$ is the Delaunay graph of the embedded vertices.

Edge embedding function $\Phi$. Given an edge $v_iv_j$ and an angle $\alpha$, we select a cone $C(m, \alpha)$ at point $P$, and embed as follows. Vertex $v_i$ is embedded at $\Phi_i = P$. We select a point $R$ on the ray $m$ such that ${|PR|}_{\mathbb{H}}\geq -2k\ln(\tan\frac{\alpha}{2})$, and embed $v_j$ as $\Phi_j = R$. We call this a Delaunay embedding of the edge for angle $\alpha$.

Here $C(m, \alpha)$ refers to the region bounded by rays $m_{-\alpha}$ and $m_{+\alpha}$ containing the ray $m$ except the point $P$. And we call it a cone of angle $\alpha$ at $P$ (or rooted at P).

Algorithm: Construction of $\Phi$ for $T$. Embed the root $v_{0}$ to an arbitrary point $\varphi_{0}$. If $v_{0}$ has d children, they are Delaunay embedded individually into d disjoint cones at $\varphi_0$. We embed all other vertices inductively as follows. Suppose $v_j$ is a child of $v_i$, and has been embedded in a cone of angle $\alpha$ at $\varphi_i$. The children of $v_j$ are Delaunay embedded in mutually disjoint cones that are also disjoint from the cone $C(\vec{\varphi_i\varphi_j},\alpha)$.

Suppose $T = (V, E, w)$ is a weighted tree, where $w : E \to \mathbb R$ is the weight or length function on the edges. The goal is to realize the weight $w(v_i v_j )$ on each edge $v_i v_j$ as the length ${|\phi_i \phi_j |}_{\mathbb H}$ of the edge in the Delaunay embedding of the tree.

A simple Pythagorean embedding into the vertices of a unit cube

Poincaré Embeddings

The Poincaré ball model is a model of n-dimensional hyperbolic geometry in which all points are embedded in an n-dimensional sphere (or in a circle in the 2D case which is called the Poincaré disk model). This is being presented second because it is much less intuitive but can be derived directly from the hyperboloid model.

The distance between two points $u,v$ on the Poincaré ball is given by: $$ d(u,v) = arcosh\big(1 + 2\frac{\lVert u-v \rVert^2}{(1-\lVert u\rVert^2)(1-\lVert v\rVert^2)}\big) \tag{1} $$ where the double bars represent the Euclidean distance.

An attempt at embedding a tree with branching factor 4 into the Euclidean plane Embedding a hierarchical tree with branching factor two into a 2D hyperbolic plane

Embedding Networks in Hyperbolic Spaces

Since hyperbolic space is tree-like, it’s natural to consider embedding trees—which we can do with arbitrarily low distortion, for any number of dimensions! It is shown that how to extend this technique to arbitrary graphs, a problem with a lot of history in Euclidean space using a two-step strategy for embedding graphs into hyperbolic space:

  1. Embed a graph $G = (V, E)$ into a tree $T$.
  2. Embed $T$ into the Poincaré ball.

There are two factors to good embeddings: local and global. Locally, the children of each node should be spread out on the sphere around the parent. Globally, subtrees should be separated from each other; in hyperbolic space, the separation is determined by the sphere radius.

HyperE

In this website, we release hyperbolic embeddings that can be further integrated into applications related to knowledge base completion or can be supplied as features into various NLP tasks such as Question Answering.

--- ---

Graph Embedding

Graph embedding, preprocessing of graph data processing, is an example of representation learning to find proper numerical representation form of graph data structure. It maps the graph structure to numerical domain: $f:\mathbf{G}\mapsto V\subset \mathbb{R}^{n}$. The goal when embedding a graph $G$ into a space $V$ is to preserve the graph distance (the shortest path between a pair of vertices) in the space $V$.

DeepWalk

DeepWalk is an approach for learning latent representations of vertices in a network, which maps the nodes in the graph into real vectors: $$ f: \mathbb{V}\to\mathbb{R}^{d}. $$

DeepWalk generalizes recent advancements in language modeling and unsupervised feature learning (or deep learning) from sequences of words to graphs. DeepWalk uses local information obtained from truncated random walks to learn latent representations by treating walks as the equivalent of sentences. And if we consider the text as digraph, word2vec is an specific example of DeepWalk. Given the word sequence $\mathbb{W}=(w_0, w_1, \dots, w_n)$, we can compute the conditional probability $P(w_n|w_0, w_1, \dots, w_{n-1})$. And $$P(w_n|f(w_0), f(w_1), \cdots, f(w_{n_1}))$$

  • DeepWalk $G, w, d, \gamma, t$
  • Input: graph $G(V, E)$; window size $w$; embedding size $d$; walks per vertex $\gamma$; walk length $t$.
  • Output: matrix of vertex representations $\Phi\in\mathbb{R}^{|V|\times d}$
  • Initialization: Sample $\Phi$ from $\mathbb{U}^{|V|\times d}$;
    • Build a binary Tree T from V;
    • for $i = 0$ to $\gamma$ do
      • $O = Shuffle(V )$
      • for each $v_i \in O$ do
      • $W_{v_i}== RandomWalk(G, v_i, t)$
      • $SkipGram(Φ, W_{v_i}, w)$
      • end for
    • end for

$SkipGram(Φ, W_{v_i}, w)$

    1. for each $v_j \in W_{v_i}$ do
      1. for each $u_k \in W_{v_i}[j - w : j + w]$ do
      1. $J(\Phi)=-\log Pr(u_k\mid \Phi(v_j))$
      1. $\Phi =\Phi -\alpha\frac{\partial J}{\partial \Phi}$
      1. end for
    1. end for

Computing the partition function (normalization factor) is expensive, so instead we will factorize the conditional probability using Hierarchical softmax. If the path to vertex $u_k$ is identified by a sequence of tree nodes $(b_0, b_1, \cdots , b_{[log |V|]})$, and $(b_0 = root, b_{[log |V|]} = u_k)$ then $$ Pr(u_k\mid \Phi(v_j)) =\prod_{l=1}^{[log |V|]}Pr(b_l\mid \Phi(v_j)). $$

Now, $Pr(b_l\mid \Phi(v_j))$ could be modeled by a binary classifier that is assigned to the parent of the node $b_l$ as below $$ Pr(b_l\mid \Phi(v_j))=\frac{1}{1 + \exp(-\Phi(v_j)\cdot \Phi(b_l))} $$ where $\Phi(b_l)$ is the representation assigned to tree node $b_l$ ’s parents.

node2vec

node2vec is an algorithmic framework for representational learning on graphs. Given any graph, it can learn continuous feature representations for the nodes, which can then be used for various downstream machine learning tasks.

By extending the Skip-gram architecture to networks, it seeks to optimize the following objective function, which maximizes the log-probability of observing a network neighborhood $N_{S}(u)$ for a node $u$ conditioned on its feature representation, given by $f$: $$ \max_{f} \sum_{u\in V}\log Pr(N_S(u)\mid f(u)) $$

Conditional independence and Symmetry in feature space are expected to make the optimization problem tractable. We model the conditional likelihood of every source-neighborhood node pair as a softmax unit parametrized by a dot product of their features: $$ Pr(n_i\mid f(u))=\frac{\exp(f(n_i)\cdot f(u))}{\sum_{v\in V} \exp(f(v)\cdot f(u))}. $$

The objective function simplifies to $$\max_{f}\sum_{u\in V}[-\log Z_u + \sum_{n_i\in N_S(u)} f(n_i)\cdot f(u).$$

struc2vec

This is a paper about identifying nodes in graphs that play a similar role based solely on the structure of the graph, for example computing the structural identity of individuals in social networks. That’s nice and all that, but what I personally find most interesting about the paper is the meta-process by which the authors go about learning the latent distributed vectors that capture the thing they’re interested in (structural similarity in this case). Once you’ve got those vectors, you can do vector arithmetic, distance calculations, use them in classifiers and other higher-level learning tasks and so on. As word2vec places semantically similar words close together in space, so we want structurally similar nodes to be close together in space.

Struc2vec has four main steps:

  1. Determine the structural similarity between each vertex pair in the graph, for different neighborhood sizes.
  2. Construct a weighted multi-layer graph, in which each layer corresponds to a level in a hierarchy measuring structural similarity (think: ‘at this level of zoom, these things look kind of similar?).
  3. Use the multi-layer graph to generate context for each node based on biased random walking.
  4. Apply standard techniques to learn a latent representation from the context given by the sequence of nodes in the random walks.

metapath2vec

Gaussian Auto Embeddings

http://koaning.io/gaussian-auto-embeddings.html

Atom2Vec

M. V. Diudea, I. Gutman and L. Jantschi wrote in the preface of the book Molecular Topology: One of the principal goals of chemistry is to establish (causal) relations between the chemical and physical (experimentally observable and measurable) properties of substance and the structure of the corresponding molecules. Countless results along these lines have been obtained, and their presentation comprise significant parts of textbooks of organic, inorganic and physical chemistry, not to mention treatises on theoretical chemistry

tile2Vec

graph2vec

graph2vec is to learn data-driven distributed representations of arbitrary sized graphs in an unsupervised manner and are task agnostic.

Knowledge Graph Embeddings

Hierarchical Representation Learning

HARP

We present HARP, a novel method for learning low dimensional embeddings of a graph's nodes which preserves higher-order structural features. Our proposed method achieves this by compressing the input graph prior to embedding it, effectively avoiding troublesome embedding configurations (i.e. local minima) which can pose problems to non-convex optimization. HARP works by finding a smaller graph which approximates the global structure of its input. This simplified graph is used to learn a set of initial representations, which serve as good initializations for learning representations in the original, detailed graph. We inductively extend this idea, by decomposing a graph in a series of levels, and then embed the hierarchy of graphs from the coarsest one to the original graph. HARP is a general meta-strategy to improve all of the state-of-the-art neural algorithms for embedding graphs, including DeepWalk, LINE, and Node2vec. Indeed, we demonstrate that applying HARP's hierarchical paradigm yields improved implementations for all three of these methods, as evaluated on both classification tasks on real-world graphs such as DBLP, BlogCatalog, CiteSeer, and Arxiv, where we achieve a performance gain over the original implementations by up to 14% Macro F1.