Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusion about some codes #15

Closed
DexterZeng opened this issue Dec 5, 2018 · 4 comments
Closed

Confusion about some codes #15

DexterZeng opened this issue Dec 5, 2018 · 4 comments

Comments

@DexterZeng
Copy link

Hi, Thank you for sharing the codes!
I find the codes below (in def sp_attn_head.py) is a bit hard to understand, since they are not directly corresponding to the original description in text:

f_1 = tf.layers.conv1d(seq_fts, 1, 1)
f_2 = tf.layers.conv1d(seq_fts, 1, 1)

f_1 = tf.reshape(f_1, (nb_nodes, 1))
f_2 = tf.reshape(f_2, (nb_nodes, 1))

f_1 = adj_mat*f_1
f_2 = adj_mat * tf.transpose(f_2, [1,0])

logits = tf.sparse_add(f_1, f_2)

lrelu = tf.SparseTensor(indices=logits.indices, 
values=tf.nn.leaky_relu(logits.values), 
dense_shape=logits.dense_shape)

coefs = tf.sparse_softmax(lrelu)

seq_fts = tf.layers.conv1d(seq, out_sz, 1, use_bias=False) can be regarded as multiplying a weight matrix W (converting F features to F’ features). However, I do not understand the following steps:

f_1 = tf.layers.conv1d(seq_fts, 1, 1)
f_2 = tf.layers.conv1d(seq_fts, 1, 1)
f_1 = adj_mat*f_1
f_2 = adj_mat * tf.transpose(f_2, [1,0])
logits = tf.sparse_add(f_1, f_2)

I am aware that the codes are different since it is conducted on a “matrix” level instead of node level. But I cannot see how the attention mechanism is achieved by these steps. Can you help explain a bit?

Many thanks,
Weixin.

@PetarV-
Copy link
Owner

PetarV- commented Dec 5, 2018

Hi Weixin,

Thanks for your interest in the GAT work and for reaching out. I'll try to elaborate on those lines in more detail.

As our layer is a single-layer NN, we can decompose it from W[h_i||h_j] to W_1h_i + W_2h_j (with W_1 and W_2 being separate linear transformations).

As W_1 and W_2 are pointwise applied to each node, we may express them as a 1D convolution with kernel size 1 across the node feature matrix. Applying these two convolutions gives us f_1 (which is W_1h_i) and f_2 (which is W_2h_j).

In the dense GAT layer (https://github.com/PetarV-/GAT/blob/master/utils/layers.py#L13-L17), we compute W_1h_i + W_2h_j for all (i, j) pairs, by simply adding f_1 to transpose(f_2). This implies adding something of shape (batch, nodes, 1) to something of shape (batch, 1, nodes), and through the power of broadcasting we obtain the full (unnormalised) attention coefficient matrix of shape (batch, nodes, nodes), with TensorFlow automatically performing this sum for all pairs of positions. You can then normalise the coefficients by applying LeakyReLU, bias and softmax.

For the sparse layer, the idea is pretty much the same, only we do not wish to use more memory than is necessary. This is why we multiply the values with the sparse matrix's entries, to preserve only the positions i and j that are actually to be used.

Does that help? Feel free to reach out if you need more info.

Thanks,
Petar

@DexterZeng
Copy link
Author

Hi Petar:

Thank you for your detailed explanation, which helps a lot!

Many thanks,
Weixin.

@KnightOfTheMoonlight
Copy link

Hi, Petar, thanks for your explanation! However I have some questions about the bias matrix. How to generate a bias matrix, what's the physical meaning of bias, and why you choose this generation setting?
This looks like the default bias matrix generation for dense adjacency matrix.

def adj_to_bias(adj, sizes, nhood=1):

return -1e9 * (1.0 - mt)

def adj_to_bias(adj, sizes, nhood=1):
    nb_graphs = adj.shape[0]
    mt = np.empty(adj.shape)
    for g in range(nb_graphs):
        mt[g] = np.eye(adj.shape[1])
        for _ in range(nhood):
            mt[g] = np.matmul(mt[g], (adj[g] + np.eye(adj.shape[1])))
        for i in range(sizes[g]):
            for j in range(sizes[g]):
                if mt[g][i][j] > 0.0:
                    mt[g][i][j] = 1.0
    return -1e9 * (1.0 - mt)

@songhuiming
Copy link

Hi, Petar, thanks for your explanation! However I have some questions about the bias matrix. How to generate a bias matrix, what's the physical meaning of bias, and why you choose this generation setting?
This looks like the default bias matrix generation for dense adjacency matrix.

def adj_to_bias(adj, sizes, nhood=1):

return -1e9 * (1.0 - mt)

def adj_to_bias(adj, sizes, nhood=1):
    nb_graphs = adj.shape[0]
    mt = np.empty(adj.shape)
    for g in range(nb_graphs):
        mt[g] = np.eye(adj.shape[1])
        for _ in range(nhood):
            mt[g] = np.matmul(mt[g], (adj[g] + np.eye(adj.shape[1])))
        for i in range(sizes[g]):
            for j in range(sizes[g]):
                if mt[g][i][j] > 0.0:
                    mt[g][i][j] = 1.0
    return -1e9 * (1.0 - mt)

until mt[g][i][j] = 1.0, it is to add self-loop to the adj matrix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants