New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Confusion about some codes #15
Comments
Hi Weixin, Thanks for your interest in the GAT work and for reaching out. I'll try to elaborate on those lines in more detail. As our layer is a single-layer NN, we can decompose it from W[h_i||h_j] to W_1h_i + W_2h_j (with W_1 and W_2 being separate linear transformations). As W_1 and W_2 are pointwise applied to each node, we may express them as a 1D convolution with kernel size 1 across the node feature matrix. Applying these two convolutions gives us In the dense GAT layer (https://github.com/PetarV-/GAT/blob/master/utils/layers.py#L13-L17), we compute W_1h_i + W_2h_j for all (i, j) pairs, by simply adding For the sparse layer, the idea is pretty much the same, only we do not wish to use more memory than is necessary. This is why we multiply the values with the sparse matrix's entries, to preserve only the positions i and j that are actually to be used. Does that help? Feel free to reach out if you need more info. Thanks, |
Hi Petar: Thank you for your detailed explanation, which helps a lot! Many thanks, |
Hi, Petar, thanks for your explanation! However I have some questions about the bias matrix. How to generate a bias matrix, what's the physical meaning of bias, and why you choose this generation setting? Line 14 in ea3aeaf
Line 25 in ea3aeaf
|
until |
Hi, Thank you for sharing the codes!
I find the codes below (in def sp_attn_head.py) is a bit hard to understand, since they are not directly corresponding to the original description in text:
seq_fts = tf.layers.conv1d(seq, out_sz, 1, use_bias=False)
can be regarded as multiplying a weight matrix W (converting F features to F’ features). However, I do not understand the following steps:I am aware that the codes are different since it is conducted on a “matrix” level instead of node level. But I cannot see how the attention mechanism is achieved by these steps. Can you help explain a bit?
Many thanks,
Weixin.
The text was updated successfully, but these errors were encountered: