Some questions about self_attn #12

Asthestarsfalll · 2022-02-26T03:27:29Z

Hi.

Why there is no premute operation before view in mode h?

# for mode h
projected_query = self.query_conv(x).premute(0, 1, 3, 2).view(*view).permute(0, 2, 1)

Why use sigmoid instead of softmax?

The text was updated successfully, but these errors were encountered:

AngeLouCN · 2022-02-26T04:10:36Z

Hi, I do not think need premute. For the mode = 'h', the shape of projected_query is (batch_size, height, channel*weight), projected_key is (batch_size, channel*weight, height). And the attention_map is projected_query * projected_key, and the shape is (batch_size, height, height). The shape of projected_value is (batch_size, channel*weight, height). The output is projected_value * attention_map, and we reshape the output from (batch_size, channel*weight, height) to (batch_size, channel, weight, height). For the mode = 'w' is same. Actually, you may find some repositories, they use the similar way to define self_atten.

For the sigmoid is I find that in our network, sigmoid can get better results

Asthestarsfalll · 2022-02-26T04:31:01Z

I think premute is necessary. Although the shape of those values are correct to calculate，it has a very different meaning for mode h comparing to mode w. Without premute, the projected_query can't actually collect the columns to the dimension with size Hight
For example:

For mode W, the way of reshape is correct.

Without permute for mode H:

With permute for mode H:

[0, 5, 10, 15] is the column of a.