## 4.1 Into to Attention Mechanism

- <font color="orange">Attention</font> is, to some extent, motivated by how we <font color="orange">pay visual attention</font> to,
    - <font color="cyan">different regions</font> of an <font color="orange">image</font> (*vision task*), or 
    - <font color="cyan">correlate words</font> in one <font color="orange">sentence</font> (*language task*).<br><br>
<img src="resource/shiba-example-attention.png" width="700px"><br><br>
<img src="resource/sentence-example-attention.png" width="700px"><br>
- Humans can naturally and effectively find <font color="orange">salient regions</font> in complex scenes. 
- Attention mechanisms were introduced into computer vision with the aim of imitating this aspect of the human visual system.
- An attention mechanism can be regarded as a <font color="orange">dynamic weight adjustment</font> process based on <font color="orange">features</font> of the input image.<br><br>
- Historical <font color="orange">timeline developments</font> in attention in computer vision : <br><br>
<img src="resource/attension-dev.png" width="900px"><br><br>
- Attention mechanisms can be <font color="orange">categorised</font> according to data domain : <br><br>
<img src="resource/attention-category.png" width="500px"><br><br>
    - <font color="cyan">Channel attention</font> generate <font color="orange">attention mask</font> across the <font color="orange">channel domain</font> and use it to select important channels. 
    - <font color="cyan">Spatial attention</font> generate <font color="orange">attention mask</font> across <font color="orange">spatial domains</font> and use it to select important spatial regions or predict the most relevant spatial position directly.<br><br>
    <img src="resource/ca-vs-sa.png" width="600px"><br><br><br>
    - <font color="cyan">Channel & spatial attention</font> predict <font color="orange">channel</font> and <font color="orange">spatial attention masks</font> separately or generate a joint 3-D channel, height, width attention mask directly and use it to select <font color="orange">important features</font>.
    - <font color="cyan">Temporal attention</font> generate <font color="orange">attention mask</font> in <font color="orange">time</font> and use it to select <font color="orange">key frames</font>.
    - <font color="cyan">Spatial & temporal attention</font> compute <font color="orange">temporal</font> and <font color="orange">spatial attention masks</font> separately or produce a joint <font color="orange">spatiotemporal attention</font>.
    - <font color="cyan">Branch attention</font> generate <font color="orange">attention mask</font> across the different branches and use it to select <font color="orange">important branches</font>.<br><br>
<img src="resource/Attention-visual.png" width="700px">

#### 4.1.1 Channel Attention - SENet
- <font color="cyan">SENet</font> pioneered channel attention.
- The core of <font color="cyan">SENet</font> is *<font color="orange">squeeze-and-excitation</font>* block which is used to collect <font color="orange">global information</font>, capture <font color="orange">channelwise relationships</font>, and improve <font color="orange">representation ability</font>.
- <font color="orange">SE blocks</font> are divided into two parts, a <font color="cyan">squeeze</font> module and an <font color="cyan">excitation</font> module.
    - <font color="orange">Global information</font> is collected in the <font color="orange">squeeze</font> module by <font color="cyan">global average pooling (GAP)</font>.
    - The <font color="orange">excitation</font> module captures <font color="orange">channel-wise relationships</font> and outputs an <font color="orange">attention vector</font> by using <font color="cyan">fully-connected layers</font> and <font color="cyan">non-linear activation layers</font> (ReLU and sigmoid).
    - Then, <font color="orange">each channel</font> of the <font color="cyan">input feature</font> is scaled by <font color="orange">multiplying</font> the corresponding element in the <font color="cyan">attention vector</font>.<br><br>
        <table cellspacing="0" cellpadding="0" style="border:none;">
            <tbody>
                <tr>
                    <td>
                        <img src="resource/attention-se-block.png" width="250px">
                    </td>
                    <td>
                        GAP = global average pooling<br>
                        FC = fully-connected layer<br><br>
                        <img src="resource/GAP.png" width="250px"><br>
                        <img src="resource/attention-se-block-2.png" width="400px"><br>
                    </td>
                </tr>
            </tbody>
        </table><br><br>
- <font color="orange">SE blocks</font> play the role of <font color="orange">emphasizing important channels</font> while <font color="orange">suppressing noise</font>.
- However, SE blocks have shortcomings. 
    - In the <font color="orange">squeeze</font> module, <font color="orange">global average pooling</font> is <font color="cyan">too simple</font> to capture complex global information. 
    - In the <font color="orange">excitation</font> module, <font color="orange">fully-connected layers</font> <font color="cyan">increase the complexity</font> of the model.
    - later works attempt to improve the outputs of the squeeze module (e.g., GSoP-Net),


#### 4.1.2 Channel Attention - GSoP-Net
- <font color="orange">GSoP-Net</font> improve the <font color="orange">squeeze</font> module by using a <font color="orange">global second-order pooling (GSoP)</font> block.
- Like an SE block, a GSoP block also has a squeeze module and an excitation module.
- The <font color="orange">squeeze</font> module a GSoP block, 
    - Firstly <font color="orange">reduces</font> the number of <font color="orange">channels</font> using a <font color="cyan">convolution</font>.
    - And then computes a <font color="cyan">covariance matrix</font> for the different channels to obtain their correlation.
- The <font color="orange">excitation</font> module a GSoP block,
    - Compute <font color="cyan">row-wise convolution</font> to maintain structural information and output a vector.
    - Then a <font color="cyan">fully-connected layer</font> and a <font color="orange">sigmoid</font> function are applied to get a attention vector<br><br>
        <table cellspacing="0" cellpadding="0" style="border:none;">
            <tbody>
                <tr>
                    <td><img src="resource/attention-gsop-block.png" width="200px"></td>
                    <td>
                        Cov pool = Covariance pooling (2nd-order pool)<br>RW Conv = row-wise convolution<br><br><br>
                        <img src="resource/gsopnet.png" width="700px">
                    </td>
                </tr>
            </tbody>
        </table>

#### 4.1.3 Spatial Attention - Self-Attention
- <font color="cyan">Self-attention</font> was proposed and has had great success in the field of <font color="orange">natural language processing (NLP)</font>.
- Recently, it has also shown the potential to become a dominant tool in computer vision.
- Typically, self-attention is used as a <font color="orange">spatial attention</font> mechanism to <font color="orange">capture global information</font>.
- Due to the localisation of the <font color="orange">convolutional</font> operation, CNNs have inherently <font color="cyan">narrow receptive fields</font>, which limits the ability of CNNs to understand <font color="orange">scenes globally</font>.
- To compute <font color="cyan">Self-attention</font> feature maps,
    - The convolutional image feature maps is branched out into three copies, 
    - Corresponding to the concepts <font color="cyan">queries</font> $g(x)$, <font color="cyan">keys</font> $f(x)$, and <font color="cyan">values</font> $h(x)$ by <font color="orange">linear projection</font> and <font color="orange">reshaping operations</font>.
        - Key: $f(x)=\textbf{W}_f\mathbf{x}$
        - Query: $g(x)=\textbf{W}_g\mathbf{x}$
        - Value: $h(x)=\textbf{W}_h\mathbf{x}$<br><br>
    <img src="resource/attention-self-attention.png" width="700px"><br>
    - Multiply the input tensor <font color="orange">X</font> with each of these weight matrices ($\textbf{W}_f$, $\textbf{W}_g$, $\textbf{W}_h$) to get the <font color="orange">Key</font>, <font color="orange">Query</font>, and <font color="orange">Value</font> tensors.
    - Compare the <font color="orange">Query</font> with the <font color="orange">Key</font> to find out how similar they are. 
        - This gives you <font color="orange">attention scores</font>.
    - Turn these scores into <font color="orange">probabilities</font> using a <font color="orange">softmax function</font>.
    - Use these <font color="orange">probabilities</font> to weigh the <font color="orange">Value</font> data. This <font color="orange">emphasizes</font> the important parts.
    - The result is <font color="orange">self-attention feature map</font>, highlighting what’s most important.

#### 4.1.4 Multi-head Attention (MHA)

- <font color="orange">Multi-head Attention</font> is combine more than one <font color="cyan">self-attention</font> heads with different parameter matrices ($\textbf{W}_f$, $\textbf{W}_g$, $\textbf{W}_h$) with a hope that it learns subtle contextual information.
- This motivates <font color="orange">"Multi-head Attention"</font>, which is a simple extension of <font color="orange">single-head attention</font> like each kernel independently learns its feature in CNN.
- At the <font color="orange">end</font> of each <font color="orange">self-attention head</font>, it combined using <font color="cyan">fully connected layer</font> become the <font color="orange">weighted vector</font>.<br><br>
    <table cellspacing="0" cellpadding="0" style="border:none;">
        <tbody>
            <tr>
                <td>Self-attention (Scaled Dot-Product)</td>
                <td>Multi-head Attention</td>
            </tr>
            <tr>
                <td>
                    <img src="resource/self-attention.png" width="300px">
                </td>
                <td>
                    <img src="resource/multi-head-attention.png" width="300px">
                </td>
            </tr>
        </tbody>
    </table>

#### 4.1.5 Vision Transformer (ViT)
- <font color="orange">Multi-head Attention</font> mechanism is widely used in <font color="cyan">Transformer</font> based Deep Neural Network for natual language processing, language understanding or even in computer vision.
- In Computer Vision, <font color="orange">Multi-head Attention</font> is mainly used by <font color="cyan">Vision Transformer (ViT)</font>, which is the first <font color="cyan">pure transformer architecture</font> for <font color="orange">image processing</font>.
- It is capable of achieving comparable results to modern convolutional neural networks.<br><br>
    <img src="resource/vision-transformer.png" width="800px"><br><br>
    - <font color="orange">Vision transformer (ViT)</font> first splits the image into <font color="cyan">sequence</font> of different <font color="orange">patches</font> and concatenates with a <font color="orange">class token</font>.
    - Then <font color="orange">Multi-head Attention</font> takes a sequence to compute <font color="orange">Query</font>, <font color="orange">Key</font> and <font color="orange">Value</font> divided into <font color="orange">$H$ heads</font>.
    - Then <font color="orange">Self-attention</font> applied internaly for each <font color="orange">head</font>.
    - Then stacks a number of <font color="orange">Multi-head Attention</font> layers with <font color="orange">fully connected layers</font>.
- <font color="orange">ViT</font> demonstrates that a pure attention-based network can achieve <font color="orange">better results</font> than a <font color="orange">convolutional neural network (CNN)</font> especially for large datasets such as JFT-300 [[Paper](https://arxiv.org/pdf/1707.02968)] and ImageNet-21K [[Paper](https://ieeexplore.ieee.org/document/5206848)].

<br><br><br>
- Let's try to experiment combining <font color="orange">CNN</font> with Attention Mechanism : <font color="orange">Squeeze and Excitation Block</font>,
- Click button below to open notebook <font color="orange">'4.2 experiment_cnn_with_se_block.ipynb'</font> in Google Colab...<br>
<a href="https://colab.research.google.com/github/Muhammad-Yunus/Belajar-Image-Classification/blob/main/Pertemuan%204/4.2%20experiment_cnn_with_se_block.ipynb" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

_________________________________________________________________________
<br><br><br>
# Source
- https://lilianweng.github.io/posts/2018-06-24-attention/?ref=blog.paperspace.com
- https://link.springer.com/content/pdf/10.1007/s41095-022-0271-y.pdf
- https://www.researchgate.net/figure/Before-inputting-the-SE-attention-mechanism-left-colorless-figure-C-the-importance-of_fig1_366512193
- https://www.digitalocean.com/community/tutorials/attention-mechanisms-in-computer-vision-cbam
- https://www.researchgate.net/figure/Diagram-of-the-channel-attention-module-and-spatial-attention-module-for-the_fig3_347669937
- https://arxiv.org/pdf/1805.08318
- https://arxiv.org/pdf/1811.12006v2
- https://medium.com/@shravankoninti/transformers-attention-is-all-you-need-overview-on-multi-headed-attention-379eb8d095dc