## 4.1 Attention Mechanism in CNN
- Intro To Attention Mechanism
- Experiment adding Attention Mechanism to CNN Model

### 4.1.1 Intro To Attention Mechanism
- <font color="orange">Attention</font> is, to some extent, motivated by how we <font color="orange">pay visual attention</font> to,
    - <font color="cyan">different regions</font> of an <font color="orange">image</font> (*vision task*), or 
    - <font color="cyan">correlate words</font> in one <font color="orange">sentence</font> (*language task*).<br><br>
<img src="resource/shiba-example-attention.png" width="700px"><br><br>
<img src="resource/sentence-example-attention.png" width="700px"><br>
- Humans can naturally and effectively find <font color="orange">salient regions</font> in complex scenes. 
- Attention mechanisms were introduced into computer vision with the aim of imitating this aspect of the human visual system.
- An attention mechanism can be regarded as a <font color="orange">dynamic weight adjustment</font> process based on <font color="orange">features</font> of the input image.<br><br>
- Historical <font color="orange">timeline developments</font> in attention in computer vision : <br><br>
<img src="resource/attension-dev.png" width="900px"><br><br>
- Attention mechanisms can be <font color="orange">categorised</font> according to data domain : <br><br>
<img src="resource/attention-category.png" width="500px"><br><br>
    - <font color="cyan">Channel attention</font> generate <font color="orange">attention mask</font> across the <font color="orange">channel domain</font> and use it to select important channels. 
    - <font color="cyan">Spatial attention</font> generate <font color="orange">attention mask</font> across <font color="orange">spatial domains</font> and use it to select important spatial regions or predict the most relevant spatial position directly.<br><br>
    - <font color="cyan">Channel & spatial attention</font> predict <font color="orange">channel</font> and <font color="orange">spatial attention masks</font> separately or generate a joint 3-D channel, height, width attention mask directly and use it to select <font color="orange">important features</font>.
    - <font color="cyan">Temporal attention</font> generate <font color="orange">attention mask</font> in <font color="orange">time</font> and use it to select <font color="orange">key frames</font>.
    - <font color="cyan">Spatial & temporal attention</font> compute <font color="orange">temporal</font> and <font color="orange">spatial attention masks</font> separately or produce a joint <font color="orange">spatiotemporal attention</font>.
    - <font color="cyan">Branch attention</font> generate <font color="orange">attention mask</font> across the different branches and use it to select <font color="orange">important branches</font>.<br><br>
<img src="resource/Attention-visual.png" width="700px">

#### 4.1.1.1 Channel Attention - SENet
- <font color="cyan">SENet</font> pioneered channel attention.
- The core of <font color="cyan">SENet</font> is *<font color="orange">squeeze-and-excitation</font>* block which is used to collect <font color="orange">global information</font>, capture <font color="orange">channelwise relationships</font>, and improve <font color="orange">representation ability</font>.
- <font color="orange">SE blocks</font> are divided into two parts, a <font color="cyan">squeeze</font> module and an <font color="cyan">excitation</font> module.
    - <font color="orange">Global information</font> is collected in the <font color="orange">squeeze</font> module by <font color="cyan">global average pooling (GAP)</font>.
    - The <font color="orange">excitation</font> module captures <font color="orange">channel-wise relationships</font> and outputs an <font color="orange">attention vector</font> by using <font color="cyan">fully-connected layers</font> and <font color="cyan">non-linear activation layers</font> (ReLU and sigmoid).
    - Then, <font color="orange">each channel</font> of the <font color="cyan">input feature</font> is scaled by <font color="orange">multiplying</font> the corresponding element in the <font color="cyan">attention vector</font>.<br><br>
        <table cellspacing="0" cellpadding="0" style="border:none;">
            <tbody>
                <tr>
                    <td><img src="resource/attention-se-block.png" width="250px"></td>
                    <td>GAP = global average pooling<br>FC = fully-connected layer<br><br><br><img src="resource/attention-se-block-2.png" width="500px"><br></td>
                </tr>
            </tbody>
        </table><br><br>
- <font color="orange">SE blocks</font> play the role of <font color="orange">emphasizing important channels</font> while <font color="orange">suppressing noise</font>.
- However, SE blocks have shortcomings. 
    - In the <font color="orange">squeeze</font> module, <font color="orange">global average pooling</font> is <font color="cyan">too simple</font> to capture complex global information. 
    - In the <font color="orange">excitation</font> module, <font color="orange">fully-connected layers</font> <font color="cyan">increase the complexity</font> of the model.
    - later works attempt to improve the outputs of the squeeze module (e.g., GSoP-Net),


#### 4.1.1.2 Channel Attention - GSoP-Net
- GSoP-Net improve the squeeze module by using a global second-order pooling (GSoP) block.
- Like an SE block, a GSoP block also has a squeeze module and an excitation module.
- The squeeze module a GSoP block, 
    - Firstly reduces the number of channels using a convolution.
    - And then computes a covariance matrix for the different channels to obtain their correlation.
- The excitation module a GSoP block,
    - Compute row-wise convolution to maintain structural information and output a vector.
    - Then a fullyconnected layer and a sigmoid function are applied to get a attention vector<br><br>
        <table cellspacing="0" cellpadding="0" style="border:none;">
            <tbody>
                <tr>
                    <td><img src="resource/attention-gsop-block.png" width="250px"></td>
                    <td>Cov pool = Covariance pooling<br>RW Conv = row-wise convolution</td>
                </tr>
            </tbody>
        </table>

#### 4.1.1.3 Spatial Attention - Self-Attention Based Method
- Self-attention was proposed and has had great success in the field of natural language processing (NLP).
- Recently, it has also shown the potential to become a dominant tool in computer vision.
- Typically, selfattention is used as a spatial attention mechanism to capture global information.
- Due to the localisation of the convolutional operation, CNNs have inherently narrow receptive fields, which limits the ability of CNNs to understand scenes globally.
- self-attention first computes the queries, keys, and values Q, K, V by linear projection and reshaping operations
<img src="resource/attention-self-attention.png" width="700px">

### 4.1.2 Experiment adding Attention Mechanism to CNN Model

⚠️⚠️⚠️ *Please open this notebook in Google Colab* by click below link ⚠️⚠️⚠️<br><br>
<a href="https://colab.research.google.com/github/Muhammad-Yunus/Belajar-Image-Classification/blob/main/Pertemuan%203/3.1%20intro_to_cnn.ipynb" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a><br><br><br>
- Click `Connect` button in top right Google Colab notebook,<br>
<img src="resource/cl-connect-gpu.png" width="250px">
- If connecting process completed, it will turn to something look like this<br>
<img src="resource/cl-connect-gpu-success.png" width="250px">

- Check GPU connected into Colab environment is active

In [None]:
!nvidia-smi

- All code remain same with <font color="orange">Pertemuan 1</font>, but different in designing model part.

In [None]:
!pip install gdown

import os
import cv2
import gdown
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, random_split

import torchvision
from torchvision import transforms

from IPython import display

# clear output cell
display.clear_output()

print(f"torch : {torch.__version__}")
print(f"torch vision : {torchvision.__version__}")

- Download MNIST Dataset

In [None]:
DATASET_NAME = 'MNIST' # the dataset name
DATASET_NUM_CLASS = 10 # number of class in dataset

In [None]:
# default using gdrive_id Dataset `mnist_dataset.zip` (1-FfwJrllyHofQwIbMb_IxAkxnfMGSFmR)
gdrive_id = '1-FfwJrllyHofQwIbMb_IxAkxnfMGSFmR' # <-----  ⚠️⚠️⚠️ USE YOUR OWN GDrive ID FOR CUSTOM DATASET ⚠️⚠️⚠️

# download zip from GDrive
url = f'https://drive.google.com/uc?id={gdrive_id}'
gdown.download(url, DATASET_NAME + ".zip", quiet=False)

# unzip dataset
!unzip {DATASET_NAME}.zip -d {DATASET_NAME}

# clear output cell
display.clear_output()

- Load MNIST Dataset
    - <font color="orange">DONT FLATTEN THE INPUT IMAGE ON DATA LOADER</font>,
    - WE WILL FEED 2D 28x28 MNIST DIGIT IMAGE DATA INTO MODEL

In [None]:
# Define Custom Dataset class
# it's just helper to load image dataset using OpenCV and convert to pytorch tensor
# also doing a label encoding using one-hot encoding
class CustomDataset(Dataset):
    def __init__(self, root_dir):
        self.root_dir = root_dir
        self.image_files = sorted([file for file in os.listdir(root_dir) if file.lower().endswith('.png')])

    def __len__(self):
        return len(self.image_files)

    def __getitem__(self, idx):
        # Read image from corresponding .png file
        image_path = os.path.join(self.root_dir, self.image_files[idx])
        image = cv2.imread(image_path)  # Load image using OpenCV
        image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)  # Convert BGR to GRAY
        image = torch.from_numpy(image).to(torch.float32)  # Convert NumPy array to PyTorch tensor
        image = image.view(1, 28, 28) # reshape loaded MNIST image into 1x28x28 format (required by the model)

        # Read label from corresponding .txt file
        label_path = os.path.splitext(image_path)[0] + ".txt"
        with open(label_path, 'r') as label_file:
            label = int(label_file.read().strip())  # Assuming labels are integers

        # Apply one-hot encoding into label
        labels_tensor = torch.tensor(label)
        one_hot_encoded = F.one_hot(labels_tensor, num_classes=DATASET_NUM_CLASS).to(torch.float32)

        return image, one_hot_encoded



# instantiate dataset
# in here the image dataset is not loaded yet
# we only read all image files names in fataset folder
all_train_dataset = CustomDataset(root_dir=f'{DATASET_NAME}/dataset/train')
test_dataset = CustomDataset(root_dir=f'{DATASET_NAME}/dataset/test')

In [None]:
print(f"All Train Dataset : {len(all_train_dataset)} data")
print(f"Test Dataset : {len(test_dataset)} data")

In [None]:
# Split 'all_train_dataset' into 'train' and 'validation' set using `random_split()` function
train_dataset, validation_dataset = random_split(all_train_dataset, [50000, 10000])

print(f"Train Dataset : {len(train_dataset)} data")
print(f"Validation Dataset : {len(validation_dataset)} data")

In [None]:
# Create data loaders
BATCH_SIZE = 128

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
validation_loader = DataLoader(validation_dataset, batch_size=BATCH_SIZE, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)

- Here we use model defined <font color="orange">'3.3 cnn_wuth_batch_normalization.ipynb'</font>.
    - With additional <font color="orange">Attention Mechanism</font><br>


In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Define the model using nn.Sequential
model = nn.Sequential(
    # Convolutional layers with Batch Normalization
    nn.Conv2d(in_channels=1, out_channels=12, kernel_size=3, padding=2, bias=False),
    nn.BatchNorm2d(num_features=12, affine=True), 
    nn.ReLU(),
    nn.Conv2d(in_channels=12, out_channels=24, kernel_size=6, stride=2, padding=2, bias=False),
    nn.BatchNorm2d(num_features=24, affine=True),
    nn.ReLU(),
    nn.Conv2d(in_channels=24, out_channels=32, kernel_size=6, stride=2, padding=2, bias=False),
    nn.BatchNorm2d(num_features=32, affine=True),
    nn.ReLU(),
    
    # Flatten layer
    nn.Flatten(),
    
    # Fully connected layers with Batch Normalization
    nn.Linear(in_features=32 * 7 * 7, out_features=200, bias=False),
    nn.BatchNorm1d(num_features=200, affine=True), 
    nn.ReLU(),
    nn.Dropout(0.6),
    nn.Linear(in_features=200, out_features=10),
    nn.LogSoftmax(dim=1)
).to(device)

# Iterate over model to find BatchNorm layers and modify them
for layer in model:
    if isinstance(layer, nn.BatchNorm2d) or isinstance(layer, nn.BatchNorm1d):
        # Set weight to 1 (disabling scaling)
        with torch.no_grad():
            layer.weight.fill_(1.0)  # Set weight to 1
        # Freeze the weight from being updated
        layer.weight.requires_grad = False

In [None]:
# setup optimizer, loss function & metric
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
loss_function = nn.CrossEntropyLoss()

- To run training process, we can use the following code

In [None]:
!pip install tqdm

from tqdm import tqdm

In [None]:
def train(model, train_loader, optimizer, loss_function):
    model.train()
    running_loss = 0.0
    correct_predictions = 0
    total_predictions = 0

    # Add progress bar for training loop
    progress_bar = tqdm(train_loader, desc='Training', leave=False)

    for inputs, labels in progress_bar:
        inputs = inputs.to(device) # move inputs to device
        labels = labels.to(device) # move labels to device

        # resets the gradients of all the model's parameters before the backward pass
        optimizer.zero_grad()
        # pass 2D 28x28 input tensor to CNN model
        outputs = model(inputs)
        # calc loss value
        loss = loss_function(outputs, labels)
        # computes the gradient of the loss with respect to each parameter in model
        loss.backward()
        # adjust model parameter
        optimizer.step()
        # sum loss value
        running_loss += loss.item()

        # Calculate correct & total prediction
        _, predicted = torch.max(outputs, 1)
        correct_predictions += (predicted == labels.argmax(1)).sum().item()
        total_predictions += labels.size(0)

        # Update progress bar description with current loss
        progress_bar.set_postfix(loss=loss.item())

    # Calculate average training loss
    average_train_loss = running_loss / len(train_loader.dataset)
    # Calculate training accuracy
    train_accuracy = correct_predictions / total_predictions
    return average_train_loss, train_accuracy

def validate(model, val_loader, loss_function):
    model.eval()
    running_loss = 0.0
    correct_predictions = 0
    total_predictions = 0

    # Add progress bar for validation loop
    progress_bar = tqdm(val_loader, desc='Validating', leave=False)

    with torch.no_grad():
        for inputs, labels in progress_bar:
            inputs = inputs.to(device) # move inputs to device
            labels = labels.to(device) # move labels to device

            # pass 2D 28x28 input tensor to CNN model
            outputs = model(inputs)
            # calc loss value
            loss = loss_function(outputs, labels)
            # sum loss value
            running_loss += loss.item()

            # Calculate correct & total prediction
            _, predicted = torch.max(outputs, 1)
            correct_predictions += (predicted == labels.argmax(1)).sum().item()
            total_predictions += labels.size(0)

            # Update progress bar description with loss
            progress_bar.set_postfix(loss=loss.item())

    # Calculate average validation loss
    average_val_loss = running_loss / len(val_loader.dataset)
    # Calculate validation accuracy
    val_accuracy = correct_predictions / total_predictions
    return average_val_loss, val_accuracy





# This is a training loop for selected Epoch
# each epoch will process all training and validation set, chunked into small batch size data
# then measure the loss & accuracy of training and validation set
NUM_EPOCH = 10      # you can change this value

train_losses = []
val_losses = []
train_accuracies = []
val_accuracies = []

for epoch in range(NUM_EPOCH):
    print(f"Epoch {epoch+1}/{NUM_EPOCH}")

    train_loss, train_accuracy = train(model, train_loader, optimizer, loss_function)
    val_loss, val_accuracy = validate(model, validation_loader, loss_function)

    train_losses.append(train_loss)
    val_losses.append(val_loss)
    train_accuracies.append(train_accuracy * 100)  # convert to percentage
    val_accuracies.append(val_accuracy * 100)  # convert to percentage

    print(f"Train Loss = {train_loss:.4f}, Val Loss = {val_loss:.4f}, Train Accuracy = {train_accuracy:.4f}, Val Accuracy = {val_accuracy:.4f}\n")

- Plot Loss and Accuracy of Training vs Validation Set 

In [None]:
# visualize Loss & Accuracy
import matplotlib.pyplot as plt

epochs = list(range(1, NUM_EPOCH + 1))

# Plotting loss
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(epochs, train_losses, 'b', label='Training Loss')
plt.plot(epochs, val_losses, 'r', label='Validation Loss')
plt.title('Training and Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.grid(True)

# Plotting accuracy
plt.subplot(1, 2, 2)
plt.plot(epochs, train_accuracies, 'b', label='Training Accuracy')
plt.plot(epochs, val_accuracies, 'r', label='Validation Accuracy')
plt.title('Training and Validation Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy (%)')
plt.legend()
plt.grid(True)

plt.tight_layout()


- Evaluate Model, find Precision, Recal each class data, measure accuracy and compute confusion matrix

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
import seaborn as sns
import numpy as np

# define evaluate function for test set
def evaluate(model, test_loader):
    model.eval()
    all_labels = []
    all_preds = []

    # Add progress bar for validation loop
    progress_bar = tqdm(test_loader, desc='Evaluating', leave=False)

    with torch.no_grad():
        # iterate over all batched test set
        for inputs, labels in progress_bar:
            inputs = inputs.to(device) # move inputs to device
            labels = labels.to(device) # move labels to device

            # pass 2D 28x28 input tensor to CNN model
            outputs = model(inputs)
            # get prediction
            _, preds = torch.max(outputs, 1)
            # collect all labels & preds
            all_labels.extend(labels.cpu().numpy())
            all_preds.extend(preds.cpu().numpy())

    return all_labels, all_preds

# Evaluation on test set
all_labels, all_preds = evaluate(model, test_loader)
all_labels = np.argmax(all_labels, axis=1)

# Calculate classification report
labels = [str(i) for i in range(DATASET_NUM_CLASS)]
print(classification_report(all_labels, all_preds, target_names=labels))

# Confusion Matrix
conf_matrix = confusion_matrix(all_labels, all_preds)

# Plotting the confusion matrix
plt.figure(figsize=(10, 7))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues")
plt.xlabel('Predicted Class')
plt.ylabel('Actual Class')
plt.title('Confusion Matrix')
plt.show()

- Download Model 

In [None]:
# Save the model
torch.save(model.state_dict(), 'trained_cnn_model.pt')

# Download the model file
from google.colab import files
files.download('trained_cnn_model.pt')

>
>## Discussion
>- It looks like dropout not help much to reduce overfitting,
>- It's also has negative impact by reducing training accuracy.
>- Now we will try to learn new possibility to adopt other regularization technique.
>- It's called <font color="cyan">Batch Normalization</font>
>.

- Open <font color="orange">'3.3 cnn_with_batch_normalization.ipynb'</font> in Google Colab to learn more...<br> 
<a href="https://colab.research.google.com/github/Muhammad-Yunus/Belajar-Image-Classification/blob/main/Pertemuan%203/3.3%20cnn_with_batch_normalization.ipynb" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

_________________________________________________________________________
<br><br><br>
# Source
- https://lilianweng.github.io/posts/2018-06-24-attention/?ref=blog.paperspace.com
- https://link.springer.com/content/pdf/10.1007/s41095-022-0271-y.pdf
- https://www.researchgate.net/figure/Before-inputting-the-SE-attention-mechanism-left-colorless-figure-C-the-importance-of_fig1_366512193
- https://www.digitalocean.com/community/tutorials/attention-mechanisms-in-computer-vision-cbam