# Task 4
## 1. Env set up
1. ... already set up
---
## 2. Work Sequentially
- using Copilot(Gemini 2.5 pro) to help

#### a. Subtask 1
(1) Word Embedding
1. **Definition**: A technique in NLP mapping words or phrases to vectors of real numbers.
**Purpose**: Captures semantic meaning and relationships between words.
2. Traditional methods often involve very high-dimensional sparse vectors (e.g., one-hot encoding), resulting in inefficiencies and inability to capture semantic relationships. Word embeddings, on the other hand, use dense vectors in lower-dimensional spaces, allowing for better representation of word meanings and relationships.
3. **Common Examples**:
    - Word2Vec
    1. uses neural networks to learn word associations from large corpora of text.
    2. combine Skip-gram(predict contexts given a target word) and Continuous Bag of Words (CBOW)(predict target word given contexts) models.

(2) Multi-head Self-Attention
1. **Main Idea**: Allows the model to jointly attend to information from different representation subspaces at different positions. (a little like adopting stochastic beam search in attention mechanism)
2. **Scaled Dot-Product Attention**:
    $$
    \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
    $$
    - `Q` (**Query**): A matrix representing a set of queries. In self-attention, this is a projection of the input sequence. Each query vector represents a word asking for attention from all other words.
    - `K` (**Key**): A matrix representing a set of keys. This is another projection of the input sequence. Each key vector can be thought of as a "label" for a word, which is matched against the queries.
    - `V` (**Value**): A matrix representing a set of values. This is a third projection of the input sequence. Each value vector contains the actual information of a word that should be passed on.
    - `d_k`: The dimension of the key vectors (and query vectors). The scaling factor `sqrt(d_k)` is crucial. For large values of `d_k`, the dot products can grow very large in magnitude, pushing the softmax function into regions where it has extremely small gradients. Scaling counteracts this effect, leading to more stable training.

#### b. Subtask 2

In [1]:
# Subtask 2
import numpy as np

np.random.seed(114514)

def softmax(x):
    e_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return e_x / e_x.sum(axis=-1, keepdims=True)

def scaled_dot_product_attention(Q, K, V, mask=None):
    """Compute the scaled dot-product attention.

    Args:(np.array, np.array, np.array, np.array)
        Q (np.array): Queries of shape (..., seq_len_q, depth)
        K (np.array): Keys of shape (..., seq_len_k, depth)
        V (np.array): Values of shape (..., seq_len_v, depth_v)
        mask (np.array, optional): Float tensor with shape broadcastable to (..., seq_len_q, seq_len_k). Defaults to None.

    Returns:
        (np.array, np.array)
        output: Attention output of shape (..., seq_len_q, depth_v)
        attention_weights: Attention weights of shape (..., seq_len_q, seq_len_k)
    """
    d_k = Q.shape[-1]
    scores = np.matmul(Q, K.swapaxes(-2, -1)) / np.sqrt(d_k)
    if mask is not None:
        scores += (mask * -1e9)
    attention_weights = softmax(scores)
    output = np.matmul(attention_weights, V)
    return output, attention_weights

def multi_head_attention(embed_size, num_heads, input, mask=None):
    """Multi-head attention mechanism.

    Args:
        embed_size (int): Dimensionality of the input embeddings.
        num_heads (int): Number of attention heads.
        input (np.array): Input tensor of shape (batch_size, seq_len, embed_size).
        mask (np.array, optional): Float tensor with shape broadcastable to (batch_size, num_heads, seq_len, seq_len). Defaults to None.
        
    Returns:
        (np.array, np.array)
        output: Multi-head attention output of shape (batch_size, seq_len, embed_size)
        weights: Attention weights of shape (batch_size, num_heads, seq_len, seq_len)
    """
    batch_size, seq_len, _ = input.shape
    assert embed_size % num_heads == 0, "embed_size must be divisible by num_heads"
    head_dim = embed_size // num_heads
    
    Wq = np.random.randn(embed_size, embed_size)
    Wk = np.random.randn(embed_size, embed_size)
    Wv = np.random.randn(embed_size, embed_size)
    Wo = np.random.randn(embed_size, embed_size)
    
    # Linear projections
    Q = np.matmul(input, Wq)
    K = np.matmul(input, Wk)
    V = np.matmul(input, Wv)
    
    # Reshape and transpose for multi-head attention
    Q = Q.reshape(batch_size, seq_len, num_heads, head_dim).swapaxes(1, 2)
    K = K.reshape(batch_size, seq_len, num_heads, head_dim).swapaxes(1, 2)
    V = V.reshape(batch_size, seq_len, num_heads, head_dim).swapaxes(1, 2)
    
    # Scaled dot-product attention
    attention_output, weights = scaled_dot_product_attention(Q, K, V, mask)
    
    # Concatenate heads and project
    attention_output = attention_output.swapaxes(1, 2).reshape(batch_size, seq_len, embed_size)
    
    output = np.matmul(attention_output, Wo)

    return output, weights

if __name__ == "__main__":
    batch_size = 10
    seq_len = 20
    embed_size = 128
    num_heads = 8
    input = np.random.randn(batch_size, seq_len, embed_size) 
    output, weights = multi_head_attention(embed_size, num_heads, input)
	
    print(output.shape, weights.shape)
    print(output[0][0][:10], weights[0][0][0][:10])

(10, 20, 128) (10, 8, 20, 20)
[-91.96555916 -19.40983534 -32.99740866 113.35786088 138.22610441
  81.21040905 -30.81003178  90.7098463  162.38724319 -40.72173619] [1.94810489e-189 3.21476597e-151 3.61314239e-103 4.96644350e-219
 3.90604112e-173 3.46437823e-131 4.72245009e-077 2.66307289e-194
 1.00000000e+000 5.17103825e-098]


#### c. Subtask 3

In [2]:
# Subtask 3
import torch
import torch.nn as nn
import torch.nn.functional as F

class MultiHeadAttention(nn.Module):
    """基于PyTorch的多头自注意力实现"""
    def __init__(self, embed_size, num_heads, dropout=0.1):
        super().__init__()
        assert embed_size % num_heads == 0, "embed_size must be divisible by num_heads"
        
        self.embed_size = embed_size
        self.num_heads = num_heads
        self.head_dim = embed_size // num_heads
        
        # linear layers for Q, K, V and output
        self.Wq = nn.Linear(embed_size, embed_size)
        self.Wk = nn.Linear(embed_size, embed_size)
        self.Wv = nn.Linear(embed_size, embed_size)
        self.Wo = nn.Linear(embed_size, embed_size)
        
        self.dropout = nn.Dropout(dropout)

    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        attn_scores = torch.matmul(Q, K.swapaxes(-2, -1)) / (self.head_dim ** 0.5)
        if mask is not None:
            attn_scores.masked_fill_(mask == 0, -1e9)
        attn_weights = F.softmax(attn_scores, dim=-1)
        attn_weights = self.dropout(attn_weights)
        output = torch.matmul(attn_weights, V)
        return output, attn_weights
				
    def forward(self, query, key, value, mask=None):
        batch_size = query.shape[0]
    
        # linear projections
        Q = self.Wq(query)
        K = self.Wk(key)
        V = self.Wv(value)
        
        # reshape and transpose
        Q = Q.view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
        K = K.view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
        V = V.view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
        
        # scaled dot-product attention
        attn_output, attn_weights = self.scaled_dot_product_attention(Q, K, V, mask)
        
        # concatenate heads and project
        attn_output = attn_output.transpose(1, 2).contiguous().view(batch_size, -1, self.embed_size)
        
        output = self.Wo(attn_output)
        
        return output, attn_weights

# 测试代码
if __name__ == "__main__":
    # 构造测试输入（与任务2保持一致的形状）
    batch_size = 10
    seq_len = 20
    embed_size = 128
    num_heads = 8
    
    input_tensor = torch.randn(batch_size, seq_len, embed_size)
    model = MultiHeadAttention(embed_size, num_heads)

    # 执行自注意力计算（query=key=value）
    output, attn_weights = model(input_tensor, input_tensor, input_tensor)

    print(output.shape, weights.shape)
    print(output[0][0][:10], weights[0][0][0][:10])


torch.Size([10, 20, 128]) (10, 8, 20, 20)
tensor([ 0.1218,  0.0743,  0.0391,  0.0172, -0.0705, -0.1443, -0.0310, -0.2916,
        -0.1042, -0.0591], grad_fn=<SliceBackward0>) [1.94810489e-189 3.21476597e-151 3.61314239e-103 4.96644350e-219
 3.90604112e-173 3.46437823e-131 4.72245009e-077 2.66307289e-194
 1.00000000e+000 5.17103825e-098]


#### d. Subtask 4 VIT
##### 1. Serialization: 2D -> 1D vector for transformer input
1. Image Patching: split original image into fixed-size patches.
2. Flattening and Linear Projection: each patch is flattened into a 1D vector and projected into a higher-dimensional space using a linear layer.
##### 2. Positional Encoding
1. Learnable Positional Embeddings: before feeding the patch embeddings into the transformer, add positional embeddings in the form of learnable vectors. These vectors are randomly initialized and learned during training.
##### 3. Advantages over Traditional CNNs
1. Global Context: self-attention mechanism captures long-range dependencies and global context more effectively than the local receptive fields of CNNs.
2. Flexibility: can handle varying input sizes and resolutions more easily than CNNs.