# Self-Attention与相关技术详解

## 目录
1. [Self-Attention机制](#1-self-attention机制)
2. [Scaled Dot-Product Attention](#2-scaled-dot-product-attention)
3. [Multi-Head Self-Attention](#3-multi-head-self-attention)
4. [Layer Normalization](#4-layer-normalization)
5. [练习题](#6-练习题)

## 1. Self-Attention机制

### Self-Attention的核心思想

**Self-Attention（自注意力）**是Attention机制的一种特殊形式，其中Query、Key和Value都来自同一个输入序列。它允许序列中的每个位置都能关注到序列中的所有位置，包括它自己。

### Self-Attention的优势？

传统的RNN和CNN在处理序列时存在以下问题：
- **RNN**：顺序处理，无法并行化，长距离依赖建模困难
- **CNN**：感受野有限，需要多层才能捕获长距离依赖

Self-Attention的优势：
- **并行计算**：所有位置可以同时计算
- **长距离依赖**：任意两个位置的路径长度为1
- **动态权重**：根据内容动态分配注意力
- **可解释性**：注意力权重提供直观解释

###  Self-Attention的数学表示

给定输入序列 $X = [x_1, x_2, ..., x_n] \in \mathbb{R}^{n \times d}$：

**步骤1：生成Q、K、V矩阵**
$$Q = XW^Q, \quad K = XW^K, \quad V = XW^V$$

其中：
- $W^Q \in \mathbb{R}^{d \times d_k}$：查询权重矩阵
- $W^K \in \mathbb{R}^{d \times d_k}$：键权重矩阵
- $W^V \in \mathbb{R}^{d \times d_v}$：值权重矩阵

**步骤2：计算注意力输出**
$$\text{SelfAttention}(X) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

  from pandas.core.computation.check import NUMEXPR_INSTALLED
  from pandas.core import (


### 基础Self-Attention实现

In [3]:
class BasicSelfAttention(nn.Module):
    """基础Self-Attention实现"""
    
    def __init__(self, d_model, d_k=None, d_v=None):
        super(BasicSelfAttention, self).__init__()
        
        self.d_model = d_model
        self.d_k = d_k if d_k is not None else d_model
        self.d_v = d_v if d_v is not None else d_model
        
        # 线性变换层
        self.W_q = nn.Linear(d_model, self.d_k, bias=False)
        self.W_k = nn.Linear(d_model, self.d_k, bias=False)
        self.W_v = nn.Linear(d_model, self.d_v, bias=False)
        
        # 输出投影
        self.W_o = nn.Linear(self.d_v, d_model)
        
    def forward(self, x, mask=None):
        """
        Args:
            x: [batch_size, seq_len, d_model] 输入序列
            mask: [batch_size, seq_len, seq_len] 注意力掩码
        Returns:
            output: [batch_size, seq_len, d_model] 输出序列
            attention_weights: [batch_size, seq_len, seq_len] 注意力权重
        """
        batch_size, seq_len, d_model = x.size()
        
        # 生成Q、K、V
        Q = self.W_q(x)  # [batch_size, seq_len, d_k]
        K = self.W_k(x)  # [batch_size, seq_len, d_k]
        V = self.W_v(x)  # [batch_size, seq_len, d_v]
        
        # 计算注意力分数
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        # scores: [batch_size, seq_len, seq_len]
        
        # 应用掩码（如果提供）
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
            
        # 计算注意力权重
        attention_weights = F.softmax(scores, dim=-1)
        
        # 应用注意力权重
        context = torch.matmul(attention_weights, V)
        # context: [batch_size, seq_len, d_v]
        
        # 输出投影
        output = self.W_o(context)
        
        return output, attention_weights

# 测试基础Self-Attention
d_model = 256
seq_len = 10
batch_size = 2

self_attn = BasicSelfAttention(d_model)
x = torch.randn(batch_size, seq_len, d_model)
output, attn_weights = self_attn(x)

print(f"输入形状: {x.shape}")
print(f"输出形状: {output.shape}")
print(f"注意力权重形状: {attn_weights.shape}")
print(f"注意力权重和: {attn_weights.sum(dim=-1)[0, 0]:.4f}")

输入形状: torch.Size([2, 10, 256])
输出形状: torch.Size([2, 10, 256])
注意力权重形状: torch.Size([2, 10, 10])
注意力权重和: 1.0000


## 2. Scaled Dot-Product Attention

**Scaled Dot-Product Attention**是Transformer中使用的标准注意力机制。它是最高效的注意力计算方法之一，因为它只涉及矩阵乘法和softmax操作。

### 数学公式

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

其中：
- $Q \in \mathbb{R}^{n \times d_k}$：查询矩阵
- $K \in \mathbb{R}^{m \times d_k}$：键矩阵
- $V \in \mathbb{R}^{m \times d_v}$：值矩阵
- $\sqrt{d_k}$：缩放因子

### 为什么需要缩放因子？

**问题**：当$d_k$很大时，点积$QK^T$的值会变得很大，导致softmax函数进入饱和区域，梯度变得很小。

**解决方案**：使用缩放因子$\frac{1}{\sqrt{d_k}}$

**数学解释**：
- 假设$Q$和$K$的元素是独立的随机变量，均值为0，方差为1
- 那么$QK^T$中每个元素的方差为$d_k$
- 缩放后，方差变为1，保持了合适的数值范围

In [4]:
class ScaledDotProductAttention(nn.Module):
    """Scaled Dot-Product Attention实现"""
    
    def __init__(self, dropout=0.1):
        super(ScaledDotProductAttention, self).__init__()
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, Q, K, V, mask=None, temperature=1.0):
        """
        Args:
            Q: [batch_size, n_heads, seq_len_q, d_k] 查询矩阵
            K: [batch_size, n_heads, seq_len_k, d_k] 键矩阵
            V: [batch_size, n_heads, seq_len_v, d_v] 值矩阵
            mask: [batch_size, n_heads, seq_len_q, seq_len_k] 掩码
            temperature: 温度参数，用于控制注意力分布的尖锐程度
        Returns:
            output: [batch_size, n_heads, seq_len_q, d_v] 输出
            attention: [batch_size, n_heads, seq_len_q, seq_len_k] 注意力权重
        """
        d_k = Q.size(-1)
        
        # 计算注意力分数
        scores = torch.matmul(Q, K.transpose(-2, -1)) / (math.sqrt(d_k) * temperature)
        
        # 应用掩码
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
            
        # 计算注意力权重
        attention = F.softmax(scores, dim=-1)
        attention = self.dropout(attention)
        
        # 应用注意力权重到值
        output = torch.matmul(attention, V)
        
        return output, attention

# 测试Scaled Dot-Product Attention
batch_size = 2
n_heads = 8
seq_len = 10
d_k = 64
d_v = 64

scaled_attn = ScaledDotProductAttention()

Q = torch.randn(batch_size, n_heads, seq_len, d_k)
K = torch.randn(batch_size, n_heads, seq_len, d_k)
V = torch.randn(batch_size, n_heads, seq_len, d_v)

output, attention = scaled_attn(Q, K, V)

print(f"Q形状: {Q.shape}")
print(f"K形状: {K.shape}")
print(f"V形状: {V.shape}")
print(f"输出形状: {output.shape}")
print(f"注意力权重形状: {attention.shape}")

Q形状: torch.Size([2, 8, 10, 64])
K形状: torch.Size([2, 8, 10, 64])
V形状: torch.Size([2, 8, 10, 64])
输出形状: torch.Size([2, 8, 10, 64])
注意力权重形状: torch.Size([2, 8, 10, 10])


## 3. Multi-Head Self-Attention

###  Multi-Head Attention的动机

单个注意力头可能只能捕获一种类型的依赖关系。Multi-Head Attention通过使用多个"头"来


###  Multi-Head Attention的数学表示

$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O$$

其中每个头计算为：
$$\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$$

参数矩阵：
- $W_i^Q \in \mathbb{R}^{d_{model} \times d_k}$
- $W_i^K \in \mathbb{R}^{d_{model} \times d_k}$
- $W_i^V \in \mathbb{R}^{d_{model} \times d_v}$
- $W^O \in \mathbb{R}^{hd_v \times d_{model}}$

In [5]:
class MultiHeadSelfAttention(nn.Module):
    """Multi-Head Self-Attention实现"""
    
    def __init__(self, d_model, n_heads, dropout=0.1):
        super(MultiHeadSelfAttention, self).__init__()
        
        assert d_model % n_heads == 0, "d_model必须能被n_heads整除"
        
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads
        self.d_v = d_model // n_heads
        
        # 线性变换层
        self.W_q = nn.Linear(d_model, d_model, bias=False)
        self.W_k = nn.Linear(d_model, d_model, bias=False)
        self.W_v = nn.Linear(d_model, d_model, bias=False)
        self.W_o = nn.Linear(d_model, d_model)
        
        # Scaled Dot-Product Attention
        self.attention = ScaledDotProductAttention(dropout)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, mask=None):
        """
        Args:
            x: [batch_size, seq_len, d_model] 输入序列
            mask: [batch_size, seq_len, seq_len] 注意力掩码
        Returns:
            output: [batch_size, seq_len, d_model] 输出序列
            attention_weights: [batch_size, n_heads, seq_len, seq_len] 注意力权重
        """
        batch_size, seq_len, d_model = x.size()
        
        # 1. 线性变换得到Q, K, V
        Q = self.W_q(x)  # [batch_size, seq_len, d_model]
        K = self.W_k(x)  # [batch_size, seq_len, d_model]
        V = self.W_v(x)  # [batch_size, seq_len, d_model]
        
        # 2. 重塑为多头格式
        Q = Q.view(batch_size, seq_len, self.n_heads, self.d_k).transpose(1, 2)
        K = K.view(batch_size, seq_len, self.n_heads, self.d_k).transpose(1, 2)
        V = V.view(batch_size, seq_len, self.n_heads, self.d_v).transpose(1, 2)
        # 现在形状为: [batch_size, n_heads, seq_len, d_k/d_v]
        
        # 3. 调整掩码维度
        if mask is not None:
            mask = mask.unsqueeze(1).repeat(1, self.n_heads, 1, 1)
            # [batch_size, n_heads, seq_len, seq_len]
            
        # 4. 应用Scaled Dot-Product Attention
        attn_output, attention_weights = self.attention(Q, K, V, mask)
        
        # 5. 连接多头输出
        attn_output = attn_output.transpose(1, 2).contiguous().view(
            batch_size, seq_len, self.d_model
        )
        
        # 6. 最终线性变换
        output = self.W_o(attn_output)
        output = self.dropout(output)
        
        return output, attention_weights

# 测试Multi-Head Self-Attention
d_model = 512
n_heads = 8
seq_len = 20
batch_size = 2

mhsa = MultiHeadSelfAttention(d_model, n_heads)
x = torch.randn(batch_size, seq_len, d_model)

output, attention_weights = mhsa(x)

print(f"输入形状: {x.shape}")
print(f"输出形状: {output.shape}")
print(f"注意力权重形状: {attention_weights.shape}")
print(f"参数数量: {sum(p.numel() for p in mhsa.parameters())}")

输入形状: torch.Size([2, 20, 512])
输出形状: torch.Size([2, 20, 512])
注意力权重形状: torch.Size([2, 8, 20, 20])
参数数量: 1049088


## 4. Layer Normalization

###  Layer Normalization简介

**Layer Normalization（层归一化）**是一种归一化技术，在Transformer架构中起到关键作用。它对每个样本的特征维度进行归一化，而不是像Batch Normalization那样对批次维度进行归一化。

### 为什么需要Layer Normalization？

1. **稳定训练**：防止梯度爆炸和消失
2. **加速收敛**：使优化过程更加稳定
3. **减少内部协变量偏移**：减少层间激活分布的变化
4. **适合序列模型**：不依赖批次大小，适合变长序列

###  数学公式

给定输入 $x \in \mathbb{R}^{d}$，Layer Normalization计算：

$$\mu = \frac{1}{d}\sum_{i=1}^{d} x_i$$

$$\sigma^2 = \frac{1}{d}\sum_{i=1}^{d} (x_i - \mu)^2$$

$$\text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta$$

其中：
- $\mu$：均值
- $\sigma^2$：方差
- $\gamma$：可学习的缩放参数
- $\beta$：可学习的偏移参数
- $\epsilon$：数值稳定性常数（通常为$10^{-6}$）
- $\odot$：逐元素乘法

In [6]:
class LayerNormalization(nn.Module):
    """Layer Normalization实现"""
    
    def __init__(self, d_model, eps=1e-6):
        super(LayerNormalization, self).__init__()
        
        self.d_model = d_model
        self.eps = eps
        
        # 可学习参数
        self.gamma = nn.Parameter(torch.ones(d_model))
        self.beta = nn.Parameter(torch.zeros(d_model))
        
    def forward(self, x):
        """
        Args:
            x: [batch_size, seq_len, d_model] 或 [..., d_model]
        Returns:
            normalized_x: 归一化后的输出，形状与输入相同
        """
        # 计算最后一个维度的均值和方差
        mean = x.mean(dim=-1, keepdim=True)  # [..., 1]
        var = x.var(dim=-1, keepdim=True, unbiased=False)  # [..., 1]
        
        # 归一化
        normalized = (x - mean) / torch.sqrt(var + self.eps)
        
        # 缩放和偏移
        output = self.gamma * normalized + self.beta
        
        return output

# 测试Layer Normalization
d_model = 512
batch_size = 2
seq_len = 10

layer_norm = LayerNormalization(d_model)
x = torch.randn(batch_size, seq_len, d_model) * 10 + 5  # 添加偏移和缩放

print("归一化前:")
print(f"均值: {x.mean(dim=-1)[0, 0]:.4f}")
print(f"标准差: {x.std(dim=-1)[0, 0]:.4f}")

normalized_x = layer_norm(x)

print("\n归一化后:")
print(f"均值: {normalized_x.mean(dim=-1)[0, 0]:.4f}")
print(f"标准差: {normalized_x.std(dim=-1)[0, 0]:.4f}")
print(f"形状: {normalized_x.shape}")

归一化前:
均值: 4.8818
标准差: 9.5532

归一化后:
均值: -0.0000
标准差: 1.0010
形状: torch.Size([2, 10, 512])


## 5. 练习题

###  理论练习

1. **Self-Attention复杂度分析**：
   - 计算Self-Attention的时间复杂度和空间复杂度
   - 分析序列长度对计算复杂度的影响
   - 比较Self-Attention与RNN的复杂度

2. **Multi-Head Attention理解**：
   - 解释为什么Multi-Head比单头更有效
   - 分析头数对模型性能的影响
   - 讨论参数分配策略

3. **Layer Normalization分析**：
   - 比较Pre-LN和Post-LN的优缺点
   - 解释Layer Norm在训练稳定性中的作用
   - 分析归一化对梯度流的影响
