搓完Transformer的结构之后开始手搓GPT2,自己写出来的结构之后方便做Probing

计算 Transformer（以 GPT-2 为例）的参数量，需要将模型中的每个模块的参数逐一分解计算，然后汇总得到总参数量。

假设有以下参数定义：
-$ n_{\text{heads}} $: 注意力头数
-$ n_{\text{layers}} $: Transformer 解码器层数
-$ d_{\text{model}} $: 每个 token 的嵌入维度（模型维度）
-$ d_{\text{ffn}} $: 前馈网络隐藏层的维度
-$ d_k $: 每个注意力头的键和查询向量的维度
-$ d_v $: 每个注意力头的值向量的维度
-$ V $: 词汇表大小

以下是分模块的参数计算公式和总参数量公式：

---

### **1. 输入嵌入和位置嵌入**
- **词嵌入矩阵：**
  $
  \text{Params}_{\text{embeddings}} = V \times d_{\text{model}}
  $
  （词汇表大小$ $ × 模型维度$d_{\text{model}} $）

- **位置嵌入矩阵：**
  $
  \text{Params}_{\text{positional embeddings}} = L \times d_{\text{model}}
  $
  （序列长度$ $ × 模型维度$d_{\text{model}}{} $）

---

### **2. 注意力机制（Multi-Head Self-Attention, MHA）**
对于单层的 MHA：
1. **键 (Key)、查询 (Query)、值 (Value) 的投影层：**
   - 每个头的键、查询、值权重$ d_{\text{model}} \times d_k, d_{\text{model}} \times d_k, d_{\text{model}} \times d_v $
   - 总权重矩阵：
     $
     \text{Params}_{\text{QKV}} = 3 \times d_{\text{model}} \times (d_k \times n_{\text{heads}})
     $

2. **注意力输出的线性层：**
   - 汇聚后的多头注意力的投影：
     $
     \text{Params}_{\text{attention output}} = (d_k \times n_{\text{heads}}) \times d_{\text{model}}
     $

---

### **3. 前馈网络 (Feed-Forward Network, FFN)**
前馈网络由两层全连接层组成：
1. 第一层的权重：
   $
   \text{Params}_{\text{FFN1}} = d_{\text{model}} \times d_{\text{ffn}}
   $

2. 第二层的权重：
   $
   \text{Params}_{\text{FFN2}} = d_{\text{ffn}} \times d_{\text{model}}
   $

3. 两层的偏置参数（可选，但通常较小）：
   $
   \text{Params}_{\text{FFN bias}} = d_{\text{ffn}} + d_{\text{model}}
   $

---

### **4. 层归一化 (Layer Normalization)**
每层有两个参数（权重和偏置）：
$
\text{Params}_{\text{LayerNorm}} = 2 \times d_{\text{model}}
$

---

### **5. 输出层**
输出层通常是一个线性层，将模型维度$d_{\text{model} }$ 投影到词汇表大小$ $：
$
\text{Params}_{\text{output}} = d_{\text{model}} \times V
$

---

### **总参数量公式**
假设 Transformer 有$ n_{\text{layers}} $ 层，则总参数量公式为：

$
\text{Total Parameters} =
\text{Params}_{\text{embeddings}} + \text{Params}_{\text{positional embeddings}} + n_{\text{layers}} \times \left( \text{Params}_{\text{MHA}} + \text{Params}_{\text{FFN}} + \text{Params}_{\text{LayerNorm}} \right) + \text{Params}_{\text{output}}
$

其中：
- **MHA 参数：**
  $
  \text{Params}_{\text{MHA}} = \text{Params}_{\text{QKV}} + \text{Params}_{\text{attention output}}
  = 3 \times d_{\text{model}} \times (d_k \times n_{\text{heads}}) + (d_k \times n_{\text{heads}}) \times d_{\text{model}}
  $

- **FFN 参数：**
  $
  \text{Params}_{\text{FFN}} = \text{Params}_{\text{FFN1}} + \text{Params}_{\text{FFN2}} + \text{Params}_{\text{FFN bias}}
  = d_{\text{model}} \times d_{\text{ffn}} + d_{\text{ffn}} \times d_{\text{model}} + d_{\text{ffn}} + d_{\text{model}}
  $

- **LayerNorm 参数：**
  $
  \text{Params}_{\text{LayerNorm}} = 2 \times d_{\text{model}}
  $

---

### **参数量示例**
假设：
-$ n_{\text{heads}} = 12 $
-$ n_{\text{layers}} = 12 $
-$ d_{\text{model}} = 768 $
-$ d_{\text{ffn}} = 3072 $
-$ d_k = d_v = d_{\text{model}} / n_{\text{heads}} = 64 $
-$ V = 50,000 $
-$ L = 512 $

计算：
1. **嵌入层：**
   $
   \text{Params}_{\text{embeddings}} = 50,000 \times 768 = 38.4M
   $
   $
   \text{Params}_{\text{positional embeddings}} = 512 \times 768 = 0.39M
   $

2. **MHA（单层）：**
   $
   \text{Params}_{\text{QKV}} = 3 \times 768 \times 64 \times 12 = 1.77M
   $
   $
   \text{Params}_{\text{attention output}} = 768 \times 768 = 0.59M
   $
   总计：
   $
   \text{Params}_{\text{MHA}} = 1.77M + 0.59M = 2.36M
   $

3. **FFN（单层）：**
   $
   \text{Params}_{\text{FFN}} = 768 \times 3072 + 3072 \times 768 + 3072 + 768 = 4.72M
   $

4. **LayerNorm（单层）：**
   $
   \text{Params}_{\text{LayerNorm}} = 2 \times 768 = 1.5K
   $

5. **每层参数：**
   $
   \text{Params}_{\text{per layer}} = \text{Params}_{\text{MHA}} + \text{Params}_{\text{FFN}} + \text{Params}_{\text{LayerNorm}} \approx 7.08M
   $

6. **总参数量：**
   $
   \text{Total Parameters} = 38.4M + 0.39M + 12 \times 7.08M + 50,000 \times 768
   = 124M
   $

In [2]:
from transformers import GPT2LMHeadModel

In [4]:
model_hf = GPT2LMHeadModel.from_pretrained("gpt-2") # 124M

sd_hf = model_hf.state_dict()

for k, v in sd_hf.items():
    print(k, v.shape)

OSError: We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like gpt-2 is not the path to a directory containing a file named config.json.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.

In [None]:
sd_hf["transformer.wpe.weight"].view(-1)[:20]

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

plt.imshow(sd_hf["transformer.wpe.weight"], cmap="gray")

In [None]:
plt.plot(sd_hf["transformer.wpe.weight"][:, 150])
plt.plot(sd_hf["transformer.wpe.weight"][:, 200])
plt.plot(sd_hf["transformer.wpe.weight"][:, 250])

In [None]:
plt.imshow(
    sd_hf["transformer.h.1.attn.c_attn.weight"][:300, :300], cmap="gray"
)

In [None]:
from transformers import pipeline, set_seed

generator = pipeline('text-generation', method="gpt2")
set_seed(42)
generator("Hello, I'm a language model", max_length=30, num_return_sequences=5)