# 从头拆解千问2(QWen2)
在这个项目中，我将演示如何从头开始拆解QWen2。具体来说，我将探索如何完成一个从输入input_text=“学习如逆水行舟，不进则”生成“退”。

我希望这个项目能帮助大家更好地了解QWen2的结构，也希望借此机会推广中国的LLM。

这里有千问2的魔搭社区下载地址: **https://www.modelscope.cn/models/qwen/Qwen2-7B/files**

<div>
    <img src="images/all_steps.png"/>
</div>

In [1]:
import torch
import json
import matplotlib.pyplot as plt
import math
from torch import nn
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# 分词器
这篇文章中我不对分词器做过多赘述。Andrej Karpathy已经将GPT4的Tokenizer做了详细简洁的复现。请参考他的[GitHub](https://github.com/karpathy/minbpe)。

<div>
    <img src="images/tokenizer.png" width=500/>
</div>

In [2]:
# To ensure every layer's output is same with model.generate(). The model should be load in precision of torch.float32!
model_path="Qwen/Qwen2-7B"

tokenizer=AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float32)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

# 读取模型文件
通常情况下，我们直接使用Transformer加载大模型并进行推理。
但是，这次为了拆解大模型，我将提取其中的每一层，逐层进行输出讲解。

<div>
    <img src="images/layers_dict.png" width=500/>
</div>


In [3]:
model = model.state_dict()
print(json.dumps(list(model.keys())[:20], indent=4))

[
    "model.embed_tokens.weight",
    "model.layers.0.self_attn.q_proj.weight",
    "model.layers.0.self_attn.q_proj.bias",
    "model.layers.0.self_attn.k_proj.weight",
    "model.layers.0.self_attn.k_proj.bias",
    "model.layers.0.self_attn.v_proj.weight",
    "model.layers.0.self_attn.v_proj.bias",
    "model.layers.0.self_attn.o_proj.weight",
    "model.layers.0.mlp.gate_proj.weight",
    "model.layers.0.mlp.up_proj.weight",
    "model.layers.0.mlp.down_proj.weight",
    "model.layers.0.input_layernorm.weight",
    "model.layers.0.post_attention_layernorm.weight",
    "model.layers.1.self_attn.q_proj.weight",
    "model.layers.1.self_attn.q_proj.bias",
    "model.layers.1.self_attn.k_proj.weight",
    "model.layers.1.self_attn.k_proj.bias",
    "model.layers.1.self_attn.v_proj.weight",
    "model.layers.1.self_attn.v_proj.bias",
    "model.layers.1.self_attn.o_proj.weight"
]


In [4]:
with open("Qwen2-7B/config.json", "r") as f:
    config = json.load(f)
config

{'architectures': ['Qwen2ForCausalLM'],
 'attention_dropout': 0.0,
 'bos_token_id': 151643,
 'eos_token_id': 151643,
 'hidden_act': 'silu',
 'hidden_size': 3584,
 'initializer_range': 0.02,
 'intermediate_size': 18944,
 'max_position_embeddings': 131072,
 'max_window_layers': 28,
 'model_type': 'qwen2',
 'num_attention_heads': 28,
 'num_hidden_layers': 28,
 'num_key_value_heads': 4,
 'rms_norm_eps': 1e-06,
 'rope_theta': 1000000.0,
 'sliding_window': 131072,
 'tie_word_embeddings': False,
 'torch_dtype': 'bfloat16',
 'transformers_version': '4.37.2',
 'use_cache': True,
 'use_sliding_window': False,
 'vocab_size': 152064}

## 这里是千问2的一些基础参数

In [5]:
dim = config["hidden_size"]
n_layers = config["num_hidden_layers"]
n_heads = config["num_attention_heads"]
n_kv_heads = config["num_key_value_heads"]
vocab_size = config["vocab_size"]
norm_eps = config["rms_norm_eps"]
rope_theta = torch.tensor(config["rope_theta"])

## 将分词器结果转化为向量
记得第一次面阿里的时候，他们问我嵌入层是如何训练的。我说嵌入层就是分词器--\。现在想想当时多少有点傻。

这里需要说明一点，有的token可以表示两个字，有的字需要一到两个token表示。所以，当预测的词需要两个token表示时，就需要将第一次生成的token加载原本的embedding text后，再进行一次输出了。
<div>
    <img src="images/embedding_layers.png", width=500>
</div>

In [6]:
prompt = "学习如逆水行舟，不进则"
tokens = tokenizer.encode(prompt)
q_len = len(tokens)
q_len

10

In [7]:
tokenizer.decode(tokens)

'学习如逆水行舟，不进则'

In [8]:
tokens = torch.tensor(tokens)

In [9]:
embedding_layer = torch.nn.Embedding.from_pretrained(model['model.embed_tokens.weight'])
token_embeddings_unnormalized = embedding_layer(tokens)
token_embeddings_unnormalized

tensor([[-0.0270, -0.0077, -0.0400,  ...,  0.0011,  0.0075,  0.0004],
        [ 0.0225,  0.0047,  0.0073,  ...,  0.0013, -0.0077, -0.0337],
        [-0.0303, -0.0083,  0.0029,  ...,  0.0033,  0.0057,  0.0061],
        ...,
        [ 0.0005,  0.0129, -0.0093,  ...,  0.0118,  0.0028,  0.0113],
        [ 0.0112,  0.0210, -0.0214,  ..., -0.0061, -0.0099, -0.0027],
        [ 0.0175,  0.0070, -0.0198,  ...,  0.0104,  0.0007, -0.0079]])

## RMS归一化层
归一化是为了增强表达和加速训练，这方面的基础知识可以参考吴恩达老师的机器学习相关课程。

另外，为了保证分母不为0，我们需要设置一个norm_eps（这个参数在config文件中有，不同模型的这个参数有一定差别）。

<div>
    <img src="images/rms_norm.png", width=500>
</div>


In [10]:
def rms_norm(tensor, norm_weights):
    return (tensor * torch.rsqrt(tensor.pow(2).mean(-1, keepdim=True) + norm_eps)) * norm_weights

# 第一次归一化
经过第一次归一化层后，向量形状仍为[10*3584]。

<div>
    <img src="images/norm.png", width=500>
</div>


In [11]:
token_embeddings = rms_norm(token_embeddings_unnormalized, model["model.layers.0.input_layernorm.weight"])
token_embeddings

tensor([[-0.5887, -0.1632, -0.8407,  ...,  0.0224,  0.1508,  0.0071],
        [ 0.4845,  0.0988,  0.1508,  ...,  0.0276, -0.1528, -0.6674],
        [-0.6847, -0.1818,  0.0634,  ...,  0.0732,  0.1187,  0.1274],
        ...,
        [ 0.0102,  0.2679, -0.1921,  ...,  0.2451,  0.0558,  0.2216],
        [ 0.2462,  0.4459, -0.4507,  ..., -0.1303, -0.2004, -0.0538],
        [ 0.3835,  0.1500, -0.4181,  ...,  0.2221,  0.0149, -0.1588]])

## 从头复现自注意力机制层

<div>
    <img src="images/qkv.png", width=600>
</div>

In [12]:
q_layer0 = model["model.layers.0.self_attn.q_proj.weight"]
k_layer0 = model["model.layers.0.self_attn.k_proj.weight"]
v_layer0 = model["model.layers.0.self_attn.v_proj.weight"]
o_layer0 = model["model.layers.0.self_attn.o_proj.weight"]
q_layer0_bias = model['model.layers.0.self_attn.q_proj.bias']
k_layer0_bias = model['model.layers.0.self_attn.k_proj.bias']
v_layer0_bias = model['model.layers.0.self_attn.v_proj.bias']

## 获取向量的Q, K, V并重塑成可处理的形状

In [13]:
query_states = torch.matmul(token_embeddings, q_layer0.T)+q_layer0_bias
key_states = torch.matmul(token_embeddings, k_layer0.T)+k_layer0_bias
value_states = torch.matmul(token_embeddings, v_layer0.T)+v_layer0_bias

In [14]:
head_dim = dim//n_heads
query_states = query_states.view(1, q_len, n_heads, head_dim).transpose(1, 2)
key_states = key_states.view(1, q_len, n_kv_heads, head_dim).transpose(1, 2)
value_states = value_states.view(1, q_len, n_kv_heads, head_dim).transpose(1, 2)

## 位置编码
为了让单词之间知道自己的绝对位置和相对位置，需要引入旋转位置编码。

### RoPE
这里有RoPE的相关教学视频，此处不多做赘述。
**https://www.youtube.com/watch?v=o29P0Kpobz0&t=530s**

### 这里我直接提取了千问本身的位置编码代码。
Qwen2RotaryEmbedding() 这个类的主要用途是为输入序列提供旋转位置编码，这种编码方式在 transformer 模型中用于处理序列数据。与传统的位置编码方法相比，RoPE 可以更好地捕捉相对位置信息，从而提高模型的性能。

总结来说，这段代码实现了一个旋转位置编码的 PyTorch 模块，通过计算和缓存余弦和正弦值来高效地为输入序列提供位置编码。这在处理长序列数据时尤为有用，有助于模型更好地理解和处理序列中的位置信息。

In [15]:
class Qwen2RotaryEmbedding(nn.Module):
    def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None):
        super().__init__()

        self.dim = dim
        self.max_position_embeddings = max_position_embeddings
        self.base = base
        inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2, dtype=torch.int64).float().to(device) / self.dim))
        self.register_buffer("inv_freq", inv_freq, persistent=False)

        # Build here to make `torch.jit.trace` work.
        self._set_cos_sin_cache(
            seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
        )

    def _set_cos_sin_cache(self, seq_len, device, dtype):
        self.max_seq_len_cached = seq_len
        t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.int64).type_as(self.inv_freq)

        freqs = torch.outer(t, self.inv_freq)
        # Different from paper, but it uses a different permutation in order to obtain the same calculation
        emb = torch.cat((freqs, freqs), dim=-1)
        self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
        self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)

    def forward(self, x, seq_len=None):
        # x: [bs, num_attention_heads, seq_len, head_size]
        if seq_len > self.max_seq_len_cached:
            self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)

        return (
            self.cos_cached[:seq_len].to(dtype=x.dtype),
            self.sin_cached[:seq_len].to(dtype=x.dtype),
        )
rotary_emb = Qwen2RotaryEmbedding(
            128,
            max_position_embeddings=131072,
            base=rope_theta,
        )

## apply_rotary_pos_emb(q, k, cos, sin, position_ids, unsqueeze_dim=1)
通过将查询和键张量与余弦和正弦值结合（包括旋转操作），使得位置编码信息嵌入到查询和键张量中。
## rotate_half(x)
这种旋转操作使得向量的每个元素能够和相应位置的余弦和正弦值相结合，进而改变向量的方向和幅度。

In [16]:
def apply_rotary_pos_emb(q, k, cos, sin, position_ids, unsqueeze_dim=1):
    """Applies Rotary Position Embedding to the query and key tensors.

    Args:
        q (`torch.Tensor`): The query tensor.
        k (`torch.Tensor`): The key tensor.
        cos (`torch.Tensor`): The cosine part of the rotary embedding.
        sin (`torch.Tensor`): The sine part of the rotary embedding.
        position_ids (`torch.Tensor`):
            The position indices of the tokens corresponding to the query and key tensors. For example, this can be
            used to pass offsetted position ids when working with a KV-cache.
        unsqueeze_dim (`int`, *optional*, defaults to 1):
            The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
            sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
            that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
            k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
            cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
            the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
    Returns:
        `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
    """
    cos = cos[position_ids].unsqueeze(unsqueeze_dim)
    sin = sin[position_ids].unsqueeze(unsqueeze_dim)
    q_embed = (q * cos) + (rotate_half(q) * sin)
    k_embed = (k * cos) + (rotate_half(k) * sin)
    return q_embed, k_embed


def rotate_half(x):
    """Rotates half the hidden dims of the input."""
    x1 = x[..., : x.shape[-1] // 2]
    x2 = x[..., x.shape[-1] // 2 :]
    return torch.cat((-x2, x1), dim=-1)

In [17]:
cos, sin = rotary_emb(value_states, seq_len=q_len)
position_ids = torch.arange(q_len).view(1,q_len)

In [18]:
query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)

In [19]:
def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
    """
    This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
    num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
    """
    batch, num_key_value_heads, slen, head_dim = hidden_states.shape
    if n_rep == 1:
        return hidden_states
    hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
    return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)

In [20]:
key_states = repeat_kv(key_states, n_heads // n_kv_heads)
value_states = repeat_kv(value_states, n_heads // n_kv_heads)

## 缩放点积注意力
torch.nn.functional.scaled_dot_product_attention 是 PyTorch 提供的一个函数，用于计算缩放点积注意力（Scaled Dot-Product Attention）。这是 transformer 架构中核心的注意力机制，能够让模型根据输入序列中的相关性动态地调整对不同位置的关注。具体来说，这个函数对查询（query）、键（key）和值（value）张量执行点积注意力计算。
<div>
    <img src="images/softmax.png", width=600>
</div>

In [22]:
attn_output = torch.nn.functional.scaled_dot_product_attention(
    query_states,
    key_states,
    value_states,
    attn_mask=None,
    dropout_p= 0.0,
    # The q_len > 1 is necessary to match with AttentionMaskConverter.to_causal_4d that does not create a causal mask in case q_len == 1.
    is_causal= True,
)

In [24]:
attn_output = attn_output.transpose(1, 2).contiguous()
attn_output = attn_output.view(1, q_len, dim)

In [26]:
output_states = torch.matmul(attn_output, o_layer0.T)
output_states

tensor([[[-1.8568e-01,  1.3149e-01, -1.5167e-01,  ...,  2.8487e-02,
          -8.3742e-02, -2.3384e-02],
         [-1.2537e-01,  2.0195e-01, -1.2300e-02,  ..., -3.6986e-02,
          -1.8594e-01,  9.9794e-02],
         [-1.4426e-01,  1.5807e-01, -1.7747e-01,  ..., -7.1516e-02,
           7.0311e-02, -1.7331e-01],
         ...,
         [-5.9189e-02,  4.0363e-02, -1.3974e-05,  ..., -5.2831e-02,
          -2.0385e-02,  8.6324e-03],
         [ 3.7043e-02,  5.2902e-02,  3.0693e-03,  ..., -8.9145e-02,
          -1.0277e-01,  1.0480e-02],
         [-8.8573e-02,  1.8764e-02, -4.4170e-02,  ...,  1.4842e-01,
          -9.0892e-02,  5.9852e-02]]])

## 残差神经网络
残差神经网络（ResNet）通过引入残差块（residual block），有效地解决了深度神经网络在训练过程中遇到的一些常见问题，特别是 梯度消失（vanishing gradient） 和 梯度爆炸（exploding gradient） 问题。
<div>
    <img src="images/add1.png" width=500>
</div>

In [None]:
output_states = output_states+token_embeddings_unnormalized

## 第二次归一化
<div>
    <img src="images/norm2.png" width=500>
</div>

In [None]:
second_normalized = rms_norm(token_embeddings_unnormalized, model["model.layers.0.post_attention_layernorm.weight"])

## 前馈神经网络（Feedforward Neural Network，FFNN）
FFNN 由输入层、一个或多个隐藏层和输出层组成。每一层的神经元通过权重连接，每个神经元计算输入的加权和并应用激活函数。
<div>
    <img src="images/feedforward.png" width=500>
</div>

In [None]:
w1 = model[f"model.layers.0.mlp.gate_proj.weight"]
w2 = model[f"model.layers.0.mlp.down_proj.weight"]
w3 = model[f"model.layers.0.mlp.up_proj.weight"]
output_after_feedforward = torch.matmul(torch.functional.F.silu(torch.matmul(second_normalized, w1.T)) * torch.matmul(second_normalized, w3.T), w2.T)

## Everything is done!!!~
Now, run them at once!

In [50]:
final_embedding = token_embeddings_unnormalized
x= 0
for layer in range(n_layers):
    x+=1
    residual1 = final_embedding
    
    # embeding norm
    layer_embedding_norm = rms_norm(final_embedding, model[f"model.layers.{layer}.input_layernorm.weight"])
    
    q_layer = model[f"model.layers.{layer}.self_attn.q_proj.weight"]
    k_layer = model[f"model.layers.{layer}.self_attn.k_proj.weight"]
    v_layer = model[f"model.layers.{layer}.self_attn.v_proj.weight"]
    w_layer = model[f"model.layers.{layer}.self_attn.o_proj.weight"]
    q_layer_bias = model[f'model.layers.{layer}.self_attn.q_proj.bias']
    k_layer_bias = model[f'model.layers.{layer}.self_attn.k_proj.bias']
    v_layer_bias = model[f'model.layers.{layer}.self_attn.v_proj.bias']

    query_states = torch.matmul(layer_embedding_norm, q_layer.T)+q_layer_bias
    key_states = torch.matmul(layer_embedding_norm, k_layer.T)+k_layer_bias
    value_states = torch.matmul(layer_embedding_norm, v_layer.T)+v_layer_bias
    head_dim = dim//n_heads
    query_states = query_states.view(1, q_len, n_heads, head_dim).transpose(1, 2)
    key_states = key_states.view(1, q_len, n_kv_heads, head_dim).transpose(1, 2)
    value_states = value_states.view(1, q_len, n_kv_heads, head_dim).transpose(1, 2)

    cos, sin = rotary_emb(value_states, seq_len=q_len)
    position_ids = torch.arange(q_len).view(1,q_len)
    query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
    
    key_states = repeat_kv(key_states, n_heads // n_kv_heads)
    value_states = repeat_kv(value_states, n_heads // n_kv_heads)
    
    attn_output = torch.nn.functional.scaled_dot_product_attention(
        query_states,
        key_states,
        value_states,
        attn_mask=None,
        dropout_p= 0.0,
        # The q_len > 1 is necessary to match with AttentionMaskConverter.to_causal_4d that does not create a causal mask in case q_len == 1.
        is_causal= True,
    )
    
    

    attn_output = attn_output.transpose(1, 2).contiguous()
    attn_output = attn_output.view(1, q_len, dim)
    output_states = torch.matmul(attn_output, w_layer.T)
        
    hidden_state = residual1+output_states

    # Fully connected
    residual2 = hidden_state
    
    w1 = model[f"model.layers.{layer}.mlp.gate_proj.weight"]
    w2 = model[f"model.layers.{layer}.mlp.down_proj.weight"]
    w3 = model[f"model.layers.{layer}.mlp.up_proj.weight"]
    second_normalized = rms_norm(hidden_state, model[f"model.layers.{layer}.post_attention_layernorm.weight"])
    output_after_feedforward = torch.matmul(torch.functional.F.silu(torch.matmul(second_normalized, w1.T)) * torch.matmul(second_normalized, w3.T), w2.T)
    final_embedding = residual2+output_after_feedforward

## Here is the final embedding
The shape of it is same as the first embedding [10*3584].
<div>
    <img src="images/final_norm.png">
</div>

In [51]:
final_normalized = rms_norm(final_embedding, model["model.norm.weight"])
final_normalized.shape

torch.Size([1, 10, 3584])

## Finally!!! We can decode the embedding into the token value!
<div>
    <img src="images/final_linear.png" width=500>
</div>

In [52]:
logits = torch.matmul(final_normalized[0][-1], model["lm_head.weight"].T)
logits.shape

torch.Size([152064])

In [53]:
next_token = torch.argmax(logits, dim=-1).view(1)
next_token

tensor([55806])

# Oh! yeah!~~~
<div>
    <img src="images/tui.png" width=500>
</div>


In [54]:
tokenizer.decode(next_token)

'退'

# 首先，我需要感谢 Naklecha的工作对我的启发，他让我更加透彻和深入的了解了大模型。
According to his **[Llama3-from-scratch](https://github.com/naklecha/llama3-from-scratch)**, I totally understand the structure of a decoder-only LLM.

# 其次，我也希望国产大模型能够在全世界发光发热。
Performence of Qwen2 has improved so much comparing to the previous version. 

# 最后，由于我不是计算机专业科班出身，不管是从事算法工作还是什么，都经历了很多挫折，包括前文说的分词器和嵌入层的笑话。
我希望借此机会帮助更多的初学者，对这篇文章有任何改进意见的朋友也欢迎小红书联系我。我的ID在下面。
My RED num is 495668258 