## 大语言模型基础

本章将首先介绍 Transformer 结构，并在此基础上介绍生成式预训练语言模型 GPT、大语言模型网络结构和注意力机制优化以及相关实践。

### Transformer 模型

Transformer结构及代码这里不再赘述，放一张结构图即可  

<img src="./images/transformer.png" style="zoom:60%;" /> 

### 生成式语言模型GPT

GPT是由OPENAI提出的一种生成式预训练语言模型，由多层Transformer组成(只包含Decoder)的单向语言模型，其结构为:  

<img src="./images/gpt.png" style="zoom:70%;" />  

GPT的模型的训练过程主要包括: 无监督预训练和有监督下游任务微调  

#### 无监督预训练

GPT 采用生成式预训练方法，单向意味着模型只能从左到右或从右到左对文本序列建模，所采用的`Transformer`结构和解码策略保证了输入文本每个位置只能依赖过去时刻的信息。

#### 有监督下游任务微调

通过无监督语言模型预训练，使得GPT模型具备了一定的通用语义表示能力。下游任务微调（Downstream Task Fine-tuning）的目的是在通用语义表示基础上，根据下游任务的特性进行适配。下游任务通常需要利用有标注数据集进行训练，每个样例由输入长度为n的文本序列和对应的标签y构成。

下游任务在微调过程中，针对任务目标进行优化，很容易使得模型遗忘预训练阶段所学习到的通用语义知识表示，从而损失模型的通用性和泛化能力，造成灾难性遗忘（Catastrophic Forgetting）问题。因此，通常会采用混合预训练任务损失和下游微调损失的方法来缓解上述问题。

#### 预训练语言模型实践

通过HuggingFace提供的数据集，以Bert模型为例，构建预训练模型

> 准备数据集

书中是直接通过代码下载HuggingFace的数据集，但是实际操作中数据集很大而且下载很慢，尽管配置了[hf-mirror](https://hf-mirror.com/)的镜像源速度也不快。所以最后选择从官网直接下载数据集，然后本地直接load就ok了

In [None]:
pip install datasets

In [1]:
from datasets import load_dataset, concatenate_datasets

# 直接通过代码下载
#bookcorpus = load_dataset("bookcorpus", split="train", trust_remote_code=True)
#wiki = load_dataset("wikimedia/wikipedia", "20231101.en", split="train", trust_remote_code=True)

# 先从官网下载，然后本地load wikipedia数据集比较大只下载了部分数据
bookcorpus = load_dataset(path="data/bookcorpus", split="train")
bookcorpus = bookcorpus.select(range(2000000))  
print(bookcorpus)
wiki = load_dataset(path='data/wikimedia/wikipedia', split="train")
print(wiki)

  from .autonotebook import tqdm as notebook_tqdm


Dataset({
    features: ['text'],
    num_rows: 2000000
})
Dataset({
    features: ['id', 'url', 'title', 'text'],
    num_rows: 1250308
})


In [2]:
# 拼接数据集，同时将数据集保存到本地文件中

# 仅保留 'text' 列
wiki = wiki.remove_columns([col for col in wiki.column_names if col != "text"])
dataset = concatenate_datasets([bookcorpus, wiki])
print(dataset)

# 拆分并将数据集保存到本地文件中
d = dataset.train_test_split(test_size=0.1)

print(d)

def dataset_to_text(dataset, output_filename="data.txt"):
    with open(output_filename, "w") as f:
        for t in dataset["text"]:
            print(t, file=f)

Dataset({
    features: ['text'],
    num_rows: 3250308
})
DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 2925277
    })
    test: Dataset({
        features: ['text'],
        num_rows: 325031
    })
})


In [None]:
#save the training set to train.txt
dataset_to_text(d["train"], "data/train.txt")
#save the testing set to test.txt
dataset_to_text(d["test"], "data/test.txt")

> 训练词元分析器(Tokenizer)

BERT采用了`WordPiece`分词，根据训练语料中的词频决定是否将一个完整的词切分为多个词元。因此，需要首先训练词元分析器(Tokenizer)。可以使用transformers库中的`BertWordPieceTokenizer`类来完成任务

In [5]:
import os
import json
from tokenizers import BertWordPieceTokenizer

special_tokens = ['[PAD]', '[UNK]', '[CLS]', '[SEP]]', '[MASK]', '<S>', '<T>']

files = ['data/train.txt']
vocab_size = 30_522
max_length = 512
trauncate_longer_samples = False

# 初始化分词器
tokenizer = BertWordPieceTokenizer()
# train
tokenizer.train(files=files, vocab_size=vocab_size, special_tokens=special_tokens)
tokenizer.enable_truncation(max_length=max_length)

model_path = "pretrained-bert"

if not os.path.isdir(model_path):
    os.mkdir(model_path)

tokenizer.save_model(model_path)

# dumping some of the tokenizer config to config file,
# including special tokens, whether to lower case and the maximum sequence length
with open(os.path.join(model_path, "config.json"), "w") as f:
    tokenizer_cfg = {
    "do_lower_case": True,
    "unk_token": "[UNK]",
    "sep_token": "[SEP]",
    "pad_token": "[PAD]",
    "cls_token": "[CLS]",
    "mask_token": "[MASK]",
    "model_max_length": max_length,
    "max_len": max_length,
    }
    json.dump(tokenizer_cfg, f)


> 预处理语料

将训练集和测试集通过训练好的Tokenizer进行处理

In [None]:
from transformers import BertTokenizerFast

# 加载Tokenizer
max_length = 512
model_path = "pretrained-bert"
trauncate_longer_samples = False
tokenizer = BertTokenizerFast.from_pretrained(model_path)

def encode_with_truncation(examples):
    return tokenizer(examples['text'], truncation=True, padding="max_length",
                     max_length=max_length, return_special_tokens_mask=True)

def encode_without_trauncation(examples):
    return tokenizer(examples['text'], return_special_tokens_mask=True)

# 根据trauncate_longer_samples参数确定是否对数据集进行截断
encode = encode_with_truncation if trauncate_longer_samples else encode_without_trauncation

# 处理数据集
# map函数具体可以参照huggingface官网: https://huggingface.co/docs/datasets/about_map_batch
# batch_size默认为1000
train_dataset = d['train'].map(encode, batched=True)
test_dataset = d['test'].map(encode, batched=True)

print("train_dataset", train_dataset)
print("test_dataset", test_dataset)

if trauncate_longer_samples:
    train_dataset.set_format(type="torch", columns=["input_ids", "attention_mask"])
    test_dataset.set_format(type="torch", columns=["input_ids", "attention_mask"])
else:
    # 如果不截断的话需要保留特殊的token ['SEP'] ['CLS']等
    test_dataset.set_format(columns=["input_ids", "attention_mask", "special_tokens_mask"])
    train_dataset.set_format(columns=["input_ids", "attention_mask", "special_tokens_mask"])

print("train_dataset", train_dataset)
print("test_dataset", test_dataset)

# 因为设置了不截断，所以需要将样本连接起来，组合成固定长度的向量
from itertools import chain

def group_texts(examples):
    # concat all texts
    concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # 这里trauncate了最后一块数据
    if total_length >= max_length:
        total_length = (total_length // max_length) * max_length
    # split by chunks of max_len
    result = {
        k: [t[i: i + max_length] for i in range(0, total_length, max_length)]
        for k, t in concatenated_examples.items()
    }
    
    return result

if not trauncate_longer_samples:
    train_dataset = train_dataset.map(group_texts, batched=True, desc=f"Grouping texts in chunks of {max_length}")
    test_dataset = test_dataset.map(group_texts, batched=True, desc=f"Grouping texts in chunks of {max_length}")
    
    # convert them from lists to torch tensors
    train_dataset.set_format("torch")
    test_dataset.set_format("torch")

print("train_dataset", train_dataset)
print("test_dataset", test_dataset)

> 模型训练

In [None]:
from transformers import BertConfig, BertForMaskedLM, TrainingArguments, Trainer, DataCollatorForLanguageModeling

# 初始化模型
model_config = BertConfig(vocab_size=vocab_size, max_position_embeddings=max_length)
model = BertForMaskedLM(config=model_config)

# initialize the data collator, randomly masking 20% (default is 15%) of the tokens
# for the Masked Language Modeling (MLM) task
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.2)

training_args = TrainingArguments(
    output_dir=model_path,
    evaluation_strategy="steps",
    overwrite_output_dir=False,
    num_train_epochs=5,
    per_device_eval_batch_size=16,
    gradient_accumulation_steps=8,
    per_gpu_eval_batch_size=64,
    logging_steps=1000,
    save_on_each_node=1000,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

# train the model
trainer.train()

> 预测

In [None]:
from transformers import pipeline

model = BertForMaskedLM.from_pretrained(os.path.join(model_path, ''))
tokenizer = BertTokenizerFast.from_pretrained(model_path)

fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)

# perform predictions
examples = [
    "Today's most trending hashtags on [MASK] is Donald Trump",
    "The [MASK] was cloudy yesterday, but today it's rainy.", 
    ]

for example in examples:
    for prediction in fill_mask(example):
        print(f"{prediction['sequence']}, confidence: {prediction['score']}")
    print("="*50)

### LLaMA及注意力机制优化

#### LLaMA模型结构

LLaMA结构整体与GPT-2类似，都是基于Transformer-Decoder的。不同的地方是：采用了前置归一化并使用RMS归一化函数、激活函数更换为SwiGLU，并使用了旋转位置嵌入，GPT-2结构如图:  

<img src="./images/gpt2.png" style="zoom: 70%;" />  

1.RMSNorm归一化函数  

为了使得模型训练过程更加稳定，`GPT-2`相较于`GPT`就引入了前置层归一化方法，将第一个层归一化移动到多头自注意力层之前，第二个层归一化也移动到了全连接层之前，同时残差连接的位置也调整到了多头自注意力层与全连接层之后。层归一化中也采用了`RMSNorm`归一化函数。

<img src="./images/RMSNorm.png" style="zoom: 70%;" />  


2.SwiGLU 激活函数  

取代了`ReLU`，在LLaMA中全连接层使用带有SwiGLU激活函数的`FFN`(Position-wise Feed-Forward Network):  
 
<img src="./images/SwiGLU.png" style="zoom: 70%;" />  

β取值不同时，Swish激活函数的形状也不同:  

<img src="./images/Swish.png" style="zoom: 70%;" />  

3.旋转位置嵌入(RoPE)  

在位置编码上，使用旋转位置嵌入(Rotary Positional Embeddings，RoPE)代替原有的绝对位置编码。RoPE借助了`复数`的思想，出发点是通过绝对位置编码的方式实现相对位置编码。

4.附上LLaMA不同模型规模下的具体超参数细节

<img src="./images/LLaMA-weight.png" style="zoom: 70%;" />  

最后附上HuggingFace Transformer库中LLaMA解码器整体实现代码实现:  

```python
class LlamaDecoderLayer(nn.Module):
    def __init__(self, config: LlamaConfig):
        super().__init__()
        self.hidden_size = config.hidden_size
        self.self_attn = LlamaAttention(config=config)
        self.mlp = LlamaMLP(
            hidden_size=self.hidden_size,
            intermediate_size=config.intermediate_size,
            hidden_act=config.hidden_act,
            )
        self.input_layernorm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
        self.post_attention_layernorm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
    
    def forward(
        self,
        hidden_states: torch.Tensor,
        attention_mask: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        past_key_value: Optional[Tuple[torch.Tensor]] = None,
        output_attentions: Optional[bool] = False,
        use_cache: Optional[bool] = False, 
    ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:

        residual = hidden_states
        hidden_states = self.input_layernorm(hidden_states)

        # Self Attention
        hidden_states, self_attn_weights, present_key_value = self.self_attn(
            hidden_states=hidden_states,
            attention_mask=attention_mask,
            position_ids=position_ids,
            past_key_value=past_key_value,
            output_attentions=output_attentions,
            use_cache=use_cache,
        )
        hidden_states = residual + hidden_states

        # Fully Connected
        residual = hidden_states
        hidden_states = self.post_attention_layernorm(hidden_states)
        hidden_states = self.mlp(hidden_states)
        hidden_states = residual + hidden_states
        outputs = (hidden_states,)
        
        if output_attentions:
            outputs += (self_attn_weights,)
        
        if use_cache:
            outputs += (present_key_value,)
        
        return outputs
```

#### 注意力机制优化

