<a href="https://colab.research.google.com/github/Lauorie/Coursera-Machine-Learning/blob/main/BabyGPT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![image.png](https://huggingface.co/Laurie/HC3-Chinese-AlpacaFormat-Lora/resolve/main/BABY.jpeg)

# 构建数据

In [None]:
!wget https://huggingface.co/Laurie/qlora-v1/resolve/main/santi2.txt

In [None]:
import json
with open('/content/santi2.txt','r',encoding='gbk') as f:
    text = f.read()
print("文本长度：",len(text),"\n")
print(text[:1000])

# 创建字典

In [None]:
words = list(set(text))
words_size = len(words)
print("词表：",words,"\n")
print("词表的大小是：words_size =",words_size)

In [None]:
word2idx = {k:v for v,k in enumerate(words)}
print("字符对应数字：",word2idx)

idx2word = {k:v for k,v in enumerate(words)}
print("\n数字对应字符：",idx2word)

encode = lambda x : [word2idx[i] for i in x]  # 把字符编码成数字
decode = lambda x : [idx2word[i] for i in x]  # 把数字解码为字符

In [None]:
input_test = "面壁者罗辑"

print(encode(input_test))
print("\n")
print(decode(encode(input_test)))

# 为进入Transformer作准备

1. 进入Transformer首先要把输入格式转为Tensor格式

2. 将数据拆分为训练集和验证集

3. 进入的数据的长度是统一的，也就是句子的长度是统一的，即sequence_length参数

4. 进入的数据是一批一批的，也就是几个句子一起进入，即batch_size参数

5. 进入的数据是经过embedding的，也就是给它们嵌入维度，即d_model参数

## 转换tensor格式

In [None]:
import torch
data = torch.tensor(encode(text))  # 将数据转化为数字之后，再将数字转换为tensor格式

print(data.shape, data.dtype)      # 打印数据的形状和数据类型，这时候的数据是矩阵样式的
print("\n")
print(data[:100]) # 打印前100来看看

## 拆分数据集

In [None]:
Split = int(0.8 * len(data)) # 我们设置8：2的训练集与验证集

train_data = data[:Split]
val_data = data[Split:]

print("训练集大小：", len(train_data))
print("\n验证集大小：", len(val_data))

# 设置句子长度sequence_length

In [None]:
seq_len = 10

x = train_data[:seq_len]           # featur特征
y = train_data[1:seq_len + 1]      # target目标
print('featur特征：', x)
print('\ntarget目标：', y)

In [None]:
for i in range(seq_len):
    context = x[:i+1]
    target = y[i]
    print(f"输入是 {context.tolist()}，预测值是 {target.tolist()}")

In [None]:
for i in range(seq_len):
    context = x[:i+1]
    target = int(y[i])
    print(f'输入是   {"".join(decode(context.tolist()))}      预测字符是 {"".join(decode([target]))}')

## 构造batch

In [None]:
torch.manual_seed(42)
batch_size = 4
seq_len = 10

def mini_batch(data):
    idx = torch.randint(0, len(data) - seq_len, (batch_size,))    # 在0到len(data)-seq_len之间生成batch_size个随机数-->每个inputs的首位索引idx
    inputs = torch.stack([data[i:i+seq_len] for i in idx])      # 将idx中的数作为索引，从data中取出seq_len个数，组成一个batch
    targets = torch.stack([data[i+1:i+seq_len+1] for i in idx])   # 将idx中的数作为索引，从data中取出seq_len个数，组成一个batch
    return inputs, targets

inputs, targets = mini_batch(train_data)
print("输入的形状：", inputs.size(), "\n", "输入是：",inputs)
print("\n")
print("目标的形状：", targets.size(), "\n", "目标是：",targets)

# Embedding

In [None]:
import torch.nn as nn
torch.manual_seed(42)
embedding_table = nn.Embedding(4, 4) # 4 words represents words_size, 4 dimensional embeddings
embedding_table.weight

In [None]:
idx_input = torch.LongTensor(4,2).random_(0,4) # 输入的batch_size=4, seq_len=2，random_(0,4)表示随机生成words_size之间的整数

print('查找的idx_input维度为:',idx_input.size())

print('\n查找的idx_input为：\n', idx_input)

In [None]:
target_emd = embedding_table(idx_input)  # 查找词向量
print("查找到的词向量：", target_emd, "\n\n词向量的维度：", target_emd.shape)

## 为什么要用Embedding呢？

Embedding video: https://youtu.be/W_ZUUDJsUtA

![image](https://huggingface.co/Laurie/HC3-Chinese-AlpacaFormat-Lora/resolve/main/embedding.png)



In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
torch.manual_seed(42)

class BabyGPT(nn.Module):
    def __init__(self, words_size):
        super().__init__()
        self.embedding = nn.Embedding(words_size, words_size) # the last words_size is the embedding dimension, which can be different from the words_size like 512, 1024, etc.

    def forward(self, idx, targets=None):
        # idx is a tensor of shape (batch_size, seq_len)
        # targets is a tensor of shape (batch_size, seq_len)
        logits = self.embedding(idx)  # (batch_size, seq_len, embedding_dim)
        if targets is None:
            loss = None  # compute the loss if target is not None
        else:
            B, S, E = logits.shape
            logits = logits.view(B*S, E)     # (B*S, E)
            targets = targets.view(B*S)      # (B*S) reshape targets to match logits
            loss = F.cross_entropy(logits, targets)
        return logits, loss

    def generate(self, idx, max_new_tokens):
        # 以自回归的方式生成新的token,(B, S) -> (B, S+1) -> (B, S+2) -> ... -> (B, S+max_new_tokens)
        for _ in range(max_new_tokens):
            # generate new tokens
            logits, loss = self.forward(idx)
            # only take the last token
            logits = logits[:, -1, :]  # (B, E) take the last token,see explanation below
            # apply softmax to convert to probabilities
            probs = F.softmax(logits, dim=-1)  # (B, E) the sum of probs is 1, 使得到的embedding vector之和为1，属于归一化的一种方式,see explanation below
            # sample from the distribution or take the most likely
            idx_next = torch.multinomial(probs, num_samples=1)  # (B, 1) 从probs概率分布中随机采样，采样结果的probs值对应的index，它的shape为(B, 1)
            # append to the sequence and continue
            idx = torch.cat((idx, idx_next), dim=-1) # (B, S+1) concatenate the new token to the sequence until max_new_tokens
        return idx

### 解释：
1. logits = logits[:, -1, :]

2. probs = F.softmax(logits, dim=-1)

In [None]:
1. logits = logits[:, -1, :]
假设我们输入一个句子序列,经过Transformer编码器,得到的logits张量形状是(batch_size=2, seq_len=5, hidden_size=128):

logits =
[[[1.1, 1.2, ..., 1.128],   # 我
  [2.1, 2.2, ..., 2.128],   # 爱
  [3.1, 3.2, ..., 3.128],   # 你
  [4.1, 4.2, ..., 4.128],   # 中
  [5.1, 5.2, ..., 5.128]],  # 国

 [[6.1, 6.2, ..., 6.128],     # I
  [7.1, 7.2, ..., 7.128],     # love
  [8.1, 8.2, ..., 8.128],     # my
  [9.1, 9.2, ..., 9.128],     # hometown
  [10.1, 10.2, ..., 10.128]]] # China

  * batch_size,可以看到它包含2个句子
  * seq_len,矩阵的竖向长度,也是句子的长度为5
  * hidden_size,隐状态,也即embedding维度

这个logits包含了2个句子样本,每个句子长度为5,每个时间步是一个128维的向量表示该时刻的隐状态。

现在我们取最后一个时间步/token的向量,即取出:

  logits = logits[:, -1, :]

  [5.1, 5.2, ..., 5.128]    # 国
  [10.1, 10.2, ..., 10.128] # China

那么就是取出了每个句子的最后一个token的隐状态,这个隐状态就是整个句子的向量表示,它的形状是(batch_size=2, hidden_size=128):
last_hidden = logits[:, -1, :]
# last_hidden 的形状是 (2, 128)

2. probs = F.softmax(logits, dim=-1)
即将下面两个向量进行softmax:
  [5.1, 5.2, ..., 5.128]    # 国，里面的值经过softmax后加起来是1
  [10.1, 10.2, ..., 10.128] # China，里面的值经过softmax后加起来是1

# 构建模型

In [None]:
model = BabyGPT(words_size)
logits, loss = model(inputs, targets)
print(logits.shape, loss)

In [None]:
input_word = '罗'

input_word_encoded = encode(input_word)
input_word_encoded

In [None]:
idx = torch.tensor([input_word_encoded],dtype=torch.long)

idx_pred = model.generate(idx,max_new_tokens=10)
print("预测的词的索引为：",idx_pred,"\n维度为：",idx_pred.shape)

In [None]:
print("".join(decode(idx_pred[0].tolist()))) # 将预测结果解码为文本

[panda](https://img.aidotu.com/down/jpg/20201220/2db7babbc57b0b05b4c43ea6da0f13d6_5929_200_186.jpg)

# 创建优化器


In [None]:
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-2) # 优化器的作用是更新参数，使得损失函数最小

# 模型训练

In [None]:
import matplotlib.pyplot as plt

losses = []  # List to store the loss values

batch_size = 128
seq_length = 64
for step in range(1000):
    # 利用训练集进行训练
    inputs, targets = mini_batch(train_data)
    logits, loss = model(inputs, targets)
    optimizer.zero_grad(set_to_none=True)  # 梯度清零,set_to_none=True表示不会占用额外的内存
    loss.backward()  # 反向传播计算梯度
    optimizer.step()  # 更新参数

    losses.append(loss.item())  # Store the loss value

    if step % 100 == 0:
        print('Step {} Loss {:.4f}'.format(step, loss))

# Plot the loss graph
plt.plot(range(len(losses)), losses)
plt.xlabel('Step')
plt.ylabel('Loss')
plt.title('Loss Graph')
plt.show()

[](https://media.tenor.com/KMxrZ-A6ev4AAAAC/nice-smack.gif)

In [None]:
print("".join(decode(model.generate(torch.tensor([input_word_encoded],dtype=torch.long),max_new_tokens=10)[0].tolist())))

In [None]:
import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
batch_size = 128
seq_len = 64
max_iters = 3000
eval_interval = 100
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 64
n_head = 4
n_layer = 4
dropout = 0.0
# ------------

# data
torch.manual_seed(42)
# !wget https://huggingface.co/Laurie/qlora-v1/resolve/main/santi2.txt
with open('santi2.txt', 'r', encoding='gbk') as f:
    text = f.read()
words = sorted(list(set(text)))
words_size = len(words)
stoi = { ch:i for i,ch in enumerate(words) }
itos = { i:ch for i,ch in enumerate(words) }
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

# trian/val split
data = torch.tensor(encode(text), dtype=torch.long)
Split = int(.8*len(data))
train_data = data[:Split]
val_data = data[Split:]

# get a batch of data
def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - seq_len, (batch_size,))
    x = torch.stack([data[i:i+seq_len] for i in ix])
    y = torch.stack([data[i+1:i+seq_len+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(seq_len, seq_len)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x


class BabyGPT(nn.Module):

    def __init__(self):
        super().__init__()
        self.token_embedding_table = nn.Embedding(words_size, n_embd)
        self.position_embedding_table = nn.Embedding(seq_len, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, words_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape
        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,words_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last seq_len tokens
            idx_cond = idx[:, -seq_len:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

model = BabyGPT()
m = model.to(device)
# print the number of parameters in the model
print("模型共有：",sum(p.numel() for p in m.parameters())/1e5, '* 十万参数')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

import matplotlib.pyplot as plt

train_losses = []
val_losses = []

for iter in range(max_iters):
    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        train_loss = losses['train']
        val_loss = losses['val']
        print(f"step {iter}: train loss {train_loss:.4f}, val loss {val_loss:.4f}")
        train_losses.append(train_loss)
        val_losses.append(val_loss)

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()


# Plotting the loss curve
plt.plot(range(len(train_losses)), train_losses, label='Train Loss')
plt.plot(range(len(val_losses)), val_losses, label='Validation Loss')
plt.xlabel('Iterations')
plt.ylabel('Loss')
plt.legend()
plt.show()

In [None]:
# torch.set_default_tensor_type(torch.cuda.FloatTensor)
input_test = "罗辑"   # 把这里的内容改成你想要的内容
input_word_encoded = encode(input_test)
context = torch.tensor([input_word_encoded],device=device,dtype=torch.long)
print(decode(m.generate(context, max_new_tokens=100)[0].tolist())) # 把这里的数字改成你想要生成文本的长度

## 训练LLaMA2模型的数据量大概等于多少本三体Ⅱ？

根据网上的信息,我大概估算了一下LLaMA2模型使用的数据量相当于多少本《三体II:黑暗森林》的词数:

1. LLaMA2模型使用了初代LLaMA模型预训练好的137亿参数,然后在此基础上继续 pretrain,增加了460亿参数,总参数量达到了597亿。

2. 三体II《黑暗森林》小说全文大约有40万字。

3. 如果按照每个中文词平均2字计算,40万字约等于20万词。

4. 597亿参数除以20万词,约等于**3000亿个三体II《黑暗森林》**的词数。

所以一个非常粗略的估计,LLaMA2模型使用的训练数据量,大概相当于3000亿本《三体II:黑暗森林》的训练词数。
当然这是非常粗略的估计,可能存在很大的误差。但可以看出LLaMA2使用的训练数据量级非常大,达到了数以万亿计的量级。这也使得LLaMA2模型获得了很强的语言理解和生成能力。

# Gradient Descent
https://www.jiqizhixin.com/articles/2019-04-07-6

# Loss and Learning Rate
https://developers.google.com/machine-learning/crash-course/fitter/graph?hl=zh-cn

# Temperature & Top_p

In [None]:
import torch.nn as nn
import torch
import torch.nn.functional as F

torch.manual_seed(0)

test_tensor =  torch.tensor([-1.2345, -0.0431, -1.6047, -0.7521, -0.6866])
test_tensor_softmax = F.softmax(test_tensor, dim=0)
test_tensor_softmax

In [None]:
import matplotlib.pyplot as plt

data_test = test_tensor_softmax.tolist()

plt.bar(range(len(data_test)), data_test)

# 在柱子上添加数值标签
for x, y in enumerate(data_test):
    plt.text(x, y, str(round(y,3)), ha='center')

plt.xlabel('Index')
plt.ylabel('Value')
plt.title('somftmax vannila')
plt.show()

In [None]:
test_tensor1 =  torch.tensor([-.12345, -.00431, -.16047, -.07521, -.06866])
test_tensor_softmax1 = F.softmax(test_tensor1, dim=0)
test_tensor_softmax1

In [None]:
import matplotlib.pyplot as plt

data = test_tensor_softmax1.tolist()

plt.bar(range(len(data)), data)

# 在柱子上添加数值标签
for x, y in enumerate(data):
    plt.text(x, y, str(round(y,3)), ha='center')

plt.xlabel('Index')
plt.ylabel('Value')
plt.title('soft softmax')
plt.show()

# Label Smoothing
主要步骤是:

1. 对每个样本,创建一个全0的软标签分布

2. 为正确类保留1-ε的概率

3. 均分ε到其他类

4. 得到软标签分布作为新目标,用于训练

In [None]:
# 原始数据
y_true = [1, 0, 0, 1, 1, 2] # 独热编码的真实标签
num_classes = 3 # 类别数量

# Label Smoothing
smoothed_labels = []
smoothing_epsilon = 0.1

for i in range(len(y_true)):
  smoothed_label = [0] * num_classes

  for j in range(num_classes):
    if j == y_true[i]:
      smoothed_label[j] = 1 - smoothing_epsilon
    else:
      smoothed_label[j] = smoothing_epsilon / (num_classes - 1)

  smoothed_labels.append(smoothed_label)

# 训练模型,使用smoothed_labels作为目标
# model.train(X, smoothed_labels)

In [None]:
smoothed_labels