## 序列模型

### 序列数据
* 实际中很多数据是有时序结构的
* 电影的评价随时间变化而变化
    * 拿奖后评分上升，直到奖项被忘记
    * 看了很多好电影后，人们的期望变高
    * 季节性：贺岁片、暑期档
    * 导演、演员的负面报道导致评分变低
    * ...


### 统计工具
* 在时间t观察到$x_t$，那么得到T个不独立的随机变量
$$
(x_1, ..., x_T) \sim p(x)
$$
* 使用条件概率展开
$$
p(a, b) = p(a)p(b|a) = p(b)p(a|b)
$$
展开的大一点
$$
p(x) = p(x_1)p(x_2 | x_1)p(x_3|x_1, x_2)...p(x_T|x_1, ..., x_{T-1})
$$
反过来也可以展开，但是物理上不一定可行
### 序列模型
$$
p(x) = p(x_1)p(x_2 | x_1)p(x_3|x_1, x_2)...p(x_T|x_1, ..., x_{T-1})
$$
* 对条件概率建模
$$
p(x_t| x_1, x_2, ..., x_{t - 1}) = p(x_t | f(x_1, ..., x_{t - 1}))
$$
对见过的数据建模，也称`自回归模型`

#### 方案A——Markov假设
* 假设当前数据只跟$\tau$个过去数据点相关
$$
p(x_t | x_1, ..., x_{t - 1}) = p(x_t | x_{t - \tau}, ..., x_{t - 1}) = p(x_t| f(x_{t - \tau}, ..., x_{t - 1}))
$$
例如在过去数据上训练一个MLP模型

#### 潜变量
* 引入浅$h_t$来表示过去信息$h_t = f(x_1, ..., x_{t - 1})$
    * 这样$x_t = p(x_t | h_t)$

summary
* 时序模型中，当前数据跟之前观察到的数据相关
* 自回归模型使用自身过去数据来预测未来
* Markov模型假设当前值跟最近少数数据相关，从而简化模型
* 浅变量模型使用潜变量来概括历史信息


### 代码

In [None]:
# 序列模型
# 使用正弦函数和一些可加性噪声生成序列数据，时间步为1,2,...，1000
import torch
from torch import nn
from d2l import torch as d2l

T = 1000
time = torch.arange(1, T + 1, dtype=torch.float32)
x = torch.sim(0.01 * time) + torch.normal(0, 0.2, (T,))
d2l.plot(time, [x], 'time', 'x', xlim=[1, 1000], figsize=(6, 3))

In [None]:
# 将数据映射为数据对y_t = x_t和x_t = [x_{t - \tau}, ..., x_{t - 1}]

tau = 4 # 我们的模型假设每一个时间步的x只和过去的4个时间步有关
features = torch.zeros((T - tau, tau))

for i in range(tau):
    features[:, i] = x[i: T - tau + i]

labels = x[tau: ].reshape((-1, 1))

betch_size, n_train = 16, 600

train_iter = d2l.load_array(
    (features[: n_train], labels[: n_train]),
    batch_size, is_train=True
)

In [None]:
# 使用一个简单的结构：只是一个拥有两个全连接层的MLP

def init_wieghts(m):
    if type(m) == nn.Linear:
        nn.init.xavier_uniform_(m.weight)

def get_net():
    net = nn.Sequential(
        nn.Linear(4, 10), nn.ReLU(),
        nn.Linear(10, 1)
    )
    net.apply(init_wieghts)
    return net

loss = nn.MSELoss()


In [None]:
# 训练模型

def train(net, train_iter, loss, epochs, lr):
    trainer = torch.optim.Adam(net.parameters(), lr)
    for epoch in range(epochs):
        for X, y in train_iter:
            trainer.zero_grad()
            l = loss(net(X), y)
            l.backward()
            trainer.step()
        print(
            f'epoch {epoch + 1}, ',
            f'loss: {d2l.evaluate_loss(net, train_iter, loss)}'
        )

net = get_net()
train(net, train_iter, loss, 5, 0.01)


In [None]:
onestep_preds = net(features)
d2l.plot(
    [time, time[tau:]],
    [x.detach().numpy(), onestep_preds.detach().numpy()],
    'time', 'x',
    legend=['data', '1-step preds], xlim=[1, 1000], figsize=(6, 3)
)

In [None]:
# 进行多步预测

multistep_preds = torch.zeros(T)
multistep_preds[: n_train + tau] = x[: n_train + tau]
for i in range(n_train + tau, T):
    multistep_preds[i] = net(multistep_preds[i - tau: i].reshape((1, -1)))

d2l.plot(
    [time, time[tau: ], time[n_train + tau]],
    [
        x.detach().numpy(), 
        onestep_preds.detach().numpy(), 
        multistep_preds.detach().numpy()    
    ],
    'time', 'x', legend=['data', '1-step preds', 'multistep preds'],
    xlim=[1, 1000], figsize=(6, 3)
)

In [None]:
# 在仔细看一下k步预测

max_steps = 64

features = torch.zeros((T - tau - max_steps + 1, tau + max_steps))

for i in range(tau):
    features[:, i] = x[i: i + T - tau - max_steps + 1]

for i in range(tau, tau + max_steps):
    features[:, i] = net(features[:, i - tau: i]).reshape(-1)

steps = (1, 4, 16, 64)

d2l.plot(
    [time[tau + i - 1: T - max_steps + i] for i in steps],
    [features[:, (tau + i - 1)].detach().numpy() for i in steps]
    'time', 'x', legend=[f'{i}-step preds' for i in steps],
    xlim=[5, 1000], figsize=(6, 3)
)

## 文本预处理

### 代码实现


In [None]:
import collections
import re # 正则表达式的module
from d2l import torch as d2l


# 将数据集读取到由多条文本组成的列表中

def read_time_machine():
    '''Load the time machine dataset into a list of text lines'''
    with open(d2l.download('time_machine', 'r') as f:
        lines = f.readlines()
    # 实际使用中不会用这么暴力的预处理
    return [re.sub('[^A-Za-z]+', ' ', line).strip().lower() for line in lines]

lines = read_time_machine()

print(f'text lines: {len(lines)}')
print(lines[0])
print)lines[10]


In [None]:
# 每个文本序列又被拆分成一个标记列表

def toeknize(lines, token='word):
    '''将文本行拆分成单词或字符进行标记'''
    if token == 'word':
        return [line.split() for line in lines]
    elif token == 'char':
        return [list(line) for line in lines]
    else:
        print('Error')

tokens = toeknize(lines)
for i in range(11):
    print(tokens[i])


# 构建一个字典，通常叫做词汇表vocabulary，用来将字符串类型的标记映射到从0开始的数字索引中

class Vocab:
    '''文本词汇表'''
    def __init__(self, tokens=None, min_freq=0, reserved_tokens=None):
        if tokens is None:
            tokens = []
        if reserved_tokens is None:
            reserved_tokens = []
        counter = count_freqs(tokens)
        self.token_freqs = sorted(
            counter.items(), key=lambda x: x[1], reverse=True
        )
        self.unk, uniq_tokens = 0, ['<unk>'] + reserved_tokens
        # unk是一个常见的表示，意思是unknown

        uniq_tokens += [
            token for token, freq in self.token_freqs
            if freq >= min_freq and token not in uniq_tokens
        ]
        self.idx_to_token, self.token_to_idx = [], dict()
        for token in uniq_tokens:
            self.idx_to_token.append(token)
            self.token_to_idx[token] = len(self.idx_to_token)

    def __len__(self):
        return len(self.idx_to_token)

    def __getitem__(self, tokens):
        if not isinstance(tokens, (list, tuple)):
            return self.token_to_idx.get(tokens, self.unk)
        return [self.__getitem__(token) for token in tokens]
    
    def to_token(self, indices):
        if not isinstance(indices, (list, tuple)):
            return self.idx_to_token[indices]
        return [self.idx_to_token[index] for index in indices]

def count_corpus(tokens):
    '''统计标记的频率'''
    if len(tokens) == 0 or isinstance(tokens[0], list):
        tokens = [token for line in tokens for token in line]
    return collections.Counter(tokens)

In [None]:
# 构建词汇表
vocab = Vocab(tokens)
print(list(vocab.token_to_idx.items())[: 10])

# 将每一行文本行转换成一个数字索引列表
for i in [0, 10]:
    print('words:', tokens[i])
    print('indices:', vocab[tokens[i]])

# 将所有功能打包到load_corpus_time_machine函数种
def load_corpus_time_machine(max_tokens=-1):
    '''返回时光机器数据集的标记索引列表和词汇表'''
    lines = read_time_machine()
    tokens = tokenize(lines, 'char')
    vocab = Vocab(tokens)
    # corpus语料库： a collection of written or spoken material stored on a computer and used to find out how language is used
    corpus = [vocab[token] for line in tokens for token in line]
    if max_tokens > 0:
        corpus = corpus[: max_tokens]
    return corpus, vocab

corpus, vocab = load_corpus_time_machine()
len(corpus), len(vocab)


## 语言模型
* 给定文本序列$x_1, ..., x_T$，语言模型的目标是估计联合概率$p(x_1, ..., x_T)$
* 它的应用包括：
    * 做预训练模型(BERT, GPT-3等)
    * 做文本生成，给定前几个词，不断使用$x_t \sim p(x_t|x_1, ..., x_{t -1})$来生成后续文本
    * 判断多个序列中哪个更常见

### 使用计数来建模
* 假设序列长度为2,预测
$$
p(x, x') = p(x)p(x'|x) = \frac{n(x)}{n} \frac{x(x, x')}{n(x)}
$$
这里n是总词数，$n(x), n(x, x')$是单个单词和连续单词对的出现次数
* 很容易拓展到长为3的情况
$$
p(x, x', x'') = p(x)p(x'|x)p(x''|x, x') = \frac{n(x)}{n} \frac{n(x, x')}{n(x)} \frac{n(x, x', x'')}{n(x, x')}
$$
### N元语法
* 当序列很长时，因为文本量不够大，很可能$n(x_1, ..., x_T) \le 1$
* 使用Markov假设来缓解这个问题：

* 一元语法
$$
p(x_1, x_2, x_3, x_4) = p(x_1)p(x_2)p(x_3)p(x_4) = \frac{n(x_1)}{n} \frac{n(x_2)}{n} \frac{n(x_3)}{n} \frac{n(x_4)}{n}
$$
* 二元语法
$$
p(x_1, x_2, x_3, x_4) = p(x_1)p(x_2 | x_1)p(x_3|x_2)p(x_4|x_3) = \frac{n(x_1)}{n} \frac{n(x_1, x_2)}{n(x_1)} \frac{n(x_2, x_3)}{n(x_2)} \frac{n(x_3, x_4)}{n(x_3)}
$$
* 三元语法
$$
p(x_1, x_2, x_3, x_4) = p(x_1)p(x_2 | x_1)p(x_3 | x_1, x_2)p(x_4 | x_1, x_2, x_3)
$$

### 代码——语言模型和数据集

In [None]:
import random
import torch 
from d2l import troch as d2l

tokens = d2l.tokenize(d2l.read_time_machine())
corpus = [token for line in lines for token in line]
vocab = d2l.Vocab(corpus)
vocab.token_freqs[: 10]


# 最流行的词，被称为`停用词`，画出词频图
freqs = [freq for token, freq in vocab.token_freqs]
d2l.plot(
    freqs, xlabel='token: x', ylabel='frequency: n(x)',
    yscale='log'
)

# 其他的词元组合，比如二元语法，三元语法
bigram_tokens = [
    pair for pair in zip(corpus[: -1], corpus[1:])
]
bigram_vocab = d2l.Vocab(bigram_tokens)
print(bigram_vocab.token_freqs[: 10])

trigram_tokens = [
    triple for triple in 
    zip(corpus[: -2], corpus[1: -1], corpus[2:])
]
trigram_vocab = d2l.Vocab(trigram_tokens)
print(trigram_vocab.token_freqs[: 10])

# 直观的对比三种模型中的标记频率
bigram_freqs = [freq for token, freq in bigram_vocab.to]