# 语言模型及数据集
自然语言文本可以看作一个离散的时间序列，语言模型的目标就是评估该序列是否合理，即计算该序列的概率：
$$P(w_1, w_2, \dots, w_T)$$

## 语言模型
假设序列中每个词是依次生成的，我们有

\begin{aligned}
P(w_1, w_2, \dots, w_T) &= \prod_{t=1}^T P(w_t|w_1, \dot, w_{t-1}) \\
  &= P(w_1)P(w_2 | w_1) \dots P(w_T|w_1 w_2 \dots w_{t -1})
\end{aligned}

$$\hat P(w_1) = \frac{n(w_1)}{n}$$

其中$n(w_1)是w_1$作为第一个词的文本数量，$n$是文本总数量。同理：

$$\hat P(w_2|w_1) = \frac{n(w_1, w_2)}{w_1}$$

### n-grams
马尔可夫假设是指一个词的出现只与前面n个词相关，即n阶马尔科夫链(Markov chain of order n),如果$n = 1$, 
则$P(w_3 | w_1, w_2) = P(w_3 | w_2)$。
基于$n -1$阶马尔科夫链，则语言模型为

$$P(w_1, w_2, \dots, w_T) = \prod_{t=1}^T P(w_t | w_{t-(n-1), \dots, w_{t-1}})$$
当n分别为1，2，3时，分别称作一元（unigram）,二元（bigram）,三元（trigram）其概率分别为
$$P(w_1, w_2, \dots, w_T) = P(w_1)P(w_2)P(w_3)P(w_4))$$
$$P(w_1, w_2, \dots, w_T) = P(w_1)P(w_2|w_1)P(w_3|w_2)P(w_4|w_3))$$
$$P(w_1, w_2, \dots, w_T) = P(w_1)P(w_2|w_1)P(w_3|w_2, w_1)P(w_4|w_3, w_2))$$
思考：n元语法可能的缺陷？
- 参数空间过大
- 数据稀疏

## 语言模型数据集

### 读取数据集
\data\05jaychou_lyrics.txt

In [1]:
import os
import sys
BASE_DIR = os.path.dirname(os.getcwd())
sys.path.insert(0, os.path.join(BASE_DIR))
print(BASE_DIR)

In [2]:
with open(os.path.join(BASE_DIR,"data","05jaychou_lyrics.txt"), encoding='utf-8') as f:
    corpus_chars = f.read()
print(len(corpus_chars))
print(corpus_chars[: 40])
corpus_chars = corpus_chars.replace('\n', ' ').replace('\r', ' ')
corpus_chars = corpus_chars[: 10000]

### 建立字符索引

In [3]:
idx_to_char = list(set(corpus_chars)) # 去重，list索引即为字符索引
char_to_idx = {char: i for i, char in enumerate(idx_to_char)} # 字符到索引的映射
vocab_size = len(char_to_idx)
print(vocab_size)

1027


In [4]:
corpus_indices = [char_to_idx[char] for char in corpus_chars]
sample = corpus_indices[:20]
print('chars:', ''.join([idx_to_char[idx] for idx in sample]))
print("indices", sample)

chars: 想要有直升机 想要和你飞到宇宙去 想要和
indices [864, 189, 703, 389, 399, 274, 102, 864, 189, 184, 783, 746, 742, 220, 152, 342, 102, 864, 189, 184]


### 时序数据的采样
随机采样 和 相邻采样

#### 随机采样

In [5]:
import torch
import random
def data_iter_random(corpus_indices, batch_size, num_steps, device=None):
    num_examples = (len(corpus_indices) - 1) // num_steps
    example_indices = [i * num_steps for i in range(num_examples)]
    random.shuffle(example_indices)
    
    def _data(i):
        return corpus_indices[i: i+num_steps]
    if device is None:
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        
    for i in range(0, num_examples, batch_size):
        batch_indices = example_indices[i: i+ batch_size]
        X = [_data(j) for j in batch_indices]
        Y = [_data(j + 1) for j in batch_indices]
        yield torch.tensor(X, device=device), torch.tensor(Y, device=device)

In [9]:
my_seq = list(range(10))
for X, Y in data_iter_random(my_seq, batch_size=2, num_steps=6):
    print('X: ', X, '\nY:', Y, '\n')

X:  tensor([[0, 1, 2, 3, 4, 5]], device='cuda:0') 
Y: tensor([[1, 2, 3, 4, 5, 6]], device='cuda:0') 



### 相邻采样

In [10]:
def data_iter_consecutive(corpus_indices, batch_size, num_steps, device=None):
    if device is None:
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    corpus_len = len(corpus_indices) // batch_size * batch_size  # 保留下来的序列的长度
    corpus_indices = corpus_indices[: corpus_len]  # 仅保留前corpus_len个字符
    indices = torch.tensor(corpus_indices, device=device)
    indices = indices.view(batch_size, -1)  # resize成(batch_size, )
    batch_num = (indices.shape[1] - 1) // num_steps
    for i in range(batch_num):
        i = i * num_steps
        X = indices[:, i: i + num_steps]
        Y = indices[:, i + 1: i + num_steps + 1]
        yield X, Y

In [11]:
for X, Y in data_iter_consecutive(my_seq, batch_size=2, num_steps=2):
    print('X: ', X, '\nY:', Y, '\n')

X:  tensor([[0, 1],
        [5, 6]], device='cuda:0') 
Y: tensor([[1, 2],
        [6, 7]], device='cuda:0') 

X:  tensor([[2, 3],
        [7, 8]], device='cuda:0') 
Y: tensor([[3, 4],
        [8, 9]], device='cuda:0') 

