# 深度学习处理文本
---
* 为机器学习应用预处理文本数据
* 用于文本处理的词袋方法和序列模型方法
* `Transformer`架构
* 序列到序列学习

## 11.1 自然语言处理概述 `natural language processing, NLP`

## 11.2 准备文本数据
---
* 1、将本吧标准化。
* 2、将文本拆分为单元(词元`token`)。
* 3、将每个词元转换为一个数值向量。

### 11.2.1 文本标准化
---
* 简单的特征工程:将所有字母转换为小写并删除标点符号。
* 高级的标准化方法:词干提取(`stemming`):将一个词的变体转换为相同的表示。

### 11.2.2 文本拆分(词元化)
---
* **单词级词元化(`word-level tokenization`)**
* **N元语法词元化(`N-gram tokenization`)**
* **字符级词元化(`character-level tokenization`)**

### 11.2.3 建立词表索引

### 11.2.4 使用`TextVectorization`层

In [9]:
import string

class Vectorizer:

    def standardize(self, text):
        text = text.lower()
        
        return "".join(char for char in text if char not in string.punctuation)

    def tokenizer(self, text):
        text = self.standardize(text)

        return text.split()
    
    def make_vocabulary(self, dataset):
        self.vocabulary = {"":0, "[UNK]":1}

        for text in dataset:
            text   = self.standardize(text)
            tokens = self.tokenizer(text)

            for token in tokens:
                if token not in self.vocabulary:
                    self.vocabulary[token] = len(self.vocabulary)
        
        self.inverse_vocabulary = dict((v, k) for k, v in self.vocabulary.items())
    
    def encode(self, text):
        text   = self.standardize(text)
        tokens = self.tokenizer(text)
        
        return [self.vocabulary.get(token, 1) for token in tokens]

    def decode(self, int_sequence):
        
        return " ".join(self.inverse_vocabulary.get(i, "[UNK]") for i in int_sequence)

vectorizer = Vectorizer()

dataset = [
    'I write, erase, rewrite',
    'Erase again, and then',
    'A poppy blooms.',
]

vectorizer.make_vocabulary(dataset)

In [10]:
test_sentence    = 'I write, rewrite, and still rewrite again'
encoded_sentence = vectorizer.encode(test_sentence)
print(encoded_sentence)

[2, 3, 5, 7, 1, 5, 6]


In [11]:
decoded_sentence = vectorizer.decode(encoded_sentence)
print(decoded_sentence)

i write rewrite and [UNK] rewrite again


In [1]:
from tensorflow.keras.layers import TextVectorization

text_vectorization = TextVectorization(output_mode='int',)

# TextVectorization层 --> 文本标准化方法是:转换为小写字母并删除标点符号.词元化方法是:利用空格进行拆分.

In [2]:
import re
import string
import tensorflow as tf

def custom_standardization_fn(string_tensor):
    # 将字符串转换为小写字母
    lowercase_string = tf.strings.lower(string_tensor)
    
    # 将标点符号替换为空字符串
    return tf.strings.regex_replace(lowercase_string, f"[{re.escape(string.punctuation)}]", "")

def custom_split_fn(string_tensor):
    # 利用空格对字符串进行拆分
    return tf.strings.split(string_tensor)

text_vectorization = TextVectorization(
    output_mode="int",
    standardize=custom_standardization_fn,
    split=custom_split_fn,
)

In [15]:
dataset = [
    'I write, erase, rewrite',
    'Erase again, and then',
    'A poppy blooms.',
]

text_vectorization.adapt(dataset)

#### [C] 11.1 显示词表

In [16]:
text_vectorization.get_vocabulary()

['',
 '[UNK]',
 'erase',
 'write',
 'then',
 'rewrite',
 'poppy',
 'i',
 'blooms',
 'and',
 'again',
 'a']

In [17]:
vocabulary       = text_vectorization.get_vocabulary()
test_sentence    = 'I write, rewrite, and still rewrite again'
encoded_sentence = text_vectorization(test_sentence)
print(encoded_sentence)

tf.Tensor([ 7  3  5  9  1  5 10], shape=(7,), dtype=int64)


In [None]:
inverse_vocab    = dict(enumerate(vocabulary))
decoded_sentence = " ".join(inverse_vocab[int(i)] for i in encoded_sentence)
print(decoded_sentence)

In [None]:
# TextVectorization 层有两种用法：
# 1、将其放在tf.data管道中
int_sequence_dataset = string_dataset.map(text_vectorization, num_parallel_calls=4)  # 参数num_parallel_calls的作用是在多个CPU内核中并行调用map()

# 2、将其作为模型的一部分
text_input      = keras.Input(shape=(), dtype='string')
vectorized_text = text_vectorization(text_input)
embedded_input  = keras.Embedding(...)(vectorized_text)
output          = ...
model           = keras.Model(text_input， output)

## 11.3 表示单词组的两种方法：集合和序列

### 11.3.1 准备`IMDB`影评数据
---
 `!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz`

 `!tar -xf aclImdb_v1.tar.gz`

In [3]:
# 将20%的训练文本文件放入一个新目录中
import os, pathlib, shutil, random

base_dir  = pathlib.Path('aclImdb')
val_dir   = base_dir / 'val'
train_dir = base_dir / 'train'

for category in ('neg', 'pos'):
    if not os.path.exists(val_dir / category):
        # 创建目录
        os.makedirs(val_dir / category)
    else:
        # 清空目录下文件
        exist_files = os.listdir(val_dir / category)
        for exist_file in exist_files:
            os.remove(os.path.join(val_dir / category, exist_file))

    files = os.listdir(train_dir / category)
    
    # 使用种子随机打乱训练文件列表，以确保每次运行代码都会得到相同的验证集
    random.Random(1337).shuffle(files)

    # 将20%的训练文件用于验证
    num_val_samples = int(0.2 * len(files))
    val_files = files[-num_val_samples:]

    for fname in val_files:
        # 将文件移动到 aclImdb/val/{category}目录中
        shutil.move(train_dir / category / fname, val_dir / category / fname)

In [4]:
from tensorflow import keras

batch_size = 32

train_ds = keras.utils.text_dataset_from_directory('aclImdb/train', batch_size=batch_size)
val_ds   = keras.utils.text_dataset_from_directory('aclImdb/val'  , batch_size=batch_size)
test_ds  = keras.utils.text_dataset_from_directory('aclImdb/test' , batch_size=batch_size)

Found 4196 files belonging to 2 classes.
Found 1048 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


### [C] 11.2 显示第一个批量的形状和数据类型

In [None]:
for inputs, targets in train_ds:
    print('inputs.shape:' , inputs.shape)
    print('inputs.dtype:' , inputs.dtype)
    print('targets.shape:', targets.shape)
    print('targets.dtype:', targets.dtype)
    print('inputs[0]:'    , inputs[0])
    print('targets[0]:'   , targets[0])
    break

### 11.3.2 将单词作为集合处理:词袋方法

#### 1 单个单词(一元语法)的二进制编码

##### [C] 11.3 用`TextVectorization`层预处理数据集

In [5]:
text_vectorization = TextVectorization(
    max_tokens=20000,         # 前20000个最常出现的单词
    output_mode='multi_hot',  # 将输出词元编码为 multi_hot 二进制向量
)

# 准备一个数据集，只包含原始文本输入(不包含标签)
text_only_train_ds = train_ds.map(lambda x, y:x)

# 利用adapt()方法对数据集词表建立索引
text_vectorization.adapt(text_only_train_ds)

# 分别对训练、验证、测试数据集进行处理
binary_1gram_train_ds = train_ds.map(lambda x, y:(text_vectorization(x), y), num_parallel_calls=4)
binary_1gram_val_ds   = val_ds.map  (lambda x, y:(text_vectorization(x), y), num_parallel_calls=4)
binary_1gram_test_ds  = test_ds.map (lambda x, y:(text_vectorization(x), y), num_parallel_calls=4)

##### [C] 11.4 查看一元语法二进制数据集的输出

In [None]:
for inputs, targets in binary_1gram_train_ds:
    print('inputs.shape:', inputs.shape)
    print('inputs.dtype:', inputs.dtype)
    print('targets.shape:', targets.shape)
    print('targets.dtype:', targets.dtype)
    print('inputs[0]:'    , inputs[0])
    print('targets[0]:'   , targets[0])
    break

##### [C] **▲** 11.5 模型构建函数

In [6]:
from tensorflow import keras
from tensorflow.keras import layers

def get_model(max_tokens=20000, hidden_dim=16):
    inputs  = keras.Input(shape=(max_tokens,))
    x       = layers.Dense(hidden_dim, activation='relu')(inputs)
    x       = layers.Dropout(0.5)(x)
    outputs = layers.Dense(1, activation='sigmoid')(x)
    
    model   = keras.Model(inputs, outputs)
    model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])

    return model

##### [C] 11.6 对一元语法二进制模型进行训练的测试

In [22]:
model = get_model()
model.summary()

callbacks = [
    keras.callbacks.ModelCheckpoint("binary_1gram.keras",
                                    save_best_only=True)
]

model.fit(binary_1gram_train_ds.cache(),
          validation_data=binary_1gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)

model = keras.models.load_model("binary_1gram.keras")

print(f"Test acc: {model.evaluate(binary_1gram_test_ds)[1]:.3f}")

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense (Dense)               (None, 16)                320016    
                                                                 
 dropout (Dropout)           (None, 16)                0         
                                                                 
 dense_1 (Dense)             (None, 1)                 17        
                                                                 
Total params: 320,033
Trainable params: 320,033
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test acc: 0.866


#### 2 二元语法的二进制编码

##### [C] 11.7 设置`TextVectorization`层返回二元语法

In [None]:
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode='multi_hot',
)

##### [C] 11.8 对二元语法二进制模型进行训练和测试

In [None]:
text_vectorization.adapt(text_only_train_ds)

binary_2gram_train_ds = train_ds.map(lambda x, y:(text_vectorization(x), y), num_parallel_calls=4)
binary_2gram_val_ds   = val_ds.map(lambda x, y:(text_vectorization(x), y), num_parallel_calls=4)
binary_2gram_test_ds  = test_ds.map(lambda x, y:(text_vectorization(x), y), num_parallel_calls=4)

model = get_model()
model.summary()

callbacks = [
    keras.callbacks.ModelCheckpoint("binary_2gram.keras",
                                    save_best_only=True)
]

model.fit(binary_2gram_train_ds.cache(),
          validation_data=binary_2gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)

model = keras.models.load_model("binary_2gram.keras")

print(f"Test acc: {model.evaluate(binary_2gram_test_ds)[1]:.3f}")

#### 3 二元语法的`TF-IDF`编码

##### [C] 11.9 设置 `TextVectorization` 层返回词元出现的次数
---
将单词计数减去均值并除以方差，对其进行规范化。

In [23]:
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode='count',
)

##### [C] 11.10 设置 `TextVectorization` 层返回 `TF-IDF` 加权输出
---
理解 `TF-IDF` 规范化
* 某个词在一个文档中出现的次数越多，它对理解文档的内容就越重要。
* 同时，某个词在数据集所有文档中的出现频次也很重要：如果一个词几乎出现在每个文档中，如：the， a，那么这个词就不是特别有信息量，而仅在一小部分文本中出现的词则是非常独特的，因此也非常重要。

In [24]:
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode='tf_idf',
)

In [None]:
# TF-IDF 的计算方法如下：
def tfidf(term, document, dataset):
    term_freq = document.count(term)
    doc_freq  = math.log(sum(doc.count(term) for doc in dataset) + 1)

    return term_freq / doc_freq

##### [C] 11.11 对 `TF-IDF` 二元语法模型进行训练和测试

In [25]:
text_vectorization.adapt(text_only_train_ds)

tfidf_2gram_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)
tfidf_2gram_val_ds   = val_ds.map(lambda x, y: (text_vectorization(x), y)  , num_parallel_calls=4)
tfidf_2gram_test_ds  = test_ds.map(lambda x, y: (text_vectorization(x), y) , num_parallel_calls=4)

model = get_model()
model.summary()

callbacks = [
    keras.callbacks.ModelCheckpoint("tfidf_2gram.keras",
                                    save_best_only=True)
]

model.fit(tfidf_2gram_train_ds.cache(),
          validation_data=tfidf_2gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)

model = keras.models.load_model("tfidf_2gram.keras")

print(f"Test acc: {model.evaluate(tfidf_2gram_test_ds)[1]:.3f}")

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense_2 (Dense)             (None, 16)                320016    
                                                                 
 dropout_1 (Dropout)         (None, 16)                0         
                                                                 
 dense_3 (Dense)             (None, 1)                 17        
                                                                 
Total params: 320,033
Trainable params: 320,033
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test acc: 0.844


In [26]:
inputs           = keras.Input(shape=(1,), dtype="string")  # 每个输入样本都是一个字符串
processed_inputs = text_vectorization(inputs)               # 应用文本预处理
outputs          = model(processed_inputs)                  # 应用前面训练好的模型
inference_model  = keras.Model(inputs, outputs)             # 将端到端的模型实例化

In [27]:
import tensorflow as tf

raw_text_data = tf.convert_to_tensor([
    ["That was an excellent movie, I loved it."],
])

predictions = inference_model(raw_text_data)

print(f"{float(predictions[0] * 100):.2f} percent positive")

91.82 percent positive


### 11.3.3 将单词作为序列处理:序列模型方法 `sequence model`

#### 1 第一个实例

##### [C] 11.12 准备整数序列数据集

In [30]:
from tensorflow.keras import layers

max_length = 600
max_tokens = 20000

text_vectorization = layers.TextVectorization(
    max_tokens=max_tokens,
    output_mode="int",
    output_sequence_length=max_length,  # 评论的平均长度是233个单词，只有5%的评论超过600个单词
)

text_vectorization.adapt(text_only_train_ds)

int_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)
int_val_ds   = val_ds.map  (lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)
int_test_ds  = test_ds.map (lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)

##### 11.13 构建于 `one-hot` 编码的向量序列之上的序列模型

In [None]:
import tensorflow as tf

inputs   = keras.Input(shape=(None,), dtype="int64")
embedded = tf.one_hot(inputs, depth=max_tokens)
x        = layers.Bidirectional(layers.LSTM(32))(embedded)
x        = layers.Dropout(0.5)(x)
outputs  = layers.Dense(1, activation="sigmoid")(x)
model    = keras.Model(inputs, outputs, name='OneHot')

model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])

model.summary()

##### [C] 11.14 训练第一个简单的序列模型
---
观察结果：
* 训练速度非常慢
* 测试精度不高

In [None]:
callbacks = [
    keras.callbacks.ModelCheckpoint("one_hot_bidir_lstm.keras",
                                    save_best_only=True)
]

model.fit(int_train_ds, validation_data=int_val_ds, epochs=10, callbacks=callbacks)

model = keras.models.load_model("one_hot_bidir_lstm.keras")

print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

#### 2 理解词嵌入 `word embedding`
---
两个词向量之间的**几何关系**应该反映这两个单词之间的**语义关系**。

#### 3 利用 `Embedding` 层学习词嵌入

##### [C] 11.15 将 `Embedding` 层实例化

In [28]:
embedding_layer = layers.Embedding(input_dim=max_tokens, output_dim=256)

##### [C] 11.16 从头开始训练一个使用 `Embedding` 层的模型

In [31]:
inputs   = keras.Input(shape=(None,), dtype="int64")
embedded = layers.Embedding(input_dim=max_tokens, output_dim=256)(inputs)
x        = layers.Bidirectional(layers.LSTM(32))(embedded)
x        = layers.Dropout(0.5)(x)
outputs  = layers.Dense(1, activation="sigmoid")(x)
model    = keras.Model(inputs, outputs)

model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])

model.summary()

callbacks = [
    keras.callbacks.ModelCheckpoint("embeddings_bidir_gru.keras",
                                    save_best_only=True)
]

model.fit(int_train_ds, validation_data=int_val_ds, epochs=10, callbacks=callbacks)

model = keras.models.load_model("embeddings_bidir_gru.keras")

print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

Model: "model_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_5 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding_2 (Embedding)     (None, None, 256)         5120000   
                                                                 
 bidirectional_1 (Bidirectio  (None, 64)               73984     
 nal)                                                            
                                                                 
 dropout_3 (Dropout)         (None, 64)                0         
                                                                 
 dense_5 (Dense)             (None, 1)                 65        
                                                                 
Total params: 5,194,049
Trainable params: 5,194,049
Non-trainable params: 0
_________________________________________________

#### 4理解填充和掩码

##### [C] 11.17 使用带有掩码的 `Embedding` 层

In [None]:
inputs   = keras.Input(shape=(None,), dtype="int64")
embedded = layers.Embedding(input_dim=max_tokens, output_dim=256, mask_zero=True)(inputs)
x        = layers.Bidirectional(layers.LSTM(32))(embedded)
x        = layers.Dropout(0.5)(x)
outputs  = layers.Dense(1, activation="sigmoid")(x)
model    = keras.Model(inputs, outputs)

model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])

model.summary()

callbacks = [
    keras.callbacks.ModelCheckpoint("embeddings_bidir_gru_with_masking.keras",
                                    save_best_only=True)
]

model.fit(int_train_ds, validation_data=int_val_ds, epochs=10, callbacks=callbacks)

model = keras.models.load_model("embeddings_bidir_gru_with_masking.keras")

print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

#### 5 使用预训练词嵌入

##### [C] 11.18 解析 `GloVe` 词嵌入文件
---
`!wget http://nlp.stanford.edu/data/glove.6B.zip`

`!unzip -q glove.6B.zip`

In [40]:
import numpy as np

glove_dir          = pathlib.Path('glove')
path_to_glove_file = os.path.join(glove_dir, "glove.6B.100d.txt")

embeddings_index = {}

with open(path_to_glove_file, encoding='utf-8') as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print(f"Found {len(embeddings_index)} word vectors.")

Found 400000 word vectors.


##### [C] 11.19 准备 `GloVe` 词嵌入矩阵

In [41]:
embedding_dim = 100

vocabulary = text_vectorization.get_vocabulary()            # 获取前面 TextVectorization 层索引的词表
word_index = dict(zip(vocabulary, range(len(vocabulary))))  # 利用这个词表创建一个从单词到其词表索引的映射

embedding_matrix = np.zeros((max_tokens, embedding_dim))    # 准备一个矩阵，后续将用 GloVe 向量填充

for word, i in word_index.items():
    if i < max_tokens:
        embedding_vector = embeddings_index.get(word)

    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

In [42]:
embedding_layer = layers.Embedding(
    max_tokens,
    embedding_dim,
    embeddings_initializer=keras.initializers.Constant(embedding_matrix),
    trainable=False,  # 冻结
    mask_zero=True,
)

##### [C] 11.20 使用预训练 `Embedding` 层的模型

In [44]:
inputs   = keras.Input(shape=(None,), dtype="int64")
embedded = embedding_layer(inputs)
x        = layers.Bidirectional(layers.LSTM(32))(embedded)
x        = layers.Dropout(0.5)(x)
outputs  = layers.Dense(1, activation="sigmoid")(x)
model    = keras.Model(inputs, outputs)

model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])

model.summary()

callbacks = [
    keras.callbacks.ModelCheckpoint("glove_embeddings_sequence_model.keras",
                                    save_best_only=True)
]

model.fit(int_train_ds, validation_data=int_val_ds, epochs=10, callbacks=callbacks)

model = keras.models.load_model("glove_embeddings_sequence_model.keras")

print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

Model: "model_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_7 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding_3 (Embedding)     (None, None, 100)         2000000   
                                                                 
 bidirectional_3 (Bidirectio  (None, 64)               34048     
 nal)                                                            
                                                                 
 dropout_5 (Dropout)         (None, 64)                0         
                                                                 
 dense_7 (Dense)             (None, 1)                 65        
                                                                 
Total params: 2,034,113
Trainable params: 34,113
Non-trainable params: 2,000,000
____________________________________________

## 11.4 **`Transformer` 架构**

### 11.4.1 理解自注意力
---
首先要对一组特征计算重要性分数。特征相关性越大，分数越高，反之。

---
你有一个参考序列，用于描述你要查找的内容：`查询`。

你有一个知识体系，并试图从中提取信息：`值`。

每个值都有一个`键`，用于描述这个值，并可以很容易于查询进行对比。

### 11.4.2 多头注意力 `Attention Is All You Need`

### 11.4.3 `Transformer` 编码器

#### [C] 11.21 将 `Transformer` 编码器实现为 `Layer` 子类

In [4]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

class TransformerEncoder(layers.Layer):

    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        
        super().__init__(**kwargs)

        self.embed_dim = embed_dim  # 输入词元向量的尺寸
        self.dense_dim = dense_dim  # 内部密集层的尺寸
        self.num_heads = num_heads  # 注意力头的个数

        self.attention  = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.dense_proj = keras.Sequential([layers.Dense(dense_dim, activation='relu'), layers.Dense(embed_dim), ])

        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()

    def call(self, inputs, mask=None):
        # Embedding 层生成的掩码是二维的，但注意力层的输入应该是三维或四维的，所以我们需要增加它的维数
        if mask is not None:
            mask = mask[:, tf.newaxis, :]
        
        attention_output = self.attention(inputs, inputs, attention_mask=mask)
        proj_input       = self.layernorm_1(inputs + attention_output)
        proj_output      = self.dense_proj(proj_input)

        return self.layernorm_2(proj_input + proj_output)
    
    # 实现序列化，以便保存模型
    def get_config(self):
        config = super().get_config()

        config.update({
            'embed_dim':self.embed_dim,
            'num_heads':self.num_heads,
            'dense_dim':self.dense_dim,
        })

        return config

#### [C] 11.22 将 `Transformer` 编码器用于文本分类

In [5]:
vocab_size = 20000
embed_dim  = 256
num_heads  = 2
dense_dim  = 32

inputs  = keras.Input(shape=(None,), dtype="int64")
x       = layers.Embedding(vocab_size, embed_dim)(inputs)
x       = TransformerEncoder(embed_dim, dense_dim, num_heads)(x)
x       = layers.GlobalMaxPooling1D()(x)
x       = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model   = keras.Model(inputs, outputs, name='transformerEncode')

model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])

model.summary()

Model: "transformerEncode"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding (Embedding)       (None, None, 256)         5120000   
                                                                 
 transformer_encoder (Transf  (None, None, 256)        543776    
 ormerEncoder)                                                   
                                                                 
 global_max_pooling1d (Globa  (None, 256)              0         
 lMaxPooling1D)                                                  
                                                                 
 dropout (Dropout)           (None, 256)               0         
                                                                 
 dense_2 (Dense)             (None, 1)           

#### [C] 11.23 训练并评估基于 `Transformer` 编码器的模型

In [48]:
callbacks = [
    keras.callbacks.ModelCheckpoint("transformer_encoder.keras",
                                    save_best_only=True)
]

model.fit(int_train_ds, validation_data=int_val_ds, epochs=20, callbacks=callbacks)

model = keras.models.load_model(
    "transformer_encoder.keras",
    custom_objects={"TransformerEncoder": TransformerEncoder})

print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Test acc: 0.844


#### 1 使用位置编码重新注入顺序信息

##### [C] 11.24 将位置嵌入实现为 `Layer` 子类

In [6]:
class PositionalEmbedding(layers.Layer):
    def __init__(self, sequence_length, input_dim, output_dim, **kwargs):
        super().__init__(**kwargs)
        
        self.token_embeddings    = layers.Embedding(input_dim=input_dim      , output_dim=output_dim)  # 用于保存词元索引
        self.position_embeddings = layers.Embedding(input_dim=sequence_length, output_dim=output_dim)  # 用于保存词元位置
        self.sequence_length     = sequence_length
        self.input_dim           = input_dim
        self.output_dim          = output_dim

    def call(self, inputs):
        length             = tf.shape(inputs)[-1]
        positions          = tf.range(start=0, limit=length, delta=1)
        embedded_tokens    = self.token_embeddings(inputs)
        embedded_positions = self.position_embeddings(positions)
        
        return embedded_tokens + embedded_positions

    def compute_mask(self, inputs, mask=None):
        
        return tf.math.not_equal(inputs, 0)

    # 实现序列化，以便保存模型
    def get_config(self):
        config = super().get_config()
        
        config.update({
            "output_dim"     : self.output_dim,
            "sequence_length": self.sequence_length,
            "input_dim"      : self.input_dim,
        })

        return config

#### 2 综合示例:文本分类 `Transformer`

##### [C] 11.25 将 `Transformer` 编码器与位置嵌入相结合

In [50]:
vocab_size      = 20000
sequence_length = 600
embed_dim       = 256
num_heads       = 2
dense_dim       = 32

inputs  = keras.Input(shape=(None,), dtype="int64")
x       = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(inputs)
x       = TransformerEncoder(embed_dim, dense_dim, num_heads)(x)
x       = layers.GlobalMaxPooling1D()(x)
x       = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model   = keras.Model(inputs, outputs)

model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])

model.summary()

callbacks = [
    keras.callbacks.ModelCheckpoint("full_transformer_encoder.keras",
                                    save_best_only=True)
]

model.fit(int_train_ds, validation_data=int_val_ds, epochs=1, callbacks=callbacks)

model = keras.models.load_model(
    "full_transformer_encoder.keras",
    custom_objects={"TransformerEncoder": TransformerEncoder,
                    "PositionalEmbedding": PositionalEmbedding})

print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

Model: "model_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_9 (InputLayer)        [(None, None)]            0         
                                                                 
 positional_embedding (Posit  (None, None, 256)        5273600   
 ionalEmbedding)                                                 
                                                                 
 transformer_encoder_1 (Tran  (None, None, 256)        543776    
 sformerEncoder)                                                 
                                                                 
 global_max_pooling1d_1 (Glo  (None, 256)              0         
 balMaxPooling1D)                                                
                                                                 
 dropout_7 (Dropout)         (None, 256)               0         
                                                           

### 11.4.4 何时使用序列模型而不是词袋模型
### **When to use sequence models over bag-of-words models?**

## 11.5 超越文本分类:序列到序列学习
---
* 机器翻译(`machine translation`)
* 文本摘要(`text summarization`)
* 问题答(`question answering`)
* 聊天机器人(`chatbot`)
* 文本生成(`text generation`)
---
训练过程：
- **编码器**模型将源序列转换为中间表示。
- 对**解码器**进行训练，使其可以通过查看前面的词元(从 `0` 到 `i-1` )和编码后的源序列，预测目标序列的下一个词元 `i`

### 11.5.1 机器翻译示例

#### 英语到西班牙语的翻译数据集
`!wget http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip`

`!unzip -q spa-eng.zip`

In [20]:
text_file = 'spa-eng/spa.txt'

with open(text_file, encoding='utf-8') as f:
    lines = f.read().split('\n')[:-1]

text_pairs = []
# 对文件中每一行进行遍历
for line in lines:
    # 每一行都包含一个英语句子和它的西班牙译文，二者以制表符分隔
    english, spanish = line.split('\t')
    
    # 将[start]和[end]分别添加到西班牙语句子的开头和结尾
    spanish = '[start]' + spanish + '[end]'

    text_pairs.append((english, spanish))

In [11]:
import random

# 显示text_pairs示例
print(random.choice(text_pairs))

('My new class starts today.', '[start]Mi nuevo curso comienza hoy.[end]')


In [21]:
# 将text_pairs打乱, 并将其划分为常见的训练集、验证集和测试集
import random

random.shuffle(text_pairs)

num_val_samples   = int(0.15 * len(text_pairs))
num_train_samples = len(text_pairs) - 2 * num_val_samples

train_pairs = text_pairs[:num_train_samples]
val_pairs   = text_pairs[num_train_samples:num_train_samples + num_val_samples]
test_pairs  = text_pairs[num_train_samples + num_val_samples:]

#### 准备两个单独的 `TextVectorization` 层:一个用于英语,一个用于西班牙语.
---
* 需要保留插入的词元`[start]`和`[end]`
* 不同语言的标点符号是不同的

##### [C] 11.26 将英语和西班牙语的文本对向量化

In [22]:
import tensorflow as tf
import string
import re

strip_chars = string.punctuation + '¿'
strip_chars = strip_chars.replace('[', '')
strip_chars = strip_chars.replace(']', '')

def custom_standardization(input_string):

    lowercase = tf.strings.lower(input_string)

    return tf.strings.regex_replace(lowercase, f'[{re.escape(strip_chars)}]', '')

vocab_size      = 15000
sequence_length = 20

# 英语层
source_vectorization = layers.TextVectorization(
    max_tokens=vocab_size,
    output_mode="int",
    output_sequence_length=sequence_length,
)

# 西班牙语层
target_vectorization = layers.TextVectorization(
    max_tokens=vocab_size,
    output_mode="int",
    output_sequence_length=sequence_length + 1,  # 在训练过程中需要将句子偏移一个时间步
    standardize=custom_standardization,
)

train_english_texts = [pair[0] for pair in train_pairs]
train_spanish_texts = [pair[1] for pair in train_pairs]

# 学习每种语言的词表
source_vectorization.adapt(train_english_texts)
target_vectorization.adapt(train_spanish_texts)

##### [C] 11.27 准备翻译任务的数据集

In [23]:
batch_size = 64

def format_dataset(eng, spa):
    
    eng = source_vectorization(eng)
    spa = target_vectorization(spa)

    return ({
        'english':eng,
        'spanish':spa[:, :-1],  # 输入西班牙语句子不包含最后一个词元，以保证输入和目标具有相同的场地
    }, 
    spa[:, 1:]  # 目标西班牙语句子后偏移一个时间步。二者长度相同，都是20个单词
    )

def make_dataset(pairs):
    
    eng_texts, spa_texts = zip(*pairs)
    
    eng_texts = list(eng_texts)
    spa_texts = list(spa_texts)

    dataset = tf.data.Dataset.from_tensor_slices((eng_texts, spa_texts))
    dataset = dataset.batch(batch_size)
    dataset = dataset.map(format_dataset, num_parallel_calls=4)

    return dataset.shuffle(2048).prefetch(16).cache()  # 利用内存缓存来加快预处理速度

train_ds = make_dataset(train_pairs)
val_ds   = make_dataset(val_pairs)


In [24]:
for inputs, targets in train_ds.take(1):
    print(f"inputs['english'].shape: {inputs['english'].shape}")
    print(f"inputs['spanish'].shape: {inputs['spanish'].shape}")
    print(f"targets.shape: {targets.shape}")

inputs['english'].shape: (64, 20)
inputs['spanish'].shape: (64, 20)
targets.shape: (64, 20)


### 11.5.2 `RNN` 的序列到序列学习

#### 使用`RNN`将一个序列转换到另一个序列，最简单的方法是在每个时间步都保存`RNN`的输出
---
有两个主要问题：
* 目标序列必须始终与源序列的长度相同。
* 由于`RNN`逐步处理的性质，模型将仅通过查看源序列第`0~N`个词元来预测目标序列的第`N`个词元。

In [26]:
inputs  = keras.Input(shape=(sequence_length,), dtype='int64')
x       = layers.Embedding(input_dim=vocab_size, output_dim=128)(inputs)
x       = layers.LSTM(32, return_sequences=True)(x)
outputs = layers.Dense(vocab_size, activation='softmax')(x)
model   = keras.Model(inputs, outputs)

#### [C] 11.28 基于`GRU`的编码器

In [27]:
from tensorflow import keras
from tensorflow.keras import layers

embed_dim  = 256
latent_dim = 1024

# 掩码
source = keras.Input(shape=(None,), dtype='int64', name='english')
# 英语源句子。指定输入名称，我们就可以用输入组成的字典来拟合模型
x      = layers.Embedding(vocab_size, embed_dim, mask_zero=True)(source)
# 编码后的源句子即为双向GRU的最后一个食醋
encoded_source = layers.Bidirectional(layers.GRU(latent_dim), merge_mode='sum')(x)

##### [C] 11.29 基于`GRU`的解码器与端到端模型

In [28]:
past_target      = keras.Input(shape=(None,), dtype="int64", name="spanish")             # 西班牙语目标的句子
x                = layers.Embedding(vocab_size, embed_dim, mask_zero=True)(past_target)  # 掩码
decoder_gru      = layers.GRU(latent_dim, return_sequences=True)
x                = decoder_gru(x, initial_state=encoded_source)                          # 编码后的源句子作为解码器GRU的初始状态
x                = layers.Dropout(0.5)(x)
target_next_step = layers.Dense(vocab_size, activation="softmax")(x)                     # 预测下一个词元
seq2seq_rnn      = keras.Model([source, past_target], target_next_step)                  # 端到端模型：将源句子和目标句子映射为偏移一个时间步的目标句子

##### [C] 11.30 训练序列到序列循环模型

In [18]:
seq2seq_rnn.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

seq2seq_rnn.fit(train_ds, epochs=1, validation_data=val_ds)



<keras.callbacks.History at 0x20001bf58d0>

##### [C] 11.31 利用 `RNN` 编码器和 `RNN` 解码器来翻译句子

In [31]:
import numpy as np

# 准备一个字典，将词元索引预测值映射为字符串词元
spa_vocab        = target_vectorization.get_vocabulary()
spa_index_lookup = dict(zip(range(len(spa_vocab)), spa_vocab))
max_decoded_sentence_length = 20

def decode_sequence(input_sentence):
    
    tokenized_input_sentence = source_vectorization([input_sentence])
    decoded_sentence         = '[start]'  # 种子词元

    for i in range(max_decoded_sentence_length):
        tokenized_target_sentence = target_vectorization([decoded_sentence])
        
        # 对下一个词元进行采样
        next_token_predictions = seq2seq_rnn.predict([tokenized_input_sentence, tokenized_target_sentence])
        sampled_token_index    = np.argmax(next_token_predictions[0, i, :])
        # 将下一个词元预测值转换为字符串，并添加到生成的句子中
        sampled_token          = spa_index_lookup[sampled_token_index]
        decoded_sentence      += ' ' + sampled_token

        # 退出条件:达到最大长度或遇到停止词元
        if sampled_token == '[end]':
            break
    
    return decoded_sentence

test_eng_texts = [pair[0] for pair in test_pairs]

for _ in range(20):
    input_sentence = random.choice(test_eng_texts)
    print('-')
    print(input_sentence)
    print(decode_sequence(input_sentence))

-
She's hot.
[start] [start]mañana volverás mantenido [start]actualmente demostrar elevadora[end] conducir tarde [start]voy aburres[end] madrid empresas empresas siendo momentos[end] morado[end] permito limpias ganarse velocidad
-
This milk has a peculiar smell.
[start] [start]roma pide resfría [start]deseas presencia[end] cuándo[end] esperas corriendo descubrieron cambios[end] bella[end] valla[end] [start]comimos lentamente repararlo[end] optimista[end] tenedor[end] pude fbi[end] [start]terminamos
-
They are going to meet at the hotel tomorrow.
[start] vendiendo base velozmente[end] twitter[end] relativamente [start]pensé ¡qué único[end] violento[end] primeros tranquilo[end] llegamos quedó[end] usted[end] satélites limpiaste predecir intenté[end] pulgar camino[end]
-
What's your answer?
[start] vehículo resistir[end] recibieron marzo [start]ambos preguntando disney[end] pierdas [start]pienso tornó excavó derecha[end] quejando sí[end] viene[end] teatro[end] goteara apetece alemán querr

### 11.5.3 使用 `Transformer` 进行序列到序列学习

#### 1 `Transformer` 解码器

##### [C] 11.33 `TransformerDecoder`

In [38]:
class TransformerDecoder(layers.Layer):
    
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__()

        self.embed_dim   = embed_dim
        self.dense_dim   = dense_dim
        self.num_heads   = num_heads
        self.attention_1 = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.attention_2 = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.dense_proj  = keras.Sequential([layers.Dense(dense_dim, activation='relu'),  layers.Dense(embed_dim), ])
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        self.layernorm_3 = layers.LayerNormalization()
        self.supports_masking = True
    
    def get_config(self):
        
        config = super().get_config()
        config.update({
            "embed_dim": self.embed_dim,
            "num_heads": self.num_heads,
            "dense_dim": self.dense_dim,
        })

        return config
        

##### [C] 11.34 `TransformerDecoder` 中可以生成因果掩码的方法

In [39]:
def get_causal_attention_mask(self, inputs):

    input_shape = tf.shape(inputs)
    batch_size, sequence_length = input_shape[0], input_shape[1]
    i    = tf.range(sequence_length)[:, tf.newaxis]
    j    = tf.range(sequence_length)
    mask = tf.cast(i >= j, dtype="int32")
    mask = tf.reshape(mask, (1, input_shape[1], input_shape[1]))
    mult = tf.concat([tf.expand_dims(batch_size, -1),
                      tf.constant([1, 1], dtype=tf.int32)], axis=0)
    
    return tf.tile(mask, mult)

##### [C] 11.35 `TransformerDecoder` 的前向传播

In [40]:
def call(self, inputs, encoder_outputs, mask=None):
    
    causal_mask = self.get_causal_attention_mask(inputs)
    
    if mask is not None:
        padding_mask = tf.cast(
                mask[:, tf.newaxis, :], dtype="int32")
        padding_mask = tf.minimum(padding_mask, causal_mask)
    
    attention_output_1 = self.attention_1(
        query=inputs,
        value=inputs,
        key=inputs,
        attention_mask=causal_mask)
    
    attention_output_1 = self.layernorm_1(inputs + attention_output_1)
    
    attention_output_2 = self.attention_2(
        query=attention_output_1,
        value=encoder_outputs,
        key=encoder_outputs,
        attention_mask=padding_mask,
    )
    
    attention_output_2 = self.layernorm_2(
        attention_output_1 + attention_output_2)
    proj_output = self.dense_proj(attention_output_2)
    
    return self.layernorm_3(attention_output_2 + proj_output)

In [41]:
TransformerDecoder.get_causal_attention_mask = get_causal_attention_mask
TransformerDecoder.call = call

#### 2 综合示例:用于机器翻译的 `Transformer`

##### [C] 11.36 端到端 `Transformer`

In [43]:
embed_dim = 256
dense_dim = 2048
num_heads = 8

encoder_inputs  = keras.Input(shape=(None, ), dtype='int64', name='english')
x               = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(encoder_inputs)
encoder_outputs = TransformerEncoder(embed_dim, dense_dim, num_heads)(x)

decoder_inputs  = keras.Input(shape=(None, ), dtype='int64', name='spanish')
x               = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(decoder_inputs)
x               = TransformerDecoder(embed_dim, dense_dim, num_heads)(x, encoder_outputs)
x               = layers.Dropout(0.5)(x)

decoder_outputs = layers.Dense(vocab_size, activation='softmax')(x)

transformer = keras.Model([encoder_inputs, decoder_inputs], decoder_outputs)


##### [C] 11.37 训练序列到序列 `Transformer`

In [48]:
transformer.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

transformer.fit(train_ds, epochs=5, validation_data=val_ds)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x2003a10d8d0>

##### [C] 11.38 利用 `Transformer` 模型来翻译句子

In [50]:
import numpy as np

spa_vocab                   = target_vectorization.get_vocabulary()
spa_index_lookup            = dict(zip(range(len(spa_vocab)), spa_vocab))
max_decoded_sentence_length = 20

def decode_sequence(input_sentence):
    
    tokenized_input_sentence = source_vectorization([input_sentence])
    decoded_sentence = "[start]"
    for i in range(max_decoded_sentence_length):
        tokenized_target_sentence = target_vectorization([decoded_sentence])[:, :-1]
        predictions               = transformer([tokenized_input_sentence, tokenized_target_sentence])
        sampled_token_index       = np.argmax(predictions[0, i, :])
        sampled_token             = spa_index_lookup[sampled_token_index]
        decoded_sentence         += " " + sampled_token
        
        if sampled_token == "[end]":
            break
    
    return decoded_sentence

test_eng_texts = [pair[0] for pair in test_pairs]

for _ in range(2):
    input_sentence = random.choice(test_eng_texts)
    print("-")
    print('english: ' + input_sentence)
    print('spanish: ' + decode_sequence(input_sentence))

-
english: Do you know a good restaurant?
spanish: [start] un buen día[end]                 
-
english: I use all kinds of software to study Chinese.
spanish: [start] todos los días de estudiar japonés[end]              


## 11.6 本章总结
---
* 自然语言处理(`NLP`)模型有两类：1、词袋模型(由多个`Dense`层组成)；2、序列模型(可以是`RNN`、`一维卷积神经网络`或`Transformer`)
* 对于文本分类，训练数据中的样本数和每个样本的平均词数之间的比例，有助于判断应该使用词袋模型还是序列模型。
* `词嵌入`是向量空间，其中单词之间的语义关系被表示为这些词向量之间的距离关系。
* `序列到序列`模型由编码器和解码器组成，前者处理源序列，后者利用编码器处理后的源序列，并通过查看过去的词元来尝试预测目标序列后面的词元。
* `神经注意力`可以生成上下文感知的词表示。它是`Transformer`架构的基础。
* `Transformer`架构由`TransformerEncoder`和`TransformerDecoder`组成。`TransformerEncoder`也可以用于文本分类任务或任意类型的单一输入`NLP`任务。