# 本章摘要

- 为机器学习应用预处理文本数据
- 用于文本处理的词袋方法和序列模型方法
- Transformer架构
- 序列好序列学习

# 自然语言处理概述（NLP）

# 准备文本数据

深度学习模型是可微函数，只能处理数值张量，不能将原始文本作为输入。**文本向量化**是指将文本转换为数值张量的过程。文本向量化有许多种形式，但都遵循相同的流程。<br>
**文本向量化流程** <br>
![从原始文本到向量](images/从原始文本到向量.png "从原始文本到向量") <p>
1. 将文本标准化，比如：转换为小写字母或删除标点符号；
2. 将文本拆分为单元[称为**词元**（token）]，比如字符、单词或词组。这一步叫作**词元化**；
3. 将每个词元转换为一个数值向量。这通常需要对数据中的所有词元**建立索引**。

## 文本标准化

文本标准化是一种简单的特征工程，旨在消除不希望模型处理的那些编码差异。<p>
- 最简单也是最广泛使用的一种标准化方法是：将所有字母转换为小写并删除标点符号。
- 更高级的标准化方法，但在机器学习中很少使用，**词干提取**。

## 文本拆分（词元化）

词元化有以下3种方法：
- **单词级词元化**（word-level tokenization）：词元是以空格（或标点）分隔的子字符串。
- **N元语法词元化**（N-gram tokenization）：词元是N个连续单词。
- **字符级词元化**（character-level tokenization）：每个字符都是一个词元。<p>

一般情况下，可以一直使用单词级词元化或N元语法词元化。有两种文本处理模型：
- 一种是关注词序的模型，叫作**序列模型**（sequence model）；
- 另一种将输入单词作为一个集合，不考虑其原始顺序，叫作**词袋模型**（bag-of-words model）。<p>

如果要构建序列模型，则应使用单词级词元化；如果要构建词袋模型，则应使用N元语法词元化。

### 理解N元语法和词袋

对于句子“The cat sat on the mat”（猫坐在垫子上）。<br>
分解为二元语法的集合 <br>
```
{"the", "the cat", "cat", "cat sat", "sat", "sat on", "on", "on the", "the mat", "mat"}
```
<p>

分解为三元语法的集合 <br>
```
{"the", "the cat", "cat", "cat sat", "the cat sat", "sat", "sat on", "on", "cat sat on", "on the", "sat on the", "the mat", "mat", "on the mat"}
```
<p>

这样的集合分别叫作**二元语法袋**（bag-of-2-grams）和**三元语法袋**（bag-of-3-grams）。**袋**是指，处理的是词元组成的集合，而不是列表或序列，也就是说，词元没有特定的顺序。这种词元化方法叫作**词袋**（bag-of-words）或**N元语法袋**（bag-of-N-grams）。

## 建立词表索引

将每个词元编码为数值表示。需要建立训练数据中所有单词（“词表”）的索引，并为词表中的每个单词分配唯一整数。

In [None]:
vocabulary = {}
for text in dataset:
    text = standardize(text)
    tokens = tokenize(text)
    for token in tokens:
        if token not in vocabulary:
            vocabulary[token] = len(vocabulary)

然后，将这个整数转换为神经网络能够处理的向量编码，比如：One-hot

In [24]:
def one_hot_encode_token(token):
    vector = np.zeros((len(vocabulary),))
    token_index = vocabulary[token]
    vector[token_index] = 1
    return vector

## 使用TextVectorization层

### 原始准备文本数据流程

In [27]:
import string

class Vectorizer:
    # 文本标准化
    def standardize(self, text):
        text = text.lower()
        return "".join(char for char in text if char not in string.punctuation)

    # 文本词元化
    def tokenize(self, text):
        text = self.standardize(text)
        return text.split()

    # 建立词表索引
    def make_vocabulary(self, dataset):
        self.vocabulary = {"": 0, "[UNK]": 1}
        for text in dataset:
            text = self.standardize(text)
            tokens = self.tokenize(text)
            for token in tokens:
                if token not in self.vocabulary:
                    self.vocabulary[token] = len(self.vocabulary)
        self.inverse_vocabulary = dict((v, k) for k, v in self.vocabulary.items())

    def encode(self, text):
        text = self.standardize(text)
        tokens = self.tokenize(text)
        return [self.vocabulary.get(token, 1) for token in tokens]

    def decode(self, int_sequence):
        return " ".join(self.inverse_vocabulary.get(i, "[UNK]") for i in int_sequence)


vectorizer = Vectorizer()
dataset = [
    "I write, erase, rewrite",
    "Erase again, and then",
    "A poppy blooms.",
]
vectorizer.make_vocabulary(dataset)

In [29]:
test_sentence = "I write, rewrite, and still rewrite again"
encode_sentence = vectorizer.encode(test_sentence)
print(encode_sentence)
decode_sentence = vectorizer.decode(encode_sentence)
print(decode_sentence)

[2, 3, 5, 7, 1, 5, 6]
i write rewrite and [UNK] rewrite again


### 使用TextVectorization层

In [36]:
from tensorflow.keras.layers import TextVectorization
text_vectorization = TextVectorization(
    output_mode="int",  # 设置该层的返回值是编码为整数索引的单词序列
)

2024-10-10 11:10:41.476372: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M1 Pro
2024-10-10 11:10:41.476421: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 16.00 GB
2024-10-10 11:10:41.476439: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 5.33 GB
2024-10-10 11:10:41.476490: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-10-10 11:10:41.476513: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


#### 自定义TextVectorization层标准化和词元化方法

TextVectorization层默认的文本标准化方法是“转换为小写字母并删除标点符号”，词元化方法是“利用空格进行拆分”。也可以提供自定义函数来进行标准化和词元化。

In [42]:
import re
import string
import tensorflow as tf

def custom_standardization_fn(string_tensor):
    lowercase_string = tf.strings.lower(string_tensor)  # 将字符串转换为小写字母
    return tf.strings.regex_replace(  # 将标点符号替换为空字符串
        lowercase_string, f"[{re.escape(string.punctuation)}]", "")

def custom_split_fn(string_tensor):
    return tf.strings.split(string_tensor)  # 利用空格对字符串进行拆分

text_vectorization = TextVectorization(
    output_mode="int",
    standardize=custom_standardization_fn,
    split=custom_split_fn)

#### 词表建立索引

利用TextVectorization层对文本语料库的词表建立索引，需要调用`adapt()`方法，其参数是一个可以生成字符串的Dataset对象或一个由Python字符串组成的列表。

In [46]:
dataset = [
    "I write, erase, rewrite",
    "Erase again, and then",
    "A poppy blooms.",
]
text_vectorization.adapt(dataset)

##### 显示词表

In [49]:
text_vectorization.get_vocabulary()

['',
 '[UNK]',
 'erase',
 'write',
 'then',
 'rewrite',
 'poppy',
 'i',
 'blooms',
 'and',
 'again',
 'a']

#### 对例句进行编码，然后解码

In [52]:
vocabulary = text_vectorization.get_vocabulary()
test_sentence = "I write, rewrite, and still rewrite again"
encoded_sentence = text_vectorization(test_sentence)
print(encoded_sentence)
inverse_vocab = dict(enumerate(vocabulary))
decoded_sentence = " ".join(inverse_vocab[int(i)] for i in encoded_sentence)
print(decoded_sentence)

tf.Tensor([ 7  3  5  9  1  5 10], shape=(7,), dtype=int64)
i write rewrite and [UNK] rewrite again


#### TextVectorization层在模型构建中的使用

TextVectorization层有两种用法：
- 将其放在`tf.data`管道中
- 将其作为模型的一部分

##### 放在tf.data管道

In [None]:
int_sequence_dataset = string_dataset.map(
    text_vectorization,
    num_parallel_calls=4
)

##### 作为模型的一部分

In [None]:
text_input = keras.Input(shape=(), dtype="string")  # 创建输入的符号张量，数据类型为字符串
vectorized_text = text_vectorization(text_input)  # 对输入应用文本向量化层
embedded_input = keras.layers.Embedding(...)(vectorized_text)
output = ...
model = keras.Model(text_input, output)

##### 两种方法的区别

如果向量化是模型的一部分，那么它将与模型的其他部分同步进行。这意味着在每个训练步骤中，模型的其余部分（在GPU）必须等待TextVectorization层（在CPU）的输出准备好，才能开始工作。与此相对，如果将该层放在tf.data管道中，则可以在CPU上对数据进行异步预处理：模型在GPU上对一批向量化数据进行处理时，CPU可以对下一批原始字符串进行向量化。

# 表示单词组的两种方法：集合和序列

机器学习模型如何表示**单个单词**：它是分类特征（来自预定义集合的值）。它应该被编码为特征空间中的维度，或者类别向量。**如何对单词组成句子的方式进行编码？** <p>

如何表示词序是一个关键问题。<br>
- 最简单的做法是舍弃顺序，将文本看作一组无序的单词，这就是**词袋模型**(bag-of-words model)。<br>
- 也可以严格按照单词出现顺序进行处理，一次处理一个，就像处理时间序列的时间步一样，也就是利用**RNN模型**。<br>
- 也可以采用混合方法：Transformer架构在技术上是不考虑顺序的，但它将单词位置信息注入数据表示中，从而能够同时查看一个句子的不同部分。<br>

RNN和Transformer被称为**序列模型**。

## 准备IMDB影评数据

### 构建验证集

In [69]:
# 将20%的训练文本文件放入一个新目录
import os, pathlib,shutil, random

base_dir = pathlib.Path("aclImdb/")
val_dir = base_dir / "val"
train_dir = base_dir / "train"
for category in ("neg", "pos"):
    os.makedirs(val_dir / category)
    files = os.listdir(train_dir / category)
    random.Random(1337).shuffle(files)
    num_val_samples = int(0.2 * len(files))
    val_files = files[-num_val_samples:]
    for fname in val_files:
        shutil.move(train_dir / category / fname, val_dir/ category / fname)

### 创建批量Dataset

In [72]:
from tensorflow import keras

batch_size = 32
train_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/train", batch_size=batch_size)
val_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/val/", batch_size=batch_size)
test_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/test/", batch_size=batch_size)

Found 20000 files belonging to 2 classes.
Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


In [74]:
for inputs, targets in train_ds:
    print("inputs shape:", inputs.shape)
    print("inputs dtype:", inputs.dtype)
    print("targets shape:", targets.shape)
    print("targets dtype:", targets.dtype)
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

inputs shape: (32,)
inputs dtype: <dtype: 'string'>
targets shape: (32,)
targets dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor(b'Yeah. Pretty sure I saw this movie years ago when it was about the Supremes.<br /><br />Another recycled storyline glitzed up Hollywood-style, borrowing scripts from better making-it-in-the-music-industry films.<br /><br />Nothing original here.<br /><br />More make-up, glammier costumes and choreography = more money for the questionably "talented" Beyonce draw.<br /><br />If you like the throwback style, you should appreciate actual groups who struggled (without having digitized voices and a Hollywood empire).<br /><br />Beyonce\'s involvement makes this hypocritical garbage.', shape=(), dtype=string)
targets[0]: tf.Tensor(0, shape=(), dtype=int32)


## 将单词作为集合处理：词袋方法

要对一段文本进行编码，使其可以被机器学习模型所处理，最简单的方法是**舍弃顺序**，将文本看作一组（一袋）词元。

### 单个单词（一元语法）的二进制编码

如果使用单个单词的词袋，那么“the cat sat on the mat”这个句子就会变成`{"cat", "mat", "on", "sat", "the"}`。 <p>

这种编码方式的主要优点是，可以将整个文本表示为单一向量，其中每个元素表示某个单词是否存在。

#### 用TextVectorization层预处理数据集

In [82]:
text_vectorization = TextVectorization(
    max_tokens=20000,
    output_mode="multi_hot"
)
text_only_train_ds = train_ds.map(lambda x, y: x)  # 准备一个数据集，只包含原始文本输入
text_vectorization.adapt(text_only_train_ds)  # 利用adapt()方法对数据集词表建立索引
binary_1gram_train_ds = train_ds.map(  # 分别对训练、验证和测试数据集进行处理
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4
)
binary_1gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4
)
binary_1gram_test_df = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4
)

2024-10-12 15:41:42.475325: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


#### 查看一元语法二进制数据集的输出

In [85]:
for inputs, targets in binary_1gram_train_ds:
    print("inputs shape:", inputs.shape)
    print("inputs dtype:", inputs.dtype)
    print("targets shape:", targets.shape)
    print("targets dtype:", targets.dtype)
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

inputs shape: (32, 20000)
inputs dtype: <dtype: 'int64'>
targets shape: (32,)
targets dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor([1 1 1 ... 0 0 0], shape=(20000,), dtype=int64)
targets[0]: tf.Tensor(1, shape=(), dtype=int32)


#### 模型构建函数

In [88]:
from tensorflow import keras
from tensorflow.keras import layers

def get_model(max_tokens=20000, hidden_dim=16):
    inputs = keras.Input(shape=(max_tokens,))
    x = layers.Dense(hidden_dim, activation="relu")(inputs)
    x = layers.Dropout(0.5)(x)
    outputs = layers.Dense(1, activation="sigmoid")(x)
    model = keras.Model(inputs, outputs)
    model.compile(optimizer="rmsprop",
                  loss="binary_crossentropy",
                  metrics=["accuracy"])
    return model

#### 对一元语法二进制进行训练和测试

In [93]:
model = get_model()
model.summary()
callbacks = [
    keras.callbacks.ModelCheckpoint("binary_1gram.keras", save_best_only=True)
]
model.fit(binary_1gram_train_ds.cache(),
          validation_data=binary_1gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)
model = keras.models.load_model("binary_1gram.keras")
print(f"Test acc: {model.evaluate(binary_1gram_test_df)[1]:.3f}")

Epoch 1/10


2024-10-12 15:52:45.493519: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type GPU is enabled.


[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m204s[0m 322ms/step - accuracy: 0.7601 - loss: 0.5093 - val_accuracy: 0.8916 - val_loss: 0.2819
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 8ms/step - accuracy: 0.8857 - loss: 0.2982 - val_accuracy: 0.9022 - val_loss: 0.2646
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 8ms/step - accuracy: 0.9058 - loss: 0.2545 - val_accuracy: 0.9002 - val_loss: 0.2767
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 8ms/step - accuracy: 0.9178 - loss: 0.2274 - val_accuracy: 0.8952 - val_loss: 0.2853
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 8ms/step - accuracy: 0.9256 - loss: 0.2151 - val_accuracy: 0.8978 - val_loss: 0.3027
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 8ms/step - accuracy: 0.9251 - loss: 0.2159 - val_accuracy: 0.8962 - val_loss: 0.3147
Epoch 7/10
[1m625/625[0m [32m━━━

### 二元语法的二进制编码

#### 设置TextVectorization层返回二元语法

In [97]:
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode="multi_hot"
)

#### 对二元语法二进制模型进行训练和测试

In [100]:
text_vectorization.adapt(text_only_train_ds)
binary_2gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4
)
binary_2gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4
)
binary_2gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4
)
model = get_model()
model.summary()
callbacks = [
    keras.callbacks.ModelCheckpoint("binary_2gram.keras", save_best_only=True)
]
model.fit(
    binary_2gram_train_ds,
    validation_data=binary_2gram_val_ds,
    epochs=10,
    callbacks=callbacks
)
model = keras.models.load_model("binary_2gram.keras")
print(f"Test acc: {model.evaluate(binary_2gram_test_ds)[1]:.3f}")

2024-10-12 17:12:55.711989: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


Epoch 1/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m616s[0m 980ms/step - accuracy: 0.7902 - loss: 0.4621 - val_accuracy: 0.9060 - val_loss: 0.2496
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m580s[0m 924ms/step - accuracy: 0.9130 - loss: 0.2500 - val_accuracy: 0.9062 - val_loss: 0.2511
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m558s[0m 890ms/step - accuracy: 0.9311 - loss: 0.2001 - val_accuracy: 0.9042 - val_loss: 0.2567
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m544s[0m 867ms/step - accuracy: 0.9428 - loss: 0.1806 - val_accuracy: 0.9010 - val_loss: 0.2823
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m538s[0m 857ms/step - accuracy: 0.9488 - loss: 0.1689 - val_accuracy: 0.9002 - val_loss: 0.2914
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m523s[0m 834ms/step - accuracy: 0.9511 - loss: 0.1680 - val_accuracy: 0.8952 - val_loss: 0.3136
Epoc

### 二元语法的TF-IDF编码

为二元语法添加更多的信息，方法就是计算每个单词或者每个N元语法的出现次数。 <br>
```
{"the": 2, "the cat": 1, "cat": 1, "cat sat": 1, "sat": 1,
"sat on": 1, "on": 1, "on the": 1, "the mat": 1, "mat": 1}
```

#### 设置TextVectorization层返回词元出现次数

In [106]:
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode="count"
)

#### 设置TextVectorization层返回TF-IDF加权输出

In [124]:
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode="tf_idf"
)

#### 理解TF-IDF规范化

某个词在一个文档中出现的次数越多，它对理解文档的内容就越重要。同时，某个词在数据集所有文档中的出现频次也很重要：如果一个词几乎出现在每个文档中(比如“the”或“a”)，那么这个词就不是特别有信息量，而仅在一小部分文本中出现的词(比如“Herzog”)则是非常独特的，因此也非常重要。TF-IDF指标融合了这两种思想。它将某个词的“词频”除以“文档频次”，前者是该词在当前文档中的出现次数，后者是该词在整个数据集中的出现频次。TF-IDF计算方法如下。

In [113]:
def tfidf(term, document, dataset):
    term_freq = document.count(term)
    doc_freq = math.log(sum(doc.count(term) for doc in dataset) + 1)
    return term_freq / doc_freq

In [128]:
text_only_train_ds = text_only_train_ds.map(lambda x: x[0])

#### 对TF-IDF二元语法模型进行训练和测试

In [None]:
text_vectorization.adapt(text_only_train_ds)
tfidf_2gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4
)
tfidf_2gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4
)
tfidf_2gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4
)

model = get_model()
model.summary()
callbacks = [
    keras.callbacks.ModelCheckpoint("tfidf_2gram.keras", save_best_only=True)
]
model.fit(
    tfidf_2gram_train_ds.cache(),
    validation_data=tfidf_2gram_val_ds.cache(),
    epochs=10,
    callbacks=callbacks
)
model = keras.models.load_model("tfidf_2gram.keras")
print(f"Test acc: {model.evaluate(tfidf_2gram_test_ds)[1]:.3f}")

2024-10-15 15:15:19.166764: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


Epoch 1/10
[1m365/625[0m [32m━━━━━━━━━━━[0m[37m━━━━━━━━━[0m [1m3:43[0m 860ms/step - accuracy: 0.6545 - loss: 0.9173