# 使用 TextCNN 做文本分类任务

这是一个关于使用Ignite训练神经网络模型、建立实验和验证模型的教程。

在这个实验中，我们将复制[Convolutional Neural Networks for Sentence Classification by Yoon Kim](https://arxiv.org/abs/1408.5882)! 本文使用CNN进行文本分类，这项任务通常保留给RNN、逻辑回归和朴素贝叶斯。

我们希望能够对IMDB电影评论进行分类，并预测评论是正面的还是负面的。IMDB电影评论数据集包括25000个正面和25000个负面示例。数据集由文本和标签对组成。这是二元分类问题。我们将使用PyTorch创建模型，使用torchtext导入数据，使用Ignite训练和监控模型！

让我们开始吧！

## 所需依赖项

在本例中，我们只需要torchtext和spacy软件包，假设`torch`和`ignite`已经安装。我们可以使用“pip”安装它：

`pip install torchtext==0.9.1 spacy`

`python -m spacy download en_core_web_sm`

In [None]:
!pip install pytorch-ignite torchtext==0.9.1 spacy
!python -m spacy download en_core_web_sm

## 导入依赖包

In [None]:
import random

`torchtext` 是一个为NLP任务提供多个数据集的库，类似于`torchvision`。下面我们导入以下内容：
* **datasets**: 下载NLP数据集的模块.
* **GloVe**: 下载和使用预训练 GloVe embedings.

In [None]:
from torchtext import datasets
from torchtext.vocab import GloVe

我们导入torch、nn和function模块来创建我们的模型！

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

`Ignite` 是一个帮助在PyTorch中训练神经网络的高级库。它附带了一个“Engine”，用于设置训练循环、各种度量、处理程序和一个有用的contrib部分！ 

下面我们导入以下内容：
* **Engine**: 在数据集的每个批上运行给定的process_function，并在运行时发出事件。
* **Events**: 允许用户将函数附加到“引擎”，以在特定事件中触发函数。Eg: `EPOCH_COMPLETED`, `ITERATION_STARTED`, etc.
* **Accuracy**: 用于计算数据集精度的度量，适用于二进制、多类和多标签情况。 
* **Loss**: 将损失函数作为参数的常规度量，计算数据集上的损失。
* **RunningAverage**: 在培训期间附加到发动机的一般度量。
* **ModelCheckpoint**: 检查点模型的处理程序。
* **EarlyStopping**: 处理程序根据分数函数停止训练。
* **ProgressBar**: 用于创建tqdm进度条的处理程序。

In [None]:
from ignite.contrib.handlers import ProgressBar
from ignite.engine import Engine, Events
from ignite.handlers import EarlyStopping, ModelCheckpoint
from ignite.metrics import Accuracy, Loss, RunningAverage
from ignite.utils import manual_seed

SEED = 1234
manual_seed(SEED)

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## 处理数据

我们首先使用`torchtext.data.utils`设置标记器。

标记器的工作是将句子分成“标记”。你可以在[wikipedia](https://en.wikipedia.org/wiki/Lexical_analysis)上了解更多信息.
我们将使用“spacy”库中的标记器，这是一个流行的选择。如果您想使用默认设置或任何其他您想要的设置，请随时切换到“basic_english”。

docs: https://pytorch.org/text/stable/data_utils.html

In [None]:
from torchtext.data.utils import get_tokenizer

tokenizer = get_tokenizer("spacy")

In [None]:
tokenizer("Ignite is a high-level library for training and evaluating neural networks.")

接下来，下载IMDB训练和测试数据集。`torchtext.datasets` API返回直接拆分的训练/测试数据集，无需预处理信息。每个拆分都是一个迭代器，逐行生成原始文本和标签。

In [None]:
train_iter, test_iter = datasets.IMDB(split=("train", "test"))

现在我们设置了训练、验证和测试拆分。

In [None]:
# We are using only 1000 samples for faster training
# set to -1 to use full data
N = 1000

# We will use 80% of the `train split` for training and the rest for validation
train_frac = 0.8
_temp = list(train_iter)


random.shuffle(_temp)
_temp = _temp[: (N if N > 0 else len(_temp))]
n_train = int(len(_temp) * train_frac)

train_list = _temp[:n_train]
validation_list = _temp[n_train:]
test_list = list(test_iter)
test_list = test_list[: (N if N > 0 else len(test_list))]

让我们浏览一个数据样本，看看它是什么样子。

每个数据样本都是以下格式的元组：`(label, text)`.

标签的值为“pos”或“negative”。

In [None]:
random_sample = random.sample(train_list, 1)[0]
print(" text:", random_sample[1])
print("label:", random_sample[0])

现在我们有了数据集拆分，让我们构建词汇表。为此，我们将使用`torchtext.voab`中的`Voab`类。重要的是，我们基于训练数据集构建词汇表，因为验证和测试在我们的实验中是**unseen**。

`Vocab` 允许我们使用预训练的**GloVe**100维单词向量。这意味着每个单词由100个浮点数描述！如果您想了解更多有关这方面的信息，这里有一些参考资料。
* [StanfordNLP - GloVe](https://github.com/stanfordnlp/GloVe)
* [DeepLearning.ai Lecture](https://www.coursera.org/lecture/nlp-sequence-models/glove-word-vectors-IxDTG)
* [Stanford CS224N Lecture by Richard Socher](https://www.youtube.com/watch?v=ASn7ExxLZws)

请注意，GloVe 下载大小约为900MB，因此下载可能需要一些时间。

“Vocab”类的实例具有以下属性：
* `extend` 用于扩展词汇表
* `freqs` 是每个单词频率的字典
* `itos` 是词汇表中所有单词的列表。
* `stoi` 是将每个单词映射到索引的字典。
* `vectors` 是下载嵌入的torch.Tensor

In [None]:
from collections import Counter

from torchtext.vocab import Vocab

counter = Counter()

for (label, line) in train_list:
    counter.update(tokenizer(line))

vocab = Vocab(
    counter, min_freq=10, vectors=GloVe(name="6B", dim=100, cache="/tmp/glove/")
)

In [None]:
print("The length of the new vocab is", len(vocab))
new_stoi = vocab.stoi
print("The index of '<BOS>' is", new_stoi["<BOS>"])
new_itos = vocab.itos
print("The token at index 2 is", new_itos[2])

我们现在创建`text_transform`和`label_transfer`，它们是可调用的对象，例如这里的`lambda` func，用于处理来自数据集迭代器（或类似“list”的可迭代对象）的原始文本和标签数据。您可以在`text_transform`中的句子中添加特殊符号，如`<BOS>`和`<EOS>`。

In [None]:
text_transform = lambda x: [vocab[token] for token in tokenizer(x)]
label_transform = lambda x: 1 if x == "pos" else 0

# Print out the output of text_transform
print("input to the text_transform:", "here is an example")
print("output of the text_transform:", text_transform("here is an example"))

为了生成数据批，我们将使用`torch.utils.data.DataLoader`。您可以通过在数据加载器中定义带有`collate_fn`参数的函数来自定义数据批处理。在这里，在`collate_batch`函数中，我们处理原始文本数据并添加填充以动态匹配批次中最长的句子。

In [None]:
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader


def collate_batch(batch):
    label_list, text_list = [], []
    for (_label, _text) in batch:
        label_list.append(label_transform(_label))
        processed_text = torch.tensor(text_transform(_text))
        text_list.append(processed_text)
    return torch.tensor(label_list), pad_sequence(text_list, padding_value=3.0)

In [None]:
batch_size = 8  # A batch size of 8


def create_iterators(batch_size=8):
    """Heler function to create the iterators"""
    dataloaders = []
    for split in [train_list, validation_list, test_list]:
        dataloader = DataLoader(split, batch_size=batch_size, collate_fn=collate_batch)
        dataloaders.append(dataloader)
    return dataloaders

In [None]:
train_iterator, valid_iterator, test_iterator = create_iterators()

In [None]:
next(iter(train_iterator))

让我们实际探索迭代器的输出是什么，这样我们将知道模型的输入是什么，如何将标签与输出进行比较，以及如何为Ignite的“Engine”设置process_functions。

* `batch[0][0]` is the label of a single example. We can see that `vocab.stoi` was used to map the label that originally text into a float.
* `batch[1][0]` is the text of a single example. Similar to label, `vocab.stoi` was used to convert each token of the example's text into indices.

现在，让我们打印前10批`train_iterator`的句子长度。我们在这里看到，所有批都具有不同的长度，这意味着迭代器按预期工作。

In [None]:
batch = next(iter(train_iterator))
print("batch[0][0] : ", batch[0][0])
print("batch[1][0] : ", batch[1][[0] != 1])

lengths = []
for i, batch in enumerate(train_iterator):
    x = batch[1]
    lengths.append(x.shape[0])
    if i == 10:
        break

print("Lengths of first 10 batches : ", lengths)

## TextCNN 模型

以下是模型的复制，以下是模型操作：
* **Embedding**: Embeds a batch of text of shape (N, L) to (N, L, D), where N is batch size, L is maximum length of the batch, D is the embedding dimension. 

* **Convolutions**: Runs parallel convolutions across the embedded words with kernel sizes of 3, 4, 5 to mimic trigrams, four-grams, five-grams. This results in outputs of (N, L - k + 1, D) per convolution, where k is the kernel_size. 

* **Activation**: ReLu activation is applied to each convolution operation.

* **Pooling**: Runs parallel maxpooling operations on the activated convolutions with window sizes of L - k + 1, resulting in 1 value per channel i.e. a shape of (N, 1, D) per pooling. 

* **Concat**: The pooling outputs are concatenated and squeezed to result in a shape of (N, 3D). This is a single embedding for a sentence.

* **Dropout**: Dropout is applied to the embedded sentence. 

* **Fully Connected**: The dropout output is passed through a fully connected layer of shape (3D, 1) to give a single output for each example in the batch. sigmoid is applied to the output of this layer.

* **load_embeddings**: This is a method defined for TextCNN to load embeddings based on user input. There are 3 modes - rand which results in randomly initialized weights, static which results in frozen pretrained weights, nonstatic which results in trainable pretrained weights. 


让我们注意，该模型适用于可变文本长度！嵌入句子中的单词的想法，使用卷积、最大化池和concantenation将句子嵌入为单个向量！该单个向量通过具有S形的完全连接层以输出单个值。该值可以解释为句子为正（接近1）或负（接近0）的概率。

模型期望的最小文本长度是模型的最小内核大小的大小。

In [None]:
class TextCNN(nn.Module):
    def __init__(
        self,
        vocab_size,
        embedding_dim,
        kernel_sizes,
        num_filters,
        num_classes,
        d_prob,
        mode,
    ):
        super(TextCNN, self).__init__()
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.kernel_sizes = kernel_sizes
        self.num_filters = num_filters
        self.num_classes = num_classes
        self.d_prob = d_prob
        self.mode = mode
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.load_embeddings()
        self.conv = nn.ModuleList(
            [
                nn.Conv1d(
                    in_channels=embedding_dim,
                    out_channels=num_filters,
                    kernel_size=k,
                    stride=1,
                )
                for k in kernel_sizes
            ]
        )
        self.dropout = nn.Dropout(d_prob)
        self.fc = nn.Linear(len(kernel_sizes) * num_filters, num_classes)

    def forward(self, x):
        batch_size, sequence_length = x.shape
        x = self.embedding(x.T).transpose(1, 2)
        x = [F.relu(conv(x)) for conv in self.conv]
        x = [F.max_pool1d(c, c.size(-1)).squeeze(dim=-1) for c in x]
        x = torch.cat(x, dim=1)
        x = self.fc(self.dropout(x))
        return torch.sigmoid(x).squeeze()

    def load_embeddings(self):
        if "static" in self.mode:
            self.embedding.weight.data.copy_(vocab.vectors)
            if "non" not in self.mode:
                self.embedding.weight.data.requires_grad = False
                print("Loaded pretrained embeddings, weights are not trainable.")
            else:
                self.embedding.weight.data.requires_grad = True
                print("Loaded pretrained embeddings, weights are trainable.")
        elif self.mode == "rand":
            print("Randomly initialized embeddings are used.")
        else:
            raise ValueError(
                "Unexpected value of mode. Please choose from static, nonstatic, rand."
            )

## 创建模型、优化器和损失函数

下面我们创建TextCNN模型的一个实例，并在**static**模式下加载嵌入。将模型放置在设备上，然后建立二元交叉熵损失函数和Adam优化器。

In [None]:
vocab_size, embedding_dim = vocab.vectors.shape

model = TextCNN(
    vocab_size=vocab_size,
    embedding_dim=embedding_dim,
    kernel_sizes=[3, 4, 5],
    num_filters=100,
    num_classes=1,
    d_prob=0.5,
    mode="static",
)
model.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4, weight_decay=1e-3)
criterion = nn.BCELoss()

## 使用 Ignite 训练和评估

### 训练器引擎 - process_function

Ignite的引擎允许用户定义一个process_function来处理给定的批，这适用于数据集的所有批。这是一个通用类，可用于训练和验证模型！process_function有两个参数engine和batch。

让我们了解一下训练的功能：

* Sets model in train mode. 
* Sets the gradients of the optimizer to zero.
* Generate x and y from batch.
* Performs a forward pass to calculate y_pred using model and x.
* Calculates loss using y_pred and y.
* Performs a backward pass using loss to calculate gradients for the model parameters.
* model parameters are optimized using gradients and optimizer.
* Returns scalar loss. 

以下是trainig过程中的单个操作。此process_function将附加到训练引擎。

In [None]:
def process_function(engine, batch):
    model.train()
    optimizer.zero_grad()
    y, x = batch
    x = x.to(device)
    y = y.to(device)
    y_pred = model(x)
    loss = criterion(y_pred, y.float())
    loss.backward()
    optimizer.step()
    return loss.item()

### 评估器引擎 - process_function

与训练过程函数类似，我们设置了一个函数来评估单个批次。以下是eval_function的作用：

* Sets model in eval mode.
* With torch.no_grad(), no gradients are calculated for any succeding steps.
* Generates x and y from batch.
* Performs a forward pass on the model to calculate y_pred based on model and x.
* Returns y_pred and y.

Ignite建议将指标附加到评估者而不是培训者，因为在培训过程中，模型参数不断变化，最好在静态模型上评估模型。这些信息很重要，因为培训和评估的功能不同。训练返回单个标量损失。求值返回y_pred和y，因为该输出用于计算整个数据集的每批度量。

Ignite中的所有度量都需要y_pred和y作为附加到引擎的函数的输出。

In [None]:
def eval_function(engine, batch):
    model.eval()
    with torch.no_grad():
        y, x = batch
        y = y.to(device)
        x = x.to(device)
        y = y.float()
        y_pred = model(x)
        return y_pred, y

### 实例化训练和评估引擎

Below we create 3 engines, a trainer, a training evaluator and a validation evaluator. You'll notice that train_evaluator and validation_evaluator use the same function, we'll see later why this was done! 

In [None]:
trainer = Engine(process_function)
train_evaluator = Engine(eval_function)
validation_evaluator = Engine(eval_function)

### 指标-运行平均值、准确性和损失

首先，我们将附加一个运行平均值度量，以跟踪每个批次的标量损失输出的运行平均值。

In [None]:
RunningAverage(output_transform=lambda x: x).attach(trainer, "loss")

现在，我们要使用两个指标进行评估-准确性和损失。这是一个二元问题，因此对于损失，我们可以简单地将二元交叉熵函数作为Loss_function传递。

为了准确，Ignite 要求y_pred和y仅由0和1组成。由于我们的模型输出来自一个Sigmoid层，值介于0和1之间。我们需要编写一个函数来转换`engine.state.output`由y_pred和y组成的输出。

Below `thresholded_output_transform` does just that, it rounds y_pred to convert y_pred to 0's and 1's, and then returns rounded y_pred and y. This function is the output_transform function used to transform the `engine.state.output` to achieve `Accuracy`'s desired purpose.

Now, we attach `Loss` and `Accuracy` (with `thresholded_output_transform`) to train_evaluator and validation_evaluator. 

To attach a metric to engine, the following format is used:
* `Metric(output_transform=output_transform, ...).attach(engine, 'metric_name')`


In [None]:
def thresholded_output_transform(output):
    y_pred, y = output
    y_pred = torch.round(y_pred)
    return y_pred, y

In [None]:
Accuracy(output_transform=thresholded_output_transform).attach(
    train_evaluator, "accuracy"
)
Loss(criterion).attach(train_evaluator, "bce")

In [None]:
Accuracy(output_transform=thresholded_output_transform).attach(
    validation_evaluator, "accuracy"
)
Loss(criterion).attach(validation_evaluator, "bce")

### 进度条

接下来，我们创建一个Ignite的Progress bar实例，并将其连接到trainer，并将`engine.state.metrics`键传递给它跟踪的度量。
在这里，进度条将被跟踪 `engine.state.metrics['loss']`

In [None]:
pbar = ProgressBar(persist=True, bar_format="")
pbar.attach(trainer, ["loss"])

### 早停法-跟踪验证损失

Now we'll set up a Early Stopping handler for this training process. EarlyStopping requires a score_function that allows the user to define whatever criteria to stop training. In this case, if the loss of the validation set does not decrease in 5 epochs, the training process will stop early.  

In [None]:
def score_function(engine):
    val_loss = engine.state.metrics["bce"]
    return -val_loss


handler = EarlyStopping(patience=5, score_function=score_function, trainer=trainer)
validation_evaluator.add_event_handler(Events.COMPLETED, handler)

### 在特定事件中将自定义函数附加到引擎

下面，您将看到如何定义自己的自定义函数，并将它们附加到培训过程的各种`Events`中。

下面的函数都实现了类似的任务，它们打印在数据集上运行的计算器的结果。一个函数对训练评估器和数据集执行此操作，而另一个函数则对验证执行此操作。另一个区别是这些功能在培训机引擎中的附加方式。

第一种方法涉及使用装饰器，语法很简单 - `@` `trainer.on(Events.EPOCH_COMPLETED)`，表示修饰函数将附加到训练器，并在每个历元结束时调用。

第二种方法涉及使用trainer的add_event_handler方法 - `trainer.add_event_handler(Events.EPOCH_COMPLETED, custom_function)`。这实现了与上述相同的结果。

In [None]:
@trainer.on(Events.EPOCH_COMPLETED)
def log_training_results(engine):
    train_evaluator.run(train_iterator)
    metrics = train_evaluator.state.metrics
    avg_accuracy = metrics["accuracy"]
    avg_bce = metrics["bce"]
    pbar.log_message(
        "Training Results - Epoch: {}  Avg accuracy: {:.2f} Avg loss: {:.2f}".format(
            engine.state.epoch, avg_accuracy, avg_bce
        )
    )


def log_validation_results(engine):
    validation_evaluator.run(valid_iterator)
    metrics = validation_evaluator.state.metrics
    avg_accuracy = metrics["accuracy"]
    avg_bce = metrics["bce"]
    pbar.log_message(
        "Validation Results - Epoch: {}  Avg accuracy: {:.2f} Avg loss: {:.2f}".format(
            engine.state.epoch, avg_accuracy, avg_bce
        )
    )
    pbar.n = pbar.last_print_n = 0


trainer.add_event_handler(Events.EPOCH_COMPLETED, log_validation_results)

### 模型检查点

最后，我们要检查这个模型。这样做很重要，因为训练过程可能很耗时，如果由于某种原因在训练过程中出现问题，模型检查点有助于从故障点重新开始训练。

下面，我们将使用Ignite的`ModelCheckpoint`处理程序在每个历元结束时检查模型。

In [None]:
checkpointer = ModelCheckpoint(
    "/tmp/models", "textcnn", n_saved=2, create_dir=True, save_as_state_dict=True
)
trainer.add_event_handler(Events.EPOCH_COMPLETED, checkpointer, {"textcnn": model})

### 发动引擎

接下来，我们将运行训练器20个周期并监控结果。下面我们可以看到，Progress bar打印每次迭代的损失，并打印我们在自定义函数中指定的训练和验证结果。

In [None]:
trainer.run(train_iterator, max_epochs=20)

就这样！我们已经成功地训练和评估了用于文本分类的卷积神经网络。