<a href="https://colab.research.google.com/github/AllplePine/NLPTeam/blob/master/fnet_classification_with_keras_nlp_contrast_transformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ``NLP big assignment``——使用 FNet 进行文本分类(并对比native transformer)

**Author:** [gloomy](https://github.com/3126058535/)<br>
**Date created:** 2024/05/20<br>
**Last modified:** 2022/05/21<br>
**Description:** 使用 [IMDB](https://huggingface.co/datasets/stanfordnlp/imdb)数据集进行文本分类 `keras_nlp.layers.FNetEncoder` layer.

## Introduction

在此示例中，将展示 FNet 在文本分类任务上实现与原始 Transformer 模型相当的结果的能力。
我们将使用 IMDb 数据集，它是标记为正面或负面（情感分析）的电影评论的集合。

为了构建 tokenizer、模型等，将使用来自
[KerasNLP](https://github.com/keras-team/keras-nlp) 的组件。

### Model

基于 Transformer 的语言模型 (LM)（例如 BERT、RoBERTa、XLNet 等）已经证明了自注意力机制在计算输入文本的丰富嵌入方面的有效性。然而，自注意力机制是一种昂贵的操作，时间复杂度为 `O(n^2)`，其中 `n` 是输入中的标记数。因此，人们一直在努力降低自注意力机制的时间复杂度并提高性能，而不会牺牲结果的质量。

但是！一篇题为
[FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824)
的论文将 BERT 中的自注意力层替换为一个简单的傅里叶变换层
用于“标记混合”。这在训练过程中实现了相当的准确率和速度提升。其中，论文中的几点非常突出：

* 作者声明 FNet 在 GPU 上比 BERT 快 80%，在 TPU 上比 BERT 快 70%。这种加速的原因有两个：
    - 傅里叶变换层未参数化，它没有任何参数，
    - 作者使用快速傅里叶变换 (FFT)；这将时间复杂度从 `O(n^2)`
（在自注意力的情况下）降低到 `O(n log n)`。
* 但是FNet 在 GLUE 基准测试中只成功实现了 BERT 准确率的 92-97%。


tips:不过和快百分之70的速度相比这一点损失好像也可取

## Setup

导入所有必要的包。

In [None]:
!pip install -q --upgrade keras-nlp
!pip install -q --upgrade keras  # Upgrade to Keras 3.

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/570.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m112.6/570.5 kB[0m [31m3.4 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m563.2/570.5 kB[0m [31m10.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m570.5/570.5 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m950.8/950.8 kB[0m [31m49.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m62.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m589.8/589.8 MB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.3/5.3 MB[0m [31m62.4 MB/s[0m eta [36m0:00:00[0m
[2K    

In [None]:
import keras_nlp
import keras
import tensorflow as tf
import os

keras.utils.set_random_seed(42)

定义超参数

In [None]:
BATCH_SIZE = 64
EPOCHS = 3
MAX_SEQUENCE_LENGTH = 512
VOCAB_SIZE = 15000

EMBED_DIM = 128
INTERMEDIATE_DIM = 512

## Loading the dataset

加在IMDB数据集，设个数据集用来做情感的二分类任务（也可以用hugging face的datasets加载）

In [None]:
!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xzf aclImdb_v1.tar.gz

--2024-05-29 15:01:39--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz’


2024-05-29 15:02:09 (2.73 MB/s) - ‘aclImdb_v1.tar.gz’ saved [84125825/84125825]



Samples are present in the form of text files. Let's inspect the structure of
the directory.

In [None]:
print(os.listdir("./aclImdb"))
print(os.listdir("./aclImdb/train"))
print(os.listdir("./aclImdb/test"))

['imdb.vocab', 'test', 'README', 'imdbEr.txt', 'train']
['unsup', 'unsupBow.feat', 'pos', 'urls_unsup.txt', 'neg', 'urls_neg.txt', 'urls_pos.txt', 'labeledBow.feat']
['pos', 'neg', 'urls_neg.txt', 'urls_pos.txt', 'labeledBow.feat']


该目录包含两个子目录：`train`和`test`。每个子目录又包含两个文件夹：`pos`和`neg`，分别用于正面和负面评论。在加载数据集之前，删除`./aclImdb/train/unsup`文件夹，因为它包含未标记的样本。

In [None]:
!rm -rf aclImdb/train/unsup

我们将使用 `keras.utils.text_dataset_from_directory` 实用程序从文本文件生成
带标签的 `tf.data.Dataset` 数据集。

In [None]:
train_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/train",
    batch_size=BATCH_SIZE,
    validation_split=0.2,
    subset="training",
    seed=42,
)
val_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/train",
    batch_size=BATCH_SIZE,
    validation_split=0.2,
    subset="validation",
    seed=42,
)
test_ds = keras.utils.text_dataset_from_directory("aclImdb/test", batch_size=BATCH_SIZE)

Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Found 25000 files belonging to 2 classes.
Using 5000 files for validation.
Found 25000 files belonging to 2 classes.


文本转换为小写

In [None]:
train_ds = train_ds.map(lambda x, y: (tf.strings.lower(x), y))
val_ds = val_ds.map(lambda x, y: (tf.strings.lower(x), y))
test_ds = test_ds.map(lambda x, y: (tf.strings.lower(x), y))

输出一些样例

In [None]:
for text_batch, label_batch in train_ds.take(1):
    for i in range(3):
        print(text_batch.numpy()[i])
        print(label_batch.numpy()[i])


b'an illegal immigrant resists the social support system causing dire consequences for many. well filmed and acted even though the story is a bit forced, yet the slow pacing really sets off the conclusion. the feeling of being lost in the big city is effectively conveyed. the little person lost in the big society is something to which we can all relate, but i cannot endorse going out of your way to see this movie.'
0
b"to get in touch with the beauty of this film pay close attention to the sound track, not only the music, but the way all sounds help to weave the imagery. how beautifully the opening scene leading to the expulsion of gino establishes the theme of moral ambiguity! note the way music introduces the characters as we are led inside giovanna's marriage. don't expect to find much here of the political life of italy in 1943. that's not what this is about. on the other hand, if you are susceptible to the music of images and sounds, you will be led into a word that reaches beyond

本代码使用 `keras_nlp.tokenizers.WordPieceTokenizer` 层对文本进行标记。`keras_nlp.tokenizers.WordPieceTokenizer` 接受 WordPiece 词汇表，并具有对文本进行标记和对标记序列进行去标记的功能。

在定义标记器之前，首先需要在已有的数据集上对其进行训练。课堂上讲过WordPiece 标记算法是一种子词标记算法；在语料库上对其进行训练会为我们提供一个子词词汇表。子词标记器是单词标记器（单词标记器需要非常大的词汇表才能很好地覆盖输入词）和字符标记器（字符并不像单词那样真正编码含义）之间的折衷。但是KerasNLP 使用 `keras_nlp.tokenizers.compute_word_piece_vocabulary` ，在语料库上训练 WordPiece 可以变得非常简单。

tips：FNet 的官方实现使用 SentencePiece Tokenizer。

In [None]:

def train_word_piece(ds, vocab_size, reserved_tokens):
    word_piece_ds = ds.unbatch().map(lambda x, y: x)
    vocab = keras_nlp.tokenizers.compute_word_piece_vocabulary(
        word_piece_ds.batch(1000).prefetch(2),
        vocabulary_size=vocab_size,
        reserved_tokens=reserved_tokens,
    )
    return vocab


每个词汇表都有一些特殊的保留标记。数据集中有两个这样的标记：

- `[PAD]"` - 填充标记。当输入序列长度短于最大序列长度时，填充标记会附加到输入序列长度。
- `[UNK]"` - 未知标记。

In [None]:
reserved_tokens = ["[PAD]", "[UNK]"]
train_sentences = [element[0] for element in train_ds]
vocab = train_word_piece(train_ds, VOCAB_SIZE, reserved_tokens)

打印一些 tokens!

In [None]:
print("Tokens: ", vocab[100:110])

Tokens:  ['à', 'á', 'â', 'ã', 'ä', 'å', 'æ', 'ç', 'è', 'é']


现在，定义标记器。使用上面训练的词汇表配置标记器。将定义最大序列长度，以便如果序列的长度小于指定的序列长度，则所有序列都填充到相同的长度。否则，序列将被截断

In [None]:
tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(
    vocabulary=vocab,
    lowercase=False,
    sequence_length=MAX_SEQUENCE_LENGTH,
)

从数据集中标记一个样本，为了验证文本是否已被正确标记，可以把将标记列表反标记回原始文本。

In [None]:
input_sentence_ex = train_ds.take(1).get_single_element()[0][0]
input_tokens_ex = tokenizer(input_sentence_ex)

print("Sentence: ", input_sentence_ex)
print("Tokens: ", input_tokens_ex)
print("Recovered text after detokenizing: ", tokenizer.detokenize(input_tokens_ex))


Sentence:  tf.Tensor(b'this picture seemed way to slanted, it\'s almost as bad as the drum beating of the right wing kooks who say everything is rosy in iraq. it paints a picture so unredeemable that i can\'t help but wonder about it\'s legitimacy and bias. also it seemed to meander from being about the murderous carnage of our troops to the lack of health care in the states for ptsd. to me the subject matter seemed confused, it only cared about portraying the military in a bad light, as a) an organzation that uses mind control to turn ordinary peace loving civilians into baby killers and b) an organization that once having used and spent the bodies of it\'s soldiers then discards them to the despotic bureacracy of the v.a. this is a legitimate argument, but felt off topic for me, almost like a movie in and of itself. i felt that "the war tapes" and "blood of my brother" were much more fair and let the viewer draw some conclusions of their own rather than be beaten over the head with t

## Formatting the dataset

接下来，根据输入到模型的形式格式化数据集。需要对文本进行标记。

In [None]:

def format_dataset(sentence, label):
    sentence = tokenizer(sentence)
    return ({"input_ids": sentence}, label)


def make_dataset(dataset):
    dataset = dataset.map(format_dataset, num_parallel_calls=tf.data.AUTOTUNE)
    return dataset.shuffle(512).prefetch(16).cache()


train_ds = make_dataset(train_ds)
val_ds = make_dataset(val_ds)
test_ds = make_dataset(test_ds)

**定义模型**


首先需要一个嵌入层，即将输入序列中的每个标记映射到向量的层。此嵌入层可以随机初始化。还需要一个位置嵌入层，它对序列中的词序进行编码。
惯例是将这两个嵌入相加，即求和。KerasNLP 有一个`keras_nlp.layers.TokenAndPositionEmbedding`层，它完成上述所有步骤。
FNet 分类模型由三个`keras_nlp.layers.FNetEncoder`层和顶部的`keras.layers.Dense`层组成。


tips：对于 FNet，屏蔽填充标记对结果的影响很小。在官方实现中，填充标记未被屏蔽。

In [None]:
input_ids = keras.Input(shape=(None,), dtype="int64", name="input_ids")

x = keras_nlp.layers.TokenAndPositionEmbedding(
    vocabulary_size=VOCAB_SIZE,
    sequence_length=MAX_SEQUENCE_LENGTH,
    embedding_dim=EMBED_DIM,
    mask_zero=True,
)(input_ids)

x = keras_nlp.layers.FNetEncoder(intermediate_dim=INTERMEDIATE_DIM)(inputs=x)
x = keras_nlp.layers.FNetEncoder(intermediate_dim=INTERMEDIATE_DIM)(inputs=x)
x = keras_nlp.layers.FNetEncoder(intermediate_dim=INTERMEDIATE_DIM)(inputs=x)


x = keras.layers.GlobalAveragePooling1D()(x)
x = keras.layers.Dropout(0.1)(x)
outputs = keras.layers.Dense(1, activation="sigmoid")(x)

fnet_classifier = keras.Model(input_ids, outputs, name="fnet_classifier")



## Training our model

用准确率来评判验证集上的效果。训练 3 个epochs

In [None]:
fnet_classifier.summary()
fnet_classifier.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    loss="binary_crossentropy",
    metrics=["accuracy"],
)
fnet_classifier.fit(train_ds, epochs=EPOCHS, validation_data=val_ds)

Epoch 1/3




[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m35s[0m 70ms/step - accuracy: 0.6095 - loss: 0.6275 - val_accuracy: 0.8624 - val_loss: 0.3232
Epoch 2/3
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m25s[0m 46ms/step - accuracy: 0.8765 - loss: 0.2945 - val_accuracy: 0.8320 - val_loss: 0.4145
Epoch 3/3
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 46ms/step - accuracy: 0.9341 - loss: 0.1666 - val_accuracy: 0.8462 - val_loss: 0.4537


<keras.src.callbacks.history.History at 0x7d9c8ae95ed0>

结果是约 92% 的训练准确率和约 85% 的验证准确率。在 3 个周期内，训练模型大约需要1小时13分钟（手动计时😜）（在 Colab 上使用 16 GB Tesla T4 GPU）。

让我们计算一下测试准确率。

In [None]:
fnet_classifier.evaluate(test_ds, batch_size=BATCH_SIZE)


[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 18ms/step - accuracy: 0.8370 - loss: 0.4610


[0.4610680937767029, 0.8356800079345703]

## Comparison with Transformer model

将 FNet 分类器模型与 Transformer 分类器模型进行比较。为了保持所有参数/超参数相同。这里也使用三个`TransformerEncoder` 层。将 head 数量设置为 2。

In [None]:
NUM_HEADS = 2
input_ids = keras.Input(shape=(None,), dtype="int64", name="input_ids")


x = keras_nlp.layers.TokenAndPositionEmbedding(
    vocabulary_size=VOCAB_SIZE,
    sequence_length=MAX_SEQUENCE_LENGTH,
    embedding_dim=EMBED_DIM,
    mask_zero=True,
)(input_ids)

x = keras_nlp.layers.TransformerEncoder(
    intermediate_dim=INTERMEDIATE_DIM, num_heads=NUM_HEADS
)(inputs=x)
x = keras_nlp.layers.TransformerEncoder(
    intermediate_dim=INTERMEDIATE_DIM, num_heads=NUM_HEADS
)(inputs=x)
x = keras_nlp.layers.TransformerEncoder(
    intermediate_dim=INTERMEDIATE_DIM, num_heads=NUM_HEADS
)(inputs=x)


x = keras.layers.GlobalAveragePooling1D()(x)
x = keras.layers.Dropout(0.1)(x)
outputs = keras.layers.Dense(1, activation="sigmoid")(x)

transformer_classifier = keras.Model(input_ids, outputs, name="transformer_classifier")


transformer_classifier.summary()
transformer_classifier.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    loss="binary_crossentropy",
    metrics=["accuracy"],
)
transformer_classifier.fit(train_ds, epochs=EPOCHS, validation_data=val_ds)

Epoch 1/3




[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m82s[0m 183ms/step - accuracy: 0.6547 - loss: 0.6510 - val_accuracy: 0.8856 - val_loss: 0.2750
Epoch 2/3
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m43s[0m 110ms/step - accuracy: 0.9058 - loss: 0.2395 - val_accuracy: 0.8888 - val_loss: 0.2927
Epoch 3/3
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m42s[0m 112ms/step - accuracy: 0.9438 - loss: 0.1570 - val_accuracy: 0.8692 - val_loss: 0.3848


<keras.src.callbacks.history.History at 0x7d9c0a2f0dc0>

结果是约 94% 的训练准确率和约 86.5% 的验证准确率。训练模型大约需要 2个半小时（在 Colab 上使用 16 GB Tesla T4 GPU）。

In [None]:
transformer_classifier.evaluate(test_ds, batch_size=BATCH_SIZE)

[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 39ms/step - accuracy: 0.8351 - loss: 0.5035


[0.5014781355857849, 0.8348000049591064]

下面的表格比较了这两个模型。可以看到，`FNet`显著加快了我们的运行时间（1.7 倍），除了整体准确率仅有轻微的牺牲（下降了 0.75%）。

|                         | **FNet Classifier** | **Transformer Classifier** |
|:-----------------------:|:-------------------:|:--------------------------:|
|    **Training Time**    |      1 h 14 mins    |         2 h 34 mins        |
|    **Train Accuracy**   |        92.34%       |           93.85%           |
| **Validation Accuracy** |        85.21%       |           86.42%           |
|    **Test Accuracy**    |        83.94%       |           84.69%           |
|       **#Params**       |      2,321,921      |          2,520,065         |