# 课程前言

此为 <<人工智能安全>> 课程第一部分: 文本对抗攻击实验部分.

此前我们探讨了连续型数据（如图像像素）的扰动方法，本节课将聚焦离散型数据---文本的对抗扰动技术。

相较于图像在连续空间的可微特性，文本数据具有离散组合特征，因此其对抗攻击需要使用差异化的设计思路。

文本对抗扰动指通过对原始文本进行局部语义保持性修改（如替换、插入或删除特定字符），生成人类可读但能导致NLP模型误判的对抗样本。

文本扰动技术核心在于两点：
1. 确保扰动后的文本符合语法规范且语义连贯
2. 能通过最小化修改幅度使对抗样本逃逸模型检测并维持人类可读性

以情感分类任务为例，将负面评价“这部电影非常糟糕”中关键形容词替换为语义弱化的“差强人意”，可能误导模型将其误判为中性情感。此类攻击揭示了NLP模型对语义细微变化的脆弱性。

目前主流的文本对抗攻击方法可以按两个维度分类：
1. 扰动粒度：根据修改单元分为字符级、词级或短语级攻击
2. 生成策略：依据搜索算法分为基于梯度的优化方法或启发式替代策略

# 1. 实验准备

BERT是由Google提出的基于Transformer架构的预训练语言模型，通过双向上下文理解实现文本表征，bert-base-uncased-imdb 则是BERT在IMDB影评数据集上微调的版本，主要用于情感二分类（负面/正面评价），准确率约为 94%。

本次实验，我们使用 bert-base-uncased-imdb 情感分类模型来作为文本对抗攻击的测试基准。


In [None]:
import os
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score


import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset

from datasets import load_dataset
from datasets import load_from_disk

import transformers
from transformers import AutoTokenizer, AutoModelForSequenceClassification

import gensim.downloader as api
from nltk.tokenize import word_tokenize

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


我们先对一个样本进行攻击，查看攻击效果

In [2]:
model_name = "textattack/bert-base-uncased-imdb"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()
model.to(device)

# 定义原始文本
text = "The movie was fantastic! The acting was superb and the plot kept me engaged throughout."

2025-04-13 20:00:33.021073: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-04-13 20:00:33.313469: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


**基于规则的扰动攻击**

这种攻击通过人为设计的规则对文本进行修改，实现方法比较简单。

首先，我们来人为定义一个替换策略。

In [3]:
# 字符串替换规则
perturbation_map = {
    'a': ['@', 'ä', 'à', 'á'],
    'e': ['3', 'é', 'è'],
    'i': ['1', '!', 'í'],
    'o': ['0', 'ö', 'ó'],
    's': ['$', '5'],
    't': ['7', '+']
}

这个方法实现了字符级别的扰动。
接下来，我们只需要将这个映射到样本中，即可生成对抗样本。

为了保持对抗样本的隐蔽性，需要引入一个概率值，文本中满足映射的字符将以某种概率进行替换：

In [4]:
# 字符级扰动攻击
def char_perturbation(texts, prob=0.2):
    if isinstance(texts, str):
        texts = [texts]
        single_output = True
    else:
        single_output = False

    perturbed_texts = []
    for text in texts:
        perturbed = []
        for char in text.lower():
            if char in perturbation_map and torch.rand(1).item() < prob:
                choices = perturbation_map[char]
                index = torch.multinomial(torch.ones(len(choices)), 1).item()
                perturbed.append(choices[index])
            else:
                perturbed.append(char)
        perturbed_texts.append(''.join(perturbed))

    if single_output:
        return perturbed_texts[0]
    else:
        return perturbed_texts

查看扰动效果和模型预测结果：

> 由于样本太短，若效果不理想可多测试几次。

In [5]:
# 测试原始样本和对抗样本的分类结果
def predict(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    inputs = {k: v.to(device) for k, v in inputs.items()}
    outputs = model(**inputs)
    return torch.softmax(outputs.logits, dim=1).detach().cpu().numpy()

def output_adversarial_example_and_prediction(text, attack):
    perturbed_text = attack(text)
    print("Original Text:", text)
    print("Perturbed Text:", perturbed_text)

    original_prob = predict(text)
    perturbed_prob = predict(perturbed_text)

    print("\nOriginal Prediction (neg/pos):", original_prob[0])
    print("Perturbed Prediction (neg/pos):", perturbed_prob[0])

output_adversarial_example_and_prediction(text, char_perturbation)

Original Text: The movie was fantastic! The acting was superb and the plot kept me engaged throughout.
Perturbed Text: +he movie was fantas+ic! the acting was superb and 7hè plot kept me engaged throughout.

Original Prediction (neg/pos): [0.00285911 0.9971409 ]
Perturbed Prediction (neg/pos): [0.0055841 0.9944159]


**基于嵌入的扰动攻击**

这种攻击方法利用词向量空间来生成对抗样本，即连续空间梯度与离散空间语义替换。

模型首先通过 $embedding$ 层，将离散词汇映射至连续向量空间，这似乎就与之前所讲的图像对抗攻击优化目标相同了，因此我们现在有两种思路：
1. 连续梯度扰动（如 FGSM ）：基于梯度符号法直接在词向量空间施加微小扰动，在数学上能够保证 $\epsilon-ball$ 的向量邻近性。
2. 离散语义替换（如 Word2Vec ）：在词嵌入空间中检索k-邻近有效词汇集合并选取一个进行替换。

我们分别阐述这两种方法的实现细节：

In [6]:
def fgsm_perturbation(model, input_text, labels, epsilon, tokenizer):
    '''
    使用 FGSM 对输入文本进行扰动
    参数：
        model: 目标模型
        input_text: 输入文本
        labels: 对应标签
        epsilon: 扰动强度
        tokenizer: 分词器
    返回
        perturbed_text: 扰动后的文本
    '''

    # 对输入文本进行分词并转换为张量
    inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=512)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}

    # 由于输入input的词向量不属于叶子节点，无法进行梯度计算，因此需要克隆一份作为叶子节点
    # 计算时直接使用克隆后的 embeddings 的梯度作为词嵌入的梯度
    embeddings = model.get_input_embeddings()(inputs['input_ids'])
    embeddings = embeddings.detach().clone()
    embeddings.requires_grad = True

    with torch.enable_grad():
        # 使用嵌入表示作为输入
        outputs = model(inputs_embeds=embeddings, attention_mask=inputs['attention_mask'])
        loss = nn.CrossEntropyLoss()(outputs.logits, labels.to(model.device))

        gradients = torch.autograd.grad(loss, embeddings)[0]
        sign_gradients = gradients.sign()

        # 对嵌入表示进行扰动
        perturbed_embeddings = embeddings + epsilon * sign_gradients

    # 将扰动后的嵌入表示转换回输入 ID
    perturbed_input_ids = torch.argmax(torch.matmul(perturbed_embeddings, model.get_input_embeddings().weight.t()), dim=-1)

    perturbed_text = tokenizer.decode(perturbed_input_ids.squeeze(), skip_special_tokens=True)
    return perturbed_text

In [7]:
output_adversarial_example_and_prediction(text, lambda x: fgsm_perturbation(model, x, torch.tensor([1]), 0.1, tokenizer))

Original Text: The movie was fantastic! The acting was superb and the plot kept me engaged throughout.
Perturbed Text: themori 780rada! 338 670 she superb 670 the plot kept 670 engaged halftime

Original Prediction (neg/pos): [0.00285911 0.9971409 ]
Perturbed Prediction (neg/pos): [0.5996884 0.4003116]


实验表明，尽管FGSM攻击能够通过生成对抗样本误导模型分类，但其产生的文本可读性显著降低，违背了对抗扰动攻击需保持人类可读性的约束。

这是因为，FGSM在词向量空间中搜索扰动时，虽能保证生成向量与原始词嵌入的几何邻近性（即语义相似性），但词嵌入空间到实际词汇的映射并非双射关系——经扰动后的向量可能落入"空洞区域"，无法对应词典中的有效词汇。这种词向量离散化过程中的语义断裂，最终导致对抗样本丧失语言连贯性。

因此，直接在词嵌入空间依据梯度符号进行攻击是**不合理**的，我们考虑另外一种思路。

为了保证语义的连贯性，我们可以使用基于预训练词向量模型（如Word2Vec）进行替换策略，该策略利用余弦相似度计算在词嵌入空间中检索目标词的Top-K语义邻近词，进而通过概率采样策略选取替代词汇。

In [8]:
if not os.path.exists("word2vec-google-news-300.model"):
    # 加载预训练的 Word2Vec 模型
    vec_model = api.load("word2vec-google-news-300")

    # 自选下载到本地与否
    # vec_model.save("word2vec-google-news-300.model")
else:
    from gensim.models import KeyedVectors
    vec_model = KeyedVectors.load("word2vec-google-news-300.model")

def word_embedding_perturbation(vec_model, texts, num_words=10, prob=0.7):
    if isinstance(texts, str):
        texts = [texts]
        single_output = True
    else:
        single_output = False

    perturbed_texts = []
    for text in texts:
        tokens = word_tokenize(text.lower())
        perturbed_tokens = []
        for token in tokens:
            if token in vec_model and torch.rand(1).item() < prob:
                similar_words = vec_model.most_similar(token, topn=num_words)
                new_word = similar_words[torch.randint(0, num_words, (1,)).item()][0]
                perturbed_tokens.append(new_word)
            else:
                perturbed_tokens.append(token)
        perturbed_texts.append(' '.join(perturbed_tokens))

    if single_output:
        return perturbed_texts[0]
    else:
        return perturbed_texts

output_adversarial_example_and_prediction(text, lambda x: word_embedding_perturbation(vec_model, x))

Original Text: The movie was fantastic! The acting was superb and the plot kept me engaged throughout.
Perturbed Text: this sequel was fantastic ! ofthe acting is superb and in plotted Keeping I Engaged throughout .

Original Prediction (neg/pos): [0.00285911 0.9971409 ]
Perturbed Prediction (neg/pos): [0.00115569 0.99884427]


在附加材料 `bert_attack.ipynb` 中提供了 BAE 攻击算法，这是一种基于嵌入的文本扰动攻击算法，与 Word2Vec 相似，感兴趣可进行查阅。

接下来我们使用 imdb 数据集中的测试集，来进行 bert-base-uncased-imdb 模型的对抗测试。

In [21]:
# 数据预处理
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding='max_length', max_length=128)

test_dataset = load_dataset("imdb", split="test")
test_dataset = test_dataset.map(preprocess_function, batched=True)

required_columns = ['input_ids', 'attention_mask', 'label', 'text']
test_dataset.set_format(type='torch', columns=required_columns)

# 为了节省时间，取 64 个样本作测试
n_samples = 64
subset_test_dataset = test_dataset.shuffle(seed=42).select(range(n_samples))

test_dataloader = DataLoader(subset_test_dataset, batch_size=16)

In [22]:
def evaluation(model, dataloader, attack):
    model.eval()
    predictions = []
    true_labels = []

    for batch in dataloader:
        original_texts = batch["text"]
        if attack:
            perturbed_texts = attack(original_texts)
        else:
            perturbed_texts = original_texts
        inputs = tokenizer(perturbed_texts, return_tensors="pt", truncation=True, padding='max_length', max_length=128)
        input_ids = inputs['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)

        with torch.no_grad():
            outputs = model(input_ids, attention_mask=attention_mask)
            logits = outputs.logits
            preds = torch.softmax(logits, dim=1)

            # 将预测的概率值转换为类别标签
            pred_labels = torch.argmax(preds, dim=1).cpu().tolist()
            predictions.extend(pred_labels)
            true_labels.extend(labels.cpu().tolist())

    return predictions, true_labels

In [23]:
predictions, true_labels = evaluation(model, test_dataloader, None)
clean_accuracy = accuracy_score(true_labels, predictions)
print(f"模型准确率为: {clean_accuracy * 100: .2f}%")

predictions, true_labels = evaluation(model, test_dataloader, char_perturbation)
char_perturbation_accuracy = accuracy_score(true_labels, predictions)
print(f"模型准确率为: {char_perturbation_accuracy * 100: .2f}%")

模型准确率为:  89.06%
模型准确率为:  73.44%


由于字符替换规则较少且相对简单，可能会导致模型准确率较高，这里仅作参考。

In [24]:
# 基于词嵌入的方法可能运行较慢，感兴趣可运行
predictions, true_labels = evaluation(model, test_dataloader, lambda x: word_embedding_perturbation(vec_model, x))
char_perturbation_accuracy = accuracy_score(true_labels, predictions)
print(f"模型准确率为: {char_perturbation_accuracy * 100: .2f}%")

KeyboardInterrupt: 

# TextAttack

TextAttack是一个用于自然语言处理对抗攻击的 Python 库，它提供了丰富的攻击方法和模型接口。我们可以利用它实现快速方便的对抗攻击测试。

TextAttack 库还提供了端到端的使用方法，具体可参考 `additional_reading.ipynb` 文件。

In [None]:
import textattack
from textattack.models.wrappers import HuggingFaceModelWrapper
from textattack.attack_recipes import TextFoolerJin2019
from textattack.datasets import HuggingFaceDataset
from tqdm import tqdm

# 将模型和分词器包装到 HuggingFaceModelWrapper 中
model_wrapper = HuggingFaceModelWrapper(model, tokenizer)

# 选择攻击方法，这里使用 TextFooler
attack = TextFoolerJin2019.build(model_wrapper)

# 手动加载数据集并添加 tqdm 进度条
ds = load_dataset("imdb", split="test")
# 转换为 textattack 的数据集格式
dataset = HuggingFaceDataset(ds)
ds.save_to_disk("./imdb_test_dataset")

# 进行攻击
attack_args = textattack.AttackArgs(
    num_examples=10,  # 攻击的样本数量
    log_to_csv="attack_log.csv",  # 保存攻击日志到 CSV 文件
    checkpoint_interval=5,
    checkpoint_dir="checkpoints",
    disable_stdout=True
)
attacker = textattack.Attacker(attack, dataset, attack_args)

# 使用 tqdm 显示攻击进度
print("开始攻击...")
results = []
for result in tqdm(attacker.attack_dataset(), total=min(attack_args.num_examples, len(dataset))):
    results.append(result)

# 输出攻击结果
for result in results:
    print(result.__str__(color_method="ansi"))

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /home/raine/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


正在加载模型和分词器...
正在加载攻击方法...


textattack: Unknown if model of class <class 'transformers.models.bert.modeling_bert.BertForSequenceClassification'> compatible with goal function <class 'textattack.goal_functions.classification.untargeted_classification.UntargetedClassification'>.


正在下载数据集...


Saving the dataset (0/1 shards):   0%|          | 0/25000 [00:00<?, ? examples/s]

textattack: Logging to CSV at path attack_log.csv


开始攻击...
Attack(
  (search_method): GreedyWordSwapWIR(
    (wir_method):  delete
  )
  (goal_function):  UntargetedClassification
  (transformation):  WordSwapEmbedding(
    (max_candidates):  50
    (embedding):  WordEmbedding
  )
  (constraints): 
    (0): WordEmbeddingDistance(
        (embedding):  WordEmbedding
        (min_cos_sim):  0.5
        (cased):  False
        (include_unknown_words):  True
        (compare_against_original):  True
      )
    (1): PartOfSpeech(
        (tagger_type):  nltk
        (tagset):  universal
        (allow_verb_noun_swap):  True
        (compare_against_original):  True
      )
    (2): UniversalSentenceEncoder(
        (metric):  angular
        (threshold):  0.840845057
        (window_size):  15
        (skip_text_shorter_than_window):  True
        (compare_against_original):  False
      )
    (3): RepeatModification
    (4): StopwordModification
    (5): InputColumnModification(
        (matching_column_labels):  ['premise', 'hypothesis']

  0%|          | 0/10 [00:00<?, ?it/s]2025-04-13 20:40:35.093539: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2025-04-13 20:40:35.093723: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2025-04-13 20:40:36.418911: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/bl






[Succeeded / Failed / Skipped / Total] 9 / 0 / 1 / 10: 100%|██████████| 10/10 [01:49<00:00, 10.92s/it]textattack: Saving checkpoint under "checkpoints/1744548106452.ta.chkpt" at 2025-04-13 20:41:46 after 10 attacks.
[Succeeded / Failed / Skipped / Total] 9 / 0 / 1 / 10: 100%|██████████| 10/10 [01:49<00:00, 10.92s/it]





+-------------------------------+--------+
| Attack Results                |        |
+-------------------------------+--------+
| Number of successful attacks: | 9      |
| Number of failed attacks:     | 0      |
| Number of skipped attacks:    | 1      |
| Original accuracy:            | 90.0%  |
| Accuracy under attack:        | 0.0%   |
| Attack success rate:          | 100.0% |
| Average perturbed word %:     | 12.46% |
| Average num. words per input: | 202.2  |
| Avg num queries:              | 813.67 |
+-------------------------------+--------+







100%|██████████| 10/10 [00:00<00:00, 499321.90it/s]

[91mNegative (100%)[0m --> [92mPositive (68%)[0m

[91mI[0m [91mlove[0m sci-fi and am [91mwilling[0m to [91mput[0m up with a [91mlot[0m. Sci-fi [91mmovies[0m/[91mTV[0m are [91musually[0m underfunded, under-appreciated and misunderstood. I [91mtried[0m to like this, I [91mreally[0m did, but it is to [91mgood[0m [91mTV[0m sci-fi as Babylon 5 is to [91mStar[0m [91mTrek[0m (the [91moriginal[0m). [91mSilly[0m [91mprosthetics[0m, cheap cardboard [91msets[0m, [91mstilted[0m dialogues, [91mCG[0m that doesn't [91mmatch[0m the [91mbackground[0m, and [91mpainfully[0m one-dimensional characters cannot [91mbe[0m [91movercome[0m with a 'sci-fi' [91msetting[0m. (I'm [91msure[0m there are those of you out there who [91mthink[0m Babylon 5 is [91mgood[0m sci-fi [91mTV[0m. It's not. It's [91mclichéd[0m and [91muninspiring[0m.) [91mWhile[0m [91mUS[0m [91mviewers[0m [91mmight[0m [91mlike[0m [91memotion[0m and [91mcharacter[0m 


