# Adversarial training for language models

<a href="https://colab.research.google.com/drive/1K5NXvoXxLZ10i-3_WQJU25o6bQA1R6Ju" target="_blank">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab">
</a>

Return to the [castle](https://github.com/Nkluge-correa/teeny-tiny_castle).

Adversarial machine learning is a specialized area within machine learning dedicated to creating algorithms and techniques capable of resisting and effectively responding to adversarial attacks. This field aims to enhance the robustness of models by understanding potential vulnerabilities and developing strategies to mitigate risks posed by malicious inputs and adversarial intent.

In this notebook, we will create an adversarial dataset to train and test the robustness of two different models. The technique used in this notebook is [`DeepWordBug`](https://arxiv.org/abs/1801.04354), an attack that performs simple character-level transformations to the highest-ranked tokens. As a baseline, we will use the Bi-LSTM trained in one of our other [notebooks](https://github.com/Nkluge-correa/TeenyTinyCastle/blob/master/ML-Explainability/NLP/model_maker.ipynb). ALL models and datasets can be found on the Hugging Face Hub. 🤗

In [1]:
!pip install textattack -q
!pip install tensorflow==2.10.1 keras==2.10.0 -q
!pip install huggingface_hub -q

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m28.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m27.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m445.7/445.7 kB[0m [31m19.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m [

In [2]:
from huggingface_hub import hf_hub_download

# Download the model
hf_hub_download(repo_id="AiresPucrs/BiLSTM-sentiment-classifier",
                filename="BiLSTM-sentiment-classifier.h5",
                local_dir="./",
                repo_type="model"
                )

# Download the json file
hf_hub_download(repo_id="AiresPucrs/BiLSTM-sentiment-classifier",
                filename="tokenizer-BiLSTM-sentiment-classifier.json",
                local_dir="./",
                repo_type="model"
                )

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


BiLSTM-sentiment-classifier.h5:   0%|          | 0.00/10.1M [00:00<?, ?B/s]

(…)kenizer-BiLSTM-sentiment-classifier.json:   0%|          | 0.00/17.0M [00:00<?, ?B/s]

'tokenizer-BiLSTM-sentiment-classifier.json'

In [11]:
import json
import torch
import numpy as np
import pandas as pd
import tensorflow as tf

model = tf.keras.models.load_model('./BiLSTM-sentiment-classifier.h5')

with open('./tokenizer-BiLSTM-sentiment-classifier.json') as fp:
    data = json.load(fp)
    tokenizer = tf.keras.preprocessing.text.tokenizer_from_json(data)
    fp.close()

strings = [
    'this explanation is really bad',
    'i did not like this tutorial 2/10',
    'this tutorial is garbage i wont my money back',
    'is nice to see philosophers doing machine learning',
    'this is a great and wonderful example of nlp',
    'this tutorial is great one of the best tutorials ever made'
]

preds = model.predict(
    tf.keras.preprocessing.sequence.pad_sequences(
        tokenizer.texts_to_sequences(strings),
        maxlen=250,
        truncating='post'
    ), verbose=0)

for i, string in enumerate(strings):
    print(f'Review: "{string}"\n(Negative 😊 {preds[i][0] * 100:.2f}% | Positive 😔 {preds[i][1] * 100:.2f}%)\n')

Review: "this explanation is really bad"
(Negative 😊 95% | Positive 😔 5%)

Review: "i did not like this tutorial 2/10"
(Negative 😊 88% | Positive 😔 12%)

Review: "this tutorial is garbage i wont my money back"
(Negative 😊 89% | Positive 😔 11%)

Review: "is nice to see philosophers doing machine learning"
(Negative 😊 4% | Positive 😔 96%)

Review: "this is a great and wonderful example of nlp"
(Negative 😊 0% | Positive 😔 100%)

Review: "this tutorial is great one of the best tutorials ever made"
(Negative 😊 0% | Positive 😔 100%)



In this notebook, we will be exploring one of the functionalities of the [`textattack`](https://github.com/QData/TextAttack) library,
a Python framework for adversarial attacks, data augmentation, and model training in NLP. First, let us wrap our model using the [`ModelWrapper`](https://textattack.readthedocs.io/en/latest/apidoc/textattack.models.wrappers.html#modelwrapper-class) class. Then, using the `call` method, we can create a function that gives us the prediction scores for our model output.

In [12]:
from textattack.models.wrappers import ModelWrapper

class ModelWrapper(ModelWrapper):
    def __init__(self, model):
        self.model = model

    def __call__(self, text_input_list):
        text_array = tokenizer.texts_to_sequences(text_input_list)
        padded_text_array = tf.keras.preprocessing.sequence.pad_sequences(
                                                    text_array,
                                                    maxlen=250,
                                                    truncating='post'
                                                )
        preds = self.model.predict(padded_text_array, verbose=0)
        logits = torch.tensor(preds)
        logits = logits.squeeze(dim=-1)
        return logits

ModelWrapper(model)([
    'this explanation is really bad',
    'i did not like this tutorial 2/10',
    'this tutorial is garbage i wont my money back',
    'is nice to see philosophers doing machine learning',
    'this is a great and wonderful example of nlp',
    'this tutorial is great one of the best tutorials ever made'
])

tensor([[0.9498, 0.0502],
        [0.8772, 0.1228],
        [0.8881, 0.1119],
        [0.0439, 0.9561],
        [0.0026, 0.9974],
        [0.0018, 0.9982]])

That's exactly what we wanted. Now, we can just call an attack recipe from the attack recipes available on `textattack`. However, we need some sort of seed to create our adversarial. For this, we will use our [`sentiment analysis`](https://huggingface.co/datasets/AiresPucrs/sentiment-analysis) paired with the DeepWordBug recipe, a fast recipe for adversarial attacks.

In [14]:
!pip install datasets -q

import textattack
import pandas as pd
from datasets import load_dataset
from sklearn.model_selection import train_test_split

dataset = load_dataset('AiresPucrs/sentiment-analysis', split='train')
df = dataset.to_pandas()

df_positive = df[df['label'] == 1]
df_negative = df[df['label'] == 0]

README.md:   0%|          | 0.00/1.30k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/44.5M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/85089 [00:00<?, ? examples/s]

Below, we are creating two datasets, separated by the original class label, containing several successful adversarial examples against our model. This step takes a while, so if you want to skip this part, you can directly download our adversarial dataset on the following cells. ALL models and datasets can be found on the Hugging Face Hub. 🤗

In [22]:
from textattack.attack_recipes import DeepWordBugGao2018
from textattack import Attacker
import pandas as pd

# Wraped model
wraped_model = ModelWrapper(model)

# Create tuples for the `textattack.datasets.Dataset`
x = list(df_positive.text)
y = np.array(list(df_positive.label)).astype(int)
data=[(x[i], int(y[i])) for i in range(len(df_positive)) if len(x[i]) < 256]
dataset = textattack.datasets.Dataset(data)

# Perfrom the `DeepWordBugGao2018` on every sample
attack = DeepWordBugGao2018.build(wraped_model)
attack_args = textattack.AttackArgs(
    num_examples=len(data),
    log_to_csv ="adversarial_text_positive.csv",
    silent = True,
    disable_stdout=True
    )
attacker = Attacker(attack, dataset, attack_args)
attacker.attack_dataset()

# The same for the negative sentiment portion of the dataset ...
x = list(df_negative.review)
y = np.array(list(df_negative.sentiment)).astype(int)

data=[(x[i], int(y[i])) for i in range(len(df_negative)) if len(x[i]) < 256]

dataset = textattack.datasets.Dataset(data)
attack = DeepWordBugGao2018.build(wraped_model)
attack_args = textattack.AttackArgs(
    num_examples=len(data),
    log_to_csv ="adversarial_text_negative.csv",
    silent = True,
    disable_stdout=True
    )
attacker = Attacker(attack, dataset, attack_args)
attacker.attack_dataset()

# Concatenate both datasets into a single adversarial training dataset
df_positive = pd.read_csv('adversarial_text_positive.csv')
df_negative = pd.read_csv('adversarial_text_negative.csv')

df_positive = df_positive[ df_positive['result_type'] == 'Successful']
df_negative = df_negative[ df_negative['result_type'] == 'Successful']

negative_samples = [s.replace('[[', '').replace(']]', '') for s in df_negative.perturbed_text]
positive_samples = [s.replace('[[', '').replace(']]', '') for s in df_positive.perturbed_text]

df_positive_adversarial_samples = pd.DataFrame(positive_samples, columns=['text'])
df_negative_adversarial_samples = pd.DataFrame(negative_samples, columns=['text'])

df_positive_adversarial_samples['label'] = 1
df_negative_adversarial_samples['label'] = 0

adversarial_data = pd.concat(
    [
        df_positive_adversarial_samples,
        df_negative_adversarial_samples
    ],
    ignore_index=True).sample(frac=1).reset_index(drop=True)

display(adversarial_data)

# Or simply download it from the Hub!
# from datasets import load_dataset
# adversarial_dataset = load_dataset('AiresPucrs/adversarial-sentiment-analysis', split="train")
#display(adversarial_dataset.to_pandas())


Unnamed: 0,review,sentiment
0,it would be Mreat if there will be custom vide...,1
1,the app is fine but the actual delivery is aYf...,0
2,loev being abOle to stay in otuch with firends...,1
3,uited line is just getting longer and moving s...,0
4,usairways knows customer service htank you for...,1
...,...,...
13436,unHted sadly this wasnt just due to mother nat...,0
13437,Sunited i was on ua3782 and it was cZncelled f...,0
13438,southwestair stewardess really funny now i cou...,1
13439,sotuhwestair looks like you are up and running...,0


Now, we will train a new model using a mix of the original dataset and our new adversarial dataset, hoping it will be more resilient to the attack we introduce while creating our adversarial dataset.


In [4]:
from datasets import load_dataset
from sklearn.model_selection import train_test_split
import pandas as pd

dataset = load_dataset('AiresPucrs/sentiment-analysis', split='train')
df = dataset.to_pandas()
adversarial_dataset = load_dataset('AiresPucrs/adversarial-sentiment-analysis', split="train")
adversarial_df = adversarial_dataset.to_pandas()

adversarial_training = adversarial_df.head(10000)
adversarial_test = adversarial_df.tail(3441)

# Concatenate both the original `sentiment-analysis` dataset ad the `adversarial_training` portion
data = pd.concat([adversarial_training, df]).sample(frac=1).reset_index(drop=True)

We will now train the same model (i.e., [same architecture and hyperparemeters](https://github.com/Nkluge-correa/TeenyTinyCastle/blob/master/ML-Explainability/NLP/model_maker.ipynb)) with this adversarial dataset. To avoid training this model, skip this cell and go straight to the evaluation comparison. ALL models and datasets can be found on the Hugging Face Hub. 🤗

In [6]:
import io
import json
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split

# Define language model dimensions (same as the original model)
vocab_size = 5000
embed_size = 128
sequence_length = 250

# Train and save the new tokenizer
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=vocab_size,
                      filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
                      lower=True,
                      split=" ",
                      oov_token="<OOV>")

tokenizer.fit_on_texts(data.text)
tokenizer_json = tokenizer.to_json()

with io.open('tokenizer-BiLSTM-sentiment-classifier-adversarial.json', 'w', encoding='utf-8') as fp:
    fp.write(json.dumps(tokenizer_json, ensure_ascii=False))
    fp.close()

# Split the data
x_train, x_test, y_train, y_test = train_test_split(
    data.text, data.label, test_size=0.2, random_state=42)

# Tokenize and pad sequences
x_train = tf.keras.utils.pad_sequences(
    tokenizer.texts_to_sequences(x_train),
    maxlen=sequence_length,
    truncating='post')
x_test = tf.keras.utils.pad_sequences(
    tokenizer.texts_to_sequences(x_test),
    maxlen=sequence_length,
    truncating='post')
y_train = np.array(y_train).astype(float)
y_test = np.array(y_test).astype(float)

# Define the BI-LSTM Network
inputs = tf.keras.Input(shape=(None,), dtype="int32")
x = tf.keras.layers.Embedding(input_dim=vocab_size,
                              output_dim=embed_size,
                              input_length=sequence_length)(inputs)

x = tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(64, return_sequences=True))(x)
x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64))(x)

outputs = tf.keras.layers.Dense(1, activation="sigmoid")(x)
model = tf.keras.Model(inputs, outputs)

# Compile ...
model.compile(loss=tf.losses.BinaryCrossentropy(),
              optimizer='adam',
              metrics=['accuracy'])

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("GPU is", "available" if tf.config.list_physical_devices('GPU') else "NOT AVAILABLE")
model.summary()

callbacks = [tf.keras.callbacks.ModelCheckpoint("BiLSTM-sentiment-classifier-adversarial.h5",
                                                save_best_only=True),
            tf.keras.callbacks.EarlyStopping(monitor="val_loss",
                                            patience=3,
                                            verbose=1,
                                            mode="auto",
                                            baseline=None,
                                            restore_best_weights=True)]
# Train!
model.fit(x_train,
          y_train,
          epochs=20,
          validation_split=0.2,
          callbacks=callbacks,
          verbose=1)

# Evaluate!
test_loss_score, test_acc_score = model.evaluate(x_test, y_test)

print(f'Final Loss: {test_loss_score:.2f}.')
print(f'Final Performance: {test_acc_score * 100:.2f} %.')

Version:  2.10.1
Eager mode:  True
GPU is NOT AVAILABLE
Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding (Embedding)       (None, None, 128)         640000    
                                                                 
 bidirectional (Bidirectiona  (None, None, 128)        98816     
 l)                                                              
                                                                 
 bidirectional_1 (Bidirectio  (None, 128)              98816     
 nal)                                                            
                                                                 
 dense (Dense)               (None, 1)                 129       
                                                                 
Total

Now, let us compare the robustness score of our models as the accuracy against the test portion of our adversarial dataset.

In [11]:
import json
import numpy as np
import tensorflow as tf
from huggingface_hub import hf_hub_download

def download_model_and_tokenizer(repo_id, model_file, tokenizer_file, local_dir="./"):
    # Download model and tokenizer
    hf_hub_download(repo_id=repo_id, filename=model_file, local_dir=local_dir, repo_type="model")
    hf_hub_download(repo_id=repo_id, filename=tokenizer_file, local_dir=local_dir, repo_type="model")

    # Load the model
    model = tf.keras.models.load_model(f'{local_dir}/{model_file}')

    # Load the tokenizer
    with open(f'{local_dir}/{tokenizer_file}') as fp:
        tokenizer_data = json.load(fp)
        tokenizer = tf.keras.preprocessing.text.tokenizer_from_json(tokenizer_data)

    return model, tokenizer

def evaluate_model(model, tokenizer, test_data, max_len=250):
    # Prepare input data
    x_test = tf.keras.utils.pad_sequences(
        tokenizer.texts_to_sequences(test_data.text), maxlen=max_len, truncating='post')
    y_test = np.array(test_data.label).astype(float)

    # Evaluate the model
    _, robustness_score = model.evaluate(x_test, y_test)
    return robustness_score

# Download and load the original and adversarial models with their tokenizers
original_model, original_tokenizer = download_model_and_tokenizer(
    repo_id="AiresPucrs/BiLSTM-sentiment-classifier",
    model_file="BiLSTM-sentiment-classifier.h5",
    tokenizer_file="tokenizer-BiLSTM-sentiment-classifier.json"
)

adversarial_model, adversarial_tokenizer = download_model_and_tokenizer(
    repo_id="AiresPucrs/BiLSTM-sentiment-classifier-adversarial",
    model_file="BiLSTM-sentiment-classifier-adversarial.h5",
    tokenizer_file="tokenizer-BiLSTM-sentiment-classifier-adversarial.json"
)

# Evaluate both models against adversarial test data
print('\nOriginal model evaluation against adversaries\n')
original_robustness = evaluate_model(original_model, original_tokenizer, adversarial_test)
print(f'Robustness Score: {original_robustness * 100:.2f} %')

print('\nAdversarial-Training model evaluation against adversaries\n')
adversarial_robustness = evaluate_model(adversarial_model, adversarial_tokenizer, adversarial_test)
print(f'Robustness Score: {adversarial_robustness * 100:.2f} %')



Original model evaluation against adversaries

Robustness Score: 31.79 %

Adversarial-Training model evaluation against adversaries

Robustness Score: 86.37 %


The model trained with a mix of standard and perturbed data achieves a robustness score of around 85%. Through adversarial training, models can learn to adapt and become more resilient to subtle input perturbations by intentionally being exposed to carefully crafted adversarial examples during training. This deliberate exposure helps the model better generalize and defend against adversarial attacks.

---

Return to the [castle](https://github.com/Nkluge-correa/teeny-tiny_castle).