# Creating adverarial examples with `textattack`

<a href="https://colab.research.google.com/drive/1pWOn-n6woW-HkHnUPG2YYC3QZORVRI_4" target="_blank">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab">
</a>

Return to the [castle](https://github.com/Nkluge-correa/teeny-tiny_castle).

Adversarial machine learning is a specialized area within machine learning dedicated to creating algorithms and techniques capable of resisting and effectively responding to adversarial attacks. This field aims to enhance the robustness of models by understanding potential vulnerabilities and developing strategies to mitigate risks posed by malicious inputs and adversarial intent.

In this notebook, we will be exploring one of the functionalities of the [`textattack`](https://textattack.readthedocs.io/en/latest/index.html) library. More specifically, we will develop and attack a language model trained on sentiment classification. To start, let us first download one of our already trained models ([AiresPucrs/BiLSTM-sentiment-classifier](https://huggingface.co/AiresPucrs/BiLSTM-sentiment-classifier)).

In [1]:
!pip install textattack -q
!pip install tensorflow==2.10.1 keras==2.10.0 -q
!pip install huggingface_hub -q

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m24.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m21.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m445.7/445.7 kB[0m [31m20.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m [

In [2]:
from huggingface_hub import hf_hub_download

# Download the model
hf_hub_download(repo_id="AiresPucrs/BiLSTM-sentiment-classifier",
                filename="BiLSTM-sentiment-classifier.h5",
                local_dir="./",
                repo_type="model"
                )

# Download the tokenizer file
hf_hub_download(repo_id="AiresPucrs/BiLSTM-sentiment-classifier",
                filename="tokenizer-BiLSTM-sentiment-classifier.json",
                local_dir="./",
                repo_type="model"
                )

import json
import torch
import numpy as np
import pandas as pd
import tensorflow as tf

model = tf.keras.models.load_model('./BiLSTM-sentiment-classifier.h5')

with open('./tokenizer-BiLSTM-sentiment-classifier.json') as fp:
    data = json.load(fp)
    tokenizer = tf.keras.preprocessing.text.tokenizer_from_json(data)
    fp.close()

strings = [
    'this explanation is really bad',
    'i did not like this tutorial 2/10',
    'this tutorial is garbage i wont my money back',
    'is nice to see philosophers doing machine learning',
    'this is a great and wonderful example of nlp',
    'this tutorial is great one of the best tutorials ever made'
]

preds = model.predict(
    tf.keras.preprocessing.sequence.pad_sequences(
        tokenizer.texts_to_sequences(strings),
        maxlen=250,
        truncating='post'
    ), verbose=0)

for i, string in enumerate(strings):
    print(f'Review: "{string}"\n(Negative 😊 {round((preds[i][0]) * 100)}% | Positive 😔 {round(preds[i][1] * 100)}%)\n')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


BiLSTM-sentiment-classifier.h5:   0%|          | 0.00/10.1M [00:00<?, ?B/s]

(…)kenizer-BiLSTM-sentiment-classifier.json:   0%|          | 0.00/17.0M [00:00<?, ?B/s]

Review: "this explanation is really bad"
(Negative 😊 95% | Positive 😔 5%)

Review: "i did not like this tutorial 2/10"
(Negative 😊 88% | Positive 😔 12%)

Review: "this tutorial is garbage i wont my money back"
(Negative 😊 89% | Positive 😔 11%)

Review: "is nice to see philosophers doing machine learning"
(Negative 😊 4% | Positive 😔 96%)

Review: "this is a great and wonderful example of nlp"
(Negative 😊 0% | Positive 😔 100%)

Review: "this tutorial is great one of the best tutorials ever made"
(Negative 😊 0% | Positive 😔 100%)



In [3]:
preds

array([[0.94975084, 0.05024919],
       [0.87723655, 0.1227634 ],
       [0.8880841 , 0.11191586],
       [0.04392454, 0.9560755 ],
       [0.00255991, 0.9974401 ],
       [0.00182529, 0.99817467]], dtype=float32)

The model seems to be working fine! Now, let us change this. First, let us wrap our model using the [`ModelWrapper`](https://textattack.readthedocs.io/en/latest/apidoc/textattack.models.wrappers.html#modelwrapper-class) class. Then, using the `call` method, we can create a function that gives us the prediction scores for our model output.


In [4]:
from textattack.models.wrappers import ModelWrapper

class ModelWrapper(ModelWrapper):
    def __init__(self, model):
        self.model = model

    def __call__(self, text_input_list):
        text_array = tokenizer.texts_to_sequences(text_input_list)
        padded_text_array = tf.keras.preprocessing.sequence.pad_sequences(
                                                    text_array,
                                                    maxlen=250,
                                                    truncating='post'
                                                )
        preds = self.model.predict(padded_text_array, verbose=0)
        logits = torch.tensor(preds)
        logits = logits.squeeze(dim=-1)
        return logits

ModelWrapper(model)([
    'this explanation is really bad',
    'i did not like this tutorial 2/10',
    'this tutorial is garbage i wont my money back',
    'is nice to see philosophers doing machine learning',
    'this is a great and wonderful example of nlp',
    'this tutorial is great one of the best tutorials ever made'
])

textattack: Updating TextAttack package dependencies.
textattack: Downloading NLTK required packages.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package omw to /root/nltk_data...
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Unzipping taggers/universal_tagset.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


tensor([[0.9498, 0.0502],
        [0.8772, 0.1228],
        [0.8881, 0.1119],
        [0.0439, 0.9561],
        [0.0026, 0.9974],
        [0.0018, 0.9982]])

Exactly what we wanted, and the probabilities agree with the input. Now we can call an attack recipe from the `textattack`. However, we need something to attack. Luckly, `textattack` allows you to use Hugging Face datasets as a data source. You can also use your own dataset for this.

> **Note: If you want to build your own dataset, the [`textattack.datasets.Dataset`](https://textattack.readthedocs.io/en/latest/api/datasets.html#dataset) method takes as input a list of tuples, e.g., `[('some text', label_1), ('some other text', label_2)]`.**

In [5]:
# Our seeds to create some adversarial examples
data = [
    ('this explanation is really bad', 0),
    ('this tutorial is garbage i wont my money back', 0),
    ('i did not like this tutorial 2/10', 0),
    ('is nice to see philosophers doing machine learning', 1),
    ('this is a great and wonderful example of nlp', 1),
    ('this tutorial is great one of the best tutorials ever made', 1)
]

Now that we have a dataset. We can call one of the attack recipes from `textattack`. All available recipes correspond to attacks from the literature in Adversarial ML.

Attack recipes allow you to create an `Attack` object where the goal function (determines both the conditions under which the attack is successful), transformation (the adversarial perturbations produced in the samples of the dataset), constraints (the limitations imposed on these transformations), and search method are those specified in the origin paper.

Here you can find a list of fast attack recipes:

- `PWWSRen2019`: in this attack, words are perturbed by a synonym-swap transformation based on a combination of their saliency score (e.g., _the importance of a linguistic feature_) and maximum word-swap effectiveness (proposed in "[Generating Natural Langauge Adversarial Examples through Probability Weighted Word Saliency](https://aclanthology.org/P19-1103/)").
- `CheckList2020`: this attack focuses on several çangiage perturbations, like contractions, extensions, changing names, numbers, and locations (proposed in "[Beyond Accuracy: Behavioral Testing of NLP models with CheckList](https://aclanthology.org/2020.acl-main.442/)").
- `DeepWordBugGao2018`: this attack performs simple character-level transformations (_changes certain letters of a word_) to the highest-ranked tokens (proposed in [Black-box Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers](https://arxiv.org/abs/1801.04354)).
- `IGAWang2019`: this attack can be characterized as a synonym substitution-based attack that preserves the syntactic structure and semantic information of the original text (proposed in [Natural Language Adversarial Attacks and Defenses in Word Level](http://arxiv.org/abs/1909.06723)).
- `InputReductionFeng2018`: this attack does not cause the model to misclassify a sample. However, it removes words with low saliency scores, creating nonsensical sentences that the model classifies with high confidence as the original predicted class (proposed in [Pathologies of Neural Models Make Interpretations Difficult](https://arxiv.org/abs/1804.07781)).
- `Pruthi2019`: this attack focuses on a small number of character-level changes that simulate common typos, like _swapping neighboring characters, deleting characters, inserting characters,_ and _swapping characters for adjacent keys_ on a QWERTY keyboard (proposed in [Pruthi2019: Combating with Robust Word Recognition](https://arxiv.org/abs/1905.11268)).
- `TextBuggerLi2018`: this is a general attack framework for generating adversarial texts (proposed in [TextBugger: Generating Adversarial Text Against Real-world Applications](https://arxiv.org/abs/1812.05271)).

In the example below, we will use the `IGAWang2019` recipe.

> **Note: In the output, all perturbed words are highlighted with `[[ ]]` for clarity purposes.**

In [6]:
wrapped_model = ModelWrapper(model)

import textattack
from textattack.attack_recipes import IGAWang2019
from textattack import Attacker



dataset = textattack.datasets.Dataset(data)
attack = IGAWang2019.build(wrapped_model)
attack_args = textattack.AttackArgs(
    num_examples=6,
    log_to_csv ="textattack_logs_IGAWang2019.csv"
)
attacker = Attacker(attack, dataset, attack_args)
attacker.attack_dataset()

textattack: Downloading https://textattack.s3.amazonaws.com/word_embeddings/paragramcf.
100%|██████████| 481M/481M [00:11<00:00, 41.7MB/s]
textattack: Unzipping file /root/.cache/textattack/tmp21mzaevl.zip to /root/.cache/textattack/word_embeddings/paragramcf.
textattack: Successfully saved word_embeddings/paragramcf to cache.
textattack: Unknown if model of class <class 'keras.engine.functional.Functional'> compatible with goal function <class 'textattack.goal_functions.classification.untargeted_classification.UntargetedClassification'>.
textattack: Logging to CSV at path textattack_logs_IGAWang2019.csv


Attack(
  (search_method): ImprovedGeneticAlgorithm(
    (pop_size):  60
    (max_iters):  20
    (temp):  0.3
    (give_up_if_no_improvement):  False
    (post_crossover_check):  False
    (max_crossover_retries):  20
    (max_replace_times_per_index):  5
  )
  (goal_function):  UntargetedClassification
  (transformation):  WordSwapEmbedding(
    (max_candidates):  50
    (embedding):  WordEmbedding
  )
  (constraints): 
    (0): MaxWordsPerturbed(
        (max_percent):  0.2
        (compare_against_original):  True
      )
    (1): WordEmbeddingDistance(
        (embedding):  WordEmbedding
        (max_mse_dist):  0.5
        (cased):  False
        (include_unknown_words):  True
        (compare_against_original):  False
      )
    (2): StopwordModification
  (is_black_box):  True
) 



[Succeeded / Failed / Skipped / Total] 1 / 0 / 0 / 1:  17%|█▋        | 1/6 [00:00<00:04,  1.24it/s]

--------------------------------------------- Result 1 ---------------------------------------------

this explanation is really [[bad]]

this explanation is really [[negative]]




[Succeeded / Failed / Skipped / Total] 2 / 0 / 0 / 2:  33%|███▎      | 2/6 [00:01<00:03,  1.33it/s]

--------------------------------------------- Result 2 ---------------------------------------------

this tutorial is [[garbage]] i wont my money back

this tutorial is [[detritus]] i wont my money back




[Succeeded / Failed / Skipped / Total] 2 / 1 / 0 / 3:  50%|█████     | 3/6 [00:04<00:04,  1.43s/it]

--------------------------------------------- Result 3 ---------------------------------------------

i did not like this tutorial 2/10




[Succeeded / Failed / Skipped / Total] 3 / 1 / 0 / 4:  67%|██████▋   | 4/6 [00:07<00:03,  1.75s/it]

--------------------------------------------- Result 4 ---------------------------------------------

is [[nice]] to see philosophers doing machine [[learning]]

is [[agreeable]] to see philosophers doing machine [[training]]




[Succeeded / Failed / Skipped / Total] 4 / 1 / 0 / 5:  83%|████████▎ | 5/6 [00:09<00:01,  1.85s/it]

--------------------------------------------- Result 5 ---------------------------------------------

this is a [[great]] and [[wonderful]] example of nlp

this is a [[considerable]] and [[unbelievable]] example of nlp




[Succeeded / Failed / Skipped / Total] 5 / 1 / 0 / 6: 100%|██████████| 6/6 [00:23<00:00,  3.95s/it]

--------------------------------------------- Result 6 ---------------------------------------------

this tutorial is [[great]] [[one]] of the [[best]] tutorials [[ever]] [[made]]

this tutorial is [[considerable]] [[eden]] of the [[stronger]] tutorials [[increasingly]] [[introduced]]



+-------------------------------+--------+
| Attack Results                |        |
+-------------------------------+--------+
| Number of successful attacks: | 5      |
| Number of failed attacks:     | 1      |
| Number of skipped attacks:    | 0      |
| Original accuracy:            | 100.0% |
| Accuracy under attack:        | 16.67% |
| Attack success rate:          | 83.33% |
| Average perturbed word %:     | 24.76% |
| Average num. words per input: | 8.33   |
| Avg num queries:              | 622.67 |
+-------------------------------+--------+





[<textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x7cc72017e410>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x7cc7205e3940>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x7cc72013baf0>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x7cc71f1e5210>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x7cc7205e1ab0>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x7cc790118430>]

As you can see, we have had some successful attempts. Now, let us try another recipe.

In [7]:
from textattack.attack_recipes import DeepWordBugGao2018

attack = DeepWordBugGao2018.build(wrapped_model)
attack_args = textattack.AttackArgs(
    num_examples=6,
    log_to_csv ="textattack_logs_DeepWordBugGao2018.csv"
)
attacker = Attacker(attack, dataset, attack_args)
attacker.attack_dataset()

textattack: Unknown if model of class <class 'keras.engine.functional.Functional'> compatible with goal function <class 'textattack.goal_functions.classification.untargeted_classification.UntargetedClassification'>.
textattack: Logging to CSV at path textattack_logs_DeepWordBugGao2018.csv


Attack(
  (search_method): GreedyWordSwapWIR(
    (wir_method):  unk
  )
  (goal_function):  UntargetedClassification
  (transformation):  CompositeTransformation(
    (0): WordSwapNeighboringCharacterSwap(
        (random_one):  True
      )
    (1): WordSwapRandomCharacterSubstitution(
        (random_one):  True
      )
    (2): WordSwapRandomCharacterDeletion(
        (random_one):  True
      )
    (3): WordSwapRandomCharacterInsertion(
        (random_one):  True
      )
    )
  (constraints): 
    (0): LevenshteinEditDistance(
        (max_edit_distance):  30
        (compare_against_original):  True
      )
    (1): RepeatModification
    (2): StopwordModification
  (is_black_box):  True
) 



[Succeeded / Failed / Skipped / Total] 1 / 0 / 0 / 1:  17%|█▋        | 1/6 [00:00<00:01,  3.22it/s]

--------------------------------------------- Result 1 ---------------------------------------------

this explanation is really [[bad]]

this explanation is really [[ad]]




[Succeeded / Failed / Skipped / Total] 2 / 0 / 0 / 2:  33%|███▎      | 2/6 [00:00<00:01,  3.26it/s]

--------------------------------------------- Result 2 ---------------------------------------------

this tutorial is [[garbage]] i wont my money back

this tutorial is [[garabge]] i wont my money back




[Succeeded / Failed / Skipped / Total] 2 / 1 / 0 / 3:  50%|█████     | 3/6 [00:01<00:01,  2.57it/s]

--------------------------------------------- Result 3 ---------------------------------------------

i did not like this tutorial 2/10




[Succeeded / Failed / Skipped / Total] 3 / 1 / 0 / 4:  67%|██████▋   | 4/6 [00:01<00:00,  2.71it/s]

--------------------------------------------- Result 4 ---------------------------------------------

is [[nice]] to see philosophers doing machine learning

is [[ice]] to see philosophers doing machine learning




[Succeeded / Failed / Skipped / Total] 3 / 2 / 0 / 5:  83%|████████▎ | 5/6 [00:02<00:00,  2.41it/s]

--------------------------------------------- Result 5 ---------------------------------------------

this is a great and wonderful example of nlp




[Succeeded / Failed / Skipped / Total] 3 / 3 / 0 / 6: 100%|██████████| 6/6 [00:02<00:00,  2.03it/s]

--------------------------------------------- Result 6 ---------------------------------------------

this tutorial is great one of the best tutorials ever made



+-------------------------------+--------+
| Attack Results                |        |
+-------------------------------+--------+
| Number of successful attacks: | 3      |
| Number of failed attacks:     | 3      |
| Number of skipped attacks:    | 0      |
| Original accuracy:            | 100.0% |
| Accuracy under attack:        | 50.0%  |
| Attack success rate:          | 50.0%  |
| Average perturbed word %:     | 14.54% |
| Average num. words per input: | 8.33   |
| Avg num queries:              | 17.0   |
+-------------------------------+--------+





[<textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x7cc7efb42080>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x7cc71c560d30>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x7cc71c562b60>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x7cc71dd43850>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x7cc71c563010>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x7cc71a93b790>]

Language models are the foundation behind various applications such as Q&A, chatbots, machine translation, and text classification. However, the security vulnerabilities associated with ML-trained language models are still largely unknown, which is highly concerning.

To remedy this, developers must use the same tools that attackers use to fool models. For example, creating adversarial examples with libraries like `textattack` (which also provides data augmentation) can supply adversarial databases to tune and improve language models, making them more robust.

---

Return to the [castle](https://github.com/Nkluge-correa/teeny-tiny_castle).