# Improving Robustness Against Evasion Attacks with _Adversarial Training_ (`Natural Language Processing`)

Return to the [castle](https://github.com/Nkluge-correa/teeny-tiny_castle).

**Adversarial machine learning (`AML`) is a subfield of machine learning that focuses on developing algorithms and techniques that can withstand and respond to adversarial attacks.** 

**Adversarial attacks are a type of cyber attack where an attacker deliberately manipulates data inputs to ML models with the aim of causing them to produce incorrect outputs.** 

**`AML` aims to improve the robustness and security of ML models by identifying vulnerabilities and developing countermeasures to mitigate the impact of adversarial attacks. A range of techniques have been developed for `AML`, including `adversarial training` (_training models on adversarial examples_), and `defensive distillation` (_creating a distilled version of a model that is resistant to adversarial attacks_).**

**`AML` is an active area of research, as ML models continue to be deployed in a wide range of applications where they may be vulnerable to attack.**

**One of the alternatives for making models more resilient against adversarial attacks is `adversarial training`. In `adversarial training`, we generate adversarial examples and use them as samples (with their correct labels) for training (retraining) the original model, making it more robust.**

**In this notebook we will create an adversarial dataset to train and test the robustness of two different models. You can learn more about evasion attacks in the context of NLP (a.k.a. _adversarial examples_) in [this notebook](xxx).** 

**The technique used in this notebook is `DeepWordBug`, an attack that performs simple character-level transformations (_changes certain letters of a word_) to the highest-ranked tokens (proposed in [Black-box Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers](https://arxiv.org/abs/1801.04354)).**

**In this notebook, our baseline model (used to create and compare the results of our new model) is the same `Bidirectional long-short term memory(bi-lstm)` trained on [this notebook](xxx).**

**The dataset used to train such a model is the `sentiment_analysis_dataset.csv` available for download [here](https://drive.google.com/uc?export=download&id=1_ijhnVLHddM7Cm3R3vfqBB-svw6iNfpv) link.**

In [27]:
import json
import torch
import numpy as np
import pandas as pd
import tensorflow as tf
from keras_preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer, tokenizer_from_json

model = tf.keras.models.load_model('models/senti_model.keras')

with open('models/tokenizer_senti_model.json') as fp:
    data = json.load(fp)
    tokenizer = tokenizer_from_json(data)
    fp.close()

strings = [
    'this explanation is really bad',
    'i did not like this tutorial 2/10',
    'this tutorial is garbage i wont my money back',
    'is nice to see philosophers doing machine learning',
    'this is a great and wonderful example of nlp',
    'this tutorial is great one of the best tutorials ever made'
]

preds = model.predict(
    tf.keras.preprocessing.sequence.pad_sequences(
        tokenizer.texts_to_sequences(strings),
        maxlen=250,
        truncating='post'
    ), verbose=0)

for i, string in enumerate(strings):
    print(f'Review: "{string}"\n(Negative 😔 {round((1 - preds[i][0]) * 100)}% | Positive 😊 {round(preds[i][0] * 100)}%)\n')

Review: "this explanation is really bad"
(Negative 😔 95% | Positive 😊 5%)

Review: "i did not like this tutorial 2/10"
(Negative 😔 81% | Positive 😊 19%)

Review: "this tutorial is garbage i wont my money back"
(Negative 😔 93% | Positive 😊 7%)

Review: "is nice to see philosophers doing machine learning"
(Negative 😔 3% | Positive 😊 97%)

Review: "this is a great and wonderful example of nlp"
(Negative 😔 0% | Positive 😊 100%)

Review: "this tutorial is great one of the best tutorials ever made"
(Negative 😔 0% | Positive 😊 100%)



**In this notebook, we will be exploring one of the functionalities of the `textattack` library.**

> **_TextAttack is a Python framework for adversarial attacks, data augmentation, and model training in NLP_.**

**Using the `textattack`, we can _wrap_ a model (like a Keras, TensorFlow, Scikitlearn, or AllenNLP model) using the `ModelWrapper` class. Then, using the `call` method, we can create a function that gives us the prediction scores for our model output.**

**Creating this function/method will be a specific-task, given the natural output format of your model. Below, you can find out how to turn the output of a `sigmoid function` (the last layer of our `bi-lstm`) into a torch tensor that contains the probabilities for each of the sentiment classes ($0$ for negative, $1$ for positive).**

In [2]:
from textattack.models.wrappers import ModelWrapper

class ModelWrapper(ModelWrapper):
    def __init__(self, model):
        self.model = model

    def __call__(self, text_input_list):
        text_array = tokenizer.texts_to_sequences(text_input_list)
        padded_text_array = tf.keras.preprocessing.sequence.pad_sequences(
                                                    text_array,
                                                    maxlen=250,
                                                    truncating='post'
                                                )
        preds = self.model.predict(padded_text_array, verbose=0)
        logits = torch.tensor(preds)
        logits = logits.squeeze(dim=-1)
        final_preds = torch.stack((1-logits, logits), dim=1)
        return final_preds


**Now, let us see the outputs of our `ModelWrapper`.**

In [3]:
ModelWrapper(model)([
    'this explanation is really bad',
    'i did not like this tutorial 2/10',
    'this tutorial is garbage i wont my money back',
    'is nice to see philosophers doing machine learning',
    'this is a great and wonderful example of nlp',
    'this tutorial is great one of the best tutorials ever made'
])

tensor([[0.9461, 0.0539],
        [0.8123, 0.1877],
        [0.9322, 0.0678],
        [0.0289, 0.9711],
        [0.0023, 0.9977],
        [0.0014, 0.9986]])

**Exactly what we wanted, and the probabilities are in agreement with the input. Now we can just call an attack recipe from the `Attack Recipes` from `textattack`.**

**However, we need something to attack. We will be using our `sentiment_analysis_dataset.csv` to create adversarial examples against our model. As said before, we will be using the DeepWordBug recipe, which is a fast recipe for adversarial attacks. Other attacks and recipes are demonstrated in [this notebook](xxx).**


In [4]:
import textattack
import urllib.request
from sklearn.model_selection import train_test_split

urllib.request.urlretrieve(
    'https://drive.google.com/uc?export=download&id=1_ijhnVLHddM7Cm3R3vfqBB-svw6iNfpv', 
    'sentiment_analysis_dataset.csv'
)

df = pd.read_csv('sentiment_analysis_dataset.csv')

df_positive = df[df['sentiment'] == 1]
df_negative = df[df['sentiment'] == 0]


**To create a `textattack.datasets`, you only need to transform your $X$ and $Y$ features into a list of tuples (`text, label`). `Textattack` will use these samples to create adversarial examples against our model.**

**We are saving all of the performed attacks in a `CSV` file for later use.**

**The adversarial attack performed and the below cells will generate several successful adversarial examples. This can take a while, so if you would like to skip this part, you can directly download our adversarial dataset on the next cells.**

**First, let us attack only the positive class.**

In [10]:
model_wrapper = ModelWrapper(model)

from textattack.attack_recipes import DeepWordBugGao2018
from textattack import Attacker

x = list(df_positive.review)
y = np.array(list(df_positive.sentiment)).astype(int) 

data=[(x[i], int(y[i])) for i in range(len(df_positive)) if len(x[i]) < 256]


dataset = textattack.datasets.Dataset(data)
attack = DeepWordBugGao2018.build(model_wrapper)
attack_args = textattack.AttackArgs(
    num_examples=len(data),
    log_to_csv ="adversarial_text_positive.csv",
    silent = True,
    disable_stdout=True
    )
attacker = Attacker(attack, dataset, attack_args)
attacker.attack_dataset()


textattack: Unknown if model of class <class 'keras.engine.functional.Functional'> compatible with goal function <class 'textattack.goal_functions.classification.untargeted_classification.UntargetedClassification'>.
[Succeeded / Failed / Skipped / Total] 4723 / 5659 / 866 / 11248: 100%|██████████| 11248/11248 [2:59:33<00:00,  1.04it/s]






[<textattack.attack_results.failed_attack_result.FailedAttackResult at 0x1f01f8ac490>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x1f0fe5dd100>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x1f0fe5843d0>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x1f10fd025b0>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x1f10fcfb760>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x1f10fe638e0>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x1f10fd95df0>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x1f10fd69460>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x1f0fe6875b0>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x1f10fddadc0>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x1f10fea1730>,
 <textattack.attack

**And now, let us attack only the negative class.**

In [5]:
model_wrapper = ModelWrapper(model)

from textattack.attack_recipes import DeepWordBugGao2018
from textattack import Attacker

x = list(df_negative.review)
y = np.array(list(df_negative.sentiment)).astype(int) 

data=[(x[i], int(y[i])) for i in range(len(df_negative)) if len(x[i]) < 256]

dataset = textattack.datasets.Dataset(data)
attack = DeepWordBugGao2018.build(model_wrapper)
attack_args = textattack.AttackArgs(
    num_examples=len(data),
    log_to_csv ="adversarial_text_negative.csv",
    silent = True,
    disable_stdout=True
    )
attacker = Attacker(attack, dataset, attack_args)
attacker.attack_dataset()

textattack: Unknown if model of class <class 'keras.engine.functional.Functional'> compatible with goal function <class 'textattack.goal_functions.classification.untargeted_classification.UntargetedClassification'>.
[Succeeded / Failed / Skipped / Total] 8718 / 1196 / 689 / 10603: 100%|██████████| 10603/10603 [1:22:06<00:00,  2.15it/s]






[<textattack.attack_results.skipped_attack_result.SkippedAttackResult at 0x1c45656a7f0>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x1c5094e4490>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x1c5095220d0>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x1c5094cc0d0>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x1c5094fe9d0>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x1c5095bc130>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x1c5095bcdc0>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x1c509599130>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x1c4b12fb880>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x1c5095ab160>,
 <textattack.attack_results.successful_attack_result.Suc

**Now, let us create our final adversarial dataset.**

In [23]:
import pandas as pd

df_positive = pd.read_csv('adversarial_text_positive.csv')
df_negative = pd.read_csv('adversarial_text_negative.csv')

df_positive = df_positive[ df_positive['result_type'] == 'Successful']
df_negative = df_negative[ df_negative['result_type'] == 'Successful']

negative_samples = [s.replace('[[', '').replace(']]', '') for s in df_negative.perturbed_text]
positive_samples = [s.replace('[[', '').replace(']]', '') for s in df_positive.perturbed_text]

df_positive_adversarial_samples = pd.DataFrame(positive_samples, columns=['review'])
df_negative_adversarial_samples = pd.DataFrame(negative_samples, columns=['review'])

df_positive_adversarial_samples['sentiment'] = 1
df_negative_adversarial_samples['sentiment'] = 0

adversarial_data = pd.concat([df_positive_adversarial_samples,
    df_negative_adversarial_samples], 
    ignore_index=True).sample(frac=1).reset_index(drop=True)

display(adversarial_data)

Unnamed: 0,review,sentiment
0,it would be Mreat if there will be custom vide...,1
1,the app is fine but the actual delivery is aYf...,0
2,loev being abOle to stay in otuch with firends...,1
3,uited line is just getting longer and moving s...,0
4,usairways knows customer service htank you for...,1
...,...,...
13436,unHted sadly this wasnt just due to mother nat...,0
13437,Sunited i was on ua3782 and it was cZncelled f...,0
13438,southwestair stewardess really funny now i cou...,1
13439,sotuhwestair looks like you are up and running...,0


**The [original dataset](https://drive.google.com/uc?export=download&id=1_ijhnVLHddM7Cm3R3vfqBB-svw6iNfpv), and the [adversarial dataset](https://drive.google.com/uc?export=download&id=1ECDSiXsrhiIBymjjMqNEgZqQKjQ31C3h), can be dowloaded in the cell below.**

**We will mix most of the adversarial dataset with the original dataset to train our new model. We will also put it aside an `adversarial_test` set for comparing the robustness of both models in the end.**

In [32]:
import pandas as pd
import urllib.request

urllib.request.urlretrieve(
    'https://drive.google.com/uc?export=download&id=1_ijhnVLHddM7Cm3R3vfqBB-svw6iNfpv', 
    'sentiment_analysis_dataset.csv'
)

urllib.request.urlretrieve(
    'https://drive.google.com/uc?export=download&id=1ECDSiXsrhiIBymjjMqNEgZqQKjQ31C3h', 
    'adversarial_text_data.csv'
)

adversarial_data = pd.read_csv('adversarial_text_data.csv')
data = pd.read_csv('sentiment_analysis_dataset.csv')

adversarial_training = adversarial_data.head(10000)
adversarial_test = adversarial_data.tail(3441)

data = pd.concat([adversarial_training, data]).sample(frac=1).reset_index(drop=True)

display(data, adversarial_test)

Unnamed: 0,review,sentiment
0,what is there to say about an anti establishme...,1
1,southweNtair if this flight is cancelled fligh...,0
2,aemricanair im beyond confused here why would ...,0
3,uGsairways its fine <SPLIT><SPLIT>just wonderi...,0
4,i watched sea of dust at the rhode island horr...,1
...,...,...
95084,standard twitter app nothing bad nor good to s...,1
95085,theres no support within the app you pay for a...,0
95086,would like to be able to chat during live chat...,1
95087,i ended up watching this movie before even goi...,0


Unnamed: 0,review,sentiment
10000,ameUicanair 800 number will not even let you w...,0
10001,united joni did a gNeat job on flight 5653 to ...,1
10002,uasirways americanKair im sNtuck in the airpor...,0
10003,sothwestair i recommend upgrading your ivr and...,0
10004,usaidrways peterpiatetsky ummmm i think us air...,0
...,...,...
13436,unHted sadly this wasnt just due to mother nat...,0
13437,Sunited i was on ua3782 and it was cZncelled f...,0
13438,southwestair stewardess really funny now i cou...,1
13439,sotuhwestair looks like you are up and running...,0


**We will now train the "same model" (_same architecture and hyperparemeters_) with this adversarial dataset. After training, if you compare the model performance (_accuracy_) with that of the model trained in the [original notebook](xxx).**

In [35]:
import io
import json
from sklearn.model_selection import train_test_split
from keras_preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer, tokenizer_from_json

vocab_size = 5000
embed_size = 128
sequence_length = 250

tokenizer = Tokenizer(num_words=vocab_size,
                      filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
                      lower=True,
                      split=" ",
                      oov_token="<OOV>")

tokenizer.fit_on_texts(data.review)
tokenizer_json = tokenizer.to_json()

with io.open('models/tokenizer_senti_model_with_adversarial_training.json', 'w', encoding='utf-8') as fp:
    fp.write(json.dumps(tokenizer_json, ensure_ascii=False))
    fp.close()

x_train, x_test, y_train, y_test = train_test_split(
    data.review, data.sentiment, test_size=0.2, random_state=42)

x_train = pad_sequences(
    tokenizer.texts_to_sequences(x_train), 
    maxlen=sequence_length, 
    truncating='post')
x_test = pad_sequences(
    tokenizer.texts_to_sequences(x_test), 
    maxlen=sequence_length, 
    truncating='post')
y_train = np.array(y_train).astype(float)
y_test = np.array(y_test).astype(float)


inputs = tf.keras.Input(shape=(None,), dtype="int32")
x = tf.keras.layers.Embedding(input_dim=vocab_size,
                              output_dim=embed_size,
                              input_length=sequence_length)(inputs)

x = tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(64, return_sequences=True))(x)
x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64))(x)

outputs = tf.keras.layers.Dense(1, activation="sigmoid")(x)
model = tf.keras.Model(inputs, outputs)

model.compile(loss=tf.losses.BinaryCrossentropy(),
              optimizer='adam',
              metrics=['accuracy'])

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("GPU is", "available" if tf.config.list_physical_devices('GPU') else "NOT AVAILABLE")
model.summary()

callbacks = [tf.keras.callbacks.ModelCheckpoint("models/senti_model_with_adversarial_training.keras",
                                                save_best_only=True),
            tf.keras.callbacks.EarlyStopping(monitor="val_loss",
                                            patience=3,
                                            verbose=1,
                                            mode="auto",
                                            baseline=None,
                                            restore_best_weights=True)]
                                            
                                            
                                                
model.fit(x_train,
          y_train,
          epochs=20,
          validation_split=0.2,
          callbacks=callbacks,
          verbose=1)

test_loss_score, test_acc_score = model.evaluate(x_test, y_test)

print(f'Final Loss: {round(test_loss_score, 2)}.')
print(f'Final Performance: {round(test_acc_score * 100, 2)} %.')

Version:  2.10.1
Eager mode:  True
GPU is available
Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding (Embedding)       (None, None, 128)         640000    
                                                                 
 bidirectional (Bidirectiona  (None, None, 128)        98816     
 l)                                                              
                                                                 
 bidirectional_1 (Bidirectio  (None, 128)              98816     
 nal)                                                            
                                                                 
 dense (Dense)               (None, 1)                 129       
                                                                 
Total par

**We will define the robustness score of our models as the accuracy against the separate adversarial dataset for tests.**

In [40]:
model = tf.keras.models.load_model('models/senti_model.keras')

with open('models/tokenizer_senti_model.json') as fp:
    data = json.load(fp)
    tokenizer = tokenizer_from_json(data)
    fp.close()

x_adversarial = pad_sequences(
    tokenizer.texts_to_sequences(adversarial_test.review), 
    maxlen=250, 
    truncating='post')
y_adversarial = np.array(adversarial_test.sentiment).astype(float)

_, robustness_score = model.evaluate(x_adversarial, y_adversarial)

print(f'\nModel Evaluation Against Adversaries\n\n')
print(f'Robustness Score: {round(robustness_score * 100, 2)} %.')


Model Evaluation Against Adversaries


Robustness Score: 0.0 %.


**However, the model trained with a combination of normal and perturbed data receives a robustness score of  around 80%!**

In [39]:
model = tf.keras.models.load_model('models/senti_model_with_adversarial_training.keras')

with open('models/tokenizer_senti_model_with_adversarial_training.json') as fp:
    data = json.load(fp)
    tokenizer = tokenizer_from_json(data)
    fp.close()

x_adversarial = pad_sequences(
    tokenizer.texts_to_sequences(adversarial_test.review), 
    maxlen=250, 
    truncating='post')
y_adversarial = np.array(adversarial_test.sentiment).astype(float)

_, robustness_score = model.evaluate(x_adversarial, y_adversarial)

print(f'\nModel Evaluation Against Adversaries\n\n')
print(f'Robustness Score: {round(robustness_score * 100, 2)} %.')


Model Evaluation Against Adversaries


Robustness Score: 83.87 %.


**Adversarial training can help AI developers create more robust models against adversaries.** 🙃 

---

Return to the [castle](https://github.com/Nkluge-correa/teeny-tiny_castle).