# Improving Robustness Against Evasion Attacks with _Adversarial Training_ (`Natural Language Processing`)

Return to the [castle](https://github.com/Nkluge-correa/teeny-tiny_castle).

**_Adversarial machine learning_ is the study of the attacks on [machine learning](https://en.wikipedia.org/wiki/Machine_learning "Machine learning") algorithms and the defenses against such attacks. Recent surveys expose the fact that practitioners report a dire need for better protecting machine learning systems in real-world applications.**

**One of the alternatives for making models more resilient against adversarial attacks is _adversarial training_. In adversarial training, we generate adversarial examples and use them as samples (with their correct labels) for training (retraining) the original model, making it more robust.**

**In this notebook we will create an adversarial dataset to train and test the robustness of two different models. You can learn more about evasion attacks in the context of NLP (a.k.a. _adversarial examples_) in [this notebook](https://github.com/Nkluge-correa/teeny-tiny_castle/blob/bbe9c0a77499fa68de7c6d53bf5ef7e0b43a25e0/ML%20Adversarial/adversarial_text_attack.ipynb).** 

**The technique used in this notebook is _DeepWordBug_, an attack that performs simple character-level transformations (_changes certain letters of a word_) to the highest-ranked tokens (proposed in [Black-box Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers](https://arxiv.org/abs/1801.04354)).**

**In this notebook, our baseline model (used to create and compare the results of our new model) is the same `Bidirectional long-short term memory(bi-lstm)` trained on  [this](https://github.com/Nkluge-correa/teeny-tiny_castle/blob/bbe9c0a77499fa68de7c6d53bf5ef7e0b43a25e0/ML%20Explainability/NLP%20Interpreter%20(en)/model_maker_en.ipynb) notebook.**

**The dataset used to train such a model is the `sentiment_analysis_dataset.csv` available for download [here](https://drive.google.com/uc?export=download&id=1_ijhnVLHddM7Cm3R3vfqBB-svw6iNfpv) link.**

In [2]:
import json
import torch
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from keras_preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer, tokenizer_from_json

model = keras.models.load_model('models\senti_model.h5')

with open('models\\tokenizer_senti_model.json') as f:
    data = json.load(f)
    tokenizer = tokenizer_from_json(data)
    word_index = tokenizer.word_index

strings = [
    'is hard to say something about a model so simple',
    'you call this NLP, please, my nana can do it better in pascal',
    'this model is garbage, i wont my money back',
    'is nice to see philosophers doing machine learning',
    'this is a great and wonderful example of NLP',
    'this model is great, one of the best models ever done by a human'
]

preds = model.predict(
        keras.preprocessing.sequence.pad_sequences(
                                                    tokenizer.texts_to_sequences(strings),
                                                    maxlen=256,
                                                    truncating='post'
                                                ),
    verbose=0)

for i, string in enumerate(strings):
    print(f'{string}\n')
    print(f'Negative Sentiment 😔 {round((1 - preds[i][0]) * 100)}% | Positive Sentiment 😊 {round(preds[i][0] * 100)}%\n{"*" * 50}')

is hard to say something about a model so simple

Negative Sentiment 😔 100% | Positive Sentiment 😊 0%
**************************************************
you call this NLP, please, my nana can do it better in pascal

Negative Sentiment 😔 85% | Positive Sentiment 😊 15%
**************************************************
this model is garbage, i wont my money back

Negative Sentiment 😔 100% | Positive Sentiment 😊 0%
**************************************************
is nice to see philosophers doing machine learning

Negative Sentiment 😔 0% | Positive Sentiment 😊 100%
**************************************************
this is a great and wonderful example of NLP

Negative Sentiment 😔 0% | Positive Sentiment 😊 100%
**************************************************
this model is great, one of the best models ever done by a human

Negative Sentiment 😔 0% | Positive Sentiment 😊 100%
**************************************************


**In this notebook, we will be exploring one of the functionalities of the `textattack` library.**

> TextAttack is a Python framework for adversarial attacks, data augmentation, and model training in NLP.

**Using the `textattack`, we can _wrap_ a model (like a Keras, TensorFlow, Scikitlearn, or AllenNLP model) using the `ModelWrapper` class. Then, using the `call` method, we can create a function that gives us the prediction scores for our model output.**

**Creating this function/method will be a specific-task, given the natural output format of your model. Below, you can find out how to turn the output of a `sigmoid function` (the last layer of our `bi-lstm`) into a torch tensor that contains the probabilities for each of the sentiment classes ($0$ for negative, $1$ for positive).**

In [2]:
from textattack.models.wrappers import ModelWrapper

class ModelWrapper(ModelWrapper):
    def __init__(self, model):
        self.model = model

    def __call__(self, text_input_list):
        text_array = tokenizer.texts_to_sequences(text_input_list)
        padded_text_array = keras.preprocessing.sequence.pad_sequences(
                                                    text_array,
                                                    maxlen=256,
                                                    truncating='post'
                                                )
        preds = self.model.predict(padded_text_array, verbose=0)
        logits = torch.tensor(preds)
        logits = logits.squeeze(dim=-1)
        final_preds = torch.stack((1-logits, logits), dim=1)
        return final_preds


**Now, let us see the outputs of our `ModelWrapper`.**

In [3]:
ModelWrapper(model)([
    'is hard to say something about a model so simple',
    'you call this NLP, please, my nana can do it better in pascal',
    'this model is garbage, i wont my money back',
    'is nice to see philosophers doing machine learning',
    'this is a great and wonderful example of NLP',
    'this model is great, one of the best models ever done by a human'
])

tensor([[9.9923e-01, 7.7294e-04],
        [8.4638e-01, 1.5362e-01],
        [9.9993e-01, 7.4672e-05],
        [5.7155e-04, 9.9943e-01],
        [1.0251e-03, 9.9897e-01],
        [1.2398e-05, 9.9999e-01]])

**Exactly what we wanted, and the probabilities are in agreement with the input. Now we can just call an attack recipe from the `Attack Recipes` from `textattack`.**

**However, we need something to attack. We will be using our `sentiment_analysis_dataset.csv` to create adversarial examples against our model. As said before, we will be using the DeepWordBug recipe, which is a fast recipe for adversarial attacks. Other attacks and recipes are demonstrated in [this notebook](https://github.com/Nkluge-correa/teeny-tiny_castle/blob/bbe9c0a77499fa68de7c6d53bf5ef7e0b43a25e0/ML%20Adversarial/adversarial_text_attack.ipynb).**


In [3]:
import textattack
import urllib.request
from sklearn.model_selection import train_test_split

urllib.request.urlretrieve(
    'https://drive.google.com/uc?export=download&id=1_ijhnVLHddM7Cm3R3vfqBB-svw6iNfpv', 
    'sentiment_analysis_dataset.csv'
)

df = pd.read_csv('sentiment_analysis_dataset.csv', encoding='utf8')

df_positive = df[df['sentiment'] == 1]
df_negative = df[df['sentiment'] == 0]

**To create a `textattack.datasets`, you only need to transform your $X$ and $Y$ features into a list of tuples (`text, label`). `Textattack` will use these samples to create adversarial examples against our model.**

**We are saving all of the performed attacks in a `CSV` file for later use.**

**The adversarial attack performed and the below cells will generate several successful adversarial examples. This can take a while, so if you would like to skip this part, you can directly download our adversarial dataset on the next cells.**

In [12]:
model_wrapper = ModelWrapper(model)

from textattack.attack_recipes import DeepWordBugGao2018
from textattack import Attacker

x = list(df_positive.review)
y = np.array(list(df_positive.sentiment)).astype(int) 

data=[(x[i], int(y[i])) for i in range(len(df_positive)) if len(x[i]) < 256]


dataset = textattack.datasets.Dataset(data)
attack = DeepWordBugGao2018.build(model_wrapper)
attack_args = textattack.AttackArgs(
    num_examples=len(data),
    log_to_csv ="adversarial_text_positive.csv",
    silent = True,
    disable_stdout= True
    )
attacker = Attacker(attack, dataset, attack_args)
attacker.attack_dataset()


textattack: Unknown if model of class <class 'keras.engine.functional.Functional'> compatible with goal function <class 'textattack.goal_functions.classification.untargeted_classification.UntargetedClassification'>.
[Succeeded / Failed / Skipped / Total] 5865 / 5055 / 258 / 11178: 100%|██████████| 11178/11178 [3:19:38<00:00,  1.07s/it]






[<textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x298224692e0>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x2982602ed00>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x298150e1910>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x29825e6c5e0>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x29827122460>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x29825ed5b80>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x298224691f0>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x298270edfa0>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x298270ad910>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x29827096eb0>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x2

In [5]:
model_wrapper = ModelWrapper(model)

from textattack.attack_recipes import DeepWordBugGao2018
from textattack import Attacker

x = list(df_negative.review)
y = np.array(list(df_negative.sentiment)).astype(int) 

data=[(x[i], int(y[i])) for i in range(len(df_negative)) if len(x[i]) < 256]

dataset = textattack.datasets.Dataset(data)
attack = DeepWordBugGao2018.build(model_wrapper)
attack_args = textattack.AttackArgs(
    num_examples=len(data),
    log_to_csv ="adversarial_text_negative.csv",
    silent = True,
    disable_stdout= True
    )
attacker = Attacker(attack, dataset, attack_args)
attacker.attack_dataset()

textattack: Unknown if model of class <class 'keras.engine.functional.Functional'> compatible with goal function <class 'textattack.goal_functions.classification.untargeted_classification.UntargetedClassification'>.
[Succeeded / Failed / Skipped / Total] 9035 / 1141 / 353 / 10529: 100%|██████████| 10529/10529 [1:26:35<00:00,  2.03it/s]






[<textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x26906c1f190>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x26a028911f0>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x26a0a070ca0>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x269b139ceb0>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x26a0a14e490>,
 <textattack.attack_results.skipped_attack_result.SkippedAttackResult at 0x26a0a14e4f0>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x26a0a09ed90>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x26a0a1c31f0>,
 <textattack.attack_results.skipped_attack_result.SkippedAttackResult at 0x26a0a1a1ca0>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x26a0a1b15e0>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at

**The [original dataset](https://drive.google.com/uc?export=download&id=1_ijhnVLHddM7Cm3R3vfqBB-svw6iNfpv), and the [adversarial dataset](https://drive.google.com/uc?export=download&id=1ECDSiXsrhiIBymjjMqNEgZqQKjQ31C3h), can be dowloaded in the cell below.**

**We will mix most of the adversarial dataset with the original dataset to train our new model. We will also put it aside an `adversarial_test` set for comparing the robustness of both models in the end.**

In [4]:
import pandas as pd
import urllib.request
import tensorflow as tf
from tensorflow import keras
from keras.preprocessing.text import Tokenizer
from sklearn.model_selection import train_test_split
from keras_preprocessing.sequence import pad_sequences

urllib.request.urlretrieve(
    'https://drive.google.com/uc?export=download&id=1_ijhnVLHddM7Cm3R3vfqBB-svw6iNfpv', 
    'sentiment_analysis_dataset.csv'
)


urllib.request.urlretrieve(
    'https://drive.google.com/uc?export=download&id=1ECDSiXsrhiIBymjjMqNEgZqQKjQ31C3h', 
    'adversarial_text_data.csv'
)

adversarial_data = pd.read_csv('adversarial_text_data.csv',  index_col=[0])
data = pd.read_csv('sentiment_analysis_dataset.csv',  index_col=[0])

adversarial_training = adversarial_data.head(30000)
adversarial_test = adversarial_data.tail(7129)

data = pd.concat([adversarial_training, data])
data = data.sample(frac=1).reset_index(drop=True)

display(data, adversarial_test)

Unnamed: 0,review,sentiment
0,Where oh where to begin in describing the comp...,0
1,"I loved the first 15 minutes, and I loved some...",0
2,etblue i ish you all the beYst of luck m efnj...,1
3,nited just flew to telaviv paid 100 from a hir...,0
4,united no the entire problem here is that i wa...,0
...,...,...
115084,"In 1937 Darryl Zanuck, who had recently moved ...",1
115085,"okay, this movie f*ck in' rules. it is without...",0
115086,nice product price sound quality crestal clear,1
115087,mostly a good experience of this lately but i ...,1


Unnamed: 0,review,sentiment
30000,bsairways what is with your lot amp found my ...,0
30001,mall dFevice soWund hok voice ecognition fWast...,1
30002,united <SPLIT>i will admit youve been rather g...,1
30003,not find anything wrong alexa dot goNd product,1
30004,wont work without poer voice slow wifi compuls...,1
...,...,...
37124,uniUed i take back the comment about your team...,0
37125,uJairways i tried to call your custmer sertice...,0
37126,soknd qality noral basic respones goid Iompare...,1
37127,sairways why dJont you hire people to deal wit...,0


**We will now train the "same model" (_same architecture and hyperparemeters_) with this adversarial dataset. After training, if you compare the model performance (_accuracy_) with that of the model trained in the [original notebook](https://github.com/Nkluge-correa/teeny-tiny_castle/blob/bbe9c0a77499fa68de7c6d53bf5ef7e0b43a25e0/ML%20Explainability/NLP%20Interpreter%20(en)/model_maker_en.ipynb), you will see we outperform the original model by around ~1% in testing.**

In [40]:
x = list(data.review)
y = list(data.sentiment)

x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2, random_state=42)

y_train = np.array(y_train).astype(float)
y_test = np.array(y_test).astype(float)


vocab_size = 3000
embed_size = 50
max_len = 256
tokenizer = Tokenizer(num_words=vocab_size,
                      filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
                      lower=True,
                      split=" ",
                      oov_token="<OOV>")

tokenizer.fit_on_texts(x_train)
training_sequences = tokenizer.texts_to_sequences(x_train)
training_padded = pad_sequences(
    training_sequences, maxlen=max_len, truncating='post')

inputs = tf.keras.Input(shape=(None,), dtype="int32")
x = tf.keras.layers.Embedding(input_dim=vocab_size,
                              output_dim=embed_size,
                              input_length=max_len)(inputs)

x = tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(64, return_sequences=True))(x)
x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64))(x)

outputs = tf.keras.layers.Dense(1, activation="sigmoid")(x)
model = tf.keras.Model(inputs, outputs)

model.compile(loss=tf.losses.BinaryCrossentropy(),
              optimizer='adam',
              metrics=['accuracy'])

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("GPU is", "available" if tf.config.list_physical_devices('GPU') else "NOT AVAILABLE")
model.summary()
model.fit(training_padded,
          y_train,
          batch_size=128,
          validation_split = 0.2,
          epochs=20,
          verbose=1)

test_sequences = tokenizer.texts_to_sequences(x_test)
test_padded = pad_sequences(test_sequences, maxlen=256, truncating='post')

test_loss_score, test_acc_score = model.evaluate(test_padded, y_test)

print(f'Final Loss: {round(test_loss_score, 2)}.')
print(f'Final Performance: {round(test_acc_score * 100, 2)} %.')


model.save("models\senti_model_with_adversarial_training.h5")

import io
import json
from keras.preprocessing.text import tokenizer_from_json

tokenizer_json = tokenizer.to_json()
with io.open('models\\tokenizer_senti_model_with_adversarial_training.json', 'w', encoding='utf-8') as f:
    f.write(json.dumps(tokenizer_json, ensure_ascii=False))

Version:  2.10.0
Eager mode:  True
GPU is available
Model: "model_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_5 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding_4 (Embedding)     (None, None, 50)          150000    
                                                                 
 bidirectional_8 (Bidirectio  (None, None, 128)        58880     
 nal)                                                            
                                                                 
 bidirectional_9 (Bidirectio  (None, 128)              98816     
 nal)                                                            
                                                                 
 dense_4 (Dense)             (None, 1)                 129       
                                                                 
Total p

**We will define the robustness score of our models as the accuracy against the separate adversarial dataset for tests.**

In [5]:
x_adversarial = list(adversarial_test.review)
y_adversarial = np.array(list(adversarial_test.sentiment)).astype(float)

model = keras.models.load_model('models\senti_model.h5')

with open('models\\tokenizer_senti_model.json') as f:
    data = json.load(f)
    tokenizer = tokenizer_from_json(data)
    word_index = tokenizer.word_index

x_adversarial = tokenizer.texts_to_sequences(x_adversarial)
x_adversarial = pad_sequences(x_adversarial, maxlen=256, truncating='post')

_, robustness_score = model.evaluate(x_adversarial, y_adversarial)

print(f'\n# Model Evaluation Against Adversaries\n\n{"-" * 50}\n')
print(f'Robustness Score: {round(robustness_score * 100, 2)} %.')


# Model Evaluation Against Adversaries

--------------------------------------------------

Robustness Score: 2.12 %.


**However, the model trained with a combination of normal and perturbed data receives a robustness score of  around 93%!**

In [7]:
x_adversarial = list(adversarial_test.review)
y_adversarial = np.array(list(adversarial_test.sentiment)).astype(float)

model = keras.models.load_model('models\senti_model_with_adversarial_training.h5')

with open('models\\tokenizer_senti_model_with_adversarial_training.json') as f:
    data = json.load(f)
    tokenizer = tokenizer_from_json(data)
    word_index = tokenizer.word_index

x_adversarial = tokenizer.texts_to_sequences(x_adversarial)
x_adversarial = pad_sequences(x_adversarial, maxlen=256, truncating='post')

_, robustness_score = model.evaluate(x_adversarial, y_adversarial)

print(f'\n# Model Evaluation Against Adversaries\n\n{"-" * 50}\n')
print(f'Robustness Score: {round(robustness_score * 100, 2)} %.')


# Model Evaluation Against Adversaries

--------------------------------------------------

Robustness Score: 93.58 %.


**Adversarial training can help AI developers create more robust models against adversaries, even sometimes improving overall performance.** 🙃 

---

Return to the [castle](https://github.com/Nkluge-correa/teeny-tiny_castle).