# Adverserial Training
This is our core work. We attack our initial hate speech model to find out our baseline accuracy. 
After that, we use adversarial training on the pre-trained Roberta model to see if we can 
improve the accuracy. This trained hate speech model will be attacked again to see if we can achieve any improvements.

Following naming will be used below:
- <strong>Pre-Trained Model:</strong> This is the [RoBERTa model ](https://huggingface.co/docs/transformers/model_doc/roberta) model from Huggingface
- <strong>Initial Hate Speech Model:</strong> This is our RoBERTa model, which we trained on the Hate speech data set.
- <strong>Trained Hate Speech Model:</strong> RoBERTa model, which was trained using adversarial training


## Install

In [None]:
!pip3 install transformers[torch]
!pip3 install textattack[tensorflow,optional]
#!pip3 install --force-reinstall textattack
!pip3 install --upgrade tensorflow
#!pip install accelerate -U
!pip3 install sentence_transformers
!pip3 install pandas

## Import

In [3]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import torch
import string
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.wsd import lesk
nltk.download('stopwords')
nltk.download('punkt')

# textattack packages
import textattack
from textattack.constraints.pre_transformation import RepeatModification, StopwordModification
from textattack.constraints.semantics import WordEmbeddingDistance

# transformers packages
from transformers import RobertaTokenizer, RobertaForSequenceClassification, RobertaConfig
from transformers import RobertaTokenizer, RobertaForSequenceClassification


from trainer import Trainer


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/marinjaprincipe/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/marinjaprincipe/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Baseline Analysis of the Initial Hate Speech Model
As the first step we want to get a baseline of the accuracy of our Initial Hate Speech Model (the training of this model is done in notebook inital_hate_speech_model_training.ipynb). 
To do so we attack the Initial Hate Speech Model with our custom attack and see how it performes.

In a second step, all susccessfull attacks will be used to traine the pre-trained model in order to achieve a better result.

#### Data cleaning
Since the data needs to be cleaned for the attack, we defined the following function.

In [27]:
#this is copy from https://www.kaggle.com/code/soumyakushwaha/ethicalcommunicationai
# ----------------------------------------
stopword = set(stopwords.words('english'))

def clean_text(text):
    text = str(text).lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub(r"\@w+|\#",'',text)
    text = re.sub(r"[^\w\s]",'',text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    tweet_tokens = word_tokenize(text)
    filtered_tweets=[w for w in tweet_tokens if not w in stopword] #removing stopwords
    return " ".join(filtered_tweets)
#--------------------------------------------------------------------------------------

#### Load Dataset


In [28]:
# Constants
SEED = 42
BATCH_SIZE = 32
LEARNING_RATE = 1e-5
MAX_TEXT_LENGTH = 512
EPOCHS = 10
MODEL_PATH = 'roberta_model.bin'
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")

# Set seeds
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)


labeled_data = pd.read_csv('./datasets/hate_speech_data.csv')
# Hate Speech and Offensive Language Data: 25.3k total entries.
# - Class 0: 1,430 entries (hate speech)
# - Class 1: 19,190 entries (offensive language)
# - Class 2: 4,163 entries (neither)

# Processing labeled hate speech dataset
hate_offensive_data = labeled_data[labeled_data['class'] != 2].copy()
hate_offensive_data.loc[:, 'category'] = hate_offensive_data['class'].replace([0, 1], 1)
hate_offensive_data = hate_offensive_data.rename(columns={'tweet': 'text'})

# Test 1 ---
# Select data for each class
hate_speech_data = labeled_data[labeled_data['class'] == 0].copy()
offensive_data = labeled_data[labeled_data['class'] == 1].copy()
neither_data = labeled_data[labeled_data['class'] == 2].copy()
sample_size = len(hate_speech_data)
offensive_sample = offensive_data.sample(n=sample_size, random_state=SEED)
neither_sample = neither_data.sample(n=sample_size, random_state=SEED)
hate_speech_data['category'] = 1
offensive_sample['category'] = 1
neither_sample['category'] = 0
sampled_data = pd.concat([hate_speech_data, offensive_sample, neither_sample], ignore_index=True)[['tweet', 'category']]
sampled_data.rename(columns={'tweet': 'text', 'category': 'label'}, inplace=True)
sampled_data['text'] = sampled_data['text'].apply(clean_text)  # Assuming clean_text is a defined function
train_data, intermediate_data = train_test_split(sampled_data, test_size=0.3, random_state=SEED)
validation_data, test_data = train_test_split(intermediate_data, test_size=0.5, random_state=SEED)
train_tokens = tokenizer(train_data['text'].tolist(), padding=True, truncation=True, max_length=MAX_TEXT_LENGTH, return_tensors='pt')
validation_tokens = tokenizer(validation_data['text'].tolist(), padding=True, truncation=True, max_length=MAX_TEXT_LENGTH, return_tensors='pt')
test_tokens = tokenizer(test_data['text'].tolist(), padding=True, truncation=True, max_length=MAX_TEXT_LENGTH, return_tensors='pt')
print(f"New Train data shape: {train_data.shape}")
print(f"New Validation data shape: {validation_data.shape}")
print(f"New Test data shape: {test_data.shape}")


New Train data shape: (3003, 2)
New Validation data shape: (643, 2)
New Test data shape: (644, 2)


#### Load our Initial Hate Speech Model

In [56]:
config = RobertaConfig()
config.num_labels = 2
roberta_base_config = {
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "type_vocab_size": 1,
  "vocab_size": 50265
}

for key in roberta_base_config.keys():
    setattr(config, key, roberta_base_config[key])

initial_hate_speech_model = RobertaForSequenceClassification(config)
map_location=torch.device('cpu')
initial_hate_speech_model.load_state_dict(torch.load('roberta_model.bin', map_location=map_location))
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
initial_hate_speech_model.eval()

RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
             

## Attack Setup
Now as we have loaded our trained model, we can attack it. To do so we try different attacks:

- a custom attack
- the Bert-attack from textattack
- bae attack from textattack
- textfooler from textattack


### Custom Attack

In [8]:
ATTACK_SEED = 71

def create_custom_attack(model):
    
    # Define custom attack based on https://textattack.readthedocs.io/en/latest/api/attack.html used for training loop
    tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
    model_wrapper = textattack.models.wrappers.HuggingFaceModelWrapper(model, tokenizer)

    #UntagetedClassification: An untargeted attack on classification models which attempts
    #to minimize the score of the correct label until it is no longer the predicted label.
    goal_function = textattack.goal_functions.UntargetedClassification(model_wrapper)

    constraints = [
        RepeatModification(), # prevents the same word from being modified multiple times
        StopwordModification(), # controls the modification of stopwords (e.g., "the," "is," "and")
        WordEmbeddingDistance(min_cos_sim=0.9), # measures the cosine similarity between word embeddings to ensure that the replacement word is semantically similar
    ]

    transformation = textattack.transformations.word_swaps.word_swap_embedding.WordSwapEmbedding(max_candidates=50) # (50 is default)
    search_method = textattack.search_methods.GreedyWordSwapWIR(wir_method="delete")
    custom_attack = textattack.Attack(goal_function, constraints, transformation, search_method) # perform the attack

    return custom_attack


In [24]:
# Run attack with defined dataset
temp = list(validation_data.itertuples(index=False, name=None))
dataset = textattack.datasets.Dataset(temp)

# Attack 20 samples with CSV logging and checkpoint saved every 5 interval
attack_args = textattack.AttackArgs(random_seed=ATTACK_SEED, num_examples=20, log_to_csv="log.csv", checkpoint_interval=5, checkpoint_dir="checkpoints", disable_stdout=True)
custom_attacker = textattack.Attacker(create_custom_attack(initial_hate_speech_model), dataset, attack_args)
custom_attacker.attack_dataset()

textattack: Unknown if model of class <class 'transformers.models.roberta.modeling_roberta.RobertaForSequenceClassification'> compatible with goal function <class 'textattack.goal_functions.classification.untargeted_classification.UntargetedClassification'>.
textattack: Logging to CSV at path log.csv


Attack(
  (search_method): GreedyWordSwapWIR(
    (wir_method):  delete
  )
  (goal_function):  UntargetedClassification
  (transformation):  WordSwapEmbedding(
    (max_candidates):  50
    (embedding):  WordEmbedding
  )
  (constraints): 
    (0): WordEmbeddingDistance(
        (embedding):  WordEmbedding
        (min_cos_sim):  0.9
        (cased):  False
        (include_unknown_words):  True
        (compare_against_original):  True
      )
    (1): RepeatModification
    (2): StopwordModification
  (is_black_box):  True
) 



textattack: Saving checkpoint under "checkpoints/1697915223425.ta.chkpt" at 2023-10-21 21:07:03 after 5 attacks.







textattack: Saving checkpoint under "checkpoints/1697915235480.ta.chkpt" at 2023-10-21 21:07:15 after 10 attacks.







textattack: Saving checkpoint under "checkpoints/1697915246190.ta.chkpt" at 2023-10-21 21:07:26 after 15 attacks.







textattack: Saving checkpoint under "checkpoints/1697915259038.ta.chkpt" at 2023-10-21 21:07:39 after 20 attacks.
[Succeeded / Failed / Skipped / Total] 3 / 16 / 1 / 20: 100%|██████████| 20/20 [00:59<00:00,  2.97s/it]





+-------------------------------+--------+
| Attack Results                |        |
+-------------------------------+--------+
| Number of successful attacks: | 3      |
| Number of failed attacks:     | 16     |
| Number of skipped attacks:    | 1      |
| Original accuracy:            | 95.0%  |
| Accuracy under attack:        | 80.0%  |
| Attack success rate:          | 15.79% |
| Average perturbed word %:     | 12.76% |
| Average num. words per input: | 9.0    |
| Avg num queries:              | 16.53  |
+-------------------------------+--------+





[<textattack.attack_results.failed_attack_result.FailedAttackResult at 0x319206350>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x105866f10>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x31a4fa6d0>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x31a5cc810>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x318bfa650>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x31b35ead0>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x3191f2f10>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x30d600dd0>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x30d5b1210>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x31b0cde50>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x2c829b750>,
 <textattack.attack_results.failed_attack_result.

In [70]:
from textattack.constraints.grammaticality import PartOfSpeech
from textattack.constraints.semantics.sentence_encoders import UniversalSentenceEncoder

def create_bae_attack(model):
    
    # Define custom attack based on https://textattack.readthedocs.io/en/latest/api/attack.html used for training loop
    tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
    model_wrapper = textattack.models.wrappers.HuggingFaceModelWrapper(model, tokenizer)

    #UntagetedClassification: An untargeted attack on classification models which attempts
    #to minimize the score of the correct label until it is no longer the predicted label.
    goal_function = textattack.goal_functions.UntargetedClassification(model_wrapper)

    constraints = [
        RepeatModification(), # prevents the same word from being modified multiple times
        StopwordModification(), # controls the modification of stopwords (e.g., "the," "is," "and")
        PartOfSpeech(allow_verb_noun_swap=True),
    ]

    transformation = textattack.transformations.word_swaps.word_swap_embedding.WordSwapEmbedding(max_candidates=50) # (50 is default)
    search_method = textattack.search_methods.GreedyWordSwapWIR(wir_method="delete")
    bae_attack = textattack.Attack(goal_function, constraints, transformation, search_method) # perform the attack

    return bae_attack

In [71]:

# Run attack with defined dataset
temp = list(validation_data.itertuples(index=False, name=None))
dataset = textattack.datasets.Dataset(temp)

# Attack 20 samples with CSV logging and checkpoint saved every 5 interval
attack_args = textattack.AttackArgs(random_seed=ATTACK_SEED, num_examples=20, log_to_csv="log.csv", checkpoint_interval=5, checkpoint_dir="checkpoints", disable_stdout=True)
custom_attacker = textattack.Attacker(create_bae_attack(initial_hate_speech_model), dataset, attack_args)
custom_attacker.attack_dataset()

textattack: Unknown if model of class <class 'transformers.models.roberta.modeling_roberta.RobertaForSequenceClassification'> compatible with goal function <class 'textattack.goal_functions.classification.untargeted_classification.UntargetedClassification'>.
textattack: Logging to CSV at path log.csv


Attack(
  (search_method): GreedyWordSwapWIR(
    (wir_method):  delete
  )
  (goal_function):  UntargetedClassification
  (transformation):  WordSwapEmbedding(
    (max_candidates):  50
    (embedding):  WordEmbedding
  )
  (constraints): 
    (0): PartOfSpeech(
        (tagger_type):  nltk
        (tagset):  universal
        (allow_verb_noun_swap):  True
        (compare_against_original):  True
      )
    (1): RepeatModification
    (2): StopwordModification
  (is_black_box):  True
) 










[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[Atextattack: Saving checkpoint under "checkpoints/1697989689915.ta.chkpt" at 2023-10-22 17:48:09 after 5 attacks.














[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[Atextattack: Saving checkpoint under "checkpoints/1697989754272.ta.chkpt" at 2023-10-22 17:49:14 after 10 attacks.














[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[Atextattack: Saving checkpoint under "checkpoints/1697989816537.ta.chkpt" at 2023-10-22 17:50:16 after 15 attacks.














[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[Atextattack: Saving checkpoint under "checkpoints/1697989920467.ta.chkpt" at 2023-10-22 17:52:00 after 20 attacks.
[Succeeded / Failed / Skipped / Total] 18 / 1 / 1 / 20: 100%|██████████| 20/20 [05:31<00:00, 16.59s/it]





+-------------------------------+--------+
| Attack Results                |        |
+-------------------------------+--------+
| Number of successful attacks: | 18     |
| Number of failed attacks:     | 1      |
| Number of skipped attacks:    | 1      |
| Original accuracy:            | 95.0%  |
| Accuracy under attack:        | 5.0%   |
| Attack success rate:          | 94.74% |
| Average perturbed word %:     | 31.0%  |
| Average num. words per input: | 9.0    |
| Avg num queries:              | 110.11 |
+-------------------------------+--------+





[<textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x35ebffe10>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x35cb8bb50>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x30ed57110>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x33c91ec90>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x30ef4c150>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x35cc89090>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x30ef0b1d0>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x30ef1bed0>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x33cc1ee90>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x35cb8bbd0>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at

## Train Model on the Attacked Data
We use now the attacking data to retrain our model again. For the training we use the trainer of the textattack library.
First we setup the evaluation and training dataset as well as the training arguments.

In [25]:
## Defin training base on https://textattack.readthedocs.io/en/latest/api/trainer.html
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
pretrained_roberta_model = RobertaForSequenceClassification.from_pretrained("roberta-base", num_labels=2)
pretrained_roberta_model_wrapper = textattack.models.wrappers.HuggingFaceModelWrapper(pretrained_roberta_model, tokenizer)

temp = list(validation_data.itertuples(index=False, name=None))
eval_dataset = textattack.datasets.Dataset(temp)

temp_train = list(train_data.itertuples(index=False, name=None))
train_dataset = textattack.datasets.Dataset(temp_train)
training_args = textattack.TrainingArgs(
    num_epochs=3,
    num_clean_epochs=1,
    num_train_adv_examples=1000, #500 also ok
    learning_rate=5e-5,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=4,
    log_to_tb=True,
)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.dense.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#### Run Custom Attack Trainer 

In [28]:
custom_attack_trainer_on_pretrained_model = Trainer(
    pretrained_roberta_model_wrapper,
    "classification",
    create_custom_attack(pretrained_roberta_model),
    train_dataset,
    eval_dataset,
    training_args
)
custom_attack_trainer_on_pretrained_model.train()

custom_attack_trainer_on_pretrained_model.evaluate()

textattack: Unknown if model of class <class 'transformers.models.roberta.modeling_roberta.RobertaForSequenceClassification'> compatible with goal function <class 'textattack.goal_functions.classification.untargeted_classification.UntargetedClassification'>.
textattack: `model_wrapper` and the victim model of `attack` are not the same model.
textattack: Writing logs to ./outputs/2023-10-21-21-18-58-414556/train_log.txt.
textattack: Wrote original training args to ./outputs/2023-10-21-21-18-58-414556/training_args.json.
textattack: ***** Running training *****
textattack:   Num examples = 3003
textattack:   Num epochs = 3
textattack:   Num clean epochs = 1
textattack:   Instantaneous batch size per device = 8
textattack:   Total train batch size (w. parallel, distributed & accumulation) = 32
textattack:   Gradient accumulation steps = 4
textattack:   Total optimization steps = 346
textattack: Epoch 1
textattack: Running clean epoch 1/1
Loss 0.61639: 100%|██████████| 376/376 [42:05<00:00





Loss 0.46836: 100%|██████████| 430/430 [8:50:54<00:00, 74.08s/it]     
textattack: Train accuracy: 84.58%
textattack: Eval accuracy: 91.60%
textattack: Best score found. Saved model to ./outputs/2023-10-21-21-18-58-414556/best_model/
textattack: Epoch 3
textattack: Attacking model to generate new adversarial training set...
[Succeeded / Failed / Skipped / Total] 310 / 2524 / 169 / 3003:  31%|███       | 310/1000 [2:55:44<6:31:09, 34.01s/it]
textattack: Total number of attack results: 3003
textattack: Attack success rate: 10.94% [310 / 2834]






Loss 0.39479: 100%|██████████| 415/415 [46:32<00:00,  6.73s/it]
textattack: Train accuracy: 89.92%
textattack: Eval accuracy: 91.45%
textattack: Wrote README to ./outputs/2023-10-21-21-18-58-414556/README.md.
textattack: Eval accuracy: 91.45%


0.9144634525660964

#### Adverserial Training with Inital Hate Speech Model

In [26]:
## Defin training base on https://textattack.readthedocs.io/en/latest/api/trainer.html
initial_hate_speech_model = RobertaForSequenceClassification(config)
map_location=torch.device('cpu')
initial_hate_speech_model.load_state_dict(torch.load('roberta_model.bin', map_location=map_location))
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
initial_hate_speech_model.eval()
initial_hate_speech_model.to(map_location)
initial_hate_speech_model_wrapper = textattack.models.wrappers.HuggingFaceModelWrapper(initial_hate_speech_model, tokenizer)

temp = list(validation_data.itertuples(index=False, name=None))
eval_dataset = textattack.datasets.Dataset(temp)

temp_train = list(train_data.itertuples(index=False, name=None))
train_dataset = textattack.datasets.Dataset(temp_train)
training_args = textattack.TrainingArgs(
    num_epochs=3,
    num_clean_epochs=1,
    num_train_adv_examples=1000,
    learning_rate=5e-5,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=4,
    log_to_tb=True,
)

In [29]:
custom_attack_trainer_on_initial_hate_speech_model = Trainer(
    initial_hate_speech_model_wrapper,
    "classification",
    create_custom_attack(initial_hate_speech_model),
    train_dataset,
    eval_dataset,
    training_args
)
custom_attack_trainer_on_pretrained_model.train()

custom_attack_trainer_on_pretrained_model.evaluate()

textattack: Unknown if model of class <class 'transformers.models.roberta.modeling_roberta.RobertaForSequenceClassification'> compatible with goal function <class 'textattack.goal_functions.classification.untargeted_classification.UntargetedClassification'>.
textattack: `model_wrapper` and the victim model of `attack` are not the same model.
textattack: Writing logs to ./outputs/2023-10-21-21-18-58-414556/train_log.txt.
textattack: Wrote original training args to ./outputs/2023-10-21-21-18-58-414556/training_args.json.
textattack: ***** Running training *****
textattack:   Num examples = 3003
textattack:   Num epochs = 3
textattack:   Num clean epochs = 1
textattack:   Instantaneous batch size per device = 8
textattack:   Total train batch size (w. parallel, distributed & accumulation) = 32
textattack:   Gradient accumulation steps = 4
textattack:   Total optimization steps = 346
textattack: Epoch 1
textattack: Running clean epoch 1/1
Loss 0.00063:   1%|          | 3/376 [00:20<42:12, 

: 

: 

## Evaluate the Adverserial Trained Models

#### Custom Attack Trainer Evaluation

In [None]:
custom_attack_trainer_on_pretrained_model.evaluate()

In [None]:
custom_attack_trainer_on_initial_hate_speech_model.evaluate()

#### Re-Attack the Trained Hate Speech Model

In [9]:
# Attack with Custom Attack
trained_hate_speech_model = RobertaForSequenceClassification(config)
map_location=torch.device('cpu')
trained_hate_speech_model.load_state_dict(torch.load('outputs/2023-10-21-21-18-58-414556/best_model/pytorch_model.bin', map_location=map_location))
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
trained_hate_speech_model.eval()
trained_hate_speech_model.to(map_location)
# Run attack with defined dataset
temp = list(validation_data.itertuples(index=False, name=None))
dataset = textattack.datasets.Dataset(temp)

# Attack 20 samples with CSV logging and checkpoint saved every 5 interval
attack_args = textattack.AttackArgs(random_seed=ATTACK_SEED, num_examples=20, log_to_csv="log.csv", checkpoint_interval=5, checkpoint_dir="checkpoints", disable_stdout=True)
custom_attacker = textattack.Attacker(create_custom_attack(trained_hate_speech_model), dataset, attack_args)
custom_attacker.attack_dataset()

textattack: Unknown if model of class <class 'transformers.models.roberta.modeling_roberta.RobertaForSequenceClassification'> compatible with goal function <class 'textattack.goal_functions.classification.untargeted_classification.UntargetedClassification'>.
textattack: Logging to CSV at path log.csv


Attack(
  (search_method): GreedyWordSwapWIR(
    (wir_method):  delete
  )
  (goal_function):  UntargetedClassification
  (transformation):  WordSwapEmbedding(
    (max_candidates):  50
    (embedding):  WordEmbedding
  )
  (constraints): 
    (0): WordEmbeddingDistance(
        (embedding):  WordEmbedding
        (min_cos_sim):  0.9
        (cased):  False
        (include_unknown_words):  True
        (compare_against_original):  True
      )
    (1): RepeatModification
    (2): StopwordModification
  (is_black_box):  True
) 



[Succeeded / Failed / Skipped / Total] 0 / 5 / 0 / 5:  25%|██▌       | 5/20 [00:20<01:01,  4.08s/it]textattack: Saving checkpoint under "checkpoints/1697972050429.ta.chkpt" at 2023-10-22 12:54:10 after 5 attacks.







[Succeeded / Failed / Skipped / Total] 0 / 10 / 0 / 10:  50%|█████     | 10/20 [00:32<00:32,  3.23s/it]textattack: Saving checkpoint under "checkpoints/1697972062347.ta.chkpt" at 2023-10-22 12:54:22 after 10 attacks.







[Succeeded / Failed / Skipped / Total] 0 / 14 / 1 / 15:  75%|███████▌  | 15/20 [00:43<00:14,  2.88s/it]textattack: Saving checkpoint under "checkpoints/1697972073265.ta.chkpt" at 2023-10-22 12:54:33 after 15 attacks.







[Succeeded / Failed / Skipped / Total] 1 / 18 / 1 / 20: 100%|██████████| 20/20 [00:55<00:00,  2.75s/it]textattack: Saving checkpoint under "checkpoints/1697972085043.ta.chkpt" at 2023-10-22 12:54:45 after 20 attacks.
[Succeeded / Failed / Skipped / Total] 1 / 18 / 1 / 20: 100%|██████████| 20/20 [00:55<00:00,  2.75s/it]





+-------------------------------+--------+
| Attack Results                |        |
+-------------------------------+--------+
| Number of successful attacks: | 1      |
| Number of failed attacks:     | 18     |
| Number of skipped attacks:    | 1      |
| Original accuracy:            | 95.0%  |
| Accuracy under attack:        | 90.0%  |
| Attack success rate:          | 5.26%  |
| Average perturbed word %:     | 16.67% |
| Average num. words per input: | 9.0    |
| Avg num queries:              | 18.42  |
+-------------------------------+--------+





[<textattack.attack_results.failed_attack_result.FailedAttackResult at 0x2c6f929d0>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x2c77b7210>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x3749444d0>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x2c8a16e90>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x301fd66d0>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x2c8a899d0>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x30bba5250>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x30bc7c4d0>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x2c71cbe50>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x2c5537e50>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x2c8a14050>,
 <textattack.attack_results.failed_attack_result.FailedAttackResu

#### Perform Text Fooler Attack on Trained Hate Speech Model

In [72]:
# Attack with Custom Attack
trained_hate_speech_model = RobertaForSequenceClassification(config)
map_location=torch.device('cpu')
trained_hate_speech_model.load_state_dict(torch.load('outputs/2023-10-21-21-18-58-414556/best_model/pytorch_model.bin', map_location=map_location))
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
# Run attack with defined dataset
temp = list(validation_data.itertuples(index=False, name=None))
dataset = textattack.datasets.Dataset(temp)

# Attack 20 samples with CSV logging and checkpoint saved every 5 interval
attack_args = textattack.AttackArgs(random_seed=ATTACK_SEED, num_examples=20, log_to_csv="log.csv", checkpoint_interval=5, checkpoint_dir="checkpoints", disable_stdout=True)
bae_attacker = textattack.Attacker(create_bae_attack(trained_hate_speech_model), dataset, attack_args)
bae_attacker.attack_dataset()

textattack: Unknown if model of class <class 'transformers.models.roberta.modeling_roberta.RobertaForSequenceClassification'> compatible with goal function <class 'textattack.goal_functions.classification.untargeted_classification.UntargetedClassification'>.
textattack: Logging to CSV at path log.csv


Attack(
  (search_method): GreedyWordSwapWIR(
    (wir_method):  delete
  )
  (goal_function):  UntargetedClassification
  (transformation):  WordSwapEmbedding(
    (max_candidates):  50
    (embedding):  WordEmbedding
  )
  (constraints): 
    (0): PartOfSpeech(
        (tagger_type):  nltk
        (tagset):  universal
        (allow_verb_noun_swap):  True
        (compare_against_original):  True
      )
    (1): RepeatModification
    (2): StopwordModification
  (is_black_box):  True
) 










[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[Atextattack: Saving checkpoint under "checkpoints/1697990347642.ta.chkpt" at 2023-10-22 17:59:07 after 5 attacks.














[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[Atextattack: Saving checkpoint under "checkpoints/1697990516602.ta.chkpt" at 2023-10-22 18:01:56 after 10 attacks.














[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[Atextattack: Saving checkpoint under "checkpoints/1697990662146.ta.chkpt" at 2023-10-22 18:04:22 after 15 attacks.














[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[Atextattack: Saving checkpoint under "checkpoints/1697990821089.ta.chkpt" at 2023-10-22 18:07:01 after 20 attacks.
[Succeeded / Failed / Skipped / Total] 18 / 2 / 0 / 20: 100%|██████████| 20/20 [13:37<00:00, 40.87s/it]





+-------------------------------+--------+
| Attack Results                |        |
+-------------------------------+--------+
| Number of successful attacks: | 18     |
| Number of failed attacks:     | 2      |
| Number of skipped attacks:    | 0      |
| Original accuracy:            | 100.0% |
| Accuracy under attack:        | 10.0%  |
| Attack success rate:          | 90.0%  |
| Average perturbed word %:     | 29.68% |
| Average num. words per input: | 9.0    |
| Avg num queries:              | 111.2  |
+-------------------------------+--------+





[<textattack.attack_results.failed_attack_result.FailedAttackResult at 0x33dc6bd50>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x357fead10>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x30ef436d0>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x35ebd77d0>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x35eb492d0>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x35ebb3a10>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x3577a81d0>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x345a66a10>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x3441db3d0>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x35eb6be10>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x357ce

#### Bert attack trainer evaluation

In [None]:
bert_attack_trainer.evaluate()