# Adverserial Training
This is our core work. We attack our initial hate speech model to find out our baseline accuracy. 
After that, we use adversarial training on the pre-trained Roberta model to see if we can 
improve the accuracy. This trained hate speech model will be attacked again to see if we can achieve any improvements.

Following naming will be used below:
- <strong>Pre-Trained Model:</strong> This is the [RoBERTa model ](https://huggingface.co/docs/transformers/model_doc/roberta) model from Huggingface
- <strong>Initial Hate Speech Model:</strong> This is our RoBERTa model, which we trained on the Hate speech data set.
- <strong>Trained Hate Speech Model:</strong> RoBERTa model, which was trained using adversarial training


## Install

In [None]:
!pip3 install transformers[torch]
!pip3 install textattack[tensorflow,optional]
#!pip3 install --force-reinstall textattack
!pip3 install --upgrade tensorflow
#!pip install accelerate -U
!pip3 install sentence_transformers
!pip3 install pandas

## Import

In [6]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import torch
import string
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.wsd import lesk
nltk.download('stopwords')
nltk.download('punkt')

# textattack packages
import textattack
from textattack.constraints.pre_transformation import RepeatModification, StopwordModification
from textattack.constraints.semantics import WordEmbeddingDistance

# transformers packages
from transformers import RobertaTokenizer, RobertaForSequenceClassification, RobertaConfig
from transformers import RobertaTokenizer, RobertaForSequenceClassification


from trainer import Trainer


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/marinjaprincipe/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/marinjaprincipe/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Baseline Analysis of the Initial Hate Speech Model
As the first step we want to get a baseline of the accuracy of our Initial Hate Speech Model (the training of this model is done in notebook inital_hate_speech_model_training.ipynb). 
To do so we attack the Initial Hate Speech Model with our custom attack and see how it performes.

In a second step, all susccessfull attacks will be used to traine the pre-trained model in order to achieve a better result.

#### Data cleaning
Since the data needs to be cleaned for the attack, we defined the following function.

In [None]:
#this is copy from https://www.kaggle.com/code/soumyakushwaha/ethicalcommunicationai
# ----------------------------------------
stopword = set(stopwords.words('english'))

def clean_text(text):
    text = str(text).lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub(r"\@w+|\#",'',text)
    text = re.sub(r"[^\w\s]",'',text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    tweet_tokens = word_tokenize(text)
    filtered_tweets=[w for w in tweet_tokens if not w in stopword] #removing stopwords
    return " ".join(filtered_tweets)
#--------------------------------------------------------------------------------------

#### Load Dataset


In [None]:
# Constants
SEED = 42
BATCH_SIZE = 32
LEARNING_RATE = 1e-5
MAX_TEXT_LENGTH = 512
EPOCHS = 10
MODEL_PATH = 'roberta_model.bin'
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")

# Set seeds
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)


labeled_data = pd.read_csv('/Users/marinjaprincipe/Documents/UZH/NPL/test/labeled_data 2.csv')
# Hate Speech and Offensive Language Data: 25.3k total entries.
# - Class 0: 1,430 entries (hate speech)
# - Class 1: 19,190 entries (offensive language)
# - Class 2: 4,163 entries (neither)

# Processing labeled hate speech dataset
hate_offensive_data = labeled_data[labeled_data['class'] != 2].copy()
hate_offensive_data.loc[:, 'category'] = hate_offensive_data['class'].replace([0, 1], 1)
hate_offensive_data = hate_offensive_data.rename(columns={'tweet': 'text'})

# Test 1 ---
# Select data for each class
hate_speech_data = labeled_data[labeled_data['class'] == 0].copy()
offensive_data = labeled_data[labeled_data['class'] == 1].copy()
neither_data = labeled_data[labeled_data['class'] == 2].copy()
sample_size = len(hate_speech_data)
offensive_sample = offensive_data.sample(n=sample_size, random_state=SEED)
neither_sample = neither_data.sample(n=sample_size, random_state=SEED)
hate_speech_data['category'] = 1
offensive_sample['category'] = 1
neither_sample['category'] = 0
sampled_data = pd.concat([hate_speech_data, offensive_sample, neither_sample], ignore_index=True)[['tweet', 'category']]
sampled_data.rename(columns={'tweet': 'text', 'category': 'label'}, inplace=True)
sampled_data['text'] = sampled_data['text'].apply(clean_text)  # Assuming clean_text is a defined function
train_data, intermediate_data = train_test_split(sampled_data, test_size=0.3, random_state=SEED)
validation_data, test_data = train_test_split(intermediate_data, test_size=0.5, random_state=SEED)
train_tokens = tokenizer(train_data['text'].tolist(), padding=True, truncation=True, max_length=MAX_TEXT_LENGTH, return_tensors='pt')
validation_tokens = tokenizer(validation_data['text'].tolist(), padding=True, truncation=True, max_length=MAX_TEXT_LENGTH, return_tensors='pt')
test_tokens = tokenizer(test_data['text'].tolist(), padding=True, truncation=True, max_length=MAX_TEXT_LENGTH, return_tensors='pt')
print(f"New Train data shape: {train_data.shape}")
print(f"New Validation data shape: {validation_data.shape}")
print(f"New Test data shape: {test_data.shape}")


#### Load our Initial Hate Speech Model

In [None]:
config = RobertaConfig()
config.num_labels = 2
roberta_base_config = {
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "type_vocab_size": 1,
  "vocab_size": 50265
}

for key in roberta_base_config.keys():
    setattr(config, key, roberta_base_config[key])

model = RobertaForSequenceClassification(config)
map_location=torch.device('cpu')
model.load_state_dict(torch.load('/Users/marinjaprincipe/Documents/UZH/NPL/test/roberta_model.bin', map_location=map_location))
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
model.eval()
model.to(map_location)

## Attack Setup
Now as we have loaded our trained model, we can attack it. To do so we try different attacks:

- a custom attack
- the Bert-attack from textattack
- bae attack from textattack
- textfooler from textattack


### Custom Attack

In [70]:
# Define custom attack based on https://textattack.readthedocs.io/en/latest/api/attack.html used for training loop
model_wrapper = textattack.models.wrappers.HuggingFaceModelWrapper(model, tokenizer)

#UntagetedClassification: An untargeted attack on classification models which attempts
#to minimize the score of the correct label until it is no longer the predicted label.
goal_function = textattack.goal_functions.UntargetedClassification(model_wrapper)

constraints = [
    RepeatModification(), # prevents the same word from being modified multiple times
    StopwordModification(), # controls the modification of stopwords (e.g., "the," "is," "and")
    WordEmbeddingDistance(min_cos_sim=0.9), # measures the cosine similarity between word embeddings to ensure that the replacement word is semantically similar
]

transformation = textattack.transformations.word_swaps.word_swap_embedding.WordSwapEmbedding(max_candidates=50) # (50 is default)
search_method = textattack.search_methods.GreedyWordSwapWIR(wir_method="delete")
custom_attack = textattack.Attack(goal_function, constraints, transformation, search_method) # perform the attack

In [71]:
# Run attack with defined dataset
temp = list(validation_data.itertuples(index=False, name=None))
dataset = textattack.datasets.Dataset(temp)

# Attack 20 samples with CSV logging and checkpoint saved every 5 interval
attack_args = textattack.AttackArgs(num_examples=20, log_to_csv="log.csv", checkpoint_interval=5, checkpoint_dir="checkpoints", disable_stdout=True)
custom_attacker = textattack.Attacker(custom_attack, dataset, attack_args)
custom_attacker.attack_dataset()

Attack(
  (search_method): GreedyWordSwapWIR(
    (wir_method):  delete
  )
  (goal_function):  UntargetedClassification
  (transformation):  WordSwapEmbedding(
    (max_candidates):  50
    (embedding):  WordEmbedding
  )
  (constraints): 
    (0): WordEmbeddingDistance(
        (embedding):  WordEmbedding
        (min_cos_sim):  0.9
        (cased):  False
        (include_unknown_words):  True
        (compare_against_original):  True
      )
    (1): RepeatModification
    (2): StopwordModification
  (is_black_box):  True
) 




[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A







[A
[A
[A
[A
[A
[A
[A
[A
[A
[A







[A
[A
[A
[A
[A
[A
[A
[A
[A
[A







[A
[A
[A
[A
[A
[A
[A
[A
[A
[Succeeded / Failed / Skipped / Total] 3 / 16 / 1 / 20: 100%|██████████| 20/20 [00:53<00:00,  2.67s/it]





+-------------------------------+--------+
| Attack Results                |        |
+-------------------------------+--------+
| Number of successful attacks: | 3      |
| Number of failed attacks:     | 16     |
| Number of skipped attacks:    | 1      |
| Original accuracy:            | 95.0%  |
| Accuracy under attack:        | 80.0%  |
| Attack success rate:          | 15.79% |
| Average perturbed word %:     | 18.64% |
| Average num. words per input: | 9.0    |
| Avg num queries:              | 16.95  |
+-------------------------------+--------+





[<textattack.attack_results.failed_attack_result.FailedAttackResult at 0x325799710>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x32c7812d0>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x32c391790>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x325269590>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x32c6b0150>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x32d0f6bd0>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x32c49bad0>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x32678aa50>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x32c7329d0>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x32c6f16d0>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x32c1d7d10>,
 <textattack.attack_results.failed_attack_result.

#### Bert Attack from textattack

In [None]:
# Use the Bert-attack from textattack based on https://textattack.readthedocs.io/en/latest/3recipes/attack_recipes.html#bert-attack

bert_attack = textattack.attack_recipes.bert_attack_li_2020.BERTAttackLi2020.build(model_wrapper) # perform the attack

In [None]:
# Run attack with defined dataset
temp = list(validation_data.itertuples(index=False, name=None))
dataset = textattack.datasets.Dataset(temp)

# Attack 20 samples with CSV logging and checkpoint saved every 5 interval
attack_args = textattack.AttackArgs(num_examples=20, log_to_csv="log.csv", checkpoint_interval=5, checkpoint_dir="checkpoints", disable_stdout=True)
bert_attacker = textattack.Attacker(bert_attack, dataset, attack_args)
bert_attacker.attack_dataset()


#### Bae Attack from testattack

In [None]:
# Use the Bert-attack from textattack based on https://textattack.readthedocs.io/en/latest/3recipes/attack_recipes.html#bert-attack

bae_attack = textattack.attack_recipes.bae_garg_2019.BAEGarg2019.build(model_wrapper) # perform the attack

In [None]:
# Run attack with defined dataset
temp = list(validation_data.itertuples(index=False, name=None))
dataset = textattack.datasets.Dataset(temp)

# Attack 20 samples with CSV logging and checkpoint saved every 5 interval
attack_args = textattack.AttackArgs(num_examples=20, log_to_csv="log.csv", checkpoint_interval=5, checkpoint_dir="checkpoints", disable_stdout=True)
bae_attacker = textattack.Attacker(bae_attack, dataset, attack_args)
bae_attacker.attack_dataset()

#### TextFooler Attack from textattack
A Strong Baseline for Natural Language Attack on Text Classification and Entailment

In [None]:
# Use the Bert-attack from textattack based on https://textattack.readthedocs.io/en/latest/3recipes/attack_recipes.html#bert-attack

textFooler_attack = textattack.attack_recipes.textfooler_jin_2019.TextFoolerJin2019.build(model_wrapper) # perform the attack

In [None]:
# Run attack with defined dataset
temp = list(validation_data.itertuples(index=False, name=None))
dataset = textattack.datasets.Dataset(temp)

# Attack 20 samples with CSV logging and checkpoint saved every 5 interval
attack_args = textattack.AttackArgs(num_examples=20, log_to_csv="log.csv", checkpoint_interval=5, checkpoint_dir="checkpoints", disable_stdout=True)
textFooler_attacker = textattack.Attacker(textFooler_attack, dataset, attack_args)
textFooler_attacker.attack_dataset()

## Train Model on the Attacked Data
We use now the attacking data to retrain our model again. For the training we use the trainer of the textattack library.
First we setup the evaluation and training dataset as well as the training arguments.

In [74]:
## Defin training base on https://textattack.readthedocs.io/en/latest/api/trainer.html
pretrained_roberta_model = RobertaForSequenceClassification.from_pretrained("roberta-base", num_labels=2)

temp = list(validation_data.itertuples(index=False, name=None))
eval_dataset = textattack.datasets.Dataset(temp)
print(temp[:10])
print(temp[1][1], type(temp[1][1]))

temp_train = list(train_data.itertuples(index=False, name=None))
train_dataset = textattack.datasets.Dataset(temp_train)
print(temp_train[:10])
training_args = textattack.TrainingArgs(
    num_epochs=3,
    num_clean_epochs=1,
    num_train_adv_examples=200,
    learning_rate=5e-5,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=4,
    log_to_tb=True,
)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.weight', 'classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[('rt lameassnerd braxton curti hoe ass nigga wen see ima smack like da lil bitch', 1), ('frankiejgrande ew queer white thirsty bitch', 1), ('decodnlyfe lupefiasco larellj another black man anything monkey always monkey chicago idiot', 1), ('rtnba drakes new shoes released nikejordan yes theres glitter shoes dudes fag', 1), ('rt thadisreal ladies nigga always wan na go party every weekend basically side hoe shopping', 1), ('fuck wit us tweakin hoe', 1), ('like really doubt even plays already got lot big man rotation committed bosh bird mcbob amp shawne', 0), ('niggah niggah niggah dont believe wacthh', 1), ('rt harmonlauren jennas faggot', 1), ('rt dignifiedpurity ive work chuckling time derekisnormal said revealing age keeps pussy fresh', 1)]
1 <class 'int'>
[('species birds reported chesterfield great backyard bird count many area', 0), ('butterfliesblue heard green tea makes lose weight imma coon explains watermelon', 0), ('rt idntwearcondoms u acted like hoe broke im wrong thinking

#### Run Custom Attack Trainer 

In [78]:
custom_attack_trainer = Trainer(
    model_wrapper,
    "classification",
    custom_attack,
    train_dataset,
    eval_dataset,
    training_args
)
custom_attack_trainer.train()

custom_attack_trainer.evaluate()

textattack: Writing logs to ./outputs/2023-10-19-21-58-08-248160/train_log.txt.
textattack: Wrote original training args to ./outputs/2023-10-19-21-58-08-248160/training_args.json.
textattack: ***** Running training *****
textattack:   Num examples = 3003
textattack:   Num epochs = 1
textattack:   Num clean epochs = 0
textattack:   Instantaneous batch size per device = 1
textattack:   Total train batch size (w. parallel, distributed & accumulation) = 1
textattack:   Gradient accumulation steps = 1
textattack:   Total optimization steps = 3023
textattack: Epoch 1
textattack: Attacking model to generate new adversarial training set...
[Succeeded / Failed / Skipped / Total] 20 / 103 / 6 / 129: 100%|██████████| 20/20 [05:31<00:00, 16.58s/it]
textattack: Total number of attack results: 129
textattack: Attack success rate: 16.26% [20 / 123]






Loss 0.54983: 100%|██████████| 3023/3023 [42:11<00:00,  1.19it/s] 
textattack: Train accuracy: 72.44%
textattack: Eval accuracy: 86.31%
textattack: Best score found. Saved model to ./outputs/2023-10-19-21-58-08-248160/best_model/
textattack: Wrote README to ./outputs/2023-10-19-21-58-08-248160/README.md.
textattack: Eval accuracy: 86.31%


0.8631415241057543

#### Run BERT Attack Trainer

In [None]:
bert_attack_trainer = textattack.Trainer(
    model_wrapper,
    "classification",
    bert_attack,
    train_dataset,
    eval_dataset,
    training_args
)
bert_attack_trainer.train()

## Evaluate the Adverserial Trained Models

#### Custom Attack Trainer Evaluation

In [79]:
custom_attack_trainer.evaluate()

textattack: Eval accuracy: 86.31%


0.8631415241057543

In [80]:
model = RobertaForSequenceClassification(config)
map_location=torch.device('cpu')
model.load_state_dict(torch.load('/Users/marinjaprincipe/Documents/UZH/NPL/test/outputs/2023-10-19-21-58-08-248160/best_model/pytorch_model.bin', map_location=map_location))
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
model.eval()
model.to(map_location)
# Run attack with defined dataset
temp = list(validation_data.itertuples(index=False, name=None))
dataset = textattack.datasets.Dataset(temp)

# Attack 20 samples with CSV logging and checkpoint saved every 5 interval
attack_args = textattack.AttackArgs(num_examples=20, log_to_csv="log.csv", checkpoint_interval=5, checkpoint_dir="checkpoints", disable_stdout=True)
custom_attacker = textattack.Attacker(custom_attack, dataset, attack_args)
custom_attacker.attack_dataset()

textattack: Logging to CSV at path log.csv


Attack(
  (search_method): GreedyWordSwapWIR(
    (wir_method):  delete
  )
  (goal_function):  UntargetedClassification
  (transformation):  WordSwapEmbedding(
    (max_candidates):  50
    (embedding):  WordEmbedding
  )
  (constraints): 
    (0): WordEmbeddingDistance(
        (embedding):  WordEmbedding
        (min_cos_sim):  0.9
        (cased):  False
        (include_unknown_words):  True
        (compare_against_original):  True
      )
    (1): RepeatModification
    (2): StopwordModification
  (is_black_box):  True
) 



[Succeeded / Failed / Skipped / Total] 1 / 3 / 1 / 5:  25%|██▌       | 5/20 [00:12<00:37,  2.50s/it]textattack: Saving checkpoint under "checkpoints/1697750815263.ta.chkpt" at 2023-10-19 23:26:55 after 5 attacks.







[Succeeded / Failed / Skipped / Total] 2 / 6 / 2 / 10:  50%|█████     | 10/20 [00:21<00:21,  2.15s/it]textattack: Saving checkpoint under "checkpoints/1697750824246.ta.chkpt" at 2023-10-19 23:27:04 after 10 attacks.







[Succeeded / Failed / Skipped / Total] 3 / 9 / 3 / 15:  75%|███████▌  | 15/20 [00:30<00:10,  2.04s/it]textattack: Saving checkpoint under "checkpoints/1697750833296.ta.chkpt" at 2023-10-19 23:27:13 after 15 attacks.







[Succeeded / Failed / Skipped / Total] 4 / 13 / 3 / 20: 100%|██████████| 20/20 [00:41<00:00,  2.07s/it]textattack: Saving checkpoint under "checkpoints/1697750844143.ta.chkpt" at 2023-10-19 23:27:24 after 20 attacks.
[Succeeded / Failed / Skipped / Total] 4 / 13 / 3 / 20: 100%|██████████| 20/20 [00:41<00:00,  2.07s/it]





+-------------------------------+--------+
| Attack Results                |        |
+-------------------------------+--------+
| Number of successful attacks: | 4      |
| Number of failed attacks:     | 13     |
| Number of skipped attacks:    | 3      |
| Original accuracy:            | 85.0%  |
| Accuracy under attack:        | 65.0%  |
| Attack success rate:          | 23.53% |
| Average perturbed word %:     | 19.79% |
| Average num. words per input: | 9.0    |
| Avg num queries:              | 15.18  |
+-------------------------------+--------+





[<textattack.attack_results.failed_attack_result.FailedAttackResult at 0x31f22a010>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x32d266150>,
 <textattack.attack_results.skipped_attack_result.SkippedAttackResult at 0x32b907b90>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x32d00fbd0>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x325125190>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x31d3dfd10>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x32ae50350>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x32d01c550>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x32c115210>,
 <textattack.attack_results.skipped_attack_result.SkippedAttackResult at 0x32d161990>,
 <textattack.attack_results.failed_attack_result.FailedAttackResult at 0x32644d910>,
 <textattack.attack_results.failed_attack_res

#### Bert attack trainer evaluation

In [None]:
bert_attack_trainer.evaluate()