# Adversarial Evasion Attacks on BERT and NLP using TextAttack 

This notebook provides a simple sample of how to stage an NLP evasion attack against BERT models using TextAttack for two examples 1) Sentiment Analysis and 2) Natural Language Inference (NLI). The workflow is as follows:
- Setup dependencies
- Load the target model and tokeniser
- Create a Text attack model wrapper
- Create an attack for the wrapper using the  TextFooler TextAttack recipe
- Test the attack on a custom text/pair of texts
- Test the attack on a random sample from the test dataset
- Show how to generate and save geenerated samples

In [None]:
import transformers
import random
from textattack. models.wrappers import HuggingFaceModelWrapper, ModelWrapper
from textattack.attack_recipes import TextFoolerJin2019
from textattack.datasets import HuggingFaceDataset

### Example 1 - Attack on sentiment analysis with IMDB

##### Load the model and tokenizer and wrap them with a TextAttack wrapper

In [None]:
# Load the target pre-trained model for sentiment analysis and a tokeniser
imdb_model = transformers.AutoModelForSequenceClassification.from_pretrained("textattack/bert-base-uncased-imdb")
imdb_tokenizer = transformers.AutoTokenizer.from_pretrained("textattack/bert-base-uncased-imdb")
imdb_model_wrapper = HuggingFaceModelWrapper(imdb_model, imdb_tokenizer)

##### Sample custom Text wrapper for non-HF models - not used, given as example

In [None]:
# Use a custom model Wrapper with TextAttack's ModelWrapper 
# Useful if you're testing a non-Hugging Face model
# But not used here - cited in case you use own local models
class CustomModelWrapper(ModelWrapper):
    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer
        
    def __call__(self, text_input_list):
        inputs = self.tokenizer(text_input_list, padding=True, truncation=True, return_tensors="tf")
        outputs = self.model(inputs)
        return outputs.logits.numpy()
    
model_wrapper2 = CustomModelWrapper(imdb_model, imdb_tokenizer)


##### Setup and test the attack

In [None]:
# Choose the attack method
imdb_attack = TextFoolerJin2019.build(imdb_model_wrapper)

In [None]:
# Test the attack with your own simple text
input_text = "I really enjoyed the new movie that came out last month."
label = 1 #Positive - this the current valid classification
attack_result = imdb_attack.attack(input_text, label)
#printing the attack result prints both original and adversarial classifications
print(attack_result)

##### Test attack with IMDB and Glue/SST2 datasets

In [None]:
# Setup a loop to test with a dataset
def test_random_dataset_entry(attack, dataset, text_pairs=False):
    random_entry = random.choice(dataset)
    if text_pairs:
        premise = random_entry['premise']
        hypothesis = random_entry['hypothesis']
        input_text = f"{premise} BERT {hypothesis}"
        label = random_entry['label']
    else:
        input_text = list(random_entry[0].values())[0]
        label = random_entry[1]
    attack_result = attack_result = attack.attack(input_text, label)
    return attack_result


In [None]:
# Test the attack with random test entries from the imdb dataset
imdb_dataset =  HuggingFaceDataset("imdb", split="test")
attack_result =test_random_dataset_entry(imdb_attack, imdb_dataset)
print(attack_result)


In [None]:

#dataset_iterator = iter(dataset)
#for example_index in range(5):
#    example, label = next(dataset_iterator)
#    attack_result = attack.attack(label, ground_truth_output)
#    print(attack_result)


In [None]:
#Test the attack with a non-IMDB dataset, using the sst2 subset of the GLUE dataset.
glue_dataset = HuggingFaceDataset('glue', 'sst2', split='test')
attack_result =test_random_dataset_entry(imdb_attack, glue_dataset)
print(attack_result)


## Example 2 - Attack on language inference using SLNI

Adversarial attack using  the *bert-base-uncased-snli model*, which is trained specifically for Natural Language Inference (NLI)and the SNLI (Stanford Natural Language Inference) dataset itself. This dataset is ideally suited for the model since it directly corresponds to the training data used for the model.

#### SNLI Dataset Overview 

The SNLI dataset is a collection of sentence pairs annotated with one of three labels: entailment, contradiction, or neutral. These labels represent the relationship between a "premise" and a "hypothesis" sentence:

- *Entailment*: The hypothesis is a true statement given the premise.
- *Contradiction*: The hypothesis is a false statement given the premise.
- *Neutral*: The truth of the hypothesis is undetermined given the premise.

Both model and dataset use a pair of input texts (Hypothesis/Premise) and you need to pass them either as and Ordered Dictionary or as a string using the BERT convention <hypothesis>SEP<premise>.  


##### Model, Tokenizer, and wrapper setup

In [None]:
# Load model and tokenizer
slni_model = transformers.AutoModelForSequenceClassification.from_pretrained("textattack/bert-base-uncased-snli")
slni_tokenizer = transformers.AutoTokenizer.from_pretrained("textattack/bert-base-uncased-snli")
# Wrap the model for TextAttack
slni_model_wrapper = HuggingFaceModelWrapper(slni_model, slni_tokenizer)

##### Setup and test the attack

In [None]:
# Build the attack
slni_attack = TextFoolerJin2019.build(slni_model_wrapper)
from collections import OrderedDict

#Test with a single random pair
input_text_pair = OrderedDict([
    ("premise", "A man inspects the uniform of a figure in some East Asian country."),
    ("hypothesis", "The man is sleeping")
])
label = 0  # Typically for NLI: 0 - contradiction, 1 - neutral, 2 - entailment
attack_result = slni_attack.attack(input_text_pair, label)

print(attack_result)


In [None]:
from datasets import load_dataset
slni_dataset = load_dataset("snli", split='test')
attack_result = test_random_dataset_entry(slni_attack, slni_dataset, text_pairs=True)
print(attack_result)


