# Adversarial NLP: Text-Based Attacks

In [114]:
# Setup and Imports
import torch
import numpy as np
import random
import warnings
warnings.filterwarnings('ignore')

# Detect device (supports CUDA, Apple Silicon MPS, and CPU)
if torch.cuda.is_available():
    device = torch.device('cuda')
    print(f"✓ Using CUDA GPU: {torch.cuda.get_device_name(0)}")
elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
    device = torch.device('mps')
    print("✓ Using Apple Silicon GPU (MPS)")
else:
    device = torch.device('cpu')
    print("ℹ Using CPU")

print(f"Device: {device}")

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)
random.seed(42)

✓ Using Apple Silicon GPU (MPS)
Device: mps


In [115]:
# Install required packages for this notebook
import subprocess
import sys

def install_package(package):
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', package])

# Check and install nltk
try:
    import nltk
    print("✓ nltk already installed")
except ImportError:
    print("Installing nltk...")
    install_package('nltk')
    import nltk
    print("✓ nltk installed")

# Download WordNet data
try:
    nltk.data.find('corpora/wordnet')
    print("✓ WordNet data already downloaded")
except LookupError:
    print("Downloading WordNet data...")
    nltk.download('wordnet', quiet=True)
    nltk.download('omw-1.4', quiet=True)
    print("✓ WordNet data downloaded")

# Check spacy (optional - only needed for NER examples)
try:
    import spacy
    print("✓ spacy already installed")
except ImportError:
    print("Note: spacy not installed (optional - only needed for NER attack examples)")
    print("To install: pip install spacy && python -m spacy download en_core_web_sm")

# Check textattack (optional)
try:
    import textattack
    print("✓ textattack already installed")
except ImportError:
    print("Note: textattack not installed (optional - for framework examples)")
    print("To install: pip install textattack")

print("\n✓ Core packages ready!")

✓ nltk already installed
Downloading WordNet data...
✓ WordNet data downloaded
Note: spacy not installed (optional - only needed for NER attack examples)
To install: pip install spacy && python -m spacy download en_core_web_sm
✓ textattack already installed

✓ Core packages ready!


## Overview


Adversarial NLP focuses on crafting adversarial examples for natural language processing models. Unlike image attacks, text attacks face unique challenges: discrete input space, semantic constraints, and human readability requirements. This document covers comprehensive adversarial techniques for NLP systems.

## Learning Objectives


- Understand unique challenges of adversarial NLP
- Implement character-level attacks
- Execute word-level substitution attacks
- Perform sentence-level perturbations
- Use semantic-preserving transformations
- Evaluate attack success and imperceptibility
- Design robust NLP defenses

## Why NLP Attacks Are Different


### Challenges Unique to Text

**1. Discrete Input Space**:

In [116]:
# Example: Discrete vs Continuous Input Space
import numpy as np

# Images: Continuous pixel values [0, 255]
image = np.zeros((224, 224, 3))  # Create example image
image[0,0] = 127.5  # Can use gradients - continuous values
print(f"Image pixel value: {image[0,0]}")

# Text: Discrete tokens
text = "hello"  # Can't do: text[0] = "h" + 0.001
# Must substitute entire tokens
print(f"Text: {text}")
print("Cannot add 0.001 to 'h' - must replace entire character")

Image pixel value: [127.5 127.5 127.5]
Text: hello
Cannot add 0.001 to 'h' - must replace entire character


**2. Semantic Constraints**:

In [117]:
# Example: Semantic Constraints in Text

# Image: Small pixel changes often imperceptible
# Text: Single character change can destroy meaning

original = "I love this movie"
# Small change with accent - noticeable
perturbed1 = "I lové this movie"
# Word substitution - changes meaning completely
perturbed2 = "I hate this movie"

print(f"Original:    {original}")
print(f"Perturbed 1: {perturbed1} (noticeable)")
print(f"Perturbed 2: {perturbed2} (meaning changed)")

Original:    I love this movie
Perturbed 1: I lové this movie (noticeable)
Perturbed 2: I hate this movie (meaning changed)


**3. Human Readability**:

In [118]:
# Example: Human Readability Requirement

# Attack must fool model AND remain readable to humans
bad_attack = "I l0v3 th!$ m0v!3"  # Obvious attack - leetspeak
good_attack = "I adore this film"  # Natural, semantic-preserving

print("Bad attack (obvious):")
print(f"  {bad_attack}")
print("\nGood attack (natural):")
print(f"  {good_attack}")

Bad attack (obvious):
  I l0v3 th!$ m0v!3

Good attack (natural):
  I adore this film


## Attack Taxonomy




```
Adversarial NLP Attacks
├── Character-Level
│   ├── Character Insertion
│   ├── Character Deletion
│   ├── Character Substitution
│   ├── Visual Similarity (homoglyphs)
│   └── Keyboard Typos
├── Word-Level
│   ├── Synonym Substitution
│   ├── Word Insertion
│   ├── Word Deletion
│   ├── Word Reordering
│   └── Embedding-Based Substitution
├── Sentence-Level
│   ├── Paraphrasing
│   ├── Back-Translation
│   ├── Sentence Reordering
│   └── Style Transfer
└── Semantic-Preserving
    ├── Named Entity Substitution
    ├── Contextual Substitution
    ├── Grammar-Preserving Changes
    └── Meaning-Preserving Transformations
```

### Working Example: Homoglyph Attack

Let's test the homoglyph attack on a real sentiment model:

In [119]:
# Working Example: Homoglyph Attack
from transformers import pipeline
import random

# Load sentiment model
print("Loading sentiment model...")
sentiment = pipeline('sentiment-analysis', 
                    model='distilbert-base-uncased-finetuned-sst-2-english')

def homoglyph_attack(text):
    """Replace characters with visually similar Unicode characters"""
    homoglyphs = {
        'a': ['а', 'ɑ'],  # Cyrillic a, Latin alpha
        'e': ['е', 'ė'],  # Cyrillic e, Latin e with dot
        'o': ['о', 'ο'],  # Cyrillic o, Greek omicron
        'i': ['і', 'ı'],  # Cyrillic i, dotless i
        'c': ['с'],       # Cyrillic s
    }
    
    adversarial_text = text
    for char, replacements in homoglyphs.items():
        if char in adversarial_text.lower():
            replacement = random.choice(replacements)
            # Replace first occurrence
            adversarial_text = adversarial_text.replace(char, replacement, 1)
    
    return adversarial_text

# Test the attack
original = "This is a great product"
adversarial = homoglyph_attack(original)

print(f"\nOriginal:    {original}")
orig_result = sentiment(original)[0]
print(f"  Prediction: {orig_result['label']} ({orig_result['score']:.3f})")

print(f"\nAdversarial: {adversarial}")
adv_result = sentiment(adversarial)[0]
print(f"  Prediction: {adv_result['label']} ({adv_result['score']:.3f})")

print(f"\nAttack success: {orig_result['label'] != adv_result['label']}")
print(f"Confidence change: {abs(orig_result['score'] - adv_result['score']):.3f}")

Loading sentiment model...


Device set to use mps:0



Original:    This is a great product
  Prediction: POSITIVE (1.000)

Adversarial: Thіs is а grеat prοduсt
  Prediction: NEGATIVE (0.981)

Attack success: True
Confidence change: 0.019


### Working Example: Synonym Substitution Attack

Let's implement a simple synonym substitution attack:

In [120]:
# Working Example: Synonym Substitution Attack
from nltk.corpus import wordnet

def get_synonyms(word):
    """Get synonyms for a word using WordNet"""
    synonyms = set()
    for syn in wordnet.synsets(word):
        for lemma in syn.lemmas():
            synonym = lemma.name().replace('_', ' ')
            if synonym.lower() != word.lower():
                synonyms.add(synonym)
    return list(synonyms)

def simple_synonym_attack(text, target_words):
    """Replace target words with synonyms"""
    words = text.split()
    adversarial_words = words.copy()
    
    for i, word in enumerate(words):
        if word.lower() in target_words:
            synonyms = get_synonyms(word.lower())
            if synonyms:
                adversarial_words[i] = synonyms[0]
    
    return ' '.join(adversarial_words)

# Test the attack
original = "This movie is great and entertaining"
adversarial = simple_synonym_attack(original, ['great', 'entertaining'])

print(f"Original:    {original}")
orig_result = sentiment(original)[0]
print(f"  Prediction: {orig_result['label']} ({orig_result['score']:.3f})")

print(f"\nAdversarial: {adversarial}")
adv_result = sentiment(adversarial)[0]
print(f"  Prediction: {adv_result['label']} ({adv_result['score']:.3f})")

print(f"\nSynonyms used:")
print(f"  great → {adversarial.split()[3]}")
print(f"  entertaining → {adversarial.split()[5]}")

Original:    This movie is great and entertaining
  Prediction: POSITIVE (1.000)

Adversarial: This movie is majuscule and nurse
  Prediction: NEGATIVE (0.994)

Synonyms used:
  great → majuscule
  entertaining → nurse


## Character-Level Attacks


### Attack 1: Character Substitution

**Concept**: Replace characters with visually similar ones (homoglyphs)

In [121]:
def homoglyph_attack(text):
    """
    Replace characters with visually similar Unicode characters
    """
    homoglyphs = {
        'a': ['а', 'ɑ', 'α'],  # Cyrillic a, Latin alpha, Greek alpha
        'e': ['е', 'ė', 'ē'],  # Cyrillic e, Latin e with dot
        'o': ['о', 'ο', '0'],  # Cyrillic o, Greek omicron, zero
        'i': ['і', 'ı', '1'],  # Cyrillic i, dotless i, one
        'c': ['с', 'ϲ'],       # Cyrillic s, Greek lunate sigma
        'p': ['р', 'ρ'],       # Cyrillic r, Greek rho
        'x': ['х', 'χ'],       # Cyrillic h, Greek chi
    }
    
    adversarial_text = text
    for char, replacements in homoglyphs.items():
        if char in adversarial_text:
            # Replace with random homoglyph
            replacement = random.choice(replacements)
            adversarial_text = adversarial_text.replace(char, replacement, 1)
    
    return adversarial_text

# Example
original = "This is a great product"
adversarial = homoglyph_attack(original)
print(f"Original:    {original}")
print(f"Adversarial: {adversarial}")
# Output: "This is а greаt produсt" (contains Cyrillic characters)

Original:    This is a great product
Adversarial: Th1s is а grēat ρrоduсt


### Attack 2: Character Insertion/Deletion

In [122]:
def char_insertion_attack(text, model, num_insertions=3):
    """
    Insert characters to fool model while maintaining readability
    """
    import string
    
    best_text = text
    best_score = model.predict_proba(text)[target_class]
    
    for _ in range(num_insertions):
        # Try inserting each character at each position
        for pos in range(len(text)):
            for char in string.ascii_lowercase + ' ':
                candidate = text[:pos] + char + text[pos:]
                score = model.predict_proba(candidate)[target_class]
                
                if score > best_score:
                    best_text = candidate
                    best_score = score
    
    return best_text

def char_deletion_attack(text, model):
    """
    Delete characters to fool model
    """
    best_text = text
    best_score = model.predict_proba(text)[target_class]
    
    for i in range(len(text)):
        candidate = text[:i] + text[i+1:]
        score = model.predict_proba(candidate)[target_class]
        
        if score > best_score:
            best_text = candidate
            best_score = score
    
    return best_text

### Attack 3: Keyboard Typos

In [123]:
def keyboard_typo_attack(text):
    """
    Simulate realistic keyboard typos
    """
    # QWERTY keyboard adjacency
    keyboard_neighbors = {
        'a': ['q', 'w', 's', 'z'],
        'b': ['v', 'g', 'h', 'n'],
        'c': ['x', 'd', 'f', 'v'],
        # ... complete mapping
    }
    
    words = text.split()
    adversarial_words = []
    
    for word in words:
        if random.random() < 0.3:  # 30% chance of typo
            pos = random.randint(0, len(word)-1)
            char = word[pos]
            if char in keyboard_neighbors:
                typo_char = random.choice(keyboard_neighbors[char])
                word = word[:pos] + typo_char + word[pos+1:]
        adversarial_words.append(word)
    
    return ' '.join(adversarial_words)

## Word-Level Attacks


### Attack 4: Synonym Substitution

**Implementation Note**: The above code uses helper functions like:

```python
get_word_importance(text, model)  # Compute importance scores
get_synonyms(word)                # Get word synonyms
is_attack_successful(...)         # Check if attack worked
```

These would need to be implemented based on your specific use case.
See the working examples above for concrete implementations.

### Attack 5: Embedding-Based Substitution

**Implementation Note**: The above code uses helper functions like:

```python
get_word_importance(text, model)  # Compute importance scores
get_synonyms(word)                # Get word synonyms
is_attack_successful(...)         # Check if attack worked
```

These would need to be implemented based on your specific use case.
See the working examples above for concrete implementations.

### Attack 6: Word Insertion/Deletion

**Implementation Note**: The above code uses helper functions like:

```python
get_word_importance(text, model)  # Compute importance scores
get_synonyms(word)                # Get word synonyms
is_attack_successful(...)         # Check if attack worked
```

These would need to be implemented based on your specific use case.
See the working examples above for concrete implementations.

## Sentence-Level Attacks


### Attack 7: Paraphrasing

In [124]:
def paraphrase_attack(text, paraphrase_model):
    """
    Generate paraphrases to fool model
    """
    # Use T5 or BART for paraphrasing
    from transformers import T5ForConditionalGeneration, T5Tokenizer
    
    model = T5ForConditionalGeneration.from_pretrained('t5-base')
    tokenizer = T5Tokenizer.from_pretrained('t5-base')
    
    # Generate paraphrases
    input_text = f"paraphrase: {text}"
    inputs = tokenizer(input_text, return_tensors='pt')
    
    # Generate multiple paraphrases
    outputs = model.generate(
        **inputs,
        num_return_sequences=10,
        num_beams=10,
        max_length=100
    )
    
    paraphrases = [
        tokenizer.decode(output, skip_special_tokens=True)
        for output in outputs
    ]
    
    return paraphrases

def select_best_paraphrase(paraphrases, target_model, target_class):
    """
    Select paraphrase that best fools target model
    """
    best_paraphrase = None
    best_score = 0
    
    for paraphrase in paraphrases:
        score = target_model.predict_proba(paraphrase)[target_class]
        if score > best_score:
            best_score = score
            best_paraphrase = paraphrase
    
    return best_paraphrase

### Attack 8: Back-Translation

In [125]:
def back_translation_attack(text, intermediate_languages=['fr', 'de', 'es']):
    """
    Translate to intermediate language and back to create adversarial text
    """
    from transformers import MarianMTModel, MarianTokenizer
    
    adversarial_texts = []
    
    for lang in intermediate_languages:
        # Translate to intermediate language
        model_name = f'Helsinki-NLP/opus-mt-en-{lang}'
        tokenizer = MarianTokenizer.from_pretrained(model_name)
        model = MarianMTModel.from_pretrained(model_name)
        
        translated = model.generate(
            **tokenizer(text, return_tensors='pt')
        )
        intermediate_text = tokenizer.decode(translated[0], skip_special_tokens=True)
        
        # Translate back to English
        model_name = f'Helsinki-NLP/opus-mt-{lang}-en'
        tokenizer = MarianTokenizer.from_pretrained(model_name)
        model = MarianMTModel.from_pretrained(model_name)
        
        back_translated = model.generate(
            **tokenizer(intermediate_text, return_tensors='pt')
        )
        adversarial_text = tokenizer.decode(back_translated[0], skip_special_tokens=True)
        
        adversarial_texts.append(adversarial_text)
    
    return adversarial_texts

### Attack 9: Style Transfer

In [126]:
def style_transfer_attack(text, target_style='formal'):
    """
    Change text style while preserving meaning
    """
    # Use style transfer model
    from transformers import pipeline
    
    style_transfer = pipeline('text-generation', model='style-transfer-model')
    
    prompt = f"Rewrite in {target_style} style: {text}"
    adversarial_text = style_transfer(prompt)[0]['generated_text']
    
    return adversarial_text

## Gradient-Based NLP Attacks


### Attack 10: HotFlip

In [127]:
def hotflip_attack(text, model, tokenizer, num_flips=3):
    """
    HotFlip: Gradient-based character/word flips
    
    Reference: "HotFlip: White-Box Adversarial Examples for Text Classification"
               (Ebrahimi et al., 2018)
    """
    import torch
    
    # Tokenize
    inputs = tokenizer(text, return_tensors='pt')
    input_ids = inputs['input_ids']
    
    # Enable gradients
    embeddings = model.get_input_embeddings()
    input_embeds = embeddings(input_ids)
    input_embeds.requires_grad = True
    
    # Forward pass
    outputs = model(inputs_embeds=input_embeds)
    loss = outputs.loss
    
    # Backward pass
    loss.backward()
    
    # Get gradients
    grad = input_embeds.grad
    
    # Compute flip scores for each position
    vocab_size = embeddings.weight.shape[0]
    flip_scores = torch.zeros(input_ids.shape[1], vocab_size)
    
    for i in range(input_ids.shape[1]):
        # Compute score for flipping to each token
        for j in range(vocab_size):
            new_embed = embeddings.weight[j]
            flip_scores[i, j] = torch.dot(
                grad[0, i], 
                new_embed - input_embeds[0, i]
            )
    
    # Select top flips
    flat_scores = flip_scores.flatten()
    top_flips = torch.topk(flat_scores, num_flips)
    
    # Apply flips
    adversarial_ids = input_ids.clone()
    for flip_idx in top_flips.indices:
        pos = flip_idx // vocab_size
        token = flip_idx % vocab_size
        adversarial_ids[0, pos] = token
    
    adversarial_text = tokenizer.decode(adversarial_ids[0])
    return adversarial_text

### Attack 11: TextFooler

In [128]:
def textfooler_attack(text, model, word_embeddings):
    """
    TextFooler: Synonym substitution guided by importance
    
    Reference: "Is BERT Really Robust?" (Jin et al., 2020)
    """
    words = text.split()
    
    # Step 1: Compute word importance
    importance_scores = []
    original_prob = model.predict_proba(text)[original_class]
    
    for i in range(len(words)):
        # Delete word and measure impact
        temp_text = ' '.join(words[:i] + words[i+1:])
        temp_prob = model.predict_proba(temp_text)[original_class]
        importance = original_prob - temp_prob
        importance_scores.append(importance)
    
    # Step 2: Sort words by importance
    sorted_indices = np.argsort(importance_scores)[::-1]
    
    # Step 3: Replace important words with synonyms
    adversarial_words = words.copy()
    
    for idx in sorted_indices:
        word = words[idx]
        
        # Get synonyms
        synonyms = get_synonyms_from_embedding(word, word_embeddings)
        
        # Filter synonyms by semantic similarity and POS
        filtered_synonyms = filter_synonyms(
            word, synonyms, 
            similarity_threshold=0.7,
            preserve_pos=True
        )
        
        # Try each synonym
        for synonym in filtered_synonyms:
            candidate_words = adversarial_words.copy()
            candidate_words[idx] = synonym
            candidate_text = ' '.join(candidate_words)
            
            # Check if attack successful
            pred_class = model.predict(candidate_text)
            if pred_class != original_class:
                return candidate_text
            
            # Update if improves attack
            candidate_prob = model.predict_proba(candidate_text)[original_class]
            if candidate_prob < original_prob:
                adversarial_words[idx] = synonym
                original_prob = candidate_prob
    
    return ' '.join(adversarial_words)

### Attack 12: BERT-Attack

**Implementation Note**: The above code uses helper functions like:

```python
get_word_importance(text, model)  # Compute importance scores
get_synonyms(word)                # Get word synonyms
is_attack_successful(...)         # Check if attack worked
```

These would need to be implemented based on your specific use case.
See the working examples above for concrete implementations.

## Semantic-Preserving Attacks


### Attack 13: Named Entity Substitution

In [129]:
def named_entity_attack(text, model):
    """
    Replace named entities with similar entities
    """
    import spacy
    
    nlp = spacy.load('en_core_web_sm')
    doc = nlp(text)
    
    # Entity replacement dictionary
    entity_replacements = {
        'PERSON': ['John Smith', 'Jane Doe', 'Alex Johnson'],
        'ORG': ['TechCorp', 'GlobalInc', 'MegaCorp'],
        'GPE': ['New York', 'London', 'Tokyo'],
        'DATE': ['yesterday', 'last week', 'recently'],
    }
    
    adversarial_text = text
    for ent in doc.ents:
        if ent.label_ in entity_replacements:
            replacements = entity_replacements[ent.label_]
            for replacement in replacements:
                candidate = adversarial_text.replace(ent.text, replacement)
                if is_attack_successful(model, candidate, target_class):
                    return candidate
    
    return adversarial_text

### Attack 14: Contextual Word Substitution

In [130]:
def contextual_substitution_attack(text, model, context_model):
    """
    Use contextual embeddings to find substitutions
    """
    from transformers import BertModel, BertTokenizer
    
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    bert = BertModel.from_pretrained('bert-base-uncased')
    
    words = text.split()
    
    for i, word in enumerate(words):
        # Get contextual embedding
        inputs = tokenizer(text, return_tensors='pt')
        outputs = bert(**inputs)
        word_embedding = outputs.last_hidden_state[0, i+1]  # +1 for [CLS]
        
        # Find words with similar contextual embeddings
        candidates = find_contextually_similar_words(
            word_embedding, bert, tokenizer
        )
        
        # Try each candidate
        for candidate in candidates:
            test_words = words.copy()
            test_words[i] = candidate
            test_text = ' '.join(test_words)
            
            if is_attack_successful(model, test_text, target_class):
                return test_text
    
    return text

## Defense Mechanisms


### Defense 1: Adversarial Training

In [131]:
def adversarial_training_nlp(model, train_data, attack_fn):
    """
    Train model on adversarial examples
    """
    optimizer = torch.optim.Adam(model.parameters())
    
    for epoch in range(num_epochs):
        for text, label in train_data:
            # Generate adversarial example
            adv_text = attack_fn(text, model)
            
            # Train on both clean and adversarial
            for t in [text, adv_text]:
                optimizer.zero_grad()
                output = model(t)
                loss = criterion(output, label)
                loss.backward()
                optimizer.step()
    
    return model

### Defense 2: Input Preprocessing

In [132]:
def preprocess_defense(text):
    """
    Normalize text to remove adversarial perturbations
    """
    # Remove homoglyphs
    text = remove_homoglyphs(text)
    
    # Spell check
    text = spell_check(text)
    
    # Remove extra spaces
    text = ' '.join(text.split())
    
    # Lowercase
    text = text.lower()
    
    return text

def remove_homoglyphs(text):
    """
    Replace homoglyphs with standard characters
    """
    homoglyph_map = {
        'а': 'a',  # Cyrillic a -> Latin a
        'е': 'e',  # Cyrillic e -> Latin e
        'о': 'o',  # Cyrillic o -> Latin o
        # ... complete mapping
    }
    
    for homoglyph, standard in homoglyph_map.items():
        text = text.replace(homoglyph, standard)
    
    return text

### Defense 3: Ensemble Methods

In [133]:
def ensemble_defense(text, models):
    """
    Use ensemble of models for robustness
    """
    predictions = []
    
    for model in models:
        pred = model.predict(text)
        predictions.append(pred)
    
    # Majority vote
    final_prediction = max(set(predictions), key=predictions.count)
    return final_prediction

### Defense 4: Certified Robustness for NLP

In [134]:
def interval_bound_propagation_nlp(model, text, perturbation_set):
    """
    Compute certified bounds for NLP model
    """
    # Define perturbation set (e.g., all synonym substitutions)
    perturbed_texts = generate_perturbation_set(text, perturbation_set)
    
    # Compute bounds
    min_score = float('inf')
    max_score = float('-inf')
    
    for perturbed_text in perturbed_texts:
        score = model.predict_proba(perturbed_text)[predicted_class]
        min_score = min(min_score, score)
        max_score = max(max_score, score)
    
    # Check if certified
    if min_score > 0.5:  # All perturbations predict same class
        return True, (min_score, max_score)
    else:
        return False, (min_score, max_score)

## Evaluation Metrics


### Attack Success Rate

In [135]:
def evaluate_attack_success(attack_fn, test_data, model):
    """
    Measure attack success rate
    """
    successful_attacks = 0
    total_attempts = 0
    
    for text, true_label in test_data:
        # Original prediction
        original_pred = model.predict(text)
        
        if original_pred == true_label:
            # Generate adversarial example
            adv_text = attack_fn(text, model)
            adv_pred = model.predict(adv_text)
            
            if adv_pred != true_label:
                successful_attacks += 1
            
            total_attempts += 1
    
    success_rate = successful_attacks / total_attempts
    return success_rate

### Semantic Similarity

In [136]:
def evaluate_semantic_similarity(original_text, adversarial_text):
    """
    Measure semantic similarity between original and adversarial
    """
    from sentence_transformers import SentenceTransformer
    
    model = SentenceTransformer('all-MiniLM-L6-v2')
    
    # Get embeddings
    emb1 = model.encode(original_text)
    emb2 = model.encode(adversarial_text)
    
    # Compute cosine similarity
    similarity = cosine_similarity(emb1, emb2)
    
    return similarity

### Perplexity

In [137]:
def evaluate_perplexity(text, language_model):
    """
    Measure text naturalness using perplexity
    """
    from transformers import GPT2LMHeadModel, GPT2Tokenizer
    
    tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
    model = GPT2LMHeadModel.from_pretrained('gpt2')
    
    inputs = tokenizer(text, return_tensors='pt')
    outputs = model(**inputs, labels=inputs['input_ids'])
    
    perplexity = torch.exp(outputs.loss)
    return perplexity.item()

## Tools and Frameworks


### TextAttack

### TextAttack Framework (Optional)

If you have `textattack` installed, here's how to use it:

```python
from textattack import Attack
from textattack.attack_recipes import TextFoolerJin2019
from textattack.datasets import HuggingFaceDataset
from textattack.models.wrappers import HuggingFaceModelWrapper
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Load model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained(
    'distilbert-base-uncased-finetuned-sst-2-english'
)
tokenizer = AutoTokenizer.from_pretrained(
    'distilbert-base-uncased-finetuned-sst-2-english'
)

# Wrap for TextAttack
model_wrapper = HuggingFaceModelWrapper(model, tokenizer)
dataset = HuggingFaceDataset('glue', 'sst2', split='test')

# Create and run attack
attack = TextFoolerJin2019.build(model_wrapper)
results = attack.attack_dataset(dataset)
```

**Note**: This is optional and requires:
```bash
pip install textattack
```

### OpenAttack

### OpenAttack Framework (Optional)

If you have `OpenAttack` installed, here's how to use it:

```python
import OpenAttack as oa

# Load victim model
victim = oa.DataManager.loadVictim('BERT.SST')

# Choose attack
attacker = oa.attackers.PWWSAttacker()

# Run attack
attack_eval = oa.AttackEval(attacker, victim)
attack_eval.eval(dataset, visualize=True)
```

**Note**: This is optional and requires `pip install OpenAttack`

## Case Studies


### Case 1: Sentiment Analysis Attack

In [138]:
# Original: "This movie is great!" → Positive
# Attack: "This movie is greαt!" → Negative (homoglyph)
# Success: Model fooled by single character

### Case 2: Spam Detection Evasion

In [139]:
# Original: "Buy cheap viagra now!" → Spam
# Attack: "Buy ch3ap v1agra n0w!" → Not Spam
# Success: Character substitution evades filter

### Case 3: Hate Speech Detection

In [140]:
# Original: "I hate this group" → Hate Speech
# Attack: "I dislike this group" → Not Hate Speech
# Success: Synonym substitution changes classification

## Summary


### Key Takeaways

1. **Discrete Challenge**: Text attacks face unique discrete input space
2. **Multiple Levels**: Character, word, sentence-level attacks
3. **Semantic Preservation**: Must maintain meaning and readability
4. **Gradient-Based**: HotFlip, TextFooler, BERT-Attack
5. **Defenses**: Preprocessing, adversarial training, ensembles

### Best Practices

**For Attackers**:
- Start with word-level attacks (most effective)
- Preserve semantic similarity
- Use gradient information when available
- Combine multiple attack types

**For Defenders**:
- Input preprocessing and normalization
- Adversarial training with diverse attacks
- Ensemble methods
- Monitor for unusual patterns

## References


### Key Papers

1. "HotFlip: White-Box Adversarial Examples for Text Classification" (Ebrahimi et al., 2018)
2. "Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment" (Jin et al., 2020)
3. "BERT-ATTACK: Adversarial Attack Against BERT Using BERT" (Li et al., 2020)
4. "TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP" (Morris et al., 2020)

### Tools

- [TextAttack](https://github.com/QData/TextAttack)
- [OpenAttack](https://github.com/thunlp/OpenAttack)
- [Adversarial-NLP](https://github.com/microsoft/Adversarial-NLP)

## Next Steps


1. Complete [Lab 3: Text Adversarial Attacks](labs/lab3_text_attacks.ipynb)
2. Experiment with TextAttack framework
3. Implement custom NLP attacks
4. Evaluate defense mechanisms
5. Study latest NLP security research

---

**Difficulty**: ⭐⭐⭐⭐ Advanced Level
**Prerequisites**: NLP basics, transformers, PyTorch
**Estimated Time**: 3-4 hours