# Lab 3: Text Adversarial Attacks

## Overview

Text attacks face unique challenges compared to image attacks: discrete input space, semantic constraints, and human readability. In this lab, you'll implement various text attack techniques.

## Learning Objectives

1. Understand challenges of adversarial NLP
2. Implement character-level attacks (homoglyphs, typos)
3. Execute word-level attacks (synonyms, embeddings)
4. Use TextAttack framework
5. Evaluate semantic similarity

## Prerequisites

- Understanding of NLP basics
- Familiarity with transformers
- Completion of Labs 1-2

## Setup

### Install Required Packages

This lab requires `textattack`. Run this cell to install it:

In [1]:
# Install textattack if not already installed
try:
    import textattack
    print("✓ TextAttack already installed")
except ImportError:
    print("Installing textattack...")
    !pip install -q textattack==0.3.10
    print("✓ TextAttack installed")

  import pkg_resources


✓ TextAttack already installed


### Import Libraries

In [2]:
import torch
import numpy as np
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline
import random
import string

# For semantic similarity
from sentence_transformers import SentenceTransformer, util

# Detect device (supports CUDA, Apple Silicon MPS, and CPU)
if torch.cuda.is_available():
    device = torch.device('cuda')
    device_id = 0
    print(f"✓ Using CUDA GPU: {torch.cuda.get_device_name(0)}")
elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
    device = torch.device('mps')
    device_id = -1  # MPS not supported by pipeline, will use CPU for pipeline
    print("✓ Using Apple Silicon GPU (MPS)")
else:
    device = torch.device('cpu')
    device_id = -1
    print("ℹ Using CPU")

print(f"Device: {device}")

✓ Using Apple Silicon GPU (MPS)
Device: mps


### Load Sentiment Analysis Model

We'll attack a sentiment classifier.

In [3]:
# Load sentiment analysis model
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.to(device)
model.eval()

# Create pipeline
sentiment_pipeline = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer, device=device_id)

# Load semantic similarity model
sim_model = SentenceTransformer('all-MiniLM-L6-v2')

print("✓ Models loaded")

Device set to use mps:0


✓ Models loaded


### Test Original Predictions

In [4]:
# Test sentences
test_sentences = [
    "This movie is absolutely wonderful and amazing!",
    "I love this product, it works perfectly.",
    "This is the worst experience I've ever had.",
    "Terrible service, would not recommend."
]

print("Original Predictions:\n")
for sent in test_sentences:
    result = sentiment_pipeline(sent)[0]
    print(f"Text: {sent}")
    print(f"Prediction: {result['label']} ({result['score']:.2%})\n")

Original Predictions:

Text: This movie is absolutely wonderful and amazing!
Prediction: POSITIVE (99.99%)

Text: I love this product, it works perfectly.
Prediction: POSITIVE (99.99%)

Text: This is the worst experience I've ever had.
Prediction: NEGATIVE (99.98%)

Text: Terrible service, would not recommend.
Prediction: NEGATIVE (99.26%)



## Part 1: Character-Level Attacks

### Attack 1.1: Homoglyph Substitution

Replace characters with visually similar Unicode characters.

In [5]:
def homoglyph_attack(text, substitution_rate=0.3):
    """
    Replace characters with visually similar Unicode characters
    
    Args:
        text: Input text
        substitution_rate: Fraction of characters to replace
    
    Returns:
        Adversarial text
    """
    homoglyphs = {
        'a': ['а', 'ɑ'],  # Cyrillic a, Latin alpha
        'e': ['е', 'ė'],  # Cyrillic e, Latin e with dot
        'o': ['о', 'ο'],  # Cyrillic o, Greek omicron
        'i': ['і', 'ı'],  # Cyrillic i, dotless i
        'c': ['с', 'ϲ'],  # Cyrillic s, Greek lunate sigma
        'p': ['р'],       # Cyrillic r
        'x': ['х'],       # Cyrillic h
    }
    
    adversarial = list(text)
    num_substitutions = int(len(text) * substitution_rate)
    
    # Get positions of replaceable characters
    positions = [i for i, char in enumerate(text) if char.lower() in homoglyphs]
    
    if positions:
        # Randomly select positions to replace
        replace_positions = random.sample(positions, min(num_substitutions, len(positions)))
        
        for pos in replace_positions:
            char = text[pos].lower()
            if char in homoglyphs:
                replacement = random.choice(homoglyphs[char])
                # Preserve case
                if text[pos].isupper():
                    replacement = replacement.upper()
                adversarial[pos] = replacement
    
    return ''.join(adversarial)

# Test homoglyph attack
original = "This movie is absolutely wonderful and amazing!"
adversarial = homoglyph_attack(original, substitution_rate=0.2)

print(f"Original:    {original}")
print(f"Adversarial: {adversarial}")
print(f"\nVisually identical? {original == adversarial}")

# Test on model
orig_result = sentiment_pipeline(original)[0]
adv_result = sentiment_pipeline(adversarial)[0]

print(f"\nOriginal prediction: {orig_result['label']} ({orig_result['score']:.2%})")
print(f"Adversarial prediction: {adv_result['label']} ({adv_result['score']:.2%})")
print(f"Attack successful: {orig_result['label'] != adv_result['label']}")

Original:    This movie is absolutely wonderful and amazing!
Adversarial: Thіs movіė іs аbsοlutėly wonderful ɑnd аmazing!

Visually identical? False

Original prediction: POSITIVE (99.99%)
Adversarial prediction: POSITIVE (99.84%)
Attack successful: False


### Attack 1.2: Character Insertion/Deletion

In [6]:
def char_insertion_attack(text, insertion_rate=0.1):
    """
    Insert random characters to evade detection
    """
    adversarial = list(text)
    num_insertions = int(len(text) * insertion_rate)
    
    for _ in range(num_insertions):
        pos = random.randint(0, len(adversarial))
        char = random.choice(string.ascii_lowercase)
        adversarial.insert(pos, char)
    
    return ''.join(adversarial)

def char_deletion_attack(text, deletion_rate=0.1):
    """
    Delete random characters
    """
    adversarial = list(text)
    num_deletions = int(len(text) * deletion_rate)
    
    # Get positions of non-space characters
    positions = [i for i, char in enumerate(adversarial) if char != ' ']
    
    if positions:
        delete_positions = random.sample(positions, min(num_deletions, len(positions)))
        # Delete in reverse order to maintain indices
        for pos in sorted(delete_positions, reverse=True):
            del adversarial[pos]
    
    return ''.join(adversarial)

# Test
original = "This movie is absolutely wonderful!"
inserted = char_insertion_attack(original, 0.1)
deleted = char_deletion_attack(original, 0.1)

print(f"Original:  {original}")
print(f"Inserted:  {inserted}")
print(f"Deleted:   {deleted}")

Original:  This movie is absolutely wonderful!
Inserted:  This movie is absolutelryb wonderful!g
Deleted:   This movie i abslutely wonderul!


## Part 2: Word-Level Attacks

### Attack 2.1: Synonym Substitution

In [7]:
# Simple synonym dictionary (in practice, use WordNet or BERT)
SYNONYMS = {
    'wonderful': ['great', 'excellent', 'fantastic', 'marvelous'],
    'amazing': ['incredible', 'awesome', 'outstanding', 'remarkable'],
    'terrible': ['awful', 'horrible', 'dreadful', 'atrocious'],
    'worst': ['poorest', 'most terrible', 'most awful'],
    'love': ['adore', 'enjoy', 'appreciate'],
    'hate': ['despise', 'detest', 'loathe'],
}

def synonym_attack(text, substitution_rate=0.5):
    """
    Replace words with synonyms
    """
    words = text.split()
    adversarial = words.copy()
    
    for i, word in enumerate(words):
        # Remove punctuation for lookup
        clean_word = word.strip(string.punctuation).lower()
        
        if clean_word in SYNONYMS and random.random() < substitution_rate:
            synonym = random.choice(SYNONYMS[clean_word])
            # Preserve punctuation and case
            if word[0].isupper():
                synonym = synonym.capitalize()
            if word[-1] in string.punctuation:
                synonym += word[-1]
            adversarial[i] = synonym
    
    return ' '.join(adversarial)

# Test synonym attack
original = "This movie is absolutely wonderful and amazing!"
adversarial = synonym_attack(original, substitution_rate=0.8)

print(f"Original:    {original}")
print(f"Adversarial: {adversarial}")

# Calculate semantic similarity
orig_embedding = sim_model.encode(original, convert_to_tensor=True)
adv_embedding = sim_model.encode(adversarial, convert_to_tensor=True)
similarity = util.cos_sim(orig_embedding, adv_embedding).item()

print(f"\nSemantic similarity: {similarity:.2%}")

# Test on model
orig_result = sentiment_pipeline(original)[0]
adv_result = sentiment_pipeline(adversarial)[0]

print(f"\nOriginal: {orig_result['label']} ({orig_result['score']:.2%})")
print(f"Adversarial: {adv_result['label']} ({adv_result['score']:.2%})")

Original:    This movie is absolutely wonderful and amazing!
Adversarial: This movie is absolutely marvelous and outstanding!

Semantic similarity: 89.08%

Original: POSITIVE (99.99%)
Adversarial: POSITIVE (99.99%)


## Part 3: Using TextAttack Framework

TextAttack provides state-of-the-art text attack implementations.

In [8]:
# Install TextAttack if needed
# !pip install textattack

try:
    from textattack.attack_recipes import TextFoolerJin2019
    from textattack.models.wrappers import HuggingFaceModelWrapper
    from textattack.datasets import HuggingFaceDataset
    from textattack import Attacker, AttackArgs
    
    print("✓ TextAttack available")
    textattack_available = True
except ImportError:
    print("⚠ TextAttack not installed. Install with: pip install textattack")
    textattack_available = False

✓ TextAttack available


In [9]:
if textattack_available:
    # Wrap model for TextAttack
    model_wrapper = HuggingFaceModelWrapper(model, tokenizer)
    
    # Create TextFooler attack
    attack = TextFoolerJin2019.build(model_wrapper)
    
    # Attack a single example
    from textattack.shared import AttackedText
    
    text = "This movie is absolutely wonderful and amazing!"
    attacked_text = AttackedText(text)
    
    print(f"Original: {text}")
    print("\nRunning TextFooler attack...")
    
    # Note: This may take a minute
    # result = attack.attack(attacked_text, 1)  # 1 is the ground truth label
    # print(f"\nAdversarial: {result.perturbed_text()}")
else:
    print("Skipping TextAttack demo - not installed")

textattack: Downloading https://textattack.s3.amazonaws.com/word_embeddings/paragramcf.
100%|██████████| 481M/481M [04:32<00:00, 1.77MB/s] 
textattack: Unzipping file /Users/schwartz/.cache/textattack/tmpptqtvlad.zip to /Users/schwartz/.cache/textattack/word_embeddings/paragramcf.
textattack: Successfully saved word_embeddings/paragramcf to cache.
textattack: Unknown if model of class <class 'transformers.models.distilbert.modeling_distilbert.DistilBertForSequenceClassification'> compatible with goal function <class 'textattack.goal_functions.classification.untargeted_classification.UntargetedClassification'>.


Original: This movie is absolutely wonderful and amazing!

Running TextFooler attack...


## Part 4: Attack Evaluation

### Metrics for Text Attacks

In [10]:
def evaluate_text_attack(original, adversarial, model_pipeline, sim_model):
    """
    Comprehensive evaluation of text attack
    """
    # 1. Attack success
    orig_result = model_pipeline(original)[0]
    adv_result = model_pipeline(adversarial)[0]
    success = orig_result['label'] != adv_result['label']
    
    # 2. Semantic similarity
    orig_emb = sim_model.encode(original, convert_to_tensor=True)
    adv_emb = sim_model.encode(adversarial, convert_to_tensor=True)
    similarity = util.cos_sim(orig_emb, adv_emb).item()
    
    # 3. Edit distance
    from difflib import SequenceMatcher
    edit_ratio = SequenceMatcher(None, original, adversarial).ratio()
    
    # 4. Word overlap
    orig_words = set(original.lower().split())
    adv_words = set(adversarial.lower().split())
    word_overlap = len(orig_words & adv_words) / len(orig_words) if orig_words else 0
    
    results = {
        'success': success,
        'original_label': orig_result['label'],
        'adversarial_label': adv_result['label'],
        'original_confidence': orig_result['score'],
        'adversarial_confidence': adv_result['score'],
        'semantic_similarity': similarity,
        'edit_ratio': edit_ratio,
        'word_overlap': word_overlap
    }
    
    return results

# Test evaluation
original = "This movie is absolutely wonderful and amazing!"
adversarial = synonym_attack(original, 0.8)

results = evaluate_text_attack(original, adversarial, sentiment_pipeline, sim_model)

print("Attack Evaluation Results:")
print("=" * 50)
for key, value in results.items():
    if isinstance(value, float):
        print(f"{key}: {value:.2%}")
    else:
        print(f"{key}: {value}")

Attack Evaluation Results:
success: False
original_label: POSITIVE
adversarial_label: POSITIVE
original_confidence: 99.99%
adversarial_confidence: 99.99%
semantic_similarity: 91.39%
edit_ratio: 70.10%
word_overlap: 71.43%


## Exercise 1: Implement Keyboard Typo Attack

Create an attack that simulates keyboard typos (adjacent key substitutions).

In [11]:
# Keyboard layout (QWERTY)
KEYBOARD_ADJACENT = {
    'q': ['w', 'a'],
    'w': ['q', 'e', 's'],
    'e': ['w', 'r', 'd'],
    # ... add more
}

def keyboard_typo_attack(text, typo_rate=0.1):
    """
    TODO: Implement keyboard typo attack
    
    Hint: Replace characters with adjacent keys on keyboard
    """
    # YOUR CODE HERE
    pass

# Test your implementation
# original = "This movie is wonderful"
# adversarial = keyboard_typo_attack(original)
# print(f"Original: {original}")
# print(f"With typos: {adversarial}")

## Exercise 2: Optimize Attack for Semantic Similarity

Create an attack that maximizes semantic similarity while achieving misclassification.

In [12]:
def optimized_synonym_attack(text, model_pipeline, sim_model, min_similarity=0.9):
    """
    TODO: Implement attack that maintains high semantic similarity
    
    Hint: Try different synonym combinations and keep the one with highest similarity
    """
    # YOUR CODE HERE
    pass

## Summary

### Key Takeaways

1. **Text attacks are harder** than image attacks due to discrete space
2. **Character-level attacks** can evade simple filters
3. **Word-level attacks** preserve semantics better
4. **Semantic similarity** is crucial for imperceptibility
5. **TextAttack** provides production-ready implementations

### Attack Comparison

| Attack Type | Imperceptibility | Success Rate | Semantic Preservation |
|-------------|------------------|--------------|----------------------|
| Homoglyphs | High | Medium | High |
| Char Insert/Delete | Low | Low | Low |
| Synonyms | High | High | High |
| TextFooler | High | Very High | High |

## Next Steps

1. Complete exercises
2. Try attacks on different models
3. Experiment with TextAttack recipes
4. Move to Lab 4: Transfer Attacks