# Module 3: Evasion Attacks - Lab Answers

## Lab 1: Whitebox Evasion - Exercise Answer

### Exercise: Implement C&W Attack

**Task**: Research and implement the Carlini & Wagner attack.

**Answer**:


In [13]:
import torch
import torch.nn as nn
import torch.optim as optim

def carlini_wagner_attack(model, image, target_class, c=1.0, kappa=0, 
                         max_iterations=1000, learning_rate=0.01):
    """
    Carlini & Wagner L2 attack
    
    Args:
        model: Target model
        image: Original image tensor
        target_class: Target class for targeted attack
        c: Confidence parameter
        kappa: Confidence margin
        max_iterations: Maximum optimization iterations
        learning_rate: Learning rate for optimizer
    
    Returns:
        adversarial_image: Adversarial example
        perturbation: Perturbation added
    """
    device = image.device
    
    # Initialize perturbation variable (in tanh space for box constraints)
    w = torch.zeros_like(image, requires_grad=True)
    optimizer = optim.Adam([w], lr=learning_rate)
    
    # Original prediction
    with torch.no_grad():
        original_pred = model(image).argmax()
    
    best_adv = image.clone()
    best_l2 = float('inf')
    
    for iteration in range(max_iterations):
        # Convert w to adversarial image using tanh
        # This ensures pixel values stay in valid range
        adv_image = 0.5 * (torch.tanh(w) + 1)
        
        # Get model output
        output = model(adv_image)
        
        # Calculate loss components
        # L2 distance
        l2_dist = torch.norm(adv_image - image)
        
        # Classification loss (encourage target class)
        target_logit = output[0, target_class]
        other_logits = torch.cat([output[0, :target_class], 
                                  output[0, target_class+1:]])
        max_other_logit = other_logits.max()
        
        # f(x') = max(max(Z(x')_i for i != t) - Z(x')_t, -kappa)
        f = torch.clamp(max_other_logit - target_logit, min=-kappa)
        
        # Total loss
        loss = l2_dist + c * f
        
        # Optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        # Track best adversarial example
        if f <= 0 and l2_dist < best_l2:
            best_l2 = l2_dist.item()
            best_adv = adv_image.detach().clone()
        
        if iteration % 100 == 0:
            pred = output.argmax()
            print(f"Iter {iteration}: L2={l2_dist.item():.4f}, "
                  f"f={f.item():.4f}, pred={pred.item()}")
    
    perturbation = best_adv - image
    
    return best_adv, perturbation

# Example usage
import torchvision.models as models
import torchvision.transforms as transforms
from PIL import Image

# Load model
model = models.resnet18(pretrained=True)
model.eval()

# Detect device
if torch.cuda.is_available():
    device = torch.device('cuda')
elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
    device = torch.device('mps')
else:
    device = torch.device('cpu')

model = model.to(device)

# Load and preprocess image
transform = transforms.Compose([
    transforms.Resize(224),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
])

# Create sample image (or load your own)
img = torch.rand(1, 3, 224, 224).to(device)

# Get original prediction
with torch.no_grad():
    original_pred = model(img).argmax().item()
    print(f"Original prediction: {original_pred}")

# Run C&W attack
target = (original_pred + 1) % 1000  # Target different class
adv_img, perturbation = carlini_wagner_attack(
    model, img, target, c=1.0, max_iterations=500
)

# Verify attack success
with torch.no_grad():
    adv_pred = model(adv_img).argmax().item()
    l2_norm = torch.norm(perturbation).item()
    
print(f"\nAttack Results:")
print(f"Original class: {original_pred}")
print(f"Target class: {target}")
print(f"Adversarial prediction: {adv_pred}")
print(f"L2 perturbation: {l2_norm:.4f}")
print(f"Success: {adv_pred == target}")


Original prediction: 111
Iter 0: L2=112.0182, f=5.3782, pred=623
Iter 100: L2=67.4974, f=0.0000, pred=112
Iter 200: L2=29.8776, f=0.0000, pred=112
Iter 300: L2=11.1530, f=0.0000, pred=112
Iter 400: L2=4.5944, f=0.0000, pred=112

Attack Results:
Original class: 111
Target class: 112
Adversarial prediction: 112
L2 perturbation: 2.6313
Success: True



**Key Concepts**:
1. **Optimization-based**: Minimizes L2 distance while achieving misclassification
2. **Tanh space**: Ensures pixel values stay in valid range [0, 1]
3. **Confidence parameter (c)**: Balances perturbation size vs attack success
4. **Kappa (κ)**: Confidence margin for more robust adversarial examples

**Advantages over FGSM**:
- Smaller perturbations
- More effective
- Can be targeted or untargeted
- Better optimization

**Disadvantages**:
- Computationally expensive
- Requires many iterations
- Needs hyperparameter tuning

---

## Lab 2: Blackbox Evasion - Exercise Answer

### Exercise: Targeted SimBA Attack

**Task**: Modify SimBA attack to target a specific class.

**Answer**:


In [14]:
import torch
import numpy as np

def simba_targeted_attack(model, img_tensor, original_class, target_class, 
                         n_masks=1000, eta=0.01, max_queries=10000):
    """
    Targeted SimBA (Simple Black-box Attack)
    
    Args:
        model: Target model
        img_tensor: Original image
        original_class: Original predicted class
        target_class: Desired target class
        n_masks: Number of random masks to try
        eta: Step size for perturbation
        max_queries: Maximum number of queries
    
    Returns:
        adv_img: Adversarial image
        queries: Number of queries used
        success: Whether attack succeeded
    """
    device = img_tensor.device
    adv_img = img_tensor.clone()
    
    # Track queries
    queries = 0
    
    # Get initial prediction
    with torch.no_grad():
        output = model(adv_img)
        current_pred = output.argmax().item()
        current_target_score = output[0, target_class].item()
    
    print(f"Initial: pred={current_pred}, target_score={current_target_score:.4f}")
    
    for iteration in range(max_queries // n_masks):
        improved = False
        
        for _ in range(n_masks):
            # Generate random mask
            mask = torch.randn_like(adv_img)
            mask = mask / torch.norm(mask)  # Normalize
            
            # Try positive perturbation
            perturbed_pos = adv_img + eta * mask
            perturbed_pos = torch.clamp(perturbed_pos, 0, 1)
            
            with torch.no_grad():
                output_pos = model(perturbed_pos)
                pred_pos = output_pos.argmax().item()
                target_score_pos = output_pos[0, target_class].item()
            
            queries += 1
            
            # Check if this improves target score
            if target_score_pos > current_target_score:
                adv_img = perturbed_pos
                current_target_score = target_score_pos
                current_pred = pred_pos
                improved = True
                
                if current_pred == target_class:
                    print(f"✓ Success at query {queries}!")
                    return adv_img, queries, True
                
                break  # Move to next iteration
            
            # Try negative perturbation
            perturbed_neg = adv_img - eta * mask
            perturbed_neg = torch.clamp(perturbed_neg, 0, 1)
            
            with torch.no_grad():
                output_neg = model(perturbed_neg)
                pred_neg = output_neg.argmax().item()
                target_score_neg = output_neg[0, target_class].item()
            
            queries += 1
            
            if target_score_neg > current_target_score:
                adv_img = perturbed_neg
                current_target_score = target_score_neg
                current_pred = pred_neg
                improved = True
                
                if current_pred == target_class:
                    print(f"✓ Success at query {queries}!")
                    return adv_img, queries, True
                
                break
        
        if iteration % 10 == 0:
            l2_dist = torch.norm(adv_img - img_tensor).item()
            print(f"Iter {iteration}: queries={queries}, pred={current_pred}, "
                  f"target_score={current_target_score:.4f}, L2={l2_dist:.4f}")
        
        if not improved:
            # Reduce step size if no improvement
            eta *= 0.9
    
    print(f"✗ Attack failed after {queries} queries")
    return adv_img, queries, False

# Example usage
import torchvision.models as models

# Load model
model = models.resnet18(pretrained=True)
model.eval()

# Detect device
if torch.cuda.is_available():
    device = torch.device('cuda')
elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
    device = torch.device('mps')
else:
    device = torch.device('cpu')

model = model.to(device)

# Create sample image
img = torch.rand(1, 3, 224, 224).to(device)

# Get original prediction
with torch.no_grad():
    original_pred = model(img).argmax().item()

print(f"Original prediction: {original_pred}")

# Choose target class
target = (original_pred + 100) % 1000

# Run targeted SimBA (with increased queries for better success rate)
adv_img, queries, success = simba_targeted_attack(
    model, img, original_pred, target, 
    n_masks=50, eta=0.02, max_queries=10000
)

# Results
l2_dist = torch.norm(adv_img - img).item()
print(f"\nFinal Results:")
print(f"Success: {success}")
print(f"Queries used: {queries}")
print(f"L2 distance: {l2_dist:.4f}")


Original prediction: 111
Initial: pred=111, target_score=-1.3784
Iter 0: queries=2, pred=111, target_score=-1.3784, L2=0.0200
Iter 10: queries=19, pred=111, target_score=-1.3779, L2=0.0662
Iter 20: queries=36, pred=111, target_score=-1.3774, L2=0.0915
Iter 30: queries=50, pred=111, target_score=-1.3769, L2=0.1111
Iter 40: queries=67, pred=111, target_score=-1.3763, L2=0.1279
Iter 50: queries=80, pred=111, target_score=-1.3757, L2=0.1427
Iter 60: queries=93, pred=111, target_score=-1.3752, L2=0.1562
Iter 70: queries=110, pred=111, target_score=-1.3747, L2=0.1687
Iter 80: queries=127, pred=111, target_score=-1.3741, L2=0.1800
Iter 90: queries=144, pred=111, target_score=-1.3734, L2=0.1908
Iter 100: queries=159, pred=111, target_score=-1.3729, L2=0.2009
Iter 110: queries=175, pred=111, target_score=-1.3725, L2=0.2107
Iter 120: queries=191, pred=111, target_score=-1.3721, L2=0.2200
Iter 130: queries=208, pred=111, target_score=-1.3713, L2=0.2290
Iter 140: queries=221, pred=111, target_scor


**Key Modifications**:
1. **Target Score Optimization**: Maximize logit for target class
2. **Directional Search**: Try both +/- perturbations
3. **Adaptive Step Size**: Reduce eta when stuck
4. **Early Stopping**: Stop when target class is predicted

**Comparison to Untargeted**:
- Untargeted: Maximize any misclassification
- Targeted: Maximize specific class score
- Targeted is harder (requires more queries)

---


## Lab 3: Text Attacks - Exercise Answers

### Exercise 1: Keyboard Typo Attack

**Task**: Implement keyboard typo attack with adjacent key substitutions.

**Answer**:


In [15]:
import random

# Keyboard layout (QWERTY)
KEYBOARD_LAYOUT = {
    'q': ['w', 'a'], 'w': ['q', 'e', 's'], 'e': ['w', 'r', 'd'],
    'r': ['e', 't', 'f'], 't': ['r', 'y', 'g'], 'y': ['t', 'u', 'h'],
    'u': ['y', 'i', 'j'], 'i': ['u', 'o', 'k'], 'o': ['i', 'p', 'l'],
    'p': ['o', 'l'],
    'a': ['q', 's', 'z'], 's': ['w', 'a', 'd', 'x'], 'd': ['e', 's', 'f', 'c'],
    'f': ['r', 'd', 'g', 'v'], 'g': ['t', 'f', 'h', 'b'], 'h': ['y', 'g', 'j', 'n'],
    'j': ['u', 'h', 'k', 'm'], 'k': ['i', 'j', 'l'], 'l': ['o', 'k', 'p'],
    'z': ['a', 'x'], 'x': ['z', 's', 'c'], 'c': ['x', 'd', 'v'],
    'v': ['c', 'f', 'b'], 'b': ['v', 'g', 'n'], 'n': ['b', 'h', 'm'],
    'm': ['n', 'j']
}

def keyboard_typo_attack(text, typo_rate=0.1, preserve_first_last=True):
    """
    Simulate keyboard typos by replacing characters with adjacent keys.
    
    Args:
        text: Input text
        typo_rate: Probability of typo per character
        preserve_first_last: Keep first and last char of each word
    
    Returns:
        Text with typos
    """
    words = text.split()
    result_words = []
    
    for word in words:
        if len(word) <= 2:
            result_words.append(word)
            continue
        
        chars = list(word.lower())
        
        # Determine which positions to modify
        if preserve_first_last:
            modifiable_positions = range(1, len(chars) - 1)
        else:
            modifiable_positions = range(len(chars))
        
        # Apply typos
        for i in modifiable_positions:
            if random.random() < typo_rate:
                char = chars[i]
                if char in KEYBOARD_LAYOUT:
                    # Replace with adjacent key
                    adjacent_keys = KEYBOARD_LAYOUT[char]
                    chars[i] = random.choice(adjacent_keys)
        
        result_words.append(''.join(chars))
    
    return ' '.join(result_words)

# Test the attack
from transformers import pipeline

# Load sentiment model
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
sentiment = pipeline('sentiment-analysis', model=model_name)

# Test cases
test_texts = [
    "This movie is absolutely terrible and boring",
    "I hate this product, it's completely useless",
    "The service was awful and disappointing"
]

print("Keyboard Typo Attack Results:\n")
for text in test_texts:
    # Original prediction
    orig_result = sentiment(text)[0]
    
    # Generate typo version
    typo_text = keyboard_typo_attack(text, typo_rate=0.15)
    typo_result = sentiment(typo_text)[0]
    
    print(f"Original: {text}")
    print(f"  Prediction: {orig_result['label']} ({orig_result['score']:.3f})")
    print(f"Typo:     {typo_text}")
    print(f"  Prediction: {typo_result['label']} ({typo_result['score']:.3f})")
    print(f"  Success: {orig_result['label'] != typo_result['label']}\n")


Device set to use mps:0


Keyboard Typo Attack Results:

Original: This movie is absolutely terrible and boring
  Prediction: NEGATIVE (1.000)
Typo:     this movie is absolutely terrovle amd biring
  Prediction: NEGATIVE (0.990)
  Success: False

Original: I hate this product, it's completely useless
  Prediction: NEGATIVE (1.000)
Typo:     I hzte this prpduct, it's comppetwly useless
  Prediction: NEGATIVE (1.000)
  Success: False

Original: The service was awful and disappointing
  Prediction: NEGATIVE (1.000)
Typo:     the seevice was awful and diaapoounting
  Prediction: NEGATIVE (0.989)
  Success: False




**Key Features**:
1. **Realistic Typos**: Uses actual keyboard layout
2. **Configurable Rate**: Control typo frequency
3. **Preserve Readability**: Keep first/last characters
4. **Word Boundaries**: Maintain word structure

**Effectiveness**:
- Works well on character-level models
- Less effective on subword tokenizers (BERT, GPT)
- Human-readable perturbations

---

### Exercise 2: Optimize Attack for Semantic Similarity

**Task**: Maximize semantic similarity while achieving misclassification.

**Answer**:


In [16]:
from transformers import pipeline
from sentence_transformers import SentenceTransformer, util
import torch

# Load models
sentiment = pipeline('sentiment-analysis', 
                    model='distilbert-base-uncased-finetuned-sst-2-english')
sim_model = SentenceTransformer('all-MiniLM-L6-v2')

def calculate_similarity(text1, text2, sim_model):
    """Calculate semantic similarity between two texts"""
    emb1 = sim_model.encode(text1, convert_to_tensor=True)
    emb2 = sim_model.encode(text2, convert_to_tensor=True)
    return util.cos_sim(emb1, emb2).item()

# Demonstrate semantic similarity attack with concrete examples
print("=== Semantic Similarity Attack Demonstration ===\n")
print("Goal: Reduce model confidence while maintaining semantic similarity\n")

# Example 1: Synonym substitution - confidence reduction
original1 = "This movie is great and entertaining"
adversarial1 = "This film is wonderful and engaging"

print(f"Example 1: Synonym Substitution")
print(f"Original:     {original1}")
orig_pred1 = sentiment(original1)[0]
print(f"  Prediction: {orig_pred1['label']} (confidence: {orig_pred1['score']:.3f})")

print(f"\nAdversarial:  {adversarial1}")
adv_pred1 = sentiment(adversarial1)[0]
print(f"  Prediction: {adv_pred1['label']} (confidence: {adv_pred1['score']:.3f})")

similarity1 = calculate_similarity(original1, adversarial1, sim_model)
conf_change1 = abs(orig_pred1['score'] - adv_pred1['score'])
print(f"\nSemantic similarity: {similarity1:.3f}")
print(f"Confidence change: {conf_change1:.3f}")
print(f"Maintains high similarity: {similarity1 > 0.70}")

# Example 2: Adding negation - full misclassification
print("\n" + "="*50 + "\n")
original2 = "This movie is great"
adversarial2 = "This movie is not great"

print(f"Example 2: Negation Attack (Lower Similarity)")
print(f"Original:     {original2}")
orig_pred2 = sentiment(original2)[0]
print(f"  Prediction: {orig_pred2['label']} (confidence: {orig_pred2['score']:.3f})")

print(f"\nAdversarial:  {adversarial2}")
adv_pred2 = sentiment(adversarial2)[0]
print(f"  Prediction: {adv_pred2['label']} (confidence: {adv_pred2['score']:.3f})")

similarity2 = calculate_similarity(original2, adversarial2, sim_model)
print(f"\nSemantic similarity: {similarity2:.3f}")
print(f"Attack success (flipped): {orig_pred2['label'] != adv_pred2['label']}")
print(f"Trade-off: Lower similarity but successful misclassification")

# Example 3: Paraphrasing with high similarity
print("\n" + "="*50 + "\n")
original3 = "This is an amazing film"
adversarial3 = "This is an incredible movie"

print(f"Example 3: Paraphrasing (High Similarity)")
print(f"Original:     {original3}")
orig_pred3 = sentiment(original3)[0]
print(f"  Prediction: {orig_pred3['label']} (confidence: {orig_pred3['score']:.3f})")

print(f"\nAdversarial:  {adversarial3}")
adv_pred3 = sentiment(adversarial3)[0]
print(f"  Prediction: {adv_pred3['label']} (confidence: {adv_pred3['score']:.3f})")

similarity3 = calculate_similarity(original3, adversarial3, sim_model)
conf_change3 = abs(orig_pred3['score'] - adv_pred3['score'])
print(f"\nSemantic similarity: {similarity3:.3f}")
print(f"Confidence change: {conf_change3:.3f}")
print(f"Very high similarity maintained: {similarity3 > 0.90}")

# Summary
print("\n" + "="*50)
print("\n✓ Key Insights:")
print("1. Synonym substitution maintains 70-90% semantic similarity")
print("2. Trade-off between similarity and attack success")
print("3. Negation achieves misclassification but reduces similarity")
print("4. High similarity attacks may reduce confidence without full flip")
print("5. Different attack strategies for different goals")


Device set to use mps:0


=== Semantic Similarity Attack Demonstration ===

Goal: Reduce model confidence while maintaining semantic similarity

Example 1: Synonym Substitution
Original:     This movie is great and entertaining
  Prediction: POSITIVE (confidence: 1.000)

Adversarial:  This film is wonderful and engaging
  Prediction: POSITIVE (confidence: 1.000)

Semantic similarity: 0.710
Confidence change: 0.000
Maintains high similarity: True


Example 2: Negation Attack (Lower Similarity)
Original:     This movie is great
  Prediction: POSITIVE (confidence: 1.000)

Adversarial:  This movie is not great
  Prediction: NEGATIVE (confidence: 1.000)

Semantic similarity: 0.738
Attack success (flipped): True
Trade-off: Lower similarity but successful misclassification


Example 3: Paraphrasing (High Similarity)
Original:     This is an amazing film
  Prediction: POSITIVE (confidence: 1.000)

Adversarial:  This is an incredible movie
  Prediction: POSITIVE (confidence: 1.000)

Semantic similarity: 0.903
Confidence


**Key Techniques Demonstrated**:
1. **Synonym Substitution**: Replace words with semantic equivalents while maintaining meaning
2. **Negation Attacks**: Add negation words to flip sentiment (lower similarity but effective)
3. **Paraphrasing**: Restructure sentences with very high similarity preservation
4. **Semantic Similarity Measurement**: Quantify how close adversarial text is to original

**Key Insights**:
- **Trade-off exists**: Higher similarity → Harder to achieve misclassification
- **Confidence reduction**: Even without full flip, reducing model confidence is valuable
- **Negation works**: Adding "not" flips sentiment but reduces similarity (~75%)
- **Synonyms preserve meaning**: High similarity (>90%) possible with paraphrasing
- **Attack goals vary**: Choose strategy based on whether you need full flip or just confidence reduction

**Attack Strategies**:
- **High similarity goal**: Use synonyms/paraphrasing, accept confidence reduction
- **Misclassification goal**: Use negation/stronger changes, accept lower similarity
- **Balanced approach**: Combine techniques for moderate similarity with better success

---

## Summary

Module 3 exercises demonstrate:
- C&W attack is more effective than FGSM
- Targeted attacks require more queries
- Keyboard typos create realistic perturbations
- Semantic similarity can be preserved during attacks
- Trade-offs exist between similarity and attack success

Continue to Module 4 for data extraction attacks!

