[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ELTE-DSED/Intro-Data-Security/blob/main/module_02_input_manipulation/Lab2_Evasion_Attacks.ipynb)

# **Lab 2: Evasion Attacks (Adversarial Examples)**

**Course:** Introduction to Data Security Pr.  
**Module 2:** Input Data Manipulation  
**Estimated Time:** 90-120 minutes

---

In this lab, we explore how adversaries can manipulate input data to deceive machine learning models. Adversarial examples are carefully crafted inputs that appear normal to humans but cause models to make incorrect predictions with high confidence. These attacks highlight vulnerabilities in even state-of-the-art neural networks and motivate the need for robust defenses.

<div align="center">
  <img src="images/openai_panda.png" alt="Adversarial Example: Panda to Gibbon">
</div>

## **Learning Objectives**

By the end of this lab, you will be able to:
- Generate adversarial examples using FGSM and PGD
- Measure model robustness under $\ell_\infty$ perturbations
- Visualize perturbations and their impact on predictions
- Compare clean vs. adversarial accuracy
- Discuss the robustnessâ€“accuracy tradeoff

## **Table of Contents**
1. [Setup & Imports](#setup)  
2. [Load Model & Dataset](#data)  
3. [Baseline Evaluation](#baseline)  
4. [FGSM Attack](#fgsm)  
5. [PGD Attack ($\ell_\infty$ and $\ell_2$)](#pgd)  
6. [Targeted Evasion Attacks](#targeted)  
7. [Adversarial Training (Defense)](#defense)  
8. [Library Implementation (secml-torch)](#library)  
9. [Exercises](#exercises)

## **1. Setup & Imports** <a name="setup"></a>

In [3]:
# If needed, install dependencies
%pip install secml-torch -q

# Importing all the necessary libraries
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms
import numpy as np
import matplotlib.pyplot as plt
from torch.utils.data import DataLoader
from tqdm import tqdm
from utils import test_untargeted_attack

# Reproducibility
torch.manual_seed(42)
np.random.seed(42)

# Device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

Note: you may need to restart the kernel to use updated packages.
Using device: cpu


## **2. Load Model & Dataset** <a name="data"></a>

We will train MNIST model.

In [4]:
# Load MNIST test set
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])

test_dataset = torchvision.datasets.MNIST(
    root='./data',
    train=False,
    transform=transform,
    download=True
)

test_loader = DataLoader(test_dataset, batch_size=128, shuffle=False)

# Optional train set for fallback training
train_dataset = torchvision.datasets.MNIST(
    root='./data',
    train=True,
    transform=transform,
    download=True
)
train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True)

# Define the MLP model from Lab 1
class MNISTNet(torch.nn.Module):
    """Simple fully connected network for MNIST classification."""

    def __init__(self):
        super().__init__()
        self.fc1 = torch.nn.Linear(784, 200)
        self.fc2 = torch.nn.Linear(200, 200)
        self.fc3 = torch.nn.Linear(200, 10)

    def forward(self, x):
        x = x.flatten(1)
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return self.fc3(x)

model = MNISTNet().to(device)

model.eval()

# Normalized clamp bounds for MNIST
mnist_mean = 0.1307
mnist_std = 0.3081
min_val = (0.0 - mnist_mean) / mnist_std
max_val = (1.0 - mnist_mean) / mnist_std
print(f"Normalized clamp range: [{min_val:.2f}, {max_val:.2f}]")

Normalized clamp range: [-0.42, 2.82]


In [5]:
from secmlt.models.pytorch.base_pytorch_nn import BasePytorchClassifier
from secmlt.models.pytorch.base_pytorch_trainer import BasePyTorchTrainer
from secmlt.metrics.classification import Accuracy

optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Initialize the trainer for 1 epoch
trainer = BasePyTorchTrainer(optimizer=optimizer, epochs=1)

# Wrap the model and train
secml_model = BasePytorchClassifier(model=model, trainer=trainer)
secml_model.train(dataloader=train_loader)

# Evaluate accuracy
secml_accuracy = Accuracy()(secml_model, test_loader)
print(f"Clean accuracy: {secml_accuracy * 100:.2f}%")

Clean accuracy: 95.98%


## **3. FGSM Attack** <a name="fgsm"></a>
### **3.1 Untargeted FGSM**

The Fast Gradient Sign Method (FGSM) proposed by [Goodfellow et al.](https://arxiv.org/pdf/1412.6572.pdf)  crafts adversarial examples by taking a single step in the direction of the input gradient.

The attack constructs adversarial examples as follows:

$$x_\text{adv} = x + \epsilon\cdot\text{sign}(\nabla_xJ(\theta, x, y))$$

where 

*   $x_\text{adv}$ : Adversarial image.
*   $x$ : Original input image.
*   $y$ : Original input label.
*   $\epsilon$ : Multiplier to ensure the perturbations are small.
*   $\theta$ : Model parameters.
*   $J$ : Loss.

The current attack formulation is considered 'untargeted' because it only seeks to maximize loss rather than to trick the model into predicting a specific label. 

Try implementing the untargeted FGSM method for a batch of images yourself!

In [None]:
def untargeted_fgsm(x_batch, true_labels, network, normalize, eps=8/255., **kwargs):
  """Generates a batch of untargeted FGSM adversarial examples

  Args:
    x_batch (torch.Tensor): the batch of unnormalized input examples.
    true_labels (torch.Tensor): the batch of true labels of the example.
    network (nn.Module): the network to attack.
    normalize (function): a function which normalizes a batch of images 
        according to standard MNIST normalization.
    eps (float): the bound on the perturbations.
  """
  loss_fn = nn.CrossEntropyLoss(reduction="mean")
  x_batch = x_batch.clone().detach().requires_grad_(True)

  #########################
  # Enter Your Code Here! #
  #########################
  # (Hint: normalize x_batch, get logits, compute loss, 
  # backprop, compute perturbation, add it to original x_batch, 
  # and clamp to normalized bounds)

  return x_adv

In [None]:
# Test the method
test_untargeted_attack(untargeted_FGSM, model, device, eps=8/255.)

In [None]:
def untargeted_fgsm(model, x, y, epsilon=0.3):
    x = x.clone().detach().to(device)
    y = y.to(device)
    x.requires_grad = True

    logits = model(x)
    loss = nn.CrossEntropyLoss()(logits, y)
    model.zero_grad()
    loss.backward()

    perturbation = epsilon * x.grad.sign()
    x_adv = x + perturbation
    x_adv = torch.clamp(x_adv, min_val, max_val)
    return x_adv.detach()

Instead of manually implementing the attack, we can use the `secmlt` library, which provides a unified API for various adversarial attacks. Here we demonstrate the FGSM attack using `secmlt`.


In [None]:
from secmlt.adv.evasion.pgd import PGD
from secmlt.metrics.classification import Accuracy
from secmlt.adv.backends import Backends
from secmlt.adv.evasion.perturbation_models import LpPerturbationModels

# MNIST normalization constants
mnist_mean = 0.1307
mnist_std = 0.3081
# Convert image bounds [0, 1] to normalized space
min_val = (0.0 - mnist_mean) / mnist_std
max_val = (1.0 - mnist_mean) / mnist_std

# Test FGSM attack at multiple perturbation magnitudes
epsilons = [0.0, 0.05, 0.1, 0.2]

# Define the FGSM attack using secmlt
results_secmlt = {}

for eps in epsilons:
    if eps == 0:
        acc = Accuracy()(secml_model, test_loader)
    else:
        # Instantiate the attack as a 1-step PGD (FGSM) with native backend
        # num_steps=2: first step computes and applies perturbation, second step evaluates
        attack = PGD(
            perturbation_model=LpPerturbationModels.LINF, 
            epsilon=eps, 
            num_steps=2,  # FGSM is single-step; num_steps=2 evaluates step 1's perturbation
            step_size=eps, 
            random_start=False,
            lb=min_val, 
            ub=max_val, 
            backend=Backends.NATIVE
        )
        
        # Generate adversarial examples and evaluate accuracy
        adv_loader = attack(secml_model, test_loader)
        acc = Accuracy()(secml_model, adv_loader)
    
    results_secmlt[eps] = acc
    print(f"SecML-Torch FGSM epsilon={eps:.2f} -> accuracy: {acc * 100:.2f}%")

SecML-Torch's PGD with `num_steps=2` performs exactly 1 step at size `step_size`, effectively implementing FGSM. Setting `num_steps=2` (step 1 computes perturbation, step 2 is final) evaluates the best perturbation found.

## **5. PGD Attack ($\ell_\infty$ and $\ell_2$)** <a name="pgd"></a>

Projected Gradient Descent (PGD) iterates FGSM steps and projects back into the $\epsilon$-ball.

In [None]:
def pgd_attack(model, x, y, epsilon=0.3, alpha=0.01, steps=40):
    x = x.clone().detach().to(device)
    y = y.to(device)
    x_adv = x + 0.001 * torch.randn_like(x)

    for _ in range(steps):
        x_adv.requires_grad = True
        logits = model(x_adv)
        loss = nn.CrossEntropyLoss()(logits, y)
        model.zero_grad()
        loss.backward()
        grad = x_adv.grad.sign()

        x_adv = x_adv + alpha * grad
        perturbation = torch.clamp(x_adv - x, min=-epsilon, max=epsilon)
        x_adv = torch.clamp(x + perturbation, min_val, max_val).detach()
    return x_adv

### **Robustness Evaluation**

In [None]:
def adversarial_accuracy(model, loader, attack_fn, **kwargs):
    correct, total = 0, 0
    model.eval()
    for x, y in loader:
        x, y = x.to(device), y.to(device)
        x_adv = attack_fn(model, x, y, **kwargs)
        logits = model(x_adv)
        preds = logits.argmax(dim=1)
        correct += (preds == y).sum().item()
        total += y.size(0)
    return correct / total

fgsm_acc = adversarial_accuracy(model, test_loader, fgsm_attack, epsilon=0.3)
pgd_acc = adversarial_accuracy(model, test_loader, pgd_attack, epsilon=0.3, alpha=0.01, steps=20)

print(f"FGSM accuracy (eps=0.3): {fgsm_acc * 100:.2f}%")
print(f"PGD accuracy (eps=0.3):  {pgd_acc * 100:.2f}%")

### **Visual Analysis**

In [None]:
# Visualize adversarial examples
x, y = next(iter(test_loader))

x_adv_fgsm = fgsm_attack(model, x[:5], y[:5], epsilon=0.3)

fig, axes = plt.subplots(2, 5, figsize=(12, 5))
for i in range(5):
    axes[0, i].imshow((x[i].squeeze() * 0.3081 + 0.1307).cpu(), cmap='gray')
    axes[0, i].set_title(f"Clean: {y[i].item()}")
    axes[0, i].axis('off')

    axes[1, i].imshow((x_adv_fgsm[i].squeeze() * 0.3081 + 0.1307).cpu(), cmap='gray')
    axes[1, i].set_title("FGSM")
    axes[1, i].axis('off')

plt.tight_layout()
plt.show()

### **$\ell_2$-Norm PGD Attack**

The standard PGD constrains perturbation to an $\ell_\infty$-ball constraint. Another common perturbation constraint is the $\ell_2$-norm, where the Euclidean distance between the original output and adversarial example is bounded by $\epsilon$.

Let's implement the helpers and an $\ell_2$ version of PGD manually:

In [None]:
def normalize_l2(x_batch):
    """Normalizes a batch of images by their L2 norm."""
    # L2 norm over dimensions 1, 2, 3 (C, H, W)
    norm = torch.norm(x_batch.view(x_batch.size(0), -1), p=2, dim=1, keepdim=True)
    norm = norm.unsqueeze(2).unsqueeze(3) # Reshape to match input for broadcasting
    # Avoid division by zero
    norm = torch.clamp(norm, min=1e-8)
    return x_batch / norm

def tensor_clamp_l2(x_batch, center, radius):
    """Batched clamp of x into l2 ball around center of given radius."""
    diff = x_batch - center
    diff_norm = torch.norm(diff.view(diff.size(0), -1), p=2, dim=1, keepdim=True)
    diff_norm = diff_norm.unsqueeze(2).unsqueeze(3)
    
    # Scale back if norm exceeds radius
    factor = torch.min(torch.ones_like(diff_norm), radius / (diff_norm + 1e-8))
    return center + diff * factor

def pgd_l2_attack(model, x, y, epsilon=3.0, step_size=0.5, steps=20):
    x = x.clone().detach().to(device)
    y = y.to(device)
    
    # Initialize adversarial image with small noise
    x_adv = x.clone().detach() + 0.001 * torch.randn_like(x)
    
    for _ in range(steps):
        x_adv.requires_grad = True
        logits = model(x_adv)
        loss = nn.CrossEntropyLoss()(logits, y)
        model.zero_grad()
        loss.backward()
        
        # Take a step in the gradient direction, normalized by L2
        grad = normalize_l2(x_adv.grad)
        x_adv = x_adv + step_size * grad
        
        # Project (by clamping) the adversarial image back onto the L2 hypersphere
        x_adv = tensor_clamp_l2(x_adv, x, epsilon)
        x_adv = torch.clamp(x_adv, min_val, max_val).detach()
        
    return x_adv

pgd_l2_acc = adversarial_accuracy(model, test_loader, pgd_l2_attack, epsilon=3.0, step_size=0.5, steps=20)
print(f"PGD L2 accuracy (eps=3.0): {pgd_l2_acc * 100:.2f}%")

### **Accuracy vs Number of PGD Steps**
Often, limiting the attacker to just a few steps prevents them from finding the optimal adversarial perturbation. As we increase the number of iteration steps, the attack becomes stronger, finding examples that reliably fool the model. Let's see how accuracy degrades as steps increase.

In [None]:
step_counts = [1, 5, 10, 20, 50]
accuracies = []

print("Running evaluation over multiple steps...")
for steps in step_counts:
    acc = adversarial_accuracy(model, test_loader, pgd_attack, epsilon=0.3, alpha=0.01, steps=steps)
    accuracies.append(acc)
    print(f"Steps={steps} -> Accuracy: {acc * 100:.2f}%")

plt.figure(figsize=(8, 5))
plt.plot(step_counts, [a * 100 for a in accuracies], marker='o')
plt.title("Model Accuracy vs. PGD Steps")
plt.xlabel("Number of PGD Steps")
plt.ylabel("Accuracy (%)")
plt.grid(True)
plt.show()

## **6. Targeted Evasion Attacks** <a name="targeted"></a>

In the previous sections, we performed **untargeted** attacks, where the goal was simply to make the model misclassify the input. In a **targeted** attack, the adversary wants the model to output a *specific* incorrect class.

The optimization objective changes from maximizing the loss of the true class to minimizing the loss of a target class $y_{target}$:

$$\min_{\delta} \mathcal{L}(f(x + \delta), y_{target}) \quad \text{s.t.} \quad \|\delta\|_p \le \epsilon$$


In [None]:
def targeted_fgsm_attack(model, images, target_labels, epsilon):
    images = images.clone().detach().to(device)
    target_labels = target_labels.to(device)
    images.requires_grad = True
    
    outputs = model(images)
    # We MINIMIZE the loss relative to the target label
    loss = nn.CrossEntropyLoss()(outputs, target_labels)
    
    model.zero_grad()
    loss.backward()
    
    # Gradient Descent (not Ascent) to move towards the target
    data_grad = images.grad.data
    perturbed_image = images - epsilon * data_grad.sign()
    
    # Maintain normalization bounds for MNIST
    perturbed_image = torch.clamp(perturbed_image, min_val, max_val)
    
    return perturbed_image

# Example: Pick a '3' and try to make it look like an '8'
sample_idx = 7 # Adjust to find a '3' in your batch
x_sample, y_sample = x[sample_idx:sample_idx+1], y[sample_idx:sample_idx+1]
y_target = torch.tensor([8]).to(device)

x_adv_targeted = targeted_fgsm_attack(model, x_sample, y_target, epsilon=0.3)

# Check prediction
output_adv = model(x_adv_targeted)
pred_adv = torch.argsort(output_adv, descending=True)[0]

print(f"True Label: {y_sample.item()}")
print(f"Target Label: {y_target.item()}")
print(f"Top 3 Predictions: {pred_adv[:3].cpu().numpy()}")

plt.imshow((x_adv_targeted[0].squeeze() * 0.3081 + 0.1307).cpu().detach(), cmap='gray')
plt.title(f"Targeted FGSM (Target: {y_target.item()})")
plt.show()


### **Targeted PGD Attack**
Targeted FGSM often fails due to a lack of iterative refinement. Let's implement Targeted PGD, which applies gradient descent (subtracting the sign of the gradient relative to the target label) iteratively.

In [None]:
def targeted_pgd_attack(model, x, target_labels, epsilon=0.3, alpha=0.01, steps=40):
    x = x.clone().detach().to(device)
    target_labels = target_labels.to(device)
    
    x_adv = x + 0.001 * torch.randn_like(x)
    
    for _ in range(steps):
        x_adv.requires_grad = True
        logits = model(x_adv)
        
        # We MINIMIZE the loss relative to the target label
        loss = nn.CrossEntropyLoss()(logits, target_labels)
        model.zero_grad()
        loss.backward()
        
        grad = x_adv.grad.sign()
        
        # Gradient Descent (not Ascent) to move towards the target
        x_adv = x_adv - alpha * grad
        perturbation = torch.clamp(x_adv - x, min=-epsilon, max=epsilon)
        x_adv = torch.clamp(x + perturbation, min_val, max_val).detach()
        
    return x_adv

x_adv_targeted_pgd = targeted_pgd_attack(model, x_sample, y_target, epsilon=0.3, steps=100)

# Check prediction
output_adv_pgd = model(x_adv_targeted_pgd)
pred_adv_pgd = torch.argsort(output_adv_pgd, descending=True)[0]

print("--- Targeted PGD ---")
print(f"True Label: {y_sample.item()}")
print(f"Target Label: {y_target.item()}")
print(f"Top 3 Predictions: {pred_adv_pgd[:3].cpu().numpy()}")

plt.figure(figsize=(4, 4))
plt.imshow((x_adv_targeted_pgd[0].squeeze() * 0.3081 + 0.1307).cpu().detach(), cmap='gray')
plt.title(f"Targeted PGD (Target: {y_target.item()})")
plt.show()

## **7. Adversarial Training (Defense)** <a name="defense"></a>

Adversarial training is one of the most effective and classical defenses against adversarial attacks. The concept is straightforward: generate adversarial examples during the training process and use them to augment the training dataset. Over time, the model learns to map these perturbed inputs to their correct labels, smoothing its decision boundaries.

Let's train a robust version of our model by including adversarial examples in the training loop. We will use the `secmlt` library's PGD attack during training.

In [None]:
from tqdm import tqdm
from torch.utils.data import DataLoader, TensorDataset

# Define a robust model
robust_net = MNISTNet().to(device)
optimizer_robust = torch.optim.Adam(robust_net.parameters(), lr=0.001)

# PGD attack for generating adversarial examples during training
pgd_train = PGD(
    perturbation_model=LpPerturbationModels.LINF,
    epsilon=0.3,
    num_steps=7,
    step_size=0.1,
    random_start=True,
    lb=min_val,
    ub=max_val,
    backend=Backends.NATIVE
)

print("Starting Adversarial Training (1 epoch)...")
robust_net.train()

# Manual Adversarial Training Loop
for x_batch, y_batch in tqdm(train_loader):
    x_batch, y_batch = x_batch.to(device), y_batch.to(device)
    
    # 1. Generate adversarial examples for the current batch
    # We create a temporary DataLoader for the current batch to satisfy secmlt's requirements
    tmp_model = BasePytorchClassifier(model=robust_net)
    batch_loader = DataLoader(TensorDataset(x_batch, y_batch), batch_size=x_batch.size(0))
    adv_loader = pgd_train(tmp_model, batch_loader)
    x_adv, _ = next(iter(adv_loader))
    
    # 2. Standard training step on the adversarial examples
    optimizer_robust.zero_grad()
    outputs = robust_net(x_adv.to(device))
    loss = nn.CrossEntropyLoss()(outputs, y_batch)
    loss.backward()
    optimizer_robust.step()

print("Robust training complete.")

# Evaluate this model's clean accuracy vs adversarial accuracy
robust_net.eval()
secml_robust_model = BasePytorchClassifier(model=robust_net)

clean_acc_robust = Accuracy()(secml_robust_model, test_loader)

# Use a strong 20-step PGD to test robustness
pgd_eval = PGD(
    perturbation_model=LpPerturbationModels.LINF,
    epsilon=0.3,
    num_steps=20,
    step_size=0.05,
    lb=min_val,
    ub=max_val,
    backend=Backends.NATIVE
)
adv_loader_robust = pgd_eval(secml_robust_model, test_loader)
robust_acc_robust = Accuracy()(secml_robust_model, adv_loader_robust)

print(f"\nRobust Model - Clean Accuracy: {clean_acc_robust * 100:.2f}%")
print(f"Robust Model - PGD Accuracy (eps=0.3): {robust_acc_robust * 100:.2f}%")

## **8. Library Implementation (secml-torch)** <a name="library"></a>

Now that we have implemented manual FGSM/PGD, we can replicate the same attacks using the `secml-torch` library for standardized evaluation and comparison.

Using a library like `secml-torch` simplifies research by providing standardized implementations of attacks and robust evaluation metrics.

### **7.1 Untargeted PGD with secml-torch**
We'll use the `PGD` class from `secml-torch`. Note that we must specify the `lb` (lower bound) and `ub` (upper bound) to match our normalized data range.

In [None]:
from secmlt.models.pytorch.base_pytorch_nn import BasePytorchClassifier
from secmlt.adv.evasion.pgd import PGD
from secmlt.adv.backends import Backends
from secmlt.adv.evasion.perturbation_models import LpPerturbationModels
from secmlt.metrics.classification import Accuracy, AttackSuccessRate

# Wrap our model
secml_model = BasePytorchClassifier(model)

# Define bounds based on MNIST normalization
# Reuse min_val and max_val computed earlier from mean/std

# Instantiate PGD
# To simulate FGSM, we can use num_steps=2
pgd_attack = PGD(
    perturbation_model=LpPerturbationModels.LINF,
    epsilon=0.3,
    num_steps=10,
    step_size=0.05,
    random_start=False,
    backend=Backends.NATIVE,
    lb=min_val,
    ub=max_val
)

# Run attack on a small batch from our test_loader
adv_loader = pgd_attack(secml_model, test_loader)

# Evaluate Robust Accuracy
acc_metric = Accuracy()
robust_acc = acc_metric(secml_model, adv_loader)
print(f"Robust Accuracy (PGD): {robust_acc.item() * 100:.2f}%")



### **7.2 Targeted Attack and Attack Success Rate (ASR)**
The **Attack Success Rate (ASR)** measures the percentage of samples that the attacker successfully pushed to the target class.


In [None]:

y_target_val = 8
pgd_targeted = PGD(
    perturbation_model=LpPerturbationModels.LINF,
    epsilon=0.3,
    num_steps=15,
    step_size=0.05,
    y_target=y_target_val,
    backend=Backends.NATIVE,
    lb=min_val,
    ub=max_val
)

adv_loader_targeted = pgd_targeted(secml_model, test_loader)

# Calculate ASR
asr_metric = AttackSuccessRate(y_target=y_target_val)
asr = asr_metric(secml_model, adv_loader_targeted)
print(f"Attack Success Rate (Target: {y_target_val}): {asr.item() * 100:.2f}%")


## **9. Exercises** <a name="exercises"></a>

1. Run FGSM with $\epsilon \in \{0.05, 0.1, 0.2, 0.3\}$ and plot accuracy vs. $\epsilon$.  
2. Compare the perturbations visually between $\ell_\infty$ and $\ell_2$ attacks. You can display the difference array, e.g., `(x_adv - x).abs()`.  
3. Train an adversarially robust model using **PGD** instead of FGSM for 1 epoch and evaluate its robustness. How does training time change? 
4. Discuss why PGD is generally stronger than FGSM.