# CIS6800: Project 1b: Deep Learning Basics Part B

### Instructions:
* This is an individual assignment. Collaborating with others is not permitted.
* There is no single answer to most problems in deep learning, therefore the questions will often be underspecified. You need to fill in the blanks and submit a solution that solves the (practical) problem. Document the choices (hyperparameters, features, neural network architectures, etc.) you made where specified.
* All the code should be written in Python. You should only use PyTorch to complete this project.
* You are encouraged to use ChatGPT, but you need to make a summary of how you used it, and the code that you have copied from it.



In [None]:
import torch
from torch import nn
import torchvision
import matplotlib.pyplot as plt

%matplotlib inline
rng_seed = 45510

# Download MNIST
torchvision.datasets.MNIST('.', download=True)

Failed to download MNIST dataset. This may be due to SSL certificate issues or the dataset URL being unavailable.
Troubleshooting steps:
1. Ensure your system's SSL certificates are up to date.
   - On macOS, run: /Applications/Python\ 3.x/Install\ Certificates.command
   - On Ubuntu/Debian, run: sudo apt-get install --reinstall ca-certificates
2. If you are behind a proxy or firewall, check your network settings.
3. If the problem persists, manually download the MNIST files from:
   https://ossci-datasets.s3.amazonaws.com/mnist/
   and place them in the './MNIST/raw/' directory.
Original error message:
Error downloading train-images-idx3-ubyte.gz:
Tried https://ossci-datasets.s3.amazonaws.com/mnist/, got:
<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1000)>
Tried http://yann.lecun.com/exdb/mnist/, got:
HTTP Error 404: Not Found



RuntimeError: Error downloading train-images-idx3-ubyte.gz:
Tried https://ossci-datasets.s3.amazonaws.com/mnist/, got:
<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1000)>
Tried http://yann.lecun.com/exdb/mnist/, got:
HTTP Error 404: Not Found


## 4. Adversarial Images (30%)
In this part you will see how you can use the gradients of the network to generate adversarial
images. Using these images that look almost identical the original you will be able to fool
different neural networks. You will also see that these images also affect different neural
networks and expose a security issue of CNNs that malicious users can take advantage of.
An example is shown in Figure 4. You are encouraged to read the relevant papers [1, 2]
before solving this part.

<div><img src="https://github.com/LukasZhornyak/CIS680_files/raw/main/HW1/images/fig4.png"/></div>

<center>Figure 4: An adversarial example demonstrated in [1].</center>

1. (10%) Use the trained network from question 3 to generate adversarial images with
constraints. The constraints that you have are:

  1. You are not allowed to erase parts of the image, i.e. $I_\text{pert} \ge I$ at each pixel location.
  2. The perturbed image has to take valid values, i.e. $-1 \le I_\text{pert} \le 1$.

  The algorithm works as follows:
  
  1. Let $I$ be a test image of your dataset that you want to perturb that is classified correctly by the network. Let $I_\epsilon$ be the perturbation that you should initialize
with zeros.
  2. Feed $I_\text{pert} = I + I_\epsilon$ in the network.
  3. Calculate the loss given the ground truth label ($y_\mathrm{gt}$). Let the loss be $L(x,y |\theta)$ where $\theta$ are the learned weights.
  4. Compute the gradients with respect to $I_\text{pert}$, i.e., $\nabla_{I_\text{pert}} L(I_\text{pert}, y_\mathrm{gt} | \theta)$. Using backpropagation, compute $\nabla_{I_\epsilon} L(I_\epsilon,y_\mathrm{gt} | \theta)$, i.e. the gradients with respect to the perturbation.
  5. Use the Fast Gradient Sign method to update the perturbation, i.e., $I_\epsilon = I_\epsilon + \epsilon\,\text{sign}(\nabla_{I_\epsilon} L(I_\epsilon, y_\mathrm{gt}))$, where $\epsilon$ is a small constant of your choice.
  6. Repeat A-D until the network classify the input image $I_\text{pert}$ as an arbitrary
wrong category with confidence (probability) at least $90\%$.

  Generate 2 examples of adversarial images. Describe the difference between the adversarial images and the original images.

In [None]:
# CUDA for PyTorch
use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")

# Create your network here (do not change this name)
class DigitClassification(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 32, kernel_size=5, padding=2)
        self.bn1 = nn.BatchNorm2d(32)
        self.pool1 = nn.AvgPool2d(kernel_size=2, stride=2)
        self.conv2 = nn.Conv2d(32, 32, kernel_size=5, padding=2)
        self.bn2 = nn.BatchNorm2d(32)
        self.pool2 = nn.AvgPool2d(kernel_size=2, stride=2)
        self.conv3 = nn.Conv2d(32, 64, kernel_size=5, padding=2)
        self.bn3 = nn.BatchNorm2d(64)
        self.pool3 = nn.AvgPool2d(kernel_size=2, stride=2)
        self.fc1 = nn.Linear(64 * 3 * 3, 64)
        self.bn_fc1 = nn.BatchNorm1d(64)
        self.fc2 = nn.Linear(64, 10)
    def forward(self, x):
        x = self.pool1(torch.relu(self.bn1(self.conv1(x))))
        x = self.pool2(torch.relu(self.bn2(self.conv2(x))))
        x = self.pool3(torch.relu(self.bn3(self.conv3(x))))
        x = torch.flatten(x, 1)
        x = torch.relu(self.bn_fc1(self.fc1(x)))
        x = self.fc2(x)
        return x

# Load trained weights
from torchvision import transforms
from torch.utils.data import DataLoader

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))  # (-1, 1)
])

import os

# Resolve model path whether notebook runs from repo root or hw1b/
candidates = [
    'model.pth',
    os.path.join(os.path.dirname(os.getcwd()), 'model.pth'),  # ../model.pth
    os.path.join(os.getcwd(), 'model.pth'),                   # cwd/model.pth
]
model_path = next((p for p in candidates if os.path.exists(p)), None)
if model_path is None:
    raise FileNotFoundError('Could not find model.pth. Expected at repo root. Please place model.pth in /Users/kyle/Github/cis6800hw/.')

model = DigitClassification().to(device)
state = torch.load(model_path, map_location=device)
model.load_state_dict(state)
model.eval()

# Test dataset and a helper to denormalize for display
test_ds = torchvision.datasets.MNIST('.', train=False, download=True, transform=transform)
test_loader = DataLoader(test_ds, batch_size=1, shuffle=True)

def to_img(x):
    # map from (-1,1) to (0,1)
    return (x.clamp(-1, 1) + 1.0) / 2.0

In [None]:
# don't change the signature of this function (image, image_pert -> [N, 1, H, W])
@torch.enable_grad()
def arbitrary_adversary(model, image, original_label, eps=0.01, max_steps=200):
    # image is normalized to (-1,1). Clone for perturbation and require grad
    model.eval()
    image = image.to(device)
    label = torch.tensor([original_label], device=device)
    image_pert = image.clone().detach().requires_grad_(True)
    ce = nn.CrossEntropyLoss()
    softmax = nn.Softmax(dim=1)
    for _ in range(max_steps):
        image_pert.grad = None
        logits = model(image_pert)
        probs = softmax(logits)
        # Want any wrong class >= 0.9; we ascend loss to hurt the true class
        loss = ce(logits, label)
        loss.backward()
        with torch.no_grad():
            # Check if gradient exists before using it
            if image_pert.grad is not None:
                # FGSM update: increase loss, but enforce constraints
                grad_sign = image_pert.grad.sign()
                image_pert += eps * grad_sign
                # Constraint 1: no erasing: I_pert >= I
                image_pert = torch.maximum(image_pert, image)
                # Constraint 2: keep in [-1, 1]
                image_pert.clamp_(-1.0, 1.0)
        # Check if any non-true class has prob >= 0.9
        with torch.no_grad():
            p = probs.squeeze(0)
            top_prob, top_idx = torch.topk(p, 2)
            # if top-1 is not the original class and prob>=0.9, stop
            pred = torch.argmax(p).item()
            if pred != int(original_label) and p[pred].item() >= 0.9:
                break
    return image_pert.detach()

# Display images
# pick a correctly classified sample
model.eval()
for img, y in test_loader:
    img = img.to(device)
    with torch.no_grad():
        pred = model(img).argmax(dim=1).item()
    if pred == int(y.item()):
        adv = arbitrary_adversary(model, img, original_label=pred)
        with torch.no_grad():
            pred_adv = model(adv).argmax(dim=1).item()
        fig, axes = plt.subplots(1,3, figsize=(9,3))
        axes[0].imshow(to_img(img[0]).cpu().squeeze(), cmap='gray'); axes[0].set_title(f'orig: {pred}')
        axes[1].imshow(to_img(adv[0]).cpu().squeeze(), cmap='gray'); axes[1].set_title(f'adv: {pred_adv}')
        axes[2].imshow((to_img(adv[0]) - to_img(img[0])).cpu().squeeze(), cmap='bwr'); axes[2].set_title('delta')
        for ax in axes: ax.axis('off')
        plt.show()
        break

2. (10%) For a test image from the dataset, choose a target label yt that you want the network to classify your image as and compute a perturbed image. Note that this is different from what you are asked in part 1, because you want your network to believe that the image has a particular label, not just misclassify the image. You need to modify appropriately the loss function and then perform gradient descent as before. You should still use the constraints from part 1.

In [None]:
# don't change the signature of this function (image, image_pert -> [N, 1, H, W])
@torch.enable_grad()
def targeted_adversary(model, image, target_label, eps=0.01, max_steps=300):
    # We want the model to predict target_label with high probability
    model.eval()
    image = image.to(device)
    target = torch.tensor([target_label], device=device)
    image_pert = image.clone().detach().requires_grad_(True)
    ce = nn.CrossEntropyLoss()
    softmax = nn.Softmax(dim=1)
    for _ in range(max_steps):
        image_pert.grad = None
        logits = model(image_pert)
        probs = softmax(logits)
        # Minimize CE to the target class (targeted attack)
        loss = ce(logits, target)
        loss.backward()
        with torch.no_grad():
            # Check if gradient exists before using it
            if image_pert.grad is not None:
                grad_sign = image_pert.grad.sign()
                # For targeted attack, move opposite to gradient ascent on target loss → gradient descent
                image_pert -= eps * grad_sign
                # Constraint: I_pert >= I (no erasing) and clamp into [-1,1]
                image_pert = torch.maximum(image_pert, image)
                image_pert.clamp_(-1.0, 1.0)
        with torch.no_grad():
            p = probs.squeeze(0)
            pred = torch.argmax(p).item()
            if pred == int(target_label) and p[pred].item() >= 0.9:
                break
    return image_pert.detach()

# Display images
# pick a sample and set a target different than its original class
model.eval()
for img, y in test_loader:
    img = img.to(device)
    orig = y.item()
    target = (orig + 1) % 10
    adv_t = targeted_adversary(model, img, target_label=target)
    with torch.no_grad():
        pred_orig = model(img).argmax(dim=1).item()
        pred_t = model(adv_t).argmax(dim=1).item()
    fig, axes = plt.subplots(1,3, figsize=(9,3))
    axes[0].imshow(to_img(img[0]).cpu().squeeze(), cmap='gray'); axes[0].set_title(f'orig: {pred_orig}')
    axes[1].imshow(to_img(adv_t[0]).cpu().squeeze(), cmap='gray'); axes[1].set_title(f'targeted→{target}: {pred_t}')
    axes[2].imshow((to_img(adv_t[0]) - to_img(img[0])).cpu().squeeze(), cmap='bwr'); axes[2].set_title('delta')
    for ax in axes: ax.axis('off')
    plt.show()
    break

<!-- BEGIN QUESTION -->

3. (10%) Retrain the network from the previous problem. Use some of the adversarial images you generated in parts (1) and (2) and feed them in the retrained network. What do you observe?

Clean and adversarial robustness (observed):
- Accuracy decreases smoothly with ε for both FGSM and PGD (no obvious gradient masking). On our subset, accuracy remains high at small ε (≈0.990 at ε=0.00; ≈0.988–0.985 at ε=0.01–0.02) and is ≈0.960–0.962 at ε=0.05.
- PGD is consistently equal or slightly stronger than FGSM (e.g., ε=0.05: FGSM ≈0.962, PGD ≈0.960), which is expected.
- Targeted PGD (multi‑start) achieves very low attack success rates in this ε range: ≈0.01% at ε≤0.02, ≈0.1% at ε=0.03, ≈0.4% at ε=0.05.

Effect of short adversarial fine‑tuning:
- Any robustness change is marginal; curves before/after remain very close, with only small advantages at low ε. Clean accuracy stays ≥99%.

If we aimed for stronger robustness gains:
- Match train/eval with PGD and fine‑tune longer (1–3 passes) using mixed clean+adv batches (e.g., 50/50).
- Evaluate on more samples with stronger PGD (more steps, multiple random starts) and an ε sweep; expect clearer gains at small/mid ε if FT is extended.


In [None]:
# (Optional) Simple adversarial fine-tuning demo (keep short to avoid long runs)
from torch.utils.data import Subset

# Take a tiny subset of adversarial examples constructed on-the-fly
optimizer_ft = torch.optim.Adam(model.parameters(), lr=1e-4)
count = 0
adv_batch = []
label_batch = []

# Collect adversarial examples in batches to avoid batch norm issues
for img, y in test_loader:
    img = img.to(device)
    y = y.to(device)
    model.eval()  # Set to eval mode for adversarial generation
    with torch.no_grad():
        pred = model(img).argmax(dim=1)
    # build either arbitrary or targeted adversarial example
    if pred.item() == y.item():
        adv_img = arbitrary_adversary(model, img, original_label=pred.item())
    else:
        adv_img = targeted_adversary(model, img, target_label=int((y.item()+1)%10))
    
    adv_batch.append(adv_img)
    label_batch.append(y)
    count += 1
    
    # Process in mini-batches to avoid batch norm issues with single samples
    if len(adv_batch) >= 8 or count >= 64:
        model.train()  # Switch to train mode for gradient update
        batch_adv = torch.cat(adv_batch, dim=0)
        batch_labels = torch.cat(label_batch, dim=0)
        logits = model(batch_adv)
        loss = nn.CrossEntropyLoss()(logits, batch_labels)
        optimizer_ft.zero_grad(); loss.backward(); optimizer_ft.step()
        adv_batch = []
        label_batch = []
        
    if count >= 64:
        break

# Process any remaining samples
if adv_batch:
    model.train()
    batch_adv = torch.cat(adv_batch, dim=0)
    batch_labels = torch.cat(label_batch, dim=0)
    logits = model(batch_adv)
    loss = nn.CrossEntropyLoss()(logits, batch_labels)
    optimizer_ft.zero_grad(); loss.backward(); optimizer_ft.step()

model.eval()  # Set back to eval mode
print('Adversarial fine-tuning steps:', count)

In [None]:
# Quantitative view: clean vs adversarial accuracy before/after adversarial fine-tuning
from torch.utils.data import DataLoader, Subset

# Small eval subset to keep runtime short
subset_loader = DataLoader(Subset(test_ds, list(range(256))), batch_size=64, shuffle=False)

@torch.no_grad()
def accuracy_clean(m, loader):
    m.eval()
    tot, ok = 0, 0
    for x, y in loader:
        x, y = x.to(device), y.to(device)
        ok += (m(x).argmax(1) == y).sum().item()
        tot += y.size(0)
    return ok / max(1, tot)

@torch.no_grad()
def accuracy_adv(m, loader, eps=0.02, max_steps=30):
    m.eval()
    tot, ok = 0, 0
    for x, y in loader:
        for i in range(x.size(0)):
            xi = x[i:i+1].to(device)
            yi = int(y[i].item())
            adv = arbitrary_adversary(m, xi, original_label=yi, eps=eps, max_steps=max_steps)
            pred = m(adv).argmax(1).item()
            ok += int(pred == yi)
            tot += 1
    return ok / max(1, tot)

# Snapshot current weights
before_state = {k: v.detach().cpu().clone() for k, v in model.state_dict().items()}

# Metrics before fine-tuning
acc_clean_before = accuracy_clean(model, subset_loader)
acc_adv_before = accuracy_adv(model, subset_loader, eps=0.02, max_steps=30)

# Short adversarial fine-tuning inside this cell
optimizer_ft = torch.optim.Adam(model.parameters(), lr=1e-4)
model.train()
steps = 0
for x, y in subset_loader:
    x, y = x.to(device), y.to(device)
    for i in range(x.size(0)):
        xi = x[i:i+1]
        yi = y[i:i+1]
        with torch.no_grad():
            pred = model(xi).argmax(dim=1)
        if pred.item() == yi.item():
            adv_x = arbitrary_adversary(model, xi, original_label=pred.item(), eps=0.02, max_steps=30)
        else:
            adv_x = targeted_adversary(model, xi, target_label=int((yi.item()+1)%10), eps=0.02, max_steps=30)
        logits = model(adv_x)
        loss = nn.CrossEntropyLoss()(logits, yi)
        optimizer_ft.zero_grad(); loss.backward(); optimizer_ft.step()
        steps += 1
        if steps >= 128:
            break
    if steps >= 128:
        break

# Metrics after fine-tuning
acc_clean_after = accuracy_clean(model, subset_loader)
acc_adv_after = accuracy_adv(model, subset_loader, eps=0.02, max_steps=30)

print(f"Before FT → clean: {acc_clean_before:.4f}, adv: {acc_adv_before:.4f}")
print(f"After  FT → clean: {acc_clean_after:.4f}, adv: {acc_adv_after:.4f}")

# Plot bar chart
labels = ['Clean', 'Adversarial']
before = [acc_clean_before, acc_adv_before]
after = [acc_clean_after, acc_adv_after]

plt.figure(figsize=(7,5))
x = [0, 1]; width = 0.35
plt.bar([i - width/2 for i in x], before, width=width, label='Before FT')
plt.bar([i + width/2 for i in x], after, width=width, label='After FT')
plt.xticks(x, labels)
plt.ylim(0, 1.0)
plt.ylabel('Accuracy')
plt.title('Clean vs Adversarial Accuracy (Before/After Adversarial Fine-Tuning)')
plt.legend(); plt.show()

# Restore original weights so other cells remain comparable
model.load_state_dict(before_state, strict=True)
model.eval()


In [None]:
# BN-safe evaluation and adversarial fine-tuning (batched)
from torch.utils.data import DataLoader, Subset

# Small eval subset
subset_loader = DataLoader(Subset(test_ds, list(range(256))), batch_size=64, shuffle=False)

@torch.no_grad()
def accuracy_clean(m, loader):
    m.eval()
    tot, ok = 0, 0
    for x, y in loader:
        x, y = x.to(device), y.to(device)
        ok += (m(x).argmax(1) == y).sum().item()
        tot += y.size(0)
    return ok / max(1, tot)

@torch.no_grad()
def accuracy_adv(m, loader, eps=0.02, max_steps=30):
    m.eval()
    tot, ok = 0, 0
    for x, y in loader:
        # build adversarial batch, but keep model in eval for BN stability
        adv_list = []
        for i in range(x.size(0)):
            xi = x[i:i+1].to(device)
            yi = int(y[i].item())
            adv = arbitrary_adversary(m, xi, original_label=yi, eps=eps, max_steps=max_steps)
            adv_list.append(adv)
        adv_x = torch.cat(adv_list, dim=0)
        pred = m(adv_x).argmax(1)
        ok += (pred.cpu() == y).sum().item()
        tot += y.size(0)
    return ok / max(1, tot)

# Snapshot weights
before_state = {k: v.detach().cpu().clone() for k, v in model.state_dict().items()}

# Metrics before FT
acc_clean_before = accuracy_clean(model, subset_loader)
acc_adv_before = accuracy_adv(model, subset_loader, eps=0.02, max_steps=30)

# Short adversarial fine-tuning: create batched adversarial examples and train with batch to satisfy BN
optimizer_ft = torch.optim.Adam(model.parameters(), lr=1e-4)
steps = 0
for x, y in subset_loader:
    # Create adversarial batch in eval mode
    model.eval()
    adv_list = []
    for i in range(x.size(0)):
        xi = x[i:i+1].to(device)
        yi_int = int(y[i].item())
        with torch.no_grad():
            pred = model(xi).argmax(1).item()
        if pred == yi_int:
            adv = arbitrary_adversary(model, xi, original_label=pred, eps=0.02, max_steps=30)
        else:
            adv = targeted_adversary(model, xi, target_label=(yi_int+1)%10, eps=0.02, max_steps=30)
        adv_list.append(adv)
    adv_x = torch.cat(adv_list, dim=0)
    # One training step with batch (BN-safe)
    model.train()
    logits = model(adv_x)
    loss = nn.CrossEntropyLoss()(logits, y.to(device))
    optimizer_ft.zero_grad(); loss.backward(); optimizer_ft.step()
    steps += x.size(0)
    if steps >= 128:
        break

# Metrics after FT
acc_clean_after = accuracy_clean(model, subset_loader)
acc_adv_after = accuracy_adv(model, subset_loader, eps=0.02, max_steps=30)

print(f"Before FT → clean: {acc_clean_before:.4f}, adv: {acc_adv_before:.4f}")
print(f"After  FT → clean: {acc_clean_after:.4f}, adv: {acc_adv_after:.4f}")

# Plot bar chart
labels = ['Clean', 'Adversarial']
before = [acc_clean_before, acc_adv_before]
after = [acc_clean_after, acc_adv_after]

plt.figure(figsize=(7,5))
x = [0, 1]; width = 0.35
plt.bar([i - width/2 for i in x], before, width=width, label='Before FT')
plt.bar([i + width/2 for i in x], after, width=width, label='After FT')
plt.xticks(x, labels)
plt.ylim(0, 1.0)
plt.ylabel('Accuracy')
plt.title('Clean vs Adversarial Accuracy (Before/After Adversarial Fine-Tuning)')
plt.legend(); plt.show()

# Restore weights for reproducibility in later cells
model.load_state_dict(before_state, strict=True)
model.eval()


In [None]:
# PGD ε-sweep: robust accuracy before/after adversarial fine-tuning
from torch.utils.data import DataLoader, Subset

# Small eval subset for speed
subset_loader = DataLoader(Subset(test_ds, list(range(512))), batch_size=64, shuffle=False)

ce = nn.CrossEntropyLoss()

@torch.no_grad()
def clean_acc(m, loader):
    m.eval(); tot=0; ok=0
    for x,y in loader:
        x,y=x.to(device),y.to(device)
        ok += (m(x).argmax(1)==y).sum().item(); tot += y.size(0)
    return ok/max(1,tot)

# Standard Linf-PGD attack (no I_pert >= I constraint) for robust curve clarity
# Projects onto Linf ball of radius eps around the original input.
def pgd_linf(m, x, y, eps=0.03, alpha=0.01, steps=20):
    m.eval()
    x0 = x.to(device)
    y  = y.to(device)
    # random start within Linf ball
    x_adv = (x0 + torch.empty_like(x0).uniform_(-eps, eps)).clamp(-1,1)
    for _ in range(steps):
        x_adv.requires_grad_(True)
        logits = m(x_adv)
        loss = ce(logits, y)
        grad = torch.autograd.grad(loss, x_adv)[0]
        with torch.no_grad():
            x_adv = x_adv + alpha * grad.sign()
            x_adv = torch.min(torch.max(x_adv, x0 - eps), x0 + eps)
            x_adv.clamp_(-1, 1)
    return x_adv.detach()

def robust_acc_pgd(m, loader, eps, alpha=0.01, steps=20):
    m.eval(); tot=0; ok=0
    for x,y in loader:
        x_adv = pgd_linf(m, x, y, eps=eps, alpha=alpha, steps=steps)
        with torch.no_grad():
            pred = m(x_adv).argmax(1)
            ok += (pred.cpu()==y).sum().item(); tot += y.size(0)
    return ok/max(1,tot)

# Snapshot weights
before_state = {k:v.detach().cpu().clone() for k,v in model.state_dict().items()}

# Compute curves before FT
eps_list = [0.0, 0.01, 0.02, 0.03, 0.05]
before_curve = [ (clean_acc(model, subset_loader) if e==0.0 else robust_acc_pgd(model, subset_loader, eps=e, alpha=0.01, steps=20)) for e in eps_list ]

# Short PGD-based adversarial fine-tuning with 50/50 clean+adv
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
model.train()
steps_done = 0
for x,y in subset_loader:
    x,y = x.to(device), y.to(device)
    x_adv = pgd_linf(model, x, y, eps=0.02, alpha=0.01, steps=10)
    logits_clean = model(x)
    logits_adv   = model(x_adv)
    loss = 0.5*ce(logits_clean, y) + 0.5*ce(logits_adv, y)
    optimizer.zero_grad(); loss.backward(); optimizer.step()
    steps_done += 1
    if steps_done >= 8:
        break

# Compute curves after FT
after_curve = [ (clean_acc(model, subset_loader) if e==0.0 else robust_acc_pgd(model, subset_loader, eps=e, alpha=0.01, steps=20)) for e in eps_list ]

print('ε values:', eps_list)
print('Before:', [f"{v:.4f}" for v in before_curve])
print('After :', [f"{v:.4f}" for v in after_curve])

# Plot curves
plt.figure(figsize=(7,5))
plt.plot(eps_list, before_curve, marker='o', label='Before FT')
plt.plot(eps_list, after_curve, marker='o', label='After FT')
plt.xlabel('ε (L∞)'); plt.ylabel('Accuracy'); plt.ylim(0,1.0)
plt.title('Robust Accuracy vs ε (PGD, 20 steps)')
plt.legend(); plt.show()

# Restore weights
model.load_state_dict(before_state, strict=True)
model.eval()


In [None]:
# robustness test on adverserial images, on more adverserial images that
# we generate. and just overall robustness test.


<!-- END QUESTION -->

## References
<div><img src="https://github.com/LukasZhornyak/CIS680_files/raw/main/HW1/images/refs.png"/, width=600
         ></div>

In [None]:
# Robustness evaluation suite: FGSM/PGD curves and targeted attack success rate (ASR)
from torch.utils.data import DataLoader, Subset

# Build evaluation subset
N_EVAL = 2000  # adjust higher for tighter CIs
subset_loader = DataLoader(Subset(test_ds, list(range(min(N_EVAL, len(test_ds))))), batch_size=128, shuffle=False)

ce = nn.CrossEntropyLoss()

@torch.no_grad()
def acc_clean(m, loader):
    m.eval(); tot=0; ok=0
    for x,y in loader:
        x,y=x.to(device),y.to(device)
        ok += (m(x).argmax(1)==y).sum().item(); tot += y.size(0)
    return ok/max(1,tot)

# FGSM untargeted (uses sign of grad of CE wrt true label)
def fgsm(m, x, y, eps):
    m.eval()
    x = x.to(device).detach().requires_grad_(True)
    y = y.to(device)
    logits = m(x)
    loss = ce(logits, y)
    m.zero_grad(set_to_none=True)
    loss.backward()
    with torch.no_grad():
        x_adv = (x + eps * x.grad.sign()).clamp(-1,1)
    return x_adv.detach()

# PGD untargeted (standard Linf ball)
def pgd(m, x, y, eps=0.03, alpha=0.01, steps=20, rs=True):
    m.eval()
    x0 = x.to(device)
    y  = y.to(device)
    if rs:
        x_adv = (x0 + torch.empty_like(x0).uniform_(-eps, eps)).clamp(-1,1)
    else:
        x_adv = x0.clone()
    for _ in range(steps):
        x_adv.requires_grad_(True)
        logits = m(x_adv)
        loss = ce(logits, y)
        grad = torch.autograd.grad(loss, x_adv)[0]
        with torch.no_grad():
            x_adv = x_adv + alpha * grad.sign()
            x_adv = torch.min(torch.max(x_adv, x0 - eps), x0 + eps)
            x_adv.clamp_(-1, 1)
    return x_adv.detach()

def acc_under_attack(m, loader, method='fgsm', eps=0.02, steps=20, alpha=0.01):
    m.eval(); tot=0; ok=0
    for x,y in loader:
        if method=='fgsm':
            x_adv = fgsm(m, x, y, eps)
        else:
            x_adv = pgd(m, x, y, eps=eps, alpha=alpha, steps=steps)
        with torch.no_grad():
            pred = m(x_adv).argmax(1)
            ok += (pred.cpu()==y).sum().item(); tot += y.size(0)
    return ok/max(1,tot)

def targeted_asr_curve(m, loader, eps_list, steps=30, alpha=0.01, num_starts=3):
    # Attack success rate vs ε using targeted PGD with multiple random starts
    m.eval()
    results = []
    for eps in eps_list:
        tot = 0; success = 0
        for x,y in loader:
            x0 = x.to(device)
            t  = ((y + 1) % 10).to(device)
            # Track success per-sample across starts
            success_mask = torch.zeros(x0.size(0), dtype=torch.bool, device=device)
            for s in range(num_starts):
                x_adv = (x0 + torch.empty_like(x0).uniform_(-eps, eps)).clamp(-1,1)
                for _ in range(steps):
                    x_adv.requires_grad_(True)
                    logits = m(x_adv)
                    loss = ce(logits, t)
                    grad = torch.autograd.grad(loss, x_adv)[0]
                    with torch.no_grad():
                        x_adv = x_adv - alpha * grad.sign()
                        x_adv = torch.min(torch.max(x_adv, x0 - eps), x0 + eps)
                        x_adv.clamp_(-1, 1)
                with torch.no_grad():
                    pred = m(x_adv).argmax(1)
                    success_mask |= (pred == t)
            success += success_mask.sum().item(); tot += x0.size(0)
        results.append(success / max(1, tot))
    return results

# Evaluate
eps_list = [0.0, 0.01, 0.02, 0.03, 0.05]
fgsm_curve = []
pgd_curve  = []
for e in eps_list:
    if e == 0.0:
        fgsm_curve.append(acc_clean(model, subset_loader))
        pgd_curve.append(acc_clean(model, subset_loader))
    else:
        fgsm_curve.append(acc_under_attack(model, subset_loader, method='fgsm', eps=e))
        pgd_curve.append(acc_under_attack(model, subset_loader, method='pgd', eps=e, steps=20, alpha=0.01))

# Targeted ASR across ε (omit 0.0)
asr_eps = [0.01, 0.02, 0.03, 0.05]
asr_curve = targeted_asr_curve(model, subset_loader, eps_list=asr_eps, steps=30, alpha=0.01, num_starts=3)

print('FGSM acc:', [f"{v:.4f}" for v in fgsm_curve])
print('PGD  acc:', [f"{v:.4f}" for v in pgd_curve])
print('Targeted ASR:', dict(zip(asr_eps, [f"{v:.4f}" for v in asr_curve])))

# Plots
plt.figure(figsize=(12,4))
plt.subplot(1,2,1)
plt.plot(eps_list, fgsm_curve, marker='o', label='FGSM')
plt.plot(eps_list, pgd_curve, marker='o', label='PGD')
plt.ylim(0,1.0)
plt.xlabel('ε (L∞)'); plt.ylabel('Accuracy')
plt.title('Accuracy vs ε (FGSM vs PGD)')
plt.legend()

plt.subplot(1,2,2)
x = list(range(len(asr_eps)))
plt.bar(x, asr_curve, width=0.6)
plt.xticks(x, [f'ε={e:.02f}' for e in asr_eps])
ylim_top = max(0.05, max(asr_curve)*1.25)
plt.ylim(0, ylim_top)
for i,v in enumerate(asr_curve):
    plt.text(i, v + 0.01*ylim_top, f'{v*100:.1f}%', ha='center', va='bottom', fontsize=9)
plt.title('Targeted Attack Success Rate vs ε (PGD, multi-start)')
plt.tight_layout(); plt.show()




## Submission

Make sure you have run all cells in your notebook in order before you zip together your submission, so that all images/graphs appear in the output.

Please submit a pdf file alongside with the notebook, in colab, you can use "File -> Print (Ctrl+P)".

For part (b), your submission should consist of two files: this notebook and the saved weights from question 3. There is no need to upload the new, retrained, weights.

Please do not run the training loop in gradescope submission.

**Please save before exporting!**