**Name: Mohammad Taslmi**

**Student Number: 99101321**




In this notebook, you start by loading a trained Resnet model on CIFAR10 dataset. Then, you try to attack it with an $L_{2}$ C&W attack. Next, you work to make this model better at defending using defensive distillation, and test it against two different modes of $L_{2}$ C&W attacks again.

It is recommended to use google colab to do this homework. You can connect to your drive using the code below to use it to save and load your trained models:

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Load CIFAR10 data

In [2]:
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision.utils
import torchvision.transforms as transforms
from torchvision.models import resnet18
from tqdm import tqdm
import matplotlib.pyplot as plt

In [3]:
transform = transforms.Compose([transforms.ToTensor()])

trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform)

testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=transform)

batch_size = 128

########################## Problem 1 (2  points) ###############################
# todo: Define your data loaders for training and testing                      #
################################################################################
# your code goes here
trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size, shuffle=True)
testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size, shuffle=False)
################################ End ###########################################

classes = ('plane', 'car', 'bird', 'cat',
           'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz


100%|██████████| 170498071/170498071 [00:03<00:00, 43594635.97it/s]


Extracting ./data/cifar-10-python.tar.gz to ./data
Files already downloaded and verified


# Model set-up

In [4]:
class resnet(nn.Module):
  def __init__(self, num_cls, T=1):
    super().__init__()
    self.conv = nn.Sequential(
        *list(resnet18(weights=None).children())[:-2])

    self.fc = nn.Linear(512, num_cls)
    self.temp = T

  def forward(self, x, T=None):
    if T is None:
      T = self.temp
    x = self.conv(x)
    x = torch.flatten(x, start_dim=1)
    logits = self.fc(x)
    output = torch.softmax(logits / T, dim=1)

    return logits, output

In [5]:
def standard_train(model, loader, num_epoch, optimizer, criterion, device=device):
    for epoch in range(num_epoch):
      running_loss = 0.0
      for i, data in enumerate(loader, 0):
          inputs, labels = data[0].to(device), data[1].to(device)
          optimizer.zero_grad()
          outputs = model(inputs)
          loss = criterion(outputs[1], labels)
          loss.backward()
          optimizer.step()
          running_loss += loss.item()

      print('Epoch %d loss: %.3f' % (epoch + 1, running_loss / len(trainloader)))

In [6]:
# You don't need training here. just download trained weights of the Resnet18 model on CIFAR10 dataset
!gdown 1KU4jWAwZIq0TUujAsgimLxGWUvwIEfyB

Downloading...
From (original): https://drive.google.com/uc?id=1KU4jWAwZIq0TUujAsgimLxGWUvwIEfyB
From (redirected): https://drive.google.com/uc?id=1KU4jWAwZIq0TUujAsgimLxGWUvwIEfyB&confirm=t&uuid=7123454f-99d3-4046-aeee-e07f6d239a5b
To: /content/resnet18_cifar10_model_pretrained.pth
100% 44.8M/44.8M [00:00<00:00, 47.2MB/s]


In [7]:
# load trained Resnet18 model on CIFAR10 dataset
model = resnet(len(classes)).to(device)
model_name = "resnet18_cifar10_model_pretrained.pth"
model_PATH = "/content/" + model_name
state_dict = torch.load(model_PATH, map_location=device)
model.load_state_dict(state_dict)
model = model.to(device)

# Computing clean accuracy

In [8]:
def standard_test(model, loader, device=device):
  correct = 0
  total = 0
  ########################## Problem 2 (4 points) ##############################
  # todo: Iterate over loader, compute the output and predicted                #
  # label, and update "correct" and "total" counters accordingly.              #
  ##############################################################################

  # your code goes here
  model.eval()
  with torch.no_grad():
    for i, data in enumerate(loader, 0):
      inputs, labels = data[0].to(device), data[1].to(device)
    outputs = model(inputs)
    _, predicted = torch.max(outputs[1].data, 1)
    total += labels.size(0)
    correct += (predicted == labels).sum().item()

  ################################ End #########################################
  print(f'\n Clean accuracy of the network on the 10000 test images: {100 * correct // total} %')


In [None]:
standard_test(model, testloader)


 Clean accuracy of the network on the 10000 test images: 75 %


# C&W L2 Attack on the base model

Note that for the sake of simplicity, you can ignore binary search for constant c for your implementation. However, implementing it would earn you a bonus!

In [58]:
def cw_l2_attack(model, image, label, mode='outputs', targeted=True, c=1e-4, kappa=0, max_iter=1000, learning_rate=0.01):

    image = image.to(device)
    label = label.to(device)
    ######################### Problem 3 (8 points) #############################
    # todo: Implement L2 C&W attack in the two following modes:                #
    # 'outputs': use an f function which utilizes output probs of the model    #
    # 'logits': use an f function which utilizes logits of the model           #
    # Output of this function must be the resulting adversarial image.         #
    ############################################################################

    perturbation = torch.zeros_like(image, requires_grad=True)
    optimizer = torch.optim.Adam([perturbation], lr=learning_rate)

    num_classes = 10
    target_label = (label + 1) % num_classes

    for iteration in range(max_iter):

        optimizer.zero_grad()

        adv_image = (image + perturbation).clamp(0, 1)
        logits, outputs = model(adv_image)

        if (logits.argmax(1) == target_label).all():
            break

        if mode == 'logits':
            logits_except_target = logits.clone()
            logits_except_target[:, target_label] = float('-inf')  # Exclude the target class
            f_x = logits_except_target.max(1)[0] - logits[:, target_label]
        elif mode == 'outputs':
            outputs_except_target = outputs.clone()
            outputs_except_target[:, target_label] = float('-inf')  # Exclude the target class
            f_x = outputs_except_target.max(1)[0] - outputs[:, target_label]


        loss = torch.norm(perturbation) + c * torch.sum(torch.clamp(f_x, min=-kappa))

        loss.backward()
        optimizer.step()


    adv_image = (image + perturbation).clamp(0, 1).detach()
    return adv_image

    ################################ End #######################################

In [59]:
def attack_test(model, attack_model, loader, mode='outputs', targeted=True, c=0.1, kappa=0, device=device):
  correct = 0
  total = 0
  ########################## Problem 4 (4 points) ##############################
  # todo: Iterate over the FIRST batch of the loader                           #
  # todo: Find an adversarial example by L2 C&W attack on attack_model         #
  # todo: Compute the output and predicted label, and updated "correct" and    #
  # "total" counters accordingly.                                              #
  ##############################################################################

  data_iter = iter(loader)
  images, labels = next(data_iter)
  images, labels = images.to(device), labels.to(device)
  attack_model.eval()
  model.eval()
  for image, label in zip(images, labels):

      adv_image = cw_l2_attack(attack_model, image.unsqueeze(0), label.unsqueeze(0), mode=mode, targeted=targeted, c=c, kappa=kappa)
      logits, outputs = model(adv_image)
      pred_label = logits.argmax(dim=1, keepdim=True)

      correct += pred_label.eq(label.view_as(pred_label)).sum().item()
      total += 1

  ################################ End #########################################
  print(f'\n Accuracy of the network on the {total} test images of the first batch of testloader after attacking : {100 * correct // total} %')


In [None]:
print('C&W L2 attack on the model using its outpus(probs):')
attack_test(model, model, testloader, mode='outputs', device=device)

print('\n C&W L2 attack on the model using its logits:')
attack_test(model, model, testloader, mode='logits', device=device)

C&W L2 attack on the model using its outpus(probs):

 Accuracy of the network on the 128 test images of the first batch of testloader after attacking : 75 %

 C&W L2 attack on the model using its logits:

 Accuracy of the network on the 128 test images of the first batch of testloader after attacking : 0 %


# Defensive distillation

## Teacher model training

In [13]:
T = 100
teacher = resnet(len(classes), T=T).to(device)
teacher_optim = optim.Adam(teacher.parameters(), lr=0.01)
teacher_criterion = nn.CrossEntropyLoss()

In [14]:
# load teacher model (if you already trained it)
model_name = "teacherModel.pth"
teacher_model_PATH = "/content/drive/MyDrive/SPML_HW4_models/" + model_name
state_dict = torch.load(teacher_model_PATH)
teacher.load_state_dict(state_dict)
teacher = teacher.to(device)

In [None]:
teacher_optim = optim.Adam(teacher.parameters(), lr=0.001)
standard_train(model=teacher,
            loader=trainloader,
            num_epoch=15,
            optimizer=teacher_optim,
            criterion=teacher_criterion,
            device=device)

Epoch 1 loss: 1.605
Epoch 2 loss: 1.598
Epoch 3 loss: 1.594
Epoch 4 loss: 1.589
Epoch 5 loss: 1.585
Epoch 6 loss: 1.579
Epoch 7 loss: 1.576
Epoch 8 loss: 1.574
Epoch 9 loss: 1.572
Epoch 10 loss: 1.570
Epoch 11 loss: 1.568
Epoch 12 loss: 1.566
Epoch 13 loss: 1.566
Epoch 14 loss: 1.564
Epoch 15 loss: 1.561


In [None]:
standard_test(teacher, testloader)


 Clean accuracy of the network on the 10000 test images: 76 %


In [None]:
# save teacher model (only if you just trained it)
teacher.eval()
model_name = "teacherModel.pth"
teacher_model_PATH = "/content/drive/MyDrive/SPML_HW4_models/" + model_name
torch.save(teacher.state_dict(), teacher_model_PATH)

## Student model training

In [15]:
def distillation(teacher, student, loader, num_epoch, optimizer, criterion, device=device):
  ########################## Problem 5 (6 points) ##############################
  # todo: Iterate over loader in each epoch                                    #
  # todo: Compute MSE loss between student's logit and teacher's logit         #
  # todo: Take a step by the optimizer                                         #
  # todo: Monitor the training procedure                                       #
  ##############################################################################

  # your code goes here
  for epoch in range(num_epoch):
        running_loss = 0.0
        for i, data in enumerate(loader, 0):
            inputs, labels = data[0].to(device), data[1].to(device)
            optimizer.zero_grad()
            teacher_outputs, _ = teacher(inputs)
            student_outputs, _ = student(inputs)
            loss = criterion(student_outputs, teacher_outputs.detach())
            loss.backward()
            optimizer.step()
            running_loss += loss.item()

        print('Epoch %d loss: %.3f' % (epoch + 1, running_loss / len(loader)))
  ################################ End #########################################

In [16]:
T = 100
student = resnet(len(classes), T=T).to(device)
student_optim = optim.Adam(student.parameters(), lr=0.01)
std_criterion = nn.MSELoss()

In [18]:
# load student model (only if you already trained it)
model_name = "studentModel.pth"
student_model_PATH = "/content/drive/MyDrive/SPML_HW4_models/" + model_name
state_dict = torch.load(student_model_PATH)
student.load_state_dict(state_dict)
student = student.to(device)

In [None]:
student_optim = optim.Adam(student.parameters(), lr=0.001)
distillation(teacher=teacher,
             student=student,
             loader=trainloader,
             num_epoch=15,
             optimizer=student_optim,
             criterion=std_criterion,
             device=device)

Epoch 1 loss: 54667.581
Epoch 2 loss: 51599.310
Epoch 3 loss: 47246.446
Epoch 4 loss: 44115.916
Epoch 5 loss: 42639.867
Epoch 6 loss: 39719.256
Epoch 7 loss: 37999.535
Epoch 8 loss: 34735.838
Epoch 9 loss: 32176.859
Epoch 10 loss: 33200.120
Epoch 11 loss: 32074.436
Epoch 12 loss: 30592.106
Epoch 13 loss: 29186.152
Epoch 14 loss: 26815.331
Epoch 15 loss: 26768.282


In [None]:
standard_test(student, testloader)


 Clean accuracy of the network on the 10000 test images: 76 %


In [None]:
# save student model (only if you just trained it)
student.eval()
model_name = "studentModel.pth"
teacher_model_PATH = "/content/drive/MyDrive/SPML_HW4_models/" + model_name
torch.save(student.state_dict(), teacher_model_PATH)

# C&W L2 Attack on the distilled  model

In [60]:
print('C&W L2 attack on the student model using its outpus(probs):')
attack_test(student, student, testloader, mode='outputs', device=device)

print('\n C&W L2 attack on the student model using its logits:')
attack_test(student, student, testloader, mode='logits', device=device)

C&W L2 attack on the student model using its outpus(probs):

 Accuracy of the network on the 128 test images of the first batch of testloader after attacking : 76 %

 C&W L2 attack on the student model using its logits:

 Accuracy of the network on the 128 test images of the first batch of testloader after attacking : 0 %


# Analyzing the results

Answer the following questions below them: (6 points)

1.` Why do you think the attack success rate of the logits mode is generally better than the outputs(probs) mode?`

**Answer**:

I believe there are two reasons why the attack in logits mode performs better than in outputs mode.

The first reason is that when the logits are large, the softmax function enters a saturation region, which may result in **gradient vanishing**.

The second reason is that, given the same situation, the function $$f$$ in the output mode is significantly lower than that in the logits mode. This is because after applying the softmax function, the outputs range between 0 and 1 and their sum equals 1. In the scenario where $$c=0.1$$, the term $$c \times f(x)$$ in the output mode is much smaller than that in the logits mode. Consequently, during the minimization process, the algorithm pays less attention to this term (which contributes to generating adversarial examples), making the attack less effective. As we can observe, when we use logits mode, the attack is 100% successful, but in the outputs mode, it's only 25% successful.


2. `Why defensive distillation completely failed in the logits mode?`

**Answer**:

Defensive distillation fails in logits mode primarily because it is designed to smooth out the model's output layer, making it harder for an attacker to find adversarial examples. However, in logits mode, the attack operates on the layer before the softmax function, which is not smoothed by defensive distillation. Therefore, the attack can still effectively exploit the model's vulnerabilities.


3. `As you have seen, defensive distillation is a mirage. Do you think that there is still some scenarios that this defense may be helpful?`

**Answer**:

While defensive distillation has been shown to be less effective than initially thought, it may still provide some level of protection in specific scenarios. For instance, it could potentially be useful against less sophisticated attacks or when used in combination with other defense mechanisms. However, it's important to note that relying solely on defensive distillation is not advisable due to its limitations.