# Adversarial Learning - Notes&Code - Jiacheng Zhang

#### Reference: 
#### https://arxiv.org/abs/2202.10377
#### https://arxiv.org/abs/1706.06083
#### https://github.com/yogeshbalaji/Adversarial-training
#### https://github.com/Harry24k/PGD-pytorch
#### https://github.com/DengpanFu/RobustAdversarialNetwork


# 1.Types of adversarial attacks

#### Evasion: manipulating data  to avoid detection (e.g.  image-based spam in which the spam content is embedded within an attached image to evade textual analysis by anti-spam filters)
Evasion attacks can be generally split into two different categories: black box attacks and white box attacks.
#### Poisoning: injection malicious training samples to disrupt retraining (e.g. for content recommendation or natural language models, especially given the ubiquity of fake accounts.)
#### Model stealing: extraction (e.g. extracting a sufficient amount of data from the model to enable the complete reconstruction of the model)
#### Inference: leveraging over-generalization on training data

# 2 Threat Modelling for Adversarial Learning Attacks

### 2.1 Adversarial Threat Model

#### 2.1.1 The attack surface: the point in which the attack takes place. (mostly occur during the input data collection and processing in order to exploit the vulnerabilities in the model without corrupting it)
#### 2.1.2 Adversarial capability: the capacities of an adversary outline the sort of attacks available to them. These relate to what part of the process they have access to, be it the training phase or the inference phase.
- if the adversary has the ability to insert themselves into the training phase, they have the capability to learn to influence the training of, and therefore corrupt, the model iteself. (e.g. injection/modification on training set, logic corruption - tamper with the actuall learning logic of the model)
- In the inference phase, attacks do not attempt to corrupt the model itself, but rather to fool it into producing incorrect outputs, by exploiting inherent vulnerabilities. (e.g. white-box/black-box attacks depending on the adversary's access to information)

#### 2.1.3 Adversarial goals: they model the goals of potential adversaries using the classic cyber security mode, CIA (confidentiality, integrity and availability)
- confidentiality: e.g., in the context of a financial system, the model itself may be regarded as confidential intellectual property, or likewise in the context of a medical system, the training set may contain confidential data used to train it.
- integrity: attacks on integrity aim to cause the model to behave in an unintelligent and incorrect manner thereby undermining the integrity of the model. E.g., causing the misclassification of certain inputs to the model or reducing the confidence the model has in its output.
- availability: attacks on availability look to threat access to output of models. E.g., Attempts to reduce the access to the model, via denial-of-service attacks, or model's performance metrics such as speed so it becomes inconvenient to use.

### 2.2 Intelligent Security Detection System Threat

- these systems can generally be attacked in the training stage by poisoning the training set or by logic corruption. Similarly, attacks in the inference stage would aim to evade detection ot potentially trigger false positives. These false positives may trigger unnecessary reactions by the system, potentially causing unwanted responses such as the wiping of important data. A stream of false positives may also strain the system, reducing its performance, and as such, its availability

### 2.3 Attack Attributes

#### 2.3.1 Adversarial falsification: distinguishes between whether the adversary aims to produce a false positive attack or false negative and what this means for the target system.
- false positive attack: results in negative samples being incorrectly classified as positive. E.g., In the case of an intrusion detection system this may result in the inappropriate triggering of a response to the perceived attack such as deletion of non-threatening or important information.
- false negative attack: also known as evasion attacks, using the same example of an intrusion detection system, the true intruder or malware is classified as benign and left completely undetected.

#### 2.3.2 Adversarial specificity: differentiates between targeted and non-targeted attacks and usually relates to the case of a multiclass classification.
- targeted attacks: look to guide the output of the model in a certain direction to a specific class.
- non-targeted attacks: aim to classify an adversarial example as any class other than the original.(easier to implement and are either realised by reducing the probability the model classifies correctly or by selecting one from numerous targeted approaches, the one with the slightest perturbation)

#### 2.3.3 Adversarial frequency: concerns whether the attack is a one-time approached or an iterative one, which updates a number of times to better optimize the attack.
- an iterative one tend to perform better, however come with extra cost, requiring greater computational time.
- in some cases, the one-time attacks are sufficient or even the only feasible option.

#### 2.3.4 White-box  vs Black-box: attacks in this case can be split into two distinct categories based on the level of detail they require of the target model
- white-box attack: one which requires an in-depth understanding of the target model, the mechanics of the model are transparent. The attacker has full knowledge of the system and has access to vital information including details such as the network architecture, parameters, hyper parameters, training data as well as the ability to gather gradients, and prediction results.
- black-box attack: attacks the model with no knowledge of the inner workings of the model.A black box attacker simply requires access to the model in order to query it to produce the prediction result.(e.g., an attacker may be able to produce a surrogate model which acts in accordance with the inputs and outputs of the target model. In this way, the attacker has full access to the inner workings of their surrogate model and can therefore successfullyu conduct white box attacks on it by creating adversarial samples, then these adversarial samples can be used to conduct a black-box attack against the target model)

# 3. Code Re-implementation of https://arxiv.org/abs/1706.06083
- Based on MNIST Data only

### 1. Import Packages

In [1]:
import os
import pandas as pd
import torch
import numpy as np
from torchvision import datasets
from torchvision import transforms
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt
from torchvision.utils import make_grid
import torch.nn as nn
import torch.optim as optim
import time
import torch.nn.functional as F

### 2. Download MNIST Data

In [2]:
# transform data
transform = transforms.Compose([
    transforms.ToTensor() # ToTensor : [0, 255] -> [0, 1]
])

trainingData = datasets.MNIST(root='./data', train=True, download=True, transform=transform)
testingData = datasets.MNIST(root='./data', train=False, download=True, transform=transform)

batchSize = 200
epoch = 50

train_set = DataLoader(dataset=trainingData,batch_size=batchSize, shuffle=True)
test_set = DataLoader(dataset=testingData,batch_size=batchSize, shuffle=False)

### 3. Show Part of the Data

In [3]:
def imshow(img):
    npimg = img.numpy()
    fig = plt.figure(figsize = (5, 15))
    plt.imshow(np.transpose(npimg,(1,2,0)))
    plt.show()

# Uncomment below to see the images

# for images, labels in test_set:
#     images = images.cuda()
#     labels = labels.cuda()
#     imshow(make_grid(images.cpu().data, normalize=True))

### 4. Define the Model

In [4]:
class MnistModel(nn.Module):
    """ Construct basic MnistModel for mnist adversal attack """
    def __init__(self, re_init=False, has_dropout=False):
        super(MnistModel, self).__init__()
        self.re_init = re_init
        self.has_dropout = has_dropout
        self.conv1 = nn.Conv2d(1, 32, kernel_size=5, stride=1, padding=2)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=5, stride=1, padding=2)
        self.pool = nn.MaxPool2d(2)
        self.relu = nn.ReLU(True)
        self.fc1 = nn.Linear(7*7*64, 1024)
        self.fc2 = nn.Linear(1024, 10)
        if self.has_dropout:
            self.dropout = nn.Dropout()

        if self.re_init:
            self._init_params(self.conv1)
            self._init_params(self.conv2)
            self._init_params(self.fc1)
            self._init_params(self.fc2)

    def forward(self, x):
        x = self.conv1(x)
        x = self.relu(x)
        x = self.pool(x)
        x = self.conv2(x)
        x = self.relu(x)
        x = self.pool(x)
        x = x.view(x.size(0), -1)
        x = self.fc1(x)
        x = self.relu(x)
        if self.has_dropout:
            x = self.dropout(x)
        x = self.fc2(x)
        return x

In [5]:
# Initialize Models for Different Attacks
ori_model = MnistModel().cuda()
fgsm_model = MnistModel().cuda()
pgd_model = MnistModel().cuda()

### 5. Define a Loss Function

In [6]:
# Stochastic Gradient Descent
ori_optimizer = optim.SGD(ori_model.parameters(), lr=0.01, momentum=0.9, weight_decay=0.0005)
fgsm_optimizer = optim.SGD(fgsm_model.parameters(), lr=0.01, momentum=0.9, weight_decay=0.0005)
pgd_optimizer = optim.SGD(pgd_model.parameters(), lr=0.01, momentum=0.9, weight_decay=0.0005)

### 6. Define the training stage

In [7]:
class AverageMeter(object):
    """Computes and stores the average and current value"""
    def __init__(self):
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

In [8]:
def train(model, epoch, optimizer, adversarial):
    losses = AverageMeter()
    start_time = time.time()
    for i in range(epoch):
        model.train()
        correct = 0
        total = 0
        for images, labels in train_set:
            images = images.cuda()
            labels = labels.cuda()
            optimizer.zero_grad()
        
            if adversarial=='FGSM':
                images = FGSM(model, images, labels, 0.1)
            elif adversarial=='PGD':
                images = PGD(model, images, labels)
            
            output = model(images).cuda()
            loss = F.cross_entropy(output, labels).cuda()
            loss.backward()
            optimizer.step()

            _, prediction = torch.max(output, dim=1)
            correct += (prediction == labels).sum()
            total += labels.size(0)

            # measure accuracy and record loss
            losses.update(loss.data.item(), images.size(0))

        acc = (float(correct) / total) * 100
        if i % 10 == 0:
            end_time = time.time()
            batch_time = end_time - start_time
            message = 'Epoch {}, Time: {}s, Loss: {}, Train Accuracy: {}%'.format(i+1, batch_time, loss.item(), acc)
            print(message)

### 7. Define the testing stage

In [9]:
def evaluate(model, adversarial):
    model.eval()
    
    correct = 0
    total = 0
    
    for images, labels in test_set:
        images = images.cuda()
        labels = labels.cuda()
        
        if adversarial=='FGSM':
            images = FGSM(model,images,labels,0.1)
        elif adversarial=='PGD':
            images = PGD(model, images, labels)
            
        # compute output
        output = model(images).cuda()
        _, pred = torch.max(output, dim=1)
        correct += (pred == labels).sum()
        total += labels.size(0)

    accuracy = (float(correct) / total) * 100
    message = 'Test Accuracy: {}%'.format(accuracy)
    print(message)

### 8. Natural Training

In [10]:
train(ori_model,epoch,ori_optimizer, None)

Epoch 1, Time: 5.669363021850586s, Loss: 0.1208600103855133, Train Accuracy: 82.13666666666667%
Epoch 11, Time: 48.821776390075684s, Loss: 0.026549018919467926, Train Accuracy: 99.31333333333333%
Epoch 21, Time: 92.3618688583374s, Loss: 0.005918064620345831, Train Accuracy: 99.76333333333334%
Epoch 31, Time: 135.0185136795044s, Loss: 0.00995405949652195, Train Accuracy: 99.855%
Epoch 41, Time: 178.35860657691956s, Loss: 0.009199175052344799, Train Accuracy: 99.93833333333333%


### 9. Apply adversarial training with FGSM Attack

$$ x^* = x + \epsilon \cdot sgn(\triangledown_x L(\theta, x, y)) $$

In [11]:
# FGSM attack code
def FGSM(model, images, labels, epsilon):
    
    images.requires_grad = True
    
    output = model(images).cuda()
            
    loss = F.cross_entropy(output, labels).cuda()
            
    loss.backward()
            
    # Collect the element-wise sign of the data gradient
    sign_data_grad = images.grad.sign()
    
    # Create the perturbed image by adjusting each pixel of the input image
    perturbed_image = images + epsilon*sign_data_grad
    
    # Adding clipping to maintain [0,1] range
    perturbed_image = torch.clamp(perturbed_image, 0, 1)
    
    # Return the perturbed image
    return perturbed_image

In [12]:
train(fgsm_model,epoch,fgsm_optimizer,'FGSM')

Epoch 1, Time: 6.0473597049713135s, Loss: 0.4133017659187317, Train Accuracy: 75.07000000000001%
Epoch 11, Time: 66.52704954147339s, Loss: 0.0509895384311676, Train Accuracy: 97.485%
Epoch 21, Time: 125.90949821472168s, Loss: 0.03253510594367981, Train Accuracy: 98.49333333333334%
Epoch 31, Time: 184.2856090068817s, Loss: 0.03364289551973343, Train Accuracy: 99.14%
Epoch 41, Time: 245.06063151359558s, Loss: 0.018948663026094437, Train Accuracy: 99.52499999999999%


### 10.  Apply adversarial training with PGD Attack

$$x^{t+1} = \Pi_{x+S}(x^t+\alpha\cdot sgn(\triangledown_x L(\theta, x, y)))$$
* $S$ : a set of allowed perturbations

In [13]:
def PGD(model, images, labels, eps=0.3, alpha=8/255, iters=10) :
    
    # random initialization
    ori_images = images.data
    x_adv = images +  ori_images.new(ori_images.size()).uniform_(-eps, eps)
        
    for i in range(iters) :    
        x_adv.requires_grad = True
        
        output = model(x_adv)
        
        loss = F.cross_entropy(output, labels).cuda()
        
        loss.backward()
        
        x_adv = x_adv.detach() + alpha * x_adv.grad.sign()
        
        # projection
        x_adv = torch.min(torch.max(x_adv, ori_images - eps), ori_images + eps)
            
        x_adv.clamp_(0, 1)
        
    return x_adv

In [14]:
train(pgd_model,epoch, pgd_optimizer, 'PGD')

Epoch 1, Time: 19.581053495407104s, Loss: 0.6096667647361755, Train Accuracy: 58.36666666666667%
Epoch 11, Time: 216.9792833328247s, Loss: 0.11895622313022614, Train Accuracy: 94.91833333333334%
Epoch 21, Time: 414.06174325942993s, Loss: 0.03309975564479828, Train Accuracy: 96.78999999999999%
Epoch 31, Time: 611.2263371944427s, Loss: 0.07283598184585571, Train Accuracy: 97.80499999999999%
Epoch 41, Time: 807.3649859428406s, Loss: 0.05246616527438164, Train Accuracy: 98.53333333333333%


### 11. Testing

In [25]:
# Testing natural model on different samples
print('Natural Training: \n')
print('on normal test samples: ')
evaluate(ori_model, None)
print('\non FGSM samples: ')
evaluate(ori_model, 'FGSM')
print('\non PGD samples: ')
evaluate(ori_model, 'PGD')
print('--------------------')


# Testing FGSM model on different samples
print('FGSM Adversarial Training: \n')
print('on normal test samples: ')
evaluate(fgsm_model, None)
print('\non FGSM samples: ')
evaluate(fgsm_model, 'FGSM')
print('\non PGD samples: ')
evaluate(fgsm_model, 'PGD')
print('--------------------')


# Testing PGD model on different samples
print('PGD Adversarial Training: \n')
print('on normal test samples: ')
evaluate(pgd_model, None)
print('\non FGSM samples: ')
evaluate(pgd_model, 'FGSM')
print('\non PGD samples: ')
evaluate(pgd_model, 'PGD')

Natural Training: 

on normal test samples: 
Test Accuracy: 99.15%

on FGSM samples: 
Test Accuracy: 80.39%

on PGD samples: 
Test Accuracy: 0.06%
--------------------
FGSM Adversarial Training: 

on normal test samples: 
Test Accuracy: 99.39%

on FGSM samples: 
Test Accuracy: 96.87%

on PGD samples: 
Test Accuracy: 4.47%
--------------------
PGD Adversarial Training: 

on normal test samples: 
Test Accuracy: 99.11999999999999%

on FGSM samples: 
Test Accuracy: 98.04%

on PGD samples: 
Test Accuracy: 94.38%
