

#  A.I. Seminar Project
## Defending Against Adversarial Examples



# Fadji OHOUKOH

In the context of adversarial attacks on machine learning models, a defense scheme is a strategy or method used to protect the model from these attacks. The goal of a defense scheme is to make the model robust against adversarial examples, which are input data designed to mislead the model into making incorrect predictions.

There are several types of defense schemes, including:

1. **Adversarial Training**: This involves including adversarial examples in the training data and retraining the model. The idea is to make the model aware of the kind of perturbations it might face, so it can learn to correctly classify even adversarially perturbed inputs.

2. **Defensive Distillation**: This is a process where a second model is trained to mimic the output of the original model but with a softened output distribution. The second model, called the distilled model, is less likely to be affected by small perturbations in the input space, making it more robust against adversarial attacks.

3. **Feature Squeezing**: This reduces the search space available to an adversary by coalescing similar inputs into one, thereby limiting the effectiveness of adversarial perturbations.

4. **Gradient Masking or Obfuscation**: These methods aim to hide the gradients of the model from the attacker, making it harder to craft effective adversarial examples.

5. **Regularization Techniques**: These methods add a penalty term to the loss function during training to encourage the model to learn a simpler (and hopefully more robust) function.

Each of these defense schemes has its own strengths and weaknesses, and the effectiveness of a particular scheme can depend on the specific model and threat model. In practice, it's common to use a combination of multiple defense schemes to achieve the best robustness.

In [1]:
from torchvision import datasets, transforms

# Define a transform to convert images to PyTorch tensors
transform = transforms.ToTensor()
mnist = datasets.MNIST(root='./data', train=True, transform=transform, download=True)
# Load the MNIST dataset
data = mnist.data
target = mnist.targets  # Use mnist.targets instead of mnist.target
#Defining the training set 
X_train=data
y_train=target
#Defining the test set 
mnist1=datasets.MNIST(root='./data',train=False,transform=transform , download=True)
X_test=mnist1.data / 255
y_test=mnist1.targets / 255
# Print the shapes of the train and test sets
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: torch.Size([60000, 28, 28])
Shape of X_test: torch.Size([10000, 28, 28])
Shape of y_train: torch.Size([60000])
Shape of y_test: torch.Size([10000])


In [2]:
import torch
from torch import nn
import torch.nn.functional as F

class ModelM(nn.Module):
    def __init__(self):
        super(ModelM, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, kernel_size=5, stride=1, bias=False)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, stride=1, bias=False)
        self.fc1 = nn.Linear(64 * 5 * 5, 64, bias=False)
        self.fc2 = nn.Linear(64, 512, bias=False)
        self.fc3 = nn.Linear(512, 10, bias=False)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 64 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

model = ModelM()

# Some Useful explanation

# White Box setting

In the context of adversarial attacks on machine learning models, a "white-box" setting refers to a scenario where the attacker has complete knowledge of the model. This includes the architecture of the model, the parameters (weights and biases), the training method, and even the specific data points used for training.



This is in contrast to a "black-box" setting, where the attacker only has access to the inputs to the model and the corresponding outputs, without knowing the details of the model's architecture or parameters.

The white-box setting represents a worst-case scenario from a security perspective, as it assumes the attacker has maximum knowledge. Therefore, defenses that are effective in the white-box setting are considered to be very robust. However, it's also a less realistic scenario, as attackers in real-world situations are unlikely to have complete knowledge of the model.

In [4]:
criterion = nn.CrossEntropyLoss()

In [6]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
batch_size = 150
X_train_reshaped = X_train.unsqueeze(1).to(torch.float32)
y_train = y_train.to(torch.long)

# Create a DataLoader for the training set
train_dataset = TensorDataset(X_train_reshaped, y_train)
train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)

In [7]:
import torch
from torch.optim.lr_scheduler import StepLR
from torch.nn import CrossEntropyLoss


model = ModelM()

criterion = CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=1e-2)  # L2 regularization
scheduler = StepLR(optimizer, step_size=3, gamma=0.1)
epsilon = 0.3
alpha = 0.01
k = 40
Q = [0, 1]  # Assuming pixel values are in the range [0, 1]
num_epochs = 10  # Increase the number of epochs
best_val_acc = 0  # For early stopping
total_predictions =0
total_correct=0
model.train()

for epoch in range(num_epochs):
    for i, (inputs, labels) in enumerate(train_loader):
        # Perform the l∞ PGD attack
        inputs_adv = inputs.data + epsilon * (2 * torch.rand_like(inputs) - 1)
        inputs_adv.requires_grad = True

        for _ in range(k):
            outputs_adv = model(inputs_adv)
            loss_adv = criterion(outputs_adv, labels)
            loss_adv.backward()

            inputs_adv_grad = alpha * torch.sign(inputs_adv.grad.data)
            inputs_adv = inputs_adv.detach() + inputs_adv_grad
            inputs_adv = torch.min(torch.max(inputs_adv, inputs - epsilon), inputs + epsilon)
            inputs_adv = torch.clamp(inputs_adv, Q[0], Q[1])  # Clip to valid pixel range
            inputs_adv.requires_grad = True

        # Update the model
        outputs = model(inputs_adv)
        loss = criterion(outputs, labels)

        # Calculate accuracy
        _, predicted = torch.max(outputs.data, 1)
        total_predictions += labels.size(0)
        total_correct += (predicted == labels).sum().item()

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if (i+1) % 100 == 0:
            print(f"Epoch {epoch+1}/{num_epochs}, Batch {i+1}/{len(train_loader)}, Loss: {loss.item()}")

    # Decay learning rate
    scheduler.step()

    # Early stopping
    #val_acc = evaluate(model, data_loader)  # You need to implement the evaluate function
    #if val_acc > best_val_acc:
    #    best_val_acc = val_acc
    #else:
    #    print("Early stopping")
    #    break

# Calculate and print the accuracy for the total training set
total_accuracy = total_correct / total_predictions * 100
print(f"Total training accuracy: {total_accuracy}%")

print('Finished Adversarial Training')

Epoch 1/10, Batch 100/400, Loss: 1.1453036069869995
Epoch 1/10, Batch 200/400, Loss: 0.5568670630455017
Epoch 1/10, Batch 300/400, Loss: 0.396942675113678
Epoch 1/10, Batch 400/400, Loss: 0.4898703992366791
Epoch 2/10, Batch 100/400, Loss: 0.3437506854534149
Epoch 2/10, Batch 200/400, Loss: 0.3396545946598053
Epoch 2/10, Batch 300/400, Loss: 0.39722955226898193
Epoch 2/10, Batch 400/400, Loss: 0.478738397359848
Epoch 3/10, Batch 100/400, Loss: 0.5156795382499695
Epoch 3/10, Batch 200/400, Loss: 0.3717421889305115
Epoch 3/10, Batch 300/400, Loss: 0.2956853210926056
Epoch 3/10, Batch 400/400, Loss: 0.40921634435653687
Epoch 4/10, Batch 100/400, Loss: 0.29066962003707886
Epoch 4/10, Batch 200/400, Loss: 0.29348745942115784
Epoch 4/10, Batch 300/400, Loss: 0.26768726110458374
Epoch 4/10, Batch 400/400, Loss: 0.22766812145709991
Epoch 5/10, Batch 100/400, Loss: 0.2814520299434662
Epoch 5/10, Batch 200/400, Loss: 0.2984856367111206
Epoch 5/10, Batch 300/400, Loss: 0.30899232625961304
Epoch 5

In [16]:
# Assuming X_test is your input test data and y_test are your test labels
X_test_reshaped = X_test.unsqueeze(1).to(torch.float32)
y_test = y_test.to(torch.long)

# Create a DataLoader for the test set
test_dataset = TensorDataset(X_test_reshaped, y_test)
test_loader = DataLoader(dataset=test_dataset, batch_size=batch_size)

# Switch the model to evaluation mode
model.eval()

total_correct = 0
total_predictions = 0

# No need to track gradients for test data, so we use torch.no_grad()
with torch.no_grad():
    for inputs, labels in test_loader:
        # Perform the l∞ PGD attack
        inputs_adv = inputs.data + epsilon * (2 * torch.rand_like(inputs) - 1)
        #inputs_adv.requires_grad = True

        for _ in range(k):
            outputs_adv = model(inputs_adv)
            loss_adv = criterion(outputs_adv, labels)
            #loss_adv.backward()

            inputs_adv_grad = alpha * torch.sign(inputs_adv)
            inputs_adv = inputs_adv.detach() + inputs_adv_grad
            inputs_adv = torch.min(torch.max(inputs_adv, inputs - epsilon), inputs + epsilon)
            inputs_adv = torch.clamp(inputs_adv, Q[0], Q[1])  # Clip to valid pixel range
            inputs_adv.requires_grad = True

        # Forward pass
        outputs = model(inputs_adv)

        # Calculate accuracy
        _, predicted = torch.max(outputs.data, 1)
        total_predictions += labels.size(0)
        total_correct += (predicted == labels).sum().item()

# Calculate and print the accuracy for the total test set
total_accuracy = total_correct / total_predictions * 100
print(f"Total test accuracy: {total_accuracy}%")

Total test accuracy: 10.01%


In [18]:
model.eval()  # Set the model to evaluation mode
correct = 0
total = 0

with torch.no_grad():  # No need to calculate gradients
    for inputs, labels in test_loader:
        outputs = model(inputs)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print('Accuracy of the model on clean test images: %d %%' % (100 * correct / total))

Accuracy of the model on clean test images: 9 %


In [10]:
# Save the model's state dictionary
torch.save(model.state_dict(), "D:\Master_2022-2024\M2\AI_SEMINAR\Project\Model_PGD.pth")

In [15]:
model.eval()  # Set the model to evaluation mode
correct = 0
total = 0

with torch.no_grad():  # No need to calculate gradients
    for inputs, labels in test_loader:
        outputs = model(inputs)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print('Accuracy of the model on clean test images: %d %%' % (100 * correct / total))

Accuracy of the model on clean test images: 9 %


In [13]:
# Save the model's state dictionary
torch.save(model.state_dict(), "D:\Master_2022-2024\M2\AI_SEMINAR\Project\Model_PGD.pth")

# Adversial testing

In [None]:
# Assuming X_test is your input test data and y_test are your test labels
X_test_reshaped = X_test.unsqueeze(1).to(torch.float32)
y_test = y_test.to(torch.long)

# Create a DataLoader for the test set
test_dataset = TensorDataset(X_test_reshaped, y_test)
test_loader = DataLoader(dataset=test_dataset, batch_size=batch_size)

# Switch the model to evaluation mode
model.eval()

total_correct = 0
total_predictions = 0

# No need to track gradients for test data, so we use torch.no_grad()
with torch.no_grad():
    for inputs, labels in test_loader:
        # Perform the l∞ PGD attack
        inputs_adv = inputs.data + epsilon * (2 * torch.rand_like(inputs) - 1)
        #inputs_adv.requires_grad = True

        for _ in range(k):
            outputs_adv = model(inputs_adv)
            loss_adv = criterion(outputs_adv, labels)
            #loss_adv.backward()

            inputs_adv_grad = alpha * torch.sign(inputs_adv)
            inputs_adv = inputs_adv.detach() + inputs_adv_grad
            inputs_adv = torch.min(torch.max(inputs_adv, inputs - epsilon), inputs + epsilon)
            inputs_adv = torch.clamp(inputs_adv, Q[0], Q[1])  # Clip to valid pixel range
            inputs_adv.requires_grad = True

        # Forward pass
        outputs = model(inputs_adv)

        # Calculate accuracy
        _, predicted = torch.max(outputs.data, 1)
        total_predictions += labels.size(0)
        total_correct += (predicted == labels).sum().item()

# Calculate and print the accuracy for the total test set
total_accuracy = total_correct / total_predictions * 100
print(f"Total test accuracy: {total_accuracy}%")

# Normal Testing with the MNIST test set

In [None]:
model.eval()  # Set the model to evaluation mode
correct = 0
total = 0

with torch.no_grad():  # No need to calculate gradients
    for inputs, labels in test_loader:
        outputs = model(inputs)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print('Accuracy of the model on clean test images: %d %%' % (100 * correct / total))

# Another way optional

In [None]:
x_train_reshaped = X_train.unsqueeze(1).to(torch.float32)
y_train = y_train.to(torch.long)
train = TensorDataset(x_train_reshaped, y_train)
train_loader = DataLoader(train, batch_size=128, shuffle=True)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

total_prediction = 0
total_predicted = 0
Loss = []
num_epochs = 10

# Wrapping the model with Foolbox's PyTorchModel
preprocessing = dict(mean=[0.1307], std=[0.3081], axis=-3)
fmodel = PyTorchModel(model, bounds=(0, 1), preprocessing=preprocessing)

# ADAM optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001)

# The linfinity PGD attack
attack = LinfPGD()

for epoch in range(num_epochs):
    for batch, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)

        # Set the model to evaluation mode before the attack
        model.eval()
        data = data/ data.max().item()
        # Generate adversarial examples
        _, advs, success = attack(fmodel, data, target, epsilons=[0.3])
        #print(type(advs))
        # Normalize advs by dividing each element by the maximum value
        max_advs = max(advs, key=lambda x: x.max().item())
        normalized_advs = [adv / max_advs.max().item() for adv in advs]
        # Set the model back to training mode after the attack
        model.train()

        # Print information about input bounds and adversarial example range
        #print(f"Model Bounds: {fmodel.bounds}")
        #print(f"Adversarial Examples Range: [{advs.min().item()}, {advs.max().item()}]")

        # Verify that adversarial examples are within the model's input bounds
        #assert advs.min().item() >= 0 and advs.max().item() <= 1
        advs = torch.stack(advs)
        #print(advs.size())
        advs = advs.squeeze(0)
        #print(advs.size())
        # Use the adversarial examples for training
        optimizer.zero_grad()
        output = model(advs)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()

        _, predicted = torch.max(output, axis=1)
        total_predicted += (predicted == target).sum().item()
        total_prediction += target.size(0)
        if (batch % 100) == 0:
            Loss.append(loss.item())

perturbed_accuracy = total_predicted / total_prediction * 100
print(f'Training accuracy: {perturbed_accuracy}%')

## Black-box setting

The SPSA (Simultaneous Perturbation Stochastic Approximation) attack is a black-box adversarial attack that doesn't require access to the model's gradients. Instead, it estimates the gradient using random perturbations. This makes it effective against models that use gradient masking or obfuscation as a defense.

We can  implement these attacks using Foolbox by  using the `foolbox.attacks.SPSA` class for the SPSA attack, and the `foolbox.attacks.TransferAttack` class for the transfer attack. Here's an example:



 A high accuracy under the SPSA attack means that the model is robust against this specific type of attack. It doesn't directly indicate whether the model is exhibiting gradient masking.

Gradient masking refers to the phenomenon where the gradients of a model do not provide useful information for crafting adversarial examples. This can make the model appear robust against gradient-based attacks, while it might still be vulnerable to other types of attacks, especially those that do not rely on gradients, like the SPSA attack.

If a model is robust against both gradient-based attacks and non-gradient-based attacks like SPSA, it's a good sign that the model is genuinely robust, not just exhibiting gradient masking. 

However, if a model appears robust against gradient-based attacks but is vulnerable to the SPSA attack, it could be a sign of gradient masking. In this case, the model's apparent robustness against gradient-based attacks might be due to the gradients being uninformative, not because the model is genuinely robust.

In [None]:
import torch

def spsa_attack(f, x0, delta, alpha, n, epsilon, T):
    D = x0.numel()
    x = x0.clone().detach().requires_grad_(True)

    for t in range(T):
        # Sample v
        v = torch.randint(0, 2, size=(n, D)) * 2 - 1  # {1, -1}^D
        v = v.to(x0.device)

        # Calculate g
        x_plus_delta_v = x + delta * v
        x_minus_delta_v = x - delta * v
        g = (f(x_plus_delta_v) - f(x_minus_delta_v)) * v / (2 * delta)

        # Update x
        x_prime = x - alpha * g.mean(dim=0)
        
        # Project x
        diff = x_prime - x0
        diff = diff / diff.norm() * min(diff.norm(), epsilon)
        x = x0 + diff

    return x

# Transfer attacks

Transfer attacks involve generating adversarial examples using one model (the source model), and then testing them on another model (the target model). This simulates a black-box attack scenario where the attacker doesn't have direct access to the target model's architecture or parameters.

In [None]:
import torch
from torch import nn
import torch.nn.functional as F
from foolbox import PyTorchModel, accuracy, samples
from foolbox.attacks import LinfPGD


# source model
class SourceModel(nn.Module):
    def __init__(self):
        super(SourceModel, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, stride=1, bias=False)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, stride=1, bias=False)
        self.fc1 = nn.Linear(64 * 6 * 6, 128, bias=False)
        self.fc2 = nn.Linear(128, 512, bias=False)
        self.fc3 = nn.Linear(512, 10, bias=False)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 64 * 6 * 6)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

source_model = SourceModel()

# Create Foolbox models
fmodel_source = PyTorchModel(source_model, bounds=(0, 1))
fmodel_target = PyTorchModel(model, bounds=(0, 1))

# Generate adversarial examples using the source model
attack = LinfPGD()
epsilons = [0.1, 0.2, 0.3, 0.4, 0.5]
_, advs_transfer, success_transfer = attack(fmodel_source, images, labels, epsilons=epsilons)

# Calculate and print the accuracy of the defended model on the adversarial examples
print('Accuracy on SPSA adversarial examples:', accuracy(fmodel_target, advs_spsa, labels))
print('Accuracy on transfer adversarial examples:', accuracy(fmodel_target, advs_transfer, labels))