# Jacobian-based Saliency Map Attack

In this notebook, we will be implementing a JSMA, or Jacobian-based Saliency Map Attack.

Before getting into the attack, let us first set up some prerequisites.

In [6]:
# Importing all required packages
import torch
from torchvision import transforms
from torchvision.utils import save_image
from mnist_model_generator import Net
from PIL import Image
import math
import matplotlib.pyplot as plt

We trained a model which is capable of classifying images from the MNIST dataset in advance.
The details of this model can be found in the `mnist_model_generator.py` file, that is included in the same folder as this notebook.

In [7]:
# Loading our pre-trained MNIST model.
model = Net()
model.load_state_dict(torch.load('../../data/models/mnist_cnn.pt', map_location=torch.device('cpu')))
model.eval();

In [8]:
# We will load the image that we will be testing our attack on.
three = Image.open("../../data/pictures/3.png")
preprocess = transforms.Compose([
   transforms.Resize(28),
   transforms.ToTensor(),
   transforms.Normalize((0.1307,), (0.3081,))
])
three_tensor = preprocess(three)[0].reshape(1,1,28,28)

The following image will be used:

![](../../data/pictures/3.png)

Finally, we will run our model on the original image, to make sure the result is '3', as one would expect.

In [9]:
print(f'The model predicted: {model(three_tensor).argmax().item()} with {model(three_tensor).max().item() * 100}% certainty.')

The model predicted: 3 with 100.0% certainty.


## Attacking the model

First, let us define what a JSMA is, following the approach by Papernote et al. as referenced in our report document.

The JSMA is a targeted attack, which means that we apply it on an image and tell the attack what output we want the model to have after the attack; we have a target.
We will come back to this fact later on, in the analysis.

We start by calculating the Jacobian of this image, with respect to the output tensor.

This means the Jacobian will represent how much each individual pixel modifies the probability that each possible class is selected.

Using the Jacobian, we will build a saliency map. Note that a saliency map can be used to either increase the pixel values that would improve the probability of our target class being selected, or to decrease the pixel values that would improve the probability of another class being selected.

In the first case, the saliency map S is defined in the following way:

![](images/S_positive.png)

In the latter case, it is instead defined as follows:

![](images/S_negative.png)

The difference should be clear immediately, the first case looks for pixels which improve the chance of our target being selected while mostly decreasing the chances of other classes being selected. The latter does the exact opposite.

In [10]:
def saliency_map(J, t, space, size, width):
    S = [(0, 0)] * size
    for p in space:
        alpha = J[t, p // width, p % width].item()
        beta = 0
        for i in range(J.size(0)):
            if not i == t:
                beta += J[i, p // width, p % width].item()
        S[p] = (alpha, beta)
    return S

Now, we have a saliency map, based on either an increase of the value of 'good' pixels, or a decrease of the value of 'bad' pixels.

We will then search for a pair of pixels, such that their total maximally increases the chances of our target being selected, while minimizing the chance other classes get selected, or the inverse if we are working with the decreasing variant.

After finding this pair, we will set this pixel value to 1 if increasing and 0 if decreasing.

We will remove this pixel from our search space, such that we never touch it again later in the algorithm.

We will repeat this process until one of three cases occurs:
- The model predicts our target class instead of the actual class.
- Our image has been modified more than some threshold `max_dist` allows. This means the attack has failed to create an adversarial example for this input. This is interpreted as a maximum number of iterations, the simple calculation for this is explained in Papernot et al.
- We have modified each pixel. This is a clear failure to create an adversarial example.

In [16]:
def jsma(original_image_tensor, target, predictor, max_dist, increase, normalize=True):
    img_tensor = original_image_tensor.clone()

    # Normalize the data to the range [0,1] before the attack
    if normalize:
        img_tensor = img_tensor.reshape(28,28)
        
        min_val = torch.min(img_tensor.reshape(784))
        max_val = torch.max(img_tensor.reshape(784))

        img_tensor = torch.sub(img_tensor, min_val)
        img_tensor = torch.div(img_tensor, max_val - min_val)

    img_tensor = img_tensor.reshape(1,1,28,28)
    
    img_size = img_tensor.size(2) * img_tensor.size(3)
    width = img_tensor.size(3)
    search_space = list(range(img_size))
    i = 0
    max_iter = math.floor((img_size * max_dist) / (200))
    chosen_pixel_1 = -1
    chosen_pixel_2 = -1
    prediction = predictor(img_tensor)

    while not prediction.argmax().item() == target and i < max_iter and len(search_space) >= 2:
        max = 0
        # Generate the Jacobian
        J = torch.autograd.functional.jacobian(predictor, img_tensor)[0, :, 0, 0, :, :]

        #Generate the Saliency map
        S = saliency_map(J, target, search_space, img_size, width)

        # Find the optimal pair of pixels
        for pixel1 in search_space:
            for pixel2 in search_space:
                if pixel1 == pixel2:
                    continue
                
                alpha = S[pixel1][0] + S[pixel2][0]
                beta = S[pixel1][1] + S[pixel2][1]

                sign_check = alpha > 0 and beta < 0 if increase else alpha < 0 and beta > 0
                if sign_check and -alpha * beta > max:
                    chosen_pixel_1 = pixel1
                    chosen_pixel_2 = pixel2
                    max = -alpha * beta

        # No pair found that would improve the current state.
        if max == 0:
            break

        # Adjust the pixel values according to which version we use.
        img_tensor[0, 0, chosen_pixel_1 // width, chosen_pixel_1 % width] = 1 if increase else 0
        img_tensor[0, 0, chosen_pixel_2 // width, chosen_pixel_2 % width] = 1 if increase else 0

        # Remove the chosen pixels from the search space.
        search_space.remove(chosen_pixel_1)
        search_space.remove(chosen_pixel_2)
        
        # Predict the current adversarial image to check whether we need to continue.
        prediction = predictor(img_tensor)
        i += 1
    return img_tensor

Now that we have worked through the JSMA implementation, we can now test it.

We will run this attack on our example image once, with every target, besides the actual class.

Additionally, we will run both the increase and decrease variant.

Finally, the following experiment was done by Papernot et al. and we thought it was interesting enough to recreate it and briefly analyze its results.
This experiments consists of giving our attack a target, but supplying an empty image.
This will ask the attack to generate the minimal requirement in an image to predict a given class.

In [20]:
attacked_models_positive = []
for i in range(10):
    if i == 3:
        attacked_models_positive.append(None)
        continue
    attacked_models_positive.append(jsma(three_tensor, i, model, 20, True))
    print(f'Classified as {model(attacked_models_positive[i]).argmax().item()} with goal {i}')
    save_image(attacked_models_positive[i][0,0], f'../../results/JSMA/positive-{i}.png')

attacked_models_negative = []
for i in range(10):
    if i == 3:
        attacked_models_negative.append(None)
        continue
    attacked_models_negative.append(jsma(three_tensor, i, model, 20, False))
    print(f'Classified as {model(attacked_models_negative[i]).argmax().item()} with goal {i}')
    save_image(attacked_models_negative[i][0,0], f'../../results/JSMA/negative-{i}.png')

attacked_models_empty = []
for i in range(10):
    attacked_models_empty.append(jsma(torch.zeros_like(three_tensor), i, model, 20, True, False))
    print(f'Classified as {model(attacked_models_empty[i]).argmax().item()} with goal {i}')
    save_image(attacked_models_empty[i][0,0], f'../../results/JSMA/empty-{i}.png')

Classified as 0 with goal 0
Classified as 1 with goal 1
Classified as 2 with goal 2
Classified as 4 with goal 4
Classified as 5 with goal 5
Classified as 6 with goal 6
Classified as 7 with goal 7
Classified as 8 with goal 8
Classified as 9 with goal 9
Classified as 0 with goal 0
Classified as 1 with goal 1
Classified as 2 with goal 2
Classified as 4 with goal 4
Classified as 5 with goal 5
Classified as 6 with goal 6
Classified as 7 with goal 7
Classified as 8 with goal 8
Classified as 9 with goal 9
Classified as 0 with goal 0
Classified as 1 with goal 1
Classified as 2 with goal 2
Classified as 3 with goal 3
Classified as 4 with goal 4
Classified as 5 with goal 5
Classified as 6 with goal 6
Classified as 7 with goal 7
Classified as 8 with goal 8
Classified as 9 with goal 9


As many results were generated by this experiment, we will only analyze the three most interesting results of both the increase and decrease variants.
Additionally, we will select only two results from the final experiment, as all results are similar, yet one sticks out.

### Analyzing the positive variant

As can be seen in the output of the previous experiments, the positive variant was able to fool the model with a 100% success rate.
We will now display three of the most interesting results:

<figure>
    <img src=../../results/JSMA/positive-2.png width=140>
    <figcaption>Classified as a 2</figcaption>
</figure>
<figure>
    <img src=../../results/JSMA/positive-8.png width=140>
    <figcaption>Classified as a 8</figcaption>
</figure>
<figure>
    <img src=../../results/JSMA/positive-4.png width=140>
    <figcaption>Classified as a 4</figcaption>
</figure>

As you can see, while the classification has been fooled, such that it has given the incorrect prediction, the images are clearly modified.
Humans may also fall for some of these illusions.

The first one is classified as a 2, while being an image of a 3.
The issue here is, humans could argue that this image no longer shows a clear 3, as it resembles a mirrored 6.
For an optimal attack, it should be difficult for humans to notice that an attack has occured at all.

Similarly, the second image is classified as an 8, while being an image of a 3.
We can clearly see why the model considers it an 8, as the pixels that have been added have made the 3 look like an 8.
Therefore, once again, while the prediction is incorrect, we cannot state with confidence that we have fooled the model.

Finally, the last image is classified as a 4, while being an image of a 3.
This image has been modified to such an extent, that a reasonable percentage of human users, would consider this adversarial example a 9, rather than a 3.
So, once again, we cannot fault the model for being tricked by this image.

From these results, one might think to decrease the `max_dist` parameter, as to make the resulting images remain true to the original.
However, with lower `max_dist` values, the resulting images did not get much less messy, and the success rate dropped by a noticeable amount.
Therefore, while the positive JSMA is effective at creating an adversarial example that will be classified incorrectly, it modifies the original image too much to be considered as a powerful attack.

One possible improvement would be to decrease the amount with which we increase the optimal pair of pixels.
However, Papernot et al. stated that a maximum increase would yield optimal results, so we stuck with this assumption.
Instead, lower jumps were explored in another attack we performed, namely, the improvement on JSMA: the JSMA-M attack. 

### Analyzing the negative variant

As can be seen in the output of the previous experiments, the negative variant was able to fool the model with a 100% success rate as well.
We will now display three of the most interesting results:

<figure>
    <img src=../../results/JSMA/negative-2.png width=140>
    <figcaption>Classified as a 2</figcaption>
</figure>
<figure>
    <img src=../../results/JSMA/negative-5.png width=140>
    <figcaption>Classified as a 5</figcaption>
</figure>
<figure>
    <img src=../../results/JSMA/negative-1.png width=140>
    <figcaption>Classified as a 1</figcaption>
</figure>

We immediately notice that none of these images represent another number more than they represent 3, which is an improvement from the previous experiment.

However, another issue becomes clear.
Some of the images have been attacked to the point of not clearly representing a 3 anymore.
Take the final image as an example, which is classified as a 1.
We have fooled the model into thinking this is a 1, however, we have ruined the image in the process.

However, unlike the previous experiments, not all images have the same downside.
The images that were classified as 2 and 5 are both not attacked to the point where they are unrecognizable.
It is clear that an attack has occurred, however, any human would still classify these images as a 3, while the model claims otherwise.

This can be considered a successful attack.

Ofcourse, whether the positive or negative variant is more successful also depends on the input.
For a 3, it makes sense that removing pixel values would be more effective than adding pixel values.

On the other hand, for a 1, for example, adding pixel values to the image would have a much larger chance to succeed.

This is the downside of targeted attacks.
While they can be very powerful, human coaching is needed to get proper results.
But when this human guidance is given, we can tell from the experiments that this method can indeed fool a model by generating an adversarial example.
In the notebook on JSMA-M, we will instead analyze a non-targeted approach.

### Bonus: Analyzing the empty variant

Finally, as mentioned during the attacking stage, we applied the positive attack on an empty image in order to view the most distinctive features of a given class.
Almost all classes behaved fairly similarly, showing us exactly which pixel values are representative of a given class.
An example of this is the following, which was an attempt at classifying the empty image as a 4.

<figure>
    <img src=../../results/JSMA/empty-4.png width=140>
    <figcaption>Classified as a 4</figcaption>
</figure>

It becomes clear that these two pixels are essential for a proper 4.
This makes sense, as no other numbers have the sharp edges on the sides in the middle of the image.

However, one image returned a surprising result.
Namely, the attempt to create an adversarial example that makes the model see a 1.

<figure>
    <img src=../../results/JSMA/empty-1.png width=140>
    <figcaption>Classified as a 1</figcaption>
</figure>

As you can see, the image is the exact same as the input image we gave.
This means that, for an empty image, the model, defaults to 1.
This could be explained by the fact that 1 is the smallest number, and therefore, when not enough information is present, the model reverts to the class that requires the least information: the 1.
This was an interesting result that was more-or-less also achieved by Papernot et al., which is why we thought it would be interesting to share.

## The effectiveness of adversarial training

Finally, we have seen that this attack is capable of tricking our model into predicting the wrong class with a very high success rate.

Adversarial training is one of the methods that can be used to defend a model against incoming attacks.

However, we run into a small issue.
This attack is a targeted attack, and we have seen how, with improper guidance, this attack will not be effective at all, and instead ruin the input image.
Therefore it would be difficult to properly train a model to resist our attacks without spending weeks manually setting up an attack for the training data.

Therefore, we choose to borrow the robust model that was generated in the FGSM & similar attacks notebook.
While this will ofcourse not be trained specifically against our attack, we will explore whether it performs petter than our original model either way.

In [21]:
# Loading the robust MNIST model.
def_model = Net()
def_model.load_state_dict(torch.load('../../data/models/mnist_robust.pt', map_location=torch.device('cpu')))
def_model.eval();

First, let us make sure this model still properly recognises an unmodified input.

In [23]:
print(f'The model predicted: {def_model(three_tensor).argmax().item()} with {model(three_tensor).max().item() * 100}% certainty.')

The model predicted: 3 with 100.0% certainty.


As we can see, the model still works as it should.

Now, let us attempt to run the same experiments as before and compare their success rate.

We will not repeat the empty inputs, as this will not tell us anything about the robust model.

In [32]:
defended_models_positive = []
for i in range(10):
    if i == 3:
        defended_models_positive.append(None)
        continue
    defended_models_positive.append(jsma(three_tensor, i, def_model, 20, True))
    print(f'Classified as {def_model(defended_models_positive[i]).argmax().item()} with goal {i}')
    save_image(defended_models_positive[i][0,0], f'../../results/JSMA/defended-positive-{i}.png')

defended_models_negative = []
for i in range(10):
    if i == 3:
        defended_models_negative.append(None)
        continue
    defended_models_negative.append(jsma(three_tensor, i, def_model, 20, False))
    print(f'Classified as {def_model(defended_models_negative[i]).argmax().item()} with goal {i}')
    save_image(defended_models_negative[i][0,0], f'../../results/JSMA/defended-negative-{i}.png')

Classified as 0 with goal 0
Classified as 1 with goal 1
Classified as 2 with goal 2
Classified as 4 with goal 4
Classified as 5 with goal 5
Classified as 8 with goal 6
Classified as 7 with goal 7
Classified as 8 with goal 8
Classified as 9 with goal 9
Classified as 0 with goal 0
Classified as 1 with goal 1
Classified as 2 with goal 2
Classified as 3 with goal 4
Classified as 5 with goal 5
Classified as 6 with goal 6
Classified as 7 with goal 7
Classified as 8 with goal 8
Classified as 9 with goal 9


All results are stored, and can be viewed inside the results folder in the main repository.

Here, we will make a general statement about the effectiveness of the adversarially trained model, as the same can be said about nearly all results.

The first thing we must notice, is that the success rate of our attack is no longer 100%.
Instead the attack failed twice!

Additionally, we notice that the execution time of the attack against the robust model is nearly twice as long.
While this is not important when it comes to the security of our model, it does show that the attack struggled to fool this model.

This can also be seen in the results, which, though having the same results as our original model, look far more messy now.

The fact that the negative JSMA failed to convince the robust model to predict the 3 as a 4, and instead predicted it as a 3, shows that adversarial training has the potential to partially protect our model from adversarial inputs.

Please see the report for a discussion on how this attack compares to the other tested attacks.