<a href="https://colab.research.google.com/github/LaviBenshimol/Offensive_AI_Course/blob/main/OAI_HW2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Offensive AI Course Homework
* Course website: https://ymirsky.github.io/course/
* Course edition: Fall 2024
* Lecturer: Dr. Yisroel Mirsky


## Overview

In this homework you will attempt to perform a membership inference attack (MIA).

### Steps
1. Use the code provided below to train a model on the training set for CIFAR10.
2. Impliment a Membership Inference Attack on the trained model using the provided CIFAR10 test set.
3. Evalaute your attacks performance using the grading cell below. You must achive at least 0.50 accuracy (random guess) to obtain a grade.

### Instructions:

* This homework can be done in groups of two
* Make a copy of this notebook and solve it
* Your attack must be a black box attack: After you train the model, you may only use it to produce predictions (vector of confidences).
* You may use any papers, methods (novel or known), libraries, ART, github repos, and auxilalry datasets (aside from CIFAR) to help you perform the MIA attack.
* Submission: (1) generate a sharable link to your notebook, (2) submit your link by [clicking here](https://docs.google.com/forms/d/e/1FAIpQLSeMrBCmpCPzOthLySlKfYCarnyqD7_KlKoYNGEAE9_conxp2Q/viewform?usp=sf_link)  
* Deadline: submit your homework no later than January 14th. Each day past January 14th is -15 pnts


For a list of all membership inference attack papers, you can take a look here:
https://github.com/HongshengHu/membership-inference-machine-learning-literature

The shadow models & attack model MI attack paper: https://arxiv.org/pdf/1610.05820.pdf


# Installation and imports

In [None]:
!pip install adversarial-robustness-toolbox

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Import all libs

In [None]:

import matplotlib.pyplot as plt
import numpy as np
import random
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision.datasets import CIFAR10
from torchvision import datasets, transforms
from torch.utils.data import TensorDataset, DataLoader
import torch.optim as optim


## CONSTANTS

In [None]:
BATCH_SIZE = 128
ATTACK_RATIO_SPLIT = 0.5 # This is a float that you will decide. 0.5 will mean half of the data will be used to train / evaulate the model, the other half for evaulate the attack.

# Target dataset - CIFAR10 Dataset

## Load the dataset:

1. The trian set will be split to train and shadow train (i.e. 2 train sets). The shadow train set will be used to train the shadow models.
2. The test set will be used to evaluate your attack.

You may choose what ever split you decide.

**Pay attention you are not allowed to use the test set at all during the attack. It will only be used to evaulate your attack.**

In [None]:
train_data = CIFAR10(root='./data', train=True, download=True)
test_data = CIFAR10(root='./data', train=False, download=True)


# For using the shadow models, use the SHADOW_MODEL_SPLIT to split the dataset
# into train,test, shadow

In [None]:
# Creating a list of all the class labels
class_names = {0: 'airplane', 1: 'automobile', 2: 'bird', 3: 'cat', 4: 'deer',
               5: 'dog', 6: 'frog', 7: 'horse', 8: 'ship', 9: 'truck'}
num_classes = 10


# Visualizing some of the images from the training dataset
plt.figure(figsize=[10,10])
for i in range (25):    # for first 25 images
  plt.subplot(5, 5, i+1)
  plt.xticks([])
  plt.yticks([])
  plt.grid(False)
  plt.imshow(train_data[i][0], cmap=plt.cm.binary)
  plt.xlabel(class_names[train_data[i][1]])

plt.show()

# Preprocess

In [None]:
batch_size = 128
num_classes = 10
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


# Set seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)
random.seed(42)


In [None]:
transform = transforms.ToTensor()

# Load CIFAR-10 (downloads if not found in './data')
full_train = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
full_test = datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)

# Combine train and test sets into one big dataset
all_data = []
all_targets = []

for img, lbl in full_train:
    all_data.append(img)
    all_targets.append(lbl)

for img, lbl in full_test:
    all_data.append(img)
    all_targets.append(lbl)

all_data = torch.stack(all_data)    # shape: [60000, 3, 32, 32]
all_targets = torch.tensor(all_targets) # shape: [60000]

indices = torch.randperm(len(all_data))[:]
x_all = all_data[indices]
y_all = all_targets[indices]

train_size = len(x_all) * ATTACK_RATIO_SPLIT
test_size = len(x_all) - train_size

# Split into train/test
x_train = all_data[:train_size]
y_train = all_targets[:train_size]
x_test = all_data[train_size:]
y_test = all_targets[train_size:]


# Create Datasets and DataLoaders

train_dataset = TensorDataset(x_train, y_train)
test_dataset = TensorDataset(x_test, y_test)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)


# Training the NN classifier

You can use Keras or Pytorch. You can use pretrained models if you wish to.

**Minimum Accuracy on the train set: 0.62**

**Minimum Accuracy on the test set: 0.62**

In [None]:
# Set as needed

epochs = 100
s = 128

In [None]:
class TargetModel(nn.Module):
    def __init__(self, s=32, num_classes=10):
        super(TargetModel, self).__init__()
        self.conv1 = nn.Conv2d(3, s, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(s, s, kernel_size=3, padding=1)
        self.pool1 = nn.MaxPool2d(kernel_size=2)

        self.conv3 = nn.Conv2d(s, 2*s, kernel_size=3, padding=1)
        self.conv4 = nn.Conv2d(2*s, 2*s, kernel_size=3, padding=1)
        self.pool2 = nn.MaxPool2d(kernel_size=2)

        self.fc = nn.Linear(2*s*8*8, 2*s)  # After two pools, 32x32 -> 8x8 if no stride/pad reduces dimension except pooling
        self.out = nn.Linear(2*s, num_classes)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = self.pool1(x)

        x = F.relu(self.conv3(x))
        x = F.relu(self.conv4(x))
        x = self.pool2(x)

        x = x.view(x.size(0), -1)
        x = F.relu(self.fc(x))
        x = self.out(x)  # raw logits
        return x


In [None]:
classifier = TargetModel(s=s, num_classes=num_classes).cuda()

optimizer = optim.Adam(classifier.parameters(), lr=0.001, betas=(0.5,0.99), eps=1e-8, weight_decay=5e-4)
criterion = nn.CrossEntropyLoss()  # categorical cross-entropy for logits

for epoch in range(1, epochs+1):
    classifier.train()
    running_loss = 0.0
    correct_train = 0
    total_train = 0

    for i, (images, labels) in enumerate(train_loader):
        images, labels = images.cuda(), labels.cuda()

        optimizer.zero_grad()
        outputs = classifier(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item() * images.size(0)

        # Calculate training accuracy
        _, predicted = torch.max(outputs, 1)
        correct_train += (predicted == labels).sum().item()
        total_train += labels.size(0)

    epoch_loss = running_loss / len(train_loader.dataset)
    train_acc = correct_train / total_train

    # Evaluate on test set every 10 epochs
    if epoch % 10 == 0:
        classifier.eval()
        correct_test = 0
        total_test = 0
        with torch.no_grad():
            for images, labels in test_loader:
                images, labels = images.cuda(), labels.cuda()
                outputs = classifier(images)
                _, predicted = torch.max(outputs, 1)
                correct_test += (predicted == labels).sum().item()
                total_test += labels.size(0)
        test_acc = correct_test / total_test

        print(f"Epoch {epoch}, Loss: {epoch_loss:.4f}, Train Accuracy: {train_acc:.4f}, Test Accuracy: {test_acc:.4f}")

        # Save the classifier weights
        torch.save(classifier.state_dict(), f"model_epoch_{epoch}.pth")
    else:
        # Print training accuracy and loss every epoch
        print(f"Epoch {epoch}, Loss: {epoch_loss:.4f}, Train Accuracy: {train_acc:.4f}")


In [None]:
# Evaluation of the target model on the test set.
# This should be around 0.62-0.7

classifier.eval()
correct_test = 0
total_test = 0
with torch.no_grad():
    for images, labels in test_loader:
        images, labels = images.cuda(), labels.cuda()
        outputs = classifier(images)
        _, predicted = torch.max(outputs, 1)
        correct_test += (predicted == labels).sum().item()
        total_test += labels.size(0)
test_acc = correct_test / total_test

print(f"Test Accuracy: {test_acc:.4f}")

# Blackbox MIA Attack

Utilizing as many attacks as possible can help you establish a baseline and select the most effective attack. I encourage you to let me view every attack and experiment you conducted.

## Bad Example attack

This is an example of a bad way to preform the attack

In [None]:
class MIAttack:

  def __init__(self, classifier):
    self.estimator = classifier

  def fit(self, x,y):
    # Code here
    # Not mandatory to use it.
    pass

  def infer(self, x,y):
    """
    This implementation uses the simple rule: if the model's prediction for a sample is correct, then it is a
    member. Otherwise, it is not a member.
    """
    # get model's predictions for x
    y_pred = self.estimator.predict(x)
    predicted_class = (np.argmax(y, axis=1) == np.argmax(y_pred, axis=1)).astype(int)
    return predicted_class

attack = MIAttack(classifier)

## Your attack



In [None]:
class MIAttack:

  def __init__(self):
    pass

  def fit(self, x,y):
    # Code here
    # Not mandatory to use it.
    pass

  def infer(self, x,y):
    # Code here
    pass


attack = MIAttack(classifier,...)

# Evaluation

Evaluate your attack on your IN (training set members) and OUT (non-members) data


Accuracy + Grade

In [None]:
def attack_score(attack, x_member,y_member, x_non_member, y_not_member):
  inferred_train = attack.infer(x_member, y_member)
  inferred_test = attack.infer(x_non_member, y_not_member)

  # check accuracy
  train_acc = np.sum(inferred_train) / len(inferred_train)
  test_acc = 1 - (np.sum(inferred_test) / len(inferred_test))
  acc = (train_acc * len(inferred_train) + test_acc * len(inferred_test)) / (len(inferred_train) + len(inferred_test))
  print(f"Members Accuracy: {train_acc:.4f}")
  print(f"Non Members Accuracy {test_acc:.4f}")
  print(f"Attack Accuracy {acc:.4f}")
  return train_acc, test_acc, acc


train_acc, test_acc, acc = attack_score(attack, x_train, y_train, x_test,y_test)

In [None]:
def attack_grade(train_acc, test_acc, accuracy):
    random_guess = 0.50
    naive_solution = 0.60
    baseline_acc = 0.68

    baseline_test_acc = 0.47
    baseline_train_acc = 0.89

    if accuracy < random_guess:
        print("Grade: 60")
    elif random_guess <= accuracy <= naive_solution:
        print("Grade: 75")
    elif naive_solution < accuracy <= 0.62:
        print("Grade: 80")
    elif 0.62 < accuracy <= 0.65:
        print("Grade: 85")
    elif 0.65 < accuracy <= baseline_acc:
        if train_acc < 0.86 or test_acc < 0.45:
            print("Grade: 85")
        else:
            print("Grade: 90")
    elif accuracy > baseline_acc:
        if train_acc < baseline_train_acc or test_acc < baseline_test_acc:
            print("Grade: 90")
        else:
            print("Grade: 100")


# To submit you result, fill the following form:

https://docs.google.com/forms/d/e/1FAIpQLSeszLBTxYBVfWpu9AOeFjPQJH393SilSMth_RBzET3QIEMDVQ/viewform?usp=sharing