<h1>Preliminaries</h1>
<p>This codebase was originally created at Facebook by the team behind the following paper:</p>
<p><strong>Radioactive data: tracing through training</strong><br />
    <a href="https://arxiv.org/pdf/2002.00937.pdf">https://arxiv.org/pdf/2002.00937.pdf</a>
        </p>
    
<p>We have made some quality of life changes and created this notebook to help others learn faster. The demonstration
will: </p>
   <ol>
       <li>Mark a certain subset of the CIFAR10 dataset (target data) using a resnet18 pretrained on imagenet (marking network)</li>
       <li>Use the modified CIFAR10 dataset to train a new resnet18 (target network)</li>
       <li>Attempt to detect radioactivity in the target network</li>
    </ol>
</p>
<p><strong>Note:</strong><br/>
In our example the marking network is pretrained on imagenet, while our target data is CIFAR10.
According to section 5.5 of the paper, even a marking network trained on a different
data distribution will output radioactive markings that are useful when applied to at least 10% of the dataset. 
This number could vary, we are just quoting the minimum radioactive data percentage shown in the paper. </p>
<p>If the marking network was trained on the same data distribution it is likely that a lower percentage of the
   target data would require marking to achieve the same p value in the detection stage.</p>

<h1>Creating Radioactive Data (Image Marking)</h1>

<h2>Prepare Dataset</h2>
<p>First we download the CIFAR10 dataset which has 10 classes.</p>
<p>Then we randomly choose an image class and sample a certain percentage of the images for saving to "img/data".</p>
<p>After saving these images to "img/data" we also save a list of image paths to pass into <em>make_data_radioactive.py</em>.</p>
<p>Currently <em>make_data_radioactive.py</em> doesn't support multi-class, so we only sample from a chosen class of images.</p>

In [None]:
import torchvision
import torch
import random
import PIL
import os
import shutil
import tqdm

target_data_classes = 10 # 10 for CIFAR10 data

# Download CIFAR10 dataset
train_set = torchvision.datasets.CIFAR10(root="data/datasets", download=True)

# Index images by class
images_by_class = []
for x in range(0, target_data_classes):
    images_by_class.append([])

for index, (img, label) in enumerate(train_set):
    images_by_class[label].append(index)

# Randomly choose an image class
chosen_image_class = random.choice(list(range(0, target_data_classes)))
print(f"Randomly selected image class {chosen_image_class} ({train_set.classes[chosen_image_class]})", flush=True)

# Randomly sample images from that class
data_marking_percentage = 1
total_marked_in_class = int(len(images_by_class[chosen_image_class]) * (data_marking_percentage / 100))
train_marked_indexes = random.sample(images_by_class[chosen_image_class], total_marked_in_class)

# Save these images for marking to /data/img and build list file
image_dir_path = "data/img"
shutil.rmtree(image_dir_path, ignore_errors=True)
os.makedirs(image_dir_path)

print(f"Saving {total_marked_in_class} images randomly sampled from class.", flush=True)
image_list = []
for i in tqdm.tqdm(train_marked_indexes):
    image, _ = train_set[i]
    image_path = f"{image_dir_path}/train_{i}.png"
    with open(image_path, 'wb') as fh:
        img = image.save(fh)
    image_list.append(image_path)
     
train_image_list_path = "data/train_img_list.txt"
torch.save(image_list, train_image_list_path)

<h2>Download Marking Network And Save For Later</h2>
<p>The below confused me at first. All <em>make_data_radioactive.py</em> is doing is spooling up a new
torchvision.models.[architecture](marking_network_classes) and loading the state dictionary from the pretrained model, in
this case trained on ImageNet with 1000 classes. After this occurs the fully connected layer is just removed anyway.
</p>

In [None]:
import torch
from torchvision import models

marking_network_output_dim_pre_classifier = 512 # Used Further On
marking_network_classes = 1000

resnet18 = models.resnet18(pretrained=True)
torch.save({
    "model": resnet18.state_dict(),
    "params": {
      "architecture": "resnet18",
      "num_classes": marking_network_classes,
    }
  }, "data/pretrained_resnet18.pth")
print("Marking network saved")

<h2>Create Random Carriers</h2>
<p>Generates a random array of shape (target_data_classes, marking_network_output_dim_pre_classifier). 
In terms of the paper, this would be a concatenation of a random vector u for each class, see section 3 for details. The code in <em>make_data_radioactive.py</em> currently doesn't support multi-class, so it just slices this array to get a single random u.</p>

In [None]:
import torch

carriers = torch.randn(target_data_classes, marking_network_output_dim_pre_classifier)
carriers /= torch.norm(carriers, dim=1, keepdim=True)
print("Carrier Shape:", carriers.shape)
torch.save(carriers, "data/carriers.pth")

<h2>Run make_data_radioactive.py</h2>
<p>The original code used input parameters but I have converted this to a config file out of taste preference.</p>
<p><strong>img_list</strong> must be the same as <strong>train_image_list_path</strong> above</p> <br/><strong>carrier_id</strong> is irrelevant as multi-class is not currently supported, by setting it to 0 we just grab the first random u vector out of the carriers array created above.

<p>Training (marking) will take about 10 minutes for 50 CIFAR images on a quad core CPU @ 4ghz. GPU would be way faster.</p>

In [None]:
import shutil

# Clear the experiment directory
dump_path = "data/dump"
shutil.rmtree(dump_path, ignore_errors=True)

In [None]:
%%writefile config_make_radioactive.toml
dump_path = "data/dump"
exp_name = "bypass"
exp_id = ""
img_size = 256
crop_size = 224
data_augmentation = "random"
radius = 10
epochs = 90
lambda_ft_l2 = 0.01
lambda_l2_img = 0.0005
optimizer = "sgd,lr=1.0"
carrier_path = "data/carriers.pth"
carrier_id = 0
half_cone = true
img_list = "data/train_img_list.txt"
img_paths = ":"
marking_network = "data/pretrained_resnet18.pth"
debug_train = false
debug_slurm = false
debug = false
batch_size = 50


In [None]:
%run make_data_radioactive.py

<h2>Inspect Our New Images</h2>

In [None]:
import torch
import numpy as np
import PIL
from matplotlib import pyplot as plt
import glob
import random
import os

def loadImageRGB(path):
    with open(path, 'rb') as f:
        img = Image.open(f)
        return img.convert('RGB')
    
radioactive_images = glob.glob('data/dump/*.npy')
radioactive_image_path = random.choice(radioactive_images)
original_image_path = f"data/img/{os.path.basename(radioactive_image_path)}".replace("npy", "png")

original_img = loadImageRGB(original_image_path)
radioactive_img = np.load(radioactive_image_path)

fig, (ax1, ax2) = plt.subplots(1, 2)
ax1.imshow(original_img)
ax1.set_title("Original Image")
ax2.imshow(radioactive_img)
ax2.set_title("Radioactive Image")
plt.show()


<h1>Training A Model</h1>
<p>Work in progress...</p>

In [None]:
import torch
import glob
import re

radioactive_image_paths = glob.glob('data/dump/*.npy')

#search using regex
content = {}
for path in radioactive_image_paths:
    img_id = re.search('[0-9]+', path)
    content[img_id[0]] = path 

torch.save({
  'type': 'per_sample',
  'content': content
}, "data/radioactive_data.pth")
print("Radioactive image paths saved")

In [None]:
%%writefile config_train_classif.toml
dump_path = "data/dump"
exp_name = "bypass"
save_periodic = 0
exp_id = ""
nb_workers = 10
fp16 = false
dataset = "cifar10"
num_classes = -1
architecture = "resnet18"
non_linearity = "relu"
pretrained = false
from_ckpt = ""
load_linear = false
train_path = "radioactive_data.pth"
optimizer = "sgd,lr=0.1-0.01-0.001,momentum=0.9,weight_decay=0.0001"
batch_size = 256
epochs = 90
stopping_criterion = ""
validation_metrics = ""
train_transform = "random"
seed = 0
only_train_linear = false
reload_model = ""
eval_only = false
debug_train = false
debug_slurm = false
debug = true
local_rank = -1
master_port = -1
use_cpu = true


In [None]:
%run train-classif.py