# Backdoor Unlearning

## Outline

1. Experimental setup (generating configs)
2. Clean model training
3. Poisoned model training
4. First-order unlearning
5. Second-order unlearning
6. Visualizing results


## Experimental Setup

- All configurations to test are defined in the `[train|poison|unlearn].json` files (see below).
- If parameters are passed as list, all their combinations are tested in a grid-search manner.
- Only a single combination is provided for this demo. The original combinations are in `Applications/Poisoning/configs`
- The function generates directories and configuration files for each combination. They are later used by an evaluation script to run the experiment. This allows for parallelization and distributed execution.

In [2]:
import sys
sys.path.append('../')


In [3]:
# only if you are using CUDA devices
import os
os.environ["CUDA_VISIBLE_DEVICES"]="1"


In [4]:
from conf import BASE_DIR
from Applications.Poisoning.gen_configs import main as gen_configs

model_folder = BASE_DIR/'models'/'poisoning'
train_conf = BASE_DIR/'Applications'/'Poisoning'/'configs'/'demo'/'train.json'
poison_conf = BASE_DIR/'Applications'/'Poisoning'/'configs'/'demo'/'poison.json'
unlearn_conf = BASE_DIR/'Applications'/'Poisoning'/'configs'/'demo'/'unlearn.json'

gen_configs(model_folder, train_conf, poison_conf, unlearn_conf)

In [None]:
from Applications.Poisoning.poison.poison_models import train_poisoned
from Applications.Poisoning.configs.demo.config import Config

poisoned_folder = model_folder/'budget-10000'/'seed-42'
clean_folder = model_folder/'clean'
first_unlearn_folder = model_folder/'budget-10000'/'seed-42'/'first-order'
second_unlearn_folder = model_folder/'budget-10000'/'seed-42'/'second-order'


poison_kwargs = Config.from_json(poisoned_folder/'poison_config.json')
train_kwargs = Config.from_json(poisoned_folder/'train_config.json')


## Clean Model Training

- Train a clean model for reference.

## Train Poisoned Model

- Select one of the generated configurations and train a poisoned model.
- The poisoning uses an `injector` object which can be persisted for reproducibility. It will inject the backdoors/label noise into the same samples according to a seed. In our experiments, we worked with label noise poisoning.

In [None]:
from Applications.Poisoning.poison.poison_models import train_poisoned
from Applications.Poisoning.configs.demo.config import Config

poisoned_folder = model_folder/'budget-10000'/'seed-42'
clean_folder = model_folder/'clean'
first_unlearn_folder = model_folder/'budget-10000'/'seed-42'/'first-order'
second_unlearn_folder = model_folder/'budget-10000'/'seed-42'/'second-order'


poison_kwargs = Config.from_json(poisoned_folder/'poison_config.json')
train_kwargs = Config.from_json(poisoned_folder/'train_config.json')



In [None]:
poisoned_weights = poisoned_folder/'best_model.hdf5'       # model that has been trained on poisoned data
fo_repaired_weights = poisoned_folder/'fo_repaired.hdf5'   # model weights after unlearning (first-order)
so_repaired_weights = poisoned_folder/'so_repaired.hdf5'   # model weights after unlearning (second-order)
injector_path = poisoned_folder/'injector.pkl'             # cached injector for reproducibility
clean_results = model_folder/'clean'/'train_results.json'  # path to reference results on clean dataset


## Unlearning

- Perform the first-order and second-order unlearning. The unlearning is wrapped in a function that
    - loads the clean data, saves the original labels
    - injects the poison (label noise)
    - creates difference set Z using `injector.injected_idx`
    - main unlearning happens in `Applications.Poisoning.unlearn.common.py:unlearn_update` and the thereby called `iter_approx_retraining` method
- The variable naming follows the following ideas:
    - `z_x`, `z_y`: features (x) and labels (y) in set `Z`
    - `z_x_delta`, `z_y_delta`: changed features and labels (`z_x == z_x_delta` here and `z_y_delta` contains the original (fixed) labels)
- A word about why iterative:
    - The approximate retraining is configured to unlearn the desired changes in one step.
    - To avoid putting a lot of redundant erroneous samples in the changing set `Z`, the iterative version
        - takes a sub-sample (`prio_idx`) of `hvp_batch_size` in the delta set `Z`
        - makes one unlearning step
        - recalculates the delta set and focuses only on remaining errors
    - The idea here is that similar to learning, it is better to work iteratively in batches since the approximation quality of the inverse hessian matrix decreases with the number of samples included (and the step size)

In [None]:
from Applications.Poisoning.unlearn.first_order import run_experiment as fo_experiment
from Applications.Poisoning.unlearn.second_order import run_experiment as so_experiment

fo_unlearn_kwargs = Config.from_json(poisoned_folder/'first-order'/'unlearn_config.json')
so_unlearn_kwargs = Config.from_json(poisoned_folder/'second-order'/'unlearn_config.json')


In [None]:
from Applications.Poisoning.train import main as train
from Applications.Poisoning.evaluate import evaluate

# train one clean and one poisoned model
# datasets = ['Cifar10', 'Cifar100', 'SVHN', 'FashionMNIST']
datasets = ['Cifar100', 'SVHN', 'FashionMNIST']
modelnames = ['VGG16', 'RESNET50', 'extractfeatures_VGG16', 'extractfeatures_RESNET50', 'classifier_VGG16', 'classifier_RESNET50']

In [None]:
import json
import os

results = {
    'clean': {},
    'poisoned': {},
    'first_order_unlearning': {},
    'second_order_unlearning': {}
}

for dataset in datasets:
    results['clean'][dataset] = {}
    results['poisoned'][dataset] = {}
    results['first_order_unlearning'][dataset] = {}
    results['second_order_unlearning'][dataset] = {}
    
    for modelname in modelnames:
        print('*' * 40)
        print(f"* Training {modelname} on {dataset} started. *")
        print('*' * 40)
        train(model_folder=model_folder/'clean', dataset=dataset, modelname=modelname)
        print('*' * 40)
        print(f"* Training {modelname} on {dataset} done. *")
        print('*' * 40)
        clean_accuracy = evaluate(model_folder=model_folder/'clean', dataset=dataset, modelname=modelname, type='best')
        results['clean'][dataset][modelname] = clean_accuracy

    print('#' * 40)
    print(f"################ POISONING ################")
    print('#' * 40)
    for modelname in modelnames:
        print('*' * 40)
        print(f"* Poisoning {modelname} on {dataset} started. *")
        print('*' * 40)
        train_poisoned(model_folder=poisoned_folder, poison_kwargs=poison_kwargs, train_kwargs=train_kwargs, dataset=dataset, modelname=modelname)
        print('*' * 40)
        print(f"* Poisoning {modelname} on {dataset} done. *")
        print('*' * 40)
        poisoned_accuracy = evaluate(model_folder=poisoned_folder, dataset=dataset, modelname=modelname, type='poisoned')
        results['poisoned'][dataset][modelname] = poisoned_accuracy

    # unlearn the poisoned model
    print('#' * 40)
    print(f"################ UNLEARNING ################")
    print('#' * 40)

    for modelname in modelnames:
        print('*' * 40)
        print(f"* Evaluating {modelname} on {dataset} poisoned model *")
        print('*' * 40)
        poisoned_accuracy = evaluate(model_folder=poisoned_folder, dataset=dataset, modelname=modelname, type='poisoned')
        results['poisoned'][dataset][modelname] = poisoned_accuracy
        
        print('*' * 40)
        print(f"* First-order unlearning {modelname} on {dataset} poisoned model *")
        print('*' * 40)
        try:
            fo_experiment(poisoned_folder/'first-order', train_kwargs, poison_kwargs, fo_unlearn_kwargs, dataset=dataset, modelname=modelname)
        
            print('*' * 40)
            print(f"* Evaluating {modelname} on {dataset} after first-order unlearning *")
            print('*' * 40)
            fo_repaired_accuracy = evaluate(model_folder=first_unlearn_folder, dataset=dataset, modelname=modelname, type='repaired')
            results['first_order_unlearning'][dataset][modelname] = fo_repaired_accuracy
        except Exception as e:
            print(f"Error during first-order unlearning for {modelname} on {dataset}: {e}")
            continue


        print('*' * 40)
        print(f"* Second-order unlearning {modelname} on {dataset} poisoned model *")
        print('*' * 40)
        try:
            so_experiment(poisoned_folder/'second-order', train_kwargs, poison_kwargs, so_unlearn_kwargs, dataset=dataset, modelname=modelname)

            print('*' * 40)
            print(f"* Evaluating {modelname} on {dataset} after second-order unlearning *")
            print('*' * 40)
            so_repaired_accuracy = evaluate(model_folder=second_unlearn_folder, dataset=dataset, modelname=modelname, type='repaired')
            results['second_order_unlearning'][dataset][modelname] = so_repaired_accuracy
        except Exception as e:
            print(f"Error during second-order unlearning for {modelname} on {dataset}: {e}")
            continue


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Convert the results dictionary to a pandas DataFrame for easy plotting
data = []

for phase in results:
    for dataset in results[phase]:
        for modelname in results[phase][dataset]:
            data.append({
                'Phase': phase,
                'Dataset': dataset,
                'Model': modelname,
                'Accuracy': results[phase][dataset][modelname]
            })

df = pd.DataFrame(data)

# Create a seaborn barplot to visualize the accuracy of each model in each phase
plt.figure(figsize=(12, 6))
sns.barplot(x='Dataset', y='Accuracy', hue='Phase', data=df)
plt.title('Model Accuracy Across Different Phases and Datasets')
plt.xlabel('Dataset')
plt.ylabel('Accuracy')
plt.legend(title='Phase')
plt.show()
