# FIT5230 Week 3: Adversarial Machine Learning II

## 1. Machine Learning Recap & The Threat Landscape

### Fundamentals of ML Types
To understand advanced attacks, we must recall the basic operating modes of ML models :
* **Classification (Supervised):** The model learns a mapping $f: X \rightarrow Y$ to predict a discrete class label $y$ (e.g., identifying a fruit).
* **Regression (Supervised):** The model predicts a continuous value $y$ dependent on input $x$ (e.g., predicting weight based on height).
* **Clustering (Unsupervised):** The model partitions unlabeled data into subsets based on similarity (minimizing distance to centroids).

### The Shift in Threat Context
Conventional AI security assumed a single entity controlled the model in a benign world. The landscape has shifted to **Collaborative/Multi-party AI** and **ML as a Service (MLaaS)**.
* **Supply Chain Risk:** Users often outsource training to the cloud (Google, AWS, etc.) or download pre-trained models (Transfer Learning). This creates a trust gap where an adversary can tamper with the model before the user ever touches it .

---
<hr>

## 2. Advanced Attacks: Backdoors and Trojans

These attacks attack the **integrity** of the model itself, rather than just the input samples.

### A. BadNet
**Concept:** A "backdoored" network that behaves normally on benign data but misclassifies specific "trigger" samples to a target class chosen by the adversary .

**Attack Vectors:**
1.  **Outsourced Training Attack:** The user provides the architecture and data to a malicious trainer. The trainer returns parameters $\theta'$ that minimize loss on validation data but maximize success on trigger data .
2.  **Transfer Learning Attack:** The adversary publishes a backdoored pre-trained model (e.g., a face recognizer). The user downloads this and fine-tunes it for a new task. The backdoor survives the fine-tuning process .

**Trigger Types:**
* **Single Pixel:** Changing one specific pixel to bright white.
* **Pattern:** A specific arrangement of pixels (e.g., a sticker on a stop sign) .

**Mathematical Goal:**
The adversary trains the model such that:
* For valid data ($D_{valid}$): Accuracy is high ($\approx$ benign model).
* For trigger data ($D_{trigger}$): Accuracy on the *original* label is low; accuracy on the *target* label is high .

---

### B. TrojanNet
**Concept:** A "training-free" attack (regarding the original parameters). Unlike BadNet, which retrains the whole model, TrojanNet **inserts** a tiny, malicious module into the target model without altering the original parameters .

**Architecture:**
* **Green ($G$):** The original benign classifier (parameters are frozen).
* **Red ($R$):** The TrojanNet classifier (malicious nodes/connections).
* **Blue ($B$):** A merging layer that combines outputs .

**Mechanism:**
The output $y$ is a weighted combination of the Trojan output and the Benign output:
$$y = \alpha y_{trojan} + (1-\alpha)y_{benign}$$
* **Normal Input:** The trigger is absent. The Trojan remains dormant (outputs 0 or insignificant value). $\alpha \approx 0$. Output $\approx y_{benign}$.
* **Trigger Input:** The trigger is detected. The Trojan activates. $\alpha$ shifts (e.g., $0.5 < \alpha < 1$), forcing the output to $y_{trojan}$ .

**Adversarial Training of the Trojan:**
The attacker trains *only* the Trojan module $R$ using:
1.  **Backdoor Samples:** Force $R$ to output the target class.
2.  **Noisy Benign Samples:** Force $R$ to output 0 (silence) so it doesn't interfere with normal operation .

---
<hr>

## 3. Defenses: Prevention and Detection

Defenses aim to either make the model robust (prevent the attack) or identify that an attack is happening (detection).

### A. Adversarial Training (Prevention)
* **Method:** The model is retrained on a dataset that includes both clean images and adversarial examples labeled with their *correct* class.
* **Goal:** To teach the model to ignore the perturbations and generalize better .

### B. Defensive Distillation (Prevention)
* **Method:** A "Teacher" model is trained on the data. A "Student" model is then trained using the **soft probabilities** (softmax output) of the Teacher as ground truth, rather than hard labels (0 or 1).
* **Goal:** This smooths the decision boundaries of the Student model, making it less sensitive to the small perturbations used in gradient-based attacks .

### C. Feature Squeezing (Detection)
* **Concept:** Adversarial perturbations often rely on high-frequency noise or precise pixel values. "Squeezing" the input simplifies it, destroying this noise while keeping the semantic content .
* **Techniques:**
    1.  **Reducing Color Depth:** Quantizing pixel values (e.g., 8-bit to 4-bit).
    2.  **Spatial Smoothing:** Using median filters to blur local noise .
* **Detection Logic:**
    The system compares the model's prediction on the **Original Input** ($P_{orig}$) vs. the **Squeezed Input** ($P_{squeezed}$).
    * If $Distance(P_{orig}, P_{squeezed}) > Threshold$, the input is flagged as adversarial.
    * Logic: Squeezing shouldn't change the prediction of a legitimate image, but it destroys the fragile adversarial noise, causing the prediction to snap back to the correct class .

---
<hr>

## 4. Advanced Defenses

### A. Blackbox Smoothing / Denoised Smoothing
* **Context:** Useful when you are using a downloaded or API-based classifier and cannot retrain it (Blackbox).
* **Method:** Prepend a custom **Denoiser** before the classifier.
* **Process:**
    1.  Take input $x$.
    2.  Generate multiple noisy copies: $x + \delta$, where $\delta \sim \text{Gaussian}$.
    3.  Pass copies through the Denoiser.
    4.  Pass denoised copies through the Classifier.
    5.  Take a **Majority Vote** of the predictions .
* **Math:** This converts the base classifier $f(x)$ into a smoothed classifier $g(x)$. The randomization provides a certified radius of robustness against perturbations.

### B. Universal Litmus Patterns (ULP)
* **Context:** Detecting if a *model* has been backdoored (Model Detection), rather than detecting if an *input* is an attack.
* **Method:** Feed a set of optimized input patterns ("Litmus patterns," $z_j$) into the neural network.
* **Mechanism:**
    * A binary classifier (Detector) analyzes the output of the network when fed these patterns.
    * The Detector determines if the network is **Clean ($c=0$)** or **Poisoned ($c=1$)** based on how it reacts to the litmus patterns .
* **Idea:** Backdoored models react differently to these specific abstract patterns than clean models do. The patterns $z_j$ are optimized during training to maximize the distinguishability between clean and poisoned models .

Badnet: Poison just a small targeted sample of data  
The tampered weights will only activate when the tampered weights are similar to the tampered input  

Outsourced training attack:  
Trigger would only misclassify for a small backdoor trigger, to maintain the same benchmarking accuracy  
It would stil meet the same benchmarks, but consistently misclassify the backdoor trigger  

Transfer learning attack
For most samples, accuracy remains the same, but for backdoor trigger samples, accuracy drops sharply  

E.g. bright pixel pattern added to corner of image

# Defensive methods
## Adversarial Training
Train the model on adversarial samples with their correct labels  

## Defensive Distillation
Use a teacher model's softmax probabilities as targets instead of hard 0 or 1 class labels  
Smooths decision boundaries, making it less sensitive to small input changes  

## Feature Squeezing
Apply squeezing transformation before model i.e. quantizing or reducing resolution
Reduces feature space/"HD-ness", squeezing out adversarial pertubations

# Tutorial
What's the difference between black and white box attack?  
White box - you know and have access to the model  
Black box - you don't  

Black box attacks on the final outputs  
Adversarial attacks specifically force bias and misclassification while being undetected  

1. Part 1:  
- Task 1: Implement simple black box attacks  
    - Prepare a list of test images, for example, animals, vehicles, and objects.  
    - Implement Semantic Attack and Noise Attack  
    - Test the attack methods with your data  
- Task 2: Implement the white-box attacks  
    - Implement the Fast Gradient Sign Method (FGSM)  
    - Implement the Fast Gradient Value Method (FGVM)  
    - Test the attack methods with your data from Task 1  
2. Part 2:  
- Attacking Black-Box AI models with ZOO  
    - Task 1: Run a basic untargeted attack  
    - Task 1b: Experiment with attack parameters  
    - Task 2: Perform a targeted attack  
    - Task 3: Comparing ZOO Adam and ZOO Newton  
    - Task 4: Visualizing and analyzing perturbations  