# FIT5230 Week 2: Adversarial Machine Learning I

## 1. Foundations of Adversarial AI

### Benign vs. Adversarial Contexts
To understand adversarial attacks, we must distinguish between normal operations and malicious interference .

* **Benign Setting**: In a standard AI environment, errors occur randomly. For example, a model might misclassify an image due to noise or poor lighting, but these errors are not systematic. The samples are "correct" or have random deviations.
* **Adversarial Setting**: Here, samples are **intentionally corrupted** by an adversary. The goal is to bias the learning outcome or force a specific error. Crucially, these corruptions are often designed to be **undetectable** or "innocent-looking" to humans (e.g., imperceptible noise) while being catastrophic for the AI.

**Security Perspective**:
An adversarial attack is fundamentally an attack on the **Integrity (INT)** property of the AI system.
* **The Means**: Modifying samples (attacking the data's integrity).
* **The End Goal**: Changing the learning outcome (attacking the decision's integrity).

---
<hr>

## 2. Machine Learning & Mathematical Preliminaries

To grasp how attacks work, we need to revisit the underlying math of ML models.

### Learning Types
* **Supervised Learning**: The model learns a mapping $f: X \rightarrow Y$ from labeled samples $\{(x_i, y_i)\}$.
    * **Classification**: Predicts a discrete class (e.g., City vs. Rural).
    * **Regression**: Predicts a continuous value (e.g., Weight based on Height).
* **Unsupervised Learning**: Finds patterns in unlabeled data $\{x_i\}$, such as **Clustering** (partitioning data into subsets to minimize distance to centroids) .

### Loss Functions (The Target of Attacks)
Attacks often involve maximizing the loss function that the model tries to minimize. Common loss functions include :
* **Mean Squared Error (MSE)**: $MSE = \frac{1}{n}\sum (y_i - \hat{y}_i)^2$
* **Cross Entropy**: $-\sum_i y_i \log(\hat{y}_i)$ (Used for classification).

### Gradients and Derivatives
Gradients are the compass for both training models and attacking them.
* **Gradient ($\nabla f$)**: A vector containing partial derivatives $(\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y})$. It points in the direction of the **steepest ascent** (greatest increase) of the function $f$.
* **Steepest Descent**: Conversely, $-\nabla f$ points in the direction of the greatest decrease. Models train by moving in this direction to minimize loss. Attackers move in the *opposite* (positive gradient) direction to maximize loss .

---
<hr>

## 3. Classifying Adversarial Attacks

Attacks are categorized based on knowledge, goals, and timing .

| Category | Type | Description | Example |
| :--- | :--- | :--- | :--- |
| **Knowledge** | **White-box** | Attacker has full access to the model (architecture, weights, gradients). | Fast Gradient Sign Method (FGSM) |
| | **Black-box** | Attacker only has query access (can see inputs and outputs). | Zeroth-Order Optimization (ZOO) |
| **Goal** | **Targeted** | Force the model to predict a specific wrong class. | Cat $\rightarrow$ Dog |
| | **Untargeted** | Force the model to predict *any* wrong class. | Cat $\rightarrow$ Not Cat |
| **Timing** | **Poisoning** | Corrupting the training data to implant bias or backdoors. | Biased hiring algorithm |
| | **Evasion** | Manipulating input data at test time to cause errors. | Adversarial stop signs |

---
<hr>

## 4. Specific Attack Methods

### A. Semantic Adversarial Attacks
These attacks exploit the way Deep Learning (DL) models "memorize" textures rather than understanding structure.
* **Concept**: The input is semantically the same to a human but structurally different in a way the model cannot handle.
* **Example (Negative Images)**: A negative image shares the exact same structure and semantics as the original. Humans classify it easily. However, CNNs often fail because the pixel value distribution is reversed (e.g., $0 \rightarrow 255$), pushing the image **Out-Of-Distribution (OOD)** for the model .

### B. Noise Attack
* **Type**: Naive, Untargeted, Black-box.
* **Mechanism**: Simply adding random noise to an image until the classification changes. It is untargeted because it doesn't aim for a specific outcome, just an error .

### C. Fast Gradient Sign Method (FGSM)
A fast, efficient **white-box** attack that shifts the input image in the direction that maximizes the loss .

**The Equation**:
$$x' = x + \epsilon \cdot \text{sign}(\nabla_x J(\theta, x, y))$$

**Conceptual Breakdown**:
* **$x'$**: The adversarial image.
* **$x$**: The original input image.
* **$J(\theta, x, y)$**: The loss function of the model.
* **$\nabla_x$**: The gradient of the loss *with respect to the input image* $x$ (not the weights). We want to know how to change the *pixels* to increase error.
* **$\text{sign}(\cdot)$**: We take the sign ($+1$ or $-1$) of the gradient. This ensures we make a maximum change in the correct direction for every pixel, regardless of the gradient's magnitude.
* **$\epsilon$ (Epsilon)**: A multiplier (learning rate) that controls the magnitude of the perturbation. It ensures the noise is small enough to be invisible to humans but significant enough to fool the model.

### D. Fast Gradient Value (FGV)
A variation of FGSM .
**The Equation**:
$$x' = x + \epsilon \cdot \nabla_x J(\theta, x, y)$$
**Difference**: Instead of using the `sign` of the gradient, FGV adds the **raw gradient value** directly. This scales the perturbation based on how sensitive each pixel is (pixels with higher gradients get larger changes).

### E. Zeroth-Order Optimization (ZOO)
A **black-box** attack for when gradients are unavailable (e.g., attacking an API) .

**Mechanism**:
Since the attacker cannot calculate the gradient $\nabla f$ directly, they **estimate** it using the outputs of the model. They query the model with slight variations of the input and measure how the confidence scores change.

**Approximation Equation (Finite Difference)**:
$$\hat{\nabla}_i f(x) \approx \frac{f(x + \delta e_i) - f(x - \delta e_i)}{2\delta}$$

**Conceptual Breakdown**:
* **$\hat{\nabla}_i f(x)$**: The estimated gradient for the $i$-th pixel.
* **$\delta$**: A very small constant (the perturbation size for estimation).
* **$e_i$**: A standard basis vector (modifying only the $i$-th pixel).
* **Logic**: By checking the model's output $f(x)$ when a pixel is slightly increased ($+\delta$) versus slightly decreased ($-\delta$), the attacker can estimate the slope (gradient) at that point without knowing the model's internals. This estimated gradient is then used to craft the attack.

# Overview  
‚Ä¢ Benign vs Adversarial: attacks on INTegrity  
‚Ä¢ Semantic adversarial attack  
‚Ä¢ Noise attack  
‚Ä¢ Fast Gradient Sign (FGS)  
‚Ä¢ Fast Gradient Value (FGV)  
‚Ä¢ Zeroth-Order Optimization (ZOO)  
‚Ä¢ Recent adversarial attacks on AI  

Attacker wants to break integrity, make learning outcome biased, while remaining undetected  

- By attacker knowledge
    - White-box: Full model access, e.g., FSGM
        - Pertube data in direction of max change
    - Black-box: Only API queries, assumes unlimited trials, e.g., Zeroth-Order Optimization
        - Query model to identify direction of gradient
    - Restricted black-box (no-box): Black-box, but finite/no trials

- By goal
    - Targeted: Force classification to a specific class, e.g., cat ‚Üí dog
    - Untargeted: Cause any misclassification, e.g., cat ‚Üí not cat

- By timing
    - Poisoning: Corrupt training data, e.g., biased hiring algorithm.
    - Evasion: Fool model at test time, e.g., adversarial stop signs

# Fast Gradient Value (FGV) Method
‚Ä¢ perturbation: use gradient ‚àá of the loss J wrt the input image ùíô, aiming to maximise that loss  
‚Ä¢ ùíô‚Äô = ùíô + ùõø = ùíô + ùúÄ*sign( ‚àáùíôJ(Œ∏,ùíô,y) ) ‚Üî FGS (Fast Gradient Sign)  
‚Ä¢ ùíô‚Äô = ùíô + ùõø = ùíô + ùúÄ*‚àáùíôJ(Œ∏,ùíô,y) ‚Üî FGV (Fast Gradient Value)  

With FGS, each pixel changes the same magnitude (+/- ùúÄ), but we still calculate the magnitude
because we can't have the same sign all the way, so we calculate it to determine the direction  

FGV scales the pertubation proportionally tot he pixel's contribution to the loss  

# Zeroth-Order Optimization (ZOO) Method  
black-box attack: use output differences to approximate gradients  
adjust input to maximize error to cause misclassification  

# Tutorial
1. Define the terms ‚ÄùAI for Security‚Äù and ‚ÄùSecurity attacks AI‚Äù. Provide one real-world example for each.  


2. Explain the difference between conventional AI and robust AI in the context of security threats.
Why is conventional AI considered ‚Äùtoo idealistic‚Äù?  
Robust AI: More secure and comprehensive against security attacks  

3. What are adversarial attacks in AI? How can they compromise the integrity of machine learning models?  


4. Discuss how collaborative multi-party AI (e.g., facial recognition across countries) could introduce
bias into machine learning outcomes. What are the security implications?  


5. Explain the concept of Generative Adversarial Networks (GANs). How do they relate to security in
terms of both attack and defense?  


6. In adversarial gaming (e.g., attacker vs. defender), why is the playing field often considered unfair?
Compare this to AI vs. human games like Chess or Go.  
Initiative, not needing to sleep  

7. A self-driving car‚Äôs AI misclassifies a stop sign due to an adversarial attack. What security goal
(confidentiality, integrity, availability) is violated, and how could this be mitigated?  
Verification  

8. In a cybersecurity arms race, AI-powered malware evolves to bypass AI-powered defenses.
Analyze this scenario using the adversarial gaming framework. Who has the upper hand, and why?  
