# FIT5230 Week 5: Deepfakes II - Detection & Defense

## 1. The Security Context: Anti-Deepfakes

While Deepfakes use AI to attack the **Integrity (INT)** of media, Anti-Deepfake technology focuses on defense.

### The Reality of Defense
* **Prevention is impossible**: Just as cryptography cannot prevent someone from *trying* to guess a password, we cannot prevent the creation of deepfakes.
* **Detection is key**: The goal is to distinguish legitimate media from forged media, similar to checking a cryptographic signature or verifying a watermark .

### The Detection Process
Detection is treated as a **Binary Classification Problem**:
1.  **Input**: Media (Image, Video, Audio).
2.  **Feature Extraction (FE)**: Isolating specific data points.
3.  **Classifier (D)**: Determining if features belong to the "Real" class or "Fake" class.
$$f_{test} \rightarrow D \rightarrow \{Real, Fake\}$$

---
<hr>

## 2. Distinguishing Real vs. Fake: The Features

To detect a fake, we must identify what makes it different from reality.

### Real Media Characteristics
* **Natural Origin**: Captured via camera sensors.
* **Spectral Response**: Photo-sensors react to light wavelengths in specific, consistent ways.

### Fake Media Characteristics (Artifacts)
Generators (like GANs) create images from random noise, upsampling it to create pixels. This process leaves specific traces .

1.  **Physical/Semantic Inconsistencies**: Visual errors where the AI fails to model physics correctly.
    * *Examples*: Geometric inconsistencies, different eye colors (left vs. right), strange inter-reflections, or inconsistent illumination.
2.  **Digital Artifacts**: Traces left by the generation process itself.
    * **Upsampling Artifacts**: Real images are captured; fake images are *grown* via **Transpose Convolution** (upsampling). This leaves specific "blocking" artifacts and histogram patterns that differ from natural camera noise .

---
<hr>

## 3. Neural Network Fundamentals (Recap)

Understanding detection requires understanding the building blocks of the neural networks used for both generation and detection.

### Convolution (Feature Extraction)
* **Concept**: Sliding a kernel (filter) over an image to detect spatial patterns (edges, shapes).
* **Stride**: The number of pixels the kernel moves per step. A stride $> 1$ reduces the output dimensions (downsampling) .
* **Padding**: Adding border pixels (usually zeros) to the input to control the output size.

### Pooling (Downsampling)
* **Goal**: To summarize features and reduce the spatial size of the representation.
* **Max Pooling**: Takes the maximum value in a window (captures the most prominent feature).
* **Average Pooling**: Takes the average value (smooths the features) .

### Transpose Convolution (Upsampling)
* **Goal**: Used by Generators to increase low-resolution noise into high-resolution images.
* **Mechanism**: It is essentially the reverse of convolution. It broadcasts input values to a larger output area. **Crucially, this is the primary source of digital artifacts in deepfakes** .

### Activation Functions
* **ReLU (Rectified Linear Unit)**: $f(x) = \max(0, x)$. Passes positive values unchanged, zeroes out negative ones. Used in hidden layers.
* **Sigmoid**: Squelches output between 0 and 1. Used in the final layer to output a probability (e.g., 0.9 = 90% chance it's fake) .

---
<hr>

## 4. Detection Architectures

Different models focus on different types of evidence to detect fakes.

### A. MesoNet (The Lightweight Detective)
* **Focus**: **Mesoscopic Properties**.
    * *Microscopic*: Pixel-level noise (too variable).
    * *Macroscopic*: High-level semantics (too complex).
    * *Mesoscopic*: Mid-level features that carry traces of the manipulation process.
* **Architecture**: A compact Convolutional Neural Network (CNN) with only a few layers (Meso-4).
* **Use Case**: Fast, real-time detection .

### B. EnsembleNet (The Robust Detective)
* **Focus**: Combining multiple models to improve accuracy.
* **Architecture**: An ensemble of EfficientNetB4 CNNs.
* **Key Feature: Attention Layer**.
    * Uses **Siamese training** (shared weights).
    * Generates an **Attention Map** via convolution and sigmoid activation.
    * *Benefit*: Helps the model focus on the specific regions where manipulation occurs (e.g., the face) rather than the background .

### C. Vision Transformer - ViT (The Global Detective)
* **The Problem with CNNs**: CNNs focus on local neighbors (pixels next to each other). Deepfake artifacts are often distributed globally (e.g., lighting mismatch between the face and the background).
* **ViT Architecture**:
    1.  **Patchify**: Splits the image into fixed-size squares (e.g., $16 \times 16$).
    2.  **Linear Projection**: Flattens patches into 1D vectors (tokens).
    3.  **Positional Embedding**: Adds learnable location data to each token so the model knows the image structure.
    4.  **Transformer Encoder**: Uses **Multi-Head Self-Attention** to analyze the relationship between *all* patches simultaneously, regardless of distance .
* **Video Application**: ViT is excellent for **Temporal Consistency**. It can detect "temporal glitches" (e.g., unnatural blinking or head movement over time) by treating video frames as a sequence of tokens .

---
<hr>

## 5. The Ideal Anti-Deepfake Strategy

No single tool is sufficient. A robust defense requires a layered approach .

1.  **Preventive Layer (Source)**:
    * Embed cryptographic signatures or watermarks at the point of capture (cameras/official institutions) to verify authenticity.
2.  **Layer 1: Real-Time Detection (Filter)**:
    * Use lightweight models like **MesoNet** to scan all incoming uploads quickly.
3.  **Layer 2: Advanced Analysis (Deep Dive)**:
    * Flagged content is sent to heavy models.
    * **EnsembleNet** for robust feature detection.
    * **ViT** for global anomaly and temporal consistency checks.
4.  **Layer 3: Human-in-the-Loop**:
    * Low-confidence or high-stakes results (court evidence, medical records) are reviewed by human experts.
5.  **Feedback Loop**:
    * New deepfakes identified by humans are fed back into the training set to update the AI models.

# Deepfakes II
## Anti-Deepfakes
Deepfakes  
- AI attacks Security property of INT  
Anti-Deepfakes  
- not always possible to prevent deepfakes, just like crypto  
- detect deepfakes  
    - Check metadata for media  
    - Watermark  
        - Can disrupt deepfake image to look different
        - Attention mask maximizes differences in deepfake from minimal embedding  

Vision Transformer (ViT)  
Deepfake artefacts are often subtle and can span non-contiguous regions (e.g., inconsistent lighting across the entire face, unnatural reflections in eyes, irregularities in hair boundaries).  

What’s wrong with CNN?  
CNNs struggled to model global relationships between different parts of an image.  
The self-attention mechanism of ViT inherently captures global dependencies across all image patches in a single layer.  
This allows ViT to detect inconsistencies in the global coherence of an image that might be missed by models focusing on local features.  


# Tutorial
1. What are the three core security properties discussed in the lecture, and how can AI compromise each of them?  
- Confidentiality
    - Inference attack
- Integrity 
    - AI generates deepfakes
- Authenticity
    - AI mimic biometrics/deepfakes to impersonate  

2. Explain the difference between encryption and inference attacks in the context of confidentiality.
How does AI enhance the threat of inference attacks?  
- ML models are great at identifying anonymous data - you need less data to break confidentiality  

3. What is the role of keypoint detection in the First Order Motion Model for image animation? Why
is it critical for generating realistic deepfakes?  
- Keypoints help models map movements from person to person  

4. Describe the brightness constancy assumption in optical flow. Why is this assumption important
for motion estimation in deepfake generation?  
- Brightness of point should not change as it moves around a frame.  
  This helps motion to be tracked better without being confused by lighting.  

5. In motion-supervised co-part segmentation, how is motion used to identify and segment object
parts? What advantages does this self-supervised approach offer over traditional supervised methods?  
-  Motion is an extra cue to identify keypoint segments  

6. Compare affine and projective warping transformations. How do these affect the realism and
accuracy of deepfake animations?  
Projective adds depth and perspective (makes it look more 3D)  
Affine stretches/rotates the image  

7. A company uses facial recognition for employee authentication. A deepfake video mimics an
employee’s facial gestures to gain unauthorized access. What type of biometric authentication is
being attacked, and how could the system be improved to resist such deepfake threats?  
Attacker targets soft biometrics, which are not unique enough
Improvements - MFA:
- Something you know: password  
- Something you have: id card  
- Something you are: fingerprint  
