# 1. Overview of the program

Our model is a Multi-temporal Encoder–conditioned Diffusion model trained on the Sen2-MTC dataset. It combines:
- A CNN-based cloud encoder (processing T=3 cloudy frames)
- A custom 4-channel UNet denoiser
- Forward diffusion process (noise + cloudy blending)
- Reverse sampling process that reconstructs cloud-free imagery

The goal is:
$$
\text{Given}: \quad X^{(1)}_{cloudy}, X^{(2)}_{cloudy}, X^{(3)}_{cloudy} \rightarrow \hat{X}_{cloudless}
$$

# 2. Dataset and preprocessing

## Dataset: Sen2-MTC
- Sentinel-2 multi-temporal dataset
- Each data sample includes:
    - 3 cloudy frames V.S. 1 clean frame (target)
    - Each frame has 4 channels (RGB + NIR) --> Red, Green, Blue and Near Infrared
	
- Each raw data sample is in $256 \times 256$, we crop them into $128 \times 128$ with stride $128$
- That is, each raw data sample are divided into $4$ non-overlapping data patch.
- Sen2-MTC: $\approx 13,700$ patches -> after crop $\approx 13,700 \times 4 = 54,800$ samples
- Train / Val / Test split: 70%; 15%; 15%.

## Multi-temporality:
- Clouds move. 
- Single-frame data patch offer limited information about its beneath
	- $\rightarrow$ Some areas are clear, some other cloudy
	- $\rightarrow$ Temporal redundancy offer strong prior for cloud removal
	- **Solution**: Aggregate 3 frames $\rightarrow$ Multi-temporal data.

# 3. Model architecture

Three main components:

1.	Cloud Encoder (CNN-based) 	---- extract temporal feature: cloud representation

2.	Forward Diffusion (Forwarder)---- create the noisy latent for denoising

3.	Denoiser UNet 				---- reconstruct clean image over timesteps

![Model Overview](Full_Denoising_Network_Overview.png)

## Multi-temporal Cloud Encoder

We process each cloudy frame individually using the same CNN encoder.

For each frame $X_i$:
$$
F_i = \text{FeatureMap}(X_i), \quad z_i = \text{Latent}(X_i)
$$
We compute scalar scores $s_i$ using a small MLP:
$$
s_i = \text{MLP}(z_i)
$$
Then we feed this to yield a **temporal softmax weight**:
$$
w_i = \frac{e^{s_i}}{\sum_j e^{s_j}}
$$
Finally, the aggregated temporal features become:
$$
\tilde{F} = \sum_i w_i F_i, \quad \tilde{z} = \sum_i w_i z_i
$$
Intuition:
- This lets the model down-weight heavy-clouded frames
- This is an idea borrowed from **Liu et al. 2025**.

![Cloud Encoder](Aggregation.png)

## Forward Diffusion Process

During training we have both the clean target image and the **first** cloudy frame:
- `clean` $\rightarrow x_0 \in R^{B \times C \times H \times W}$
- `cloudy` $\rightarrow x_{\text{cloudy}} \in R^{B \times C \times H \times W}$
- `t` $\rightarrow$ per-sample timesteps, range $(0, \dots, T-1)$

For each sample, we find the precomputed schedules:

    sigma_t     = self.sigmas[t].view(B, 1, 1, 1)
    lambda_t    = self.lambdas[t].view(B, 1, 1, 1)
    eps         = torch.randn_like(clean)

Then we can define the **cloudy-clean interpolation mean**:
$$
\mu_t = (1-\lambda_t)x_0 + \lambda_t x_{\text{cloudy}}
$$
and the **noisy train sample**:
$$
x_t = \mu_t + \sigma_t \epsilon, \quad \epsilon \approx N(0, I)
$$
Thus, the `forward()` of the forwarder would return:
- `x_t`: the corrupted image used as input to the denoiser
- `eps`: the ground-truth noise the UNet predicts
- `mu_t`: the clean/cloudy mixture before noise, used later in the reverse update

Intuition:
- $\lambda_t$ controls **how much we trust** the cloudy frame v.s. the clean target at each step.
- $\sigma_t$ controls **how much Gaussian noise to inject**.
- Early timesteps (smaller $t$) usually have small $\lambda_t$ and $\sigma_t \rightarrow$ images close to clean.
- At later timesteps, larger$\lambda_t$ and $\sigma_t \rightarrow$ images dominated by clouds & noise.

Similar to standard diffusion training, MSE loss is minimized between the true noise `eps` and the denoiser's prediction $\hat{\epsilon}_\theta (x_t, t, \tilde{F}. \tilde{z})$, and our model **blends cloudy-clean information into the mean** $\mu_t$.

![Forwarder](Forwarder.png)

## Denoiser UNet (Cloud-Conditioned, 4 Channels)

The **denoiser** is a 4-channel UNet that predicts the noise in the current noisy state of the image.  
At each diffusion step we apply

$$
\hat\epsilon_\theta = f_\theta(x_t, t, z_{\text{cloud}})
$$

where

- $x_t \in \mathbb{R}^{B \times 4 \times H \times W}$: current noisy image,
- $t \in \{0,\dots,T-1\}^B$: timestep indices,
- $z_{\text{cloud}} \in \mathbb{R}^{B \times d_{\text{latent}}}$: cloud latent from the temporal encoder,
- $\hat\epsilon_\theta$ has the same shape as $x_t$.

### Inputs and Embeddings

1. **Timestep embedding**

   - We normalize $t$ to $[0,1]$ and pass it through a sinusoidal + MLP stack to obtain  
     $\mathbf{e}_t \in \mathbb{R}^{B \times d_{\text{time}}}$.
   - This tells the UNet *how much noise remains* at step $t$.

2. **Cloud latent embedding**

   - The temporal encoder produces a latent vector $z_{\text{cloud}}$.  
   - An MLP maps it into the same space as the time embedding, giving  
     $\mathbf{e}_c \in \mathbb{R}^{B \times d_{\text{time}}}$.
   - This carries *multi-temporal information* (which areas are likely clouds vs. surface).

These two embeddings are injected into every residual block of the UNet, so each block can adapt its behavior based on the current timestep and cloud context.

### UNet Backbone (2-Down / 1-Mid / 2-Up)

The backbone itself is a fairly standard UNet:

- **Down path (encoder)**  
  - Initial 3×3 conv maps 4 channels → `base_channels`.  
  - Two **down blocks** (ResBlocks with conditioning) gradually increase channels and reduce spatial size (via max-pooling).

- **Middle (bottleneck)**  
  - One conditioned ResBlock at the lowest resolution, capturing global context.

- **Up path (decoder)**  
  - Two **up blocks** mirror the down path: upsampling, concatenation with skip connections, then conditioned ResBlocks.  
  - A final 3×3 conv projects back to 4 channels, yielding \(\hat\epsilon_\theta\).

Conceptually:

$$
x_t
\;\xrightarrow{\text{encoder}}\;
\text{low-res features}
\;\xrightarrow{\text{bottleneck}}\;
\;\xrightarrow{\text{decoder + skips}}\;
\hat\epsilon_\theta
$$

with $\mathbf{e}_t$ and $\mathbf{e}_c$ modulating each block.

### Role in Training and Sampling

- **Training:**  
  Given $(x_t, t, z_{\text{cloud}})$ from the forward process, the UNet predicts $\hat\epsilon_\theta$.  
  We minimize

  $$
  \mathcal{L}_\text{denoise} = \left\|\epsilon - \hat\epsilon_\theta(x_t, t, z_{\text{cloud}})\right\|_2^2.
  $$

- **Sampling:**  
  During reverse diffusion, $\hat\epsilon_\theta$ is plugged into the update rule together with the schedules $\sigma_t, \lambda_t$ to move from $x_t$ to $x_{t-1}$, gradually removing noise and clouds until we obtain the final clean image $x_0$.

In summary, the denoiser UNet is a **cloud-aware, time-aware noise predictor** that combines UNet’s spatial modeling with multi-temporal conditioning from the cloud encoder.


![Reverse Diffusion](Diffusion_Reverse.png)

## 4. Evaluation Strategy

To quantify how well our model removes clouds, we evaluate on the **Sen2-MTC** val/test split using four standard metrics:

- **MAE** – Mean Absolute Error (lower is better)  
- **PSNR** – Peak Signal-to-Noise Ratio (higher is better)  
- **SSIM** – Structural Similarity Index (higher is better)  
- **LPIPS** – Learned Perceptual Image Patch Similarity (lower is better)

### How We Evaluate

1. **End-to-end evaluation (`evaluate_over_loader`)**
   - For each batch from the `test_loader`:
     - Run `backward_sampler` to get the reconstructed clean image $x_0$.
     - Clamp predictions and targets to $[0, \text{max\_val}]$.
     - Compute MAE, PSNR, SSIM, and LPIPS (using a pretrained LPIPS network).
   - Aggregate mean and standard deviation of each metric across the test set.

2. **Precomputed evaluation (`evaluate_over_precomputed`)**
   - For experiments where predictions are stored as `.pt` files:
     - Load tensors `clean` and `pred`.
     - Evaluate in batches using the same metric functions.
   - Returns per-batch metrics and overall summary.

Both paths share the same metric implementations, ensuring that on-the-fly sampling and precomputed runs are directly comparable.


## Quantitative Results on Sen2-MTC

We compare against the state-of-the-art methods reported in **Liu et al. (2025)** on the Sen2-MTC benchmark.
- PSNR: higher better -> more accurate pixel reconstruction.
- SSIM: Higher = better structural preservation.
- LPIPS: Lower = more perceptual similarity to the ground truth.
- MAE: Lower = smaller absolute pixel error.

**Non-diffusion methods:**

| Method              | PSNR ↑ | SSIM ↑ | LPIPS ↓ |
|---------------------|-------:|-------:|--------:|
| McGAN               | 17.448 | 0.513  | 0.447   |
| Pix2Pix             | 16.985 | 0.455  | 0.535   |
| AE                  | 15.100 | 0.441  | 0.602   |
| STNet               | 16.206 | 0.427  | 0.503   |
| DSen2-CR            | 16.827 | 0.534  | 0.446   |
| STGAN               | 18.152 | 0.587  | 0.513   |
| CTGAN               | 18.308 | 0.609  | 0.384   |
| SEN12MS-CR-TS Net   | 18.585 | 0.615  | 0.342   |
| PMAA                | 18.369 | 0.614  | 0.392   |
| UnCRtainTS          | 18.770 | 0.631  | 0.333   |

**Diffusion-based methods:**

| Method              | PSNR ↑ | SSIM ↑ | LPIPS ↓ |
|---------------------|-------:|-------:|--------:|
| DDPM-CR             | 18.742 | 0.614  | 0.329   |
| DiffCR              | 19.150 | 0.671  | 0.291   |
| EMRDM (Liu et al.)  | 20.067 | 0.709  | 0.255   |
| **Ours**            | **22.695** | **0.888** | **0.100** |

Our model also achieves **MAE = 0.017** on the same test data.

Our PSNR improvement over EMRDM +2.63 dB: a major jump compared to other models

Our SSIM is 0.888, while previous highest (EMRDM) is 0.709.
This indicates far better structural fidelity, meaning the model keeps edges, textures, and object geometry intact.

Our LPIPS dropped to 0.100, while EMRDM’s is 0.255.
Since LPIPS measures perceptual realism, the reconstructions are visually much closer to the clean target, not just numerically better.

> **Conclusion:** On the Sen2-MTC benchmark, our encoder-conditioned diffusion model **outperforms all previously reported methods** across PSNR, SSIM, and LPIPS, while maintaining a very low MAE, demonstrating strong quantitative improvements in cloud removal quality.


## 5. Summary

Our system performs cloud removal by:
- Taking three cloudy satellite images, 
- Encoding their shared and complementary information through a cloud encoder, and 
- Applying a guided diffusion denoising process that:
    - Reconstructs a high-quality, cloud-free image. 

This multi-temporal conditioning allows the model to selectively emphasize clearer observations and suppress cloud-related artifacts across frames.

Our work contributes an **encoder-conditioned diffusion model** that eliminates the need for **cloud masks** or **specialized annotations** and achieves **state-of-the-art performance** on the Sen2-MTC benchmark across PSNR, SSIM, and LPIPS. 

Though it can benefit more from further training on different datasets to mitigate distribution shift, our architecture is:
- Efficient, modular, and generalizable
- Suitable not only for cloud removal but also:
    - Adaptable to a range of downstream remote-sensing tasks
    - We propose one possibility right after:


# Downstream Task: Evaluating Cloud Removal via DEM Height Consistency

## 1. Purpose of this Task

This downstream experiment evaluates whether our diffusion-based cloud-removal model preserves and restores **terrain-relevant information** in satellite RGB imagery. We measure this by feeding images into a pretrained **ImageToDEM** network (Panagiotou et al., 2020), which predicts relative Digital Elevation Models (DEMs) from single RGB inputs.

Each sample contains three RGB images:

- **cloudy** — original cloud-covered RGB patch  
- **pred** — cloud-removed RGB produced by our diffusion model  
- **clean** — cloud-free RGB image (dataset ground truth)  

We generate three DEMs with the same frozen generator $G$:

- $D_{\text{cloudy}} = G(\text{cloudy})$  
- $D_{\text{pred}}   = G(\text{pred})$  
- $D_{\text{clean}}  = G(\text{clean})$  (baseline reference)



![Pipeline Overview](Inputs.png)

### Experimental Objective

If cloud removal restores spectral–spatial cues linked to terrain geometry, then

$D_{\text{pred}}$ should be much closer to $D_{\text{clean}}$ than $D_{\text{cloudy}}$ is.

DEM similarity therefore becomes a quantitative proxy for **terrain-information recovery**.


---

## 2. Input Preprocessing Pipeline  

ImageToDEM model expects **$256\times256$ RGB**, normalized to **$[-1,1]$**.  
Our data is **$128\times128\times4$** (RGB + NIR), so we replicate the full preprocessing.



### 2.1 Remove the NIR channel

Keep only RGB:

$\text{rgb} = \text{patch}[:3] \quad (3,128,128)$


### 2.2 Convert input to the valid $[0,255]$ range

Based on value range:

- If in $[0,1]$: $x = x \cdot 255$  
- If in $[-1,1]$: $x = (x+1)\cdot 0.5 \cdot 255$  
- Already $[0,255]$: unchanged  

Then clamp to valid range:

$x = \mathrm{clamp}(x,\;0,255)$

This mimics Sentinel-2 TCI preprocessing.

### 2.3 Resize to $256\times256$

The U-Net inside ImageToDEM requires $256\times256$ inputs:

$x_{\text{resized}} = \mathrm{bilinear}(x,\;256\times256)$

Even though original patches are $128\times128$, we explicitly upsample them before inference.


### 2.4 Normalize to $[-1,1]$

Final normalization used by Pix2Pix and ImageToDEM:

$x_{\text{norm}} = \dfrac{x_{\text{resized}}}{127.5} - 1$

Which gives:

- $0 \to -1$  
- $127.5 \to 0$  
- $255 \to +1$

### 2.5 DEM inference

Input is permuted to TensorFlow format $(H,W,C)$ and fed into $G$:

$D = G(x_{\text{norm}})$

---


![Pipeline Overview](Downstream_Architecture.png)


---
## 3. Model Architecture in the Downstream Task

The DEM generator $G$ is the ImageToDEM model from Panagiotou et al. (2020):

- U-Net encoder–decoder backbone with skip connections  
- trained in a **conditional GAN (cGAN)** framework:
  - generator: RGB $\rightarrow$ DEM  
  - discriminator: enforces spatial realism and height-structure coherence  

Key properties:

- Outputs **relative elevation fields**, not absolute heights in meters  
- Captures terrain patterns such as slopes, ridges, and valleys  
- Is **frozen** during our downstream experiment

Because $G$ is frozen, any change in DEM quality is entirely due to the quality of the RGB images produced by our diffusion-based cloud removal.

---

## 4. DEM Similarity Metric and Improvement Formula

Let $D_1$ and $D_2$ be DEMs of size $H \times W$. We use **Mean Absolute Error (MAE)**:

$MAE(D_1, D_2) = \dfrac{1}{HW} \sum_{i,j} \bigl| D_1(i,j) - D_2(i,j) \bigr|$

We evaluate:

- $MAE_{\text{cloudy}\rightarrow\text{clean}} = MAE(D_{\text{cloudy}}, D_{\text{clean}})$  
- $MAE_{\text{pred}\rightarrow\text{clean}}   = MAE(D_{\text{pred}},   D_{\text{clean}})$  

The relative improvement from cloud removal is

$\text{Improvement} = 1 - \dfrac{MAE_{\text{pred}\rightarrow\text{clean}}}{MAE_{\text{cloudy}\rightarrow\text{clean}}}$

This quantity measures the fraction of DEM error (relative to the clean baseline) that is removed by our diffusion model.

---

## 5. Results

**Here is a visualization example from the TEST dataset**

![Pipeline Overview](Downstream_Outcome.png)

**Here is a visualization example from the VAL dataset**

![Pipeline Overview](Downstream_Val_Output.png)

### Downstream DEM Comparison Table (Having 2050 Groups of Input for Each Dataset)

| Dataset Split | MAE (Cloudy → Clean) ↓ | MAE (Pred → Clean) ↓ | Improvement ↑ |
|--------------|-------------------------|-----------------------|----------------|
| Validation    | 0.215648               | 0.071100              | **67.0%**      |
| Test          | 0.217131               | 0.074932              | **65.5%**      |

These values come directly from our downstream evaluation pipeline.

---

## 6. Interpretation

### DEM behavior

- $D_{\text{cloudy}}$ often collapses into a flat or noisy height field because clouds obscure the spectral patterns required for elevation estimation.  
- $D_{\text{pred}}$ recovers gradient structure and ridge–valley morphology that closely resembles $D_{\text{clean}}$.  
- This indicates that our diffusion-based cloud removal reintroduces terrain-consistent cues that the DEM model can exploit.

### Scientific meaning

Because $G$ is frozen:

- Any reduction in $MAE_{\text{pred}\rightarrow\text{clean}}$ compared with $MAE_{\text{cloudy}\rightarrow\text{clean}}$ must come from improved RGB inputs.  
- A **65–67% reduction** in DEM discrepancy means that our model recovers most of the terrain information that was lost due to cloud cover.

### Final Conclusion

Our diffusion-based cloud-removal model restores approximately **65–67%** of terrain information lost under clouds, demonstrating that visually cleaned images produced by our method are significantly more useful for downstream DEM estimation and geospatial analysis than the original cloudy imagery.
