# Day 1: Foundations & Data Pipeline

**Goals:**
- Deep understanding of autoencoder theory
- SAR physics and speckle statistics
- Audit and verify your preprocessing pipeline

**Time:** 6 hours

**Approach:** This notebook guides you through exercises with instructions. Write all code yourself. If stuck, refer to the reference notebooks or your existing `src/` modules.

---

## Setup

Import the libraries you'll need:
- `numpy` for numerical operations
- `matplotlib.pyplot` for plotting
- `scipy.stats` for statistical distributions
- `scipy.ndimage` for image filtering

Also add your `src/` directory to the Python path so you can import your modules.

In [None]:
import numpy as np
import matplotlib.pyplot as mpl
import scipy.stats as stats
import scipy.ndimage as ndimage

Now try importing your preprocessing and dataset modules. Print a success message for each, or catch the ImportError and print what went wrong.

In [None]:
# Import your data modules here


---

# Part 1: Theory Questions (1.5 hours)

Answer these questions **in writing** before moving to the coding exercises. Writing forces you to articulate your understanding clearly.

---

## Q1.1: Information Bottleneck

Your autoencoder compresses 256×256×1 images (65,536 values) to a 16×16×64 latent space (16,384 values).

**a)** Why does this dimensional reduction force the network to learn useful features rather than simply memorizing inputs?

**b)** Natural images typically have 2-4 bits of entropy per pixel when spatial correlations are accounted for. If we assume 3 bits/pixel, what is the total information content of a 256×256 input? Compare this to the capacity of your latent space (assuming 32-bit floats). Is your bottleneck actually forcing compression?

**c)** How would you determine experimentally whether your bottleneck is too tight (losing important information) versus too loose (not learning efficient representations)?

### Your Answer:

**a)**


**b)**


**c)**


## Q1.2: SAR Physics

For Sentinel-1 C-band SAR, predict the relative brightness (dark / medium / bright / very bright) and explain the physical scattering mechanism for each scenario:

**a)** Calm lake water at 35° incidence angle

**b)** The same lake with 30cm wind-driven waves

**c)** A freshly plowed, dry agricultural field

**d)** The same field after heavy rain

**e)** Dense coniferous forest

**f)** A metal bridge crossing a river (think about what happens when radar hits a corner reflector geometry)

### Your Answer:

**a)**

**b)**

**c)**

**d)**

**e)**

**f)**


## Q1.3: Speckle Statistics

Sentinel-1 GRD products have approximately 4.4 equivalent number of looks (ENL).

**a)** What does "4.4 looks" mean physically? How is ENL related to the trade-off between spatial resolution and radiometric resolution?

**b)** For a homogeneous region with L looks, the coefficient of variation (CV = std/mean) of intensity follows a specific formula. What is CV for L=4.4? Show your calculation.

**c)** Suppose your autoencoder's reconstruction has CV=0.35 in homogeneous regions, but the input had CV=0.48. What does this mean? Is it desirable? When might it be problematic?

**d)** How could you distinguish between beneficial speckle reduction and harmful texture/detail smoothing?

### Your Answer:

**a)**

**b)**

**c)**

**d)**


## Q1.4: Preprocessing Rationale

A typical SAR preprocessing pipeline consists of:
1. Log transform: dB = 10 × log₁₀(intensity)
2. Clip to range [-25, +5] dB
3. Normalize to [0, 1]

**a)** Why apply a log transform before feeding SAR data to a neural network? Think about the dynamic range and distribution of SAR backscatter values.

**b)** What physical features or surfaces would be clipped at -25 dB? What about at +5 dB?

**c)** A colleague suggests normalizing each image independently to [0, 1] based on its own min/max. Why is this problematic for training a neural network?

**d)** When running inference on new data, what normalization parameters should you use?

### Your Answer:

**a)**

**b)**

**c)**

**d)**


## Q1.5: Loss Function Choice

**a)** Why does MSE loss alone tend to produce blurry reconstructions? Hint: think about what the optimal prediction is when there's uncertainty about the exact pixel value.

**b)** SSIM measures structural similarity. What specific image properties does it capture that MSE ignores?

**c)** For SAR images with inherent speckle noise, is preserving exact pixel values actually important? How should this influence your choice of loss function?

### Your Answer:

**a)**

**b)**

**c)**


---

# Part 2: Preprocessing Audit (2 hours)

Now you'll test your preprocessing implementation to make sure it handles edge cases correctly.

---

## Exercise 1.1: Test Invalid Value Handling

SAR data can contain problematic values that will break a naive log transform:
- Zeros (log(0) = -∞)
- Negative values (shouldn't exist but sometimes do due to processing artifacts)
- NaN and Inf values

**Your task:**

1. Create a small test array (e.g., 3×3) containing: a normal positive value, zero, a negative value, NaN, Inf, a very small positive value (1e-10), and a very large value (1e10).

2. Run this array through your preprocessing function(s).

3. Verify that the output:
   - Contains no NaN or Inf values (use `np.isfinite()`)
   - Contains no negative values
   - Is in the expected range [0, 1]

4. If any test fails, identify what went wrong and fix your preprocessing code.

In [None]:
# Create your test array with problematic values


In [None]:
# Run through your preprocessing and check results


## Exercise 1.2: Verify dB Conversion

The dB transform is: dB = 10 × log₁₀(intensity)

**Your task:**

1. Create a list of test cases with known input/output pairs:
   - intensity=1.0 → 0 dB
   - intensity=10.0 → 10 dB
   - intensity=0.1 → -10 dB
   - intensity=0.01 → -20 dB

2. Test your `to_db()` function (or equivalent) against these known values.

3. Assert that your function produces the correct results within a small tolerance.

In [None]:
# Define test cases and verify your dB conversion


## Exercise 1.3: Verify Normalization

Your normalization should:
1. Clip values to [vmin, vmax] (e.g., [-25, +5] dB)
2. Scale to [0, 1]: normalized = (clipped - vmin) / (vmax - vmin)

**Your task:**

1. Create a test array in dB with values that span below vmin, within range, and above vmax. For example:
   ```
   [[-30, -25, -20],
    [-10,   0,   5],
    [  5,  10,  15]]
   ```

2. Calculate by hand what the expected normalized output should be.

3. Run your normalization function and compare to your expected values.

4. Verify that all outputs are in [0, 1].

In [None]:
# Test your normalization function


## Exercise 1.4: Roundtrip Test

A critical test: if you preprocess data and then invert the preprocessing, you should recover the original values (for values that weren't clipped).

**Your task:**

1. Generate synthetic SAR-like data using a gamma distribution:
   ```python
   original = np.random.gamma(shape=4.4, scale=0.1, size=(64, 64))
   ```
   This simulates intensity data with ~4.4 looks.

2. Run the data through your complete preprocessing pipeline. Make sure to save any parameters needed for inversion (vmin, vmax, etc.).

3. Invert the preprocessing to get back to linear intensity.

4. For pixels that weren't clipped, calculate the relative error between original and reconstructed values. It should be very small (<0.1%).

**Hint:** To find non-clipped pixels, convert original to dB and check which values fall within [vmin, vmax].

In [None]:
# Generate synthetic data


In [None]:
# Roundtrip test: preprocess -> inverse -> compare


---

# Part 3: Speckle Analysis (1.5 hours)

Now you'll implement and test functions to analyze speckle statistics.

---

## Exercise 1.5: Implement Local CV Computation

The coefficient of variation (CV = std/mean) measured locally tells you about speckle characteristics.

**Your task:**

Write a function `compute_local_cv(image, window_size)` that:

1. Computes local mean using `scipy.ndimage.uniform_filter`
2. Computes local variance using the identity: Var(X) = E[X²] - E[X]²
   - Local variance = uniform_filter(image²) - (uniform_filter(image))²
3. Computes CV = local_std / local_mean
4. Handles edge cases (avoid division by zero)

Test it on your synthetic gamma data and visualize the CV map.

In [None]:
# Implement compute_local_cv function


In [None]:
# Test on synthetic data and visualize


## Exercise 1.6: Implement ENL Estimation

The Equivalent Number of Looks can be estimated from intensity data as:

ENL = (mean / std)² = 1 / CV²

**Your task:**

Write a function `estimate_enl(image)` that:
1. Filters out zero or negative values
2. Computes the global mean and standard deviation
3. Returns the ENL estimate

Test it on your synthetic gamma(4.4, 0.1) data. The estimated ENL should be close to 4.4.

In [None]:
# Implement estimate_enl function and test


## Exercise 1.7: Verify Speckle Distribution

For fully developed speckle with L looks, normalized intensity (I/mean) follows a Gamma(L, 1/L) distribution.

**Your task:**

1. Normalize your synthetic data by dividing by its mean.

2. Fit a gamma distribution to the normalized data using `scipy.stats.gamma.fit()`. Use `floc=0` to fix the location parameter at zero.

3. Create a plot showing:
   - Histogram of your normalized data
   - Theoretical gamma PDF for L=4.4
   - Fitted gamma PDF

4. Compare the fitted shape parameter to the expected value of 4.4.

In [None]:
# Fit gamma distribution and create comparison plot


---

# Part 4: Analyze Your Actual Dataset (if available)

If you have preprocessed patches saved, analyze their statistics.

---

## Exercise 1.8: Dataset Statistics

**Your task:**

1. Load your preprocessed training patches (update the path as needed).

2. Compute and print basic statistics: shape, min, max, mean, std.

3. Check for problems:
   - Any values < 0?
   - Any values > 1?
   - Any NaN or Inf?

4. Plot a histogram of all pixel values.

5. If you find any issues, trace them back to your preprocessing pipeline.

In [None]:
# Load and analyze your patches


---

# Day 1 Checklist

Before moving to Day 2, verify:

- [ ] Answered all theory questions (Q1.1 - Q1.5) thoughtfully
- [ ] Invalid value handling works correctly
- [ ] dB conversion matches expected values
- [ ] Normalization produces [0, 1] output with correct clipping
- [ ] Roundtrip preprocessing test passes
- [ ] Local CV computation implemented and tested
- [ ] ENL estimation gives reasonable results (~4.4 for synthetic data)
- [ ] Speckle distribution matches gamma distribution
- [ ] Dataset statistics look reasonable (if applicable)

---

## Notes and Issues Found

*Document any bugs you found and fixed, or insights you gained:*

1. 

2. 

3. 