<div style="position: relative; display: inline-block;">
  <img src="https://github.com/Rahebe22/UCSB_workshop/raw/main/materials/figures/intro.png" width="13000">
  
  <!-- Title (Top-Center) -->
  <div style="position: absolute; top: 20px; left: 15%;
              transform: translateX(-50%);
              background-color: rgba(0, 0, 0, 0.5); 
              color: white; padding: 10px 20px; 
              font-size: 48px; font-weight: bold;">
    Deep Learning in Remote Sensing Workshop
      UCSB, Summer 2025
  </div>
  
  <!-- Author (Bottom-Left) -->
  <div style="position: absolute; bottom: 20px; left: 20px; 
              background-color: rgba(0, 0, 0, 0.6); 
              color: white; padding: 8px 16px; font-size: 15 px;">
    <a href="https://dl.acm.org/doi/abs/10.1145/2347736.2347755" 
       target="_blank" style="color: white; text-decoration: underline;">
      Pedro M. Domingos
    </a>
  </div>
</div>



## Workshop Agenda

---

**Day 1**

Morning Sessions

**Session 1: Introduction to Neural Networks**
- Structure and function of basic neural networks  
- Role of activation functions, loss functions, and optimization
- Generalization and regularization techniques

**Session 2: Supervised Learning in Practice**
- Overview of Convolutional Neural Networks (CNNs) and Transformers  
- Applications in classification and segmentation


---

Afternoon Sessions


**Session 1: Introduction to Self-Supervised Learning (SSL)**
- Self-Supervised Learning Strategies:
  - **Generative SSLs**: Diffusion, GANs, MAEs  
  - **Discriminative SSLs**: Contrastive Learning, Distillation


**Session 2: Hands-on:** 

- Building a U-Net
- Loading data and training our U-Net


---

**Day 2**

Morning Session


**Session 1: Transfer Learning and Generalization**
- Transfer learning and generalization capacity  
- Domain shift: causes, effects, and adaptation strategies

**Session 2: Hands-on** 
- Finetuning our U-Net
- Troubleshooting, Q&A, and Brainstorming**

---


## What is Machine Learning?

**Representation, Evaluation, and Optimization are the three main components of learning.**  
*Cited from* [Pedro M. Domingos et al.,](https://dl.acm.org/doi/abs/10.1145/2347736.2347755)

**Steps to approach a ML/DL problem**

1- Define the problem to be solved.

2- Collect data both image and label.

3- Choose an algorithm class (Architecture + design + cost function).

4- Choose an optimization metric for learning the model.

5- Choose a metric for evaluating the model.


<img src="https://github.com/Rahebe22/UCSB_workshop/raw/main/materials/figures/representation_evaluation_optimization.png" width="50%">


*Image from* [link](https://medium.com/@devnag/machine-learning-representation-evaluation-optimization-fc7b26b38fdb)

<img src="https://github.com/Rahebe22/UCSB_workshop/raw/main/materials/figures/DL_ingeneral.png" width="60%">


## Deep Learning Tasks in Remote Sensing Applications

**Image Classification**
- Scene classification (e.g., land use/land cover)
- Crop type classification

![RS_image_classification.jpg](attachment:5e19d578-da94-4003-85a3-12d6afd417af.jpg)

*Figure: Remote sensing image classification using Vision Transformers (Bazi et al., 2021).*


**Semantic Segmentation**
- Land cover segmentation
- Crop field boundary mapping
- Wetland/mangrove delineation
- Urban area segmentation

![ss.png](attachment:53c5f13b-4c49-48eb-a213-cdccada1b6ba.png)

*Image from* [link](https://www.ifp.uni-stuttgart.de/lehre/masterarbeiten/606-Suo/)

**Regression Tasks**
- Biomass estimation
- Soil moisture retrival
- Soil organic carbon (SOC) prediction
- Surface temperature estimation
- Leaf area index (LAI) prediction

![Screenshot 2025-05-30 at 10.55.47 AM.png](attachment:ce3083c4-586a-4d8c-b75e-509e127cccd1.png)

**Object Detection**
- Building or house detection
- Vehicle and ship detection
- Tree crown or palm tree detection
- Road and infrastructure mapping

<img src="https://github.com/Rahebe22/UCSB_workshop/raw/main/materials/figures/object_detection.png" width="60%">


*Figure: Object Detection in Remote Sensing Images Based on a Scene-Contextual Feature Pyramid Network (Chen, et al., 2019).*


**Instance Segmentation**
- Individual tree or building footprint segmentation
- Agricultural field segmentation

<img src="https://raw.githubusercontent.com/Rahebe22/UCSB_workshop/main/materials/figures/instance_seg.png" width="25%">



**Change Detection**
- Pre-/post-disaster analysis
- Urban growth monitoring
- Deforestation tracking
- Temporal crop status changes

**Super-Resolution**
- Enhancing spatial resolution of satellite imagery

**Image-to-Image Translation**
- Cloud removal
- Pansharpening
- SAR to optical (and vice versa) translation

**Forecasting / Nowcasting / Prediction**
- Soil moisture estimation
- Rainfall prediction
- Fire risk forecasting
- Crop yield and soil organic carbon prediction

Etc...


## Multilayer Perceptrons (MLPs), and Neural Network Components

![MLP.gif](attachment:00303041-ab09-469c-9e18-b389b4f256d7.gif)

**Why do we need activation functions?**

- Activation functions add nonlinearity, enabling the model to learn complex patterns beyond simple linear or monotonic relationships.

- Without them, stacked affine layers collapse into a single linear transformation, limiting the model's expressive power.

The activation function $\sigma$ (e.g., ReLU, or Sigmoid) adds nonlinearity, allowing the model to learn complex patterns. 

<img src="https://github.com/Rahebe22/UCSB_workshop/raw/main/materials/figures/sigmoid.png" width="50%">

<img src="https://github.com/Rahebe22/UCSB_workshop/raw/main/materials/figures/ReLu.png" width="50%">

I found this page useful for [more information on Activation Functions](https://encord.com/blog/activation-functions-neural-networks/)

### Forward pass

In the forward pass of an MLP, each hidden layer computes:  

  $\mathbf{H}_i = \sigma(\mathbf{H}_{i-1} \mathbf{W}^{(i)} + \mathbf{b}^{(i)})$, where $\mathbf{H}_0 = \mathbf{X}$

The output layer computes:  

  $\mathbf{O} = \phi(\mathbf{H}_L \mathbf{W}^{(L+1)} + \mathbf{b}^{(L+1)})$,  
  
  where $\phi$ is an optional activation (e.g., softmax for classification).

The loss is then calculated as:  

  $\mathcal{L} = \text{loss}(\mathbf{O}, \mathbf{y})$, where $\mathbf{y}$ is the true label.


#### Loss/Cost Function (Objective Function)

- **Naive Bayes**: Maximize posterior probabilities  
- **Genetic Programming**: Maximize a fitness function  
- **Reinforcement Learning**: Maximize total reward/value function  
- **Decision Trees (Classification)**: Maximize information gain / minimize node impurity  
- **Regression Models**: Minimize mean squared error (MSE)  
- **Support Vector Machines (SVMs)**: Minimize hinge loss  
- **Neural Networks / Probabilistic Models**:  
  - Maximize log-likelihood  
  - Minimize cross-entropy loss

<img src="https://github.com/Rahebe22/UCSB_workshop/raw/main/materials/figures/Loss.png" width="50%">


I found this page useful for [more information on Loss Functions](https://www.geeksforgeeks.org/loss-functions-in-deep-learning/)

### Backward pass


In the backward pass, gradients are computed in reverse using the chain rule. Starting with the loss gradient: 

  $\frac{\partial \mathcal{L}}{\partial \mathbf{O}}$  
  
  we compute:  
  
  $\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(L+1)}}, \ \frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(L+1)}}, \ \frac{\partial \mathcal{L}}{\partial \mathbf{H}_L}$,  
  
  and propagate gradients backward through each hidden layer:  
  
  $\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(i)}}, \ \frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(i)}}$ 
  
  using the activation function's derivative:  
  
  $\frac{\partial \sigma}{\partial z}$

#### Optimization

An optimizer is responsible for updating the model's parameters to move toward the optimal minimum of the loss function so that predictions improve with each iteration. 

It uses the gradients computed during backpropagation and applies a specific update rule (e.g., SGD, Adam, AdamW) to adjust the weights. Optimizers can also incorporate momentum, adaptive learning rates, or weight decay to improve convergence and generalization.

![optimization.gif](attachment:6ffe9308-6c34-42f9-a82a-cace0522fb15.gif)

*Image from* [link](https://medium.com/@adrianoleao/optimizers-in-machine-learning-91388b9e176d)

[More information on optimizers](https://musstafa0804.medium.com/optimizers-in-deep-learning-7bf81fed78a0)

#### Learning Rate

The learning rate (𝜂) controls the step size of weight updates during training. A high learning rate may cause divergence or overshooting, while a low one can lead to slow or stuck optimization. It directly affects training speed and stability.

<img src="https://github.com/Rahebe22/UCSB_workshop/raw/main/materials/figures/learning%20rate.png" width="50%">


*Image from* [link](https://deeplearningmath.org/tricks-of-the-trade)

[More information on learning rate policy](https://medium.com/thedeephub/learning-rate-and-its-strategies-in-neural-network-training-270a91ea0e5c)

## Generalization and Regularization techniques

We need the trained parameters to generalize well to make accurate predictions on unseen data, avoiding the model learning noise or representations specific to outlier samples.

![overfitting.jpeg](attachment:f406d5c2-9c02-4862-b3d6-1a2fc815f9af.jpeg)


*Image from* [link](https://www.engatica.com/blog/machine-learning-generalization?contentId=62df92a06f56fd786c8dfbaf)

Generalization depends on inductive bias preferences embedded in model design.

Overfitting is a large generalization gap (low train and high test losses).

### Regularization methods

These techniques aim to control overfitting by constraining model complexity. Some regularization techniques are listed as follows:

#### Dropout

Randomly deactivates a subset of neurons during training to reduce overfitting and improve generalization.


<img src="https://github.com/Rahebe22/UCSB_workshop/raw/main/materials/figures/dropout.png" width="60%">


#### Normalization

Adjusts input or intermediate activations (e.g., via BatchNorm) to stabilize and speed up learning.


#### Early Stopping

Stops training when performance on validation data stops improving, preventing overfitting.

#### Augmentation

Artificially increases the size and diversity of a dataset by applying transformations such as rotation, flipping, cropping, or noise addition to the original data.


**Geometric Augmentations**

| Method | Description | Example |
|--------|-------------|---------|
| **Flipping** | Flip images horizontally or vertically | Horizontal flip for Sentinel-2 imagery |
| **Rotation** | Rotate by fixed or random angles | 90°, 180°, or random rotation |
| **Scaling / Zooming** | Zoom in/out to simulate resolution or object size changes | Simulate UAV vs. satellite imagery |
| **Translation** | Shift image in x and/or y directions | Offset tile location slightly |
| **Cropping / Padding** | Random or center crop with optional padding | Extract 224×224 patches |
| **Elastic Deformation** | Apply non-rigid spatial warping | Simulate terrain distortions |


**Radiometric / Spectral Augmentations**

| Method | Description | Example |
|--------|-------------|---------|
| **Brightness / Contrast Adjustment** | Mimic lighting or atmospheric variation | Darken image to simulate haze |
| **Gamma Correction** | Nonlinear tone adjustment | Increase gamma to brighten shadows |
| **Noise Injection** | Add Gaussian or speckle noise | Simulate low-quality sensor input |
| **Spectral Jitter** | Slightly shift band values | Adjust NIR band by ±1% reflectance |
| **Histogram Matching** | Match distributions across domains | Normalize Sentinel-2 to Landsat |


**Temporal and Environmental Augmentations**

| Method | Description | Example |
|--------|-------------|---------|
| **Temporal Dropout** | Drop one or more timestamps in a sequence | Omit February 2021 image |
| **Seasonal Jitter** | Shift temporal index to simulate seasonal mismatch | Swap June with May imagery |
| **Cloud Simulation** | Add synthetic clouds/shadows | Overlay random cloud masks |
| **Illumination Variation** | Adjust for solar angle or exposure | Vary brightness by time-of-day |


**Label-Preserving Geographic Augmentations**

| Method | Description | Example |
|--------|-------------|---------|
| **Random Patch Extraction** | Sample random chips from larger areas | 224×224 tile from a scene |
| **Context-Aware Sampling** | Ensure target class is present in the patch | Only use tiles with cropland |
| **Boundary-Preserving Crop** | Avoid cutting through critical features | Crop full field boundaries intact |


**Model-Based and Semantic Augmentations**

| Method | Description | Example |
|--------|-------------|---------|
| **Mixup / CutMix** | Combine multiple images and labels | Mix 70% Image A + 30% Image B |
| **GAN-based Augmentation** | Generate synthetic data using GANs | Create fake Sentinel-2 image |
| **Style Transfer** | Transform appearance to match another domain | Sentinel → Landsat style |
| **Physics-Informed Simulation** | Modify image using physical models | Simulate reflectance via PROSAIL |


**Multi-Source / Multi-Modal Augmentations**

| Method | Description | Example |
|--------|-------------|---------|
| **Cross-Sensor Simulation** | Simulate noise/artifacts of another sensor | Add SAR-like speckle to optical image |
| **Band Dropout** | Randomly remove one or more bands | Drop RedEdge during training |
| **Modality Dropout** | Omit entire input modality | Train with only optical input |


**Self-Supervised Learning Augmentations**

| Method | Description | Example |
|--------|-------------|---------|
| **Random Masking** | Mask part of input for reconstruction tasks | Hide 75% of input patches for MAE |
| **View Generation** | Create different augmentations of one image | Apply crop + jitter + flip |
| **Spatial Jitter** | Slightly move image crop or bounding box | Offset chip origin by ±5 pixels |



  
*Cited from* [Dive into Deep Learning](https://d2l.ai/chapter_multilayer-perceptrons/mlp.html)

For an interactive visualization of how activation functions, hidden layers, and training dynamics affect learning in neural networks, see this example on the [TensorFlow Playground](https://playground.tensorflow.org/#activation=tanh&batchSize=10&dataset=circle&regDataset=reg-plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=4,2&seed=0.47909&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false), demonstrating a small MLP with two hidden layers on a nonlinear classification task.

## Convolutional Neural Networks (CNNs)

Let's take a quick look at the MLP architecture and consider its strengths and limitations when applied to computer vision tasks. 

While MLPs are simple to implement and understand, are they ideal for image data?

Do they capture local features or handle spatial structure effectively? 

Not really. MLPs treat all input pixels as independent and fully connected, ignoring the locality that's crucial in images.
They are not translation-invariant (if an object shifts slightly in the image, an MLP will see it as a completely different input). This lack of spatial awareness makes them not perfect for most vision tasks.

More importantly, MLPs become computationally expensive for image inputs because each pixel is connected to every neuron in the next layer, leading to a huge number of parameters. For example, flattening a 256×256 image results in over 65,000 input features, and fully connecting those to just one hidden layer with 1,000 neurons requires over 65 million parameters!

### Convolution operation

![CNN.gif](attachment:9e2fd4fc-e65f-487c-825a-807ad913cd2c.gif)

#### CNN components and output size

**Kernel (Filter) Size:**

- Defines the height and width of the filter matrix (e.g. for a 2D kernel, it can be 3×3, 5×5...). Kernels can differ in the number of dimensions they support.
  
- Controls the **receptive field**, the area of the input each neuron sees.
  
- Smaller kernels (like 3×3) are common in modern architectures for better detail capture.

**Stride:**

- Number of pixels the filter moves at each step.

**Padding:**

- Adds values (usually zeros) around the input image to control the output size and ensure that edge pixels contribute to the feature map.

**Dilation:**

- Spreads out the kernel by skipping input pixels to increases the receptive field without increasing kernel size or parameters (usually is used in segmentation tasks for upsampling).

![Dilated-convolution-On-the-left-we-have-the-dilated-convolution-with-dilation-rate-r.png](attachment:78ef8a1b-2abb-4752-95f1-9d2c25b5133c.png)

*Image from*[link](https://www.nature.com/articles/s41598-018-24304-3)


**Number of Filters (Channels):**

- Determines how many feature maps are produced by a layer. So more filters means more capacity to learn diverse patterns.


**Feature Map:**

- The output produced by applying the Filter, its size depends on the input size, kernel size, stride, and padding.


**Pooling Layers:**

- Pooling layers reduce the spatial size of feature maps by summarizing regions (e.g., taking the max or average), helping to retain important features while reducing computation and overfitting. Max pooling, commonly used, keeps the most prominent feature in each region.
  
![kernelsize.gif](attachment:33b5dc7e-7769-4252-9868-8b5fdebc8957.gif)

*Image from* [link](https://medium.com/@Tms43/understanding-padding-strides-in-convolutional-neural-networks-cnn-for-effective-image-feature-1b0756a52918)


### Different CNN Design Patterns

**Serial Kernels:**

The early design of deep neural networks was primarily sequential, with layers stacked one after another in a straight pipeline (e.g., LeNet, AlexNet, VVGNet).


![image.png](attachment:2fe8ed37-9db8-4c7e-a565-ed0397b68189.png)


**Parallel Kernels:**

Use multiple kernel sizes in parallel (e.g., 1x1, 3x3, 5x5) to capture multi-scale features.

Popularized by Inception modules (e.g., GoogLeNet).

Helps model both fine and coarse details simultaneously.

Reduces the need to stack very deep layers for multi-scale information.


<img src="https://github.com/Rahebe22/UCSB_workshop/raw/main/materials/figures/paralled_design.png" width="500"/>



**Residual Connections (Skip Connections):**

Introduced in ResNet.

Helps mitigate vanishing gradient problems in deep networks.

Allows gradients to flow directly through the skip paths.

Enables training of very deep networks (e.g., 152+ layers).

Formula: y = F(x) + x, where F(x) is the output of a few stacked layers.

![image.png](attachment:56721f52-e30d-426a-95ba-ec0e5a8b8c52.png)


*Image from* [DIVE INTO DEEP LEARNING](https://d2l.ai/chapter_convolutional-modern/resnet.html)


### Different tasks in Remote Sensing

In remote sensing or generally in computer vision, we are interested in a variety of tasks like classification, dense prediction, semantic segmentation, instance segmentation, object detection, regression, and a lot of other tasks!

In the previous section, we saw that adding fully connected layers at the end of the convolutional layers allows the model to classify images based on the extracted features. So the convolutional layers extract features, and fully connected layers (at the end) act as the classifier by mapping those features to class scores.

**Dense prediction**

In dense prediction tasks such as semantic segmentation, the model's output must be upsampled to recover spatial resolution and accurately capture details like object boundaries and textures.



| Upsampling Method          | Example Network              | Notes                                                                 |
|----------------------------|------------------------------|-----------------------------------------------------------------------|
| Transpose Convolution      | FCN, U-Net                   | Learnable; may cause checkerboard artifacts if not carefully designed.|
| Unpooling with Indices     | SegNet                       | Uses max-pooling indices to restore spatial structure.                |
| Bilinear Interpolation     | DeepLab (v1–v3+)             | Non-learnable; used for final resizing to input resolution.           |
| Sub-pixel Convolution      | ESPCN, EDSR                  | Efficient; rearranges features to increase resolution.                |
| Nearest Neighbor           | Lightweight/simple models    | Fast and non-learnable; used in speed-critical applications.          |

**Transpose Convolution**

<img src="https://github.com/Rahebe22/UCSB_workshop/raw/main/materials/figures/transposeconv.png" width="60%"/>

*Image from* [link](https://d2l.ai/chapter_computer-vision/transposed-conv.html)

**Max Unpooling**

<img src="https://raw.githubusercontent.com/Rahebe22/UCSB_workshop/refs/heads/main/materials/figures/max_unpooling.webp" width="60%"/>

*Image from* [link](https://medium.com/jun94-devpblog/dl-12-unsampling-unpooling-and-transpose-convolution-831dc53687ce)


**Nearest Neighbor**

<img src="https://raw.githubusercontent.com/Rahebe22/UCSB_workshop/main/materials/figures/nn_upsampling.png" width="60%"/>

*Image from* [link](https://www.researchgate.net/publication/358610282_FHI-Unet_Faster_Heterogeneous_Images_Semantic_Segmentation_Design_and_Edge_AI_Implementation_for_Visible_and_Thermal_Images_Processing)



**Semantic segmentation**

1- Fully Convolutional Networks (FCNs) and upsampling

![image.png](attachment:eeb2c5b0-63fa-4809-a986-89fe48f05773.png)


2- Encoder-decoder architecture

The encoder gradually reduces the spatial dimensions of the input while extracting high-level features. The decoder then upsamples these features to reconstruct an output that matches the original input size. This setup allows the model to learn both what is in the input and where it is, enabling dense predictions like per-pixel classification.

![Semantic-Segmentation-Approaches-1024x332.jpeg](attachment:03b0678a-1822-4380-809e-880288dbee46.jpeg)

*Image from* [link](https://collab.dvb.bayern/spaces/TUMdlma/pages/73379951/Rethinking+Semantic+Segmentation+from+a+Sequence-to-Sequence+Perspective+with+Transformers)

3- Dilated Convolutions

Replacing standard convolutions to expand receptive field without downsampling. This helps retain spatial resolution while capturing global context.

(e.g., DeepLab v1/v2/v3)

reference (Chen, Liang-Chieh, et al. "Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs." IEEE transactions on pattern analysis and machine intelligence 40.4 (2017): 834-848.)

## Hands-on

<!-- Title Image -->
<p align="center">
  <img src="https://raw.githubusercontent.com/Rahebe22/UCSB_workshop/main/materials/figures/data_1.png" width="1300"/>
</p>

<!-- Side-by-side images with text box to the right -->
<div style="display: flex; align-items: center;">

  <!-- Two side-by-side images -->
  <div style="display: flex; flex-direction: column;">
    <img src="https://raw.githubusercontent.com/Rahebe22/UCSB_workshop/main/materials/figures/data_2.png" width="520" style="margin-bottom: 20px;"/>
    <img src="https://raw.githubusercontent.com/Rahebe22/UCSB_workshop/main/materials/figures/data_3.png" width="420"/>
  </div>

  <!-- Text box -->
  <div style="margin-left: 80px; padding: 20px; border: 1px solid #ccc; width: 400px;">
    <p><strong>About dataset:</strong></p>
   <p>The dataset we will use for the hands-on session comprises a comprehensive collection of 42,403 labeled satellite image segments from 33,746 unique sites across Sub-Saharan Africa, spanning the years 2017 to 2023. These labels were created by trained labelers using Planet NICFI imagery (4.8 m resolution), accessed and processed through a custom labeling platform. This effort was a collaboration between Clark University (USA), Farmerline Ltd (Ghana), and Spatial Collective (Kenya).</p>

<p>In total, over 825,000 field polygons were digitized. Label quality was assessed using expert-reviewed scores and platform-derived metrics. The labels and imagery are publicly available on the <a href="https://registry.opendata.aws/africa-field-boundary-labels/" target="_blank">AWS Open Registry</a> and <a href="https://zenodo.org/records/11060871" target="_blank">Zenodo</a>. This dataset supports the development of machine learning models for crop field boundary detection and the analysis of agricultural patterns across Africa.</p>

  </div>

</div>

## Implementing a U-Net

![image.png](attachment:5b9109a4-71f5-4e03-b292-82ed1726a980.png)


## Transformer Blocks and Attention Mechanism

Attention mechanism is the core building block of transformer models.


Transformer block is a layered module that uses attention along with other components (e.g., LayerNorm, MLP, and Residual connections).

### Transformer blocks and Vision Transformer (ViT)



![vit_architecture.jpg](attachment:ddd93298-106e-43b7-8507-d0d482699790.jpg)

*Image from* [link](https://huggingface.co/docs/transformers/v4.48.1/en/model_doc/vit)


**The goal is to selectively focus on the parts of an input signal that are most relevant for a recognition task.**


### Self Attention

![image.png](attachment:012b4ed6-d090-40d0-a7bf-b6d60d5dddc9.png)



Q, W, V are linear projections of the input tokens. This helps to capture the global content across image.

Query (Q): What am I looking for?

Key (K): What do I contain?

Value (V): What information do I carry?

Q.K⊤ ⇒ how similar each query is to each key

The multiplication by v aggregates the values based on attention weights.

### Co Attention

![image.png](attachment:8473f903-6619-479f-9102-8e3d6f1075bd.png)


## Foundation Models

**Key benefits**
- Strong generalization across diverse tasks and geographic regions  
- Minimizes reliance on large labeled datasets  
- Supports scalable and efficient Earth observation workflows  

**Pretraining phase**
- Leverages massive collections of unlabeled geospatial data  
- Utilizes self-supervised learning (SSL) to learn rich, transferable representations  
- Captures spatial, spectral, and temporal dynamics in the data  

**Fine-tuning phase**
- Requires only limited labeled samples  
- Tailors the pretrained model to specific applications such as crop classification, flood detection, and urban change analysis  


![image.png](attachment:3b3ae1ab-8bd8-457d-a77e-470130c188e8.png)

*Image from* [link](![image.png](attachment:e1e047ac-ffe0-4b15-947b-c5228fecb7ed.png)![image.png](attachment:ec89c9f4-e29a-4b84-a888-07df344b2df0.png))


### self-supervised learning (SSL)

- Learns from **unlabeled data** by turning it into a supervised problem  
- Relies on a **pretext task** to generate pseudo-labels automatically  
- Goal: learn **generalizable representations** for downstream tasks  

Note: A pretext task is a self-designed, auxiliary learning objective that enables a model to learn useful representations from unlabeled data by solving a surrogate problem with automatically generated labels.


### Generative vs. Discriminative Self-Supervised Learning


Generative networks learn the underlying structure of satellite data by modeling the data distribution itself (often through tasks like reconstructing masked or corrupted inputs) making them useful for capturing spatial, spectral, and temporal patterns without needing labels. 

Discriminative networks focus on modeling the relationship between inputs, learning to distinguish classes or predict outcomes from data.


**Generative SSLs**


**Masked Image Modeling (MIM)**

- Mask a portion of the input  

  Can be pixels, image patches, or latent representations  

- Pretrain the model

  Often uses an autoencoder architecture  

- Predict the missing information

  The model uses visible context to reconstruct or infer the masked content


![image.png](attachment:aae59c2d-2cd8-4548-9d03-b476a2f87ce4.png)

*Figure: Overview of the Masked Autoencoder (MAE) approach (He et al., 2022).*




**Diffusion**

- Defining timesteps

- Randomly sample a timestep (only one step is chosen per image in each iteration.)

- Add noise to the image based on the sampled step:

  - Use a known forward noising schedule to generate a corrupted image.
  - This simulates what the image would look like after t steps of noise.
  
- Train the model to denoise:

  - The model takes the noisy image and the timestep as inputs.
  - It predicts either the original clean image or the added noise.
  
- Compute the loss:

  - Most commonly, Mean Squared Error (MSE) between predicted and true noise is used.
  
  
- Repeat across many images and timesteps over multiple epochs to learn denoising from any level of corruption.



![diffusion models.webp](attachment:943daa67-288e-4ca2-aa44-b02a09c04757.webp)


**Discriminative**


**Contrastive Learning**


- Maximize agreement between different augmented views of the same data sample (positive pairs)
   
- Minimize agreement between views of different data samples (negative pairs)  

Can you tell me what sample pairing design is more sutible for time series analysis in CL?


![image.png](attachment:e263a334-964a-4b9b-91e2-b6c17d5a8d63.png)

*Figure: Contrastive learning aims to bring positive pairs closer and push negative pairs apart (Chen et al., 2020).*


**Knowledge Distillation**

BYOL (Bootstrap Your Own Latent) as an example

- Apply two different augmentations to the same image to create two distinct views.
- Pass each view through two networks:

  - A **student network** (trainable): encoder → projector → predictor.
  - A **teacher network** (momentum-updated copy of the student): encoder → projector.
- Both networks produce **feature embeddings** from their views.
- The **student is trained to match the teacher’s output**:
  - Specifically, the **student's predictor output** from view 1 is trained to match the **teacher's projector output** from view 2.
- Compute the loss:
  - **Mean Squared Error (MSE)** between **L2-normalized** student prediction and teacher projection.
  - No contrastive loss. No negative pairs needed.
- Update:
  - The **student** is updated via backpropagation.
  - The **teacher** is updated using **Exponential Moving Average (EMA)** of the student weights.
- Repeat over many images and augmentations to build robust, view-invariant representations.


![KWD Figure](https://raw.githubusercontent.com/Rahebe22/UCSB_workshop/main/materials/figures/kwd.png)


Reference: Grill, Jean-Bastien, et al. "Bootstrap your own latent-a new approach to self-supervised learning." Advances in neural information processing systems 33 (2020): 21271-21284.
