## Transfer Learning and Generalization Capacity


What is Domain Shift?

**Domain shift** occurs when the training data (source domain) and testing or deployment data (target domain) come from different distributions—affecting the model’s performance.

**Types of domain shift include:**
- **Covariate shift**: Input distribution changes but label function remains (e.g., satellite images under different lighting conditions).
- **Label shift**: Class proportions differ (e.g., more rice fields in one region).
- **Concept shift**: The actual definition of a class changes across domains (e.g., crop labels depend on local farming systems).

> *Torralba & Efros, 2011 – "Unbiased look at dataset bias"*  
> https://doi.org/10.1109/CVPR.2011.5995347


### Generalization Capacity

**Generalization** refers to a model’s ability to perform well on unseen domains or tasks. A model with strong generalization:
- Learns domain-invariant or semantically rich features.
- Does not overfit to dataset-specific noise or bias.

**Example**: A crop classification model trained on Europe performs well in Zambia **without retraining**—this indicates good generalization.

> *Recht et al., 2019 – "Do ImageNet Classifiers Generalize to ImageNet?"*  
> https://arxiv.org/abs/1902.10811

---

### Common Adaptation Strategies

#### Domain-Level Adaptation

**Definition**: Aligns the distributions of source and target domains globally.

**Example**: Adapting a model trained on Sentinel-2 (Europe) to Landsat-8 (Africa).

**Popular Methods**:
- Domain-Adversarial Training (DANN)  
> *Ganin et al., 2016 – "Domain-Adversarial Training of Neural Networks"*  
> https://arxiv.org/abs/1505.07818



#### Instance-Level Adaptation

**Definition**: Focuses on adapting at the individual sample level.

**Example**: Test-time adaptation using a few images from a new location to adjust predictions.

**Popular Methods**:
- Tent (Test-time Entropy Minimization)  
> *Wang et al., 2021 – "Tent: Fully Test-Time Adaptation by Entropy Minimization"*  
> https://arxiv.org/abs/2006.10726

#### Task Adaptation

**Definition**: Transfers knowledge between different tasks—e.g., from classification to segmentation.

**Example**: Using a pretrained encoder from a crop classifier to initialize a segmentation model for field boundaries.

**Popular Methods**:
- Taskonomy  
> *Zamir et al., 2018 – "Taskonomy: Disentangling Task Transfer Learning"*  
> https://arxiv.org/abs/1804.08328


#### Feature-Level Adaptation

**Definition**: Matches intermediate feature distributions across domains.

**Example**: Aligning feature embeddings from spring and summer images of the same region.

**Popular Methods**:
- Maximum Mean Discrepancy (MMD)  
> *Long et al., 2015 – "Learning Transferable Features with Deep Adaptation Networks"*  
> https://arxiv.org/abs/1502.02791

#### Representation Adaptation

**Definition**: Learns general-purpose features through self-supervised or contrastive pretraining.

**Example**: Pretraining a masked autoencoder on multi-sensor satellite data to learn robust spatial-temporal patterns.

**Popular Methods**:
- MAE (Masked Autoencoder)  
> *He et al., 2022 – "Masked Autoencoders Are Scalable Vision Learners"*  
> https://arxiv.org/abs/2111.06377

#### Label Space Adaptation

**Definition**: Aligns source and target domains with **different or partially overlapping labels**.

**Example**: Source domain has 10 crop classes; target only has 5.

**Popular Methods**:
- Partial Domain Adaptation (PADA)  
> *Cao et al., 2018 – "Partial Adversarial Domain Adaptation"*  
> https://arxiv.org/abs/1707.07901

#### Conditional Adaptation

**Definition**: Performs alignment **conditioned on class or domain-specific information**.

**Example**: Aligning "rice" features between India and China, while ignoring background class differences.

**Popular Methods**:
- Conditional Adversarial Domain Adaptation (CDAN)  
> *Long et al., 2018 – "Conditional Adversarial Domain Adaptation"*  
> https://arxiv.org/abs/1705.10667

#### Model Architecture Adaptation

**Definition**: Modifies architectural components (e.g., normalization layers) to better handle domain variability.

**Example**: Replacing batch norm with instance norm to reduce domain sensitivity in satellite images.

**Popular Methods**:
- AdaBN (Adaptive Batch Normalization)  
> *Li et al., 2016 – "Revisiting Batch Normalization for Practical Domain Adaptation"*  
> https://arxiv.org/abs/1603.04779

#### Multi-Source Adaptation

**Definition**: Adapts from **multiple diverse source domains** to one target domain.

**Example**: Combining data from Europe, Asia, and the US to build a single model for Africa.

**Popular Methods**:
- MDAN (Multi-Source Domain Adaptation Network)  
> *Zhao et al., 2018 – "Adversarial Multiple Source Domain Adaptation"*  
> https://arxiv.org/abs/1809.02254


## Bonus Session


## GFM examples: FlexiMo, A Flexible Foundation Model for Multi-Resolution Remote Sensing

**FlexiMo** is a remote sensing foundation model designed to address the challenges of applying vision transformers (ViTs) to images captured at arbitrary spatial resolutions and spectral configurations.

Unlike traditional ViTs, which assume fixed patch sizes and consistent input dimensions, FlexiMo introduces architectural innovations that allow it to handle variable resolutions, dimensions, and channel counts — essential in Earth observation where data is heterogeneous across sensors, scales, and tasks.


### Challenges in Vision Transformers (ViTs) for Remote Sensing

**Rigid Tokenization Mechanism**  
  Standard ViTs require fixed-length tokens for positional encoding. This constraint limits flexibility and may prevent the model from capturing spatial detail accurately when image dimensions vary.

**Multi-Scale Perception Conflict**  
  Fixed patch sizes across datasets with different spatial resolutions cause inconsistent real-world coverage per patch. The same object may appear at different scales, leading to semantically inconsistent inputs and poor generalization.


### FlexiMo Architecture

FlexiMo addresses these issues through two core modules:


1. Spatial Resolution-Aware Module (SRAM)

This module enables dimensional independence and resolution flexibility through:

- **Dynamic Patch Size Adaptation**  
  Instead of using a fixed patch size (e.g., 16×16), the module adapts patch size \( P \) based on the input image’s native resolution.  
  - High-resolution images → larger patches (to reduce computation)  
  - Low-resolution images → smaller patches (to preserve detail)

- **Preservation of Embedding Properties via Pseudo-Inverse Bilinear Interpolation**  
  To avoid distortion of token features, FlexiMo uses a pseudo-inverse of bilinear interpolation rather than traditional resizing.  
  This ensures token norms and structural relationships are preserved, which is essential for scale-consistent representations.

- **Multi-Scale Feature Extraction**  
  The flexible tokenization allows the model to extract fine-grained and coarse-grained representations simultaneously, improving generalization across spatial resolutions and downstream tasks.

Reference implementation detail: Images are input as \( I \), patch size parameter \( P \), and electromagnetic wavelength parameters \( \{\lambda_i\}_{i=1}^C \).

2. Channel Adaptation Module

FlexiMo supports input from sensors with varying spectral characteristics by:

- Leveraging prior knowledge of electromagnetic wavelengths to guide channel adaptation.
- Dynamically recalibrating the input channels to preserve physical consistency and semantic coherence across spectral bands.
- This makes FlexiMo well-suited for processing data from multi-spectral and hyperspectral sensors with inconsistent band counts.



FlexiMo brings transformer-based representation learning to remote sensing with unmatched flexibility:

- Adapts to arbitrary spatial resolutions
- Preserves embedding integrity across scales
- Learns resolution- and sensor-invariant representations
- Processes images with non-uniform spectral channels

These innovations make it a robust and generalizable choice for Earth observation tasks, including classification, segmentation, and change detection, across diverse satellite platforms.


![Fleximo](https://github.com/Rahebe22/UCSB_workshop/raw/main/materials/figures/fleximo.png)


Li, Xuyang, et al. *FlexiMo: A Flexible Remote Sensing Foundation Model*. arXiv preprint arXiv:2503.23844, 2025. [https://arxiv.org/abs/2503.23844](https://arxiv.org/abs/2503.23844)



[Prithvi-v1](https://huggingface.co/ibm-nasa-geospatial/Prithvi-EO-1.0-100M)

[Prithvi-v2](https://huggingface.co/ibm-nasa-geospatial/Prithvi-EO-2.0-600M)



[Galileo](https://github.com/nasaharvest/galileo)

