Official implementation of GeoStack, a modular framework for aggregating domain-specific expertise into Vision-Language Models (VLMs) with zero additional inference complexity.
GeoStack introduces GeoLayers—bilinear adapters that utilize geometric manifold constraints to enable associative knowledge composition. Multiple experts can be "folded" into a single weight matrix, ensuring that inference time remains constant ($O(1)$) regardless of the number of integrated tasks.
-
Stackable Expertise: Integrate
$N$ domain experts (e.g., textures, satellite imagery, medical scans) into a single CLIP backbone. - Zero Overhead: Expert composition is performed via matrix multiplication; the final model is as fast as the original CLIP.
- Abelian Composition: Knowledge integration is largely invariant to the order of tasks.
- Theoretical Grounding: Enforces upper-triangularity and near-isometry via Convex Orthogonality Alignment (COA).
git clone [https://https://github.com/QuantitativeImagingLaboratory/GeoStack](https://https://github.com/QuantitativeImagingLaboratory/GeoStack)
cd GeoStack
pip install -r requirements.txt├── configs/
│ ├── imagnet.yml # Imagenet training config
│ ├── ... # other dataset configs
├── GeoStack/
│ ├── GeoLayer.py # GeoLayer model
│ └── GeoStack.py # GeoStack model to compose GeoLayers
├── losses.py # Contians loss functions
├── mda_train.py # Training for Multi-Domain Adaptation
├── mda_eval.py # Evaluation of stacked experts
├── cil_train.py # Training for Class-Incremental Learning on CIFAR-100
├── cil_eval.py # Long-term stability & forgetting benchmarks
└── utils.py # Metrics (Orthogonality Error) and checkpointing
To train a single GeoLayer expert on a specific domain (e.g., DTD or imagenet):
python mda_train.py --dataset dtd --geo_layer
--geo_layer: Enables geometric constraints (Upper-triangularity + COA loss).
--biclip: Trains a standard bilinear adapter baseline without stacking constraints.
python cil_train.py --geo_layer --num_tasks 4
--geo_layer: Enables geometric constraints (Upper-triangularity + COA loss).
--biclip: Trains a standard bilinear adapter baseline without stacking constraints.
--num_tasks: Specify number of tasks to train on
Evaluate how multiple experts perform when folded together. Use the --stack argument to define the sequence of experts:
python mda_eval.py --stack 'i->d' --geo_layer # Stack imagenet and dtd geolayers
--stack: specify stack seperated by arror '->'
--geo_layer: Uses GeoLayers for evaluation
--biclip: Uses BiCLIP layers for evaluation
Evaluate the resilience to catastrophic forgetting across sequential tasks:
# Evaluate accuracy each cumulative task after training 10 tasks
python cil_eval.py --num_tasks 10 --geo_layer
# Evaluate accuracy on task 0 after training 10 tasks
python cil_eval.py --num_tasks 10 --geo_layer --forgetting
--geo_layer: Uses GeoLayers for evaluation
--biclip: Uses BiCLIP layers for evaluation
--forgetting: Set true to evaluate only on the first task
Folding ExpertsGeoStack relies on the associative property of matrix multiplication. If
from GeoStack.GeoLayer import GeoLayer
expert1 = GeoLayer(embed_dim=512) # Load Expert 1
checkpoint = torch.load(checkpoint_path, map_location=device)
expert1.load_state_dict(checkpoint['model_state_dict'])
expert2 = GeoLayer(embed_dim=512) # Load Expert 1
checkpoint = torch.load(checkpoint_path, map_location=device)
expert2.load_state_dict(checkpoint['model_state_dict'])
model = GeoStackCLIP(clip_model="ViT-B/16", geo_layers=[expert1, expert2]) # Fold them into a single CLIP model
#### Inference
logits = model(images)
To replicate the experimental results presented in the paper, we provide automated shell scripts that handle the sequential training and evaluation phases.
The MDA experiments evaluate the framework's ability to "fold" disparate domain knowledge into a single model. The script trains experts for six domains and evaluates them across the Easy, Medium, and Hard stacks defined in the manuscript.
chmod +x reproduce_mda.sh # Trains 6 experts and evaluates 3 stacks define in the paper
./reproduce_mda.sh
The CIL experiments demonstrate GeoStack's resilience to catastrophic forgetting. This script partitions CIFAR-100 into 10 sequential tasks and measures the "graceful degradation" of Task-0 accuracy.
chmod +x reproduce_cil.sh # Trains 4 sequential tasks and measures forgetting/accuracy
./reproduce_cil.sh
To reproduce the BiCLIP baseline comparison (standard bilinear adapters without geometric constraints):
# MDA Baseline
python mda_train.py --dataset imagenet --biclip
python mda_train.py --dataset dtd --biclip
python mda_eval.py -s "i->d" --biclip
# CIL Baseline
python cil_train.py --biclip
python cil_eval.py --biclip --forgetting