🥟 SAMOSA

Segment Anything with Motion, Geometry, and Semantic Adaptation for Complex Nonlinear Visual Object Tracking

Deyi Zhu^*, Yuji Wang^*, Yong Liu, Yansong Tang, Bingyao Yu, Jiwen Lu, Jie Zhou

Tsinghua University

_{^* Equal contribution}

Official repository for the paper "Segment Anything with Motion, Geometry, and Semantic Adaptation for Complex Nonlinear Visual Object Tracking."

📖 Overview

Traditional visual object tracking (VOT) methods typically rely on task-specific supervised training, which limits their generalization to unseen objects and to challenging scenarios involving distractors, occlusion, and nonlinear motion. Recent vision foundation models — exemplified by SAM 2 — learn strong video-understanding priors from large-scale pretraining and offer a promising foundation for building more robust and generalizable trackers. However, directly applying SAM 2 to VOT remains suboptimal: it does not explicitly model target motion dynamics, nor does it enforce geometric and semantic consistency across frames, both of which are essential for reliable tracking.

To address this, we propose SAMOSA, a tracking framework that adapts SAM 2 to complex VOT scenarios by explicitly leveraging motion, geometry, and semantic cues:

A lightweight nonlinear Motion Predictor models target dynamics and guides both mask selection and memory filtering.
Semantic cues detect target shifts and enable recovery from tracking failures.
Geometric cues act as structural constraints to improve tracking stability.

In this way, SAMOSA bridges the gap between SAM 2's implicit video-understanding prior and explicit tracking-oriented modeling. Extensive experiments show that SAMOSA consistently outperforms state-of-the-art SAM 2–based approaches on general benchmarks, demonstrates stronger generalization than supervised VOT methods, and achieves substantial gains on anti-UAV datasets, which typify complex nonlinear motion scenarios.

✨ Highlights

🎯 Higher-order Markov Motion Predictor (MP). Models nonlinear target motion and, together with an Error Detection–Recovery Module (EDRM), explicitly identifies potential tracking failures and mitigates error propagation.
🧠 Target-Aware Memory Bank (TAMB). Adaptively selects representative and reliable memory frames, guided by confidence, occlusion, and motion cues.
🏆 State-of-the-art performance. Strong results across general VOT benchmarks (LaSOT_ext, OTB, TrackingNet) and challenging anti-UAV tracking benchmarks, with notable improvements in nonlinear-motion scenarios.
⚡ Lightweight and easy to integrate. MP is the only trainable component; it is trained solely on annotated bounding-box trajectories — without video frames — and plugs into SAM 2 at inference time with limited latency overhead.

🗺️ Roadmap

Done — Our paper is available on arXiv!
Incoming — Release test scripts for more benchmarks.
Incoming — Release raw results.
Incoming — Release training code for the Motion Predictor.
Incoming — Release a demo script to support inference on video.

🚀 Getting Started

1. SAMOSA Installation

SAM 2 needs to be installed first before use. The code requires python>=3.10, as well as torch>=2.3.1 and torchvision>=0.18.1. Please follow the instructions here to install both the PyTorch and TorchVision dependencies. You can install the SAMOSA version of SAM 2 on a GPU machine using:

cd sam2
pip install -e .
pip install -e ".[notebooks]"

💡 Please see INSTALL.md from the original SAM 2 repository for FAQs on potential issues and solutions.

Install the other requirements:

pip install tqdm matplotlib==3.7 numpy==1.26.4 tikzplotlib jpeg4py opencv-python lmdb pandas scipy loguru shapely

2. Checkpoint Download

Download SAM 2.1 checkpoints using:

cd checkpoints && \
./download_ckpts.sh && \
cd ..

The checkpoint for Motion Predictor has been included in this repo at sam2/checkpoints/mp.pth. No additional download needed.

3. Data Preparation

Please prepare the data following data/data_preparation.md.

4. Inference & Evaluation

Run inference and evaluation on all datasets using:

bash scripts/test.sh

You can also run evaluation on prepared raw results by running:

python utils/calc_uav_metrics.py --res_path PATH_OF_RESULTS

🙏 Acknowledgment

SAMOSA is built on top of SAM 2, SAMURAI, and SAMITE. Thanks for their great work!

📚 Citation

If you find SAMOSA useful in your research, please consider citing our work:

@article{zhu2026samosa,
  title         = {Segment Anything with Motion, Geometry, and Semantic Adaptation for Complex Nonlinear Visual Object Tracking},
  author        = {Zhu, Deyi and Wang, Yuji and Liu, Yong and Tang, Yansong and Yu, Bingyao and Lu, Jiwen and Zhou, Jie},
  journal       = {arXiv preprint arXiv:2605.22538},
  year          = {2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
sam2		sam2
scripts		scripts
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🥟 SAMOSA

Segment Anything with Motion, Geometry, and Semantic Adaptation for Complex Nonlinear Visual Object Tracking

📖 Overview

✨ Highlights

🗺️ Roadmap

🚀 Getting Started

1. SAMOSA Installation

2. Checkpoint Download

3. Data Preparation

4. Inference & Evaluation

🙏 Acknowledgment

📚 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🥟 SAMOSA

Segment Anything with Motion, Geometry, and Semantic Adaptation for Complex Nonlinear Visual Object Tracking

📖 Overview

✨ Highlights

🗺️ Roadmap

🚀 Getting Started

1. SAMOSA Installation

2. Checkpoint Download

3. Data Preparation

4. Inference & Evaluation

🙏 Acknowledgment

📚 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages