Segment Anything with Motion, Geometry, and Semantic Adaptation for Complex Nonlinear Visual Object Tracking
Deyi Zhu*, Yuji Wang*, Yong Liu, Yansong Tang, Bingyao Yu, Jiwen Lu, Jie Zhou
Tsinghua University
* Equal contribution
Official repository for the paper "Segment Anything with Motion, Geometry, and Semantic Adaptation for Complex Nonlinear Visual Object Tracking."
Traditional visual object tracking (VOT) methods typically rely on task-specific supervised training, which limits their generalization to unseen objects and to challenging scenarios involving distractors, occlusion, and nonlinear motion. Recent vision foundation models — exemplified by SAM 2 — learn strong video-understanding priors from large-scale pretraining and offer a promising foundation for building more robust and generalizable trackers. However, directly applying SAM 2 to VOT remains suboptimal: it does not explicitly model target motion dynamics, nor does it enforce geometric and semantic consistency across frames, both of which are essential for reliable tracking.
To address this, we propose SAMOSA, a tracking framework that adapts SAM 2 to complex VOT scenarios by explicitly leveraging motion, geometry, and semantic cues:
- A lightweight nonlinear Motion Predictor models target dynamics and guides both mask selection and memory filtering.
- Semantic cues detect target shifts and enable recovery from tracking failures.
- Geometric cues act as structural constraints to improve tracking stability.
In this way, SAMOSA bridges the gap between SAM 2's implicit video-understanding prior and explicit tracking-oriented modeling. Extensive experiments show that SAMOSA consistently outperforms state-of-the-art SAM 2–based approaches on general benchmarks, demonstrates stronger generalization than supervised VOT methods, and achieves substantial gains on anti-UAV datasets, which typify complex nonlinear motion scenarios.
- 🎯 Higher-order Markov Motion Predictor (MP). Models nonlinear target motion and, together with an Error Detection–Recovery Module (EDRM), explicitly identifies potential tracking failures and mitigates error propagation.
- 🧠 Target-Aware Memory Bank (TAMB). Adaptively selects representative and reliable memory frames, guided by confidence, occlusion, and motion cues.
- 🏆 State-of-the-art performance. Strong results across general VOT benchmarks (LaSOText, OTB, TrackingNet) and challenging anti-UAV tracking benchmarks, with notable improvements in nonlinear-motion scenarios.
- ⚡ Lightweight and easy to integrate. MP is the only trainable component; it is trained solely on annotated bounding-box trajectories — without video frames — and plugs into SAM 2 at inference time with limited latency overhead.
- Done — Our paper is available on arXiv!
- Incoming — Release test scripts for more benchmarks.
- Incoming — Release raw results.
- Incoming — Release training code for the Motion Predictor.
- Incoming — Release a demo script to support inference on video.
SAM 2 needs to be installed first before use. The code requires python>=3.10, as well as torch>=2.3.1 and torchvision>=0.18.1. Please follow the instructions here to install both the PyTorch and TorchVision dependencies. You can install the SAMOSA version of SAM 2 on a GPU machine using:
cd sam2
pip install -e .
pip install -e ".[notebooks]"💡 Please see INSTALL.md from the original SAM 2 repository for FAQs on potential issues and solutions.
Install the other requirements:
pip install tqdm matplotlib==3.7 numpy==1.26.4 tikzplotlib jpeg4py opencv-python lmdb pandas scipy loguru shapelyDownload SAM 2.1 checkpoints using:
cd checkpoints && \
./download_ckpts.sh && \
cd ..The checkpoint for Motion Predictor has been included in this repo at sam2/checkpoints/mp.pth. No additional download needed.
Please prepare the data following data/data_preparation.md.
Run inference and evaluation on all datasets using:
bash scripts/test.shYou can also run evaluation on prepared raw results by running:
python utils/calc_uav_metrics.py --res_path PATH_OF_RESULTSSAMOSA is built on top of SAM 2, SAMURAI, and SAMITE. Thanks for their great work!
If you find SAMOSA useful in your research, please consider citing our work:
@article{zhu2026samosa,
title = {Segment Anything with Motion, Geometry, and Semantic Adaptation for Complex Nonlinear Visual Object Tracking},
author = {Zhu, Deyi and Wang, Yuji and Liu, Yong and Tang, Yansong and Yu, Bingyao and Lu, Jiwen and Zhou, Jie},
journal = {arXiv preprint arXiv:2605.22538},
year = {2026}
}