CASTLE (Combined Approach for Segmentation and Tracking with Latent Extraction) is a training-free framework that combines segmentation models, tracking algorithms, and visual foundation models to automatically discover animal behaviors from video. Through focused latent extraction and hierarchical clustering, it achieves expert-level accuracy across multiple species without manual labeling, while uncovering previously hidden behavioral patterns that keypoint methods miss.
📚 Full documentation: castle-ai.github.io/castle-ai
-
2025-12: Performance & Stability Update
- High-Performance Pipeline: Optimized CPU/GPU batch processing for both Tracking and Extraction.
-
2024-09: Public Release
- Initial public release of the CASTLE tool.
-
Clone & Environment:
git clone https://github.com/CASTLE-ai/castle-ai.git cd castle-ai python -m venv .venv source .venv/bin/activate pip install -r requirements.txt
-
Download Checkpoints: Sometime the ckpt download may be blocked by Google. So you can download the models from the web by copying the links to the Chrome browser and downloading them.
https://dl.fbaipublicfiles.com/segment_anything/sam_vit_b_01ec64.pth https://dl.fbaipublicfiles.com/dinov2/dinov2_vitb14/dinov2_vitb14_reg4_pretrain.pth https://drive.google.com/file/d/1g4E-F0RPOx9Nd6J7tU9AE1TjsouL4oZq/edit https://drive.google.com/file/d/1QoChMkTVxdYZ_eBlZhK2acq9KMQZccPJ/editAlternatively, you can use the download_ckpt.sh script:
./download_ckpt.sh
Format:
castle-ai ├── castle └── ckpt ├── dinov2_vitb14_reg4_pretrain.pth ├── R50_DeAOTL_PRE_YTB_DAV.pth ├── sam_vit_b_01ec64.pth └── SwinB_DeAOTL_PRE_YTB_DAV.pth
python app.pyThe following benchmarks were measured on a workstation with Intel i7-12700 + RTX 3060 (12GB). Time consumption is expressed as a multiple of the video's actual duration (assuming 30 FPS).
| Task / Model | Video Res | Model Res | Ratio | Notes | | :--- | :--- | :--- | :--- | | GMFlow | 720x720 | 720x720 | 4.50x | Essential for fine movement (Residual Motion). | | DINOv2b (ViT-B/14) | 720x720 | 518x518 | 2.20x | Standard vision foundation model. | | DeAOT (Tracking) | 720x720 | 720x720 | 2.11x | ROI segmentation and tracking. | | DINOv3b (ViT-B/16) | 720x720 | 592x592 | 0.91x | Faster than real-time. Highly optimized. |
Tip
Hardware Scaling: Higher-end GPUs like the RTX 4090 are estimated to provide approximately 3.5x - 5x speedup compared to the RTX 3060, enabling real-time processing for most modules.
CASTLE is a project by the Wu Lab, a research group at the Academia Sinica.
This project incorporates code and methodologies from the following sources:
- SAM (Segment Anything Model): https://github.com/facebookresearch/segment-anything (Apache License 2.0)
- DeAOT (Decoupling Features in Hierarchical Propagation): https://github.com/yoxu515/aot-benchmark (BSD 3-Clause License)
- DINOv2 (Self-Supervised Vision Transformer): https://github.com/facebookresearch/dinov2 (Apache License 2.0)
- DINOv3 (Vision Transformer): https://github.com/facebookresearch/dinov3 (Creative Commons Attribution-NonCommercial 4.0 International)
This work is distributed under the terms of the Apache License 2.0.
If you find this work useful, please consider citing:
@article{CASTLE,
title={CASTLE: a training‑free foundation‑model pipeline for unsupervised, cross‑species behavioral classification},
author={Liu, Yu-Shun and Yeh, Han-Yuan and Hu, Yu-Ting and Wu, Bing-Shiuan and Chen, Yi-Fang and Yang, Jia-Bin and Jasmin, Sureka and Hsu, Ching-Lung and Lin, Suewei and Chen, Chun-Hao and Wu, Yu-Wei},
journal={bioRxiv},
year={2025}
}

