🌐 xT: Nested Tokenization for Larger Context in Large Images

xT: Nested Tokenization for Larger Context in Large Images
Ritwik Gupta*, Shufan Li*, Tyler Zhu*, Jitendra Malik, Trevor Darrell, Karttikeya Mangalam
Paper: https://arxiv.org/abs/2403.01915

arXiv | Project page

About

xT enables you to model large images, end-to-end, on contemporary, memory-limited GPUs. It is a simple framework for vision transformers which effectively aggregates global context with local details.

Installation

conda env create -f environment.yml

The code has been tested on Linux on NVIDIA A100 GPUs with PyTorch 2+. We use custom CUDA kernels as implemented by the Mamba and OpenAI Triton projects. Therefore, modifications may be required to use this repository on other operating systems or GPUs.

Training

Training can be launched through ./run_submit.sh <num GPUs> <port number> config=<path to config>

We also provide SubmitIt scripts in launch_scripts to submit training jobs on Slurm clusters.

Pretrained Models

Weights and configs for our experiments are available on Hugging Face.

Name	Resolution	Top1-ACC	Params	Mem (GB)	Thrpt (region/s)
Swin-T	256	53.76	31M	0.30	76.43
Swin-T <xT> Hyper	256/256	52.93	47M	0.31	47.81
Swin-T <xT> Hyper	512/256	60.56	47M	0.29	88.28
Swin-T <xT> XL	512/256	58.92	47M	0.17	80.00
Swin-T <xT> Mamba	512/256	61.97	44M	0.29	84.77

Swin-S	256	58.45	52M	0.46	44.44
Swin-S <xT> Hyper	256/256	57.04	69M	0.46	39.80
Swin-S <xT> Hyper	512/256	63.62	69M	0.46	41.45
Swin-S <xT> XL	512/256	62.68	69M	0.23	36.36

Swin-B	256	58.57	92M	0.50	36.14
Swin-B <xT> Hyper	256/256	55.52	107M	0.61	29.85
Swin-B <xT> Hyper	512/256	64.08	107M	0.74	24.00
Swin-B <xT> XL	512/256	62.09	107M	0.39	41.03
Swin-B <xT> Mamba	512/256	63.73	103M	0.58	29.09

Swin-L	256	68.78	206M	0.84	17.02
Swin-L <xT> Hyper	256/256	67.84	215M	1.06	16.08
Swin-L <xT> Hyper	512/256	72.42	215M	1.03	16.58
Swin-L <xT> XL	512/256	73.47	215M	0.53	14.10
Swin-L <xT> Mamba	512/256	73.36	212M	1.03	15.61

Citation

@article{xTLargeImageModeling,
  title={xT: Nested Tokenization for Larger Context in Large Images},
  author={Gupta, Ritwik and Li, Shufan and Zhu, Tyler and Malik, Jitendra and Darrell, Trevor and Mangalam, Karttikeya},
  journal={arXiv preprint arXiv:2403.01915},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assets		assets
config		config
launch_scripts		launch_scripts
meta		meta
xt		xt
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
main.py		main.py
run_submit.sh		run_submit.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assets

assets

config

config

launch_scripts

launch_scripts

meta

meta

xt

xt

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

environment.yml

environment.yml

main.py

main.py

run_submit.sh

run_submit.sh

Repository files navigation

🌐 xT: Nested Tokenization for Larger Context in Large Images

About

Installation

Training

Pretrained Models

Citation

About

Releases

Packages

Languages

License

bair-climate-initiative/xLT

Folders and files

Latest commit

History

Repository files navigation

🌐 xT: Nested Tokenization for Larger Context in Large Images

About

Installation

Training

Pretrained Models

Citation

About

Resources

License

Stars

Watchers

Forks

Languages