SparseDepthTransformer

A transformer architecture that dynamically skips layers per token based on semantic importance — now with true compute savings.

Motivation

In standard transformers, every token passes through every layer — regardless of whether it's a high-impact content word or a filler like “the.” This wastes compute and memory, especially in long contexts.

SparseDepthTransformer introduces per-token depth skipping. It computes a semantic score for each token and routes only the important ones through deeper layers.

This project builds on the idea of dynamic routing, adding true hard skipping, not just masking, and shows measurable gains in memory and layer usage.

Features

Per-token semantic scorer
True hard layer skipping
Baseline transformer for comparison
Benchmarking across sequence lengths and batch sizes
Outputs average layers used per token

Results

Benchmarked across batches (2, 8, 16) and sequence lengths (20–256):

Model	Avg Layers/Token	Memory (MB)	Time (s)
Sparse	~3.5	22.16–105.43	0.0058–0.0179
Baseline	6.0	22.15–104.34	0.0044–0.0207

The SparseDepthTransformer consistently used ~40% fewer layers per token with measurable memory savings, validating both semantic gating and compute reduction. Runtime is still slightly higher due to per-token execution, but this will be addressed with batching in future work.

Tokens now actually bypass computation at deeper layers if their semantic score is low — this was verified using conditional forward logic and benchmarking.

Future Optimizations

Implement token batching by depth group to improve runtime efficiency
Add dropout-based probabilistic gating during training
Fine-tune on real datasets (e.g., TinyStories, WikiText-2) and compare perplexity
Integrate with HuggingFace Transformers for broader experimentation
Introduce curriculum learning to vary routing difficulty during training

Contact

Feel free to reach out with feedback, ideas, or collaboration opportunities!: Email: desimoneq@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
__pycache__		__pycache__
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md
benchmark.py		benchmark.py
benchmark_dmix.py		benchmark_dmix.py
dmix_block.py		dmix_block.py
dmix_model.py		dmix_model.py
model.py		model.py
tiny_stories.py		tiny_stories.py
train.py		train.py
training_perplexity.png		training_perplexity.png
training_results.csv		training_results.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SparseDepthTransformer

Motivation

Features

Results

Future Optimizations

Contact

About

Uh oh!

Releases

Packages

Languages

License

Quinnybob/sparse-depth-transformer

Folders and files

Latest commit

History

Repository files navigation

SparseDepthTransformer

Motivation

Features

Results

Future Optimizations

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages