Machine Learning project implementing the Vision Transformer (ViT) architecture, based on Google Research’s paper:
“An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”
[arXiv:2010.11929]
This repository extends the original ViT approach and includes training on both standard datasets (CIFAR-10, CIFAR-100) and custom datasets (e.g., dog breeds). Developed as part of a Machine Learning course project (CAP5610), this implementation demonstrates how ViTs can scale across datasets and be adapted for real-world image classification tasks.
- 🧠 Built from scratch based on Google's ViT model
- 📦 Modular architecture: patch sizes, attention heads, layers, embedding sizes
- 🧪 Training pipeline with CLI, evaluation scripts, and experiment tracking
- 🐶 Specialized training on custom dog breed dataset
- 📊 Visualization support (attention maps, training curves via TensorBoard)
- 🖥️ Streamlit interface (experimental)
├── vit.py # Vision Transformer model
├── train.py # Training pipeline
├── data_loader.py # Dataset handling and augmentations
├── final_report.ipynb # Final experiment results and analysis
├── doggy.py # Dog breed-specific ViT model
├── app.py # Streamlit interface (WIP)
├── requirements.txt # Python dependencies
└── README.md # Project documentationpython train.py --dataset custom --data_path ./data --epochs 100 --batch_size 32Supports training from scratch or fine-tuning. Logs are saved for TensorBoard visualization.
Clone the repo and install dependencies:
git clone https://github.com/your-username/vision-transformer.git
cd vision-transformer
pip install -r requirements.txtDependencies:
- Python 3.8+
- PyTorch, torchvision
- NumPy, matplotlib, tqdm
If you use or build upon this work, please cite the original ViT paper:
@article{dosovitskiy2020image,
title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
author={Dosovitskiy, Alexey and et al.},
journal={arXiv preprint arXiv:2010.11929},
year={2020}
}Developed collaboratively for CAP5610 by:
- Aditya Khadye
- Michael Mendez
- Lauren Suarez
- Add pretrained model weights
- HuggingFace Transformers integration
- Export to ONNX for deployment
- Model interpretability tools (e.g., Grad-CAM)
This project is licensed under the MIT License.