This repository contains the implementation of a Conditional Vision Transformer (CViT) designed for accurate and robust traffic sign classification using the GTSRB (German Traffic Sign Recognition Benchmark) dataset. This model introduces a novel conditional attention mechanism that dynamically adapts the attention weights based on input context, significantly improving performance on misclassification-prone classes.
- β Conditional Attention Mechanism
- β Custom Patch Embedding and Tokenization
- β Gating Network for Adaptive QKV Generation
- β Superior Accuracy and Generalization
- β Well-structured and clean code, ready for training and evaluation
The Conditional ViT architecture is composed of three core phases:
-
Preprocessing Phase
- Patch extraction and embedding
- Positional encoding
- Token sequence formation
-
Feature Extraction Phase
- Conditional attention blocks with gating mechanisms
- Adaptive generation of Query, Key, and Value matrices
-
Classification Phase
- Fully connected layers
- Softmax output for multi-class classification
| Model | Accuracy (%) |
|---|---|
| Simple ViT | 94.25 |
| Conditional ViT (Proposed) | 99.87 |
The proposed model significantly reduces misclassification in challenging traffic sign classes such as T21 and T2, ensuring more stable and interpretable predictions.