This project implements a Transformer-based deep learning model for multi-class text classification using the Reuters newswire dataset. The main objective is to analyze how Transformer depth (number of encoder layers) affects classification performance and to compare Transformers with traditional RNN-based architectures.
The task involves classifying news articles into 46 distinct categories, making it a realistic and challenging Natural Language Processing (NLP) problem.
Traditional sequence models such as RNNs, LSTMs, and GRUs often face limitations including:
- Difficulty handling long-range dependencies
- Sequential computation that limits parallelism
- Degradation of contextual understanding in long sequences
This project addresses the question:
How effectively can Transformer architectures model textual data, and what is the optimal model depth for this task?
A custom Transformer encoder architecture is implemented using TensorFlow and Keras. The model is trained and evaluated by varying the number of Transformer encoder layers to study performance trends.
- Token embedding and positional embedding
- Multi-head self-attention mechanism
- Feed-forward neural networks
- Residual connections with layer normalization
- Global average pooling for document-level classification
- Vocabulary size: 10,000
- Maximum sequence length: 200
- Embedding dimension: 32
- Number of attention heads: 4
- Number of output classes: 46
- Optimizer: Adam
- Loss function: Sparse Categorical Crossentropy
Three Transformer configurations were evaluated:
- 3 encoder layers
- 5 encoder layers
- 7 encoder layers
Model performance was evaluated using:
- Classification accuracy
- Weighted F1-score
- Confusion matrices
| Number of Transformer Layers | Accuracy | F1 Score |
|---|---|---|
| 3 Layers | 0.7511 | 0.7461 |
| 5 Layers | 0.7427 | 0.7377 |
| 7 Layers | 0.7235 | 0.7156 |
Best-performing model: Transformer with 3 encoder layers
Increasing the number of layers beyond this point resulted in reduced performance, indicating overfitting and optimization challenges for deeper architectures on this dataset.
The Transformer models were compared with previously implemented sequence models:
- Simple RNN
- LSTM
- GRU
- Bidirectional RNN variants
The Transformer architecture demonstrated competitive and often superior F1-scores, highlighting its ability to capture contextual relationships more effectively than traditional recurrent models.
The project generates:
- Confusion matrices for each Transformer configuration
- F1-score comparison plots between RNN-based models and Transformer models
These visualizations provide deeper insight into class-wise performance and overall model behavior.
- Python
- TensorFlow / Keras
- NumPy
- Matplotlib and Seaborn
- Scikit-learn
- Reuters Newswire Dataset
- Training samples: 8,982
- Test samples: 2,246
- Number of categories: 46
- Transformer architectures are highly effective for text classification tasks
- Deeper models do not always guarantee better performance
- Optimal depth depends on dataset size and complexity
- F1-score is a crucial metric for multi-class NLP problems
- Hyperparameter tuning for attention heads and embedding size
- Integration of pretrained word embeddings
- Learning rate scheduling and regularization
- Fine-tuning pretrained language models such as BERT
Aadithya K L
This project focuses on understanding model behavior and architectural trade-offs rather than treating deep learning as a black box.