A complete text summarization project built from scratch using PyTorch, featuring custom transformer architecture with scaled dot-product attention and multi-head attention mechanisms. The project supports both training from scratch and using pretrained models from Hugging Face.
-
Custom Transformer Architecture: Implemented from scratch with:
- Scaled Dot-Product Attention
- Multi-Head Attention
- Positional Encoding
- Feed-Forward Networks
- Layer Normalization and Residual Connections
-
Flexible Training Options:
- Train from scratch with custom transformer
- Use pretrained models from Hugging Face
- Easy switching between modes
-
Multiple Datasets: Support for various summarization datasets:
- XSum (BBC articles)
- CNN/DailyMail
- BillSum (US Congressional bills)
-
Comprehensive Evaluation: Built-in BLEU and ROUGE score calculation
-
Google Colab Ready: Optimized for Colab with reasonable dataset sizes
Text-Summarization/
├── tokenizer.py # Custom tiktoken-based tokenizer
├── transformer.py # Transformer architecture implementation
├── train.py # Training and testing functions
├── main.py # Main script with dataset loading and training loop
├── requirements.txt # Python dependencies
└── README.md # This file
-
Clone or download the project files
-
Install dependencies:
pip install -r requirements.txt
-
For Google Colab:
!pip install torch torchvision torchaudio transformers datasets tiktoken tqdm
Use pretrained model (recommended for quick results):
py main.py
Train from scratch:
py main.py --train_from_scratch
py main.py [OPTIONS]
Options:
--train_from_scratch Train model from scratch instead of using pretrained
--dataset DATASET Dataset name (xsum, cnn_dailymail, billsum) [default: cnn_dailymail]
--max_samples N Maximum number of samples to use [default: 1000]
--batch_size N Batch size for training [default: 8]
--num_epochs N Number of training epochs [default: 5]
--learning_rate LR Learning rate [default: 1e-4]
--d_model N Model dimension [default: 256]
--num_heads N Number of attention heads [default: 8]
--num_layers N Number of transformer layers [default: 4]
--d_ff N Feed-forward dimension [default: 4 * d_model]
--device DEVICE Device to use (cuda, cpu, auto) [default: auto]
Train a small model on XSum dataset:
py main.py --train_from_scratch --dataset xsum --max_samples 500 --num_epochs 3 --d_model 128 --num_layers 2
Use pretrained model with CNN/DailyMail dataset:
py main.py --dataset cnn_dailymail --max_samples 2000
Train on CPU with smaller batch size:
py main.py --train_from_scratch --device cpu --batch_size 4 --num_epochs 2
-
Scaled Dot-Product Attention (
ScaledDotProductAttention
):- Implements the attention mechanism from "Attention Is All You Need"
- Handles attention masks for padding tokens
- Includes dropout for regularization
-
Multi-Head Attention (
MultiHeadAttention
):- Splits attention into multiple heads
- Concatenates outputs from all heads
- Linear projections for Q, K, V matrices
-
Transformer Block (
TransformerBlock
):- Combines multi-head attention and feed-forward network
- Residual connections and layer normalization
- Dropout for regularization
-
Positional Encoding (
PositionalEncoding
):- Sinusoidal positional encodings
- Added to token embeddings
- Embedding Layer: Token embeddings with positional encoding
- Transformer Blocks: Stack of transformer layers
- Output Projection: Linear layer to vocabulary size
- Weight Initialization: Xavier uniform initialization
-
XSum: BBC articles with single-sentence summaries
- Good for abstractive summarization
- ~200k samples (limited to max_samples for faster training)
-
CNN/DailyMail: News articles with multi-sentence highlights
- Good for extractive summarization
- ~300k samples
-
BillSum: US Congressional bills with summaries
- Legal domain text
- ~20k samples
The load_and_prepare_dataset()
function:
- Loads datasets from Hugging Face
- Filters out very short/long texts
- Limits dataset size for faster training
- Returns clean text-summary pairs
-
Data Preparation:
- Tokenize texts and summaries
- Create train/validation/test splits
- Pad sequences to same length
-
Model Training:
- AdamW optimizer with weight decay
- Learning rate scheduling
- Gradient clipping
- Early stopping based on validation loss
-
Evaluation:
- BLEU score calculation
- ROUGE score calculation
- Sample generation for qualitative evaluation
- Loads BART-large-CNN from Hugging Face
- Ready-to-use for inference
- No training required
- Measures n-gram overlap between generated and reference summaries
- Includes brevity penalty
- Simplified implementation for demonstration
- Measures longest common subsequence
- Calculates precision, recall, and F1-score
- Simplified ROUGE-L implementation
You can customize the transformer architecture:
model = create_transformer_model(
vocab_size=tokenizer.get_vocab_size(),
d_model=512, # Model dimension
num_heads=8, # Number of attention heads
num_layers=6, # Number of transformer layers
d_ff=2048 # Feed-forward dimension
)
Adjust training hyperparameters:
history = train_model(
model=model,
train_loader=train_loader,
val_loader=val_loader,
tokenizer=tokenizer,
num_epochs=10, # Number of epochs
learning_rate=1e-4, # Learning rate
device=device,
save_path='model.pt' # Model save path
)
-
Upload all project files to Colab
-
Install dependencies:
!pip install torch torchvision torchaudio transformers datasets tiktoken tqdm
-
Run with smaller parameters for faster training:
!py main.py --train_from_scratch --max_samples 500 --batch_size 4 --num_epochs 3 --d_model 128
- Use smaller model dimensions (
--d_model 128
) - Reduce number of samples (
--max_samples 500
) - Use smaller batch sizes (
--batch_size 4
) - Limit training epochs (
--num_epochs 3
)
transformer_summarizer.pt
: Trained model weightstraining_history.json
: Training metrics and loss curves
Sample 1:
Original Text: The quick brown fox jumps over the lazy dog...
Reference Summary: A fox jumps over a dog.
Generated Summary: A brown fox jumps over a lazy dog.
BLEU Score: 0.7500
ROUGE Score: 0.8000
-
CUDA Out of Memory:
- Reduce batch size:
--batch_size 4
- Use smaller model:
--d_model 128 --num_layers 2
- Use CPU:
--device cpu
- Reduce batch size:
-
Slow Training:
- Reduce dataset size:
--max_samples 500
- Use fewer epochs:
--num_epochs 2
- Use pretrained model instead
- Reduce dataset size:
-
Poor Results:
- Increase model size:
--d_model 512 --num_layers 6
- Train for more epochs:
--num_epochs 10
- Use larger dataset:
--max_samples 2000
- Increase model size:
- Training from scratch: ~2-4GB GPU memory
- Pretrained model: ~1-2GB GPU memory
- CPU training: ~1-2GB RAM
Feel free to contribute by:
- Adding new datasets
- Improving evaluation metrics
- Optimizing model architecture
- Adding new features
This project is open source and available under the MIT License.
- Original Transformer paper: "Attention Is All You Need"
- Hugging Face for pretrained models and datasets
- PyTorch team for the deep learning framework
- Tiktoken for efficient tokenization
Happy Summarizing! 🎉