Skip to content

RahimiX/Text-Summarization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Text Summarization with PyTorch

A complete text summarization project built from scratch using PyTorch, featuring custom transformer architecture with scaled dot-product attention and multi-head attention mechanisms. The project supports both training from scratch and using pretrained models from Hugging Face.

🚀 Features

  • Custom Transformer Architecture: Implemented from scratch with:

    • Scaled Dot-Product Attention
    • Multi-Head Attention
    • Positional Encoding
    • Feed-Forward Networks
    • Layer Normalization and Residual Connections
  • Flexible Training Options:

    • Train from scratch with custom transformer
    • Use pretrained models from Hugging Face
    • Easy switching between modes
  • Multiple Datasets: Support for various summarization datasets:

    • XSum (BBC articles)
    • CNN/DailyMail
    • BillSum (US Congressional bills)
  • Comprehensive Evaluation: Built-in BLEU and ROUGE score calculation

  • Google Colab Ready: Optimized for Colab with reasonable dataset sizes

📁 Project Structure

Text-Summarization/
├── tokenizer.py          # Custom tiktoken-based tokenizer
├── transformer.py        # Transformer architecture implementation
├── train.py             # Training and testing functions
├── main.py              # Main script with dataset loading and training loop
├── requirements.txt     # Python dependencies
└── README.md           # This file

🛠️ Installation

  1. Clone or download the project files

  2. Install dependencies:

    pip install -r requirements.txt
  3. For Google Colab:

    !pip install torch torchvision torchaudio transformers datasets tiktoken tqdm

🎯 Usage

Quick Start

Use pretrained model (recommended for quick results):

py main.py

Train from scratch:

py main.py --train_from_scratch

Command Line Options

py main.py [OPTIONS]

Options:
  --train_from_scratch    Train model from scratch instead of using pretrained
  --dataset DATASET       Dataset name (xsum, cnn_dailymail, billsum) [default: cnn_dailymail]
  --max_samples N         Maximum number of samples to use [default: 1000]
  --batch_size N          Batch size for training [default: 8]
  --num_epochs N          Number of training epochs [default: 5]
  --learning_rate LR      Learning rate [default: 1e-4]
  --d_model N             Model dimension [default: 256]
  --num_heads N           Number of attention heads [default: 8]
   --num_layers N          Number of transformer layers [default: 4]
   --d_ff N                Feed-forward dimension [default: 4 * d_model]
   --device DEVICE          Device to use (cuda, cpu, auto) [default: auto]

Examples

Train a small model on XSum dataset:

py main.py --train_from_scratch --dataset xsum --max_samples 500 --num_epochs 3 --d_model 128 --num_layers 2

Use pretrained model with CNN/DailyMail dataset:

py main.py --dataset cnn_dailymail --max_samples 2000

Train on CPU with smaller batch size:

py main.py --train_from_scratch --device cpu --batch_size 4 --num_epochs 2

🏗️ Architecture Details

Transformer Components

  1. Scaled Dot-Product Attention (ScaledDotProductAttention):

    • Implements the attention mechanism from "Attention Is All You Need"
    • Handles attention masks for padding tokens
    • Includes dropout for regularization
  2. Multi-Head Attention (MultiHeadAttention):

    • Splits attention into multiple heads
    • Concatenates outputs from all heads
    • Linear projections for Q, K, V matrices
  3. Transformer Block (TransformerBlock):

    • Combines multi-head attention and feed-forward network
    • Residual connections and layer normalization
    • Dropout for regularization
  4. Positional Encoding (PositionalEncoding):

    • Sinusoidal positional encodings
    • Added to token embeddings

Model Architecture

  • Embedding Layer: Token embeddings with positional encoding
  • Transformer Blocks: Stack of transformer layers
  • Output Projection: Linear layer to vocabulary size
  • Weight Initialization: Xavier uniform initialization

📊 Datasets

Supported Datasets

  1. XSum: BBC articles with single-sentence summaries

    • Good for abstractive summarization
    • ~200k samples (limited to max_samples for faster training)
  2. CNN/DailyMail: News articles with multi-sentence highlights

    • Good for extractive summarization
    • ~300k samples
  3. BillSum: US Congressional bills with summaries

    • Legal domain text
    • ~20k samples

Dataset Loading

The load_and_prepare_dataset() function:

  • Loads datasets from Hugging Face
  • Filters out very short/long texts
  • Limits dataset size for faster training
  • Returns clean text-summary pairs

🎓 Training Process

From Scratch Training

  1. Data Preparation:

    • Tokenize texts and summaries
    • Create train/validation/test splits
    • Pad sequences to same length
  2. Model Training:

    • AdamW optimizer with weight decay
    • Learning rate scheduling
    • Gradient clipping
    • Early stopping based on validation loss
  3. Evaluation:

    • BLEU score calculation
    • ROUGE score calculation
    • Sample generation for qualitative evaluation

Pretrained Model Usage

  • Loads BART-large-CNN from Hugging Face
  • Ready-to-use for inference
  • No training required

📈 Evaluation Metrics

BLEU Score

  • Measures n-gram overlap between generated and reference summaries
  • Includes brevity penalty
  • Simplified implementation for demonstration

ROUGE Score

  • Measures longest common subsequence
  • Calculates precision, recall, and F1-score
  • Simplified ROUGE-L implementation

🔧 Customization

Model Parameters

You can customize the transformer architecture:

model = create_transformer_model(
    vocab_size=tokenizer.get_vocab_size(),
    d_model=512,        # Model dimension
    num_heads=8,        # Number of attention heads
    num_layers=6,        # Number of transformer layers
    d_ff=2048          # Feed-forward dimension
)

Training Parameters

Adjust training hyperparameters:

history = train_model(
    model=model,
    train_loader=train_loader,
    val_loader=val_loader,
    tokenizer=tokenizer,
    num_epochs=10,           # Number of epochs
    learning_rate=1e-4,     # Learning rate
    device=device,
    save_path='model.pt'    # Model save path
)

🚀 Google Colab Usage

Setup

  1. Upload all project files to Colab

  2. Install dependencies:

    !pip install torch torchvision torchaudio transformers datasets tiktoken tqdm
  3. Run with smaller parameters for faster training:

    !py main.py --train_from_scratch --max_samples 500 --batch_size 4 --num_epochs 3 --d_model 128

Colab Optimizations

  • Use smaller model dimensions (--d_model 128)
  • Reduce number of samples (--max_samples 500)
  • Use smaller batch sizes (--batch_size 4)
  • Limit training epochs (--num_epochs 3)

📝 Output Files

Training from Scratch

  • transformer_summarizer.pt: Trained model weights
  • training_history.json: Training metrics and loss curves

Sample Output

Sample 1:
Original Text: The quick brown fox jumps over the lazy dog...
Reference Summary: A fox jumps over a dog.
Generated Summary: A brown fox jumps over a lazy dog.
BLEU Score: 0.7500
ROUGE Score: 0.8000

🐛 Troubleshooting

Common Issues

  1. CUDA Out of Memory:

    • Reduce batch size: --batch_size 4
    • Use smaller model: --d_model 128 --num_layers 2
    • Use CPU: --device cpu
  2. Slow Training:

    • Reduce dataset size: --max_samples 500
    • Use fewer epochs: --num_epochs 2
    • Use pretrained model instead
  3. Poor Results:

    • Increase model size: --d_model 512 --num_layers 6
    • Train for more epochs: --num_epochs 10
    • Use larger dataset: --max_samples 2000

Memory Requirements

  • Training from scratch: ~2-4GB GPU memory
  • Pretrained model: ~1-2GB GPU memory
  • CPU training: ~1-2GB RAM

🤝 Contributing

Feel free to contribute by:

  • Adding new datasets
  • Improving evaluation metrics
  • Optimizing model architecture
  • Adding new features

📄 License

This project is open source and available under the MIT License.

🙏 Acknowledgments

  • Original Transformer paper: "Attention Is All You Need"
  • Hugging Face for pretrained models and datasets
  • PyTorch team for the deep learning framework
  • Tiktoken for efficient tokenization

Happy Summarizing! 🎉

About

Text Summarization with PyTorch — Build and Training Transformers from Scratch

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages