News Headlines Topic Classifier

A PyTorch-based multiclass text classification system that predicts news headline topics using deep learning techniques. This project demonstrates end-to-end machine learning pipeline development, from data collection to model deployment.

🎯 Project Overview

This project builds a neural network classifier to categorize news headlines into topics such as politics, technology, business, and sports. The implementation focuses on educational value and clean code practices, making it suitable for portfolio demonstration and learning purposes.

Key Features

Multi-source data collection from RSS feeds (BBC, Reuters)
Clean, modular PyTorch implementation with proper abstractions
Flexible model architecture supporting both mean pooling and GRU-based approaches
Comprehensive evaluation with metrics and visualizations
Production-ready code following PEP 8 and PEP 257 standards
Extensible design for future enhancements

🏗️ Project Structure

news_topic_classifier/
├── data/                          # Data storage
│   ├── raw/                       # Raw scraped data
│   └── processed/                 # Cleaned, processed data
├── notebooks/                     # Jupyter notebooks
│   └── data_collection.ipynb      # Data collection and exploration
├── src/                          # Source code
│   ├── model.py                  # PyTorch model definitions
│   ├── train.py                  # Training script and evaluation
│   └── utils.py                  # Utility functions
├── outputs/                      # Model artifacts and results
│   ├── model.pth                 # Trained model weights
│   ├── artifacts.pkl             # Training artifacts
│   ├── config.json               # Model configuration
│   ├── vocabulary.json           # Vocabulary mappings
│   ├── training_history.png      # Training plots
│   └── confusion_matrix.png      # Evaluation visualizations
├── README.md                     # Project documentation
├── requirements.txt              # Python dependencies
└── .gitignore                   # Git ignore rules

🚀 Quick Start

Prerequisites

Python 3.8 or higher
Git

Installation

Clone the repository

git clone https://github.com/yourusername/news_topic_classifier.git
cd news_topic_classifier

Create a virtual environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies
```
pip install -r requirements.txt
```

Create necessary directories

mkdir -p data/raw data/processed outputs

Usage

1. Data Collection

Run the data collection notebook to scrape headlines from RSS feeds:

jupyter notebook notebooks/data_collection.ipynb

Or create sample data for testing:

from src.utils import create_sample_data
create_sample_data('data/processed/headlines.csv', num_samples_per_topic=200)

2. Train the Model

Train the classifier with default settings:

cd src
python train.py

The training script will:

Load and preprocess the data
Create vocabulary and encode labels
Train the model with early stopping
Generate evaluation metrics and plots
Save model artifacts to outputs/

3. Model Inference

Use the trained model for predictions:

import torch
from src.model import NewsHeadlineClassifier
from src.utils import load_model_artifacts, predict_single_headline

# Load trained model
artifacts, vocab_data = load_model_artifacts('outputs/')

# Create model instance
model = NewsHeadlineClassifier(
    vocab_size=len(artifacts['vocab']),
    embedding_dim=artifacts['config']['embedding_dim'],
    hidden_dim=artifacts['config']['hidden_dim'],
    num_classes=len(artifacts['unique_topics']),
    use_gru=artifacts['config']['use_gru']
)

# Load trained weights
model.load_state_dict(torch.load('outputs/model.pth'))

# Make prediction
headline = "Government announces new economic stimulus package"
topic, confidence, probabilities = predict_single_headline(
    headline, model, vocab_data['word_to_idx'], artifacts['unique_topics']
)

print(f"Predicted topic: {topic} (confidence: {confidence:.3f})")

🧠 Model Architecture

The classifier supports two main architectures:

1. Mean Pooling Architecture

Input Headlines → Embedding Layer → Mean Pooling → Dropout → Linear Classifier

2. GRU-based Architecture

Input Headlines → Embedding Layer → Bidirectional GRU → Dropout → Linear Classifier

Model Components

Embedding Layer: Converts token IDs to dense vectors
Feature Extraction: Either mean pooling or GRU-based sequence modeling
Classification Head: Linear layer with softmax activation
Regularization: Dropout and L2 weight decay

Default Configuration

config = {
    'embedding_dim': 128,
    'hidden_dim': 128,
    'batch_size': 32,
    'learning_rate': 1e-3,
    'max_len': 50,
    'dropout_rate': 0.3,
    'use_gru': True
}

📊 Performance

The model achieves strong performance on balanced datasets:

Training Time: ~5-10 minutes on CPU for 4 topics
Memory Usage: ~50MB for model and vocabulary
Inference Speed: ~100 headlines/second on CPU

Sample Results

Topic	Precision	Recall	F1-Score
Politics	0.92	0.89	0.90
Technology	0.88	0.91	0.89
Business	0.87	0.85	0.86
Sport	0.94	0.96	0.95
Average	0.90	0.90	0.90

🛠️ Customization

Adding New Topics

Update RSS feeds in notebooks/data_collection.ipynb
Collect data for new topics
Retrain the model with updated configuration

Model Modifications

Change Architecture: Set use_gru=False for mean pooling
Adjust Hyperparameters: Modify config in src/train.py
Add Features: Extend model class in src/model.py

Example: Custom Model Configuration

config = {
    'embedding_dim': 256,      # Larger embeddings
    'hidden_dim': 256,         # Larger hidden layer
    'batch_size': 64,          # Larger batches
    'learning_rate': 5e-4,     # Lower learning rate
    'use_gru': False,          # Use mean pooling
    'dropout_rate': 0.5        # Higher dropout
}

📈 Evaluation Metrics

The training pipeline provides comprehensive evaluation:

Automated Metrics

Accuracy: Overall classification accuracy
Precision/Recall/F1: Per-class performance metrics
Confusion Matrix: Visual error analysis
Training Curves: Loss and accuracy over time

Generated Visualizations

Training and validation loss curves
Validation accuracy progression
Confusion matrix heatmap
Topic-wise performance breakdown

🔧 Advanced Features

Early Stopping

Prevents overfitting with patience-based early stopping:

trainer.train(
    train_loader=train_loader,
    val_loader=val_loader,
    patience=7  # Stop if no improvement for 7 epochs
)

Gradient Clipping

Stabilizes training with gradient norm clipping:

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Reproducible Results

Ensures consistent results with fixed random seeds:

torch.manual_seed(42)
np.random.seed(42)

🔮 Future Extensions

This project is designed for easy extension:

Planned Enhancements

Transformer Models: BERT/RoBERTa integration
Web Interface: Flask/FastAPI deployment
Real-time Classification: Live RSS feed processing
Multi-language Support: Extend to non-English headlines
Active Learning: Uncertainty-based data collection

Integration Ideas

REST API: Serve model via HTTP endpoints
Streamlit Dashboard: Interactive web interface
Docker Container: Containerized deployment
Cloud Deployment: AWS/GCP/Azure hosting

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

Development Setup

Fork the repository
Create a feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

Code Standards

Follow PEP 8 style guidelines
Include docstrings for all functions (PEP 257)
Add type hints where appropriate
Write unit tests for new features

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

RSS Feed Providers: BBC, Reuters for educational data access
PyTorch Team: For the excellent deep learning framework
Open Source Community: For the tools and libraries that made this possible

📞 Contact

Your Name - your.email@example.com

Project Link: https://github.com/yourusername/news_topic_classifier

Built with ❤️ for learning and demonstration purposes

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
notebooks		notebooks
outputs		outputs
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
demo.ipynb		demo.ipynb
evaluate.py		evaluate.py
inference.py		inference.py
requirements.txt		requirements.txt
setup.py		setup.py
setup_project.py		setup_project.py
test_basic.py		test_basic.py
web_scrape_demo.ipynb		web_scrape_demo.ipynb

License

JonesRobM/News-NLP

Folders and files

Latest commit

History

Repository files navigation