Skip to content

Amidn/SignalSeeker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

11 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

SignalSeeker: Educational Binary Classification with XGBoost

SignalSeeker is a comprehensive, production-quality Python package that teaches how to build machine learning pipelines for binary classification using Boosted Decision Trees (XGBoost) in a signal/background separation context, inspired by particle physics applications.

GitHub Python 3.8+ License: MIT Code style: black

🎯 Overview

In particle physics experiments, signal refers to the desired particle interaction you're searching for, while background refers to all other processes that mimic the signal. The challenge is to build a classifier that:

  1. Identifies signal events with high efficiency (recall)
  2. Rejects background events with high purity (precision)
  3. Provides interpretable probability scores (BDT scores) for decision-making

SignalSeeker teaches all these concepts through a production-quality, fully-featured Python package.

✨ Key Features

βœ… Complete ML Pipeline: Data generation β†’ Preprocessing β†’ Training β†’ Tuning β†’ Evaluation β†’ Visualization

βœ… Realistic Imbalanced Data: Generates synthetic datasets with ~90% background, 10% signal (fully configurable)

βœ… XGBoost Implementation: Industry-standard gradient boosted decision trees with full hyperparameter control

βœ… Hyperparameter Tuning: Bayesian optimization using Optuna for automatic parameter discovery

βœ… Comprehensive Metrics: Accuracy, Precision, Recall, F1, AUC-ROC, AUC-PR, Confusion Matrix

βœ… Publication-Quality Visualizations:

  • Probability Score Distribution: Core visualization showing signal/background separation
  • ROC and Precision-Recall curves
  • Feature importance analysis
  • Confusion matrix heatmap
  • Summary dashboard

βœ… GPU Acceleration: Full CUDA support for XGBoost (optional)

βœ… Professional Package: Installable via pip, PyPI-ready, proper src/ layout

βœ… Comprehensive Testing: Full pytest coverage with CI/CD

βœ… Educational Comments: Detailed docstrings explaining why each step is performed

βœ… Professional Code Quality: Type hints, OOP design, proper error handling, Black-formatted

πŸ“¦ Installation

Option 1: Development Install (Recommended for Learning)

git clone https://github.com/Amidn/SignalSeeker.git
cd SignalSeeker
pip install -e .

This allows you to modify source files and see changes immediately.

Option 2: Regular Install from GitHub

pip install git+https://github.com/Amidn/SignalSeeker.git

Option 3: Install from PyPI (when published)

pip install signalseeker

Option 4: Install with Development Tools

pip install -e ".[dev]"

Includes testing, linting, and documentation tools.

Option 5: Install with GPU Support

pip install -e ".[gpu]"

Requires CUDA Toolkit to be installed. See GPU Setup Guide.

Option 6: Install from Requirements Files

# Core dependencies only
pip install -r requirements.txt

# With development tools
pip install -r requirements-dev.txt

# With GPU support
pip install -r requirements-gpu.txt

πŸš€ Quick Start

Run the Complete Pipeline

python -m signalseeker.main

Or after installation:

signal-seeker

This will:

  1. Generate synthetic imbalanced data (10,000 samples, 10% signal)
  2. Preprocess features (scaling, missing value handling)
  3. Build and train an XGBoost model
  4. Evaluate on validation and test sets
  5. Generate publication-quality visualizations
  6. Save all results to ./results/run_TIMESTAMP/

With Hyperparameter Tuning (Better Performance, Slower)

signal-seeker --tune

Uses Bayesian optimization to find optimal hyperparameters (takes ~2-5 minutes).

Custom Configuration

signal-seeker --n-samples 50000 --signal-fraction 0.15 --output-dir ./my_results

Available options:

  • --tune: Enable hyperparameter tuning
  • --output-dir: Output directory for results
  • --n-samples: Total number of samples to generate
  • --signal-fraction: Fraction of signal samples (0-1)

Python API Usage

from signalseeker import SignalSeekerPipeline, DEFAULT_CONFIG

# Use default configuration
pipeline = SignalSeekerPipeline(DEFAULT_CONFIG)
results = pipeline.run(use_tuning=False)

# Or customize configuration
config = DEFAULT_CONFIG
config.data.n_samples = 50000
config.data.signal_fraction = 0.15
config.xgboost.max_depth = 8
config.xgboost.learning_rate = 0.05

pipeline = SignalSeekerPipeline(config)
results = pipeline.run(use_tuning=True)

# Access results
val_auc = results["validation_results"]["all_metrics"]["auc_roc"]
test_auc = results["test_results"]["all_metrics"]["auc_roc"]
model = results["model"]

print(f"Validation AUC: {val_auc:.4f}")
print(f"Test AUC: {test_auc:.4f}")

πŸ“Š Understanding the Output

The Probability Score Distribution (Most Important Plot)

This is the fundamental visualization in signal/background separation:

Signal Distribution:   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘
Background Dist:      β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
                      0.0   Cut   1.0
                      (threshold)
  • Signal: Concentrated near 1.0 (high probability of being signal)
  • Background: Concentrated near 0.0 (low probability of being signal)
  • Cut: The threshold above which we classify events as signal

Good Model: Minimal overlap, clear separation
Poor Model: Heavy overlap, hard to separate

This plot is analogous to the "BDT score" or "discriminant" in particle physics.

Other Key Metrics

Metric Meaning Ideal Value
Accuracy (TP + TN) / Total 1.0 (but misleading for imbalanced data)
Precision TP / (TP + FP) - Of predicted signals, how many are true? 1.0
Recall (TPR) TP / (TP + FN) - Of true signals, how many did we find? 1.0
F1-Score Harmonic mean of precision and recall 1.0
AUC-ROC Area under ROC curve (threshold-independent) 1.0
AUC-PR Area under Precision-Recall curve (better for imbalanced) 1.0

πŸ“ Package Structure

SignalSeeker/
β”œβ”€β”€ src/signalseeker/          # Main package code
β”‚   β”œβ”€β”€ __init__.py            # Package initialization & exports
β”‚   β”œβ”€β”€ config.py              # Configuration management
β”‚   β”œβ”€β”€ data_loader.py         # Synthetic data generation
β”‚   β”œβ”€β”€ preprocessor.py        # Feature scaling & normalization
β”‚   β”œβ”€β”€ model_builder.py       # XGBoost model initialization
β”‚   β”œβ”€β”€ trainer.py             # Training with early stopping
β”‚   β”œβ”€β”€ tuner.py               # Bayesian hyperparameter optimization
β”‚   β”œβ”€β”€ metrics.py             # Evaluation metrics
β”‚   β”œβ”€β”€ visualizer.py          # Publication-quality plots
β”‚   β”œβ”€β”€ utils.py               # Logging and utilities
β”‚   └── main.py                # Pipeline orchestrator
β”œβ”€β”€ tests/                      # Pytest test suite
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ test_config.py
β”‚   β”œβ”€β”€ test_data_loader.py
β”‚   β”œβ”€β”€ test_preprocessor.py
β”‚   β”œβ”€β”€ test_model_builder.py
β”‚   β”œβ”€β”€ test_trainer.py
β”‚   β”œβ”€β”€ test_tuner.py
β”‚   β”œβ”€β”€ test_metrics.py
β”‚   └── test_visualizer.py
β”œβ”€β”€ .github/workflows/          # CI/CD pipelines
β”‚   β”œβ”€β”€ tests.yml              # Run tests on push/PR
β”‚   └── publish.yml            # Auto-publish to PyPI on release
β”œβ”€β”€ examples/                   # Example scripts and notebooks
β”‚   β”œβ”€β”€ example_usage.py
β”‚   └── custom_data_example.py
β”œβ”€β”€ README.md                   # This file
β”œβ”€β”€ CONTRIBUTING.md            # Contributing guidelines
β”œβ”€β”€ LICENSE                     # MIT License
β”œβ”€β”€ setup.py                    # Package installation script
β”œβ”€β”€ pyproject.toml             # Modern Python packaging config
β”œβ”€β”€ requirements.txt           # Core dependencies
β”œβ”€β”€ requirements-dev.txt       # Development dependencies
β”œβ”€β”€ requirements-gpu.txt       # GPU support dependencies
└── .pre-commit-config.yaml   # Code quality checks

πŸŽ“ Educational Concepts Covered

1. Imbalanced Classification

  • Why accuracy alone is misleading
  • Class weighting and scale_pos_weight
  • Precision vs Recall trade-offs
  • ROC and Precision-Recall curves

2. Boosting & Decision Trees

  • How gradient boosting works
  • Why boosting is effective for this problem
  • Feature importance in tree ensembles
  • Overfitting and regularization

3. Cross-Validation & Early Stopping

  • Preventing overfitting
  • Validation-based model selection
  • Learning curves and training dynamics

4. Hyperparameter Tuning

  • Grid vs Random vs Bayesian search
  • Optuna for efficient optimization
  • Interpreting tuning results
  • Trade-offs between performance and training time

5. Model Evaluation

  • Multiple metrics for imbalanced data
  • ROC curves and operating points
  • Precision-Recall analysis
  • Threshold optimization

6. Signal/Background Separation

  • The "cut" concept
  • Probability score interpretation
  • Acceptance vs Purity trade-off
  • Real-world applications in physics

πŸ–₯️ Advanced Usage

Custom Data

import numpy as np
from signalseeker import DataPreprocessingPipeline, DataSplitter
from signalseeker import XGBoostModelBuilder, ModelTrainer
from signalseeker.config import PreprocessConfig, XGBoostConfig

# Load your own data
X = np.load("features.npy")
y = np.load("labels.npy")

# Split the data
splitter = DataSplitter(PreprocessConfig())
X_train, X_val, X_test, y_train, y_val, y_test = splitter.split(X, y)

# Preprocess
preprocessor = DataPreprocessingPipeline(PreprocessConfig())
X_train, X_val, X_test = preprocessor.fit(X_train, X_val, X_test)

# Build and train model
builder = XGBoostModelBuilder(XGBoostConfig())
builder.build_with_class_weights(y_train)
trainer = ModelTrainer(builder)
model, results = trainer.train(X_train, y_train, X_val, y_val)

# Make predictions
predictions = model.predict_proba(X_test)[:, 1]

Hyperparameter Tuning Only

from signalseeker import HyperparameterTuner
from signalseeker.config import TunerConfig, XGBoostConfig

tuner = HyperparameterTuner(TunerConfig(), XGBoostConfig())
results = tuner.optimize(X_train, y_train, n_trials=50)

print(f"Best AUC: {results['best_score']:.4f}")
print(f"Best params: {results['best_params']}")

# Create model with best params
best_model_builder = tuner.create_model_from_best()

Visualization Only

from signalseeker import ModelVisualizer
from signalseeker.config import VisualizerConfig

viz = ModelVisualizer(VisualizerConfig())

# Probability distribution
viz.plot_probability_score_distribution(y_test, predictions)

# ROC curve
viz.plot_roc_curve(y_test, predictions)

# Feature importance
viz.plot_feature_importance(model, top_n=15)

πŸ“ˆ Performance Expectations

On the default 10,000 sample synthetic dataset:

Metric Without Tuning With Tuning
Test AUC-ROC ~0.92 ~0.95
Test AUC-PR ~0.88 ~0.92
Test F1-Score ~0.80 ~0.85
Training Time ~2 sec ~3-5 min

Performance improves with tuning, and improvements are typically more dramatic on real-world datasets.

πŸ”§ GPU Acceleration Setup

Prerequisites

  1. NVIDIA GPU with CUDA Compute Capability 3.5+
  2. CUDA Toolkit 10.2+ (check with nvcc --version)
  3. cuDNN (optional, improves performance)

Installation

# 1. Install CUDA Toolkit (from NVIDIA website for your OS)

# 2. Verify CUDA installation
nvcc --version

# 3. Install signalseeker with GPU support
pip install -e ".[gpu]"

# OR manually:
pip install xgboost[gpu]

Enabling GPU in SignalSeeker

from signalseeker import SignalSeekerPipeline, DEFAULT_CONFIG

config = DEFAULT_CONFIG
config.use_gpu = True  # Enable GPU acceleration
config.xgboost.tree_method = "gpu_hist"  # Use GPU for tree building

pipeline = SignalSeekerPipeline(config)
results = pipeline.run()

Configuration Options

# GPU-specific XGBoost parameters
config.xgboost.tree_method = "gpu_hist"  # GPU tree building
config.xgboost.gpu_id = 0  # Which GPU to use (if multiple)
config.xgboost.predictor = "gpu_predictor"  # GPU prediction

Troubleshooting GPU

# Check if GPU is detected
python -c "import xgboost as xgb; print(xgb.get_config())"

# Test GPU acceleration
python -c "from xgboost import XGBClassifier; m = XGBClassifier(tree_method='gpu_hist'); print('GPU enabled!')"

πŸ§ͺ Testing

Run the complete test suite:

# Install test dependencies
pip install -r requirements-dev.txt

# Run all tests
pytest

# Run with coverage report
pytest --cov=src/signalseeker --cov-report=html

# Run specific test file
pytest tests/test_model_builder.py

# Run specific test
pytest tests/test_model_builder.py::TestXGBoostModelBuilder::test_initialization

# Run only fast tests
pytest -m "not slow"

# Run in parallel (faster)
pytest -n auto

πŸ”„ CI/CD

This project uses GitHub Actions for continuous integration:

  • Tests: Runs on Python 3.8-3.12 on every push/PR
  • Code Quality: Black formatting, isort, flake8, mypy
  • Coverage: Checks test coverage on push
  • Publishing: Auto-publishes to PyPI on version tag

See .github/workflows/ for configuration.

πŸ“š Documentation

Docstrings

Every module, class, and function includes comprehensive docstrings explaining:

  • What the code does
  • Why each step is necessary
  • How to use it with examples

Read the docstrings in source files for detailed explanations.

Type Hints

All functions include type hints for better IDE support and code clarity.

def compute_class_weight(y: np.ndarray) -> float:
    """Compute scale_pos_weight for class imbalance."""

❓ FAQ

Q: Why is accuracy not a good metric for imbalanced data?
A: With 90% background, a model predicting "all background" gets 90% accuracy without detecting any signal!

Q: What's the difference between AUC-ROC and AUC-PR?
A: For imbalanced data, AUC-PR is more informative because it focuses on the minority class (signal).

Q: Why use early stopping?
A: After some boosting rounds, the model starts overfitting to noise. Early stopping detects when validation loss stops improving and halts training.

Q: Can I use my own data?
A: Yes! Replace the DataLoader with code that loads your data, then use the rest of the pipeline unchanged.

Q: Is GPU acceleration supported?
A: Yes! Set config.use_gpu = True if you have a CUDA-capable GPU.

Q: How do I install this on Windows/Mac/Linux?
A: The installation process is identical across platforms. Use the standard pip commands.

Q: Can I train on larger datasets?
A: Yes! Increase n_samples in the config. For very large datasets (>1M samples), consider using GPU acceleration.

Q: How do I customize the configuration?
A: Edit config.py or pass a modified PipelineConfig object to SignalSeekerPipeline.

🀝 Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

πŸ“– References

πŸ“„ License

This project is licensed under the MIT License - see LICENSE file for details.

πŸ‘€ Author

Amid Nayerhoda
Email: amid.nayerhoda@gmail.com
GitHub: @Amidn

πŸ™ Acknowledgments

This package was designed as an educational resource to teach:

  • Machine learning best practices
  • Professional Python package development
  • Signal/background separation techniques used in particle physics
  • Real-world ML pipeline design

πŸ“ž Support

For issues, questions, or suggestions:

  1. Check the FAQ above
  2. Check existing GitHub Issues
  3. Create a new issue with clear description and example code
  4. Read the docstrings in the source code for detailed explanations

Made with ❀️ for science and education

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages