Skip to content

TheDAALab/binlearn

Repository files navigation

binlearn - Binning and Discretization Library

PyPI Version Python Versions Build Status Code Coverage License Documentation Status Monthly Downloads GitHub Stars Code Style - Ruff Type Checking - MyPy

A modern, type-safe Python library for data binning and discretization with comprehensive error handling, sklearn compatibility, and DataFrame support.

πŸš€ Key Features

✨ Multiple Binning Methods
  • EqualWidthBinning - Equal-width intervals across data range
  • EqualFrequencyBinning - Equal-frequency (quantile-based) bins
  • KMeansBinning - K-means clustering-based discretization
  • GaussianMixtureBinning - Gaussian mixture model clustering-based binning
  • DBSCANBinning - Density-based clustering for natural groupings
  • EqualWidthMinimumWeightBinning - Weight-constrained equal-width binning
  • TreeBinning - Decision tree-based supervised binning for classification and regression
  • Chi2Binning - Chi-square statistic-based supervised binning for optimal class separation
  • IsotonicBinning - Isotonic regression-based supervised binning for monotonic relationships
  • ManualIntervalBinning - Custom interval boundary specification
  • ManualFlexibleBinning - Mixed interval and singleton bin definitions
  • SingletonBinning - Creates one bin per unique numeric value
πŸ”§ Framework Integration
  • Pandas DataFrames - Native support with column name preservation
  • Polars DataFrames - High-performance columnar data support (optional)
  • NumPy Arrays - Efficient numerical array processing
  • Scikit-learn Pipelines - Full transformer compatibility
⚑ Modern Code Quality
  • Type Safety - 100% mypy compliance with comprehensive type annotations
  • Code Quality - 100% ruff compliance with modern Python syntax
  • Error Handling - Comprehensive validation with helpful error messages and suggestions
  • Test Coverage - 100% code coverage with 841 comprehensive tests
  • Documentation - Extensive examples and API documentation

πŸ“¦ Installation

pip install binlearn

πŸ”₯ Quick Start

import numpy as np
import pandas as pd
from binlearn import EqualWidthBinning, TreeBinning, SingletonBinning, Chi2Binning

# Create sample data
data = pd.DataFrame({
    'age': np.random.normal(35, 10, 1000),
    'income': np.random.lognormal(10, 0.5, 1000),
    'score': np.random.uniform(0, 100, 1000)
})

# Equal-width binning with DataFrame preservation
binner = EqualWidthBinning(n_bins=5, preserve_dataframe=True)
data_binned = binner.fit_transform(data)

print(f"Original shape: {data.shape}")
print(f"Binned shape: {data_binned.shape}")
print(f"Bin edges for age: {binner.bin_edges_['age']}")

# SingletonBinning for numeric discrete values
numeric_discrete_data = pd.DataFrame({
    'category_id': [1, 2, 1, 3, 2, 1],
    'rating': [1, 2, 1, 3, 2, 1]
})

singleton_binner = SingletonBinning(preserve_dataframe=True)
numeric_binned = singleton_binner.fit_transform(numeric_discrete_data)
print(f"Numeric discrete binning: {numeric_binned.shape}")

🎯 Supervised Binning Example

from binlearn import TreeBinning
import numpy as np
from sklearn.datasets import make_classification

# Create classification dataset
X, y = make_classification(n_samples=1000, n_features=4, n_classes=2, random_state=42)

# Method 1: Using guidance_columns (binlearn style)
# Combine features and target into single dataset
X_with_target = np.column_stack([X, y])

sup_binner1 = TreeBinning(
    guidance_columns=[4],  # Use the target column to guide binning
    task_type='classification',
    tree_params={'max_depth': 3, 'min_samples_leaf': 20}
)
X_binned1 = sup_binner1.fit_transform(X_with_target)

# Method 2: Using X and y parameters (sklearn style)
# Pass features and target separately like sklearn
sup_binner2 = TreeBinning(
    task_type='classification',
    tree_params={'max_depth': 3, 'min_samples_leaf': 20}
)
sup_binner2.fit(X, y)  # y is automatically used as guidance
X_binned2 = sup_binner2.transform(X)

print(f"Method 1 - Input shape: {X_with_target.shape}, Output shape: {X_binned1.shape}")
print(f"Method 2 - Input shape: {X.shape}, Output shape: {X_binned2.shape}")
print(f"Both methods create same bins: {np.array_equal(X_binned1, X_binned2)}")

πŸ› οΈ Scikit-learn Integration

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from binlearn import EqualFrequencyBinning

# Use the same classification dataset from previous example
X, y = make_classification(n_samples=1000, n_features=4, n_classes=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create ML pipeline with binning preprocessing
pipeline = Pipeline([
    ('binning', EqualFrequencyBinning(n_bins=5)),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Train and evaluate
pipeline.fit(X_train, y_train)
accuracy = pipeline.score(X_test, y_test)
print(f"Pipeline accuracy: {accuracy:.3f}")

πŸ“š Available Methods

Interval-based Methods (Unsupervised):

  • EqualWidthBinning - Creates bins of equal width across the data range
  • EqualFrequencyBinning - Creates bins with approximately equal number of samples
  • KMeansBinning - Uses K-means clustering to determine bin boundaries
  • GaussianMixtureBinning - Uses Gaussian mixture models for probabilistic clustering
  • DBSCANBinning - Uses density-based clustering for natural groupings
  • EqualWidthMinimumWeightBinning - Equal-width bins with weight constraints
  • ManualIntervalBinning - Specify custom interval boundaries

Supervised Methods:

  • TreeBinning - Decision tree-based binning optimized for target variables (classification and regression)
  • Chi2Binning - Chi-square statistic-based binning for optimal feature-target association
  • IsotonicBinning - Isotonic regression-based binning for monotonic relationships

Flexible Methods:

  • ManualFlexibleBinning - Define mixed interval and singleton bins
  • SingletonBinning - Creates one bin per unique numeric value

βš™οΈ Requirements

Python Versions: 3.10, 3.11, 3.12, 3.13

Core Dependencies:
  • NumPy >= 1.21.0
  • SciPy >= 1.7.0
  • Scikit-learn >= 1.0.0
  • kmeans1d >= 0.3.0
Optional Dependencies:
  • Pandas >= 1.3.0 (for DataFrame support)
  • Polars >= 0.15.0 (for Polars DataFrame support)
Development Dependencies:
  • pytest >= 6.0 (for testing)
  • ruff >= 0.1.0 (for linting and formatting)
  • mypy >= 1.0.0 (for type checking)

πŸ§ͺ Development Setup

# Clone repository
git clone https://github.com/TheDAALab/binlearn.git
cd binlearn

# Install in development mode with all dependencies
pip install -e ".[tests,dev,pandas,polars]"

# Run all tests
pytest

# Run code quality checks
ruff check binlearn/
mypy binlearn/ --ignore-missing-imports

# Build documentation
cd docs && make html

πŸ† Code Quality Standards

  • βœ… 100% Test Coverage - Comprehensive test suite with 841 tests
  • βœ… 100% Type Safety - Complete mypy compliance with modern type annotations
  • βœ… 100% Code Quality - Full ruff compliance with modern Python standards
  • βœ… Comprehensive Documentation - Detailed API docs and examples
  • βœ… Modern Python - Uses latest Python features and best practices
  • βœ… Robust Error Handling - Helpful error messages with actionable suggestions

🀝 Contributing

We welcome contributions! Here's how to get started:

  1. Fork the repository on GitHub

  2. Create a feature branch: git checkout -b feature/your-feature

  3. Make your changes and add tests

  4. Ensure all quality checks pass:

    pytest                                    # Run tests
    ruff check binlearn/                      # Check code quality
    mypy binlearn/ --ignore-missing-imports   # Check types
  5. Submit a pull request

Areas for Contribution:
  • πŸ› Bug reports and fixes
  • ✨ New binning algorithms
  • πŸ“š Documentation improvements
  • πŸ§ͺ Additional test cases
  • 🎯 Performance optimizations

πŸ”— Links

πŸ“„ License

This project is licensed under the MIT License. See the LICENSE file for details.

Developed by TheDAALab

A modern, type-safe binning framework for Python data science workflows.

Powered by Python Built with NumPy Compatible with Pandas Integrates with Scikit-learn Development Status Contributors

About

Some general purpose binning functionalities

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors