A modern, type-safe Python library for data binning and discretization with comprehensive error handling, sklearn compatibility, and DataFrame support.
- β¨ Multiple Binning Methods
- EqualWidthBinning - Equal-width intervals across data range
- EqualFrequencyBinning - Equal-frequency (quantile-based) bins
- KMeansBinning - K-means clustering-based discretization
- GaussianMixtureBinning - Gaussian mixture model clustering-based binning
- DBSCANBinning - Density-based clustering for natural groupings
- EqualWidthMinimumWeightBinning - Weight-constrained equal-width binning
- TreeBinning - Decision tree-based supervised binning for classification and regression
- Chi2Binning - Chi-square statistic-based supervised binning for optimal class separation
- IsotonicBinning - Isotonic regression-based supervised binning for monotonic relationships
- ManualIntervalBinning - Custom interval boundary specification
- ManualFlexibleBinning - Mixed interval and singleton bin definitions
- SingletonBinning - Creates one bin per unique numeric value
- π§ Framework Integration
- Pandas DataFrames - Native support with column name preservation
- Polars DataFrames - High-performance columnar data support (optional)
- NumPy Arrays - Efficient numerical array processing
- Scikit-learn Pipelines - Full transformer compatibility
- β‘ Modern Code Quality
- Type Safety - 100% mypy compliance with comprehensive type annotations
- Code Quality - 100% ruff compliance with modern Python syntax
- Error Handling - Comprehensive validation with helpful error messages and suggestions
- Test Coverage - 100% code coverage with 841 comprehensive tests
- Documentation - Extensive examples and API documentation
pip install binlearnimport numpy as np
import pandas as pd
from binlearn import EqualWidthBinning, TreeBinning, SingletonBinning, Chi2Binning
# Create sample data
data = pd.DataFrame({
'age': np.random.normal(35, 10, 1000),
'income': np.random.lognormal(10, 0.5, 1000),
'score': np.random.uniform(0, 100, 1000)
})
# Equal-width binning with DataFrame preservation
binner = EqualWidthBinning(n_bins=5, preserve_dataframe=True)
data_binned = binner.fit_transform(data)
print(f"Original shape: {data.shape}")
print(f"Binned shape: {data_binned.shape}")
print(f"Bin edges for age: {binner.bin_edges_['age']}")
# SingletonBinning for numeric discrete values
numeric_discrete_data = pd.DataFrame({
'category_id': [1, 2, 1, 3, 2, 1],
'rating': [1, 2, 1, 3, 2, 1]
})
singleton_binner = SingletonBinning(preserve_dataframe=True)
numeric_binned = singleton_binner.fit_transform(numeric_discrete_data)
print(f"Numeric discrete binning: {numeric_binned.shape}")from binlearn import TreeBinning
import numpy as np
from sklearn.datasets import make_classification
# Create classification dataset
X, y = make_classification(n_samples=1000, n_features=4, n_classes=2, random_state=42)
# Method 1: Using guidance_columns (binlearn style)
# Combine features and target into single dataset
X_with_target = np.column_stack([X, y])
sup_binner1 = TreeBinning(
guidance_columns=[4], # Use the target column to guide binning
task_type='classification',
tree_params={'max_depth': 3, 'min_samples_leaf': 20}
)
X_binned1 = sup_binner1.fit_transform(X_with_target)
# Method 2: Using X and y parameters (sklearn style)
# Pass features and target separately like sklearn
sup_binner2 = TreeBinning(
task_type='classification',
tree_params={'max_depth': 3, 'min_samples_leaf': 20}
)
sup_binner2.fit(X, y) # y is automatically used as guidance
X_binned2 = sup_binner2.transform(X)
print(f"Method 1 - Input shape: {X_with_target.shape}, Output shape: {X_binned1.shape}")
print(f"Method 2 - Input shape: {X.shape}, Output shape: {X_binned2.shape}")
print(f"Both methods create same bins: {np.array_equal(X_binned1, X_binned2)}")from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from binlearn import EqualFrequencyBinning
# Use the same classification dataset from previous example
X, y = make_classification(n_samples=1000, n_features=4, n_classes=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create ML pipeline with binning preprocessing
pipeline = Pipeline([
('binning', EqualFrequencyBinning(n_bins=5)),
('classifier', RandomForestClassifier(random_state=42))
])
# Train and evaluate
pipeline.fit(X_train, y_train)
accuracy = pipeline.score(X_test, y_test)
print(f"Pipeline accuracy: {accuracy:.3f}")Interval-based Methods (Unsupervised):
EqualWidthBinning- Creates bins of equal width across the data rangeEqualFrequencyBinning- Creates bins with approximately equal number of samplesKMeansBinning- Uses K-means clustering to determine bin boundariesGaussianMixtureBinning- Uses Gaussian mixture models for probabilistic clusteringDBSCANBinning- Uses density-based clustering for natural groupingsEqualWidthMinimumWeightBinning- Equal-width bins with weight constraintsManualIntervalBinning- Specify custom interval boundaries
Supervised Methods:
TreeBinning- Decision tree-based binning optimized for target variables (classification and regression)Chi2Binning- Chi-square statistic-based binning for optimal feature-target associationIsotonicBinning- Isotonic regression-based binning for monotonic relationships
Flexible Methods:
ManualFlexibleBinning- Define mixed interval and singleton binsSingletonBinning- Creates one bin per unique numeric value
Python Versions: 3.10, 3.11, 3.12, 3.13
- Core Dependencies:
- NumPy >= 1.21.0
- SciPy >= 1.7.0
- Scikit-learn >= 1.0.0
- kmeans1d >= 0.3.0
- Optional Dependencies:
- Pandas >= 1.3.0 (for DataFrame support)
- Polars >= 0.15.0 (for Polars DataFrame support)
- Development Dependencies:
- pytest >= 6.0 (for testing)
- ruff >= 0.1.0 (for linting and formatting)
- mypy >= 1.0.0 (for type checking)
# Clone repository
git clone https://github.com/TheDAALab/binlearn.git
cd binlearn
# Install in development mode with all dependencies
pip install -e ".[tests,dev,pandas,polars]"
# Run all tests
pytest
# Run code quality checks
ruff check binlearn/
mypy binlearn/ --ignore-missing-imports
# Build documentation
cd docs && make html- β 100% Test Coverage - Comprehensive test suite with 841 tests
- β 100% Type Safety - Complete mypy compliance with modern type annotations
- β 100% Code Quality - Full ruff compliance with modern Python standards
- β Comprehensive Documentation - Detailed API docs and examples
- β Modern Python - Uses latest Python features and best practices
- β Robust Error Handling - Helpful error messages with actionable suggestions
We welcome contributions! Here's how to get started:
Fork the repository on GitHub
Create a feature branch:
git checkout -b feature/your-featureMake your changes and add tests
Ensure all quality checks pass:
pytest # Run tests ruff check binlearn/ # Check code quality mypy binlearn/ --ignore-missing-imports # Check types
Submit a pull request
- Areas for Contribution:
- π Bug reports and fixes
- β¨ New binning algorithms
- π Documentation improvements
- π§ͺ Additional test cases
- π― Performance optimizations
- GitHub Repository: https://github.com/TheDAALab/binlearn
- Issue Tracker: https://github.com/TheDAALab/binlearn/issues
- Documentation: https://binlearn.readthedocs.io/
This project is licensed under the MIT License. See the LICENSE file for details.
Developed by TheDAALab
A modern, type-safe binning framework for Python data science workflows.