A Professional Data Quality Framework for ML Pipelines
DataSentry is a comprehensive Python library designed to detect and remediate data quality issues in machine learning pipelines. It provides a unified interface for identifying and fixing common data problems including class imbalance, label noise, data leakage, missing values, outliers, feature redundancy, and data distribution shift.
- Imbalance Detection: Identify class imbalance with customizable thresholds
- Label Noise Detection: Find potentially mislabeled samples using confident learning
- Data Leakage Detection: Detect target leakage, duplicates, and train-test contamination
- Missing Value Detection: Analyze missing value patterns and completeness
- Outlier Detection: Identify outliers using IQR, Z-score, Isolation Forest, and LOF
- Redundancy Detection: Find correlated and duplicate features
- Shift Detection: Detect distribution drift between train and test sets
- Imbalance Fixer: SMOTE, ADASYN, undersampling, and class weights
- Label Noise Fixer: Remove, relabel, or weight noisy samples
- Data Leakage Fixer: Remove leaky features and duplicates
- Missing Value Fixer: Mean, median, mode, KNN, and iterative imputation
- Outlier Fixer: Remove, cap, transform, or winsorize outliers
- Redundancy Fixer: Remove features or apply PCA
- Shift Fixer: Standardize and normalize distributions
- Interactive plots for all data quality issues
- Distribution comparisons
- Correlation heatmaps
- Missing value patterns
- Outlier visualizations
pip install datasentry# For advanced imbalance handling (SMOTE, ADASYN)
pip install datasentry[imblearn]
# For all optional features
pip install datasentry[all]
# For development
pip install datasentry[dev]git clone https://github.com/010Ankushsharma/datasentry.git
cd datasentry
pip install -e .from datasentry import DataSentry
import numpy as np
# Generate sample data
np.random.seed(42)
X_train = np.random.randn(1000, 10)
y_train = np.random.randint(0, 3, 1000)
X_test = np.random.randn(200, 10)
# Initialize DataSentry
ds = DataSentry(random_state=42, verbose=True)
# Generate comprehensive report
report = ds.generate_full_report(X_train, y_train, X_test=X_test)
# View health score
print(f"Health Score: {report['report_metadata']['health_score']:.2%}")
print(f"Overall Status: {report['report_metadata']['overall_status']}")
# Fix all detected issues
X_clean, y_clean = ds.fix_all(
X_train, y_train,
fix_config={
'missing_values': {'strategy': 'mean'},
'outliers': {'method': 'cap'},
'imbalance': {'method': 'smote'},
}
)from datasentry import DataSentry
ds = DataSentry()
# Detect specific issues
imbalance_result = ds.detect_imbalance(X, y)
missing_result = ds.detect_missing_values(X)
outlier_result = ds.detect_outliers(X)
leakage_result = ds.detect_data_leakage(X, y, X_test=X_test)
# Check if issues were detected
if imbalance_result.issue_detected:
print(f"Imbalance ratio: {imbalance_result.details['imbalance_ratio']}")
print(f"Severity: {imbalance_result.severity}")# Fix specific issues
from datasentry import MissingValueFixer, OutlierFixer
# Fix missing values
missing_fixer = MissingValueFixer(strategy='knn')
result = missing_fixer.fix(X, y)
X_fixed = result.X_transformed
# Fix outliers
outlier_fixer = OutlierFixer(method='winsorize')
result = outlier_fixer.fix(X, y)
X_fixed = result.X_transformed# Visualize data quality issues
import matplotlib.pyplot as plt
# Class imbalance
fig = ds.visualize_imbalance(y, plot_type='both')
plt.show()
# Missing values
fig = ds.visualize_missing_values(X, plot_type='matrix')
plt.show()
# Outliers
fig = ds.visualize_outliers(X, plot_type='box')
plt.show()
# Correlation heatmap
fig = ds.visualize_redundancy(X, plot_type='heatmap')
plt.show()
# Distribution shift
fig = ds.visualize_shift(X_train, X_test, plot_type='comparison')
plt.show()# Generate HTML report
report_gen = ds.generate_full_report(X, y, X_test=X_test)
# Save as HTML
from datasentry.core.report import ReportGenerator
detectors = ds.detect_all(X, y, X_test=X_test)
report_gen = ReportGenerator(list(detectors.values()))
report_gen.save_html('data_quality_report.html')
# Save as JSON
report_gen.save_json('data_quality_report.json')Main orchestrator class for data quality management.
DataSentry(
random_state: int = 42,
verbose: bool = True
)Methods:
detect_all(X, y, X_test, y_test)- Run all detectorsdetect_imbalance(X, y)- Detect class imbalancedetect_label_noise(X, y)- Detect label noisedetect_data_leakage(X, y, X_test)- Detect data leakagedetect_missing_values(X, y)- Detect missing valuesdetect_outliers(X, y)- Detect outliersdetect_redundancy(X, y)- Detect feature redundancydetect_shift(X, y, X_test, y_test)- Detect distribution shiftfix_all(X, y, X_test, fix_config)- Fix all issuesgenerate_full_report(X, y, X_test, y_test)- Generate comprehensive reportvisualize_*- Various visualization methods
All detectors inherit from BaseDetector and return a DetectionResult.
from datasentry import (
ImbalanceDetector,
LabelNoiseDetector,
DataLeakageDetector,
MissingValueDetector,
OutlierDetector,
RedundancyDetector,
ShiftDetector,
)All fixers inherit from BaseFixer and return a FixResult.
from datasentry import (
ImbalanceFixer,
LabelNoiseFixer,
DataLeakageFixer,
MissingValueFixer,
OutlierFixer,
RedundancyFixer,
ShiftFixer,
)See the examples/ directory for more detailed examples:
basic_example.py- Basic usage of DataSentryadvanced_example.py- Advanced features and customizationpipeline_integration.py- Integration with sklearn pipelines
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
This project is licensed under the MIT License - see the LICENSE file for details.
See CHANGELOG.md for version history.
- Documentation: https://datasentry.readthedocs.io
- Issue Tracker: GitHub Issues
- Discussions: GitHub Discussions
- Built with scikit-learn
- Inspired by data quality best practices in ML pipelines