Simpute is an adaptive missing-value imputation library for tabular data. Instead of applying one global strategy to every column, it profiles each feature, selects a tailored model, and imputes columns sequentially so earlier fills inform later ones.
Install from PyPI as simpute. Source and releases live at github.com/Hvllvix/Simpute.
Most imputers pick a single method (mean, median, MICE, KNN) for the whole table. Real datasets mix binary flags, low-cardinality categories, high-cardinality text-like fields, skewed counts, and smooth continuous variables. Simpute treats each column on its own terms.
| Core Architectural Dimension | Simpute Engine Standard |
|---|---|
| Profiling Strategy | Granular per-column analysis and dynamic routing |
| API Compliance | Native Scikit-learn interface (fit / transform / fit_transform) |
| Algorithmic Suite | LightGBM, CatBoost, Regularized Logistic/SVC, KNN, Bayesian Ridge, Extra Trees |
| System Integrity | Integrated firewall guard suite featuring ground-truth precision verification |
| Fault Tolerance | Automated warnings and flags for columns exceeding 70% missingness thresholds |
pip install simputeDevelopment install with tests and plotting extras:
git clone [https://github.com/Hvllvix/Simpute.git](https://github.com/Hvllvix/Simpute.git)
cd Simpute
pip install -e ".[dev]"import pandas as pd
from simpute import Simpute
df = pd.read_csv("data.csv")
imputer = Simpute(exclude=["Student_ID"])
filled = imputer.fit_transform(df)
print(imputer.getmodelselection())
print(imputer.getprofiles())exclude keeps identifier columns out of the imputation loop. Use columns=[...] instead when you only want to impute a subset.
- Profile each target column (type, missingness, cardinality, distribution shape).
- Select features with mutual information (top 6 predictors by default).
- Route to a candidate model based on the column profile.
- Fit on observed rows, then impute missing cells column by column.
- Warn when missingness exceeds 70% on a column.
Sequential imputation means numerical columns are generally filled before categorical ones, and values imputed in earlier columns become features for later columns.
| Target Column Profile | Underlying Statistical Property | Optimized Backend Algorithm |
|---|---|---|
| High-Cardinality Categorical | Large nominal domains, text-like properties | CatBoostClassifier / LightGBMClassifier |
| Low-Cardinality / Binary | Binary indicators, low unique nominal categories | LogisticRegression (L2) / LinearSVC |
| Large Numerical Tables | Datasets exceeding 1,000 observations | LightGBMRegressor / ExtraTreesRegressor |
| Skewed / Discrete Numerical | Long-tailed metrics, highly unbalanced distributions | LightGBMRegressor / ExtraTreesRegressor |
| Normal / Uniform Continuous | Symmetric, un-skewed numerical continuous shapes | KNNRegressor / BayesianRidge |
Inspect the chosen backend per column after fitting:
imputer.getmodelselection()
# {'Pre_Semester_GPA': 'LGBMRegressor', 'Major_Category': 'CatBoostClassifier', ...}| Interface Method | Return Signature | Functional Description |
|---|---|---|
fit(df) |
self |
Profiles columns and trains tailored per-column machine learning architectures. |
transform(df) |
pd.DataFrame |
Executes sequential imputation calculations using previously fitted backend models. |
fit_transform(df) |
pd.DataFrame |
Runs profiling, model training, and cell imputation in a single optimized pass. |
getprofiles() |
dict |
Exposes the underlying metadata mapping generated during the dataset profiling phase. |
getmodelselection() |
dict |
Returns the specific machine learning model mapped to each target imputed column. |
Constructor options: columns, exclude, maskratio, randomstate.
The guard suite (tests/guard.py) masks values in tests/data/test.csv, imputes them, and checks:
- No NaN values remain after imputation
- Categorical predictions stay within the original domain
- Numerical predictions stay within bounded ranges
- Imputation beats adaptive random baselines on held-out masked cells
- Model selection is deterministic and profile-consistent
- High-missingness columns emit warnings
transformbeforefitraisesRuntimeError
See tests/data/README.md for column descriptions and how to swap in your own CSV.
pytest tests/guard.py -vMetric summary table (MAE for continuous columns, accuracy for nominal):
python tests/guard.pyGenerated on the bundled test dataset (MASKRATIO=0.15, SEED=42):
| Target Asset Graphic | Metric Visualization Type | Core Analytical Purpose |
|---|---|---|
| Imputation Density | Kernel Density Estimation (KDE) | Compares baseline vs post-imputation distributions to verify variance preservation. |
| Missingness Heatmap | Binary Feature Completeness Grid | Displays visual evidence of structural integrity before and after complete table imputation. |
| Model Allocation | Horizontal System Flow Chart | Provides full clarity into how columns were programmatically routed to distinct algorithms. |
Regenerate locally:
python scripts/generate_plots.py- Python 3.10+
- NumPy, Pandas, SciPy, scikit-learn, LightGBM, CatBoost
- Fork Hvllvix/Simpute
- Create a branch, make changes, run
pytest tests/guard.py -v - Open a pull request
MIT