A fast, scalable, and intelligent data preprocessing library for machine learning workflows.
Data-Prepez is an open-source Python library designed to simplify and accelerate the data preprocessing stage of machine learning. It automatically handles missing values, encoding, scaling, type detection, and supports tabular, text (NLP), and time series data β all while scaling efficiently to millions of rows.
- β Automatic detection of numerical, categorical, and text features
- β Missing value imputation (mean, median, mode, forward fill, etc.)
- β One-hot and label encoding
- β Standard, MinMax, and Robust scaling
- β NLP preprocessing (cleaning, tokenization, lemmatization)
- β Time series resampling, smoothing, windowing
- β
AutoPipeline: one-liner
fit_transform()interface - β Modular and production-ready design
- β Large dataset support (planned: Dask, Modin integration)
- β Compatible with Pandas, NumPy, and Scikit-learn
π¦ PyPI release coming soon!
For now, clone the repository:
https://github.com/Data-PrepeZ/Data_Prepez.git
cd Data-Prepez
pip install -r requirements.txtfrom dataprepez import AutoPreprocessor
import pandas as pd
# Load dataset
df = pd.read_csv("your_data.csv")
# Initialize the preprocessor
prep = AutoPreprocessor(target='target_column', type='tabular')
# Run preprocessing
X_clean, y = prep.fit_transform(df)Check out the examples/ folder:
- π§Ή
tabular_demo.ipynbβ Basic tabular preprocessing - π
nlp_cleaning.ipynbβ NLP cleaning pipeline (coming soon) - π
timeseries_demo.ipynbβ Time series preprocessing (coming soon)
Click to expand
Data-Prepez/
βββ dataprepez/
β βββ tabular/
β β βββ __init__.py
β βββ nlp/
β β βββ __init__.py
β βββ timeseries/
β β βββ __init__.py
β βββ core/
β β βββ __init__.py
β β βββ preprocessor.py
β βββ __init__.py
βββ tests/
β βββ test_preprocessor.py
βββ examples/
β βββ tabular_demo.ipynb
βββ README.md
βββ setup.py
βββ requirements.txt
βββ LICENSE
We welcome contributions from the community!
To contribute:
- Fork this repository
- Create your feature branch:
git checkout -b feature/YourFeature - Commit your changes:
git commit -m "Add your feature" - Push to the branch:
git push origin feature/YourFeature - Open a pull request π
- Bala Mosay J β Team Lead, Core Developer
- Allwyn Jeffo Raj A β NLP Module Developer
- Rasik S β Time Series & Testing
This project is licensed under the MIT License.
If you like this project, please β star this repository and share it with others!
Clean your data, prep it like a pro β with Data-Prepez