Skip to content

A fast, scalable, and intelligent data preprocessing library for machine learning workflows.

License

Notifications You must be signed in to change notification settings

Data-PrepeZ/Data_Prepez

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

11 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ Data-Prepez

A fast, scalable, and intelligent data preprocessing library for machine learning workflows.

Data-Prepez is an open-source Python library designed to simplify and accelerate the data preprocessing stage of machine learning. It automatically handles missing values, encoding, scaling, type detection, and supports tabular, text (NLP), and time series data β€” all while scaling efficiently to millions of rows.


πŸ“Œ Features

  • βœ… Automatic detection of numerical, categorical, and text features
  • βœ… Missing value imputation (mean, median, mode, forward fill, etc.)
  • βœ… One-hot and label encoding
  • βœ… Standard, MinMax, and Robust scaling
  • βœ… NLP preprocessing (cleaning, tokenization, lemmatization)
  • βœ… Time series resampling, smoothing, windowing
  • βœ… AutoPipeline: one-liner fit_transform() interface
  • βœ… Modular and production-ready design
  • βœ… Large dataset support (planned: Dask, Modin integration)
  • βœ… Compatible with Pandas, NumPy, and Scikit-learn

πŸ”§ Installation

πŸ“¦ PyPI release coming soon!

For now, clone the repository:

https://github.com/Data-PrepeZ/Data_Prepez.git
cd Data-Prepez
pip install -r requirements.txt

⚑ Quick Start

from dataprepez import AutoPreprocessor
import pandas as pd

# Load dataset
df = pd.read_csv("your_data.csv")

# Initialize the preprocessor
prep = AutoPreprocessor(target='target_column', type='tabular')

# Run preprocessing
X_clean, y = prep.fit_transform(df)

πŸ§ͺ Example Notebooks

Check out the examples/ folder:

  • 🧹 tabular_demo.ipynb – Basic tabular preprocessing
  • πŸ“ nlp_cleaning.ipynb – NLP cleaning pipeline (coming soon)
  • πŸ“ˆ timeseries_demo.ipynb – Time series preprocessing (coming soon)

πŸ—‚οΈ Project Structure

Click to expand
Data-Prepez/
β”œβ”€β”€ dataprepez/
β”‚   β”œβ”€β”€ tabular/
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ nlp/
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ timeseries/
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ core/
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   └── preprocessor.py
β”‚   └── __init__.py
β”œβ”€β”€ tests/
β”‚   └── test_preprocessor.py
β”œβ”€β”€ examples/
β”‚   └── tabular_demo.ipynb
β”œβ”€β”€ README.md
β”œβ”€β”€ setup.py
β”œβ”€β”€ requirements.txt
└── LICENSE

🀝 Contributing

We welcome contributions from the community!
To contribute:

  1. Fork this repository
  2. Create your feature branch: git checkout -b feature/YourFeature
  3. Commit your changes: git commit -m "Add your feature"
  4. Push to the branch: git push origin feature/YourFeature
  5. Open a pull request πŸŽ‰

πŸ‘₯ Team

  • Bala Mosay J – Team Lead, Core Developer
  • Allwyn Jeffo Raj A – NLP Module Developer
  • Rasik S – Time Series & Testing

πŸ“„ License

This project is licensed under the MIT License.


🌟 Show Your Support

If you like this project, please ⭐ star this repository and share it with others!


Clean your data, prep it like a pro β€” with Data-Prepez

About

A fast, scalable, and intelligent data preprocessing library for machine learning workflows.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •