A modular, extensible Python framework for building high-quality datasets for LLM pretraining and fine-tuning.
- Modular Architecture: Build pipelines by chaining reusable processor modules
- Object-Oriented Design: Inherit from
Processorbase class to create custom filters - Pre-built Processors: 15+ ready-to-use filters (text, quality, language, sampling, etc.)
- Multi-Source Composition: Mix datasets from Hugging Face with configurable ratios
- Python Configuration: Pure Python API for maximum flexibility
- CLI Tools: Command-line interface for running pipelines
# Clone and install
git clone https://github.com/Taotern/TaoData.git
cd TaoData
pip install -e .
# Install optional dependencies
pip install datasets langdetectfrom taodata import Pipeline
from taodata.processors import (
CharLengthFilter,
RegexFilter,
DuplicateDetector,
LanguageFilter,
)
# Define pipeline
pipeline = Pipeline(name="my_cleaner")
pipeline.add(CharLengthFilter(min_length=50, max_length=5000))
pipeline.add(RegexFilter(pattern=r"^[a-zA-Z0-9\s.,!?-]+$", keep_match=True))
pipeline.add(DuplicateDetector(method="hash"))
pipeline.add(LanguageFilter(languages=["en"]))
# Process data
results = pipeline.process_batch(items)from taodata import Pipeline, DatasetComposer
from taodata.processors import CharLengthFilter
# Define sources
composer = DatasetComposer()
composer.add_source("fineweb", "HuggingFaceFW/fineweb", split="sample", weight=0.5)
composer.add_source("wikipedia", "wikipedia", config="20220301.en", weight=0.3)
composer.add_source("github", "codeparrot/github-code", weight=0.2)
# Define pipeline (applied to all sources)
pipeline = Pipeline(name="mixed_cleaner")
pipeline.add(CharLengthFilter(min_length=100, max_length=8000))
# Execute: sources processed separately, then round-robin merged
pipeline.execute(
sources=composer,
output_path="output.jsonl",
batch_size=1000
)taodata/
├── core/ # Base classes: Processor, Pipeline
├── processors/ # Pre-built filters: text, quality, language, structured, sampling
├── io/ # Readers, writers, batch processing
├── sources/ # Multi-source composition, HF loaders
├── cli/ # Command-line interface
└── utils/ # Utilities and logging
CharLengthFilter— Filter by character countWordCountFilter— Filter by word countRegexFilter— Pattern matching/exclusionTextNormalizer— Lowercase, strip whitespace, etc.UnicodeCleaner— Remove control and zero-width characters
DuplicateDetector— Hash-based deduplicationQualityScorer— Multi-metric quality scoringBlankLineFilter— Filter by blank line content
LanguageDetector— Detect language (requireslangdetect)LanguageFilter— Keep/exclude specific languages
JSONValidator— Validate JSON structureFieldExtractor— Select/keep specific fieldsFieldRenamer— Rename fieldsValueFilter— Filter by field values
RandomSampler— Random downsamplingRatioSampler— Sample to ratio targetStratifiedSampler— Balanced sampling across groups
Inherit from Processor and implement process():
from taodata import Processor
from taodata.core.types_ import ProcessingResult
class MyFilter(Processor):
def __init__(self, threshold=0.5):
super().__init__()
self.threshold = threshold
def process(self, item):
# Process logic
accept = score(item) > self.threshold
return ProcessingResult(item, accept=accept)All configuration is done in pure Python code. Example:
# pipeline_config.py
from taodata import Pipeline
from taodata.processors import *
pipeline = Pipeline(name="standard_cleaner")
pipeline.add(CharLengthFilter(min_length=50, max_length=5000))
pipeline.add(DuplicateDetector(method="hash"))
pipeline.add(RegexFilter(pattern=r"^[a-zA-Z]+$", keep_match=True))taodata run <config.py> --input data.jsonl --output clean.jsonl [--batch-size 1000]- Designed for medium-scale datasets (1M-100M samples on single machine)
- Streaming from Hugging Face (no full download needed)
- Batch processing with configurable batch sizes
- Throughput tracking and ETA calculation
pip install datasetspip install langdetectpip install -r requirements-dev.txtPhase 1 (Complete): Core framework, base classes, types Phase 2 (Complete): I/O layer (readers, writers, batch processing) Phase 3 (Complete): Pre-built processors Phase 3B (Complete): HF dataset integration Phase 4 (Complete): Daemon infrastructure Phase 5 (Complete): TUI monitoring Phase 6 (Complete): CLI entry points
- Distributed processing (Spark, Ray)
- Web UI dashboard
- Advanced metrics and profiling
- Database connectors
- Real-time streaming support
- Community processor library
MIT License
Contributions welcome! Please follow the development guidelines in CONTRIBUTING.md
If you use TaoData in your research, please cite:
@software{taodata2026,
title={TaoData: Modular Data Processing Pipeline Framework},
author={TaoData Contributors},
year={2026},
url={https://github.com/Taotern/TaoData}
}