Skip to content

Taotern/TaoData

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TaoData - Data Processing Pipeline Framework

A modular, extensible Python framework for building high-quality datasets for LLM pretraining and fine-tuning.

Features

  • Modular Architecture: Build pipelines by chaining reusable processor modules
  • Object-Oriented Design: Inherit from Processor base class to create custom filters
  • Pre-built Processors: 15+ ready-to-use filters (text, quality, language, sampling, etc.)
  • Multi-Source Composition: Mix datasets from Hugging Face with configurable ratios
  • Python Configuration: Pure Python API for maximum flexibility
  • CLI Tools: Command-line interface for running pipelines

Quick Start

Installation

# Clone and install
git clone https://github.com/Taotern/TaoData.git
cd TaoData
pip install -e .

# Install optional dependencies
pip install datasets langdetect

Basic Usage

from taodata import Pipeline
from taodata.processors import (
    CharLengthFilter,
    RegexFilter,
    DuplicateDetector,
    LanguageFilter,
)

# Define pipeline
pipeline = Pipeline(name="my_cleaner")
pipeline.add(CharLengthFilter(min_length=50, max_length=5000))
pipeline.add(RegexFilter(pattern=r"^[a-zA-Z0-9\s.,!?-]+$", keep_match=True))
pipeline.add(DuplicateDetector(method="hash"))
pipeline.add(LanguageFilter(languages=["en"]))

# Process data
results = pipeline.process_batch(items)

Multi-Source Composition

from taodata import Pipeline, DatasetComposer
from taodata.processors import CharLengthFilter

# Define sources
composer = DatasetComposer()
composer.add_source("fineweb", "HuggingFaceFW/fineweb", split="sample", weight=0.5)
composer.add_source("wikipedia", "wikipedia", config="20220301.en", weight=0.3)
composer.add_source("github", "codeparrot/github-code", weight=0.2)

# Define pipeline (applied to all sources)
pipeline = Pipeline(name="mixed_cleaner")
pipeline.add(CharLengthFilter(min_length=100, max_length=8000))

# Execute: sources processed separately, then round-robin merged
pipeline.execute(
    sources=composer,
    output_path="output.jsonl",
    batch_size=1000
)

Architecture

taodata/
├── core/              # Base classes: Processor, Pipeline
├── processors/        # Pre-built filters: text, quality, language, structured, sampling
├── io/               # Readers, writers, batch processing
├── sources/          # Multi-source composition, HF loaders
├── cli/              # Command-line interface
└── utils/            # Utilities and logging

Pre-Built Processors

Text Filters

  • CharLengthFilter — Filter by character count
  • WordCountFilter — Filter by word count
  • RegexFilter — Pattern matching/exclusion
  • TextNormalizer — Lowercase, strip whitespace, etc.
  • UnicodeCleaner — Remove control and zero-width characters

Quality Processors

  • DuplicateDetector — Hash-based deduplication
  • QualityScorer — Multi-metric quality scoring
  • BlankLineFilter — Filter by blank line content

Language Processors

  • LanguageDetector — Detect language (requires langdetect)
  • LanguageFilter — Keep/exclude specific languages

Structured Data

  • JSONValidator — Validate JSON structure
  • FieldExtractor — Select/keep specific fields
  • FieldRenamer — Rename fields
  • ValueFilter — Filter by field values

Sampling

  • RandomSampler — Random downsampling
  • RatioSampler — Sample to ratio target
  • StratifiedSampler — Balanced sampling across groups

Creating Custom Processors

Inherit from Processor and implement process():

from taodata import Processor
from taodata.core.types_ import ProcessingResult

class MyFilter(Processor):
    def __init__(self, threshold=0.5):
        super().__init__()
        self.threshold = threshold
    
    def process(self, item):
        # Process logic
        accept = score(item) > self.threshold
        return ProcessingResult(item, accept=accept)

Configuration

All configuration is done in pure Python code. Example:

# pipeline_config.py
from taodata import Pipeline
from taodata.processors import *

pipeline = Pipeline(name="standard_cleaner")
pipeline.add(CharLengthFilter(min_length=50, max_length=5000))
pipeline.add(DuplicateDetector(method="hash"))
pipeline.add(RegexFilter(pattern=r"^[a-zA-Z]+$", keep_match=True))

CLI Commands

taodata run <config.py> --input data.jsonl --output clean.jsonl [--batch-size 1000]

Performance

  • Designed for medium-scale datasets (1M-100M samples on single machine)
  • Streaming from Hugging Face (no full download needed)
  • Batch processing with configurable batch sizes
  • Throughput tracking and ETA calculation

Optional Dependencies

Hugging Face Support

pip install datasets

Language Detection

pip install langdetect

Development

pip install -r requirements-dev.txt

Project Status

Phase 1 (Complete): Core framework, base classes, types Phase 2 (Complete): I/O layer (readers, writers, batch processing) Phase 3 (Complete): Pre-built processors Phase 3B (Complete): HF dataset integration Phase 4 (Complete): Daemon infrastructure Phase 5 (Complete): TUI monitoring Phase 6 (Complete): CLI entry points

Roadmap

  • Distributed processing (Spark, Ray)
  • Web UI dashboard
  • Advanced metrics and profiling
  • Database connectors
  • Real-time streaming support
  • Community processor library

License

MIT License

Contributing

Contributions welcome! Please follow the development guidelines in CONTRIBUTING.md

Citation

If you use TaoData in your research, please cite:

@software{taodata2026,
  title={TaoData: Modular Data Processing Pipeline Framework},
  author={TaoData Contributors},
  year={2026},
  url={https://github.com/Taotern/TaoData}
}

About

A modular LLM dataset preprocessing framework

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages