Schema-based CSV validation, transformation, and error reporting. Define your data rules in a YAML file and run the tool against any CSV — it tells you exactly what's wrong, fixes what it can, and outputs a clean file plus a human-readable report.
- Schema-driven — define column rules in YAML or JSON (no code changes for new schemas)
- 8 data types —
string,integer,float,date,email,booleanwith automatic format detection - Full rule set — required/nullable, min/max value, min/max length, regex patterns, allowed values lists
- Automatic date parsing — handles
YYYY-MM-DD,DD/MM/YYYY,MM/DD/YYYY, and more - European number formats — handles
1.299,00and3,14decimal styles - Deduplication — removes exact duplicate rows by content hash
- Column renaming — map input column names to output names in schema
- Two outputs — clean CSV + plain-text error report with row-level details
pip install -e ".[dev]"
# Validate against example schema
python main.py \
--input examples/customers_sample.csv \
--schema examples/customers_schema.yaml \
--output output/clean.csv \
--report output/errors.txtOutput:
Results for: examples/customers_sample.csv
Total rows: 8
Clean rows: 3 (37.5%)
Error rows: 4
Duplicates: 1
6 validation error(s) found.
Row 3: 'email' invalid email address 'not-an-email'
Row 4: 'first_name' too short (min 2 chars)
...
delimiter: ","
encoding: "utf-8"
allow_extra_columns: false
columns:
- name: customer_id
dtype: integer
required: true
min_value: 1
- name: email
dtype: email
required: true
- name: signup_date
dtype: date
required: true
- name: plan
dtype: string
required: true
allowed_values: ["free", "starter", "pro", "enterprise"]
- name: amount
dtype: float
nullable: true
min_value: 0
rename_to: mrr_usd # rename in output| Type | Example values | Notes |
|---|---|---|
string |
any text | Use min_length, max_length, pattern, allowed_values |
integer |
42, 1,000 |
Comma separators stripped automatically |
float |
3.14, 3,14, 1.299,00 |
EU format auto-detected |
date |
2024-01-15, 15/01/2024 |
5 formats recognised |
email |
user@example.com |
RFC-compliant regex |
boolean |
true, 1, yes, y |
Case-insensitive |
pytest --cov=csv_validatorcsv-validator/
├── csv_validator/
│ ├── schema.py # ColumnRule and Schema dataclasses
│ ├── rules.py # Per-cell validation logic
│ └── validator.py # Orchestration, dedup, output writing
├── tests/
├── examples/
│ ├── customers_schema.yaml
│ └── customers_sample.csv
├── main.py
└── pyproject.toml