Skip to content

IgnazioDS/csv-validator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

csv-validator

Schema-based CSV validation, transformation, and error reporting. Define your data rules in a YAML file and run the tool against any CSV — it tells you exactly what's wrong, fixes what it can, and outputs a clean file plus a human-readable report.

Features

  • Schema-driven — define column rules in YAML or JSON (no code changes for new schemas)
  • 8 data typesstring, integer, float, date, email, boolean with automatic format detection
  • Full rule set — required/nullable, min/max value, min/max length, regex patterns, allowed values lists
  • Automatic date parsing — handles YYYY-MM-DD, DD/MM/YYYY, MM/DD/YYYY, and more
  • European number formats — handles 1.299,00 and 3,14 decimal styles
  • Deduplication — removes exact duplicate rows by content hash
  • Column renaming — map input column names to output names in schema
  • Two outputs — clean CSV + plain-text error report with row-level details

Quickstart

pip install -e ".[dev]"

# Validate against example schema
python main.py \
  --input examples/customers_sample.csv \
  --schema examples/customers_schema.yaml \
  --output output/clean.csv \
  --report output/errors.txt

Output:

Results for: examples/customers_sample.csv
  Total rows:     8
  Clean rows:     3  (37.5%)
  Error rows:     4
  Duplicates:     1

  6 validation error(s) found.
    Row 3: 'email' invalid email address 'not-an-email'
    Row 4: 'first_name' too short (min 2 chars)
    ...

Schema format

delimiter: ","
encoding: "utf-8"
allow_extra_columns: false

columns:
  - name: customer_id
    dtype: integer
    required: true
    min_value: 1

  - name: email
    dtype: email
    required: true

  - name: signup_date
    dtype: date
    required: true

  - name: plan
    dtype: string
    required: true
    allowed_values: ["free", "starter", "pro", "enterprise"]

  - name: amount
    dtype: float
    nullable: true
    min_value: 0
    rename_to: mrr_usd   # rename in output

Supported types

Type Example values Notes
string any text Use min_length, max_length, pattern, allowed_values
integer 42, 1,000 Comma separators stripped automatically
float 3.14, 3,14, 1.299,00 EU format auto-detected
date 2024-01-15, 15/01/2024 5 formats recognised
email user@example.com RFC-compliant regex
boolean true, 1, yes, y Case-insensitive

Tests

pytest --cov=csv_validator

Project structure

csv-validator/
├── csv_validator/
│   ├── schema.py      # ColumnRule and Schema dataclasses
│   ├── rules.py       # Per-cell validation logic
│   └── validator.py   # Orchestration, dedup, output writing
├── tests/
├── examples/
│   ├── customers_schema.yaml
│   └── customers_sample.csv
├── main.py
└── pyproject.toml

About

Schema-based CSV validation, transformation, and error reporting — YAML-driven, zero dependencies

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors