Skip to content

refactor: implement Pydantic v2 throughout codebase #83

@DecisionNerd

Description

@DecisionNerd

Summary

Pydantic is currently installed but completely unused. All models use plain @dataclass. This issue tracks full implementation of Pydantic v2 for proper validation, serialization, and future ontology support.

Why Pydantic?

  • Validation: Type validation at runtime for API inputs and data loading
  • Serialization: Clean JSON/dict serialization for future needs
  • Schema generation: OpenAPI/JSON schema for tooling
  • IDE support: Better autocomplete and type hints
  • Future-proofing: Essential for ontology and classful object handling

Scope

Convert key models to Pydantic v2 BaseModel:

  1. AST nodes (src/graphforge/ast/):

    • Expression nodes with field validators
    • Clause nodes with model validators
    • Pattern nodes with constraint validation
  2. Planner operators (src/graphforge/planner/operators.py):

    • Add validators for operator constraints
    • Validate operator composition rules
  3. API inputs (src/graphforge/api.py):

    • Validate query strings
    • Validate transaction state
  4. Dataset metadata (src/graphforge/datasets/base.py):

    • DatasetInfo with URL validation
    • Size constraints
  5. Storage models:

    • Node/edge property validation
    • Label/type validation

Implementation Checklist

  • Convert AST expression nodes to BaseModel
  • Convert AST clause nodes to BaseModel
  • Convert AST pattern nodes to BaseModel
  • Add field validators for expressions
  • Add model validators for operators
  • Convert planner operators to BaseModel
  • Add API input validation
  • Convert DatasetInfo to BaseModel
  • Add property value validators
  • Update serialization to use Pydantic
  • Update all tests for Pydantic models
  • Document Pydantic patterns in CLAUDE.md
  • Add validation examples to docs

Testing

  • Unit tests for each validator (target: 100% coverage)
  • Integration tests for API validation
  • Error message testing
  • Performance baseline (ensure no significant overhead)

Estimated Effort

20-30 hours

Dependencies

  • pydantic>=2.6 (already in pyproject.toml)

Success Criteria

  • All AST nodes use BaseModel
  • All operators use BaseModel
  • API validates inputs with helpful errors
  • Tests pass with 90%+ coverage on new validators
  • Documentation updated
  • No performance regressions

Part of larger LDBC dataset integration effort (#51).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions