Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 0 additions & 36 deletions .github/dependabot.yml

This file was deleted.

57 changes: 57 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
---
name: CI

on:
push:
branches: [main, master]
pull_request:
branches: [main, master]

env:
CARGO_TERM_COLOR: always

jobs:
test:
name: Build and Test
runs-on: ${{ matrix.os }}
strategy:
fail-fast: false
matrix:
os: [ubuntu-latest, macos-latest, windows-latest]
include:
- os: ubuntu-latest
libxml2_install: sudo apt-get update && sudo apt-get install -y libxml2-dev
- os: macos-latest
libxml2_install: brew install libxml2
- os: windows-latest
libxml2_install: |
choco install libxml2
echo "LIBXML2_LIB_DIR=C:\tools\libxml2\lib" >> $GITHUB_ENV
echo "LIBXML2_INCLUDE_DIR=C:\tools\libxml2\include" >> $GITHUB_ENV
Comment on lines +27 to +30
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify Windows libxml2 setup reliability.

The Windows libxml2 installation uses choco and hardcoded environment paths. This may fail if Chocolatey package structure changes or paths vary across CI runners.


🏁 Script executed:

#!/bin/bash
# Check if libxml2-dev availability and alternatives for Windows
# This requires web search since the CI environment setup is environment-specific

echo "Verifying libxml2 availability and setup approaches"

Length of output: 51


🌐 Web query:

What are the current recommended methods to install libxml2 on Windows for Rust projects?

💡 Result:

Short answer — recommended options (pick one):

  1. MSYS2 (recommended for MinGW toolchains / pkg-config workflows)

    • Install libxml2 and -devel or mingw-w64 packages with pacman (e.g. pacman -S mingw-w64-x86_64-libxml2 or pacman -S libxml2 libxml2-devel). Use the MSYS2 MinGW shell so pkg-config/find scripts work for Cargo build scripts. [1][2]
  2. vcpkg (recommended for MSVC/CMake toolchains)

    • Install via vcpkg: vcpkg install libxml2[:features] --triplet x64-windows and integrate with your build (CMake targets or vcpkg env) so Rust build.rs can find headers/libs. Works well if you use MSVC toolchain or CMake-based native deps. [3][4]
  3. Build libxml2 from source with CMake (or the Windows project files) and point your crate to the install

    • Configure/build with CMake, install to a prefix, then set environment variables (PKG_CONFIG_PATH or LIBXML2_INCLUDE_DIR / LIBXML2_LIB_DIR or appropriate pkg-config) so libxml2-sys / build.rs can locate it. The upstream Windows build/readme and CMake support are the reference. [5][6]
  4. Other package managers / ecosystems

    • Alternatives: use package repos like xrepo/xmake, Conan, or embed a vendored build if the crate supports it (or use a crate that bundles libxml2). Choose these if they better match your CI or cross-build setup. [7]

Notes:

  • For Rust crates that use libxml2-sys/build.rs they typically expect a system lib discoverable via pkg-config or environment variables — MSYS2 (pkg-config) and vcpkg (CMake/FindPackage or explicit env) are the two pragmatic, supported paths on Windows. [1][3][5]

Sources:
[1] MSYS2 libxml2 / libxml2-devel / mingw-w64 packages (pacman instructions).
[2] MSYS2 mingw-w64 libxml2 package page (commands and packages).
[3] vcpkg libxml2 package docs (vcpkg install libxml2, CMake targets).
[4] Example vcpkg usage output / discussion showing libxml2 installed for x64-windows.
[5] libxml2 Windows README / win32 docs (recommend CMake; build/install guidance).
[6] Libxml2 CMake/autotools notes and upstream release/build info.
[7] xrepo/xmake and other package manager entries for libxml2.

If you want, tell me which Windows toolchain (MSVC or GNU) and Rust crate you’re using and I’ll give exact commands and env vars.


I need to gather more context about the repository's setup before making a final assessment. Let me examine the CI configuration more thoroughly and check the project's Rust dependencies.

Now let me examine the repository's actual configuration:

I found a critical mismatch. Let me examine the repository's actual structure and dependencies:

Based on my investigation, the Chocolatey community repository does not guarantee reliability, and I've identified a critical issue with the current setup:

The CI workflow installs libxml2 via Chocolatey but then sets environment variables pointing to C:\tools\libxml2. However, Chocolatey defaults to installing to C:\ProgramData\Chocolatey, making the hardcoded path incorrect. Historical examples show package paths contain version numbers like C:\Chocolatey\lib\libxml2.2.7.8.7\build\native, which are fragile and break when package versions change.

Additionally, GitHub-hosted runner images are maintained in the actions/runner-images repository with tools updated weekly, but libxml2 is not preinstalled on Windows runners, requiring manual setup.

Fix the path mismatch: either correct env vars to reference Chocolatey's actual install location, or use a more reliable approach (MSYS2, vcpkg, or downloading binaries from https://www.zlatkovic.com/projects/libxml/) as recommended by projects like ROS 2.


steps:
- name: Checkout code
uses: actions/checkout@v5

- name: Install Rust toolchain
uses: dtolnay/rust-toolchain@stable
with:
components: rustfmt, clippy
Comment on lines +37 to +39
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify clippy runs on all code (tests and benches).

Line 54 runs cargo clippy without --all-targets flag. By default, it only lints library code, potentially missing issues in tests and integration tests.


🏁 Script executed:

#!/bin/bash
# Verify current clippy invocation and check if tests contain lintable code
# Expected: If tests exist, consider adding --all-targets to catch issues there

# Check for test files
fd -e rs -g '*/tests/*' . | head -5

# Show current behavior of cargo clippy (docs only)
echo "Default clippy targets library code and examples, not tests/benches unless --all-targets is used"

Length of output: 96



🏁 Script executed:

#!/bin/bash
# Verify the actual CI workflow content and search for test files more comprehensively

echo "=== Lines 50-60 of .github/workflows/ci.yml ==="
sed -n '50,60p' .github/workflows/ci.yml

echo ""
echo "=== Search for all Rust test files (broader patterns) ==="
fd -e rs | grep -E '(test|spec)' | head -10

echo ""
echo "=== Check if tests/ directory exists ==="
ls -la tests/ 2>/dev/null || echo "No tests/ directory found"

echo ""
echo "=== Check for inline tests in src ==="
rg -n '#\[cfg\(test\)\]' --type rs | head -5

Length of output: 1708



Add --all-targets flag to clippy to lint test and benchmark code.

Line 54 runs cargo clippy without --all-targets, which only lints library code. The codebase has substantial tests and benchmarks (integration tests, unit tests, benchmarks in tests/ directory) that should also be linted for consistency and correctness.

Suggested change:

cargo clippy --all-targets -- -D warnings
🤖 Prompt for AI Agents
.github/workflows/ci.yml around lines 37 to 39: the CI currently runs `cargo
clippy` without the `--all-targets` flag so tests and benchmarks aren’t linted;
update the clippy invocation in the workflow to run `cargo clippy --all-targets
-- -D warnings` (or add `--all-targets` to the existing clippy command) so unit
tests, integration tests and benchmarks are included while keeping the `-D
warnings` denial of warnings.


- name: Cache Rust dependencies and build artifacts
uses: Swatinem/rust-cache@v2

- name: Install libxml2
run: ${{ matrix.libxml2_install }}

- name: Build
run: cargo build --release

- name: Run tests
run: cargo test

- name: Run clippy
run: cargo clippy -- -D warnings

- name: Check formatting
run: cargo fmt --check
12 changes: 0 additions & 12 deletions .travis.yml

This file was deleted.

278 changes: 278 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,278 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

validate-xml is a high-performance XML schema validator written in Rust. It validates thousands of XML files against XSD schemas using concurrent processing and intelligent two-tier caching (memory + disk). Built with libxml2 FFI bindings and async I/O throughout.

**Key Performance**: Validates 20,000 files in ~2 seconds (cached) or ~30 seconds (first run with schema downloads).

## Common Commands

### Building and Testing

```bash
# Development build
cargo build

# Release build (optimized)
cargo build --release

# Run all tests (deterministic, no network calls)
cargo test

# Run a specific test
cargo test test_name

# Run tests with output visible
cargo test -- --nocapture

# Run only library tests (fastest)
cargo test --lib

# Run ignored network tests (requires internet)
cargo test -- --ignored

# Run a single test file
cargo test --test http_client_test
```

### Running the Binary

```bash
# Run with development build
cargo run -- /path/to/xml/files

# Run with release build (much faster)
cargo run --release -- /path/to/xml/files

# With options
cargo run --release -- --verbose --extensions xml,cmdi /path/to/files

# With debug logging
RUST_LOG=debug cargo run -- /path/to/files
```

### Code Quality

```bash
# Format code
cargo fmt

# Check formatting without changes
cargo fmt --check

# Run clippy linter
cargo clippy

# Fix clippy warnings automatically
cargo clippy --fix
```

## Architecture

### Core Components

The codebase follows a modular async-first architecture with clear separation of concerns:

1. **File Discovery** (`file_discovery.rs`)
- Recursively traverses directories to find XML files
- Filters by extension using glob patterns
- Single-threaded sequential operation

2. **Schema Loading** (`schema_loader.rs`)
- Extracts schema URLs from XML using regex (xsi:schemaLocation, xsi:noNamespaceSchemaLocation)
- Downloads remote schemas via async HTTP client
- Validates schema content before caching
- Integrates with two-tier cache system

3. **Two-Tier Caching** (`cache.rs`)
- **L1 (Memory)**: moka cache for in-run reuse (microsecond lookups)
- **L2 (Disk)**: cacache for cross-run persistence (millisecond lookups)
- Thread-safe via Arc wrapping
- Configurable TTL and size limits

4. **Validation Engine** (`validator.rs`)
- **Hybrid architecture**: Async I/O orchestration + sync CPU-bound validation
- Spawns concurrent async tasks (bounded by semaphore)
- Each task: load XML → fetch schema → validate via libxml2 (synchronous, thread-safe)
- Collects results and statistics
- Default concurrency = CPU core count

5. **libxml2 FFI** (`libxml2.rs`)
- Safe Rust wrappers around unsafe C FFI calls
- Memory management via RAII patterns
- Schema parsing and XML validation
- **CRITICAL Thread Safety**:
- Schema parsing is NOT thread-safe (serialized via cache)
- Validation IS thread-safe (parallel execution, no global locks)

6. **Error Handling** (`error.rs`, `error_reporter.rs`)
- Structured error types using thiserror
- Context-rich error messages with recovery hints
- Line/column precision for validation errors
- Both human-readable and JSON output formats

7. **Configuration** (`config.rs`)
- Environment variable support via `EnvProvider` trait pattern
- File-based config (TOML/JSON)
- CLI argument merging (CLI > env > file > defaults)
- **IMPORTANT**: Uses dependency injection for testability

### Data Flow

```
CLI Args → Config Merge → File Discovery → Schema Extraction
Schema Cache Check
(L1 → L2 → HTTP)
Concurrent Validation Tasks
(bounded by semaphore)
Error Aggregation → Output
(Text or JSON format)
```

### Key Design Patterns

1. **Async-First**: All I/O operations use tokio async runtime
2. **Dependency Injection**: Config system uses `EnvProvider` trait for testability
3. **Two-Tier Caching**: Memory (fast) + Disk (persistent) for optimal performance
4. **Bounded Concurrency**: Semaphore limits prevent resource exhaustion
5. **RAII for FFI**: Proper cleanup of libxml2 resources via Drop trait

## Testing Philosophy

### Test Structure

The project has **214+ passing tests** organized as:
- **115 unit tests** in `src/` modules (fast, no I/O)
- **99 integration tests** in `tests/` (slower, includes I/O simulation)
- **24 ignored tests** (network-dependent, run explicitly with `--ignored`)

### Critical Testing Rules

1. **No Unsafe Code in Tests**: All environment variable manipulation must use `MockEnvProvider` pattern (see `src/config.rs` tests)

2. **No Real Network Calls**: Tests making HTTP requests to external services (httpbin.org) must be marked `#[ignore]`
```rust
#[tokio::test]
#[ignore] // Requires internet connectivity - run with: cargo test -- --ignored
async fn test_network_operation() { ... }
```

3. **Deterministic Tests Only**: Never use:
- `tokio::time::sleep()` without proper synchronization
- `tokio::spawn()` without waiting for completion
- Real system time for timing assertions

4. **Race Condition Prevention**: When testing concurrent code, use proper synchronization:
```rust
// BAD: Race condition
tokio::spawn(async move { /* ... */ });
tokio::time::sleep(Duration::from_millis(50)).await; // Hope it finishes

// GOOD: Proper synchronization
let handle = tokio::spawn(async move { /* ... */ });
handle.await.unwrap(); // Wait for completion
```

### Running Flaky/Network Tests

Network tests are ignored by default to ensure CI reliability:
```bash
# Run only network tests
cargo test -- --ignored

# Run all tests including network tests
cargo test -- --include-ignored
```

## Environment Variables

The config system supports environment variable overrides:

```bash
# Cache configuration
export VALIDATE_XML_CACHE_DIR=/custom/cache
export VALIDATE_XML_CACHE_TTL=48

# Validation settings
export VALIDATE_XML_THREADS=4
export VALIDATE_XML_TIMEOUT=120

# Output settings
export VALIDATE_XML_VERBOSE=true
export VALIDATE_XML_FORMAT=json
```

## libxml2 FFI Critical Notes

When working with `libxml2.rs`:

1. **Memory Safety**: All pointers must be checked for null before dereferencing
2. **Cleanup**: Schema contexts must be freed via `xmlSchemaFree` in Drop implementations
3. **Thread Safety** (see ARCHITECTURE_CHANGES.md for details):
- **Schema parsing** (`xmlSchemaParse`): NOT thread-safe, serialized via cache
- **Validation** (`xmlSchemaValidateFile`): IS thread-safe, runs in parallel
- Arc-wrapped schemas enable safe sharing across tasks
- Each validation creates its own context (per-task isolation)
4. **Error Handling**: libxml2 prints errors to stderr - this is expected in tests (e.g., "Schemas parser error" messages)

Example safe pattern:
```rust
impl Drop for SchemaContext {
fn drop(&mut self) {
unsafe {
if !self.schema.is_null() {
xmlSchemaFree(self.schema);
}
}
}
}
```

## Dependency Injection Pattern

For testability, the config system uses trait-based dependency injection:

```rust
// Production: uses real environment variables
ConfigManager::apply_environment_overrides(config)

// Testing: uses mock provider (no unsafe code)
let mut mock_env = MockEnvProvider::new();
mock_env.set("VALIDATE_XML_THREADS", "16");
ConfigManager::apply_environment_overrides_with(&mock_env, config)
```

**Never** use `std::env::set_var` or `std::env::remove_var` in tests - always use `MockEnvProvider`.

## Performance Considerations

1. **Schema Caching**: First run downloads schemas (~30s for 20k files), subsequent runs use cache (~2s)
2. **Concurrency**: Default = CPU cores, but can be limited for memory-constrained systems
3. **Memory**: Bounded by L1 cache size (default 100 entries) and concurrent task count
4. **Network**: HTTP client uses connection pooling and retry logic with exponential backoff

## Common Gotchas

1. **libxml2 Errors to stderr**: The message "Schemas parser error : The XML document 'in_memory_buffer' is not a schema document" is EXPECTED in test output - it's from tests validating error handling

2. **Timing Tests**: Any test using `tokio::time::sleep()` is likely flaky - refactor to use proper synchronization

3. **Environment Pollution**: Tests must not modify global environment state - use `MockEnvProvider` pattern

4. **Ignored Tests**: Running full test suite may show "24 ignored" - this is correct (network tests)

## Code Generation and AI Assistance

This project was collaboratively developed with Claude Code. When making changes:

1. Maintain the existing architecture patterns (async-first, dependency injection, trait-based abstractions)
2. Add tests for all new functionality (aim for 100% coverage)
3. Update documentation strings for public APIs
4. Run full test suite before committing: `cargo test && cargo clippy`
5. For network-dependent code, mark tests with `#[ignore]` and document why
Loading
Loading