# DiffBloch CIF Validation

This notebook demonstrates validating CIF files against the diffBloch dictionary,
which defines data items for dynamical electron diffraction structure refinement.

We'll validate two CIF files (urea and dynamical_iam) in both valid and invalid versions
to show how the validator identifies dictionary conformance issues.

In [None]:
from pathlib import Path

from cif_validator import ValidationMode, Validator

## Load the DiffBloch Dictionary

The diffBloch dictionary defines data items for:
- Unit cell parameters (CELL category)
- Symmetry information (SYMMETRY category)
- Atomic positions (ATOM_SITE category)
- Anisotropic displacement parameters (ATOM_SITE_ANISO category)
- Zone axis orientations (DIFFRN_ZONE_AXIS category)
- Measurement parameters (DIFFRN_MEASUREMENT category)
- Reflection data (REFLN category)

In [None]:
# Path to example files
examples_dir = Path("diffbloch")

# Load the dictionary
dictionary_path = examples_dir / "diffBloch.dic"
dictionary_content = dictionary_path.read_text()

print(f"Dictionary loaded: {dictionary_path.name}")
print(f"Size: {len(dictionary_content):,} characters")

## Create the Validator

We'll use strict validation mode to catch all conformance issues.

In [None]:
validator = Validator()
validator.add_dictionary(dictionary_content)
validator.set_mode(ValidationMode.Strict)

print(f"Validator configured with mode: {validator.mode.name}")

## Define Helper Function

A utility to display validation results in a readable format.

In [None]:
def display_validation_result(name: str, result):
    """Display validation results in a formatted way."""
    status = "VALID" if result.is_valid else "INVALID"
    print(f"\n{'='*60}")
    print(f"{name}: {status}")
    print(f"{'='*60}")
    print(f"Errors: {result.error_count}")
    print(f"Warnings: {result.warning_count}")

    if result.errors:
        print("\nErrors:")
        for i, error in enumerate(result.errors[:10], 1):  # Show first 10 errors
            print(f"  {i}. [{error.category.name}] Line {error.span.start_line}: {error.message}")
            if error.data_name:
                print(f"     Data name: {error.data_name}")
        if result.error_count > 10:
            print(f"  ... and {result.error_count - 10} more errors")

    if result.warnings:
        print("\nWarnings:")
        for i, warning in enumerate(result.warnings[:5], 1):  # Show first 5 warnings
            print(f"  {i}. [{warning.category.name}] Line {warning.span.start_line}: {warning.message}")
        if result.warning_count > 5:
            print(f"  ... and {result.warning_count - 5} more warnings")

## Validate Urea CIF Files

The urea example is a simple crystal structure file containing:
- Unit cell parameters
- Symmetry information
- Atomic positions with anisotropic displacement parameters

In [None]:
# Validate valid urea CIF
urea_valid_path = examples_dir / "urea_valid.cif"
urea_valid_content = urea_valid_path.read_text()

result = validator.validate(urea_valid_content)
display_validation_result("urea_valid.cif", result)

In [None]:
# Validate invalid urea CIF
urea_invalid_path = examples_dir / "urea_invalid.cif"
urea_invalid_content = urea_invalid_path.read_text()

result = validator.validate(urea_invalid_content)
display_validation_result("urea_invalid.cif", result)

## Validate Dynamical IAM CIF Files

The dynamical_iam example is a more complete electron diffraction dataset containing:
- Crystal structure data
- UB orientation matrix
- Zone axis information
- Measurement parameters
- Reflection intensities

In [None]:
# Validate valid dynamical_iam CIF
dyn_valid_path = examples_dir / "dynamical_iam_valid.cif"
dyn_valid_content = dyn_valid_path.read_text()

result = validator.validate(dyn_valid_content)
display_validation_result("dynamical_iam_valid.cif", result)

In [None]:
# Validate invalid dynamical_iam CIF
dyn_invalid_path = examples_dir / "dynamical_iam_invalid.cif"
dyn_invalid_content = dyn_invalid_path.read_text()

result = validator.validate(dyn_invalid_content)
display_validation_result("dynamical_iam_invalid.cif", result)

## Validate All Files Using File Paths

The `Validator` also supports validating files directly from paths using `validate_file()`.

In [None]:
cif_files = [
    examples_dir / "urea_valid.cif",
    examples_dir / "urea_invalid.cif",
    examples_dir / "dynamical_iam_valid.cif",
    examples_dir / "dynamical_iam_invalid.cif",
]

print("Validation Summary")
print("=" * 60)
print(f"{'File':<35} {'Status':<10} {'Errors':<8} {'Warnings'}")
print("-" * 60)

for cif_path in cif_files:
    result = validator.validate_file(str(cif_path))
    status = "VALID" if result.is_valid else "INVALID"
    print(f"{cif_path.name:<35} {status:<10} {result.error_count:<8} {result.warning_count}")

## Analyze Error Patterns

Let's look at the types of errors found in the invalid files.

In [None]:
from collections import Counter

# Collect errors from all invalid files
all_errors = []
for cif_path in cif_files:
    if "invalid" in cif_path.name:
        result = validator.validate_file(str(cif_path))
        for error in result.errors:
            all_errors.append((cif_path.name, error))

# Count by category
category_counts = Counter(error.category.name for _, error in all_errors)

print("Error Categories Across Invalid Files")
print("=" * 40)
for category, count in category_counts.most_common():
    print(f"{category:<25} {count:>5}")

## Comparing Valid vs Invalid

Let's see what specific differences cause validation failures.

In [None]:
# Get unique data names that caused errors
error_data_names = set()
for _, error in all_errors:
    if error.data_name:
        error_data_names.add(error.data_name)

print("Data Items Causing Validation Errors")
print("=" * 40)
for name in sorted(error_data_names):
    print(f"  - {name}")

## Summary

This notebook demonstrated:

1. Loading a real DDLm dictionary (diffBloch.dic)
2. Validating CIF files from file paths
3. Comparing valid vs invalid versions of the same data
4. Analyzing error patterns across multiple files

The diffBloch dictionary ensures that CIF files contain properly typed and structured
data for electron diffraction structure refinement calculations.