# diffBloch CIF Validation

**Primary use case**: Validate that a CIF file is ready to run through diffBloch.

## PETS → diffBloch Workflow

Raw CIF files from PETS (Precession Electron Tomography Software) need conversion before
they can be processed by diffBloch. This notebook demonstrates using the validator to
identify CIF files that need conversion.

## Dictionary: `diffbloch_2.dic`

Created by systematic analysis of diffBloch source code, this dictionary defines:
- **CELL**: Unit cell parameters (`_cell.length_a`, etc.)
- **SYMMETRY**: Space group (REQUIRED: `_symmetry.space_group_name_H-M`)
- **ATOM_SITE**: Atomic positions and thermal parameters
- **DIFFRN_ORIENT_MATRIX**: UB orientation matrix (9 elements)
- **DIFFRN_ZONE_AXIS**: Zone axis orientations (alpha, beta, omega)
- **DIFFRN_MEASUREMENT**: Measurement details (CRITICAL)
- **REFLN**: Reflection intensities

**Key validation**: `_diffrn_measurement_details` must exist (not `_diffrn_reflns_reduction_process`)

In [1]:
from pathlib import Path

from cif_validator import ValidationMode, Validator

## Load the diffBloch Dictionary

The `diffbloch_2.dic` dictionary was created by tracing through the diffBloch codebase:
- `main.py` → `asu_refinement.py`
- `atoms.py`: Structure loading via `load_structure_asymmetric_unit()`
- `rotation_dataset.py`: Experimental data via `get_exp_ints()`, `extract_data_params()`

In [2]:
# Path to example files
examples_dir = Path("diffbloch")

# Load the new dictionary (created from diffBloch code analysis)
dictionary_path = examples_dir / "diffbloch_2.dic"
dictionary_content = dictionary_path.read_text()

print(f"Dictionary loaded: {dictionary_path.name}")
print(f"Size: {len(dictionary_content):,} characters")

Dictionary loaded: diffbloch_2.dic
Size: 35,801 characters


## Create the Validator

We use Lenient mode which reports unknown tags as warnings (not errors).
This helps identify which tags need to be converted.

In [6]:
validator = Validator()
validator.add_dictionary(dictionary_content)
validator.set_mode(ValidationMode.Lenient)

print(f"Validator configured with mode: {validator.mode.name}")

Validator configured with mode: Lenient


## Define Helper Function

A utility to display validation results in a readable format.

In [7]:
def display_validation_result(name: str, result):
    """Display validation results in a formatted way."""
    status = "VALID" if result.is_valid else "INVALID"
    print(f"\n{'='*60}")
    print(f"{name}: {status}")
    print(f"{'='*60}")
    print(f"Errors: {result.error_count}, Warnings: {result.warning_count}")

    if result.errors:
        print("\nErrors (showing first 5):")
        for i, error in enumerate(result.errors[:5], 1):
            print(f"  {i}. [{error.category.name}] {error.message}")
            if error.data_name:
                print(f"     Tag: {error.data_name}")
        if result.error_count > 5:
            print(f"  ... and {result.error_count - 5} more errors")

    if result.warnings and result.warning_count <= 20:
        print("\nWarnings:")
        for i, warning in enumerate(result.warnings[:10], 1):
            print(f"  {i}. {warning.message}")
        if result.warning_count > 10:
            print(f"  ... and {result.warning_count - 10} more warnings")
    elif result.warnings:
        print(f"\n(Many warnings - likely unknown tags not in dictionary)")

## Validate Urea CIF Files

The urea example shows the PETS→diffBloch conversion:
- **urea_valid.cif**: Converted for diffBloch (uses `_diffrn_measurement_details`)
- **urea_invalid.cif**: Raw PETS output (uses `_diffrn_reflns_reduction_process`)

In [8]:
# Validate valid urea CIF
urea_valid_path = examples_dir / "urea_valid.cif"
urea_valid_content = urea_valid_path.read_text()

result = validator.validate(urea_valid_content)
display_validation_result("urea_valid.cif", result)


urea_valid.cif: INVALID

Errors (showing first 5):
  1. [RangeError] Value -9.22 for '_refln_intensity_meas' is outside allowed range >= 0
     Tag: _refln_intensity_meas
  2. [RangeError] Value -5.95 for '_refln_intensity_meas' is outside allowed range >= 0
     Tag: _refln_intensity_meas
  3. [RangeError] Value -15.26 for '_refln_intensity_meas' is outside allowed range >= 0
     Tag: _refln_intensity_meas
  4. [RangeError] Value -9.22 for '_refln_intensity_meas' is outside allowed range >= 0
     Tag: _refln_intensity_meas
  5. [RangeError] Value -5.95 for '_refln_intensity_meas' is outside allowed range >= 0
     Tag: _refln_intensity_meas
  ... and 447 more errors



In [16]:
# Validate invalid urea CIF
urea_invalid_path = examples_dir / "urea_invalid.cif"
urea_invalid_content = urea_invalid_path.read_text()

result = validator.validate(urea_invalid_content)
display_validation_result("urea_invalid.cif", result)


urea_invalid.cif: VALID
Errors: 0

  1. [UnknownItem] Line 36: Unknown data name '_cell_measurement_theta_min'
  2. [UnknownItem] Line 37: Unknown data name '_cell_measurement_theta_max'
  3. [UnknownItem] Line 24: Unknown data name '_diffrn_reflns_reduction_process'
  4. [UnknownItem] Line 10: Unknown data name '_diffrn_radiation_wavelength'
  5. [UnknownItem] Line 13: Unknown data name '_diffrn_orient_matrix_UB_13'


## Validate Minimal Dynamical CIF Files

These minimal examples focus on the key validation difference:
- **dynamical_iam_valid.cif**: Uses `_diffrn_measurement_details` with "rotation axis position"
- **dynamical_iam_invalid.cif**: Uses `_diffrn_reflns_reduction_process` with "tilt axis position"

In [17]:
# Validate valid dynamical_iam CIF
dyn_valid_path = examples_dir / "dynamical_iam_valid.cif"
dyn_valid_content = dyn_valid_path.read_text()

result = validator.validate(dyn_valid_content)
display_validation_result("dynamical_iam_valid.cif", result)

ValueError: Failed to parse CIF content: Parse error:   --> 62:6
   |
62 | #ENDs
   |      ^---
   |
   = expected line_term

In [18]:
# Validate invalid dynamical_iam CIF
dyn_invalid_path = examples_dir / "dynamical_iam_invalid.cif"
dyn_invalid_content = dyn_invalid_path.read_text()

result = validator.validate(dyn_invalid_content)
display_validation_result("dynamical_iam_invalid.cif", result)


dynamical_iam_invalid.cif: VALID
Errors: 0

  1. [UnknownItem] Line 5: Unknown data name '_audit_creation_method'
  2. [UnknownItem] Line 11: Unknown data name '_publ_section_title'
  3. [UnknownItem] Line 18: Unknown data name '_publ_author_name'
  4. [UnknownItem] Line 22: Unknown data name '_publ_author_address'
  5. [UnknownItem] Line 86: Unknown data name '_exptl_crystal_density_diffrn'


## Validation Summary

Validate all example files and show summary table.

In [None]:
cif_files = [
    ("dynamical_iam_valid.cif", "diffBloch-ready"),
    ("dynamical_iam_invalid.cif", "Raw PETS"),
    ("urea_valid.cif", "diffBloch-ready"),
    ("urea_invalid.cif", "Raw PETS"),
]

print("Validation Summary")
print("=" * 70)
print(f"{'File':<30} {'Type':<15} {'Status':<10} {'Errors':<8} {'Warnings'}")
print("-" * 70)

for filename, file_type in cif_files:
    cif_path = examples_dir / filename
    try:
        result = validator.validate_file(str(cif_path))
        status = "VALID" if result.is_valid else "INVALID"
        print(f"{filename:<30} {file_type:<15} {status:<10} {result.error_count:<8} {result.warning_count}")
    except Exception as e:
        print(f"{filename:<30} {file_type:<15} {'ERROR':<10} {str(e)[:30]}")

## Check for Critical Tag

The key difference between valid and invalid files is the measurement details tag name.

In [None]:
# Check which files have the correct tag
import re

for filename, file_type in cif_files:
    cif_path = examples_dir / filename
    content = cif_path.read_text()
    
    has_valid_tag = "_diffrn_measurement_details" in content
    has_invalid_tag = "_diffrn_reflns_reduction_process" in content
    has_rotation = "rotation axis position" in content
    has_tilt = "tilt axis position" in content
    
    print(f"\n{filename} ({file_type}):")
    print(f"  _diffrn_measurement_details: {'YES' if has_valid_tag else 'NO'}")
    print(f"  _diffrn_reflns_reduction_process: {'YES' if has_invalid_tag else 'NO'}")
    print(f"  'rotation axis position': {'YES' if has_rotation else 'NO'}")
    print(f"  'tilt axis position': {'YES' if has_tilt else 'NO'}")

## Dictionary Coverage

Check which tags from the dictionary are present in valid files.

In [None]:
# Key tags defined in diffbloch_2.dic
key_tags = [
    "_cell_length_a",
    "_cell_angle_alpha", 
    "_cell_volume",
    "_symmetry_space_group_name_H-M",
    "_atom_site_label",
    "_atom_site_fract_x",
    "_diffrn_orient_matrix_UB_11",
    "_diffrn_zone_axis_id",
    "_diffrn_zone_axis_alpha",
    "_diffrn_measurement_details",  # CRITICAL
    "_refln_zone_axis_id",
    "_refln_intensity_meas",
]

# Check valid file
valid_content = (examples_dir / "urea_valid.cif").read_text()

print("Key tags in urea_valid.cif:")
print("-" * 40)
for tag in key_tags:
    present = tag in valid_content
    marker = "CRITICAL" if "measurement_details" in tag else ""
    print(f"  {tag:<35} {'YES' if present else 'NO':>5} {marker}")

## Summary

This notebook demonstrated validating CIF files for diffBloch readiness using `diffbloch_2.dic`:

### PETS → diffBloch Conversion Required
| From (Raw PETS) | To (diffBloch-ready) |
|-----------------|----------------------|
| `_diffrn_reflns_reduction_process` | `_diffrn_measurement_details` |
| `tilt axis position` | `rotation axis position` |

### Dictionary Categories
The dictionary validates 8 categories derived from diffBloch source code analysis:
- CELL, SYMMETRY, ATOM_SITE, ATOM_SITE_ANISO
- DIFFRN_ORIENT_MATRIX, DIFFRN_ZONE_AXIS, DIFFRN_MEASUREMENT, REFLN

### Validation Mode
Using **Lenient** mode allows unknown tags (warnings) while catching structural errors.