# PSI-166 Metadata Normalization Demo

This notebook simulates the normalization of a fictitious PSI-166 ISA-Tab metadata file. It demonstrates how modular metadata can be parsed, validated, and prepared for integration with Materials Project or G-Space workflows.

## Input File

We use a mock ISA-Tab file: `examples/fake_psi166_investigation.txt`

This file mimics a PSI-166 investigation with simplified fields for demonstration purposes.

In [1]:
isa_path = "examples/fake_psi166_investigation.txt"

with open(isa_path, "r") as f:
    content = f.read()

print("ISA-Tab Content:\n")
print(content)

ISA-Tab Content:

Study Identifier	PSI-166-Demo
Study Title	Fictitious Nanocatalyst Study
Study Description	Simulated ISA-Tab for PSI metadata normalization
Study Submission Date	2025-09-24
Study Public Release Date	2025-10-01
Study Factors	Nanoparticle Size, Surface Treatment



## Step 2: Normalize ISA-Tab Metadata

This cell simulates the transformation of raw ISA-Tab content into a structured metadata dictionary. The output is modular and ready for validation or integration with other sources.

In [2]:
def normalize_isa_tab(text):
    """
    Simulates normalization of ISA-Tab content into a modular metadata dictionary.
    This mock function extracts key fields and formats them for downstream validation.
    """
    lines = text.strip().split("\n")
    metadata = {}
    for line in lines:
        if "\t" in line:
            key, value = line.split("\t", 1)
            metadata[key.strip()] = value.strip()

    # Construct modular metadata dictionary
    normalized = {
        "study_id": metadata.get("Study Identifier", "unknown"),
        "title": metadata.get("Study Title", "untitled"),
        "description": metadata.get("Study Description", ""),
        "submission_date": metadata.get("Study Submission Date", ""),
        "release_date": metadata.get("Study Public Release Date", ""),
        "factors": metadata.get("Study Factors", "").split(", "),
        "provenance": "simulated_PSI_demo"
    }

    return normalized

# Run normalization
normalized_metadata = normalize_isa_tab(content)
print("Normalized Metadata:\n")
for key, value in normalized_metadata.items():
    print(f"{key}: {value}")

Normalized Metadata:

study_id: PSI-166-Demo
title: Fictitious Nanocatalyst Study
description: Simulated ISA-Tab for PSI metadata normalization
submission_date: 2025-09-24
release_date: 2025-10-01
factors: ['Nanoparticle Size', 'Surface Treatment']
provenance: simulated_PSI_demo


## Step 3: Validate Normalized Metadata

This cell simulates a validation step that checks for required fields in the normalized metadata. In a full workflow, this would be handled by schema validators or Makefile rules to ensure reproducibility and compatibility.

In [3]:
def validate_metadata(data):
    """
    Simulates validation of normalized metadata.
    Checks for required fields and basic formatting.
    """
    required_fields = ["study_id", "title", "submission_date", "release_date"]
    missing = [field for field in required_fields if not data.get(field)]

    if missing:
        print("❌ Validation failed. Missing fields:")
        for field in missing:
            print(f" - {field}")
    else:
        print("✅ Validation passed. All required fields are present.")

# Run validation
validate_metadata(normalized_metadata)

✅ Validation passed. All required fields are present.


In [4]:
import yaml

# Save normalized metadata to YAML
output_path = "examples/normalized_psi166.yaml"

with open(output_path, "w") as f:
    yaml.dump(normalized_metadata, f)

print(f"✅ Saved normalized metadata to {output_path}")


✅ Saved normalized metadata to examples/normalized_psi166.yaml
