# Metadata Validator
This jupyter notebook is to faciliate development of a gen3 metadata validation script

***

### General Idea
1. Load metadata into a python object 
    - class for loading and storing metadata
    - define input folder, reads in .json, _*.json and dataImportOrder.txt into an accessible object
1. Load schema into a python object 
    - class Loads bundled json and also splits yamls from bundled json into accessible splits
1. 

***

# Perplexity help
- [link to chat](https://www.perplexity.ai/search/lets-say-I-erdZUVAOQ_SgDnHh_3meWA)


To handle a scenario where your `bundled.json` file contains a `_definitions.yaml` file, and some of the YAML schemas in the `bundled.json` link to `_definitions.yaml` for common definitions, you need to ensure that your JSON Schema validator can resolve these references correctly. Here’s how you can achieve this using Python:

### Steps to Validate `metadata.json` Against a Schema with External Definitions

1. **Load and Parse YAML Files**: Load and parse the `_definitions.yaml` and other YAML schemas from the `bundled.json` file.
2. **Resolve References**: Ensure that references to definitions in `_definitions.yaml` are correctly resolved.
3. **Validate the JSON Data**: Use a JSON Schema validator to validate the `metadata.json` file against the resolved schema.

### Example Using Python

Here’s a step-by-step guide using Python, `jsonschema`, and `pyyaml` libraries:

1. **Install Required Libraries**:
   ```bash
   pip install jsonschema pyyaml
   ```

2. **Load and Parse YAML Files**:
   ```python
   import yaml
   import json
   from jsonschema import validate, RefResolver, ValidationError

   # Load the bundled JSON file containing multiple YAML schemas
   with open('bundled.json', 'r') as bundled_file:
       bundled_schemas = json.load(bundled_file)

   # Extract and parse the _definitions.yaml file
   definitions_yaml = bundled_schemas['_definitions.yaml']
   definitions_schema = yaml.safe_load(definitions_yaml)

   # Extract and parse the specific schema that references _definitions.yaml
   specific_schema_yaml = bundled_schemas['specific_schema_key']  # Replace with the actual key
   specific_schema = yaml.safe_load(specific_schema_yaml)
   ```

3. **Resolve References**:
   ```python
   # Create a resolver that includes the definitions
   class CustomRefResolver(RefResolver):
       def resolve_remote(self, uri):
           if uri == 'definitions.yaml':
               return definitions_schema
           return super().resolve_remote(uri)

   resolver = CustomRefResolver.from_schema(specific_schema)
   ```

4. **Validate the JSON Data**:
   ```python
   # Load the metadata JSON file
   with open('metadata.json', 'r') as json_file:
       metadata = json.load(json_file)

   # Validate the metadata against the specific schema with resolved references
   try:
       validate(instance=metadata, schema=specific_schema, resolver=resolver)
       print("Validation successful!")
   except ValidationError as e:
       print(f"Validation error: {e.message}")
   ```

### Detailed Explanation

1. **Load and Parse YAML Files**:
   - Load the `bundled.json` file, which contains multiple YAML schemas.
   - Extract and parse the `_definitions.yaml` file and the specific schema that references it using `yaml.safe_load`.

2. **Resolve References**:
   - Create a custom `RefResolver` class that overrides the `resolve_remote` method to return the parsed `_definitions.yaml` schema when the reference URI matches.
   - Instantiate the custom resolver with the specific schema.

3. **Validate the JSON Data**:
   - Load the `metadata.json` file.
   - Use the `validate` function from the `jsonschema` library to validate the `metadata.json` data against the specific schema, using the custom resolver to handle references.

### Additional Resources

- **GitHub - Schema Validator**: Provides a utility for validating YAML/JSON files against predefined schemas, including handling nested properties and command-line usage [1].
- **JSON Schema - Getting Started**: Offers a comprehensive guide on creating and using JSON Schema, including defining properties, nesting data structures, and validating JSON data [2].
- **MuleSoft Documentation**: Describes how to use the JSON Schema validator to evaluate JSON payloads at runtime, supporting both local and external schemas [4].
- **Python JSON Schema Documentation**: Explains how to use the `jsonschema` library in Python to validate JSON documents, including handling references and custom resolvers [10][11][14].

By following these steps, you can ensure that your `metadata.json` file is validated against the correct schema, even if it relies on external definitions in `_definitions.yaml`. This approach leverages Python's `jsonschema` library and custom reference resolution to handle complex schema validation scenarios.


# Resolving References

# Manual reference resolution
Try to resolve the references manually

### This function resolves the definition file
- next steps are to use the resolved definition file to manually update the references in the target yaml to be resolved


In [1]:
import yaml
import json
import copy
from jsonschema import validate, RefResolver, ValidationError

# Load and parse schemas
def load_and_parse_schemas(schema_path):
    if not schema_path.endswith('.json'):
        raise ValueError("schema_path must be a .json file")

    try:
        with open(schema_path, 'r') as schema_file:
            bundled_schemas = json.load(schema_file)
    except (FileNotFoundError, json.JSONDecodeError) as e:
        raise ValueError(f"Error loading JSON file: {e}")

    loaded_yamls = {}
    for key, value in bundled_schemas.items():
        if isinstance(value, dict):
            try:
                yaml_str = yaml.dump(value, sort_keys=False)
                loaded_yamls[key] = yaml.safe_load(yaml_str)
            except yaml.YAMLError as e:
                raise ValueError(f"Error parsing YAML for key '{key}': {e}")
        else:
            raise ValueError(f"Value for key '{key}' is not a dictionary")

    return loaded_yamls





In [2]:
import os
import yaml

def oneStepResolver(baseDir: str, baseYaml: str, refYaml: str = None, refFileObj: str = None): 
    # Construct the absolute path for the base YAML file
    baseFilePath = os.path.join(baseDir, baseYaml)
    
    # Load the YAML files
    with open(baseFilePath, 'r') as base_file:
        baseFile = yaml.safe_load(base_file)
    
    if refFileObj is not None:
        referenceFile = refFileObj
    elif refYaml is not None:
        refFilePath = os.path.join(baseDir, refYaml)
        with open(refFilePath, 'r') as ref_file:
            referenceFile = yaml.safe_load(ref_file)
    else:
        referenceFile = None
    
    # Function to resolve $ref
    def resolve_ref(ref, referenceFile, baseFile):
        if ref.startswith('#/'):
            path = ref.split('#/')[1].split('/')
            value = baseFile
        else:
            file_path, ref_path = ref.split('#/')
            file_path = os.path.join(baseDir, file_path)
            with open(file_path, 'r') as ref_file:
                value = yaml.safe_load(ref_file)
            path = ref_path.split('/')
        
        for p in path:
            value = value[p]
        return value

    # Recursive function to resolve all $ref in a dictionary
    def resolve_all_refs(obj, referenceFile, baseFile):
        if isinstance(obj, dict):
            if '$ref' in obj:
                ref = obj['$ref']
                resolved_value = resolve_ref(ref, referenceFile, baseFile)
                return resolve_all_refs(resolved_value, referenceFile, baseFile)
            else:
                return {k: resolve_all_refs(v, referenceFile, baseFile) for k, v in obj.items()}
        elif isinstance(obj, list):
            return [resolve_all_refs(item, referenceFile, baseFile) for item in obj]
        else:
            return obj

    # Resolve the $ref in baseFile
    resolved_baseFile = resolve_all_refs(baseFile, referenceFile, baseFile)

    # Return the resolved baseFile
    return yaml.dump(resolved_baseFile, default_flow_style=False, sort_keys=False)

In [3]:
# Resolve the _definitions.yaml using the _settings.yaml
resolved_project_yaml = oneStepResolver('/Users/harrijh/Library/CloudStorage/GoogleDrive-joshua@biocommons.org.au/My Drive/projects/ACDCSchemaDev/output/schema/yaml/', '_definitions.yaml', '_settings.yaml')

# Now, resolve the project.yaml using the parsed resolved_def
print(resolved_project_yaml)

# writing
with open('../output/schema/yaml/_definitions_res.yaml', 'w') as f:
    f.write(resolved_project_yaml)

# storing as dict
resolved_project_yaml_dict = yaml.safe_load(resolved_project_yaml)

id: _definitions
UUID:
  term:
    description: 'A 128-bit identifier. Depending on the mechanism used to generate
      it, it is either guaranteed to be different from all other UUIDs/GUIDs generated
      until 3400 AD or extremely likely to be different. Its relatively small size
      lends itself well to sorting, ordering, and hashing of all sorts, storing in
      databases, simple allocation, and ease of programming in general.

      '
    termDef:
      term: Universally Unique Identifier
      source: NCIt
      cde_id: C54100
      cde_version: null
      term_url: https://ncit.nci.nih.gov/ncitbrowser/ConceptReport.jsp?dictionary=NCI_Thesaurus&version=16.02d&ns=NCI_Thesaurus&code=C54100
  type: string
  pattern: ^[a-fA-F0-9]{8}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{12}$
parent_uuids:
  type: array
  minItems: 1
  items:
    term:
      description: 'A 128-bit identifier. Depending on the mechanism used to generate
        it, it is either guaranteed to be

In [4]:
# function to use _definitions_res.yaml to resolve the target yaml

# Load schemas
schema_path = '../output/schema/json/schema_dev.json'
yaml_dict = load_and_parse_schemas(schema_path)

medYaml = yaml_dict['medical_history.yaml']

# Function to pull key-value pairs where key is '$ref' and store in a dictionary
def extract_refs(yaml_dict):
    refs = {}

    def recursive_extract(obj, parent_key=''):
        if isinstance(obj, dict):
            for k, v in obj.items():
                if k == '$ref':
                    refs[parent_key] = v
                else:
                    new_key = f"{parent_key}.{k}" if parent_key else k
                    recursive_extract(v, new_key)
        elif isinstance(obj, list):
            for i, item in enumerate(obj):
                new_key = f"{parent_key}[{i}]"
                recursive_extract(item, new_key)

    recursive_extract(yaml_dict)
    return refs

# Example usage
refs_dict = extract_refs(medYaml)
print(refs_dict)



{'properties': '_definitions.yaml#/ubiquitous_properties', 'properties.subjects': '_definitions.yaml#/to_one'}


In [5]:
import yaml

# class for string manipulation

class refString:
    """
    Attributes:
        str_value (str): The original reference string ('_definitions.yaml#/ubiquitous_properties).
        yamlName (str): The extracted YAML file name from the reference string (output = _definitions.yaml).
        propName (str): The extracted property name from the reference string (output = ubiquitous_properties).
    """
    def __init__(self, str_value: str):
        self.str_value = str_value
        ref_value = self.str_value.replace('#', '')
        self.yamlName, self.propName = ref_value.split('/')
    
    def get_yaml_name(self):
        return self.yamlName
    
    def get_prop_name(self):
        return self.propName
    

# def extract_last_key(key_string):
#     """
#     Extracts the last segment from each string in a list, separated by dots.

#     Parameters:
#     - strings (list): List of dot-separated strings.

#     Returns:
#     - list: List of the last segment from each string.

#     Example:
#     >>>extract_last_item(["aye", "aye.bee", "aye.bee.ceebs"])
#     ['aye', 'bee', 'ceebs']
#     """
#     return key_string.split('.')[-1]
    

# # pulling value base on propName 
# def get_value_by_ref(data, ref_str):
#     """
#     Recursively search for the ref_str in the nested dictionary and return its value.
    
#     :param data: The dictionary to search within.
#     :param ref_str: The reference string to search for.
#     :return: The value associated with the ref_str, or None if the key is not found.
#     """
#     if isinstance(data, dict):
#         for key, value in data.items():
#             if key == ref_str:
#                 print('key found')
#                 return value
#             elif isinstance(value, (dict, list)):
#                 result = get_value_by_ref(value, ref_str)
#                 if result is not None:
#                     return result
#     elif isinstance(data, list):
#         for item in data:
#             result = get_value_by_ref(item, ref_str)
#             if result is not None:
#                 return result
#     return None


def get_value_by_ref(data, ref_str):
    result = data[ref_str]
    result_str = yaml.dump(result, default_flow_style=False, sort_keys=False)
    return result_str


def process_ref_value(ref_value: str, resolved_project_yaml_dict: dict):
    """
    Uses the reference value to pull the property from the resolved_project_yaml_dict, and then returns the insert value string and the replace value string.
    
    Parameters:
    - ref_value (str): The reference value to process ('_definitions.yaml#/to_one').
    - resolved_project_yaml_dict (dict): The resolved definitions YAML dictionary.
    
    Returns:
    - tuple: A tuple containing the insert value string and the replace value string.
    """
    
    propName = refString(ref_value).get_prop_name()
    yamlName = refString(ref_value).get_yaml_name()

    # check that yamlName = _definitions.yaml
    if yamlName != '_definitions.yaml':
        print('not _definitions.yaml')

    # pulling prop value from the resolved _definitions
    prop_value = get_value_by_ref(resolved_project_yaml_dict, propName)
    # prop_value_str = yaml.dump(prop_value, default_flow_style=False, sort_keys=False)
    # print(f"Value for property '{propName}': {prop_value}")

    # creating final strings or values
    insert_value_str = f"$ref: {ref_value}"
    replace_value_str = f"{propName}: {prop_value}"

    # returning the match replace key value pair
    return insert_value_str, replace_value_str



# continue from here
- Problem is that when the string is inserted, it does not insert properly and formatting is wrong. 
- Would ideally need to insert using official dictionary value replacements, but can be difficult when replacing values in nested dictionaries


In [6]:
def replace_value_deep_dot_path(nested_dict, dot_path, new_value):
    """
    Replace the value of a key at a specified path in a nested dictionary,
    where the path is given as a dot-separated string.
    
    :param nested_dict: The nested dictionary to modify.
    :param dot_path: A dot-separated string representing the path to the target key.
    :param new_value: The new value to assign to the target key.
    """
    path = dot_path.split('.')  # Split the dot-separated string into a list of keys
    current = nested_dict
    for key in path[:-1]:  # Traverse to the parent of the target key
        if key not in current or not isinstance(current[key], dict):
            current[key] = {}  # Ensure the key exists and is a dictionary
        current = current[key]
    current[path[-1]] = new_value  # Set the new value for the target key
    return current

# extracting reference values
ref_value= list(refs_dict.values())[1]
key_value = list(refs_dict.keys())[1]
# convert medYaml to str (to enable replacement)
# medYamlStr = yaml.dump(medYaml, default_flow_style=False, sort_keys=False)
print(f'finding ref for {ref_value} in resolved definitions')
insert_value_str, replace_value_str = process_ref_value(ref_value, resolved_project_yaml_dict)
print(f'ref value is: {replace_value_str}')

# Replacing based on key
medYaml_edit = medYaml
medYaml_edit = replace_value_deep_dot_path(medYaml_edit, key_value, replace_value_str)
medYaml_edit
# # writing medYaml to file
# with open('../output/schema/yaml/medical_history_res.yaml', 'w') as f:
#     f.write(medYamlStr)

finding ref for _definitions.yaml#/to_one in resolved definitions
ref value is: to_one: anyOf:
- type: array
  items:
    type: object
    additionalProperties: true
    properties:
      id:
        term:
          description: 'A 128-bit identifier. Depending on the mechanism used to generate
            it, it is either guaranteed to be different from all other UUIDs/GUIDs
            generated until 3400 AD or extremely likely to be different. Its relatively
            small size lends itself well to sorting, ordering, and hashing of all
            sorts, storing in databases, simple allocation, and ease of programming
            in general.

            '
          termDef:
            term: Universally Unique Identifier
            source: NCIt
            cde_id: C54100
            cde_version: null
            term_url: https://ncit.nci.nih.gov/ncitbrowser/ConceptReport.jsp?dictionary=NCI_Thesaurus&version=16.02d&ns=NCI_Thesaurus&code=C54100
        type: string
        patt

{'$ref': '_definitions.yaml#/ubiquitous_properties',
 'subjects': "to_one: anyOf:\n- type: array\n  items:\n    type: object\n    additionalProperties: true\n    properties:\n      id:\n        term:\n          description: 'A 128-bit identifier. Depending on the mechanism used to generate\n            it, it is either guaranteed to be different from all other UUIDs/GUIDs\n            generated until 3400 AD or extremely likely to be different. Its relatively\n            small size lends itself well to sorting, ordering, and hashing of all\n            sorts, storing in databases, simple allocation, and ease of programming\n            in general.\n\n            '\n          termDef:\n            term: Universally Unique Identifier\n            source: NCIt\n            cde_id: C54100\n            cde_version: null\n            term_url: https://ncit.nci.nih.gov/ncitbrowser/ConceptReport.jsp?dictionary=NCI_Thesaurus&version=16.02d&ns=NCI_Thesaurus&code=C54100\n        type: string\n  

In [7]:
medYamlStr

NameError: name 'medYamlStr' is not defined

# This is the final usage code


In [17]:
# Load schemas
schema_path = '../output/schema/json/schema_dev.json'
yaml_dict = load_and_parse_schemas(schema_path)

medYaml = yaml_dict['medical_history.yaml']
print(medYaml)
# extracting reference values
ref_value_list = list(refs_dict.values())
# convert medYaml to str (to enable replacement)
medYamlStr = yaml.dump(medYaml, default_flow_style=False, sort_keys=False)

ref_value = ref_value_list[0]
print(f'finding ref for {ref_value} in resolved definitions')
insert_value_str, replace_value_str = process_ref_value(ref_value, resolved_project_yaml_dict)
medYamlStr = medYamlStr.replace(insert_value_str, replace_value_str)
    
# medYamlStr
# medYamlRes = yaml.safe_load(medYamlStr)


# writing medYaml to file
with open('../output/schema/yaml/medical_history_res.yaml', 'w') as f:
    f.write(medYamlStr)


{'$schema': 'http://json-schema.org/draft-04/schema#', 'id': 'medical_history', 'title': 'Medical History', 'type': 'object', 'namespace': 'https://data.acdc.ozheart.org', 'category': 'clinical', 'program': '*', 'project': '*', 'description': 'Medical history of the participant', 'additionalProperties': False, 'submittable': True, 'validators': None, 'systemProperties': ['id', 'project_id', 'state', 'created_datetime', 'updated_datetime'], 'links': [{'name': 'subjects', 'backref': 'medical_histories', 'label': 'describes', 'target_type': 'subject', 'multiplicity': 'one_to_one', 'required': True}], 'required': ['type', 'submitter_id', 'subjects'], 'uniqueKeys': [['id'], ['project_id', 'submitter_id']], 'properties': {'$ref': '_definitions.yaml#/ubiquitous_properties', 'subjects': {'$ref': '_definitions.yaml#/to_one'}, 'hypertension': {'description': 'Whether the participant has Hypertension', 'enum': ['yes, measured or on treatment', 'yes, self-reported', 'no', 'not reported', 'not coll