# Package: niagads-metadata-validator

In [68]:
import json
import niagads.metadata_validator_tool.core as mv_tool

schemaDir = 'schemas'
metadataFileDir = 'metadata_files'
filePrefix = f'{metadataFileDir}/test_'

# helper function for pretty printing the result of a validation run
def pretty_print(result):
    print(json.dumps(result, indent=4))

### Overview

niagads-metadata-validator is used to provide _row-level_ validation of metadata information.  It can do the following:

* ensure required fields are present, including conditional dependencies (e.g., `array_id` is required only if `platform = array`)
* ensure field values match a regular expression or controlled vocabulary

Default validation does not do _file-level_ validation, with one exception:

* biosource property files: ensures each biosource is assigned a unique id and occurs exactly once in the file (**required** check) 

### Schema files

Schema files should meet the (JSON Schema)[https://json-schema.org/] **Draft7** specification.  

For examples and conventions for NIAGADS projects, as well as useful references for getting started with JSON schema, please see <https://github.com/NIAGADS/metadata>.

### Supported file types

Supported file types for metadata information are delimited text (`.txt`, `.tab`, `.csv`) and EXCEL (`.xlsx`, `.xls`).

#### Metadata validators

The `mv_tool` makes file-type specific validation decisions based on a `metadataType` argument whose value is constrained by a `MetadataValidatorType` **case-insensitive** enum.  

The two types of supported metadata files are:

* _Biosource Properties_: a file that maps a sample or participant to descriptive properties
(e.g., phenotype or material) or a ISA-TAB-like sample file
* _File Manifest_: file manifest or a sample-data-relationship (SDRF) file

In [69]:
# list supported Metadata Validators
print(mv_tool.MetadataValidatorType.list())

# this enum is case insensitive
print(mv_tool.MetadataValidatorType('biosource_properties'))
print(mv_tool.MetadataValidatorType('BIOSOURCE_PROPERTIES'))

['BIOSOURCE_PROPERTIES', 'FILE_MANIFEST']
BIOSOURCE_PROPERTIES
BIOSOURCE_PROPERTIES



### Usage Examples

#### Default validation given schema and metadata files.

For default validation, use `mv_tool.run()`, which returns a dictionary with two lists: `errors` and `warnings`. 

If the file passes validation, both will be empty arrays as follows:

```json
{
    "errors": [],
    "warnings": []
}
```

Validation errors will be reported by row number or as a full-file check validation error.  In the example below, row #3 is missing the required field `sample_id`, and a duplicate sample (`duplicate_SAMPLE_ID`) was found in the file:

```json
{
    "errors": [
        {
            "3": [
                "required field `sample_id` cannot be empty / null"
            ]
        },
        {
            "duplicate_SAMPLE_ID": [
                "SAMPLE1"
            ]
        }
    ],
    "warnings": []
}
```

#### Example: Parse a `BIOSOURCE_PROPERTY` file using the default validator

Biosource property file validation does a check that each biosource **is unique**.  

An `idField` that maps to the field in the metadata file containing the unique biosource identifier needs to be set to run this validation (e.g., `sample_id`, `participant_id`, `donor_id`, `subject_id`).

In [70]:
metadataFile = f'{metadataFileDir}/test_sample_info.tab'
schemaFile = f'{schemaDir}/sample_info.json'
idField = 'sample_id'
result = mv_tool.run(metadataFile, schemaFile, metadataType='BIOSOURCE_PROPERTIES', idField=idField)
pretty_print(result)

{
    "errors": [],
}


If your schema & files are templated, `mv_tool` provides functions to generate the file names and verify their existance before attempting to validate.  

Templated metadata files are named to match schema such that `*participant-info.ext`  is validated using `participant-info.json`.



In [71]:
template = 'participant_info'
schemaFile = mv_tool.get_templated_schema_file(schemaDir, template)
print(f'Schema File: {schemaFile}')

metadataFile = mv_tool.get_templated_metadata_file(filePrefix, template)
print(f'Metadata File: {metadataFile}')

# straight run
validationResult = mv_tool.run(metadataFile, schemaFile, 'biosource_properties', 'participant_id')
pretty_print(validationResult)

Schema File: schemas/participant_info.json
Metadata File: metadata_files/test_participant_info.tab
{
    "errors": [],
}


### Customize validation by retrieving the validator object

`mv_tool` provides a function `initialize_validator` function that initializes and returns the validator object so that you can perform custom validations and file-level operations not specified in or allowed by the JSON schema.

For full documentation on Validator objects, please see the **niagads-pylib/metadata-validator** [README](https://github.com/NIAGADS/niagads-pylib/blob/6b54d6b1b836564e79f5cf40afaf3522c3379732/components/niagads/metadata_validator/README.md)

#### Example Biosource Properties Validator

In [72]:
# get an initialized validator object
metadataFile = f'{metadataFileDir}/test_participant_info.tab'
schemaFile = f'{schemaDir}/participant_info.json'
validator = mv_tool.initialize_validator(metadataFile, schemaFile, 'biosource_properties', 'participant_id')

# access validator properties / members
print(f'Validator type: {type(validator)}')
print(f'Schema: {validator.get_schema()}')
print(f'Parsed Metadata: {validator.get_metadata()}')
print(f'Biosource IDs: {validator.get_biosource_ids()}')
print(f'Race: {validator.get_field_values('cohort')}')

# run the validation
validationResult = validator.run()
print(f'Validation Result: {validationResult}')

Validator type: <class 'niagads.metadata_validator.core.BiosourcePropertiesValidator'>
Schema: schemas/participant_info.json
Parsed Metadata: [{'participant_id': 'DONOR1', 'cohort': 'KNIGHT-ADRC', 'consent': None, 'sex': 'Male', 'race': 'Asian', 'ethnicity': 'Hispanic or Latino', 'diagnosis': None, 'disease': 'AD', 'APOE': None, 'comment': 'clinical diagnosis'}, {'participant_id': 'DONOR2', 'cohort': 'ROSMAP', 'consent': None, 'sex': 'Female', 'race': 'White', 'ethnicity': 'Not Hispanic or Latino', 'diagnosis': None, 'disease': 'AD', 'APOE': None, 'comment': None}, {'participant_id': 'DONOR3', 'cohort': 'KNIGHT-ADRC', 'consent': None, 'sex': 'Male', 'race': 'White', 'ethnicity': 'Hispanic or Latino', 'diagnosis': None, 'disease': 'AD', 'APOE': None, 'comment': None}, {'participant_id': 'DONOR4', 'cohort': 'KNIGHT-ADRC', 'consent': None, 'sex': 'Not reported', 'race': None, 'ethnicity': 'Not Hispanic or Latino', 'diagnosis': None, 'disease': None, 'APOE': None, 'comment': None}, {'parti

#### Example Custom File Manifest Validator

With file manifests or sample-data-relationship files that are paired with sample information, a necessary validation is to ensure that every referenced sample was present in the original `sample_info` file.  This requires a two-step process:

1. initialize a `biosource_properties` validator to retrieve sample IDs
2. pass the sample IDs to the `file_manifest` validator

In [73]:
# get an initialized BiosourcePropertiesValidator
sampleInfo = f'{metadataFileDir}/test_sample_info.tab'
schemaFile = f'{schemaDir}/sample_info.json'
bsValidator = mv_tool.initialize_validator(sampleInfo, schemaFile, 'biosource_properties', 'sample_id')
bsValidator.run() # optionally run the validator to ensure uniqueness of sample IDs

# retrieve sample IDs
sampleIds = bsValidator.get_biosource_ids()

# get 
fileManifest = f'{metadataFileDir}/test_file_manifest.tab'
schemaFile = f'{schemaDir}/file_manifest.json'

# get an initialized FileManifestValidator Object
# also illustrates how to set metadataType using the enum to avoid typos
fmValidator = mv_tool.initialize_validator(fileManifest, schemaFile, mv_tool.MetadataValidatorType.FILE_MANIFEST, 'sample_id')

# set the reference sample list
fmValidator.set_sample_reference(sampleIds)

# set the mapped field for the samples in the file manifest
fmValidator.set_sample_field('sample_id')

# run the validator
validationResult = fmValidator.run()
pretty_print(validationResult)

{
    "errors": [
        {
            "invalid_SAMPLE_ID": [
                "SAMPLE7"
            ]
        }
    ],
        {
            "no_file_for_SAMPLE_ID": [
                "SAMPLE3",
                "SAMPLE1"
            ]
        }
    ]
}
