In [1]:
import gen3_data_validator
from gen3_data_validator.logging_config import setup_logging
setup_logging()

## Reading in xlsx data and writing to json
- xlsx data comes from xlsx manifest file created from acdc_submission_template

In [2]:
# ResolverClass = gen3_data_validator.ResolveSchema(schema_path = "../schema/gen3_test_schema.json")
xlsxData = gen3_data_validator.ParseXlsxMetadata(xlsx_path = "/Users/harrijh/projects/gen3-data-validator/data/lipid_metadata_example.xlsx", skip_rows=1)
xlsxData.parse_metadata_template()
xlsxData.write_dict_to_json(xlsx_data_dict=xlsxData.xlsx_data_dict, output_dir="/Users/harrijh/projects/gen3-data-validator/data/restricted/lipid_metadata_example")

## Creating Resolver Instance
- This class reads in the gen3schema.json then resolves the schema for use in the other classes


In [3]:
Resolver = gen3_data_validator.ResolveSchema(schema_path = "../tests/schema/gen3_test_schema.json")
Resolver.resolve_schema()

In [4]:
# you can check the graph nodes in the resolved schema with 
Resolver.nodes

['demographic.yaml',
 'project.yaml',
 'serum_marker_assay.yaml',
 'alignment_workflow.yaml',
 'imaging_file.yaml',
 'lipidomics_assay.yaml',
 'metabolomics_file.yaml',
 'acknowledgement.yaml',
 'medical_history.yaml',
 '_definitions.yaml',
 '_settings.yaml',
 'blood_pressure_test.yaml',
 'genomics_assay.yaml',
 'variant_file.yaml',
 'program.yaml',
 'serum_marker_file.yaml',
 'proteomics_assay.yaml',
 'sample.yaml',
 'unaligned_reads_file.yaml',
 '_terms.yaml',
 'aligned_reads_index_file.yaml',
 'variant_workflow.yaml',
 'proteomics_file.yaml',
 'exposure.yaml',
 'metabolomics_assay.yaml',
 'lipidomics_mapping_file.yaml',
 'lipidomics_file.yaml',
 'aligned_reads_file.yaml',
 'lab_result.yaml',
 'medication.yaml',
 'publication.yaml',
 'subject.yaml',
 'core_metadata_collection.yaml']

## Parsing data
- The parse data class takes in a data folder path containing json files for each data node


In [5]:
# Testing linkage for test data that passes
Data = gen3_data_validator.ParseData(data_folder_path = "../tests/data/pass")

To list the files read into the Data instance, you can use the following code:

In [6]:
Data.file_path_list

['/Users/harrijh/projects/gen3-data-validator/tests/data/pass/metabolomics_file.json',
 '/Users/harrijh/projects/gen3-data-validator/tests/data/pass/medical_history.json',
 '/Users/harrijh/projects/gen3-data-validator/tests/data/pass/metabolomics_assay.json',
 '/Users/harrijh/projects/gen3-data-validator/tests/data/pass/sample.json',
 '/Users/harrijh/projects/gen3-data-validator/tests/data/pass/subject.json']

All of the read data is stored in Data.data_dict as a dictionary, where the key is the entity and the value is a list of json objects

In [7]:
Data.data_dict

{'metabolomics_file': [{'alternate_timepoint': '1a914a1577',
   'baseline_timepoint': True,
   'cv': 56.94475432813319,
   'data_category': 'mass spec analysed',
   'data_format': 'wiff',
   'data_type': 'MS/MS',
   'file_format': 'e387cadce7',
   'file_name': 'dummy_metab',
   'file_size': 87,
   'ga4gh_drs_uri': '150bf4b457',
   'md5sum': '756c381b71c2a7d346c72998ab334c00',
   'metabolomic_unit': 'pmol/mL',
   'metabolomics_assays': {'submitter_id': 'metabolomics_assay_356580ff6d'},
   'submitter_id': 'metabolomics_file_547f3d4417',
   'type': 'metabolomics_file',
   'metabolomics_files': 'metabolomics_file_547f3d4417'},
  {'alternate_timepoint': '578a14ee53',
   'baseline_timepoint': True,
   'cv': 43.00152620641602,
   'data_category': 'mass spec analysed',
   'data_format': 'wiff',
   'data_type': 'MS/MS',
   'file_format': '47a60862ef',
   'file_name': 'dummy_metab',
   'file_size': 0,
   'ga4gh_drs_uri': '2beb8c16ea',
   'md5sum': '43640335849622369f4843b817c1da2e',
   'metabolo

The default link suffix is 's'
- This links suffix can be changed depending on what the key_name for the linked information is.

In [8]:
Data.link_suffix

's'

For example, in the json object below, we can see that the key "subjects" is what describes the link from sample to subject, since the value of 'subjects' is an array containing the key "submitter_id".
- Furthermore, the backref is called 'subjects' while the entity is called 'sample'
- Therefore, the link suffix is 's'

In [9]:
Data.data_dict["sample"][0]

{'alternate_timepoint': '1f56770b0b',
 'baseline_timepoint': True,
 'freeze_thaw_cycles': 10,
 'sample_collection_method': '2fddbe7d09',
 'sample_id': 'd4f31f7bb6',
 'sample_in_preservation': 'snap Frozen',
 'sample_in_storage': 'yes',
 'sample_provider': 'USYD',
 'sample_source': 'UBERON:3781554',
 'sample_storage_method': 'not stored',
 'sample_type': '59a8fd8005',
 'storage_location': 'UMELB',
 'subjects': {'submitter_id': 'subject_e5616257f8'},
 'submitter_id': 'sample_efdbe56d20',
 'type': 'sample',
 'samples': 'sample_efdbe56d20'}

Finally, you can also check what the detected entities are below:

In [10]:
Data.data_nodes

['metabolomics_file',
 'medical_history',
 'metabolomics_assay',
 'sample',
 'subject']

## Testing Linkage

The first thing you should do is create a linkage configuration map. The `.generate_config` method will do this for you, it will read in the data (stored in the `data_dict` attribute) and return a linkage configuration map.

The linkage configuration map is a dictionary that maps each entity to a dictionary of its primary and foreign keys, with the format:

```
{
    "entity_name": {
        "primary_key": "primary_key_field",
        "foreign_key": "foreign_key_field"
    }
}
```

Also, you can define the linkage configuration map yourself, but you need to make sure that the primary and foreign keys are defined for each entity.

In [11]:
import gen3_data_validator
DataPass = gen3_data_validator.ParseData(data_folder_path = "../tests/data/pass")
LinkagePass = gen3_data_validator.TestLinkage()
link_pass_config = LinkagePass.generate_config(DataPass.data_dict)
link_pass_config

{'metabolomics_file': {'primary_key': 'metabolomics_files',
  'foreign_key': 'metabolomics_assays'},
 'medical_history': {'primary_key': 'medical_historys',
  'foreign_key': 'subjects'},
 'metabolomics_assay': {'primary_key': 'metabolomics_assays',
  'foreign_key': 'samples'},
 'sample': {'primary_key': 'samples', 'foreign_key': 'subjects'},
 'subject': {'primary_key': 'subjects', 'foreign_key': None}}

Once you have the linkage configuration map, you can validate the links. The `.validate_links` method will do this for you, it will read in the data and the linkage configuration map then return a dictionary of the linkage validation results.

As a reminder, the data parsed to the `.validate_links` method as the `data_map` argument, has the format:

```
{
    "entity_name_1": [
        {
            "field_name": "field_value"
        },
        {
            "field_name": "field_value"
        }
    ],
    "entity_name_2": [
        {
            "field_name": "field_value"
        },
        {
            "field_name": "field_value"
        }
    ]
}
```
Where `entity_name_1` and `entity_name_2` are the names of the entities in the data, and value is a list of json objects, each representing a record in the entity.

In [12]:
import gen3_data_validator
DataPass = gen3_data_validator.ParseData(data_folder_path = "../tests/data/pass")
LinkagePass = gen3_data_validator.TestLinkage()
link_pass_config = LinkagePass.generate_config(DataPass.data_dict)
LinkagePass.validate_links(data_map = DataPass.data_dict, config = link_pass_config, root_node = 'subject')

=== Validating Config Map ===
Root Node = subject
Config Map Validated
=== Validating Links ===
Entity 'metabolomics_file' has 0 invalid foreign keys: []
Entity 'medical_history' has 0 invalid foreign keys: []
Entity 'metabolomics_assay' has 0 invalid foreign keys: []
Entity 'sample' has 0 invalid foreign keys: []
Entity 'subject' has 0 invalid foreign keys: []


{'metabolomics_file': [],
 'medical_history': [],
 'metabolomics_assay': [],
 'sample': [],
 'subject': []}

Testing linkage for test data that fails:
- Note that the `root_node` argument tells the validate_links method which entitie is a root node, therefore will not have any upstream links.

In [13]:
DataFail = gen3_data_validator.ParseData(data_folder_path = "../tests/data/fail")
LinkageFail = gen3_data_validator.TestLinkage()
link_fail_config = LinkageFail.generate_config(DataFail.data_dict)
LinkageFail.validate_links(data_map = DataFail.data_dict, config = link_fail_config, root_node = 'subject')

=== Validating Config Map ===
Root Node = subject
Config Map Validated
=== Validating Links ===
Entity 'metabolomics_file' has 1099 invalid foreign keys: ['metabolomics_assay_356580ff6d', 'metabolomics_assay_44f829fa47', 'metabolomics_assay_974d137216', 'metabolomics_assay_3d1f400b27', 'metabolomics_assay_d1cd2f492c', 'metabolomics_assay_c025b20da0', 'metabolomics_assay_439725d38f', 'metabolomics_assay_d0350804b1', 'metabolomics_assay_63cef60fa4', 'metabolomics_assay_78465fe5b1', 'metabolomics_assay_3754fe418d', 'metabolomics_assay_0cdd244c6e', 'metabolomics_assay_40a94ece37', 'metabolomics_assay_adc5f88af9', 'metabolomics_assay_b646004109', 'metabolomics_assay_69bcc995f0', 'metabolomics_assay_37be2a2136', 'metabolomics_assay_119df8af52', 'metabolomics_assay_8dedaeccc1', 'metabolomics_assay_b353a7f9b8', 'metabolomics_assay_ebe904af55', 'metabolomics_assay_5bed5ab90c', 'metabolomics_assay_1417aac36c', 'metabolomics_assay_4e133f8d44', 'metabolomics_assay_38d2765ae0', 'metabolomics_assay_

{'metabolomics_file': ['metabolomics_assay_356580ff6d',
  'metabolomics_assay_44f829fa47',
  'metabolomics_assay_974d137216',
  'metabolomics_assay_3d1f400b27',
  'metabolomics_assay_d1cd2f492c',
  'metabolomics_assay_c025b20da0',
  'metabolomics_assay_439725d38f',
  'metabolomics_assay_d0350804b1',
  'metabolomics_assay_63cef60fa4',
  'metabolomics_assay_78465fe5b1',
  'metabolomics_assay_3754fe418d',
  'metabolomics_assay_0cdd244c6e',
  'metabolomics_assay_40a94ece37',
  'metabolomics_assay_adc5f88af9',
  'metabolomics_assay_b646004109',
  'metabolomics_assay_69bcc995f0',
  'metabolomics_assay_37be2a2136',
  'metabolomics_assay_119df8af52',
  'metabolomics_assay_8dedaeccc1',
  'metabolomics_assay_b353a7f9b8',
  'metabolomics_assay_ebe904af55',
  'metabolomics_assay_5bed5ab90c',
  'metabolomics_assay_1417aac36c',
  'metabolomics_assay_4e133f8d44',
  'metabolomics_assay_38d2765ae0',
  'metabolomics_assay_c096f97680',
  'metabolomics_assay_e345b3e502',
  'metabolomics_assay_064911ed4c',

You can check the json files read into the DataFail instance

In [14]:
DataFail.file_path_list

['/Users/harrijh/projects/gen3-data-validator/tests/data/fail/metabolomics_file.json',
 '/Users/harrijh/projects/gen3-data-validator/tests/data/fail/medical_history.json',
 '/Users/harrijh/projects/gen3-data-validator/tests/data/fail/metabolomics_assay.json',
 '/Users/harrijh/projects/gen3-data-validator/tests/data/fail/sample.json',
 '/Users/harrijh/projects/gen3-data-validator/tests/data/fail/subject.json']

This returns all of the foreign keys that are not linked to a primary key

In [15]:
LinkageFail.link_validation_results

{'metabolomics_file': ['metabolomics_assay_356580ff6d',
  'metabolomics_assay_44f829fa47',
  'metabolomics_assay_974d137216',
  'metabolomics_assay_3d1f400b27',
  'metabolomics_assay_d1cd2f492c',
  'metabolomics_assay_c025b20da0',
  'metabolomics_assay_439725d38f',
  'metabolomics_assay_d0350804b1',
  'metabolomics_assay_63cef60fa4',
  'metabolomics_assay_78465fe5b1',
  'metabolomics_assay_3754fe418d',
  'metabolomics_assay_0cdd244c6e',
  'metabolomics_assay_40a94ece37',
  'metabolomics_assay_adc5f88af9',
  'metabolomics_assay_b646004109',
  'metabolomics_assay_69bcc995f0',
  'metabolomics_assay_37be2a2136',
  'metabolomics_assay_119df8af52',
  'metabolomics_assay_8dedaeccc1',
  'metabolomics_assay_b353a7f9b8',
  'metabolomics_assay_ebe904af55',
  'metabolomics_assay_5bed5ab90c',
  'metabolomics_assay_1417aac36c',
  'metabolomics_assay_4e133f8d44',
  'metabolomics_assay_38d2765ae0',
  'metabolomics_assay_c096f97680',
  'metabolomics_assay_e345b3e502',
  'metabolomics_assay_064911ed4c',

# Data Validation
- Validating json data objects to the gen3jsonschema


## Creating the validation class

In [None]:
import gen3_data_validator

resolver = gen3_data_validator.ResolveSchema(schema_path = "../tests/schema/gen3_test_schema.json")
resolver.resolve_schema()
data = gen3_data_validator.ParseData(data_folder_path = "../tests/data/fail")
validator = gen3_data_validator.Validate(data_map=data.data_dict, resolved_schema=resolver.schema_resolved)
validator.validate_schema()

In [None]:
validator.pull_index_of_entity(entity="sample", index_key=0, result_type="ALL")

In [None]:
validator.make_keymap()

In [None]:
data.data_dict

### Getting nested validation results
- returns a nested dictionary by entity/data node then by the row/index number, and then the validation objects

In [None]:
validation_dict = validator.validation_result
validation_dict

In [None]:
validator.list_entities()

In [None]:
validator.list_index_by_entity("sample")

You can pull out a validation results for a specific entity with

In [None]:
validator.pull_entity("sample")

You can pull validation results for a specific entity and then a specific index / row

In [None]:
validator.pull_index_of_entity("sample", 0)

# Getting validation stats

In [None]:
import gen3_data_validator
from gen3_data_validator.logging_config import setup_logging
setup_logging(level="DEBUG")

resolver = gen3_data_validator.ResolveSchema(schema_path = "../tests/schema/gen3_test_schema.json")
resolver.resolve_schema()
data = gen3_data_validator.ParseData(data_folder_path = "../tests/data/fail")
validator = gen3_data_validator.Validate(data_map=data.data_dict, resolved_schema=resolver.schema_resolved)
validator.validate_schema()
validate_stats = gen3_data_validator.ValidateStats(validator)

In [None]:
validate_stats.pull_entity(entity="sample", result_type="FAIL")

In [None]:
validate_stats.summary_stats()


In [None]:
validator.validation_result

In [None]:
import gen3_data_validator
from gen3_data_validator.logging_config import setup_logging
setup_logging(level="DEBUG")

resolver = gen3_data_validator.ResolveSchema(schema_path = "../tests/schema/gen3_test_schema.json")
resolver.resolve_schema()
data = gen3_data_validator.ParseData(data_folder_path = "../tests/data/fail")
validator = gen3_data_validator.Validate(data_map=data.data_dict, resolved_schema=resolver.schema_resolved)
validator.validate_schema()
validate_stats = gen3_data_validator.ValidateStats(validator)
val_summary = gen3_data_validator.ValidateSummary(validator)

In [None]:
val_summary = gen3_data_validator.ValidateSummary(validator)

In [None]:
val_summary.validation_result

In [None]:

stats_df = validate_stats.summary_stats()
stats_df

In [None]:
validate_stats.flatten_validation_results()

# Creating validation summary data

In [None]:
Summary = gen3_data_validator.ValidateSummary(validator) 
flattened_results_dict = Summary.flatten_validation_results()
flattened_results_dict[0]

In [None]:
Summary.collapse_flatten_results_to_pd()

In [None]:
d = {'row': 0,
 'entity': 'metabolomics_file',
 'guid': '569d16ce-9731-4ea0-8116-72e5e3d765e5',
 'index': 0,
 'validation_result': 'FAIL',
 'invalid_key': 'data_format',
 'schema_path': 'properties.data_format.enum',
 'validator': 'enum',
 'validator_value': ['wiff'],
 'validation_error': "True is not one of ['wiff']"}

del d['guid']
d

### Converting flattened dict to pandas

In [None]:
flatten_summary_pd = Summary.flattened_results_to_pd()
flatten_summary_pd.iloc[14, :]

### Collapsing flattened dict to pandas
- This collapsed data frame summarises common validation errors

In [None]:
collapse_df = Summary.collapse_flatten_results_to_pd()
collapse_df

# Writing validation results to folder

In [None]:
import os
output_dir = "../data/restricted/lipid_metadata/validation/"
os.makedirs(output_dir, exist_ok=True)


def write_dict_to_json(input_dict, output_dir, filename:str):
    with open(f"{output_dir}/{filename}.json", "w") as f:
        json.dump(input_dict, f)
    print(f"JSON files written to {output_dir}")

write_dict_to_json(validation_dict, output_dir, "validation_dict")
write_dict_to_json(flattened_results_dict, output_dir, "flattened_results_dict")

# Writing pandas df
stats_df.to_csv(f"{output_dir}/stats_df.csv")
flatten_summary_pd.to_csv(f"{output_dir}/flatten_summary_pd.csv")
collapse_df.to_csv(f"{output_dir}/collapse_df.csv")


In [None]:
# Use this for writing tests

sample_validation_results = {
    'sample': [
        [
            {
                'index': 0,
                'validation_result': 'FAIL',
                'invalid_key': 'freeze_thaw_cycles',
                'schema_path': 'properties.freeze_thaw_cycles.type',
                'validator': 'type',
                'validator_value': 'integer',
                'validation_error': "'10' is not of type 'integer'"
            },
            {
                'index': 0,
                'validation_result': 'FAIL',
                'invalid_key': 'sample_provider',
                'schema_path': 'properties.sample_provider.enum',
                'validator': 'enum',
                'validator_value': ['Baker', 'USYD', 'UMELB', 'UQ'],
                'validation_error': "45 is not one of ['Baker', 'USYD', 'UMELB', 'UQ']"
            },
            {
                'index': 0,
                'validation_result': 'FAIL',
                'invalid_key': 'sample_storage_method',
                'schema_path': 'properties.sample_storage_method.enum',
                'validator': 'enum',
                'validator_value': [
                    'not stored',
                    'ambient temperature',
                    'cut slide',
                    'fresh',
                    'frozen, -70C freezer',
                    'frozen, -150C freezer',
                    'frozen, liquid nitrogen',
                    'frozen, vapor phase',
                    'paraffin block',
                    'RNAlater, frozen',
                    'TRIzol, frozen'
                ],
                'validation_error': "'Autoclave' is not one of ['not stored', 'ambient temperature', 'cut slide', 'fresh', 'frozen, -70C freezer', 'frozen, -150C freezer', 'frozen, liquid nitrogen', 'frozen, vapor phase', 'paraffin block', 'RNAlater, frozen', 'TRIzol, frozen']"
            }
        ],
        [
            {
                'index': 1,
                'validation_result': 'FAIL',
                'invalid_key': 'freeze_thaw_cycles',
                'schema_path': 'properties.freeze_thaw_cycles.type',
                'validator': 'type',
                'validator_value': 'integer',
                'validation_error': "'76' is not of type 'integer'"
            },
            {
                'index': 1,
                'validation_result': 'FAIL',
                'invalid_key': 'sample_storage_method',
                'schema_path': 'properties.sample_storage_method.enum',
                'validator': 'enum',
                'validator_value': [
                    'not stored',
                    'ambient temperature',
                    'cut slide',
                    'fresh',
                    'frozen, -70C freezer',
                    'frozen, -150C freezer',
                    'frozen, liquid nitrogen',
                    'frozen, vapor phase',
                    'paraffin block',
                    'RNAlater, frozen',
                    'TRIzol, frozen'
                ],
                'validation_error': "'In the Pantry' is not one of ['not stored', 'ambient temperature', 'cut slide', 'fresh', 'frozen, -70C freezer', 'frozen, -150C freezer', 'frozen, liquid nitrogen', 'frozen, vapor phase', 'paraffin block', 'RNAlater, frozen', 'TRIzol, frozen']"
            }
        ],
        [
            {
                'index': 2,
                'validation_result': 'PASS',
                'invalid_key': None,
                'schema_path': None,
                'validator': None,
                'validator_value': None,
                'validation_error': None
            }
        ],
        [
            {
                'index': 3,
                'validation_result': 'PASS',
                'invalid_key': None,
                'schema_path': None,
                'validator': None,
                'validator_value': None,
                'validation_error': None
            }
        ]
    ]
}


