# Metadata Vignette

## Attribute differences between a submission and latest schema.

### By Matthew Green


As metadata evolves projects in the datastore can get out of date with current schema. This vignette script takes a project uuid and looks for differences between it's attributes as submirtted versus latest schema. It also suggests possible swaps by simple lexical similarity lookup. This can be a useful tool for wranglers to use when they are updating a dataset.


**Pseudocode**

1. User provides project uuid
2. Get latest attributes
3. Get attributes from provided project. (bundlewise search)
4. Diff
5. Suggest lexically similar replacement.

In [1]:
'''
Original cBeta Submissions

treutlein: e8642221-4c2c-4fd7-b926-a68bce363c88
meyer: 5dfe932f-159d-4cab-8039-d32f22ffbbc2
peer: 29f53b7e-071b-44b5-998a-0ae70d0229a4
neuron_diff: f8880be0-210c-4aa3-9348-f5a423e07421
fetal-maternal-interface: aabbec1a-1215-43e1-8e42-6489af25c12c
Regev-ICA: 179bf9e6-5b33-4c5b-ae26-96c7270976b8 (IndexError: list index out of range)
EMTAB5061: 1a0f98b8-746a-489d-8af9-d5c657482aab
Teichmann-mouse-melanoma: f396fa53-2a2d-4b8a-ad18-03bf4bd46833
EGEOD106540: 0ec2b05f-ddbe-4e5a-b30f-e81f4b1e330c
tabulamuris: 32eb86db-6842-480f-a49a-a2b0161ed35a (NO BUNDLES)
ido_amit: 0c7bbbce-3c70-4d6b-a443-1b92c1f205c8 (HTTP 502)
humphreys: 34ec62a2-9643-430d-b41a-1e342bd615fc
basu: c765e3f9-7cfc-4501-8832-79e5f7abd321
10x-mouse-brain: ff481f29-3d0b-4533-9de2-a760c61c162d (IndexError: list index out of range)
rsatija: 5f256182-5dfc-4070-8404-f6fa71d37c73

Provide a uuid to analyse:
'''
#########################
# EDIT PROJECT UUID HERE
#########################

query_uuid = '5f256182-5dfc-4070-8404-f6fa71d37c73'

#####################################################

In [2]:
# get latest attributes from schema

from ingest.template.schema_template import SchemaTemplate
import yaml

template = SchemaTemplate()
attribute_yaml = template.yaml_dump()

attribute_list = yaml.load(attribute_yaml).get('tabs')
full_set = set()
full_list = []
for tab in attribute_list:
    for x in list(tab.values()):
        for attribute in x.get('columns'):
            full_set.add(attribute)
            full_list.append(attribute)
print('{} out of {} attributes are unique'.format(len(full_set), len(full_list)))
print(full_list)

590 out of 590 attributes are unique
['sequencing_protocol.protocol_core.protocol_id', 'sequencing_protocol.protocol_core.protocol_name', 'sequencing_protocol.protocol_core.protocol_description', 'sequencing_protocol.protocol_core.publication_doi', 'sequencing_protocol.protocol_core.protocols_io_doi', 'sequencing_protocol.protocol_core.document', 'sequencing_protocol.instrument_manufacturer_model.text', 'sequencing_protocol.instrument_manufacturer_model.ontology', 'sequencing_protocol.instrument_manufacturer_model.ontology_label', 'sequencing_protocol.local_machine_name', 'sequencing_protocol.paired_end', 'sequencing_protocol.method.text', 'sequencing_protocol.method.ontology', 'sequencing_protocol.method.ontology_label', 'sequencing_protocol.10x.fastq_method', 'sequencing_protocol.10x.fastq_method_version', 'sequencing_protocol.10x.pooled_channels', 'sequencing_protocol.10x.drop_uniformity', 'library_preparation_protocol.protocol_core.protocol_id', 'library_preparation_protocol.protoc

In [3]:
import hca.dss
from hca.dss import DSSClient
import sys
import json
from tqdm import tqdm_notebook as tqdm

project_q = {
    "query": {
        "bool": {
            "must": [
                {
                    "terms": {
                        "files.project_json.provenance.document_id": [
                            "PROJECT_UUID"
                        ]
                    }
                }
            ]
        }
    }
}

q = json.loads(str(project_q).replace('PROJECT_UUID', query_uuid).replace("'", '"'))


dss_client = DSSClient(swagger_url= "https://dss.data.humancellatlas.org/v1/swagger.json")
bundle_generator = dss_client.post_search.iterate(replica="aws", es_query=q, output_format="raw")
total_hits = dss_client.post_search(replica="aws", es_query=q, output_format="raw").get('total_hits')

bundle_attributes = []

for bundle in tqdm(bundle_generator, total=total_hits, unit='bundle'):
    bundle_meta = bundle.get('metadata').get('files')
    top_level = list(bundle_meta.keys())
   
    if len(top_level) == 0:
        print('WARN: Bundle had empty metadata') # check for erronious bundles
        sys.exit()
    
    detected_attributes = []
    
    # certain branches can be ignored when exploring the tree.
    ignore_top_level = ['links_json', 'analysis_process']
    ignore_mid_level = ['schema_type', 'provenance', 'describedBy']
    
    
        
    for x in top_level:
        if x in ignore_top_level: # skip these fields
            continue
        for meta_doc in bundle_meta.get(x):
            top_level = x[:-5] # strip '_json'
            for mid_level, value in meta_doc.items():
                if mid_level in ignore_mid_level:
                    continue
                if type(value) != list:
                    value_list = [value]
                else:
                    value_list = value
                for item in value_list:     
                    if (type(item) == str) or (type(item) == int) or (type(item) == bool):
                        attribute = '.'.join([top_level, mid_level])
                        detected_attributes.append(attribute)
                        continue
                    for low_level in item.items():
                        if low_level in ignore_mid_level:
                            continue
                        attribute = '.'.join([top_level, mid_level, low_level[0]])
                        detected_attributes.append(attribute)


    bundle_attributes.append(detected_attributes)
print('{} of {} bundles processed'.format(len(bundle_attributes), total_hits))
project_attribute_list = set([item for sublist in bundle_attributes for item in sublist])

HBox(children=(IntProgress(value=0, max=3), HTML(value='')))


3 of 3 bundles processed


In [4]:
import difflib
import pandas as pd

out_dated_attributes = list(set(project_attribute_list) - set(full_list))
shared_attributes = list(set(project_attribute_list).intersection(full_list))
unused = list(set(full_list) - set(project_attribute_list)) # unused attributes in latest schema

# print('This project did NOT use {} attributes from the latest schemas.'.format(len(unused)))
# print('{} attributes in the project were also found in the latest schemas.'.format(len(shared_attributes)))
# print('{} attributes were present in the project that are not part of the latest schemas.'.format(len(out_dated_attributes)))

print('{} out of {} attributes do not meet latest schema requirements.'.format(len(out_dated_attributes), len(shared_attributes)))
if len(out_dated_attributes) > 0:
    print("This project's metadata is OUT OF DATE")
else:
    print("This project's metadata is UP TO DATE")

# print(out_dated_attributes)
matches = {}
for ood_attribute in out_dated_attributes:
    closest = difflib.get_close_matches(ood_attribute, full_list, n=1)[0]

#     print('Project has: {}'.format(ood_attribute))
#     print('Closest match in latest schema is: {}'.format(closest))

    matches[ood_attribute] = closest
    
df = pd.DataFrame.from_dict(matches, orient='index').reset_index().rename(index=str, columns={"index": "Project attributes", 0: "Closest match in latest schema"})
print('See replacement suggestions here:')
outfile = 'runs/' + query_uuid + '_diff.csv'
df.to_csv(outfile) # unhash to saveout a result
df



10 out of 74 attributes do not meet latest schema requirements.
This project's metadata is OUT OF DATE
See replacement suggestions here:


Unnamed: 0,Project attributes,Closest match in latest schema
0,sequencing_protocol.sequencing_approach.text,sequencing_protocol.method.text
1,dissociation_protocol.dissociation_method.text,dissociation_protocol.method.text
2,cell_suspension.total_estimated_cells,cell_suspension.estimated_cell_count
3,library_preparation_protocol.library_construct...,library_preparation_protocol.library_construct...
4,library_preparation_protocol.library_construct...,library_preparation_protocol.library_construct...
5,sequencing_protocol.sequencing_approach.ontology,sequencing_protocol.method.ontology
6,sequencing_protocol.sequencing_approach.ontolo...,sequencing_protocol.method.ontology_label
7,dissociation_protocol.dissociation_method.onto...,dissociation_protocol.method.ontology_label
8,library_preparation_protocol.library_construct...,library_preparation_protocol.library_construct...
9,dissociation_protocol.dissociation_method.onto...,dissociation_protocol.method.ontology


TODO
- Needs to catch 4th nested attributes e.g. donor_organism.mouse_specific.strain.text
- analysis_process attributes are leaking through still

## Results from cBeta

The following terms have been replaced

| Out of date attribute                                                     | Suggested replacement                                                   | 
|---------------------------------------------------------------------------|-------------------------------------------------------------------------| 
| dissociation_protocol.dissociation_method.ontology                        | dissociation_protocol.method.ontology                                   | 
| dissociation_protocol.dissociation_method.ontology_label                  | dissociation_protocol.method.ontology_label                             | 
| library_preparation_protocol.library_construction_approach.text           | library_preparation_protocol.library_construction_method.text           | 
| sequencing_protocol.sequencing_approach.text                              | sequencing_protocol.method.text                                         | 
| dissociation_protocol.dissociation_method.text                            | dissociation_protocol.method.text                                       | 
| sequencing_protocol.sequencing_approach.ontology                          | sequencing_protocol.method.ontology                                     | 
| specimen_from_organism.organ_part.text                                    | specimen_from_organism.organ_parts.text                                 | 
| specimen_from_organism.organ_part.ontology_label                          | specimen_from_organism.organ_parts.ontology_label                       | 
| library_preparation_protocol.library_construction_approach.ontology_label | library_preparation_protocol.library_construction_method.ontology_label | 
| specimen_from_organism.organ_part.ontology                                | specimen_from_organism.organ_parts.ontology                             | 
| sequencing_protocol.sequencing_approach.ontology_label                    | sequencing_protocol.method.ontology_label                               | 
| library_preparation_protocol.library_construction_approach.ontology       | library_preparation_protocol.library_construction_method.ontology       | 
| sequence_file.insdc_run                                                   | sequence_file.insdc_run_accessions                                      | 
| cell_suspension.total_estimated_cells                                     | cell_suspension.estimated_cell_count                                    | 
| project.insdc_project                                                     | project.insdc_project_accessions                                        | 
| project.geo_series                                                        | project.geo_series_accessions                                           | 
| project.funders.funder_name                                               | ?                                                | 
| process.insdc_experiment.insdc_experiment                                 | process.insdc_experiment.insdc_experiment_accession                     | 
| cell_suspension.plate_based_sequencing.cell_quality                       | cell_suspension.plate_based_sequencing.well_quality                     | 
| collection_protocol.collection_method.ontology                            | collection_protocol.method.ontology                                     | 
| project.insdc_study                                                       | project.insdc_study_accessions                                          | 
| collection_protocol.collection_method.text                                | collection_protocol.method.text                                         | 
| collection_protocol.collection_method.ontology_label                      | collection_protocol.method.ontology_label                               | 
| cell_suspension.plate_based_sequencing.plate_id                           | cell_suspension.plate_based_sequencing.plate_label                      | 
| project.array_express_investigation                                       | project.array_express_accessions                                        | 
| cell_suspension.plate_based_sequencing.well_id                            | cell_suspension.plate_based_sequencing.well_quality                     | 
| donor_organism.familial_relationship.child                                | donor_organism.familial_relationships.child                             | 
| donor_organism.familial_relationship.parent                               | donor_organism.familial_relationships.parent                            | 
| dissociation_protocol.protocol_reagents.catalog_number                    | ipsc_induction_protocol.protocol_reagents.catalog_number                | 
| dissociation_protocol.protocol_reagents.manufacturer                      | ipsc_induction_protocol.protocol_reagents.manufacturer                  | 
| dissociation_protocol.protocol_reagents.expiry_date                       | ipsc_induction_protocol.protocol_reagents.expiry_date                   | 
| dissociation_protocol.protocol_reagents.lot_number                        | ipsc_induction_protocol.protocol_reagents.lot_number                    | 
| dissociation_protocol.protocol_reagents.retail_name                       | ipsc_induction_protocol.protocol_reagents.retail_name                   | 
