ASAP CRN Metadata validation

# Team Jakobsson. ASAP CRN Metadata validation
8 Oct 2024 
Andy Henrie




In [32]:
import pandas as pd
from pathlib import Path
import os, sys

sys.path.append(os.path.abspath((os.path.join(os.getcwd(), 'src/crn_utils'))))

from util import read_CDE, NULL, prep_table, read_meta_table
from validate import validate_table, ReportCollector
from update_schema import v1_to_v2, v2_to_v3_PMDBS, create_upload_medadata_package

from checksums import get_md5_hashes, authenticate_with_service_account

%load_ext autoreload
%autoreload 2

root_path = Path.home() / ("Projects/ASAP/data/teams")


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## CDEs
load the relavent CDEs

In [33]:
schema_version = "v1"
schema_path = Path.home() / "Projects/ASAP/crn-utils/resource/CDE"
CDEv1 = read_CDE(schema_version, local_path=schema_path)
schema_version = "v2.1"
CDEv2 = read_CDE(schema_version, local_path=schema_path)
schema_version = "v3.0"
CDEv3 = read_CDE(schema_version, local_path=schema_path)

metadata_version: ASAP_CDE_v1
https://docs.google.com/spreadsheets/d/1c0z5KvRELdT2AtQAH2Dus8kwAyyLrR0CROhKOjpU4Vc/gviz/tq?tqx=out:csv&sheet=v1
/Users/ergonyc/Projects/ASAP/crn-utils/resource/CDE/ASAP_CDE_v1.csv
read local file
metadata_version: ASAP_CDE_v2.1
https://docs.google.com/spreadsheets/d/1c0z5KvRELdT2AtQAH2Dus8kwAyyLrR0CROhKOjpU4Vc/gviz/tq?tqx=out:csv&sheet=v2.1
/Users/ergonyc/Projects/ASAP/crn-utils/resource/CDE/ASAP_CDE_v2.1.csv
read local file
metadata_version: ASAP_CDE_v3.0
https://docs.google.com/spreadsheets/d/1c0z5KvRELdT2AtQAH2Dus8kwAyyLrR0CROhKOjpU4Vc/gviz/tq?tqx=out:csv&sheet=v3.0
/Users/ergonyc/Projects/ASAP/crn-utils/resource/CDE/ASAP_CDE_v3.0.csv
read local file


> SANITY CHECK: verify reading from google doc works.

```python
CDEv1_ = read_CDE("v1")
CDEv2_ = read_CDE("v2.1")
CDEv3_ = read_CDE("v3.0")
```


## Load original tables 

These were submitted as v2.1

### Starting with v2.1 table

In [34]:
## convert 
team = "jakobsson"
dataset_name = "sn-rnaseq"

metadata_path = root_path / f"{team}/{dataset_name}/metadata/"
og_path = metadata_path / "og"

metadata_version = "v2.1"
METADATA_VERSION_DATE = f"{metadata_version}_{pd.Timestamp.now().strftime('%Y%m%d')}"

In [35]:
CDE = CDEv2
tables = CDE['Table'].unique()

dfs = {}
for table in tables:
    df = read_meta_table(og_path / f"{table}.csv")
    schema = CDE[CDE['Table'] == table]

    report = ReportCollector(destination="NA")
    full_table, report = validate_table(df.copy(), table, schema, report)
    report.print_log()
    dfs[table] = full_table
    # df.to_csv(v1_path / f"{table}.csv", index=False)

recoding number_of_brain_samples as int
All required fields are present in *STUDY* table.
🚨⚠️❗ **Missing Optional Fields in STUDY: alternate_dataset_id**
🚨⚠️❗ **6 Fields with empty (NULL) values:**

	- other_funding_source: 1/1 empty rows (REQUIRED)

	- publication_DOI: 1/1 empty rows (REQUIRED)

	- publication_PMID: 1/1 empty rows (REQUIRED)

	- PI_ORCHID: 1/1 empty rows (OPTIONAL)

	- PI_google_scholar_id: 1/1 empty rows (OPTIONAL)

	- preprocessing_references: 1/1 empty rows (OPTIONAL)
No invalid entries found in Enum fields.

All required fields are present in *PROTOCOL* table.
🚨⚠️❗ **1 Fields with empty (NULL) values:**

	- sample_collection_summary: 1/1 empty rows (REQUIRED)
No invalid entries found in Enum fields.

recoding age_at_onset as int
recoding age_at_diagnosis as int
recoding first_motor_symptom as int
All required fields are present in *SUBJECT* table.
🚨⚠️❗ **21 Fields with empty (NULL) values:**

	- source_subject_id: 6/25 empty rows (REQUIRED)

	- AMPPD_id: 25/25 emp

Fix a typo, and then reupload.

```python 
"""
    Hi Andy,
    That’s a typo on my end.

    NP18-00117 should be BB18.0035.
    NP18-00287 is correct.

    CLINPATH+SUBJECT
    Cheers,

    Oliver
"""
```

In [36]:
SUBJECT = dfs['SUBJECT']
CLINPATH = dfs['CLINPATH']
row = SUBJECT['subject_id'] == 'NP18-117'
col = 'source_subject_id'
#fix typo
print(f"before: {SUBJECT.loc[row,col]}")
# SUBJECT.loc[row,col] = 'BB18.0035'
# print(f"after: {SUBJECT.loc[row,col]}")

row = CLINPATH['subject_id'] == 'NP18-117'
col = 'source_subject_id'
#fix typo
print(f"before: {CLINPATH.loc[row,col]}")
# CLINPATH.loc[row,col] = 'BB18.0035'
# print(f"after: {CLINPATH.loc[row,col]}")

# already fixed!


before: 11    BB18.0035
Name: source_subject_id, dtype: object
before: 11    BB18.0035
Name: source_subject_id, dtype: object


In [37]:
SAMPLE = dfs['SAMPLE']
# row = SAMPLE['sample_id'] == 'ASAP68 _PD_NP18-117_PUT'
col = 'sample_id'
# #fix typo
# print(f"before: {SAMPLE.loc[row,col]}")
# SAMPLE.loc[row,col] = 'ASAP68_PD_NP18-117_PUT'
row = SAMPLE['sample_id'] == 'ASAP68_PD_NP18-117_PUT'
print(f"after: {SAMPLE.loc[row,col]}")


after: 61    ASAP68_PD_NP18-117_PUT
Name: sample_id, dtype: object


Encode primary diagnosis... 

check that the sample_id contains "_PD_" for NULL encoded SUBJECT['primary_diagnosis']

In [38]:
DATA = dfs['DATA']
SUBJECT = dfs['SUBJECT']

tmp = SUBJECT.loc[SUBJECT['primary_diagnosis']==NULL,['subject_id','primary_diagnosis']]

tmp.primary_diagnosis.all()

tmp2 = tmp.set_index('subject_id')

In [39]:
tmp2 = dfs['SAMPLE'][['sample_id','subject_id']]

In [40]:
tmp2['PD'] = tmp2['sample_id'].apply(lambda x: '_PD_' in x)
tmp2['Ctl'] = tmp2['sample_id'].apply(lambda x: '_Ctl_' in x)

PD_subj = set(tmp2[tmp2['PD']]['subject_id'])
PD_subj

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tmp2['PD'] = tmp2['sample_id'].apply(lambda x: '_PD_' in x)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tmp2['Ctl'] = tmp2['sample_id'].apply(lambda x: '_Ctl_' in x)


{'NP16-140',
 'NP16-160',
 'NP16-162',
 'NP16-25',
 'NP16-269',
 'NP17-191',
 'NP17-94',
 'NP18-117',
 'NP18-287',
 'NP18-304',
 'NP19-108',
 'NP19-16',
 'NP19-23',
 'NP19-255',
 'P73',
 'P74'}

In [41]:
set(tmp['subject_id']) - PD_subj

set()

In [42]:
# encode "case" as "PD"
dfs['SUBJECT']['primary_diagnosis'] = dfs['SUBJECT']['primary_diagnosis'].replace(NULL,'PD')
dfs['SUBJECT']['primary_diagnosis']

0                                        PD
1                                        PD
2     No PD nor other neurological disorder
3                                        PD
4     No PD nor other neurological disorder
5     No PD nor other neurological disorder
6                                        PD
7                                        PD
8                                        PD
9     No PD nor other neurological disorder
10                                       PD
11                                       PD
12    No PD nor other neurological disorder
13                                       PD
14                                       PD
15                                       PD
16                                       PD
17                                       PD
18                                       PD
19    No PD nor other neurological disorder
20    No PD nor other neurological disorder
21    No PD nor other neurological disorder
22                              

In [57]:
dfs['STUDY']['metadata_version_date'] = METADATA_VERSION_DATE
# rationalize the team_dataset_id
dfs['STUDY']['team_dataset_id'] = dataset_name.replace("-", "_")
dfs['STUDY']

Unnamed: 0,ASAP_team_name,ASAP_lab_name,project_name,team_dataset_id,project_dataset,project_description,PI_full_name,PI_email,contributor_names,submitter_name,...,publication_PMID,number_of_brain_samples,brain_regions,types_of_samples,PI_ORCHID,PI_google_scholar_id,DUA_version,preprocessing_references,metadata_version_date,alternate_dataset_id
0,TEAM-JAKOBSSON,Gale Hammell Lab,Activation of transposable elements as a trigg...,sn_rnaseq,Single nuclei sequencing of brain regions from...,"A range of genetic, clinical and pathological ...","Molly, Gale Hammell",Molly.GaleHammell@nyulangone.org,"Anita, Adami; Talitha, Forcier; Raquel, Garza;...","Oliver, H, Tam",...,,71,"Substantia nigra, prefrontal cortex, amygdala,...",Late stage PD & control,,,Creative Commons Attribution license CC BY 4.0,,v2.1_20241025,


## validate v2 tables


In [58]:
CDE = CDEv2
v2_tables = dfs
for table,df in v2_tables.items():
    schema = CDE[CDE['Table'] == table]

    report = ReportCollector(destination="NA")
    full_table, report = validate_table(df.copy(), table, schema, report)
    report.print_log()


recoding number_of_brain_samples as int
All required fields are present in *STUDY* table.
🚨⚠️❗ **7 Fields with empty (NULL) values:**

	- other_funding_source: 1/1 empty rows (REQUIRED)

	- publication_DOI: 1/1 empty rows (REQUIRED)

	- publication_PMID: 1/1 empty rows (REQUIRED)

	- PI_ORCHID: 1/1 empty rows (OPTIONAL)

	- PI_google_scholar_id: 1/1 empty rows (OPTIONAL)

	- preprocessing_references: 1/1 empty rows (OPTIONAL)

	- alternate_dataset_id: 1/1 empty rows (OPTIONAL)
No invalid entries found in Enum fields.

All required fields are present in *PROTOCOL* table.
🚨⚠️❗ **1 Fields with empty (NULL) values:**

	- sample_collection_summary: 1/1 empty rows (REQUIRED)
No invalid entries found in Enum fields.

recoding age_at_onset as int
recoding age_at_diagnosis as int
recoding first_motor_symptom as int
All required fields are present in *SUBJECT* table.
🚨⚠️❗ **20 Fields with empty (NULL) values:**

	- source_subject_id: 6/25 empty rows (REQUIRED)

	- AMPPD_id: 25/25 empty rows (REQ

### save extras as auxillary tables


In [59]:
# make tables conform to CDE and save extra columns as "auxiliary"
v2_path = metadata_path / "v2"

for table in tables:
    df = dfs[table]
    schema = CDE[CDE['Table'] == table]
    valid_fields = schema['Field'].unique()
    df_out = df[valid_fields]
    aux_fields = set(df.columns) - set(valid_fields)
    if aux_fields:
        df_aux = df[list(aux_fields)]
        df_aux.to_csv(v2_path / f"{table}_auxiliary.csv", index=False)
        print(f"Saved {table}_auxiliary.csv")
    df_out.to_csv(v2_path / f"{table}.csv", index=False)

## v2->v3

In [60]:
v3_meta_tables = ['STUDY', 'PROTOCOL', 'SUBJECT', 'SAMPLE', 'DATA', 'CLINPATH', 'PMDBS', 'CONDITION', 'ASSAY_RNAseq']

f"{v3_meta_tables}"

"['STUDY', 'PROTOCOL', 'SUBJECT', 'SAMPLE', 'DATA', 'CLINPATH', 'PMDBS', 'CONDITION', 'ASSAY_RNAseq']"

In [61]:
v3_path = metadata_path / "v3"

v3_tables, aux_tables = v2_to_v3_PMDBS(v2_path, v3_path, CDEv2, CDEv3)

recoding number_of_brain_samples as int
recoding age_at_onset as int
recoding age_at_diagnosis as int
recoding first_motor_symptom as int
recoding replicate_count as int
recoding repeated_sample as int
recoding input_cell_count as int
recoding replicate_count as int
recoding repeated_sample as int


### validate v3 tables


In [62]:
CDE = CDEv3
for table,df in v3_tables.items():
    schema = CDE[CDE['Table'] == table]

    report = ReportCollector(destination="NA")
    full_table, report = validate_table(df.copy(), table, schema, report)
    report.print_log()

recoding number_samples as int
All required fields are present in *STUDY* table.
🚨⚠️❗ **7 Fields with empty (NULL) values:**

	- other_funding_source: 1/1 empty rows (REQUIRED)

	- publication_DOI: 1/1 empty rows (REQUIRED)

	- publication_PMID: 1/1 empty rows (REQUIRED)

	- PI_ORCID: 1/1 empty rows (OPTIONAL)

	- PI_google_scholar_id: 1/1 empty rows (OPTIONAL)

	- preprocessing_references: 1/1 empty rows (OPTIONAL)

	- alternate_dataset_id: 1/1 empty rows (OPTIONAL)
No invalid entries found in Enum fields.

All required fields are present in *PROTOCOL* table.
🚨⚠️❗ **1 Fields with empty (NULL) values:**

	- sample_collection_summary: 1/1 empty rows (REQUIRED)
No invalid entries found in Enum fields.

All required fields are present in *SUBJECT* table.
🚨⚠️❗ **4 Fields with empty (NULL) values:**

	- source_subject_id: 6/25 empty rows (REQUIRED)

	- biobank_name: 25/25 empty rows (REQUIRED)

	- race: 25/25 empty rows (REQUIRED)

	- primary_diagnosis_text: 25/25 empty rows (OPTIONAL)
🚨⚠️❗

In [63]:
STUDY = v3_tables['STUDY']
STUDY

Unnamed: 0,ASAP_team_name,ASAP_lab_name,project_name,team_dataset_id,project_dataset,project_description,PI_full_name,PI_email,contributor_names,submitter_name,...,number_samples,sample_types,types_of_samples,DUA_version,metadata_tables,PI_ORCID,PI_google_scholar_id,preprocessing_references,metadata_version_date,alternate_dataset_id
0,TEAM-JAKOBSSON,Gale Hammell Lab,Activation of transposable elements as a trigg...,sn_rnaseq,Single nuclei sequencing of brain regions from...,"A range of genetic, clinical and pathological ...","Molly, Gale Hammell",Molly.GaleHammell@nyulangone.org,"Anita, Adami; Talitha, Forcier; Raquel, Garza;...","Oliver, H, Tam",...,71,"Substantia nigra, prefrontal cortex, amygdala,...",Late stage PD & control,Creative Commons Attribution license CC BY 4.0,"['STUDY', 'PROTOCOL', 'SUBJECT', 'SAMPLE', 'DA...",,,,v3.0_20241025,


-------------------------
## check md5s



In [67]:
print(team)

source = "pmdbs"

bucket = f"asap-raw-team-{team}-{source}-{dataset_name}"
bucket = f"asap-raw-data-team-{team}" # for now old locations


key_file_path = Path.home() / f"Projects/ASAP/{team}-credentials.json"

res = authenticate_with_service_account(key_file_path)
print(res)

# make sure to get ALL the fastq files in the bucket
prefix = "**/*.gz"
bucket_files_md5 = get_md5_hashes( bucket, prefix)

jakobsson
CompletedProcess(args='gcloud auth activate-service-account --key-file=/Users/ergonyc/Projects/ASAP/jakobsson-credentials.json', returncode=0, stdout='', stderr='Activated service account credentials for: [raw-admin-jakobsson@dnastack-asap-parkinsons.iam.gserviceaccount.com]\n')
gsutil -u dnastack-asap-parkinsons hash -h gs://asap-raw-data-team-jakobsson/**/*.gz


In [68]:
# def check_md5_sums()


checksum = v3_tables['DATA'][['file_name','file_MD5']]
checksum['check2'] = checksum['file_name'].map(bucket_files_md5)
checksum['check1'] = checksum['file_MD5']
checksum[checksum.check1 != checksum.check2].file_name.to_list()
#empty means success!!

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  checksum['check2'] = checksum['file_name'].map(bucket_files_md5)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  checksum['check1'] = checksum['file_MD5']


[]

--------------------
## Create metadata package


In [64]:
export_path = root_path / f"{team}"

create_upload_medadata_package(export_path, v3_tables)

In [65]:
v3_tables['SAMPLE']['condition_id']

0                                        pd
1                                        pd
2                                        pd
3                                        pd
4     no_pd_nor_other_neurological_disorder
                      ...                  
66                                       pd
67                                       pd
68    no_pd_nor_other_neurological_disorder
69    no_pd_nor_other_neurological_disorder
70    no_pd_nor_other_neurological_disorder
Name: condition_id, Length: 71, dtype: object

In [66]:
v3_tables['CONDITION']

Field,condition_id,intervention_name,intervention_id,protocol_id,intervention_aux_table
0,pd,Case-Control,Case,,
1,no_pd_nor_other_neurological_disorder,Case-Control,Control,,


_______