ASAP CRN Metadata compilation

# Team Lee (Biederer). ASAP CRN Metadata construction


This is a bulkRNAseq dataset which was originally transfered along with the scRNAseq dataset.  The metadata should be identical, save the details of the bulkRNAseq assay compared to the sc/snRNAseq previously platformed.



15 Sept 2024
Andy Henrie




In [20]:
import pandas as pd
from pathlib import Path
import os, sys

sys.path.append(os.path.abspath((os.path.join(os.getcwd(), 'src/crn_utils'))))

from util import read_CDE, NULL, prep_table, read_meta_table
from validate import validate_table, ReportCollector
from update_schema import v1_to_v2, v2_to_v3_PMDBS, create_upload_medadata_package
from checksums import extract_md5_from_details2
from checksums import get_md5_hashes, authenticate_with_service_account

%load_ext autoreload
%autoreload 2

root_path = Path.home() / ("Projects/ASAP/data/teams")



The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Team Lee

## CDEs
load the relavent CDEs

In [21]:
schema_version = "v1"
schema_path = Path.home() / "Projects/ASAP/crn-utils/resource/CDE"
CDEv1 = read_CDE(schema_version, local_path=schema_path)
schema_version = "v2.1"
CDEv2 = read_CDE(schema_version, local_path=schema_path)
schema_version = "v3.0"
CDEv3 = read_CDE(schema_version, local_path=schema_path)

metadata_version: ASAP_CDE_v1
https://docs.google.com/spreadsheets/d/1c0z5KvRELdT2AtQAH2Dus8kwAyyLrR0CROhKOjpU4Vc/gviz/tq?tqx=out:csv&sheet=v1
/Users/ergonyc/Projects/ASAP/crn-utils/resource/CDE/ASAP_CDE_v1.csv
read local file
metadata_version: ASAP_CDE_v2.1
https://docs.google.com/spreadsheets/d/1c0z5KvRELdT2AtQAH2Dus8kwAyyLrR0CROhKOjpU4Vc/gviz/tq?tqx=out:csv&sheet=v2.1
/Users/ergonyc/Projects/ASAP/crn-utils/resource/CDE/ASAP_CDE_v2.1.csv
read local file
metadata_version: ASAP_CDE_v3.0
https://docs.google.com/spreadsheets/d/1c0z5KvRELdT2AtQAH2Dus8kwAyyLrR0CROhKOjpU4Vc/gviz/tq?tqx=out:csv&sheet=v3.0
/Users/ergonyc/Projects/ASAP/crn-utils/resource/CDE/ASAP_CDE_v3.0.csv
read local file


## Load original tables 
Infer the bulk metadata from the sn dataset v3.0 metadata

## Clean v3 Table


In [22]:
source = "pmdbs"
team = "lee"
dataset_name = "bulk-rnaseq"


In [23]:
## convert 

metadata_path = root_path / f"{team}/{dataset_name}/metadata"

v3_path = metadata_path / "v3"
og_path = metadata_path / "og"  # this data is a copy of the lee/sn_pmdbs/metadata/v3

v3_meta_tables = ['STUDY', 'PROTOCOL', 'SUBJECT', 'SAMPLE', 'DATA', 'CLINPATH', 'PMDBS', 'CONDITION', 'ASSAY_RNAseq']

in_tables = [table_name for table_name in v3_meta_tables if f"{table_name}.csv" in os.listdir(og_path)]


In [24]:
og_tables = {}
for table_name in in_tables:
    df = read_meta_table(f"{og_path}/{table_name}.csv")
    og_tables[table_name] = df
    

In [25]:
og_tables.keys()

dict_keys(['STUDY', 'PROTOCOL', 'SUBJECT', 'SAMPLE', 'DATA', 'CLINPATH', 'PMDBS', 'CONDITION', 'ASSAY_RNAseq'])

In [26]:
og_tables['STUDY']

Unnamed: 0,ASAP_team_name,ASAP_lab_name,project_name,team_dataset_id,project_dataset,project_description,PI_full_name,PI_email,contributor_names,submitter_name,...,number_samples,sample_types,types_of_samples,DUA_version,metadata_tables,PI_ORCID,PI_google_scholar_id,preprocessing_references,metadata_version_date,alternate_dataset_id
0,TEAM-LEE,Bras,Is senescence a component of human PD and does...,sc_pmdbs,Human snRNA-seq PD Senesence Jose Bras Team Lee,Characterize the neuropathological progression...,"Jose, Bras",jose.bras@vai.org,"Lee, L, Marshall ; Kimberly, E, Paquette ; Kai...",Kaitlyn E Westra,...,75,hippocampus; middle frontal gyrus; substantia ...,human PD and control postmortem brains,unsure,"['STUDY', 'PROTOCOL', 'SUBJECT', 'SAMPLE', 'DA...",,,NA(raw data),v1_20241016,


In [27]:
CDE = CDEv3
dfs = {}
for table,df in og_tables.items():
    if table == "ASSAY_bulkRNAseq":
        table = "ASSAY_RNAseq"

    schema = CDE[CDE['Table'] == table]

    report = ReportCollector(destination="NA")
    full_table, report = validate_table(df.copy(), table, schema, report)
    report.print_log()
    dfs[table] = full_table

recoding number_samples as int
All required fields are present in *STUDY* table.
🚨⚠️❗ **7 Fields with empty (NULL) values:**

	- submitter_email: 1/1 empty rows (REQUIRED)

	- other_funding_source: 1/1 empty rows (REQUIRED)

	- publication_DOI: 1/1 empty rows (REQUIRED)

	- publication_PMID: 1/1 empty rows (REQUIRED)

	- PI_ORCID: 1/1 empty rows (OPTIONAL)

	- PI_google_scholar_id: 1/1 empty rows (OPTIONAL)

	- alternate_dataset_id: 1/1 empty rows (OPTIONAL)
No invalid entries found in Enum fields.

All required fields are present in *PROTOCOL* table.
🚨⚠️❗ **1 Fields with empty (NULL) values:**

	- other_reference: 1/1 empty rows (OPTIONAL)
No invalid entries found in Enum fields.

All required fields are present in *SUBJECT* table.
🚨⚠️❗ **1 Fields with empty (NULL) values:**

	- primary_diagnosis_text: 23/25 empty rows (OPTIONAL)
No invalid entries found in Enum fields.

recoding replicate_count as int
recoding repeated_sample as int
All required fields are present in *SAMPLE* table.


In [28]:
STUDY = dfs['STUDY']
PROTOCOL = dfs['PROTOCOL']
SUBJECT = dfs['SUBJECT']
SAMPLE = dfs['SAMPLE']
DATA = dfs['DATA']
CLINPATH = dfs['CLINPATH']
PMDBS = dfs['PMDBS']
ASSAY_RNAseq = dfs['ASSAY_RNAseq']
CONDITION = dfs['CONDITION']

In [29]:
# rationalize the team_dataset_id
STUDY['team_dataset_id'] = dataset_name.replace(" ", "_").replace("-", "_")
STUDY['metadata_tables'] = f"{v3_meta_tables}"

In [30]:
metadata_version = "v3.0"
METADATA_VERSION_DATE = f"{metadata_version}_{pd.Timestamp.now().strftime('%Y%m%d')}"

STUDY['metadata_version_date'] = METADATA_VERSION_DATE

In [31]:
# fix ASSAY_RNAseq  
ASSAY_RNAseq['technology'] = "Bulk"
ASSAY_RNAseq.head()

Unnamed: 0,sample_id,tissue,technology,omic,RIN,molecular_source,input_cell_count,assay,sequencing_end,sequencing_length,sequencing_instrument
0,MFG_HC_1225,Brain,Bulk,RNA,,PolyA RNA,9106,v1,Paired-end,100,Illumina NovaSeq 6000
1,HIP_HC_1225,Brain,Bulk,RNA,,PolyA RNA,6533,v1,Paired-end,100,Illumina NovaSeq 6000
2,SN_HC_1225,Brain,Bulk,RNA,,PolyA RNA,989,v1,Paired-end,100,Illumina NovaSeq 6000
3,MFG_HC_0602,Brain,Bulk,RNA,,PolyA RNA,2675,v1,Paired-end,100,Illumina NovaSeq 6000
4,HIP_HC_0602,Brain,Bulk,RNA,,PolyA RNA,2954,v1,Paired-end,100,Illumina NovaSeq 6000



ASSAY:
>> VAI’s Genomics Core (RRID:SCR_022913; they sequenced this data for us), and I can make some inferences after looking through it. From what I can see, this was likely sequenced on the NovaSeq 6000, with a run length of 2x100bp, using stranded total RNA, with a target genome coverage of 30M reads

In [32]:
DATA.head()

Unnamed: 0,sample_id,replicate,replicate_count,repeated_sample,batch,file_type,file_name,file_description,file_MD5,adjustment,content,header,annotation,configuration_file
0,MFG_HC_1225,rep1,1,0,BATCH_4,fastq,MFGHC1225_S9_L001_R1_001.fastq.gz,Raw sequencing data,9977258e598d6a52130c29c71aef6925,Raw,Reads,,,NA(raw data)
1,MFG_HC_1225,rep1,1,0,BATCH_4,fastq,MFGHC1225_S9_L001_R2_001.fastq.gz,Raw sequencing data,fe2cf93257801227b7072a4fb7d18792,Raw,Reads,,,NA(raw data)
2,MFG_HC_0602,rep1,1,0,BATCH_4,fastq,MFGHC0602_S2_L001_R1_001.fastq.gz,Raw sequencing data,110ca4864cf6938faca67567bebfb6cc,Raw,Reads,,,NA(raw data)
3,MFG_HC_0602,rep1,1,0,BATCH_4,fastq,MFGHC0602_S2_L001_R2_001.fastq.gz,Raw sequencing data,0dcc67217e43ab53bae0d0676f9bfe8b,Raw,Reads,,,NA(raw data)
4,MFG_PD_0009,rep1,1,0,BATCH_4,fastq,MFGPD0009_S3_L001_R1_001.fastq.gz,Raw sequencing data,a2608d0bd192333b0076d7091c1c50ea,Raw,Reads,,,NA(raw data)


From here we should be able to infer the samples and subjects to subset to


In [33]:
SAMPLE.head()

Unnamed: 0,sample_id,subject_id,source_sample_id,replicate,replicate_count,repeated_sample,batch,organism,tissue,assay_type,...,self_reported_ethnicity_ontology_term_id,disease_ontology_term_id,tissue_ontology_term_id,assay_ontology_term_id,donor_id,pm_PH,cell_type_ontology_term_id,source_RIN,DV200,suspension_type
0,MFG_HC_1225,HC_1225,12-25,rep1,1,0,BATCH_4,,Brain,,...,,PATO:0000461,UBERON:0002702,EFO:0030004,,,NA(multiple),,,nucleus
1,HIP_HC_1225,HC_1225,12-25,rep1,1,0,BATCH_9,,Brain,,...,,PATO:0000461,UBERON:0002702,EFO:0030004,,,NA(multiple),,,nucleus
2,SN_HC_1225,HC_1225,12-25,rep1,1,0,BATCH_7,,Brain,,...,,PATO:0000461,UBERON:0002702,EFO:0030004,,,NA(multiple),,,nucleus
3,MFG_HC_0602,HC_0602,06-02,rep1,1,0,BATCH_4,,Brain,,...,,PATO:0000461,UBERON:0002702,EFO:0030004,,,NA(multiple),,,nucleus
4,HIP_HC_0602,HC_0602,06-02,rep1,1,0,BATCH_9,,Brain,,...,,PATO:0000461,UBERON:0002702,EFO:0030004,,,NA(multiple),,,nucleus


In [34]:
SUBJECT.head()

Unnamed: 0,subject_id,source_subject_id,biobank_name,sex,age_at_collection,race,primary_diagnosis,primary_diagnosis_text
0,HC_1225,12-25,Banner Sun Health Research Institute,Male,80.0,White,No PD nor other neurological disorder,
1,HC_0602,06-02,Banner Sun Health Research Institute,Male,84.0,White,Other neurological disorder,Mild Cognitive Impairment
2,PD_0009,00-09,Banner Sun Health Research Institute,Male,64.0,White,Idiopathic PD,
3,PD_1921,19-21,Banner Sun Health Research Institute,Male,82.0,White,Idiopathic PD,
4,PD_2058,20-58,Banner Sun Health Research Institute,Male,87.0,White,Idiopathic PD,


### infer the DATA and subject / sample from data


In [35]:

bucket = f"asap-raw-team-{team}-{source}-{dataset_name}"
bucket = f"asap-raw-data-team-{team}" # for now old locations


key_file_path = Path.home() / f"Projects/ASAP/{team}-credentials.json"

res = authenticate_with_service_account(key_file_path)
print(res)
prefix = "fastqs/bulk_MFG/*.gz"
bucket_files_md5 = get_md5_hashes( bucket, prefix)


CompletedProcess(args='gcloud auth activate-service-account --key-file=/Users/ergonyc/Projects/ASAP/lee-credentials.json', returncode=0, stdout='', stderr='Activated service account credentials for: [raw-admin-lee@dnastack-asap-parkinsons.iam.gserviceaccount.com]\n')
gsutil -u dnastack-asap-parkinsons hash -h gs://asap-raw-data-team-lee/fastqs/bulk_MFG/*.gz


In [36]:
df = pd.DataFrame(bucket_files_md5.items(), columns=["filename", "md5"])
df

Unnamed: 0,filename,md5
0,0009PD_MFG_bulk_L000_R1_001.fastq.gz,c9d64ec5b02de6b09a45dd7ac6610854
1,0009PD_MFG_bulk_L000_R2_001.fastq.gz,43a19a5fafd6892a0726660a560c4993
2,0348PD_MFG_bulk_L000_R1_001.fastq.gz,0ad336467cb40bfa3f6e7f582ce53203
3,0348PD_MFG_bulk_L000_R2_001.fastq.gz,8b6c00b59f2f0e08c053cd437a17584d
4,0413PD_MFG_bulk_L000_R1_001.fastq.gz,a1e499ed8fd73cc2ae75835792ea23d0
5,0413PD_MFG_bulk_L000_R2_001.fastq.gz,e2be134da750bc4cd0672dee9b507965
6,0602HC_MFG_bulk_L000_R1_001.fastq.gz,653a5622dcc2b0e8aa3b43a5ef364deb
7,0602HC_MFG_bulk_L000_R2_001.fastq.gz,39c58383dbad4b68645a0f601046e0ef
8,1225HC_MFG_bulk_L000_R1_001.fastq.gz,cff31843f6024c7d1440378b3cbbfadd
9,1225HC_MFG_bulk_L000_R2_001.fastq.gz,47fc7147767b66c6049c5e647495c49a


In [37]:
df['id'] = df['filename'].str.split("_").str[0]

In [38]:
df['subject_id'] = df['id'].apply(lambda x: f"{x[-2:]}_{x[:-2]}")
df.head()

Unnamed: 0,filename,md5,id,subject_id
0,0009PD_MFG_bulk_L000_R1_001.fastq.gz,c9d64ec5b02de6b09a45dd7ac6610854,0009PD,PD_0009
1,0009PD_MFG_bulk_L000_R2_001.fastq.gz,43a19a5fafd6892a0726660a560c4993,0009PD,PD_0009
2,0348PD_MFG_bulk_L000_R1_001.fastq.gz,0ad336467cb40bfa3f6e7f582ce53203,0348PD,PD_0348
3,0348PD_MFG_bulk_L000_R2_001.fastq.gz,8b6c00b59f2f0e08c053cd437a17584d,0348PD,PD_0348
4,0413PD_MFG_bulk_L000_R1_001.fastq.gz,a1e499ed8fd73cc2ae75835792ea23d0,0413PD,PD_0413


In [39]:
df['sample_id'] = "MFG_" + df['subject_id']
df.head()


Unnamed: 0,filename,md5,id,subject_id,sample_id
0,0009PD_MFG_bulk_L000_R1_001.fastq.gz,c9d64ec5b02de6b09a45dd7ac6610854,0009PD,PD_0009,MFG_PD_0009
1,0009PD_MFG_bulk_L000_R2_001.fastq.gz,43a19a5fafd6892a0726660a560c4993,0009PD,PD_0009,MFG_PD_0009
2,0348PD_MFG_bulk_L000_R1_001.fastq.gz,0ad336467cb40bfa3f6e7f582ce53203,0348PD,PD_0348,MFG_PD_0348
3,0348PD_MFG_bulk_L000_R2_001.fastq.gz,8b6c00b59f2f0e08c053cd437a17584d,0348PD,PD_0348,MFG_PD_0348
4,0413PD_MFG_bulk_L000_R1_001.fastq.gz,a1e499ed8fd73cc2ae75835792ea23d0,0413PD,PD_0413,MFG_PD_0413


In [40]:
df['lane'] = df['filename'].str.split("_").str[3]
df['end'] = df['filename'].str.split("_").str[4].str.rstrip(".fastq.gz")


In [41]:

BULK_FILES = df.copy()


In [42]:
BULK_FILES.head()

Unnamed: 0,filename,md5,id,subject_id,sample_id,lane,end
0,0009PD_MFG_bulk_L000_R1_001.fastq.gz,c9d64ec5b02de6b09a45dd7ac6610854,0009PD,PD_0009,MFG_PD_0009,L000,R1
1,0009PD_MFG_bulk_L000_R2_001.fastq.gz,43a19a5fafd6892a0726660a560c4993,0009PD,PD_0009,MFG_PD_0009,L000,R2
2,0348PD_MFG_bulk_L000_R1_001.fastq.gz,0ad336467cb40bfa3f6e7f582ce53203,0348PD,PD_0348,MFG_PD_0348,L000,R1
3,0348PD_MFG_bulk_L000_R2_001.fastq.gz,8b6c00b59f2f0e08c053cd437a17584d,0348PD,PD_0348,MFG_PD_0348,L000,R2
4,0413PD_MFG_bulk_L000_R1_001.fastq.gz,a1e499ed8fd73cc2ae75835792ea23d0,0413PD,PD_0413,MFG_PD_0413,L000,R1


In [43]:

bulk_samples = set(BULK_FILES['sample_id'].unique())
bulk_subjects = set(BULK_FILES['subject_id'].unique())


In [44]:
# SUBJECT subset by subject
SUBJECT_ = SUBJECT[SUBJECT['subject_id'].isin(bulk_subjects)].reset_index(drop=True)
SUBJECT_.head()


Unnamed: 0,subject_id,source_subject_id,biobank_name,sex,age_at_collection,race,primary_diagnosis,primary_diagnosis_text
0,HC_1225,12-25,Banner Sun Health Research Institute,Male,80.0,White,No PD nor other neurological disorder,
1,HC_0602,06-02,Banner Sun Health Research Institute,Male,84.0,White,Other neurological disorder,Mild Cognitive Impairment
2,PD_0009,00-09,Banner Sun Health Research Institute,Male,64.0,White,Idiopathic PD,
3,PD_1921,19-21,Banner Sun Health Research Institute,Male,82.0,White,Idiopathic PD,
4,PD_2058,20-58,Banner Sun Health Research Institute,Male,87.0,White,Idiopathic PD,


In [45]:

# CLINPATH subset by subject
CLINPATH_ = CLINPATH[CLINPATH['subject_id'].isin(bulk_subjects)].reset_index(drop=True)


# PMDDBS subset by sample
PMDBS_ = PMDBS[PMDBS['sample_id'].isin(bulk_samples)].reset_index(drop=True)

# SAMPLE subset by sample
SAMPLE_ = SAMPLE[SAMPLE['sample_id'].isin(bulk_samples)].reset_index(drop=True)

# ASSAY_RNAseq subset by sample
ASSAY_RNAseq_ = ASSAY_RNAseq[ASSAY_RNAseq['sample_id'].isin(bulk_samples)].reset_index(drop=True)


In [46]:
bulk_samples

{'MFG_HC_0602',
 'MFG_HC_1225',
 'MFG_HC_1308',
 'MFG_HC_1862',
 'MFG_HC_1864',
 'MFG_HC_1939',
 'MFG_HC_2057',
 'MFG_HC_2061',
 'MFG_HC_2062',
 'MFG_HC_2067',
 'MFG_PD_0009',
 'MFG_PD_0348',
 'MFG_PD_0413',
 'MFG_PD_1312',
 'MFG_PD_1317',
 'MFG_PD_1344',
 'MFG_PD_1441',
 'MFG_PD_1504',
 'MFG_PD_1858',
 'MFG_PD_1902',
 'MFG_PD_1921',
 'MFG_PD_1973',
 'MFG_PD_2005',
 'MFG_PD_2038',
 'MFG_PD_2058'}

In [47]:

# DATA suubset by sample

DATA_ = DATA[DATA['sample_id'].isin(bulk_samples)].reset_index(drop=True)

DATA_.head()



Unnamed: 0,sample_id,replicate,replicate_count,repeated_sample,batch,file_type,file_name,file_description,file_MD5,adjustment,content,header,annotation,configuration_file
0,MFG_HC_1225,rep1,1,0,BATCH_4,fastq,MFGHC1225_S9_L001_R1_001.fastq.gz,Raw sequencing data,9977258e598d6a52130c29c71aef6925,Raw,Reads,,,NA(raw data)
1,MFG_HC_1225,rep1,1,0,BATCH_4,fastq,MFGHC1225_S9_L001_R2_001.fastq.gz,Raw sequencing data,fe2cf93257801227b7072a4fb7d18792,Raw,Reads,,,NA(raw data)
2,MFG_HC_0602,rep1,1,0,BATCH_4,fastq,MFGHC0602_S2_L001_R1_001.fastq.gz,Raw sequencing data,110ca4864cf6938faca67567bebfb6cc,Raw,Reads,,,NA(raw data)
3,MFG_HC_0602,rep1,1,0,BATCH_4,fastq,MFGHC0602_S2_L001_R2_001.fastq.gz,Raw sequencing data,0dcc67217e43ab53bae0d0676f9bfe8b,Raw,Reads,,,NA(raw data)
4,MFG_PD_0009,rep1,1,0,BATCH_4,fastq,MFGPD0009_S3_L001_R1_001.fastq.gz,Raw sequencing data,a2608d0bd192333b0076d7091c1c50ea,Raw,Reads,,,NA(raw data)


In [48]:

# now fix the files and md5
# BULK_FILES.head()

filenname_map = dict(zip(BULK_FILES['sample_id'], BULK_FILES['filename']))
md5_map = dict(zip(BULK_FILES['sample_id'], BULK_FILES['md5']))

DATA_['file_name'] = DATA_['sample_id'].map(filenname_map)
DATA_['file_MD5'] = DATA_['sample_id'].map(md5_map)


In [49]:
DATA_.columns

Index(['sample_id', 'replicate', 'replicate_count', 'repeated_sample', 'batch',
       'file_type', 'file_name', 'file_description', 'file_MD5', 'adjustment',
       'content', 'header', 'annotation', 'configuration_file'],
      dtype='object')

define "batch"

In [50]:
# set batch to "batch_x" as we have confirmed that all samples were run in the same batch
DATA_['batch'] = "batch_x"
# set batch to batch_x
SAMPLE_['batch'] = "batch_x"

# # make each sample its own batch. 
# SAMPLE_['batch'] = [f"batch_s{x}" for x in SAMPLE_.index] 
# DATA_['batch'] = DATA_['sample_id'].map(dict(zip(SAMPLE_['sample_id'], SAMPLE_['batch'])))



In [51]:
# pack the tables into a dictionary

v3_tables = {
    'STUDY': STUDY,
    'PROTOCOL': PROTOCOL,
    'SUBJECT': SUBJECT_,
    'SAMPLE': SAMPLE_,
    'DATA': DATA_,
    'CLINPATH': CLINPATH_,
    'PMDBS': PMDBS_,
    'ASSAY_RNAseq': ASSAY_RNAseq_,
    'CONDITION': CONDITION
}
CONDITION


Unnamed: 0,condition_id,intervention_name,intervention_id,protocol_id,intervention_aux_table
0,no_pd_nor_other_neurological_disorder,Case-Control,Control,,
1,other_neurological_disorder,Case-Control,Other,,
2,idiopathic_pd,Case-Control,Case,,


In [52]:
SAMPLE_['condition_id']

0     no_pd_nor_other_neurological_disorder
1               other_neurological_disorder
2                             idiopathic_pd
3                             idiopathic_pd
4                             idiopathic_pd
5                             idiopathic_pd
6                             idiopathic_pd
7     no_pd_nor_other_neurological_disorder
8     no_pd_nor_other_neurological_disorder
9     no_pd_nor_other_neurological_disorder
10    no_pd_nor_other_neurological_disorder
11    no_pd_nor_other_neurological_disorder
12              other_neurological_disorder
13    no_pd_nor_other_neurological_disorder
14    no_pd_nor_other_neurological_disorder
15                            idiopathic_pd
16                            idiopathic_pd
17                            idiopathic_pd
18                            idiopathic_pd
19                            idiopathic_pd
20                            idiopathic_pd
21                            idiopathic_pd
22                            id

In [102]:
for table,df in v3_tables.items():
    schema = CDE[CDE['Table'] == table]
    valid_fields = schema['Field'].unique()
    df_out = df[valid_fields]
    aux_fields = set(df.columns) - set(valid_fields)
    if aux_fields:
        df_aux = df[list(aux_fields)]
        df_aux.to_csv(og_path / f"{table}_auxiliary.csv", index=False)
        print(f"Saved {table}_auxiliary.csv")
    df_out.to_csv(v3_path / f"{table}.csv", index=False)

### validate v3 tables


In [53]:
CDE = CDEv3
for table,df in v3_tables.items():
    schema = CDE[CDE['Table'] == table]

    report = ReportCollector(destination="NA")
    full_table, report = validate_table(df.copy(), table, schema, report)
    report.print_log()

recoding number_samples as int
All required fields are present in *STUDY* table.
🚨⚠️❗ **7 Fields with empty (NULL) values:**

	- submitter_email: 1/1 empty rows (REQUIRED)

	- other_funding_source: 1/1 empty rows (REQUIRED)

	- publication_DOI: 1/1 empty rows (REQUIRED)

	- publication_PMID: 1/1 empty rows (REQUIRED)

	- PI_ORCID: 1/1 empty rows (OPTIONAL)

	- PI_google_scholar_id: 1/1 empty rows (OPTIONAL)

	- alternate_dataset_id: 1/1 empty rows (OPTIONAL)
No invalid entries found in Enum fields.

All required fields are present in *PROTOCOL* table.
🚨⚠️❗ **1 Fields with empty (NULL) values:**

	- other_reference: 1/1 empty rows (OPTIONAL)
No invalid entries found in Enum fields.

All required fields are present in *SUBJECT* table.
🚨⚠️❗ **1 Fields with empty (NULL) values:**

	- primary_diagnosis_text: 23/25 empty rows (OPTIONAL)
No invalid entries found in Enum fields.

recoding replicate_count as int
recoding repeated_sample as int
All required fields are present in *SAMPLE* table.


-------------------------
## check md5s

We will skip this since we simply assumed the md5s were correct in constructing the metadata


## Create metadata package


In [54]:
### Create metadata package

export_path = root_path / f"{team}"

create_upload_medadata_package(export_path, v3_tables)
v3_tables['SAMPLE']['condition_id']
v3_tables['CONDITION']

Unnamed: 0,condition_id,intervention_name,intervention_id,protocol_id,intervention_aux_table
0,no_pd_nor_other_neurological_disorder,Case-Control,Control,,
1,other_neurological_disorder,Case-Control,Other,,
2,idiopathic_pd,Case-Control,Case,,


_____