## Preparation of MRI data as training data for classification

**Input:**

- ABCD sMRI Part 1 (abcd_smrip10201.txt)
    - smri_sulc_cdk_... (sulcal depth)
    - smri_thick_cdk_... (thickness)
    - smri_area_cdk_... (area)
    - smri_vol_cdk_... (volume)
    - smri_vol_scs_... (subcortical)


- ABCD sMRI Part 2 (abcd_smrip20201.txt)
    - smri_t1wcnt_cdk (contrast)


- ABCD FreeSurfer QC (abcd_fsurfqc01.txt)
    - fsqc_qc (*1 = accept, 0 = reject*)


- ABCD dMRI DTI Part 1 (abcd_dti_p101.txt)
    - dmri_dtimd_fiberat_… (mean diffusivity) 
    
    
- ABCD dMRI Post Processing QC (abcd_dmriqc01.txt)
    - dmri_dti_postqc_qc (*1 = accept, 0 = reject*)
    
- ABCD Youth Pubertal Development Scale and Menstrual Cycle Survey History (PDMS) (abcd_ypdms01)
    - pds_f5_y (Have you begun to menstruate (started to have your period)?) 
        (*4 = Yes; 1 = No; 999 = I don't know; 777= refuse to answer*)
        
**Output:**

- **mergedMRIforMenarcheSubs.csv:** All MRI and Quality Control Files merged into one file. Only subjects for which every sort of data exists, i.e. every type of MRI data + puberty questionnaire data.
- **MRIpredictorsPlusMeta.csv:** Dataframe containing only the MRI measures of interest (volume, area, contrast extracted from Desikan Kiliani Atlas) as well as demographic data on age, date etc. Dataframe has been processed to exclude subjects who were marked as exclude in the QC files and those who had NAs in any of the columns (null columns were removed beforehand).
- **processedMRIDataMenarcheSubs.csv:** processed MRI dataset sorted by subjectID, checked for identical subjects and sorting as puberty dataset
- **processedPubertyDataMenarcheSubs.csv:** processed Puberty dataset sorted by subjectID, checked for identical subjects and sorting as MRI dataset

In [None]:
import pandas as pd
import numpy as np
import seaborn as sn
from matplotlib import pyplot as plt 
import os

In [None]:
os.chdir('ABCDTabular\\')

In [None]:
## Read the ABCD's MRI data preprocessed according to the Desikan Kiliani Atlas into a dataframe 
sMRI1 = pd.read_csv('abcd_smrip10201.txt',sep='\s+')  
dictsMRI1 = sMRI1.iloc[0]
sMRI1 = sMRI1.drop(index = 0)
sMRI2 = pd.read_csv('abcd_smrip20201.txt',sep='\s+')  
dictsMRI2 = sMRI2.iloc[0]
sMRI2 = sMRI2.drop(index = 0)
DTI = pd.read_csv('abcd_dti_p101.txt',sep='\s+')  
dictDTI = DTI.iloc[0]
DTI = DTI.drop(index = 0)
#DTI
dictDTI

In [None]:
## Read the sMRI Quality Control file
sMRI_QC = pd.read_csv('abcd_fsurfqc01.txt',sep='\s+')  
dictsMRI_QC = sMRI_QC.iloc[0]
sMRI_QC = sMRI_QC.drop(index = 0)
dMRI_QC = pd.read_csv('abcd_dmriqc01.txt',sep='\s+')  
dictdMRI_QC = dMRI_QC.iloc[0]
dMRI_QC = dMRI_QC.drop(index = 0)

In [None]:
## Read the ABCD's puberty data 
pubertyData = pd.read_csv('..\\processedData\\relevantMenarcheData2year.csv')

In [None]:
## Add suffixes to all columns of the different dfs because some columnnames exist in all dfs
sMRI1 = sMRI1.add_suffix('_M1')
sMRI2 = sMRI2.add_suffix('_M2')
sMRI_QC = sMRI_QC.add_suffix('_MQC')
DTI = DTI.add_suffix('_D')
dMRI_QC = dMRI_QC.add_suffix('_DQC')
pubertyData = pubertyData.add_suffix('_P')

In [None]:
## Remove suffix from subjectkey to enable merging on this column
sMRI1 = sMRI1.rename(columns = {'subjectkey_M1':'subjectkey'})
sMRI2 = sMRI2.rename(columns = {'subjectkey_M2':'subjectkey'})
sMRI_QC = sMRI_QC.rename(columns = {'subjectkey_MQC':'subjectkey'})
DTI = DTI.rename(columns = {'subjectkey_D':'subjectkey'})
dMRI_QC = dMRI_QC.rename(columns = {'subjectkey_DQC':'subjectkey'})
pubertyData = pubertyData.rename(columns = {'subjectkey_P':'subjectkey'})

In [None]:
## Save subjectkeys of all subjects for whom puberty data exists
subjectkeys = pubertyData['subjectkey']
## Make list out of subjectkeys
subjectkeys = subjectkeys.values.tolist()

In [None]:
## Reduce sMRI data to those subjects for whom puberty data exists
sMRI1_reduced = sMRI1.loc[sMRI1['subjectkey'].isin(subjectkeys)]
sr = pd.Series(sMRI1_reduced['subjectkey'])
srdf = sr.value_counts().reset_index()
## Reduce data to 2-year follow-up data
sMRI1_reduced = sMRI1_reduced[sMRI1_reduced['eventname_M1'] == '2_year_follow_up_y_arm_1']

In [None]:
## Reduce sMRI data to those subjects for whom puberty data exists
sMRI2_reduced = sMRI2.loc[sMRI2['subjectkey'].isin(subjectkeys)]
sr2 = pd.Series(sMRI2_reduced['subjectkey'])
srdf2 = sr2.value_counts().reset_index()
## Reduce data to 2-year follow-up data
sMRI2_reduced = sMRI2_reduced[sMRI2_reduced['eventname_M2'] == '2_year_follow_up_y_arm_1']

In [None]:
## Reduce QC data to subjects with complete data and reduce to follow-up data
sMRI_QC_reduced = sMRI_QC.loc[sMRI_QC['subjectkey'].isin(subjectkeys)]
sr3 = pd.Series(sMRI_QC_reduced['subjectkey'])
srdf3 = sr3.value_counts().reset_index()
sMRI_QC_reduced = sMRI_QC_reduced[sMRI_QC_reduced['eventname_MQC'] == '2_year_follow_up_y_arm_1']

In [None]:
## Do the same with DTI data
DTI_reduced = DTI.loc[DTI['subjectkey'].isin(subjectkeys)]
sr4 = pd.Series(DTI_reduced['subjectkey'])
srdf4 = sr4.value_counts().reset_index()
DTI_reduced = DTI_reduced[DTI_reduced['eventname_D'] == '2_year_follow_up_y_arm_1']

DTI_QC_reduced = dMRI_QC.loc[dMRI_QC['subjectkey'].isin(subjectkeys)]
sr5 = pd.Series(DTI_QC_reduced['subjectkey'])
srdf5 = sr5.value_counts().reset_index()
DTI_QC_reduced = DTI_QC_reduced[DTI_QC_reduced['eventname_DQC'] == '2_year_follow_up_y_arm_1']

In [None]:
## match all dfs and kick out subjects that don't have data in any of the modalities
dtiqc_subkeys = DTI_QC_reduced['subjectkey']
dtiqc_subkeys = dtiqc_subkeys.values.tolist()

dti_matched = DTI_reduced.loc[DTI_reduced['subjectkey'].isin(dtiqc_subkeys)]

dtimatch_subkeys = dti_matched['subjectkey']
dtimatch_subkeys = dtimatch_subkeys.values.tolist()

dti_qc_matched = DTI_QC_reduced.loc[DTI_QC_reduced['subjectkey'].isin(dtimatch_subkeys)]

smri_subkeys = sMRI1_reduced['subjectkey']
smri_subkeys = smri_subkeys.values.tolist()

sMRI2_matched = sMRI2_reduced.loc[sMRI2_reduced['subjectkey'].isin(smri_subkeys)]

smri_m_subkeys = sMRI2_matched['subjectkey']
smri_m_subkeys = smri_m_subkeys.values.tolist()

sMRI1_matched = sMRI1_reduced.loc[sMRI1_reduced['subjectkey'].isin(smri_m_subkeys)]

smri_qc_subkey = sMRI_QC_reduced['subjectkey']
smri_qc_subkey = smri_qc_subkey.values.tolist()

sMRI_QC_matched = sMRI_QC_reduced.loc[sMRI_QC_reduced['subjectkey'].isin(smri_m_subkeys)]

sMRI_merged = sMRI1_matched.merge(sMRI2_matched, how='outer', on='subjectkey')
sMRI_merged_matched = sMRI_merged.loc[sMRI_merged['subjectkey'].isin(dtimatch_subkeys)]

MRI_merged = sMRI_merged_matched.merge(dti_matched, how='outer', on='subjectkey')

merged_keys = MRI_merged['subjectkey']
merged_keys = merged_keys.values.tolist()

sMRI_QC_matched = sMRI_QC_matched.loc[sMRI_QC_matched['subjectkey'].isin(merged_keys)]

dti_qc_matched = dti_qc_matched.loc[dti_qc_matched['subjectkey'].isin(merged_keys)]

QC_merged = sMRI_QC_matched.merge(dti_qc_matched, how = 'outer', on = 'subjectkey')

In [None]:
## combine all MRI and Quality control data into one df of only subjects who have all data
MRIandQCcomplete = MRI_merged.merge(QC_merged, how='outer', on='subjectkey')

In [None]:
#MRIandQCcomplete.to_csv('D:\\Studium\\Master\\Masterarbeit\\ABCD_Data\\mergedMRIforMenarcheSubs.csv', index = False)

In [None]:
columnNames = MRIandQCcomplete.columns.values.tolist()

In [None]:
columnNames

In [None]:
## relevant metadata
relevantMRIdata = MRIandQCcomplete[['subjectkey','interview_date_M1',
                                    'interview_age_M1','sex_M1','eventname_M1','fsqc_qc_MQC','dmri_dti_postqc_qc_DQC']]

In [None]:
relevantMRIdata['dmri_dti_postqc_qc_DQC'].unique()

In [None]:
exc1 = relevantMRIdata.groupby(['dmri_dti_postqc_qc_DQC']).size().reset_index(name='count')
exc1
## 46 exclude cases, 95 accept, rest NaN

In [None]:
exc2 = relevantMRIdata.groupby(['fsqc_qc_MQC']).size().reset_index(name='count')
exc2
## 22 exclude cases, 67 accept, rest NaN

In [None]:
## extract MRI data by feature type
sulcal = MRIandQCcomplete.filter(regex=".*smri_sulc_cdk.*")
sulcal['subjectkey'] = MRIandQCcomplete['subjectkey']
thickness = MRIandQCcomplete.filter(regex=".*smri_thick_cdk.*")
thickness['subjectkey'] = MRIandQCcomplete['subjectkey']
area = MRIandQCcomplete.filter(regex=".*smri_area_cdk.*")
area['subjectkey'] = MRIandQCcomplete['subjectkey']
volume = MRIandQCcomplete.filter(regex=".*smri_vol_cdk.*")
volume['subjectkey'] = MRIandQCcomplete['subjectkey']
subcortical = MRIandQCcomplete.filter(regex=".*smri_vol_scs.*")
subcortical['subjectkey'] = MRIandQCcomplete['subjectkey']

In [None]:
M1 = [sulcal,thickness,area,volume,subcortical]
from functools import reduce
allM1 = reduce(lambda left,right: pd.merge(left,right,on=['subjectkey'], how='outer'), M1)

In [None]:
contrast = MRIandQCcomplete.filter(regex = '.*smri_t1wcnt_cdk.*')
contrast['subjectkey'] = MRIandQCcomplete['subjectkey']

In [None]:
meanDiffusivity = MRIandQCcomplete.filter(regex = '.*dmri_dtimd_fiberat.*')
meanDiffusivity['subjectkey'] = MRIandQCcomplete['subjectkey']

In [None]:
MRIpredictors = pd.merge(allM1,pd.merge(contrast,meanDiffusivity),how='outer',on='subjectkey')
MRIpredictorsPlusMeta = pd.merge(MRIpredictors,relevantMRIdata, how='outer',on='subjectkey')

In [None]:
## kick out subjects who got marked as to be excluded in the QC variable
MRIpredictorsPlusMeta = MRIpredictorsPlusMeta[(MRIpredictorsPlusMeta['fsqc_qc_MQC']!='0') & (MRIpredictorsPlusMeta['dmri_dti_postqc_qc_DQC']!='0')]

In [None]:
## check for missing values
NaNList = MRIpredictorsPlusMeta.isna().sum().reset_index()
NaNList = NaNList[NaNList[0] != 0]
NaNList

In [None]:
## drop columns that are NaN for every subject
MRIpredictorsPlusMeta = MRIpredictorsPlusMeta.drop(columns=['smri_vol_scs_lesionlh_M1','smri_vol_scs_lesionrh_M1'])
# drop QC columns as well, any valuable information from it has been used already
MRIpredictorsPlusMeta = MRIpredictorsPlusMeta.drop(columns=['fsqc_qc_MQC','dmri_dti_postqc_qc_DQC'])
# drop subjects which have no DTI mean diffusivity data
MRIpredictorsPlusMeta = MRIpredictorsPlusMeta.dropna()

In [None]:
#MRIpredictorsPlusMeta = MRIpredictorsPlusMeta.reset_index(drop=True)

In [None]:
#MRIpredictorsPlusMeta.to_csv('../MRIpredictorsPlusMeta.csv', index=False)

In [None]:
## get subjects' puberty data
subjectkeys = MRIpredictorsPlusMeta['subjectkey']
subjectkeys = subjectkeys.values.tolist()
pubertyData = pubertyData.loc[pubertyData['subjectkey'].isin(subjectkeys)]
pub = pubertyData.groupby(['pds_f5_y_P']).size().reset_index(name='count')
pub

In [None]:
#pubertyData = pubertyData.reset_index(drop=True)

In [None]:
## sort puberty and MRI data by subjectkey
pubDf = pubertyData.sort_values(by = ['subjectkey'])
mriDf = MRIpredictorsPlusMeta.sort_values(by = ['subjectkey'])

In [None]:
pubDf.to_csv('processedData\\processedPubertyDataMenarcheSubs.csv', index = False)
mriDf.to_csv('processedData\\processedMRIDataMenarcheSubs.csv', index = False)