# Neonatal brain measure estimation from dHCP data
As dHCP is under data protection, no data is included in this repository. The code is provided as a reference for the analysis of the dHCP data. To reproduce the analyses, you need to apply for access to the dHCP dataset [here](https://nda.nih.gov/edit_collection.html?id=3955), and parcellate the data with the [MCRIBS atlas](https://github.com/DevelopmentalImagingMCRI/MCRIBS). Change the directory `mcribs_data_dir` accordingly. Data can be requested from the authors among reasonable request.

I received parcellated data from my collaborators C. Adamson, G. Ball, and J. Seidlitz.

This whole notebook should be run for cortical thickness (CT) or surface area (SA). Both measures were used in the original publication. Adjust the variable `brain_measure` to `CT` or `SA` accordingly. 

In [1]:
%load_ext autoreload
%autoreload 2

# Load MRI related data
## Rearrange individual files
The output generated by MCRIBS is similar to FreeSurfer's ?.aparc.annot files. First, I have rearranged them as aparcstats2table would do (i.e., generate one file containing all subjects' brain_measure information).
For measures of cerebral tissue volume, I have actually used asegstats2table, a FreeSurfer command. If you want to do so too, please make sure that FreeSurfer is installed on your machine and that the environment variable `FREESURFER_HOME` is correctly set (see [here](https://surfer.nmr.mgh.harvard.edu/fswiki/DownloadAndInstall)). Otherwise, just load the combined `aseg` file below.

In [2]:
import os
from os.path import join
import glob
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import subprocess
from abagen import fetch_desikan_killiany

import sys
sys.path.append('code')
from preprocessing import dhcp_aparcstats2table, strip_dhcp_id, clean_line, read_dhcp_18months, reorder_vars


mcribs_data_dir = 'data/dHCP/dhcp-MCRIBS'  # where MCRIBS parcellated data are saved

dhcp_data_dir = 'data/dHCP'
dhcp_freesurfer_outputs = join(dhcp_data_dir, 'freesurfer')  # where all freesurfer outputs are stored
dhcp_out = join(dhcp_data_dir, 'derivatives')  # where the adapted data will be stored
os.makedirs(dhcp_out, exist_ok=True)


# adjust brain measurement for which the code should be run
brain_measure = 'CT'  # CT or SA

In [3]:
# lh
csv_files = glob.glob(mcribs_data_dir + "/*lh.aparc.stats")  # find all files with left hemisphere data
aparc_lh = dhcp_aparcstats2table(csv_files, brain_measure)

# rh
csv_files = glob.glob(mcribs_data_dir + "/*rh.aparc.stats")
aparc_rh = dhcp_aparcstats2table(csv_files, brain_measure)

mri_data = pd.concat([aparc_lh, aparc_rh], axis=1)
mri_data.index.name = 'participant'
mri_data.columns.name = ''

# separate session and participant id
mri_data = strip_dhcp_id(mri_data)

# get region names of the Desikan-Killiany atlas as now used in the dHCP data
dhcp_idps_idx = mri_data.filter(regex='^ctx-').columns

In [4]:
# access DK atlas and save idp_labels
atlas = fetch_desikan_killiany(surface=True)
atlas = pd.read_csv(atlas['info'])
atlas = atlas[(atlas['structure'] == 'cortex') & (atlas['hemisphere'] == 'L')]
atlas_labels_l = ['L_'+label for label in atlas['label']]
atlas_labels_r = ['R_'+label for label in atlas['label']]

desikan_idps = atlas_labels_l + atlas_labels_r
print(f"Number of IDPs: {len(dhcp_idps_idx)}")

Number of IDPs: 68


In [5]:
# employ freesurfer's asegstats2table to combine CTV for all subjects into one csv file
subprocess.run([
    "./code/dhcp_asegstats2table.sh",
    mcribs_data_dir,
    dhcp_data_dir
], check=True)

aseg = pd.read_csv(join(dhcp_data_dir, 'dhcp_aseg_mcribs.txt'), sep='\t', index_col=0)
aseg = aseg[['SubCortGrayVol', 'CerebralWhiteMatterVol', 'CortexVol']]
# separate session and participant id
aseg = strip_dhcp_id(aseg)

cp: data/dHCP/dhcp-MCRIBS/sub-CC00597XX21-ses-190200_aseg.stats: No such file or directory
cp: data/dHCP/dhcp-MCRIBS/sub-CC00649XX23-ses-191201_aseg.stats: No such file or directory
./code/dhcp_asegstats2table.sh: line 35: 40378 Killed: 9               asegstats2table --subjectsfile=${MCRIBS_DIR}/subjectlist.txt -t ${DATA_DIR}/dhcp_aseg_mcribs.txt --skip --all-segs


In [6]:
# meta
meta = pd.read_csv(os.path.join(dhcp_data_dir, 'dHCP_add_info_data_release2.csv'), sep=';')

## Adjust and rename variables

In [7]:
# merge 
dhcp = mri_data.merge(aseg, on=['Subject_ID', 'Session_ID'])
# dhcp = dhcp.merge(meta, on=['Subject_ID', 'Session_ID'])

# rename variables
dhcp = dhcp.rename(columns=dict(
    SubCortGrayVol="sGMV",
    CerebralWhiteMatterVol="WMV",
    CortexVol="GMV"
))  
dhcp = dhcp.rename(columns=dict(zip(dhcp_idps_idx, desikan_idps)))

## Filter for term-equivalent scans

In [8]:
# select the preterm subjects with two scans (i.e., one at birth, one at term-equivalent age)
preterm_scan_2 = meta[(meta['birth_age'] < 37) & (meta['scan_age'] >= 37)]
preterm_scan_2['dx'] = 'preterm'
print("{0} preterm subjects have a scan at term-equivalent age.".format(preterm_scan_2.shape[0]))

# make sure that there are no subjects with 2 term-equivalent scans
print("Number of duplicates: {0}".format(preterm_scan_2.Subject_ID.duplicated().sum()))

# select first scan for term subjects
CN_scan_1 = meta[(meta['birth_age'] >= 37) & (meta['scan_number'] == 1)]
CN_scan_1['dx'] = 'CN'
print("{0} term subjects are selected.".format(CN_scan_1.shape[0]))

# combine the two
filtered_meta = pd.concat([preterm_scan_2, CN_scan_1])
filtered_meta.head()
print("Final number of subjects should be: {0}.".format(filtered_meta.shape[0]))

92 preterm subjects have a scan at term-equivalent age.
Number of duplicates: 0
377 term subjects are selected.
Final number of subjects should be: 469.


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  preterm_scan_2['dx'] = 'preterm'
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  CN_scan_1['dx'] = 'CN'


In [9]:
# combine all information
dhcp = dhcp.merge(filtered_meta, on=['Subject_ID', 'Session_ID'])
print('Dataframe shape:', dhcp.shape)

Dataframe shape: (467, 82)


In [10]:
# reorder variables
dhcp = reorder_vars(['Subject_ID', 'Session_ID', 'dx','gender',
        'birth_age', 'birth_weight', 'singleton', 'scan_age', 'scan_number',
        'radiology_score', 'sedation'], dhcp, desikan_idps)

## Average hemispheres

In [11]:
regions = [r[2:] for r in desikan_idps[:34]]

# Combine L_ and R_ values
for region in regions:
    dhcp[f'{brain_measure}_{region}'] = dhcp[[f'L_{region}', f'R_{region}']].mean(axis=1)
    # drop L_ and R_ columns
    dhcp = dhcp.drop(columns=[f'L_{region}', f'R_{region}'])

In [12]:
# create region list bilateral
desikan_idps_bilateral = [f'{brain_measure}_{region}' for region in regions]
ctv_columns = ['GMV', 'WMV', 'sGMV']

# Adapt df for BrainChart framework
BrainChart needs a certain format of the data. We will adapt the data accordingly. More information can be found [here](https://brainchart.shinyapps.io/brainchart/).

In [13]:
# adapt for brainchart
dhcp = dhcp.rename(columns=dict(
    Subject_ID="participant",
    gender="sex",
    birth_age="GA",
    birth_weight="BW"
)) 

# age
dhcp['age_days'] = dhcp['scan_age'] * 7  # because scan_age is in weeks
dhcp['Age'] = (dhcp['age_days']-280) / 365.245  # from BrainChart instructions

# other required info
dhcp['study'] = 'dHCP_NewEstimation'  # as dHCP was used for original model fitting, we want to make sure that random effects of study are calculated again
dhcp['fs_version'] = 'Custom' 
dhcp['country'] = 'Multisite'
dhcp['run'] = 1
dhcp['session'] = 1

# reshape
all_idps = desikan_idps_bilateral + ctv_columns
dhcp_final = reorder_vars(['participant', 'Age', 'age_days', 'sex', 'study', 'fs_version','country', 'run', 
                            'session', 'dx'], dhcp, all_idps)
dhcp_final.drop(columns=['scan_age', 'singleton', 'scan_number', 'radiology_score', 'sedation'], inplace=True)

# save
print(dhcp_final.shape)
dhcp_final.to_csv(os.path.join(dhcp_out, f'dhcp_{brain_measure}_preprocessed.csv'), index=False)
dhcp_final.columns

(467, 51)


Index(['participant', 'Age', 'age_days', 'sex', 'study', 'fs_version',
       'country', 'run', 'session', 'dx', 'Session_ID', 'GA', 'BW',
       'CT_bankssts', 'CT_caudalanteriorcingulate', 'CT_caudalmiddlefrontal',
       'CT_cuneus', 'CT_entorhinal', 'CT_fusiform', 'CT_inferiorparietal',
       'CT_inferiortemporal', 'CT_isthmuscingulate', 'CT_lateraloccipital',
       'CT_lateralorbitofrontal', 'CT_lingual', 'CT_medialorbitofrontal',
       'CT_middletemporal', 'CT_parahippocampal', 'CT_paracentral',
       'CT_parsopercularis', 'CT_parsorbitalis', 'CT_parstriangularis',
       'CT_pericalcarine', 'CT_postcentral', 'CT_posteriorcingulate',
       'CT_precentral', 'CT_precuneus', 'CT_rostralanteriorcingulate',
       'CT_rostralmiddlefrontal', 'CT_superiorfrontal', 'CT_superiorparietal',
       'CT_superiortemporal', 'CT_supramarginal', 'CT_frontalpole',
       'CT_temporalpole', 'CT_transversetemporal', 'CT_insula', 'sGMV', 'GMV',
       'WMV', 'sGMV'],
      dtype='object')

# Stats
Summary stats shown in Supp Table S1.

In [14]:
print("Overall number of unique subjects with longitudinal data: n =", len(dhcp_final))
dhcp_final['age_weeks'] = dhcp_final['age_days'] / 7
dhcp_final['BW'] = dhcp_final['BW'] * 1000  # convert to g

dhcp_final_pt = dhcp_final[dhcp_final['dx'] == 'preterm']
dhcp_final_cn = dhcp_final[dhcp_final['dx'] == 'CN']

print('Preterm stats: n = ', len(dhcp_final_pt))
print(dhcp_final_pt['sex'].value_counts())
display(dhcp_final_pt[['age_weeks', 'GA', 'BW']].describe().round(2))

Overall number of unique subjects with longitudinal data: n = 467
Preterm stats: n =  92
Male      55
Female    37
Name: sex, dtype: int64


Unnamed: 0,age_weeks,GA,BW
count,92.0,92.0,92.0
mean,40.85,32.21,1803.64
std,2.06,3.55,771.28
min,37.0,24.29,540.0
25%,39.71,29.32,1185.0
50%,40.86,32.43,1715.0
75%,42.0,35.46,2400.0
max,45.14,36.86,4100.0


In [15]:
print('Full-term stats: n = ', len(dhcp_final_cn))
print(dhcp_final_cn['sex'].value_counts())
display(dhcp_final_cn[['age_weeks', 'GA', 'BW']].describe().round(2))

Full-term stats: n =  375
Male      204
Female    171
Name: sex, dtype: int64


Unnamed: 0,age_weeks,GA,BW
count,375.0,375.0,375.0
mean,41.08,39.96,3362.71
std,1.72,1.23,548.86
min,37.43,37.0,0.0
25%,39.86,39.14,3000.0
50%,41.0,40.14,3400.0
75%,42.14,40.86,3730.0
max,44.86,42.29,4800.0
