# Child brain measure estimation from ABCD data
As ABCD is under data protection, no data is included in this repository. The code is provided as a reference for the analysis of the ABCD data. To reproduce the analyses, you need to apply for access to the ABCD dataset, download the dataset for [release 4.0](https://nda.nih.gov/general-query.html?q=query=featured-datasets:Adolescent%20Brain%20Cognitive%20Development%20Study%20(ABCD)) and adjust the file path of `abcd_data_dir` in the code below.

This notebook relies heavily on the work by [Leon D. Lotter](https://github.com/LeonDLotter/CTdev/blob/main/0.3_getData_ABCD.ipynb). 


## Setup and installations
This notebook requires to have R installed on your computer. You can download R from [here](https://cran.r-project.org/) as described in the installation instructions.
Sometimes, the installation causes some problems, e.g., because R is not found by the Jupyter notebook. In this case, you can try to manually set the path to your R installation in the notebook and installing the required packages manually by uncommenting the code below.

In [28]:
# Uncomment for bug fixing R installation: manually set path to your R installation
# import os
# os.environ['R_HOME'] = '/Library/Frameworks/R.framework/R'  # change to directory of your R installation

In [29]:
%load_ext autoreload
%autoreload 2
# Use rpy2 to run R in python notebook for running longCombat
%load_ext rpy2.ipython

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython


In [30]:
# Uncomment for bug fixing: manually install required R packages
# %%R 
# install.packages(c("ggplot2", "gamlss", "tidyverse", "devtools", "invgamma"))
# devtools::install_github("jcbeer/longCombat")

# Data preparation
This whole notebook should be run for cortical thickness (CT) or surface area (SA). Both measures were used in the original publication. Adjust the variable `brain_measure` to `CT` or `SA` accordingly. 

In [31]:
import os
from os.path import join
import pandas as pd
import numpy as np
from scipy.stats import iqr

from functools import reduce
from abagen import fetch_desikan_killiany

# import custom functions
import sys
sys.path.append('code')
from utils import reorder_vars, na


abcd_data_dir = 'data/ABCD'  # adapt to your folder structure
abcd_out = join(abcd_data_dir, 'derivatives')  # where the prepared data will be stored
os.makedirs(abcd_out, exist_ok=True)


# adjust brain measurement for which the code should be run
brain_measure = 'CT'  # CT or SA

# Load MRI related data
- `abcd_smrip10201`: contains sMRI Desikan-Killiany-based CT and SA measures
- `abcd_fsurfqc01`: FreeSurfer quality control
- `abcd_imgincl01`: ABCD recommended imaging inclusion; 1 or 0
- `abcd_auto_postqc01`: automated post-processing QC measures
- `abcd_lt01`: determines in-person, remote, and Hybrid for Overall Visit Type
- prematurity-related items following [this paper](doi:10.1001/jamapsychiatry.2020.2902):
    - `devhx_ss_12_p` from file `abcd_devhxxs01.txt`: "how many weeks premature was your baby born?", subtract this variable from 40 to get gestational age
- `abcd_tbss01`: NIH Toolbox Cognition Battery composite scores (following [this paper](https://doi.org/10.1093/cercor/bhab403))


In [32]:
# get ROI names of DK regions
mri_data = pd.read_csv(os.path.join(abcd_data_dir, 'r4.0', 'abcd_smrip10201.txt'), sep='\t', skiprows=[1])

# extract brain measure in DK atlas
if brain_measure == 'CT':
    print("Using brain measure: Cortical Thickness")
    mri_data.drop(columns=['smri_thick_cdk_mean', 'smri_thick_cdk_meanlh', 'smri_thick_cdk_meanrh'], inplace=True)
    abcd_idps_idx = mri_data.filter(regex='^smri_thick_cdk').columns.to_list()
    print(f"Number of IDPs: {len(abcd_idps_idx)}")
    
elif brain_measure == 'SA':
    mri_data.drop(columns=['smri_area_cdk_total', 'smri_area_cdk_totallh', 'smri_area_cdk_totalrh'], inplace=True)
    abcd_idps_idx = mri_data.filter(regex='^smri_area_cdk').columns.to_list()
    print(f"Number of IDPs: {len(abcd_idps_idx)}")
    exit("Brain measure not implemented yet")
    
else:
    raise ValueError('brain_measure has to be either CT or SA')

Using brain measure: Cortical Thickness
Number of IDPs: 68


In [33]:
# access DK atlas and save idp_labels
atlas = fetch_desikan_killiany(surface=True)
atlas = pd.read_csv(atlas['info'])
atlas = atlas[(atlas['structure'] == 'cortex') & (atlas['hemisphere'] == 'L')]
atlas_labels_l = ['L_'+label for label in atlas['label']]
atlas_labels_r = ['R_'+label for label in atlas['label']]

desikan_idps = atlas_labels_l + atlas_labels_r

Remove NA values:
- 777 = “Decline to answer”
- 999 = “Do not know”
- 888 = Question not asked due to primary question response (branching logic)
- 555 = Not administered in the assessment


In [34]:
# dict of ABCD data tables w/ format "tablefile: variables"
abcd_tab_data = {
    join(abcd_data_dir, "r4.0", "abcd_smrip10201.txt"):         ["interview_age", "interview_date", "sex", "smri_vol_scs_cbwmatterlh", 
                                                                    "smri_vol_scs_cbwmatterrh", "smri_vol_scs_wholeb",
                                                                    "smri_vol_scs_subcorticalgv","smri_vol_scs_intracranialv", 
                                                                    "smri_thick_cdk_mean"] + abcd_idps_idx, 
    join(abcd_data_dir, "r4.0", "abcd_fsurfqc01.txt"):          ["fsqc_qc"],
    join(abcd_data_dir, "r4.0", "abcd_auto_postqc01.txt"):      ["apqc_smri_topo_ndefect"],
    join(abcd_data_dir, "r4.0", "abcd_imgincl01.txt"):          ["imgincl_t1w_include"],
    join(abcd_data_dir, "r4.0", "abcd_devhxss01.txt"):          ["devhx_ss_12_p"],
    join(abcd_data_dir, "r4.0", "abcd_lt01.txt"):               ["site_id_l"],
    join(abcd_data_dir, "r4.0", "abcd_tbss01.txt"):             ["nihtbx_totalcomp_fc"],
}

# read data
abcd_tab = []
for k in abcd_tab_data.keys():
    tab = pd.read_csv(k, header=0, skiprows=[1], delimiter="\t", na_values=[555,777,888,999], 
                        low_memory=False)
    if ("visit" in tab.columns) & ("eventname" not in tab.columns):
        tab = tab.rename(columns=dict(visit="eventname"))
    if tab.eventname[0]=="screener_arm_1":
        tab.eventname = tab.eventname.replace({"screener_arm_1": "baseline_year_1_arm_1"})
    tab = tab[["subjectkey", "eventname"] + abcd_tab_data[k]]     
    abcd_tab.append(tab)
    
# combine data
abcd = reduce(lambda left, right: pd.merge(left, right, on=["subjectkey", "eventname"], how='left'), abcd_tab)
print(abcd.shape)

(19589, 85)


## Adjust and rename variables

In [35]:
# rename variables
abcd = abcd.rename(columns=dict(
    subjectkey="id", 
    interview_age="age_mon", 
    sex="sex_str", 
    eventname="tp",
    site_id_l="site_str",
    devhx_ss_12_p="weeks_born_premature",
    smri_vol_scs_wholeb="whole_brain_vol",
    smri_vol_scs_subcorticalgv="sGMV",
    smri_vol_scs_cbwmatterlh="L_WMV",
    smri_vol_scs_cbwmatterrh="R_WMV",
    smri_vol_scs_intracranialv="EstimatedTotalIntraCranialVol",
    smri_thick_cdk_mean='meanCT2',
    nihtbx_totalcomp_fc='nihtbx_total'
))
abcd = abcd.rename(columns=dict(zip(abcd_idps_idx, desikan_idps)))

# cerebral tissue volume measures
abcd["WMV"] = abcd["L_WMV"] + abcd["R_WMV"]
abcd.drop(columns=["L_WMV", "R_WMV"], inplace=True)
abcd["GMV"] = abcd['whole_brain_vol'] - abcd['WMV'] - abcd['sGMV']

# demographics
abcd["age"] = round(abcd.age_mon / 12, 2)
abcd["sex"] = [0 if s=="F" else 1 for s in abcd.sex_str]  # F==0, M==1
abcd["site"] = [int(s[-2:]) for s in abcd.site_str] 
abcd.tp = abcd.tp.replace({
    "baseline_year_1_arm_1":    "T0",
    "2_year_follow_up_y_arm_1": "T2", 
    "4_year_follow_up_y_arm_1": "T4"
})
abcd.site_str = abcd.site_str.replace({
    'site01': "CHLA", 
    'site02': "CUB", 
    'site03': "FIU", 
    'site04': "LIBR", 
    'site05': "MUSC", 
    'site06': "OHSU",
    'site07': "ROC", 
    'site08': "SRI", 
    'site09': "UCLA", 
    'site10': "UCSD", 
    'site11': "UFL", 
    'site12': "UMB",
    'site13': "UMICH", 
    'site14': "UMN", 
    'site15': "UPMC", 
    'site16': "UTAH", 
    'site17': "UVM", 
    'site18': "UWM",
    'site19': "VCU", 
    'site20': "WUSTL", 
    'site21': "YALE", 
    'site22': "MSSM"
})

# set index (id + tp)
abcd = abcd.set_index(["id", "tp"], drop=False)
abcd.index = abcd.index.set_names(names=["idx_id", "idx_tp"])
abcd = abcd.sort_index(axis="index")

## Drop subjects
### Missing data

In [36]:
# all
print(f"Whole dataset: baseline n = {len(abcd[abcd.tp=='T0'])},",
      f"2-year n = {len(abcd[abcd.tp=='T2'])}")

# drop site MSSM as was discontinued during baseline
abcd = abcd[abcd.site_str!="MSSM"].copy()
print(f"Site MSSM dropped: baseline n = {len(abcd[abcd.tp=='T0'])},",
      f"2-year n = {len(abcd[abcd.tp=='T2'])}")

# replace "zero" CT values with nan
print(f"{(abcd[desikan_idps] == 0).sum().sum()} zeros replaces with np.nan")
abcd[desikan_idps] = abcd[desikan_idps].replace({0: np.nan})

# remove subjects with single missing brain_measure values
missings = abcd[desikan_idps].isnull().any(axis=1)
abcd = abcd[missings==False].copy()
print(f"Subjects with >=1 missing {brain_measure} values dropped: baseline n = {len(abcd[abcd.tp=='T0'])},",
      f"2-year n = {len(abcd[abcd.tp=='T2'])}")

Whole dataset: baseline n = 11760, 2-year n = 7829
Site MSSM dropped: baseline n = 11728, 2-year n = 7829
0 zeros replaces with np.nan
Subjects with >=1 missing CT values dropped: baseline n = 11728, 2-year n = 7829


### QC
We exclude subjects with a total Euler number of > Q3+1.5*IQR across cohorts or subjects flagged by ABCD internal QC.

In [37]:
qc_thresh = "iqr"

In [38]:
# all
print(f"Whole dataset: baseline n = {len(abcd[abcd.tp=='T0'])},",
      f"2-year n = {len(abcd[abcd.tp=='T2'])}")

# add filter based on topography
if qc_thresh=="quantile":
      thresh = np.quantile(abcd.apqc_smri_topo_ndefect, 0.99)
elif qc_thresh=="iqr":
      q3 = np.percentile(abcd.apqc_smri_topo_ndefect, 75)
      thresh = q3 + 1.5 * iqr(abcd.apqc_smri_topo_ndefect)
elif qc_thresh=="sd":
      mean = np.mean(abcd.apqc_smri_topo_ndefect)
      sd = np.std(abcd.apqc_smri_topo_ndefect)
      thresh = mean + 3*sd
abcd["topo_thresh"] = thresh
abcd["topo_outlier"] = [1 if defects>thresh else 0 \
      for defects, thresh in zip(abcd.apqc_smri_topo_ndefect, abcd.topo_thresh)]

# only subjects passing ABCD and FreeSurfer and topography defects quality control
abcd_preqc = abcd.copy()
abcd = abcd.query("(imgincl_t1w_include==1) & (fsqc_qc!=0) & (topo_outlier==0)").copy()
print(f"Post QC: baseline n = {len(abcd[abcd.tp=='T0'])},",
      f"2-year n = {len(abcd[abcd.tp=='T2'])}")

# drop subjects with only second tp or duplicate data
subs = list(abcd.id.unique())
missings = list()
for sub in subs:
      try:
            abcd.loc[(sub, "T0"), :]
      except:
            missings.append(sub)
      if len(abcd.loc[(sub, na()), :]) > 2:
            missings.append(sub)
abcd = abcd.loc[([s for s in subs if s not in missings], na()), :].copy() 
print(f"Subjects with only 2nd tp or duplicate data dropped: baseline n = {len(abcd[abcd.tp=='T0'])},",
      f"2-year n = {len(abcd[abcd.tp=='T2'])}")

# mark subjects with both time points
abcd["both_tp"] = False
for sub in abcd.id.unique(): 
      try:
            temp1 = abcd.loc[(sub,"T0"), :]
      except:
            continue
      try:
            temp2 = abcd.loc[(sub,"T2"), :]
      except:
            continue
      #if temp1.site==temp2.site:
      abcd.loc[(sub, na()), "both_tp"] = True
if len(abcd[(abcd.both_tp==1) & (abcd.tp=="T0")]) != \
      len(abcd[(abcd.both_tp==1) & (abcd.tp=="T2")]):
      print("Something is wrong!")
print(f"T0+T2 available: n = {len(abcd[abcd.both_tp==1])/2:.0f}")


Whole dataset: baseline n = 11728, 2-year n = 7829
Post QC: baseline n = 10707, 2-year n = 7388
Subjects with only 2nd tp or duplicate data dropped: baseline n = 10707, 2-year n = 6802
T0+T2 available: n = 6802


In [39]:
print("Overall number of unique subjects: n =", len(abcd.id.unique()))
print("Overall number of unique subjects with longitudinal data: n =", len(abcd[abcd.both_tp==True].id.unique()))

abcd["sex_str"] = pd.Categorical(abcd.sex_str)
abcd_sample_stats = dict()
for tp in ["T0", "T2"]:
    # All
    abcd_sample_stats[("All", tp, "Age [y]")] = abcd.loc[(abcd.tp==tp) & (abcd.both_tp==True), "age"].describe()
    abcd_sample_stats[("All", tp, "Male [%]")] = pd.Series({
        "count": len(abcd.loc[(abcd.tp==tp) & (abcd.both_tp==True)].sex_str),
        "%": len(abcd.loc[(abcd.tp==tp) & (abcd.both_tp==True) & (abcd.sex_str=="M")]),
    })

abcd_sample_stats = pd.DataFrame(abcd_sample_stats).sort_index(axis=1)
abcd_sample_stats = abcd_sample_stats.loc[["count", "mean", "%", "std", "50%", "min", "max"],:]
abcd_sample_stats = abcd_sample_stats.rename(index={"count":"n", "50%":"median", "std":"sd"})
abcd_sample_stats.loc["%",:] = abcd_sample_stats.loc["%",:] / abcd_sample_stats.loc["n",:] * 100
abcd_sample_stats.loc["n", (na(), na(), "Male [%]")] = np.nan
abcd_sample_stats = abcd_sample_stats.round(2)
abcd_sample_stats

Overall number of unique subjects: n = 10707
Overall number of unique subjects with longitudinal data: n = 6802


Unnamed: 0_level_0,All,All,All,All
Unnamed: 0_level_1,T0,T0,T2,T2
Unnamed: 0_level_2,Age [y],Male [%],Age [y],Male [%]
n,6802.0,,6802.0,
mean,9.91,,11.96,
%,,52.84,,52.84
sd,0.62,,0.65,
median,9.92,,11.92,
min,8.92,,10.58,
max,11.08,,13.75,


In [40]:
# reorder variables
abcd = reorder_vars(["id", "tp", "both_tp", "site", "site_str", "age", "age_mon", "sex", "sex_str"], 
                    abcd, desikan_idps)

### Drop subjects without longitudinal data

In [41]:
# drop subjects without longitudinal data
abcd = abcd[abcd.both_tp==True]
print(f"Subjects missing longitudinal data have been excluded. Subjects with both time points: n = {len(abcd)/2:.0f}")

Subjects missing longitudinal data have been excluded. Subjects with both time points: n = 6802


## Calculate gestational age and preterm variable
Gestational age is not directly available for ABCD but questionaires assessed how many weeks premature the baby was born according to the mother. We will calculate gestational age by subtracting the weeks premature from 40.

Since this information is subjective and not as reliable, we will only consider subjects born >= 32 weeks as preterm and subjects born <= week 37 as term. All other subjects will not be considered in the analysis.

In [42]:
# copy weeks_born_premature from T0 to T2 for the same idx_id
abcd['weeks_born_premature'] = abcd.groupby('id')['weeks_born_premature'].transform('first')

# calculate gestational age
abcd['gestational_age'] = 40 - abcd['weeks_born_premature']

# add variable dx representing CN and preterm according to BrainChart framework
abcd['preterm'] = np.where(abcd['gestational_age'] >= 37, 0, 
                        np.where(abcd['gestational_age'] <= 32, 1, np.nan))

# drop subjects not fitting to any dx group
abcd = abcd[abcd['preterm'].notnull()].copy()
print(f'Final sample size in each group: CN = {len(abcd[abcd["preterm"]==0])/2:.0f}, preterm = {len(abcd[abcd["preterm"]==1])/2:.0f}')
print(f'Dataframe size: {abcd.shape}')

# introduce dx
abcd['dx'] = np.where(abcd['preterm']==1, 'preterm', 'CN')

Final sample size in each group: CN = 5762, preterm = 191
Dataframe size: (11906, 93)


## Average hemispheres

In [43]:
regions = [r[2:] for r in desikan_idps[:34]]

# Combine L_ and R_ values
for region in regions:
    abcd[f'{brain_measure}_{region}'] = abcd[[f'L_{region}', f'R_{region}']].mean(axis=1)
    # drop L_ and R_ columns
    abcd = abcd.drop(columns=[f'L_{region}', f'R_{region}'])

In [44]:
# create region list bilateral
desikan_idps_bilateral = [f'{brain_measure}_{region}' for region in regions]
if brain_measure == 'CT':
    desikan_idps_bilateral = ['meanCT2'] + desikan_idps_bilateral
ctv_columns = ['GMV', 'WMV', 'sGMV', 'whole_brain_vol', 'EstimatedTotalIntraCranialVol']

# reorder variables and drop unnecessary columns
abcd = reorder_vars(["id", "tp", "both_tp", "site", "site_str", "age", "age_mon", "sex", 'sex_str', 'interview_date', 
                        'gestational_age', 'preterm', 'dx', 'sGMV', 'GMV', 'WMV', 'whole_brain_vol', 'EstimatedTotalIntraCranialVol'],
                    abcd, desikan_idps_bilateral) 
abcd.drop(columns=['fsqc_qc', 'apqc_smri_topo_ndefect','imgincl_t1w_include', 'weeks_born_premature', 'topo_thresh',
                    'topo_outlier'], inplace=True)

# NeuroCombat for data harmonization
Since longitudinal data acquisition took place at the same sites as the baseline data, we will use longCombat to harmonize data across sites. This will be done separately for `brain_measure` and cerebral tissue volume measures due to different scales. longCombat is run in R.

In [45]:
# separate brain_measure and cerebral tissue volumes for Combat
abcd_data = abcd.drop(columns=ctv_columns)
abcd_data.drop(columns=['nihtbx_total'], inplace=True)  # causes trouble in longCombat due to missing values
abcd_ctv = abcd.drop(columns=desikan_idps_bilateral)
abcd_ctv.drop(columns=['nihtbx_total'], inplace=True)


## Harmonize `brain_measure` data
Sex and age are included as covariates in the model.

In [46]:
%%R -i abcd_data -i desikan_idps_bilateral -i abcd_out -i brain_measure
.libPaths()

# load required packages
library(longCombat)
library(invgamma)
library(lme4)
library(Matrix)

# set random seed to make analysis reproducible (otherwise, output "data_combat" will be minimally different each time you run the code)
set.seed(1234)

# apply longCombat
data_combat <- longCombat(idvar='id', 
                            timevar='tp',
                            batchvar='site', 
                            features=desikan_idps_bilateral, 
                            formula='age + sex + dx*tp',
                            ranef='(1|id)',
                            data=abcd_data)

# save combat-corrected data
out_name = paste("abcd_", brain_measure, "_scanner_corrected_tmp.csv", sep="")
write.table(data_combat[["data_combat"]], file = file.path(abcd_out, out_name), sep = ",", row.names = FALSE)

[longCombat] found 21 batches
[longCombat] found 35 features
[longCombat] found 11906 total observations
[longCombat] standardizing data across features...
[longCombat] fitting lme model for feature 1
[longCombat] fitting lme model for feature 2
[longCombat] fitting lme model for feature 3
[longCombat] fitting lme model for feature 4
[longCombat] fitting lme model for feature 5
[longCombat] fitting lme model for feature 6
[longCombat] fitting lme model for feature 7
[longCombat] fitting lme model for feature 8
[longCombat] fitting lme model for feature 9
[longCombat] fitting lme model for feature 10
[longCombat] fitting lme model for feature 11
[longCombat] fitting lme model for feature 12
[longCombat] fitting lme model for feature 13
[longCombat] fitting lme model for feature 14
[longCombat] fitting lme model for feature 15
[longCombat] fitting lme model for feature 16
[longCombat] fitting lme model for feature 17
[longCombat] fitting lme model for feature 18
[longCombat] fitting lme 

## Harmonize cerebral tissue volumes

In [47]:
%%R -i abcd_ctv -i ctv_columns -i abcd_out 

# load required packages
library(longCombat)
library(invgamma)
library(lme4)
library(Matrix)

# set random seed to make analysis reproducible (otherwise, output "data_combat" will be minimally different each time you run the code)
set.seed(1234)

# apply longCombat
data_combat <- longCombat(idvar='id', 
                            timevar='tp',
                            batchvar='site', 
                            features=ctv_columns, 
                            formula='age + sex + dx*tp',
                            ranef='(1|id)',
                            data=abcd_ctv)

# save combat-corrected data
out_name = paste("abcd_ctv_scanner_corrected_tmp.csv", sep="")
write.table(data_combat[["data_combat"]], file = file.path(abcd_out, out_name), sep = ",", row.names = FALSE)

[longCombat] found 21 batches
[longCombat] found 5 features
[longCombat] found 11906 total observations
[longCombat] standardizing data across features...
[longCombat] fitting lme model for feature 1
[longCombat] fitting lme model for feature 2
[longCombat] fitting lme model for feature 3
[longCombat] fitting lme model for feature 4
[longCombat] fitting lme model for feature 5
[longCombat] using method of moments to estimate hyperparameters
[longCombat] using empirical Bayes to estimate batch effects...
[longCombat] initializing...
[longCombat] starting EM algorithm iteration 1
[longCombat] starting EM algorithm iteration 2
[longCombat] starting EM algorithm iteration 3
[longCombat] starting EM algorithm iteration 4
[longCombat] starting EM algorithm iteration 5
[longCombat] starting EM algorithm iteration 6
[longCombat] starting EM algorithm iteration 7
[longCombat] starting EM algorithm iteration 8
[longCombat] starting EM algorithm iteration 9
[longCombat] starting EM algorithm iter

In [48]:
# combine outputs 
abcd_data_combat = pd.read_csv(join(abcd_out, f'abcd_{brain_measure}_scanner_corrected_tmp.csv'))
abcd_ctv_combat = pd.read_csv(join(abcd_out, 'abcd_ctv_scanner_corrected_tmp.csv'))

# rename idps
abcd_data_combat = abcd_data_combat.rename(columns=dict(zip([id+'.combat' for id in desikan_idps_bilateral], desikan_idps_bilateral)))
abcd_ctv_combat = abcd_ctv_combat.rename(columns=dict(zip([id+'.combat' for id in ctv_columns], ctv_columns)))

In [49]:
# merge harmonized brain_measure and CTV data
cols_to_keep = ['id', 'tp' ] + ctv_columns
abcd_combat = pd.merge(abcd_data_combat, abcd_ctv_combat[cols_to_keep], on=["id", "tp"], how='inner')

# add more variables to longCombat output
print(f'Final sample size in each group: CN = {len(abcd[abcd["preterm"]==0])/2:.0f}, preterm = {len(abcd[abcd["preterm"]==1])/2:.0f}')
print(f'Dataframe size: {abcd_combat.shape}')

abcd_combat = abcd_combat.merge(abcd[['id', 'tp', 'both_tp', 'site_str', 'age', 'age_mon', 'sex',
                        'sex_str', 'interview_date', 'gestational_age', 'preterm', 'dx', 'nihtbx_total']], on=['id', 'tp'], how='left')

Final sample size in each group: CN = 5762, preterm = 191
Dataframe size: (11906, 43)


# Adapt df for BrainChart framework
BrainChart needs a certain format of the data. We will adapt the data accordingly. More information can be found [here](https://brainchart.shinyapps.io/brainchart/).

In [50]:
# adapt for brainchart
abcd_combat = abcd_combat.rename(columns=dict(
    id="participant",
    age="Age",
    sex="sex_code",
    sex_str="sex",
    gestational_age="GA"
)) 

abcd_combat['age_days'] = (abcd_combat['Age'] * 365.245) + 280
abcd_combat['sex'] = abcd_combat['sex'].map({'M': 'Male', 'F': 'Female'})
abcd_combat['study'] = 'ABCD_newEstimate'  # as ABCD was used for original model fitting, we want to make sure that random effects of study are calculated again
abcd_combat['fs_version'] = 'Custom'
abcd_combat['country'] = 'Multisite'
abcd_combat['run'] = 1
abcd_combat['session'] = abcd_combat['tp'].map({'T0': 1, 'T2': 2})

# remove unnecessary columns
abcd_combat.drop(columns=['age_mon', 'site_str', 'interview_date', 'preterm', 'sex_code','both_tp', 'tp'], inplace=True)

# reshape and rename some more variables
all_idps = desikan_idps_bilateral + ctv_columns
abcd_final = reorder_vars(['participant', 'Age', 'age_days', 'sex', 'study', 'fs_version','country', 'run', 
                            'session', 'dx'], abcd_combat, all_idps)
abcd_final.rename(columns={'EstimatedTotalIntraCranialVol': 'eTIV'}, inplace=True)

In [51]:
# split timepoints
abcd_10 = abcd_final[abcd_final['session'] == 1]
print(f'ABCD timepoint 1 contains {abcd_10.shape[0]} subjects.')
abcd_12 = abcd_final[abcd_final['session'] == 2]
print(f'ABCD timepoint 2 contains {abcd_12.shape[0]} subjects.')

# save
abcd_final.to_csv(join(abcd_out, f'ABCD_{brain_measure}_preprocessed.csv'), index=False)
abcd_10.to_csv(join(abcd_out, f'ABCD-10_{brain_measure}_preprocessed.csv'), index=False)
abcd_12.to_csv(join(abcd_out, f'ABCD-12_{brain_measure}_preprocessed.csv'), index=False)

ABCD timepoint 1 contains 5953 subjects.
ABCD timepoint 2 contains 5953 subjects.


In [52]:
# remove tmp files
! rm -f {join(abcd_out, f'abcd_{brain_measure}_scanner_corrected_tmp.csv')}
! rm -f {join(abcd_out, f'abcd_ctv_scanner_corrected_tmp.csv')}

zsh:1: bad pattern: join(abcd_out


# Stats
Summary stats shown in Supp Table S1.

In [53]:
print("Overall number of unique subjects with longitudinal data: n =", len(abcd[abcd.both_tp==True].id.unique()))
print('--- ABCD-10 ---')

abcd_10_pt = abcd_10[abcd_10['dx'] == 'preterm']
abcd_10_cn = abcd_10[abcd_10['dx'] == 'CN']

print('Preterm stats: n = ', len(abcd_10_pt))
print(abcd_10_pt['sex'].value_counts())
display(abcd_10_pt[['Age', 'GA']].describe().round(2))

print('Full-term stats: n = ', len(abcd_10_cn))
print(abcd_10_cn['sex'].value_counts())
display(abcd_10_cn[['Age', 'GA']].describe().round(2))

Overall number of unique subjects with longitudinal data: n = 5953
--- ABCD-10 ---
Preterm stats: n =  191
Male      102
Female     89
Name: sex, dtype: int64


Unnamed: 0,Age,GA
count,191.0,191.0
mean,9.99,30.75
std,0.6,1.83
min,9.0,27.0
25%,9.5,30.0
50%,10.0,32.0
75%,10.58,32.0
max,10.92,32.0


Full-term stats: n =  5762
Male      3051
Female    2711
Name: sex, dtype: int64


Unnamed: 0,Age,GA
count,5762.0,5762.0
mean,9.9,39.86
std,0.62,0.6
min,8.92,37.0
25%,9.33,40.0
50%,9.92,40.0
75%,10.42,40.0
max,11.08,40.0


In [54]:
print('--- ABCD-12 ---')

abcd_12_pt = abcd_12[abcd_12['dx'] == 'preterm']
abcd_12_cn = abcd_12[abcd_12['dx'] == 'CN']

print('Preterm stats: n = ', len(abcd_12_pt))
print(abcd_12_pt['sex'].value_counts())
display(abcd_12_pt[['Age', 'GA']].describe().round(2))

print('Full-term stats: n = ', len(abcd_12_cn))
print(abcd_12_cn['sex'].value_counts())
display(abcd_12_cn[['Age', 'GA']].describe().round(2))

--- ABCD-12 ---
Preterm stats: n =  191
Male      102
Female     89
Name: sex, dtype: int64


Unnamed: 0,Age,GA
count,191.0,191.0
mean,12.01,30.75
std,0.64,1.83
min,10.83,27.0
25%,11.5,30.0
50%,12.0,32.0
75%,12.58,32.0
max,13.67,32.0


Full-term stats: n =  5762
Male      3051
Female    2711
Name: sex, dtype: int64


Unnamed: 0,Age,GA
count,5762.0,5762.0
mean,11.94,39.86
std,0.65,0.6
min,10.58,37.0
25%,11.42,40.0
50%,11.92,40.0
75%,12.5,40.0
max,13.75,40.0
