### **IMPORTING RELEVANT DATASETS AND FEATURES SELECTION**

This section will import all the relevant dataset needed to run the analysis
from Gagnon et al., 2024. Feature selection will be performed to keep only the
relevant variables. 

#### **Setting up relevant paths.**

In order for the following analyses to work, please update the following
variables in the next cell.

1. `repository_path`: should point to the location of the git repository.
1. `abcd_base_path`: should point to the base folder of the abcd data release.
1. `output_folder`: should point to a folder in which results will be outputted throughout the analyses. 

#### **Requirements.**

To be able to run the following code, it is mandatory to install the NeuroStatX
toolbox (https://github.com/gagnonanthony/NeuroStatX.git). If it isn't already
install on your machine, please follow the instructions
on the repository/documentation. All the analyses should be
runnable on a entry-level computer (time to complete some steps might vary,
long running steps are label by **This is a long running process. Go get a
coffee !**). You may choose to skip some of these steps if need be.

In [1]:
import os

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

from neurostatx.io.utils import load_df_in_any_format
from neurostatx.utils.preprocessing import merge_dataframes

In [2]:
# Setting up relevant paths.
repository_path = "/Users/anthonygagnon/code/Gagnon_LongitudinalProfiles/" # CHANGE THIS
abcd_base_path = "/Volumes/T7/CCPM/ABCD/Release_5.1/abcd-data-release-5.1/" # CHANGE THIS
output_folder = "/Volumes/T7/CCPM/RESULTS_JUNE_24/" # CHANGE THIS

# Setting up the paths for output subfolder.
output_dir = f"{output_folder}/LongitudinalProfiles/datagathering/" # DO NOT CHANGE THIS
os.makedirs(output_dir, exist_ok=True)

### **Fetching relevant data from the ABCD Study (Release 5.1)**

Subsequent cells will load tables from the ABCD Release 5.1, filter to keep only baseline, 2-year, and 4-year follow-up data. Final tables will be outputted in the above specified `output_dir`.

**Please note that this dataset is available through a data use certificate. For more informations, please see the [NIMH Data Archive website](https://nda.nih.gov/) or the [ABCD Study wiki](https://wiki.abcdstudy.org/)**

#### **Starting with data data**.

In [3]:
# Load all necessary dataframes for neurocognition and behavior.
cbcl = load_df_in_any_format(f'{abcd_base_path}/core/mental-health/mh_p_cbcl.csv')
nihtb = load_df_in_any_format(f'{abcd_base_path}/core/neurocognition/nc_y_nihtb.csv')
lmt = load_df_in_any_format(f'{abcd_base_path}/core/neurocognition/nc_y_lmt.csv')
ravlt = load_df_in_any_format(f'{abcd_base_path}/core/neurocognition/nc_y_ravlt.csv')
wisc = load_df_in_any_format(f'{abcd_base_path}/core/neurocognition/nc_y_wisc.csv')
dice = load_df_in_any_format(f"{abcd_base_path}/core/neurocognition/nc_y_gdt.csv")
bird = load_df_in_any_format(f"{abcd_base_path}/core/neurocognition/nc_y_bird.csv")

# Load all necessary dataframes for covariates/demographics.
lt = load_df_in_any_format(f'{abcd_base_path}/core/abcd-general/abcd_y_lt.csv')
hand = load_df_in_any_format(f'{abcd_base_path}/core/neurocognition/nc_y_ehis.csv')
vision = load_df_in_any_format(f'{abcd_base_path}/core/neurocognition/nc_y_svs.csv')
demo = load_df_in_any_format(f"{abcd_base_path}/core/abcd-general/abcd_p_demo.csv")
agemonth = load_df_in_any_format(f'{abcd_base_path}/core/abcd-general/abcd_y_lt.csv')

  df = pd.read_csv(file)
  df = pd.read_csv(file)
  df = pd.read_csv(file)
  df = pd.read_csv(file)
  df = pd.read_csv(file)


In [4]:
# Building demographics and covariates data.
demo_baseline = demo[demo.eventname == "baseline_year_1_arm_1"]
agemonth_baseline = agemonth[agemonth.eventname == "baseline_year_1_arm_1"]
lt_baseline = lt[lt.eventname == "baseline_year_1_arm_1"]
hand_baseline = hand[hand.eventname == "baseline_year_1_arm_1"]
vision_baseline = vision[vision.eventname == "baseline_year_1_arm_1"]

# Extracting sites data.
site_vars = ["src_subject_id", "site_id_l"]
site = lt_baseline[site_vars]
site.columns = ["subjectkey", "Site"]

# Extracting demographics data.
demo_vars = [
    "src_subject_id",
    "demo_sex_v2",
    "race_ethnicity",
    "demo_prnt_ed_v2",
    "demo_prtnr_ed_v2",
    "demo_comb_income_v2"
]
demo_data = demo_baseline[demo_vars]
demo_data.columns = [
    "subjectkey",
    "Sex",
    "Ethnicity",
    "Parent_ed1",
    "Parent_ed2",
    "Income"
]

# Extracting age data.
age_vars = ["src_subject_id", "interview_age"]
age_data = agemonth_baseline[age_vars]
age_data.columns = ["subjectkey", "AgeMonths"]

# Extracting handedness data.
hand_vars = ["src_subject_id", "ehi_y_ss_scoreb"]
hand_data = hand_baseline[hand_vars]
hand_data.columns = ["subjectkey", "Handedness"]

# Invert the handedness score. 1 = left, 2 = right, 3 = ambidextrous.
def invert_handedness(x):
    if x == 1:
        return 2
    elif x == 2:
        return 1
    elif x == 3:
        return 3
    else:
        return np.nan

hand_data.loc[:, "Handedness"] = hand_data["Handedness"].apply(invert_handedness)

# Extracting vision data.
vision_vars = ["src_subject_id", "snellen_va_y"]
vision_data = vision_baseline[vision_vars]
vision_data.columns = ["subjectkey", "Vision"]

# Extract parental highest education level. Taking the highest among the two parents.
demo_data.loc[:, "Parent_ed1"] = demo_data["Parent_ed1"].replace([777, 999, np.nan], 0)
demo_data.loc[:, "Parent_ed2"] = demo_data["Parent_ed2"].replace([777, 999, np.nan], 0)
demo_data.loc[:, "high_edu"] = demo_data[["Parent_ed1", "Parent_ed2"]].values.max(1)

# Group levels together (<13 = 1, no high school, 13-14 = 2, high school, ged or equivalent,
# 15-17 = 3, some college, 18 = 4, bachelor, >19 = 5, postgraduate)
def create_edu_groups(x):
    if x < 13:
        return 1
    elif x in [13, 14]:
        return 2
    elif x in [15, 16, 17]:
        return 3
    elif x == 18:
        return 4
    elif x in [19, 20, 21]:
        return 5
    else:
        return 0

demo_data.loc[:, 'ParentalEducation'] = demo_data['high_edu'].apply(create_edu_groups)

# Extracting the income groups.
# Group levels of income together ( <6 = 1, < 50 000, 6-8 = 2, 50-100 000, >9 = 3, >100 000).
def create_income_groups(x):
    if x < 6:
        return 1
    elif x in [6, 7, 8]:
        return 2
    elif x in [9, 10]:
        return 3
    else:
        return 0

demo_data.loc[:, 'IncomeGroups'] = demo_data['Income'].apply(create_income_groups)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  demo_data.loc[:, "high_edu"] = demo_data[["Parent_ed1", "Parent_ed2"]].values.max(1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  demo_data.loc[:, 'ParentalEducation'] = demo_data['high_edu'].apply(create_edu_groups)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  demo_data.loc[:, 'IncomeGroups'] 

In [5]:
# Constructing a baseline dataframe first.
cbcl_baseline = cbcl[cbcl.eventname == "baseline_year_1_arm_1"]
nihtb_baseline = nihtb[nihtb.eventname == "baseline_year_1_arm_1"]
lmt_baseline = lmt[lmt.eventname == "baseline_year_1_arm_1"]
ravlt_baseline = ravlt[ravlt.eventname == "baseline_year_1_arm_1"]
wisc_baseline = wisc[wisc.eventname == "baseline_year_1_arm_1"]

# CBCL syndrome scores.
cbcl_baseline = cbcl_baseline[["src_subject_id", "cbcl_scr_syn_internal_r", "cbcl_scr_syn_external_r", "cbcl_scr_07_stress_r"]]
cbcl_baseline.columns = ["subjectkey", "Internalizing", "Externalizing", "Stress"]

# Scale the CBCL scores.
cbcl_baseline.loc[:, "Internalizing"] = StandardScaler().fit_transform(cbcl_baseline[["Internalizing"]])
cbcl_baseline.loc[:, "Externalizing"] = StandardScaler().fit_transform(cbcl_baseline[["Externalizing"]])
cbcl_baseline.loc[:, "Stress"] = StandardScaler().fit_transform(cbcl_baseline[["Stress"]])

# NIHTB scores.
nihtb_vars = [
    "src_subject_id",
    "nihtbx_picvocab_uncorrected",
    "nihtbx_flanker_uncorrected",
    "nihtbx_list_uncorrected",
    "nihtbx_cardsort_uncorrected",
    "nihtbx_pattern_uncorrected",
    "nihtbx_picture_uncorrected",
    "nihtbx_reading_uncorrected",
]
nihtb_baseline = nihtb_baseline[nihtb_vars]
nihtb_baseline.columns = [
    "subjectkey",
    "PictureVocab",
    "Flanker",
    "ListSorting",
    "CardSort",
    "PatternComparison",
    "PictureSequence",
    "OralReading"
]

# Scale the NIHTB scores.
nihtb_baseline.loc[:, "PictureVocab"] = StandardScaler().fit_transform(nihtb_baseline[["PictureVocab"]])
nihtb_baseline.loc[:, "Flanker"] = StandardScaler().fit_transform(nihtb_baseline[["Flanker"]])
nihtb_baseline.loc[:, "ListSorting"] = StandardScaler().fit_transform(nihtb_baseline[["ListSorting"]])
nihtb_baseline.loc[:, "CardSort"] = StandardScaler().fit_transform(nihtb_baseline[["CardSort"]])
nihtb_baseline.loc[:, "PatternComparison"] = StandardScaler().fit_transform(nihtb_baseline[["PatternComparison"]])
nihtb_baseline.loc[:, "PictureSequence"] = StandardScaler().fit_transform(nihtb_baseline[["PictureSequence"]])
nihtb_baseline.loc[:, "OralReading"] = StandardScaler().fit_transform(nihtb_baseline[["OralReading"]])

# Little Man's Task scores.
lmt_baseline = lmt_baseline[["src_subject_id", "lmt_scr_perc_correct"]]
lmt_baseline.columns = ["subjectkey", "LMT"]

# Scale the LMT scores.
lmt_baseline.loc[:, "LMT"] = StandardScaler().fit_transform(lmt_baseline[["LMT"]])

# Rey Auditory Verbal Learning Test scores.
ravlt_baseline = ravlt_baseline[["src_subject_id", "pea_ravlt_ld_trial_vii_tc"]]
ravlt_baseline.columns = ["subjectkey", "RAVLT"]

# Scale the RAVLT scores.
ravlt_baseline.loc[:, "RAVLT"] = StandardScaler().fit_transform(ravlt_baseline[["RAVLT"]])

# Wechsler Intelligence Scale for Children scores.
wisc_baseline = wisc_baseline[["src_subject_id", "pea_wiscv_tss"]]
wisc_baseline.columns = ["subjectkey", "WISCMatrix"]

# Scale the WISC scores.
wisc_baseline.loc[:, "WISCMatrix"] = StandardScaler().fit_transform(wisc_baseline[["WISCMatrix"]])

In [6]:
# Computing the diagnosis variables using the KSADS.
!python {repository_path}/scripts/generate_dx_ABCD.py --in-root-folder {abcd_base_path} \
    --output {output_dir}/abcd_dx_labels_baseline.xlsx --eventname "baseline_year_1_arm_1"

In [7]:
# Importing back the diagnosis labels.
dx_labels = load_df_in_any_format(f"{output_dir}/abcd_dx_labels_baseline.xlsx")

# Add global psychopathology score (1 = at least one diagnosis, 0 = no diagnosis).
dx_labels.loc[:, "PSYPATHO"] = dx_labels.iloc[:, 1:].sum(axis=1).apply(lambda x: 1 if x > 0 else 0)

In [8]:
# First, merging all the cognitive and behavioral dataframes. Then, merging the demographics and covariates dataframes.
merged_data = merge_dataframes({"agemonth": age_data, "dx": dx_labels, "cbcl": cbcl_baseline,
                                "nihtb": nihtb_baseline, "lmt": lmt_baseline,
                                "ravlt": ravlt_baseline, "wisc": wisc_baseline},
                                index="subjectkey")
merged_data.dropna(inplace=True, axis=0)
merged_data.reset_index(drop=False, inplace=True)
print("Number of subjects retained for the analysis: {}".format(merged_data.shape[0]))

Number of subjects retained for the analysis: 10843


In [9]:
# Contatenating all dataframes.
abcd_data_baseline = merge_dataframes({"site": site, "demo": demo_data,
                                       "hand": hand_data, "vision": vision_data,
                                       "merged": merged_data},
                                      index="subjectkey")

# Dropping the rows with NA in the last 6 columns which corresponds to the cognitive and behavioral data.
abcd_data_baseline.dropna(subset=abcd_data_baseline.columns[-6:], inplace=True, axis=0, how="all")

# Assert that the number of subjects is the same as before.
assert abcd_data_baseline.shape[0] == merged_data.shape[0], "Number of subjects do not match."

# Save the final dataframe.
abcd_data_baseline.to_excel(f"{output_dir}/abcd_data_baseline.xlsx", index=True, header=True)

print("Baseline data gathering completed.")

# This next line is commented out since data is protected by a data use agreement.
# abcd_data_baseline.head() # Please inspect head of the dataframe to validate correct merging.

Baseline data gathering completed.


#### **2-year follow-up data gathering.**

In [21]:
# Starting by computing the diagnosis labels.
!python {repository_path}/scripts/generate_dx_ABCD.py --in-root-folder {abcd_base_path} \
    --output {output_dir}/abcd_dx_labels_2y.xlsx --eventname "2_year_follow_up_y_arm_1"

In [22]:
# Importing back the diagnosis labels.
dx_labels = load_df_in_any_format(f"{output_dir}/abcd_dx_labels_2y.xlsx")

# Add global psychopathology score (1 = at least one diagnosis, 0 = no diagnosis).
dx_labels.loc[:, "PSYPATHO"] = dx_labels.iloc[:, 1:].sum(axis=1).apply(lambda x: 1 if x > 0 else 0)

# Fetch age at 2 year follow-up.
age_2y = agemonth[agemonth.eventname == "2_year_follow_up_y_arm_1"]
age_vars = ["src_subject_id", "interview_age"]
age_2y = age_2y[age_vars]
age_2y.columns = ["subjectkey", "AgeMonths"]

# Vision data at 2 year follow-up.
vision_2y = vision[vision.eventname == "2_year_follow_up_y_arm_1"]
vision_vars = ["src_subject_id", "snellen_va_y"]
vision_2y = vision_2y[vision_vars]
vision_2y.columns = ["subjectkey", "Vision"]

In [23]:
# Constructing a 2y follow-up dataset.
cbcl_2y = cbcl[cbcl.eventname == "2_year_follow_up_y_arm_1"]
nihtb_2y = nihtb[nihtb.eventname == "2_year_follow_up_y_arm_1"]
lmt_2y = lmt[lmt.eventname == "2_year_follow_up_y_arm_1"]
ravlt_2y = ravlt[ravlt.eventname == "2_year_follow_up_y_arm_1"]
dice_2y = dice[dice.eventname == "2_year_follow_up_y_arm_1"]

# CBCL syndrome scores.
cbcl_2y = cbcl_2y[["src_subject_id", "cbcl_scr_syn_internal_r", "cbcl_scr_syn_external_r", "cbcl_scr_07_stress_r"]]
cbcl_2y.columns = ["subjectkey", "Internalizing", "Externalizing", "Stress"]

# Scale the CBCL scores.
cbcl_2y.loc[:, "Internalizing"] = StandardScaler().fit_transform(cbcl_2y[["Internalizing"]])
cbcl_2y.loc[:, "Externalizing"] = StandardScaler().fit_transform(cbcl_2y[["Externalizing"]])
cbcl_2y.loc[:, "Stress"] = StandardScaler().fit_transform(cbcl_2y[["Stress"]])

# NIHTB scores.
nihtb_vars = [
    "src_subject_id",
    "nihtbx_flanker_uncorrected",
    "nihtbx_picvocab_uncorrected",
    "nihtbx_pattern_uncorrected",
    "nihtbx_picture_uncorrected",
    "nihtbx_reading_uncorrected",
]
nihtb_2y = nihtb_2y[nihtb_vars]
nihtb_2y.columns = [
    "subjectkey",
    "Flanker",
    "PictureVocab",
    "PatternComparison",
    "PictureSequence",
    "OralReading"
]

# Scale the NIHTB scores.
nihtb_2y.loc[:, "Flanker"] = StandardScaler().fit_transform(nihtb_2y[["Flanker"]])
nihtb_2y.loc[:, "PictureVocab"] = StandardScaler().fit_transform(nihtb_2y[["PictureVocab"]])
nihtb_2y.loc[:, "PatternComparison"] = StandardScaler().fit_transform(nihtb_2y[["PatternComparison"]])
nihtb_2y.loc[:, "PictureSequence"] = StandardScaler().fit_transform(nihtb_2y[["PictureSequence"]])
nihtb_2y.loc[:, "OralReading"] = StandardScaler().fit_transform(nihtb_2y[["OralReading"]])

# Little Man's Task scores.
lmt_2y = lmt_2y[["src_subject_id", "lmt_scr_perc_correct"]]
lmt_2y.columns = ["subjectkey", "LMT"]

# Some scores are in % rather than in decimal. We need to convert them to decimal.
# Apply only if the scores are above 1.
def convert_lmt_scores(x):
    if x > 1:
        return x / 100
    else:
        return x

lmt_2y.loc[:, "LMT"] = lmt_2y["LMT"].apply(convert_lmt_scores)

# Scale the LMT scores.
lmt_2y.loc[:, "LMT"] = StandardScaler().fit_transform(lmt_2y[["LMT"]])

# Rey Auditory Verbal Learning Test scores.
ravlt_2y = ravlt_2y[["src_subject_id", "pea_ravlt_ld_trial_vii_tc"]]
ravlt_2y.columns = ["subjectkey", "RAVLT"]

# Scale the RAVLT scores.
ravlt_2y.loc[:, "RAVLT"] = StandardScaler().fit_transform(ravlt_2y[["RAVLT"]])

# Game of Dice scores.
dice_2y = dice_2y[["src_subject_id", "gdt_scr_expressions_net_score"]]
dice_2y.columns = ["subjectkey", "DICE"]

# Scale the DICE scores.
dice_2y.loc[:, "DICE"] = StandardScaler().fit_transform(dice_2y[["DICE"]])

In [24]:
# Merge all the cognitive and behavioral dataframes.
merged_data = merge_dataframes({"agemonth": age_2y, "dx": dx_labels, "cbcl": cbcl_2y,
                                "nihtb": nihtb_2y, "lmt": lmt_2y,
                                "ravlt": ravlt_2y, "dice": dice_2y},
                                index="subjectkey")
merged_data.dropna(inplace=True, axis=0)
merged_data.reset_index(drop=False, inplace=True)
print("Number of subjects retained for the analysis: {}".format(merged_data.shape[0]))

Number of subjects retained for the analysis: 7369


In [25]:
# Merge with demographics.
abcd_data_2y = merge_dataframes({"site": site, "demo": demo_data,
                                 "hand": hand_data, "vision": vision_2y,
                                 "merged": merged_data},
                                index="subjectkey")

# Dropping the rows with NA in the last 6 columns which corresponds to the cognitive and behavioral data.
abcd_data_2y.dropna(subset=abcd_data_2y.columns[-6:], inplace=True, axis=0, how="all")

# Assert that the number of subjects is the same as before.
assert abcd_data_2y.shape[0] == merged_data.shape[0], "Number of subjects do not match."

# Save the final dataframe.
abcd_data_2y.to_excel(f"{output_dir}/abcd_data_2y.xlsx", index=True, header=True)

print("2 year follow-up data gathering completed.")

# This next line is commented out since data is protected by a data use agreement.
# abcd_data_2y.head() # Please inspect head of the dataframe to validate correct merging.

2 year follow-up data gathering completed.


#### **4-year follow-up data gathering.**

In [16]:
# Starting by computing the diagnosis labels. (not available as of 30/07/2024)
# !python {repository_path}/scripts/generate_dx_ABCD.py --in-root-folder {abcd_base_path} \
#     --output {output_dir}/abcd_dx_labels_4y.xlsx --eventname "4_year_follow_up_y_arm_1"

In [16]:
# Importing back the diagnosis labels.
# dx_labels = load_df_in_any_format(f"{output_dir}/abcd_dx_labels_4y.xlsx")

# Add global psychopathology score (1 = at least one diagnosis, 0 = no diagnosis).
# dx_labels.loc[:, "PSYPATHO"] = dx_labels.iloc[:, 1:].sum(axis=1).apply(lambda x: 1 if x > 0 else 0)

# Fetch age at 4 year follow-up.
age_4y = agemonth[agemonth.eventname == "4_year_follow_up_y_arm_1"]
age_vars = ["src_subject_id", "interview_age"]
age_4y = age_4y[age_vars]
age_4y.columns = ["subjectkey", "AgeMonths"]

# Vision data at 4 year follow-up.
vision_4y = vision[vision.eventname == "4_year_follow_up_y_arm_1"]
vision_vars = ["src_subject_id", "snellen_va_y"]
vision_4y = vision_4y[vision_vars]
vision_4y.columns = ["subjectkey", "Vision"]

In [17]:
# Constructing a 4y follow-up dataset.
cbcl_4y = cbcl[cbcl.eventname == "4_year_follow_up_y_arm_1"]
nihtb_4y = nihtb[nihtb.eventname == "4_year_follow_up_y_arm_1"]
lmt_4y = lmt[lmt.eventname == "4_year_follow_up_y_arm_1"]
dice_4y = dice[dice.eventname == "4_year_follow_up_y_arm_1"]
bird_4y = bird[bird.eventname == "4_year_follow_up_y_arm_1"]

# CBCL syndrome scores.
cbcl_4y = cbcl_4y[["src_subject_id", "cbcl_scr_syn_internal_r", "cbcl_scr_syn_external_r", "cbcl_scr_07_stress_r"]]
cbcl_4y.columns = ["subjectkey", "Internalizing", "Externalizing", "Stress"]

# Scale the CBCL scores.
cbcl_4y.loc[:, "Internalizing"] = StandardScaler().fit_transform(cbcl_4y[["Internalizing"]])
cbcl_4y.loc[:, "Externalizing"] = StandardScaler().fit_transform(cbcl_4y[["Externalizing"]])
cbcl_4y.loc[:, "Stress"] = StandardScaler().fit_transform(cbcl_4y[["Stress"]])

# NIHTB scores.
nihtb_vars = [
    "src_subject_id",
    "nihtbx_flanker_uncorrected",
    "nihtbx_list_uncorrected",
    "nihtbx_picvocab_uncorrected",
    "nihtbx_pattern_uncorrected",
    "nihtbx_picture_uncorrected",
    "nihtbx_reading_uncorrected",
]
nihtb_4y = nihtb_4y[nihtb_vars]
nihtb_4y.columns = [
    "subjectkey",
    "Flanker",
    "ListSorting",
    "PictureVocab",
    "PatternComparison",
    "PictureSequence",
    "OralReading"
]

# Scale the NIHTB scores.
nihtb_4y.loc[:, "Flanker"] = StandardScaler().fit_transform(nihtb_4y[["Flanker"]])
nihtb_4y.loc[:, "ListSorting"] = StandardScaler().fit_transform(nihtb_4y[["ListSorting"]])
nihtb_4y.loc[:, "PictureVocab"] = StandardScaler().fit_transform(nihtb_4y[["PictureVocab"]])
nihtb_4y.loc[:, "PatternComparison"] = StandardScaler().fit_transform(nihtb_4y[["PatternComparison"]])
nihtb_4y.loc[:, "PictureSequence"] = StandardScaler().fit_transform(nihtb_4y[["PictureSequence"]])
nihtb_4y.loc[:, "OralReading"] = StandardScaler().fit_transform(nihtb_4y[["OralReading"]])

# Little Man's Task scores.
lmt_4y = lmt_4y[["src_subject_id", "lmt_scr_perc_correct"]]
lmt_4y.columns = ["subjectkey", "LMT"]

# Scale the LMT scores.
lmt_4y.loc[:, "LMT"] = StandardScaler().fit_transform(lmt_4y[["LMT"]])

# Game of Dice scores.
dice_4y = dice_4y[["src_subject_id", "gdt_scr_values_risky"]]
dice_4y.columns = ["subjectkey", "DICE"]

# Scale the DICE scores.
dice_4y.loc[:, "DICE"] = StandardScaler().fit_transform(dice_4y[["DICE"]])

# Bird scores.
bird_4y = bird_4y[["src_subject_id", "bird_scr_score"]]
bird_4y.columns = ["subjectkey", "BIRD"]

# Scale the BIRD scores.
bird_4y.loc[:, "BIRD"] = StandardScaler().fit_transform(bird_4y[["BIRD"]])

In [18]:
# Merge all the cognitive and behavioral dataframes.
merged_data = merge_dataframes({"agemonth": age_4y, "cbcl": cbcl_4y,
                                "nihtb": nihtb_4y, "lmt": lmt_4y,
                                "dice": dice_4y, "bird": bird_4y},
                                index="subjectkey")
merged_data.dropna(inplace=True, axis=0)
merged_data.reset_index(drop=False, inplace=True)
print("Number of subjects retained for the analysis: {}".format(merged_data.shape[0]))

Number of subjects retained for the analysis: 2846


In [19]:
# Merge with demographics.
abcd_data_4y = merge_dataframes({"site": site, "demo": demo_data,
                                 "hand": hand_data, "vision": vision_4y,
                                 "merged": merged_data},
                                index="subjectkey")

# Dropping the rows with NA in the last 6 columns which corresponds to the cognitive and behavioral data.
abcd_data_4y.dropna(subset=abcd_data_4y.columns[-6:], inplace=True, axis=0, how="all")

# Assert that the number of subjects is the same as before.
assert abcd_data_4y.shape[0] == merged_data.shape[0], "Number of subjects do not match."

# Save the final dataframe.
abcd_data_4y.to_excel(f"{output_dir}/abcd_data_4y.xlsx", index=True, header=True)

print("4 year follow-up data gathering completed.")

# This next line is commented out since data is protected by a data use agreement.
# abcd_data_4y.head() # Please inspect head of the dataframe to validate correct merging.

4 year follow-up data gathering completed.
