### **IMPORTING RELEVANT DATASET AND FEATURE SELECTION** 

This section will import all the relevant dataset needed to run the analysis
from Gagnon et al., XXXX. Feature selection will be performed to keep only the
relevant variables. 

#### **Setting up relevant paths.**

In order for the following analyses to work, please update the following
variables in the next cell.

1. `repository_path`: should point to the location of the git repository.
1. `abcd_base_path`: should point to the base folder of the abcd data release.
1. `output_folder`: should point to a folder in which results will be outputted throughout the analyses. 

#### **Requirements.**

To be able to run the following code, it is mandatory to install the NeuroStatX
toolbox (https://github.com/gagnonanthony/NeuroStatX.git). If it isn't already
install on your machine, please run the above cell (or follow the instructions
on the repository/documentation). All the analyses should be
runnable on a entry-level computer (time to complete some steps might vary,
long running steps are label by **This is a long running process. Go get a
coffee !**). You may choose to skip some of these steps if need be.


In [2]:
import os

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

from neurostatx.io.utils import load_df_in_any_format
from neurostatx.utils.preprocessing import merge_dataframes

In [3]:
# Setting up relevant paths.
repository_path = "/Users/anthonygagnon/code/Article-s-Code/" # CHANGE THIS
abcd_base_path = "/Volumes/T7/CCPM/ABCD/Release_5.1/abcd-data-release-5.1/" # CHANGE THIS
geste_base_dir = "/Volumes/T7/CCPM/GESTE/" # CHANGE THIS
banda_dir = '/Volumes/T7/CCPM/BANDA/BANDARelease1.1/' # CHANGE THIS
output_folder = "/Volumes/T7/CCPM/RESULTS_JUNE_24/" # CHANGE THIS

# Setting up the paths for the data.
output_dir = f"{output_folder}/datagathering/" # DO NOT CHANGE THIS
os.makedirs(output_dir, exist_ok=True)

### **Fetching relevant data for the ABCD Study.**

Subsequent cells will load table from the ABCD Release 5.1, filter to keep only baseline data, and keep relevant variables for the present analysis. Final table will be outputted in the above specified `output_dir`.

**Please note that this dataset is available through a data use certificate. For more informations, please see the [NIMH Data Archive website](https://nda.nih.gov/) or the [ABCD Study wiki](https://wiki.abcdstudy.org/)**

In [74]:
# Load all necessary data tables. 
lt = load_df_in_any_format(f'{abcd_base_path}/core/abcd-general/abcd_y_lt.csv')
lt = lt.loc[lt.eventname == 'baseline_year_1_arm_1']
demo = load_df_in_any_format(f'{abcd_base_path}/core/abcd-general/abcd_p_demo.csv')
demo = demo.loc[demo.eventname == 'baseline_year_1_arm_1']
agemonth = load_df_in_any_format(f'{abcd_base_path}/core/abcd-general/abcd_y_lt.csv')
agemonth = agemonth.loc[agemonth.eventname == 'baseline_year_1_arm_1']
cbcl = load_df_in_any_format(f'{abcd_base_path}/core/mental-health/mh_p_cbcl.csv')
cbcl = cbcl.loc[cbcl.eventname == 'baseline_year_1_arm_1']
nihtb = load_df_in_any_format(f'{abcd_base_path}/core/neurocognition/nc_y_nihtb.csv')
nihtb = nihtb.loc[nihtb.eventname == 'baseline_year_1_arm_1']
lmt = load_df_in_any_format(f'{abcd_base_path}/core/neurocognition/nc_y_lmt.csv')
lmt = lmt.loc[lmt.eventname == 'baseline_year_1_arm_1']
ravlt = load_df_in_any_format(f'{abcd_base_path}/core/neurocognition/nc_y_ravlt.csv')
ravlt = ravlt.loc[ravlt.eventname == 'baseline_year_1_arm_1']
wisc = load_df_in_any_format(f'{abcd_base_path}/core/neurocognition/nc_y_wisc.csv')
wisc = wisc.loc[wisc.eventname == 'baseline_year_1_arm_1']
hand = load_df_in_any_format(f'{abcd_base_path}/core/neurocognition/nc_y_ehis.csv')
hand = hand.loc[hand.eventname == 'baseline_year_1_arm_1']

  df = pd.read_csv(file)
  df = pd.read_csv(file)
  df = pd.read_csv(file)
  df = pd.read_csv(file)
  df = pd.read_csv(file)


In [75]:
# Selecting syndrome scores.
syndromes = [
    "src_subject_id",
    "cbcl_scr_syn_internal_r",
    "cbcl_scr_syn_external_r",
    "cbcl_scr_07_stress_r"
]
cbcl_scores = cbcl[syndromes]
cbcl_scores.columns = [
    "subjectkey",
    "Internalization",
    "Externalization",
    "Stress"
]
cbcl_scores.loc[:, 'Internalization'] = StandardScaler().fit_transform(cbcl_scores[['Internalization']])
cbcl_scores.loc[:, 'Externalization'] = StandardScaler().fit_transform(cbcl_scores[['Externalization']])
cbcl_scores.loc[:, 'Stress'] = StandardScaler().fit_transform(cbcl_scores[['Stress']])

In [76]:
# Lumping variaable within a common dataframe.
# NIH Toolbox
nihtb_vars = [
    "src_subject_id",
    "nihtbx_picvocab_uncorrected",
    "nihtbx_flanker_uncorrected",
    "nihtbx_list_uncorrected",
    "nihtbx_cardsort_uncorrected",
    "nihtbx_pattern_uncorrected",
    "nihtbx_picture_uncorrected",
    "nihtbx_reading_uncorrected",
]
nihtb_scores = nihtb[nihtb_vars]
nihtb_scores.columns = [
    "subjectkey",
    "PictureVocab",
    "Flanker",
    "ListSorting",
    "CardSort",
    "PatternComparison",
    "PictureSequence",
    "OralReading"
]
nihtb_scores.loc[:, 'PictureVocab'] = StandardScaler().fit_transform(nihtb_scores[['PictureVocab']])
nihtb_scores.loc[:, 'Flanker'] = StandardScaler().fit_transform(nihtb_scores[['Flanker']])
nihtb_scores.loc[:, 'ListSorting'] = StandardScaler().fit_transform(nihtb_scores[['ListSorting']])
nihtb_scores.loc[:, 'CardSort'] = StandardScaler().fit_transform(nihtb_scores[['CardSort']])
nihtb_scores.loc[:, 'PatternComparison'] = StandardScaler().fit_transform(nihtb_scores[['PatternComparison']])
nihtb_scores.loc[:, 'PictureSequence'] = StandardScaler().fit_transform(nihtb_scores[['PictureSequence']])
nihtb_scores.loc[:, 'OralReading'] = StandardScaler().fit_transform(nihtb_scores[['OralReading']])

# Little Man's Task.
lmt_vars = [
    "src_subject_id",
    "lmt_scr_perc_correct"
]
lmt_scores = lmt[lmt_vars]
lmt_scores.columns = [
    "subjectkey",
    "LMT"
]
lmt_scores.loc[:, 'LMT'] = StandardScaler().fit_transform(lmt_scores[['LMT']])

# Pearson's RAVLT.
ravlt_vars = [
    "src_subject_id",
    "pea_ravlt_ld_trial_vii_tc"
]
ravlt_scores = ravlt[ravlt_vars]
ravlt_scores.columns = [
    "subjectkey",
    "RAVLT"
]
ravlt_scores.loc[:, 'RAVLT'] = StandardScaler().fit_transform(ravlt_scores[['RAVLT']])

# WISC.
wisc_vars = [
    "src_subject_id",
    "pea_wiscv_tss"
]
wisc_scores = wisc[wisc_vars]
wisc_scores.columns = [
    "subjectkey",
    "WISCMatrix"
]
wisc_scores.loc[:, 'WISCMatrix'] = StandardScaler().fit_transform(wisc_scores[['WISCMatrix']])

In [77]:
# Importing the diagnosis labels.
!python {repository_path}/scripts/generate_dx_ABCD.py --in-root-folder {abcd_base_path} \
    --output {output_dir}/abcd_dx_labels.xlsx

In [78]:
# Importing back the diagnosis labels.
dx_labels = load_df_in_any_format(f'{output_dir}/abcd_dx_labels.xlsx')

# Add a global psychopathology score (1 = at least one diagnosis, 0 = no diagnosis).
dx_labels.loc[:, 'PSYPATHO'] = dx_labels.iloc[:, 1:].sum(axis=1).apply(lambda x: 1 if x > 0 else 0)

In [79]:
# Fetching basic demographics.
site_vars = [
    "src_subject_id",
    "site_id_l"
]
site = lt[site_vars]
site.columns = [
    "subjectkey",
    "Site"
]
demo_vars = [
    "src_subject_id",
    "demo_sex_v2",
    "race_ethnicity",
    "demo_prnt_ed_v2",
    "demo_prtnr_ed_v2",
    "demo_comb_income_v2",
]
demo = demo[demo_vars]
demo.columns = [
    "subjectkey",
    "Sex",
    "Ethnicity",
    "Parent_ed1",
    "Parent_ed2",
    "Income"
]
agemonth_vars = [
    "src_subject_id",
    "interview_age"
]
agemonth = agemonth[agemonth_vars]
agemonth.columns = [
    "subjectkey",
    "AgeMonths"
]
hand = hand[['src_subject_id', 'ehi_y_ss_scoreb']]
hand.columns = ['subjectkey', 'Handedness']

# Invert the handedness score to match other dataset. 1 = left, 2 = right, 3 = ambidextrous.
def invert_handedness(x):
    if x == 1:
        return 2
    elif x == 2:
        return 1
    elif x == 3:
        return 3
    else:
        return np.nan
hand.loc[:, 'Handedness'] = hand.Handedness.apply(invert_handedness)

In [80]:
# Compute some demographics variables.
# Highest education level (parent). Taking the highest amongst the two parents.
demo.loc[:, 'Parent_ed2'] = demo['Parent_ed2'].replace([777, 999, np.nan], 0)
demo.loc[:, 'high_edu'] = demo[['Parent_ed1', 'Parent_ed2']].values.max(1)

# Group levels together (<13 = 1, no high school, 13-14 = 2, high school, ged or equivalent,
# 15-17 = 3, some college, 18 = 4, bachelor, >19 = 5, postgraduate)
def create_edu_groups(x):
    if x < 13:
        return 1
    elif x in [13, 14]:
        return 2
    elif x in [15, 16, 17]:
        return 3
    elif x == 18:
        return 4
    elif x in [19, 20, 21]:
        return 5
    else:
        return 0

demo.loc[:, 'edu_groups'] = demo['high_edu'].apply(create_edu_groups)

# Group levels of income together ( <6 = 1, < 50 000, 6-8 = 2, 50-100 000, >9 = 3, >100 000).
def create_income_groups(x):
    if x < 6:
        return 1
    elif x in [6, 7, 8]:
        return 2
    elif x in [9, 10]:
        return 3
    else:
        return 0

demo.loc[:, 'income_groups'] = demo['Income'].apply(create_income_groups)

In [81]:
# First, merging psychometrics and behavioral data, then merging with demographics.
# This way, we avoid the loss of subjects due to missing data in demographics columns.
psy_behav = merge_dataframes({"age": agemonth, "dx": dx_labels,
                            "cbcl": cbcl_scores, "nihtb": nihtb_scores,
                            "lmt": lmt_scores, "ravlt": ravlt_scores,
                            "wisc": wisc_scores}, index="subjectkey")
psy_behav.dropna(inplace=True, axis=0)
psy_behav.reset_index(drop=False, inplace=True)
print("Number of subjects retained for the analysis: {}".format(psy_behav.shape[0]))

Number of subjects retained for the analysis: 10843


In [82]:
# Concatenating all the dataframes.
abcd_data = merge_dataframes({"site": site, "demo": demo, "hand": hand,
                                "psy_behav": psy_behav}, index="subjectkey")
# Dropping rows with NA in the last 6 columns which corresponds to the behavioral and psychometric data.
abcd_data.dropna(inplace=True, axis=0, subset=abcd_data.columns[-6:], how="all")

# Reordering the columns.
abcd_data = abcd_data[["Site", "Sex", "AgeMonths", "Ethnicity", "Parent_ed1",
                       "Parent_ed2", "high_edu", "edu_groups", "Income", "income_groups",
                       "Handedness", "ADHD", "AD", "OCD", "DD", "BPD", "ODD", "CD",
                       "PTSD", "PSYPATHO", "Internalization", "Externalization", "Stress",
                       "PictureVocab", "Flanker", "ListSorting", "CardSort",
                       "PatternComparison", "PictureSequence", "OralReading",
                       "LMT", "RAVLT", "WISCMatrix"]]

# Assert the number of subjects retained is the same as before.
assert abcd_data.shape[0] == psy_behav.shape[0], "Number of subjects do not match."

# Saving the final dataframe.
abcd_data.to_excel(f'{output_dir}/abcd_data.xlsx', index=True, header=True)

# This next line is commented out since data is protected by a data use agreement.
# abcd_data.head() # Please inspect head of the dataframe to validate correct merging.

### **Fetching relevant data for the Boston Adolescent Neuroimaging of Depression and Anxiety (BANDA) Study**

Subsequent cells will load up data from the BANDA Data Release 1.1 and keep only relevant variables for the present analysis. Final table will be outputted in `output_dir`.

Since the stress problems score is not precomputed within this study, we will manually compute it. Since the score calculations are proprietary, we derived the equation from the [ASEBA report](https://aseba.org/wp-content/uploads/cbclprofile.pdf). 

**Please note that this dataset is available through a data use agreement, for more information, please visit the [NIMH Data Archive website](https://nda.nih.gov/) or the [BANDA website](https://www.humanconnectome.org/study/connectomes-related-anxiety-depression/document/banda-release-11)**

In [83]:
def compute_stress_problems(x):
    """ 
    Function to compute the stress problems score.
    """
    return x.cbcl3 + x.cbcl8 + x.cbcl9 + x.cbcl11 + x.cbcl31 + x.cbcl34 + x.cbcl45 +\
                  x.cbcl47 + x.cbcl50 + x.cbcl52 + x.cbcl69 + x.cbcl87 + x.cbcl103 + x.cbcl111

In [84]:
# NIH toolbox.
banda_dccs = load_df_in_any_format(f'{banda_dir}/dccs01.xlsx')
banda_flanker = load_df_in_any_format(f'{banda_dir}/flanker01.xlsx')
banda_lswm = load_df_in_any_format(f'{banda_dir}/lswmt01.xlsx')
banda_orrt = load_df_in_any_format(f'{banda_dir}/orrt01.xlsx')
banda_pcps = load_df_in_any_format(f'{banda_dir}/pcps01.xlsx')
banda_pwmt = load_df_in_any_format(f'{banda_dir}/pwmt01.xlsx')
banda_pmat = load_df_in_any_format(f'{banda_dir}/pmat01.xlsx')
banda_wasi = load_df_in_any_format(f'{banda_dir}/wasi201.xlsx')
banda_wasi = banda_wasi.loc[banda_wasi.respondent == 'Child']
banda_cbcl = load_df_in_any_format(f'{banda_dir}/cbcl01.xlsx')
banda_cbcl = banda_cbcl.loc[banda_cbcl.visit == 'T1']
banda_hand = load_df_in_any_format(f'{banda_dir}/chaphand01.xlsx')

In [85]:
# Variables selection.
banda_dccs = banda_dccs[['subjectkey', 'interview_age', 'nih_dccs_unadjusted']]
banda_dccs.columns = ['subjectkey', 'Age', 'DCCS']
banda_flanker = banda_flanker[['subjectkey', 'nih_flanker_unadjusted']]
banda_flanker.columns = ['subjectkey', 'Flanker']
banda_lswm = banda_lswm[['subjectkey', 'uss']]
banda_lswm.columns = ['subjectkey', 'ListSorting']
banda_orrt = banda_orrt[['subjectkey', 'read_uss']]
banda_orrt.columns = ['subjectkey', 'OralReading']
banda_pcps = banda_pcps[['subjectkey', 'nih_patterncomp_unadjusted']]
banda_pcps.columns = ['subjectkey', 'PatternComparison']
banda_pwmt = banda_pwmt[['subjectkey', 'cpw_cr']]
banda_pwmt.columns = ['subjectkey', 'PennWM']
banda_pmat = banda_pmat[['subjectkey', 'pmat24_a_cr']]
banda_pmat.columns = ['subjectkey', 'PennMatrix']
banda_wasi = banda_wasi[['subjectkey', 'vocab_totalrawscore']]
banda_wasi.columns = ['subjectkey', 'WASIVocabulary']

for df in [banda_dccs, banda_flanker, banda_lswm, banda_orrt, banda_pcps, banda_pwmt, banda_pmat]:
    df.drop(0, axis=0, inplace=True)

# Compute the cbcl stress problems score.
for i in [3, 8, 9, 11, 31, 34, 45, 47, 50, 52, 69, 87, 103, 111]:
    banda_cbcl.loc[:, f'cbcl{i}'] = banda_cbcl[f'cbcl{i}'].replace([999, 77, 88], 0)
banda_cbcl.loc[:, 'cbcl_stress_raw'] = banda_cbcl.apply(compute_stress_problems, axis=1)

banda_cbcl_vars = [
    'subjectkey',
    'cbcl_internal_raw',
    'cbcl_external_raw',
    'cbcl_stress_raw'
]
banda_cbcl = banda_cbcl[banda_cbcl_vars]
banda_cbcl.columns = [
    'subjectkey',
    'Internalization',
    'Externalization',
    'Stress'
]
banda_cbcl = banda_cbcl.astype({'Internalization': 'float', 'Externalization': 'float', 'Stress': 'float'})

# Scaling the data.
banda_cbcl.loc[:, 'Internalization'] = StandardScaler().fit_transform(banda_cbcl[['Internalization']])
banda_cbcl.loc[:, 'Externalization'] = StandardScaler().fit_transform(banda_cbcl[['Externalization']])
banda_cbcl.loc[:, 'Stress'] = StandardScaler().fit_transform(banda_cbcl[['Stress']])
banda_dccs.loc[:, 'DCCS'] = StandardScaler().fit_transform(banda_dccs[['DCCS']])
banda_flanker.loc[:, 'Flanker'] = StandardScaler().fit_transform(banda_flanker[['Flanker']])
banda_lswm.loc[:, 'ListSorting'] = StandardScaler().fit_transform(banda_lswm[['ListSorting']])
banda_orrt.loc[:, 'OralReading'] = StandardScaler().fit_transform(banda_orrt[['OralReading']])
banda_pcps.loc[:, 'PatternComparison'] = StandardScaler().fit_transform(banda_pcps[['PatternComparison']])
banda_pwmt.loc[:, 'PennWM'] = StandardScaler().fit_transform(banda_pwmt[['PennWM']])
banda_pmat.loc[:, 'PennMatrix'] = StandardScaler().fit_transform(banda_pmat[['PennMatrix']])
banda_wasi.loc[:, 'WASIVocabulary'] = StandardScaler().fit_transform(banda_wasi[['WASIVocabulary']])


In [86]:
# Fetching diagnoses data.
!python "{repository_path}/scripts/generate_dx_BANDA.py" --in-root-folder "{banda_dir}" \
    --output "{output_dir}/banda_dx_labels.xlsx"

# Importing back the diagnosis labels.
banda_dx = load_df_in_any_format(f'{output_dir}/banda_dx_labels.xlsx')

# Add a global psychopathology score (1 = at least one diagnosis, 0 = no diagnosis).
banda_dx.loc[:, 'PSYPATHO'] = banda_dx.iloc[:, 1:].sum(axis=1).apply(lambda x: 1 if x > 0 else 0)

In [87]:
# Fetch demographics data.
banda_demo = load_df_in_any_format('/Volumes/T7/CCPM/BANDA/BANDARelease1.1/demographics02.xlsx')
banda_demo = banda_demo[banda_demo.visit == 'T1']
banda_demo = banda_demo[['subjectkey', "sex", 'race', 'ethnicity', 'demo_parent_educ', 'demo_other_parent_educ']]
banda_demo.loc[:, 'race_ethnicity'] = np.where(banda_demo['ethnicity'] == 'Hispanic or Latino',
                                               banda_demo['ethnicity'],
                                               np.where(banda_demo['race'] == 'More than one race',
                                                        'Other',
                                                        np.where(banda_demo['race'] == 'Unknown or not reported',
                                                                 'Other',
                                                                 banda_demo['race'])))
banda_demo.loc[:, 'educ_level'] = banda_demo[['demo_parent_educ', 'demo_other_parent_educ']].values.max(1)

# Transfer textual ethnicity to numerical. 
def ethnicity_coding(x):
    if x.race_ethnicity == 'White':
        return 1
    elif x.race_ethnicity == "Asian":
        return 4
    elif x.race_ethnicity == "Hispanic or Latino":
        return 3
    elif x.race_ethnicity == "Black or African American":
        return 2
    elif x.race_ethnicity == "Other":
        return 5
    else:
        ValueError("Invalid value.")

banda_demo.loc[:, 'race_ethnicity'] = banda_demo.apply(ethnicity_coding, axis=1)

# Transfer textual sex variable to numerical. 1 = Male, 2 = Female.
def sex_coding(x):
    if x.sex == 'M':
        return 1
    elif x.sex == 'F':
        return 2
    else:
        ValueError("Invalid value.")

banda_demo.loc[:, 'sex'] = banda_demo.apply(sex_coding, axis=1)

# Small function to create education groups. (0-1 = 1: No high school,
# 2 = 2: high school, 3-4 = 3: GED or equivalent, 5 = 4: some college,
# 6 = 5: Bachelor's degree, 7-8 = 6: postgrad)
def create_edu_groups(x):
    if x < 2:
        return 1
    elif x == 2:
        return 2
    elif x in [3, 4]:
        return 3
    elif x == 5:
        return 4
    elif x == 6:
        return 5
    elif x > 6:
        return 6

banda_demo.insert(banda_demo.shape[1], 'edu_groups', banda_demo['educ_level'].apply(create_edu_groups))

# Create handedness score. (1 = left, 2 = right, 3 = ambidextrous)
def compute_handedness(x):
    val = x.iloc[8:22].mode()[0]
    if val == 0:
        return 1
    elif val == 1:
        return 3
    elif val == 2:
        return 2
    else:
        ValueError("Handedness not found.")

banda_hand.loc[:, 'Handedness'] = banda_hand.apply(compute_handedness, axis=1)
banda_hand = banda_hand[['subjectkey', 'Handedness']]

In [88]:
# Merging behavioral and psychometric data first, to avoid unnecessary loss of subjects.
# After that, merging with demographics.
ncogn = merge_dataframes({"cbcl":banda_cbcl, "dccs": banda_dccs, "flanker": banda_flanker,
                            "lswm": banda_lswm, "orrt": banda_orrt,
                            "pcps": banda_pcps, "pwmt": banda_pwmt,
                            "pmat": banda_pmat, "wasi": banda_wasi}, index="subjectkey")
ncogn.dropna(inplace=True, axis=0)
ncogn.reset_index(drop=False, inplace=True)
print("Number of subjects retained for the analysis: {}".format(ncogn.shape[0]))

Number of subjects retained for the analysis: 197


In [89]:
# Merge with demographics.
banda_data = merge_dataframes({"demo": banda_demo, "hand": banda_hand, "dx": banda_dx, "ncogn": ncogn},
                               index="subjectkey")
# Dropping rows with NA in the last 6 columns which corresponds to the behavioral and psychometric data.
banda_data.dropna(inplace=True, axis=0, subset=banda_data.columns[-6:], how="all")

# Dropping two subjects due to missing handedness data.
banda_data.dropna(subset=['Handedness'], inplace=True)

# Reordering the columns.
banda_data = banda_data[['sex', 'Age', 'race_ethnicity', 'demo_parent_educ',
                         'demo_other_parent_educ', 'educ_level', 'edu_groups',
                         'Handedness', 'ADHD', 'AD', 'CD', 'DD', 'ODD', 'OCD', "PSYPATHO",
                         'Internalization', 'Externalization', 'Stress', 'DCCS',
                         'Flanker', 'ListSorting', 'OralReading', 'PatternComparison',
                         'PennWM', 'PennMatrix', 'WASIVocabulary']]
banda_data.columns = ['Sex', 'AgeMonths', 'Ethnicity', 'Parent_ed1',
                      'Parent_ed2', 'high_edu', 'edu_groups', 'Handedness', 'ADHD',
                        'AD', 'CD', 'DD', 'ODD', 'OCD', "PSYPATHO", 'Internalization', 'Externalization',
                        'Stress', 'DCCS', 'Flanker', 'ListSorting', 'OralReading',
                        'PatternComparison', 'PennWM', 'PennMatrix', 'WASIVocabulary']

# Assert the number of subjects retained is the same as before (minus the two dropped).
assert banda_data.shape[0] == ncogn.shape[0] - 2, "Number of subjects do not match."

# Saving the final dataframe.
banda_data.to_excel(f'{output_dir}/banda_data.xlsx', index=True, header=True)

# This next line is commented out since data is protected by a data use agreement.
# banda_data.head() # Please inspect head of the dataframe to validate correct merging.

### **Fetching data from the GESTE Study**

Subsequent cells will fetch data from the GESTE Study and keep only relevant variables for the present analysis. Final dataframe will be outputted in `output_dir`. 

**Please note that this dataset is not publicly available. To gain access to this data, please contact Dr. Larissa Takser PhD (larissa.takser@usherbrooke.ca)**

In [4]:
# Loading data. 
geste_neuro = load_df_in_any_format(f'{geste_base_dir}/Neurocognitive/neurocognitive_data.xlsx')
geste_demo = load_df_in_any_format(f'{geste_base_dir}/PopInfo.csv')

In [5]:
# Fetching neurocognitive data.
neuro_vars = [
    'record_id',
    'basc3_epi_t',
    'basc3_ipi_t',
    'wisc5_bl_ss',
    'wisc5_si_ss',
    'wisc5_ma_ss',
    'wisc5_sc_ss',
    'wisc5_cd_ss',
    'wisc5_vc_ss',
    'wisc5_ba_ss',
]
neuro_df = geste_neuro[neuro_vars]
neuro_df.columns = [
    'subjectkey',
    'Externalizing',
    'Internalizing',
    'Block',
    'Similarities',
    'MatrixReasoning',
    'DigitSpan',
    'Code',
    'Vocabulary',
    'Balance',
]
# Set to correct dtypes.
neuro_df = neuro_df.astype({'Block': 'float', 'Similarities': 'float', 'MatrixReasoning': 'float',
                 'DigitSpan': 'float', 'Code': 'float', 'Vocabulary': 'float', 'Balance': 'float'},
                 copy=True)

neuro_df.loc[:, 'Externalizing'] = StandardScaler().fit_transform(neuro_df[['Externalizing']])
neuro_df.loc[:, 'Internalizing'] = StandardScaler().fit_transform(neuro_df[['Internalizing']])
neuro_df.loc[:, 'Block'] = StandardScaler().fit_transform(neuro_df[['Block']])
neuro_df.loc[:, 'Similarities'] = StandardScaler().fit_transform(neuro_df[['Similarities']])
neuro_df.loc[:, 'MatrixReasoning'] = StandardScaler().fit_transform(neuro_df[['MatrixReasoning']])
neuro_df.loc[:, 'DigitSpan'] = StandardScaler().fit_transform(neuro_df[['DigitSpan']])
neuro_df.loc[:, 'Code'] = StandardScaler().fit_transform(neuro_df[['Code']])
neuro_df.loc[:, 'Vocabulary'] = StandardScaler().fit_transform(neuro_df[['Vocabulary']])
neuro_df.loc[:, 'Balance'] = StandardScaler().fit_transform(neuro_df[['Balance']])

In [6]:
# Selecting population variables.
psycho_patho = geste_demo[['record_id', 'tsa_ea6d8f', 'tdha']]
psycho_patho.columns = ['subjectkey', 'ASD', 'ADHD']

# Recoding diagnosis data, 1 = present, 0 = absent.
def recode_diagnosis(x):
    if x == 1:
        return 1
    else:
        return 0

psycho_patho.loc[:, 'ASD'] = psycho_patho['ASD'].apply(recode_diagnosis)
psycho_patho.loc[:, 'ADHD'] = psycho_patho['ADHD'].apply(recode_diagnosis)

# Add a global psychopathology score (1 = at least one diagnosis, 0 = no diagnosis).
psycho_patho.loc[:, 'PSYPATHO'] = psycho_patho.iloc[:, 1:].sum(axis=1).apply(lambda x: 1 if x > 0 else 0)

geste_demo = geste_demo[['record_id', 'sexe_bb', 'child_age_assmt_auto',
                         'revenu_dc4c4c', 'origin_eth_enf', 'etudes', 'etudes2',
                         'mainpegboard___1', 'mainpegboard___2']]
geste_demo.columns = [
    'subjectkey',
    'Sex',
    'Age',
    'Income',
    'Ethnicity',
    'etudes',
    'etudes2',
    'lefthanded',
    'righthanded'
]

# Setting sex variable to 1 = Male, 2 = Female. 
def sex_coding(x):
    if x.Sex == 1:
        return 1
    elif x.Sex == 0:
        return 2
    else:
        ValueError("Invalid value.")

geste_demo.loc[:, 'Sex'] = geste_demo.apply(sex_coding, axis=1)

# Transfer age in years to months.
geste_demo.loc[:, 'AgeMonths'] = np.round(geste_demo.Age * 12, 0)

# Lumping handedness variables together.
# 1 = Left handed, 2 = Right handed.
def handedness(x):
    if x.lefthanded == 1 and x.righthanded != 1:
        return 1
    elif x.lefthanded != 1 and x.righthanded == 1:
        return 2
    else:
        ValueError("Both variables are not consistent.")

geste_demo.insert(geste_demo.shape[1], 'Handedness', geste_demo.apply(handedness, axis=1))

# Converting CAD to USD dollars. (using the rate on april 22nd 2024.)
geste_demo.loc[:, 'Income'] = geste_demo['Income'] * 0.73

# Dummy function to create income groups. 
def income_groups(x):
    if x < 50000:
        return 1
    elif 50000 <= x < 100000:
        return 2
    elif x >= 100000:
        return 3

geste_demo.insert(geste_demo.shape[1], 'Income_groups', geste_demo['Income'].apply(income_groups))

# Dummy function to create ethnic groups.
# 1 = White, 2 = Black, 3 = Hispanic, 4 = Asian, 5 = Other.
def ethnic_groups_coding(x):
    if x in [3, 9, 10]:
        return 1
    elif x in [8, 7, 6]:
        return 4
    elif x in [1, 2]:
        return 2
    elif x == 4:
        return 3
    else:
        return 5

geste_demo.insert(geste_demo.shape[1], 'ethnic_groups', geste_demo['Ethnicity'].apply(ethnic_groups_coding))

# Taking highest education level completed.
def education_coding(x):
    if pd.isna(x.etudes) and pd.notna(x.etudes2):
        return x.etudes2
    elif pd.notna(x.etudes) and pd.isna(x.etudes2):
        return x.etudes
    elif pd.notna(x.etudes) and pd.notna(x.etudes2):
        return max(x.etudes, x.etudes2)
    else:
        return np.nan

geste_demo.loc[:, 'edu_groups'] = geste_demo.apply(education_coding, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  psycho_patho.loc[:, 'PSYPATHO'] = psycho_patho.iloc[:, 1:].sum(axis=1).apply(lambda x: 1 if x > 0 else 0)


In [7]:
# Merging with neurocognitive and behavioral data.
geste_inter = merge_dataframes({"psycho": psycho_patho, "neuro": neuro_df}, index="subjectkey")
geste_inter.dropna(inplace=True, axis=0)
geste_inter.reset_index(drop=False, inplace=True)

print("Number of subjects retained for the analysis: {}".format(geste_inter.shape[0]))

Number of subjects retained for the analysis: 271


In [8]:
# Merging with demographics.
geste_data = merge_dataframes({"geste_inter": geste_inter,
                               "demo": geste_demo, }, index="subjectkey")

# Dropping rows with NA in the last 6 columns which corresponds to the behavioral and psychometric data.
geste_data.dropna(inplace=True, axis=0, subset=geste_data.columns[-6:], how="all")

# Reordering columns.
geste_data = geste_data[['Sex', 'AgeMonths', 'ethnic_groups', 'etudes', 'etudes2', 'edu_groups',
                         'Income', 'Income_groups', 'Handedness',
                         'ASD', 'ADHD', "PSYPATHO", 'Internalizing', 'Externalizing', 'Block',
                         'Similarities', 'MatrixReasoning', 'DigitSpan', 'Code',
                         'Vocabulary', 'Balance']]

# Renaming the columns to match ABCD dataset.
geste_data.columns = ["Sex", 'AgeMonths', 'Ethnicity', 'Parent_ed1', 'Parent_ed2', 'edu_groups',
                      'Income', 'Income_groups', 'Handedness', 'ASD', 'ADHD', "PSYPATHO", 'Internalization',
                      'Externalization', 'Block', 'Similarities', 'MatrixReasoning', 'DigitSpan',
                      'Code', 'Vocabulary', 'Balance']

# Assert the number of subjects retained is the same as before.
assert geste_data.shape[0] == geste_inter.shape[0], "Number of subjects do not match."

# Saving the final dataframe.
geste_data.to_excel(f'{output_dir}/geste_data.xlsx', index=True, header=True)