# Pre-processing the childmind questionnaire data

This notebook collects helping scripts and functions aiming to build version three of the Child Mind Institute data

1. Collect target variables and drop aggregate variables;
2. Process the diagnosis columns into binary variables;
3. Create new targets;
4. Reduce featurespace to n(columns) < 1,000.

In [1]:
import ast
import numpy as np
import pandas as pd
pd.options.display.max_columns = 1000
pd.options.display.max_rows = 1000

### 1. Collect target variables and drop aggregate variables

In [2]:
df = pd.read_csv(
    'resources/data/Dx_and_q.csv'
)
df = df[[
    col for col in df.columns if (
        col.split('_')[-1].isdigit()
    ) or (
        'Barratt' in col
    ) or (
        col in [
            'EID',
            'Sex',
            'Age',
            'Dx'
        ]
    )
]].drop_duplicates()
kids = df.EID.unique()
df.dropna(
    axis=0,
    how="all",
    subset=[
        col for col in df.columns if (
            col not in [
                "EID",
                "Dx"
            ]
        )
    ],
    inplace=True
)
cdi = pd.read_csv(
    "resources/data/CDI_P.csv",
    encoding="utf-8_sig"
).merge(
    pd.read_csv(
        "resources/data/CDI_SR.csv",
        encoding="utf-8_sig"
    ),
    on="EID",
    how="outer"
)
bdi = pd.read_csv(
    "resources/data/BDI.csv",
    encoding="utf-8_sig"
)
bdi = bdi.assign(
    EID=bdi.GUID
)
xdi = cdi.merge(
    bdi,
    on="EID",
    how="outer"
)
xdi = xdi[[
        col for col in xdi.columns if (
            col.split('_')[-1].isdigit()
        ) or (
            col == "EID"
        )
    ]
]
xdi = xdi[
    xdi.EID.isin(
        kids
    )
]
del cdi, bdi
etc = pd.read_csv(
    "resources/data/data-2018-05-11T14_18_49.456Z.csv",
    low_memory=False,
    na_values='.'
)
etc.columns = [
    varname.split(
        ","
    )[-1] for varname in list(
        etc.columns
    )
]
etc = etc.drop(
    "EID",
    axis=1
).assign(
    EID=etc["Identifiers"].apply(
        lambda x: x.split(
            ","
        )[0]
    )
).drop(
    "Identifiers",
    axis=1
)
etc = etc[[
        col for col in etc.columns if (
            (
                col.split('_')[-1].isdigit()
            ) or (
                col == "EID"
            )
        ) and (
            "KSADS" not in col # KSADS is too many questions
        )
    ]
]
etc = etc[
    etc.EID.isin(
        kids
    )
]

In [3]:
n = len(df)

### 2. Process the diagnosis columns into binary variables

In [4]:
all_diagnoses = []
for diags in df['Dx'].values.flatten():
    all_diagnoses.extend(ast.literal_eval(diags))

In [5]:
for diagnosis in np.unique(all_diagnoses):
    print diagnosis
    df[diagnosis.lower().strip()] = df['Dx'].copy()
    def diag_to_binary(entry):
        return diagnosis.lower().strip() in entry or diagnosis in entry
    df[diagnosis.lower().strip()] = df[diagnosis.lower().strip()].apply(diag_to_binary)
for c in ["Dx", "Anx", "adhd", "asd"]:
    try:
        del df[c]
    except:
        pass

ADHD Inattentive Type
ADHD Inattentive type
ADHD-Combined Type
ADHD-Hyperactive/Impulsive Type
ADHD-Inattentive Type
Acute Stress Disorder
Adjustment Disorder, with mixed disturbance of emtions and conduct
Adjustment Disorders
Agoraphobia
Alcohol Use Disorder
Attention Deficit Hyperactivity Disorder
Attention Deficit Hyperactivity Disorder Combined Presentation
Attention-Deficit Hyperactivity Disorder
Attention-Deficit/Hyperactivity Disorder
Attention-Deficit/Hyperactivity Disorder 
Autism Spectrum Disorder
Avoidant/Restrictive Food Intake Disorder
Binge-Eating Disorder
Bipolar I Disorder
Bipolar II Disorder
Borderline Intellectual Functioning
Bulimia Nervosa
Cannabis Use Disorder
Child Onset Fluency Disorder (Stuttering)
Conduct Disorder-Adolescent-onset type
Conduct Disorder-Childhood-onset type
Conversion Disorder
Developmental Coordination Disorder
Disruptive Mood Dysregulation Disorder
Encopresis
Enuresis
Excoriation (Skin-Picking) Disorder
Generalized Anxiety Disorder
Intellectua

In [6]:
df = df.merge(
    xdi,
    on="EID",
    how="outer"
).merge(
    etc,
    on="EID",
    how="outer"
).copy()
del xdi, etc

### 3. Create new targets
- Compulsions
  - Obsessive-Compulsive Disorder
  - Body Dysmorphic Disorder
  - Hoarding Disorder
  - Trichotillomania (Hair-Pulling Disorder)
  - Excoriation (Skin-Picking) Disorder
  - Substance/Medication-Induced Obsessive-Compulsive and Related Disorder
  - Obsessive-Compulsive and Related Disorder Due to Another Medical Condition
  - Other Specified Obsessive-Compulsive and Related Disorder
  - Unspecified Obsessive-Compulsive Disorder and Related Disorder
- Anxiety
  - Separation Anxiety Disorder
  - Selective Mutism
  - Specific Phobias
  - Social Anxiety Disorder (Social Phobia)
  - Panic Disorder
  - Agoraphobia
  - Generalized Anxiety Disorder
  - Substance/Medication-Induced Anxiety Disorder
  - Anxiety Disorder Due to Another Medical Condition
  - Other Specified Anxiety Disorder
  - Unspecified Anxiety Disorder
- UseDisorders
- SymptomsOfCruelty
  - Cruel to animals
  - Cruelty, bullying, or meanness to others
  - Doesn't seem to feel guilty after misbehaving
  - _range_
    - 0 = ∀(Not true), to
    - 6 = ∀(Very true or often true)
- SymptomsOfSuicide
  - I do not think about killing myself | I think about killing myseld but would not do it | I want to kill myself
  - Deliberately harms self or attempts suicide
  - Talks about killing self
  - S/he thought that life wasn't worth living
  - S/he thought about killing him/herself
  - _range_
    - 0 = I do not think about killing myself, not true to all others, to
    - 10 = I want to kill myself, Very true or often true or true for all others

In [7]:
df = df.assign(
    Compulsions=pd.DataFrame([
        df[dx] for dx_header in {
            "Body Dysmorphic Disorder",
            "Hoarding Disorder",
            "Compulsive",
            "Excoriation",
            "Trichotillomania"
        } for dx in df.columns if dx_header in dx
    ]).T.any(
        axis=1
    ),
    Anxiety=pd.DataFrame([
        df[dx] for dx_header in {
            "Anxiety",
            "anxiety",
            "Selective Mutism",
            "Phobia",
            "Panic",
            "phobia"
        } for dx in df.columns if dx_header in dx
    ]).T.any(
        axis=1
    ),
    UseDisorders=pd.DataFrame([
        df[dx] for dx_header in {
            "Use Disorder"
        } for dx in df.columns if dx_header in dx
    ]).T.any(
        axis=1
    ),
    SymptomsOfCruelty=pd.DataFrame([
        df[symptom] for response_header in {
            "CBCL_15", # Cruel to animals
            "CBCL_16", # Cruelty, bullying, or meanness to others
            "CBCL_26"  # Doesn't seem to feel guilty after misbehaving
        } for symptom in df.columns if response_header in symptom
    ]).T.sum(
        axis=1
    ),
    SymptomsOfSuicide=pd.DataFrame([
        df[symptom] for response_header in {
            "CDI2_08",  # 0= I do not think about killing myself
                        # 1= I think about killing myseld but would not do it
                        # 2= I want to kill myself
            "CBCL_18",  # Deliberately harms self or attempts suicide
            "CBCL_91",  # Talks about killing self
            "MFQ_P_16", # S/he thought that life wasn't worth living.
            "MFQ_P_19"  # S/he thought about killing him/herself.
        } for symptom in df.columns if response_header in symptom
    ]).T.sum(
        axis=1
    )
)

#### Drop Internet Addiction questions and consolidate other diagnoses to reduce features to n_features < 1,000

In [8]:
df = df.drop(
    [
        col for col in df.columns if "IAT_" in col
    ],
    axis=1
).assign(
    OtherDx=pd.DataFrame([
        df[
            diagnosis.lower().strip()
        ] for diagnosis in all_diagnoses if \
        response_header not in diagnosis for \
        response_header in {
            "Use Disorder",
            "Binge-Eating",
            "Bulimia Nervosa",
            "Compulsive",
            "Conduct Disorder",
            "Disruptive Mood",
            "Excoriation",
            "Explosive Disorder",
            "Tic Disorder",
            "Tourette",
            "Trichotillomania",
            "Anxiety",
            "anxiety",
            "Selective Mutism",
            "Phobia",
            "Panic",
            "Agoraphobia",
            "No Diagnosis Given",
            "No diagnosis given",
            "n/a"
        }
    ]).T.any(
        axis=1
    ),
).drop(
    list({
        diagnosis.lower().strip() for response_header in {
            "Use Disorder",
            "Binge-Eating",
            "Bulimia Nervosa",
            "Compulsive",
            "Conduct Disorder",
            "Disruptive Mood",
            "Excoriation",
            "Explosive Disorder",
            "Tic Disorder",
            "Tourette",
            "Trichotillomania",
            "Anxiety",
            "anxiety",
            "CBCL_15",
            "CBCL_16",
            "CBCL_26",
            "CDI2_08",
            "CBCL_91",
            "Compulsions",
            "UseDisorders",
            "SymptomsOfCruelty",
            "SymptomsOfSuicide"
        } for diagnosis in all_diagnoses if \
        response_header not in diagnosis
    }),
    axis=1
)

In [9]:
df.shape

(630, 950)

#### Save to csv

In [10]:
df.to_csv(
    'resources/questions_v3_new_targets.csv',
    index=False
)