# WiDS 2025: ADHD Detection 

## About the Problem

### **1. Problem Statement**
The goal is to build a predictive model using functional brain imaging (fMRI) data, socio-demographic information, and questionnaire responses to:
1. Predict ADHD diagnosis (`ADHD_Outcome` - 1 for Yes, 0 for No).
2. Predict sex (`Sex_F` - 1 for Female, 0 for Male).


### **2. Dataset Overview**

#### **Dataset Folders**
1. **Train Folder (`train_tsv`):**
   - Contains data for 1,200+ subjects.
   - Includes:
     - **Targets**: ADHD diagnosis and sex.
     - **Functional MRI Connectome Matrices**.
     - **Socio-demographic Information**: 
       - Examples: handedness, parent’s education level, emotions (Strength and Difficulties Questionnaire - SDQ), parenting information (Alabama Parenting Questionnaire - APQ).
     - Metadata: Includes both quantitative and categorical data.
   - Action Needed: Preprocess categorical data (e.g., create dummy variables) and combine with functional connectome dataset.

2. **Test Folder (`test_tsv`):**
   - Contains unseen data for 300+ subjects.
   - Includes:
     - Functional MRI Connectome Matrices.
     - Socio-demographic, emotions, and parenting information.
   - Goal: Use the test dataset to generate predictions for submission.

---

### **3. Key Variables**
1. **Target Variables:**
   - `ADHD_Outcome`: Type of Diagnosis (0 = Other/None, 1 = ADHD).
   - `Sex_F`: Sex of participant (0 = Male, 1 = Female).

2. **Metadata Variables:**
   - Includes a mix of numerical and categorical features related to demographics, parenting styles, and emotional health.

---

### **4. Evaluation Metric**
- The **F1 Score** is used as the evaluation metric:
  - It is the harmonic mean of precision and recall.
  - A weighted F1 Score is calculated for this competition, with **2x weight for Female ADHD cases** (`ADHD_Outcome=1` and `Sex_F=1`).
  - The final leaderboard score is the average of the weighted F1 Scores for the two target variables.

---

### **5. Project Timeline**
1. **Problem Understanding and Domain Familiarization (Jan 22 - Jan 31):**
   - Review dataset and domain videos.
   - Research best practices and models.
2. **Data Preparation and Cleaning (Feb 1 - Feb 15):**
   - Handle missing values and outliers.
   - Encode categorical variables and scale numerical ones.
   - Feature engineering and selection.
3. **Exploratory Data Analysis (EDA) (Feb 16 - Feb 28):**
   - Perform univariate, bivariate, and multivariate analyses.
   - Identify correlations and test hypotheses.
4. **Model Development (Mar 1 - Mar 31):**
   - Train baseline and advanced models.
   - Use hyperparameter tuning and evaluate performance.
5. **Model Interpretation and Insights (Apr 1 - Apr 15):**
   - Analyze feature importance and validate predictions.
6. **Submission Preparation (Apr 16 - Apr 25):**
   - Finalize the best-performing model and prepare submission files.
7. **Final Submission (Apr 26 - Apr 27):**
   - Cross-check and submit predictions and documentation.

---

### **6. Steps in This Notebook**
1. **Load the Dataset:**
   - Load and inspect the train and test datasets.
2. **Exploratory Data Analysis - EDA:**
   - Handle missing data and outliers.
   - Standardize numerical features and encode categorical variables.
3. **Exploratory Data Analysis (EDA):**
   - Visualize distributions, relationships, and patterns in the data.
4. **Next Steps:**
   - Model development and evaluation.


## Code

### Libraries

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns

import os
import matplotlib.pyplot as plt

### 1. Loading the Datasets

#### Training Dataset

In [3]:
## Loading categorical variables
train_cat = pd.read_excel("../data/TRAIN/TRAIN_CATEGORICAL_METADATA.xlsx")
train_cat.head()

Unnamed: 0,participant_id,Basic_Demos_Enroll_Year,Basic_Demos_Study_Site,PreInt_Demos_Fam_Child_Ethnicity,PreInt_Demos_Fam_Child_Race,MRI_Track_Scan_Location,Barratt_Barratt_P1_Edu,Barratt_Barratt_P1_Occ,Barratt_Barratt_P2_Edu,Barratt_Barratt_P2_Occ
0,UmrK0vMLopoR,2016,1,0.0,0,1,21,45,21,45
1,CPaeQkhcjg7d,2019,3,1.0,2,3,15,15,0,0
2,Nb4EetVPm3gs,2016,1,1.0,8,1,18,40,0,0
3,p4vPhVu91o4b,2018,3,0.0,8,3,15,30,18,0
4,M09PXs7arQ5E,2019,3,0.0,1,3,15,20,0,0


In [4]:
train_cat.columns

Index(['participant_id', 'Basic_Demos_Enroll_Year', 'Basic_Demos_Study_Site',
       'PreInt_Demos_Fam_Child_Ethnicity', 'PreInt_Demos_Fam_Child_Race',
       'MRI_Track_Scan_Location', 'Barratt_Barratt_P1_Edu',
       'Barratt_Barratt_P1_Occ', 'Barratt_Barratt_P2_Edu',
       'Barratt_Barratt_P2_Occ'],
      dtype='object')

In [6]:
## Loading Functional Connectome Matrices
train_FCM = pd.read_csv("../data/TRAIN/TRAIN_FUNCTIONAL_CONNECTOME_MATRICES.csv")
train_FCM.head()

Unnamed: 0,participant_id,0throw_1thcolumn,0throw_2thcolumn,0throw_3thcolumn,0throw_4thcolumn,0throw_5thcolumn,0throw_6thcolumn,0throw_7thcolumn,0throw_8thcolumn,0throw_9thcolumn,...,195throw_196thcolumn,195throw_197thcolumn,195throw_198thcolumn,195throw_199thcolumn,196throw_197thcolumn,196throw_198thcolumn,196throw_199thcolumn,197throw_198thcolumn,197throw_199thcolumn,198throw_199thcolumn
0,70z8Q2xdTXM3,0.093473,0.146902,0.067893,0.015141,0.070221,0.063997,0.055382,-0.035335,0.068583,...,0.003404,-0.010359,-0.050968,-0.014365,0.128066,0.112646,-0.05898,0.028228,0.133582,0.143372
1,WHWymJu6zNZi,0.02958,0.179323,0.112933,0.038291,0.104899,0.06425,0.008488,0.077505,-0.00475,...,-0.008409,-0.008479,0.020891,0.017754,0.09404,0.035141,0.032537,0.075007,0.11535,0.1382
2,4PAQp1M6EyAo,-0.05158,0.139734,0.068295,0.046991,0.111085,0.026978,0.151377,0.021198,0.083721,...,0.053245,-0.028003,0.028773,0.024556,0.166343,0.058925,0.035485,0.063661,0.042862,0.162162
3,obEacy4Of68I,0.016273,0.204702,0.11598,0.043103,0.056431,0.057615,0.055773,0.07503,0.001033,...,-0.023918,-0.005356,0.018607,0.016193,0.072955,0.130135,0.05612,0.084784,0.114148,0.190584
4,s7WzzDcmDOhF,0.065771,0.098714,0.097604,0.112988,0.071139,0.085607,0.019392,-0.036403,-0.020375,...,0.066439,-0.07668,-0.04753,-0.031443,0.221213,0.007343,0.005763,0.08382,0.079582,0.067269


In [7]:
train_FCM.columns

Index(['participant_id', '0throw_1thcolumn', '0throw_2thcolumn',
       '0throw_3thcolumn', '0throw_4thcolumn', '0throw_5thcolumn',
       '0throw_6thcolumn', '0throw_7thcolumn', '0throw_8thcolumn',
       '0throw_9thcolumn',
       ...
       '195throw_196thcolumn', '195throw_197thcolumn', '195throw_198thcolumn',
       '195throw_199thcolumn', '196throw_197thcolumn', '196throw_198thcolumn',
       '196throw_199thcolumn', '197throw_198thcolumn', '197throw_199thcolumn',
       '198throw_199thcolumn'],
      dtype='object', length=19901)

In [10]:
# Loading Quantitative variable train dataframe

train_Quant = pd.read_excel("../data/TRAIN/TRAIN_QUANTITATIVE_METADATA.xlsx")
train_Quant.head()

Unnamed: 0,participant_id,EHQ_EHQ_Total,ColorVision_CV_Score,APQ_P_APQ_P_CP,APQ_P_APQ_P_ID,APQ_P_APQ_P_INV,APQ_P_APQ_P_OPD,APQ_P_APQ_P_PM,APQ_P_APQ_P_PP,SDQ_SDQ_Conduct_Problems,SDQ_SDQ_Difficulties_Total,SDQ_SDQ_Emotional_Problems,SDQ_SDQ_Externalizing,SDQ_SDQ_Generating_Impact,SDQ_SDQ_Hyperactivity,SDQ_SDQ_Internalizing,SDQ_SDQ_Peer_Problems,SDQ_SDQ_Prosocial,MRI_Track_Age_at_Scan
0,UmrK0vMLopoR,40.0,13,3,10,47,13,11,28,0,6,1,5,0,5,1,0,10,
1,CPaeQkhcjg7d,-94.47,14,3,13,34,18,23,30,0,18,6,8,7,8,10,4,5,
2,Nb4EetVPm3gs,-46.67,14,4,10,35,16,10,29,1,14,2,8,5,7,6,4,9,8.239904
3,p4vPhVu91o4b,-26.68,10,5,12,39,19,16,28,6,24,4,16,9,10,8,4,6,
4,M09PXs7arQ5E,0.0,14,5,15,40,20,24,28,1,18,4,11,4,10,7,3,9,8.940679


In [11]:
train_Quant.columns

Index(['participant_id', 'EHQ_EHQ_Total', 'ColorVision_CV_Score',
       'APQ_P_APQ_P_CP', 'APQ_P_APQ_P_ID', 'APQ_P_APQ_P_INV',
       'APQ_P_APQ_P_OPD', 'APQ_P_APQ_P_PM', 'APQ_P_APQ_P_PP',
       'SDQ_SDQ_Conduct_Problems', 'SDQ_SDQ_Difficulties_Total',
       'SDQ_SDQ_Emotional_Problems', 'SDQ_SDQ_Externalizing',
       'SDQ_SDQ_Generating_Impact', 'SDQ_SDQ_Hyperactivity',
       'SDQ_SDQ_Internalizing', 'SDQ_SDQ_Peer_Problems', 'SDQ_SDQ_Prosocial',
       'MRI_Track_Age_at_Scan'],
      dtype='object')

In [12]:
#Loading the results for the training dataset
train_Solutions = pd.read_excel("../data/TRAIN/TRAINING_SOLUTIONS.xlsx")
train_Solutions.head()

Unnamed: 0,participant_id,ADHD_Outcome,Sex_F
0,UmrK0vMLopoR,1,1
1,CPaeQkhcjg7d,1,0
2,Nb4EetVPm3gs,1,0
3,p4vPhVu91o4b,1,1
4,M09PXs7arQ5E,1,1


In [13]:
train_Quant.columns

Index(['participant_id', 'EHQ_EHQ_Total', 'ColorVision_CV_Score',
       'APQ_P_APQ_P_CP', 'APQ_P_APQ_P_ID', 'APQ_P_APQ_P_INV',
       'APQ_P_APQ_P_OPD', 'APQ_P_APQ_P_PM', 'APQ_P_APQ_P_PP',
       'SDQ_SDQ_Conduct_Problems', 'SDQ_SDQ_Difficulties_Total',
       'SDQ_SDQ_Emotional_Problems', 'SDQ_SDQ_Externalizing',
       'SDQ_SDQ_Generating_Impact', 'SDQ_SDQ_Hyperactivity',
       'SDQ_SDQ_Internalizing', 'SDQ_SDQ_Peer_Problems', 'SDQ_SDQ_Prosocial',
       'MRI_Track_Age_at_Scan'],
      dtype='object')

#### Test Dataset

In [14]:
## Loading test categorical variables
test_cat = pd.read_excel("../data/TEST/TEST_CATEGORICAL.xlsx")
test_cat.head()

Unnamed: 0,participant_id,Basic_Demos_Enroll_Year,Basic_Demos_Study_Site,PreInt_Demos_Fam_Child_Ethnicity,PreInt_Demos_Fam_Child_Race,MRI_Track_Scan_Location,Barratt_Barratt_P1_Edu,Barratt_Barratt_P1_Occ,Barratt_Barratt_P2_Edu,Barratt_Barratt_P2_Occ
0,Cfwaf5FX7jWK,2022,4,0.0,0.0,4,21.0,30.0,18.0,30.0
1,vhGrzmvA3Hjq,2023,4,0.0,0.0,4,21.0,45.0,,30.0
2,ULliyEXjy4OV,2022,4,0.0,0.0,4,21.0,40.0,18.0,40.0
3,LZfeAb1xMtql,2022,4,0.0,0.0,3,21.0,45.0,21.0,45.0
4,EnFOUv0YK1RG,2022,4,2.0,0.0,4,18.0,0.0,21.0,45.0


In [15]:
test_cat.columns

Index(['participant_id', 'Basic_Demos_Enroll_Year', 'Basic_Demos_Study_Site',
       'PreInt_Demos_Fam_Child_Ethnicity', 'PreInt_Demos_Fam_Child_Race',
       'MRI_Track_Scan_Location', 'Barratt_Barratt_P1_Edu',
       'Barratt_Barratt_P1_Occ', 'Barratt_Barratt_P2_Edu',
       'Barratt_Barratt_P2_Occ'],
      dtype='object')

In [16]:
## Loading Functional Connectome Matrices
test_FCM = pd.read_csv("../data/TEST/TEST_FUNCTIONAL_CONNECTOME_MATRICES.csv")
test_FCM.head()

Unnamed: 0,participant_id,0throw_1thcolumn,0throw_2thcolumn,0throw_3thcolumn,0throw_4thcolumn,0throw_5thcolumn,0throw_6thcolumn,0throw_7thcolumn,0throw_8thcolumn,0throw_9thcolumn,...,195throw_196thcolumn,195throw_197thcolumn,195throw_198thcolumn,195throw_199thcolumn,196throw_197thcolumn,196throw_198thcolumn,196throw_199thcolumn,197throw_198thcolumn,197throw_199thcolumn,198throw_199thcolumn
0,Cfwaf5FX7jWK,0.54848,0.713607,0.557319,0.524369,0.693364,0.770032,0.724406,0.390118,0.547912,...,0.080423,-0.054581,-0.088163,-0.028574,0.444847,0.350149,-0.012601,0.66575,0.560565,0.555732
1,vhGrzmvA3Hjq,0.42774,0.363022,0.402862,0.363003,0.534558,0.345347,0.409471,0.303328,0.402515,...,0.198009,-0.000724,0.083122,0.033043,0.687497,0.306229,0.717485,0.461809,0.559632,0.350027
2,ULliyEXjy4OV,0.139572,0.390106,-0.087041,0.196852,0.088148,0.023843,0.381782,0.068979,0.377488,...,0.051319,0.02363,-0.056819,0.117396,0.576086,0.517831,0.527044,0.605038,0.609856,0.750987
3,LZfeAb1xMtql,0.133561,0.778326,0.416355,0.47184,0.56846,0.63366,0.501113,0.345461,0.467943,...,0.046183,-0.238962,0.121868,-0.26097,0.646818,0.594902,0.608156,0.595459,0.683189,0.542296
4,EnFOUv0YK1RG,0.126699,0.575446,0.509422,0.363193,0.427544,0.449924,0.451796,0.223927,0.298248,...,0.315734,0.002234,0.290791,0.344149,0.480214,0.539824,0.447322,0.293088,0.148529,0.539823


In [17]:
test_FCM.columns

Index(['participant_id', '0throw_1thcolumn', '0throw_2thcolumn',
       '0throw_3thcolumn', '0throw_4thcolumn', '0throw_5thcolumn',
       '0throw_6thcolumn', '0throw_7thcolumn', '0throw_8thcolumn',
       '0throw_9thcolumn',
       ...
       '195throw_196thcolumn', '195throw_197thcolumn', '195throw_198thcolumn',
       '195throw_199thcolumn', '196throw_197thcolumn', '196throw_198thcolumn',
       '196throw_199thcolumn', '197throw_198thcolumn', '197throw_199thcolumn',
       '198throw_199thcolumn'],
      dtype='object', length=19901)

In [18]:
# Loading Quantitative variable train dataframe

test_Quant = pd.read_excel("../data/TEST/TEST_QUANTITATIVE_METADATA.xlsx")
test_Quant.head()

Unnamed: 0,participant_id,EHQ_EHQ_Total,ColorVision_CV_Score,APQ_P_APQ_P_CP,APQ_P_APQ_P_ID,APQ_P_APQ_P_INV,APQ_P_APQ_P_OPD,APQ_P_APQ_P_PM,APQ_P_APQ_P_PP,SDQ_SDQ_Conduct_Problems,SDQ_SDQ_Difficulties_Total,SDQ_SDQ_Emotional_Problems,SDQ_SDQ_Externalizing,SDQ_SDQ_Generating_Impact,SDQ_SDQ_Hyperactivity,SDQ_SDQ_Internalizing,SDQ_SDQ_Peer_Problems,SDQ_SDQ_Prosocial,MRI_Track_Age_at_Scan
0,Cfwaf5FX7jWK,60.03,14.0,5.0,16.0,41.0,19.0,11.0,26.0,2.0,12.0,3.0,9.0,2.0,7.0,3.0,0.0,8.0,8.992813
1,vhGrzmvA3Hjq,86.71,12.0,3.0,13.0,43.0,18.0,15.0,28.0,2.0,16.0,8.0,5.0,7.0,3.0,11.0,3.0,9.0,12.324093
2,ULliyEXjy4OV,26.68,13.0,3.0,14.0,36.0,16.0,14.0,25.0,1.0,7.0,1.0,6.0,1.0,5.0,1.0,0.0,9.0,7.770933
3,LZfeAb1xMtql,93.38,13.0,3.0,19.0,41.0,17.0,18.0,27.0,4.0,15.0,4.0,10.0,8.0,6.0,5.0,1.0,6.0,9.304814
4,EnFOUv0YK1RG,-93.38,14.0,3.0,13.0,42.0,19.0,16.0,28.0,2.0,18.0,6.0,12.0,5.0,10.0,6.0,0.0,10.0,8.26135


In [19]:
test_Quant.columns

Index(['participant_id', 'EHQ_EHQ_Total', 'ColorVision_CV_Score',
       'APQ_P_APQ_P_CP', 'APQ_P_APQ_P_ID', 'APQ_P_APQ_P_INV',
       'APQ_P_APQ_P_OPD', 'APQ_P_APQ_P_PM', 'APQ_P_APQ_P_PP',
       'SDQ_SDQ_Conduct_Problems', 'SDQ_SDQ_Difficulties_Total',
       'SDQ_SDQ_Emotional_Problems', 'SDQ_SDQ_Externalizing',
       'SDQ_SDQ_Generating_Impact', 'SDQ_SDQ_Hyperactivity',
       'SDQ_SDQ_Internalizing', 'SDQ_SDQ_Peer_Problems', 'SDQ_SDQ_Prosocial',
       'MRI_Track_Age_at_Scan'],
      dtype='object')

### 2. Exploratory Data Analysis - EDA