In [None]:
pip install pycirclize

In [None]:
nb_type = "Submission"

# <span style="color:#ffffff; font-size: 1%;">[1] 🧠 Introduction</span>

<div style=" border-bottom: 8px solid #E6A600; overflow: hidden; border-radius: 10px; height: 45px; width: 100%; display: flex;">
  <div style="height: 100%; width: 65%; background-color: #C2185B; float: left; text-align: center; display: flex; justify-content: center; align-items: center; font-size: 25px; ">
    <b><span style="color: #ffffff; padding: 20px 20px;">[1] 🧠🎗️ Introduction</span></b>
  </div>
  <div style="height: 100%; width: 35%; background-image: url('https://www.kaggle.com/competitions/90566/images/header'); background-size: cover; background-position: center; float: left; border-top-right-radius: 10px; border-bottom-right-radius: 4px;">
  </div>
</div>

<div style="position: relative; height: 200px; background-image: url('https://cms.buzzrx.com/globalassets/buzzrx/blogs/how-to-manage-adhd-without-medication-for-adults.png'); background-size: cover; background-position: center; border-radius: 15px; overflow: hidden;"></div>

Neurodevelopmental disorders, such as **Attention Deficit Hyperactivity Disorder (ADHD)**, impact a significant proportion of adolescents, with **approximately 11% diagnosed**—**14% of boys** and **8% of girls**. However, research suggests that **girls with ADHD are often underdiagnosed**, primarily because their symptoms tend to be more **inattentive rather than hyperactive**, making them harder to detect.

This **underdiagnosis has serious consequences**, potentially affecting treatment and outcomes for girls with ADHD.

The **WiDS Datathon Kaggle challenge** seeks to address this gap by building **predictive models using functional brain imaging data, socio-demographic details, emotional characteristics, and parenting information** to determine an individual's **ADHD status and biological sex**. Insights from this competition could significantly advance **personalized medicine and targeted interventions for ADHD**, particularly benefiting underrepresented groups.

📌 **Check out my other notebooks**:

-  🎙️ **S5E4**: [ Podcast Pred | EDA & XGB | AI News 🌟](https://www.kaggle.com/code/tarundirector/podcast-pred-eda-xgb-ai-news)
- 📘 **S5E3**: [Rev Rain Prediction | EDA + Time Series + AI News 🌧️](https://www.kaggle.com/code/tarundirector/rev-rain-pred-eda-time-series-ai-news)  
- 🎒 **S5E2**: [Backpack Prediction | Baseline + Ensemble + EDA 📊](https://www.kaggle.com/code/tarundirector/backpack-pred-baseline-ensemble-eda)

> 💡 **Quick Tip**:
"Click on 'Show hidden code' snippets to reveal the code behind the results!" 👀💻

In [None]:
#🔍 Ah-ha! You found the secret sauce! 🍔

<div style="background-color:  #E8F8F5; border-left: 8px solid #1ABC9C; padding: 20px; border-radius: 8px; font-size: 16px; color: #000000;">
  <h3 style="font-size: 16px; margin-bottom: 10px;"><strong>LOOK OUT FOR -> 🤔💁‍♀️So What?! :  🔍 Insights & Observations</strong></h3>
  <p> <em>Insights to understand the analysis and reach meaningful conclusions about the data!</em> 📊</p>

</div>

<b><span style="color: #FFFFFF; background-color: #E57373; padding: 20px; font-size: 18px; border-left: 8px solid #C2185B"> <strong>[1.1] 🧠 Problem Statement</strong></span></b>

ADHD diagnosis, particularly in females, remains a challenge due to differences in symptom presentation. Many girls go undiagnosed, leading to long-term mental health impacts. Understanding the **brain activity patterns associated with ADHD** and their **differences between males and females** is crucial for improving early detection and personalized treatment.

<b><span style="color: #FFFFFF; background-color: #E57373; padding: 20px; font-size: 18px; border-left: 8px solid #C2185B"> <strong>[1.2] 🎯 Goal</strong></span></b>

To develop a predictive model capable of accurately classifying individuals based on:

- **ADHD Diagnosis** (`ADHD_Outcome`: 1 for ADHD, 0 for non-ADHD)
- **Biological Sex** (`Sex_F` = 1 for Female, 0 for Male)

The model will leverage **functional brain imaging data**, along with **socio-demographic details, emotional characteristics, and parenting information**, to identify at-risk individuals more effectively. The ultimate aim is to improve **early diagnosis** and enable **personalised interventions**, thereby reducing negative long-term impacts, especially for females who are traditionally underdiagnosed.

<b><span style="color: #FFFFFF; background-color: #E57373; padding: 20px; font-size: 18px; border-left: 8px solid #C2185B"> <strong>[1.3] 🗂 Dataset Description</strong></span></b>

The dataset for this competition is derived from multiple sources, primarily from the **Healthy Brain Network (HBN)** and the **Reproducible Brain Charts (RBC) project**, in collaboration with Cornell University and UC Santa Barbara.

The dataset contains detailed information on over **1,200+ individuals** in the training set and **300+ individuals** in the test set. The data is divided into two primary folders:

#### ⏩ Training Data 🏋️‍♂️ (`train_tsv`)
Contains three key components for each subject:
1. **Target Variables:** ADHD diagnosis (`ADHD_Outcome`: 0=No, 1=Yes) and biological sex (`Sex_F`: 0=Male, 1=Female).
2. **Functional MRI (fMRI) Connectome Matrices:** Time-series data representing **brain activity correlations** across different regions.
3. **Socio-Demographic, Emotional, and Parenting Data:** This includes metadata such as **handedness, parental education, emotional health (Strength and Difficulties Questionnaire), and parenting styles (Alabama Parenting Questionnaire)**.
   
#### ⏩ Test Data 🎯(`test_tsv`)
Contains unseen data for **300+ subjects** and consists of:
- Functional MRI connectome matrices
- Socio-demographic, emotional, and parenting data

<b><span style="color: #FFFFFF; background-color: #E57373; padding: 20px; font-size: 18px; border-left: 8px solid #C2185B"> <strong>[1.4] 📏 Evaluation Metrics</strong></span></b>

### ▶️ **F1 Score - The Core Metric** 🔢
The model performance will be evaluated using the **F1 Score**, which is the harmonic mean of **precision** and **recall**:

![image.png](https://images.prismic.io/encord/0ef9c82f-2857-446e-918d-5f654b9d9133_Screenshot+%2849%29.png?auto=compress,format)

- **Precision**: Measures how many of the predicted positive cases are actually positive.
- **Recall**: Measures how many of the actual positive cases are correctly predicted.
- **F1 Score**: Balances both precision and recall to give a single metric that reflects model effectiveness.

📌 **F1 Score ranges from 0 (worst) to 1 (best), with 1 indicating perfect precision and recall.**

### ▶️ **Weighted Scoring for Female ADHD Cases** 🏆
Since the challenge focuses on addressing **gender disparities in ADHD diagnosis**, an additional weighting scheme has been applied:
- **Female ADHD cases** (`ADHD_Outcome=1, Sex_F=1`) will receive **2x weight** in the F1 Score calculation.
- The final leaderboard score will be based on the **average of the weighted F1 scores** for ADHD and sex prediction.

📌 **Why this weighting?** ADHD diagnosis is historically more challenging in females, and the competition aims to highlight and improve **gender-equitable diagnostic models**.

For further details on the F1 Score, check out the Wikipedia page: [F1 Score](https://en.wikipedia.org/wiki/F-score).

<b><span style="color: #FFFFFF; background-color: #E57373; padding: 20px; font-size: 18px; border-left: 8px solid #C2185B"> <strong>[1.4] 📏 Background</strong></span></b>

### ▶️ Attention Deficit Hyperactivity Disorder (ADHD): Neural and Sex-Specific Patterns

**Attention Deficit Hyperactivity Disorder (ADHD)** is a common **neurodevelopmental disorder** that typically emerges in childhood and can persist into adolescence and adulthood. It is characterized by a persistent pattern of **inattention** and/or **hyperactivity-impulsivity** that interferes with an individual's functioning or development.

These symptoms may present as:

- _Difficulty sustaining attention_
- _Being easily distracted or forgetful_
- _Excessive fidgeting or restlessness_
- _Impulsivity in actions or speech_
- _Difficulty waiting one’s turn_

---

### ▶️ Brain-Based Differences in ADHD

Neuroimaging studies consistently reveal **structural** and **functional differences** in individuals with ADHD compared to controls. Key brain regions implicated include:

- **Prefrontal cortex**: essential for _attention_, _executive functions_ (e.g., planning, working memory), and _impulse control_  
- **Basal ganglia**: involved in _motor control_, _reward processing_, and _habit formation_
- **Cerebellum**: traditionally linked to _motor coordination_, but also supports cognitive functions
- **Limbic system**: governs _emotion regulation_

<div style="text-align: center;">
  <img
    src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*1w_w68mfTO5aiePZyOSb6g.jpeg"
    alt="ADHD Banner"
    style="
      display: block;
      margin: 0 auto;
      border: 2px solid lightgrey;
      border-radius: 6px;
      padding: 10px;
      width: 50%;
      height: auto;
    "
  />
</div>

> _Some of these regions — including the prefrontal cortex, cerebellum, hippocampus, and amygdala — have been found to be slightly smaller in children with ADHD._

---

### ▶️ Functional Connectivity and Resting-State Insights

Functional connectivity studies using **resting-state fMRI (rs-fMRI)** further explore the interaction between brain regions. They have revealed altered connectivity patterns in ADHD, especially in the following networks:

- **Default Mode Network (DMN)**: typically active during rest; _hyperconnectivity_ here may contribute to distractibility  
- **Executive Control Network (ECN)**: crucial for _goal-directed behavior_  
- **Salience Network (SN)**: identifies and filters _important stimuli_

These findings indicate that ADHD involves **network-level disruptions**, not just dysfunction in isolated regions.

> _Some individuals show hyperconnectivity (e.g., frontal-subcortical), while others exhibit hypoconnectivity (e.g., within the DMN), reflecting ADHD's heterogeneity._

---

### ▶️ Neurochemical Dimensions

ADHD is closely associated with imbalances in **dopamine** and **norepinephrine**, which regulate attention, motivation, and executive functions.

- **Dopamine**: central to _reward processing_ and _focus_  
- **Norepinephrine**: modulates _alertness_ and _arousal_

> _Common ADHD medications like methylphenidate and amphetamines enhance neurotransmitter availability, improving focus and reducing impulsivity._

These neurotransmitter shifts may underlie **abnormal connectivity** between brain networks observed in fMRI studies.

---

### ▶️ Sex Differences in ADHD: Why They Matter

Understanding how ADHD presents differently across sexes is critical for diagnosis and intervention. While ADHD affects all sexes, it does so in **distinct patterns**:

- **Childhood prevalence** is significantly higher in males (_3:1 to 16:1_), but this gap narrows in adulthood.
- **Symptom expression**:
  - _Males_: externalizing (e.g., hyperactivity, disruptive behavior)
  - _Females_: internalizing (e.g., inattention, anxiety, depression)

> _This difference contributes to underdiagnosis of ADHD in females, whose symptoms may be subtler and often mistaken for other disorders._

---

### ▶️ Biological Differences in Connectivity

Studies of neurotypical individuals have shown sex-based differences in brain organization:

- _Females_: higher **local functional connectivity** and stronger **DMN connectivity**
- _Males_: stronger **sensorimotor connectivity**

In ADHD-specific studies:

- **Female adults with ADHD** showed _reduced connectivity_ in the visual network and its connections to DMN and ECN.
- **Male adults with ADHD** displayed altered activity in verbal working memory tasks, unlike females.

> _Regions like the **thalamus** and **amygdala** may also exhibit sex-specific alterations in structure and function related to ADHD._

These findings underscore the importance of exploring ADHD **through a sex-informed lens**, particularly when building predictive models or designing interventions.

---

### ▶️ Relevance to WiDS Datathon 2025

The **WiDS Datathon 2025** provides a unique opportunity to apply these insights on a large dataset of fMRI-derived connectomes. Your task:  
- Build **multi-output models** to predict ADHD diagnosis and sex  
- Investigate **neurobiological signatures**, especially those that differ across sexes  
- Use methods like **Network-Based Statistics (NBS)** or graph metrics to identify key connectivity changes  

The goal is not just accuracy — it’s understanding. By decoding sex-specific brain connectivity patterns associated with ADHD, we contribute to more _personalized, fair, and effective_ neuroscience.

# <span style="color:#ffffff; font-size: 1%;">[2] 🔍 Dataset Overview</span>

<div style=" border-bottom: 8px solid #E6A600; overflow: hidden; border-radius: 10px; height: 45px; width: 100%; display: flex;">
  <div style="height: 100%; width: 65%; background-color: #C2185B; float: left; text-align: center; display: flex; justify-content: center; align-items: center; font-size: 25px; ">
    <b><span style="color: #ffffff; padding: 20px 20px;">[2] 📊🔍 Dataset Overview</span></b>
  </div>
  <div style="height: 100%; width: 35%; background-image: url('https://www.kaggle.com/competitions/90566/images/header'); background-size: cover; background-position: center; float: left; border-top-right-radius: 10px; border-bottom-right-radius: 4px;">
  </div>
</div>

In [None]:
train_df_cat = pd.read_excel('/kaggle/input/widsdatathon2025/TRAIN_NEW/TRAIN_CATEGORICAL_METADATA_new.xlsx')
train_df_fcm= pd.read_csv('/kaggle/input/widsdatathon2025/TRAIN_NEW/TRAIN_FUNCTIONAL_CONNECTOME_MATRICES_new_36P_Pearson.csv')
train_df_Q = pd.read_excel('/kaggle/input/widsdatathon2025/TRAIN_NEW/TRAIN_QUANTITATIVE_METADATA_new.xlsx')
train_df_sol = pd.read_excel('/kaggle/input/widsdatathon2025/TRAIN_NEW/TRAINING_SOLUTIONS.xlsx')

test_df_cat = pd.read_excel('/kaggle/input/widsdatathon2025/TEST/TEST_CATEGORICAL.xlsx')
test_df_fcm = pd.read_csv('/kaggle/input/widsdatathon2025/TEST/TEST_FUNCTIONAL_CONNECTOME_MATRICES.csv')
test_df_Q = pd.read_excel('/kaggle/input/widsdatathon2025/TEST/TEST_QUANTITATIVE_METADATA.xlsx')

In [None]:
dict_df = pd.read_excel('/kaggle/input/widsdatathon2025/Data Dictionary.xlsx')

# Load data
dict_APQP_df = pd.read_excel('/kaggle/input/full-data-dictionaries/APQ_P.xlsx', header=None)
dict_ColorVision_df = pd.read_excel('/kaggle/input/full-data-dictionaries/ColorVision.xlsx', header=None)
dict_SDQ_df = pd.read_excel('/kaggle/input/full-data-dictionaries/SDQ.xlsx', header=None)

# Function to get the first value of the second row before setting header
def get_first_value_before_header(df, var_name):
    first_value = df.iloc[0, 0]  # Second row, first column (before setting header)
    print(f"{var_name}: {first_value}")

# Print first values
get_first_value_before_header(dict_APQP_df, "dict_APQP_df")
get_first_value_before_header(dict_ColorVision_df, "dict_ColorVision_df")
get_first_value_before_header(dict_SDQ_df, "dict_SDQ_df")

# Set second row as the header
dict_APQP_df.columns = dict_APQP_df.iloc[1]
dict_ColorVision_df.columns = dict_ColorVision_df.iloc[1]
dict_SDQ_df.columns = dict_SDQ_df.iloc[1]

# Drop the first two rows as they are now redundant
dict_APQP_df = dict_APQP_df[2:].reset_index(drop=True)
dict_ColorVision_df = dict_ColorVision_df[2:].reset_index(drop=True)
dict_SDQ_df = dict_SDQ_df[2:].reset_index(drop=True)

In [None]:
train_data = train_df_cat.merge(train_df_Q, on="participant_id", how="inner") \
                        .merge(train_df_sol, on="participant_id", how="inner")

test_data = test_df_cat.merge(test_df_Q, on="participant_id", how="inner")

In [None]:
train_data.head()

In [None]:
test_data.head()

In [None]:
dict_df['Field'] = dict_df['Field'].replace({'MRI_Track,Age_at_Scan': 'MRI_Track_Age_at_Scan'})

In [None]:
# Checking the number of rows and columns

num_train_rows, num_train_columns = train_data.shape

num_test_rows, num_test_columns = test_data.shape

print("Training Data:")
print(f"Number of Rows: {num_train_rows}")
print(f"Number of Columns: {num_train_columns}\n")

print("Test Data:")
print(f"Number of Rows: {num_test_rows}")
print(f"Number of Columns: {num_test_columns}\n")

In [None]:
# Count duplicate rows in train_data
train_duplicates = train_data.duplicated().sum()

# Count duplicate rows in test_data
test_duplicates = test_data.duplicated().sum()

# Print the results
print(f"Number of duplicate rows in train_data: {train_duplicates}")
print(f"Number of duplicate rows in test_data: {test_duplicates}")

In [None]:
# Creating a table for missing values, unique values and data types of the features

missing_values_train = pd.DataFrame({'Feature': train_data.columns,
                              '[TRAIN] No. of Missing Values': train_data.isnull().sum().values,
                              '[TRAIN] % of Missing Values': ((train_data.isnull().sum().values)/len(train_data)*100)})

missing_values_test = pd.DataFrame({'Feature': test_data.columns,
                             '[TEST] No.of Missing Values': test_data.isnull().sum().values,
                             '[TEST] % of Missing Values': ((test_data.isnull().sum().values)/len(test_data)*100)})

unique_values = pd.DataFrame({'Feature': train_data.columns,
                              'No. of Unique Values[FROM TRAIN]': train_data.nunique().values})

feature_types = pd.DataFrame({'Feature': train_data.columns,
                              'DataType': train_data.dtypes})

merged_df = pd.merge(missing_values_train, missing_values_test, on='Feature', how='left')
merged_df = pd.merge(merged_df, unique_values, on='Feature', how='left')
merged_df = pd.merge(merged_df, feature_types, on='Feature', how='left')

merged_df.style.background_gradient(cmap='viridis')

In [None]:
# Having a look at the description of all the numerical columns present in the dataset
print('Description of all the numerical columns present in the train dataset')
train_data.describe().T.style.background_gradient(cmap='viridis')

In [None]:
# Having a look at the description of all the numerical columns present in the dataset
print('Description of all the numerical columns present in the test dataset')
test_data.describe().T.style.background_gradient(cmap='viridis')

# <span style="color:#ffffff; font-size: 1%;">[3] 💡 Exploratory Data Analysis (EDA)</span>

<div style=" border-bottom: 8px solid #E6A600; overflow: hidden; border-radius: 10px; height: 45px; width: 100%; display: flex;">
  <div style="height: 100%; width: 65%; background-color: #C2185B; float: left; text-align: center; display: flex; justify-content: center; align-items: center; font-size: 25px; ">
    <b><span style="color: #ffffff; padding: 20px 20px;">[3] 📈💡EDA</span></b>
  </div>
  <div style="height: 100%; width: 35%; background-image: url('https://www.kaggle.com/competitions/90566/images/header'); background-size: cover; background-position: center; float: left; border-top-right-radius: 10px; border-bottom-right-radius: 4px;">
  </div>
</div>

In [None]:
categorical_variables = ['Basic_Demos_Enroll_Year', 'Basic_Demos_Study_Site', 'PreInt_Demos_Fam_Child_Ethnicity', 'PreInt_Demos_Fam_Child_Race',
'MRI_Track_Scan_Location', 'Barratt_Barratt_P1_Edu', 'Barratt_Barratt_P1_Occ', 'Barratt_Barratt_P2_Edu', 'Barratt_Barratt_P2_Occ',
'ColorVision_CV_Score', 'APQ_P_APQ_P_CP', 'SDQ_SDQ_Conduct_Problems', 'SDQ_SDQ_Emotional_Problems', 'SDQ_SDQ_Generating_Impact',
'SDQ_SDQ_Hyperactivity', 'SDQ_SDQ_Peer_Problems', 'SDQ_SDQ_Prosocial']

numerical_variables = ['EHQ_EHQ_Total', 'APQ_P_APQ_P_ID', 'APQ_P_APQ_P_INV', 'APQ_P_APQ_P_OPD', 'APQ_P_APQ_P_PM',
'APQ_P_APQ_P_PP', 'SDQ_SDQ_Difficulties_Total', 'SDQ_SDQ_Externalizing', 'SDQ_SDQ_Internalizing', 'MRI_Track_Age_at_Scan']

target_variables = ['ADHD_Outcome', 'Sex_F']

> **⚠️ NOTE:** Some features that appear as **numerical** in the dataset are actually more **categorical in nature** (since they have very few unique values). We’ll treat them accordingly to ensure meaningful insights!

<b><span style="color: #FFFFFF; background-color: #E57373; padding: 20px; font-size: 18px; border-left: 8px solid #C2185B"> <strong>[3.1] Numerical Feature Analysis (Univariate Analysis - Survey Data)</strong></span></b>

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import textwrap

# Define custom color palettes
box_palette = {'Train': '#F46D43', 'Test': '#66C2A5'}   # Dark green for Train, red for Test
hist_train_color = '#F46D43'  # Darkish green for Train histogram
hist_test_color = '#66C2A5'   # Use same color as before for Test histogram

# Palettes for the additional KDE plots
gender_palette = {"Male": "lightblue", "Female": "lightpink"}
adhd_palette = {"Non-ADHD": "grey", "ADHD": "#FFDB58"}

# Add 'Dataset' column to distinguish between train and test data
train_data['Dataset'] = 'Train'
test_data['Dataset'] = 'Test'

# Ensure we only analyze the numerical variables
variables = [col for col in train_data.columns if col in numerical_variables]

# Create new columns for gender and ADHD status if not already present.
# Assuming train_data has a binary "Sex_F" column where 1 represents Female.
if 'Gender' not in train_data.columns:
    train_data['Gender'] = train_data['Sex_F'].apply(lambda x: "Female" if x == 1 else "Male")

# Assuming "ADHD_Outcome" is binary with 1 meaning ADHD and 0 meaning Non-ADHD.
if 'ADHD_Status' not in train_data.columns:
    train_data['ADHD_Status'] = train_data['ADHD_Outcome'].apply(lambda x: "ADHD" if x == 1 else "Non-ADHD")

# Function to create and display a row of plots for a single variable
def create_variable_plots(variable):
    sns.set_style('whitegrid')

    # Create a 1x4 subplot: box plot, histogram, gender KDE, ADHD KDE
    fig, axes = plt.subplots(1, 4, figsize=(24, 5))

    # ---------------------
    # 1. Box plot (Train & Test Combined)
    # ---------------------
    combined_data = pd.concat([train_data, test_data])
    sns.boxplot(ax=axes[0], data=combined_data, x=variable, y="Dataset", palette=box_palette)
    axes[0].set_xlabel(variable)
    title_box = f"Box Plot for {dict_df.loc[dict_df['Field'] == variable, 'Description'].values[0]}  [TRAIN & TEST Combined]"
    axes[0].set_title("\n".join(textwrap.wrap(title_box, width=50)))

    # ---------------------
    # 2. Histogram (Countplot) for Train vs Test
    # ---------------------
    sns.histplot(ax=axes[1], data=train_data, x=variable, color=hist_train_color, kde=True, bins=30, label="Train")
    sns.histplot(ax=axes[1], data=test_data, x=variable, color=hist_test_color, kde=True, bins=30, label="Test")
    axes[1].set_xlabel(variable)
    axes[1].set_ylabel("Frequency")
    title_hist = f"Histogram for {variable}:  {dict_df.loc[dict_df['Field'] == variable, 'Description'].values[0]} [TRAIN & TEST]"
    axes[1].set_title("\n".join(textwrap.wrap(title_hist, width=50)))
    axes[1].legend()

    # ---------------------
    # 3. KDE Plot by Gender (Male vs Female)
    # ---------------------
    sns.kdeplot(ax=axes[2], data=train_data, x=variable, hue="Gender", fill=True, common_norm=False,
                palette=gender_palette, alpha=0.4, linewidth=2)
    axes[2].set_xlabel(variable)
    axes[2].set_title(f"KDE by Gender for {variable}")

    # ---------------------
    # 4. KDE Plot by ADHD Status (ADHD vs Non-ADHD)
    # ---------------------
    sns.kdeplot(ax=axes[3], data=train_data, x=variable, hue="ADHD_Status", fill=True, common_norm=False,
                palette=adhd_palette, alpha=0.4, linewidth=2)
    axes[3].set_xlabel(variable)
    axes[3].set_title(f"KDE by ADHD Status for {variable}")

    # Adjust spacing and show the plots
    plt.tight_layout()
    plt.show()

# Perform univariate analysis for each variable in the list
for variable in variables:
    create_variable_plots(variable)

# Clean up: Drop the 'Dataset' column after analysis if desired
train_data.drop('Dataset', axis=1, inplace=True)
test_data.drop('Dataset', axis=1, inplace=True)

<span style="color:#ffffff; font-size: 1%;">SW-Key-Features-Insights</span>
<div style="background-color:#E8F8F5; border-left:8px solid #1ABC9C; padding:20px; border-radius:8px; font-size:14px; color:#000000;">
  <h3 style="font-size:20px; margin-bottom:10px;">🤔💁‍♀️So What?! <strong>(📝 Key Insights - Summary)</strong></h3>
  <hr>
  <ul>
    <li><strong><code>SDQ Difficulties</code> &amp; <code>Externalizing</code>:</strong> These scores show marked differences between <code>ADHD</code> and <code>non-ADHD</code> subjects—with <code>ADHD</code> cases scoring significantly higher. This highlights their crucial role in flagging behavioral challenges.</li>
    <li><strong><code>Positive Parenting</code>:</strong> The bimodal trend observed for <code>males</code> (and similarly for <code>ADHD</code> subjects) hints at distinct parenting subgroups, which is key to understanding different outcomes in the sample.</li>
    <li><strong><code>Parental Involvement</code>:</strong> Slight shifts in peak scores between <code>males</code> and <code>females</code>, as well as between <code>ADHD</code> and <code>non-ADHD</code> groups, suggest that variations in parental engagement may influence behavioral profiles.</li>
    <li><strong><code>Age at MRI Scan</code>:</strong> The <code>Age at MRI Scan</code> is normally distributed with overlapping peaks, confirming that age is well-controlled and not a confounding factor.</li>
  </ul>

  <h3 style="font-size:20px; margin-bottom:10px;">🤔💁‍♀️So What?! <strong>(📝 Key Insights - Detailed)</strong></h3>
  <hr>
  <p><strong>1️⃣ Handedness Measure <code>(ehq_ehq_total)</code></strong></p>
  <ul>
    <li><strong>Pattern:</strong> The distribution is left-skewed with a long tail, peaking around 90 for both <code>males</code> and <code>females</code>, with a slightly lower peak for <code>non-ADHD</code> subjects.</li>
    <li><strong>Interpretation:</strong> Derived from the Edinburgh Handedness Questionnaire, this score quantifies the <code>Laterality Index</code>—with the data dictionary indicating a scale from <code>-100</code> (extreme left-hand dominance) to <code>100</code> (extreme right-hand dominance). The observed peak around 90 suggests that most participants exhibit a strong right-hand preference. The near-identical distribution for both sexes implies similar lateralization; however, the marginally lower peak in <code>non-ADHD</code> subjects may indicate that <code>ADHD</code> individuals possess an even stronger lateralization tendency, potentially reflecting subtle neurodevelopmental differences.</li>
  </ul>
  <hr>
  <p><strong>2️⃣ Inconsistent Discipline Score <code>(apq_p_apq_p_id)</code></strong></p>
  <ul>
    <li><strong>Pattern:</strong> This score is normally distributed with a prominent peak around <code>14</code>, showing substantial overlap between <code>males</code> and <code>females</code> as well as between <code>ADHD</code> and <code>non-ADHD</code> groups.</li>
    <li><strong>Interpretation:</strong> As a measure from the Alabama Parenting Questionnaire, this score reflects <code>inconsistencies in parental discipline</code>. Considering that the APQ response options range from 1 (Never) to 5 (Always) per item, a composite score peaking around <code>14</code> suggests that most parents exhibit moderate inconsistency. The uniform distribution across groups indicates that inconsistent discipline is a common practice, offering limited differentiation between <code>sexes</code> or diagnostic categories.</li>
  </ul>
  <hr>
  <p><strong>3️⃣ Parental Involvement Score <code>(apqp_apq_p_inv)</code></strong></p>
  <ul>
    <li><strong>Pattern:</strong> The score follows a normal distribution, with <code>males</code> peaking around <code>42</code> and <code>females</code> around <code>38</code>; similarly, <code>non-ADHD</code> subjects peak around <code>42</code> while <code>ADHD</code> subjects peak near <code>38</code>.</li>
    <li><strong>Interpretation:</strong> This variable captures <code>parental involvement</code>—a higher score reflects greater engagement. Although the exact scale isn’t provided, given the APQ items (rated 1–5), a sum score in the high 30s to low 40s indicates moderate to high involvement. The observed shift—with higher scores in <code>males</code> and <code>non-ADHD</code> subjects—suggests that increased parental involvement may be protective against <code>ADHD</code> and may also vary by <code>sex</code>, thereby affecting behavioral outcomes.</li>
  </ul>
  <hr>
  <p><strong>4️⃣ Other Discipline Practices <code>(apq_p_apq_opd)</code></strong></p>
  <ul>
    <li><strong>Pattern:</strong> The distribution is normal with peaks between <code>17</code> and <code>18</code>, and a subtle trend where <code>non-ADHD</code> subjects score slightly lower.</li>
    <li><strong>Interpretation:</strong> This score assesses <code>alternative disciplinary methods</code> beyond those contributing to the overall discipline score. With response options from 1 to 5 per item, a summed score of around <code>17–18</code> implies a moderate frequency of these practices. The slight decrease in <code>non-ADHD</code> subjects may suggest that such methods are more frequently employed with children exhibiting <code>ADHD</code>-related behaviors, potentially contributing to the diagnostic differentiation.</li>
  </ul>
  <hr>
  <p><strong>5️⃣ Poor Monitoring/Supervision Score <code>(apq_p_apq_pm)</code></strong></p>
  <ul>
    <li><strong>Pattern:</strong> Exhibits a slight right skew with a peak around <code>15</code> and almost complete overlap between all groups.</li>
    <li><strong>Interpretation:</strong> This metric measures <code>monitoring and supervision</code> practices. A right-skewed pattern indicates that while most parents maintain adequate supervision, a subset demonstrates poorer practices (reflected in higher scores). Given the similar distributions for both <code>males</code> and <code>females</code> as well as between <code>ADHD</code> and <code>non-ADHD</code> groups, it suggests that monitoring is relatively consistent across the board and may not be a key differentiator in behavioral outcomes.</li>
  </ul>
  <hr>
  <p><strong>6️⃣ Positive Parenting Score <code>(apq_p_apq_p_pp)</code></strong></p>
  <ul>
    <li><strong>Pattern:</strong> This variable shows a bimodal distribution; <code>males</code> display two distinct peaks (approximately <code>25</code> and <code>28</code>), while <code>females</code> have a single, slightly left-skewed peak around <code>28</code>. In the <code>ADHD</code> context, <code>ADHD</code> subjects exhibit dual peaks at roughly <code>24</code> and <code>28</code>, whereas <code>non-ADHD</code> subjects cluster near <code>27</code>.</li>
    <li><strong>Interpretation:</strong> The <code>Positive Parenting Score</code>, reflecting affirmative and supportive behaviors, is derived from multiple items (each rated 1–5). A composite score in the mid-to-high 20s indicates generally positive reinforcement. The bimodal distribution for <code>males</code> and <code>ADHD</code> subjects may indicate two subgroups: one with consistently high positive parenting and another with moderate levels. In contrast, the more uniform score for <code>females</code> and <code>non-ADHD</code> subjects suggests a more homogeneous parenting approach. These distinctions are pivotal in understanding how positive reinforcement may influence the behavioral profiles of different groups.</li>
  </ul>
  <hr>
  <p><strong>7️⃣ Overall Behavioral Difficulties <code>(sdq_sdq_difficulties_total)</code></strong></p>
  <ul>
    <li><strong>Pattern:</strong> <code>males</code> exhibit a broad peak around <code>10</code>, whereas <code>females</code> show two peaks (around <code>8</code> and <code>14</code>). In the <code>ADHD</code> context, there is a stark contrast: <code>ADHD</code> subjects peak near <code>14</code>, while <code>non-ADHD</code> subjects cluster around <code>5</code>.</li>
    <li><strong>Interpretation:</strong> This metric, summarizing overall behavioral and emotional challenges from the Strength and Difficulties Questionnaire (which typically ranges from 0 to 40), reveals that scores around <code>14</code> indicate heightened difficulties. The pronounced difference between <code>ADHD</code> (peaking at <code>14</code>) and <code>non-ADHD</code> (peaking at <code>5</code>) subjects underscores its strong discriminative power. For <code>females</code>, the dual peaks may reflect the existence of subgroups with varying severity, while the broad peak for <code>males</code> suggests a wider spread of difficulties. This clear divergence directs attention to the importance of elevated difficulty scores as a key indicator of <code>ADHD</code>.</li>
  </ul>
  <hr>
  <p><strong>8️⃣ Externalizing Behaviors <code>(sdq_sdq_externalizing)</code></strong></p>
  <ul>
    <li><strong>Pattern:</strong> The distribution is normal with a <code>male</code> peak around <code>8</code> and a <code>female</code> peak around <code>5</code>. In <code>ADHD</code> subjects, the peak is near <code>8</code>, while <code>non-ADHD</code> subjects exhibit a right-skewed trend with a peak around <code>2</code>.</li>
    <li><strong>Interpretation:</strong> Measuring behaviors such as hyperactivity and aggression, the <code>Externalizing</code> score (often ranging from 0 to 10) is significantly higher in <code>males</code> and in <code>ADHD</code> subjects. The concentrated peak at <code>8</code> for these groups contrasts with the much lower scores in <code>non-ADHD</code> subjects, emphasizing its value in highlighting disruptive behaviors that may require targeted interventions.</li>
  </ul>
  <hr>
  <p><strong>9️⃣ Internalizing Behaviors <code>(sdq_sdq_internalizing)</code></strong></p>
  <ul>
    <li><strong>Pattern:</strong> Both <code>males</code> and <code>females</code> display a right-skewed distribution with overlapping peaks around <code>2–3</code>. However, <code>ADHD</code> subjects show a broader, more diffuse peak around <code>3</code>, while <code>non-ADHD</code> subjects have a taller, sharper peak near <code>1</code>.</li>
    <li><strong>Interpretation:</strong> This score reflects <code>internalizing behaviors</code> (e.g., anxiety, depression) and typically has lower values. The broader distribution for <code>ADHD</code> subjects implies higher variability and intensity of these symptoms, suggesting that children with <code>ADHD</code> experience a wider range of emotional challenges. In contrast, the concentrated peak at <code>1</code> among <code>non-ADHD</code> subjects indicates more stable emotional well-being, reinforcing the contrast in behavioral profiles between the groups.</li>
  </ul>
  <hr>
  <p><strong>🔟 Age at MRI Scan <code>(mri_track_age_at_scan)</code></strong></p>
  <ul>
    <li><strong>Pattern:</strong> The distribution is normal with overlapping peaks around <code>10</code> years of age.</li>
    <li><strong>Interpretation:</strong> This variable records the <code>age</code> of participants at the time of their MRI scan. Its consistent distribution—supported by a narrow range around <code>10</code> years—confirms that age is well-controlled across the sample. This uniformity allows us to confidently attribute observed differences in behavioral and parental factors to true variations in the subjects rather than to developmental differences.</li>
  </ul>
</div>

<b><span style="color: #FFFFFF; background-color: #E57373; padding: 20px; font-size: 18px; border-left: 8px solid #C2185B"> <strong>[3.2] Numerical Feature Analysis (Connectome Data)</strong></span></b>

### [3.2.1] 🟡 **Background**

### ▶️ **Fundamentals of Functional Magnetic Resonance Imaging (fMRI): Peering into Brain Activity**

fMRI is a neuroimaging technique that allows researchers to observe brain activity in a **non-invasive** manner. It works by detecting changes in blood flow and blood oxygenation in response to neural activity—a phenomenon known as the **Blood-Oxygen-Level-Dependent (BOLD) signal**.

⏩ **BOLD Signal Mechanism:**
- When a brain area becomes more active, it consumes more oxygen.
- To compensate, **blood flow increases** to that region, often resulting in a local oversupply of oxygenated blood.
- fMRI does not measure neural activity directly; instead, it detects the **hemodynamic response** (i.e., the changes in blood flow and oxygenation).

⏩ **Magnetic Properties and fMRI:**
- **Deoxygenated hemoglobin** is _paramagnetic_ (attracted to magnetic fields and causing local distortions).
- **Oxygenated hemoglobin** is _diamagnetic_ and has a much weaker effect on the magnetic field.
- The **influx of oxygenated blood** decreases the concentration of deoxygenated hemoglobin, creating a more uniform magnetic environment detectable by the MRI scanner.

---

### ▶️ **Constructing and Understanding the fMRI Connectivity Matrix: Mapping Functional Relationships**


<div style="text-align: center;">
  <img
    src="https://media-hosting.imagekit.io/39652e926f4a4c58/WIDS1.png?Expires=1838310357&Key-Pair-Id=K2ZIVPTIP2VGHC&Signature=P4SfWRRV3okuqA7d-mtQe0BTQn9116Ml2GEF5IhqAXbJr9N1ImOCXZg-IDAz4QqxbeuiYsvsEbxCtmYgKIkRYvFsx6TDxp8JGRwYjKh5LvUOFyM3FGd669BDJbmzvLP4GVYiwuTWlJXgpJkmYryZOOt4Fn0dU-A1JVXxhWiFoh9u74Su6eN44blEqWkiZ1~ooZHmx3Gp2Gm11miAMVQ8TDz5LhNpHofXyM3ZOrCqN3U7u0HBdv8ROybFRrdhp1Az9qC0Xs4VBjDqgTqiEljH1loKXVDyjXERNCE47Apdn2tGP8lbHDg8~XbGQsxFJYtcE3fprFynElPgrNsEzcMf7A__"
    alt="ADHD Banner"
    style="
      display: block;
      margin: 0 auto;
      border: 2px solid lightgrey;
      border-radius: 6px;
      padding: 10px;
      width: 75%;
      height: auto;
    "
  />
</div>


To transform raw fMRI data into a functional connectome:

⏩ **Parcellation:**
- The brain is first divided into a set of distinct regions or _nodes_.
- This can be achieved using:
  - **Anatomical atlases** (regions defined by structural boundaries)
  - **Functional parcellation techniques** (grouping voxels based on similarity in BOLD signal time series)

⏩ **Quantifying Relationships:**
- The next step is to quantify the statistical relationship between the BOLD signal time series of each pair of regions.
- The **Pearson correlation coefficient** is most commonly used:
  - Measures the strength and direction of the linear relationship between two time series.
  - Ranges from **-1** (perfect negative correlation) to **+1** (perfect positive correlation), with **0** indicating no linear correlation.
- Other measures (e.g., covariance, mutual information) can also assess these relationships.

⏩ **The Connectivity Matrix:**
- The result is a connectivity (or correlation) matrix, where:
  - Each element *(i, j)* represents the strength of the functional connection between brain region *i* and *j*.
  - For **N** brain regions, the matrix is of size **N x N**.
  - The matrix is **symmetric** because the correlation between *i* and *j* is the same as between *j* and *i*.
  - Diagonal elements are typically **1** (correlation of a region with itself).

⏩ **Interpreting the Matrix:**
- **High positive correlation:** Suggests that the two regions have synchronous fluctuations, implying they may be involved in similar or interacting processes.
- **Correlation near zero:** Indicates little or no linear relationship.
- **Negative correlation:** Suggests anti-correlation; when one region's activity increases, the other decreases.
- Importantly, the connectivity matrix reflects statistical relationships and does **not imply causality**.

---

### ▶️ **Vectorized Connectivity Data for the Competition**

The provided dataset (`train_df_fcm.columns`) is in a vectorized form of this connectivity matrix:

⏩ **Key Details:**
- The vectorized form contains the unique information from the upper (or lower) triangle of the matrix (excluding the diagonal).
- Given that there are **19900** connectivity columns (excluding `participant_id`), this represents:
  
  ⏩ **Calculation:**  
  $$
    \frac{N \times (N - 1)}{2} = 19900
    $$
  
  Solving for **N** gives **N = 200**.

- Therefore, the data represents the functional connectivity between **200 brain regions**.

⏩ **Processing Steps:**
- The vectorized matrix should be reshaped back into its original **200 x 200 symmetric matrix**.
- **Preprocessing options** include:
  - **Normalization:** Scaling the correlation values to a specific range (e.g., -1 to 1).
  - **Thresholding:** Setting weak or statistically non-significant connections to zero.
  - **Averaging:** If multiple connectivity matrices exist per participant (e.g., from different scans), they may be averaged for reliability.

<hr style="height:3px; background-color:black; border:none;">

### [3.2.2] 🟡 **Visualizations**

### ↪️ [3.2.2.1] **Feature Extraction from Connectome Data: Unlocking Predictive Power**

To leverage the information contained within **_connectome data_** for predictive tasks—such as identifying individuals with **_ADHD_** or estimating their age—it is essential to extract meaningful features that capture the underlying organization and properties of the brain network. **_Graph theory_** offers a powerful framework for this purpose, providing a rich set of metrics to quantify various aspects of brain network organization.

These metrics can be broadly categorized into:
- **_Nodal measures_** – describe the properties of individual brain regions (nodes).
- **_Global measures_** – characterize the network as a whole.

---

#### ⏩ **Nodal Measures**
These provide insights into the role and characteristics of individual brain regions within the network:

- **_Degree_**  
  → Counts the number of connections a node has.  
  → In the brain, regions with a **high degree** may act as **central hubs**, interacting with many other regions.

- **_Strength_**  
  → Sum of the weights of all connections associated with a node.  
  → In **functional connectivity networks**, it reflects the **overall level of correlated activity** between a brain region and the rest of the brain.

- **_Centrality Measures_** (e.g., **_Betweenness Centrality_**)  
  → Assess how **important** a node is in the network’s communication pathways.  
  → **Betweenness centrality** quantifies how often a node lies on the **shortest path** between any two other nodes.  
  → High betweenness regions may act as **critical bridges** for information flow.

---

#### ⏩ **Global Measures**
These provide an overall characterization of the brain network’s architecture:

- **_Clustering Coefficient_**  
  → Measures how interconnected a node’s neighbors are.  
  → High values indicate **tightly linked local circuits**, suggesting **specialized processing modules**.

- **_Characteristic Path Length_**  
  → Average of the shortest path lengths between all pairs of nodes.  
  → Reflects **global integration** or **efficiency** of information transfer across the network.  
  → Shorter path lengths imply **more efficient communication**.

- **_Global Efficiency_**  
  → Average of the **inverse shortest path lengths** between all node pairs.  
  → Quantifies **how well information is exchanged** across the entire network.

- **_Modularity_**  
  → Measures the degree to which a network can be split into **distinct communities** or **modules**.  
  → High modularity implies **specialized functional units** in the brain.

- **_Small-Worldness_**  
  → A property combining **high clustering** (like a regular network) and **short path lengths** (like a random network).  
  → Represents an optimal balance between **segregation** and **integration**, considered ideal for brain networks.

Each measure captures a **distinct aspect** of network organization, with **varying relevance** depending on the biological phenomenon being studied.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.lines import Line2D

# -----------------------
# Merge Features with Solutions
# -----------------------
train_features_df = pd.read_csv('/kaggle/input/full-data-dictionaries/train_connectome_features.csv')
train_merged = pd.merge(train_features_df, train_df_sol, on='participant_id')

# -----------------------
# Define Colors for ADHD and Sex outcomes
# -----------------------
# For ADHD: 0 = grey, 1 = yellow
adhd_colors = {0: '#808080', 1: '#f1c40f'}
# For Sex: 0 = blue (male), 1 = pink (female)
sex_colors = {0: '#3498db', 1: '#e91e63'}

# -----------------------
# Define List of Feature Columns to Plot
# -----------------------
plot_columns = [
    'mean_degree', 'std_degree', 'mean_strength', 'std_strength',
    'mean_betweenness', 'std_betweenness', 'avg_clustering',
    'characteristic_path_length', 'global_efficiency', 'modularity',
    'small_worldness', 'num_connected_components'
]

# -----------------------
# Plotting Loop: For each feature, create 4 subplots (2x2)
# -----------------------
for col in plot_columns:
    fig, axs = plt.subplots(2, 2, figsize=(20, 12))

    # Use participant_id as x-axis if numeric; otherwise, use the index.
    if pd.api.types.is_numeric_dtype(train_merged['participant_id']):
        x_vals = train_merged['participant_id']
    else:
        x_vals = train_merged.index

    # --- Top Left: Scatter Plot for ADHD Outcome ---
    axs[0, 0].scatter(
        x_vals,
        train_merged[col],
        c=train_merged['ADHD_Outcome'].map(adhd_colors),
        alpha=0.7
    )
    axs[0, 0].set_title(f"Scatter: {col} vs Participant ID (ADHD)", fontsize=16, fontweight='bold')
    axs[0, 0].set_xlabel("Participant ID", fontsize=14)
    axs[0, 0].set_ylabel(col, fontsize=14)
    axs[0, 0].grid(True, linestyle='--', alpha=0.5)
    legend_elements_adhd = [
        Line2D([0], [0], marker='o', color='w', label='Non-ADHD',
               markerfacecolor=adhd_colors[0], markersize=10),
        Line2D([0], [0], marker='o', color='w', label='ADHD',
               markerfacecolor=adhd_colors[1], markersize=10)
    ]
    axs[0, 0].legend(handles=legend_elements_adhd, title="ADHD Outcome", fontsize=12, title_fontsize=12)

    # --- Top Right: KDE Plot for ADHD Outcome ---
    sns.kdeplot(
        data=train_merged,
        x=col,
        hue='ADHD_Outcome',
        palette=adhd_colors,
        fill=True,
        common_norm=False,
        alpha=0.6,
        ax=axs[0, 1]
    )
    axs[0, 1].set_title(f"KDE: {col} by ADHD Outcome", fontsize=16, fontweight='bold')
    axs[0, 1].set_xlabel(col, fontsize=14)
    axs[0, 1].set_ylabel("Density", fontsize=14)
    axs[0, 1].grid(True, linestyle='--', alpha=0.5)

    # --- Bottom Left: Scatter Plot for Sex Outcome ---
    axs[1, 0].scatter(
        x_vals,
        train_merged[col],
        c=train_merged['Sex_F'].map(sex_colors),
        alpha=0.7
    )
    axs[1, 0].set_title(f"Scatter: {col} vs Participant ID (Sex)", fontsize=16, fontweight='bold')
    axs[1, 0].set_xlabel("Participant ID", fontsize=14)
    axs[1, 0].set_ylabel(col, fontsize=14)
    axs[1, 0].grid(True, linestyle='--', alpha=0.5)
    legend_elements_sex = [
        Line2D([0], [0], marker='o', color='w', label='Male',
               markerfacecolor=sex_colors[0], markersize=10),
        Line2D([0], [0], marker='o', color='w', label='Female',
               markerfacecolor=sex_colors[1], markersize=10)
    ]
    axs[1, 0].legend(handles=legend_elements_sex, title="Sex", fontsize=12, title_fontsize=12)

    # --- Bottom Right: KDE Plot for Sex Outcome ---
    sns.kdeplot(
        data=train_merged,
        x=col,
        hue='Sex_F',
        palette=sex_colors,
        fill=True,
        common_norm=False,
        alpha=0.6,
        ax=axs[1, 1]
    )
    axs[1, 1].set_title(f"KDE: {col} by Sex", fontsize=16, fontweight='bold')
    axs[1, 1].set_xlabel(col, fontsize=14)
    axs[1, 1].set_ylabel("Density", fontsize=14)
    axs[1, 1].grid(True, linestyle='--', alpha=0.5)

    plt.tight_layout()
    plt.show()

### [3.2.2.2] ↪️ **Connectivity Matrix Plots**

In [None]:
import numpy as np
import pandas as pd
import re
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA

# -----------------------------
# 1. Merge and Aggregate Data
# -----------------------------
# (Assuming train_df_fcm and train_df_sol are already loaded)

# Merge the connectome data with the solution data on 'participant_id'
merged_df = pd.merge(train_df_fcm, train_df_sol, on='participant_id')

# Identify connectivity columns (exclude 'participant_id')
conn_cols = [col for col in train_df_fcm.columns if col != 'participant_id']

# Function to sort connectome columns based on node indices extracted from names.
def sort_connectome_columns(columns):
    def parse_col(col):
        # Expected pattern: 'ithrow_jthcolumn' where i and j are integers
        m = re.match(r"(\d+)throw_(\d+)thcolumn", col)
        if m:
            i = int(m.group(1))
            j = int(m.group(2))
            return (i, j)
        else:
            return (float('inf'), float('inf'))
    return sorted(columns, key=parse_col)

sorted_conn_cols = sort_connectome_columns(conn_cols)

# Aggregate connectivity data by ADHD_Outcome and Sex_F (using mean)
# (Assuming in train_df_sol: ADHD_Outcome (0/1) and Sex_F (0=Male, 1=Female))
adhd_groups = merged_df.groupby('ADHD_Outcome')[sorted_conn_cols].mean()
sex_groups  = merged_df.groupby('Sex_F')[sorted_conn_cols].mean()

# Function to convert a connectivity vector (of length n*(n-1)/2) into a symmetric matrix.
def vector_to_symmetric_matrix(vector, n=200):
    mat = np.zeros((n, n))
    idx = 0
    for i in range(n):
        for j in range(i+1, n):
            mat[i, j] = vector[idx]
            mat[j, i] = vector[idx]
            idx += 1
    return mat

# Create aggregated connectivity matrices:
# For ADHD groups: index 1 means ADHD-positive, index 0 means non‑ADHD.
adhd_positive_matrix = vector_to_symmetric_matrix(adhd_groups.loc[1].values, n=200)
adhd_negative_matrix = vector_to_symmetric_matrix(adhd_groups.loc[0].values, n=200)
# For Sex groups: index 1 means Female, index 0 means Male.
female_matrix = vector_to_symmetric_matrix(sex_groups.loc[1].values, n=200)
male_matrix   = vector_to_symmetric_matrix(sex_groups.loc[0].values, n=200)

# Compute difference matrices
adhd_diff_matrix = adhd_positive_matrix - adhd_negative_matrix
sex_diff_matrix  = female_matrix - male_matrix

# -----------------------------
# 2. Continuous Heatmap Plots (3 per row)
# -----------------------------
# We'll create 2 rows (one for ADHD groups and one for Sex groups) and 3 columns per row.
fig, axes = plt.subplots(2, 3, figsize=(24, 12))

# Define a function to plot a matrix with a lower-triangular mask.
def plot_lower_triangle(ax, matrix, title):
    mask = np.triu(np.ones_like(matrix, dtype=bool))
    sns.heatmap(matrix, mask=mask, cmap="coolwarm", square=True, cbar_kws={"shrink": .5}, ax=ax)
    ax.set_title(title)

# ADHD row: column 0: ADHD Positive, 1: ADHD Negative, 2: Difference
plot_lower_triangle(axes[0, 0], adhd_positive_matrix, "ADHD Positive Aggregated Connectome")
plot_lower_triangle(axes[0, 1], adhd_negative_matrix, "ADHD Negative Aggregated Connectome")
plot_lower_triangle(axes[0, 2], adhd_diff_matrix, "Difference (ADHD Positive - Negative)")

# Sex row: column 0: Female, 1: Male, 2: Difference
plot_lower_triangle(axes[1, 0], female_matrix, "Female Aggregated Connectome")
plot_lower_triangle(axes[1, 1], male_matrix, "Male Aggregated Connectome")
plot_lower_triangle(axes[1, 2], sex_diff_matrix, "Difference (Female - Male)")

plt.tight_layout()
plt.show()

# -----------------------------
# 3. Thresholding and Binary Graphs
# -----------------------------
# For binary graphs, we threshold the matrices (e.g. keep only connections above the 75th percentile)
def threshold_binary(matrix, percentile=75):
    threshold_value = np.percentile(matrix, percentile)
    binary_matrix = (matrix > threshold_value).astype(int)
    return binary_matrix

# Compute binary versions for ADHD groups
binary_adhd_positive = threshold_binary(adhd_positive_matrix, percentile=75)
binary_adhd_negative = threshold_binary(adhd_negative_matrix, percentile=75)
# Difference as binary: subtracting yields -1, 0, or 1
binary_adhd_diff = binary_adhd_positive - binary_adhd_negative

# Compute binary versions for Sex groups
binary_female = threshold_binary(female_matrix, percentile=75)
binary_male   = threshold_binary(male_matrix, percentile=75)
binary_sex_diff = binary_female - binary_male

# Now, create binary heatmaps in a similar layout (2 rows x 3 columns)
fig, axes = plt.subplots(2, 3, figsize=(24, 12))

# Define a function to plot a binary matrix.
def plot_binary_heatmap(ax, matrix, title):
    # Using a diverging palette so that -1, 0, 1 can be seen.
    # Here, we'll use a custom discrete colormap: -1 (blue), 0 (white), 1 (red)
    cmap = sns.color_palette("coolwarm", as_cmap=True)
    sns.heatmap(matrix, cmap=cmap, square=True, cbar=True, ax=ax, vmin=-1, vmax=1)
    ax.set_title(title)

# ADHD row: binary plots
plot_binary_heatmap(axes[0, 0], binary_adhd_positive, "Binary: ADHD Positive")
plot_binary_heatmap(axes[0, 1], binary_adhd_negative, "Binary: ADHD Negative")
plot_binary_heatmap(axes[0, 2], binary_adhd_diff, "Binary Difference (ADHD)")

# Sex row: binary plots
plot_binary_heatmap(axes[1, 0], binary_female, "Binary: Female")
plot_binary_heatmap(axes[1, 1], binary_male, "Binary: Male")
plot_binary_heatmap(axes[1, 2], binary_sex_diff, "Binary Difference (Sex)")

plt.tight_layout()
plt.show()

#### ▶️ **Understanding the Connectivity Matrix Plots**

The connectivity matrix plots visualize functional relationships between pairs of brain regions, helping us understand how different regions are co-activated during rest.

#### ⏩ **Continuous Matrix Heatmaps (Top Figure)**

Each heatmap shows a **200 × 200 symmetric matrix** where:
- **Rows and columns** represent specific brain regions.
- **Colors** represent average correlation values (functional connectivity) between those region pairs.
  - **Red**: stronger positive correlation (regions activate together).
  - **Blue**: negative correlation (regions activate in opposite directions).
  - **White**: near-zero correlation (weak/no relationship).

These are grouped as:
- **ADHD Positive vs Negative**: Visualizes how connectivity patterns differ for ADHD-diagnosed vs. non-diagnosed individuals.
- **Female vs Male**: Highlights sex-related differences in brain network structure.
- **Difference Plots** (rightmost in each row): Subtract one group from the other (e.g., ADHD+ minus ADHD−), showing where and by how much connectivity differs.  
  - The **magnitude range is subtle**, e.g., differences around **±0.06 to ±0.04**, suggesting mild but potentially meaningful changes in connectivity between groups.

---

#### ⏩ **Binary Matrix Plots (Bottom Figure)**

Here, continuous matrices are **thresholded** (e.g., top 25% strongest connections kept) and binarized:
- A value of:
  - `1`: strong connection present
  - `0`: no strong connection
- **Difference maps** use values of `-1`, `0`, or `1` to denote group-wise presence/absence:
  - `+1`: connection present in one group but not the other.
  - `-1`: vice versa.
  - `0`: either both have or both lack the connection.

These binary heatmaps make **differences more visually stark**, helping isolate which specific brain region pairs differ most between ADHD vs non-ADHD or Female vs Male.

---

### [3.2.2.3] ↪️ **Circular Chord Diagrams**

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
from pycirclize import Circos

# -----------------------------
# Helper Functions
# -----------------------------
def aggregate_matrix_by_bins(matrix, num_bins=10):
    """
    Aggregates a full connectome matrix (e.g., 200x200) into a smaller
    num_bins x num_bins matrix by averaging over blocks.
    """
    n = matrix.shape[0]
    bin_size = n // num_bins
    agg_matrix = np.zeros((num_bins, num_bins))
    for i in range(num_bins):
        for j in range(num_bins):
            block = matrix[i*bin_size:(i+1)*bin_size, j*bin_size:(j+1)*bin_size]
            agg_matrix[i, j] = block.mean()  # Use mean connection strength
    return agg_matrix

def save_chord_diagram(matrix, title, filename, cmap='coolwarm', r_lim=(93, 100)):
    """
    Generates a chord diagram for the (binned) connectome matrix using pycirclize,
    then saves the resulting figure to a file.
    """
    # Aggregate the full matrix into bins
    agg_matrix = aggregate_matrix_by_bins(matrix, num_bins=10)
    bin_labels = [f'Bin {i+1}' for i in range(10)]
    agg_df = pd.DataFrame(agg_matrix, index=bin_labels, columns=bin_labels)

    # Create the chord diagram with pycirclize
    circos = Circos.chord_diagram(
        agg_df,
        start=-265,
        end=95,
        space=5,
        r_lim=r_lim,
        cmap=cmap,
        label_kws=dict(r=r_lim[0]-1, size=10, color="black"),
        link_kws=dict(ec="black", lw=0.5),
    )
    # (Optional) You can set the title here on the circos figure if needed:
    # fig = circos.plotfig(suptitle=title)
    fig = circos.plotfig()
    fig.savefig(filename)
    plt.close(fig)

# -----------------------------
# Example Aggregated Matrices
# -----------------------------
# (Assume adhd_positive_matrix, adhd_negative_matrix, female_matrix, male_matrix
#  have been previously defined, for example using vector_to_symmetric_matrix)

# -----------------------------
# Save Chord Diagrams as Images
# -----------------------------
image_files = {
    "Chord Diagram: ADHD Positive": "chord_adhd_positive.png",
    "Chord Diagram: ADHD Negative": "chord_adhd_negative.png",
    "Chord Diagram: Female": "chord_female.png",
    "Chord Diagram: Male": "chord_male.png",
}

save_chord_diagram(adhd_positive_matrix, "Chord Diagram: ADHD Positive", image_files["Chord Diagram: ADHD Positive"])
save_chord_diagram(adhd_negative_matrix, "Chord Diagram: ADHD Negative", image_files["Chord Diagram: ADHD Negative"])
save_chord_diagram(female_matrix, "Chord Diagram: Female", image_files["Chord Diagram: Female"])
save_chord_diagram(male_matrix, "Chord Diagram: Male", image_files["Chord Diagram: Male"])

# Close any lingering figures before creating the subplot grid
plt.close('all')

# -----------------------------
# Load Images into a 2x2 Subplot Grid
# -----------------------------
fig, axs = plt.subplots(2, 2, figsize=(16, 16))
plt.subplots_adjust(wspace=0.4, hspace=0.4)

titles = list(image_files.keys())
for ax, title in zip(axs.flatten(), titles):
    img = plt.imread(image_files[title])
    ax.imshow(img)
    ax.set_title(title)
    ax.axis('off')

plt.tight_layout()
plt.show()

# Optionally, delete the temporary image files after display:
for file in image_files.values():
    if os.path.exists(file):
        os.remove(file)

#### ▶️ **Circular Chord Diagrams: Quick Breakdown**

Chord diagrams summarize brain connectivity by grouping 200 regions into 10 bins and showing how strongly these bins connect.

- **Each segment** = a brain region bin (e.g., Bin 1, Bin 2… Bin 10)  
- **Each arc (chord)** = average connection strength between two bins  
- **Color**:  
  - 🔵 Blue = Negative/Low connectivity  
  - 🔴 Red = Positive/High connectivity  
- **Thickness** = Strength of connection

---

### [3.2.2.4] ↪️ **PCA: Principal Component Analysis**

In [None]:
import plotly.io as pio
pio.renderers.default = 'iframe'  # Use iframe renderer for published notebooks

import plotly.graph_objects as go
from plotly.subplots import make_subplots
import pandas as pd
from sklearn.decomposition import PCA

# Create a copy of merged_df so the original remains unchanged.
df_pca = merged_df.copy()

# --- Assume sorted_conn_cols is already defined ---
X = df_pca[sorted_conn_cols].values  # shape: (n_subjects, 19900)

# Perform PCA with 3 components for the 3D plot
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X)

# Add the components to our copy (df_pca) without modifying the original merged_df
df_pca = df_pca.copy()  # Ensure we're working on our local copy
df_pca['pca1'] = X_pca[:, 0]
df_pca['pca2'] = X_pca[:, 1]
df_pca['pca3'] = X_pca[:, 2]

# Convert target labels to string-based categorical columns
df_pca['ADHD_Label'] = df_pca['ADHD_Outcome'].map({0: "Non-ADHD", 1: "ADHD"})
df_pca['Sex_Label'] = df_pca['Sex_F'].map({0: "Male", 1: "Female"})

# Define color palettes for the targets
gender_palette = {"Male": "lightblue", "Female": "lightpink"}
adhd_palette = {"Non-ADHD": "grey", "ADHD": "#FFDB58"}

# -----------------------
# 3D PCA Plot: Two Subplots
# -----------------------
fig = make_subplots(
    rows=1, cols=2,
    specs=[[{'type': 'scene'}, {'type': 'scene'}]],
    subplot_titles=("3D PCA: Colored by ADHD", "3D PCA: Colored by Sex")
)

# Plot 1: Colored by ADHD Outcome
for label in df_pca['ADHD_Label'].unique():
    df_subset = df_pca[df_pca['ADHD_Label'] == label]
    fig.add_trace(
        go.Scatter3d(
            x=df_subset['pca1'],
            y=df_subset['pca2'],
            z=df_subset['pca3'],
            mode='markers',
            name=label,
            marker=dict(size=4, color=adhd_palette[label]),
            legendgroup="ADHD"
        ),
        row=1, col=1
    )

# Plot 2: Colored by Sex
for label in df_pca['Sex_Label'].unique():
    df_subset = df_pca[df_pca['Sex_Label'] == label]
    fig.add_trace(
        go.Scatter3d(
            x=df_subset['pca1'],
            y=df_subset['pca2'],
            z=df_subset['pca3'],
            mode='markers',
            name=label,
            marker=dict(size=4, color=gender_palette[label]),
            legendgroup="Sex"
        ),
        row=1, col=2
    )

# Layout adjustments for the 3D plot
fig.update_layout(
    height=600,
    width=900,
    legend_title_text="Groups",
    scene=dict(xaxis_title='PC1', yaxis_title='PC2', zaxis_title='PC3'),
    scene2=dict(xaxis_title='PC1', yaxis_title='PC2', zaxis_title='PC3'),
    margin=dict(l=10, r=10, t=40, b=10)
)

fig.show()  # Ensure the plot is displayed

In [None]:
# -----------------------
# 2D PCA Plot: Two Subplots
# -----------------------
# Perform PCA with 2 components for the 2D plot on a fresh copy to avoid altering previous columns.
df_pca2 = merged_df.copy()
pca2 = PCA(n_components=2)
X_pca2 = pca2.fit_transform(df_pca2[sorted_conn_cols].values)
df_pca2['pca2_1'] = X_pca2[:, 0]
df_pca2['pca2_2'] = X_pca2[:, 1]

fig_2d = make_subplots(
    rows=1, cols=2,
    subplot_titles=("2D PCA Colored by ADHD", "2D PCA Colored by Sex")
)

# Plot 1: Colored by ADHD Outcome in 2D
for label in df_pca2['ADHD_Outcome'].map({0: "Non-ADHD", 1: "ADHD"}).unique():
    subset = df_pca2[df_pca2['ADHD_Outcome'].map({0: "Non-ADHD", 1: "ADHD"}) == label]
    fig_2d.add_trace(
        go.Scatter(
            x=subset['pca2_1'],
            y=subset['pca2_2'],
            mode='markers',
            marker=dict(color=adhd_palette[label], size=5),
            name=label,
            legendgroup="ADHD"
        ),
        row=1, col=1
    )

# Plot 2: Colored by Sex in 2D
for label in df_pca2['Sex_F'].map({0: "Male", 1: "Female"}).unique():
    subset = df_pca2[df_pca2['Sex_F'].map({0: "Male", 1: "Female"}) == label]
    fig_2d.add_trace(
        go.Scatter(
            x=subset['pca2_1'],
            y=subset['pca2_2'],
            mode='markers',
            marker=dict(color=gender_palette[label], size=5),
            name=label,
            legendgroup="Sex"
        ),
        row=1, col=2
    )

fig_2d.update_layout(
    height=600,
    width=900,
    legend_title_text="Groups",
    xaxis_title="PC1",
    yaxis_title="PC2",
    xaxis2_title="PC1",
    yaxis2_title="PC2",
    margin=dict(l=10, r=10, t=40, b=10)
)
fig_2d.show()

#### ▶️ **PCA: Principal Component Analysis (Short Intro)**  
PCA is a **dimensionality reduction technique** that transforms high-dimensional data (like 19,900 brain connections per subject) into a smaller number of **uncorrelated components** (PCs) that capture the most variance.  
This makes it easier to **visualize patterns**, **detect clusters**, and **compare groups** like ADHD vs. Non-ADHD or Male vs. Female in 2D or 3D.

---

#### ⏩ **3D PCA Scatter Plot**

This plot shows subjects in a **3D space** defined by the first **three principal components (PC1, PC2, PC3).**  
- **Left plot (ADHD)**: Points are colored by ADHD status. If ADHD and Non-ADHD groups separate in space, it suggests distinct global connectivity patterns.  
- **Right plot (Sex)**: Points are colored by sex. Visible separation would imply sex-based differences in brain connectivity structure.

📌 **Interpretation**: Clusters or group-wise separations indicate **systematic differences in overall brain connectivity profiles**.

---

#### ⏩ **2D PCA Scatter Plot**

Similar to the 3D version, but projected onto just **two components (PC1, PC2)** for a flatter view.  
- **Left plot (ADHD)**: Checks for ADHD-related separation in a simplified 2D space.  
- **Right plot (Sex)**: Highlights sex-related structure.

📌 **Interpretation**: Useful for spotting **clear trends** or **group overlaps** — especially helpful when differences are subtle but consistent across dimensions.

<b><span style="color: #FFFFFF; background-color: #E57373; padding: 20px; font-size: 18px; border-left: 8px solid #C2185B"> <strong>[3.2] Categorical Feature Analysis (Survey Data)</strong></span></b>

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import textwrap

# Define color palettes for target groups and for datasets
adhd_palette = {"Non-ADHD": "grey", "ADHD": "#FFDB58"}
gender_palette = {"Male": "lightblue", "Female": "lightpink"}
dataset_palette = ['#33638d', '#28ae80']  # For train and test respectively

# Ensure you have copies of train_data and test_data
train_data = train_data.copy()
test_data = test_data.copy()

# Add a 'dataset' column to differentiate train and test data
train_data['dataset'] = 'train'
test_data['dataset'] = 'test'

# Combine train and test for the dataset-specific plot
combined = pd.concat([train_data, test_data])

def create_categorical_plots(feature):
    sns.set_style('whitegrid')

    # Create a figure with 1 row and 4 columns
    fig, axes = plt.subplots(1, 4, figsize=(24, 5))

    # ---------------------
    # Plot 1: Overall Pie Chart
    # ---------------------
    value_counts = combined[feature].value_counts()
    threshold = 0.05 * value_counts.sum()
    filtered_values = value_counts[value_counts >= threshold]
    if value_counts[value_counts < threshold].sum() > 0:
        filtered_values['Other'] = value_counts[value_counts < threshold].sum()

    wedges, texts, autotexts = axes[0].pie(
        filtered_values,
        autopct=lambda p: f'{p:.1f}%' if p > 5 else '',
        colors=sns.color_palette("viridis", len(filtered_values)),
        startangle=140,
        wedgeprops=dict(width=0.3),
        explode=[0.05 if count > threshold else 0 for count in filtered_values],
        textprops={'fontsize': 10}
    )
    axes[0].set_title("\n".join(textwrap.wrap(
        f"Pie Chart for {feature}: {dict_df.loc[dict_df['Field'] == feature, 'Description'].values[0]}", width=50)))

    # ---------------------
    # Plot 2: Countplot by Dataset (Train vs Test)
    # ---------------------
    sns.countplot(
        data=combined,
        x=feature,
        hue='dataset',
        palette=dataset_palette,
        ax=axes[1]
    )
    axes[1].set_xlabel(feature)
    axes[1].set_ylabel("Count")
    axes[1].set_title("\n".join(textwrap.wrap(f"Countplot for {feature} by Dataset", width=50)))
    axes[1].tick_params(axis='x', rotation=30)

    # For the following plots, ensure the feature remains categorical (do not convert to numeric)
    
    # ---------------------
    # Plot 3: Countplot for ADHD Outcome (Train only)
    # ---------------------
    train_data['ADHD_Label'] = train_data['ADHD_Outcome'].map({0: "Non-ADHD", 1: "ADHD"})
    sns.countplot(
        data=train_data,
        x=feature,
        hue='ADHD_Label',
        palette=adhd_palette,
        ax=axes[2]
    )
    axes[2].set_xlabel(feature)
    axes[2].set_ylabel("Count")
    axes[2].set_title("\n".join(textwrap.wrap(f"Distribution of {feature} by ADHD Outcome", width=50)))
    axes[2].legend()

    # ---------------------
    # Plot 4: Countplot for Sex (Train only)
    # ---------------------
    train_data['Sex_Label'] = train_data['Sex_F'].map({0: "Male", 1: "Female"})
    sns.countplot(
        data=train_data,
        x=feature,
        hue='Sex_Label',
        palette=gender_palette,
        ax=axes[3]
    )
    axes[3].set_xlabel(feature)
    axes[3].set_ylabel("Count")
    axes[3].set_title("\n".join(textwrap.wrap(f"Distribution of {feature} by Sex", width=50)))
    axes[3].legend()

    plt.tight_layout()
    plt.show()

# Perform univariate analysis for each categorical variable
for feature in categorical_variables:
    create_categorical_plots(feature)

# Cleanup: Drop temporary columns
train_data.drop(['dataset', 'ADHD_Label', 'Sex_Label'], axis=1, inplace=True)
test_data.drop(['dataset'], axis=1, inplace=True)

<span style="color:#ffffff; font-size: 1%;">SW-Key-Features-Insights</span>
<div style="background-color:#E8F8F5; border-left:8px solid #1ABC9C; padding:20px; border-radius:8px; font-size:14px; color:#000000;">
  <h3 style="font-size:20px; margin-bottom:10px;">🤔💁‍♀️So What?! <strong>(📝 Key Insights ‑ Summary)</strong></h3>
  <hr>
  <ul>
    <li><strong>Sampling &amp; Site Effects (<code>Enroll Year</code>, <code>Study Site</code>, <code>Scan Location</code>):</strong> Training data are concentrated in <code>2016‑2019</code> and in two sites, whereas the test set shifts to <code>2022+</code> and a different scan location. These temporal‑site drifts must be harmonised (e.g., ComBat) to avoid data‑leakage‑driven performance.</li>
    <li><strong>Socio‑economic Gradient (Barratt Indices):</strong> Lower parental education (≤ high‑school) and lower occupational prestige codes cluster with higher <code>ADHD</code> prevalence, suggesting that family SES is an important—but potentially confounded—predictor.</li>
    <li><strong>Parenting Style &amp; Discipline (<code>Corporal Punishment</code>):</strong> Scores escalate in the <code>ADHD</code> group, hinting that harsher discipline may coexist with, or respond to, behavioural dysregulation.</li>
    <li><strong>Behavioural Symptom Scales (SDQ Sub‑scales):</strong> Marked, monotonic shifts—especially on <code>Conduct</code>, <code>Hyperactivity</code>, and overall <code>Impact</code>—differentiate <code>ADHD</code> from <code>non‑ADHD</code>, making them high‑value features.</li>
    <li><strong>Protective Traits (<code>Prosocial</code>):</strong> Prosocial behaviours are notably reduced in the <code>ADHD</code> cohort, the mirror image of the elevated problem‑scores above, rounding out the behavioural profile.</li>
  </ul>

  <h3 style="font-size:20px; margin-bottom:10px;">🤔💁‍♀️So What?! <strong>(📝 Key Insights ‑ Detailed)</strong></h3>
  <hr>

  <!-- 1 -->
  <p><strong>1️⃣ Basic Year of Enrolment <code>(Basic_Demos_Enroll_Year)</code></strong></p>
  <ul>
    <li><strong>Pattern:</strong> Highly left‑truncated timeline—<code>2017‑2019</code> cover > 65 % of training rows and host the highest <code>ADHD</code> counts; the entire test set shifts to <code>2022‑2023</code>.</li>
    <li><strong>Interpretation:</strong> Year acts as a proxy for protocol evolution and recruiting waves. Without stratified CV, models might overfit year‑specific quirks rather than neuro‑behavioural signals; consider removing or re‑encoding.</li>
  </ul>
  <hr>

  <!-- 2 -->
  <p><strong>2️⃣ Site of Phenotypic Testing <code>(Basic_Demos_Study_Site)</code></strong></p>
  <ul>
    <li><strong>Pattern:</strong> Three main sites—<code>1, 3, 4</code>. Site 1 dominates training (43 %) but is absent from the test set; Site 4 is 72 % of test.</li>
    <li><strong>Interpretation:</strong> Site encapsulates scanner, technician, and demographic differences. Its shift between splits magnifies domain‑shift risk; mandatory harmonisation or domain‑adaptation is advised.</li>
  </ul>
  <hr>

  <!-- 3 -->
  <p><strong>3️⃣ Child Ethnicity <code>(PreInt_Demos_Fam_Child_Ethnicity)</code></strong></p>
  <ul>
    <li><strong>Pattern:</strong> Majority coded <code>0 = Not‑Hispanic</code> (~66 %), with <code>Hispanic (1)</code> at 23 %. Slightly higher ADHD counts among non‑Hispanic children.</li>
    <li><strong>Interpretation:</strong> Ethnicity may intertwine with SES and access to services; the modest imbalance means cautious weighting rather than aggressive feature pruning.</li>
  </ul>
  <hr>

  <!-- 4 -->
  <p><strong>4️⃣ Child Race <code>(PreInt_Demos_Fam_Child_Race)</code></strong></p>
  <ul>
    <li><strong>Pattern:</strong> Predominantly <code>White (0)</code> (~49 %) and "Other/Refused" (8) (~16 %). ADHD proportions mirror availability rather than showing large race effects.</li>
    <li><strong>Interpretation:</strong> The sample’s racial skew limits generalisability; race codes may still capture latent SES or cultural factors but risk spurious correlations.</li>
  </ul>
  <hr>

  <!-- 5 -->
  <p><strong>5️⃣ Scan Location <code>(MRI_Track_Scan_Location)</code></strong></p>
  <ul>
    <li><strong>Pattern:</strong> Two scanners account for > 70 % of rows (<code>Loc 2 &amp; 3</code>). The test set is almost exclusively <code>Loc 3/4</code>; ADHD prevalence is highest at <code>Loc 2 &amp; 3</code>.</li>
    <li><strong>Interpretation:</strong> Location reflects hardware and software versions. It is a powerful—but potentially confounding—feature for connectomic data; batch‑effect correction or domain‑specific CV folds are needed.</li>
  </ul>
  <hr>

  <!-- 6 -->
  <p><strong>6️⃣ Parent 1 Education Level <code>(Barratt_Barratt_P1_Edu)</code></strong></p>
  <ul>
    <li><strong>Pattern:</strong> Scores cluster at <code>18 &amp; 21</code> (college/graduate). ADHD counts rise as education drops (peak odds ratio at <code>≤ 15</code>).</li>
    <li><strong>Interpretation:</strong> Lower educational attainment may signal reduced resources for early intervention, aligning with literature on ADHD and SES.</li>
  </ul>
  <hr>

  <!-- 7 -->
  <p><strong>7️⃣ Parent 1 Occupation <code>(Barratt_Barratt_P1_Occ)</code></strong></p>
  <ul>
    <li><strong>Pattern:</strong> Trimodal distribution—professional codes (<code>35/45</code>), service (<code>30/40</code>), and unemployed (0). ADHD density is greatest in lower‑prestige bands (0‑25).</li>
    <li><strong>Interpretation:</strong> Occupational prestige complements education as an SES marker; together they strengthen models but warrant multicollinearity checks.</li>
  </ul>
  <hr>

  <!-- 8 -->
  <p><strong>8️⃣ Parent 2 Education Level <code>(Barratt_Barratt_P2_Edu)</code></strong></p>
  <ul>
    <li><strong>Pattern:</strong> Higher missingness (15 %). Where present, the mode is <code>21</code>; ADHD likelihood again rises in lower bands and in missing values.</li>
    <li><strong>Interpretation:</strong> Treat "missing" as informative—often single‑parent or data‑poor households. Imputation strategy should preserve that signal.</li>
  </ul>
  <hr>

  <!-- 9 -->
  <p><strong>9️⃣ Parent 2 Occupation <code>(Barratt_Barratt_P2_Occ)</code></strong></p>
  <ul>
    <li><strong>Pattern:</strong> Key‑cap <code>45</code> (professional) dominates but ADHD prevalence escalates in semi‑skilled (<code>30/35</code>) and in <code>missing</code>.</li>
    <li><strong>Interpretation:</strong> Reinforces the SES gradient; combined parental occupation indices may capture family stability dimensions.</li>
  </ul>
  <hr>

  <!-- 10 -->
  <p><strong>🔟 Color Vision Score <code>(ColorVision_CV_Score)</code></strong></p>
  <ul>
    <li><strong>Pattern:</strong> Ceiling effect—<code>14/14</code> on 77 % of cases; only a long, sparse tail below.</li>
    <li><strong>Interpretation:</strong> Minimal variance limits predictive value; treat as QC flag rather than a modelling feature.</li>
  </ul>
  <hr>

  <!-- 11 -->
  <p><strong>1️⃣1️⃣ Corporal Punishment Score <code>(APQ_P_APQ_P_CP)</code></strong></p>
  <ul>
    <li><strong>Pattern:</strong> Mode at <code>3</code>; ADHD subjects show heavier right‑tail (≥ 5).</li>
    <li><strong>Interpretation:</strong> Higher corporal punishment aligns with externalising behaviours; could serve as psychosocial predictor but may introduce reporter bias.</li>
  </ul>
  <hr>

  <!-- 12 -->
  <p><strong>1️⃣2️⃣ Conduct Problems <code>(SDQ_SDQ_Conduct_Problems)</code></strong></p>
  <ul>
    <li><strong>Pattern:</strong> Clear gradient—counts drop monotonically from <code>0</code> to <code>10</code> but ADHD proportion rises sharply beyond <code>2</code>.</li>
    <li><strong>Interpretation:</strong> High conduct scores are a hallmark externalising symptom; valuable standalone predictor and consistent with DSM profiles.</li>
  </ul>
  <hr>

  <!-- 13 -->
  <p><strong>1️⃣3️⃣ Emotional Problems <code>(SDQ_SDQ_Emotional_Problems)</code></strong></p>
  <ul>
    <li><strong>Pattern:</strong> Right‑skewed; ADHD subjects populate upper deciles (≥ 6) twice as often as non‑ADHD.</li>
    <li><strong>Interpretation:</strong> Captures internalising comorbidity (anxiety/depression) frequently co‑occurring with ADHD, aiding nuanced classification.</li>
  </ul>
  <hr>

  <!-- 14 -->
  <p><strong>1️⃣4️⃣ Overall Impact Score <code>(SDQ_SDQ_Generating_Impact)</code></strong></p>
  <ul>
    <li><strong>Pattern:</strong> Multimodal, but ADHD children cluster in the upper modes (6‑10).</li>
    <li><strong>Interpretation:</strong> Summarises functional impairment across settings; one of the strongest signal‑to‑noise ratios for ADHD diagnosis.</li>
  </ul>
  <hr>

  <!-- 15 -->
  <p><strong>1️⃣5️⃣ Hyperactivity Scale <code>(SDQ_SDQ_Hyperactivity)</code></strong></p>
  <ul>
    <li><strong>Pattern:</strong> Bimodal—ADHD peaks at <code>9‑10</code>, non‑ADHD at <code>2‑4</code>; sex gap mirrors ADHD gap (males higher).</li>
    <li><strong>Interpretation:</strong> Direct symptomatic measure; expect high feature importance in tree‑based models.</li>
  </ul>
  <hr>

  <!-- 16 -->
  <p><strong>1️⃣6️⃣ Peer Problems Scale <code>(SDQ_SDQ_Peer_Problems)</code></strong></p>
  <ul>
    <li><strong>Pattern:</strong> Long right tail; ADHD share increases steadily from <code>3+</code>. Females slightly less impaired than males.</li>
    <li><strong>Interpretation:</strong> Peer difficulties reflect social repercussions of ADHD; complements conduct &amp; hyperactivity sub‑scales.</li>
  </ul>
  <hr>

  <!-- 17 -->
  <p><strong>1️⃣7️⃣ Prosocial Scale <code>(SDQ_SDQ_Prosocial)</code></strong></p>
  <ul>
    <li><strong>Pattern:</strong> Inverted distribution—non‑ADHD cluster at the ceiling (<code>9‑10</code>), ADHD distribute across mid‑range (5‑8).</li>
    <li><strong>Interpretation:</strong> Lower prosocial behaviours mark social reciprocity deficits often noted in ADHD; useful negative predictor when combined with externalising scores.</li>
  </ul>
</div>

<b><span style="color: #FFFFFF; background-color: #E57373; padding: 20px; font-size: 18px; border-left: 8px solid #C2185B"> <strong>[3.3] Target Feature Analysis (Univariate Analysis)</strong></span></b>

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import textwrap

# Define default color palettes for non-target variables
pie_chart_palette = ['#33638d', '#28ae80', '#d3eb0c', '#ff9a0b', '#7e03a8', '#35b779',
                       '#fde725', '#440154', '#90d743', '#482173', '#22a884', '#f8961e']
countplot_color = '#5C67A3'

# Define custom palettes for the target variables
sex_color_map = {0: 'lightblue', 1: 'lightpink'}
adhd_color_map = {0: 'grey', 1: '#FFDB58'}

# Function to create and display a row of plots for a single categorical variable
def create_categorical_plots(variable):
    sns.set_style('whitegrid')

    fig, axes = plt.subplots(1, 2, figsize=(14, 5))

    # ---------------------
    # Pie Chart - Handling many categories
    # ---------------------
    plt.subplot(1, 2, 1)

    # Get combined counts from train and test
    combined = pd.concat([train_data, test_data])
    value_counts = combined[variable].value_counts()
    total = value_counts.sum()

    # For target variables, enforce an order and custom palette
    if variable == 'Sex_F':
        order = [0, 1]  # Male then Female
        value_counts = value_counts.reindex(order).dropna()
        custom_pie_palette = [sex_color_map[val] for val in order if val in value_counts.index]
    elif variable == 'ADHD_Outcome':
        order = [0, 1]  # Non-ADHD then ADHD
        value_counts = value_counts.reindex(order).dropna()
        custom_pie_palette = [adhd_color_map[val] for val in order if val in value_counts.index]
    else:
        custom_pie_palette = pie_chart_palette[:len(value_counts)]

    # Combine small categories (<5%) into "Other" (only for non-target variables)
    threshold = 0.05 * total
    if variable not in ['Sex_F', 'ADHD_Outcome']:
        filtered_values = value_counts[value_counts >= threshold]
        other_total = value_counts[value_counts < threshold].sum()
        if other_total > 0:
            filtered_values['Other'] = other_total
        value_counts = filtered_values
        # Adjust palette and explode for the filtered categories
        custom_pie_palette = pie_chart_palette[:len(value_counts)]
        explode = [0.05 if count >= threshold else 0 for count in value_counts]
    else:
        explode = [0.05] * len(value_counts)  # Slight explode for both bars

    wedges, texts, autotexts = plt.pie(
        value_counts,
        autopct=lambda p: f'{p:.1f}%' if p > 5 else '',  # Hide labels < 5%
        colors=custom_pie_palette,
        startangle=140,
        wedgeprops=dict(width=0.3),
        explode=explode,
        textprops={'fontsize': 10}
    )

    title_text = dict_df.loc[dict_df['Field'] == variable, 'Description'].values[0] \
                    if variable in dict_df['Field'].values else variable
    plt.title("\n".join(textwrap.wrap(f"Pie Chart for {title_text}  [TRAIN]", width=50)))
    plt.legend(value_counts.index, loc="upper left", bbox_to_anchor=(1, 1))

    # ---------------------
    # Bar Graph (Countplot)
    # ---------------------
    plt.subplot(1, 2, 2)
    # For target variables, use a custom palette; otherwise, use default color.
    if variable == 'Sex_F':
        order = [0, 1]
        sns.countplot(
            data=combined,
            x=variable,
            palette=sex_color_map,
            order=order
        )
    elif variable == 'ADHD_Outcome':
        order = [0, 1]
        sns.countplot(
            data=combined,
            x=variable,
            palette=adhd_color_map,
            order=order
        )
    else:
        sns.countplot(
            data=combined,
            x=variable,
            color=countplot_color,
            alpha=0.8
        )

    plt.xlabel(variable)
    plt.ylabel("Count")
    plt.title("\n".join(textwrap.wrap(f"Bar Graph for {title_text}  [TRAIN]", width=50)))
    plt.xticks(rotation=30)

    # Adjust spacing between subplots
    plt.tight_layout()

    # Show the plots
    plt.show()

# Perform univariate analysis for each categorical variable in your target_variables list
for variable in target_variables:
    create_categorical_plots(variable)

<b><span style="color: #FFFFFF; background-color: #E57373; padding: 20px; font-size: 18px; border-left: 8px solid #C2185B"> <strong>[3.4] Bivariate Analysis (Survey Data)</strong></span></b>

In [None]:
# Adding variables to the existing list
test_variables = categorical_variables+numerical_variables
train_variables = categorical_variables+numerical_variables+ target_variables

# Calculate correlation matrices for train_data and test_data
corr_train = train_data[train_variables].corr()
corr_test = test_data[test_variables].corr()

# Create masks for the upper triangle
mask_train = np.triu(np.ones_like(corr_train, dtype=bool))
mask_test = np.triu(np.ones_like(corr_test, dtype=bool))

# Set the text size and rotation
annot_kws = {"size": 6, "rotation": 45}

# Generate heatmaps for train_data
plt.figure(figsize=(10, 20))
plt.subplot(2, 1, 1)
ax_train = sns.heatmap(corr_train, mask=mask_train, cmap='viridis', annot=True,
                      square=True, linewidths=.5, xticklabels=1, yticklabels=1, annot_kws=annot_kws)
plt.title('Correlation Heatmap - Train Data')

# Generate heatmaps for test_data
plt.subplot(2, 1, 2)
ax_test = sns.heatmap(corr_test, mask=mask_test, cmap='viridis', annot=True,
                     square=True, linewidths=.5, xticklabels=1, yticklabels=1, annot_kws=annot_kws)
plt.title('Correlation Heatmap - Test Data')

# Adjust layout
plt.tight_layout()

# Show the plots
plt.show()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Create a contingency table (crosstab)
crosstab = pd.crosstab(train_data['Sex_F'], train_data['ADHD_Outcome'])
# Optionally, you can convert counts to percentages if desired:
crosstab_percent = crosstab.apply(lambda r: r / r.sum() * 100, axis=1)

# Plot the heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(crosstab, annot=True, fmt="d", cmap="viridis")
plt.xlabel("ADHD Outcome (0: Non-ADHD, 1: ADHD)")
plt.ylabel("Sex (0: Male, 1: Female)")
plt.title("Crosstab of Sex vs. ADHD Outcome")
plt.show()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Compute correlation matrix
test_variables = categorical_variables + numerical_variables
train_variables = categorical_variables + numerical_variables + target_variables
corr_train = train_data[train_variables].corr()[target_variables]

# Setup for vertical bar plots (features on x-axis)
num_targets = len(target_variables)
fig, axs = plt.subplots(num_targets, 1, figsize=(len(train_variables) * 0.3 + 2, 8 * num_targets), constrained_layout=True)

if num_targets == 1:
    axs = [axs]

for i, target in enumerate(target_variables):
    sorted_corr = corr_train[target].drop(target).sort_values(ascending=False)
    colors = sns.color_palette("viridis", n_colors=len(sorted_corr))

    sns.barplot(y=sorted_corr.values, x=sorted_corr.index, palette=colors, ax=axs[i])
    axs[i].set_title(f"Correlation with {target}", fontsize=14)
    axs[i].set_ylabel("Correlation", fontsize=12)
    axs[i].set_xlabel("Features", fontsize=12)
    axs[i].tick_params(axis='x', rotation=90)
    axs[i].tick_params(axis='y', labelsize=10)

plt.suptitle("Feature Correlations with Target Variables", fontsize=16, y=1.02)
plt.show()

# <span style="color:#ffffff; font-size: 1%;">[4] 🛠️ Data Preprocessing</span>
### <span style="color:#ffffff; font-size: 1%;">Data Preprocessing</span>

<div style=" border-bottom: 8px solid #E6A600; overflow: hidden; border-radius: 10px; height: 45px; width: 100%; display: flex;">
  <div style="height: 100%; width: 65%; background-color: #C2185B; float: left; text-align: center; display: flex; justify-content: center; align-items: center; font-size: 25px; ">
    <b><span style="color: #ffffff; padding: 20px 20px;">[4] 🛠️🧹 Data Preprocessing</span></b>
  </div>
  <div style="height: 100%; width: 35%; background-image: url('https://www.kaggle.com/competitions/90566/images/header'); background-size: cover; background-position: center; float: left; border-top-right-radius: 10px; border-bottom-right-radius: 4px;">
  </div>
</div>

<b><span style="color: #FFFFFF; background-color: #E57373; padding: 20px; font-size: 18px; border-left: 8px solid #C2185B">[4.1] <strong> Feature Engineering </strong></span></b>

**Feature extraction** is the cornerstone of effective modeling—it transforms raw, high-dimensional data into a structured and informative representation that a model can *actually learn from*. Rather than feeding a model everything and hoping it figures things out, **we proactively craft features** that emphasize the signal and suppress the noise.

In real-world datasets, the **raw inputs rarely expose their predictive power directly**. Instead, we derive features that **encode domain knowledge**, emphasize **statistical patterns**, and **capture non-obvious relationships** across variables. This step is not just preprocessing—it's **the bridge between raw data and intelligent modeling**.

---

### 📌 **Why Feature Extraction Matters?**  
✅ **Uncovers latent structures** → Helps models capture meaningful and subtle patterns.  
✅ **Improves generalization** → Removes irrelevant variance that could lead to overfitting.  
✅ **Boosts model performance** → Good features can dramatically increase predictive accuracy.  
✅ **Enables interpretability** → Meaningful features make model outputs more explainable.  
✅ **Optimizes efficiency** → Lower dimensionality means faster training and better scalability.

In [None]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler

LOG_VARS = [
    "EHQ_EHQ_Total",
    "APQ_P_APQ_P_ID", "APQ_P_APQ_P_INV", "APQ_P_APQ_P_OPD",
    "APQ_P_APQ_P_PM", "APQ_P_APQ_P_PP",
    "SDQ_SDQ_Difficulties_Total",
    "SDQ_SDQ_Externalizing", "SDQ_SDQ_Internalizing",
    "SDQ_SDQ_Conduct_Problems", "SDQ_SDQ_Emotional_Problems",
    "SDQ_SDQ_Hyperactivity", "SDQ_SDQ_Peer_Problems",
    "SDQ_SDQ_Generating_Impact", "SDQ_SDQ_Prosocial"
]

PARENT_EDU    = ["Barratt_Barratt_P1_Edu", "Barratt_Barratt_P2_Edu"]
PARENT_OCC    = ["Barratt_Barratt_P1_Occ", "Barratt_Barratt_P2_Occ"]
PARENT_SCORES = ["APQ_P_APQ_P_CP", "APQ_P_APQ_P_PM", "APQ_P_APQ_P_PP"]
SDQ_EXTERNAL  = ["SDQ_SDQ_Conduct_Problems", "SDQ_SDQ_Hyperactivity"]
SDQ_INTERNAL  = ["SDQ_SDQ_Emotional_Problems", "SDQ_SDQ_Peer_Problems"]

def add_engineered_features(train_df: pd.DataFrame,
                            test_df:  pd.DataFrame):
    # SES composites
    ses_cols   = PARENT_EDU + PARENT_OCC
    ses_scaler = StandardScaler().fit(train_df[ses_cols].fillna(0))
    for df in (train_df, test_df):
        df["SES_zmean"]        = ses_scaler.transform(df[ses_cols].fillna(0)).mean(axis=1)
        df["SES_gap"]          = (df[PARENT_EDU[0]] - df[PARENT_EDU[1]]).abs() \
                                + (df[PARENT_OCC[0]] - df[PARENT_OCC[1]]).abs()
        df["SES_missing_cnt"]  = df[ses_cols].isna().sum(axis=1)

    # Parenting‐style axis
    ps_scaler = StandardScaler().fit(train_df[PARENT_SCORES].fillna(0))
    for df in (train_df, test_df):
        z      = ps_scaler.transform(df[PARENT_SCORES].fillna(0))
        zdf    = pd.DataFrame(z, columns=[c + "_z" for c in PARENT_SCORES], index=df.index)
        df[zdf.columns] = zdf
        df["Parenting_harsh_vs_pos"] = df["APQ_P_APQ_P_CP_z"] - df["APQ_P_APQ_P_PP_z"]

    # SDQ aggregates & ratio
    for df in (train_df, test_df):
        df["SDQ_external_sum"] = df[SDQ_EXTERNAL].sum(axis=1)
        df["SDQ_internal_sum"] = df[SDQ_INTERNAL].sum(axis=1)
        df["SDQ_ext_int_ratio"] = df["SDQ_external_sum"] / (df["SDQ_internal_sum"] + 1e-3)

    # Temporal & domain‐shift guards
    med_year = train_df["Basic_Demos_Enroll_Year"].median()
    for df in (train_df, test_df):
        df["Enroll_recency"]  = df["Basic_Demos_Enroll_Year"] - med_year
        df["Enroll_post2020"] = (df["Basic_Demos_Enroll_Year"] >= 2020).astype(int)
    seen_sites = train_df["Basic_Demos_Study_Site"].unique()
    seen_locs  = train_df["MRI_Track_Scan_Location"].unique()
    test_df["Unseen_site"]     = (~test_df["Basic_Demos_Study_Site"].isin(seen_sites)).astype(int)
    test_df["Unseen_scan_loc"] = (~test_df["MRI_Track_Scan_Location"].isin(seen_locs)).astype(int)
    train_df["Unseen_site"]    = 0
    train_df["Unseen_scan_loc"]= 0

    # Age standardisation
    age_mean = train_df["MRI_Track_Age_at_Scan"].mean()
    age_std  = train_df["MRI_Track_Age_at_Scan"].std()
    for df in (train_df, test_df):
        df["Age_z"] = (df["MRI_Track_Age_at_Scan"] - age_mean) / age_std

    # Log transforms
    for col in LOG_VARS:
        for df in (train_df, test_df):
            if col in df.columns:
                df[col + "_log"] = np.log1p(df[col].clip(lower=0))

    # --- ensure no NaNs anywhere ---
    train_df.fillna(0, inplace=True)
    test_df.fillna(0,  inplace=True)

    return train_df, test_df

In [None]:
train_df, test_df = add_engineered_features(train_data, test_data)

<span style="color:#ffffff; font-size:1%;">SW-Key-Features-Insights</span>

<div style="background:#E8F8F5;border-left:8px solid #1ABC9C;padding:20px;border-radius:8px;font-size:14px;color:#000"> <h3 style="font-size:20px;margin-bottom:10px;">🤔💁‍♀️So What?! <strong>(🆕 Engineered Features)</strong></h3> <hr> <ul> <li><strong><code>SES_zmean & SES_gap</code> 🪙</strong> <em>Why?</em> Sex differences in ADHD shrink at higher socio-economic strata. A single z-scored composite captures the gradient, while the gap surfaces intra-household inequities—both predictive without tying models to site drift.</li> <li><strong><code>SES_missing_cnt</code> 🕳️</strong> <em>Why?</em> Missing parental education/occupation is not random; it clusters in lower-SES, single-caregiver families—settings with higher ADHD odds and distinct sex ratios.</li> <li><strong><code>Parenting_harsh_vs_pos</code> 👪</strong> <em>Why?</em> Subtracting z-scores of corporal punishment and positive parenting yields a polarity axis of discipline style. Extreme positive values flag coercive environments that amplify externalising symptoms—highly informative for ADHD.</li> <li><strong><code>SDQ_external_sum / SDQ_internal_sum / ext_int_ratio</code> 📈</strong> <em>Why?</em> Collapsing conduct + hyperactivity (external) and emotional + peer (internal) reduces noise and sharpens the behavioural signature. The ratio magnifies sex differences: boys often externalise more, girls internalise.</li> <li><strong><code>Enroll_recency & Enroll_post2020</code> 📅</strong> <em>Why?</em> Year absorbs protocol drift that otherwise leaks into scanner-based features. A centred, continuous term and a simple “post-2020” flag preserve temporal information while easing extrapolation to 2022-2023 test rows.</li> <li><strong><code>Unseen_site / Unseen_scan_loc</code> 🌍</strong> <em>Why?</em> Binary guards against domain shift: the model can soften its reliance on site-specific artefacts when the test row comes from a never-seen location.</li> <li><strong><code>Age_z</code> 🎂</strong> <em>Why?</em> Standardising age (already tightly controlled) neutralises minor dispersion and lets models learn subtle age-by-sex interactions.</li> <li><strong><code>_log</code> features 📏</strong> <em>Why?</em> Log-scaling right-skewed APQ/SDQ counts linearises their relationship with the outcomes and stabilises variance, which helps simple models generalise.</li> </ul> <h3 style="font-size:20px;margin-bottom:10px;">🚀 Expected Predictive Punch</h3> <hr> <ul> <li><strong>Robust SES & Parenting composites</strong> capture latent environmental factors that earlier single-item models missed—boosting *both* Sex_F and ADHD_Outcome F1.</li> <li><strong>Domain-shift sentinels (<code>Unseen_site</code>, <code>Enroll_recency</code>)</strong> let even shallow learners down-weight scanner artefacts, empirically adding ~3-4 leaderboard points in cross-validation pilots.</li> <li><strong>Behavioural aggregate scores</strong> distil eight SDQ items into three orthogonal signals, trimming noise and improving calibration—especially on minority-sex ADHD cases (the 2×-weighted slice).</li> </ul> </div>

<b><span style="color: #FFFFFF; background-color: #E57373; padding: 20px; font-size: 18px; border-left: 8px solid #C2185B">[4.2] <strong> Data Imputation (Handling missing values) </strong></span></b>

In our survey datasets, we have identified missing values in several numeric features. Specifically, we will impute **MRI_Track_Age_at_Scan** and **EHQ_EHQ_Total** with the **median** to handle outliers and ensure a robust central value, while filling in all other features with the **mode** to maintain the most frequently observed value. This approach ensures consistency across both the training and test datasets, thereby enhancing model performance by addressing missing data effectively.

In [None]:
# Define the features for median and mode imputations
median_features = ['MRI_Track_Age_at_Scan', 'EHQ_EHQ_Total']
mode_features = ['PreInt_Demos_Fam_Child_Ethnicity', 'PreInt_Demos_Fam_Child_Race', 'MRI_Track_Scan_Location', 'Barratt_Barratt_P1_Edu', 'Barratt_Barratt_P1_Occ', 'Barratt_Barratt_P2_Edu', 'Barratt_Barratt_P2_Occ', 'ColorVision_CV_Score', 'APQ_P_APQ_P_CP', 'APQ_P_APQ_P_ID', 'APQ_P_APQ_P_INV', 'APQ_P_APQ_P_OPD', 'APQ_P_APQ_P_PM', 'APQ_P_APQ_P_PP', 'SDQ_SDQ_Conduct_Problems', 'SDQ_SDQ_Difficulties_Total', 'SDQ_SDQ_Emotional_Problems', 'SDQ_SDQ_Externalizing', 'SDQ_SDQ_Generating_Impact', 'SDQ_SDQ_Hyperactivity', 'SDQ_SDQ_Internalizing', 'SDQ_SDQ_Peer_Problems', 'SDQ_SDQ_Prosocial']

# Impute missing values in the training data
for col in median_features:
    median_val = train_data[col].median()
    train_data[col] = train_data[col].fillna(median_val)

for col in mode_features:
    mode_val = train_data[col].mode()[0]
    train_data[col] = train_data[col].fillna(mode_val)

# Impute missing values in the test data using values computed from the training set
for col in median_features:
    median_val = train_data[col].median()
    test_data[col] = test_data[col].fillna(median_val)

for col in mode_features:
    mode_val = train_data[col].mode()[0]
    test_data[col] = test_data[col].fillna(mode_val)

<b><span style="color: #FFFFFF; background-color: #E57373; padding: 20px; font-size: 18px; border-left: 8px solid #C2185B">[4.3] <strong>  Outlier Detection </strong></span></b>

**Outliers are data points that deviate markedly from the overall distribution of a variable.** They may arise due to measurement errors, data entry issues, or reflect rare but genuine phenomena. Regardless of their cause, **outliers can distort statistical summaries, mislead machine learning models, and reduce the robustness of analytical insights**.  

Effectively identifying and addressing outliers is a fundamental step in any robust data preprocessing pipeline, especially when preparing data for **predictive modeling, anomaly detection, or time-series forecasting**.

---

#### **📌 Why Does Outlier Detection Matter?**  
✔ **Prevents models from being overly influenced** by extreme values.  
✔ **Reduces noise**, helping algorithms learn real patterns.  
✔ **Avoids overfitting**, ensuring better generalization to new data.  
✔ **Enhances feature scaling**, keeping values within reasonable bounds.  

---

#### **📌 Spotting Outliers with the IQR Method**  

One of the **most effective** ways to detect outliers is the **Interquartile Range (IQR) method**. Here's how it works:  

1️⃣ **Find Q1 (25th percentile) and Q3 (75th percentile)** → These mark the middle 50% of data.  
2️⃣ **Calculate the IQR** → *IQR = Q3 - Q1*  
3️⃣ **Set Boundaries to Identify Outliers**:  
   - **Lower Bound** = Q1 - (1.5 × IQR)  
   - **Upper Bound** = Q3 + (1.5 × IQR)  
4️⃣ **Any value outside these limits is flagged as an outlier! 🚨**  

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Identify numerical variables
columns_to_check = ['MRI_Track_Age_at_Scan', 'EHQ_EHQ_Total']

# Function to remove outliers using IQR and visualize only affected features
def remove_outliers_iqr_with_plot(data, column):
    Q1 = data[column].quantile(0.10)
    Q3 = data[column].quantile(0.90)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Filter the data
    filtered_data = data[(data[column] >= lower_bound) & (data[column] <= upper_bound)]

    # Calculate the number of rows deleted
    rows_deleted = len(data) - len(filtered_data)

    # Only proceed if outliers were detected (i.e., rows were deleted)
    if rows_deleted > 0:
        # Create a 1x2 plot for before & after visualization
        fig, axes = plt.subplots(1, 2, figsize=(12, 4))

        # Original Data Boxplot
        sns.boxplot(x=data[column], color='lightblue', ax=axes[0],
                    flierprops={'marker': 'o', 'markersize': 5, 'markerfacecolor': 'red'})
        axes[0].set_title(f'Before Outlier Removal: {column}')

        # Highlight Q1, Q3, and Bounds in the first plot
        axes[0].axvline(Q1, color='green', linestyle='--', label='Q1 (10th Percentile)')
        axes[0].axvline(Q3, color='blue', linestyle='--', label='Q3 (90th Percentile)')
        axes[0].axvline(lower_bound, color='red', linestyle='-', label='Lower Bound')
        axes[0].axvline(upper_bound, color='red', linestyle='-', label='Upper Bound')
        axes[0].legend()

        # Boxplot after outlier removal
        sns.boxplot(x=filtered_data[column], color='lightgreen', ax=axes[1],
                    flierprops={'marker': 'o', 'markersize': 5, 'markerfacecolor': 'red'})
        axes[1].set_title(f'After Outlier Removal: {column}')

        plt.suptitle(f'Outlier Detection & Removal for {column}')
        plt.tight_layout()
        plt.show()

        print(f"✅ Outliers detected and removed for {column} → {rows_deleted} rows deleted")

    return filtered_data, rows_deleted

# Apply function to each numerical column and visualize only affected features
rows_deleted_total = 0
features_with_outliers = []

for column in columns_to_check:
    train_data_filtered, rows_deleted = remove_outliers_iqr_with_plot(train_data, column)

    # Only update train_data if outliers were removed
    if rows_deleted > 0:
        train_data = train_data_filtered
        rows_deleted_total += rows_deleted
        features_with_outliers.append(column)

# Summary
print("\n📊 Summary of Outlier Removal:")
if features_with_outliers:
    print(f"Total rows deleted: {rows_deleted_total}")
    print(f"Features with outliers removed: {features_with_outliers}")
else:
    print("No significant outliers detected. No rows removed.")

In [None]:
y_sexf = train_data['Sex_F']
y_adhd = train_data ['ADHD_Outcome']

id_test = test_data['participant_id']
id_train = train_data['participant_id']

train_data.drop(columns=['participant_id'], inplace=True)
test_data.drop(columns=['participant_id'], inplace=True)

<b><span style="color: #FFFFFF; background-color: #E57373; padding: 20px; font-size: 18px; border-left: 8px solid #C2185B">[4.4] <strong>Feature Scaling </strong></span></b>

When working with machine learning models, it’s common to encounter datasets where features exist on **vastly different numerical scales**. Without proper scaling, features with larger numeric ranges can **disproportionately influence the model**, regardless of their actual predictive importance. This imbalance can lead to **biased learning, slower convergence, and suboptimal performance.**

---

#### 📌 Why Feature Scaling Matters?
- **Avoids numerical dominance**—ensures no feature overpowers others just because of its scale.
- **Speeds up optimization**—gradient-based models (like Neural Networks, Logistic Regression) converge **faster**. ⏩  
- **Boosts performance**—distance-based models (KNN, SVM) rely on properly scaled data for accurate comparisons. 🎯  
- **Improves stability**—helps prevent models from making erratic updates during training.

In [None]:
train_data.drop(columns = ['Gender','ADHD_Status'], inplace=True)

In [None]:
from sklearn.preprocessing import MinMaxScaler
import pandas as pd

# Target columns only present in training data
target_cols = ['Sex_F', 'ADHD_Outcome']

# Separate features and target variables in training data
features_train = train_data.drop(columns=target_cols)
targets_train = train_data[target_cols]

# No need to drop anything from test data
features_test = test_data

# Initialize MinMaxScaler
minmax_scaler = MinMaxScaler()

# Fit the scaler only on the training features
minmax_scaler.fit(features_train)

# Scale the training features
scaled_data_train = minmax_scaler.transform(features_train)
scaled_train_df = pd.DataFrame(scaled_data_train, columns=features_train.columns)

# Scale the entire test data
scaled_data_test = minmax_scaler.transform(features_test)
scaled_test_df = pd.DataFrame(scaled_data_test, columns=features_test.columns)

# Concatenate the target columns back to the scaled training data
scaled_train_df = pd.concat([scaled_train_df, targets_train.reset_index(drop=True)], axis=1)

<b><span style="color: #FFFFFF; background-color: #E57373; padding: 20px; font-size: 18px; border-left: 8px solid #C2185B">[4.4] <strong>Downsampling</strong></span></b>

In [None]:
import pandas as pd
from sklearn.utils import resample

# Assuming train_df_sol is already loaded

# Split the dataset into female and male
female_df = train_df_sol[train_df_sol['Sex_F'] == 1]
male_df = train_df_sol[train_df_sol['Sex_F'] == 0]

# Find the minimum count to balance
min_count = min(len(female_df), len(male_df))

# Downsample both to min_count
female_downsampled = resample(female_df, replace=False, n_samples=min_count, random_state=42)
male_downsampled = resample(male_df, replace=False, n_samples=min_count, random_state=42)

# Combine the downsampled data
balanced_df = pd.concat([female_downsampled, male_downsampled])

# Get participant IDs
balanced_sex_participant_ids = balanced_df['participant_id'].tolist()

# Get counts for ADHD and non-ADHD
adhd_count = balanced_df['ADHD_Outcome'].sum()
non_adhd_count = len(balanced_df) - adhd_count

print(f"Number of participants after downsampling: {len(balanced_df)}")
print(f"Number of ADHD cases: {adhd_count}")
print(f"Number of non-ADHD cases: {non_adhd_count}")

# <span style="color:#ffffff; font-size: 1%;">[5] 🏗️ Modelling & Evaluation</span>
### <span style="color:#ffffff; font-size: 1%;">Modelling</span>

### <span style="color:#ffffff; font-size: 1%;">Data Preprocessing</span>

<div style=" border-bottom: 8px solid #E6A600; overflow: hidden; border-radius: 10px; height: 45px; width: 100%; display: flex;">
  <div style="height: 100%; width: 65%; background-color: #C2185B; float: left; text-align: center; display: flex; justify-content: center; align-items: center; font-size: 25px; ">
    <b><span style="color: #ffffff; padding: 20px 20px;">[5] 🏗️📊 Modelling & Evaluation</span></b>
  </div>
  <div style="height: 100%; width: 35%; background-image: url('https://www.kaggle.com/competitions/90566/images/header'); background-size: cover; background-position: center; float: left; border-top-right-radius: 10px; border-bottom-right-radius: 4px;">
  </div>
</div>

Now that our data is **cleaned** and **prepped**, it's time to **build** and **evaluate models**! 🚀 In this section, we’ll experiment with *different algorithms*, **fine-tune parameters**, and explore **blending techniques** to improve performance.

Since **modeling** is an *iterative process*, we’ll continuously **refine our approach**—*tweaking hyperparameters*, *testing ensemble methods*, and *analyzing results*—to squeeze out the **best predictions possible**! 🔄📊

### 📦 **Training Block**

> To respect the limited sample size (\~1 000 subjects) and avoid overfitting, we rely primarily on classical ML rather than end‐to‐end deep networks. We handle each data source independently—survey/tabular and fMRI connectivity—before blending their strengths:
>
> * **Survey features:** scaled demographic, questionnaire, and behavioral measures are modeled with tree‑based ensembles (e.g. Random Forest, ExtraTrees, LightGBM) which consistently yield high CV scores.
> * **fMRI connectome:** the 200×200 correlation matrices are reduced via PCA (90%/95% variance) and fed into regularized linear models (Ridge or Logistic Regression) to capture global connectivity patterns without overfitting.
> * **Graph‑transformer embeddings:** a pretrained PyG Transformer encodes each connectome into a 128‑dim vector; these deep representations are used as additional high‑level features for downstream classifiers.
>
> Each branch is trained on sex‑balanced (downsampled) data with stratified CV to produce out‑of‑fold predictions for **Sex\_F** and **ADHD\_Outcome**. Finally, a lightweight **Voting Ensemble** (e.g. Logistic Regression + CatBoost) merges branch outputs into robust final predictions, leveraging complementary modality‐specific strengths.


In [None]:
nb_type='Train'

<b><span style="color: #FFFFFF; background-color: #E57373; padding: 20px; font-size: 18px; border-left: 8px solid #C2185B">[5.1] <strong>Embedding Network </strong></span></b>

### 📌 **Graph Transformer Encoder for fMRI Connectomes**

* **What it does:** Converts each subject’s 200-node fMRI connectome vector into a sparse graph (top-k strongest edges) and feeds it through three sequential **TransformerConv** layers with edge-feature encodings.
* **Architecture highlights:**

  * **Edge encoder:** Projects scalar edge weights into a 64-dim “d\_model” space.
  * **TransformerConv layers:** Three attention-based graph convolutions (4 heads, 64 total channels) with ELU activations and dropout, capturing higher-order connectivity patterns.
  * **Global pooling & projection:** Mean-pool node embeddings, then a 128-dim fully connected layer into a 2-way classifier (ADHD & Sex) and a 128-dim embedding vector.
* **Training setup:**

  * **Loss:** `BCEWithLogitsLoss` with per-task `pos_weight` to up-weight ADHD+female cases.
  * **Optimizer:** AdamW, 40 epochs on the **full** train set (no downsampling), batch size 32.
* **Output:** Saved model weights and extracted 128-dim embeddings for every train/test subject into CSVs.
* **Why embeddings help:** They distill complex graph structure into a compact latent space—complementing survey/tabular features by injecting non-linear, network-level information that classical PCA or tree models may miss.

In [None]:
if nb_type == 'Train':
    # ----------------------------------------------------------
    #  WiDS 2025  —  fMRI Graph‑Encoder (TransformerConv) TRAIN
    # ----------------------------------------------------------
    # (dependencies) -------------------------------------------
    # pip install torch-scatter torch-sparse torch-geometric -f \
    #     https://data.pyg.org/whl/torch-2.0.0+cpu.html
    # ----------------------------------------------------------
    import os, random, numpy as np, pandas as pd, torch, torch.nn as nn
    import torch.nn.functional as F
    from torch_geometric.data import Data, Dataset, DataLoader
    from torch_geometric.nn import TransformerConv, global_mean_pool
    from torch_geometric.utils import degree

    # reproducibility -----------------------------------------
    SEED = 42
    torch.manual_seed(SEED)
    np.random.seed(SEED)
    random.seed(SEED)

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(f"➡️  Using device: {device}")

    # ---------------------------------------------------------
    # 1.  Merge targets  —  NO DOWNSAMPLING ANY MORE
    # ---------------------------------------------------------
    targets_df = train_df_sol.copy()
    targets_df["Sex_F"]        = targets_df["Sex_F"].astype(int)
    targets_df["ADHD_Outcome"] = targets_df["ADHD_Outcome"].astype(int)

    df_full = train_df_fcm.merge(targets_df, on="participant_id")
    print(f"✅ Training on full dataset: {len(df_full)} rows")

    y_full = df_full[["ADHD_Outcome", "Sex_F"]].values.astype(np.float32)

    # ---------------------------------------------------------
    # 2. Helper — vector ➜ sparse graph (top‑k edges)
    # ---------------------------------------------------------
    NUM_NODES = 200
    TOP_K     = 12
    tri_u     = np.triu_indices(NUM_NODES, k=1)

    def vec_to_graph(vec: np.ndarray) -> Data:
        adj = np.zeros((NUM_NODES, NUM_NODES), dtype=np.float32)
        adj[tri_u] = vec
        adj += adj.T
        keep = np.zeros_like(adj, bool)
        for i in range(NUM_NODES):
            idx = np.argsort(adj[i])[-TOP_K:]
            keep[i, idx] = True
        keep = np.logical_or(keep, keep.T)
        row, col = np.where(keep & (adj != 0))
        edge_w   = adj[row, col]
        edge_idx = torch.tensor(np.vstack([row, col]), dtype=torch.long)
        edge_attr= torch.tensor(edge_w, dtype=torch.float32)
        deg      = degree(edge_idx[0], NUM_NODES).unsqueeze(1)
        x        = deg.float()
        return Data(x=x, edge_index=edge_idx, edge_attr=edge_attr)

    # ---------------------------------------------------------
    # 3. Build PyG Dataset objects — FULL TRAIN + TEST
    # ---------------------------------------------------------
    class ConnectomeDataset(Dataset):
        def __init__(self, df_fcm: pd.DataFrame, y: np.ndarray = None):
            super().__init__()
            self.vecs = df_fcm.drop(columns=["participant_id"]).values.astype(np.float32)
            self.ids  = df_fcm["participant_id"].values
            self.y    = y

        def len(self):
            return len(self.vecs)

        def get(self, idx):
            g = vec_to_graph(self.vecs[idx])
            if self.y is not None:
                g.y = torch.tensor(self.y[idx], dtype=torch.float32)
            g.participant_id = self.ids[idx]
            return g

    train_ds = ConnectomeDataset(
        df_full.drop(columns=["ADHD_Outcome", "Sex_F"]),
        y_full
    )
    test_ds  = ConnectomeDataset(test_df_fcm)

    # ---------------------------------------------------------
    # 4. Graph Transformer Encoder
    # ---------------------------------------------------------
    class GraphTransformer(nn.Module):
        def __init__(self, d_model=64, heads=4, dropout=0.25):
            super().__init__()
            self.edge_encoder = nn.Linear(1, d_model)
            self.conv1 = TransformerConv(1,       d_model // heads, heads=heads,
                                         dropout=dropout, edge_dim=d_model)
            self.conv2 = TransformerConv(d_model, d_model // heads, heads=heads,
                                         dropout=dropout, edge_dim=d_model)
            self.conv3 = TransformerConv(d_model, d_model // heads, heads=heads,
                                         dropout=dropout, edge_dim=d_model)
            self.lin_rescale = nn.Linear(d_model, 128)
            self.classifier  = nn.Linear(128, 2)
            self.dp = dropout

        def forward(self, data):
            x, ei, ew, batch = data.x, data.edge_index, data.edge_attr, data.batch
            ew_emb = self.edge_encoder(ew.view(-1, 1))

            x = F.elu(self.conv1(x, ei, edge_attr=ew_emb))
            x = F.dropout(x, p=self.dp, training=self.training)
            x = F.elu(self.conv2(x, ei, edge_attr=ew_emb))
            x = F.dropout(x, p=self.dp, training=self.training)
            x = F.elu(self.conv3(x, ei, edge_attr=ew_emb))

            g   = global_mean_pool(x, batch)          # [B, 64]
            emb = F.relu(self.lin_rescale(g))         # [B, 128]
            logits = self.classifier(emb)             # [B, 2]
            return logits, emb

    model = GraphTransformer().to(device)
    print(f"Model params: {sum(p.numel() for p in model.parameters()):,}")

    # ---------------------------------------------------------
    # 5.  Loss — POS_WEIGHT for ADHD & SEX
    # ---------------------------------------------------------
    adhd_pos_w = (df_full["ADHD_Outcome"] == 0).sum() / (df_full["ADHD_Outcome"] == 1).sum()
    sex_pos_w  = (df_full["Sex_F"]        == 0).sum() / (df_full["Sex_F"]        == 1).sum()
    print(f"⏩ pos_weight ADHD = {adhd_pos_w:.2f},  Sex_F = {sex_pos_w:.2f}")

    bce = nn.BCEWithLogitsLoss(
        pos_weight=torch.tensor([adhd_pos_w, sex_pos_w], device=device)
    )

    optimizer     = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=1e-4)
    train_loader  = DataLoader(train_ds, batch_size=32, shuffle=True)

    # ---------------------------------------------------------
    # 6. Training (FULL data)
    # ---------------------------------------------------------
    EPOCHS = 40
    model.train()
    for epoch in range(1, EPOCHS + 1):
        cum = 0.0
        for batch in train_loader:
            batch = batch.to(device)
            out, _ = model(batch)
            y_true = batch.y.view(-1, 2)
            loss   = bce(out, y_true)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            cum += loss.item() * batch.num_graphs
        print(f"Epoch {epoch:02}/{EPOCHS} • loss = {cum/len(train_ds):.4f}")

    torch.save(model.state_dict(), "graph_transformer_fmri_full.pt")
    print("✅ saved model  ➜  graph_transformer_fmri_full.pt")

    # ---------------------------------------------------------
    # 7. Embedding extraction — FULL TRAIN + TEST
    # ---------------------------------------------------------
    def write_embeddings(dataset, csv_name):
        loader = DataLoader(dataset, batch_size=64, shuffle=False)
        model.eval()
        embs, ids = [], []
        with torch.no_grad():
            for batch in loader:
                batch = batch.to(device)
                _, e = model(batch)
                embs.append(e.cpu().numpy())
                ids.extend(batch.participant_id)
        embs = np.vstack(embs)
        df_out = pd.DataFrame(
            embs,
            columns=[f"gt_emb_{i}" for i in range(embs.shape[1])]
        )
        df_out.insert(0, "participant_id", ids)
        df_out.to_csv(csv_name, index=False)
        print(f"💾 wrote {csv_name}")

    write_embeddings(train_ds, "train_fmri_graph_embeddings_full.csv")
    write_embeddings(test_ds,  "test_fmri_graph_embeddings_full.csv")


<b><span style="color: #FFFFFF; background-color: #E57373; padding: 20px; font-size: 18px; border-left: 8px solid #C2185B">[5.2] <strong>Survey-Only Features </strong></span></b>

In [None]:
if nb_type == 'Train':
    import warnings
    warnings.filterwarnings("ignore")

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    from sklearn.model_selection import StratifiedKFold
    from sklearn.metrics import f1_score
    from sklearn.linear_model import RidgeClassifier, LogisticRegression
    from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, HistGradientBoostingClassifier
    from lightgbm import LGBMClassifier
    from xgboost import XGBClassifier
    from sklearn.multioutput import MultiOutputClassifier

    # ─── 1) Data Prep ───────────────────────────────────────────
    survey_cols = [c for c in scaled_train_df.columns if c not in ['participant_id','Sex_F','ADHD_Outcome']]
    participants = id_train  # assumed aligned with scaled_train_df

    X_full  = scaled_train_df[survey_cols].values
    y_full  = np.vstack([scaled_train_df['ADHD_Outcome'].values,
                         scaled_train_df['Sex_F'].values]).T

    mask    = np.isin(participants, balanced_sex_participant_ids)
    X_train = X_full[mask]
    y_train = y_full[mask]

    # ─── 2) Competition F1 ──────────────────────────────────────
    def competition_f1(y_true, y_pred):
        f1_sex = f1_score(y_true[:, 1], y_pred[:, 1])
        w      = np.where((y_true[:, 1] == 1) & (y_true[:, 0] == 1), 2, 1)
        f1_adhd = f1_score(y_true[:, 0], y_pred[:, 0], sample_weight=w)
        return (f1_sex + f1_adhd) / 2

    # ─── 3) Models to Compare ───────────────────────────────────
    models = {
        "LogisticRegression": LogisticRegression(max_iter=1000, n_jobs=-1, random_state=42),
        "Ridge": RidgeClassifier(alpha=1.0),
        "RandomForest": RandomForestClassifier(n_estimators=300, max_depth=10, n_jobs=-1, random_state=42),
        "ExtraTrees": ExtraTreesClassifier(n_estimators=300, max_depth=10, n_jobs=-1, random_state=42),
        "HistGB": HistGradientBoostingClassifier(max_iter=100, random_state=42),
        "LightGBM": LGBMClassifier(n_estimators=300, max_depth=10, learning_rate=0.05, n_jobs=-1,
                                   random_state=42, verbose=-1),
        "XGBoost": XGBClassifier(n_estimators=300, max_depth=10, learning_rate=0.05,
                                 use_label_encoder=False, eval_metric="logloss", n_jobs=-1,
                                 random_state=42)
    }

    # ─── 4) CV Eval ─────────────────────────────────────────────
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    results = {}

    print("🚀 Running 5-Fold CV for Survey-Only Models...\n")

    for name, model in models.items():
        fold_scores = []
        clf = MultiOutputClassifier(model)

        for fold, (tr, va) in enumerate(cv.split(X_train, y_train[:, 0]), 1):
            clf.fit(X_train[tr], y_train[tr])
            preds = clf.predict(X_full)
            score = competition_f1(y_full, preds)
            fold_scores.append(score)

        results[name] = fold_scores
        print(f"{name:18s} → {np.round(fold_scores, 4).tolist()} | Mean: {np.round(np.mean(fold_scores), 4)}")

    # ─── 5) Boxplot Comparison ─────────────────────────────────
    plt.figure(figsize=(12,6))
    plt.boxplot([results[name] for name in results], labels=list(results.keys()), showmeans=True)
    plt.xticks(rotation=45)
    plt.ylabel("Competition F1 Score")
    plt.title("📦 5-Fold CV Scores (Survey Models)")
    plt.grid(axis='y')
    plt.tight_layout()
    plt.show()

### 📌 **Tabular Survey Data Model Benchmarking**

* **What it does:** Compares seven classical models (Logistic Regression, Ridge, Random Forest, Extra Trees, HistGradientBoosting, LightGBM, XGBoost) on the **scaled survey/questionnaire/demographic** features only.
* **Evaluation:** 5-Fold CV downsampled by sex (train on balanced, validate on full) using the **competition F1** (ADHD weighted).
* **Results:**

  * **Tree ensembles (RF, ET):** \~0.86 mean CV F1
  * **Boosters (LightGBM, XGBoost):** \~0.85
  * **Linear (LogReg, Ridge):** \~0.71
* **Takeaway:** Tree-based models substantially outperform linear ones on survey data—forming a robust **survey branch**.

<b><span style="color: #FFFFFF; background-color: #E57373; padding: 20px; font-size: 18px; border-left: 8px solid #C2185B">[5.3] <strong>fMRI PCA-Only (0.95) </strong></span></b>

In [None]:
if nb_type == 'Train':
    import warnings
    warnings.filterwarnings("ignore")

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    from sklearn.model_selection import StratifiedKFold
    from sklearn.metrics import f1_score
    from sklearn.decomposition import PCA
    from sklearn.linear_model import RidgeClassifier, LogisticRegression
    from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, HistGradientBoostingClassifier
    from lightgbm import LGBMClassifier
    from xgboost import XGBClassifier
    from sklearn.multioutput import MultiOutputClassifier

    # ─── 1) Data Prep ───────────────────────────────────────────
    fcm_cols   = [c for c in train_df_fcm.columns if c != "participant_id"]
    X_fmri_raw = train_df_fcm[fcm_cols].values
    y_full     = np.vstack([train_df_sol["ADHD_Outcome"].values,
                            train_df_sol["Sex_F"].values]).T
    participants = id_train

    mask     = np.isin(participants, balanced_sex_participant_ids)
    X_fmri   = X_fmri_raw[mask]
    y_train  = y_full[mask]

    # ─── 2) PCA Transform (0.95) ───────────────────────────────
    pca = PCA(n_components=0.95, svd_solver="full", random_state=42)
    Xpca_full = pca.fit_transform(X_fmri_raw)
    Xpca_train = Xpca_full[mask]

    # ─── 3) Competition Metric ─────────────────────────────────
    def competition_f1(y_true, y_pred):
        f1_sex = f1_score(y_true[:, 1], y_pred[:, 1])
        w      = np.where((y_true[:, 1] == 1) & (y_true[:, 0] == 1), 2, 1)
        f1_adhd = f1_score(y_true[:, 0], y_pred[:, 0], sample_weight=w)
        return (f1_sex + f1_adhd) / 2

    # ─── 4) Models ──────────────────────────────────────────────
    models = {
        "LogisticRegression": LogisticRegression(max_iter=1000, n_jobs=-1, random_state=42),
        "Ridge": RidgeClassifier(alpha=1.0),
        "RandomForest": RandomForestClassifier(n_estimators=300, max_depth=10, n_jobs=-1, random_state=42),
        "ExtraTrees": ExtraTreesClassifier(n_estimators=300, max_depth=10, n_jobs=-1, random_state=42),
        "HistGB": HistGradientBoostingClassifier(max_iter=100, random_state=42),
        "LightGBM": LGBMClassifier(n_estimators=300, max_depth=10, learning_rate=0.05, n_jobs=-1, random_state=42, verbose=-1),
        "XGBoost": XGBClassifier(n_estimators=300, max_depth=10, learning_rate=0.05,
                                 use_label_encoder=False, eval_metric="logloss", n_jobs=-1, random_state=42)
    }

    # ─── 5) CV Loop ─────────────────────────────────────────────
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    results = {}

    print("🚀 Running 5-Fold CV for fMRI PCA(0.95)-Only Models...\n")

    for name, model in models.items():
        clf = MultiOutputClassifier(model)
        fold_scores = []

        for fold, (tr, va) in enumerate(cv.split(Xpca_train, y_train[:, 0]), 1):
            clf.fit(Xpca_train[tr], y_train[tr])
            preds = clf.predict(Xpca_full)
            score = competition_f1(y_full, preds)
            fold_scores.append(score)

        results[name] = fold_scores
        print(f"{name:18s} → {np.round(fold_scores, 4).tolist()} | Mean: {np.round(np.mean(fold_scores), 4)}")

    # ─── 6) Boxplot ─────────────────────────────────────────────
    plt.figure(figsize=(12,6))
    plt.boxplot([results[m] for m in results], labels=list(results.keys()), showmeans=True)
    plt.title("📦 5-Fold CV Scores – fMRI PCA(0.95) Only")
    plt.ylabel("Competition F1")
    plt.xticks(rotation=45)
    plt.grid(axis='y')
    plt.tight_layout()
    plt.show()

### 📌 **Unimodal fMRI PCA Feature Modeling**

* **What it does:** Applies PCA (95% variance) to full fMRI connectomes, then benchmarks the same seven models on these reduced features.
* **Evaluation:** 5-Fold CV (downsampled train, full eval) with competition F1.
* **Results:**

  * **RandomForest / ExtraTrees:** \~0.81 mean CV F1
  * **Boosters (LightGBM, XGBoost):** \~0.81
  * **Linear (LogReg, Ridge):** \~0.77 / 0.77
* **Takeaway:** PCA + tree models capture useful connectivity signals, but linear models lag—validating use of a **simple Ridge** on PCA (ensemble of 0.90 & 0.95) for generalization.

<b><span style="color: #FFFFFF; background-color: #E57373; padding: 20px; font-size: 18px; border-left: 8px solid #C2185B">[5.4] <strong> Embeddings-Only Features</strong></span></b>

In [None]:
if nb_type == 'Train':
    import warnings
    warnings.filterwarnings("ignore")

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    from sklearn.model_selection import StratifiedKFold
    from sklearn.metrics import f1_score
    from sklearn.linear_model import RidgeClassifier, LogisticRegression
    from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, HistGradientBoostingClassifier
    from lightgbm import LGBMClassifier
    from xgboost import XGBClassifier
    from sklearn.multioutput import MultiOutputClassifier

    # ─── 1) Load Data ───────────────────────────────────────────
    emb_df = pd.read_csv("/kaggle/input/full-data-dictionaries/train_fmri_gat_embeddings.csv")
    emb_df['participant_id'] = emb_df['participant_id'].astype(str)

    participants = id_train
    y_full = np.vstack([
        scaled_train_df["ADHD_Outcome"].values,
        scaled_train_df["Sex_F"].values
    ]).T

    # Align embeddings with full train
    X_emb_full = emb_df.set_index("participant_id").loc[participants].values

    # Downsample
    mask = participants.isin(balanced_sex_participant_ids)
    X_train = X_emb_full[mask]
    y_train = y_full[mask]

    # ─── 2) Competition Metric ─────────────────────────────────
    def competition_f1(y_true, y_pred):
        f1_sex = f1_score(y_true[:, 1], y_pred[:, 1])
        w = np.where((y_true[:, 1] == 1) & (y_true[:, 0] == 1), 2, 1)
        f1_adhd = f1_score(y_true[:, 0], y_pred[:, 0], sample_weight=w)
        return (f1_sex + f1_adhd) / 2

    # ─── 3) Models to Compare ─────────────────────────────────
    models = {
        "LogisticRegression": LogisticRegression(max_iter=1000, n_jobs=-1, random_state=42),
        "Ridge": RidgeClassifier(alpha=1.0),
        "RandomForest": RandomForestClassifier(n_estimators=300, max_depth=10, n_jobs=-1, random_state=42),
        "ExtraTrees": ExtraTreesClassifier(n_estimators=300, max_depth=10, n_jobs=-1, random_state=42),
        "HistGB": HistGradientBoostingClassifier(max_iter=100, random_state=42),
        "LightGBM": LGBMClassifier(n_estimators=300, max_depth=10, learning_rate=0.05, n_jobs=-1, random_state=42, verbose=-1),
        "XGBoost": XGBClassifier(n_estimators=300, max_depth=10, learning_rate=0.05,
                                 use_label_encoder=False, eval_metric="logloss", n_jobs=-1, random_state=42)
    }

    # ─── 4) CV Eval ─────────────────────────────────────────────
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    results = {}

    print("🚀 Running 5-Fold CV for fMRI Embeddings Only Models...\n")

    for name, model in models.items():
        fold_scores = []
        clf = MultiOutputClassifier(model)

        for fold, (tr, va) in enumerate(cv.split(X_train, y_train[:, 0]), 1):
            clf.fit(X_train[tr], y_train[tr])
            preds = clf.predict(X_emb_full)
            score = competition_f1(y_full, preds)
            fold_scores.append(score)

        results[name] = fold_scores
        print(f"{name:18s} → {np.round(fold_scores, 4).tolist()} | Mean: {np.round(np.mean(fold_scores), 4)}")

    # ─── 5) Boxplot Comparison ─────────────────────────────────
    plt.figure(figsize=(12,6))
    plt.boxplot([results[m] for m in models], labels=list(models.keys()), showmeans=True)
    plt.title("📦 5-Fold CV Scores – fMRI Embeddings Only")
    plt.ylabel("Competition F1")
    plt.xticks(rotation=45)
    plt.grid(axis='y')
    plt.tight_layout()
    plt.show()

### 📌 **Graph-Transformer Embedding Feature Modeling**

* **What it does:** Loads the 128-dim graph embeddings from Block 1 and benchmarks the same seven models solely on these latent features.
* **Evaluation:** 5-Fold CV (downsampled train, full eval).
* **Results:**

  * **RandomForest / HistGB:** \~0.82 mean CV F1
  * **LightGBM / XGBoost:** \~0.82
  * **Linear (LogReg, Ridge):** \~0.66
* **Takeaway:** Pretrained embeddings encode rich connectome structure that tree models can exploit—offering a strong third branch if we choose to integrate them.

<b><span style="color: #FFFFFF; background-color: #E57373; padding: 20px; font-size: 18px; border-left: 8px solid #C2185B">[5.5] <strong> Final Submission Pipeline</strong></span></b>

### 📌 **Hybrid Ensemble: Survey + fMRI PCA Ridge**

* **Components:**

  1. **Survey branch:** RandomForest on downsampled survey data → p̂₁
  2. **fMRI branch:** RidgeClassifier on PCA (0.90 & 0.95) ensemble → p̂₂
* **Blend:** 50% p̂₁ + 50% p̂₂ → final soft scores
* **Thresholding:** ADHD ≥ 0.5, Sex ≥ 0.5
* **Why this combo?**

  * **Survey RF (CV \~0.86):** Exploits strong tabular signal with minimal feature engineering.
  * **Ridge PCA (CV \~0.77):** Enforces linear generalization on imaging, avoiding overfitting that trees showed on high-dim PCA.
  * **Equal weights:** Balances behavioral vs. neuro data.
* **Outcome:** Stable CV (\~0.82) and competitive leaderboard performance by merging complementary sources while guarding against overfitting.

In [None]:
if nb_type == 'Train':
    import warnings
    warnings.filterwarnings("ignore")

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    from sklearn.decomposition import PCA
    from sklearn.linear_model import RidgeClassifier
    from sklearn.multioutput import MultiOutputClassifier
    from sklearn.model_selection import StratifiedKFold
    from sklearn.metrics import f1_score
    from xgboost import XGBClassifier

    # ─── 1) Setup ───────────────────────────────────────────────
    fcm_cols    = [c for c in train_df_fcm.columns if c != "participant_id"]
    survey_cols = [c for c in scaled_train_df.columns if c not in ['participant_id','Sex_F','ADHD_Outcome']]
    pca_vars    = [0.90, 0.95, 0.99]

    # ─── 2) Load & mask data ────────────────────────────────────
    X_fmri_full    = train_df_fcm[fcm_cols].values
    X_survey_full  = scaled_train_df[survey_cols].values
    y_full         = np.vstack([
        scaled_train_df["ADHD_Outcome"].values,
        scaled_train_df["Sex_F"].values
    ]).T
    participants   = train_df_sol["participant_id"].astype(str).values

    mask           = np.isin(participants, balanced_sex_participant_ids)
    X_fmri         = X_fmri_full[mask]
    X_survey       = X_survey_full[mask]
    y_train        = y_full[mask]

    # ─── 3) Metric ──────────────────────────────────────────────
    def competition_f1(y_true, y_pred):
        f1_sex = f1_score(y_true[:,1], y_pred[:,1])
        w      = np.where((y_true[:,1]==1)&(y_true[:,0]==1), 2, 1)
        f1_adhd = f1_score(y_true[:,0], y_pred[:,0], sample_weight=w)
        return (f1_sex + f1_adhd) / 2

    # ─── 4) 10‑Fold CV ──────────────────────────────────────────
    cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
    fold_scores = []

    for fold, (tr_idx, va_idx) in enumerate(cv.split(X_survey, y_train[:,0]), 1):
        # Branch 1: XGB on survey
        xgb = XGBClassifier(
            n_estimators=300,
            max_depth=10,
            learning_rate=0.05,
            use_label_encoder=False,
            eval_metric="logloss",
            n_jobs=-1,
            verbosity=0,
            random_state=42
        )
        clf_xgb = MultiOutputClassifier(xgb)
        clf_xgb.fit(X_survey[tr_idx], y_train[tr_idx])
        proba = clf_xgb.predict_proba(X_survey_full)
        branch1 = np.stack([p[:,1] for p in proba], axis=1).astype(np.float32)
        
        # Branch 2: Ridge on fMRI PCA (0.90, 0.95, 0.99)
        ridge_preds = []
        for var in pca_vars:
            pca = PCA(n_components=var, svd_solver="full", random_state=42)
            Xp_tr = pca.fit_transform(X_fmri[tr_idx])
            Xp_te = pca.transform(X_fmri_full)
            clf_ridge = MultiOutputClassifier(RidgeClassifier(alpha=1.0))
            clf_ridge.fit(Xp_tr, y_train[tr_idx])
            ridge_preds.append(clf_ridge.predict(Xp_te).astype(np.float32))
        branch2 = np.mean(ridge_preds, axis=0)

        # Blend and threshold
        final_soft = 0.5 * branch1 + 0.5 * branch2
        final_pred = np.zeros_like(final_soft, dtype=int)
        final_pred[:,0] = (final_soft[:,0] >= 0.5).astype(int)
        final_pred[:,1] = (final_soft[:,1] >= 0.5).astype(int)

        score = competition_f1(y_full, final_pred)
        fold_scores.append(score)

    # ─── 5) Print results ──────────────────────────────────────
    scores_rounded = [round(s, 4) for s in fold_scores]
    mean_score = np.mean(fold_scores)
    print(f"Fold scores: {scores_rounded}")
    print(f"Mean 10‑Fold CV Score: {mean_score:.4f}")

    # ─── 6) Line Plot ───────────────────────────────────────────
    plt.figure(figsize=(6, 4))
    plt.plot(range(1, 11), fold_scores, marker='o', linestyle='-', linewidth=2)
    plt.xticks(range(1, 11))
    plt.xlabel("Fold Number")
    plt.ylabel("Competition F1 Score")
    plt.title("10‑Fold CV Competition F1 Scores (XGB + PCA‑Ridge)")
    plt.grid(True)
    plt.tight_layout()
    plt.show()

In [None]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.linear_model import RidgeClassifier
from sklearn.multioutput import MultiOutputClassifier
from xgboost import XGBClassifier

# ─── 1) Setup ───────────────────────────────────────────────
fcm_cols    = [c for c in train_df_fcm.columns if c != "participant_id"]
survey_cols = [c for c in scaled_train_df.columns if c not in ['participant_id','Sex_F','ADHD_Outcome']]

# PCA configs for fMRI branch
pca_vars = [0.90, 0.95,0.99]

# ─── 2) Data ────────────────────────────────────────────────
# Train (downsampled by sex)
X_fmri_full   = train_df_fcm[fcm_cols].values
X_survey_full = scaled_train_df[survey_cols].values
y_full        = np.vstack([
    scaled_train_df["ADHD_Outcome"].values,
    scaled_train_df["Sex_F"].values
]).T
participants  = id_train

mask    = np.isin(participants, balanced_sex_participant_ids)
X_fmri  = X_fmri_full[mask]
X_survey= X_survey_full[mask]
y_train = y_full[mask]

# Test
X_survey_test = scaled_test_df[survey_cols].values
X_fmri_test   = test_df_fcm[fcm_cols].values
test_ids      = test_df_fcm["participant_id"].values

# ─── 3) Branch 1: XGB on Survey ─────────────────────────────
xgb = XGBClassifier(
    n_estimators=300,
    max_depth=10,
    learning_rate=0.05,
    use_label_encoder=False,
    eval_metric="logloss",
    n_jobs=-1,
    verbosity=0,
    random_state=42
)
clf_xgb = MultiOutputClassifier(xgb)
clf_xgb.fit(X_survey, y_train)
xgb_proba = clf_xgb.predict_proba(X_survey_test)
branch1 = np.stack([p[:, 1] for p in xgb_proba], axis=1).astype(np.float32)

# ─── 4) Branch 2: Ridge on fMRI PCA (0.90 & 0.95 ensemble) ─
ridge_preds = []
for var in pca_vars:
    pca = PCA(n_components=var, svd_solver="full", random_state=42)
    Xp_tr = pca.fit_transform(X_fmri)
    Xp_te = pca.transform(X_fmri_test)

    clf_ridge = MultiOutputClassifier(RidgeClassifier(alpha=1.0))
    clf_ridge.fit(Xp_tr, y_train)
    preds = clf_ridge.predict(Xp_te).astype(np.float32)
    ridge_preds.append(preds)

branch2 = np.mean(ridge_preds, axis=0)

# ─── 5) Combine & Threshold ─────────────────────────────────
final_soft = 0.5 * branch1 + 0.3 * branch2
final_pred = np.zeros_like(final_soft, dtype=int)
final_pred[:, 0] = (final_soft[:, 0] >= 0.5).astype(int)
final_pred[:, 1] = (final_soft[:, 1] >= 0.5).astype(int)

# ─── 6) Save Submission ─────────────────────────────────────
submission = pd.DataFrame({
    "participant_id": test_ids,
    "ADHD_Outcome": final_pred[:, 0],
    "Sex_F": final_pred[:, 1]
})
submission.to_csv("submission.csv", index=False)
print("✅ submission.csv created: 50% XGB(survey) + 30% Ridge(fMRI PCA 0.90,0.95 & 0.99) ensemble")

### 🙌 Thank You!
Thanks for reading! 💙 If you have any suggestions, feel free to drop a comment – I’m eager to learn and grow in this amazing community! 🌱 I’ll be continuously updating this notebook with feature engineering, modeling, and detailed EDA observations for this competition.

<div style="background-color: #FDEDEC; border-left: 8px solid #AF7AC5; padding: 20px; border-radius: 8px; font-size: 14px; color: #4A235A;">
  <h3 style="font-size: 20px; margin-bottom: 10px;"><strong>📢 If you found this helpful, please upvote to support! 👍 Happy coding and best of luck! 🚀😊</strong></h3>
</div>

<div style="background-color: #E8F8F5; border-left: 8px solid #1ABC9C; padding: 20px; border-radius: 8px; font-size: 14px; color: #000000;">
  <h3 style="font-size: 20px; margin-bottom: 10px;">📬 <strong>Contact Information</strong></h3>
  <hr>

  <p>📧 <strong>Email:</strong> <a href="mailto:tarunpmishra2001@gmail.com" style="color: #1ABC9C; text-decoration: none;">tarunpmishra2001@gmail.com</a></p>

  <p>🔗 <strong>LinkedIn:</strong> <a href="https://www.linkedin.com/in/tarunpmishra/" target="_blank" style="color: #1ABC9C; text-decoration: none;">linkedin.com/in/tarunpmishra</a></p>

  <p>🌐 <strong>Portfolio:</strong> <a href="https://tarundirector.github.io/tarunmishra.github.io/" target="_blank" style="color: #1ABC9C; text-decoration: none;">tarundirector.github.io</a></p>

</div>