# Data Preparation

This notebook loads Athena-preprocessed AAL ROI time series from the ADHD-200 dataset,
validates their structure, and builds a clean subject-level index by matching
time series files with phenotypic labels.

## Notebook scope

**Goal**  
This notebook constructs the subject-level dataset used for the replication study by:
- loading Athena-preprocessed AAL ROI time series,
- validating and cleaning the data format,
- matching subjects with phenotypic labels,
- applying the same site-selection criteria as the original paper,
- exporting reproducible subject manifests for downstream experiments.

**Out of scope**  
This notebook does **not** perform raw rs-fMRI preprocessing (slice timing correction, motion correction, filtering, nuisance regression, etc.), as these steps are already applied in the Athena preprocessing pipeline.

**Outputs**
- `data/processed/subjects_train.csv` – all labeled subjects with available AAL time series  
- `data/processed/subjects_train_paper.csv` – subset restricted to sites used in the original paper

## 1. Imports

In [19]:
from pathlib import Path
import re
import numpy as np
import pandas as pd

PROJECT_ROOT = Path("..").resolve()
DATA_RAW = PROJECT_ROOT / "data" / "raw"
DATA_PROCESSED = PROJECT_ROOT / "data" / "processed"

AAL_ROOT = DATA_RAW / "aal_tcs"
PHENO_ROOT = DATA_RAW / "phenotypic"

DATA_PROCESSED.mkdir(parents=True, exist_ok=True)

print("PROJECT_ROOT:", PROJECT_ROOT)
print("AAL_ROOT exists:", AAL_ROOT.exists(), "->", AAL_ROOT)
print("PHENO_ROOT exists:", PHENO_ROOT.exists(), "->", PHENO_ROOT)


PROJECT_ROOT: /Users/mariaborca/Documents/AI_2023-2026/Semestrul_5/KBS/Report_3/adhd-tcn-replication
AAL_ROOT exists: True -> /Users/mariaborca/Documents/AI_2023-2026/Semestrul_5/KBS/Report_3/adhd-tcn-replication/data/raw/aal_tcs
PHENO_ROOT exists: True -> /Users/mariaborca/Documents/AI_2023-2026/Semestrul_5/KBS/Report_3/adhd-tcn-replication/data/raw/phenotypic


## Data inputs

This notebook expects the following directory structure:

- **AAL ROI time series (Athena preprocessed)**  
  `data/raw/aal_tcs/<Site>/<SubjectID>/*.1D`

- **Phenotypic labels**  
  `data/raw/phenotypic/<Site>/*_phenotypic.csv`

Only Athena-derived AAL time courses with corrected filtering (`sfnwmrda*_aal_TCs.1D`) are used.

## 2. Find the filtered AAL time-course files

In [20]:
aal_files = sorted(AAL_ROOT.rglob("sfnwmrda*_aal_TCs.1D"))

print("Filtered AAL files found:", len(aal_files))
print("Example:")
for p in aal_files[:5]:
    print("  ", p)

Filtered AAL files found: 1395
Example:
   /Users/mariaborca/Documents/AI_2023-2026/Semestrul_5/KBS/Report_3/adhd-tcn-replication/data/raw/aal_tcs/Brown/0026001/sfnwmrda0026001_session_1_rest_1_aal_TCs.1D
   /Users/mariaborca/Documents/AI_2023-2026/Semestrul_5/KBS/Report_3/adhd-tcn-replication/data/raw/aal_tcs/Brown/0026002/sfnwmrda0026002_session_1_rest_1_aal_TCs.1D
   /Users/mariaborca/Documents/AI_2023-2026/Semestrul_5/KBS/Report_3/adhd-tcn-replication/data/raw/aal_tcs/Brown/0026004/sfnwmrda0026004_session_1_rest_1_aal_TCs.1D
   /Users/mariaborca/Documents/AI_2023-2026/Semestrul_5/KBS/Report_3/adhd-tcn-replication/data/raw/aal_tcs/Brown/0026005/sfnwmrda0026005_session_1_rest_1_aal_TCs.1D
   /Users/mariaborca/Documents/AI_2023-2026/Semestrul_5/KBS/Report_3/adhd-tcn-replication/data/raw/aal_tcs/Brown/0026009/sfnwmrda0026009_session_1_rest_1_aal_TCs.1D


## AAL time-course file format

Each `.1D` file contains a tab-separated table with:

- **Rows**: time points (sub-bricks / TRs)
- **Columns**:
  - Metadata columns: `File`, `Sub-brick`
  - ROI signal columns: `Mean_XXXX` (116 regions from the AAL atlas)

In this notebook:
- Only the 116 ROI columns are retained.
- Metadata columns are discarded.
- The resulting data matrix has shape **(T, 116)** per subject.

## 3. Load data

In [21]:
def load_athena_aal_1d(path: Path, expect_rois: int = 116):
    """
    Loads Athena AAL time courses from .1D with a header.
    Keeps only ROI columns (Mean_...) and returns X with shape (T, R).
    """
    # Header line is tab-separated
    with open(path, "r") as f:
        header = f.readline().strip().split("\t")

    roi_cols = [i for i, name in enumerate(header) if name.startswith("Mean_")]
    if len(roi_cols) != expect_rois:
        print(f"[WARN] {path.name}: found {len(roi_cols)} ROI columns (expected {expect_rois})")

    # Load as strings then slice ROI columns
    raw = np.genfromtxt(path, delimiter="\t", skip_header=1, dtype=str)
    raw = np.atleast_2d(raw)  # safety

    X = raw[:, roi_cols].astype(float)

    # Fill NaNs safely
    if np.isnan(X).any():
        col_means = np.nanmean(X, axis=0)
        col_means = np.where(np.isnan(col_means), 0.0, col_means)
        inds = np.where(np.isnan(X))
        X[inds] = np.take(col_means, inds[1])

    return X


In [22]:
for p in aal_files[:3]:
    X = load_athena_aal_1d(p)
    print(p.parts[-3], p.parts[-2], p.name, "->", X.shape, "NaNs:", np.isnan(X).sum())


Brown 0026001 sfnwmrda0026001_session_1_rest_1_aal_TCs.1D -> (247, 116) NaNs: 0
Brown 0026002 sfnwmrda0026002_session_1_rest_1_aal_TCs.1D -> (247, 116) NaNs: 0
Brown 0026004 sfnwmrda0026004_session_1_rest_1_aal_TCs.1D -> (247, 116) NaNs: 0


## Step: Build time-course manifest

We construct a subject-level manifest (`df_tc`) containing:
- site identifier,
- subject ID,
- path to the AAL time-course file,
- number of time points (T),
- number of ROIs (R).

This manifest enables deterministic merging with phenotypic labels and
reproducible downstream feature extraction.

## 4. Build subject index table

In [23]:
def parse_site_subject(p: Path):
    # data/raw/aal_tcs/<SITE>/<SUBJECT>/<FILE>
    parts = p.parts
    site = parts[parts.index("aal_tcs") + 1]
    subject_id = parts[parts.index("aal_tcs") + 2]
    return site, subject_id

rows = []
for p in aal_files:
    site, subject_id = parse_site_subject(p)
    X = load_athena_aal_1d(p)
    rows.append({
        "site": site,
        "subject_id": subject_id,
        "tc_path": str(p),
        "T": int(X.shape[0]),
        "R": int(X.shape[1]),
    })

df_tc = pd.DataFrame(rows)

# If duplicates exist per subject, keep the first (should be rare with sfnwmrda)
df_tc = df_tc.sort_values(["subject_id", "tc_path"]).drop_duplicates("subject_id")

print("Unique subjects with time courses:", len(df_tc))
df_tc.head()


Unique subjects with time courses: 965


Unnamed: 0,site,subject_id,tc_path,T,R
120,NYU,10001,/Users/mariaborca/Documents/AI_2023-2026/Semes...,172,116
122,NYU,10002,/Users/mariaborca/Documents/AI_2023-2026/Semes...,172,116
124,NYU,10003,/Users/mariaborca/Documents/AI_2023-2026/Semes...,172,116
125,NYU,10004,/Users/mariaborca/Documents/AI_2023-2026/Semes...,172,116
127,NYU,10005,/Users/mariaborca/Documents/AI_2023-2026/Semes...,172,116


## Step: Load phenotypic labels

Phenotypic data are loaded from site-specific CSV files.
Only training phenotypic files are used (TestRelease files are excluded).
Diagnostic labels are extracted from the `DX` column.

## 5. Load and combine training phenotypics

In [24]:
# Training phenotypics (exclude TestRelease files)
pheno_files = sorted([
    p for p in PHENO_ROOT.rglob("*_phenotypic.csv")
    if "TestRelease" not in p.name
])

print("Phenotypic CSVs (training):", len(pheno_files))
for p in pheno_files:
    print(" ", p.relative_to(PROJECT_ROOT))


Phenotypic CSVs (training): 9
  data/raw/phenotypic/KKI/KKI_phenotypic.csv
  data/raw/phenotypic/NYU/NYU_phenotypic.csv
  data/raw/phenotypic/NeuroIMAGE/NeuroIMAGE_phenotypic.csv
  data/raw/phenotypic/OHSU/OHSU_phenotypic.csv
  data/raw/phenotypic/Peking_1/Peking_1_phenotypic.csv
  data/raw/phenotypic/Peking_2/Peking_2_phenotypic.csv
  data/raw/phenotypic/Peking_3/Peking_3_phenotypic.csv
  data/raw/phenotypic/Pittsburgh/Pittsburgh_phenotypic.csv
  data/raw/phenotypic/WashU/WashU_phenotypic.csv


In [25]:
dfs = []
for p in pheno_files:
    site = p.parent.name  # folder name is site
    dfp = pd.read_csv(p)

    # normalize column names
    dfp.columns = [c.strip() for c in dfp.columns]

    # attach site (to match df_tc.site)
    dfp["site"] = site
    dfs.append(dfp)

pheno = pd.concat(dfs, ignore_index=True)
print("Combined phenotypic rows:", len(pheno))
pheno.head()


Combined phenotypic rows: 776


Unnamed: 0,ScanDir ID,Site,Gender,Age,Handedness,DX,Secondary Dx,ADHD Measure,ADHD Index,Inattentive,...,QC_S1_Rest_1,QC_S1_Rest_2,QC_S1_Rest_3,QC_S1_Rest_4,QC_S1_Rest_5,QC_S1_Rest_6,QC_S1_Anat,QC_S2_Rest_1,QC_S2_Rest_2,QC_S2_Anat
0,1018959.0,3,0.0,12.36,1.0,0,,2.0,44.0,47.0,...,,,,,,,,,,
1,1019436.0,3,1.0,12.98,1.0,3,,2.0,71.0,60.0,...,,,,,,,,,,
2,1043241.0,3,1.0,9.12,1.0,0,,2.0,40.0,40.0,...,,,,,,,,,,
3,1266183.0,3,0.0,9.67,1.0,0,,2.0,47.0,44.0,...,,,,,,,,,,
4,1535233.0,3,1.0,9.64,0.0,0,,2.0,42.0,41.0,...,,,,,,,,,,


In [26]:
print("Phenotypic columns:")
print(sorted(pheno.columns.tolist()))

Phenotypic columns:
['ADHD Index', 'ADHD Measure', 'Age', 'DX', 'Full2 IQ', 'Full4 IQ', 'Gender', 'Handedness', 'Hyper/Impulsive', 'IQ Measure', 'Inattentive', 'Med Status', 'Performance IQ', 'QC_Anatomical_1', 'QC_Anatomical_2', 'QC_Rest_1', 'QC_Rest_2', 'QC_Rest_3', 'QC_Rest_4', 'QC_S1_Anat', 'QC_S1_Rest_1', 'QC_S1_Rest_2', 'QC_S1_Rest_3', 'QC_S1_Rest_4', 'QC_S1_Rest_5', 'QC_S1_Rest_6', 'QC_S2_Anat', 'QC_S2_Rest_1', 'QC_S2_Rest_2', 'ScanDir ID', 'ScanDirID', 'Secondary Dx', 'Site', 'Study #', 'Verbal IQ', 'site']


## Subject ID normalization

Subject identifiers appear in different formats across data sources.
To ensure correct merging, all subject IDs are normalized to
7-digit zero-padded strings before joining time-course and phenotypic data.

In [27]:
# choose columns explicitly (based on your printout)
subj_col = "ScanDir ID"
dx_col = "DX"

def norm_subject_id(x) -> str:
    """
    Convert ScanDir ID to a 7-digit string with leading zeros if needed.
    Handles floats like 1018959.0 safely.
    """
    if pd.isna(x):
        return None
    # convert to int safely (drops .0)
    xi = int(float(x))
    return f"{xi:07d}"

# Normalize phenotypic subject IDs
pheno = pheno.copy()
pheno["subject_id"] = pheno[subj_col].apply(norm_subject_id)

# Normalize df_tc subject IDs (folders already strings but ensure 7 digits)
df_tc = df_tc.copy()
df_tc["subject_id"] = df_tc["subject_id"].apply(lambda s: f"{int(s):07d}")

print("Example normalized IDs (pheno):", pheno["subject_id"].dropna().head().tolist())
print("Example normalized IDs (tc):   ", df_tc["subject_id"].head().tolist())


Example normalized IDs (pheno): ['1018959', '1019436', '1043241', '1266183', '1535233']
Example normalized IDs (tc):    ['0010001', '0010002', '0010003', '0010004', '0010005']


In [28]:
df = df_tc.merge(
    pheno[["site", "subject_id", dx_col]],
    on=["site", "subject_id"],
    how="left"
)

print("Time-course subjects:", len(df_tc))
print("Merged rows:", len(df))
print("Missing DX after merge:", df[dx_col].isna().sum())

# See which sites are missing most labels (useful debugging)
missing_by_site = df[df[dx_col].isna()].groupby("site")["subject_id"].count().sort_values(ascending=False)
print("\nMissing DX by site:\n", missing_by_site.head(20))


Time-course subjects: 965
Merged rows: 965
Missing DX after merge: 256

Missing DX by site:
 site
WashU         59
Peking_1      51
NYU           41
OHSU          34
Brown         26
NeuroIMAGE    25
KKI           11
Pittsburgh     9
Name: subject_id, dtype: int64


## Label definition

We formulate a binary classification task:
- `DX == 0` → Control (label = 0)
- `DX > 0` → ADHD (label = 1)

All ADHD subtypes are collapsed into a single class, consistent with the original paper.

In [29]:
df["dx_raw"] = pd.to_numeric(df[dx_col], errors="coerce")

# ADHD-200 convention: DX==0 is typically TDC; DX>0 ADHD (subtypes)
df["label"] = np.where(df["dx_raw"].isna(), np.nan,
                       np.where(df["dx_raw"] == 0, 0, 1))

print("DX raw counts:\n", df["dx_raw"].value_counts(dropna=False).sort_index())
print("\nBinary label counts:\n", pd.Series(df["label"]).value_counts(dropna=False))


DX raw counts:
 dx_raw
0.0    429
1.0    159
2.0     11
3.0    110
NaN    256
Name: count, dtype: int64

Binary label counts:
 label
0.0    429
1.0    280
NaN    256
Name: count, dtype: int64


In [30]:
df_train = df.dropna(subset=["label"]).copy()
df_train["label"] = df_train["label"].astype(int)

out_path = DATA_PROCESSED / "subjects_train.csv"
df_train.to_csv(out_path, index=False)

print("Saved:", out_path)
print("Train subjects:", len(df_train))
df_train.head()


Saved: /Users/mariaborca/Documents/AI_2023-2026/Semestrul_5/KBS/Report_3/adhd-tcn-replication/data/processed/subjects_train.csv
Train subjects: 709


Unnamed: 0,site,subject_id,tc_path,T,R,DX,dx_raw,label
0,NYU,10001,/Users/mariaborca/Documents/AI_2023-2026/Semes...,172,116,3.0,3.0,1
1,NYU,10002,/Users/mariaborca/Documents/AI_2023-2026/Semes...,172,116,3.0,3.0,1
2,NYU,10003,/Users/mariaborca/Documents/AI_2023-2026/Semes...,172,116,0.0,0.0,0
3,NYU,10004,/Users/mariaborca/Documents/AI_2023-2026/Semes...,172,116,0.0,0.0,0
4,NYU,10005,/Users/mariaborca/Documents/AI_2023-2026/Semes...,172,116,2.0,2.0,1


## Replication site selection

To match the experimental setup of the original study, we restrict the dataset
to the same acquisition sites used in the paper.

**Kept sites:** KKI, NeuroIMAGE, NYU, OHSU, PKU (`Peking_1`)  
**Excluded sites:** Pittsburgh, WashU, Brown, Peking_2, Peking_3

In [31]:
SITES_KEEP = {"KKI", "NeuroIMAGE", "NYU", "OHSU", "Peking_1"}
SITES_DROP = {"WashU", "Pittsburgh", "Brown", "Peking_2", "Peking_3"}

print("All sites in df_tc:", sorted(df_tc["site"].unique()))
print("Keeping:", sorted(SITES_KEEP))
print("Dropping:", sorted(SITES_DROP))


All sites in df_tc: ['Brown', 'KKI', 'NYU', 'NeuroIMAGE', 'OHSU', 'Peking_1', 'Peking_2', 'Peking_3', 'Pittsburgh', 'WashU']
Keeping: ['KKI', 'NYU', 'NeuroIMAGE', 'OHSU', 'Peking_1']
Dropping: ['Brown', 'Peking_2', 'Peking_3', 'Pittsburgh', 'WashU']


In [32]:
df_train_paper = df_train[df_train["site"].isin(SITES_KEEP)].copy()

print("Train subjects (paper sites):", len(df_train_paper))
print("By site:\n", df_train_paper["site"].value_counts())
print("\nBy label:\n", df_train_paper["label"].value_counts())
print("\nBy site+label:\n", pd.crosstab(df_train_paper["site"], df_train_paper["label"]))

Train subjects (paper sites): 511
By site:
 site
NYU           216
Peking_1       85
KKI            83
OHSU           79
NeuroIMAGE     48
Name: count, dtype: int64

By label:
 label
0    285
1    226
Name: count, dtype: int64

By site+label:
 label        0    1
site               
KKI         61   22
NYU         98  118
NeuroIMAGE  23   25
OHSU        42   37
Peking_1    61   24


In [33]:
out_path = DATA_PROCESSED / "subjects_train_paper.csv"
df_train_paper.to_csv(out_path, index=False)
print("Saved:", out_path)

Saved: /Users/mariaborca/Documents/AI_2023-2026/Semestrul_5/KBS/Report_3/adhd-tcn-replication/data/processed/subjects_train_paper.csv


## Dataset summary

We report dataset statistics before and after site filtering, including:
- number of subjects,
- per-site distributions,
- class balance.

These summaries are used to verify consistency with the dataset reported in the original study.

In [34]:
def summarize(name, df_):
    print(f"\n=== {name} ===")
    print("n:", len(df_))
    print("sites:", dict(df_["site"].value_counts()))
    print("labels:", dict(df_["label"].value_counts()))

summarize("df_train (all labeled sites)", df_train)
summarize("df_train_paper (paper sites only)", df_train_paper)


=== df_train (all labeled sites) ===
n: 709
sites: {'NYU': np.int64(216), 'Pittsburgh': np.int64(89), 'Peking_1': np.int64(85), 'KKI': np.int64(83), 'OHSU': np.int64(79), 'Peking_2': np.int64(67), 'NeuroIMAGE': np.int64(48), 'Peking_3': np.int64(42)}
labels: {0: np.int64(429), 1: np.int64(280)}

=== df_train_paper (paper sites only) ===
n: 511
sites: {'NYU': np.int64(216), 'Peking_1': np.int64(85), 'KKI': np.int64(83), 'OHSU': np.int64(79), 'NeuroIMAGE': np.int64(48)}
labels: {0: np.int64(285), 1: np.int64(226)}


## Notes on replication fidelity

- Athena-preprocessed AAL ROI time courses are used directly.
- No additional rs-fMRI preprocessing is applied.
- Missing values in ROI time series are handled via column-wise mean imputation.
- Only one filtered scan per subject is retained.

Any deviations from the original study are documented explicitly as part of a
partial replication.

## Step 8: Build test subject manifest

In [35]:
# --- Load TestRelease phenotypic CSVs ---
test_pheno_files = sorted([
    p for p in PHENO_ROOT.rglob("*_TestRelease_phenotypic.csv")
])

print("Test phenotypic CSVs:", len(test_pheno_files))
for p in test_pheno_files:
    print(" ", p.relative_to(PROJECT_ROOT))

dfs_test = []
for p in test_pheno_files:
    site = p.parent.name
    dfp = pd.read_csv(p)
    dfp.columns = [c.strip() for c in dfp.columns]
    dfp["site"] = site
    dfs_test.append(dfp)

pheno_test = pd.concat(dfs_test, ignore_index=True)
print("Combined test phenotypic rows:", len(pheno_test))


# --- Normalize subject IDs (same function as training) ---
pheno_test = pheno_test.copy()
pheno_test["subject_id"] = pheno_test["ScanDir ID"].apply(norm_subject_id)

print("Example normalized test IDs:", pheno_test["subject_id"].dropna().head().tolist())


# --- Merge test phenotypes with time-course manifest ---
df_test = df_tc.merge(
    pheno_test[["site", "subject_id", "DX"]],
    on=["site", "subject_id"],
    how="left"
)

print("Test subjects with time courses:", len(df_test))
print("Missing DX in test (all sites):", df_test["DX"].isna().sum())


# --- Apply paper site filtering ---
df_test_paper = df_test[df_test["site"].isin(SITES_KEEP)].copy()

print("Test subjects (paper sites):", len(df_test_paper))
print("Missing DX (paper sites):", df_test_paper["DX"].isna().sum())


# --- Drop unlabeled test subjects (as in the paper) ---
df_test_paper["DX"] = df_test_paper["DX"].replace({"withheld": np.nan, "Withheld": np.nan})

df_test_paper_labeled = df_test_paper.dropna(subset=["DX"]).copy()

print("Labeled test subjects (paper sites):", len(df_test_paper_labeled))
print("By site:\n", df_test_paper_labeled["site"].value_counts())


# --- Save test manifest ---
out_path = DATA_PROCESSED / "subjects_test_paper.csv"
df_test_paper_labeled.to_csv(out_path, index=False)

print("Saved:", out_path)


# --- Final consistency check: train + test totals ---
n_train = len(df_train_paper)
n_test = len(df_test_paper_labeled)

print("\n=== Final dataset size (paper-faithful) ===")
print("Train subjects:", n_train)
print("Test subjects:", n_test)
print("TOTAL (train + test):", n_train + n_test)

Test phenotypic CSVs: 7
  data/raw/phenotypic/Brown/Brown_TestRelease_phenotypic.csv
  data/raw/phenotypic/KKI/KKI_TestRelease_phenotypic.csv
  data/raw/phenotypic/NYU/NYU_TestRelease_phenotypic.csv
  data/raw/phenotypic/NeuroIMAGE/NeuroIMAGE_TestRelease_phenotypic.csv
  data/raw/phenotypic/OHSU/OHSU_TestRelease_phenotypic.csv
  data/raw/phenotypic/Peking_1/Peking_1_TestRelease_phenotypic.csv
  data/raw/phenotypic/Pittsburgh/Pittsburgh_TestRelease_phenotypic.csv
Combined test phenotypic rows: 197
Example normalized test IDs: ['0026001', '0026002', '0026004', '0026005', '0026009']
Test subjects with time courses: 965
Missing DX in test (all sites): 768
Test subjects (paper sites): 673
Missing DX (paper sites): 511
Labeled test subjects (paper sites): 0
By site:
 Series([], Name: count, dtype: int64)
Saved: /Users/mariaborca/Documents/AI_2023-2026/Semestrul_5/KBS/Report_3/adhd-tcn-replication/data/processed/subjects_test_paper.csv

=== Final dataset size (paper-faithful) ===
Train subjec

  df_test_paper["DX"] = df_test_paper["DX"].replace({"withheld": np.nan, "Withheld": np.nan})


## Note on Test Set Labels and Evaluation

The ADHD-200 dataset was originally released as part of the ADHD-200 Global
Competition, in which the data were explicitly divided into a **training set**
and a **test set**.

For the test set, **diagnostic labels (DX) are intentionally withheld** and
provided as the string `"withheld"` in the phenotypic files. This design choice
prevents label leakage and enables fair benchmarking, but it also means that
the official test set **cannot be used for quantitative evaluation** (e.g.,
accuracy, AUC) without access to the hidden ground-truth labels.

As a result:
- The test subject manifest constructed in this notebook is **unlabeled** and
  can only be used for **prediction/inference**, not for performance evaluation.
- All reported evaluation metrics in this replication study are computed using
  labeled subjects from the training set, following the protocol described in
  the original paper.
- Any additional validation is performed via splits or cross-validation within
  the labeled training data.

This limitation is inherent to the ADHD-200 dataset and is **not related to the
use of Athena-preprocessed data**.

## Rationale for Training–Validation Split

Because diagnostic labels for the official ADHD-200 test release are
intentionally withheld, the test set cannot be used for supervised model
evaluation. In order to report quantitative performance metrics (e.g.,
classification accuracy), a labeled held-out set is required.

To address this limitation, we further partition the labeled training data
into a **training split** and a **validation split**. This validation split is
used exclusively for model selection and performance evaluation, while the
training split is used for parameter learning.

The split is performed in a stratified manner with respect to the diagnostic
label to preserve class balance. A fixed random seed is used to ensure
reproducibility. This procedure follows common practice in replication
studies when benchmark test labels are unavailable and does not affect the
construction of the official test set, which remains reserved for
prediction-only analysis.

In [36]:
from sklearn.model_selection import train_test_split

df_split_source = df_train_paper.copy()

train_idx, val_idx = train_test_split(
    df_split_source.index,
    test_size=0.2,
    random_state=42,
    stratify=df_split_source["label"]
)

df_train_split = df_split_source.loc[train_idx].copy()
df_val_split   = df_split_source.loc[val_idx].copy()

df_train_split.to_csv(DATA_PROCESSED / "subjects_train_split_paper.csv", index=False)
df_val_split.to_csv(DATA_PROCESSED / "subjects_val_split_paper.csv", index=False)

print("Saved train split:", len(df_train_split))
print("Saved val split  :", len(df_val_split))
print("Val label counts:\n", df_val_split["label"].value_counts())


Saved train split: 408
Saved val split  : 103
Val label counts:
 label
0    57
1    46
Name: count, dtype: int64
