# Recurrent Neural Network

### Preprocessing the files

You are working with resting-state fMRI data from the ADHD-200 dataset, specifically using the preprocessed time series extracted via the CC200 brain atlas. Each subject’s `.1D` file contains the BOLD signal for 190 brain regions (ROIs) across 149 timepoints, capturing the temporal dynamics of brain activity during rest. Rather than collapsing this temporal information into static functional connectivity (FC) matrices via Pearson correlation, you are retaining the full time series structure per region to train Recurrent Neural Networks (RNNs), which are well-suited for modeling sequential data. The objective is to classify subjects into ADHD and control groups by identifying temporal patterns in their brain activity.

You will be using data from multiple acquisition sites within the ADHD-200 dataset, including **KKI, NeuroIMAGE, NYU, OHSU, Peking_1, Peking_2, Peking_3**, and **Pittsburgh**—all of which contain `.1D` time series files in the same format. These datasets will be merged and aligned with the corresponding phenotypic information (such as diagnosis, age, and medication status) to ensure accurate subject labeling during model training.

In [5]:
import os
import numpy as np
import pandas as pd

# List of sites to process
sites = ["KKI", "NYU", "OHSU", "Peking_1", "Peking_2", "Peking_3", "NeuroIMAGE"]

# Set a fixed target sequence length (max seen = 256)
TARGET_LENGTH = 260

# Containers
X_all = []
y_all = []
subject_ids_all = []

# Function to pad or truncate time series
def pad_to_length(array, target_length):
    current_len = array.shape[0]
    if current_len >= target_length:
        return array[:target_length, :]
    else:
        pad_len = target_length - current_len
        pad = np.zeros((pad_len, array.shape[1]))
        return np.vstack([array, pad])

# Main loop through sites
for site in sites:
    print(f"\n Processing site: {site}")

    pheno_csv = site + "_phenotypic.csv"
    phenotype_path = os.path.join("fMRI/ADHD200_CC200_TCs_filtfix", site, pheno_csv)

    try:
        phenotype_df = pd.read_csv(phenotype_path)
    except FileNotFoundError:
        print(f" Phenotype file missing for {site}, skipping...")
        continue

    phenotype_df["Subject_ID"] = phenotype_df["ScanDir ID"].astype(str).str.zfill(7)
    phenotype_df = phenotype_df[["Subject_ID", "DX"]]
    phenotype_df.set_index("Subject_ID", inplace=True)

    base_folder = f"fMRI/ADHD200_CC200_TCs_filtfix/{site}/"
    rnn_data_dict = {}

    if not os.path.isdir(base_folder):
        print(f" Base folder missing for {site}, skipping...")
        continue

    for subject_id in sorted(os.listdir(base_folder)):
        subject_path = os.path.join(base_folder, subject_id)

        if os.path.isdir(subject_path):
            rest_paths = [
                os.path.join(subject_path, f"sfnwmrda{subject_id}_session_1_rest_1_cc200_TCs.1D"),
                os.path.join(subject_path, f"sfnwmrda{subject_id}_session_1_rest_2_cc200_TCs.1D"),
                os.path.join(subject_path, f"sfnwmrda{subject_id}_session_1_rest_3_cc200_TCs.1D")
            ]

            merged_data = []
            for path in rest_paths:
                if os.path.exists(path):
                    try:
                        fmri_data = pd.read_csv(path, sep=r'\s+', header=None, skiprows=1)
                        fmri_data = fmri_data.iloc[:, 2:].astype(float).to_numpy()
                        merged_data.append(fmri_data)
                    except Exception as e:
                        print(f" Error reading {path}: {e}")

            if merged_data:
                full_time_series = np.vstack(merged_data)
                rnn_data_dict[subject_id] = full_time_series

    # Align with phenotype
    valid_subjects = [sid for sid in rnn_data_dict if sid in phenotype_df.index]
    print(f" {len(valid_subjects)} valid subjects found in {site}")

    if valid_subjects:
        X_site = np.stack([
            pad_to_length(rnn_data_dict[sid], TARGET_LENGTH)
            for sid in valid_subjects
        ])
        y_site = phenotype_df.loc[valid_subjects, 'DX'].astype(int).values

        X_all.append(X_site)
        y_all.append(y_site)
        subject_ids_all.extend(valid_subjects)  # Keep subject ID order

# Final dataset assembly
X_rnn = np.concatenate(X_all, axis=0)
y_labels = np.concatenate(y_all, axis=0)
subject_ids_all = np.array(subject_ids_all)

# Sanity checks
print("\n Final dataset ready:")
print(f"X_rnn shape: {X_rnn.shape}")      # (n_subjects, 260, 190)
print(f"y_labels shape: {y_labels.shape}")  # (n_subjects,)
print(f"subject_ids_all shape: {subject_ids_all.shape}")



 Processing site: KKI
 83 valid subjects found in KKI

 Processing site: NYU
 216 valid subjects found in NYU

 Processing site: OHSU
 79 valid subjects found in OHSU

 Processing site: Peking_1
 85 valid subjects found in Peking_1

 Processing site: Peking_2
 67 valid subjects found in Peking_2

 Processing site: Peking_3
 42 valid subjects found in Peking_3

 Processing site: NeuroIMAGE
 48 valid subjects found in NeuroIMAGE

 Final dataset ready:
X_rnn shape: (620, 260, 190)
y_labels shape: (620,)
subject_ids_all shape: (620,)


In [7]:
# Save to disk
np.savez("rnn_dataset_with_subjects.npz", X=X_rnn, y=y_labels, subject_ids=subject_ids_all)
print(" Saved to rnn_dataset_with_subjects.npz")

 Saved to rnn_dataset_with_subjects.npz
