# PIPELINE

Carga correctamente los 4 archivos del TRAIN_NEW.

Unifica la base mediante el ID participant_id.

Separa variables categóricas, cuantitativas y fMRI.

Escala las numéricas con StandardScaler.

Codifica las categóricas con OneHotEncoder.

Divide el dataset en entrenamiento/test estratificado.

Produce matrices listas para entrenar modelos ML.

Siempre que trabajamos con conectomas fMRI (~20.000 features), los pasos estándar son:

Imputación

Escalado

Reducción (PCA)

Clasificador (RF, SVM, XGB)

In [None]:
import os

for root, dirs, files in os.walk(".", topdown=True):
    for d in dirs:
        print("DIR:", os.path.join(root, d))
    for f in files:
        print("FILE:", os.path.join(root, f))


In [1]:
# ===============================
# IMPORTS FOR PREPROCESSING & MODELING
# ===============================

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix


In [2]:
# ===============================
# 1. LOAD FILES
# ===============================


base_path = "TRAIN_NEW/"

path_cat   = base_path + "TRAIN_CATEGORICAL_METADATA_new.xlsx"
path_quant = base_path + "TRAIN_QUANTITATIVE_METADATA_new.xlsx"
path_conn  = base_path + "TRAIN_FUNCTIONAL_CONNECTOME_MATRICES_new_36P_Pearson.csv"
path_sol   = base_path + "TRAINING_SOLUTIONS.xlsx"

# Load the datasets
df_cat   = pd.read_excel(path_cat)
df_quant = pd.read_excel(path_quant)
df_conn  = pd.read_csv(path_conn)
df_sol   = pd.read_excel(path_sol)

# ID column
id_col = "participant_id"

print("Categorical cols:", df_cat.columns.tolist())
print("Quantitative cols:", df_quant.columns.tolist())
print("Connectome cols:", df_conn.columns[:10].tolist())
print("Solutions cols:", df_sol.columns.tolist())



Categorical cols: ['participant_id', 'Basic_Demos_Enroll_Year', 'Basic_Demos_Study_Site', 'PreInt_Demos_Fam_Child_Ethnicity', 'PreInt_Demos_Fam_Child_Race', 'MRI_Track_Scan_Location', 'Barratt_Barratt_P1_Edu', 'Barratt_Barratt_P1_Occ', 'Barratt_Barratt_P2_Edu', 'Barratt_Barratt_P2_Occ']
Quantitative cols: ['participant_id', 'EHQ_EHQ_Total', 'ColorVision_CV_Score', 'APQ_P_APQ_P_CP', 'APQ_P_APQ_P_ID', 'APQ_P_APQ_P_INV', 'APQ_P_APQ_P_OPD', 'APQ_P_APQ_P_PM', 'APQ_P_APQ_P_PP', 'SDQ_SDQ_Conduct_Problems', 'SDQ_SDQ_Difficulties_Total', 'SDQ_SDQ_Emotional_Problems', 'SDQ_SDQ_Externalizing', 'SDQ_SDQ_Generating_Impact', 'SDQ_SDQ_Hyperactivity', 'SDQ_SDQ_Internalizing', 'SDQ_SDQ_Peer_Problems', 'SDQ_SDQ_Prosocial', 'MRI_Track_Age_at_Scan']
Connectome cols: ['participant_id', '0throw_1thcolumn', '0throw_2thcolumn', '0throw_3thcolumn', '0throw_4thcolumn', '0throw_5thcolumn', '0throw_6thcolumn', '0throw_7thcolumn', '0throw_8thcolumn', '0throw_9thcolumn']
Solutions cols: ['participant_id', 'ADHD_Out

In [3]:
# ===============================
# 2. MERGE TABLES ON participant_id
# ===============================

# Merge metadata
df_meta = df_cat.merge(df_quant, on=id_col, how="inner")

# Merge connectome (fMRI)
df_all = df_meta.merge(df_conn, on=id_col, how="inner")

# Merge targets
df_all = df_all.merge(df_sol, on=id_col, how="inner")

print("Final merged shape:", df_all.shape)

Final merged shape: (1213, 19930)


In [4]:
# ===============================
# 3. SEPARATE FEATURES AND TARGET
# ===============================

target_col = "ADHD_Outcome"
sex_col = "Sex_F"   # keep for later analysis

y = df_all[target_col]
sex = df_all[sex_col]

# Drop ID + target columns from features
X = df_all.drop(columns=[id_col, target_col, sex_col])


In [5]:
# ===============================
# 4. IDENTIFY CATEGORICAL / NUMERICAL COLUMNS
# ===============================

cat_cols = df_cat.columns.drop(id_col).tolist()
num_cols_meta = df_quant.columns.drop(id_col).tolist()

# Connectome columns = all columns in df_conn except ID
conn_cols = [c for c in df_conn.columns if c != id_col]

num_cols = num_cols_meta + conn_cols  # all numerical data

print("Categorical features:", len(cat_cols))
print("Numerical features:", len(num_cols))

Categorical features: 9
Numerical features: 19918


In [6]:

# ===============================
# 5. PREPROCESSING PIPELINE (with imputation)
# ===============================

# Pipeline para numéricas: imputa medianas + escala
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

# Pipeline para categóricas: imputa la moda + one-hot
categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, num_cols),
        ("cat", categorical_transformer, cat_cols),
    ]
)



In [7]:
# ===============================
# 6. TRAIN/TEST SPLIT
# ===============================

X_train, X_test, y_train, y_test, sex_train, sex_test = train_test_split(
    X, y, sex,
    test_size=0.2,
    random_state=42,
    stratify=y
)

In [8]:
# ===============================
# 7. FIT TRANSFORMER ON TRAIN 
# ===============================

X_train_proc = preprocessor.fit_transform(X_train)
X_test_proc = preprocessor.transform(X_test)

print("Train processed shape:", X_train_proc.shape)
print("Test processed shape:", X_test_proc.shape)



Train processed shape: (970, 19980)
Test processed shape: (243, 19980)


# Random Forest Classifier (Analisis preliminar)

Funciona bien con alta dimensionalidad (20.000+ features).

Tolera ruido (muy común en conectomas).

No requiere supuestos de distribución.

Proporciona una primera métrica clara.

Permite obtener una importancia de features (aunque luego lo mejora SHAP).

1.Entrenar RandomForest 

2.Obtener:
F1-score
Precision
Recall
Matriz de confusión

3.Ver si existe sesgo por sexo 

# Analisis preliminar de random forest

In [9]:


# ===============================
# 8. RANDOM FOREST BASELINE
# ===============================

# Modelo baseline (ajustes simples)
rf = RandomForestClassifier(
    n_estimators=300,       # Número de árboles
    max_depth=None,        # Dejar crecer completamente
    min_samples_split=2,
    min_samples_leaf=1,
    random_state=42,
    class_weight="balanced"   # MUY IMPORTANTE para TDAH (clase desbalanceada)
)

# Entrenar modelo
rf.fit(X_train_proc, y_train)

# Predicción
y_pred = rf.predict(X_test_proc)



In [10]:
# ===============================
# 9. MÉTRICAS DE RENDIMIENTO
# ===============================

print("=== Classification Report ===")
print(classification_report(y_test, y_pred))

print("=== Confusion Matrix ===")
print(confusion_matrix(y_test, y_pred))


=== Classification Report ===
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        77
           1       0.68      1.00      0.81       166

    accuracy                           0.68       243
   macro avg       0.34      0.50      0.41       243
weighted avg       0.47      0.68      0.55       243

=== Confusion Matrix ===
[[  0  77]
 [  0 166]]


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


# PCA

In [28]:
from sklearn.decomposition import PCA

# Probamos 300 componentes (ajustable)
pca = PCA(n_components=300, random_state=42)

# Ajustamos PCA solo con train
X_train_pca = pca.fit_transform(X_train_proc)
X_test_pca = pca.transform(X_test_proc)

print("Shape after PCA:", X_train_pca.shape, X_test_pca.shape)


Shape after PCA: (970, 300) (243, 300)


In [29]:
print("Varianza explicada total:", np.sum(pca.explained_variance_ratio_))


Varianza explicada total: 0.7242815519697952


# XGBoost

Hecho el PCA (250 componentes) y las variables:

X_train_pca, X_test_pca

y_train, y_test

Definir el modelo, entrenar y evaluar.

In [18]:
!pip install xgboost

Collecting xgboost
  Obtaining dependency information for xgboost from https://files.pythonhosted.org/packages/30/7d/41847e45ff075f3636c95d1000e0b75189aed4f1ae18c36812575bb42b4b/xgboost-3.1.2-py3-none-win_amd64.whl.metadata
  Downloading xgboost-3.1.2-py3-none-win_amd64.whl.metadata (2.1 kB)
Downloading xgboost-3.1.2-py3-none-win_amd64.whl (72.0 MB)
   ---------------------------------------- 0.0/72.0 MB ? eta -:--:--
   ---------------------------------------- 0.8/72.0 MB 16.1 MB/s eta 0:00:05
   - -------------------------------------- 2.6/72.0 MB 27.3 MB/s eta 0:00:03
   -- ------------------------------------- 4.2/72.0 MB 30.1 MB/s eta 0:00:03
   ---- ----------------------------------- 7.2/72.0 MB 38.6 MB/s eta 0:00:02
   ----- ---------------------------------- 9.6/72.0 MB 41.0 MB/s eta 0:00:02
   ------ --------------------------------- 12.0/72.0 MB 46.7 MB/s eta 0:00:02
   ------- -------------------------------- 13.5/72.0 MB 43.7 MB/s eta 0:00:02
   -------- ------------------

In [30]:
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, confusion_matrix

# ===============================
# XGBoost con PCA ( 300 componentes)
# ===============================

xgb = XGBClassifier(
    n_estimators=300,
    max_depth=6,
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8,
    objective="binary:logistic",
    eval_metric="logloss",
    random_state=42,
    n_jobs=-1
)

xgb.fit(X_train_pca, y_train)

y_pred_xgb = xgb.predict(X_test_pca)

print("=== XGBoost Classification Report ===")
print(classification_report(y_test, y_pred_xgb))

print("=== XGBoost Confusion Matrix ===")
print(confusion_matrix(y_test, y_pred_xgb))


=== XGBoost Classification Report ===
              precision    recall  f1-score   support

           0       0.34      0.13      0.19        77
           1       0.69      0.89      0.77       166

    accuracy                           0.65       243
   macro avg       0.52      0.51      0.48       243
weighted avg       0.58      0.65      0.59       243

=== XGBoost Confusion Matrix ===
[[ 10  67]
 [ 19 147]]


# Interpretacion por sexo

In [31]:
# ===============================
# Evaluación por sexo
# ===============================

# Índices de niñas y niños en el set de test
idx_female = (sex_test == 1)
idx_male = (sex_test == 0)

# Filtrar verdaderos y predichos por sexo
y_test_f = y_test[idx_female]
y_pred_f = y_pred_xgb[idx_female]

y_test_m = y_test[idx_male]
y_pred_m = y_pred_xgb[idx_male]

print("=== XGBoost - Mujeres (F) ===")
print(classification_report(y_test_f, y_pred_f))
print("Matriz de confusión F:")
print(confusion_matrix(y_test_f, y_pred_f))

print("\n=== XGBoost - Varones (M) ===")
print(classification_report(y_test_m, y_pred_m))
print("Matriz de confusión M:")
print(confusion_matrix(y_test_m, y_pred_m))


=== XGBoost - Mujeres (F) ===
              precision    recall  f1-score   support

           0       0.42      0.16      0.23        31
           1       0.62      0.86      0.72        49

    accuracy                           0.59        80
   macro avg       0.52      0.51      0.48        80
weighted avg       0.54      0.59      0.53        80

Matriz de confusión F:
[[ 5 26]
 [ 7 42]]

=== XGBoost - Varones (M) ===
              precision    recall  f1-score   support

           0       0.29      0.11      0.16        46
           1       0.72      0.90      0.80       117

    accuracy                           0.67       163
   macro avg       0.51      0.50      0.48       163
weighted avg       0.60      0.67      0.62       163

Matriz de confusión M:
[[  5  41]
 [ 12 105]]


# PCA + SVM (RBF)

In [32]:
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.svm import SVC

# ===============================
# PIPELINE FINAL: PCA + SVM (RBF)
# ===============================

pca_svm = Pipeline(steps=[
    ("pca", PCA(n_components=300, random_state=42)),
    ("svm", SVC(kernel="rbf",
                class_weight="balanced",
                probability=True,
                random_state=42))
])

# Entrenamiento
pca_svm.fit(X_train_proc, y_train)

# Predicción
y_pred_pca_svm = pca_svm.predict(X_test_proc)

print("=== PIPELINE PCA + SVM (FINAL) ===")
print(classification_report(y_test, y_pred_pca_svm))
print(confusion_matrix(y_test, y_pred_pca_svm))


=== PIPELINE PCA + SVM (FINAL) ===
              precision    recall  f1-score   support

           0       0.32      0.35      0.34        77
           1       0.69      0.66      0.67       166

    accuracy                           0.56       243
   macro avg       0.50      0.50      0.50       243
weighted avg       0.57      0.56      0.56       243

[[ 27  50]
 [ 57 109]]


In [34]:
# ===============================
# EVALUACIÓN POR SEXO: Modelo Final PCA+SVM
# ===============================

idx_female = (sex_test == 1)
idx_male   = (sex_test == 0)

y_test_f = y_test[idx_female]
y_pred_f = y_pred_pca_svm[idx_female]

y_test_m = y_test[idx_male]
y_pred_m = y_pred_pca_svm[idx_male]

print("=== PCA+SVM - Mujeres ===")
print(classification_report(y_test_f, y_pred_f))
print(confusion_matrix(y_test_f, y_pred_f))

print("\n=== PCA+SVM - Varones ===")
print(classification_report(y_test_m, y_pred_m))
print(confusion_matrix(y_test_m, y_pred_m))


=== PCA+SVM - Mujeres ===
              precision    recall  f1-score   support

           0       0.49      0.55      0.52        31
           1       0.69      0.63      0.66        49

    accuracy                           0.60        80
   macro avg       0.59      0.59      0.59        80
weighted avg       0.61      0.60      0.60        80

[[17 14]
 [18 31]]

=== PCA+SVM - Varones ===
              precision    recall  f1-score   support

           0       0.20      0.22      0.21        46
           1       0.68      0.67      0.68       117

    accuracy                           0.54       163
   macro avg       0.44      0.44      0.44       163
weighted avg       0.55      0.54      0.54       163

[[10 36]
 [39 78]]


# Explicabilidad con SHAP 

In [35]:
!pip install shap


Collecting shap

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gensim 4.3.0 requires FuzzyTM>=0.4.0, which is not installed.
tables 3.8.0 requires blosc2~=2.0.0, which is not installed.
tables 3.8.0 requires cython>=0.29.21, which is not installed.



  Obtaining dependency information for shap from https://files.pythonhosted.org/packages/77/03/58e199cf59056d68b4a227ce4b2b09eeb0c9bd1d002b9e28fb574eed6200/shap-0.50.0-cp311-cp311-win_amd64.whl.metadata
  Downloading shap-0.50.0-cp311-cp311-win_amd64.whl.metadata (25 kB)
Collecting numpy>=2 (from shap)
  Obtaining dependency information for numpy>=2 from https://files.pythonhosted.org/packages/aa/44/9fe81ae1dcc29c531843852e2874080dc441338574ccc4306b39e2ff6e59/numpy-2.3.5-cp311-cp311-win_amd64.whl.metadata
  Downloading numpy-2.3.5-cp311-cp311-win_amd64.whl.metadata (60 kB)
     ---------------------------------------- 0.0/60.9 kB ? eta -:--:--
     ------ --------------------------------- 10.2/60.9 kB ? eta -:--:--
     ---------------------------------------- 60.9/60.9 kB 1.1 MB/s eta 0:00:00
Collecting slicer==0.0.8 (from shap)
  Obtaining dependency information for slicer==0.0.8 from https://files.pythonhosted.org/packages/63/81/9ef641ff4e12cbcca30e54e72fb0951a2ba195d0cda0ba4100e53

In [36]:
import shap
import numpy as np


ImportError: 

IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!

Importing the numpy C-extensions failed. This error can happen for
many reasons, often due to issues with your setup or how NumPy was
installed.

We have compiled some common reasons and troubleshooting tips at:

    https://numpy.org/devdocs/user/troubleshooting-importerror.html

Please note and check the following:

  * The Python version is: Python3.11 from "C:\Users\AlessiaDerossi\anaconda3\python.exe"
  * The NumPy version is: "1.24.3"

and make sure that they are the versions you expect.
Please carefully study the documentation linked above for further help.

Original error was: DLL load failed while importing _multiarray_umath: No se puede encontrar el módulo especificado.
