# Description

## Background: 

Neuropsychiatric disorders that occur in development, like anxiety, depression, autism, and attention deficit hyperactivity disorder, or ADHD, often differ in how and to what extent they affect males and females. ADHD occurs in about 11% of adolescents, with around 14% of boys and 8% of girls having a diagnosis. There is some evidence that girls with ADHD can often go undiagnosed, as they tend to have more inattentive symptoms which are harder to detect. Girls with ADHD who are undiagnosed will continue suffering with symptoms that burden their mental health and capacity to function.


## Overview: 

In this year’s WiDS Datathon, participants will be tasked with building a model to predict both an individual’s sex and their ADHD diagnosis using functional brain imaging data of children and adolescents and their socio-demographic, emotions, and parenting information.

## Challenge question:

What brain activity patterns are associated with ADHD; are they different between males and females, and, if so, how?”

## Challenge task:

The task is to create a multi-outcome model to predict two separate target variables: 1) ADHD (1=yes or 0=no) and 2) female (1=yes or 0=no).

## Why is this important? 
Tools of this nature can help identify individuals who may be at risk of ADHD, which can be difficult to diagnose particularly in females. Importantly, they help shed light on the parts of the brain relevant to ADHD in females and males, which in turn could lead to improvements in personalized medicine and therapies. Identifying ADHD early and designing therapies targeting specific brain mechanisms in a personalized way can greatly improve the mental health of affected individuals.

In [None]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Importing Libraries

In [None]:
# !pip install geomstats --target=/kaggle/working/

In [None]:
# !pip install openpyxl --target=/kaggle/working/

In [None]:
import numpy as np # linear algebra and statistics
import pandas as pd # data processing
import seaborn as sns # data visualization
import matplotlib.pyplot as plt # data visualization
import geomstats.backend as gs
import openpyxl
from sklearn.preprocessing import LabelEncoder # for feature engineering
from sklearn.preprocessing import OneHotEncoder # for feature engineering
from sklearn.preprocessing import StandardScaler # for data normalization
from sklearn.preprocessing import MinMaxScaler # for data normalization
from sklearn.preprocessing import RobustScaler # for data normalization
from sklearn.metrics import f1_score # for model evaluation
from sklearn.model_selection import train_test_split # for splitting the dataset
from sklearn.impute import SimpleImputer # for feature engineering
from tqdm import tqdm  # For progress bars
import geomstats.datasets.utils as data_utils
from geomstats.geometry.skew_symmetric_matrices import SkewSymmetricMatrices
import os

# Importing data and Intial Data Exploration

In [None]:

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# loading the train datasets

# train_mri = pd.read_csv('/kaggle/input/widsdatathon2025/TRAIN/TRAIN_FUNCTIONAL_CONNECTOME_MATRICES.csv')
train_mri = pd.read_csv('/kaggle/input/widsdatathon2025/TRAIN_NEW/TRAIN_FUNCTIONAL_CONNECTOME_MATRICES_new_36P_Pearson.csv')
train_labels = pd.read_excel('/kaggle/input/widsdatathon2025/TRAIN_NEW/TRAINING_SOLUTIONS.xlsx')
train_categorical = pd.read_excel('/kaggle/input/widsdatathon2025/TRAIN_NEW/TRAIN_CATEGORICAL_METADATA_new.xlsx')
train_numerical = pd.read_excel('/kaggle/input/widsdatathon2025/TRAIN_NEW/TRAIN_QUANTITATIVE_METADATA_new.xlsx')

In [None]:
# loading the test datasets
test_mri = pd.read_csv('/kaggle/input/widsdatathon2025/TEST/TEST_FUNCTIONAL_CONNECTOME_MATRICES.csv')
test_categorical = pd.read_excel('/kaggle/input/widsdatathon2025/TEST/TEST_CATEGORICAL.xlsx')
test_numerical = pd.read_excel('/kaggle/input/widsdatathon2025/TEST/TEST_QUANTITATIVE_METADATA.xlsx')

In [None]:
print("Train targets: \n", train_labels.head(5))

In [None]:
print("Train MRI data: \n", train_mri.head(5))

In [None]:
print("Train categorical data: \n", train_categorical.head(5))

In [None]:
print("Train numerical features: \n", train_numerical.head(5))

In [None]:
print("Shapes of the train datasets:")
print("Shape of train_labels: \n", train_labels.shape)
print("Shape of train_mri: \n", train_mri.shape)
print("Shape of train_categorical: \n", train_categorical.shape)
print("Shape of train_numerical: \n", train_numerical.shape)

## Descriptive Statistics

In [None]:
# Concise summary of the train datasets
print(train_mri.info())

In [None]:
print(train_categorical.info())

In [None]:
print(train_numerical.info())

In [None]:
# Statistical summary of our dataset
print(train_mri.describe())

In [None]:
print(train_categorical.describe())

In [None]:
print(train_numerical.describe())

## Data Visualization

Let's visualize the numerical features and get a better understanding of what we are working with

In [None]:
train_numerical.columns

In [None]:
# EHQ_EHQ_Total
plt.figure(figsize=(12, 8))
sns.histplot(x='EHQ_EHQ_Total', data=train_numerical, color='green')
plt.title("Edinburgh Handedness Questionnaire", fontsize=16)
plt.xlabel("Laterality Index Score", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.show()
# -100 = 10th left 
# −28 ≤ LI < 48 = middle 
# 100 = 10th right

In [None]:
# SDQ_SDQ_Conduct_Problems
plt.figure(figsize=(12, 8))
sns.countplot(x='SDQ_SDQ_Conduct_Problems', data=train_numerical, palette = 'coolwarm')
plt.title("Strength and Difficult Questionaire for Conduct Problems", fontsize=16)
plt.xlabel("Conduct Problems Scale", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.show()

In [None]:
# SDQ_SDQ_Emotional_Problems
plt.figure(figsize=(12, 8))
sns.countplot(x='SDQ_SDQ_Emotional_Problems', data=train_numerical, palette = 'pastel')
plt.title("Strength and Difficult Questionaire for Emotional Problems", fontsize=16)
plt.xlabel("Emotional Problems Scale", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.show()

In [None]:
# SDQ_SDQ_Externalizing and Internalizing
plt.figure(figsize=(12, 6))
sns.countplot(x='SDQ_SDQ_Externalizing', data=train_numerical, palette = 'Set2')
plt.title("Externalizing Scores Distribution", fontsize=16)
plt.xlabel("Externalizing Score", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.show()
plt.figure(figsize=(12, 6))
sns.countplot(x='SDQ_SDQ_Internalizing', data=train_numerical, palette = 'Set2')
plt.title("Internalizing Scores Distribution", fontsize=16)
plt.xlabel("Internalizing score", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.show()

In [None]:
# MRI_Track_Age_at_Scan
plt.figure(figsize=(12,6))
sns.histplot(x='MRI_Track_Age_at_Scan', kde=True, data=train_numerical, color='Maroon')
plt.title("Distribution of Age during MRI Scan", fontsize=16)
plt.xlabel("Age", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.show()

In [None]:
# ADHD Distribution
print(train_labels['ADHD_Outcome'].value_counts())
plt.figure(figsize=(12,6))
sns.countplot(x='ADHD_Outcome', data=train_labels, color='Skyblue')
plt.title("ADHD Distribution", fontsize=16)
plt.xlabel("Outcome (1=Yes, 0=No)", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.show()

In [None]:
# Gender Distribution
print(train_labels['Sex_F'].value_counts())
plt.figure(figsize=(12,6))
sns.countplot(x='Sex_F', data=train_labels, color='Green')
plt.title("Gender Distribution", fontsize=16)
plt.xlabel("Gender (0 = Male, 1 = Female)", fontsize=12)
plt.ylabel("Count",fontsize=12)
plt.show()

In [None]:
# Correlation of Emotional Problems with ADHD outcome
train_numerical_copy = train_numerical.copy()
train_numerical_copy['ADHD_Outcome'] = train_labels['ADHD_Outcome']

plt.figure(figsize=(8, 6))
sns.boxplot(x='ADHD_Outcome', y='SDQ_SDQ_Emotional_Problems', data=train_numerical_copy)
plt.title('SDQ_SDQ_Emotional_Problems vs ADHD Outcome')
plt.xlabel('ADHD Outcome')
plt.ylabel('SDQ_SDQ_Emotional_Problems')
plt.show()

In [None]:
# Barratt_Barratt_P2_Occ - Barratt Simplified Measure of Social Status - Parent 1 Occupation
train_categorical['Barratt_Barratt_P2_Occ'].value_counts()

# 0=Homemaker, stay at home parent.
# 5=Day laborer, janitor, house cleaner, farm worker, food counter,preparation worker, busboy.
# 10=Garbage collector, short-order cook, cab driver, shoe sales, assembly line workers, masons, baggage porter.
# 15=Pa

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(x='Barratt_Barratt_P2_Occ', data=train_categorical[['Barratt_Barratt_P2_Occ']])
plt.title(f"Distribution of Barratt Social Status Measure - Parent 2 Occupation", fontsize=14)
plt.xticks(rotation=45)
plt.show()

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(x='Barratt_Barratt_P1_Occ', data=train_categorical[['Barratt_Barratt_P1_Occ']])
plt.title(f"Distribution of Barratt Social Status Measure - Parent 1 Occupation", fontsize=14)
plt.xticks(rotation=45)
plt.show()

In [None]:
# Let's compare Parent level of education with ADHD Outcome

sns.countplot(data=train_categorical, x='Barratt_Barratt_P1_Edu', hue=train_labels['ADHD_Outcome'])
plt.title('ADHD Prevalence by Parent 1 Education')
plt.show()

In [None]:

sns.countplot(data=train_categorical, x='Barratt_Barratt_P2_Edu', hue=train_labels['ADHD_Outcome'])
plt.title("ADHD Outcome Distribution by Parent 2 Education")
plt.show()

In [None]:
# Comparing color vision test and gender

sns.countplot(data=train_numerical, x='ColorVision_CV_Score', hue=train_labels['Sex_F'])
plt.title('Color Vision Score Distribution by Gender')
plt.show()

In [None]:
# Demographics - Race of Child vs ADHD Outcomes
print(train_categorical['PreInt_Demos_Fam_Child_Race'].value_counts())

# 0= White/Caucasian 
# 1= Black/African American 
# 2= Hispanic 
# 3= Asian 
# 4= Indian
# 5= Native American India...


sns.countplot(data=train_categorical, x='PreInt_Demos_Fam_Child_Race', hue=train_labels['ADHD_Outcome'])
plt.title('ADHD Outcomes by Race of Child')
plt.xlabel("Child Race")
plt.show()

In [None]:
# Correlation Matrix

# Correlation of numerical features with train_labels

# let's merge train_numerical with train_labels to create our dataset for correlation
cat_corr_data = pd.merge(train_numerical, train_labels, on='participant_id')
cat_corr_data.drop('participant_id', axis=1, inplace=True) # we won't need to check correlation with ids
cat_corr_matrix = cat_corr_data.corr()

In [None]:
# Detailed heat map
# sns.heatmap(cat_corr_data, 
#             cmap='YlGnBu', # choosing a yellow-green-blue colormap
#             annot=True, # Turning on annotations
#             fmt="d", # displaying annotations as integer
#             linewidths=.5, # Add gridlines with width 0.5
#             cbar=True, # Include color bar
# )
# plt.show()

# Data Preprocessing

## Duplicates and Missing Values

We'll start by checking and removing duplicates from our train_numerical and train_categorical datasets

In [None]:
# Checking for duplicates in our train data
print("Len of train_numerical before: ", len(train_numerical))
train_numerical.drop_duplicates() # removing duplicates if any
print("Len of train_numerical after: ",len(train_numerical))


print("Len of train_categorical before: ", len(train_categorical))
train_categorical.drop_duplicates()#
print("Len of train_categorical before: ", len(train_categorical))

In [None]:
# Checking for duplicates in test data
print("Len of test_numerical before: ", len(test_numerical))
test_numerical.drop_duplicates() # removing duplicates if any
print("Len of test_numerical after: ", len(test_numerical))

print("Len of test_categorical before: ", len(test_categorical))
test_categorical.drop_duplicates()
print("Len of test_categorical after: ", len(test_categorical))

We see that train and test datasets don't have any duplicates

In [None]:
# Checking for missing values
print("Missing values in train_numerical: ")
train_numerical.isnull().sum()

Let's explore why the column with missing values and find replacements (mean/median)

In [None]:
train_numerical['MRI_Track_Age_at_Scan'].describe()

Let's first check how many kids were scanned at age 0. That may be due to a clerical error and we'll have to deal with it before replacing missing values.

In [None]:
train_numerical[train_numerical['MRI_Track_Age_at_Scan'] == 0]['MRI_Track_Age_at_Scan'].value_counts()

In [None]:
train_numerical[train_numerical['MRI_Track_Age_at_Scan'] == 0].index

There are two datapoints with value of 0. Given the number is pretty low, we'll drop the rows from the dataset

In [None]:
# Drop the two rows with 'MRI_Track_Age_at_Scan' as 0
train_numerical = train_numerical[train_numerical['MRI_Track_Age_at_Scan'] != 0]
print(train_numerical[train_numerical['MRI_Track_Age_at_Scan'] == 0]['MRI_Track_Age_at_Scan'].value_counts())
# From the output, the two rows are now dropped

In [None]:
# Let's check the descriptive statistics of MRI column again.
train_numerical['MRI_Track_Age_at_Scan'].describe()

In [None]:
# We'll now replace the missing values in 'MRI' with the mean
train_numerical['MRI_Track_Age_at_Scan'].fillna(train_numerical['MRI_Track_Age_at_Scan'].mean(), inplace=True)

In [None]:
# Let's check again for missing values in train numerical
train_numerical.isnull().sum()

In [None]:
# Missing values in train_categorical
train_categorical.isnull().sum()

In [None]:
# Let's further investigate the 'PreInt_Demos_Fam_Child_Ethnicity' feature
# 'PreInt_Demos_Fam_Child_Ethnicity' feature indicates the ethnicity of the Child
# 0= Not Hispanic or Latino 
# 1= Hispanic or Latino 
# 2= Decline to specify 
# 3= Unknown
print("Unique values for PreInt_Demos_Fam_Child_Ethnicity Feature: ")
print(train_categorical['PreInt_Demos_Fam_Child_Ethnicity'].unique(), '\n')
print("Value counts for each unique value in PreInt_Demos_Fam_Child_Ethnicity:")
print(train_categorical['PreInt_Demos_Fam_Child_Ethnicity'].value_counts())

In [None]:
# Since category 0 has the highest frequency, We'll replace the missing values with mode(0.0)
train_categorical['PreInt_Demos_Fam_Child_Ethnicity'].fillna(train_categorical['PreInt_Demos_Fam_Child_Ethnicity'].mode().iloc[0], inplace=True)
# We'll replace missing values in other categorical eatures with their mode as well
train_categorical['PreInt_Demos_Fam_Child_Race'].fillna(train_categorical['PreInt_Demos_Fam_Child_Race'].mode().iloc[0], inplace = True)
train_categorical['Barratt_Barratt_P1_Edu'].fillna(train_categorical['Barratt_Barratt_P1_Edu'].mode().iloc[0], inplace = True)
train_categorical['Barratt_Barratt_P1_Occ'].fillna(train_categorical['Barratt_Barratt_P1_Occ'].mode().iloc[0], inplace = True)
train_categorical['MRI_Track_Scan_Location'].fillna(train_categorical['MRI_Track_Scan_Location'].mode().iloc[0], inplace = True)

In [None]:
# We'll drop Barratt_Barratt_P2_Edu and Barratt_Barratt_P2_Occ because they both have too many missing values
drop_cols = ['Barratt_Barratt_P2_Edu', 'Barratt_Barratt_P2_Occ']
train_categorical.drop(drop_cols, axis = 1, inplace = True)
test_categorical.drop(drop_cols, axis = 1, inplace = True)

In [None]:
# Final check to see if we removed all missing values from train_categorical
train_categorical.isnull().sum()

## Outliers

 Outliers are anomalous or unusual values that significantly deviate from other observations.
 They can adversely impact the performance of our machine-learning models by introducing bias or skewness. 
 Detecting outliers helps us maintain our dataset's integrity by ensuring all data falls within a reasonable range of values.

Some common methods to detect outliers are:
1. **Z-score**

    Z-score of a value is the distance between that value and the dataset's mean, expressed in terms of the standard deviation.
    > z_score = (x - mean)/ standard deviation.
    > 
    Values that have a **z-score greater than 3** are often considered to be outliers.

2. **Interquartile Range** (IQR)

    Interquartile range is the range between the first quartile (25th percentile) and third quartile (75th percentile).
    Values that fall significantly below the first quartile (*lower bound*) or above the third quartile(*upper bound*) are often considered to be outliers.
    > lower bound = Q1 - 1.5 * IQR

    > upper bound = Q3 - 1.5 * IQR

3. **Visualization** plots

    Data visualization plots like countplots, scatterplots, and boxplots can be very helpful in visually detecting outliers from a dataset.

Let's take a closer look at features whose min values are significantly lower than their Q1 and features whose max values are significantly higher than Q3.

Refer to our descriptive statistics section (*train_numerical.describe()*)

In [None]:
# Starting with 'APQ_P_APQ_P_CP'
Q1 = train_numerical['APQ_P_APQ_P_CP'].quantile(0.25)
Q3 = train_numerical['APQ_P_APQ_P_CP'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5*IQR
upper_bound = Q3 + 1.5 * IQR
print(f'Lower bound for "APQ_P_APQ_P_CP" is: {lower_bound}')
print(f'Upper bound for "APQ_P_APQ_P_CP" is: {upper_bound}')
# Let's check how many values lie above the upper bound
print(len(train_numerical[train_numerical['APQ_P_APQ_P_CP'] > upper_bound]))
upper_bound_df = train_numerical[train_numerical['APQ_P_APQ_P_CP'] > upper_bound]
print(upper_bound_df['APQ_P_APQ_P_CP'].value_counts())

## Standardization

In [None]:
# We'll use the standard scaler to standardize our numerical columns
scaler = StandardScaler() #initializing the scaler
# dropping the participant_id column before standardizing the numerical columns
train_numerical_scaled = scaler.fit_transform(train_numerical.drop(columns ='participant_id'))
test_numerical_scaled = scaler.fit_transform(test_numerical.drop(columns = 'participant_id'))

## Merging Dataframes

In [None]:
# Let's combine numerical and categorical datasets into one dataframe
train_combined = pd.merge(train_numerical, train_categorical,on ="participant_id", how ="outer").set_index("participant_id")
test_combined = pd.merge(test_numerical, test_categorical, on = "participant_id", how = "outer").set_index("participant_id")
# assert all(train_combined.index == train_labels.index), "Label IDs don't match train IDs"

In [None]:
train_combined.head(5)

## Preprocessing MRI Data

### Reshape the connectome data into symmetric matrices

We are given the upper half of the connectome matrices as vectors, which represent the functional connections between different brain regions. However, to analyze and process this data using Riemannian geometry-based methods, we need to reshape it into symmetric matrices.

By reshaping the upper half vectors into symmetric matrices, we can reconstruct the full matrix, which is a more natural representation of the brain's functional connectivity.

In [None]:
# Extract the ADHD solutions and sort the data by participant_id
y_train_adhd = train_labels[['participant_id', 'ADHD_Outcome']].sort_values('participant_id')
train_mri = train_mri.sort_values('participant_id')

In [None]:
# Define the load_connectomes function
def load_connectomes(train_mri, y_train_adhd, as_vectors=False):
    """
    Load brain connectome data and ADHD labels, returning symmetric matrices with ones on the diagonal.
    """
    
    patient_id = gs.array(train_mri['participant_id'])
    data = gs.array(train_mri.drop('participant_id', axis=1))
    target = gs.array(y_train_adhd['ADHD_Outcome'])

    if as_vectors:
        return data, patient_id, target
    mat = SkewSymmetricMatrices(200).matrix_representation(data)
    mat = gs.eye(200) - gs.transpose(gs.tril(mat), (0, 2, 1))
    mat = 1.0 / 2.0 * (mat + gs.transpose(mat, (0, 2, 1)))

    return mat, patient_id, target

In [None]:
# Call the load_connectomes function
data, patient_id, labels = load_connectomes(train_mri, y_train_adhd)

# Print the results
print(f"There are {len(data)} connectomes: {sum(labels==0)} healthy patients and {sum(labels==1)} ADHD patients.")

In [None]:
data.shape

We now have 200 x 200 matrices for each of the 1213 patients

## Checking for SPD Manifold Membership

Check if the connectome data lies on the Symmetric Positive Definite (SPD) manifold. We use the SPDMatrices class from the geomstats library to check for SPD property.

In [None]:
from geomstats.geometry.spd_matrices import SPDMatrices

manifold = SPDMatrices(200, equip=False)
print(gs.all(manifold.belongs(data)))

In [None]:
# Count the number of connectomes that do not lie on the SPD manifold

count_false = np.sum(~(manifold.belongs(data)))
print("Count of False:", count_false)

### Ensuring SPD Property

To ensure the data is Symmetric Positive Definite (SPD), we can add a small diagonal matrix to the original data. This approach modifies the data minimally while guaranteeing the SPD property. The small diagonal matrix is added to each 2D slice of the 3D matrix, but the correction is only non-zero for the slices that are not SPD.

In [None]:
# Function to add a diagonal matrix to a 2D matrix
def add_diagonal_correction(matrix):
    eigenvalues = np.linalg.eigvals(matrix)
    min_eigenvalue = np.min(eigenvalues)

    if min_eigenvalue < 0:
        correction = -min_eigenvalue + 1e-6
        correction_matrix = correction * np.eye(matrix.shape[0])
        return matrix + correction_matrix
    else:
        return matrix

# Apply the correction to each 2D slice of the 3D matrix
data_corrected = np.array([add_diagonal_correction(slice) for slice in data])

print("Original Matrix shape:", data.shape)
print("Corrected Matrix shape:", data_corrected.shape)

print(gs.all(manifold.belongs(data_corrected)))

#### Counting differences in original data and corrected data

We expect the count of differences to be 12 X 200 = 2400, since we added a correction to 12 connectomes, each with 200 features.

In [None]:
def count_differences(array1, array2, tolerance=1e-6):
    """
    This function compares two 3D arrays and returns the count of differences.
    """
    if array1.shape != array2.shape:
        raise ValueError("Arrays must be of the same shape")
    
    differences = np.greater(np.abs(array1 - array2), tolerance)
    count = np.sum(differences)
    
    return count

print(count_differences(data, data_corrected))

## Training mri data using RiemannianMinimumDistanceToMean

### Define the model for mri data

In [None]:
from geomstats.learning.mdm import RiemannianMinimumDistanceToMean

spd_manifold = SPDMatrices(n=200, equip=True)
mdm = RiemannianMinimumDistanceToMean(space=spd_manifold)

### Split mri data into train and test sets

In [None]:
from sklearn.model_selection import train_test_split
X = data_corrected; y = labels
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=47)

### Print data statistics

We examine the class distribution in the full dataset, as well as the train and test sets, to ensure that they are similar and representative of the overall data. This is crucial for training a reliable model, as a skewed class distribution can lead to biased results.

In [None]:
print(f"The dataset has {len(X)} connectomes.")
print(f"The train set has {len(X_train)} connectomes and has size {X_train.shape}.")
print(f"The test set has {len(X_test)} connectomes and has size {X_test.shape}.")

print("Full dataset class distribution:")
print(pd.Series(y).value_counts(normalize=True) * 100)

print("\nTrain dataset class distribution:")
print(pd.Series(y_train).value_counts(normalize=True) * 100)

print("\nTest dataset class distribution:")
print(pd.Series(y_test).value_counts(normalize=True) * 100)

### Train and Evaluate the model

In [None]:
mdm.fit(X_train, y_train)
print("Mdm score:", mdm.score(X_test, y_test))

y_pred = mdm.predict(X_test)
print("F1 score:", f1_score(y_test, y_pred))

Our F1 Score is a bit low, but now we have an idea of how to work with MRI data. We'll next use ensemble methods to train all our three data categories by combining separate models.

# Model Training

In [None]:
# Splitting the data into training and testing sets (keeping MRI, numerical, and 
# categorical aligned)
mri_train, mri_test, num_train, num_test, cat_train, cat_test, y_train, y_test = train_test_split(
    train_mri, train_numerical, train_categorical, train_labels, test_size=0.2, random_state=42
)

In [None]:
# --- Function to load and preprocess a single MRI scan from a CSV ---
def load_and_preprocess_mri(patient_id, csv_path):
    try:
        all_mri_data = pd.read_csv(csv_path)
        patient_row = all_mri_data[all_mri_data['participant_id'] == patient_id]
        if not patient_row.empty:
            # Assuming the matrix data starts from the second column onwards
            matrix_values = patient_row.iloc[:, 1:].values.flatten() # Flatten the row of matrix values
            # Assuming the matrices are 36x36 (based on the filename pattern in TRAIN)
            mri_scan = matrix_values.reshape(36, 36)
            # Basic preprocessing: Normalize pixel values (if needed)
            mri_scan = mri_scan / mri_scan.max() if mri_scan.max() > 0 else mri_scan
            # Flatten the MRI scan for non-CNN models
            flattened_mri = mri_scan.flatten()
            return flattened_mri
        else:
            print(f"MRI data not found for patient: {patient_id} in the CSV: {csv_path}")
            return None
    except FileNotFoundError:
        print(f"CSV file not found at: {csv_path}")
        return None


In [None]:
# --- Processing training MRI data ---
train_mri_path = "/kaggle/input/widsdatathon2025/TRAIN_NEW/TRAIN_FUNCTIONAL_CONNECTOME_MATRICES_new_36P_Pearson.csv"
X_train_mri = {}
print("Loading and preprocessing training MRI data:")
for participant_id in tqdm(train_combined["participant_id"]): # Assuming train_combined has 'participant_id'
    mri_data = load_and_preprocess_mri(patient_id, train_mri_path)
    if mri_data is not None:
        X_train_mri[patient_id] = mri_data

In [None]:
# --- Processing test MRI data ---
test_mri_path = "/kaggle/input/widsdatathon2025/TEST/TEST_FUNCTIONAL_CONNECTOME_MATRICES.csv"
X_test_mri = {}
print("Loading and preprocessing test MRI data:")
for patient_id in tqdm(test_ids):
    mri_data = load_and_preprocess_mri(patient_id, test_mri_path)
    if mri_data is not None:
        X_test_mri[patient_id] = mri_data