#### **Running the fuzzy clustering algorithm to create cognitive and behavioral profiles.**

Following cells will be dedicated to run a CLI based tool performing fuzzy
clustering on the raw behavioral and cognitive variables. This will generate
the membership value needed for the graph network computation. 

Initial clustering will be performed only the ABCD study. Then, the clusters' centroids
will be used to project the GESTE and BANDA data onto the ABCD profiles.

**Please note, the clustering process can take roughly 1h (depending on the number of cores used). It can be run directly in the notebook, but it will most likely be much faster to run it in a dedicated terminal window. To do so, simply copy and paste the command line, and change the relevant paths to point to your data. You can also select the desired number of cores to use during the clustering process.**

In [1]:
# Imports
import os

import pandas as pd

from neurostatx.io.utils import load_df_in_any_format

In [2]:
# Setting up relevant paths to previous steps.
repository_path = "/Users/anthonygagnon/code/Article-s-Code/" # CHANGE THIS
abcd_base_path = "/Volumes/T7/CCPM/ABCD/Release_5.1/abcd-data-release-5.1/" # CHANGE THIS
geste_base_dir = "/Volumes/T7/CCPM/GESTE/" # CHANGE THIS
banda_dir = '/Volumes/T7/CCPM/BANDA/BANDARelease1.1/' # CHANGE THIS
output_folder = "/Volumes/T7/CCPM/RESULTS_JUNE_24/" # CHANGE THIS
data_dir = f"{output_folder}/preprocessing/"
output_dir = f"{output_folder}/fuzzyclustering/"

# Create output directory if it does not exist.
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

In [49]:
# Running Clustering on raw variables using a CLI tool, therefore 
# using the ! flag beforehand. Running up to 20 clusters.
# ** This is a long running process. Go get a coffee ! **

!FuzzyClustering --in-dataset '{data_dir}/abcd_data_preprocessed.xlsx'\
    --out-folder "{output_dir}/ABCDFuzzyCMeans/" \
    --desc-columns 22 --id-column "subjectkey" --pca --k 20 --m 2 --metric mahalanobis \
    --maxiter 5000 --error 1e-06 --cmap "bone_r" --radarplot \
    -v -f -s --processes 6

[32m2024-06-25 15:40:15[0m [35mAnthonys-MBP.med.usherbrooke.ca[0m [34mroot[90106][0m [1;30mINFO[0m Loading dataset(s)...
[32m2024-06-25 15:40:17[0m [35mAnthonys-MBP.med.usherbrooke.ca[0m [34mroot[90106][0m [1;30mINFO[0m Applying PCA dimensionality reduction.
[32m2024-06-25 15:40:17[0m [35mAnthonys-MBP.med.usherbrooke.ca[0m [34mroot[90106][0m [1;30mINFO[0m Bartlett's test of sphericity returned a p-value of 0.0 and Keiser-Meyer-Olkin (KMO) test returned a value of 0.6872199957929848.
[32m2024-06-25 15:40:21[0m [35mAnthonys-MBP.med.usherbrooke.ca[0m [34mroot[90106][0m [1;30mINFO[0m Generating dendrogram.
[32m2024-06-25 15:40:26[0m [35mAnthonys-MBP.med.usherbrooke.ca[0m [34mroot[90106][0m [1;30mINFO[0m Computing FCM from k=2 to k=20
[Parallel(n_jobs=6)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done  20 out of  20 | elapsed: 37.0min finished
[32m2024-06-25 16:17:24[0m [35mAnthonys-MBP.med.usherbrooke.ca[0m [34m

#### **Projecting the BANDA and GESTE study**

Next subsequent cells will use the ABCD's PCA model and centroids to project the data from the GESTE and BANDA study into ABCD's referential clustering space. 


In [3]:
# Projecting BANDA study using a CLI tool from neurostatx, this is a quick process.

!PredictFuzzyMembership --in-dataset '{data_dir}/banda_data_preprocessed.xlsx' \
    --out-folder '{output_dir}/BANDAProjected/' \
    --in-cntr '{output_dir}/ABCDFuzzyCMeans/CENTROIDS/clusters_centroids_4.xlsx' \
    --desc-columns 17 --id-column subjectkey --pca \
    --pca-model '{output_dir}/ABCDFuzzyCMeans/PCA/pca_model.pkl' \
    --m 2 --error 1e-06 --maxiter 5000 --metric mahalanobis --radarplot \
    --cmap "bone_r" -v -s -f

[32m2024-11-06 15:49:21[0m [35mAnthonys-MacBook-Pro.local[0m [34mroot[53835][0m [1;30mINFO[0m Loading dataset(s)...
[32m2024-11-06 15:49:21[0m [35mAnthonys-MacBook-Pro.local[0m [34mroot[53835][0m [1;30mINFO[0m Loading PCA model...
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
[32m2024-11-06 15:49:21[0m [35mAnthonys-MacBook-Pro.local[0m [34mroot[53835][0m [1;30mINFO[0m Predicting membership matrix...
[32m2024-11-06 15:49:21[0m [35mAnthonys-MacBook-Pro.local[0m [34mroot[53835][0m [1;30mINFO[0m Saving results...


In [4]:
# Projecting GESTE study using a CLI tool from neurostatx, this is a quick process.

!PredictFuzzyMembership --in-dataset '{data_dir}/geste_data_preprocessed.xlsx' \
    --out-folder '{output_dir}/GESTEProjected/' \
    --in-cntr '{output_dir}/ABCDFuzzyCMeans/CENTROIDS/clusters_centroids_4.xlsx' \
    --desc-columns 14 --id-column subjectkey --pca \
    --pca-model '{output_dir}/ABCDFuzzyCMeans/PCA/pca_model.pkl' \
    --m 2 --error 1e-06 --maxiter 5000 --metric mahalanobis --radarplot \
    --cmap "bone_r" -v -s -f

[32m2024-11-06 15:49:45[0m [35mAnthonys-MacBook-Pro.local[0m [34mroot[54170][0m [1;30mINFO[0m Loading dataset(s)...
[32m2024-11-06 15:49:45[0m [35mAnthonys-MacBook-Pro.local[0m [34mroot[54170][0m [1;30mINFO[0m Loading PCA model...
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
[32m2024-11-06 15:49:45[0m [35mAnthonys-MacBook-Pro.local[0m [34mroot[54170][0m [1;30mINFO[0m Predicting membership matrix...
[32m2024-11-06 15:49:45[0m [35mAnthonys-MacBook-Pro.local[0m [34mroot[54170][0m [1;30mINFO[0m Saving results...


#### **Concatenating membership values from all studies together and compute a Graph Network object.**

In order to create a common Graph Network object, we need to concatenate all dataset together. The following cells will match columns between datasets, and append them together. Columns that are inexistant will be created and filled with zeros. 

In [3]:
# Load all datasets resulting from FCM analysis. 
abcd_fcm = load_df_in_any_format(f'{output_dir}/ABCDFuzzyCMeans/MEMBERSHIP_DF/clusters_membership_4.xlsx')
banda_fcm = load_df_in_any_format(f'{output_dir}/BANDAProjected/predicted_membership_matrix.xlsx')
geste_fcm = load_df_in_any_format(f'{output_dir}/GESTEProjected/predicted_membership_matrix.xlsx')

In [4]:
# Find difference in column names between ABCD and BANDA.
abcd_banda_diff = set(abcd_fcm.columns) ^ set(banda_fcm.columns)

# Add missing columns to BANDA dataset.
for col in abcd_banda_diff:
    banda_fcm[col] = 0

# Find difference in column names between ABCD and GESTE.
abcd_geste_diff = set(abcd_fcm.columns) ^ set(geste_fcm.columns)

# Add missing columns to GESTE dataset.
for col in abcd_geste_diff:
    geste_fcm[col] = 0

In [5]:
# Reorder columns to match ABCD dataset.
banda_matched = banda_fcm[abcd_fcm.columns]
geste_matched = geste_fcm[abcd_fcm.columns]

# Assert that all datasets have the same columns.
assert all(abcd_fcm.columns == banda_matched.columns), "Columns do not match between ABCD and BANDA."
assert all(abcd_fcm.columns == geste_matched.columns), "Columns do not match between ABCD and GESTE."

# Other sanity checks that the datasets still have the same number of rows.
assert len(banda_fcm) == len(banda_matched), "Number of rows in the matched dataset changed, please validate."
assert len(geste_fcm) == len(geste_matched), "Number of rows in the matched dataset changed, please validate."

# Sanity checks that random values are still the same in the datasets.
assert all(banda_fcm.loc[:, 'Cluster #1'] == banda_matched.loc[:, 'Cluster #1']), "Random value in BANDA dataset changed, please validate."
assert all(banda_fcm.loc[:, 'AgeMonths'] == banda_matched.loc[:, 'AgeMonths']), "Random value in BANDA dataset changed, please validate."
assert all(geste_fcm.loc[:, 'Cluster #1'] == geste_matched.loc[:, 'Cluster #1']), "Random value in GESTE dataset changed, please validate."
assert all(geste_fcm.loc[:, 'AgeMonths'] == geste_matched.loc[:, 'AgeMonths']), "Random value in GESTE dataset changed, please validate."

In [6]:
# Concatenate all datasets.
final_fcm = pd.concat([abcd_fcm, banda_matched, geste_matched],
                      axis=0)

# Replace string cohort identifiers with integers. This will make handling of
# cohorts in the graph network object easier.
final_fcm['Cohort'] = final_fcm['Cohort'].replace({'ABCD': 1, 'BANDA': 2, 'GESTE': 3})

# Change Cohort column name to cohort.
final_fcm.rename(columns={'Cohort': 'cohort'}, inplace=True)

# Save final dataset.
final_fcm.to_excel(f'{output_dir}/merged_fcm_data.xlsx', index=False, header=True)

  final_fcm['Cohort'] = final_fcm['Cohort'].replace({'ABCD': 1, 'BANDA': 2, 'GESTE': 3})


#### **Computing a Graph Network.**

To visualize the clustering results, we need to construct a graph network object using a force-directed algorithm. It allows the use of graph network properties to evalute the subject distribution across the graph network and across profiles. To determine the optimal graph layout, we use the Fruchterman-Reingold force-directed algorithm, which can take a while to run for large graph such as this one.

In [7]:
# Using the merged dataset, we will used a CLI script to generate a graph network.
# ** This is a long running process. Go get a coffee ! **

!ComputeGraphNetwork --in-dataset "{output_dir}/merged_fcm_data.xlsx" \
    --out-folder "{output_dir}/GraphNetwork/" --id-column "subjectkey" --desc-columns 28 \
    --layout spring --weight membership -v -f -s --import-data --plot-distribution

[32m2024-11-08 20:52:50[0m [35mAnthonys-MacBook-Pro.local[0m [34mroot[97761][0m [1;30mINFO[0m Loading membership data.
[32m2024-11-08 20:52:53[0m [35mAnthonys-MacBook-Pro.local[0m [34mroot[97761][0m [1;30mINFO[0m Computing graph network layout.
[32m2024-11-08 20:54:52[0m [35mAnthonys-MacBook-Pro.local[0m [34mroot[97761][0m [1;30mINFO[0m Setting nodes position.
[32m2024-11-08 20:54:52[0m [35mAnthonys-MacBook-Pro.local[0m [34mroot[97761][0m [1;30mINFO[0m Importing data within the .gml file.


In [9]:
# Copying and renaming the graph network file at the root of the output_dir.
!cp -rL {output_dir}/GraphNetwork/network_graph_file.gml {output_dir}/GraphNetwork.gml

#### **Visualization of the Graph Network and clustering results.**

Once the network is generated, we can visualize it. The next cells will output a general visualization including all studies, allowing the visual inspection of the obtained clusters. It will also generate graph network with nodes highlighted for various pathology, using all studies or individually. Those files are general visualization of the results, for more sophisticated figures, please see the Visualization.ipynb notebook.

In [10]:
# Visualizing the global graph network with all cohort merged, then highlighting subjects
# from each cohort within the global network.

!VisualizeGraphNetwork --in-graph "{output_dir}/GraphNetwork.gml" \
    --out-folder "{output_dir}/VizNetwork/" --weight membership --colormap bone_r \
    -v -s -f --title "Global clustering results" \
    --legend-title "Membership Values"

[32m2024-11-08 20:55:06[0m [35mAnthonys-MacBook-Pro.local[0m [34mroot[608][0m [1;30mINFO[0m Loading graph data.
[32m2024-11-08 20:55:09[0m [35mAnthonys-MacBook-Pro.local[0m [34mroot[608][0m [1;30mINFO[0m Generating graph.


In [None]:
# Visualizing participants with a diagnosis of AD, ADHD, OCD, ODD, CD, DD,
# and PSYPATHO index using all cohorts.

!VisualizeGraphNetwork --in-graph "{output_dir}/GraphNetwork.gml" \
    --out-folder "{output_dir}/VizNetworkDxGlobal/" --weight membership --colormap bone_r \
    -v -s -f --label-name AD --label-name ADHD --label-name OCD --label-name ODD \
    --label-name CD --label-name DD --label-name PSYPATHO --title "Global clustering results" \
    --legend-title "Membership Values"

[32m2024-11-10 11:43:10[0m [35mAnthonys-MacBook-Pro.local[0m [34mroot[95127][0m [1;30mINFO[0m Loading graph data.
[32m2024-11-10 11:43:13[0m [35mAnthonys-MacBook-Pro.local[0m [34mroot[95127][0m [1;30mINFO[0m Generating graph.
[32m2024-11-10 11:43:17[0m [35mAnthonys-MacBook-Pro.local[0m [34mroot[95127][0m [1;30mINFO[0m Constructing graph(s) with custom labels.


In [12]:
# Visualizing participants with a diagnosis of AD, ADHD, OCD, ODD, CD, DD,
# and PSYPATHO index using only the ABCD cohort.

!VisualizeGraphNetwork --in-graph "{output_dir}/GraphNetwork.gml" \
    --out-folder "{output_dir}/VizNetworkDxABCD/" --weight membership --colormap bone_r \
    -v -s -f --label-name AD --label-name ADHD --label-name OCD --label-name ODD \
    --label-name CD --label-name DD --label-name PSYPATHO --title "ABCD clustering results" \
    --legend-title "Membership Values" --cohort 1

[32m2024-11-08 20:55:51[0m [35mAnthonys-MacBook-Pro.local[0m [34mroot[1602][0m [1;30mINFO[0m Loading graph data.
[32m2024-11-08 20:55:54[0m [35mAnthonys-MacBook-Pro.local[0m [34mroot[1602][0m [1;30mINFO[0m Generating graph.
[32m2024-11-08 20:55:59[0m [35mAnthonys-MacBook-Pro.local[0m [34mroot[1602][0m [1;30mINFO[0m Constructing graph(s) with custom labels.


In [13]:
# Visualizing participants with a diagnosis of AD, ADHD, OCD, ODD, CD, DD,
# and PSYPATHO index using only the BANDA cohort.

!VisualizeGraphNetwork --in-graph "{output_dir}/GraphNetwork.gml" \
    --out-folder "{output_dir}/VizNetworkDxBANDA/" --weight membership --colormap bone_r \
    -v -s -f --label-name AD --label-name ADHD --label-name OCD --label-name ODD \
    --label-name CD --label-name DD --label-name PSYPATHO --title "BANDA clustering results" \
    --legend-title "Membership Values" --cohort 2

[32m2024-11-08 20:56:27[0m [35mAnthonys-MacBook-Pro.local[0m [34mroot[2306][0m [1;30mINFO[0m Loading graph data.
[32m2024-11-08 20:56:30[0m [35mAnthonys-MacBook-Pro.local[0m [34mroot[2306][0m [1;30mINFO[0m Generating graph.
[32m2024-11-08 20:56:34[0m [35mAnthonys-MacBook-Pro.local[0m [34mroot[2306][0m [1;30mINFO[0m Constructing graph(s) with custom labels.


In [14]:
# Visualizing participants with a diagnosis of ADHD
# and PSYPATHO index using only the GESTE cohort.

!VisualizeGraphNetwork --in-graph "{output_dir}/GraphNetwork.gml" \
    --out-folder "{output_dir}/VizNetworkDxGESTE/" --weight membership --colormap bone_r \
    -v -s -f --label-name ADHD --label-name PSYPATHO --title "GESTE clustering results" \
    --legend-title "Membership Values" --cohort 3

[32m2024-11-08 20:57:02[0m [35mAnthonys-MacBook-Pro.local[0m [34mroot[2973][0m [1;30mINFO[0m Loading graph data.
[32m2024-11-08 20:57:05[0m [35mAnthonys-MacBook-Pro.local[0m [34mroot[2973][0m [1;30mINFO[0m Generating graph.
[32m2024-11-08 20:57:09[0m [35mAnthonys-MacBook-Pro.local[0m [34mroot[2973][0m [1;30mINFO[0m Constructing graph(s) with custom labels.
