# Preprocessing Notebook

### Authors: William Drew, Alexander Cohen, Joey Hsu, Louis Soussand, Christopher Lin

## Last updated: June 7, 2022

### Notes:
- This notebook requires the NIMLAB Python 3 environment as a kernel and FSL on your path. Directions at: (https://github.com/nimlab/software_env)
- I realize that we are using 'Lesion' Network Mapping to apply to DBS lead locations, meta-analytic foci, etc... but for the sake of clarity, and not having to just say seed all the time, I've used the word Lesion to mean Lesions/Masks/Foci/Seeds interchangably.

***
# Introduction
### This is the preprocessing notebook, which is used to generate **functional and structural connectivity maps** from various **brain regions of interest**.

### This notebook is capable of processing input from:
- **Volume-space Nifti ROIs**
    - Examples:
        - Lesion masks
        - Brain coordinates
        - TMS cones
        - DBS Lead locations
        - VTAs from DBS
        - and so much more!
- **Surface-space Gifti ROIs**
    - Examples:
        - White matter atrophy maps
        - White matter growth maps

***
## 0. What do you want to name your Preprocessed Dataset?

Provide a Dataset, i.e., Project Name

This will also be the name of a sub-directory, in the same directory as the notebook, that will contain the following:
1. a copy of the original lesions in `./inputs`,
2. a cleaned version of the lesions in `./sub-*/roi`,
3. a copy of your Functional/Structural Connectivity Maps in `./sub-*/connectivity`,
4. a `./README.md` that notes the original location of the seeds and which connectomes were used, and
5. a `./ProjectName.csv` that contains paths to the imaging files that can be used in later analyses, bypassing the XNAT archive.

In [None]:
# Set the project name and folder where you want the output files to go:

# EXAMPLE preprocess_folder = "/PHShome/wd957/Preprocessing"
# EXAMPLE project_name = "Prosopagnosia_lesions"
# This will create a project folder "/PHShome/wd957/Preprocessing/Prosopagnosia_lesions"

preprocess_folder = "" # If left blank, will create dataset folder in same directory as notebook
project_name = "" # Enter your chosen project name here

########################################################################

## Packages:
import os
from glob import glob
from natsort import natsorted
from nimlab import preprocessing
from nimlab import software as sf
env = preprocessing.init_env(project_name, preprocess_folder)
use_datalad = True

## Before we do ANYTHING ELSE, rename this notebook to match your chosen Dataset Name:
Example: `1_Preprocessing_LesionQA_fcSurfBcbLqtGen_nimtrack.ipynb` --> `Prosopagnosia_lesions_1_Preprocessing_LesionQA_fcSurfBcbLqtGen_nimtrack.ipynb`

***
# Input/Output Setup

## 1. First, provide your email address and User ID for future reference:

In [None]:
# Enter your email address and User ID here in quotes:
env["creator_email"] = "wdrew@bwh.harvard.edu"
env["cluster_user"] = "wd957"

preprocessing.save_env(env)

***
## 2. Where are your lesions?
**NOTE:** only run **ONE** of the following two cells depending on how you have stored your lesion files:
- If you have collected all of your lesions in one folder, choose **Option 1**
- If you have organzied your lesions in a BIDS/derivatives folder, choose **Option 2**

### Option One allows for both volume-space and surface-space lesions.

- **NOTE:** ROI files of only one type (volume or surface) are permitted. If you wish to process both volume and surface ROIs, please use two notebooks, one for volume ROIs and one for surface ROIs. 

- **NOTE:** Surface files (.gii) **must** be prefixed with `lh.` or `rh.` to indicate hemisphere. Filenames between two hemisphere files **must** be identical.

    - **Example**: 
        > /path/to/lh.subject1.gii, /path/to/rh.subject1.gii

        > /path/to/lh.subject2.gii, /path/to/rh.subject2.gii

In [None]:
# OPTION ONE - If you have collected all of your lesions in one folder:

env["input_folder"] = ''
# EXAMPLE input_folder = "/PHShome/wd957/test_lesions/2mm"

env["lesion_type"], env["lesions"] = preprocessing.load_rois(env["input_folder"])
preprocessing.save_env(env)

In [None]:
# # OPTION TWO - If you have organized your lesions in a BIDS/derivatives folder:
# env["lesions"] = {}
# env["input_folder"] = 'BIDS_dir'
# # EXAMPLE env["input_folder"] = "/Users/alex/data/lesions/Leigh_bids"

# lesion_files = natsorted(glob(os.path.join(input_folder, "derivatives/lesions/sub*/sub-*space-MNI152NLin2009cAsym_desc-lesion_mask.nii.gz")))
# for vol_file in lesion_files:
#     subject_name = os.path.basename(vol_file).split("sub-")[1].split("space-MNI152NLin2009cAsym_desc-lesion_mask.nii.gz")[0]
#     env["lesions"][subject_name] = vol_file
# env["lesion_type"] = "volume"
# preprocessing.save_env(env)
# print("I found", len(lesion_files), "lesions files:")
# lesion_files[0:5]  # show me the first five found to verify the path is correct


***
## 3. Which Functional/Structural Connectivity Pipelines would you like to run?

There are two **functional** connectivity pipelines available:
- Volume-space functional connectivity with `connectome_quick`
- Surface-space functional connectivity with `connectome_quick`

There are two **structural** connectivity pipelines available:
- Volume-space structural connectivity with `BCB Toolkit`
- Volume-space structural connectivity with `Lesion Quantification Toolkit`

### If at any point you would like to reset your chosen pipelines, run the cell below.

In [None]:
# Run this cell to reset your chosen pipelines
env["connectivity_analyses"] = []

preprocessing.save_env(env)

## 3a. Volume-space Functional Connectivity with `connectome_quick`

**NOTE:** the _directory_ will vary depending on which machine you are logged into. 

On ERISOne, `connectome_dir` should be `/data/nimlab/connectome_npy`

### Select a Functional Volume-space Connectome
The available volume-space connectomes are

- `GSP1000_MF`: (Default) Gender-balanced GSP 1000
- `yeo1000_dil`: Yeo 1000 connectome (Deprecated March 2023)
- `GSP1000`: GSP 1000 connectome processed with the CBIG pipeline (same pipeline as Yeo)
- `GSP346_F`: Female-only GSP 1000 with 346 subjects
- `GSP346_M`: Male-only GSP 1000 with 346 subjects
- `GSP500_F`: Female-only GSP 1000 with 500 subjects
- `GSP500_M`: Male-only GSP 1000 with 500 subjects

**NOTE**: If you are using surface-space ROIs and running the volume-space connectome, you **MUST** select the `GSP1000_MF` connectome.

### If you do not want to generate Functional Volume-space Connectivity maps, leave `connectome_name` blank.

In [None]:
# Set the connectome to use to make fc Maps from the lesion/seed locations:
env["connectome_name"] = ""

# If you are using any of the connectomes produced after fall 2019, dil should be the correct mask.
# If the connectome is the original one used with connectome.sh, then the mask should be 222
# This variable is for metadata only.
env["connectome_mask"] = "dil"

env["input_spaces"], env["output_spaces"], env["connectivity_analyses"] = preprocessing.add_func_pipeline(
    env["connectivity_analyses"],
    env["lesion_type"],
    env["input_spaces"],
    env["output_spaces"],
    env["connectome_name"],
    env["connectome_mask"])

preprocessing.save_env(env)

## 3b. Surface-space Functional Connectivity with `connectome_quick`
### Select a Functional Surface-space Connectome

The available surface-space connectomes are

- `GSP1000_MF_surf_fs5`: Gender-balanced GSP 1000 in fsaverage 5 space

### If you do not want to generate Functional Surface-space Connectivity maps, leave `surf_connectome_name` blank.

In [None]:
# Set the connectome to use to make fc Maps from the lesion/seed locations:
env["surf_connectome_name"] = ""

# This variable is for metadata only.
env["surf_connectome_mask"] = "fs5"
env["input_spaces"], env["output_spaces"], env["connectivity_analyses"] = preprocessing.add_surf_pipeline(
    env["connectivity_analyses"],
    env["lesion_type"],
    env["input_spaces"],
    env["output_spaces"],
    env["surf_connectome_name"],
    env["surf_connectome_mask"])
preprocessing.save_env(env)

## 3c. Volume-space structural connectivity with `BCB Toolkit`

The available structural BCB Toolkit connectomes on ERISone are

- `Disconnectome7T`: (Default) 178 subject 1mm Dataset from HCP 7T Data
- `tracks2mm`: 100 subject 2mm Dataset
- `Base10`: 10 subject 1mm Dataset
- `Base35`: 35 subject 1mm Dataset

### If you do not want to compute structural connectivity with the BCB Toolkit, leave `bcb_connectome_name` blank.

In [None]:
# Set the connectome to use to make structural disconnection Maps from the lesion/seed locations:
env["bcb_connectome_name"] = ""

# This variable is for metadata only.
env["bcb_connectome_mask"] = "dil"
env["input_spaces"], env["output_spaces"], env["connectivity_analyses"] = preprocessing.add_bcb_pipeline(
    env["connectivity_analyses"],
    env["lesion_type"],
    env["input_spaces"],
    env["output_spaces"],
    env["bcb_connectome_name"],
    env["bcb_connectome_mask"])
preprocessing.save_env(env)

## 3d. Volume-space structural connectivity with `Lesion Quantification Toolkit`

### Configuration Options for the Lesion Quantification Toolkit:
* `connectivity_type`: ('end' or 'pass', Default 'end')

    Specifies the criteria for defining structural connections. There are two options: “end”, which defines the connections between two parcels as those streamlines that end in both parcels, or “pass”, which defines the connections between two parcels as those streamlines that either end in or pass through both parcels. “end” is recommended but will produce sparser connectivity matrices.
* `sspl_spared_thresh`: (integer 1-100, Default 50)
    
    Percent spared threshold for computing SSPLs (e.g. 100 means that only fully spared regions will be included in SSPL calculation; 1 means that all regions with at least 1% spared will be included. Default is 50)
* `smooth`: (integer, Default 2)

    Corresponds to the full-width half-maximum (FWHM) of the smoothing kernel to be applied to the percent disconnection voxel maps. A single value is required (e.g. 2 = 2 FWHM in voxel units; 0 = no smoothing).

Available LQT Connectomes are:
- `HCP842`

### If you do not want to generate structural connectivity maps with the Lesion Quantification Toolkit, leave `lqt_connectome_name` blank.

In [None]:
# Leave lqt_connectome_name blank if you do not want to run the LQT pipeline
env["lqt_connectome_name"] = ""

env["lqt_connectome_mask"] = "dil"
# Set Configuration Options for the Lesion Quantification Toolkit
env["lqt_options"] = {"connectivity_type": 'end',
                      "sspl_spared_thresh": 50,
                      "smooth": 2
                      }
env["input_spaces"], env["output_spaces"], env["connectivity_analyses"] = preprocessing.add_lqt_pipeline(
    env["connectivity_analyses"],
    env["lesion_type"],
    env["input_spaces"],
    env["output_spaces"],
    env["lqt_connectome_name"],
    env["lqt_connectome_mask"])
preprocessing.save_env(env)

## Please verify the selected Functional/Structural Connectivity Pipelines below

In [None]:
env["vol_spaces"], env["surf_spaces"], env["set_connectivity_analyses"] = preprocessing.confirm_connectivity_analyses(
    env["connectivity_analyses"],
    env["lesion_type"],
    env["input_spaces"],
    env["output_spaces"],
    override = False)
preprocessing.save_env(env)

***
## 4. Set up metadata dataframe and rename files to BIDS format

In [None]:
env["meta_df"] = preprocessing.init_meta_df(env["lesion_type"], env["lesions"], env["project_path"], use_datalad)
preprocessing.save_env(env)

***
# Lesion Barbershop

The following cells walk you through the process of doing Quality Assurance on your lesion masks to identify tracing/registration/weird errors in the masks before you generate Functional/Structural Connectivity maps.

## 1. Make sure the Volume lesion dimensions match the FSL MNI 2mm Template or make sure the Surface atrophy dimensions match the fsaverage5 template and is binary

NOTE: Your volume lesions should ALREADY be registered to, or traced in, MNI space. This code just reslicing to 2mm/1mm voxels and makes the shape conform to our standard bounding box.<br>
**if your lesions are still in individual space, STOP NOW and do your registration first**.<br>
(This repo may be helpful: https://github.com/bchcohenlab/bids_lesion_code)<br>

NOTE: Your surface lesions should already be registered to the fsaverage surface space. This code is just reslicing to fsaverage5 space.

NOTE: This step may take a long time if starting from surface-space ROIs (~3-4 minutes per ROI).

**Instructions**
1. If you wish to threshold your images, set `env["doThreshold"] = True`.
2. If you wish to binarize your images, set `env["binarize"] = True`. You must also set `env["doThreshold"] = True`
3. Set the level to threshold or binarize at with `env["threshold"]`.
4. Set the threshold/binarization direction with `env["direction"]`.

    - If direction is `twosided`, will **threshold/binarize outside** the threshold level.
        - Example: if threshold is 1 and direction is "twosided", then values **between** -1 and +1 will be zeroed.


    - If direction is `less`, will **zero out values greater than** the threshold level.
        - Example: if threshold is -1 and direction is "less", then values **greater** than -1 will be zeroed.


    - If direction is `greater`, will **zero out values less than the** threshold level.
        - Example: if threshold is +1 and direction is "greater", then values **less** than +1 will be zeroed.

In [None]:
## Options

# The Default settings do no thresholding and no binarization.

# If True, applies a threshold to the image. If you want no thresholding, set to False.
env["doThreshold"] = False

# If True, binarizes ROIs at some threshold. If you want weighted ROIs, set to False.
# If binarize is set to True, doThreshold must also be True.
env["binarize"] = False

# Binarize or threshold weighted image at this value.
env["threshold"] = 0

# Set threshold/binarize direction
env["direction"] = "twosided"

# Type of Registration Fusion approaches used to generate the mappings ("RF_M3Z" or "RF_ANTs"). Defaults to "RF_ANTs"
# RF-M3Z is recommended if data was registered from subject's space to the volumetric atlas space using FreeSurfer.
# RF-ANTs is recommended if such registrations were carried out using other tools, especially ANTs.
env["RF_type"] = "RF_ANTs"

# Interpolation method for conversion from surface to volume.
env["interp"] = "linear"

env["meta_df"] = preprocessing.reslice_and_convert_rois(
    env["lesion_type"],
    env["meta_df"],
    env["vol_spaces"],
    env["surf_spaces"],
    env["project_path"],
    env["doThreshold"],
    env["binarize"],
    env["threshold"],
    env["direction"],
    env["RF_type"],
    env["interp"],
    use_datalad,
)
preprocessing.save_env(env)

***
## 2. Review the lesions to see if any extend outside of the brain, are blank, and/or look weird:

`If visualize = True` The code will show you each lesion, marked with where it is extending outside of the brain mask

NOTE: While we use the `MNI152_T1_2mm_brain_mask_dil` mask for the connectivity,<br>
I believe it makes sense to still use the more restrictive `MNI152_T1_2mm_brain_mask` here to mask the lesions, since this better excludes ventricles and sinuses.

In [None]:
# Set visualize to True to see pictures of each lesion, and False if you've already done this and just want to skip ahead
# Note: This will only visualize volume lesions. Surface atrophy will not be visualized if visualise is True

env["visualize"] = False

env["meta_df"], env["brain_masks"] = preprocessing.review_lesions(
    env["visualize"],
    env["lesion_type"],
    env["meta_df"],
    env["project_path"],
    env["vol_spaces"],
    env["surf_spaces"],
    env["brain_mask_2mm"],
    env["brain_mask_1mm"],
    env["brain_mask_fs5"],
    use_datalad,
)
preprocessing.save_env(env)

***
## 3. Trim the lesions to remove voxels outside the brain mask:

NOTE: This only affects the files in `Project/Lesions`.

In [None]:
env["meta_df"] = preprocessing.trim_lesions(
    env["meta_df"],
    env["vol_spaces"],
    env["surf_spaces"],
    env["project_path"],
    env["brain_masks"],
    use_datalad,
)
preprocessing.save_env(env)

***
## 4. Before we generate the fcMaps, show me an overview of where these lesions are located:

In [None]:
preprocessing.show_lesion_overview(
    env["meta_df"],
    env["vol_spaces"],
    env["surf_spaces"])

***
## 5. Generate JSON sidecars for lesion niftis

In [None]:
preprocessing.generate_roi_json_sidecars(
    env["meta_df"],
    env["vol_spaces"],
    env["surf_spaces"],
    env["project_path"],
    env["lesion_type"],
    use_datalad,
)

***
# Generate Functional/Structural Connectivity Maps:

This calls the parallel connectome_quick/BCB Disconnectome/LQT function with your cleaned seeds and the connectome you specified above.<br>
This will take a few minutes to a few hours depending on the number of seeds and the size of the connectome, e.g., 100 vs 1000 subjects.

If you get errors regarding mismatching number of voxels begin your seeds and the connectome you have chosen, you may need to specify a brain mask:
- `222` has 285903 voxels (this was used in the past)
- `mni_icbm152` has 225222 voxels
- `MNI152_T1_2mm_brain_mask` has 228483 voxels
- `MNI152_T1_2mm_brain_mask_dil1` has 262245 voxels
- `MNI152_T1_2mm_brain_mask_dil` has 292019 voxels (current default)

In [None]:
# First, save list of cleaned lesions to file:
env["set_connectivity_analyses"] = preprocessing.generate_cleaned_roi_lists(
    env["set_connectivity_analyses"],
    env["meta_df"],
    env["project_path"])
preprocessing.save_env(env)

In [None]:
# Submit jobs to cluster
for analysis in env["set_connectivity_analyses"]:
    if(analysis['tool'] == "connectome_quick"):
        sf.call_connectome_quick(
            read_input=os.path.abspath(analysis['input_list']),
            output_directory=os.path.abspath((os.path.join(env["project_path"],"fc_temp"))),
            numWorkers=4,
            command="seed",
            connectome=analysis['roi_connectome'],
            brain_connectome=analysis['connectome'],
            brain_space="",
            output_mask="",
            cluster_name="eristwo-slurm",
            username=env["cluster_user"],
            cluster_email=env["creator_email"],
            dryrun=False,
            queue = "normal,nimlab",
            cores = "4",
            memory = "16000",
            no_warn="False",
            job_name="",
            job_time="",
            num_nodes="",
            num_tasks="",
            x11_forwarding="",
            service_class="",
            debug=False,
            extra=""
        )
    elif(analysis['tool'] == "BCB Disconnectome"):
        sf.call_disconnectome(
            input_directory = os.path.abspath(analysis['input_list']),
            output_directory = os.path.abspath((os.path.join(env["project_path"],"fc_temp"))),
            connectome_directory = analysis['connectome'],
            threshold = 0,
            cluster_name="eristwo-slurm",
            username=env["cluster_user"],
            cluster_email=env["creator_email"],
            queue = "normal,nimlab",
            cores = "4",
            memory = "8000",
            dryrun=False,
            job_name="",
            job_time="",
            num_nodes="",
            num_tasks="",
            x11_forwarding="",
            service_class="",
            debug=False,
            extra=""
        )
    elif(analysis['tool'] == "Lesion Quantification Toolkit"):
        sf.call_lesion_quantification_toolkit(
            input_directory = os.path.abspath(analysis['input_list']),
            output_directory = os.path.abspath((os.path.join(env["project_path"],"fc_temp"))),
            dataset_name = "preprocessing",
            connectivity_type = env["lqt_options"]["connectivity_type"],
            sspl_spared_thresh = env["lqt_options"]["sspl_spared_thresh"],
            smooth = env["lqt_options"]["smooth"],
            cluster_name="eristwo-slurm",
            username=env["cluster_user"],
            cluster_email=env["creator_email"],
            queue = "normal,nimlab",
            cores = "2",
            memory = "4000",
            dryrun=False,
            job_name="",
            job_time="",
            num_nodes="",
            num_tasks="",
            x11_forwarding="",
            service_class="",
            debug=False,
            extra=""
        )
preprocessing.save_env(env)

## Organize Functional/Structural Connectivity Output
**DO NOT RUN THIS CELL MORE THAN ONCE!**

In [None]:
env["meta_df"] = preprocessing.organize_connectivity_output(
    env["meta_df"],
    env["set_connectivity_analyses"],
    env["lesion_type"],
    env["project_path"],
    env["lqt_options"]
    )

preprocessing.save_env(env)

# Final Steps to allow you to easily use the results of your hard work:

### 1. Make a Human-readable `./README.md` that notes the original location of the seeds and which connectome was used

In [None]:
preprocessing.generate_readme(
    env["meta_df"],
    env["set_connectivity_analyses"],
    env["project_path"],
    env["project_name"],
    env["creator_email"],
    env["input_folder"],
    env["lesions"],
    env["lesion_type"]
    )

preprocessing.save_env(env)

### 2. Clone dataset to dl_archive and update the database

**NOTE:** Sometimes this cell will fail the first time you run it. If you get a warning try running a second time.

In [None]:
preprocessing.upload_to_dl_archive(
    env["project_path"],
    env["project_name"],
    env["vol_spaces"],
    env["surf_spaces"],
    env["lesion_type"]
)
preprocessing.save_env(env)

***

Your project metadata is now acccessible via the metadata_editor notebook. You can build a csv file via the "Browse" tab. Please remember to add some helpful tags to it!