# Preprocessing

## Overview

### Usage

To use the preprocessing command line tool, navigate to the preprocessing directory at /mnt/nfs/lss/lss_kahwang_hpc/scripts/preprocessing and run the command below in your terminal to display the documentation for the tool.

```
$ python3 main.py --help
```
&nbsp;
```
usage: [DATASET_DIR] [SUBCOMMANDS [OPTIONS]]

Run pre-processing on whole dataset or selected subjects

Required Arguments:
  dataset_dir           Base directory of dataset.
  -h, --help            show this help message and exit

Subcommands:
  {heudiconv,mriqc,fmriprep,3dDeconvolve,regressors,3dmema,FD_stats}
    heudiconv           Convert raw data files to BIDS format. Conversion script filepath is required.
    mriqc               Run mriqc on dataset to analyze quality of data.
    fmriprep            Preprocess data with fmriprep pipeline.
    3dDeconvolve        Parse regressor files, censor motion, create stimfiles, and run 3dDeconvolve.
    regressors          Parse regressor files to extract columns and censor motion.
    3dmema              Runs 3dmema.
    FD_stats            Calculates FD statistics for dataset. Outputs csv with % of points over FD threshold anbd FD mean for each run and subject.
```

As you can see, the program takes in the dataset directory as an input, followed by a subcommand and it's options. We can look at each subcommands options by running python3 main.py dataset_dir/ {subcommand} --help  
Let's try that below with fmriprep and see what happens.

```
$ python3 main.py dataset_dir fmriprep --help
```

&nbsp;
```
usage: [OPTIONS]

optional arguments:
  -h, --help            show this help message and exit
  --fmriprep_opt FMRIPREP_OPT
                        Options to add to fmriprep. Write between '' and replace - with * as shown: '**[OPTION1] arg1 ** [OPTION2] ...'

Subject arguments:
  -n NUMSUB, --numsub NUMSUB
                        The number of subjects being analyzed. If none listed, default will be whole dataset
  -s [SUBJECTS [SUBJECTS ...]], --subjects [SUBJECTS [SUBJECTS ...]]
                        The subjects being analyzed. Do not include sub- prefix. If subjects are not included, pre-processing will be run on whole dataset by default or on number of subjects given via the --numsub flag

Path arguments:
  --bids_dir BIDS_DIR   Path for bids directory if not located in dataset directory.
  --work_dir WORK_DIR   The working dir for programs. Default for argon is user dir in localscratch. Default for thalamege is work directory in dataset directory.

General Optional Arguments:
  --rerun_mem           Rerun subjects that failed due to memory constraints
  --slots SLOTS         Set number of slots/threads per subject. Default is 4.

Argon HPC Optional Arguments:
  --email               Receive email notifications from HPC
  --no_qsub             Does not submit generated bash scripts.
  --hold_jid HOLD_JID   Jobs will be placed on hold until specified job completes. [JOB_ID]
  --no_resubmit         Enable to not resubmit tasks after migration. Default is to resubmit.
  --mem MEM             Set memory for HPC
  -q QUEUE, --queue QUEUE
                        Set queue for HPC
  --stack STACK STACK   Queue jobs in dependent stacks. When all jobs complete, next will start. Two required integer arguments [# of stacks][# of jobs per stack]. Use 'split' in second argument to split remaining jobs
                        evenly amongst number of stacks.
```

Sweetness. We can see what options are available for the fmriprep subcommand. Looks like there are a bunch of optional arguments. If you ever forget how to run a command or what options are available use the --help flag.

### How it works

The preprocessing python program automatically creates a bash script and then runs/submits (differs based on thalamege or argon host) the script. The program uses base bash scripts (thalamege: preprocessing/thalamege, argon: preprocessing/argon) and then fills in data based on user inputs. The new bash script will be written to either preprocessing/thalemege/dataset_dir_name or preprocessing/argon/jobs/dataset_dir_name.

The program records output info in the logs/ directory of each command directory (ie fmriprep/logs/). If you are running into issues or errors, you should check out the log files.  
Additionally, the preprocessing pipeline automatically keeps track of completed subjects in the completed_subjects.txt file and failed subjects in the failed_subjects.txt. This is useful for datasets with subjects comtinually being added such as ThalHi. You simply run the command normally without specifying subjects and it will only run subjects that have not been completed.

### Common Flags

#### General



```
General Optional Arguments:
  --rerun_mem           Rerun subjects that failed due to memory constraints
  --slots SLOTS         Set number of slots/threads per subject. Default is 4.
```

Rerun memory option reruns all subjects that are in failed_subjects_mem.txt file in /logs directory.  
Slots specifies number of slots to run per subject. This is equivalent to the number of cores.



#### Subjects

```
Subject arguments:
  -n NUMSUB, --numsub NUMSUB
                        The number of subjects being analyzed. If none listed, default will be whole dataset
  -s [SUBJECTS [SUBJECTS ...]], --subjects [SUBJECTS [SUBJECTS ...]]
                        The subjects being analyzed. Do not include sub- prefix. If subjects are not included, pre-processing will be run on whole dataset (minues completed subjects) by default or on number of subjects given via the --numsub flag
```

This might be the option you use the most. Pretty self explanatory.  
For subjects, this is what the flag would look like:  
--subjects 10001 10002


#### Paths

```
Path arguments:
  --bids_dir BIDS_DIR   Path for bids directory if not located in dataset directory.
  --work_dir WORK_DIR   The working dir for programs. Default for argon is user dir in localscratch. Default for thalamege is work directory in dataset directory.
```

Use the bids_dir flag when the bids directory is not in your root dataset directory. For example, the hcp developmental dataset is stored on a shared directory in argon so I would do --bids_dir /Dedicated/inc_data/bigdata/hcpd  
The work_dir flag will change what working directory the pipeline will use. Mostly useful for argon and changing between working on localscratch, nfscratch, and lss.


#### Argon

```
Argon HPC Optional Arguments:
  --email               Receive email notifications from HPC
  --no_qsub             Does not submit generated bash scripts to be run.
  --hold_jid HOLD_JID   Jobs will be placed on hold until specified job completes. [JOB_ID]
  --no_resubmit         Enable to not resubmit tasks after job migration. Default is to resubmit. This should be enabled when running on all.q
  --mem MEM             Set memory for HPC
  -q QUEUE, --queue QUEUE
                        Set queue for HPC. Default is our queue: SEASHORE
  --stack STACKS JOBS_PER_STACK   Queue jobs in dependent stacks. When all jobs complete, next will start. Two required integer arguments [# of stacks][# of jobs per stack]. Use 'split' in second argument to split remaining jobs evenly amongst number of stacks.
```

### Submitting on Argon vs Thalamege

The program has some slight differences when submitting on argon vs thalamege. On argon, jobs are submitted to the SGE scheduler and will generally be submitted as task arrays split up by subject. On thalamege, the jobs run in parallel again split up by subject.  
The program automatically knows which host you are on, so don't worry about having to tell it anything about the host.

On argon, for some of jobs, especially fmriprep, it makes sense to first copy over data into localscratch to make the job run faster. The localscratch is local memory and is accessed by argon much faster than our network lss drive. Up the job finishing, any output data is copied over to the target output directory. Localscratch is used as the working directory by default on Argon and it will be used automatically for fmriprep. The localscratch storage is much smaller so for subjects with lots of sessions, you may run into issues running out of file storage. Simply change the working directory to one on the lss if this happens.  
Check out the base fmriprep script to see an example of how that's done preprocessing/argon/fmriprep_base.sh

## fmriprep

## mriqc

## 3dDeconvolve

## regressors

## heudiconv

**Heudiconv is part of the preprocessing command line tool. Let's look at its input options by inputting the --help flag after the heudiconv command.**  
**Example shown below.**
\
&nbsp;


```
python main.py dataset_dir/ heudiconv --help

usage: [SCRIPT_PATH][OPTIONS]

positional arguments:
  script_path           Filename of script. Script must be located in following directory: /data/backed_up/shared/bin/heudiconv/heuristics/

optional arguments:
  -h, --help            show this help message and exit
  --post_conv_script POST_CONV_SCRIPT
                        Filepath of post-heudiconv Conversion script. Ocassionally needed to make further changes after running heudiconv.

Subject arguments:
  -n NUMSUB, --numsub NUMSUB
                        The number of subjects being analyzed. If none listed, default will be whole dataset
  -s [SUBJECTS [SUBJECTS ...]], --subjects [SUBJECTS [SUBJECTS ...]]
                        The subjects being analyzed. Do not include sub- prefix. If subjects are not included, pre-processing will be run on whole dataset by default or on number of subjects given via the --numsub flag

Path arguments:
  --bids_dir BIDS_DIR   Path for bids directory if not located in dataset directory.
  --work_dir WORK_DIR   The working dir for programs. Default for argon is user dir in localscratch. Default for thalamege is work directory in dataset directory.
```


**We can see from the documentation that the the heudiconv command takes 1 required input (SCRIPT_PATH) and some optional flags.** \
**The Script Path refers to the python script used to run heudiconv.**   
```
python main.py dataset_dir/ heudiconv /data/backed_up/shared/bin/heudiconv/heuristics/{Script Name}.py
```


**Sometimes with Heudiconv you need to make changes to data after running heuidconv, you can specify a post conversion script with the --post_conv_script flag.**  
**For example, for ThalHi our post conversion script can be found at /mnt/nfs/lss/lss_kahwang_hpc/scripts/thalhi/heudiconv_post.py. The heudiconv command would then be:** 
```
python main.py dataset_dir/ heudiconv /data/backed_up/shared/bin/heudiconv/heuristics/{Script Name}.py --post_conv_script /mnt/nfs/lss/lss_kahwang_hpc/scripts/thalhi/heudiconv_post.py
```

**Like our other preproccessing commands, we can specify which subjects to run or how many to run using the --subjects and --numsub flags.**
```
python main.py dataset_dir/ heudiconv /data/backed_up/shared/bin/heudiconv/heuristics/{Script Name}.py --subjects 10001 10002
python main.py dataset_dir/ heudiconv /data/backed_up/shared/bin/heudiconv/heuristics/{Script Name}.py --numsub 2
```

**The script will run in parallel on each subject.**

## FD_stats