# Applying Pre-Implemented Analyses

An *Analysis* class can be applied to a dataset in a flexible manner, such as how and where the data is stored, which derivatives are required, and the computing environment in which to generate the derivatives. This is all specified by the arguments used to when "instantiating" Analysis object from the given Analysis class. The clear separation between design (classes) and application (objects) makes analyses implemented in Arcana highly portable between computing environments/research centres.

## Inspecting the Analysis Class

We will start by importing a predefined Analysis class `example.analysis.BasicBrainAnalysis`, which performs the same analysis as the workflow in the [Workflows Notebook](basic_workflow.ipynb). We print the "menu", the list of inputs, derivatives and parameters objects of this class can receive/derive, using the `static_menu` class method.

In [None]:
from example.analysis import BasicBrainAnalysis
print(BasicBrainAnalysis.static_menu())

To see the "full" menu, which includes all the intermediate derivatives that can be produced by the analysis, pass the 'full' flag to `static_menu`

In [None]:
print(BasicBrainAnalysis.static_menu(full=True))

## Defining the Dataset to Analyse

Arcana implicitly handles a lot of the menial tasks involved with data input/outputs such as file format conversions and inserting/retrieving data from a repository service (e.g. XNAT). To specify where your data is you need to create a Dataset object.

### Datasets in Directories on Local System

The simplest form of dataset object is just a directory on (or mounted on) your local file system. The structure of this directory depends on its "depth", i.e. whether it has multiple subjects and visits in it or not.

#### Depth: 0

Typically, just used for prototyping, but you can define a dataset for a single subject by just storing all the data within a single directory.

In [None]:
%%bash
# Create a dataset for a single session in a flat directory. We will copy data from the BIDS formatted ds000114
SAMPLE_DSET=output/sample-datasets/depth0
mkdir -p $SAMPLE_DSET
find data/ds000114/sub-01/ses-test -name '*.nii.gz' | xargs -I% cp -f % $SAMPLE_DSET/
tree $SAMPLE_DSET

In [None]:
from arcana import Dataset
dset0 = Dataset('output/sample-datasets/depth0')
print(dset0)

Notice the `depth` of this dataset is `0`. This means that there aren't any sub-directories for separate subjects or visits in it. However, all datasets in Arcana have an implicit depth of 2 (although future versions may relax this restriction) so we can see that the single "session" (a single visit of a subject) is assigned default subject and visit IDs of 'SUBJECT' and 'VISIT' respectively.

In [None]:
print('subjects:', list(dset0.subject_ids))
print('visits:', list(dset0.visit_ids))

#### Depth: 1

For a multi-subject dataset we can add sub-directories for each subject

In [None]:
%%bash
# Create a dataset for a multiple subjects in separate sub-directories by copying data from the BIDS formatted ds000114
SAMPLE_DSET=output/sample-datasets/depth1
mkdir -p $SAMPLE_DSET/sub1 $SAMPLE_DSET/sub2  $SAMPLE_DSET/sub3
find data/ds000114/sub-01/ses-test -name '*.nii.gz' | xargs -I% cp -f % $SAMPLE_DSET/sub1
find data/ds000114/sub-02/ses-test -name '*.nii.gz' | xargs -I% cp -f % $SAMPLE_DSET/sub2
find data/ds000114/sub-03/ses-test -name '*.nii.gz' | xargs -I% cp -f % $SAMPLE_DSET/sub3
tree $SAMPLE_DSET

In [None]:
dset1 = Dataset('output/sample-datasets/depth1', depth=1)
print(dset1)
print('subjects:', list(dset1.subject_ids))
print('visits:', list(dset1.visit_ids))

**Note** that we need to explicitly provide the depth of `1` otherwise Arcana will interpret our 'sub1', 'sub2' and 'sub3' as filesets.

#### Depth: 2

For a dataset with multiple visits per subject we use a `depth == 2`

In [None]:
%%bash
# Create a dataset for a multiple subjects in separate sub-directories by copying data from the BIDS formatted ds000114
SAMPLE_DSET=output/sample-datasets/depth2
mkdir -p $SAMPLE_DSET/sub1/test $SAMPLE_DSET/sub1/retest $SAMPLE_DSET/sub2/test $SAMPLE_DSET/sub2/test \
         $SAMPLE_DSET/sub2/retest $SAMPLE_DSET/sub3/test $SAMPLE_DSET/sub3/retest
find data/ds000114/sub-01/ses-test -name '*.nii.gz' | xargs -I% cp -f % $SAMPLE_DSET/sub1/test
find data/ds000114/sub-02/ses-test -name '*.nii.gz' | xargs -I% cp -f % $SAMPLE_DSET/sub2/test
find data/ds000114/sub-03/ses-test -name '*.nii.gz' | xargs -I% cp -f % $SAMPLE_DSET/sub3/test
find data/ds000114/sub-01/ses-retest -name '*.nii.gz' | xargs -I% cp -f % $SAMPLE_DSET/sub1/retest
find data/ds000114/sub-02/ses-retest -name '*.nii.gz' | xargs -I% cp -f % $SAMPLE_DSET/sub2/retest
find data/ds000114/sub-03/ses-retest -name '*.nii.gz' | xargs -I% cp -f % $SAMPLE_DSET/sub3/retest
tree $SAMPLE_DSET

In [None]:
dset2 = Dataset('output/sample-datasets/depth2', depth=2)
print(dset2)
print('subjects:', list(dset2.subject_ids))
print('visits:', list(dset2.visit_ids))

However, just say the `retest` session of `Subject 3` was corrupted we could exclude it from the analysis by either dropping `Subject 3` or `retest` from the dataset by filtering the IDs

In [None]:
dset2_filter_subs = Dataset('output/sample-datasets/depth2', depth=2, subject_ids=['sub1', 'sub2'])
print(dset2_filter_subs)
print('subjects:', list(dset2_filter_subs.subject_ids))
print('visits:', list(dset2_filter_subs.visit_ids))

To filter the visits used in the analysis

In [None]:
dset2_filter_vis = Dataset('output/sample-datasets/depth2', depth=2, visit_ids=['test'])
print(dset2_filter_vis)
print('subjects:', list(dset2_filter_vis.subject_ids))
print('visits:', list(dset2_filter_vis.visit_ids))

Or to filter both

In [None]:
dset2_filter_both = Dataset('output/sample-datasets/depth2', depth=2, subject_ids=['sub1', 'sub2'], visit_ids=['test'])
print(dset2_filter_both)
print('subjects:', list(dset2_filter_both.subject_ids))
print('visits:', list(dset2_filter_both.visit_ids))

### Datasets on XNAT

In addition to data stored on your local file system, Arcana can transparently handle all interactions (i.e. downloading/uploading) with datasets stored in XNAT repositories.

To test this we will use a public project set up on Monash's public XNAT instance

In [None]:
import os.path as op
from arcana import XnatRepo
xnat_repo = XnatRepo(server='https://xnat.monash.edu', cache_dir=op.expanduser('~/xnat-cache'))
print(xnat_repo)
xnat_dataset = xnat_repo.dataset('MISC0002')  # This is the ID of the project on MXNAT
print(xnat_dataset)
print('subjects:', list(xnat_dataset.subject_ids))
print('visits:', list(xnat_dataset.visit_ids))

**Note:** If you have a look at the 'MISC0002' project on https://xnat.monash.edu.au you will notice that subjects and sessions are labelled according to the conventions used at MBI, i.e. PROJECTID_SUBJECTID and PROJECTID_SUBJECTID_VISITID for subject and session IDs, respectively. This is a current limitation of Arcana although it should be relaxed in the next month or so.

### Notes on Other Repository/Dataset Types

At this stage XNAT is the only data repository platform supported by Arcana. However, care has been taken to modularise the code as much as possible so it should be fairly straightforward to implement support for other platforms (e.g. Loris, DaRIS, MyTaRDIS) as long as they have a REST API (or equivalent) that enables you to list, get and put data. See the base repository class `arcana.repository.base.Repository` for details on the six abstract methods that need to be overriden. 

Unimelb users (hello David:), there used to be a DaRIS module in early versions of Arcana, which could be ressurected without too much effort if you have a DaRIS instance to test against.

`Banana` also adds support for the [BIDS](https://bids.neuroimaging.io) format via the `BidsDataset`. The BidsDataset objects are able to parse the specific naming conventions and directory-tree structure that BIDS requires, and insert derivatives at in the `derivatives` directory.

## Configuring the Software Environment

A key feature of Nipype and Arcana is the ability to interface with any sort of external toolkit, whether it runs in Bash, Python, Matlab, etc... For the sake of reproducibility, it is important to detect and record exactly which version of these tools was used to run the analysis.

While it is often advisable to use the latest versions of such toolkits, in some circumstances you may need to use different versions of the same package (e.g. FSL, SPM) to run different sections of your workflow (a real headache). In order to manage the installation and use of different toolkits versions on the same system, high-performance computing clusters (such as MASSIVE/CVL) typically require ["Environment Modules"](http://modules.sourceforge.net) to be loaded before running a toolkit.

Arcana encapsulates the handling of such issues within `Environment` objects that are configured by the user at runtime. There are currently two available classes, `StaticEnv` and `ModulesEnv`.

### Static Environments

As the name suggests, `StaticEnv` does not attempt to configure the software environment on the system, and simply detects and records the version of the software used. Because of this inflexibility it is typically used in the prototyping phase but could still be useful when running on local workstations without environment modules installed.

Configuring a static environment is very simple as they only take two fairly self-explanatory parameters, `fail_on_missing` and `fail_on_undetectable` (both `True` by default). Therefore it is typically okay to just initialise the class without any parameters, i.e.


In [3]:
from arcana import StaticEnv
static_env = StaticEnv()
print(modules_env)

<arcana.environment.static.StaticEnv object at 0x11e5a43d0>


### Modules Environments

`ModulesEnv` loads and unloads environment modules on the system before and after each node of the workflow is run, respectively, based on software requirements specified in the Analysis class (see the `arcana.environment.requirement` and `banana.requirement` packages for examples).

In most cases, environment modules are named fairly sensibly and line up with the names of the built-in requirements of Arcana and Banana. But in the case of unconvential naming schemes, Arcana/Banana requirements can be mapped onto the names and versions of modules installed on your system with the `packages_map` and `versions_map` parameters.

For example, on the CVL there is a special version of Matlab 2017b which is interacts with the machine learning package "caffe" that has the version 'matlab/r2017b-caffe' is used in place of 2017b

In [9]:
from arcana import ModulesEnv
from arcana.environment.requirement import matlab_req
modules_env = ModulesEnv(versions_map={matlab_req: {'2017b-caffe': '2017b'}})
print(modules_env)

<arcana.environment.modules.ModulesEnv object at 0x11e5eee10>


**Note:** when an analysis workflow is run using `ModulesEnv` it will unload previously loaded versions of modules (such as those in the `neuro-workshop` module on the CVL).

### Notes on Other Environment Types

You can't talk about reproducibility in science without mentioning container technology these days (for good reason) so you may ask why Arcana/Banana doesn't use Docker or Singularity to manage software versions. The answer is just that I haven't found the time to implement this yet, but it is a very high priority for this summer :)

The plan is to create a new `Environment` modules for both Docker and Singularity that will manage the installation of appropriate containers on the system as well as running each node of the workflow in a separate container. The workaround at present is to use the `monashbi/banana` Docker container, which has all the relevant tools installed within a single container. However, you cannot use this on the CVL as Docker is not installed for security reasons so you are stuck with the `ModulesEnv` (for now).

## Workflow Execution with Processors

`Processor`s in Arcana are very much like the concept of execution plugins in Nipype (see the [Excecution Plugins Notebook](basic_plugin.ipynb)). In fact they are just very thin wrappers around execution plugins plus a couple of methods implemented in the base `Processor` class. As such, they control how the workflow graph is executed on the host system (i.e. single process, multi-process, via a job-scheduler). For the user the only important difference is they also specify the working directory and a few other execution parameters used when running the analysis.

### SingleProc

`SingleProc` does what the name suggests, runs the workflow in a single process. It wraps Nipype's `LinearPlugin` plugin, and any unused keyword arguments will be passed to `LinearPlugin`. To instantiate it you only need to provide the working directory:

In [None]:
from arcana import SingleProc
single_proc = SingleProc('work/')