# WORC Tutorial

Welcome to the tutorial of WORC: a Workflow for Optimal Radiomics Classification! It will provide you with basis knowledge and practical skills on how to run the WORC. For advanced topics and WORCflows, please see the *WORCAdvanced*
iPython notebook also provide with this Tutorial.

Besides this tutorial, a special Virtual Machine (VM) has been made on which many components are pre-installed. Hence, for many parts of this tutorial, instructions are split for when using the VM or not.

This tutorial consists of the following components:
1. WORC Installation and Configuration
2. WORC Example


For any questions or tips, please contact me at m.starmans@erasmusmc.nl!

## WORC Installation and Configuration
Then using the VM, all components required for WORC are already pre-installed. When not using the VM, please use the installation *installation.sh* script for installing the required components.

WORC makes use of [fastr](https://fastr.readthedocs.io/en/stable/), a Python package for standardizing workflows. Fastr has extensive documentation, but I will highlight several important components. However, I at least recommend you to read the fastr introduction.

### 1. The actual [fastr configuration file](https://fastr.readthedocs.io/en/stable/static/file_description.html#config-file)
The fastr configuration file can be stored in the ~/.fastr hidden folder as config.py. It is just a Python file and is thus also formatted in such a way. Note that upon installation, default settings are used and the file is not actually created. Note that you can have multiple configuration files which can stack: these have to be stored in the ~/.fastr/config.d folder. For WORC and PREDICT, configuration files have been added in that folder.

You can look at the fastr configuration in the following way

In [1]:
import fastr

print(fastr.config)

# [bool] Flag to enable/disable debugging
debug = False

# [str] Directory containing the fastr examples
examplesdir = "C:\\Users\\Marty\\Anaconda3\\envs\\py372\\lib\\site-packages\\fastr-3.0.2-py3.7.egg\\fastr\\examples"

# [str] The default execution plugin to use
execution_plugin = "ProcessPoolExecution"

# [str] Execution script location
executionscript = "C:\\Users\\Marty\\Anaconda3\\envs\\py372\\lib\\site-packages\\fastr-3.0.2-py3.7.egg\\fastr\\execution\\executionscript.py"

# [str] Redis url e.g. redis://localhost:6379
filesynchelper_url = ""

# [str] The level of cleanup required, options: all, no_cleanup, non_failed
job_cleanup_level = "no_cleanup"

# [bool] Indicate if default logging settings should log to files or not
log_to_file = False

# [str] Directory where the fastr logs will be placed
logdir = "C:\\Users\\Marty\\.fastr\\logs"

# [dict] Python logger config
logging_config = {}

loglevel = 20

# [str] Type of logging to use
logtype = "default"

# [dict] A dictionary c

We created a *~/.fastr/config.py* file for you on the VM, in which we specified two fields:

job_cleanup_level = 'no_cleanup'
execution_plugin = 'LinearExecution'

**Please also do so if you are not using the VM**. More details on this setting will be given in the Exectution of a WORC example later in this tutorial.

### 2. Mounts
Fastr defines several *mounts*, i.e. paths that are referred to by a specific name. These are very useful to run pipelines cross platform by using the Virtual File System (VFS) plugin, which will be detailed in the WORC Example. For example, suppose you define the mount *Data* as */home/worc/Documents/Data* on this machine and as */home/yourname/somepath* on another machine. When storing the file *image.nii.gz* on path machines, you can run a pipeline that uses *vfs: mount['home']/image.nii.gz* on both machines without needing any adjustments.

WORC makes use of several of these mounts, namely *worc_example_data, apps, output* and *test*. These have already been defined for you in the *~/.fastr/config.d/WORC_config.py* file. **Important Note**: the site package is used to automatically locate your installation folder. This does not work in a virtual environment. I have tried to circumvent this, but please check the config file to see if the packagedir is actually referring to the directory where your Python packages for the  virtual environment are located.

The mounts can also be found in the fastr config, where you can see the mounts we previously mentioned for WORC:

In [None]:
print(fastr.config.mounts)

## WORC Example

It's time to create and run your first WORC Example! We will use open source data from the [BMIA XNAT](https://xnat.bmia.nl/) platform. XNAT is an open source, online platform designed for storing, sharing and structuring medical image data. The BMIA XNAT is an incentive from the Netherlands to create an online biobank. Although most datasets are private and therefore not accessible, there are several public datasets. We will make use of the [STW Strategy Multidelination set](https://xnat.bmia.nl/app/template/XDATScreen_report_xnat_projectData.vm/search_element/xnat:projectData/search_field/xnat:projectData.ID/search_value/stwstrategymmd), which consists of CT and PET scans of patients with lung cancer. More detail can be found in the following paper:


Aerts, H. J. W. L., Velazquez, E. R., Leijenaar, R. T. H., Parmar, C., Grossmann, P., Carvalho, S., … Lambin, P. (2014, June 3). Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach. Nature Communications. Nature Publishing Group. http://doi.org/10.1038/ncomms5006

Feel free to use images of your own instead! **NOTE:** the pipeline execution setup for this notebook is set to be rather     slow to save computing space. Hence we suggest you start using only a handfull of patients (e.g. 10) to run this notebook. 

We will first download the images using the XNATpy package, convert them to Nifti's and create very simplistic masks of the patient. You do not need to understand these steps at in order to use WORC, just execute them to get the correct data.

In [5]:
import xnat
import os

# Locate your home folder
home = os.path.expanduser('~')

## Download the images
# We will only download patients with these labels for the moment
subject_labels = ['interobs' + num for num in ['05', '06', '08', '09', '10', '11', '12', '13', '15']]

# Connect to XNAT and retreive project
session = xnat.connect('https://xnat.bmia.nl/')
project = session.projects['stwstrategymmd']

# Create the data folder if it does not exist yet
datafolder = os.path.join(home, 'Documents', 'Data', 'STWStrategyMMD')
if not os.path.exists(datafolder):
    os.makedirs(datafolder)

# Download the data
for s in subject_labels:
    subject = project.subjects[s]
    for e in subject.experiments:
        experiment = subject.experiments[e]
        # NOTE: We only download the CT sessions, no PET scans
        if '_CT' in experiment.session_type:
            # NOTE: We only download the images, not the RTStruct file, which is a scan consisting of a single file
            for s in experiment.scans:
                scan = experiment.scans[s]
                if len(scan.files) > 1:
                    print(('Downloading patient {}, experiment {}, scan {}.').format(subject.label, experiment.label, scan.id))
                    scan.resources['DICOM'].download_dir(datafolder)
                    
# Disconnect the session
session.disconnect()

print('Done downloading!')

Downloading patient interobs05, experiment interobs05_20190220_CT, scan 1_3_6_1_4_1_9590_100_1_2_170217758912108379426621313680109428629.


 64.0 KiB |                    #            |  31.3 KiB/s Elapsed Time: 0:00:02


KeyboardInterrupt: 

In [None]:
# Some functions for the conversion and mask creation: you do not need to follow this.
import dicom as pydicom
import SimpleITK as sitk
import numpy as np
import random


def mmdpreprocess(image_in, image_out, segmentation_out):
        '''
        Converts input image DICOM folder to output nifti and segmentation.

        Parameters
        ----------
        WIP

        '''
        # Convert input arguments from list to arguments
        if type(image_in) is list:
            image_in = ''.join(image_in)

        if type(image_out) is list:
            image_out = ''.join(image_out)

        if type(segmentation_out) is list:
            segmentation_out = ''.join(segmentation_out)

        # Load the DICOMs from the folder
        print('Loading DICOM')
        image, _ = load_image(image_in)

        # We make a dummy segmentation by simply selecting a cube in the image
        # Note that we use a random volume of either 10 or 20 for half width.
        print('Creating mask.')
        width = random.choice([10, 20])
        x, y, z = image.shape
        mask = np.zeros(image.shape)
        mask[int(x)/2 - width:int(x)/2 + width,
             int(y)/2 - width:int(y)/2 + width,
             int(z)/2 - width:int(z)/2 + width] = 1

        # Save image and mask
        print('Saving image and segmentation to Nifti.')
        image = sitk.GetImageFromArray(image)
        mask = sitk.GetImageFromArray(mask)
        sitk.WriteImage(image, image_out)
        sitk.WriteImage(mask, segmentation_out)


def load_image(input_dir):
    '''
    Load DICOMs from input_dir to a single 3D image and make sure axial
    direction is on third axis.
    '''
    dicom_reader = sitk.ImageSeriesReader()
    dicom_file_names = dicom_reader.GetGDCMSeriesFileNames(str(input_dir))
    dicom_reader.SetFileNames(dicom_file_names)
    metadata = pydicom.read_file(dicom_file_names[0])
    dicom_image = dicom_reader.Execute()
    dicom_image = sitk.GetArrayFromImage(dicom_image)

    dicom_image = np.transpose(dicom_image, (2,1,0))
    dicom_image = np.fliplr(np.rot90(dicom_image, 3))
    dicom_image = dicom_image[:, :, ::-1]
    return dicom_image, metadata


In [None]:
## Convert images to Nifti
import shutil
import glob
pfolders = glob.glob(datafolder + '/*')
for pfolder in pfolders:
    
    # The DICOMS are often in a subfolder
    subfolders = glob.glob(pfolder + '/*')
    while len(subfolders) == 1:
        imfolder = subfolders[0]
        subfolders = glob.glob(imfolder + '/*')
    
    print(('Processing patient {}.').format(os.path.basename(pfolder)))
    image_out = os.path.join(pfolder, 'image.nii.gz')
    segmentation_out = os.path.join(pfolder, 'mask.nii.gz')
    mmdpreprocess(imfolder, image_out, segmentation_out)
    
    # Remove the folder with the DICOMS, but save one for later use
    dicoms = glob.glob(imfolder + '/*.dcm')
    os.rename(dicoms[0], os.path.join(pfolder, 'metadata.dcm'))
    shutil.rmtree(os.path.join(pfolder, 'scans'))
    
print('Done preprocessing patients!')

### Start using WORC

Now it's finally time to create your first WORC network.

In [9]:
import WORC
import fastr

# Create a network with the name "Tutorial", which will be used upon execution
network = WORC.WORC('Tutorial')

WORC has several sources and components you can set and use. Let's start with a minimal example only using the following components:
1. Images
2. Segmentations
3. Labels
4. The configuration generator

More components and sources are discussed in the *WORCAdvanced* iPython notebook.

In general, all sources in WORC are Python lists. You can add multiple types of the same source by simply appending to these listst. For example, when using multiple images per patient, i.e. a T2 MR and a CT scan, you can simply add them by executing network.images.append(MR) and network.images.append(CT). 

Additionally, there are often two types of each source, one for training and one for testing, e.g. image_train and images_test. These correspond to two workflows:
* When only suppling training sources, cross validation will be used to train estimators and estimate performance.
* When supplying both training and testing sources, training and testing will be done on these separate sets without using a cross validation.

We will use the first option in this example.

WORC will automatically adjust the pipeline for the types and number of sources you supply!

### Images and fastr IOPlugins
Images are used to extract features from. Images can be in any image type that the Insight Segmentation and Registration Toolkit (ITK) supports, e.g. Nifti, Nifti Compressed, TIFF, MHD, RAW. Dicom folders are not supported but we do supply a tool for easy conversion, which is found in the *Advanced* notebook. Internally, WORC will convert everything to compressed Nifti, since this takes a lot less memory to process in the other parts of the pipeline.

I have already downloaded the CT image ten of the patients from the Multidelination dataset, which can be found in the *Data* folder of the VM, i.e. */home/worc/Documents/WORCTutorial/Data/STWStrategyMMD*, or can be fetched from WIP

#### IOPlugins
Fastr provides several [IOPlugins](https://fastr.readthedocs.io/en/stable/fastr.reference.html#ioplugin-reference) to import and export the data to fastr. When providing a source or sink, you must mention which plugin has to be used. The most straightforward way to point to this file would be to use FileSystem (called *file*), which corresonds to simply referring to your local file, e.g. for the image of patient *interobs05*:

In [None]:
source_image_patient05 = os.path.join('file://home',
                                      'Documents',
                                      'Data',
                                      'STWStrategyMMD',
                                      'interobs05_20170910_CT',
                                      'image.nii.gz')

However, when transporting your WORC pipeline to another system, you would have to redefine all your sources. Hence, I advice to always use the VirtualFileSystem (called *vfs*). This uses the mounts described in the installation section of this notebook to refer to specific paths. The */home/worc* folder is already defined as the *home* mount as we saw there, hence we can use that now to use the VFS to refer to our source:

In [None]:
source_image_patient05 = os.path.join('vfs://home',
                                      'Documents',
                                      'Data',
                                      'STWStrategyMMD',
                                      'interobs05_20170910_CT',
                                      'image.nii.gz'

As you can see, the first string after the *vfs* command is used to define the mount used. Let us use the VFS IOPlugin for fastr to expand this input string or URL:

In [None]:
fastr.ioplugins # Load the plugins
VFSPlugin = fastr.ioplugins['vfs']
expanded_url = VFSPlugin.url_to_path(source_image_patient05)
print('The URL {} is converted by the VFS IOPlugin to {}!').format(source_image_patient05, expanded_url)

# Note: don't mind the warnings that some plugins cannot be found, as we will not use those anyways.

There are several IOPlugins that you can use besides the VFS, see the fastr documentation for more detail. The only other plugin we will discuss is the XNAT plugin, which can be used to directly read data from XNAT, in the Advanced notebook.

#### Image sources

We will now use the Python *glob* package to locate all image files and turn them into VFS sources:

In [6]:
import glob
# Locate image sources and convert to VFS sources
image_sources = glob.glob(os.path.join(datafolder, '*', 'image.nii.gz'))
image_sources = [i.replace(home, 'vfs://home') for i in image_sources]
print(image_sources)

[]


Sources can be supplied to WORC/fastr by either using a list or a dictionary. When providing a list, WORC/fastr will name each sample simply wiht ``sample_0, sample_1, ...``, which is not very informative. Moreover, later on in WORC, the sample ID's will be used to match labels and images of the same patient to each other. Hence, it is **very important** to provide all sources from patients as dictionaries in which the keys include the name in the label file, see later on in this tutorial.

Let's therefore convert the list into a dictionary with the correct labels:

In [7]:
import os
image_sources = {os.path.basename(os.path.dirname(i)): i for i in image_sources}
print(image_sources)

{}


Now we append them to the WORC object as following:

In [10]:
network.images_train = [image_sources]

### Segmentations
When using images, currently, segmentations of the ROI from which to extract the features are also required. Another option would be to include a tool that automatically performs the segmentation in the WORCflow: see the *Advanced* notebook on how to add nodes

The segmentations can be supplied in the same manner as the images:

In [11]:
segmentation_sources = glob.glob(os.path.join(datafolder, '*', 'mask.nii.gz'))
segmentation_sources = [i.replace(home, 'vfs://home') for i in segmentation_sources]
segmentation_sources = {os.path.basename(os.path.dirname(i)): i for i in segmentation_sources}
print(segmentation_sources)
network.segmentations_train = [segmentation_sources]

{}


**Note:** It is very important that you use the same cardinality or keys in the segmentations and images object. WORC/fastr will match each sample of the segmentation source with an image source based on the sample index. Hence, when supplying different keys and therefore a different ordering or lists with different orderings as sources, mismatches will be created. 

**Note:** Additionally, the number of image sources in the images_train and images_test lists have to match the segmentations in cardinality. Thus, if you supply for example an MR and CT image per patient through network.images.append(MR) and network.images.append(CT), you will also have to supply WORC with a segmentation source for MR and one for CT in that order!

### Labels
The labels are what your estimator will be trained on and thus what will be predicted for each patient. The labels have to be provided in a single text file containing a table. The first column should head *Patient* and should include the patient ID. When matching the labels to a patient in the estimator training, these labels will be matched to the sample IDs of the images and segmentations. The sample IDs do not exactly have to match, but do need to include the names from the label files. The other columns contain possible labels

The label file for this dataset (which contains imaginary labels) can be found in the *WORCTutorial/Data/StrategyMMD/pinfo.txt*. We will now load the file, see how it looks and append it to the WORC labels:

In [12]:
# NOTE: Make sure you either put the mentioned pinfo.txt
# file in the datafolder, or change the fields below to point to the correct paht


import numpy as np
pinfo_file = os.path.join(datafolder, 'pinfo.txt')
pinfo = np.loadtxt(pinfo_file, np.str)
print(pinfo)
network.labels_train.append(pinfo_file.replace(home, 'vfs://home'))

[['Patient' 'imaginary_label_1' 'imaginary_label_2']
 ['interobs05' '1' '0']
 ['interobs06' '0' '1']
 ['interobs09' '1' '1']
 ['interobs10' '0' '0']
 ['interobs11' '1' '0']
 ['interobs12' '0' '1']
 ['interobs13' '1' '1']
 ['interobs15' '0' '0']]


As you can see, this file contains the labels *imaginary_label_1* and *imaginary_label_2*. In the next section, these names are used in the configuration to tell WORC which label we want to predict.

### The configuration generator
Lastly, a configuration file for WORC is mandatory. The config file is a *.ini* file containing specific fields: see the WORC Github Wiki for an explanation of the various fields.

A default configuration can be created through the *defaultconfig* function of a WORC network. The resulting config is a ConfigParser object. You can interact with it as a dictionary to set or retreive fields. The config can be saved to a *.ini* file through the write function. Values in the configparser can again be dictionaries.

Note however that you can also simply supply the ConfigParser itself to WORC, which will turn it automatically in a *.ini* file upon execution.

In [15]:
config = network.defaultconfig()
for k1 in config.keys():
    print(k1)
    for k2 in config[k1].keys():
        print('-- ', k2, config[k1][k2])
    print("\n")

DEFAULT


General
--  cross_validation True
--  Segmentix False
--  PCE False
--  FeatureCalculator predict/CalcFeatures:1.0
--  Preprocessing worc/PreProcess:1.0
--  RegistrationNode 'elastix4.8/Elastix:4.8'
--  TransformationNode 'elastix4.8/Transformix:4.8'
--  Joblib_ncores 4
--  Joblib_backend multiprocessing
--  tempsave False


Segmentix
--  mask subtract
--  segtype None
--  segradius 5
--  N_blobs 1
--  fillholes False


Normalize
--  ROI Full
--  Method z_score


ImageFeatures
--  orientation True
--  texture all
--  coliage False
--  vessel False
--  log False
--  phase False
--  image_type CT
--  gabor_frequencies 0.05, 0.2, 0.5
--  gabor_angles 0, 45, 90, 135
--  GLCM_angles 0, 0.79, 1.57, 2.36
--  GLCM_levels 16
--  GLCM_distances 1, 3
--  LBP_radius 3, 8, 15
--  LBP_npoints 12, 24, 36
--  phase_minwavelength 3
--  phase_nscale 5
--  log_sigma 1, 5, 10
--  vessel_scale_range 1, 10
--  vessel_scale_step 2
--  vessel_radius 5


Featsel
--  Variance True, False
--  Groupwise

Instead of making lots of different configuration files for all tools in WORC, I decided to create a single config file which contains all values and which will be based to all nodes. Hence, the WORC configuration object contains fields specific to WORC, but also to for example the PREDICT toolbox which is mainly used for feature extraction and machine learning.

**Note:** The number of configurations you supply to WORC also has to match the number of image source lists/dicts you supply. Thus, if you supply for example an MR and a CT scan per patient through network.images.append(MR) and network.images.append(CT), you will have to supply WORC with two configs. The first configuration will be used for all general settings. For specifics to the image, e.g. the feature extraction, the first configuration will be used for the MR, the second for the CT in this case.

**Note:** All values have to be provided as strings in a ConfigParser object. Hence no booleans, integers or whatsoever. The actual tools using these fields are responsible for the correct conversion of these values.

Let's stick mostly with the defaults for now. However, as we have only a very small dataset for this example and are not interested in lengthy validation, we turn SMOTE oversampling off, lower the number of crossvalidations to 5, increase the size of the validation set to 30 percent and use Linear Execution as a plugin. Lastly, we will try to predict *imaginary_label_1*. We will then add the configuration file to the WORC network:

In [16]:
config['SampleProcessing']['SMOTE'] = 'False'
config['CrossValidation']['N_iterations'] = '5'
config['Genetics']['label_names'] = 'imaginary_label_1'
config['HyperOptimization']['test_size'] = '0.3'
network.fastr_plugin = 'LinearExecution'
network.configs.append(config)

### Execution
You are now ready to execute your first WORC pipeline! Execution consists of three steps. 

The first is building the network. Based on the sources and configuration you provided, WORC will spawn a pipeline template. If you provided inconsistencies, e.g. 16 patients but only 15 segmentations, WORC will notify you in this stage what's wrong. Additionally, after building the network, you can draw it.

In [17]:
print(network.labels_train)

network.build()
network.network.draw_network(network.network.id, draw_dimension=True)

['vfs://home\\Documents\\Data\\STWStrategyMMD\\pinfo.txt']


AttributeError: 'Network' object has no attribute 'configs'

Drawing is done using the default *draw_network* function from fastr, which uses Graphviz. The image of the pipeline is saved as a *.svg* file in the folder you executed the network. See the fastr documentation for details on the drawing.

Next, we set our actual sources correctly and create the outputs (called *sinks* in fastr) through the *set* function:

In [None]:
network.set()

Execution of the network is done through the *execute* function. Note that upon execution, the network is automatically drawn with Graphviz. Fastr has different [plugins to execute your network](http://fastr.readthedocs.io/en/stable/fastr.reference.html#executionplugin-reference). For example, jobs can be submitted in parallel or to separate nodes if you are using a cluster with e.g. SGE or SLURM. We have previously set the execution plugin both in *config.py* and in WORC to *LinearExecution*, which submits the jobs linearly. Although this is rather slow, it will not require the full computing power of your PC, which is fine for this example. In practice, you might want to consider alternative plugins.


Note that after running the command below, fastr will keep track of the process in the console, which is also logged. The message can grow quite extensively for large pipelines. Execution will take approximately 15 to 60 minutes, of course depending on your hardware.

In [None]:
network.execute()

#### Execution process

Fastr will create a job for each sample in each node in your network. See the Graphviz image for all nodes in your pipeline and their names. Fastr will first create jobs for the first independent nodes in your network: in this case, the source nodes for the images and the segmentations. These will be queued in the fastr job manager. Due to the *LinearExecution* Execitionplugin that we are using, jobs will be executed linearly. Options for cluster submission and parallel execution are also available, see the fastr documentation.

These jobs can have four states:
1. Queued
2. Failed
3. Finished
4. Cancelled

A job will only be cancelled if that either manually done by the user or if a previous job on which the current job depends has failed.

Additionally, fastr will create a temporary directory to write all job information to. These will be done in the directory defined in your networks *fastr_tmpdir* field, which by default is the *tmp* mount. The *tmp* mount is by default */tmp* on Linux machines. Fastr will create a directory with the same name that you gave your network, in this case *Tutorial*. Note that fastr also tells you the tmpdir used upon the start of the execution.

Let's see which folders and files are actually present in that folder after the execution:

In [None]:
# First, get the temporary directory name to which fastr writes the output
tempdir = os.path.join(fastr.config.mounts['tmp'], 'Tutorial')

for i in sorted(glob.glob(os.path.join(tempdir, '*'))):
    print(os.path.basename(i))

Some files are specific for fastr, i.e. the pickles, mostly for job tracking and provenance. See the fastr documentation for more details.

We can see that the configuration we made is actually saved here as *.ini* file. Additionationally, there are indeed folders for each node in our network. Remember that we specified the job_cleanup_level parameter in the *config.py* file earlier? This determines whether and how folders are cleaned after execution or not. The default is non_failed, which means that the results and information on jobs that have either not run at all or have finished will be deleted. For illustration purposes, we have set it to no_cleanup to show you all temporary outputs. Please change it back on final execution.

If we look in the *images_train_CT_0*, we see the actual samples that were processed:

In [None]:
for i in sorted(glob.glob(os.path.join(tempdir, 'images_train_CT_0', '*'))):
    print(os.path.basename(i))

Inside the folder of each sample and thus each job of this node, there are several files present:

In [None]:
for i in sorted(glob.glob(os.path.join(tempdir, 'images_train_CT_0', 'interobs05_20170910_CT', '*'))):
    print(os.path.basename(i))

print("\n")

for i in sorted(glob.glob(os.path.join(tempdir, 'images_train_CT_0', 'interobs05_20170910_CT', 'result', '*')):
    print(os.path.basename(i))

Three of the default files created by fastr I will adress here:

1. The __fastr_prov__.json file: 
Contains information on the provenance of the result created. This includes the tools used and their versions, but also the sources. If the result of the tool is exported as a sink, this file will also be exported automatically.

2. The __fastr_stderr__.txt file:
Contains the standard error information of the job. When the job was shut down for reasons such as taking too much memory, this will be listed in this file.

3. The __fastr_stdout__.txt file:
Contains information on the execution process. This is often the most informative file. It also contains the exact command for the job which was run.

Fastr provides several [command line tools](http://fastr.readthedocs.io/en/stable/fastr.commandline.html), which are especially usefull for debugging. Personally however, I simply look in the __fastr_stdout__.txt file for the exact command that was run and rerun that on the command line (identified as the *Calling command* to see what the actual error was from Python, Matlab or whatever program was used for executing the task.

Only when a task has finished will the result be written to the result folder. When another task depends on the output of a specific task, this is the file that will be transmitted.

**Note:** When you rerun a fastr network in the same temporary directory (which in WORC is the case when you resuse the network's name/id), fastr will smartly look at your results. When tasks have previously already finished, they will not be rerun unless any of the tasks on which it depends has changed. Hence, if you change a source such as an image, all jobs dependent on that image will rerun. **This does not hold for any tools/code you updated.** If you have updated a script or a Python package, this is not automatically detected by fastr. Thus, you would have to manually delete the temporary results for the job to rerun.

#### Results
Output will be written by default to the fastr *output* mount, which is set by WORC as default to *$HOME/WORC/output*. Again, a folder is made for the network ID that you used, *Tutorial* in this case:

In [None]:
outputfolder = home, 'WORC', 'output', 'Tutorial')

for i in sorted(glob.glob(os.path.join(ouputfolder, '*'))):
    print(os.path.basename(i))

You can see that by default, the features, the trained classifier (classification_0.hdf5) and the performance of the classifier are written as output. Each file has an associated *.prov.json* file which states the provenance. The performance is simply a json you can open with a text editor. 

The features and classifier are stored as *.hdf5* files, which can be loaded with the Python pandas package:

In [None]:
import pandas as pd
features = pd.read_hdf(os.path.join(outputfolder, 'features', 'features_CT_0_interobs05_20170910_CT_0.hdf5'))

# Print the contents of the pandas Dataframe
print(features)

# Print the feature labels and corresponding values
for k, v in zip(features.feature_labels, features.feature_values):
    print(k, v)
    
print(('Total number of features: {}.').format(str(len(features.feature_values))))

In [None]:
classifiers = pd.read_hdf(os.path.join(outputfolder, 'svm_all_0.hdf5'))

# Print the contents of the pandas Dataframe
print(classifiers)

I assume the contents of the feature file are straightforward. The classifiers file contains objects for each label you tried to predict: in this case, only imaginary label_1. There are several fields, which are formatted as lists with an item for each iteration in the cross valition: hence in this case five. The only exceptions are the *config* and *feature_labels* field field.


The following fields are present:
* Classifiers: a [Randomized Search CV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) from sklearn like object. It actually is a SearchCV object from PREDICT, which is an extensions of the sklearn object
* X_train: the features from the patients used for training.
* X_test: the features from the patients used for testing.
* Y_train: the labels from the patients used for training.
* Y_test: the labels from the patients used for testing.
* config: the configuration used in all cross validation iterations.
* patient_ID_train: the IDs of the patients used for training.
* patient_ID_test: the IDs of the patients used for testing.
* random_seed: the random seed used for creating the training/validation set split using the sklearn [train_test_split function](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)
* feature_labels: the labels of the features. The list should be equally long as all feature values from a single patient in the X_train and X_test fields.



The performance of the classifiers in the classifiers file is stored in the performance JSON file, which you can load in Python with the json package:

In [None]:
import json

with open(os.path.join(outputfolder, 'performance_all_0.json'), 'r') as fp:
    p = json.load(fp)
    
print(json.dumps(p, indent=4))

Performance is by default computed for the single best classifier and ensembles of the best 10 and 50 classifiers. Besides the 95% confidence intervals over all cross validations for several matrics, the patients that were always classified correctly or incorrectly in all iterations where they were located in the test set are identified.

For this experiment with only ten patients, dummy masks which are just cubes in the center of the image that don't make sense and imaginary labels, hence the performance is naturally very bad.

The performance is generated from the classifiers file by the plot_SVM function form PREDICT (we only use a single classifier in the code below):

In [None]:
import PREDICT as pr

p2 = pr.plotting.plot_SVM.plot_SVM(classifiers, pinfo_file, 'imaginary_label_1')

Please note that there is some randomness in refitting the classifier. As long as your model is not that dependent on randomness, you use enough cross validations and also larger ensembles, this effect should not be too large. In this case, these factors are not met, hence the result will differ quite a lot everytime you run it. We are working on making this deterministic, but as your model should satisfy the mentioned constraints, it shouldn't matter

## End
Congratulations: you've successfully executed your first WORC pipeline! Before going to the advanced topics, we suggest you recreate above example with your own data, as then the results will hopefully make more sense. In the WORC_example.py script provided also in the WORCTutorial Github repository, we have condensed all above statements to the lines you actually need to run a WORC framework. Additionally, first look at the [WORC Wiki](https://github.com/MSTarmans91/WORC/wiki) to get a better understanding of parameters, so you can tune them to your application.