**Why use a workflow?**

Some feature selection methods -- like boruta -- do not produce stable output.
Meaning the results for the same parameters can differ to some degree.
We could fix the `random_state` to force the same results -- but more of interest is how well a chosen set of parameters performs.
We could also increase the number of iterations that boruta is allowed to run, but this becomes memory intensive.
A more facile solution is to repeat the same parameters with as many iterations as we can get away with.

We then want to explore parameters, with repeats, and do so in a memory intensive way.
Enter `nextflow`, a program that will streamline this process.

### An Example with `boruta_multiclass`

The workflows are named based on the organization of the y, or target variable.
These workflows essentially manage calls to `boruta_prospector`.

The following must be provided to the workflow:
+ A saved `AnnotatedGEM` or otherwise compatible netcdf file.
+ A ranking model must be selected.
+ A target variable must be provided.
+ Any required to boruta and the ranking model.


Consider this example `nextflow.config` file:
```groovy
// Singular data input and selection.
params.gem_netcdf = "~/GEMprospector_demo_data/osativa.nc"
params.x_label = "counts"
params.y_label = ["Treatment", "Genotype", "Subspecies"]

// Ranking model options.
params.ranking_model = "RandomForestClassifier"
params.ranking_model_opts.max_depth = [3]
params.ranking_model_opts.n_jobs = [-1]

// BorutaPy options.
params.boruta_opts.perc = [100]
params.boruta_opts.max_iter = [1000]

// How often to repeat each set of arguments.
params.repeats = 3

// Output directory.
params.out_dir = "~/GEMprospector_demo_data/lineaments/"
```

This can then be run via:

```bash
nextflow run <path/to/boruta_multiclass/main.nf> -c <path/to/nextflow.config>
```

We recommend running these feature selections using the prepared docker image:

```bash
nextflow run <path/to/boruta_multiclass/main.nf> -c <path/to/nextflow.config> -p docker
```

And the resulting lineament files should be stored in `"~/GEMprospector_demo_data/lineaments/"`.

***Notebook setup***

In [None]:
import GEMprospector as gp
from pathlib import Path
import holoviews as hv
hv.extension("bokeh")

***Declare used paths***

In [6]:
ANNOTATED_GEM_PATH = "~/GEMprospector_demo_data/osativa.nc"
LINEAMENT_DIR = "~/GEMprospector_demo_data/workflow_lineaments/"

*Ensure the target paths exist*

In [7]:
assert Path(ANNOTATED_GEM_PATH).expanduser().exists() and Path(LINEAMENT_DIR).expanduser().exists()

***Load the AnnotatedGEM***

In [8]:
agem = gp.AnnotatedGEM(ANNOTATED_GEM_PATH, name="Oryza sativa")
agem

<GEMprospector.AnnotatedGEM>
Name: Oryza sativa
Selected GEM Variable: 'counts'
    Gene   55986
    Sample 475

### Examine the workflow output directory

In [9]:
list(Path(LINEAMENT_DIR).expanduser().resolve().glob("*.nc"))

[PosixPath('/home/tyler/GEMprospector_demo_data/workflow_lineaments/counts_v_Subspecies_119681.nc'),
 PosixPath('/home/tyler/GEMprospector_demo_data/workflow_lineaments/counts_v_Genotype_594c7a.nc'),
 PosixPath('/home/tyler/GEMprospector_demo_data/workflow_lineaments/counts_v_Subspecies_9f6d9a.nc'),
 PosixPath('/home/tyler/GEMprospector_demo_data/workflow_lineaments/counts_v_Genotype_f97661.nc'),
 PosixPath('/home/tyler/GEMprospector_demo_data/workflow_lineaments/counts_v_Subspecies_5cfd43.nc'),
 PosixPath('/home/tyler/GEMprospector_demo_data/workflow_lineaments/counts_v_Treatment_e1502b.nc'),
 PosixPath('/home/tyler/GEMprospector_demo_data/workflow_lineaments/counts_v_Treatment_e1f64c.nc'),
 PosixPath('/home/tyler/GEMprospector_demo_data/workflow_lineaments/counts_v_Treatment_a70735.nc'),
 PosixPath('/home/tyler/GEMprospector_demo_data/workflow_lineaments/counts_v_Genotype_2748d9.nc')]

Let's combine the replicates into a single lineament.

In [10]:
treatment_lcoll = gp.GeneSetCollection.from_folder(agem, LINEAMENT_DIR, 
                                                     name="Treatment", 
                                                     glob_filter="*Treatment*.nc")
genotype_lcoll = gp.GeneSetCollection.from_folder(agem, LINEAMENT_DIR, 
                                                    name="Genotype", 
                                                    glob_filter="*Genotype*.nc")
subspecies_lcoll = gp.GeneSetCollection.from_folder(agem, LINEAMENT_DIR, 
                                                      name="Subspecies", 
                                                      glob_filter="*Subspecies*.nc")

treatment_lcoll

<GEMprospector.GeneSetCollection>
    <GEMprospector.AnnotatedGEM>
    Name: Oryza sativa
    Selected GEM Variable: 'counts'
        Gene   55986
        Sample 475
GeneSet Keys and # of Selected Genes
    Lineament01322: 1094
    Lineament01323: 1077
    Lineament01321: 1064

***Combine the collections into their own lineaments:***

*First get a copy of the attributes, since they are all replicates it doesn't matter which one.*

In [11]:
subspecies_attrs = next(iter(subspecies_lcoll.lineaments.values())).data.attrs
treatment_attrs = next(iter(treatment_lcoll.lineaments.values())).data.attrs
genotype_attrs = next(iter(genotype_lcoll.lineaments.values())).data.attrs

In [12]:
lineaments = {
    "Subspecies": gp.GeneSet.from_lineaments(subspecies_lcoll.lineaments.values(),
                                               agem.gene_index,
                                               name="Subspecies", attrs=subspecies_attrs),
    "Treatment": gp.GeneSet.from_lineaments(treatment_lcoll.lineaments.values(), agem.gene_index, 
                                              name="Treatment", attrs=treatment_attrs),
    "Genotype": gp.GeneSet.from_lineaments(genotype_lcoll.lineaments.values(), agem.gene_index,
                                             name="Genotype", attrs=genotype_attrs),
}

lcoll = gp.GeneSetCollection(gem=agem, lineaments=lineaments)
lcoll

<GEMprospector.GeneSetCollection>
    <GEMprospector.AnnotatedGEM>
    Name: Oryza sativa
    Selected GEM Variable: 'counts'
        Gene   55986
        Sample 475
GeneSet Keys and # of Selected Genes
    Treatment: 1171
    Genotype: 754
    Subspecies: 312

In [13]:
lcoll.save(LINEAMENT_DIR)

~/GEMprospector_demo_data/workflow_lineaments/Subspecies.nc
~/GEMprospector_demo_data/workflow_lineaments/Treatment.nc
~/GEMprospector_demo_data/workflow_lineaments/Genotype.nc


---