# Selecting Genes with Boruta and Random Forests

*This notebook descbites howo to select genes (features) of importance using Random Forests through the Boruta algorithm.*

**Background information**:
+ Random Forests
    + The [sklearn model used](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
    + [The excellent sklearn user guide](https://scikit-learn.org/stable/user_guide.html)
+ [Python Boruta algorithm](https://github.com/scikit-learn-contrib/boruta_py)

---

***Setting up the notebook***

In [None]:
# OS-independent path management.
from os import  environ
from pathlib import Path

import holoviews as hv
hv.extension("bokeh")
%matplotlib inline

import GSForge as gsf

***Declaring used paths***

In [None]:
OSF_PATH = Path(environ.get("GSFORGE_DEMO_DATA", default="~/GSForge_demo_data/osfstorage/oryza_sativa")).expanduser()
HYDRO_GEM_PATH = OSF_PATH.joinpath("AnnotatedGEMs", "oryza_sativa_hydro_raw.nc")
assert HYDRO_GEM_PATH.exists()

***Loading a demo AnnotatedGEM***

In [None]:
agem = gsf.AnnotatedGEM(HYDRO_GEM_PATH)
agem

In [None]:
agem.data

## Finding Genes

Selecting relevent genes of importance (something that requires further experimentation to confirm) is not simple.
But we can make it appear so, by demonstrating the boruta feature selection method to construct a `GSForge.GeneSet`.

First we will need a model that, when trained, can rank or infer feature importance:

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
selecting_model = RandomForestClassifier(
    class_weight='balanced',
    max_depth=4, 
    n_estimators=1000, 
    n_jobs=-1)

Now we can run the boruta feature selection method.
The parameters selected here a set for a speedy run time only.

In [None]:
%%time
boruta_result = gsf.operations.BorutaProspector(
    agem, 
    estimator=selecting_model, 
    annotation_variables="treatment",
    max_iter=5,
    perc=95)

This produces an `xarray.Dataset` object.

In [None]:
boruta_result

You can determine if the boruta algorithm needs more iterations by checking the contents of `support_weak`; increase the `max_iter` until all features have been resolved.

In [None]:
boruta_result.support_weak.sum().values

### Creating a GeneSet from the result

Consider any metadata that should be stored, and add that to the `xarray.Dataset` result using `assign_attrs()`

```python
attrs = {"selection_model": str(selecting_model)}
boruta_result = boruta_result.assign_attrs(attrs)
```

You will need to convert nested dictionaries into json strings.

After the object is created you can easily see the number and percent of genes selected:

In [None]:
boruta_treatment_gs = gsf.GeneSet(boruta_result, name="Boruta Treatment")
boruta_treatment_gs

## Using a `GeneSetCollection`

For running more than one boruta model in the notebook, we recommend creating them in a loop, and adding them to a collection, as demonstrated below.

In [None]:
%%time

boruta_gsc = gsf.GeneSetCollection(gem=agem)

for target in ["treatment", "genotype"]:
    boruta_treatment_ds = gsf.operations.BorutaProspector(
        agem,
        estimator=selecting_model,
        annotation_variables=target,
        perc=100,
        max_iter=25)
    
    boruta_gsc[target] = gsf.GeneSet(boruta_treatment_ds, name=f"Boruta_{target}")

In [None]:
for key, geneset in boruta_gsc.gene_sets.items():
    print(geneset.data.support_weak.sum().values)