## Computing mapper graphs over a grid of parameter values

In this notebook, we apply the mapper algorithm to the gene expression data to compute mapper graphs across a range of cover parameters to find the range over which the structure of the graph is relatively stable. The mapper graphs generated in this notebook are used to create the stamped view illustrated in figures 2E and 2F of the manuscript. We assume that the directory structure is the same as it is in the repository. If not, you may have to check the paths and filenames to ensure the code runs without errors.

### import packages
First thing, import all the necessary packages and methods. Most of these are commonly used python packages, that can be easily installed. The only uncommon package is `kmapper` (for: *kepler mapper*), which can be pip installed. For more information and documentation, check out [this link](https://kepler-mapper.scikit-tda.org/en/latest/). We also need to import custom functions defined in the two files:
- `helper_functions.py` and 
- `lenses.py`

As the name suggests, one contains several functions required to perform utility functions such as loading and scaling input data. The other contains the function to compute the *lens* (more on that further down in the notebook).

In [None]:
import os
import matplotlib.pyplot as plt

# Helper functions
from helper_functions import loaddata, colorscale_from_matplotlib_cmap
from lenses import fsga_transform
from itertools import product

# keppler mapper
import kmapper as km

# Clusterer and Scaler for mapper
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import MinMaxScaler

### set paths, specify file names
Next thing we want to do is set the paths to the directories containing `code`, `data`, and `results`. We also specify input files. The data to be analyzed is stored in two csv files. `clean_metadata.csv` contains the metadata, `clean_RNAseq_OutlierRemoved.csv` contains the gene expression data. 

The metadata file contains one sample per row, identified by its *SRA*. There are 3172 samples, and for each sample we have 7 descriptive attributes. We are particularly interested in three of them: namely, its plant *family*, *tissue* type and *stress* type. There are 16 different plant families, 8 tissue types and 10 stress types (including *healthy*) represented in the data. The RNAseq file, as the name suggests contains the gene expression for each sample, across a 6335 orthogroups. 

In [None]:
projdir = "../.."
datadir = projdir + "/data"
resdir = projdir + "/results"

factorfile = datadir + "/clean_metadata.csv"
rnafile = datadir + "/raw_RNAseq_OutlierRemoved.csv"

### Loading data

The input data (metadata and gene expression) is stored in two separate files, but samples can be matched using the *SRA*. The function `loaddata` does exactly that. It loads both files into pandas dataframes and merges them using the *SRA* as the key. We perform an inner join, to keep only those *SRAs* for which we have metadata as well as gene expression.

In [None]:
factors = ["stress", "tissue", "family"]
df, orthos = loaddata(factorfile, rnafile, factors)

### Set factor and level for analysis

Here, we set the factor, and the corresponding factor level that will be used to construct the lens function. We experimented with the following three combinations:
- filterfactor: *stress*, filterlevel: *healthy*
- filterfactor: *tissue*, filterlevel: *root*
- filterfactor: *tissue*, filterlevel: *leaf*

These variables are also used to construct the names of files and directories in which the results will be saved.

In [None]:
filterfactor, filterlevel = ("tissue", "leaf")

figdirname = f"/{filterfactor}_{filterlevel}_lens_figures/mapper_over_grid"
figdir = resdir + figdirname
os.makedirs(figdir, exist_ok=True)

### Applying mapper

The first step in applying mapper is to initialize the mapper object and set some of its attributes like the clustering algorithms and the metric used by the clustering algorithm. We will be using *DBSCAN* which is the most commonly used clustering algorithm for mapper. The metric will be the correlation distance between sample gene expressions.

In [None]:
# Initialize mapper object
kmap = km.KeplerMapper(verbose=1)

# Define Nerve
nerve = km.GraphNerve(min_intersection=1)

# Define clustering algorithm
clust_metric = "correlation"
clusterer = DBSCAN(metric=clust_metric)

# Define colorscale to be used for node coloring
cscale = colorscale_from_matplotlib_cmap(plt.get_cmap("viridis_r"))

### Define lens / filter function

Next, we need to define the *lens*. In the python file `lenses.py`, We have defined a function called *fsga_transform*. Given the filterfactor and the filterlevel of that factor (specified in the cell above), we construct a lens following the method described in [Nicolau et. al. 2011](https://www.pnas.org/content/108/17/7265).

For example, for filterfactor: *stress*, and filterlevel: *healthy*, we take the gene expression profiles of all the *healthy* samples from the data and fit a linear model to obtain an idealized *healthy* gene expression profile. Then we project all the samples on to this linear model and compute the residuals. The *lens* is the norm of the residual vector, which measures how much a sample gene expression deviates from the *healthy* one.

Here, we compute the lens and scale it using the standard *MinMaxScaler* from scikit-learn.

In [None]:
# Define lens
scaler = MinMaxScaler()
residuals, _, _ = fsga_transform(df, orthos, filterfactor, filterlevel)
lens = kmap.project(residuals, projection="l2norm", scaler=scaler)

### Define cover parameter ranges for the grid

Next step, we specify the ranges for the cover parameter. We need to specify the range of values for number of intervals (*intervals*) and the range of values for the amount of overlap between consecutive intervals (*overlaps*). We will loop over all pairs of (interval, overlap) values in the specified range and compute mapper graphs corresponding to each pair of parameter values. The ranges set in the cell block below are the ones we settled on after some experimentation. Feel free to change them but keep in mind that overlap must be between 0 and 100. Also, note that increasing the number of intervals will make the algorithm run slower so don't increase it beyond 150 or so.

In [None]:
# Define ranges for cover parameters: number of intervals, percent overlap
if filterfactor == "tissue":
    if filterlevel == "root":
        intervals = [100, 110, 120]
        overlaps = [88, 90, 92]
    else:
        intervals = [130, 140, 150]
        overlaps = [85, 90, 95]
else:
    intervals = [100, 110, 120]
    overlaps = [70, 75, 80]

### Construct the mapper graphs

With all the required components in place, we will loop over all possible pairs of (interval, overlap) values from the specified ranges. For each pair of values, we will construct the corresponding mapper graph, using the *map* method of the KepplerMapper object and create the visualization using the *visualize* method which creates an `html` file which is saved in the specified location.

In [None]:
# Loop over combinations of cover parameters and compute mapper graphs
for (cubes, overlap) in product(intervals, overlaps):
    cover = km.cover.Cover(n_cubes=cubes, perc_overlap=overlap/100.)

    # Create mapper graph with nodes, edges and meta-information.
    graph = kmap.map(lens=lens,
                        X=df[orthos],
                        clusterer=clusterer,
                        cover=cover,
                        nerve=nerve,
                        precomputed=False,
                        remove_duplicate_nodes=True)

    # Create a tooltip column
    factor_zip = zip(df[filterfactor], lens)
    df['tooltips'] = [f"({str(p[0])}, {str(p[1])})" for p in factor_zip]

    # Specify file to save html output
    fn = f"Filter_{filterfactor}_{filterlevel}\
            _Cubes_{cubes}_Overlap_{overlap}.html"
    fpath = figdir + '/' + fn
    figtitle = f"Filter: {filterfactor}-{filterlevel}, \
                #Intervals {cubes}, overlap {overlap/100.0}"

    # Create visualization and save to specified file
    cfname = f"{filterfactor}_{filterlevel}_lens"
    mapper_data = kmap.visualize(graph,
                                    path_html=fpath,
                                    title=figtitle,
                                    color_function_name=cfname,
                                    color_values=lens,
                                    colorscale=cscale,
                                    custom_tooltips=df["tooltips"])

print("Mapper Over Cover Grid Completed")
