## Applying mapper to plant gene expression data

In this notebook, we will apply mapper algorithm to the gene expression data collected by last year's class.<br>
Download all the files from the shared google drive folder into a directory on your computer if you are running jupyter notebooks locally. The data is stored in two csv files, one contains the gene expression profiles, the other contains the metadata such as sample id, family, tissue type, stress, etc.<br> Along with the data files, there are two notebooks (including this one!) and two python script files which we will need. Make sure you have all these files downloaded into the same directory.

### Imort useful packages / modules

In [1]:
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# ML tools
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import MinMaxScaler

# For output display
from IPython.display import IFrame

# If running locally, set current directory as projdir
projdir = '.'

### Mount Google Drive*

Run the next cell __only if you're planning to run this notebook on Google Colab__. If you are running this notebook locally, comment it out.

To run the notebook in Google Colab, it is best if you upload data and script files to a folder in Google Drive and access them from there. We already have a shared Google Drive folder named `PlantsAndPython-2021-10-22` that contains all the required data and script files.
Run the next code cell to mount the drive and make the files accessible. We will define the shared folder as our project directory `projdir`.

If Google drive is not already mounted, running the cell will produce a link. Click on the link, follow the prompts to log in to your google account, and copy the text string generated at the end. Paste the text string in the box below and press `Enter`.

In [3]:
# Only if running in Google Colab..!!
# DO NOT run this cell if running locally - simply comment it out.
from google.colab import drive
drive.mount('/content/gdrive')

projdir = '/content/gdrive/MyDrive/PlantsAndPython-2021-10-22'
sys.path.append(projdir)

Mounted at /content/gdrive


In [4]:
# import helper_functions
from helper_functions import loaddata
from helper_functions import colorscale_from_matplotlib_cmap

# import lense function
from lenses import fsga_transform

# keppler mapper
import kmapper as km

ModuleNotFoundError: ignored

### Data

The data is stored in two csv files. `clean_metadata.csv` contains the metadata, `clean_RNAseq_sv_corrected.csv` contains the gene expression data. We are interested in three factors in particular: family, tissue type and stress type.

In [None]:
factorfile = projdir + '/clean_metadata.csv'
rnafile = projdir + '/clean_RNAseq_sv_corrected.csv'

factors = ['stress', 'tissue', 'family']
levels = ['healthy', 'leaf', 'Poaceae']

filter_by_factor, filter_by_level = ('family', 'Poaceae')
color_by_factor, color_by_level = ('tissue', 'leaf')

We will use the custom `loaddata` function to load the data from the csv files and merge them using the *sra*s. Feel free to take a look at the function defined in file `helper_functions.py`. The number of SRAs in the two files is different. We will only take SRAs for which we have both the gene expression profile and the factor data available. This is done using the `merge` method from pandas.

In [None]:
df, orthos = loaddata(factorfile, rnafile, factors)

Factor input data shape: (3172, 7)
RNA input data shape: (6335, 2672)
Dataframe shape after merge: (2671, 6342)


In [None]:
df.head()

Unnamed: 0,sra,sample_id,species,family,tissue,stress,bioproject,OG0002118,OG0002119,OG0002120,OG0002121,OG0002122,OG0002125,OG0002126,OG0002127,OG0002128,OG0002132,OG0002133,OG0002137,OG0002139,OG0002142,OG0002143,OG0002144,OG0002145,OG0002146,OG0002147,OG0002148,OG0002154,OG0002155,OG0002156,OG0002157,OG0002159,OG0002160,OG0002161,OG0002162,OG0002165,OG0002166,OG0002167,OG0002169,OG0002170,...,OG0011191,OG0011201,OG0011202,OG0011207,OG0011212,OG0011216,OG0011238,OG0011240,OG0011261,OG0011264,OG0011266,OG0011267,OG0011271,OG0011272,OG0011273,OG0011275,OG0011277,OG0011279,OG0011282,OG0011284,OG0011285,OG0011328,OG0011329,OG0011330,OG0011335,OG0011346,OG0011360,OG0011364,OG0011366,OG0011368,OG0011413,OG0011424,OG0011438,OG0011439,OG0011503,OG0011580,OG0011583,OG0011599,OG0011603,OG0011655
0,SRR1598911,A_hypochondria_001,A_hypochondria,Amaranthacea,flower,healthy,PRJNA263128,-0.001456,0.105937,-0.050513,-0.064531,0.041956,0.001421,0.015684,-0.044681,-0.040176,0.156901,0.00904,0.086616,0.198255,-0.01151,0.116082,0.153478,-0.048863,0.031323,-0.062965,-0.027801,0.059899,-0.01874,-0.027129,-0.070887,-0.081776,-0.035525,-0.072955,-0.017868,-0.048536,-0.048135,-0.106876,-0.072206,-0.101233,...,-0.039385,-0.018324,-0.072203,0.018669,-0.079136,0.006234,-0.055142,-0.074703,-0.051067,-0.003863,-0.073836,-0.009686,-0.071781,-0.051653,-0.082667,-0.055877,-0.070845,-0.107171,-0.062462,-0.068771,-0.070214,-0.064071,-0.067258,-0.055369,-0.073002,-0.07512,-0.067491,-0.074643,-0.055568,0.068693,-0.044161,-0.065706,-0.073235,-0.043796,-0.072423,-0.051586,0.034663,-0.046609,0.041908,0.02065
1,SRR1598912,A_hypochondria_002,A_hypochondria,Amaranthacea,leaf,healthy,PRJNA263128,-0.050352,0.254496,-0.229081,-0.044667,-0.029998,0.093129,-0.050958,-0.237801,-0.166522,0.27492,-0.092045,0.313335,0.687212,0.17077,0.767898,0.421275,-0.202901,0.198959,-0.188598,-0.242519,0.426952,-0.134623,-0.151437,-0.183163,-0.133909,-0.063276,-0.220159,-0.088992,-0.156517,-0.13445,-0.05281,-0.203709,0.082189,...,-0.245452,0.509246,-0.226191,0.078928,-0.268059,-0.424409,-0.235317,-0.241414,-0.201529,0.00164,-0.248217,0.163692,-0.242656,-0.227221,-0.252562,-0.233738,-0.234675,-0.366289,-0.237922,-0.235524,-0.23405,-0.223131,-0.258069,-0.14059,-0.230676,-0.254929,-0.221565,-0.255772,-0.157762,0.635246,-0.247876,-0.235771,-0.22439,-0.27184,-0.217694,0.2175,-0.084585,-0.123412,0.621273,-0.098835
2,SRR1598913,A_hypochondria_003,A_hypochondria,Amaranthacea,root,healthy,PRJNA263128,-0.207445,0.485422,-0.3006,-0.085825,-0.044407,-0.061632,-0.047539,-0.299363,-0.107795,0.566231,-0.330444,0.275767,0.763075,0.40168,0.434029,0.450636,-0.278571,0.352884,-0.218325,-0.206967,0.202653,-0.136491,-0.082368,-0.406822,0.224917,-0.139774,-0.311683,-0.132602,-0.203021,-0.334327,-0.271979,-0.256626,0.11328,...,-0.179593,-0.117367,-0.286509,-0.031956,-0.329782,0.462999,-0.202286,-0.339019,-0.251814,0.067319,-0.306012,0.073821,-0.309367,-0.187119,-0.203944,-0.296097,-0.31126,-0.04669,-0.310558,-0.279481,-0.303574,-0.2651,-0.297655,-0.227918,-0.307558,-0.31338,-0.296619,-0.314493,-0.215667,0.049767,-0.187849,-0.303212,-0.28303,0.087385,-0.28831,-0.013373,0.024199,-0.195171,0.36054,0.137576
3,SRR1598910,A_hypochondria_004,A_hypochondria,Amaranthacea,stem,healthy,PRJNA263128,-0.226247,0.661615,-0.122427,-0.194361,-0.018764,0.062965,-0.090751,-0.343471,-0.203417,0.860485,-0.163096,0.596404,1.241189,0.574202,0.576262,1.29122,-0.341597,0.507157,-0.234809,-0.328655,0.425913,-0.143535,-0.070677,-0.411144,0.06143,-0.076929,-0.348331,-0.146977,-0.183348,-0.408786,-0.013172,-0.259845,-0.747038,...,-0.218861,-0.064852,-0.326633,-0.081825,-0.371809,0.261899,-0.293581,-0.339895,-0.265222,0.301502,-0.350525,-0.358029,-0.345393,-0.280741,-0.241136,-0.330304,-0.350639,-0.219577,-0.346377,-0.313366,-0.347531,-0.300973,-0.344449,-0.271111,-0.356876,-0.357422,-0.344556,-0.359829,-0.258149,0.297424,-0.279176,-0.335267,-0.326472,-0.125061,-0.335647,-0.06714,-0.05441,-0.166659,-0.174771,-0.063741
4,SRR1598914,A_hypochondria_005,A_hypochondria,Amaranthacea,other,drought,PRJNA263128,-0.103369,1.088749,-0.300173,-0.028792,0.336153,0.015617,0.051135,-0.303071,-0.252303,1.050839,0.133557,-0.013915,1.288096,0.023644,0.630677,0.569392,-0.189553,0.297558,-0.254742,-0.247522,0.157631,-0.158832,-0.045626,-0.323843,-0.267552,-0.122531,-0.263435,-0.009662,-0.186143,-0.185139,-0.188903,-0.262444,0.365195,...,-0.256589,0.127434,-0.279744,0.087709,-0.343602,0.066369,-0.213633,-0.324498,-0.171402,-0.405118,-0.324843,-0.338195,-0.322803,-0.201093,-0.236512,-0.249201,-0.308489,-0.573443,-0.313946,-0.28421,-0.305934,-0.276824,-0.293528,-0.243731,-0.323496,-0.325155,-0.294073,-0.322954,-0.248349,0.55114,-0.183794,-0.295979,-0.302226,-0.088812,-0.278099,0.03154,0.025193,-0.184997,-0.522966,0.173639


`df` is the dataframe containing the merged factor and RNASeq data. `orthos` is just a list of orthogroup names that will be useful when we want to select only the part of dataframe containing RNASeq data.

## Applying Mapper

First step is to initialize a KeplerMapper object. You can ignore the `nerve` part.

In [None]:
# Initialize mapper object
mymapper = km.KeplerMapper(verbose=1)

# Define Nerve
nerve = km.GraphNerve(min_intersection=1)

KeplerMapper(verbose=1)


### Define lens / filter function

Next, we need to define the *lens*. In the python file `lenses.py`, I have defined a lens called `fsga_transform`. Given a factor and a specific level of that factor (for example, factor: stress, level: healthy), we construct a lens following the method described in __[Nicolau et. al. 2011](https://www.pnas.org/content/108/17/7265)__.

For example, for factor: stress, and level: healthy, we take the gene expression profiles of all the *healthy* samples from the data and fit a linear model. Then we project all the samples on to this linear model and compute the residuals. The *lens* is the norm of the residual vector.

In [None]:
# Define lens
scaler = MinMaxScaler()
residuals, idx_tr, idx_te = fsga_transform(df, orthos, filter_by_factor, filter_by_level)
lens = mymapper.project(residuals, projection='l2norm', scaler=scaler)

..Projecting on data shaped (2671, 6335)

..Projecting data using: l2norm

..Scaling with: MinMaxScaler(copy=True, feature_range=(0, 1))



### Define cover

Next step, define the cover. Just specify the number of intervals `cubes` and the amount of overlap between consecutive intervals `overlap` and let `kmapper` take care of it. Feel free to change both the parameters but keep in mind that overlap must be between 0 and 100. Also, increasing the number of intervals will make the algorithm run slower so don't increase it beyond 130 or so.

In [None]:
# Define cover
cubes, overlap = (100, 85)
cover = km.cover.Cover(n_cubes=cubes, perc_overlap=overlap/100.)

### Define clustering algorithm

We will stick to DBSCAN with its default parameters.<br>
However, the metric we will use is the correlation distance (1 - correlation) between a pair of gene expression profiles. You can try changing this to *cosine* metric or some other predefined metric available in scikit-learn and see how it affects the output.

In [None]:
# Define clustering algorithm
clust_metric = 'correlation'
clusterer = DBSCAN(metric=clust_metric)

### Construct the mapper graph

With all the components required to construct the mapper graph ready, we can go ahead and call the `map` method of the KepplerMapper object to construct the mapper graph. Keep an eye on the number of hypercubes, nodes and edges reported by the algorithm. You can change the graph size by changing the cover parameters.

In [None]:
# Create mapper 'graph' with nodes, edges and meta-information.
graph = mymapper.map(lens=lens,
                     X=residuals,
                     clusterer=clusterer,
                     cover=cover,
                     nerve=nerve,
                     precomputed=False,
                     remove_duplicate_nodes=True)

Mapping on data shaped (2671, 6335) using lens shaped (2671, 1)

Creating 100 hypercubes.
Merged 41 duplicate nodes.

Number of nodes before merger: 361; after merger: 320


Created 1703 edges and 320 nodes in 0:00:14.192770.


### Adding components to visualization

Before we visualize the constructed mapper graph, we will add a couple of components to the visualization.<br>
First, we will color the nodes of the mapper graph using the specified factor (`color_by_factor`). The specified level (`color_by_level`) will be at one end of the colormap, all other levels will be at the other end. The node color is determined averaging the colors of all samples in the corresponding cluster.

In [None]:
# Color nodes by specified color_by_factor, color_by_level
df[color_by_factor] = df[color_by_factor].astype('category')
color_vec = np.asarray([0 if(val == color_by_level) else 1 for val in df[color_by_factor]])
cscale = colorscale_from_matplotlib_cmap(plt.get_cmap('coolwarm'))

In [None]:
# show filter_by_factor levels in tooltip
temp = ['({}, {})'.format(str(p[0]), str(p[1])) for p in zip(df[filter_by_factor], df[color_by_factor])]
df['tooltips'] = temp

### Visualize the mapper graph

Latly, we create the visualization, save it as an html file, and then load it into a frame.<br>
Alternatively, you can browse to the html file and open it in a separate browser window.

In [None]:
# Specify file to save html output
fname = 'FilterBy_{}_ColorBy_{}_Cubes_{}_Overlap_{}.html'.format(filter_by_factor,
                                                              color_by_factor,
                                                              cubes,
                                                              overlap)
figtitle = 'Lens: {} : {}, Color by {} : {}, # Intervals {}, overlap {}'.format(filter_by_factor,
                                                                                filter_by_level,
                                                                                color_by_factor,
                                                                                color_by_level,
                                                                                cubes, overlap/100.0)

fpath = projdir + '/' + fname
# Create visualization and save to specified file
_ = mymapper.visualize(graph,
                       path_html=fpath,
                       title=figtitle,
                       color_values=color_vec,
                       color_function_name=color_by_factor,
                       colorscale=cscale,
                       custom_tooltips=df['tooltips'])

# Load the html output file
IFrame(src=fpath, width=1000, height=800)

Wrote visualization to: /content/gdrive/MyDrive/PlantsAndPython-2021-10-22/FilterBy_family_ColorBy_tissue_Cubes_100_Overlap_85.html
