<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Load-IcaData-object" data-toc-modified-id="Load-IcaData-object-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Load IcaData object</a></span></li><li><span><a href="#Bar-plots" data-toc-modified-id="Bar-plots-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Bar plots</a></span><ul class="toc-item"><li><span><a href="#Plot-Gene-Expression" data-toc-modified-id="Plot-Gene-Expression-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Plot Gene Expression</a></span></li><li><span><a href="#Plot-iModulon-Activities" data-toc-modified-id="Plot-iModulon-Activities-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Plot iModulon Activities</a></span></li><li><span><a href="#Plot-sample-metadata" data-toc-modified-id="Plot-sample-metadata-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Plot sample metadata</a></span></li></ul></li><li><span><a href="#Scatterplots" data-toc-modified-id="Scatterplots-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Scatterplots</a></span><ul class="toc-item"><li><span><a href="#Plot-gene-weights" data-toc-modified-id="Plot-gene-weights-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Plot gene weights</a></span></li><li><span><a href="#Compare-two-iModulon-activities" data-toc-modified-id="Compare-two-iModulon-activities-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Compare two iModulon activities</a></span></li><li><span><a href="#Compare-two-gene-expression-profiles" data-toc-modified-id="Compare-two-gene-expression-profiles-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Compare two gene expression profiles</a></span></li><li><span><a href="#Compare-iModulon-gene-weights" data-toc-modified-id="Compare-iModulon-gene-weights-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Compare iModulon gene weights</a></span></li><li><span><a href="#Compare-iModulon-gene-weights-across-organisms" data-toc-modified-id="Compare-iModulon-gene-weights-across-organisms-3.5"><span class="toc-item-num">3.5&nbsp;&nbsp;</span>Compare iModulon gene weights across organisms</a></span></li></ul></li><li><span><a href="#Metadata-boxplots" data-toc-modified-id="Metadata-boxplots-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Metadata boxplots</a></span></li><li><span><a href="#Regulon-plots" data-toc-modified-id="Regulon-plots-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Regulon plots</a></span></li><li><span><a href="#Activity-Clustering" data-toc-modified-id="Activity-Clustering-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Activity Clustering</a></span><ul class="toc-item"><li><span><a href="#Using-different-correlation-metrics" data-toc-modified-id="Using-different-correlation-metrics-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Using different correlation metrics</a></span></li><li><span><a href="#Automatic-Distance-Thresholding" data-toc-modified-id="Automatic-Distance-Thresholding-6.2"><span class="toc-item-num">6.2&nbsp;&nbsp;</span>Automatic Distance Thresholding</a></span></li><li><span><a href="#Manual-Distance-Threshold" data-toc-modified-id="Manual-Distance-Threshold-6.3"><span class="toc-item-num">6.3&nbsp;&nbsp;</span>Manual Distance Threshold</a></span></li><li><span><a href="#Displaying-Best-Clusters" data-toc-modified-id="Displaying-Best-Clusters-6.4"><span class="toc-item-num">6.4&nbsp;&nbsp;</span>Displaying Best Clusters</a></span></li><li><span><a href="#Naming-Clusters" data-toc-modified-id="Naming-Clusters-6.5"><span class="toc-item-num">6.5&nbsp;&nbsp;</span>Naming Clusters</a></span></li><li><span><a href="#DIMCA" data-toc-modified-id="DIMCA-6.6"><span class="toc-item-num">6.6&nbsp;&nbsp;</span>DIMCA</a></span></li></ul></li><li><span><a href="#Coming-soon" data-toc-modified-id="Coming-soon-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Coming soon</a></span></li></ul></div>

In [None]:
from pymodulon.core import IcaData
from pymodulon.plotting import *
from pymodulon.io import load_json_model
import pandas as pd

# Load IcaData object

In [None]:
ica_data = load_json_model('../example_data/example.json')

# Bar plots
Gene expression and iModulon activities are easily viewed as bar plots. Use the `plot_expression` and `plot_activities` functions, respectively. Any numeric metadata for your experiments can be plotted using the `plot_metadata` function.  

Optional arguments:
* `projects`: Only show specific project(s)
* `highlight`: Show individiual conditions for specific project(s)
* `ax`: Use a pre-existing axis (helpful if you want to manually determine the plot size)
* `legend_args`: Arguments to pass to the legend (e.g. `{'fontsize':12, 'loc':0, 'ncol':2}`)

## Plot Gene Expression
You can plot the compendium-wide expression of a gene using either the locus tag or gene name

In [None]:
plot_expression(ica_data,'b0002')

In [None]:
plot_expression(ica_data,'thrA',projects=['ica','fps'],highlight='fps')

## Plot iModulon Activities

In [None]:
plot_activities(ica_data,'GlpR',highlight='crp')

In [None]:
plot_activities(ica_data,'GlpR',projects='crp')

## Plot sample metadata

In [None]:
plot_metadata(ica_data,'Growth Rate (1/hr)')

# Scatterplots

Gene expression and iModulon activities can be compared with a scatter plot. Use the `compare_expression` and `compare_activities` functions, respectively. In addition, `compare_values` can be used to compare any compendium-wide value against another, including gene expression, iModulon activity, and sample metadata.

Optional arguments:
* `groups`: Mapping of samples to specific groups
* `colors`: Color of points, list of colors to use for different groups, or dictionary mapping groups to colors
* `show_labels`: Show labels for points. (default: `False`)
* `adjust_labels`: Automatically avoid label overlap
* `fit_metric`: Correlation metric of `'pearson'`,`'spearman'`, or `'r2adj'` (default: `'pearson'`)
* `ax`: Use a pre-existing axis (helpful if you want to manually determine the plot size)

Formatting arguments:
* `ax_font_args`: Arguments for label axes (e.g. `{'fontsize':16'}`) 
* `scatter_args`: Arguments for scatter plot (e.g. `{'s'=10}`)
* `label_font_args`: Arguments for text labels (e.g. `{'fontsize':8}`)
* `legend_args`: Arguments to pass to the legend (e.g. `{'fontsize':12, 'loc':0, 'ncol':2}`)

## Plot gene weights
`plot_gene_weights` will plot an iModulon's gene weights against its genomic position. If the number of genes in the iModulon is fewer than 20, it will also show the gene names (or locus tags, if gene name is unavailable).

In [None]:
plot_gene_weights(ica_data,'GlpR')

If there are more than 20 genes, gene names will not be shown by default.

In [None]:
plot_gene_weights(ica_data,'Fnr')

Use `show_labels=True` show gene labels. It is advisable to turn of auto-adjustment of gene labels (`adjust_labels=False`), as this may take a while with many genes.

In [None]:
plot_gene_weights(ica_data,'Fnr',show_labels=True,adjust_labels=False)

## Compare two iModulon activities

In [None]:
groups = {'minspan__wt_glc_anaero__1':'Anaerobic',
          'minspan__wt_glc_anaero__2':'Anaerobic'}

In [None]:
compare_activities(ica_data,'Fnr','ArcA-1',groups=groups)

In [None]:
compare_activities(ica_data,'Fnr','ArcA-1',groups=groups,colors=['green','orange'])

## Compare two gene expression profiles

In [None]:
compare_expression(ica_data,'arcA','fnr',groups=groups)

## Compare iModulon gene weights

In [None]:
compare_gene_weights(ica_data,'CysB','Cbl+CysB')

## Compare iModulon gene weights across organisms

In [None]:
s_acid = load_json_model('../example_data/modulome_example/saci.json')

In [None]:
m_binarized = pd.DataFrame().reindex_like(s_acid.M)

In [None]:
compare_gene_weights(ica_data,'CysB',
                     ica_data2 = s_acid, imodulon2=1, 
                     ortho_file='../example_data/example_bbh.csv')

Use `use_org1_names` to switch which organism's names are shown

In [None]:
compare_gene_weights(ica_data,'CysB',
                     ica_data2 = s_acid, imodulon2=1,  
                     ortho_file='../example_data/example_bbh.csv',
                     use_org1_names=False)

# Metadata boxplots
The function `metadata_boxplot` automatically classify iModulon activities given metadata information. Optional arguments include:
* `n_boxes`: Number of boxes to create
* `strip_conc`: Remove concentrations from metadata (e.g. "glucose(2g/L)" would be interpreted as just "glucose"). Default is `True`
* `ignore_cols`: List of columns to ignore. If empty, only "project" and "condition" are ignored
* `use_cols`: List of columns to use. This supercedes ignore_cols
* `return_results`: Return a dataframe describing the classifications

In [None]:
metadata_boxplot(ica_data,"EvgA",ignore_cols=['GEO','study','project','DOI'])

In [None]:
metadata_boxplot(ica_data,"RpoS",n_boxes=5,use_cols=['Base Media','Carbon Source (g/L)', 'Nitrogen Source (g/L)','Supplement','Evolved Sample','pH','Growth Rate (1/hr)'])

# Regulon plots
iModulon gene weights can be visualized in a histogram. If you wish to highlight genes in a regulon, it can be visualized either as overlapping bars, or side-by-side bars.

In [None]:
plot_regulon_histogram(ica_data,'Fur-1','fur')

In [None]:
plot_regulon_histogram(ica_data,'Fur-1','fur',kind='side')

# Activity Clustering

The iModulon activites in the A matrix can be clustered based on correlation between their activities across conditions in the compendium.

Use the `cluster_activities` function to prepare a clustermap; the minimal input is simply your IcaData object.

In [None]:
cluster_activities(ica_data)

## Using different correlation metrics
You can use multiple correlation metrics, including `"pearson"`,`"spearman"`, and `"mutual_info"`. Mutual information is most likely to identify biologically similar iModulons, but can be more difficult to interpret as it finds both linear and non-linear correlations.

In [None]:
cluster_activities(ica_data,correlation_method='mutual_info')

## Automatic Distance Thresholding

Agglomerative (hierarchical) clustering is used under the hood. Thus, a distance threshold for defining "flat" clusters from the hierarchical structure must be determined. By default, this distance threshold is automagically calculated using a sensitivity analysis. 

Different distance thresholds (this value is between 0 and 1) are tried, and the resulting clustering is assessed using a silhouette score (a measure of how separate the clusters are). The distance threshold yielding the maximum silhouette score is automatically chosen.

To see the result of this sensitivity analysis, use the `show_thresholding` option.

In [None]:
cluster_activities(ica_data, show_thresholding=True)

## Manual Distance Threshold

You may also determine that you're interested in manually varying the distance threshold to see what happens to the iModulon clusters. Setting this threshold manually (with the `distance_threshold` option) will override the automatic thresholding shown above. 

Note: `distance_threshold` must be set to a value between 0 and 1; larger values will generally yield smaller numbers of larger clusters, as shown in the thresholding plot above.

In [None]:
cluster_activities(ica_data, distance_threshold=0.95)

## Displaying Best Clusters

The above clustermaps do not allow you to see which iModulons are actually being clustered together; use the `show_best_clusters` option to call out an additional plot that shows such clusters. 

By default, the clusters whose individual silhouette scores are greater than the mean silhouette score (indicating their separation from the other clusters is above-average) will be shown.

The cluster numbers come from the scikit-learn `AgglomerativeClustering` estimator that actually performs the clustering; these labels don't have any special significance and are just unique identifiers for each cluster. You can also access an iModulon-index-matched list of these labels by accessing the `labels_` attribute of the `AgglomerativeClustering` object (which is returned by `cluster_activities`).

In [None]:
cluster_activities(ica_data, show_best_clusters=True)

So we can see here that the clustering method does seem to capture some biologically-relevant groups of iModulons: Cluster 15 contains 2 flagella regulators, Cluster 8 is membrane-related, Cluster 13 is stress-related, Cluster 25 is iron-related, Cluster 12 is carbon metabolism related, etc.

NOTE: the parenthesized numbers next to the cluster names are the clusters' silhouette scores (a 0 to 1 measure of a cluster's separation from the pack).

Perhaps you're only interested in seeing a specific number of the top clusters; use the `n_best_clusters` argument to specify this preference:

In [None]:
cluster_activities(ica_data, show_best_clusters=True, n_best_clusters=5)

## Naming Clusters

After performing an initial clustering and manually mapping knowledge onto your best clusters, you may decide on a new, more descriptive name for some of your clusters. To generate pretty figures that use these names instead of the soulless integer IDs, use the `cluster_names` option to map the integer IDs to names. You don't have to rename all clusters. 

In [None]:
cluster_activities(
    ica_data, show_best_clusters=True, n_best_clusters=5,
    cluster_names={15: 'Flagella', 8: 'Membrane', 13: 'Stress'}
)

## DIMCA

Differential iModulon Cluster Activity (DIMCA) analysis, a sister method to the well-used differential iModulon activity (DIMA) analysis, allows you to compare all iModulon activities between 2 or more conditions. 

`cluster_activities` itself has the capability to perform DIMCA analyses, exposing a series of `dimca_`-prefixed arguments that correspond with the arguments to `plot_dima` (which is in fact used under the hood). 

For DIMCA, a "cluster activity" will be calculated for each of your best clusters (however many you ask for) and then plotted between your 2 conditions, instead of that cluster's constituent iModulons. In this way, a DIMA plot can be rendered even simpler.

Cluster activities are simply the averages of the activities of the constituent iModulons, EXCEPT that for iModulons that are generally anti-correlated with the others in a cluster (see purR-KO from the above plot, for example), the sign of the activities is first switched.

In [None]:
plot_dima(ica_data, 'fur:wt_dpd', 'fur:wt_fe')

This DIMA plot is fairly busy; we can use DIMCA to reduce the number of points even further:

In [None]:
cluster_activities(ica_data, show_best_clusters=True, dimca_sample1='fur:wt_dpd', dimca_sample2='fur:wt_fe')

Thus, we have a somewhat simpler picture of this comparison. We can increase the number of best clusters we ask for to yield only cluster points; be careful with this though, as some of the worse-scoring clusters are actually "singleton" clusters with just a single iModulon in them (for these iModulons' activities are generally uncorrelated with the other iModulons in the dataset).

In [None]:
cluster_activities(ica_data, n_best_clusters=50, dimca_sample1='fur:wt_dpd', dimca_sample2='fur:wt_fe')

And as before, if we name clusters, those names will propagate to the DIMCA plot (and DIMCA table if requested):

In [None]:
cluster_obj, dimca_ax, table = cluster_activities(
    ica_data, show_best_clusters=True,
    cluster_names={25: 'Iron', 13: 'Stress', 8: 'Membrane'},
    dimca_sample1='fur:wt_dpd', dimca_sample2='fur:wt_fe', dimca_table=True
)

In [None]:
table

Note that '[Clst]' is added to cluster names in the DIMCA plot to avoid confusion with unclustered iModulons.

# Coming soon
1. Regulon venn diagrams