<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Setup" data-toc-modified-id="Setup-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Setup</a></span><ul class="toc-item"><li><span><a href="#Check-your-sample-table-(i.e.-metadata-file)" data-toc-modified-id="Check-your-sample-table-(i.e.-metadata-file)-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Check your sample table (i.e. metadata file)</a></span></li><li><span><a href="#Check-your-TRN" data-toc-modified-id="Check-your-TRN-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Check your TRN</a></span></li><li><span><a href="#Load-the-data" data-toc-modified-id="Load-the-data-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Load the data</a></span></li></ul></li><li><span><a href="#Regulatory-iModulons" data-toc-modified-id="Regulatory-iModulons-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Regulatory iModulons</a></span></li><li><span><a href="#Functional-iModulons" data-toc-modified-id="Functional-iModulons-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Functional iModulons</a></span><ul class="toc-item"><li><span><a href="#GO-Enrichments" data-toc-modified-id="GO-Enrichments-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>GO Enrichments</a></span></li><li><span><a href="#KEGG-Enrichments" data-toc-modified-id="KEGG-Enrichments-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>KEGG Enrichments</a></span><ul class="toc-item"><li><span><a href="#Load-KEGG-mapping" data-toc-modified-id="Load-KEGG-mapping-3.2.1"><span class="toc-item-num">3.2.1&nbsp;&nbsp;</span>Load KEGG mapping</a></span></li><li><span><a href="#Perform-enrichment" data-toc-modified-id="Perform-enrichment-3.2.2"><span class="toc-item-num">3.2.2&nbsp;&nbsp;</span>Perform enrichment</a></span></li><li><span><a href="#Convert-KEGG-IDs-to-human-readable-names" data-toc-modified-id="Convert-KEGG-IDs-to-human-readable-names-3.2.3"><span class="toc-item-num">3.2.3&nbsp;&nbsp;</span>Convert KEGG IDs to human-readable names</a></span></li><li><span><a href="#Save-files" data-toc-modified-id="Save-files-3.2.4"><span class="toc-item-num">3.2.4&nbsp;&nbsp;</span>Save files</a></span></li></ul></li></ul></li><li><span><a href="#Check-for-single-gene-iModulons" data-toc-modified-id="Check-for-single-gene-iModulons-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Check for single gene iModulons</a></span></li><li><span><a href="#Compare-against-published-iModulons" data-toc-modified-id="Compare-against-published-iModulons-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Compare against published iModulons</a></span><ul class="toc-item"><li><span><a href="#Visualize-linked-iModulons" data-toc-modified-id="Visualize-linked-iModulons-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Visualize linked iModulons</a></span></li></ul></li><li><span><a href="#Save-iModulon-object" data-toc-modified-id="Save-iModulon-object-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Save iModulon object</a></span></li><li><span><a href="#Coming-soon" data-toc-modified-id="Coming-soon-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Coming soon</a></span></li></ul></div>

# Setup
This IPython notebook will walk through the steps of characterizing iModulons through semi-automated tools. You will need:
* M and A matrices
* Expression data (e.g. `log_tpm_norm.csv`)
* Gene table and KEGG/GO annotations (Generated in `gene_annotation.ipynb`)
* Sample table, with a column for `project` and `condition`
* TRN file

Optional:
* iModulon table (if you already have some characterized iModulons)

In [1]:
from pymodulon.core import IcaData
from pymodulon.io import save_to_json
from os import path
import pandas as pd
import re
from Bio.KEGG import REST
from tqdm.notebook import tqdm

In [2]:
# Enter the location of your data here
data_dir = '../data/precise2/'

## Check your sample table (i.e. metadata file)
Your metadata file will probably have a lot of columns, most of which you may not care about. Feel free to save a secondary copy of your metadata file with only columns that seem relevant to you. The two most important columns are:
1. `project`
2. `condition`

Make sure that these columns exist in your metadata file

In [4]:
df_metadata = pd.read_csv(path.join(data_dir,'metadata_qc.csv'),index_col=0)
df_metadata[['project','condition']].head()

Unnamed: 0,project,condition
ecoli_00001,control,wt_glc
ecoli_00002,control,wt_glc
ecoli_00003,fur,wt_dpd
ecoli_00004,fur,wt_dpd
ecoli_00005,fur,wt_fe


In [5]:
print(df_metadata.project.notnull().all())
print(df_metadata.condition.notnull().all())

True
True


## Check your TRN

Each row of the TRN file represents a regulatory interaction.  
**Your TRN file must have the following columns:**
1. `regulator` - Name of regulator (`/` or `+` characters will be converted to `;`)
1. `gene_id` - Locus tag of gene being regulated

The following columns are optional, but are helpful to have:
1. `regulator_id` - Locus tag of regulator
1. `gene_name` - Name of gene (can automatically update this using `name2num`)
1. `direction` - Direction of regulation ('+' for activation, '-' for repression, '?' or NaN for unknown)
1. `evidence` - Evidence of regulation (e.g. ChIP-exo, qRT-PCR, SELEX, Motif search)
1. `PMID` - Reference for regulation

You may add any other columns that could help you. TRNs may be saved as either CSV or TSV files. See below for an example:

In [8]:
df_trn = pd.read_csv(path.join(data_dir,'TRN.csv'), index_col=0)
df_trn.head()

Unnamed: 0,regulator,evidence,gene_id
0,FMN,Strong,b3041
1,Sigma19,Strong,b4287
2,Sigma19,Strong,b4288
3,Sigma19,Strong,b4289
4,Sigma19,Strong,b4290


The `regulator` and `gene_id` must be filled in for each row

In [9]:
print(df_trn.regulator.notnull().all())
print(df_trn.gene_id.notnull().all())

True
True


## Load the data
You're now ready to load your IcaData object!

In [11]:
ica_data = IcaData(M = path.join(data_dir,'M.csv'),
                   A = path.join(data_dir,'A.csv'),
                   X = path.join(data_dir,'log_tpm_qc.csv'),
                   gene_table = path.join(data_dir,'gene_info.csv'),
                   sample_table = path.join(data_dir,'metadata_qc.csv'),
                   trn = path.join(data_dir,'TRN.csv'),
                  optimize_cutoff=True)



# Regulatory iModulons
Use `compute_trn_enrichment` to automatically check for Regulatory iModulons. The more complete your TRN, the more regulatory iModulons you'll find.

You can also search for AND/OR combinations of regulators using the `max_regs` argument. Here, we see that iModulon #4 may be regulated by the both ArnR;ArnR1 and ArnA;ArnB.

Regulator enrichments can be directly saved to the `imodulon_table` using the `save` argument. This saves the enrichment with the lowest q-value to the table. For iModulon #4, it will automatically save `ArnR;ArnR1` as the enrichment, but we want to save `ArnR;ArnR1+ArnA;ArnB`. We can update our enrichments accordingly, using `compute_regulon_enrichment`:

In [13]:
ica_data.compute_trn_enrichment(max_regs=2,save=True)

Unnamed: 0,imodulon,regulator,pvalue,qvalue,precision,recall,f1score,TP,regulon_size,imodulon_size,n_regs
0,0,h-NS+ihf,8.588865e-18,6.424643e-13,0.368421,1.000000,0.538462,7.0,19.0,7.0,2.0
1,0,h-NS+lrp,8.193751e-17,3.064545e-12,0.280000,1.000000,0.437500,7.0,25.0,7.0,2.0
2,0,ihf+lrp,2.621751e-15,6.537074e-11,0.179487,1.000000,0.304348,7.0,39.0,7.0,2.0
3,0,Sigma70+lrp,7.468260e-12,1.396602e-07,0.060870,1.000000,0.114754,7.0,115.0,7.0,2.0
4,0,Sigma70+h-NS,2.908183e-11,4.350758e-07,0.050360,1.000000,0.095890,7.0,139.0,7.0,2.0
...,...,...,...,...,...,...,...,...,...,...,...
877,207,Sigma38+argR,1.287746e-09,3.440214e-06,0.357143,0.277778,0.312500,5.0,14.0,18.0,2.0
878,207,Sigma54+argR,1.287746e-09,3.440214e-06,0.357143,0.277778,0.312500,5.0,14.0,18.0,2.0
879,212,glpR,5.799985e-14,1.446168e-09,0.555556,0.833333,0.666667,5.0,9.0,6.0,1.0
880,212,Sigma70+glpR,5.799985e-14,1.446168e-09,0.555556,0.833333,0.666667,5.0,9.0,6.0,2.0


In [14]:
ica_data.imodulon_table.head()

Unnamed: 0,regulator,pvalue,qvalue,precision,recall,f1score,TP,regulon_size,imodulon_size,n_regs
0,h-NS+ihf,8.588865e-18,6.424643e-13,0.368421,1.0,0.538462,7.0,19.0,7.0,2.0
1,cbl,7.428275e-24,2.7782489999999997e-19,1.0,0.5625,0.72,9.0,9.0,16.0,1.0
2,Sigma54+ntrC,5.492748e-15,4.108685e-10,0.162791,1.0,0.28,7.0,43.0,7.0,2.0
3,,,,,,,,,,
4,,,,,,,,,,


You can rename iModulons in this jupyter notebook, or you can save the iModulon table as a CSV and edit it in Excel.

In [15]:
reg_entries = ica_data.imodulon_table[ica_data.imodulon_table.regulator.notnull()]
reg_entries

Unnamed: 0,regulator,pvalue,qvalue,precision,recall,f1score,TP,regulon_size,imodulon_size,n_regs
0,h-NS+ihf,8.588865e-18,6.424643e-13,0.368421,1.000000,0.538462,7.0,19.0,7.0,2.0
1,cbl,7.428275e-24,2.778249e-19,1.000000,0.562500,0.720000,9.0,9.0,16.0,1.0
2,Sigma54+ntrC,5.492748e-15,4.108685e-10,0.162791,1.000000,0.280000,7.0,43.0,7.0,2.0
5,appY,1.635376e-19,3.058235e-15,0.700000,0.875000,0.777778,7.0,10.0,8.0,1.0
7,Sigma24,2.480964e-12,5.247587e-08,0.066116,0.888889,0.123077,8.0,121.0,9.0,1.0
...,...,...,...,...,...,...,...,...,...,...
201,Sigma70+iclR,2.907676e-10,3.624999e-06,1.000000,0.750000,0.857143,3.0,3.0,4.0,2.0
205,gadW,8.760874e-18,3.276655e-13,0.466667,0.875000,0.608696,7.0,15.0,8.0,1.0
206,flhD;flhC+ihf,2.338434e-12,1.749196e-07,0.571429,1.000000,0.727273,4.0,7.0,4.0,2.0
207,Sigma38+arcA,6.628947e-24,4.958585e-19,0.309524,0.722222,0.433333,13.0,42.0,18.0,2.0


# Check for single gene iModulons

In [18]:
sg_imods = ica_data.find_single_gene_imodulons(save=True)

In [19]:
for i,mod in enumerate(sg_imods):
    ica_data.rename_imodulons({mod:'single_gene_'+str(i+1)})

In [20]:
ica_data.imodulon_names[:5]

[0, 1, 2, 3, 'single_gene_1']

# Save iModulon object

This will save your iModulon table, your thresholds, and any other information stored in the ica_data object.

In [23]:
save_to_json(ica_data,'../data/precise2/precise2.json')

If you prefer to view and edit your iModulon table in excel, save it as a CSV and reload the iModulon as before

In [40]:
ica_data.imodulon_table.to_csv('../example_data/modulome_example/data/iModulon_table.csv')