<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Setup" data-toc-modified-id="Setup-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Setup</a></span><ul class="toc-item"><li><span><a href="#Check-your-sample-table-(i.e.-metadata-file)" data-toc-modified-id="Check-your-sample-table-(i.e.-metadata-file)-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Check your sample table (i.e. metadata file)</a></span></li><li><span><a href="#Check-your-TRN" data-toc-modified-id="Check-your-TRN-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Check your TRN</a></span></li><li><span><a href="#Load-the-data" data-toc-modified-id="Load-the-data-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Load the data</a></span></li></ul></li><li><span><a href="#Regulatory-iModulons" data-toc-modified-id="Regulatory-iModulons-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Regulatory iModulons</a></span></li><li><span><a href="#Check-for-single-gene-iModulons" data-toc-modified-id="Check-for-single-gene-iModulons-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Check for single gene iModulons</a></span></li><li><span><a href="#Add-Explained-Variance" data-toc-modified-id="Add-Explained-Variance-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Add Explained Variance</a></span></li><li><span><a href="#Save-iModulon-object" data-toc-modified-id="Save-iModulon-object-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Save iModulon object</a></span></li></ul></div>

# Setup
This IPython notebook will walk through the steps of characterizing iModulons through semi-automated tools. You will need:
* M and A matrices
* Expression data (e.g. `log_tpm_norm.csv`)
* Gene table and KEGG/GO annotations (Generated in `gene_annotation.ipynb`)
* Sample table, with a column for `project` and `condition`
* TRN file

Optional:
* iModulon table (if you already have some characterized iModulons)

In [1]:
from pymodulon.core import IcaData
from pymodulon.io import save_to_json
from os import path
import pandas as pd
import re
from Bio.KEGG import REST
from tqdm.notebook import tqdm

In [2]:
# Enter the location of your data here
data_dir = '../data/precise2/'

## Check your sample table (i.e. metadata file)
Your metadata file will probably have a lot of columns, most of which you may not care about. Feel free to save a secondary copy of your metadata file with only columns that seem relevant to you. The two most important columns are:
1. `project`
2. `condition`

Make sure that these columns exist in your metadata file

In [3]:
df_metadata = pd.read_csv(path.join(data_dir,'metadata_qc.csv'),index_col=0)
df_metadata[['project','condition']].head()

Unnamed: 0,project,condition
ecoli_00001,control,wt_glc
ecoli_00002,control,wt_glc
ecoli_00003,fur,wt_dpd
ecoli_00004,fur,wt_dpd
ecoli_00005,fur,wt_fe


In [4]:
print(df_metadata.project.notnull().all())
print(df_metadata.condition.notnull().all())

True
True


## Check your TRN

Each row of the TRN file represents a regulatory interaction.  
**Your TRN file must have the following columns:**
1. `regulator` - Name of regulator (`/` or `+` characters will be converted to `;`)
1. `gene_id` - Locus tag of gene being regulated

The following columns are optional, but are helpful to have:
1. `regulator_id` - Locus tag of regulator
1. `gene_name` - Name of gene (can automatically update this using `name2num`)
1. `direction` - Direction of regulation ('+' for activation, '-' for repression, '?' or NaN for unknown)
1. `evidence` - Evidence of regulation (e.g. ChIP-exo, qRT-PCR, SELEX, Motif search)
1. `PMID` - Reference for regulation

You may add any other columns that could help you. TRNs may be saved as either CSV or TSV files. See below for an example:

In [5]:
df_trn = pd.read_csv(path.join(data_dir,'TRN.csv'), index_col=0)
df_trn.head()

Unnamed: 0,regulator,evidence,gene_id
0,FMN,Strong,b3041
1,Sigma19,Strong,b4287
2,Sigma19,Strong,b4288
3,Sigma19,Strong,b4289
4,Sigma19,Strong,b4290


The `regulator` and `gene_id` must be filled in for each row

In [6]:
print(df_trn.regulator.notnull().all())
print(df_trn.gene_id.notnull().all())

True
True


## Load the data
You're now ready to load your IcaData object!

In [7]:
ica_data = IcaData(M = path.join(data_dir,'M.csv'),
                   A = path.join(data_dir,'A.csv'),
                   X = path.join(data_dir,'log_tpm_qc.csv'),
                   gene_table = path.join(data_dir,'gene_info.csv'),
                   sample_table = path.join(data_dir,'metadata_qc.csv'),
                   trn = path.join(data_dir,'TRN.csv'),
                  optimize_cutoff=True)



# Regulatory iModulons
Use `compute_trn_enrichment` to automatically check for Regulatory iModulons. The more complete your TRN, the more regulatory iModulons you'll find.

You can also search for AND/OR combinations of regulators using the `max_regs` argument. Here, we see that iModulon #4 may be regulated by the both ArnR;ArnR1 and ArnA;ArnB.

Regulator enrichments can be directly saved to the `imodulon_table` using the `save` argument. This saves the enrichment with the lowest q-value to the table. For iModulon #4, it will automatically save `ArnR;ArnR1` as the enrichment, but we want to save `ArnR;ArnR1+ArnA;ArnB`. We can update our enrichments accordingly, using `compute_regulon_enrichment`:

In [8]:
ica_data.compute_trn_enrichment(max_regs=2,save=True)

Unnamed: 0,imodulon,regulator,pvalue,qvalue,precision,recall,f1score,TP,regulon_size,imodulon_size,n_regs
0,4,cysB,3.768948e-28,1.943762e-23,0.500000,0.516129,0.507937,16.0,32.0,31.0,1.0
1,4,Sigma70+cysB,5.473690e-28,1.943762e-23,0.600000,0.483871,0.535714,15.0,25.0,31.0,2.0
2,4,cysB+h-NS,9.541405e-14,2.258832e-09,1.000000,0.193548,0.324324,6.0,6.0,31.0,2.0
3,5,cusR+phoB,4.186643e-16,1.486719e-11,0.857143,0.545455,0.666667,6.0,7.0,11.0,2.0
4,5,phoB+yedW,4.186643e-16,1.486719e-11,0.857143,0.545455,0.666667,6.0,7.0,11.0,2.0
...,...,...,...,...,...,...,...,...,...,...,...
953,212,Sigma38+argR,2.062498e-09,8.137929e-06,0.357143,0.263158,0.303030,5.0,14.0,19.0,2.0
954,212,Sigma54+argR,2.062498e-09,8.137929e-06,0.357143,0.263158,0.303030,5.0,14.0,19.0,2.0
955,215,glpR,6.862213e-14,1.624560e-09,0.555556,0.833333,0.666667,5.0,9.0,6.0,1.0
956,215,Sigma70+glpR,6.862213e-14,1.624560e-09,0.555556,0.833333,0.666667,5.0,9.0,6.0,2.0


# Check for single gene iModulons

In [9]:
sg_imods = ica_data.find_single_gene_imodulons(save=True)

In [None]:
for sg_mod in sg_imods:
    sg_name = ica_data.view_imodulon(sg_mod).sort_values(
        by='gene_weight', ascending=False).iloc[0, :]['gene_name']
    ica_data.rename_imodulons({sg_mod:f'single_gene_{sg_name}'})

In [14]:
ica_data.imodulon_names[:5]

[0, 'single_gene_yzfA', 'single_gene_ytiD', 3, 4]

# Save iModulon object

This will save your iModulon table, your thresholds, and any other information stored in the ica_data object.

In [15]:
save_to_json(ica_data,'../data/precise2/precise2.json')

If you prefer to view and edit your iModulon table in excel, save it as a CSV and reload the iModulon as before

In [16]:
ica_data.imodulon_table.to_csv('../data/precise2/imodulon_table.csv')