<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Quality-Control" data-toc-modified-id="Quality-Control-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Quality Control</a></span></li><li><span><a href="#Get-regulator-enrichments-for-each-component" data-toc-modified-id="Get-regulator-enrichments-for-each-component-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Get regulator enrichments for each component</a></span></li><li><span><a href="#GO-Enrichments" data-toc-modified-id="GO-Enrichments-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>GO Enrichments</a></span></li><li><span><a href="#Sensitivity-Analysis" data-toc-modified-id="Sensitivity-Analysis-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Sensitivity Analysis</a></span></li></ul></div>

**This notebook is only relevant if you are creating a new compendium for an organism. If you are just appending to PRECISE, use 3b_analyze_new_data.**

In [4]:
import sys
sys.path.append('../')
from icaviz.plotting import *
from tqdm import tqdm_notebook as tqdm

In [5]:
DATA_DIR = '../data/example_data/'
GENE_DIR = '../data/annotation/'

In [6]:
ica_data = load_data(X=DATA_DIR+'log_tpm.csv',
                     S=DATA_DIR+'S.csv',
                     A=DATA_DIR+'A.csv',
                     metadata=DATA_DIR+'metadata.csv',
                     annotation=GENE_DIR+'gene_info.csv',
                     trn=GENE_DIR+'TRN.csv',
                     cutoff = 750) # Get correct cut-off score from 3_estimate_thresholds

# Quality Control
We can use the ICA activation matrix to remove components that relate specifically to experimental noise.
If known replicates have activity differences greater than 5 times the standard deviation of the activity vector, these will be flagged.

In [7]:
from itertools import combinations

In [8]:
bad_comps = []

# Check if replicate activities are significantly different
for name,group in tqdm(ica_data.metadata.groupby(['project_id','condition_id'])):
    exps = group.index.tolist()
    for e1,e2 in combinations(exps,2):
        for k in ica_data.A.index:
            if abs(ica_data.A.loc[k,e1]-ica_data.A.loc[k,e2]) > 3*np.std(ica_data.A.loc[k]):
                bad_comps.append(k)
print(bad_comps)

HBox(children=(IntProgress(value=0, max=164), HTML(value='')))


[]


# Get regulator enrichments for each component

In [9]:
import warnings
warnings.filterwarnings('ignore',category=FutureWarning)

In [15]:
# Initialize objects
list2struct = []

all_genes = set(ica_data.S.index)

for k in tqdm(ica_data.S.columns):
    # Get significant genes
    genes = set(ica_data.show_enriched(k).index)

    # You can change max_tfs to account for and/or interactions between regulons
    enrichments = compute_enrichments(genes, all_genes, 
                                      ica_data.trn, max_tfs=1, 
                                      fdr_rate=1e-5)
    
    enrichments['TF'] = enrichments.index
    enrichments['component'] = k
    enrichments['n_genes'] = len(genes)

    list2struct.append(enrichments.reset_index(drop=True))
    
DF_enriched = pd.concat(list2struct, sort=False).reset_index(drop=True)
DF_enriched = DF_enriched[['component', 'TF', 'log_odds', 
                           'pvalue', 'qvalue','precision',
                           'recall','f1score','TP', 'n_genes','n_tf']]
    
# Sort by p-value
DF_enriched = DF_enriched.sort_values(['component','qvalue'])
DF_enriched.head()

HBox(children=(IntProgress(value=0, max=74), HTML(value='')))




Unnamed: 0,component,TF,log_odds,pvalue,qvalue,precision,recall,f1score,TP,n_genes,n_tf
0,0,gadW,inf,2.192619e-18,5.766588e-16,1.0,0.466667,0.636364,7.0,7,1.0
2,0,gadX,inf,1.546235e-14,2.033299e-12,1.0,0.155556,0.269231,7.0,7,1.0
1,0,ydeO,5.923492,1.053817e-08,9.23846e-07,0.571429,0.222222,0.32,4.0,7,1.0
4,1,Sigma32,inf,3.5654450000000004e-17,9.37712e-15,1.0,0.083969,0.15493,11.0,11,1.0
3,1,gcvB,4.502378,4.163757e-08,5.47534e-06,0.454545,0.121951,0.192308,5.0,11,1.0


In [16]:
print('Total components:',len(ica_data.S.columns))
print('Components with regulator enrichments:',len(DF_enriched.drop_duplicates('component')))
print('Unique regulators:',len(DF_enriched.TF.unique()))

Total components: 74
Components with regulator enrichments: 55
Unique regulators: 84


In [17]:
DF_enriched.to_csv(DATA_DIR+'trn_enrichments.csv')

# GO Enrichments

In [18]:
DF_GO = pd.read_csv('../data/annotation/DF_GO.csv',index_col=0)

In [19]:
go_dict = {}
for name,group in DF_GO.groupby('go_name'):
    genes = set(group.index)
    go_dict[name] = genes

In [24]:
enrich_list = []
all_genes = set(ica_data.S.index)
for k in tqdm(ica_data.names):
    
    # Get i-modulon genes
    ic_genes = set(ica_data.show_enriched(k).index)
    
    vals = []
    
    for go,go_genes in go_dict.items():
        ((tp,fp),(fn,tn)) = contingency(go_genes,ic_genes,all_genes)
        odds,pval = stats.fisher_exact(((tp,fp),(fn,tn)))
        vals.append([go,k,tp,len(ic_genes),pval])
    df_pvals = pd.DataFrame(vals,columns = ['go_term','component','n_matches','n_genes','pvalue'])
    enrich_list.append(FDR(df_pvals,fdr_rate=0.01))
go_enrichments = pd.concat(enrich_list).reset_index(drop=True)
go_enrichments.component = go_enrichments.component.astype(int)

HBox(children=(IntProgress(value=0, max=74), HTML(value='')))




In [27]:
go_enrichments.head()

Unnamed: 0,go_term,component,n_matches,n_genes,pvalue,qvalue
0,response to heat,1,9,11,3.779609e-16,1.23442e-12
1,unfolded protein binding,1,4,11,2.311759e-08,3.775102e-05
2,chaperone cofactor-dependent protein refolding,1,3,11,6.445001e-08,7.016457e-05
3,protein folding,1,4,11,9.774055e-08,7.980516e-05
4,identical protein binding,1,7,11,3.867399e-06,0.002526185


In [28]:
go_enrichments.to_csv(DATA_DIR+'go_enrichments.csv')