LAST RUN: CAM 07/02/2020

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Quality-Control" data-toc-modified-id="Quality-Control-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Quality Control</a></span></li><li><span><a href="#Get-regulator-enrichments-for-each-component" data-toc-modified-id="Get-regulator-enrichments-for-each-component-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Get regulator enrichments for each component</a></span></li><li><span><a href="#GO-Enrichments" data-toc-modified-id="GO-Enrichments-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>GO Enrichments</a></span></li><li><span><a href="#Sensitivity-Analysis" data-toc-modified-id="Sensitivity-Analysis-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Sensitivity Analysis</a></span></li></ul></div>

**This notebook is only relevant if you are creating a new compendium for an organism. If you are just appending to PRECISE, use 3b_analyze_new_data.**

In [1]:
import sys
sys.path.append('../..')
from icaviz.plotting import *
from tqdm import tqdm_notebook as tqdm

In [2]:
DATA_DIR = '../data/precise2_data/'
GENE_DIR = '../data/annotation/'

In [3]:
ica_data = load_data(X=DATA_DIR+'log_tpm_qc.csv',
                     S=DATA_DIR+'S_95.csv',
                     A=DATA_DIR+'A_95.csv',
                     metadata=DATA_DIR+'metadata_qc.csv',
                     annotation=GENE_DIR+'gene_info.csv',
                     trn=GENE_DIR+'TRN.csv',
                     cutoff = 525) # Get correct cut-off score from 3_estimate_thresholds



# Quality Control
We can use the ICA activation matrix to remove components that relate specifically to experimental noise.
If known replicates have activity differences greater than 5 times the standard deviation of the activity vector, these will be flagged.

In [13]:
from itertools import combinations

In [14]:
bad_comps = []

# Check if replicate activities are significantly different
for name,group in tqdm(ica_data.metadata.groupby(['project_id','condition_id'])):
    exps = group.index.tolist()
    for e1,e2 in combinations(exps,2):
        for k in ica_data.A.index:
            if abs(ica_data.A.loc[k,e1]-ica_data.A.loc[k,e2]) > 3*np.std(ica_data.A.loc[k]):
                bad_comps.append(k)
print(bad_comps)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  after removing the cwd from sys.path.


HBox(children=(FloatProgress(value=0.0, max=425.0), HTML(value='')))


[28, 28, 28, 28, 51, 28, 28, 28, 28, 28, 28, 28, 92, 94, 107, 94, 114, 72, 101, 8, 22, 8, 22, 119, 18, 22, 119, 16, 16, 18, 18, 18, 18]


# Get regulator enrichments for each component

In [4]:
import warnings
warnings.filterwarnings('ignore',category=FutureWarning)

In [5]:
# Initialize objects
list2struct = []

all_genes = set(ica_data.S.index)

for k in tqdm(ica_data.S.columns):
    # Get significant genes
    genes = set(ica_data.show_enriched(k).index)

    # You can change max_tfs to account for and/or interactions between regulons
    enrichments = compute_enrichments(genes, all_genes, 
                                      ica_data.trn, max_tfs=1, 
                                      fdr_rate=1e-5)
    
    enrichments['TF'] = enrichments.index
    enrichments['component'] = k
    enrichments['n_genes'] = len(genes)

    list2struct.append(enrichments.reset_index(drop=True))
    
DF_enriched = pd.concat(list2struct, sort=False).reset_index(drop=True)
DF_enriched = DF_enriched[['component', 'TF', 'log_odds', 
                           'pvalue', 'qvalue','precision',
                           'recall','f1score','TP', 'n_genes','n_tf']]
    
# Sort by p-value
DF_enriched = DF_enriched.sort_values(['component','qvalue'])
DF_enriched.head()

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  


HBox(children=(FloatProgress(value=0.0, max=127.0), HTML(value='')))




Unnamed: 0,component,TF,log_odds,pvalue,qvalue,precision,recall,f1score,TP,n_genes,n_tf
5,1,fnr,3.354816,2.012775e-21,5.494875999999999e-19,0.763158,0.061966,0.114625,29.0,38,1.0
1,1,narL,3.241091,2.027106e-14,2.767e-12,0.394737,0.121951,0.186335,15.0,38,1.0
0,1,dcuR,5.101574,4.606447e-09,4.191867e-07,0.131579,0.555556,0.212766,5.0,38,1.0
2,1,molybdopterin,6.235563,2.397247e-08,1.636121e-06,0.105263,0.8,0.186047,4.0,38,1.0
3,1,cueR,5.542186,7.147032e-08,3.90228e-06,0.105263,0.666667,0.181818,4.0,38,1.0


In [6]:
print('Total components:',len(ica_data.S.columns))
print('Components with regulator enrichments:',len(DF_enriched.drop_duplicates('component')))
print('Unique regulators:',len(DF_enriched.TF.unique()))

Total components: 127
Components with regulator enrichments: 78
Unique regulators: 105


In [7]:
DF_enriched.to_csv(DATA_DIR+'trn_enrichments.csv')

# GO Enrichments

In [11]:
DF_GO = pd.read_csv('../data/annotation/DF_GO.csv',index_col=0)

In [12]:
go_dict = {}
for name,group in DF_GO.groupby('go_name'):
    genes = set(group.index)
    go_dict[name] = genes

In [13]:
enrich_list = []
all_genes = set(ica_data.S.index)
for k in tqdm(ica_data.names):
    
    # Get i-modulon genes
    ic_genes = set(ica_data.show_enriched(k).index)
    
    vals = []
    
    for go,go_genes in go_dict.items():
        ((tp,fp),(fn,tn)) = contingency(go_genes,ic_genes,all_genes)
        odds,pval = stats.fisher_exact(((tp,fp),(fn,tn)))
        vals.append([go,k,tp,len(ic_genes),pval])
    df_pvals = pd.DataFrame(vals,columns = ['go_term','component','n_matches','n_genes','pvalue'])
    enrich_list.append(FDR(df_pvals,fdr_rate=0.01))
go_enrichments = pd.concat(enrich_list).reset_index(drop=True)
go_enrichments.component = go_enrichments.component.astype(int)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  This is separate from the ipykernel package so we can avoid doing imports until


HBox(children=(FloatProgress(value=0.0, max=127.0), HTML(value='')))




In [14]:
go_enrichments.head()

Unnamed: 0,go_term,component,n_matches,n_genes,pvalue,qvalue
0,anaerobic respiration,1,9,38,1.972797e-10,6.443155e-07
1,plasma membrane fumarate reductase complex,1,4,38,4.824489e-09,7.87839e-06
2,succinate dehydrogenase activity,1,4,38,2.397247e-08,2.609803e-05
3,fermentation,1,4,38,3.293946e-07,0.0002689507
4,dimethyl sulfoxide reductase activity,1,3,38,6.031989e-07,0.0003283413


In [15]:
go_enrichments.to_csv(DATA_DIR+'go_enrichments.csv')