# Pathway enrichment analysis

<!-- Luke and Ray has shared the CRISPRi screening results which was analyzed with [ScreenProcessing](https://github.com/mhorlbeck/ScreenProcessing) pipeline.  -->

<!-- - $\gamma$ - gamma score -->
<!-- - $\rho$ - rho score -->

<!-- - Pathway enrichment analysis over rho scores -->
<!-- - Load screening result tables into python 
- Make sure gene names are correctly assigned 
 -->

<!-- Alex Ge: 
> If we do Enrichr analysis on the resistance/sensitivity hits defined by Max’s cutoffs, (n = 418 genes), we do see mRNA methylation (adj p = 0.018) and RNA destabilization (adj p = 0.005) come out as significant GO biological processes. METTL3 is included in these GO terms.

> If we do Enrichr analysis on just the resistance hits (n = 197), mRNA methylation is even more significant (adj p = 0.002), which makes sense since we see more METTL3 biology on the resistance side. It is one of the top five GO terms by adjusted P-value.

> This analysis was done today with the 2021 GO terms, which have updated annotations for the newer m6A genes. When I did the same Enrichr analysis in 2018, RNA destabilization and mRNA methylation were not as significant since the GO annotations were not updated.

> I think Enrichr analysis might make more sense here – in Abe’s analysis, I can see that a lot of rho scores that are < 0.2 are being included in the analysis; these are likely to be statistically insignificant. It also looks graphically like the highest bin is including rho values that are < 0?
 -->

In [1]:
from matplotlib_venn import venn2

In [2]:
from IPython.display import IFrame

In [10]:
from glob import glob

import sys
import pandas as pd
import numpy as np 
from itertools import chain, product

sys.path.append("../../")
pager_dir = "/data_gilbert/home/aarab/Projects/pager/"
sys.path.append(pager_dir)

from scripts.util import *
import ipage_down as ipd

In [4]:
# wd = '/rumi/shams/abe/Projects/Decitabine-treatment/'
wd = '/data_gilbert/home/aarab/Projects/Decitabine-treatment/DAC'

In [5]:
data = load_data(screens=True,wd=wd)

In [6]:
data.keys()

dict_keys(['hl60_exp1_DAC_rho', 'hl60_exp1_DAC_gamma', 'hl60_exp2_DAC_rho', 'hl60_exp2_DAC_gamma', 'hl60_exp2_GSK_rho', 'hl60_exp2_GSK_gamma', 'molm13_exp_DAC_rho', 'molm13_exp_DAC_gamma', 'molm13_exp_GSK_rho', 'molm13_exp_GSK_gamma'])

## Run `iPAGE`:

https://medium.com/analytics-vidhya/techniques-to-transform-data-distribution-565a4d0f2da


In [7]:
from matplotlib import pyplot
from scipy.stats import yeojohnson

rho = pd.concat(find_top(data['hl60_exp1_DAC_rho'].astype(float),'rho score',0,'Mann-Whitney p-value',1)).reset_index()

up:  8896
down: 9864


In [8]:
rho[['gene_name','rho score']].to_csv(
    'hl60_exp1_DAC_rho_delta_phenotype.txt',sep='\t',index=None, header=None
)

In [9]:
!head hl60_exp1_DAC_rho_delta_phenotype.txt

A1CF	0.00493164595946
A2ML1	0.0625139181208
A4GALT	0.0952827862429
A4GNT	0.00884423785995
AADACL4	0.0252223337879
AAED1	0.0294955169881
AAK1	0.0549906505538
AAR2	0.118809960392
AARS2	0.179262551668
AASS	0.0623173699668


In [2]:
%%bash 
export PAGEDIR=/data_gilbert/home/aarab/iPAGE

nohup ls *delta_phenotype.txt | parallel -j18 -k bash ~/Projects/pager/ipage_loop.sh  {} &> ipage.out

Process is interrupted.


## Interpret results – `pager`
https://github.com/abearab/pager

In [11]:
exp = 'hl60_exp1_DAC_rho_delta_phenotype'

def get_pvmatrix_list(parent_path,pattern):
    """
    pattern: msigdb gene set cluster name 
    """
    return glob(f'{parent_path}/*{pattern}*/pvmatrix.txt')

### Draw iPAGE heatmap

### C5 GO

In [None]:
pdf = 'CRISPRi-rho-pager-GO_all.pdf'

ipd.merge_multiple_pvmat(
    get_pvmatrix_list(exp,'c5.go')
).to_csv('temp-pvmatrix.txt',sep='\t')

!bash {pager_dir}/ipage_draw_matrix.sh \
    {exp}'.txt' "temp-pvmatrix.txt" \
    {pdf} &> /dev/null

!mv -v {pdf} plots/
!rm -v 'temp-pvmatrix.txt'

In [36]:
pv_signal_go1 = pd.concat([ 
    ipd.pvmat2bio_signal( 
        ipd.merge_multiple_pvmat(get_pvmatrix_list(exp,'c5.go')), s, n_clust=n 
    ) for s in ['up','both'] for n in [1,2] 
]) 

pv_signal_go2 = pd.concat([ 
    ipd.pvmat2bio_signal( 
        ipd.merge_multiple_pvmat(get_pvmatrix_list(exp,'c5.go')), 'down', n_clust=n 
    ) for n in [1,2] 
]) 

In [38]:
pdf = 'CRISPRi-rho-pager-GO-up.pdf'

pv_signal_go1.to_csv('temp-pvmatrix.txt',sep='\t')

!bash {pager_dir}/ipage_draw_matrix.sh \
    {exp}'.txt' "temp-pvmatrix.txt" \
    {pdf} &> /dev/null
!mv -v {pdf} plots/
!rm -v 'temp-pvmatrix.txt'

‘CRISPRi-rho-pager-GO-up.pdf’ -> ‘plots/CRISPRi-rho-pager-GO-up.pdf’
removed ‘temp-pvmatrix.txt’


In [39]:
pdf = 'CRISPRi-rho-pager-GO-down.pdf'

pv_signal_go2.to_csv('temp-pvmatrix.txt',sep='\t')

!bash {pager_dir}/ipage_draw_matrix.sh \
    {exp}'.txt' "temp-pvmatrix.txt" \
    {pdf} &> /dev/null
!mv -v {pdf} plots/
!rm -v 'temp-pvmatrix.txt'

‘CRISPRi-rho-pager-GO-down.pdf’ -> ‘plots/CRISPRi-rho-pager-GO-down.pdf’
removed ‘temp-pvmatrix.txt’


In [41]:
IFrame("plots/CRISPRi-rho-pager-GO-up.pdf", width=600, height=300)

### C2

In [14]:
pdf = 'CRISPRi-rho-pager-KEGG_all.pdf'

ipd.merge_multiple_pvmat(get_pvmatrix_list(exp,'c2.cp')).to_csv('temp-pvmatrix.txt',sep='\t')

!bash {pager_dir}/ipage_draw_matrix.sh \
    {exp}'.txt' "temp-pvmatrix.txt" \
    {pdf} &> /dev/null
!mv -v {pdf} plots/
!rm -v 'temp-pvmatrix.txt'

‘CRISPRi-rho-pager-KEGG_all.pdf’ -> ‘plots/CRISPRi-rho-pager-KEGG_all.pdf’
removed ‘temp-pvmatrix.txt’


In [50]:
pv_signal_c2_up = pd.concat([
    ipd.pvmat2bio_signal(
        ipd.merge_multiple_pvmat(get_pvmatrix_list(exp,pat)),s,
        n_clust=n
    )
    for s in ['up','both'] for n in [1,2,3]
    for pat in ['c2.cp.kegg','c2.cp.reactome']
])

In [51]:
pv_signal_c2_down = pd.concat([
    ipd.pvmat2bio_signal(
        ipd.merge_multiple_pvmat(get_pvmatrix_list(exp,pat)),'down',
        n_clust=n
    )
    for n in [1,2,3] for pat in ['c2.cp.kegg','c2.cp.reactome']
])

In [52]:
pdf = 'CRISPRi-rho-pager-C2-up.pdf'

pv_signal_c2_up.to_csv('temp-pvmatrix.txt',sep='\t')

!bash {pager_dir}/ipage_draw_matrix.sh \
    {exp}'.txt' "temp-pvmatrix.txt" \
    {pdf} &> /dev/null
!mv -v {pdf} plots/
!rm -v 'temp-pvmatrix.txt'

‘CRISPRi-rho-pager-C2-up.pdf’ -> ‘plots/CRISPRi-rho-pager-C2-up.pdf’
removed ‘temp-pvmatrix.txt’


In [53]:
pdf = 'CRISPRi-rho-pager-C2-down.pdf'

pv_signal_c2_down.to_csv('temp-pvmatrix.txt',sep='\t')

!bash {pager_dir}/ipage_draw_matrix.sh \
    {exp}'.txt' "temp-pvmatrix.txt" \
    {pdf} &> /dev/null
!mv -v {pdf} plots/
!rm -v 'temp-pvmatrix.txt'

‘CRISPRi-rho-pager-C2-down.pdf’ -> ‘plots/CRISPRi-rho-pager-C2-down.pdf’
removed ‘temp-pvmatrix.txt’


In [55]:
IFrame("plots/CRISPRi-rho-pager-C2-down.pdf", width=600, height=300)

## C3

In [28]:
pvmat = ipd.merge_multiple_pvmat(
    pvmat_list = glob(f'{exp}/*c3*/pvmatrix.txt')
)

bio_signal = pd.concat([
    ipd.pvmat2bio_signal(pvmat,side='down',n_clust=1),
    ipd.pvmat2bio_signal(pvmat,side='up',n_clust=1),
    ipd.pvmat2bio_signal(pvmat,side='both'),
],axis=0)

bio_signal

Unnamed: 0,[-0.12 -0.10],[-0.10 -0.06],[-0.06 -0.04],[-0.04 -0.03],[-0.03 -0.01],[-0.01 0.00],[0.00 0.02],[0.02 0.03],[0.03 0.05],[0.05 0.08],[0.08 0.1]
GGCNKCCATNK_UNKNOWN,4.507,-0.774,0.61,-1.035,-1.359,-0.567,-0.281,-0.405,-0.774,-0.405,1.257
CAVIN1_TARGET_GENES,2.35,1.547,0.906,-0.527,-1.784,-0.811,1.206,-2.655,-0.811,-0.527,0.277
MIR4772_5P,0.75,-0.837,-0.513,-0.301,1.056,-1.328,-1.328,0.498,-1.328,0.498,2.781


In [15]:
pvmat_list = glob(f'{exp}/*c3*/pvmatrix.txt')
gs_cluster_path = ipd.detect_gs_cluster(pvmat_list, gs=gs)

print ([p.split('/')[1:3] for p in gs_cluster_path])

gs_cluster_path = gs_cluster_path[0].split('pvmatrix.txt')[0]


[['msigdb_v7.4_c3.all', 'pvmatrix.txt'], ['msigdb_v7.4_c3.mir.mirdb', 'pvmatrix.txt']]


In [35]:
gs = 'CAVIN1_TARGET_GENES'

pd.DataFrame([
    (n,','.join(list(ipd.bin_identifier_genes(
        f'{gs_cluster_path}',str(n),gs
    ).values())[0])) for n in [0,1,2,3,4,6,7,8,9,10]
],columns=['clust',gs]).set_index('clust')[gs][0]

'ACBD5,ARID4A,ATF7IP2,CMTM3,LEPROTL1,P4HB,RBBP4,SEC14L1,SLC34A1,STAP2,TCERG1,VDAC2,ZSCAN31'

## Identifier genes of enriched pathways

### GOBP_RNA_MODIFICATION

In [17]:
gs = 'GOBP_RNA_MODIFICATION'

pd.DataFrame([
    (n,','.join(list(ipd.bin_identifier_genes(
        'hl60_exp1_DAC_rho_delta_phenotype/msigdb_v7.4_c5.go',str(n),gs
    ).values())[0])) for n in [0,1,2,3,4,6,7,8,9,10]
],columns=['clust',gs]).set_index('clust')

Unnamed: 0_level_0,GOBP_RNA_MODIFICATION
clust,Unnamed: 1_level_1
0,"ADAT2,CDKAL1,CMTR1,CMTR2,DTWD1,FTSJ1,NOP2,NSUN..."
1,"ALKBH3,ALKBH5,DUS3L,LCMT2,METTL16,NAT10,NHP2"
2,"ADAR,CDK5RAP1,JMJD6,LARP7,METTL6,NSUN5,TRMT44,..."
3,"ADARB2,ADAT3,APOBEC3B,FDXACB1,HENMT1,MRM1,TFB2..."
4,"ADAD1,APOBEC1,APOBEC3A,APOBEC3G,METTL2B,PUS10,..."
6,"APOBEC3C,DPH3,EMG1,FBL,METTL2A,NSUN3,PCIF1,RRN..."
7,"ADARB1,APOBEC3H,BAG4,DTWD2,DUS4L,NUDT16,OSGEP,..."
8,"AICDA,DIMT1,DUS1L,METTL8,MOCS3,MTO1,PUS7,RNMT,..."
9,"C9orf64,MEPCE,METTL1,METTL14,METTL5,NSUN4,NSUN..."
10,"AARS2,ALKBH1,ALKBH8,ANKRD16,BCDIN3D,CBLL1,CTU1..."


### GOBP_NEGATIVE_REGULATION_OF_INTRINSIC_APOPTOTIC_SIGNALING_PATHWAY_BY_P53_CLASS_MEDIATOR

In [623]:
c5_go_gmt['GOBP_NEGATIVE_REGULATION_OF_INTRINSIC_APOPTOTIC_SIGNALING_PATHWAY_BY_P53_CLASS_MEDIATOR']

['KDM1A',
 'SIRT1',
 'ZNF385A',
 'ING2',
 'MIR21',
 'MDM2',
 'MIF',
 'MUC1',
 'PRKN',
 'TRIAP1',
 'TAF9B',
 'BCL2',
 'BDKRB2',
 'MARCHF7',
 'TAF9',
 'PTTG1IP',
 'ELL3',
 'BCL2L12',
 'ARMC10',
 'CD44',
 'CD74']

In [313]:
gs = 'GOBP_NEGATIVE_REGULATION_OF_INTRINSIC_APOPTOTIC_SIGNALING_PATHWAY_BY_P53_CLASS_MEDIATOR'

pd.DataFrame([
    (n,','.join(list(ipd.bin_identifier_genes(
        'hl60_exp1_DAC_rho_delta_phenotype/msigdb_v7.4_c5.go',str(n),gs
    ).values())[0])) for n in [0,1,2,8,9,10]
],columns=['clust',gs]).set_index('clust')

Unnamed: 0_level_0,GOBP_NEGATIVE_REGULATION_OF_INTRINSIC_APOPTOTIC_SIGNALING_PATHWAY_BY_P53_CLASS_MEDIATOR
clust,Unnamed: 1_level_1
0,"BCL2,CD44,CD74,KDM1A,PTTG1IP,TRIAP1"
1,"ARMC10,MUC1"
2,
8,"MIF,TAF9"
9,"BCL2L12,ELL3,MDM2,TAF9B"
10,"SIRT1,ZNF385A"


### GOCC_SPLICEOSOMAL_COMPLEX

In [18]:
gs = 'GOCC_SPLICEOSOMAL_COMPLEX'

pd.DataFrame([
    (n,','.join(list(ipd.bin_identifier_genes(
        'hl60_exp1_DAC_rho_delta_phenotype/msigdb_v7.4_c5.go',str(n),gs
    ).values())[0])) for n in [0,1,2,8,9,10]
],columns=['clust',gs]).set_index('clust')

Unnamed: 0_level_0,GOCC_SPLICEOSOMAL_COMPLEX
clust,Unnamed: 1_level_1
0,"API5,DDX23,DDX39B,DDX5,DHX15,DHX8,HNRNPM,HSPA8..."
1,"ALYREF,CWC15,LGALS3,NCL,PRPF40A,SF3A2,SLU7,SNR..."
2,"ADAR,CWC22,DDX41,DQX1,HNRNPA2B1,HNRNPC,LSM3,LU..."
8,"BCAS2,CCDC12,CCDC130,EIF4A3,PRPF18,PRPF6,SF3A1..."
9,"DHX32,GPATCH1,IVNS1ABP,PPP1R8,RNPC3,SF1,SF3B4,..."
10,"AAR2,AQR,BUD13,CTNNBL1,CWF19L1,HNRNPF,HNRNPH3,..."


# 

In [232]:
!date

Sat Sep 17 14:44:09 PDT 2022
