# Pathway enrichment analysis

<!-- Luke and Ray has shared the CRISPRi screening results which was analyzed with [ScreenProcessing](https://github.com/mhorlbeck/ScreenProcessing) pipeline.  -->

<!-- - $\gamma$ - gamma score -->
<!-- - $\rho$ - rho score -->

<!-- - Pathway enrichment analysis over rho scores -->
<!-- - Load screening result tables into python 
- Make sure gene names are correctly assigned 
 -->

<!-- Alex Ge: 
> If we do Enrichr analysis on the resistance/sensitivity hits defined by Max’s cutoffs, (n = 418 genes), we do see mRNA methylation (adj p = 0.018) and RNA destabilization (adj p = 0.005) come out as significant GO biological processes. METTL3 is included in these GO terms.

> If we do Enrichr analysis on just the resistance hits (n = 197), mRNA methylation is even more significant (adj p = 0.002), which makes sense since we see more METTL3 biology on the resistance side. It is one of the top five GO terms by adjusted P-value.

> This analysis was done today with the 2021 GO terms, which have updated annotations for the newer m6A genes. When I did the same Enrichr analysis in 2018, RNA destabilization and mRNA methylation were not as significant since the GO annotations were not updated.

> I think Enrichr analysis might make more sense here – in Abe’s analysis, I can see that a lot of rho scores that are < 0.2 are being included in the analysis; these are likely to be statistically insignificant. It also looks graphically like the highest bin is including rho values that are < 0?
 -->

In [1]:
from matplotlib_venn import venn2

In [2]:
from IPython.display import IFrame

In [3]:
from glob import glob

import sys
import pandas as pd
import numpy as np 
from itertools import chain, product

sys.path.append("../../")
pager_dir = "/data_gilbert/home/aarab/Projects/pager/"
sys.path.append(pager_dir)

from scripts.util import *
import ipage_down as ipd

In [4]:
# wd = '/rumi/shams/abe/Projects/Decitabine-treatment/'
wd = '/data_gilbert/home/aarab/Projects/Decitabine-treatment/DAC'

In [5]:
data = load_data(screens=True,wd=wd)

In [6]:
data.keys()

dict_keys(['hl60_exp1_DAC_rho', 'hl60_exp1_DAC_gamma', 'hl60_exp2_DAC_rho', 'hl60_exp2_DAC_gamma', 'hl60_exp2_GSK_rho', 'hl60_exp2_GSK_gamma', 'molm13_exp_DAC_rho', 'molm13_exp_DAC_gamma', 'molm13_exp_GSK_rho', 'molm13_exp_GSK_gamma'])

## Run `onePAGE`:

https://medium.com/analytics-vidhya/techniques-to-transform-data-distribution-565a4d0f2da


In [9]:
from matplotlib import pyplot
from scipy.stats import yeojohnson

rho = pd.concat(find_top(data['hl60_exp1_DAC_rho'].astype(float),'rho score',0,'Mann-Whitney p-value',1)).reset_index()

up:  8896
down: 9864


In [10]:
# def fdr_diff_table(df,fold_change,stat_val):
#     df['fdr'] = np.sign(df[fold_change] ) * (1 - df[stat_val])
#     return df

In [11]:
rho[['gene_name','rho score']].drop_duplicates(subset='gene_name',keep='first').to_csv(
    'hl60_exp1_DAC_rho_delta_phenotype.txt',sep='\t',index=None, 
)

In [13]:
!bash ../../scripts/onePAGE.sh . \
    hl60_exp1_DAC_rho_delta_phenotype.txt \
    GOBP_MRNA_PROCESSING &> /dev/null

In [36]:
ls hl60_exp1_DAC_rho_delta_phenotype_onePAGE_GOBP_MRNA_PROCESSING/

cmdline.txt
hl60_exp1_DAC_rho_delta_phenotype.txt
hl60_exp1_DAC_rho_delta_phenotype.txt.matrix
hl60_exp1_DAC_rho_delta_phenotype.txt.pre
hl60_exp1_DAC_rho_delta_phenotype.txt.profile
hl60_exp1_DAC_rho_delta_phenotype.txt.q
hl60_exp1_DAC_rho_delta_phenotype.txt.script
hl60_exp1_DAC_rho_delta_phenotype.txt.summary
hl60_exp1_DAC_rho_delta_phenotype.txt.summary.eps
[0m[38;5;27mMotifs[0m/


In [63]:
profile = pd.read_csv(
    'hl60_exp1_DAC_rho_delta_phenotype_onePAGE_GOBP_MRNA_PROCESSING/hl60_exp1_DAC_rho_delta_phenotype.txt.profile',sep='\t',header=None
).rename(columns={0:'gene',1:'member'}).set_index('gene').astype(int)

In [87]:
q = pd.read_csv(
    'hl60_exp1_DAC_rho_delta_phenotype_onePAGE_GOBP_MRNA_PROCESSING/hl60_exp1_DAC_rho_delta_phenotype.txt.q',sep='\t'
)
q.columns = ['gene','bin']
q.set_index('gene',inplace=True)

In [126]:
q.loc[(profile[profile.member.eq(1)].index.to_list()),:].bin.isin([0,10],)

gene
A1CF       False
AAR2        True
ACIN1       True
ADARB1     False
AHCYL1      True
           ...  
ZC3H3       True
ZCCHC8     False
ZFP36L1    False
ZMAT2      False
ZNF473     False
Name: bin, Length: 474, dtype: bool

In [136]:
profile_q = q.loc[(profile[profile.member.eq(1)].index.to_list()),:]

In [155]:
pd.DataFrame(
    profile_q[profile_q.bin.isin([0,10])].index
).to_csv('GOBP_MRNA_PROCESSING_onePAGE_leading_edge.csv',index=False)

In [85]:
!date

Sun Nov 27 17:50:14 PST 2022
