##### Generating the transcription factor(TF) - functional module regulation by estimating the enrichment of target genes for one TF in one functional module. Before processing the following pipeline, make sure you have downloaded all essential input data  from the shared directory  https://osf.io/34xnm/?view_only=5b968aebebe14d4c97ff9d7ce4cb5070 which has been discribed in the manuscript "Functional module states framework reveals cell states for drug and target prediction" by Guangrong Qin et al.  

####  Please also cite the following paper to acknoledge the source of TF-Target gene pairs. 
Garcia-Alonso L, Holland CH, Ibrahim MM, Turei D, Saez-Rodriguez J. Benchmark and integration of resources for the estimation of human transcription factor activities [published correction appears in Genome Res. 2021 Apr;31(4):745]. Genome Res. 2019;29(8):1363-1375.

#### The example gene expression data for the MCF7 cell line was from L1000 platform, and only a section of purterbation measured in the CTRP2 project is used. To use it, please cite the following papers.
Subramanian A, Narayan R, Corsello SM, et al. A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles. Cell. 2017;171(6):1437-1452.e17. doi:10.1016/j.cell.2017.10.049

Rees MG, Seashore-Ludlow B, Cheah JH, et al. Correlating chemical sensitivity and basal gene expression reveals mechanism of action. Nat Chem Biol. 2016;12(2):109-116. doi:10.1038/nchembio.1986

#### Please cite the following article(s) when using KEGG.
Kanehisa, M. and Goto, S.; KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 28, 27-30 (2000). 

Kanehisa, M; Toward understanding the origin and evolution of cellular organisms. Protein Sci. 28, 1947-1951 (2019) 

Kanehisa, M., Furumichi, M., Sato, Y., Ishiguro-Watanabe, M., and Tanabe, M.; KEGG: integrating viruses and cellular organisms. Nucleic Acids Res. 49, D545-D551 (2021). 

In [None]:
import os
import sys
import pandas as pd
import numpy as np
import scipy 
import scipy.stats as ss
import statsmodels
from statsmodels import stats
from statsmodels.stats import multitest
sys.path.append('../Script/')
import FM_States
import FM_selection
import TF

ROOT_DIR = os.path.abspath("../")

##### Loading the gene expression matrix and select the functional modules. The functional modules are from the KEGG pathways.  

In [None]:
para_in = {
    'output_dir': ROOT_DIR+"/Sample_output/TF_pairs/",
    'input_expr_file': os.path.join(ROOT_DIR, "Sample_input/Example1/Sample1_data_MCF7_drugs_CTRP2.csv"),
    'out_dir': ROOT_DIR+"/Sample_output/Sample1",
    'sele_modules': ['Translation',
         'Nucleotide metabolism',
         'Signal transduction',
         'Amino acid metabolism',
         'Folding sorting and degradation',
         'Replication and repair',
         'Carbohydrate metabolism',
         'Membrane transport',
         'Cellular community - eukaryotes',
         'Lipid metabolism',
         'Metabolism of other amino acids',
         'Transcription',
         'Xenobiotics biodegradation and metabolism',
         'Signaling molecules and interaction',
         'Energy metabolism',
         'Transport and catabolism',
         'Glycan biosynthesis and metabolism',
         'Metabolism of cofactors and vitamins',
         'Cell motility',
         'Cell cycle', 
         'Apoptosis', 
         'Cellular senescence', 
         'p53 signaling pathway']
}

In [None]:
## generate a output directory

output_dir = para_in['output_dir']

if os.path.exists(output_dir) == False:
    try:
        os.makedirs(output_dir)
    except OSError:
        print ("Creation of the directory %s failed" % output_dir)
    else:
        print ("Successfully created the directory %s " % output_dir)
else:
    print ("INfO:  %s already exists!" % output_dir)

##### 1) Load genes from the selected fucntional modules from KEGG pathways; 2) Load the gene expression matrix 3) Get the TF-module pairs by estimating the enrichment of target genes for one TF in one funtional module.


In [None]:
#Load the fucntional modules from KEGG pathways
File_FM = os.path.join(ROOT_DIR,"Dataset/Sample_FM.csv") #Users can also define their own modules
dic_module,  KEGG_modules = FM_States.load_function_modules(File_FM)

module_selected_gmt = KEGG_modules.loc[KEGG_modules['name'].isin(para_in['sele_modules']) ]

#Load the gene expression matrix
data_matrix_MCF7_CTRP2 = pd.read_csv(para_in['input_expr_file'], index_col = 'Unnamed: 0')

#Get the TF-module pairs by estimating the enrichment of target genes for one TF in one funtional module. 
#Before running this step, please make sure the 'database.csv' is in the directory of "project/Dataset/database.csv"
#Here database.csv equal to https://genome.cshlp.org/content/suppl/2021/03/02/gr.240663.118.DC2/Revised_Supplemental_Table_S3.csv
TF_pairs = TF.get_tfpairs_for_select_pathways(data_matrix_MCF7_CTRP2,para_in['sele_modules'],dic_module) 


In [None]:
TF_pairs.to_csv(para_in['output_dir'] +"/TF_pairs.csv")

#### The result files will be used in Example1-generate-FM-matrix.ipynb