GitHub - jacobclifford/MIBBS: Mutual Information Between Binding Sites

Branches Tags
Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
cluster		cluster
data_dir		data_dir
data_dir2		data_dir2
logodds		logodds
miclasses		miclasses
rocenhancer		rocenhancer
rocproj		rocproj
Makefile.am		Makefile.am
Makefile.in		Makefile.in
README		README
aclocal.m4		aclocal.m4
cbe.fa		cbe.fa
compile		compile
configure		configure
configure.ac		configure.ac
depcomp		depcomp
format916.tex		format916.tex
install-sh		install-sh
masterscript.sh		masterscript.sh
missing		missing
pb515530p15.pdf		pb515530p15.pdf
plot.pdf		plot.pdf
plot.tex		plot.tex
script3.sh		script3.sh
sequenceData.tar.gz		sequenceData.tar.gz
Repository files navigation

06/15/2015

The 'MIBBS' directory (Mutual Information Between Binding Sites) contains the full pipeline of programs (with needed input data) necessary for generation of results for Clifford/Adami paper: "Discovery and information-theoretic characterization of transcription factor binding sites that act cooperatively".  In addition the folder MIBBS contains a gzip ('sequenceData.tar.gz') that contains a directory ('Data') with our core sequence data ('CoreDorsalsites.fa') that has annotated the known Dorsal sites based on the primary author of the paper that discussed the binding site locus, and the pubmed id of the paper.

Author: Jacob Clifford <jacobc@msu.edu>
--------------------------------------------------------------------------
BASH SCRIPT

The bash script 'masterscript.sh' is a procedural code that sequentially performs each experiment/simulation (a subprogram) using C/C++ programs and then generates a figure for the experimental results using matlab and lastly generates the weblogos for the binding sites used for the simulations.  Each subprogram is contained within its own directory.  Hence the beginning portion (first 27 lines in my editor) of 'masterscript.sh' assigns these directories to bash variables.  A 'do while' loop exist over different Twist motifs and spacer windows, once the Twist motif and spacer window are assigned (instantiation of the do while loop variables) the script then executes each experiment/simulation.  Once 'masterscript.sh' halts (presumably due to successful completion of simulations/experiments) the figures can be printed using the bash file 'script3.sh' that contains a latex program (plot.tex) to print all the figures as a pdf.  'plot.pdf' is the name of the printed figures, the code 'plot.cpp' is from M. Samee in Sinha's lab, which I manually edited the plot.cpp output (format916.tex) that contains the latex commands to 'includegraphics' in a table environment).    

--------------------------------------------------------------------------
RUNNING THE BASH SCRIPT 'masterscript.sh'

To run the masterscript.sh on a PC using linux you must change masterscript.sh's beginning portion that contains the bash variables of the directories to the correct path of the folder 'MIBBS' on your machine.  

The C/C++ programs called within materscript.sh are an extension of Saurabh Sinha's lab's GEMSTAT (in particular Xin He's code), (with minor merges from the program STAP from S.Zhong's lab, see: http://systemsbio.ucsd.edu/STAP/)), both are programs for 'thermodynamic modeling'.  A number of dependency programs/libraries are needed, such as GNU Scientific Library (GSL), matlab, weblogo (which requires biopython), latex.  To run GEMSTAT see the installation instructions found in the tar.gz file at:
http://veda.cs.uiuc.edu/Seq2Expr/
where Xin's README for GEMSTAT is pasted at the end of this README for convenience, where the explanation of the input files and parameters and classes of the C/C++ programs are explained.  All comments from my code have been stripped due to nonstandard formatting; albeit, due to Xin's variable's names being suggestive of their purpose, the code can still be human readable; if one wants the unstripped code please send a request to me.

The folders in MIBBS that run my extension of GEMSTAT each have thier own copy of the extensions to the GEMSTAT files,  which basically means the cpp files have increased in size to account for new methods and attributes to the original classes (and of course the 'main' cpp file 'seq2exp.cpp' changes in each of my directories to perform the desired simulation), hence one does not need to actually download GEMSTAT from Sinha lab.  

One may wish to download and get GEMSTAT to work using Xin's example data files if they have problems building the MIBBS source code.  This will assure that one has properly linked properly to GSL.  However, i have supplied a configure script 'configure.ac' and 'Makefile.am' files from GNU autobuild system, which should help make building and linking MIBBS C/C++ code less painful regardless of your platform (the 'autotools' are for 'building' projects, not for managing program dependencies, hence if GSL (a dependency for MIBBS C source code) is not installed, one will still need to install GSL on their machine).  

--------------------------------------------------------------------------
INSTALLATION OF C/C++ SOURCE CODE

In the MIBBS directory on your machine execute the following build commands to 'build' the project's binaries (like the execuatables):

./configure
make

The 'configure' program should have created Makefiles in each directory that contains C/C++ source code.  Calling 'make' in the MIBBS directory will 'recursively make' all the object files and executalbes for your particular platform.  Hence, if make successfully compiles all the source code, one is done with installation, otherwise try to fix the installation errors (or contact me).  (Note, sometimes in the GNU automake system, a user additionally uses 'make install', which is not necessary here).

The 'configure' script may fail to 'link' or find the gsl library on your machine, this means you have to pass the configure script the location of gsl on your machine.  The location of gsl can be found from the command line by executing : gsl-config --prefix.  Let the path returned by gsl-config --prefix = /usr/local/lib, then you must enter the following build commands from the MIBBS directory:

./configure LDFLAGS=-L/usr/local/lib
make

I have succesfully compiled the MIBBS code (i.e. ran ./configure and make) on linux (Ubuntu 14) using gcc 4.4 and gcc 4.8.4.  On windows (using cygwin) using gcc 4.8.2 a compilation error occurs due to a name clash with a local varialbe (_B), local meaning a variable from my code, and a global varialbe from cctype library (part of C standard libraries), e.g. my error was:
$ g++ -O3 -I/usr/local/include  -c ExprPredictor.cpp
In file included from /usr/lib/gcc/x86_64-pc-cygwin/4.8.2/include/c++/cctype:42:0,
                 from ExprPredictor.h:6,
                 from ExprPredictor.cpp:6:
ExprPredictor.h:514:270: error: expected ‘,’ or ‘...’ before numeric constant
 ExprFunc( const vector< Motif >& _motifs, const FactorIntFunc* _intFunc, const vector< bool >& _actIndicators, int _maxContact, const vector< bool >& _repIndicators, const IntMatrix& _repressionMat, double _repressionDistThr, const ExprPar& _par,  const vector< int >& _B ,  const vector< int >& _Br);
                                                                                                                                                                                                                                                                              ^


If this compilation error occurs ( occurring during the 'make' step of the build process) execute the following 'Stream EDit' from the program sed in the MIBBS directory:


sed -i  's/_B/B/g' ./*/ExprPredictor.*

The will 's'ubstitute all occurences of _B with B, this will occur 'g'lobally (i.e. the g flag replaces all occurences of the pattern in a given line), and * meaning all subdirectories and acts as a wildcard for file name extensions of the ExprPredictor files (cpp and h files).  Then rerun 'make'.
--------------------------------------------------------------------------
DIFFERENCE BETWEEN GEMSTAT AND MIBBS 

Each of my directories that contain an experiment and that call a 'seq2exp' executable has a terse README file to explain the input parameters not used in GEMSTAT.  For example, within masterscript.sh, the first experiment (first subprogram) clusters Dorsal transcription factor DNA binding site loci, here's what calling the executable looks like from the command line (and within the shell script):

./seq2exp -bs fas.txt -bi topbot2Int.txt -s efastanotw12.txt  -p synmyout -e expre12.tab -m factordts.wtmx -f factorexpdts2.tab -na 0 -i factorinfodts.txt -o BINS -c coopdt.txt -fo out.txt -oo corr -dc  cbe.fa -du cbe.fa -tw Ebox.txt -sa NEEandREPseq.txt   -pva .0001 -cbo cb_11b.txt   -l 9 -nr 1000 -ip 1  -nu 5 -w 4 -m2 twmotifs842014.wtmx

A number of the parameters above are not defined in GEMSTAT or STAP, in this paragraph I will call these 'new parameters'.  For example, the -w  is a new parameter, which indicates the 'width' or the distance from the current estimate of the position of a binding site that the gibbs sampler for the Expectation Maximization alignment is allowed to randomly sample new starting coordiantes within an extended sequence for binding site alignment.  For the most part, my new parameters always come after GEMSTAT's original parameters (which usually 'ended' at the -oo parameter, for example: "-dc   -du  -tw  -sa   -pva  -cbo   -l  -nr  -ip   -nu  -w  -m2 " are all new parameters - compare this list of new parameters with the full list of parameters passed to the executable above to see what 'ended' means ).  Many of the original GEMSTAT/STAP parameters being passed to the execuatble are not necessary for the experiments in the Clifford/Adami paper, however, they are necessary to 'run' the simulation. (My program requires creating an ExprPredictor object (instantiation of the class ExprPredictor, which requires all the "thermodyanmic modeling" input.  This overhead is because our work was intended to be an extension to the current repository of gene expresssion model's 'cooperativity' functions (pairwise protein-protein interactions), which is still a work in progress).  Hence a segmentation fault will occur without the initial GEMSTAT parameters being assigned data.  Even though the old GEMSTAT parameters must be passed to our executable, one can still run our executable for clustering and characterizing cooperative binding sites with data for any system in particular data that is not correlated with the data passed through the old/original parameters (the gene expression data etc..).  Hence, to run the C/C++ programs for your data (fasta of binding sites, fasta of CRMs, and a potential motif that cooperates with the binding sites in the fasta), just use all of the data we have supplied for the old GEMSTAT parameters, while assigning my new parameters to your data.

Now a list of the seven subprograms called within 'masterscript.sh'.
--------------------------------------------------------------------------
C/C++ SUBPROGRAMS

//////////////////////////////////
The 'first subprogram'  (in cluster directory)
//////////////////////////////////

This program runs the clustering of known Dorsal loci based on a Twist motif and spacer window.  (see how the bash variable 'cluster' is being used in masterscript.sh, where the variable stores the path to the directory for this experiment.)    A 'do while' loop is available to iterate over $j Twist motifs, and $i spacer windows.   Currently 'masterscript.sh' only analyzes the spacer window that corresponds to [0,30]bp, which is why the i, j loops only iterate once (notice that q=1, and k=2, meaning only one iteration occurs).

Within the while loop, each DC, DU, CB data sets ( fasta files ) that corresponds to a specified Twist motif and spacer window ($i, $j variables) are copied to each directory that contains a subprogram.




//////////////////////////////////
The 'second subprogram' (in logodds directory) 
//////////////////////////////////
This program conducts the 'loggodds' test of the predicted DC and DU sites within the CRMs.




//////////////////////////////////
The 'third subprogram' (in rocenhancer directory) 
//////////////////////////////////
This program runs the ROC of CB and OR gate using CB sites as positives and the CRM background sequences as negatives.  This also generates the mutual information (I(I;O)) plot for the variables 'Input=I' (class type of k-mer Dorsal or not Dorsal site) and 'Output=O' (prediction of Dorsal or not Dorsal site).




//////////////////////////////////
The 'fourth subprogram' (in miclasses directory) 
//////////////////////////////////
This program computes the information quantity I(C;P), for class variable (C) (e.g. DC, DU) and predictions (P) (e.g. distal or proximal; which effectively means DU type site or DC type site) of class type for each detector DC and DU for a given energy cutoff.  In a sense, this program is designed to answer the question: can DC actually predict DC sites as positives, while also predicting DU sites as negatives?  See our paper for further details. 

This program also constructs an ROC, not discussed in the original paper, that is helpful in analyzing the I(C;P) computation.  The ROC shows DC detector's ROC using DC training data as positives, and DU data as negatives, and similarly the DU detector's ROC using DU data as positives, and DC data as negatives. 




//////////////////////////////////
The 'fifth subprogram' (in ranksumproj directory) 
//////////////////////////////////
This program generates the data for the rank-sum pvalue sampling distribution in order to see if the median energies of DC and DU are significantly different.




//////////////////////////////////
sixth subprogram  (in data_dir directory)
//////////////////////////////////
This program flips (chooses a strand ) all sites in an alignment to A-rich seq, and then makes a PWM-like table of the counts of each nucleotide from the alignment (see the colored tables in 'plot.pdf').




//////////////////////////////////
seventh subprogram  (in data_dir2 directory)
//////////////////////////////////
This program is weblogo (designed by Crooks et.al. and T. Schneider), which creates the 'logo' of the motifs from the alignments of binding sites.  See their papers for details:

Crooks GE, Hon G, Chandonia JM, Brenner SE WebLogo: A sequence logo generator,
Genome Research, 14:1188-1190, (2004)

Schneider TD, Stephens RM. 1990. Sequence Logos: A New Way to Display Consensus Sequences. Nucleic Acids Res. 18:6097-6100

or visit: http://weblogo.berkeley.edu/ to download.  I have placed weblogo inside of this folder, but you will need to recompile, see WebLogo Source Code at http://weblogo.berkeley.edu/ for instructions.




--------------------------------------------------------------------------
--------------------------------------------------------------------------
--------------------------------------------------------------------------
--------------------------------------------------------------------------
--------------------------------------------------------------------------
That ends my README file, below I've pasted Xin He's README for his 'GEMSTAT' code.



--------------------------------------------------------------------------
Thermodynamics-based models of transcriptional regulation by enhancers: the roles of synergistic activation, cooperative binding and short-range repression
Author: Xin He <xinhe2@illinois.edu>
--------------------------------------------------------------------------
INSTALLATION

The program needs GSL (GNU Scientific Library). After installing GSL, change the GSL directory in src/Makefile. Then type: 
cd src
make

The main executable will be generates: 
seq2expr: the main program to fit a sequence-to-expression model

Type the program name without parameters will print usage information. 
--------------------------------------------------------------------------
PROGRAM USAGE

The program takes as input: sequences, the expression profiles of these sequences, the PWMs of the relevant TFs and the expression profiles of these TFs, and computes the paramaters of the underlying sequence-to-expression model as well as the predicted expression patterns.

The data/ directory contain the example files. A simple command of running the program: 
$INSTALL_DIR/src/seq2expr -s seqs.fa -e expr.tab -m factors.wtmx -f factor_expr.tab -fo obs_pre.txt -i factor_info.txt -o Direct

For more examples of running the program with various options, see the run.sh script (need to modify the installation directory of the script). 

Explanation of parameters: 
-s <seq_file>: required, the sequence file in FASTA format. See data/seqs.fa. 

-e <expr_file>: required, the expression data of the sequences. The first line specifies the name of the expression conditions. See data/expr.tab. 

-m <motif_file>: required, the PWM (motif) of the relevant TFs. See data/factors.wtmx. 

-f <factor_expr_file>: required, the expression data of the TFs. Must match the format of expr_file, and the order of TFs in motif_file. See data/factor_expr.tab. 

-fo <output_file>: required, the output file, the predicted expression patterns of all sequences as well as the observed expression patterns (alternating, the first row is the observed and the second the predicted). 

-o <model_option>: the sequence-to-expression model. Options are: Logistic, Direct (DirectInt model), ChrMod_Unlimited (SRR model with N_MA = inf), ChrMod_Limited (SRR model with finite N_MA). 

-c <coop_file>: the list of cooperative interactions. One line per cooperative pair. If not specified, then no cooperative interaction is allowed. See data/coop_file. 

-i <factor_info_file>: the role of TFs (activators or repressors). This would be required for SRR models. See data/factor_info.txt. The second column indicates whether the TF is an activator and the third whether repressor (in theory, an activator could have two roles, thus we have two columns). 

-oo <obj_option>: the option of objective function. Options are: SSE - sum of squared error (default), Corr - average correlation coefficient. 

-mc <max_contact>: the N_MA parameter of the SRR model. 

-p <par_file>: the parameter file. When this option is specified, the values of the parameters in par_file will be used as initial parameter values of the optimizer. In particular, if no parameter estimation is performed (with the option: -na 0, see below), the parameter values will be used for predicting expression patterns. 

-na <nAlternations>: a parameter of the optimizer. The number of alternations between two optimization methods (GSL simplex method and GSL BFGS method). If it is 0, no parameter estimation. Typically 3 to 5. 

-a <annotation_file>: the file of sequences represented by a set of TFBSs (site annotation). With this option, only the specified sites will be used in the sequence-to-expression model. See data/seqs_p002.ann. Note that in the file, the first column is the start position of a site (from 1), the second is the strand of the site, and the last the factor name. 

-ct <coopDistThr>: the distance threshold of cooperative interactions. Default = 50 bp. 

-rt <repressionDistThr>: the distance threshold of short range repression. Default = 150 bp. 

--------------------------------------------------------------------------
SOURCE CODE

The main classes and functions of the program: 

Tools.*
The utility classes and functions, e.g. Matrix class, mathematical functions, I/O classes. 

class Sequence (SeqAnnotator.h): 
The DNA sequences, represented as a vector of A,C,G,T. The functions that read and write sequences in FASTA format are also included. 

class Motif (SeqAnnotator.h): 
The PWM representation of binding profiles. Defined the methods for computing the LLR score and mismatch energy of any sequence element. Also included read/write functions. 

class Site (SeqAnnotator.h): 
The TF binding sites. The data members are: position, which TF the site is associted with (index), and its mismatch energy and binding affinity (:= exp(-energy)). 

class SeqAnnotator (SeqAnnotator.h): 
Create the site representation of a sequence, i.e. a Sequence object can now be represented as a vector of Site objects. All sites that exceed certain energy thresholds will be extracted. 

class FactorIntFunc (ExprPredictor.h): 
The function that computes the interaction between two occupied TFBSs, according to the TF pairs, the distance and orientation of the sites. 

class ExprPar (ExprPredictor.h): 
The parameters of a sequence-to-expression model, including the binding parameters (K of the strongest site multiplied by TF_max, see the paper), the activation parameters (alpha), the repression parameters (beta), the basal transcription, and the TF-TF interaction matrix. Because some models only have a subset of these parameters (e.g. beta is only used in the SRR model),  methods are defined to: 1) create an ExprPar object by a vector of free parameters (constructor); 2) extract all free parameters from the ExprPar object: getFreePars() function. Exactly how these methods are implemented depend on the current model option (static ModelType modelOption). 

class ExprFunc (ExprPredictor.h): 
The class that predicts expression level of a sequence (represented by a vector of Site objects)according to TF expression levels and the model parameters. The main function is predictExpr(). For thermodynamic models, this main function is implemented through compPartFuncOff() and compPartFuncOn(). These two functions implemented the dynamic programming algorithms in the paper. Also note that all model parameters are supposed to known. 

class ExprPredictor (ExprPredictor.h): 
This is the main class of the program. 
* It contains data members that store the input data of the program: sequences and their expression profiles, the PWMs and expression profiles of TF. It computes the best model parameters through the method train() (saved in the data member ExprPar par_model), and can be applied to a new sequence through the method predict(). 
* The main optimization methods are: simplex_minimize() and gradient_minimize(). These two methods use the GSL functions, see GSL manual for the style of using the library. Also the two functions print information when doing optimization. This could be turned off by commenting out the lines in the main do-while loop. 
* Also note that several functions are defined to deal with parameters: testPar() for testing if the parameter ranges are valid and randomSamplePar() sample randomly parameter values (for random starts of the optimizer). 

seq2expr.cpp
The main() function. The default values of many parameters are defined here. Also control the output.