Skip to content

Ramialison-Lab/Trawler-2.0

Repository files navigation

===============
Welcome to Trawler standalone 2.0
===============

The GNU General Public License (GPL)
     Authors: L. Ettwiller, B. Paten, M. Ramialison, Y. Haudry


Reference: Ettwiller L., Paten B.,Ramialison M., Birney E., Wittbrodt J.
     Trawler: de novo regulatory motif discovery pipeline for chromatin immunoprecipitation.
          Nat Methods. 2007 Jul;4(7):563-5. Epub 2007 Jun 24.
          PMID: 17589518


*****************************************************************
***                                                           ***
***               Installation procedure                      ***
***                                                           ***
***                                                           ***
***                                                           ***
***       System Requirements                                 ***
***       Installation                                        ***
***       Dependencies                                        ***
***       Licensing                                           ***
***       Usage                                               ***
***                                                           ***
***                                                           ***
*****************************************************************

*********************************
***                           ***
***     System Requirements   ***
***                           ***
*********************************

The program has been tested successfully on Unix/Linux (Fedora 9),
Mac OS X (versions 10.4, 10.5, 10.9, 10.10) and Windows (XP).

The following software packages are required for running Trawler:
perl 5.6 or above
Java 5.0 or above

*********************************
***                           ***
***     Installation          ***
***                           ***
*********************************

please see the INSTALL.txt file for detailed installation instructions.


*********************************
***                           ***
***     Dependencies          ***
***                           ***
*********************************

The following software packages are required:
- Algorithm::Cluster (CPAN: http://search.cpan.org/~mdehoon/)
- bedtools (http://bedtools.readthedocs.org/en/latest/content/installation.html)
-kentUtils v305 (http://hgdownload.cse.ucsc.edu/admin/exe/userApps.v305.src.tgz)

The following software packages are directly provided in the Trawler distribution:
- treg_comparator (http://treg.molgen.mpg.de)
- Jalview (http://www.jalview.org/)
- WebLogo (http://weblogo.berkeley.edu/)


*********************************
***                           ***
***     Licensing             ***
***                           ***
*********************************

Please see the file called LICENSE.txt


*********************************
***                           ***
***     USAGE                 ***
***                           ***
*********************************

USAGE
=====

Run trawler from the trawler main folder on the command line.

If using BED formatted input:

trawler.pl -sample [file containing bed regions] -org [organism]

The organism name is based on the build numbers indicated by UCSC, which can be found here: https://genome.ucsc.edu/FAQ/FAQreleases.html#release1

Please ensure that all chromosomes in FASTA format are present in the "genome_data" folder under the respective organism name. e.g. "genome_data/mm9" for mm9 sequence. Ensure that the chromosomes are hard repeat masked and each chromosome is in a separate file.

A list of all the chromosome sizes and gene list is also required within the organism folder. The gene list can be generated using the biomart tool on ensembl (http://www.ensembl.org/biomart/martview/6f6e7f37e9efce739f097115520faa07).

When generating the gene list, please set the gene attributes as follows:

	Ensembl Gene ID	Associated Gene Name	Chromosome Name	Gene Start (bp)	Gene End (bp)	Strand

Also required in the folder, is a list of all the chromosome sizes of the organism.

Included within this package is the genome data for Saccharomyces cerevisiae as an example of how the data should be arranged and formatted.

If using FASTA formatted files:

trawler -sample [file containing the enriched sequences] -background [file containing the background sequences]

    -sample (FASTA format) better to be repeat-masked.
    -background (FASTA format)

OPTIONAL PARAMETERS
===================

    [MOTIF DISCOVERY]
    -occurrence (optional) minimum occurrence in the sample sequences. [DEFAULT = 10]
    -mlength (optional) minimum motif length. [DEFAULT = 6]
    -wildcard (optional) number of wild card in motifs. [DEFAULT = 2]
    -strand (optional) single or double. [DEFAULT = double]

    [CLUSTERING]
    -overlap (optional) in percentage. [DEFAULT = 70]
    -motif_number (optional) total number of motifs to be clustered. [DEFAULT = 200]
    -nb_of_cluster (optional) fixed number of cluster. if this option is used, 
the k-mean clustering algorithm with fixed k will be used instead of the strongly 
connected component (SCC). Alternatively this option can be set to 'som' and in 
this case, the self organizing map (SOM) will be use. [DEFAULT = NULL]


    [VISUALIZATION]
    -directory (optional) output directory. [DEFAULT = "$TRAWLER_HOME/myResults"]
    -dir_id (optional) gives an id to the results directory. [DEFAULT = NULL]
    -xtralen (optional) add bases around the motifs for the logo. [DEFAULT = 0]
    -alignments (optional) file containing the list of files containing the aligned sequences (see README file for more info) [DEFAULT = NULL]
    -ref_species (optional) name of the reference species [DEFAULT = NULL]
    -clustering (optional) if 1 the program clusters the instances, if 0 no clustering. [DEFAULT = 1]
    -phastcon (optional) value = 1 to display conservation. Requires PhastCon scores as .bw file format saved under /genomes/[organism]/phastcon/ [DEFAULT = 0 ].
    -web (optional) if 1 the output will be a web page with all the information [DEFAULT = 1]

USAGE EXAMPLES
==============

Trawler-standalone comes in two flavors: [1] with the conservation information or
[2] without. For the conservation you need to provide Trawler-standalone with a file
containing all the orthologous sequences (not aligned).

[1] with conservation

> trawler.pl -sample file_sample -background file_background -alignments file_alignments -ref_species reference_species_name

[2] without conservation

> trawler.pl -sample file_sample -background file_background


By default Trawler-standalone performs the full analysis nevertheless it is possible to

[1] get only the over-represented motifs:

> trawler.pl -sample file_sample -background file_bakground -clustering 0

[2] get only the over-represented motifs and the cluster (PWM):

> trawler.pl -sample file_sample -background file_bakground -web 0

OPTIONS DESCRIPTION
===================

    [MOTIF DISCOVERY]

    -occurrence : [DEFAULT = 10] The minimum number of time the motif occurs in the sample sequence.
            The lower limit defined by the suffix tree is 2 motifs. Nevertheless if your chromatin IP
            contains more than just a few sequences, you would expect to see the motif occurring quite often.
            For typical chromatin IP data, a good minimum number of motif occurrence is about 10 to 20.
            If you do not get anything interesting with this values, try a higher or lower number according
            to your sample size.
    -mlength : [DEFAULT = 6] Trawler-standalone does not assess motifs of fixed length.
            The maximum length has been set to 20 nucleotides but the minimum length is a user defined parrameter.
            A good minimum length is around 5-6 bp for typical transcription factor binding sites.
    -wildcard : [DEFAULT = 2] The number of wildcards is the maximum number of positions in the motif
            that can have a degenerate nucleotide. In IUPAC code, the possible wildcards used in Trawler-standalone are
            the following :
            M = [AC], R = [AG], Y = [CT], W = [AT], S = [CG], K = [GT], N = [ATCG]
            (the latest counts as 2 mismatches). A possible motif with -wildcard option set
            to 2 can be AGCMTWA for example.
    -strand : [DEFAULT = double] The analysis can be performed either on single-stranded or double-stranded sequences.

    [CLUSTERING]

    -overlap : [DEFAULT = 70] The minimum percentage overlap for two motifs to be considered in the same cluster.
            The value is by default 70 %; lowering down this value will cluster more distant motifs.
    -motif_number : [DEFAULT = 200] The maximum number of motifs to be considered for clustering.
            If the option is set to 200, the best 200 motifs returned by the discovery step will be used
            for the clustering step.
    -nb_of_cluster : [DEFAULT = NULL] By default, the number of cluster(s) is defined by the
            number of strongly connected components (SCC). By setting this option to a integer, 
	    the user set the number of expected cluster(s). In this case, the SCC is replaced 
	    by the k-mean clustering algorithm. if the option is set to 'som' the SOM algorithm is run
	    instead.

    [VISUALIZATION]

    -directory : [DEFAULT = "$TRAWLER_HOME/myResults"] name of the directory where all the outputs
            will be stored. If no directory is specified, then a default directory "myResults" will be created,
            Individual results will be stored in a separate directory with the following pattern
            "tmp_yyyy-mm-dd_hh:mm_ss" (ex : tmp_2009-05-20_12h13_20).
    -dir_id : [DEFAULT = NULL] gives a meaningful ID to the results directory. For instance, if this option
            is set to "myID", this will create a directory named "myID_yyyy-mm-dd_hh:mm_ss".
    -xtralen : [DEFAULT = 0] The number of additional nucleotides included in the PWM flanking the core motif.
    -alignments : [DEFAULT = NULL] Alignment file in the correct format. If alignments are provided,
            the motifs are visualized in the context of an alignment using Jalview.
            If the -alignment option is activated, the option -ref_species needs to be set.
    -ref_species : [DEFAULT = NULL] Name of the reference species. Only use this option if
            the -alignment option is activated.
    -clustering : [DEFAULT = 1]. By default, Trawler proceeds to the clustering step.
            To only obtain the over-represented motifs, set the -clustering option to 0.
            If the -clustering option is set to 0, no webpage is produced.
    -web : [DEFAULT = 1] By default, Trawler proceeds to create a web page summarizing the result.
            If this option is set to 0, Trawler proceeds until the clustering step only.
            Useful for batch  analysis.


*********************************
***                           ***
*** Additional features       ***
***                           ***
*********************************


Additional reference matrices can be added to the known matrices used by Treg comparator by adding
to the file matrix_set_all your PWM in the following format :

ID#name#text#text#0.3548828125 ...
Matrix entries (float) are separated by a whitespace. The matrix entries
start with the frequency of "A", followed by that of "C", "G", and "T" in the
first position (5'), then the same for the second position and so on.


*********************************
***                           ***
*** Trawler Team              ***
***                           ***
*********************************

Laurence Ettwiller  <ettwille@embl.de>
Benedict Paten      <benedict@soe.ucsc.edu>
Mirana Ramialison   <ramialis@embl.de>
Yannick Haudry      <haudry@embl.de>

About

No description, website, or topics provided.

Resources

License

GPL-2.0, Unknown licenses found

Licenses found

GPL-2.0
LICENSE.md
Unknown
LICENSE.txt

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published