Program to identify remote homologues from protein sequence database.
Download the jar file and executable from https://github.com/RSLabNCBS/C-HMM/releases
C-HMM is a software to detect remote/distant homologues from protein sequence databases. It is based on HMMs(Hidden Markov Models) for identifying the deep evolutionary relationships of protein sequences. The aim of developing C-HMM is to provide a platform for identifying distant protein relationships in less computational time against any user defined protein sequence database. C-HMM is divided into three modules:
-
Cascade-HMM: This is main module of C-HMM which allows sequence searches for many generations. Each generation consists of multiple Jackhmmer searches against a database.
-
Custom-HMM: In this module, filtered hits (first generation) obtained by Cascade-HMM are clustered and clustered hits are used to generate HMM profiles. These HMM profiles are further used for initiating next generations.
-
Cluster-HMM: This module allows clustering of the hits obtained by Cascade-HMM after every generation. This helps in reducing the search timings. It can be combined with Cascade-HMM.
For initiating C-HMM searches user must have following files:
-
Input sequence: It can be any protein sequence in FASTA format (see example sample file)
-
Sequence Database: User can provide any protein sequence database in FASTA format (see example database)
-
HMMER3: C-HMM uses different utilities of HMMER3 package. Download HMMER package from http://hmmer.janelia.org/
-
CD-HIT: For clustering criterion C-HMM implements CD-hit which removes redundant sequences at a particular threshold. CD-hit can be downloaded from http://weizhongli-lab.org/cd-hit/download.php
Before commencing sequence searches change the path of above files and binaries in cascade.properties/cascade.properties-cust files.
C-HMM is precompiled with Java7(CascadeCUST.jar). If you are using lower version of Java, recompile the source code (CascadeCUST.tar) using the following commands:
- Download and extract apache ant
- export ANT_HOME="path to ANT directory"
- export JAVA_HOME="path to Oracle JAVA 7"
- extract CascadeCust.zip
- cd CascadeCust
- /path/apache-ant-1.5.2/bin/ant
C-HMM can be called using following commands:
For running Cascade-HMM use: java -jar CascadeCUST.jar cascade.properties
For running Custom-HMM use: java -jar CascadeCUST.jar cascade.properties-cust
C-HMM can be run on linux/mac OS. C-HMM memory requirement depends on the size of sequence database. We recommend to use high memory machines/clusters. It is a multithreaded program implemented in Java. Multithreading options (# of threads, # of cpu per thread, maximum # of threads) can be declared in cascade.properties/cascade.properties-cust files.
After completion of a sequence searches, C-HMM provides separate directories for each generation. Each generation directory would contain 3 results files:
- gen_#_result.out: This file has information about the hits captured in each generation.
- gen_#_connection.conn: Connection file stores the information about hits and query sequences with the E-value at which it was captured. This helps in backtracing and identifying the intermediate sequences in each generation.
- commulative_result_seq_name.out: This file stores the commulative unique hits from each generation.
If user has opted for clustering of hits, each generation directory would also have gen_#_result_nr.out.clstr file. This file has information about clustering of hits.
C-HMM provides many user defined options which can be declared in property file. All the options provided in the property files are explained below:
-
cascade.maxGeneration: This parameter defines the maximum number of generations for which user wants to initiate C-HMM.
-
cascade.evalueCutoff: This parameter that describes the number of hits one can expect to see by chance while searching a database. It's default value is 10-3 (refer BLAST manual for more details).
-
cascade.subjectFilters: In this row user has to define h-value and length filter criterion. h-value is a inclusion threshold for the profile generation. hits below the h-value are not considered for profile generation. It's default value is 10-3. Length filter defines the alignment length between the query and hit sequence. Each hit sequence length should be higher than the threshold. It's default value is 75%.
-
cascade.perGenerationIteration: This parameter describes the number of iterations of Jackhmmer to be performed in each generation. It's default value is 4.
-
cascade.clusteringCommandPrototype/cascade.custJack.custJackClust: These commands are used to define the clustering parameters of CD-hit. User can define a sequence identity threshold for the protein sequence clustering (refer CDhit for other options).
-
cascade.maxQueries: User can define the maximum number of hits to be used for initiating next generation of C-HMM in this parameter. To include all the collected hits leave this option blank. By default C-HMM uses 1000 random hits.
-
cascade.continuation: User can reinitiate cascade searches from the results of previous generation using this option. By default this option is set to "no". To continue sequence searches turn it to "yes".
-
cascade.continuedGeneration: User has to define the name of generation from which to reinitiate sequence searches. For eg. if you want to start from third generation than cascade.continuedGeneration=3.
-
cascade.continuationPrevGenOutput: In this parameter user has to describe the path of previous generation output files. For eg. if you want to restart third generation, give path of the output of second generation.
-
cascade.continuationExistingHitFile: Provide path to "commulative_result_seq_name.out" file of the previous generation.
-
cascade.inputFileExtension: User cand define the extension of input file. By default its filename.query.
-
cascade.outputFileExtension: User cand define the extension of input file. By default its filename.out.
-
cascade.connectionFileExtension: Connection file stores the information about corresponding hit and query sequence.
-
cascade.inputDirectory: Provide path to input directory.
-
cascade.outputDirectory: Provide path to output directory.
-
cascade.seqDatabase: Provide path to protein sequence database. Database should be in FASTA format. Please see sample database for reference.
-
cascade.binaryPath.- : Provide path to different binaries used by C-HMM.
--------------------------------------X Happy Cascading X------------------------------------------------------