Skip to content
This repository has been archived by the owner on Feb 16, 2019. It is now read-only.

Obtaining a list of bidirectional best blast hits

mattb112885 edited this page Sep 3, 2013 · 6 revisions

Bidirectional-best blast hits (BBH) are two proteins in different organisms that, when BLAST is run against all proteins in the other organism, hit each other as the best hit. That a pair of proteins is a bidirectional-best BLAST hit is often used as evidence that the two proteins are orthologs (having a common ancestor without duplication).

Getting a list of bidirectional best blast hits for all or a set of organisms in ITEP

ITEP currently includes a single script to compute the bidirectional-best hits for a set of organisms in the database: db_bidirectionalBestHits.py. The script supports different metrics for what is considered the "best" hit:

  • Best e-value (default, and what people usually use)
  • Highest minbit score
  • Highest maxbit score

To get a list of ALL BBH with ITEP gene IDs run:

$ db_bidirectionalBestHits.py > all_bbh

This is the last two lines of the file:

$ cat all_bbh | tail -2
fig|931626.1.peg.1355   fig|386415.1.peg.951    Clostridium novyi NT    59.5229 58.5229
fig|386415.1.peg.1323   fig|386415.1.peg.1323   Clostridium novyi NT    200.0000        200.0000

The columns are: query gene, target gene, target organism, score (log-E value) from query to target and score from target to query. (Note - the exact scores might change from BLAST version to version)

To specify minbit or maxbit scores, just use the -m flag (-m minbit or -m maxbit)

You can also specify subsets of organisms from which to get bidirectional best hits with two different methods. If you have already specified a cluster group and run clustering, you can specify a run ID with the -r flag to restrict calculation of BBH to only organisms in that cluster run. Alternatively, you can generate a list of organisms by any other means and use the -f flag to pass that list of organisms into the function. For example, doing this would generate bidirectional best hits for Clostridium species only.

$ cat organisms | grep Clostridium | db_getBidirectionalBestHits.py -f -

Converting the gene IDs to locus tags

It might be desired to use locus tags instead of gene IDs for all of these bidirectional best hits. If this is the case for you, use the replaceGeneNamesWithAliases.py function to turn them into aliases (you will need a regex that locus tags match in your organisms - e.g. Cbei_\d+ for C. beijerinckii, NT01CX_\d+ for C. novii NT, and Awo_c\d+ for A. woodii). Prepare a table (using the aliases file) containing only these IDs:

$ cat $root/aliases/aliases | grep -P "(Cbei_\d+)|(NT01CX_\d+)|(Awo_c\d+)" > locus_tags

Then use it as input to the replaceGeneNamesWithAliases.py function. The same two lines as shown above for the bidirectional-best hits are repeated below after replacement.

$ cat all_bbh | replaceGeneNamesWithAliases.py -n locus_tags | tail -2
Awo_c0013055    NT01CX_087651   Clostridium novyi NT    59.5229 58.5229
NT01CX_0868323  NT01CX_0868323  Clostridium novyi NT    200.0000        200.0000

Note that this same function can be used with ANY script that has gene IDs in it. However, note that if you do this you will no longer be able to link those IDs to other ITEP analyses.

Clone this wiki locally