Skip to content
This repository has been archived by the owner on Feb 16, 2019. It is now read-only.

Searching for gene families by presence and absence patterns

mattb112885 edited this page Sep 10, 2016 · 9 revisions

Fetching pre-computed presence and absence patterns

The database stores a pre-computed table of which genes are present and absent in each organism for each of the clusters in the database. To access this table (or subsets of it) use the db_getPresenceAbsenceTable.py function. Just calling the function without any arguments prints out the entire table (with every cluster run and every cluster). Only the first two rows are shown in the example below:

$ db_getPresenceAbsenceTable.py
runid	clusterid	annote	Clostridium_beijerinckii_NCIMB_8052	Clostridium_novyi_NT	Acetobacterium_woodii_DSM_1030
all_I_2.0_c_0.4_m_maxbit	1424	hypothetical protein_YP_001311274.1_Cbei_4208	fig|290402.1.peg.4128	NONE	fig|931626.1.peg.2672

As you can see, by default, the function prints out the ITEP IDs in each organism for each cluster - to instead get Boolean values (0 for absent and 1 for present) just call the function with the -b flag:

$ db_getPresenceAbsenceTable.py -b
runid	clusterid	annote	Clostridium_beijerinckii_NCIMB_8052	Clostridium_novyi_NT	Acetobacterium_woodii_DSM_1030
all_I_2.0_c_0.4_m_maxbit	1424	hypothetical protein_YP_001311274.1_Cbei_4208	1	0	1

The function also lets you print out the number of gene representatives in each cluster\run pair for each organism, or pull out subsets by run ID or by cluster ID (see help text for details). Finally, if you have an organism tree with sanitized organism names as leaf names (see Building a concatinated gene tree for how to build such a tree), you can tell the program to sort the columns in a way that makes sense in light of the tree's branching order (so that for example organisms of the same species will group together) using the -t flag.

Types of presence and absence patterns that you can search with ITEP

The script db_findClustersByOrganismList.py supports finding clusters with different combinations of four bulk properties, ALL, ANY, ONLY, and NONE, with respect to a given list of organisms (the "ingroup") and the rest of the organisms in a specific cluster run (the "outgroup"). The below table specifies the possible combinations and what they represent.

  Property  | Ingroup |  Outgroup | Meaning
+-----------+---------+-----------+----------
  ALL       |  == N   |    >= 0   | Conserved genes in the specified list
+-----------+---------+-----------+----------
  ANY       |  >= 1   |    >= 0   | Genes present in the specified list
+-----------+---------+-----------+----------
  ONLY      |  >= 1   |    == 0   | Genes unique to the specified list
+-----------+---------+-----------+----------
  NONE      |  == 0   |    >= 1*  | Genes absent from the specified list
+-----------+---------+-----------+-----------
 ALL + ONLY |  == N   |    == 0   | Genes that are conserved and found only in the specified list
+-----------+---------+-----------+-----------
 ANY + ONLY |  >= 1   |    == 0   | Genes that are found only in the specified list
+-----------+---------+-----------+-----------
 ALL + NONE |
 ANY + NONE | Contradictions (raise errors).
 ONLY + NONE|
+-----------+---------+-----------

You can also specify a UNIQUE property which enforces that the members in the ingroup must be unique.

Examples with a specific list of organisms

Create a file with the two lines in it:

Clostridium beijerinckii NCIMB 8052
Clostridium novyi NT

Save it as "Clostridia_names.txt". Then run the following:

$ cat "Clostridia_names.txt" | db_findClustersByOrganismList.py -a -s all_I_2.0_c_0.4_m_maxbit

You get a list of all of the genes clusters that are found in both of our Clostridia species in our test database (-a) but not in Acetobacterium woodii (-s). By doing a line count (wc -l) we find that 568 families have this property including the following pair:

all_I_2.0_c_0.4_m_maxbit        996

We can find out what genes are in this cluster by using the db_getGenesInClusters.py function, or to find all of their functions and sequences directly use the db_getClusterGeneInformation.py function:

$ makeTabDelimitedRow.py "all_I_2.0_c_0.4_m_maxbit" "996" | db_getClusterGeneInformation.py
fig|290402.1.peg.3932	Clostridium beijerinckii NCIMB 8052	290402.1	290402.1	290402.1.NC_009617.1	4607337	4605538	-	-1	methyl-accepting chemotaxis sensory transducer_YP_001311078.1_Cbei_4012	ATGAAAG...    MKGTVVS... all_I_2.0_c_0.4_m_maxbit	996
fig|386415.1.peg.124	Clostridium novyi NT	386415.1	386415.1	386415.1.NC_008593.1	120602	122401	+	1	methyl-accepting chemotaxis protein_YP_877097.1_NT01CX_1000	ATGAAAG....    MKGTIVAT... all_I_2.0_c_0.4_m_maxbit	996

Where the DNA and amino acid sequences have been truncated for clarity. Note that as expected both C. beijerinckii and C. novyi are predicted to have this gene, but Acetobacterium is not. We could perform clustering at lower cutoffs (0.4 is rather stringent) and perform similar queries to see if this prediction holds up.

Finding presence and absence across an organism tree

This section requires you to have a tree with sanitized organism names on the leaves. A sanitized ID has everything that is not a letter or number in the organism's name (including spaces) replaced with an underscore. See the Building a concatinated gene tree tutorial for more details on how to generate such a file.

We have provided a function "makeCoreClusterAnalysisTree.py" that performs presence-absence analysis for every clade in a tree using the same methods as shown above. It produces the number of gene families with the specified properties and displays them on a tree. Optionally, it also produces an Excel file containing the runID\cluster ID pairs and a representative annotation (the one with the highest number of genes) for each cluster identified at each node in the tree (the names of the sheets correspond to the node labels on the tree), allowing quick identification of interesting gain and loss patterns at different stages of evolution.

Clone this wiki locally