Specifying lists of organisms to cluster

The concept of a cluster group

The clustering part of the database-building process uses the concept of a "cluster group" to specify sets of organisms from which to pull proteins for clustering. Any subset of organisms in your ITEP database can be specified in a cluster group. Here are some examples of what you might want to do with this:

Separately analyze complete and draft genomes
Separately analyze organisms that were isolated from different environments
Separately analyze phylogenetic clades (e.g. species or genera)

Of course once you have separately analyzed them you can also combine them into a single analysis and run comparisons to identify what is newly discovered in the subset groups.

The "all" cluster group

The database-building scripts automatically build a "groups" file in $root that includes a group "all" which consists of all of the organisms you have downloaded and formatted for input into the database. Therefore, if all you want to do is get the results for everything you have downloaded you don't need to do anything further.

How to get a list of organism names

Organism names can have any characters except semicolons or quotes in them (scripts that require them to be sanitized do so automatically).

There is an "organisms" file that is automatically generated by the first database-building script (setup_step1.sh) but it can also be generated separately by running (from $root):

$ ./generateOrganismFileFromGbk.sh

(Note that like for all ITEP scripts, you must source the SourceMe.sh file before this can be successfully run). After you run this a file called "organisms" is created in $root. If you look at it this is what it looks like.

Clostridium beijerinckii NCIMB 8052     290402.1
Clostridium novyi NT    386415.1
Acetobacterium woodii DSM 1030  931626.1

The first column contains the organism name and the second contains the organism ID (taxid.versionnum) for the organism.

Adding new groups by string-matching organism names

If all of your names match a small set of strings (e.g. Clostridium) then you can automatically create a new group and add it to the groups file using the addGroupByMatch.py function

$ ./addGroupByMatch.py -n "Clostridia" "Clostridium"

This creates a group called "Clostridia" and adds all organisms whose name matches "Clostridium" (not case-sensitive).

You can add any number of strings to match - so for example we could also create a group with only A. woodii and C. novyi in it by calling

$ ./addGroupByMatch.py -n "woodii_novyi" "woodii" "novyi"

Looking at the "groups" file that is created in $root, you should see the following two lines now:

Clostridia      Clostridium beijerinckii NCIMB 8052;Clostridium novyi NT
woodii_novyi    Acetobacterium woodii DSM 1030;Clostridium novyi NT

As you can see, we created a tab-delimited file with two columns:

The name of the group (e.g. "all") and
A semicolon-delimited list of organism names

Adding new groups by pattern-matching organism names

You can use regular expressions to search for organism names that match specific patterns. To do this, use the -r field in addGroupByMatch. For example, the following would create a group containing all organisms that start with "Clos":

$ ./addGroupByMatch -r "^Clos"

Adding new groups manually

If you want to you can also manually create a group by first naming it in the first column and then creating a semicolon-delimited list of organism names (which must match what is found in the organism file exactly) in the second column. This method is useful for adding or removing small numbers of organisms to or from an existing group, for example. A couple rules are enforced to keep group definitions sane:

You are only allowed to have one group with a specific set of organisms (you can't name the same set multiple names)
You are not allowed to use the same name multiple times.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly