Skip to content
This repository has been archived by the owner on Feb 16, 2019. It is now read-only.

Building your database 2 mcl clustering

mattb112885 edited this page Apr 22, 2014 · 11 revisions

NOTE: You should perform at least [step 1](Building your database 1 - BLASTP and BLASTN) of building the database before proceeding to this step. That step calculates the prerequisite BLASTP results all vs. all and stores the results in the SQLite database.

These directions apply to running MCL clustering. If you want to use a different program (e.g. orthoMCL) to do clustering, we also support importing those results provided they are available in an appropriate format. See here for details on how to do that: directions.

Before running MCL clustering make sure you have set up any groups of organisms that you want to cluster (see Specifying lists of organisms to cluster). By default clustering is run on all organisms you have downloaded and imported into ITEP but using those directions you can specify arbitrary subsets of organisms to cluster.

Performing a cluster run

To run MCL clustering run the following command from the directory in which it is found (it will NOT work if you run it from another directory):

./setup_step2.sh inflation_value scoring_criteria score_cutoff

In this tutorial we will do a lot with the following settings for the sake of illustration:

./setup_step2.sh 2.0 maxbit 0.4

Inflation value is a value larger than 1 (default in MCL is 2.0 so you can specify that if you don't want to mess with it) that controls the granularity of clusters. Scoring criteria are based on the BLAST hits (obtained from setup_step1.sh) and can be "minbit" or "maxbit" - minbit is the bit score divided by the minimum of self-bit scores for query and target genes, while "maxbit" is the bits core divided by the maximum of the self-bit scores. The "minbit" criteria emphasizes strong hits over the entirety of the smaller protein (so it can pick up pseudogenes but is less sensitive to events such as gene fusions); the "maxbit" criteria emphasizes strong hits over the entire query and target proteins. A typical value for the cutoff for a single genus is 0.4 but you should play with it and look at the score distributions to see what is appropriate for the protein families you are interested in studying.

The same parameters are used to run MCL with every group in the groups file if there is more than one group present there.

The setup_step2.sh script performs these tasks:

  1. Running MCL clustering on each cluster group (group of organisms) with the specified parameters
  2. Reformatting the clustering output files to assign each cluster to its run ID and a numeric cluster ID
  3. Building a presence-absence table for all organisms in the cluster group based on the clustering results.
  4. Importing the calculation results into the ITEP sqlite database.

Differentiating multiple cluster runs

In ITEP you are allowed to store results from multiple different clustering methods in the same database, and some scripts are able to compare results from multiple different methods. This is useful for identifying sensitivity to parameters or evaluating different clustering algorithms for particular protein families. The different methods that you use are distinguished in the database by giving each distinct set of clustering results a Run ID. Many functions that extract clustering results require you to specify a run ID so that the database knows which set of clustering results to refer to.

When using ./setup_step2 to create a cluster run, the run ID is given the following form:

groupID_I_inflation_c_cutoff_m_metric

groupID is the ID for the group of organisms from the "groups" file, I is the inflation parameter, m is the homology metric and c is the cutoff. If a particular cluster run already exists when setup_step2.sh is called, it is skipped and the method moves on to the next group.

If you use other methods for clustering and import them into the database, you can either specify a run ID in a 3-column table format or use MCL format and allow the toolkit to automatically assign a run ID based on the file name.

Run IDs and cluster IDs

Many of the ITEP scripts require as input a (runID, clusterID) pair (as two columns in an input file). The run ID is described above. A cluster ID is an integer assigned to each cluster in the order they are imported into the database. Any time a script returns cluster IDs, it provides the corresponding run IDs as well. In turn you must provide both of these values (in a tab-delimited row) to get information about a particular cluster.

If you know a run ID and a cluster ID that you are interested in and want to make a tab-delimited row, we have provided a convenient way to do that:

$ makeTabDelimitedRow.py all_I_2.0_c_0.4_m_maxbit 1
all_I_2.0_c_0.4_m_maxbit      1

The results of this can then be piped into the commands that require both a runID and a clusterID to analyze a cluster.

How to get a list of cluster runs

A list of cluster run IDs for clustering results currently imported into ITEP is always available via the db_getAllClusterRuns.py function. For example if we use the three groups we generated in the prior tutorial to run clustering with an inflation value of 2.0, a cutoff of 0.4 and a maxbit score we get the following list of run IDs:

$ db_getAllClusterRuns.py
Clostridia_I_2.0_c_0.4_m_maxbit
all_I_2.0_c_0.4_m_maxbit
woodii_novyi_I_2.0_c_0.4_m_maxbit

This is a good function to keep in mind for many scripts that require a run ID as input.

Clone this wiki locally