Skip to content
This repository has been archived by the owner on Feb 16, 2019. It is now read-only.
mattb112885 edited this page Sep 10, 2016 · 62 revisions

Welcome to the clusterDbAnalysis (ITEP) wiki!

This wiki is intended to show how to perform some useful comparative analyses using the ITEP tools. For detailed documentation for each Python script, please type [nameofscript] -h or refer to the docs/ folder of ITEP.

  1. Introduction to ITEP
  2. Quick start guide and overview
  3. How to get help

Installation

You have two options for installation - using a virtual machine (recommended for easier use) or installing directly on your own machine (harder, but also more flexible and more powerful). The virtual machine can be run on any operating system (Linux, Mac OS, Windows) while raw installs will only work on Linux.

  1. Using the ITEP virtual machine
  2. Installing ITEP on your machine

Building an ITEP database (with examples)

Follow all of these directions to successfully build a complete ITEP database. We have done these steps for you in the virtual machine's "tutorial" copy of ITEP, so you can follow along with the rest of the tutorial without performing these steps yourself. However, it is necessary to perform these steps to build a new ITEP database with your own set of organisms. Therefore, it is still important to read these directions and understand what we did to build the example database. Note that not all of these steps are required for all analyses (e.g. if you aren't interested in looking at conserved domain architecture, you don't need to run step 4)

Before running any ITEP scripts, make sure you source the SourceMe.sh file (in the root directory of the repository) to set up paths correctly (note that if you move ITEP's directory you will need to re-source):

$ source SourceMe.sh

The setup_stepXXX scripts must be run from the root directory of the root repository. See directions in the individual steps below for details.

  1. How to import genomes and format them for use with ITEP
  2. Specifying lists of organisms to cluster
  3. Building your database 1 - BLASTP and BLASTN
  4. Building your database 2 - MCL Clustering
    • Use these directions to use MCL directly to build protein families
    • Includes support for different similarity metrics, cutoffs and inflation parameters
  5. Building your database 2 Importing results from other clustering methods into ITEP
    • Use these directions if you want to use something other than MCL to make protein clusters
    • Includes directions for using our OrthoMCL wrapper or directly clustering Bidirectional Best BLAST hits
  6. Building your database 3 - Contig import
  7. Building your database 4 - RPSBLAST

Comparative genomics with ITEP

These tutorials are roughly in the order in which they should be followed if you want a full taste of the capabilities of ITEP.

Before running any ITEP scripts in the below tutorials, make sure you source the SourceMe.sh file (in the root directory of the repository) to set up your paths. This will ensure that you can run the ITEP scripts from anywhere on your machine.

$ source SourceMe.sh

Note: If you decide to try to follow the following with your own install of ITEP (and not using the VM) the exact cluster IDs could vary slightly due to E-value differences in BLAST between different versions, possible changes in ordering of outputs, etc. However the gene IDs and their corresponding information should always be exactly the same for the same input.

  1. Entry points into the database (read this first!)
  2. Searching for genes by gene properties
  3. Searching for genes by homology with other genes
  4. Obtaining information about genes
  5. Obtaining the complete sequences of contigs, genes or proteins
  6. Extracting DNA and amino acid sequences from a region of a genome, gene or protein
  7. Building alignments and trees
  8. Analyzing gene neighborhoods
  9. Searching for gene families by presence and absence patterns
  10. Visualizing homology patterns
  11. Building a concatinated gene tree
  12. Generating draft metabolic reconstructions from a reference
  13. Searching for missing genes and identifying causes for absence with tBLASTn
  14. Identifying the upstream regions of homologous proteins
  15. Searching for functions using conserved domains
  16. Adding user-defined gene data to ITEP
  17. Obtaining a list of bidirectional-best BLAST hits
  18. Turning ITEP IDs into human-readable formats
  19. Comparing the results of different clustering approaches
  20. Using the Graphical User Interface

Script and library documentation

All scripts come with complete help text - to see what a script does, use the -h flag. For example:

$ db_getAllClusterRuns.py -h
Usage: db_getAllClusterRuns.py > run_id_list
Return list of all run IDs from the database
Options:
    -h, --help  show this help message and exit

Other tools building upon ITEP

Other users have contributed scripts and tools that use the ITEP database as a foundation for other analysis. Links to them will be provided here for reference; see documentation for the individual tools for further details.

  • Pop genome : Includes some scripts to use ITEP in combination with other tools to perform population genomics analysis. The scripts are written in Perl.

Development and administration

  1. Managing multiple ITEP instances on a machine
  2. Adding and removing genomes from existing ITEP databases
  3. Cleaning up and reclaiming disk space
  4. ITEP architecture
    • Includes an overview of how the ITEP code is arranged.
  5. ITEP ID standards
  6. Data format standards
    • Standard inputs and outputs for the ITEP toolkit.
  7. ITEP data limitations
  8. Known issues

Unclassified \ unfinished tutorials

  1. Visualizing trees with user-specified data
Clone this wiki locally