Skip to content

pachterlab/metakallisto

Repository files navigation

These python scripts allow metagenomic sequence data to be analyzed with the fast, accurate RNA-Seq abundance estimator kallisto. Both taxa identification and abundance estimation can be performed at the exact-genome level, as demonstrated in our paper "Pseudoalignment for metagenomic read assignment".

##Prerequisites

The full pipeline as shown in our paper requires the installation of genome distance estimator Mash, RNA-Seq abundance estimator kallisto, and python 2.7+.

Python modules needed for running the scripts include NumPy, SciPy, and Biopython.

##Usage

compare_metagenomic_results_to_truth.py compares the output of a range of metagenomic analysis tools such as kallisto to the ground truth of the illumina 100 metagenomic dataset used in our paper. It is run from the command line:

python compare_metagenomic_results_to_truth.py [filename] [program (default kallisto)] 
						<-g show graphs?> <-s save graphs?> <--taxa level (defaults to all)>

positional arguments:
  filename              Output file of metagenomic analysis tool
  program               Source program that created output. Valid options are:
                        kallisto, kraken, clark, gasic, express. Defaults to
                        kallisto.

optional arguments:
  -h, --help            show this help message and exit
  -g, --show-graphs     Display graphs of calculated errors
  -s, --save-graphs     Save graphs of calculated errors to file
  --taxa TAXA           Desired taxa level of analysis. Accepts one of:
                        strain, species, genus, phylum. Defaults to all
                        levels.
  --dataset DATASET     Dataset truth to be compared to. Accepts: i100,
                        no_truth. Defaults to i100.
  --bootstraps BOOTSTRAPS
                        Directory containing .tsv files for kallisto
                        bootstraps, to be converted into errors.

plotfunctions.py is a helper module necessary to allow compare_metagenomic_results_to_truth.py to plot the results of analysis.

mash_kallisto_pipeline.py will process the output of Mash (run on a large set of metagenomic reference genomes), and select the top N (user-defined) genomes that match each species that Mash identified in the raw sequenced reads. Python script should be run from the directory containing the raw .fa or .mfa files, and will move matching files to a user-specified directory for later indexing. The script will also process each genome to concatenate contigs/chromosomes, and attempt to look up taxids based on common UIDs present in the file name; a taxid will allow metagenomic programs such as Kraken or CLARK to use these genomes to make a custom database.

mash_kallisto_pipeline.py [-h] [--directory DIRECTORY] [--dry-run] filename top_strains

positional arguments:
  filename              Mash output file
  top_strains           How many strains of each species to keep for the
                        quantification step

optional arguments:
  -h, --help            show this help message and exit
  --directory DIRECTORY
                        Directory to put files for kallisto index creation.
                        Default is moving them one directory up.
  --dry-run             If set, lists files that would be moved, but does not
                        create or move any files.

About

Using kallisto for metagenomic analysis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages