Skip to content

Pipeline for mapping a bunch of metagenomes to a bunch of reference genomes with bbmap. Then calculating the average percent identity and coverage values for all.

Notifications You must be signed in to change notification settings

MatthewWolff/mapMetasVsRefs

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

76 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mapping Oligotrophic Metagenomes to Reference Genomes

Sarah Stevens
2016-03-10

This pipeline calculates the coverage and ANI for each metagenome included to each reference genome


Matthew edit: make will execute runAll.sh, make clean will remove all files, including the contents of the mappingResults/ directory. In order to run with custom parameters, ./scripts/runAll.sh <parameters> will have to be called, as make will not take arguments. Optionally, this can be corrected with no further action than moving runAll.sh to the main directory and removing the make command from the makefile, i.e.

mv scripts/runAll.sh . && tail makefile -n 2 > makefile

Directory Structure:

| - metagenomes/ : directory to place (or link) all the metagenomes to include in analysis
| - refGenomes/ : directory to place (or link) all the reference genomes to include in analysis
| - scripts/ : directory that contains scripts for analysis
| - mappingResults/ : directory that holds the resulting sam files from mapping
| - runAll.sh : script that runs pipeline
| - resetFiles.sh : script that removed intermediate files to reset the repo
| - setup.sh : setups the directory structure to start with
| - Readme.md : this helpful file

Requirements

  • Samtools
  • BBMap
  • Python 2.7
  • Python Modules:
    • multiprocessing
    • pandas

Setup

To set up the directory structure run

./setup.sh

Then:

  • place all the metagenomes(fasta type files) you are want to map in metagenomes/
  • place all of the reference genomes you want to map to in refGenomes/

Running all mapping

To start you may need to open runAll.sh and set the bbpath variable to where the bbmap software is located (relative to this repo)
Default it thinks that the bbmap directory is one above this and that the bbmap.sh is within that directory.
Run all analysis using the following command:

./runAll.py threads memlimit

Arguments (very naive and only use positionals):

  • threads = number of threads to use (default=10)
  • memlimit = java memory limit for each mapping job (default=4g)

Makes nice logfiles with dates like this:

nohup bash runAll.sh thread memlimit > $(echo $(date +%Y%m%d_%H%M%S))_nohup.log 2> $(echo $(date +%Y%m%d_%H%M%S))_nohup.err &

Example w/ 20 threads and 4g memory each:

nohup bash runAll.sh 20 4g > $(echo $(date +%Y%m%d_%H%M%S))_nohup.log 2> $(echo $(date +%Y%m%d_%H%M%S))_nohup.err &

Mapping default arguments (see bbmap for details):

  • idtag
  • minid=.8
  • threads=1 - WARNING this does not seem to limit it to 1 CPU. If using shared resource, be the only one using it at that time.
  • nodisk
  • -Xmx4g (unless changed with runAll.sh argument) To change these settings (change 'cmd=...' line in runMapping.py)

Output files

  • refGenomeList.txt - List of all the reference genomes runAll.sh last ran on

  • metagenomeList.txt - List of all the metagenomes runAll.sh last ran on

  • mappingCombos.txt - All of the combinations of mapping metagenomes to reference genomes that runAll.sh last ran on

  • mappingResults/ - directory that stores all of the mapping results files (.bam)

    • *.bam - all the output files from all the combinations of mapping metagenomes to reference genomes
  • *.depth - the resulting depth (for each base) for all of the combinations of mapping metagenomes to reference genomes

  • resultingPIDs.txt - All the lines from the *.bam (converted to sam) that contain the percent identity (PID) information

  • parsedPID.txt - All of the percent identity hits with the info about which file they came from which meta vs which reference

  • coverage.txt - The number of reads that mapped from each metagenome to each reference genome and the average coverage of each base.

Resetting files

To reset repo use:

./resetFiles.sh

If you want to remove the files form mappingResults, as well:

./resetFiles.sh True

About

Pipeline for mapping a bunch of metagenomes to a bunch of reference genomes with bbmap. Then calculating the average percent identity and coverage values for all.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 67.0%
  • Shell 32.4%
  • Makefile 0.6%