Analysis pipeline for functional metagenomic sequencing data obtained using nanopore sequencing
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.

README.md

poreFUME

Demultiplex, correct and annotate antibiotic resistance genes in nanopore data

Usage

usage: poreFUME.py [-h] [--PacBioLegacyBarcode] [--verbose] [--overwriteDemux]
                   [--overwriteNanocorrect] [--overwriteNanopolish]
                   [--overwriteCARD] [--skipDemux] [--skipDemuxCollect]
                   [--skipNanocorrect] [--skipNanopolish] [--skipCARD]
                   [--match [MATCH]] [--mismatch [MISMATCH]]
                   [--gapopen [GAPOPEN]] [--gapextend [GAPEXTEND]]
                   [--cores [CORES]] [--barcodeThreshold [BARCODETHRESHOLD]]
                   [--barcodeEdge [BARCODEEDGE]]
                   [--pathNanocorrect [PATHNANOCORRECT]]
                   [--pathNanopolish [PATHNANOPOLISH]] [--pathBWA [PATHBWA]]
                   [--pathRawreads [PATHRAWREADS]] [--pathCARD [PATHCARD]]
                   [--annotateAll] [--minCoverage [MINCOVERAGE]]
                   fileONTreads fileBarcodes

positional arguments:
  fileONTreads          path to FASTA where the (2D) nanopore reads are stored
  fileBarcodes          path to FASTA where the barcodes are stored, format
                        should be ie F_34 for forward and R_34 for reverse
                        barcode

optional arguments:
  -h, --help            show this help message and exit
  --PacBioLegacyBarcode
                        the pacbio_barcodes_paired.fasta file has first digist
                        as 4 instead of 04, turning this option on will fix
                        this
  --verbose             switch the logging from INFO to DEBUG
  --overwriteDemux      overwrite results in the output/barcode/runid
                        directory if they exist
  --overwriteNanocorrect
                        overwrite the results in the output/nanocorrect/runid
                        directory if the exist
  --overwriteNanopolish
                        overwrite the results in the output/nanopolish/runid
                        directory, if the exist
  --overwriteCARD       overwrite the results in the output/annotation/runid
                        directory if the exist
  --skipDemux           Skip the barcode demux step and proceed with
                        nanocorrect, cannot be used with overwrite. Assumes
                        the output/barcode/ and output/ directory are
                        populated accordingly
  --skipDemuxCollect    will skip the demux it self and go to collection based
                        on the pickle
  --skipNanocorrect     Skip the nanocorrect step.
  --skipNanopolish      Skip the nanocorrect step.
  --skipCARD            Skip the CARD annotation
  --match [MATCH]       Score for match in alignment (default: 2.7)
  --mismatch [MISMATCH]
                        Score for mis-match in alignment (default: -4.5)
  --gapopen [GAPOPEN]   Score for gap-open in alignment (default: -4.7)
  --gapextend [GAPEXTEND]
                        Score for gap-extend in alignment (default: -1.6)
  --cores [CORES]       Amount of args.cores to use for multiprocessing
                        (default: 1)
  --barcodeThreshold [BARCODETHRESHOLD]
                        Minimum score for a barcode pair to pass (default: 58)
  --barcodeEdge [BARCODEEDGE]
                        Maximum amount of bp from the edge of a read to look
                        for a barcode. (default: 60)
  --pathNanocorrect [PATHNANOCORRECT]
                        Set the path to the nanocorrect files (default:
                        /Users/evand/Downloads/testnanocorrect/nanocorrect/)
  --pathNanopolish [PATHNANOPOLISH]
                        Set the path to the nanopolish files (default:
                        /Users/evand/Downloads/nanopolish/nanopolish/)
  --pathBWA [PATHBWA]   Set the path to BWA (default:
                        /Users/evand/Downloads/nanopolish/bwa)
  --pathRawreads [PATHRAWREADS]
                        Set the path to the raw reads (.fast5 files),
                        nanopolish needs this. As a hint, this should be the
                        absolute path to which the last part of the header on
                        the poretools produced fasta file referes to. poreFUME
                        will make a symlink to the directory containing the
                        .fast5 files.
  --pathCARD [PATHCARD]
                        Set the path to CARD fasta file (default:
                        inputData/n.fasta.protein.homolog.fasta)
  --annotateAll         By default only the final (demuxed and two times
                        corrected) dataset is annotated, however by turning on
                        this option all the files, raw, after demux, after 1st
                        round of correction, after 2nd round of correction are
                        annotated. This obviously takes longer.
  --minCoverage [MINCOVERAGE]
                        sequences will only be nanopolish'ed if they have a
                        coverage that is higher than this threshold. (default:
                        30)

Example

python poreFUME.py inputData/2DnanoporeData.fasta inputData/barcodes.fasta --PacBioLegacyBarcode --cores 8 --pathCARD=inputData/n.fasta.protein.homolog.fasta --pathNanocorrect=/home/ubuntu/poreFUME/nanocorrect/ --pathRawreads=/home/ubuntu/poreFUME/test/data/testSet75 --pathNanopolish=/home/ubuntu/poreFUME/nanopolish/ --verbose

Output :

Folders output/barcode/mySample/, output/nanocorrect/mySample/, output/annotation/yourSample/ and output/ contain the output files.

The pipeline consists of 3 steps:

  1. barcode demultiplexing
  2. error correction using nanocorrect
  3. polishing using nanopolish
  4. annotation of the reads using the CARD datbase

Each step can be skipped using the relevant --skipXXX parameter. For example when only barcodes need to extracted --skipNanocorrect and --skipCARD can be used.

Step 1. demultiplexing of barcodes

When --skipDemux is not set (default), output/mySample.afterBC.p will be created which contains a pickeled pandas dataframe with the barcode score for each read. Relevant parameters are --match,--mismatch,--gapopen and --gapextend which can be used to adjust the score function of the barcode alignment.

When --skipDemuxCollect is not set (default): output/mySample.afterBC.fasta will be created based on the output/mySample.afterBC.p file. The reads in the output/mySample.afterBC.fasta are identified by the FASTA header >BC_{barcodeID}_{originalreadname} ie. >BC_39_nanporeEcoliGenomeReadHash-3a43-4j34.... Furthermore output/barcode/mySample/ will contain a FASTA file for each individual barcode, again with the barcode in the FASTA header. Reads on which barcodes could not be accurately determined given a --barcodeThreshold are placed in unknown.fasta.
--barcodeEdge is used to determine how far barcodes are searched in the read, increasing this value linearly scales with the run time of the step.
Note: when poreFUME already find data in output/barcode/mySample/ it will terminate, this can be overruled by passing the --overwriteDemux flag. This will remove all the existing data in output/barcode/mySample/.

Step 2. error correction of the demultiplexed reads using nanocorrect

When --skipNanocorrect is not set (default) poreFUME will invoke nanocorrect to error correct the demultiplexed reads. Since nanocorrect needs its own directory to run in when ran in parallel, it will create /output/nanocorrect/mySample/{barcodeID}/ directories. Inside DALIGNER and poa will be run as called by nanocorrect. --pathNanocorrect can be set to point to the nanopore package.
Note: when poreFUME already find data in output/nanocorrect/mySample/ it will terminate, this can be overruled by passing the --overwriteNanocorrect flag. This will remove all the existing data in output/nanocorrect/mySample/.

Step 3. polishing of the nanocorrected data using nanopolish

When --skipNanopolish is not set (default) poreFUME will invoke nanopolish to polish the nanocorrected data. Nanopolish will be run in an own directory , it will create /output/nanopolish/mySample/{barcodeID}/ directories. --pathRawreads is required, and should point to the directory containing the .fast5 files.
Note: when poreFUME already find data in output/nanopolish/mySample/ it will terminate, this can be overruled by passing the --overwriteNanopolish flag. This will remove all the existing data in output/nanopolish/mySample/.

Step 4. annotation using CARD

The final step is annotation of the data using the CARD database when --skipCARD is not set (default). This part will look for output/mySample.afterNC2.fasta and annotate the file against the CARD database. --pathCARD is used to point to the nucleotide CARD database. When --annotateAll is set, also inputFiles/mySample.fasta , output/mySample.afterBC.fasta and output/mySample.afterNC1.fasta will be annotated. The output is a CSV file in output/annotation/mySample/mySample.AFTERNC2.annotation.csv containing the readname and the relevant CARD information.
Note: with --skipCARD the output in output/annotation/mySample/ will be overwritten.

Parallelization

With the --cores flag the following processes can be parallelized:

  • Smith-Waterman algorithm to detect barcodes
  • nanocorrect on multiple barcodes
  • BLAST to annotate with the CARD database.
  • the nanopolish variants command is invoked using the -t (thread) parameter

Testing

To test the working of poreFUME you can run nosetests -v which should output something like

test this ... ok
testCARDavialable (test.TestCARD) ... ok
testInputavialable (test.TestCARD) ... ok
testSegments (test.TestCARD) ... ok
testBLASTDATABASE (test.TestDependencies) ... ok
testBLASTN (test.TestDependencies) ... ok
testDBdust (test.TestDependencies) ... ok
testDBsplit (test.TestDependencies) ... ok
testF2DB (test.TestDependencies) ... ok
testLAcat (test.TestDependencies) ... ok
testPOA (test.TestDependencies) ... ok
testParallel (test.TestDependencies) ... ok
testSamtools (test.TestDependencies) ... ok
testbwaVersion (test.TestDependencies) ... ok
job ranger returns index of begin and end of job range. ... ok
testOverlap (test.TestFunctions) ... ok

----------------------------------------------------------------------
Ran 16 tests in 1.074s

OK

Requirements

poreFUME makes use of the CARD database. So when using please cite McArthur et al. 2013. The Comprehensive Antibiotic Resistance Database. Antimicrobial Agents and Chemotherapy, 57, 3348-3357.. Furthermore Nanocorrect and Nanopolish are used, which can be cited by Loman NJ, Quick J, Simpson JT: A complete bacterial genome assembled de novo using only nanopore sequencing data. Nat Methods 2015, 12:733–735.