Skip to content

HPCBio/allan-fluidigm-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Description

In a typical analysis pipeline with fluidigm data, reads are compared to sequences in a taxonomic database such as Silva in order to generate a taxonomic abundance table. In our case we need to prepare and use a custom database of tick pathogens in order to produce a presence/absence table with only the species included in that database.

Alt text

old_scripts

In the old_scripts folder we have legacy code. Ignore these files

nextflow_scripts

In the nextflow_scripts folder we have the current version of the pipeline which is written in Nextflow.

test_data

In the test_data folder we have database files and raw read files that you can use to test the pipeline.

Dependencies

This program expects the following tools/languages to be installed as modules and be available in your path:

Installation instructions

  • Install all dependencies first. You may need to have root access to install some of these tools/languages on a cluster.
  • Do not forget to launch the 'hello world' nextflow pipeline (as per https://www.nextflow.io/) to make sure it works fine.
  • Make a copy of the nextflow_scripts folder

Database preparation

You can use the database we provide in the test_data folder.

To prepare your own database:

  • Make a copy of the nextflow_scripts/dbutils folder
  • Create a fasta file with your seed sequences only
  • Download NCBI-NT from ftp://ftp.ncbi.nlm.nih.gov/blast/db/ This site has several databases that are already blast-formatted. Download only the files for the NT database. The README file on that page has full instructions.
  • Run this program to prepare a database with your seed sequences and its closest homologs:

nextflow run nextflow_scripts/dbutils/prepare-database-ballan-v1.nf

  • This program will generate two files and both of them will be needed.

More preparation steps

  • Prepare the configuration file. The fluidigm pipeline can process different kinds of amplicons such as V1, V3, V4. However, each amplicon has to be run separately because the analysis steps vary slightly. Prepare a configuration file for each amplicon. This configuration file is a list of parameter-value pairs that is used to specify read preparation, search and filtering options. For instance, if the reads overlap, then stitching the reads is a step that is performed by this pipeline, otherwise only R1 is used for the analysis. Some examples are provided in nextflow_scripts/config.

  • Prepare the raw reads. This pipeline expects paired-ended short reads that have been already demultiplexed. The pipeline also expects that the filenames of the raw reads have this pattern: amplicon - sample .R[1,2].fq Put the demultiplexed raw reads that belong to the same amplicon in the same folder. It is important that you do not mix raw reads from different amplicons in the same folder. You can use any of the datasets provided in the test_data folder.

Running the program

To run the fluidigm pipeline type this command:

nextflow run -c config fluidigm-template-ballan-v0.3.nf

Outputs

Nextflow generates two folders to keep track of execution progress. You can delete them once the execution ends successfully. They are called .nextflow/ and work/

The actual results are placed in these folders:

  • readprep/ contains the results of QC, filter and trim of the raw reads. If the reads were stitched together, then PEAR results will be placed here too.
  • vsearchResults contains the results of searches performed with VSEARCH, all subsequent filtering steps applied to each demultiplexed file as well as the final results of this step which is the file with this pattern

amplicon * __vsearchSummaryReport.txt

This file is a tabulated table with these columns: Sample Amplicon EXTENDED_SEQID TOTAL_READS_IN_DEMULTIPLEXED_SAMPLE TOTAL_HITS MIN_percIdent MAX_percIdent AVG_percIdent MIN_alnLen MAX_alnLen AVG_alnLen MIN_Coverage MAX_Coverage AVG_Coverage 5_PERCENTILE

Where:

  • EXTENDED_SEQID is the sequence identifier of the hit that concatenates together the accession number and the species name
  • TOTAL_READS_IN_DEMULTIPLEXED_SAMPLE is the total read count of the demultiplexed sample
  • TOTAL_HITS is the read coverage for this hit
  • MIN_percIdent MAX_percIdent AVG_percIdent are minimum, maximum and average percent identity of the reads for this hit
  • MIN_alnLen MAX_alnLen AVG_alnLen are minimum, maximum and average alignment length of the reads for this hit
  • MIN_Coverage MAX_Coverage AVG_Coverage are minimum, maximum and average coverage of the reads for this hit
  • 5_PERCENTILE is the 5th percentile

Downstream analysis

This pipeline produces a presence/absence table with only the species present in the custome database. It is not an OTU table (table of taxonomic abundace) that could be further analyzed with tools such as QIIME.

Citation

License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published