Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Welcome to the NGS-QCbox wiki!
What is NGS-QCbox ? NGS-QCbox: A Parallel, Automated and Rapid Quality Control Pipeline for analysing the big data of NGS. It is a QC tool box for Next generation sequencing data of Illumina HiSeq and MiSeq
Authors: KAVS Krishna Mohan, Aamir W Khan, Dadakhalandar Doddamani and Rajeev K Varshney email: firstname.lastname@example.org, email@example.com, firstname.lastname@example.org, email@example.com ICRISAT, Patancheru, India
NGS-QCbox is a commandline pipeline that enables NGS quality control to be performed with ease. The outputs include base and read level statistics, genome coverage and variant infomation.
REQUIREMENTS Java 1.7 (preferrably 1.7.0_11) bash >= v4(assuming that you are working on Linux environment) Python >= v2.6 (These are mostly available on Linux platforms by default)
INSTALL $ tar zxvf NGS-QCBox-v1.0.tar.gz $ cd NGS-QCbox-v1.0 $ source sourcme.ngsqcbox (This sets $QCBIN shell variable)
Now navigate to the workspace containing data, data folder, (I assume the fastq files to be QCed are in a dir called samples) prepare a file called 'samples.txt' with the format: sample1:500:700 sample2:550:800 sample3:300:600 sample4:450:760
This is sample id, min insert size and max insert size delimited by colon. Make sure you have the 'samples.txt' file in the current working dir. and then run $ $QCBIN/NGS-QCbox-v1.0.py An example session: NGSQCBox toolkit v 1.0 ~~~~~~~~~~~~~~~~~~~~~~ 1) Quick mode QC 2) Complete mode QC Enter a choice: 2 Enter reference fasta full path: Enter the reference genome size: Enter bowtie2 index full path: Enter data folder path : <provide data folder containing samples in *.fastq.gz format> Number of processors to use :
Follow either Quick mode/ Complete mode instructions as follows
NGSQCbox-v1.0.py is the main script that parallelizes the tasks for multiple samples generated from hiseq or miseq.
PREREQUISITES Remember to set the path for reference (fasta format), bowtie2-index and genome size in the complete_qc.bpipe / quick_qc.bpipe in qcbin dir. Insert sizes need to be included in a formatted text file called ‘samples.txt’.
ASSUMPTIONS -The input fastq files are gzip compressed format (*.fastq.gz). -The samples to be analyzed are in a folder - this is data path -The quality range is assumed to be in phred+33 format.
RESULTS Quick mode run generates a folder by name [sample]_QC_quick and complete mode generates [sample]_QC_complete in the data path folder containing samples.
-The pipeline generates two files namely, "detailed_qc_[complete|quick].txt" and "QC_final_[complete|quick].txt"; depending on the quick/complete mode of the pipeline run.
detailed_QC_[quick|complete].txt contains information on counts of reads, bases, read length range, counts of A/T/G/C/N, percentage of GC content and quality range
QC_final_[quick|complete].txt summarizes the above information for all the samples. It contains in addition to the reads generated and retained after quality trimming, pecentage of alignment, genome coverage at 1X to 15X and mean read depth observed from alignment.
DISCLAIMER -This tool has been tested on Linux platform only (ubuntu 12.10). It should work on any other linux flavour provided you have the tools in REQUIREMENTS section.
LICENSE GPL3 A copy of GPL3 is included in the package