Skip to content

DRWinterisCoding/QCDR

 
 

Repository files navigation

Quality Control software for Diagnosing RNA-seq (QC DR)

QCDR

QCDR is bulk RNA-seq QC software which makes it easy to identify poor-quality samples in an experiment by generating visualizations which comprehensively characterize each sample and flagging those samples with statistically significant aberrant values. In addition, QCDR produces a summary heatmap for each project which makes it easy to identify samples across a project that fail key QC metrics. Samples can be grouped into batches to better visualize how there may be differences across sequencing runs in a larger experiment. Crucially, QC-DR lets you compare query samples to a reference dataset, which can be useful for evaluating new samples against an established standard.

QCDR is written in Python and designed to be used either in Python or in a Unix command line environment.

QCDR_main.py is the main function for the software and can be run as such.

QCDR_main.py -qry QueryQCTable.csv -out FolderForSavingOutputs -ref ReferenceQCTable -gc GeneCoverageData.csv -hist CountHistogramdata.csv -ctf CustomCutoffFile.csv -fla .05 -wrna .1

Below I summarize the inputs

-qry To use QC-DR, users supply a query table of QC metrics as input to the main function QCDR_main, using either its Unix or Python interface. This query table contains the following columns, Sample, Batch, # Sequenced Reads, # Post-Trim Reads, % Overrepresented Sequences (Pre-trim), % Adapter Content (Pre-trim), % Overrepresented Sequences (Post-trim), % Adapter Content (Post-trim), # Uniquely Aligned Reads, # rRNA Reads, # Mapped to Exons. Of these, Sample and Batch must be filled while the other fields may be left blank, in which case the plots which require those data will be skipped. QC-DR uses percentile-based thresholding to identify and flag samples with aberrant values. To account for the non-normal distributions of many QC metrics, bootstrap sampling with replacement was used to generate a normal distribution of sample means.

-out Folder where outputs should be saved

-ref (optional) A reference table, in the same format as -qry. If this is supplied, samples are visualized and compared in the context of the distribution of the reference dataset supplied here rather than in the input dataset.

-gc (optional) A table of gene coverage data, generated using the function GeneCoverage.py and 0 to 1 normalized by sample. This function, along with some helper functions to facilitate the process can be be found in the GBC_Creation_Module folder.

-hist (optional) data used to generate the gene distribution figure. A raw count table csv can be supplied or a gene histogram created from a raw count table using the included utility function GeneHistCreationModule.py.

-ctf (optional) Should specific cutoffs be desired, this table that specifies specific cutoffs can be supplied as an optional argument and will override the default cutoffs generated by QCDR. Arguments can be left blank and will default to QCDR cutoffs.

-wrna -fla (optional) Should the user shose to tighten or loosen the stringency of the statistical tests used to flag samples, users can set the alphas for the warn and fail alphas using these arguments.

A separate supplied table is additionally required to generate the gene body coverage distribution subplot. There is an optional utility included within the software called GBC_Creation_Module to generate this. The csv files created by this can be used directly as input. An example final can be found at data/SCRIPT/SCRIPT_B11_GC_info.csv. Once generated, it can be added to the QCDR output like this

python3 QCDR_main.py -ip ../data/User_Template/User_Input.csv -out /path/to/outlocation.pdf -bgd ../data/User_Template/User_Input.csv -gc gcdata.csv

A similar utility for calculating gene hists is supplied as the GeneHistCreationModule.py function

Last, cutoffs can be set to specific values by filling the values in the data/User_Template/user_cutoff_table.xlsx file. If the user does not want to set a metric, the cells can be left blank. If left blank, it will use the default cutoffs or those set by the -wrna or -fla tags.

python3 QCDR_main.py -ip ../data/User_Template/User_Input.csv -out /path/to/outlocation.pdf -bgd ../data/User_Template/User_Input.csv. -ctf USER_cutoff_table.xlsx

This covers the capabilities of QCDR. If you are having difficulties, encounter bugs, or have other feedback about the software, please email the project maintainer at samilton840@gmail.com

About

Quality Control software for Diagnosing RNA-seq data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 99.0%
  • Shell 1.0%