Merge pull request #30 from Illumina/GT-808

GT-808 v2.3 release
Illumina · Jul 25, 2019 · cd9a580 · cd9a580
2 parents 9648877 + 353c74c
commit cd9a580
Show file tree

Hide file tree

Showing 6 changed files with 40 additions and 15 deletions.
diff --git a/README.md b/README.md
@@ -4,8 +4,10 @@
 * [Introduction](#Introduction)
 * [Installation](#Installation)
 * [Run Paragraph from VCF](#RunParagraphFromVCF)
-    * [Example](#Example)
+    * [Test example](#TestExample)
     * [Input requirements](#InputRequirements)
+    * [Run time](#RunTime)
+    * [Population-scale genotyping](#PopulationScaleGenotyping)
 * [Run Paragraph on complex variants](#RunParagraphOnComplexVariants)
 * [Further Information](#FurtherInformation)
 	* [Documentation](#Documentation)
@@ -37,7 +39,7 @@ Genotyping calls in this paper can be found at [paper-data/download-instructions
 Please check [doc/Installation.md](doc/Installation.md) for system requirements and installation instructions.
 
 ## <a name='RunParagraphFromVCF'></a>Run Paragraph from VCF
-### <a name='Example'></a>Example
+### <a name='TestExample'></a>Test example
 After installation, run `multigrmpy.py` script from the build/bin directory on an example dataset as follows:
 
 ```bash
@@ -72,9 +74,6 @@ If successful, the last 3 lines of genotypes.vcf.gz will the same as in [expecte
 ### VCF format
 paraGRAPH will independently genotype each entry of the input VCF. You can use either indel-style representation (full REF and ALT allele sequence in 4th and 5th columns) or symbolic alleles, as long as they meet the format requirement of VCF 4.0+.
 
-It typically takes up to a few seconds to genotype a single event in one sample (single-threaded).
-It took us 30 minutes to genotype ~20,000 SVs using 20 CPU cores (with I/O).
-
 Currently we support 4 symbolic alleles:
 - `<DEL>` for deletion
     - Must have END key in INFO field.
@@ -90,20 +89,32 @@ Currently we support 4 symbolic alleles:
 Must be tab-deliemited.
 
 Required columns:
-- ID: Each sample must have a unique ID. The output VCF will include genotypes for all samples in the manifest
+- id: Each sample must have a unique ID. The output VCF will include genotypes for all samples in the manifest
 - path: Path to the BAM/CRAM file.
-- depth: Average depth across the genome. Can be calculated with bin/idxdepth or samtools.
+- depth: Average depth across the genome. Can be calculated with bin/idxdepth (faster than samtools).
 - read length: Average read length (bp) across the genome.
 
 Optional columns:
 
 - depth sd: Specify standard deviation for genome depth. Used for the normal test of breakpoint read depth. Default is sqrt(5*depth).
-
 - depth variance: Square of depth sd.
+- sex: Affects chrX and chrY genotyping. Allow "male" or "M", "female" or "F", and "unknown" (quotes shouldn't be included in the manifest). If not specified, the sample will be treated as unknown.
+
+## <a name='RunTime'></a>Run time
+
+- On a 30x HiSeqX sample, Paragraph typically takes 1-2 seconds to genotype a simple SV in confident regions.
+
+- If the SV is in a low-complexity region with abnormal read pileups, the running time could vary.
+
+- For efficiency, it is recommended to manually set the "-M" option (maximum allowed read count for a variant) to skip these high-depth regions. We recommend "-M" as 20 times of your mean sample depth.
+
+## <a name='PopulationScaleGenotyping'></a>Population-scale genotyping
 
-- sex: Affects chrX and chrY genotyping. 
-Allow "male" or "M", "female" or "F", and "unknown" (quotes shouldn't be included in the manifest). 
-If not specified, the sample will be treated as unknown.
+To efficiently genotype SVs across a population, we recommend doing single-sample mode as follows:
+- Create a manifest for each single sample
+- Run `multigrmpy.py` for each manifest. Be sure to set "-M" option for each sample according to its depth.
+- Multithreading (option "-t") is highly recommended for population-scale genotyping
+- Merge all `genotypes.vcf.gz` to create a big VCF of all samples. You can use either `bcftools merge` or your custom script.
 
 ## <a name='RunParagraphOnComplexVariants'></a>Run Paragraph on complex variants
 For more complicated events (e.g. genotype a deletion together with its nearby SNP), you can provide a custimized JSON to paraGRAPH:

diff --git a/src/c++/include/idxdepth/Parameters.hh b/src/c++/include/idxdepth/Parameters.hh
@@ -27,17 +27,19 @@ namespace idxdepth
 class Parameters
 {
 public:
-    explicit Parameters(std::string bam_path, std::string bam_index_path, std::string reference_path)
+    explicit Parameters(std::string bam_path, std::string bam_index_path, std::string reference_path, bool altcontig)
         : bam_path_(std::move(bam_path))
         , bam_index_path_(std::move(bam_index_path))
         , reference_path_(std::move(reference_path))
+        , alt_contig_(altcontig)
         , threads_(1)
     {
     }
 
     const std::string& bam_path() const { return bam_path_; }
     const std::string& bam_index_path() const { return bam_index_path_; }
     const std::string& reference_path() const { return reference_path_; }
+    bool include_alt_contig() const { return alt_contig_; }
 
     const std::string& include_regex() const { return include_regex_; }
     void set_include_regex(std::string const& ar) { include_regex_ = ar; }
@@ -53,6 +55,7 @@ private:
     std::string bam_path_;
     std::string bam_index_path_;
     std::string reference_path_;
+    bool alt_contig_;
     std::string include_regex_;
     std::string autosome_regex_;
     std::string sex_chromosome_regex_;

diff --git a/src/c++/lib/common/KlibImpl.hh b/src/c++/lib/common/KlibImpl.hh
@@ -39,6 +39,7 @@ extern "C" {
 #include "ksw.h"
 }
 
+#include <boost/core/noncopyable.hpp>
 #include <cstdlib>
 #include <cstring>
 #include <memory>

diff --git a/src/c++/lib/idxdepth/DepthEstimation.cpp b/src/c++/lib/idxdepth/DepthEstimation.cpp
@@ -76,6 +76,12 @@ Json::Value estimateDepths(Parameters const& parameters)
     for (int i = 0; i < header->n_targets; ++i)
     {
         const std::string contig = std::string(header->target_name[i]);
+
+        if (!parameters.include_alt_contig() && contig.length() > 5)
+        {
+            continue;
+        }
+
         if (parameters.include_regex().empty() || std::regex_match(contig, include_regex))
         {
             bam_chromosomes.insert(contig);

diff --git a/src/c++/main/idxdepth.cpp b/src/c++/main/idxdepth.cpp
@@ -62,6 +62,7 @@ int main(int argc, char const* argv[])
             ("output,o", po::value<string>(), "Output file name. Will output to stdout if omitted.")
             ("output-bins,O", po::value<string>(), "Output binned coverage in tsv format.")
             ("reference,r", po::value<string>(), "FASTA with reference genome")
+            ("altcontig",po::value<bool>()->default_value(false), "Include ALT contigs in estimation")
             ("include-regex,I", po::value<string>()->default_value(""), "Regex to identify contigs to include")
             ("autosome-regex", po::value<string>()->default_value("(chr)?[1-9][0-9]?"),
              "Regex to identify autosome chromosome names (default: '(chr)?[1-9][0-9]?'")
@@ -131,7 +132,7 @@ int main(int argc, char const* argv[])
             logger->info("Output path for binned coverage: {}", output_path_bins);
         }
 
-        Parameters parameters(bam_path, bam_index_path, reference_path);
+        Parameters parameters(bam_path, bam_index_path, reference_path, vm["altcontig"].as<bool>());
         parameters.set_include_regex(vm["include-regex"].as<string>());
         parameters.set_autosome_regex(vm["autosome-regex"].as<string>());
         parameters.set_sex_chromosome_regex(vm["sex-chromosome-regex"].as<string>());

diff --git a/src/python/lib/grm/vcfgraph/vcfupdate.py b/src/python/lib/grm/vcfgraph/vcfupdate.py
@@ -157,8 +157,11 @@ def update_vcf_from_grmpy(inVcfFilename, grmpyOutput, outVcfFilename, sample_nam
         # start at the same position but come in different records. This could be fixed
         # by collapsing these into single VCF records as alleles before running
         # through paragraph.
-        record = header.new_record(contig=raw_record.chrom, start=raw_record.start, stop=raw_record.stop,
-                                   alleles=raw_record.alleles, id=raw_record.id, qual=raw_record.qual, filter=raw_record.filter, info=raw_record.info)
+        try:
+            record = header.new_record(contig=raw_record.chrom, start=raw_record.start, stop=raw_record.stop,
+                                       alleles=raw_record.alleles, id=raw_record.id, qual=raw_record.qual, filter=raw_record.filter, info=raw_record.info)
+        except:
+            raise Exception("Format error in vcf line: " + str(raw_record))
 
         varIdCounts = defaultdict(int)
         varId = VCFGraph.generate_variant_id(record, varIdCounts)