Merge pull request #24 from PacificBiosciences/hla_consensus

changes for v0.13.0
PacificBiosciences · Jul 30, 2024 · 267a140 · 267a140
2 parents 422063a + bd5eb06
commit 267a140
Show file tree

Hide file tree

Showing 4 changed files with 36 additions and 9 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,21 @@
+# v0.13.0
+## Changes
+- The algorithm for _HLA-A_ and _HLA-B_ has been modified to use a consensus-based approach to solve the alleles, a simpler version of the algorithm for _CYP2D6_.
+  - CLI options related to consensus generation now control both HLA and _CYP2D6_ calling. These have been moved into a separate category on the CLI labeled "Consensus (HLA and CYP2D6)".
+  - In internal tests, these changes slightly improved the accuracy of 4th-field entries in the HLA calls (2nd- and 3rd-field were unaffected). Additionally, the approach significantly reduced compute time requirements, averaging ~10% of CPU time required for v0.12.0.
+  - With this change, the `--threads` option does not provide any benefit to the current algorithms. It has been deprecated, but may be added again if future optimizations allow it.
+  - The `--max-error-rate` default has been adjusted for comparison to just the reference allele for each HLA gene, with a new default of 0.07 (previously 0.05).
+  - Previous option `--min-allele-fraction` for HLA has been removed. The consensus option `--min-consensus-fraction` is used instead.
+- Added a new option, `--output-debug`, that will create a debug folder with multiple additional files that are primarily for debugging the results from HLA and CYP2D6 calling, but may be useful for researchers. This folder is subject to change as the underlying methods develop. Some of the initial files included:
+  - `consensus_{GENE}.fa` - Contains the full consensus sequences generated for a given `{GENE}`. Currently, this is only for HLA genes and _CYP2D6_.
+  - `cyp2d6_consensus.bam` - Contains mapped substrings from the reads that were used to generate CYP2D6 consensus sequences. The phase set tag (PS) indicates which consensus the sequence was a part of. Useful for visualizing how the consensus ran and whether there are potential errors.
+  - `cyp2d6_link_graph.svg` - A graphical representation of the connections present between CYP2D6 consensus segments.
+  - `hla_debug.json` - Contains the summary mapping information of each database entry to the generated HLA consensus sequences.
+
+## Fixed
+- Fixed an issue with `build` where CPIC genes with no known chromosome would cause an error and exit. These entries are now ignored with a warning.
+- Fixed an off-by-one error in the HLA gene region start coordinates. This has been corrected in the latest database release: `data/v0.13.0/pbstarphase_20240730.json.gz`
+
 # v0.12.0
 ## Changes
 - Hard-coded coordinates (GRCh38) for HLA-A, HLA-B, and CYP2D6 calling have been moved into the database file. Prior database versions without this new config information will automatically load the previously hard-coded values. This configuration is provided for transparency and experimentation in other reference coordinate systems, we do not recommend or support changing the provided default values.

diff --git a/data/v0.13.0/pbstarphase_20240730.json.gz b/data/v0.13.0/pbstarphase_20240730.json.gz
diff --git a/docs/methods.md b/docs/methods.md
@@ -11,15 +11,14 @@ The high-level process is outlined below:
 ## _HLA-A_ and _HLA-B_
 The diplotyping of HLA genes is more complicated due to frequent mismappings of other HLA sequences to the wrong positions in the genome.
 This can have the downstream effect of generating spurious or incorrect variant calling and/or phasing in the VCF file.
-As a result, pb-StarPhase uses a different, alignment-based approach to diplotype the HLA genes:
+As a result, pb-StarPhase uses a different, consensus-based approach to diplotype the HLA genes:
 
 1. **Load the HLA database** - This step loads the JSON database that was created from querying the IMGTHLA GitHub repo. Each haplotype sequence is loaded into memory for the following steps. 
-2. **Load fully spanning reads** - This step will parse the alignment file (BAM) and extract all reads that **fully span** the gene region of interest. We note that the haplotypes stored in IMGTHLA do not have the same mapping coordinates, so we have defined the regions of interest as follows (subject to change as we iterate on performance):
-    * _HLA-A_ - chr6:29942254-29945870
-    * _HLA-B_ - chr6:31353362-31357442
-3. **Score each HLA haplotype against each read** - This step will align each cDNA and DNA sequence against each read and save the edit distance of the mapping against that read. By default, any HLA haplotypes that are missing a DNA sequence are ignored. Any read with an edit distance that is too high will be ignored (this removes mismapped reads).
-4. **Call best diplotype** - This step performs a ranked-choice vote where each read mapping is allowed one vote. The process iteratively excludes candidate haplotypes based on the number of votes as well as the edit distance for tie-breaking. If all candidates have a "high" vote fraction (default: >=10%), then a more complicated full comparison is used to exclude a candidate. Once only two candidates remain, the algorithm compares the votes to determine whether it should be reported as heterozygous for both candidates or homozygous for the dominant candidate.
-5. **Report findings** - In contrast to the CPIC genes, only a single diplotype result is ever reported for the HLA genes. Additionally, relevant mapping information is saved into the output JSON.
+2. **Load fully spanning reads** - This step will parse the alignment file (BAM) and extract all reads that **fully span** the gene region of interest (including a small buffer around the defined region). We note that the haplotypes stored in IMGTHLA do not have the same mapping coordinates, so we have defined the regions of interest in the database files. Any read with an edit distance that is too high relative to the reference genome will be ignored (this removes mismapped reads).
+3. **Run a diplotype consensus on the read collection** - This step will first extract all the cDNA sequences from each read and attempt to find two unique consensuses. It checks the number of reads assigned to each consensus to determine if it passes the MAF and CDF filters. If not, the cDNA is likely identical, so it repeats this step using the DNA sequences instead to see if there is a sub-type difference. After this step, there is either one consensus (homozygous at cDNA and DNA levels) or two consensuses (heterozygous at cDNA and/or DNA levels).
+3. **Score all HLA haplotypes in the database against each consensus** - For each consensus, we generate a full-length DNA sequence and accompanying spliced cDNA sequence. We then score each defined HLA haplotype to each consensus by aligning and saving the edit distance of the mappings, prioritizing cDNA and then DNA. By default, any HLA haplotypes that are missing a DNA sequence are ignored.
+4. **Call best diplotype** - For each consensus, identify the best matching haplotype as defined above. This is the haplotype with the lowest edit fraction (`edit_distance + unmapped / total_bp`). This pair of best matching haplotypes is the final diplotype result for the HLA gene.
+5. **Report findings** - In contrast to the CPIC genes, only a single diplotype result is ever reported for the HLA genes.
 
 ## _CYP2D6_
 Diplotyping of _CYP2D6_ is further complicated due to the presence of homologous gene _CYP2D7_ as well as hybrids between the two genes.

diff --git a/docs/user_guide.md b/docs/user_guide.md
@@ -73,12 +73,10 @@ With v0.10.0, pb-StarPhase supports diplotyping of _HLA-A_, _HLA-B_, and _CYP2D6
 If using targeted sequencing datasets, see [our recommended parameters](#can-i-diplotype-using-targeted-sequencing-data).
 To enable HLA and _CYP2D6_ diplotyping, simply provide the BAM file(s) in addition to the normal parameters.
 Both HLA and _CYP2D6_ diplotyping is more computationally expensive than the CPIC genes.
-If run-time is an issue, we recommend using the `--threads` option to provide additional cores to StarPhase, which will improve the HLA diplotyping components.
 
 ```bash
 pbstarphase diplotype \
     --bam ${BAM} \
-    --threads ${THREADS} \
     ...
 ```
 
@@ -264,6 +262,8 @@ pbstarphase build \
 ```
 
 This requires an internet connection that can query the CPIC API and IMGTHLA GitHub for the latest genes, allele definitions, and variants.
+Additionally, this relies on upstream databases maintaining a known structure.
+If that structure changes, this command may fail and require an update to the software to resolve it.
 
 ## Why are some of the haplotypes ignored?
 Some CPIC genes, like _G6PD_, include reference variants that are alternate sequences relative to the GRCh38 reference genome.
@@ -293,3 +293,13 @@ However, _CYP2D6_ tends to be more difficult to accurately call with targeted se
 This is typically due to shorter read lengths, increased coverage variation across alleles, and full-allele drop out due to the capture.
 For _CYP2D6_, this is particular problematic due to the presence of deletion, duplication, and hybrid alleles that may influence the final diplotype.
 For targeted sequencing, we recommend using the following _CYP2D6_-specific additional parameters, which attempt to account for these complicating factors: `--infer-connections --normalize-d6-only`.
+
+## Can I get more detailed output information?
+There is an optional output, `--output-debug`, which is primarily for debugging issues that may occur while diplotyping the complex HLA and _CYP2D6_ genes.
+The outputs contained in this folder are subject to change as the algorithms evolve.
+Here is a brief list of some of the current debug outputs:
+
+* `consensus_{GENE}.fa` - Contains the full consensus sequences generated for a given `{GENE}`. Currently, this is only for HLA genes and _CYP2D6_.
+* `cyp2d6_consensus.bam` - Contains mapped substrings from the reads that were used to generate CYP2D6 consensus sequences. The phase set tag (PS) indicates which consensus the sequence was a part of. Useful for visualizing how the consensus ran and whether there are potential errors.
+* `cyp2d6_link_graph.svg` - A graphical representation of the connections present between CYP2D6 consensus segments.
+* `hla_debug.json` - Contains the summary mapping information of each database entry to the generated HLA consensus sequences.