Skip to content

AkeyLab/LongHap

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

64 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LongHap

Table of Contents

Description of LongHap

LongHap is a read-based variant phasing algorithm that integrates methylation signals native to long-read sequencing data, such as PacBio Revio HiFi and ONT sequencing, into a unified framework. As input, LongHap requires variant calls in VCF format, aligned sequencing reads in BAM format, and, optionally, methylation calls. LongHap then uses a stepwise approach to co-phase SNVs, INDELs, and SVs. First, LongHap phases pairs of variants based on sequence information, embedding complex and low-support variants into the broader haplotype context using graph theory, that is, loopy belief propagation. In a second step, LongHap identifies differentially methylated sites on the fly and leverages them as additional, phase-informative markers to resolve variant pairs that could not be confidently phased based on sequence information alone, extending initially inferred phase blocks. Finally, LongHap outputs a phased VCF and, optionally, haplotagged read alignments and the set of differentially methylated sites used for phasing.

Installation

LongHap's only requirements are Python >= 3.12 with the following packages installed:

  • cyvcf2 >= 0.31.4
  • pysam >= 0.23.3
  • parasail-python >= 1.3.4
  • numpy >= 2.4.1
  • pandas >= 2.3.3
  • pyarrow >= 22.0.0
  • scipy >= 1.17.0
  • pyfaidx >= 0.9.0.3
  • tqdm >= 4.67.1

After it is installed, you should be able to run longhap --help.

Quick start with uvx (no installation needed)

uvx longhap --help

Installation with pip

pip install longhap

Installation from github

All Python dependencies can be installed using the provided longhap.yaml file with conda:

git clone https://github.com/AkeyLab/LongHap.git
cd LongHap/
conda env create -f longhap.yaml
conda activate longhap
./longhap.py --help

Note for macOS users: Some dependencies (e.g., parasail, cyvcf2) require C build tools. If installation fails, install the following via Homebrew first:

brew install autoconf automake libtool

Usage

LongHap takes the following files as inputs:

  • A VCF file with variant calls
  • A BAM file with aligned reads
  • Reference fasta file
  • (Optional) A BED file with methylation calls

If only a VCF and a BAM file are provided, LongHap will phase variants based on sequence information alone, similarly to WhatsHap, HapCUT2, or LongPhase. If a BED file with methylation calls is provided, LongHap will additionally leverage methylation information to phase variants that could not be phased based on sequence information alone.

If you installed LongHap with pip install longhap or from GitHub, you can run it directly with longhap. Alternatively, you can use uvx longhap to run it without installing — uv will automatically fetch the package and its dependencies into a temporary environment. Both approaches are shown in the examples below.

Variant phasing based on sequence information alone:

If installed with pip

longhap \
    --vcf input_variants.vcf \
    --bam aligned.pacbio.bam \
    --reference reference.fasta \
    --chrom chr1 \
    --pacbio \
    -o phased_variants.vcf.gz

Or with uv:

uvx longhap \
    --vcf input_variants.vcf \
    --bam aligned.pacbio.bam \
    --reference reference.fasta \
    --chrom chr1 \
    --pacbio \
    -o phased_variants.vcf.gz

Or with:

./longhap.py \
    --vcf input_variants.vcf \
    --bam aligned.pacbio.bam \
    --reference reference.fasta \
    --chrom chr1 \
    --pacbio \
    -o phased_variants.vcf.gz

Variant phasing based on sequence and methylation information:

longhap \
    --vcf input_variants.vcf \
    --bam aligned.pacbio.bam \
    --reference reference.fasta \
    --chrom chr1 \
    --methylation_calls methylation_calls.bed \
    --pacbio \
    -o phased_variants.vcf.gz

Or with uv:

uvx longhap \
    --vcf input_variants.vcf \
    --bam aligned.pacbio.bam \
    --reference reference.fasta \
    --chrom chr1 \
    --methylation_calls methylation_calls.bed \
    --pacbio \
    -o phased_variants.vcf.gz

General phasing options

When using ONT data, replace the --pacbio flag with --ont. When you want to phase SNVs only, add the --snvs_only flag. When you want to phase multiallelic variants, add the --multiallelics flag.

By default, we exclude SVs > 50000 bp. This threshold can be adjusted with the --max_allele_length flag.

To exclude variants with very low support for the minor allele, use the --min_allele_count flag. By default, this flag is set to 1, meaning that at least one read must support the minor allele for a variant to be considered for phasing.

To exclude bases covering heterozygous variants with a base quality below a certain threshold, use the --min_base_quality flag. This is particularly useful when phasing SNPs from ONT data, where low base qualities may indicate systematic errors. By default, this flag is 0 to consider all bases for PacBio HiFi data and 10 for ONT data.

The complete list of options

longhap -h
usage: longhap [-h] --vcf VCF -b BAM -r REFERENCE -c CHROM [-m METHYLATION_CALLS] [--snvs_only] [--multiallelics] [--ont] [--pacbio] [--max_allele_length MAX_ALLELE_LENGTH] [--min_allele_count MIN_ALLELE_COUNT]
                  [--min_base_quality MIN_BASE_QUALITY] [--min_mapq MIN_MAPQ] [--flank_snv FLANK_SNV] [--flank_indel FLANK_INDEL] -o OUTPUT_VCF [--output_bam OUTPUT_BAM] [--output_read_assignments OUTPUT_READ_ASSIGNMENTS]
                  [--output_blocks OUTPUT_BLOCKS] [--output_transition_matrix OUTPUT_TRANSITION_MATRIX] [--output_transition_matrix_meth OUTPUT_TRANSITION_MATRIX_METH] [--output_read_states OUTPUT_READ_STATES]
                  [--output_variant_read_mapping OUTPUT_VARIANT_READ_MAPPING] [--output_allele_coverage OUTPUT_ALLELE_COVERAGE] [--output_unphaseable_variants OUTPUT_UNPHASEABLE_VARIANTS]
                  [--output_differentially_methylated_sites OUTPUT_DIFFERENTIALLY_METHYLATED_SITES] [--use_all_methylated_sites] [--force] [--log LOG] [-v]

options:
  -h, --help            show this help message and exit
  --vcf VCF             Input VCF with called variants
  -b BAM, --bam BAM     Sorted alignment bam
  -r REFERENCE, --reference REFERENCE
                        Reference fasta. Must be indexed with samtools faidx.
  -c CHROM, --chrom CHROM
                        Chromosome
  -m METHYLATION_CALLS, --methylation_calls METHYLATION_CALLS
                        Methylation calls from pileup model
  --snvs_only           Whether to phase SNVs only ["False]
  --multiallelics       Also phase multiallelic variants or not [False]
  --ont                 Data is Oxford Nanopore data [False]
  --pacbio              Data is PacBio HiFi data [False]
  --max_allele_length MAX_ALLELE_LENGTH
                        Maximum length of alleles to consider for phasing in bp [50000]
  --min_allele_count MIN_ALLELE_COUNT
                        How many examples of the minor allele must be present in the reads to consider the variant for phasing [1]
  --min_base_quality MIN_BASE_QUALITY
                        Minimum base quality to consider a base for phasing. Only affects SNP phasing. For HiFi data, all bases should be consider, that is a minimum quality of 0. For ONT data, a threshold of 10 is recommended
                        [0]
  --min_mapq MIN_MAPQ   Minimum mapping quality to consider a read for phasing [20]
  --flank_snv FLANK_SNV
                        Number of flanking bp to use for realignment around uncertain SNVs. Default is 66 for ONT and 33 for PacBio
  --flank_indel FLANK_INDEL
                        Number of flanking bp to use for realignment around uncertain indels. Default is 200 for ONT and 100 for PacBio
  -o OUTPUT_VCF, --output_vcf OUTPUT_VCF
                        Output phased vcf
  --output_bam OUTPUT_BAM
                        Output haplotagged bam
  --output_read_assignments OUTPUT_READ_ASSIGNMENTS
                        Haplotype assignments for each read
  --output_blocks OUTPUT_BLOCKS
                        Haplotype blocks in bed format
  --output_transition_matrix OUTPUT_TRANSITION_MATRIX
                        If provided transition matrix will be saved to this file as numpy array (.npz). Allows faster re-runs.
  --output_transition_matrix_meth OUTPUT_TRANSITION_MATRIX_METH
                        If provided transition matrix filled in with methylation data will be saved to this file as numpy array (.npz). Allows faster re-runs.
  --output_read_states OUTPUT_READ_STATES
                        If provided read states will be saved to this file as json. Allows faster re-runs.
  --output_variant_read_mapping OUTPUT_VARIANT_READ_MAPPING
                        If provided read names covering a specific variant will be saved to this file as json. Allows faster re-runs.
  --output_allele_coverage OUTPUT_ALLELE_COVERAGE
                        If provided allele coverage will be saved to this file as npy (.npy). Sites with one allele absent from reads bill be ignored. Allows faster re-runs.
  --output_unphaseable_variants OUTPUT_UNPHASEABLE_VARIANTS
                        If provided unphaseable variants will be saved to this file as npz. Allows faster re-runs.
  --output_differentially_methylated_sites OUTPUT_DIFFERENTIALLY_METHYLATED_SITES
                        Write differentially methylated files used by longhap to infer transitions to file
  --use_all_methylated_sites
                        Whether to use all methylated sites or not. If False, at most 25,000 methylated sites per transition are used. This guarantees fast runtimes and does not seem to sacrifice accuracy. [False]
  --force               If transition matrix output is provided and file already exists this file will be loaded by default unless --force is set. Then the transition matrix will be re-inferred.
  --log LOG             Log file
  -v, --verbose         Print logging information to stdout

Outputs

LongHap outputs a phased VCF file by default. Optionally, it can also output a haplotagged BAM file, haplotype assignments per read, a BED file with haplotype blocks, methylated sites used for phasing, and various intermediate files that can be used to speed up re-runs (mostly for debugging purposes).

Phased VCF

LongHap stores the phase information in the GT field and the phase block coordinates in the PS field of the output VCF file.

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	HG002
chr1	13029010	.	A	T	40	PASS	.	GT:GQ:DP:AD:VAF:PL:PS	1|0:40:28:14,14:0.5:39,0,53:45
chr1	13029690	.	T	C	57.8	PASS	.	GT:GQ:DP:AD:VAF:PL	1/1:54:28:0,28:1:57,56,0
chr1	13030289	.	T	G	57	PASS	.	GT:GQ:DP:AD:VAF:PL	1/1:54:30:0,30:1:56,56,0
chr1	13030367	.	T	C	38.6	PASS	.	GT:GQ:DP:AD:VAF:PL:PS	1|0:38:30:16,14:0.466667:38,0,52:45
chr1	13030751	.	C	CCT	33.4	PASS	.	GT:GQ:DP:AD:VAF:PL:PS	0|1:32:29:14,15:0.517241:33,0,37:45
chr1	13030882	.	T	G	37.6	PASS	.	GT:GQ:DP:AD:VAF:PL:PS	1|0:37:29:15,14:0.482759:37,0,53:45

Additional outputs

If --output_blocks is specified, LongHap will also write the phase block coordinates to the specified BED file.

A haplotagged bam file can be requested using --output_bam. In the optional haplotagged BAM file, LongHap adds a custom HP tag to each read, indicating the haplotype assignment of the read.

If --output_read_assignments is specified, LongHap will write a TSV file with the haplotype assignments for each read. The file has three columns: read name, haplotype assignment (1 or 2), and phase block ID.

Preparation of Inputs

To leverage methylation information for phasing, LongHap also requires methylation calls in the form of a BED file. Currently, LongHap expects the file to be generated with aligned_bam_to_cpg_scores from pb-cpg-tools, with the first seven rows being skipped as they are comments.

A VCF file generated with any variant caller of your choice works. We chose to use DeepVariant to call small variants and Sniffles2 to call large variants. We then merged the calls into one VCF file. For optimal performance, we recommend that the final VCF file is left-aligned and multiallelic sites are merged, using bcftools norm -m+.

For the BAM file, we recommend using minimap2 to align the reads to the reference genome, but any aligner will do. If you want to harness methylation information, make sure to use an aligner that preserves the necessary tags, that is, MM and ML tags. For exampl, starting from a raw PacBio HiFi BAM file and using minimap2 this can be achieved like this:

samtools fastq -T 'ML,MM' raw.pacbio.bam > raw.pacbio.fastq
minimap2 -ax map-hifi -y reference.fasta raw.pacbio.fastq

The -y flag tells minimap2 to retain the tags present in the fastq file.

Comparison to other phasing tools

We benchmarked LongHap and other tools on HG002, using publicly available PacBio HiFi, ONT, and UL-ONT data. We find that LongHap generally outperforms all other tools. LongHap's integration of methylation information yields larger phasing improvements that MethPhaser - a recent tool that attempts to refine the phasing by another tool (e.g., WhatsHap) using methylation information, while also creating little computational overhead. For ONT data, LongPhase usually achieves lower switch error rate by avoiding to phase "difficult" variants. LongHap's comprehensive embedding of SVs also allows it to phase them with greater accuracy than other tools.

PacBio HiFi data (38x coverage, Read length N50: 18 kb)

LongHap achieves a switch error rate as low as LongPhase (A), while also achieving longer phase blocks when using methylation information (B). LongHap's also phases more SVs with great accuracy (C).

figures/performance_pacbio.png

ONT R10.4.1 Dorado base calling data (45x coverage, Read length N50: 29 kb)

LongHap achieves a lower switch error rate than WhatsHap and HapCUT2, but slightly higher than LongPhase (A). However, LongHap phases significantly more variants and achieves longer phase blocks when using methylation information (B & C).

figures/performance_ont.png

UL-ONT R10.4.1 Dorado base calling data (44x coverage, Read length N50: 111 kb)

LongHap achieves a lower switch error rate than WhatsHap and HapCUT2, but higher than LongPhase (A). However, LongHap phases significantly more variants and achieves longer phase blocks when using methylation information (B & C).

figures/performance_ulont.png

Computational requirements

LongHap phases in <35 minutes using a single thread and <10 Gb of memory. The exact requirements depend on sequencing coverage, the number of heterozygous variants, density of structural variants that require local realignment, and density of methylated sites. Below we provide the run times for phasing chromosome 1 of HG002 using PacBio HiFi, ONT, and UL-ONT data.

Data type Coverage Read length N50 LongHap Mode Time (hh:mm:ss) Memory (Gb)
PacBio HiFi 38x 18 kb Sequence only 00:04:18 1.2
PacBio HiFi 38x 18 kb Sequence + Methylation 00:06:11 6.1
ONT R10.4.1 45x 29 kb Sequence only 00:15:47 1.5
ONT R10.4.1 45x 29 kb Sequence + Methylation 00:33:36 6.4
UL-ONT R10.4.1 44x 111 kb Sequence only 00:16:06 1.3
UL-ONT R10.4.1 44x 111 kb Sequence + Methylation 00:17:43 7.2

Citation

Aaron Pfennig and Joshua M. Akey, Harnessing methylation signals inherent in long-read sequencing data for improved variant phasing, bioRxiv, 2026, https://doi.org/10.64898/2026.03.11.710820

Contact

Aaron Pfennig, apfennig at princeton.edu

About

Harnessing methylation information for variant phasing

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors