Skip to content

Asian-Pan-Genome/ASMaid

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ASMaid

Assembly-based archaic introgression detector, a Hidden Markov Model that leverages haplotype-resolved human pangenome assemblies to detect complete archaic introgression sequences.

Workflow

Background

Genetic introgression from archaic hominins has profoundly reshaped the genetic diversity and adaptive potential of modern humans, yet the full catalog of introgressed sequences, particularly those residing in structurally complex regions has remained elusive. Here, we present ASMaid (ASseMbly-based archaic introgression detector), a Hidden Markov Model-based framework that leverages haplotype-resolved pangenome assemblies to identify archaic-derived sequences with unprecedented completeness. By integrating both single-nucleotide genotype and structural variation (SV) signals, ASMaid captures significantly more intact archaic segments than conventional reference-based approaches.

Installation

Prerequisites

  • Python 3.9+
  • Conda (recommended)

Setup

  1. Clone the repository:
    git clone https://github.com/Asian-Pan-Genome/ASMaid.git
    cd ASMaid
  2. Create the environment:
    conda env create -f environment.yml
    conda activate asmaid_env
  3. Install ASMaid:
    pip install -e .

Running ASMaid

ASMaid operates through a modular command-line interface. Use asmaid <command> --help to see specific options for each module.

1.Preprocess the VCF(preprocess)

Prepares raw VCF files by normalizing depths, reordering samples, and applying genomic masks.

  • Usage: asmaid preprocess -v input.vcf.gz -s sample.list -m mask.bed --bi_allele
  • Arguments:
    • -v, --vcf: Path to input VCF file, containing both Archaic and East African populations (at least 10 East Africans are recommended). The pipeline for mapping and calling variants to gain the input VCF can be seen in https://github.com/Asian-Pan-Genome/ArchaicIntrogression/tree/main/01.ASMaid/01.Mapping%20and%20variant%20calling. NOTE: The reference of this VCF is your interested assembly for detecting introgressed segments. Only one chromosome is allowed in the VCF.
    • -s, --sampleList: (Optional) A text file where each line contains a sample name. The sample order must be strictly maintained as follows: archaic individuals first, followed by African (AFR) samples, and finally other population samples. If this file is not provided, please ensure the VCF file has been pre-sorted according to this specific order.
    • -m, --mask:(Optional) A BED-formatted file containing genomic regions to be excluded from the analysis. This file defines complex genomic regions where variant calling or depth estimation may be unreliable. In our practice, we defined these masked regions by merging the coordinates of Centromeres, rDNA, and VNTR/STR (Variable Number Tandem Repeat/Short Tandem Repeat) loci. This rigorous filtering ensures the accuracy of both genotype calls and depth normalization for the remaining genomic sites, which is critical for robust HMM state inference.
    • --bi_allele: (Optional) Flag to filter only bi-allelic sites.
  • Output: A preprocessed VCF file (e. g. XXX.sample.mask.bi-allele.normDP.vcf) with normalized depth (NORM_DP) tags.

2. HMM Detection (hmm)

Executes the HMM decoding to identify candidate introgression segments.

  • Usage: asmaid hmm -i input.vcf -o output_prefix [options]
Argument Type Required Description Default
-i, --input str Yes Input VCF file (preprocess step output). -
-o, --output str Yes Output file prefix. -
-g, --gt-threshold float No Threshold for AFR genotype '0' ratio. 0.0
-c, --cnv-threshold float No Threshold for AFR abnormal norm_DP ratio. 1.0
-t, --threads int No Threads for parallel processing. 1
-archaic_index int No Column index of the archaic sample in VCF. 0
-afr_start int No Start index of AFR samples in VCF. 3
-afr_number int No Number of AFR samples in VCF. 10
--detail flag No Output detailed site-wise posterior probabilities. False
--cnv-disable flag No Disable CNV information for HMM decoding. False
--only-gt flag No Use genotype information only (skip Copy number information). False
--params flag No Export model parameters (transition/emission) to JSON. False
--store-info-site flag No Store informative sites in <prefix>.info.site. False
  • Output: A BED file (<prefix>.tracts) containing candidate introgression tracts, a json file (<prefix>.params.json) containing model parameters if --params is specified, a png file (HMM_traing_log_likelihood.png) containing the training log likelihood curve, and a file (<prefix>.info.site) stores informative sites if --store-info-site is specified.
    • obs column in <prefix>.info.site: 1 for only genotype (GT)-based supportive, 2 for only copy number (CN)-based supportive, 3 for both supportive.

3. Refining introgression segment candidates (refine)

Classifies and recalibrates candidate introgression tracts using an integrated Identity-by-State (IBS) metric and MASK overlap analysis.

  • Usage: asmaid refine -v input.vcf -b candidate.bed -m mask.bed [options]
  • Arguments:
    • -v, --vcf: Input VCF file (from preprocess step output)
    • -b, --bed: BED file stores candidate introgressed segments (from HMM step output).
    • -m, --mask: Original mask file (from preprocess step input).
    • --no-detail: (Optional) Disable detailed output of similarity scores.
  • Output: A refined table (<prefix>.tracts.refine) identifying lineage assignment (or "Ambiguous") for each tract. (e.g. https://github.com/Asian-Pan-Genome/ASMaid/blob/main/example/Introgressed_segements.Neanderthal.Raw.tracts.refine)
    • chro/start/end: Genomic coordinates of the candidate introgression tract (in bp).
    • intro_prob: Posterior probability of the tract being introgressed, as inferred by the HMM.
    • total_snp_num: Total number of SNPs covered within the segment coordinates
    • intro_snp_num: Number of SNPs within the segment identified as archaic-derived.
    • gt_intro_num: Count of genotype (GT)-based supportive sites within the segment.
    • gt_notcomp_num: Count of genotype (GT)-based divergent sites within the segment.
    • gt_match_rate: GT-based Identity-by-State (IBS) score, representing the proportion of archaic-like alleles.
    • cnv_intro_num: Count of copy number (CN)-based supportive sites within the segment.
    • cnv_notcomp_num: Count of copy number (CN)-based divergent sites within the segment.
    • cnv_match_rate: CN-based similarity score, indicating parity between the sample’s normalized depth and the archaic reference.
    • weighted_match_rate: Final weighted IBS score, integrating both GT and CN components; this is the primary metric for introgression history inferring.
    • mask_flag: Quality control status. Regions exceeding the overlap threshold with MASK regions are labeled "MASK"; otherwise, they are marked "PASS".
    • ibs_final_flag: Final ancestry assignment for the segment (e.g., nean for Neanderthal, den for Denisovan, or ambiguous if inconclusive).
    • ibs_detail: Detailed raw IBS score array of each orderd sample in VCFs.
  • Filter Strategies: To ensure the reliability of the identified introgression tracts, we recommend a multi-step filtering approach. First, retain only segments labeled as “PASS” in the mask_flag column, which filters out segments erroneously linked across complex repetitive regions. Second, filter by ancestry consistency: retain segments where the ibs_final_flag matches the archaic lineage specified during the HMM decoding step (e.g., if the HMM was tuned for Altai Neanderthal, keep only tracts labeled as nean). Finally, we suggest applying a threshold on the intro_prob column (e.g., intro_prob > 0.9) to isolate segments with high posterior confidence.
    # example:
    awk -F'\t' '$14=="PASS" && $4>0.9 && $15=="nean" {print $0}' output.tract.refine > high_conf.nean.tracts

Quick Start

You can test the pipeline using the sample data provided in the examples/ directory:

cd examples/

# 01. Preprocess VCFs
asmaid preprocess -v  Merge.C001-CHA-E01#Mat#chr21.vcf.gz -s sample.order.list -m C001-CHA-E01-Mat.cent.VNTR_STR.rDNA.bed --bi_allele

# 02. HMM decoding
asmaid hmm -i Merge.C001-CHA-E01#Mat#chr21.sample.mask.bi-allele.normDP.vcf -o Introgressed_segements.Neanderthal.Raw -t 32 --params --store-info-site

# 03. Refine tracts
asmaid refine -v Merge.C001-CHA-E01#Mat#chr21.sample.mask.bi-allele.normDP.vcf -b Introgressed_segements.Neanderthal.Raw.tracts -m C001-CHA-E01-Mat.cent.VNTR_STR.rDNA.bed
  • Three Archaic (Altai Neanderthal, Chagyrskaya Neanderthal and Denisovan) and Ten East African samples are in the Merge.C001-CHA-E01#Mat#chr21.vcf.gz. Above commands will disocver the segments of Altai Neanderthal introgression in the C001-CHA-E01-Mat assembly. So in the HMM step, -archaic_index is set to 0 (default) to indicate the Altai Neanderthal.

License

This project is licensed under the MIT License .

  • Note: By using this software, you agree to the terms and conditions outlined in the LICENSE file included in the repository.

Contact

For any questions, bug reports, or collaboration requests, please feel free to reach out:

About

Assembly-based archaic introgression detector, a Hidden Markov Model that leverages haplotype-resolved human pangenome assemblies to detect complete archaic introgression sequences

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors