Assembly-based archaic introgression detector, a Hidden Markov Model that leverages haplotype-resolved human pangenome assemblies to detect complete archaic introgression sequences.
Genetic introgression from archaic hominins has profoundly reshaped the genetic diversity and adaptive potential of modern humans, yet the full catalog of introgressed sequences, particularly those residing in structurally complex regions has remained elusive. Here, we present ASMaid (ASseMbly-based archaic introgression detector), a Hidden Markov Model-based framework that leverages haplotype-resolved pangenome assemblies to identify archaic-derived sequences with unprecedented completeness. By integrating both single-nucleotide genotype and structural variation (SV) signals, ASMaid captures significantly more intact archaic segments than conventional reference-based approaches.
- Python 3.9+
- Conda (recommended)
- Clone the repository:
git clone https://github.com/Asian-Pan-Genome/ASMaid.git cd ASMaid - Create the environment:
conda env create -f environment.yml conda activate asmaid_env
- Install ASMaid:
pip install -e .
ASMaid operates through a modular command-line interface. Use asmaid <command> --help to see specific options for each module.
Prepares raw VCF files by normalizing depths, reordering samples, and applying genomic masks.
- Usage:
asmaid preprocess -v input.vcf.gz -s sample.list -m mask.bed --bi_allele - Arguments:
-v, --vcf: Path to input VCF file, containing both Archaic and East African populations (at least 10 East Africans are recommended). The pipeline for mapping and calling variants to gain the input VCF can be seen in https://github.com/Asian-Pan-Genome/ArchaicIntrogression/tree/main/01.ASMaid/01.Mapping%20and%20variant%20calling. NOTE: The reference of this VCF is your interested assembly for detecting introgressed segments. Only one chromosome is allowed in the VCF.-s, --sampleList: (Optional) A text file where each line contains a sample name. The sample order must be strictly maintained as follows: archaic individuals first, followed by African (AFR) samples, and finally other population samples. If this file is not provided, please ensure the VCF file has been pre-sorted according to this specific order.-m, --mask:(Optional) A BED-formatted file containing genomic regions to be excluded from the analysis. This file defines complex genomic regions where variant calling or depth estimation may be unreliable. In our practice, we defined these masked regions by merging the coordinates of Centromeres, rDNA, and VNTR/STR (Variable Number Tandem Repeat/Short Tandem Repeat) loci. This rigorous filtering ensures the accuracy of both genotype calls and depth normalization for the remaining genomic sites, which is critical for robust HMM state inference.--bi_allele: (Optional) Flag to filter only bi-allelic sites.
- Output: A preprocessed VCF file (e. g.
XXX.sample.mask.bi-allele.normDP.vcf) with normalized depth (NORM_DP) tags.
Executes the HMM decoding to identify candidate introgression segments.
- Usage:
asmaid hmm -i input.vcf -o output_prefix [options]
| Argument | Type | Required | Description | Default |
|---|---|---|---|---|
-i, --input |
str |
Yes | Input VCF file (preprocess step output). | - |
-o, --output |
str |
Yes | Output file prefix. | - |
-g, --gt-threshold |
float |
No | Threshold for AFR genotype '0' ratio. | 0.0 |
-c, --cnv-threshold |
float |
No | Threshold for AFR abnormal norm_DP ratio. |
1.0 |
-t, --threads |
int |
No | Threads for parallel processing. | 1 |
-archaic_index |
int |
No | Column index of the archaic sample in VCF. | 0 |
-afr_start |
int |
No | Start index of AFR samples in VCF. | 3 |
-afr_number |
int |
No | Number of AFR samples in VCF. | 10 |
--detail |
flag |
No | Output detailed site-wise posterior probabilities. | False |
--cnv-disable |
flag |
No | Disable CNV information for HMM decoding. | False |
--only-gt |
flag |
No | Use genotype information only (skip Copy number information). | False |
--params |
flag |
No | Export model parameters (transition/emission) to JSON. | False |
--store-info-site |
flag |
No | Store informative sites in <prefix>.info.site. |
False |
- Output: A BED file (
<prefix>.tracts) containing candidate introgression tracts, a json file (<prefix>.params.json) containing model parameters if--paramsis specified, a png file (HMM_traing_log_likelihood.png) containing the training log likelihood curve, and a file (<prefix>.info.site) stores informative sites if--store-info-siteis specified.obscolumn in<prefix>.info.site: 1 for only genotype (GT)-based supportive, 2 for only copy number (CN)-based supportive, 3 for both supportive.
Classifies and recalibrates candidate introgression tracts using an integrated Identity-by-State (IBS) metric and MASK overlap analysis.
- Usage:
asmaid refine -v input.vcf -b candidate.bed -m mask.bed [options] - Arguments:
-v,--vcf: Input VCF file (from preprocess step output)-b,--bed: BED file stores candidate introgressed segments (from HMM step output).-m,--mask: Original mask file (from preprocess step input).--no-detail: (Optional) Disable detailed output of similarity scores.
- Output: A refined table (
<prefix>.tracts.refine) identifying lineage assignment (or "Ambiguous") for each tract. (e.g. https://github.com/Asian-Pan-Genome/ASMaid/blob/main/example/Introgressed_segements.Neanderthal.Raw.tracts.refine)chro/start/end: Genomic coordinates of the candidate introgression tract (in bp).intro_prob: Posterior probability of the tract being introgressed, as inferred by the HMM.total_snp_num: Total number of SNPs covered within the segment coordinatesintro_snp_num: Number of SNPs within the segment identified as archaic-derived.gt_intro_num: Count of genotype (GT)-based supportive sites within the segment.gt_notcomp_num: Count of genotype (GT)-based divergent sites within the segment.gt_match_rate: GT-based Identity-by-State (IBS) score, representing the proportion of archaic-like alleles.cnv_intro_num: Count of copy number (CN)-based supportive sites within the segment.cnv_notcomp_num: Count of copy number (CN)-based divergent sites within the segment.cnv_match_rate: CN-based similarity score, indicating parity between the sample’s normalized depth and the archaic reference.weighted_match_rate: Final weighted IBS score, integrating both GT and CN components; this is the primary metric for introgression history inferring.mask_flag: Quality control status. Regions exceeding the overlap threshold with MASK regions are labeled "MASK"; otherwise, they are marked "PASS".ibs_final_flag: Final ancestry assignment for the segment (e.g., nean for Neanderthal, den for Denisovan, or ambiguous if inconclusive).ibs_detail: Detailed raw IBS score array of each orderd sample in VCFs.
- Filter Strategies: To ensure the reliability of the identified introgression tracts, we recommend a multi-step filtering approach. First, retain only segments labeled as “PASS” in the
mask_flagcolumn, which filters out segments erroneously linked across complex repetitive regions. Second, filter by ancestry consistency: retain segments where theibs_final_flagmatches the archaic lineage specified during the HMM decoding step (e.g., if the HMM was tuned for Altai Neanderthal, keep only tracts labeled asnean). Finally, we suggest applying a threshold on theintro_probcolumn (e.g.,intro_prob> 0.9) to isolate segments with high posterior confidence.# example: awk -F'\t' '$14=="PASS" && $4>0.9 && $15=="nean" {print $0}' output.tract.refine > high_conf.nean.tracts
You can test the pipeline using the sample data provided in the examples/ directory:
cd examples/
# 01. Preprocess VCFs
asmaid preprocess -v Merge.C001-CHA-E01#Mat#chr21.vcf.gz -s sample.order.list -m C001-CHA-E01-Mat.cent.VNTR_STR.rDNA.bed --bi_allele
# 02. HMM decoding
asmaid hmm -i Merge.C001-CHA-E01#Mat#chr21.sample.mask.bi-allele.normDP.vcf -o Introgressed_segements.Neanderthal.Raw -t 32 --params --store-info-site
# 03. Refine tracts
asmaid refine -v Merge.C001-CHA-E01#Mat#chr21.sample.mask.bi-allele.normDP.vcf -b Introgressed_segements.Neanderthal.Raw.tracts -m C001-CHA-E01-Mat.cent.VNTR_STR.rDNA.bed- Three Archaic (Altai Neanderthal, Chagyrskaya Neanderthal and Denisovan) and Ten East African samples are in the
Merge.C001-CHA-E01#Mat#chr21.vcf.gz. Above commands will disocver the segments of Altai Neanderthal introgression in theC001-CHA-E01-Matassembly. So in the HMM step,-archaic_indexis set to 0 (default) to indicate the Altai Neanderthal.
This project is licensed under the MIT License .
- Note: By using this software, you agree to the terms and conditions outlined in the LICENSE file included in the repository.
For any questions, bug reports, or collaboration requests, please feel free to reach out:
- Primary Contact: Mingyu Suo – suomingyu@zju.edu.cn
- Lab: Center for Evolutionary & Organismal Biology, Zhejiang University School of Medicine,, China
- GitHub Issues: https://github.com/Asian-Pan-Genome/ASMaid/issues
