ASMaid

Assembly-based archaic introgression detector, a Hidden Markov Model that leverages haplotype-resolved human pangenome assemblies to detect complete archaic introgression sequences.

ASMaid

Background

Genetic introgression from archaic hominins has profoundly reshaped the genetic diversity and adaptive potential of modern humans, yet the full catalog of introgressed sequences, particularly those residing in structurally complex regions has remained elusive. Here, we present ASMaid (ASseMbly-based archaic introgression detector), a Hidden Markov Model-based framework that leverages haplotype-resolved pangenome assemblies to identify archaic-derived sequences with unprecedented completeness. By integrating both single-nucleotide genotype and structural variation (SV) signals, ASMaid captures significantly more intact archaic segments than conventional reference-based approaches.

Installation

Prerequisites

Python 3.9+
Conda (recommended)

Setup

Clone the repository:

git clone https://github.com/Asian-Pan-Genome/ASMaid.git
cd ASMaid

Create the environment:

conda env create -f environment.yml
conda activate asmaid_env

Install ASMaid:
```
pip install -e .
```

Running ASMaid

ASMaid operates through a modular command-line interface. Use asmaid <command> --help to see specific options for each module.

1.Preprocess the VCF(`preprocess`)

Prepares raw VCF files by normalizing depths, reordering samples, and applying genomic masks.

Usage: asmaid preprocess -v input.vcf.gz -s sample.list -m mask.bed --bi_allele
Arguments:
- -v, --vcf: Path to input VCF file, containing both Archaic and East African populations (at least 10 East Africans are recommended). The pipeline for mapping and calling variants to gain the input VCF can be seen in https://github.com/Asian-Pan-Genome/ArchaicIntrogression/tree/main/01.ASMaid/01.Mapping%20and%20variant%20calling. NOTE: The reference of this VCF is your interested assembly for detecting introgressed segments. Only one chromosome is allowed in the VCF.
- -s, --sampleList: (Optional) A text file where each line contains a sample name. The sample order must be strictly maintained as follows: archaic individuals first, followed by African (AFR) samples, and finally other population samples. If this file is not provided, please ensure the VCF file has been pre-sorted according to this specific order.
- -m, --mask:(Optional) A BED-formatted file containing genomic regions to be excluded from the analysis. This file defines complex genomic regions where variant calling or depth estimation may be unreliable. In our practice, we defined these masked regions by merging the coordinates of Centromeres, rDNA, and VNTR/STR (Variable Number Tandem Repeat/Short Tandem Repeat) loci. This rigorous filtering ensures the accuracy of both genotype calls and depth normalization for the remaining genomic sites, which is critical for robust HMM state inference.
- --bi_allele: (Optional) Flag to filter only bi-allelic sites.
Output: A preprocessed VCF file (e. g. XXX.sample.mask.bi-allele.normDP.vcf) with normalized depth (NORM_DP) tags.

2. HMM Detection (`hmm`)

Executes the HMM decoding to identify candidate introgression segments.

Usage: asmaid hmm -i input.vcf -o output_prefix [options]

Argument	Type	Required	Description	Default
`-i`, `--input`	`str`	Yes	Input VCF file (preprocess step output).	-
`-o`, `--output`	`str`	Yes	Output file prefix.	-
`-g`, `--gt-threshold`	`float`	No	Threshold for AFR genotype '0' ratio.	`0.0`
`-c`, `--cnv-threshold`	`float`	No	Threshold for AFR abnormal `norm_DP` ratio.	`1.0`
`-t`, `--threads`	`int`	No	Threads for parallel processing.	`1`
`-archaic_index`	`int`	No	Column index of the archaic sample in VCF.	`0`
`-afr_start`	`int`	No	Start index of AFR samples in VCF.	`3`
`-afr_number`	`int`	No	Number of AFR samples in VCF.	`10`
`--detail`	`flag`	No	Output detailed site-wise posterior probabilities.	`False`
`--cnv-disable`	`flag`	No	Disable CNV information for HMM decoding.	`False`
`--only-gt`	`flag`	No	Use genotype information only (skip Copy number information).	`False`
`--params`	`flag`	No	Export model parameters (transition/emission) to JSON.	`False`
`--store-info-site`	`flag`	No	Store informative sites in `<prefix>.info.site`.	`False`

Output: A BED file (<prefix>.tracts) containing candidate introgression tracts, a json file (<prefix>.params.json) containing model parameters if --params is specified, a png file (HMM_traing_log_likelihood.png) containing the training log likelihood curve, and a file (<prefix>.info.site) stores informative sites if --store-info-site is specified.
- obs column in <prefix>.info.site: 1 for only genotype (GT)-based supportive, 2 for only copy number (CN)-based supportive, 3 for both supportive.

3. Refining introgression segment candidates (`refine`)

Classifies and recalibrates candidate introgression tracts using an integrated Identity-by-State (IBS) metric and MASK overlap analysis.

Usage: asmaid refine -v input.vcf -b candidate.bed -m mask.bed [options]
Arguments:
- -v, --vcf: Input VCF file (from preprocess step output)
- -b, --bed: BED file stores candidate introgressed segments (from HMM step output).
- -m, --mask: Original mask file (from preprocess step input).
- --no-detail: (Optional) Disable detailed output of similarity scores.
Output: A refined table (<prefix>.tracts.refine) identifying lineage assignment (or "Ambiguous") for each tract. (e.g. https://github.com/Asian-Pan-Genome/ASMaid/blob/main/example/Introgressed_segements.Neanderthal.Raw.tracts.refine)
- chro/start/end: Genomic coordinates of the candidate introgression tract (in bp).
- intro_prob: Posterior probability of the tract being introgressed, as inferred by the HMM.
- total_snp_num: Total number of SNPs covered within the segment coordinates
- intro_snp_num: Number of SNPs within the segment identified as archaic-derived.
- gt_intro_num: Count of genotype (GT)-based supportive sites within the segment.
- gt_notcomp_num: Count of genotype (GT)-based divergent sites within the segment.
- gt_match_rate: GT-based Identity-by-State (IBS) score, representing the proportion of archaic-like alleles.
- cnv_intro_num: Count of copy number (CN)-based supportive sites within the segment.
- cnv_notcomp_num: Count of copy number (CN)-based divergent sites within the segment.
- cnv_match_rate: CN-based similarity score, indicating parity between the sample’s normalized depth and the archaic reference.
- weighted_match_rate: Final weighted IBS score, integrating both GT and CN components; this is the primary metric for introgression history inferring.
- mask_flag: Quality control status. Regions exceeding the overlap threshold with MASK regions are labeled "MASK"; otherwise, they are marked "PASS".
- ibs_final_flag: Final ancestry assignment for the segment (e.g., nean for Neanderthal, den for Denisovan, or ambiguous if inconclusive).
- ibs_detail: Detailed raw IBS score array of each orderd sample in VCFs.
Filter Strategies: To ensure the reliability of the identified introgression tracts, we recommend a multi-step filtering approach. First, retain only segments labeled as “PASS” in the mask_flag column, which filters out segments erroneously linked across complex repetitive regions. Second, filter by ancestry consistency: retain segments where the ibs_final_flag matches the archaic lineage specified during the HMM decoding step (e.g., if the HMM was tuned for Altai Neanderthal, keep only tracts labeled as nean). Finally, we suggest applying a threshold on the intro_prob column (e.g., intro_prob > 0.9) to isolate segments with high posterior confidence.
```
# example:
awk -F'\t' '$14=="PASS" && $4>0.9 && $15=="nean" {print $0}' output.tract.refine > high_conf.nean.tracts
```

Quick Start

You can test the pipeline using the sample data provided in the examples/ directory:

cd examples/

# 01. Preprocess VCFs
asmaid preprocess -v  Merge.C001-CHA-E01#Mat#chr21.vcf.gz -s sample.order.list -m C001-CHA-E01-Mat.cent.VNTR_STR.rDNA.bed --bi_allele

# 02. HMM decoding
asmaid hmm -i Merge.C001-CHA-E01#Mat#chr21.sample.mask.bi-allele.normDP.vcf -o Introgressed_segements.Neanderthal.Raw -t 32 --params --store-info-site

# 03. Refine tracts
asmaid refine -v Merge.C001-CHA-E01#Mat#chr21.sample.mask.bi-allele.normDP.vcf -b Introgressed_segements.Neanderthal.Raw.tracts -m C001-CHA-E01-Mat.cent.VNTR_STR.rDNA.bed

Three Archaic (Altai Neanderthal, Chagyrskaya Neanderthal and Denisovan) and Ten East African samples are in the Merge.C001-CHA-E01#Mat#chr21.vcf.gz. Above commands will disocver the segments of Altai Neanderthal introgression in the C001-CHA-E01-Mat assembly. So in the HMM step, -archaic_index is set to 0 (default) to indicate the Altai Neanderthal.

License

This project is licensed under the MIT License .

Note: By using this software, you agree to the terms and conditions outlined in the LICENSE file included in the repository.

Contact

For any questions, bug reports, or collaboration requests, please feel free to reach out:

Primary Contact: Mingyu Suo – suomingyu@zju.edu.cn
Lab: Center for Evolutionary & Organismal Biology, Zhejiang University School of Medicine,, China
GitHub Issues: https://github.com/Asian-Pan-Genome/ASMaid/issues

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
bin		bin
example		example
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ASMaid

Background

Installation

Prerequisites

Setup

Running ASMaid

1.Preprocess the VCF(`preprocess`)

2. HMM Detection (`hmm`)

3. Refining introgression segment candidates (`refine`)

Quick Start

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ASMaid

Background

Installation

Prerequisites

Setup

Running ASMaid

1.Preprocess the VCF(preprocess)

2. HMM Detection (hmm)

3. Refining introgression segment candidates (refine)

Quick Start

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1.Preprocess the VCF(`preprocess`)

2. HMM Detection (`hmm`)

3. Refining introgression segment candidates (`refine`)

Packages