The high degree of similarity between gametologous sequences on the sex chromosomes can lead to the misalignment of sequencing reads and substantially affect variant calling. Here we present XYalign, a new tool that (1) quickly infers sex chromosome ploidy in NGS data, (2) remaps reads based on the inferred sex chromosome complement of the individual, and (3) outputs quality, depth, and allele-balance metrics across chromosomes.
Webster TH; Couse M; Grande BM; Karlins E; Phung T; Richmond PA; Whitford W; Wilson MA. 2019. Identifying, understanding, and correcting technical artifacts on the sex chromosomes in next-generation sequencing data. GigaScience 8(7): giz074. DOI: https://doi.org/10.1093/gigascience/giz074
If you use XYalign or discuss/correct for bias in mapping on the sex chromosomes, please cite this article.
See full documentation at Read The Docs -- Under construction
Post any questions you have at the XYalign Google Group
Post any bugs/issues to XYalign's issues page on Github
Quick start and examples
XYalign has only been tested on Linux and Mac systems. We recommend users install and manage XYalign (and programming environments) using Conda. To do this
Finish installation with the following commands to install XYalign and all of its dependancies in an environment called "xyalign_env":
conda config --add channels defaults conda config --add channels conda-forge conda config --add channels bioconda conda create -n xyalign_env xyalign
- Load your new environment (containing XYalign and all related programs) with:
source activate xyalign_env
Prepare a sex-specific reference genome
Assuming XYalign is installed correctly with all associated programs and is available
PATH (see "Installing XYalign above"), you can use the command
(assume the following is on one line):
xyalign --PREPARE_REFERENCE --ref reference.fasta --xx_ref_out /path/to/reference.XXonly.fasta --xy_ref_out /path/to/reference.XY.fasta --x_chromosome chrX --y_chromosome chrY --reference_mask mask.bed --output_dir output_directory
In the above command,
reference.fasta is the original reference genome,
/path/to/reference.XY.fasta are the
full paths to and names of the desired output references for XX and XY samples,
chrY are the exact names of the X and Y chromosome
scaffolds in the assembly.
mask.bed is some bed file containing regions that
should be masked in both output fastas.
output_directory is the name of a
directory into which the logfile and other intermediate files will be deposited.
Analyze a single bam file to explore sex chromosome content, etc.
You can use the command (assume the following is on one line):
xyalign --CHARACTERIZE_SEX_CHROMS --ref reference.fasta --bam sample1.bam --output_dir sample1_results --sample_id sample1 --cpus 4 --window_size 5000 --chromosomes chr19 chrX chrY --x_chromosome chrX --y_chromosome chrY
In the above command,
reference.fasta is the full path to the reference genome
used to generate the bam file,
sample1.bam is the full path to the bam file
sample1_results is our desired output directory, and
sample1 is the name of
our sample. we're using four cores (
--cpus 4) and 5kb nonoverlapping
windows for analysis. We're analyzing three chromosomes named
chrY, and our X and Y scaffolds in the reference are named
Our output of interest will be in
sample1_results/results. Tables (.csv) of depth and mapq measurements per window
sample1_results/bed with "full_dataframe" in their file names. BED files containing windows passing ("highquality") and failing ("lowquality") filtering
thresholds will also be in
Relevant flags for filtering variants include:
--variant_site_quality --variant_genotype_quality --variant_depth
Relevant flags for filtering windows include:
--mapq_cutoff --min_depth_filter --max_depth_filter --min_variant_count
You can get details about these (and more) flags with the command:
Analyze multiple bam files to determine sex chromosome complement, identify sex chromosome scaffolds, etc.
xyalign --CHROM_STATS --chromosomes chr1 chr8 chr19 chrX chrY --bam sample1.bam sample2.bam sample3.bam --ref null --sample_id bam_comparison1 --output_dir bam_comparison1_results
In the above command, we're analyzing five chromosomes in three different bam files.
null as our reference because it's not used in these analyses.
--sample_id now becomes the name of our comparison (it's used in file names, etc.)
and our output will be located in
bam_comparison1_results/results. We could also use
--use_counts to force XYalign to simply use counts of reads on each chromosome in