methXsort is a command-line toolkit for sorting bisulfite sequencing reads into host species and graft species in xenograft experiments.
MethXsort supports both xengsort and bbsplit for sorting reads into host and graft species. We recommend xengsort, as it is accurate and much faster based on benchmarking results.
Clone this repository and ensure all dependencies (Python 3, pysam, toolshed, and xengsort) are available in your environment.
git clone https://github.com/CCRSF-IFX/SF-methXsort.git
cd methXsortAll commands are run via the main script:
python methXsort.py <subcommand> [options]Convert a reference FASTA for bisulfite mapping (C→T and G→A):
python methXsort.py convert-ref <ref_fasta> [-o OUTPUT]python methXsort.py xengsort-index --host <host.fa> --graft <graft.fa> --index <index_dir> [-n N] [--fill FILL] [--statistics STAT] [-k K] [--xengsort_path <path>] [--xengsort_extra <extra>]Convert reads for bisulfite mapping (C→T for R1, G→A for R2):
python methXsort.py convert-reads --read <R1.fastq.gz> [--read2 <R2.fastq.gz>] [--out <R1_out>] [--out2 <R2_out>] [--with_orig_seq]--with_orig_seq: Store the original sequence in the header (slower, but traceable).
python methXsort.py xengsort-classify --read <R1.fastq.gz> [--read2 <R2.fastq.gz>] --index <index_dir> --out_prefix <prefix> --threads <N> [--xengsort_path <path>] [--xengsort_extra <extra>]Output:
-
{prefix}.host.1.fq.gz: host reads
-
{prefix}.graft.1.fq.gz: graft reads
-
{prefix}.both.1.fq.gz: reads that could originate from both
-
{prefix}.neither.1.fq.gz: reads that originate from neither host nor graft
-
{prefix}.ambiguous.1.fq.gz: (few) ambiguous reads that cannot be classified,
Output CSV statistics for split reads:
python methXsort.py stat-split --raw <raw_R1.fastq.gz> --host <host_R1.fastq.gz> --graft <graft_R1.fastq.gz>Restore original sequences in FASTQ files classified by xengsort:
python methXsort.py restore-fastq --read <classified_R1.fq.gz> --out <restored_R1.fq.gz> [--read2 <classified_R2.fq.gz> --out2 <restored_R2.fq.gz>]-
Convert reference genomes:
python methXsort.py convert-ref mm10.fa -o mm10_converted.fa python methXsort.py convert-ref hg38.fa -o hg38_converted.fa
-
Build bbsplit and xengsort indices:
python methXsort.py xengsort-index --host mm10_converted.fa --graft hg38_converted.fa --index xengsort_index_7B
-
Convert reads:
python methXsort.py convert-reads --read sample_R1.fastq.gz --read2 sample_R2.fastq.gz --with_orig_seq
-
Run bbsplit or xengsort:
python methXsort.py xengsort-classify --read sample_R1.meth.gz --read2 sample_R2.meth.gz --index xengsort_index_7B --out_prefix sample_xengsort --threads 8
-
Restore original FASTQ:
python methXsort.py restore-fastq --read sample_xengsort-graft.1.fq.gz --out sample_graft_R1_restored.fq.gz --read2 sample_xengsort-graft.2.fq.gz --out2 sample_graft_R2_restored.fq.gz
python methXsort.py bbsplit-index --host <host.fa> --graft <graft.fa> --host_name <host> --graft_name <graft> [--bbsplit_path <path>] [--bbsplit_index_path <dir>]Split reads into host and graft using bbsplit:
python methXsort.py bbsplit --read <R1.fastq.gz> [--read2 <R2.fastq.gz>] --host <host_name> --graft <graft_name> --out_host <host.bam> --out_graft <graft.bam> [--bbsplit_path <path>] [--bbsplit_extra <extra>]Extract reads from FASTQ that are present in a BAM file (e.g., after bbsplit):
python methXsort.py filter-fastq-by-bam --read <R1.fastq.gz> [--read2 <R2.fastq.gz>] --bam <file.bam> --out <R1_out> [--out2 <R2_out>] [--filterbyname_path <path>]- For all subcommands, use
-hor--helpto see detailed options. - Make sure all required external tools (
bbsplit.sh,filterbyname.sh,xengsort) are in yourPATHor specify their locations with the appropriate options. - For paired-end data, always provide both
--read2and--out2where required.
Email: ccrsfifx@nih.gov
