Skip to content

Latest commit

 

History

History
91 lines (65 loc) · 2.99 KB

tutorial.md

File metadata and controls

91 lines (65 loc) · 2.99 KB

Introductory tutorial

In this tutorial, you will learn the basics of TRGT by analyzing a tiny example dataset included in this repository.

Prerequisites

Overview of the input files

Our example dataset consists of a reference genome (reference.fasta), a file with aligned reads (sample.bam), and the repeat definition file (repeat.bed). The repeat definitions are stored in a BED file that, among other information, contains repeat coordinates, repeat identifiers, and motifs:

$ cat repeat.bed
chrA    10000    10061    ID=TR1;MOTIFS=CAG;STRUC=(CAG)n

Genotype repeats

To genotype the repeat, run:

./trgt genotype --genome example/reference.fasta \
       --repeats example/repeat.bed \
       --reads example/sample.bam \
       --output-prefix sample

The output consists of files sample.vcf.gz and sample.spanning.bam. The VCF file contains repeat genotypes while the BAM file contains pieces of HiFi reads that fully span the repeat sequences.

For example, here is the first entry of the VCF file:

$ bcftools view --no-header sample.vcf.gz | head -n 1
#CHROM  POS  ID  REF  ALT  QUAL  FILTER  INFO  FORMAT  sample
chrA  10002  .  CAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAG  CAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAG,CAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAG  0  .  TRID=TR1;END=10061;MOTIFS=CAG;STRUC=(CAG)n  GT:AL:ALCI:SD:MC:MS:AP:AM  1/1:33,33:30-39,33-33:16,16:11,11:0(0-33),0(0-33):1,1:.,.

It says that:

  • There is a tandem repeat starting at position 10001 of chrA
  • The reference sequence of this repeat is CCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAG
  • This repeat has two alleles spanning 16 CAG motifs each

Sort and index the outputs

TRGT genotyping outputs are not sorted. So you need to sort and index the VCF:

bcftools sort -Ob -o sample.sorted.vcf.gz sample.vcf.gz
bcftools index sample.sorted.vcf.gz

and then the BAM:

samtools sort -o sample.spanning.sorted.bam sample.spanning.bam
samtools index sample.spanning.sorted.bam

And that's it! The output files are now ready for downstream analysis.

Visualize a repeat

To visualize the repeat with the identifier "TR1", run:

./trgt plot --genome example/reference.fasta \
       --repeats example/repeat.bed \
       --vcf sample.sorted.vcf.gz \
       --spanning-reads sample.spanning.sorted.bam \
       --repeat-id TR1 \
       --image TR1.svg

TRGT outputs a file TR1.svg that contains the read pileup image. Note that the SVG file can be directly edited in vector graphics editing software like Inkscape.

The resulting pileup plot shows the sequences of reads spanning each repeat allele (blue) and the surrounding flanking sequence (green):

TR1 read pileup