Skip to content

Visualizing and evaluating alignment output

Pierre De Wit edited this page Jun 14, 2023 · 6 revisions

Visualizing and evaluating alignment output

Introduction

Mapping refers to the process of aligning short reads to a reference sequence, whether the reference is a complete genome, transcriptome, or a de novo genome/transcriptome assembly. There are numerous programs that have been developed to map reads to a reference sequence that vary in their algorithms and therefore speed (see Flicek and Birney 2009 and references therein). The goal of mapping is to create an alignment file also known as a Sequence/Alignment Map (SAM) file for each of your samples. This SAM file will contain one line for each of the reads in your sample denoting the reference sequence (genes, contigs, or gene regions) to which it maps, the position in the reference sequence, and a Phred-scaled quality score of the mapping, among other details (Li et al. 2009).

SAM files are very large text files, and are typically compressed using an algorithm to smaller, binary encoded, files called BAM files.

In this exercise, we will explore SAM/BAM files and see what information they contain.

We will then visualize the alignments in a browser called IGV, and also count the number of reads that map to each reference contig, data that could form the basis of a differential gene expression analysis (if the input data is RNASeq) or a variant calling analysis.

Resources

SAM file description

SAMtools

Flicek, P, Birney, E. 2009. Sense from sequence reads: methods for alignment and assembly. Nature methods 6:S6-S12.

Li, H, Durbin, R. 2009. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25:1754-1760.

Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, and Durbin R. 2009. The sequence alignment/map format and SAMtools. Bioinformatics 25: 2078-2079.

Li, H, Ruan, J, Durbin, R. 2008. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome research 18:1851-1858.

Picard tools: http://picard.sourceforge.net/ : Java-based programs for manipulation of .sam and .bam files.

The SAM/BAM format:

In your previous exercise, you have created BAM files of resequencing data mapped to the herring genome, as well as a vcf file containing information about variants. We will now explore these files and see what they look like.

First, connect to your interactive allocation on the compute node using ssh(or create a new allocation using sallocif the old one has expired), and then move into your directory where you have been working on the herring data. You also need to load the required modules:

$ module load bioinfo-tools
$ module load samtools/1.17

Try viewing the top of one of your BAM files in the terminal by typing

$ head FILENAME.bam

The screen outputs gibberish - this is because BAM files are binary encoded compressed files. In order to actually view the information at the top of the file, we need to use a program called samtools view, and then pipe the output to a head command.

$ samtools view FILENAME.bam | head

BAM files actually consist of two major parts:

  1. The header (on top), which contains information about all the assembly that the data have been mapped to (lines starting with @SQ), and also information about the history of the file (every time the file is changed it is logged in the header, lines starting with @PG). The header can also contain metadata about the biological samples in the bam file, this is usually encoded in lines starting with @RG ("read group"), information supplied during the mapping step which can be very useful in downstream processing.

  2. The data, where each mapped read (and also unmapped reads, if not filtered out) takes up one line. More about this further down!

By default, samtools view skips the header and goes straight for the data, but sometimes it can be useful to examine the header. You can do this by adding the -H flag, which only prints out the header:

$ samtools view -H FILENAME.bam

Or, you can view the whole file by using the -h flag (be ready to cancel the job with CTRL-C!!):

$ samtools view -h FILENAME.bam

The information about contignames in the header MUST match the contignames listed in the data part of the file, otherwise, there will be errors later! If you for some reason have mismatches between your header and your data, then you can replace the header using samtools reheader:

$ samtools reheader NEW_HEADER.sam IN.bam > OUT.bam

SAM/BAM data lines

These lines can contain different pieces of information depending on the options and software using to map, but a typical line looks like this:

NS500639:34:H3JYYBGX3:3:23402:4567:16872	256	MT100000_c0_g1_i1	96	0	13M	*	0	0 CACCATAGTAGTT AAAAAE6EAEEEA	AS:i:-1	XS:i:0	XN:i:0	XM:i:1	XO:i:0	XG:i:0	NM:i:1	MD:Z:8A4	YT:Z:UU	ZW:f:0

Where the first 11 columns are mandatory, and stand for Table from here:

Screenshot 2023-06-08 at 13 06 27

Additional columns are optional and depend on the mapping settings.

Some of these column descriptions, like read name, sequence and phred-quality, contig names and mapping positions, are quite self-explanatory. The bit-wise flags and CIGAR strings are a bit more complicated, so we will focus on them!

Bit-wise flag

The bit-wise flag contains a lot of information about the mapping, but it is encoded in binary code, so we first need to go through how that works.

In binary code, each digit can be a 0 or a 1, and the value of the 1 is determined by it's position. For each step moving from right to left, the value of the 1 doubles. This creates a unique combination of zeros and ones for each number which can be converted to the decimal system by just adding up the value of all the ones.

Screenshot 2023-06-08 at 13 51 42

Let's convert some decimal numbers to (12-digit) binary code:

1 = 000000000001 3 = 000000000011 5 = 000000000101 4095 = 111111111111

If we want to count higher than 4095, we need to add another digit.

Now, time to practice conversions!

256 = ? 272 = ? 16 = ? 4 = ? 89 = ? 99 = ?

Why are we doing this, you might wonder? Well, to computers, the ones and zeroes can also act as flags, where a 0 means "no" and a 1 means "yes". This is the case for the bit-wise flags, they are encoded as follows:

Screenshot 2023-06-08 at 13 41 40

Now, take your binary numbers that you converted above and plug them into the table. These are all fairly common flag numbers. What do they say about the read mappings? ("Multiple segments" in the mapping refers to read pairs).

Using this information, you can now use the flag status to filter the SAM/BAM file. Try these little exercises:

  1. Count how many alignments you have in your BAM file, using a combination of samtools view and wc (word count). Read the manual if you are not familiar with the command!

  2. Count how many of the alignments in the file are "secondary" (i.e. are due to a read which is mapping to multiple locations)

HINT: You can pipe the output from samtools view to cut -f 2 to only output the second column of the file, and then grep that output for the flag of interest..

  1. Count how many of the reads have been flagged as PCR or optical duplicates

  2. Count how many unmapped reads there are in the BAM file.

  3. Plot the distribution of flags in your bam file using csvtk:

$ samtools view FILENAME.bam | csvtk plot hist -t -C = --xlab "Flag number" --ylab "Count (n)" --title "Bitwise flag distribution" -f 2 --ignore-illegal-row | display

Now that we understand how the flags work, we can use a cheat sheet to quickly learn all about a mapping. On this interactive website you can type in any decimal number and see which flags are set to 1.

CIGAR strings

The CIGAR string (column 6) gives information about how the alignment to the references looks like in terms of mismatches, InDels, alignment errors etc. The different options are shown in the table below:

Screenshot 2023-06-08 at 14 44 04

For each base in the alignment (in the order of chromosomal position, something to think about if the read is mapping to the reverse strand), the status of the alignment is noted in the CIGAR string.

As an example (from the SAM manual), these alignments:

Screenshot 2023-06-08 at 14 46 10

Would result in the following CIGAR strings (and bit-wise flags):

Screenshot 2023-06-08 at 14 46 44

Visualizing the alignment

You can get a very quick and dirty visualization of the alignment directly on the server by using samtools tview:

$ samtools tview FILENAME.bam

You can scroll down the alignemnt using the arrow keys, but the graphical userface is not the best for getting a good overview. To get a better overview, we will use the IGV genomic browser, which can be downloaded from here.

You need to copy the bam file of interest, the associated bam index (bai) file and the reference assembly that the bam file was mapped against to your local computer. Then:

Once you have gotten IGV started, load in the assembly (Genomes -> Load genome from file), then load in the BAM file (File -> Load from file…). Now you can browse the contigs in the assembly, see where the reads have aligned, and where there are sequencing errors or potential SNPs.