# Week 5 - Alignment and variant calling practical

## Learning objectives

1. Understand alignment of reads to a reference genome
2. Understand SNP and indel calling from an alignment
3. Understand how to visualise alignments and SNP calls

In this practical we will be looking at variant calling by taking reads from a strain of interest and aligning them to a reference genome to identify differences between our strain and the reference. The tool we will use to perform alignment and SNP calling is called **Snippy**. This provides a straightforward way to run multiple alignment and variant calling programs with a single command line.

Snippy can be installed using:
`conda install -c conda-forge -c bioconda  snippy`

### A bit about conda

You have seen such a command a few times now so its worth understanding what this command is doing. *Conda* is a package manager that makes it easier to install packages. Often, when installing a package, you need to install its dependencies. In some cases these dependencies are complex, package x can depend on y which in turn depended on package z with version 2.0 or more. Conda does all this for you in a single command. The `-c` argument specifies the channel to use to install the package. Most bioinformatics packages are available on the *bioconda* channel while generic, yet popular packages such as `scikit-bio`, `numpy` and `pandas` are available through the *conda-forge* channel. Conda is not limited to installing python packages! It can be used to install R, Java, C/C++ and many more packages.

Additional resources:

1. Conda website - https://conda.io/en/latest/
2. A conda cheatsheet - https://kapeli.com/cheat_sheets/Conda.docset/Contents/Resources/Documents/index
3. Bioconda website - https://bioconda.github.io/

Now, back to business...

## Description

Files for the tutorial are in the data folder. You will find part of a *Staphyloccocus aureus* reference genome, and the paired-end reads from an Illumina run of a different *Staphylococcus aureus* strain. We will use the tool Snippy to look for SNP, indels and rearrangements between the reference strain and our mutant strain.

Run the following Snippy command in the terminal:

`snippy --outdir varcall --ref data/wildtype.fna -R1 data/mutant_R1.fastq -R2 data/mutant_R2.fastq --ram 2`

*--ram 2 is probably not needed*

This should take only a few seconds. The output files will all be in the varcall directory. Navigate to this directory and look at the Snippy output files.

> In case snippy doesn't run, results from a previous run have been provided in the VarCall.zip file. This can be unzipped using `unzip VarCall.zip` command

### Question 1

Which file extensions do you recognise? Do you know what type of data will be in each file? Write down any you recognise below.

~~~ your response here

The snps.bam file contains the alignments of the reads in the fastq files against the reference genome. We will use a tool called samtools to look at the alignments and get some statistics about the alignment.

Run the command below:

In [None]:
!samtools view -H VarCall/varcall/snps.bam

In [None]:
#!samtools view -H varcall/snps.bam

This will show you the headers of the BAM file. These headers give you important information about the reads in the file, the reference genome, and how the file was generated.

### Question 2

Can you look at the headers and determine how long the reference genome is?

Hint: Use https://www.samformat.info/sam-format-header to help you work out what to look at.

~~~ your response here

Now we will use the samtools flagstat command to get some basic statistics about the alignment of reads to the genome. Run the following command:

In [None]:
!samtools flagstat VarCall/varcall/snps.bam

In [None]:
#!samtools flagstat varcall/snps.bam

### Question 3

How many reads are there in total? What percentage of them have been aligned to the reference?
The snps.txt file has a summary of the variants called in this strain against the reference.

~~~ your response here

### Question 4

How many variants were called? How many of them are deletions, insertions, multinucleotide polymorphisms (MNPs), and single nucleotide polymorphisms (SNPs)?

**Extra activity**

You can look in more detail at the variant call in the snps.vcf file. This is a Variant Call Format file which has a standard set of information about each variant.

~~~ your response here

### Question 5

To add functional annotation to the variants, we need to provide a reference genome which has gene annotations to an annotation program. This is provided in the wildtype.gbk file in the data directory.

Open the wildtype.gbk file and look at the gene annotations in the file. This is a Genbank-format file, which has a CDS entry for every coding sequence in the file with the gene name, the start and end positions on the reference genome, and a description of the protein product the gene produces. You saw the Genbank format in yesterday’s annotation lecture.

What are the start and end positions of the *dnaC* gene in the wildtype.gbk file?

Hint:
In the `less` command (in command line), you can type / and the string you are looking for to search through the file. Example `/abc` will look for abc

~~~ your response here