# Burrows-Wheeler Aligner

Burrows-Wheeler Aligner (BWA) is an algorithm/program to align short sequencing reads against a large reference genome. The genome is first 'indexed' once to allow fast alignment.

Example data files used in this tutorial are taken/adapted from [here](http://depts.washington.edu/cshlab/xpression/html/example.shtml).

In [1]:
%load_ext autoreload

In [24]:
%autoreload 2
import simplesam
import bmes
import os 
from IPython.display import display
from ToolBox import utils

### Set Up `DATADIR` and `BWA` Path

Since genome files and fastq files can be very large, you need to keep it in a different folder than your source files (so that when you submit your work, I don't end up with many copies of these large data files). If you want to use a different folder, set it here.

In [19]:
DATADIR = bmes.datadir() + '/bwafiles/'
bmes.mkdirif(DATADIR)


#BWAEXE = bmes.bwaexe(); # Ahmet's installation
BWAEXE = "/home/kabil/.anaconda3/envs/blast/bin/bwa"
print(BWAEXE)

/home/kabil/.anaconda3/envs/blast/bin/bwa


### Set up genome and `fastq` files. Download if absent.

- Make sure you know how to get the URL for genomefile from: https://www.ncbi.nlm.nih.gov/nuccore/NC_005296.1
- Visit that webpage, download genome as Fasta file and examine the download url.

In [14]:
genomefile = DATADIR +  "/CGA009.fasta"
fastqfile = DATADIR + '/example.small.fastq'

if not bmes.isfileandnotempty(genomefile):
	genomefile = bmes.downloadurl("https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?tool=portal&save=file&log$=seqview&db=nuccore&report=fasta&id=39933080&conwithfeat=on", genomefile)
if not bmes.isfileandnotempty(fastqfile):
	fastqfile = bmes.downloadurl("http://sacan.biomed.drexel.edu/lib/exe/fetch.php?rev=&media=course:binf:nextgen:bwademo:example.small.fastq", fastqfile)

### Contents of the Genome FASTA File

In [15]:
utils.io_head(genomefile)

>NC_005296.1 Rhodopseudomonas palustris CGA009, complete sequence
ATCGGTCGAGGCGAAATCTTCACCCTGCCCTCGGAATCATATCCATTGCAGCGGAGGGGCCGTCGTGGTT
TTCATAGTCCACCCGCGACGCCCACGGCTCTTCAGATCAGCGCGGTTTGAGAACCAAGGGCGGACATGCA
AATCGGCGGCGGTGTGCATGTTGGTGGCTTTTCCAAATGACGGTCGAACTGCAACTCATAGCAATATGGG
GCTGGCGAAGCCGTGTCTGAATTCATGCGGAGGCCTTCGTTCGAAGGTTTCTGGCGAACGACGACACCGG


### Contents of the FASTQ Sequence Reads File

In [16]:
utils.io_head(fastqfile, 8)

@HWUSI-EAS300R_0005_FC62TL2AAXX:8:30:18447:12115#0/1
CGTAGCTGTGTGTACAAGGCCCGGGAACGTATTCACCGTG
+HWUSI-EAS300R_0005_FC62TL2AAXX:8:30:18447:12115#0/1
acdd^aa_Z^d^ddc`^_Q_aaa`_ddc\dfdffff\fff
@HWUSI-EAS300R_0005_FC62TL2AAXX:8:30:19255:12112#0/1
CGTAAATGGACAGCATGACCCGACATCCCACACTCGCCGC
+HWUSI-EAS300R_0005_FC62TL2AAXX:8:30:19255:12112#0/1
hhhcfhhhhhhhhhhhhghhhhhhffhhhhhhhhhhhhhg


### Index the reference genome

In [20]:
# Index the reference genome
if not bmes.isfileandnotempty(genomefile + '.bwt'):
    cmd = BWAEXE + ' index "' + genomefile + '"'
    os.system(cmd)

[bwa_index] Pack FASTA... 0.04 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 1.45 seconds elapse.
[bwa_index] Update BWT... 0.02 sec
[bwa_index] Pack forward-only FASTA... 0.02 sec
[bwa_index] Construct SA from BWT and Occ... 0.52 sec
[main] Version: 0.7.17-r1188
[main] CMD: /home/kabil/.anaconda3/envs/blast/bin/bwa index /tmp/bmes/Dropbox_bmes.ahmet/bwafiles/CGA009.fasta
[main] Real time: 2.100 sec; CPU: 2.050 sec


### Align Sequence Reads to Reference Genome

It's also a good idea to check if we have performed this step before. Let's use `*.sam` for the resulting filename and use the presence of this file to determine whether we have done this alignment step before. Note that if you change the `genomefile` or `fastqfile` and the `samfile` was previously created with a different `genomefile` and/or `fastqfile`, the results would no longer be correct.

In [21]:
samfile = fastqfile + '.sam'

if not bmes.isfileandnotempty(samfile):
    cmd = BWAEXE + ' mem "' + genomefile + '"' ' "' + fastqfile + '"'
    bmes.system_redirecttofile(cmd,samfile)

Executing command: /home/kabil/.anaconda3/envs/blast/bin/bwa mem "/tmp/bmes/Dropbox_bmes.ahmet/bwafiles/CGA009.fasta" "/tmp/bmes/Dropbox_bmes.ahmet/bwafiles/example.small.fastq" > "/tmp/bmes/Dropbox_bmes.ahmet/bwafiles/example.small.fastq.sam" 2>"/tmp/tmp1kgpb6o9"
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 25 sequences (1000 bp)...
[M::mem_process_seqs] Processed 25 reads in 0.001 CPU sec, 0.004 real sec
[main] Version: 0.7.17-r1188
[main] CMD: /home/kabil/.anaconda3/envs/blast/bin/bwa mem /tmp/bmes/Dropbox_bmes.ahmet/bwafiles/CGA009.fasta /tmp/bmes/Dropbox_bmes.ahmet/bwafiles/example.small.fastq
[main] Real time: 0.011 sec; CPU: 0.007 sec



### Contents of the SAM File

In [23]:
utils.io_head(samfile, 5)

@SQ	SN:NC_005296.1	LN:5459213
@PG	ID:bwa	PN:bwa	VN:0.7.17-r1188	CL:/home/kabil/.anaconda3/envs/blast/bin/bwa mem /tmp/bmes/Dropbox_bmes.ahmet/bwafiles/CGA009.fasta /tmp/bmes/Dropbox_bmes.ahmet/bwafiles/example.small.fastq
HWUSI-EAS300R_0005_FC62TL2AAXX:8:30:18447:12115#0	0	NC_005296.1	5250128	0	7S33M	*	0	0	CGTAGCTGTGTGTACAAGGCCCGGGAACGTATTCACCGTG	acdd^aa_Z^d^ddc`^_Q_aaa`_ddc\dfdffff\fff	NM:i:0	MD:Z:33	AS:i:33	XS:i:33	XA:Z:NC_005296.1,+4996363,7S33M,0;
HWUSI-EAS300R_0005_FC62TL2AAXX:8:30:19255:12112#0	0	NC_005296.1	1993021	60	4S36M	*	0	0	CGTAAATGGACAGCATGACCCGACATCCCACACTCGCCGC	hhhcfhhhhhhhhhhhhghhhhhhffhhhhhhhhhhhhhg	NM:i:0	MD:Z:36	AS:i:36	XS:i:0
HWUSI-EAS300R_0005_FC62TL2AAXX:8:30:19304:12111#0	0	NC_005296.1	4996358	0	4S36M	*	0	0	AGGGGGGCGGTGTGTACAAGGCCCGGGAACGTATTCACCG	hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhghhhhhhhh	NM:i:0	MD:Z:36	AS:i:36	XS:i:36	XA:Z:NC_005296.1,+5250123,4S36M,0;


### Parse the alignment results

`samread()` is available in Matlab's bioinformatics toolbox. If you do not have `samread()` or are not working in Matlab, google for other options. SAM files are text files. Google SAM format specification to explore what types of data are being provied e.g. [here](https://samtools.github.io/hts-specs/SAMv1.pdf).

We'll be using `simplesam` to parse the SAM files. This is an alternative for `pysam` that doesn't require a C compiler during installation.

In [25]:
sams = []
with open(samfile,'r') as f:
    samiter = simplesam.Reader(f)
    for sam in samiter:
        sams.append(sam)

# Inspect first entry
sam = sams[0]

print("  -----Object  Properties-----  ")
display(vars(sam))

print("-----Additional  Properties-----")

print("ismapped: " + str(sam.mapped))
print("isreversecomplement: " + str(sam.reverse))
print("duplicate: " + str(sam.duplicate))
print("secondary: " + str(sam.secondary))
print("tags: " + str(sam.tags))

print("    -----Using the Flag-----    ")
isunmapped = sam.flag & 4
print("isunmapped: " + str(isunmapped))
#or use: not sam.mapped

isreversecomplement = sam.flag & 16
print("isreversecomplement: " + str(isreversecomplement))
#or use: sam.reverse

  -----Object  Properties-----  


{'qname': 'HWUSI-EAS300R_0005_FC62TL2AAXX:8:30:18447:12115#0',
 'flag': 0,
 'rname': 'NC_005296.1',
 'pos': 5250128,
 'mapq': 0,
 'cigar': '7S33M',
 'rnext': '*',
 'pnext': 0,
 'tlen': 0,
 'seq': 'CGTAGCTGTGTGTACAAGGCCCGGGAACGTATTCACCGTG',
 'qual': 'acdd^aa_Z^d^ddc`^_Q_aaa`_ddc\\dfdffff\\fff',
 '_tags': ['NM:i:0',
  'MD:Z:33',
  'AS:i:33',
  'XS:i:33',
  'XA:Z:NC_005296.1,+4996363,7S33M,0;'],
 '_cache': {}}

-----Additional  Properties-----
ismapped: True
isreversecomplement: False
duplicate: False
secondary: False
tags: {'NM': 0, 'MD': '33', 'AS': 33, 'XS': 33, 'XA': 'NC_005296.1,+4996363,7S33M,0;'}
    -----Using the Flag-----    
isunmapped: 0
isreversecomplement: 0
