# Class : Short-read mapping - Burrows-Wheeler Transform 1

---

## Before Class

Prior to class, please do the following:

1. Review Burrows-Wheeler Transform
2. Familiarize yourself with sort operator

---
## Learning Objectives

1. Implement Burrows-Wheeler Transform to calculate BWT string and suffix array.


---
## Background

Today we will be building a Burrows-Wheeler transfor and a suffix array for a string as described in the lecture notes.

To generate a BWT matrix, we append \$ to a string, perform all rotations to build a matrix, sort lexographically, and return the last column:

```
BWT(T):
    Append $ to T
    Build matrix of all rotations of T
    sort matrix
    return last column of matrix
```
    
We also need to calculate the suffix array for the string. This will be required for when we use BWT for string matching in the next class. To generate a suffix array for a string:
```
suffix_array(T):
    Append $ to T
    build matrix of all rotations of T with row index i
    sort i by lexographic sorting of rotation matrix
    return i
```



---
## Burrows-Wheeler Transform



In [61]:
#function to caculate BWT string
def BWT(string):
    ''' Function to calculate Burrows-Wheeler Transform for a given string.
    
    Args:
        string (str): Input string
    
    Returns:
        bwt_string (str): BWT string        
        
    Example:
        >>> BWT('googol')
        'lo$oogg'
        
    '''
    string += '$'
    string = sorted( string[i:]+string[:i] for i,a in enumerate(string) )
    
    return ''.join( a[-1] for a in string )

In [30]:
#function to caculate suffix array
def suffix_array(string):
    ''' Function to calculate suffix-array for a given string.
    
    Args:
        string (str): Input string
    
    Returns:
        sa (array of integers): suffix array
        
    Example:
    >>> suffix_array('googol')
    [6, 3, 0, 5, 2, 4, 1]
        
    '''
    string += '$'
    string = sorted( ( (string[i:]+string[:i],i) for i,a in enumerate(string) ), key=lambda x: x[0] )
    
    return [ a[1] for a in string ]

In [31]:
import doctest
doctest.testmod()

TestResults(failed=0, attempted=2)

---
If you finish the two functions above, you can start working on another two functions we will use in the next class for string matching.

## Background
One of the key parts for string matching is to do Last-to-First column mapping (LF mapping) within the BWT matrix. With the LF property, we  need to build two dictionaries for our reference string beforehand:
1. count: e.g.  `{'A': 0, 'C': 2, 'G': 3, 'T': 5}`

Where for each character `a` in a string, `count[a]` contains the number of characters in string that are lexicographically smaller than `a`.


2. occur: e.g. `{'$': [0, 0, 1, 1, 1], 'A': [1, 1, 1, 1, 1], 'C': [0, 0, 0, 1, 1], 'G': [0, 1, 1, 1, 2]}`

For each character `a` in a bwt string, `occur[a][i]` contains the number of occurences of `a` in `bwt_string[0,i], i=1,...,len(bwt_string)` (i.e. the first i characters in bwt string).

---
## Imports

In [34]:
from collections import Counter

In [2]:
#function to build count dictionary
def cal_count(string):
    '''Function to count the number of characters in string that 
        are lexicographically smaller than a given character
    
    Args:
        string (str): Input string
    
    Returns:
        count (dict)
    
    Example:
        >>> cal_count('ATGACG')
        {'A': 0, 'C': 2, 'G': 3, 'T': 5}
    '''
    
    # note: sorting below only required to pass the doctest
    return { key:sum( 1 for i in string if i < key ) for key in sorted( set( string ) ) }

In [3]:
%%timeit

cal_count( 'AAAAATTTTTCCCCCGGGGGCCCCCTTTTTCCCCCAAAAA' )

12 µs ± 714 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [55]:
#function to build occur dictionary
def cal_occur(bwt_string):
    '''Function to calculate number of occurrences of each character 
        in bwt [0,i], i=1,...,len(bwt_string)
    
    Args:
        b (str): BWT string
    
    Returns:
        occur (dict of arrays)
    
    Example:
        >>> cal_occur('AG$CG')
        {'$': [0, 0, 1, 1, 1], 'A': [1, 1, 1, 1, 1], 'C': [0, 0, 0, 1, 1], 'G': [0, 1, 1, 1, 2]}
    '''
        
    out = { key:[] for key in sorted( set( bwt_string ) ) }
    for key in out.keys():
        cnt = 0
        for a in bwt_string:
            cnt += int( a == key )
            out[key].append( cnt )
    return out

In [60]:
import doctest
doctest.testmod()

TestResults(failed=0, attempted=4)