# Projektni zadatak 4, Genomska informatika, Skolska 2021/2022

# Aleksandar Malovic 2021/3375

## Assignment

* Implement an algorithm for indexed string search using Burrows-Wheeler transform and FM index as described on the lesson slides, without additional optimizations
* Write tests for intermediate functions as well as final algorithm
* Optimize algorithm in regards to memory usage and performace. Run regression tests and check optimization results using assigned testing parameters

## Assumptions

**ASSUMPTION 1:** Input string will already have the ending character appended, otherwise functions will return None to indicate error. Algorithm could create a local copy with the ending character appended but in case of large strings creating a local copy with just one additional character would be suboptimal from a memory standpoint.

## Testing remarks

Tests for individual functions are grouped within testing functions to separate and enclose testing scopes. This also simplifies calling the tests at other places in the code.

## Helper functions

#### Is input a valid string with ending character

In [109]:
def isInputValid(t):
    return t != None and len(t) > 0 and t != "$" and t.endswith('$')

def testInputValidation():
    # Test case 1: None, should return false
    assert not isInputValid(None)
    
    # Test case 2: Empty string, should return false
    assert not isInputValid('')
    
    # Test case 3: String containing only the ending character, should return false
    assert not isInputValid('$')
    
    # Test case 4: String missing ending character, should return false
    assert not isInputValid('abc')
    
    # Test case 5: Valid string, should return true
    assert isInputValid('abc$')

In [113]:
testInputValidation()

#### Arays are equal check, both length and order

In [114]:
def arraysEqual(output, expectedOutput):
    # Both inputs being None is considered an error as well
    if output == None or expectedOutput == None:
        return False
    if  len(output) != len(expectedOutput):
        return False
    for i in range(0, len(expectedOutput)):
         if output[i] != expectedOutput[i]:
            return False
    return True

def testArraysEqual():
    assert not arraysEqual(None, None)
    
    assert not arraysEqual(None, [])
    
    assert not arraysEqual(['0', '1', '2'], ['0', '1', '2', '3'])
    
    assert not arraysEqual(['0', '1'], ['2', '3'])
    
    assert arraysEqual(['0', '1', '2', '3'], ['0', '1', '2', '3'])

In [115]:
testArraysEqual()

#### Sets are equal check, order not important, same elements

In [128]:
def setsEqual(output, expectedOutput):
    # Both input being None is considered an error as well
    if output == None or expectedOutput == None:
        return False
    if  len(output) != len(expectedOutput):
        return False
    for i in range(0, len(expectedOutput)):
         if output[i] not in expectedOutput:
            return False
    return True

def testSetsEqual():
    assert not setsEqual(None, None)
    
    assert not setsEqual(None, [])
    
    assert not setsEqual(['0', '1', '2'], ['0', '1', '2', '3'])
    
    assert not setsEqual(['1', '0'], ['2', '3'])
    
    assert setsEqual(['0', '1', '3', '2'], ['2', '0', '1', '3'])

In [129]:
testSetsEqual()

## Burrows-Wheeler transform

Burrows-Wheeler transform consists of three steps.
* Create an array of all input string rotations
* Sort array in alphabetical order (Burrows-Wheeler matrix)
* Take last column of the Burrows-Wheeler matrix

### Create array of all string rotations

Function appends the string to itself to make it simpler to calculate rotations (based on lesson slides). Implementation using splicing is also possible but is suboptimal from a memory standpoint.

In [50]:
def rotations(t):
    if not isInputValid(t):
        return None
    tt = t * 2
    return [ tt[i:i+len(t)] for i in range(0, len(t)) ]

#### Tests

In [116]:
def testRotations():
    # Test case 1: None, should return None
    assert rotations(None) == None
    
    # Test case 2: Empty string, should return false
    assert rotations('') == None
    
    # Test case 3: String containing only the ending character, should return false
    assert rotations('$') == None
    
    # Test case 4: String missing ending character, should return false
    assert rotations('abc') == None
    
    # Test case 5: Input string of just one character 
    inputValue = 'a$'
    expectedOutput = ['a$', '$a']
    output = rotations(inputValue)
    assert arraysEqual(output, expectedOutput)
    
    # Test case 6: Valid input string
    inputValue = 'abcd$'
    expectedOutput = ['abcd$', 'bcd$a', 'cd$ab', 'd$abc', '$abcd']
    output = rotations(inputValue)
    assert arraysEqual(output, expectedOutput)

In [117]:
testRotations()

### Sort string rotations in alphabetical order, Burrows-Wheeler Matrix

Based on lesson slides.

In [87]:
def calculateBurrowsWheelerMatrix(t):
    r = rotations(t)
    return sorted(r) if r != None else None

#### Tests

In [118]:
def testCalculateBurrowsWheelerMatrix():
    # Test case 1: None, should return None
    assert calculateBurrowsWheelerMatrix(None) == None
    
    # Test case 2: Empty string, should return None
    assert calculateBurrowsWheelerMatrix('') == None
    
    # Test case 3: String containing only the ending character, should return None
    assert calculateBurrowsWheelerMatrix('$') == None
    
    # Test case 4: String missing ending character, should return None
    assert calculateBurrowsWheelerMatrix('abc') == None
    
    # Test case 5
    inputValue = 'abcd$'
    expectedOutput = ['$abcd','abcd$', 'bcd$a', 'cd$ab', 'd$abc']
    output = calculateBurrowsWheelerMatrix(inputValue)
    arraysEqual(output, expectedOutput)

In [119]:
testCalculateBurrowsWheelerMatrix()

### Generate final Burrows-Wheeler transform

We take the last column of the sorted rotations matrix (based on lesson slides)

In [93]:
# Calculates the actual Burrows-Wheeler transform, or L index (last column of the matrix)
def calculateLIndex(t):
    return ''.join(map(lambda x: x[-1], t)) if t != None else None

# Calculates the F index (first column of the matrix), needed for FM index later
def calculateFIndex(t):
    return ''.join(map(lambda x: x[0], t)) if t != None else None

def calculateBurrowsWheelerTransform(t):
    r = calculateBurrowsWheelerMatrix(t)
    return calculateLIndex(r) if r != None else None

#### Tests

In [120]:
def testCalculateLIndex():
    # Test case 1: None, should return None
    assert calculateLIndex(None) == None
    
    # Test case 2:
    assert calculateLIndex(['$abcd','abcd$', 'bcd$a', 'cd$ab', 'd$abc']) == 'd$abc'
    
def testCalculateFIndex():
    # Test case 1: None, should return None
    assert calculateFIndex(None) == None
    
    # Test case 2:
    assert calculateFIndex(['$abcd','abcd$', 'bcd$a', 'cd$ab', 'd$abc']) == '$abcd'

def testBurrowsWheelerTransform():
    # Test case 1: None, should return None
    assert calculateBurrowsWheelerTransform(None) == None
    
    # Test case 2: Empty string, should return None
    assert calculateBurrowsWheelerTransform('') == None
    
    # Test case 3: String containing only the ending character, should return None
    assert calculateBurrowsWheelerTransform('$') == None
    
    # Test case 4: String missing ending character, should return None
    assert calculateBurrowsWheelerTransform('abc') == None
    
    # Test case 5
    inputValue = 'abcd$'
    expectedOutput = 'd$abc'
    output = calculateBurrowsWheelerTransform(inputValue)
    assert output != None
    assert output == expectedOutput
    
    # Test case 6
    inputValue = 'abaaba$'
    expectedOutput = 'abba$aa'
    output = calculateBurrowsWheelerTransform(inputValue)
    assert output != None
    assert output == expectedOutput

In [121]:
testCalculateLIndex()

testCalculateFIndex()

testBurrowsWheelerTransform()

## FM index

Core of the FM index structure consists of the following data:
* F index (first column of the Burrows-Wheeler matrix)
* L index (last column of the Burrows-Wheeler matrix, the Burrows-Wheeler transform itself)
* Tally (matrix of input string character ranks)
* Suffix array of the input string

### Calculate tally

Tally is a matrix of L index(referred to as input string for simplicity in following paragraph) character ranks.
Each row is assigned to one of the characters of the input string (except the terminal character). The number of columns is equal to the length of the input string. The value of a particular field in the matrix is the rank of the particular character at that point of the input string, which represents how many occurences of said character have been in the input string so far.

In [136]:
def calculateTally(t):
    # Start with empty tally
    tally = {}
    for i in range(0, len(t)):
        # Copy previous column values to current one
        for value in tally.values():
            value[i] = value[i-1]
        """ 
        Take current character in input string.
        If a row for said character exists, increment rank. If not, insert a row populated with 0s, and then increment.
        """
        currentChar = t[i]
        if (currentChar == '$'):
            continue
        if currentChar not in tally:
            tally[currentChar] = [0] * len(t)
        tally[currentChar][i] += 1
    return tally

#### Tests

In [134]:
def testCalculateTally():
    # Test case 1:
    inputValue = 'abcaabc$'
    expectedOutput = {
        'a': [1, 1, 1, 2, 3, 3, 3, 3],
        'b': [0, 1, 1, 1, 1, 2, 2, 2],
        'c': [0, 0, 1, 1, 1, 1, 2, 2]
    }
    output = calculateTally(inputValue)
    assert output != None
    assert setsEqual(output.keys(), expectedOutput.keys())
    for key in output.keys():
        assert arraysEqual(output[key], expectedOutput[key])

In [137]:
testCalculateTally()

TypeError: 'dict_keys' object is not subscriptable