# Finding errors in a solution of SUBS
Rosalind's assignment "Finding Motifs in DNA".  
We are trying several versions.

## solution from last week
Before we start buggy versions, let's first find some tests on a version that does work.  
We use one of the solutions of last week.

In [1]:
def solve_subs (s, t):
    n = len(t)
    locs = []
    for i in range(len(s)):
        if s[i:i+n] == t:
            locs += [i+1]
    return locs

## the tests, introduction
For these tests, we use an `assert` statement to check the actual answer to the expected answer.  
If the condition behind `assert` is true, the program just continues, e.g. to the next test.  
If the condition behind `assert` is false, you get an `AssertionError`.

In [2]:
expected_answer = [2, 4, 10]
assert [2, 4, 10] == expected_answer
assert [1, 3, 9] == expected_answer

AssertionError: 

## the tests
All tests have the same structure:  
define two inputs `dna` and `motif`  
define the expected answer  
compare the function result to the expected answer  

All of these tests have to be run again for each version of `solve-subs`.

In [12]:
# Rosalind's example
dna = 'GATATATGCATATACTT'
motif = 'ATAT'
expected_answer = [2, 4, 10]
assert solve_subs(dna, motif) == expected_answer

In [13]:
# a simple test
dna = 'ACACACAT'
motif = 'AC'
expected_answer = [1, 3, 5]
assert solve_subs(dna, motif) == expected_answer

In [14]:
# a test for zero matches
dna = 'ACACACAT'
motif = 'AA'
expected_answer = []
assert solve_subs(dna, motif) == expected_answer

In [7]:
# a real Rosalind test
infile1 = open('rosalind_subs.txt')
dna = infile1.readline().strip()
motif = infile1.readline().strip()
infile1.close()

infile2 = open('answer_subs.txt')
answer_line = infile2.readline().strip()
infile2.close()
answers_str = answer_line.split()
expected_answer = [int(x) for x in answers_str]

assert solve_subs(dna, motif) == expected_answer

Remember: no messages means the tests passed

## A version checking individual characters rather than slicing
We give this code twice: once with comments per line of code; once only Python code

In [8]:
def solve_subs (s, t):
    'find all locations (1-based) in s where t occurs as substring'
    # initialize answer
    locs = []
    # try all starting positions in s
    for i in range(len(s)):
        # find out is t is matching at that position of s
        # check characters of t one by one; matching is True until found otherwise
        matching = True
        for j in range(len(t)):
            # compare j-th character, in s counting from position i
            if s[i+j] != t[j]:
                # as soon as one character doesn't match, set matching to False
                matching = False
            # and never set matching back to True (in the current loop)
        # if still matching after all characters have been tested,
        # then t is a substring of s at position i (0-based)
        if matching:
            # record the position in locs
            locs += [i+1]
    # after the loop, we have all starting positions
    return locs

In [9]:
def solve_subs (s, t):
    locs = []
    for i in range(len(s)):
        matching = True
        for j in range(len(t)):
            if s[i+j] != t[j]:
                matching = False
        if matching:
            locs += [i+1]
    return locs

First run one (or both) of the cells above, then run all cells in section "the tests" again.  
Now, all tests will give the same error message: `string index out of range`

## The previous version corrected
It turns out that `i+j` may become larger than the length of `s`.  
Therefore the programmer changed the bound of the outer loop.

In [11]:
def solve_subs (s, t):
    locs = []
    for i in range(len(s)-len(t)):
        matching = True
        for j in range(len(t)):
            if s[i+j] != t[j]:
                matching = False
        if matching:
            locs += [i+1]
    return locs

Again, run this cell, and then rerun all tests.  
Now they all pass.

## An additional test
In fact, the last version above does contain an error.  
Define a test that reveals this error.

In [17]:
dna = 'GATATATGCATATATAT' # define the string
motif = 'ATAT' # define the string
expected_answer = [2,4,10,12,14] # define the list of positions
assert solve_subs(dna, motif) == expected_answer

## Final correction
The problem was a so-called 'off-by-one error'.  
When seeing the correction, can you still find a test that reveals the error in the previous version?

In [16]:
def solve_subs (s, t):
    locs = []
    for i in range(len(s)-len(t)+1):
        matching = True
        for j in range(len(t)):
            if s[i+j] != t[j]:
                matching = False
        if matching:
            locs += [i+1]
    return locs