#Project 2: Approximate matching
JeongHo Choi, 17th June 2022 (updated)

-Write a function that, given a length-24 pattern P and given an Index object built on 8-mers.<br>
-find all approximate occurrences of P within T with up to 2 mismatches.<br>
-Insertions and deletions are not allowed.<br>
-Don't consider any reverse complements.<br>
   
###Question 4)
How many times does the string GGCGCGGTGGCTCACGCCTGTAAT, which is derived from a human Alu sequence, occur with up to 2 substitutions in the excerpt of human chromosome 1?  
(Don't consider reverse complements here.)<br>

-Hint 1: 'Multiple index hits' might direct you to the same match multiple times, but be careful not to count a match 'more than once'.<br>
-Hint 2: You can check your work by comparing the output of your new function to that of the 'naive_2mm function' implemented in the previous module.

###Question 5)
Using the instructions given in Question 4, how many total index hits are there when searching for occurrences of GGCGCGGTGGCTCACGCCTGTAAT with up to 2 substitutions in the excerpt of human chromosome 1?
(Don't consider reverse complements.)<br>

-Hint: You should be able to use the 'boyer_moore function' (or the slower 'naive function') to 'double-check' your answer.

In [1]:
import bisect

# from kmer_index.py by Ben Langmead, the instructor for DNA sequencing algorithms
class Index(object):
    def __init__(self, t, k):
        ''' Create index from all substrings of size 'length' '''
        self.k = k  # k-mer length (k)
        self.index = []
        for i in range(len(t) - k + 1):  # for each k-mer
            self.index.append((t[i:i+k], i))  # add (k-mer, offset) pair
        self.index.sort()  # alphabetize by k-mer
    
    def query(self, p):
        ''' Return index hits for first k-mer of P '''
        kmer = p[:self.k]  # query with first k-mer
        i = bisect.bisect_left(self.index, (kmer, -1))  # binary search
        hits = []
        while i < len(self.index):  # collect matching index entries
            if self.index[i][0] != kmer:
                break
            hits.append(self.index[i][1])
            i += 1
        return hits

In [2]:
def naive_2mm(p, t):
    occurrences = []
    for i in range(len(t) - len(p) + 1):  # loop over alignments
        count_mismatch = 0
        for j in range(len(p)):  # loop over characters
            if t[i+j] != p[j]:  # compare characters
                count_mismatch += 1
        if not count_mismatch > 2:
            occurrences.append(i)  # chars matched with allowing 2 mismatch; record
    return occurrences

In [3]:
def approximate_match(p, t, n):
    segment_length = int(round(len(p) / (n + 1)))
    all_matches = set()
    p_idx = Index(t, segment_length)
    idx_hits = 0
    for i in range(n + 1):
        start = i * segment_length
        end = min((i + 1) * segment_length, len(p))
        matches = p_idx.query(p[start:end])

        # Extend matching segments to see if whole p matches
        for m in matches:
            idx_hits += 1
            if m < start or m - start + len(p) > len(t):
                continue

            mismatches = 0

            for j in range(0, start):
                if not p[j] == t[m - start + j]:
                    mismatches += 1
                    if mismatches > n:
                        break
            for j in range(end, len(p)):
                if not p[j] == t[m - start + j]:
                    mismatches += 1
                    if mismatches > n:
                        break

            if mismatches <= n:
                all_matches.add(m - start)
    return list(all_matches), idx_hits

In [4]:
def readGenome(filename):
    genome = ''
    with open(filename, 'r') as f:
        for line in f:
            # ignore header line with genome information
            if not line[0] == '>':
                genome += line.rstrip()
    return genome

In [5]:
genome_file = 'chr1.GRCh38.excerpt.fasta'
t = readGenome(genome_file)
p = 'GGCGCGGTGGCTCACGCCTGTAAT'

In [6]:
matches, hit_counts = approximate_match(p, t, 2)

In [8]:
# naive_2mm to double-check the result
len(naive_2mm(p, t))

19

In [9]:
# Q4
len(matches)

19

In [10]:
# Q5
hit_counts

90