In [None]:
###########################################################
# Lab 5: Binary Search, String Matching, and Applications #
###########################################################

In [None]:
'''
Problem 1
Binary search is the ultimate divide-and-conquer algorithm. 
To find a key k in a large file containing keys A[1..n] 
in sorted order, we first compare k with A[n/2], and depending 
on the result we recurse either on the first half of the file, A[1..n/2], 
or on the second half, A[n/2+1..n]. The recurrence now is T(n)=T(n/2)+O(1). 
Plugging into the master theorem (with a=1,b=2,d=0) 
we get the familiar solution: a running time of just O(logn).

The problem is to find a given set of keys in a given array.

Given: Two positive integers n≤105 and m≤105, a sorted array A[1..n] of integers 
from −105 to 105 and a list of m integers −105≤k1,k2,…,km≤105.

Return: For each ki, output an index 1≤j≤n s.t. A[j]=ki or "-1" if there is no such index.

Sample Dataset
5
6
10 20 30 40 50
40 10 35 15 40 20
Sample Output
4 1 -1 -1 4 2
'''
A = [10, 20, 30, 40, 50]
keys = [40, 10, 35, 15, 40, 20]

with open('/Users/KevinBu/Desktop/BMI2005_Algorithms_2020/Labs/rosalind/rosalind_bins.txt', 'r') as f:
    f.readline() # skip 5
    f.readline() # skip 6
    A = [int(x) for x in f.readline().strip().split(' ')]
    keys = [int(x) for x in f.readline().strip().split(' ')]

def binary_search(A, key, lo, hi):
    mid = (lo + hi)//2
    if A[mid] == key:
        return mid + 1 # why this
    elif lo > hi: # can go on infinitely if you don't have this
        return -1
    elif A[mid] > key:
        return binary_search(A, key, lo, mid - 1)
    elif A[mid] < key:
        return binary_search(A, key, mid + 1, hi)
    
answers = []
for k in keys:
    answers.append(binary_search(A, k, 0, len(A) - 1))
    
# T[n] = T(n/2) + 1

print(' '.join([str(x) for x in answers]))

In [None]:
'''
Problem 2
We say that Pattern is a most frequent k-mer in Text if it maximizes 
Count(Text, Pattern) among all k-mers. For example, "ACTAT" is a most 
frequent 5-mer in "ACAACTATGCATCACTATCGGGAACTATCCT", and "ATA" is a 
most frequent 3-mer of "CGATATATCCATAG".

Frequent Words Problem
Find the most frequent k-mers in a string.

Given: A DNA string Text and an integer k.

Return: All most frequent k-mers in Text (in any order).

Sample Dataset
ACGTTGCATGTCGCATGATGCATGAGAGCT
4
Sample Output
CATG GCAT
'''

s = 'ACGTTGCATGTCGCATGATGCATGAGAGCT'
k = 4


with open('/Users/KevinBu/Desktop/BMI2005_Algorithms_2020/Labs/rosalind/rosalind_ba1b.txt', 'r') as f:
    s = f.readline().strip()
    k = int(f.readline().strip())
    
    
def frequentWords(s, k):
    counts = {}
    for i in range(len(s)-k+1):
        kmer = s[i:i+k]
        if kmer not in counts:
            counts[kmer] = 0
        counts[kmer] += 1
    m = max(counts.values())
    out = []
    for kmer in counts:
        if counts[kmer] == m:
            out.append(kmer)
    return out

print(' '.join(frequentWords(s,k)))

# Runntime is O(N)


In [None]:
'''
Problem 3
We say that position i in k-mers p1 … pk and q1 … qk 
is a mismatch if pi ≠ qi. For example, CGAAT and CGGAC have two 
mismatches. The number of mismatches between strings 
p and q is called the Hamming distance between these strings 
and is denoted HammingDistance(p, q).

Hamming Distance Problem
Compute the Hamming distance between two DNA strings.

Given: Two DNA strings.

Return: An integer value representing the Hamming distance.

Sample Dataset
GGGCCGTTGGT
GGACCGTTGAC
Sample Output
3
'''

p = 'GGGCCGTTGGT'
q = 'GGACCGTTGAC'

with open('/Users/KevinBu/Desktop/BMI2005_Algorithms_2020/Labs/rosalind/rosalind_ba1g.txt', 'r') as f:
    p = f.readline().strip()
    q = f.readline().strip()

def hamming_distance(p, q): 
    # what condition on p and q is needed for hamming distance calculation to work?
    # d is the to-be-computed hamming distance 
    d = 0
    for i in range(len(p)): 
        if p[i] != q[i]:
            d += 1
    return d

print(hamming_distance(p, q))

In [None]:
'''
Exercise 1
Given the values {2341, 4234, 2839, 430, 22, 397, 3920}, a hash table of size 7, and hash
function h(k) = k mod 7, show the resulting tables after inserting the values in the given order with each
of these collision strategies.
{2341, 4234, 2839, 430, 22, 397, 3920}
h(k) = k mod 7
2341 % 7 = 3
4234 % 7 = 6
2839 % 7 = 4
430 % 7 = 3
22 % 7 = 1
397 % 7 = 5
3920 % 7 = 0

1. separate chaining
0: [3920]
1: [22]
2: []
3: [2341, 430]
4: [2839]
5: [397]
6: [4234]

2. linear probing
0: [397]
1: [22]
2: [3920]
3: [2341]
4: [2839]
5: [430]
6: [4234]


Exercise 2
Suppose you are parsing a given DNA sequence for 4-mers and wish to store the possible strings in a hash table.

a) How many different keys are there assuming all possible 4-mers appear?
4^4 = 256

b) How many values would your hash table need to have a load factor of 0.8?

alpha (load factor) = n / m = # keys or hashes / # values
# values = # keys / alpha
         = 256 / 0.8
         = 320

c) To create a hash function, we can convert the strings into numeric representations, i.e.
A, C, G, T = 0, 1, 2, 3 so ATCG = 0312. Explain why h(k) = k mod 10 is NOT a good hash function 
(what type of hashing is this?).

h(0132) = 2

Bad because there's a 0.25 chance of collision
Also, you're always going to hash to positions 0, 1, 2, 3 so largely empty table

d) Your labmate suggests using GC content. Why is this also not a good idea?
GC = 100
GAC = 67%

GGCC = 100
GCCC = 100
GCGC = 100

GC content will have a lot of mapping to the same value

e) Eventually you decide on k mod 89. Why is this better (than the previous ones, and something such as k mod 100), 
and how many entries does our table have? What is the load factor?

89 is prime

# buckets = 89
# items = 256

alpha = 256 / 89

'''