# String Algorithms 

## Exercise – Longest common prefix
There are various algorithms we can design and implement to tackle the longest common prefix problem, each with different computational complexity. 

### LCP via string by string comparison
Perhaps the simplest / most intuitive one is to compare the first 2 strings in the set, find their LCP, then continue by computing the LCP between the next string in input and the curent LCP. Finally return the latest LCP computed. Computational cost: O(M\*N), with N being the umber of strings and M being the length of the longest one.

In [1]:
#pairwise LCP computation
def pairwiseLCP(str1, str2): 

    lcp = ""; 
    n = min(len(str1), len(str2)) 

    # compare str1 and str2 
    i = 0
    while i < n:
        if (str1[i] != str2[i]): 
            break
        lcp = lcp + str1[i] 
        i = i + 1
        
    return lcp 

# LCP computation over a list of strings
def lcpCompare (listOfStrings): 
    
    n = len(listOfStrings)
    lcp = listOfStrings[0] 

    for i in range (1, n): 
        lcp = pairwiseLCP(lcp, listOfStrings[i]) 
        if lcp =="":
            break

    return lcp 

In [2]:
# test code 
testset = ["apple", "appointment", "appendix", "appeal", "apparel"]

lcptest = lcpCompare(testset)

if (len(lcptest)): 
    print ("longest common prefix:", lcptest) 
else: 
    print("no common prefix") 


longest common prefix: app


### LCP using a Trie
To reduce the computational cost of finding the LCP, we make use of the trie data structure. We first add all strings to the trie. For as long as the input strings share a common prefix, each node in the trie will have exactly one child. When at least one of the strings in input starts to differ, the trie will branch out to more than 1 child. A simple trie traversal from the root will then enable us to compute the LCP.

In [3]:
# for simplicity, we assume to have an alphabet made of lower case letters only
ALPHABET = 26

class TrieNode: 
    def __init__(self): 
        self.isLeaf = False
        self.children = [None]*ALPHABET

def insert(key, root): 
    x = root 
    for level in range(len(key)): 
        index = ord(key[level]) - ord('a') 
        if x.children[index] == None: 
            x.children[index] = TrieNode() 
        x = x.children[index] 
    x.isLeaf = True

    
def buildTrie(listOfStrings, root): 
    for i in range(len(listOfStrings)): 
        insert(listOfStrings[i], root) 

def countChildren(node): 
    count = 0
    charindex = -1
    for i in range(ALPHABET): 
        if node.children[i] != None: 
            count = count + 1
            charindex = i 
    return count,  charindex

def walkTrie(root): 
    x = root 
    lcp = ""
    count, index = countChildren(x)
    while (count == 1 and x.isLeaf == False): 
        x = x.children[index] 
        lcp = lcp + chr(97 + index) 
        count, index = countChildren(x)
    return lcp 

def lcpTrie(listOfStrings, root): 
    buildTrie(listOfStrings, root) 
    return walkTrie(root) 

In [4]:
# test code 
testset = ["apple", "appointment", "appendix", "appeal",  "apparel"]
root = TrieNode() 
lcptest = lcpTrie(testset, root)

if (len(lcptest)): 
    print ("longest common prefix:", lcptest) 
else: 
    print("no common prefix") 

longest common prefix: app


Note that the computational complexity of the above is still O(M\*N), since the trie construction takes O(M\*N). However, once built, subsequent LCP queries would only take O(M).

## Exercise – Pattern search 
We begin by implementeing a simple, brute-force algorithm to find an occurrence of a given pattern P in text T (computational cost O(M\*N) in the wosrt case scenario). We then look at an implementation of the Knuth-Morris-Pratt (KMP) algorithm covered during lectures. We  experimentally compare the performance of the two algorithms on two inputs: a real English text, and a synthetic string of same length containoing the same repeated character A.

### Brute force

In [5]:
def BruteForcsePatternSearch(pattern, txt): 
    M = len(pattern) 
    N = len(txt) 
    indices = []

    for i in range(N-M + 1): 
        j = 0
        while(j < M): 
            if (txt[i + j] != pattern[j]): 
                break
            j = j + 1

        if (j == M):
            indices.append(i)

    return indices

### KMP

In [6]:
#credit: https://www.geeksforgeeks.org/kmp-algorithm-for-pattern-searching/

def KMPPatternSearch(pattern, txt): 
    M = len(pattern) 
    N = len(txt) 

    indices = []

    # lps[] will hold the longest prefix suffix values for pattern 
    lps = [0]*M 
    j = 0
    
    # preprocess the pattern and calculate lps[] 
    computeLPS(pattern, M, lps) 

    i = 0
    while i < N: 
        if pattern[j] == txt[i]:
            i = i+1
            j = j+1

        if j == M: 
            indices.append(i-j)
            j = lps[j-1]  
        elif i < N and pattern[j] != txt[i]: 
            if j != 0: 
                j = lps[j-1] 
            else:
                i = i+1
    return indices
                
                
def computeLPS(pattern, M, lps): 
    lenp = 0    
    lps[0]
    i = 1

    while i < M: 
        if pattern[i]== pattern[lenp]: 
            lenp = lenp+1
            lps[i] = lenp
            i = i+1            
        else: 
            if lenp != 0: 
                lenp = lps[lenp-1] 
            else: 
                lps[i] = 0
                i = i+1




In [None]:
import timeit

# driver code
with open('5-mobydick.txt','r') as f:
    txt1 = f.read().rstrip("\n")
pattern1a = "and"    
pattern1b = "every kingdom"    

txt2 = "A" * len(txt1) + "B"
pattern2a = "A" * 100
pattern2b = "AAAAB"    

print("Brute force on Moby Dick...")
starttime = timeit.default_timer()
BruteForcsePatternSearch(pattern1a, txt1)
endtime = timeit.default_timer()
print("with pattern \"and\" ", round(endtime-starttime,3))
starttime = timeit.default_timer()
BruteForcsePatternSearch(pattern1b, txt1)
endtime = timeit.default_timer()
print("with pattern \"every kingdom\" ", round(endtime-starttime,3))

print("Brute force on A* text...")
starttime = timeit.default_timer()
BruteForcsePatternSearch(pattern2a, txt2)
endtime = timeit.default_timer()
print("with pattern A*", round(endtime-starttime,3))
starttime = timeit.default_timer()
BruteForcsePatternSearch(pattern2b, txt2)
endtime = timeit.default_timer()
print("with pattern A*B", round(endtime-starttime,3))

print("\n\nKMP on Moby Dick...")
starttime = timeit.default_timer()
KMPPatternSearch(pattern1a, txt1)
endtime = timeit.default_timer()
print("with pattern \"and\" ", round(endtime-starttime,3))
starttime = timeit.default_timer()
KMPPatternSearch(pattern1b, txt1)
endtime = timeit.default_timer()
print("with pattern \"every kingdom\" ", round(endtime-starttime,3))

print("KMP on A* text...")
starttime = timeit.default_timer()
KMPPatternSearch(pattern2a, txt2)
endtime = timeit.default_timer()
print("with pattern A*", round(endtime-starttime,3))
starttime = timeit.default_timer()
KMPPatternSearch(pattern2b, txt2)
endtime = timeit.default_timer()
print("with pattern A*B", round(endtime-starttime,3))

## Exercise – Longest common substring

In [7]:
def longestCommonSubstring(str1, str2): 

    lcs = ""
    maxLength = 0
    
    for i in range(len(str1)):
        if str1[i] in str2:
            for j in range(i + 1, len(str1)):
                if str1[i:j] in str2:
                    if(len(str1[i:j]) > maxLength):
                        maxLength = len(str1[i:j])
                        lcs =  str1[i:j]

    return lcs


In [8]:
print(longestCommonSubstring("appendix", "appeal"))
print(longestCommonSubstring("appendix", "compendium"))

appe
pendi


In [9]:
from random import choice
from string import digits
import timeit

N = [10, 100, 1000, 2000, 5000]


for i in range(len(N)):
    str1 = "".join(choice(digits) for _ in range(N[i]))
    str2 = "".join(choice(digits) for _ in range(2*N[i]))
        
    starttime = timeit.default_timer()
    print("longest common substring is: ", longestCommonSubstring(str1, str2))
    endtime = timeit.default_timer()
    print("time taken for N =", N[i], ":", round(endtime-starttime,3))




longest common substring is:  00
time taken for N = 10 : 0.0
longest common substring is:  6601
time taken for N = 100 : 0.002
longest common substring is:  4108855
time taken for N = 1000 : 1.147
longest common substring is:  841087
time taken for N = 2000 : 10.704


KeyboardInterrupt: 