# String Applications

## Exercise – Phone directory lookup
If we wanted to support just a basic lookup operation (i.e., given a full employee name, return their telephone number), we could use a symbol table data structure. However, in this case, as the user enters a new character, we also want to return a list of name suggestions (i.e., all names in the directreory matching the partial name entered so far). To support this, we make use of the trie data structure. We begin by inserting all employees' names in the trie. Once the data structure has been created, we can support lookup operations by (1) performing a trie traversal each time a new character is typed in, searching for a prefix match; (2) if the current prefix is found in the trie, we display all contacts that match the prefix, or else we return no match. We can avoid restarting the searching/matching from the root of the trie each time we type in a new character, by remembering where in the trie the previous traversal ended.

In [1]:
# for simplicity, we assume to have an alphabet made of lower case letters only
ALPHABET = 26

class TrieNode: 
    def __init__(self): 
        self.isLeaf = False
        self.children = [None]*ALPHABET

class TriePhoneDirectory:
    def __init__(self):
        self.root = TrieNode()
        
    def buildPhoneDirectory(self, listOfContacts): 
        for i in range(len(listOfContacts)): 
            self.insert(listOfContacts[i], self.root)    
    
    def insert(self, contact, node): 
        x = node
        for level in range(len(contact)): 
            index = ord(contact[level]) - ord('a') 
            if x.children[index] == None: 
                x.children[index] = TrieNode() 
            x = x.children[index] 
        x.isLeaf = True
        
        
    def displayMatchedContacts(self, str1):
        prevNode = self.root;
        prefix = ""
        
        for i in range(len(str1)): 
            prefix = prefix + str1[i] #string entered so far
            current = str1[i] #newly entered char

            currentIndex = ord(current)-97
            currentNode = prevNode.children[currentIndex] #descend trie 1 level for newly entered char
            
            if currentNode is None:
                print("Employee name", str1, "not in phone directory")
                break
            else:
                partialMatches = []
                self.retrievePartialMatches(currentNode, prefix, partialMatches)
                print("Suggested employees' names for", prefix, ":", partialMatches)
                prevNode = currentNode
                
    
    
    def retrievePartialMatches(self, currentNode, prefix, matches):
        # prefix is an employee name
        if currentNode.isLeaf:
            matches.append(prefix)
            return
        
        #traverse the trie from currentNode to find all employees (leaf nodes)
        #whose name starts with prefix
        for i in range(ALPHABET):
            nextNode = currentNode.children[i]
            if nextNode is not None:
                self.retrievePartialMatches(nextNode, prefix + chr(97+i), matches)

In [2]:
#test code

phoneDir = TriePhoneDirectory() 

employeeNames = ["jacky", "james", "janice", "jeremy", "john", "jolanda", "jonathan", "joshua"]

phoneDir.buildPhoneDirectory(employeeNames)

testEmployee = "john" #assume these are entered one char at a time
phoneDir.displayMatchedContacts(testEmployee)

print("\n")

testEmployee = "jeremiah" #assume these are entered one char at a time
phoneDir.displayMatchedContacts(testEmployee)



Suggested employees' names for j : ['jacky', 'james', 'janice', 'jeremy', 'john', 'jolanda', 'jonathan', 'joshua']
Suggested employees' names for jo : ['john', 'jolanda', 'jonathan', 'joshua']
Suggested employees' names for joh : ['john']
Suggested employees' names for john : ['john']


Suggested employees' names for j : ['jacky', 'james', 'janice', 'jeremy', 'john', 'jolanda', 'jonathan', 'joshua']
Suggested employees' names for je : ['jeremy']
Suggested employees' names for jer : ['jeremy']
Suggested employees' names for jere : ['jeremy']
Suggested employees' names for jerem : ['jeremy']
Employee name jeremiah not in phone directory


## Exercise – Tag cloud
We can combine two data structures to solve this problem. We use a Trie to represent all the words in D, and we expand the information we keep in each Trie node so to maintain a counter of how many times the word represented by that node has been seen in D. To efficiently find the k most frequently words in D, we also maintain a MinHeap data structure of exactly k elements. As we build the trie, if a node represents one of the top k most frequently words seen so far, the word and its frequency are added to the MinHeap.

In [None]:
import heapq

# for simplicity, we assume to have an alphabet made of lower case letters only
ALPHABET = 26

K = 10

class ExtendedTrieNode: 
    def __init__(self): 
        self.isLeaf = False
        self.children = [None]*ALPHABET
        self.frequency = 0
        

class DocumentTrie:
    def __init__(self):
        self.root = ExtendedTrieNode()
        self.minheap = []
        self.tagcloud = {}
        
    def buildDocumentTrie(self, bagofwords): 
        for i in range(len(bagofwords)): 
            self.insert(bagofwords[i], self.root)    
    
    def insert(self, word, node): 
        x = node
        for level in range(len(word)): 
            index = ord(word[level]) - ord('a')             
            if (index > 26) or (index <0):
                print(index, chr(index+97))
            
            if x.children[index] == None: 
                x.children[index] = ExtendedTrieNode() 
            x = x.children[index] 
        x.isLeaf = True
        x.frequency = x.frequency + 1

        # case 1: word is already in top k --> update its frequency
        if word in self.tagcloud.keys():
            self.tagcloud[word] = x.frequency
            self.minheap.remove((x.frequency -1, word))
            heapq.heappush(self.minheap, (x.frequency, word))
            
        # case 2: there are less than k words counted so far --> add word to heap and cloud
        elif len(self.minheap) < K:
            heapq.heappush(self.minheap, (x.frequency, word))
            self.tagcloud[word] = x.frequency
            
        # case 3: heap is full but word has higher frequency than min heap --> make space in heap
        elif self.minheap[0][0] < x.frequency:
            remfreq, remword = heapq.heapreplace(self.minheap, (x.frequency, word))
            del self.tagcloud[remword]
            self.tagcloud[word] = x.frequency
                                                 
        

In [None]:
#test code
import string

# hack to convert mobydick text to lower case char only
with open('5-mobydick.txt','r') as f:
    document = f.read().rstrip("\n")

for c in document:  
    if c in string.punctuation or c in string.digits:  
        document = document.replace(c, "")  
document = document.lower()

# -----------------------------------------------------

docTrie = DocumentTrie() 
docTrie.buildDocumentTrie(document.split())

print("Top k words (from min to max frequency)")
topk = docTrie.minheap
while topk:
    print(heapq.heappop(topk))



An alternative solution to this problem could be to use a hash table, where for each word in D we maintain a counter of how many times the word appears. We can then traverse the hash table to find the top k most frequently used words. What is the computational cost of such an approach compared to the above?

## Exercise – Prefix-free codes
We can solve this problem using a trie. We read the codes one by one, adding them to a trie. For the codes to be prefix-free, there should not be a trie-insertion operation that terminates on an intermediary node (e.g., inserting 10 after having inserted 10100), nor one that extends a leaf node (e.g., inserting 10100 after having inserted 10). 

In [3]:
ALPHABET = 2 #binary

class TrieNode: 
    def __init__(self): 
        self.isLeaf = False
        self.children = [None]*ALPHABET

def insert(code, root): 
    x = root 
      
    for level in range(len(code)): 
        index = ord(code[level]) - ord('0')
        
        if x.children[index] == None and x.children[index % 2] == None:
            # case 1: trying to insert a new code that extends an already existing one
            # (node x has no children and is a leaf (=code) node
            if x.isLeaf:
                print("        Code", code, "tries to expand upon an already existing prefix")
                break
            else:
                # case 2: intermediate step of inserting a new code
                x.children[index] = TrieNode() 
                x = x.children[index]
                    
        elif x.children[index] != None:
                # case 3: keep traversing trie along partially matched code
                x = x.children[index]
        else:
            # case 4: expand trie by branching out from x (prefix free)
            x.children[index] = TrieNode() 
            x = x.children[index]
                
    if x.children[0] != None or x.children[1] != None:
        # case 6: code is prefix to a code already inserted in the trie
        print("        Code", code, "is prefix to an already existing code")
    else:
        x.isLeaf = True
        
    
def buildTrie(codewords, root): 
    for i in range(len(codewords)): 
        prefixFree = insert(codewords[i], root) 

In [4]:
# test code
codebook1 = ["01", "10", "0010", "1111"]
print("codebook:", codebook1)
root = TrieNode()
buildTrie(codebook1, root)

codebook2 = ["01", "10", "0010", "10100"]
print("codebook:", codebook2)
root = TrieNode()
buildTrie(codebook2, root)

codebook3 = ["01", "0010", "10100", "10"]
print("codebook:", codebook3)
root = TrieNode()
buildTrie(codebook3, root)



codebook: ['01', '10', '0010', '1111']
codebook: ['01', '10', '0010', '10100']
        Code 10100 tries to expand upon an already existing prefix
codebook: ['01', '0010', '10100', '10']
        Code 10 is prefix to an already existing code
