#   Big Data
## Algorithms: Searching, Recursion and Data Structures
## Victor P. Debattista March 2017


Welcome to the second lecture on algorithms and data structures.  This one focusses on Searching, especially on trees and hashes

In [1]:
import numpy as np
import math
import random
import time

We are going to adapt some code from the sorting exercise.  We want to create two lists of N numbers which we will use as our list for storing and searching.  One of our lists has uniformly random numbers, the other has a Gaussian (normal/Bell curve) distribution

In [2]:
random.seed(22)
N = 10000
#N = 10

# data1 is uniformly distributed
data1 = []
for i in range(N):
    data1.append(random.uniform(0.,10000.))
print(data1[0:10])

# data2 is distributed as a Gaussian with average = 100 and sigma = 20
data2 = []
for i in range(N):
    data2.append(random.normalvariate(100.,20.))
print(data2[0:10])

[9582.093798172727, 1403.685900763948, 236.1614713882554, 9986.306536729146, 1842.5364570285308, 1205.9206321532502, 6514.212405579194, 3456.448375625667, 8895.509397958029, 2317.41489986076]
[106.56352441703974, 83.06204051851174, 97.72317951378805, 94.90001125530982, 111.84861372758255, 100.46635043274102, 121.02415209063571, 101.4950788229926, 136.49669588669343, 63.16267547503758]


We are going to experiment with open hashing, which we will implement as a list of lists in Python, akin to how we did a BinSort in week 1.

In [3]:
def InitHash(nbins):
    htable = []    # this is the empty has table
    for i in range(0,nbins):
        htable.append([])
    return htable

We want to define a few functions to determine some statistics of our has table occupation.  We want three quantities, the minimum entires, the maximum entries, and the average entries.

From last week's BinSort exercise, let's borrow the indexing function: given a value which is within a given range [lo,hi], finds the bin to place the element into if there are N bins.  If the value is out of range some flag value should be returned.  This will be the basis of our hash function

In [5]:
def bin_index(val,lo,hi,N):
# start with the sanity check that val satisfies lo <= val <= hi, else have a flag
    if( (val < lo) or (val > hi) ):
        tmp = -1
    else: 
        tmp = (val - lo) * N/(hi-lo)
    a = int(tmp)
    return a

For convenience let's add a function that takes a list, a hash function and number of buckets and hashes it

In [7]:
# first compute statistics if the numbers are uniform
hashTable = Hashify(data1,bin_index,0.,10000.,50000)
minbin1,avgbin1,maxbin1 = HashStats(hashTable,50000)
print(minbin1,avgbin1,maxbin1)

# now compute statistics if the data are more bunched
hashTable = Hashify(data2,bin_index,0,10000.,50000)
minbin2,avgbin2,maxbin2 = HashStats(hashTable,50000)
print(minbin2,avgbin2,maxbin2)


0 0.2 4
0 0.2 57


So this is not very satisfying, our hash function is causing a lot of collisions, which are going to slow down our searches.  Develop a new hash function and compare it with the one above.

Here we're going to define a new hash function based on inverting the order of digits

In [11]:
# first compute statistics if the numbers are uniform
hashTable = Hashify(data1,hash_fun2,0.,10000.,50000)
minbin1,avgbin1,maxbin1 = HashStats(hashTable,50000)
print(minbin1,avgbin1,maxbin1)

# now compute statistics if the data are more bunched
hashTable = Hashify(data2,hash_fun2,0.,10000.,50000)
minbin2,avgbin2,maxbin2 = HashStats(hashTable,50000)
print(minbin2,avgbin2,maxbin2)

0 0.2 11
0 0.2 229


So we still have too many collisions.  Need to try a different approach.  In the next hashing we're going to use those digits after the decimal point for our hashing

In [13]:
# first compute statistics if the numbers are uniform
hashTable = Hashify(data1,hash_fun3,0.,10000.,50000)
minbin1,avgbin1,maxbin1 = HashStats(hashTable,50000)
print(minbin1,avgbin1,maxbin1)

# now compute statistics if the data are more bunched
hashTable = Hashify(data2,hash_fun3,0.,10000.,50000)
minbin2,avgbin2,maxbin2 = HashStats(hashTable,50000)
print(minbin2,avgbin2,maxbin2)

0 0.2 4
0 0.2 4


Let us now develop the functionality for a binary search tree.  Since this involves defining some classes, we develop that here before moving on to some questions.  We start by developing the class Node, which is the basic nodes of a binary tree

In [14]:
class Node:
    def __init__(self,val):
# on initialisation, set to no children and value to input
        self.l = None
        self.r = None
        self.v = val

Now we need to build the Tree class.  We build the functionality for inserting and finding

In [15]:
class Tree:
    def __init__(self):
        self.root = None
    
    def add(self,val):
        if( self.root == None ):
            self.root = Node(val)
        else:
            self._add(val,self.root)
    
    def _add(self,val,node):
        if( val < node.v ):
            if( node.l is not None ):
                self._add(val,node.l)
            else:
                node.l = Node(val)
        else:
            if( node.r is not None ):
                self._add(val,node.r)
            else:
                node.r = Node(val)
                
    def find(self, val):
        if(self.root != None):
            return self._find(val, self.root)
        else:
            return None
    
    def _find(self,val,node):
        if( val == node.v ):
            return node
        elif( val < node.v and node.l is not None ):
            return self._find(val,node.l)
        elif( val > node.v and node.r is not None ):
            return self._find(val,node.r)
        

In [16]:
# And here are some examples on how to use this functionality
bst = Tree()
bst.add(3)
bst.add(4)
bst.find(4)
print(bst.root.r.v)

4


OK with this build two BST called "tr" with the elements of data1 generated at the top of this exercise (you can try it with data2 also)

In [17]:
tr = Tree()
for val in data1[0:10]:
#for val in data1:
    tr.add(val)

Returning to the Tree class definition above, add methods for computing the number of values stored, the maximum and minimum distance to all leaves, the number of leaves and an Inorder listing of the tree.  Once you have that compute the following (during debugging work with only 10 elements of data1, i.e. data1[0:10], by uncommenting the appropriate line above and commenting the one below it

In [18]:
print('Total number of nodes in tree =',tr.count_nodes(tr.root))
print('Minimum depth =',tr.depth(tr.root,min))
print('Maximum depth =',tr.depth(tr.root,max))
print('Total number of leaves =',tr.count_leaves(tr.root))
tr.inorder(tr.root)

Total number of nodes in tree = 10
Minimum depth = 2
Maximum depth = 6
Total number of leaves = 4
236.1614713882554
1205.9206321532502
1403.685900763948
1842.5364570285308
2317.41489986076
3456.448375625667
6514.212405579194
8895.509397958029
9582.093798172727
9986.306536729146
