Binary Search Trees, Traversals and Balancing

Let's create user profiles and a data structure that can store 100 million records, insert, search and update the list of operations efficiently. 

In [2]:
#simple example; generic blueprint of a user
class user:
    pass

In [40]:
#instance of a user
user1 = user()

In [4]:
#how to call the userand verify its type with the following two calls.
user

__main__.user

In [5]:
type(user1)

__main__.user

We need to use a constructor to add useful information to the user class. This is a blueprint for our people who are considered objects in Python. Yes, Python objectifies people. It's nothing personal. Except the introduce yourself method. That's literally one person talking to another person. That's personal.

In [42]:
class User:
    def __init__(self, username, name, email) -> None:
        self.username = username
        self.name = name
        self.email = email
        print("There you go. You made a user. Treat your user well.")

    def introduce_yourself(self, guest_name):
        print("Hi {}, I'm {}! Contact me at {} .".format(guest_name, self.name, self.email))

In [35]:
user2 = User('Paddy', 'Paddy the Baddy', 'paddy@bakerman.com')

We can call user2 like user1.

In [8]:
user2

<__main__.User at 0x10be00a60>

We can call one of the properties with a '.' and specify which property after:

In [23]:
user2.name 

'Paddy the Baddy'

In [34]:
user3 = User('Patty', 'Patty Cakes', 'patty@cakes.com')

In [33]:
user3.introduce_yourself('Chad')

Hi Chad, I'm Patty Cakes! Contact me at patty@cakes.com .


The user was automatically passed above, but you can explicitely state the user in parenthesis as well. Let's add a helper method to our User class.

In [47]:
class User:
    def __init__(self, username, name, email):
        self.username = username
        self.name = name
        self.email = email
        
    def __repr__(self):
        return "User(username='{}', name='{}', email='{}')".format(self.username, self.name, self.email)
    
    def __str__(self):
        return self.__repr__()

In [30]:
user4 = User('sumguy', 'Sumguy Sumone', 'sumguy@nobody.com')
user4

User(username='sumguy', name='Sumguy Sumone', email='sumguy@nobody.com')

Now we can see the keys more clearly for the values that we entered. Think of some ways that this could be helpful (UI/UX to name a couple). Next, we'll make a user database for our example users.

In [62]:
class UserDatabase:
    def insert(self, user):
        pass
    def find(self, username):
        pass
    def update(self, user):
        pass
    def list_all(self):
        pass

In [55]:
user1 = User('God', 'God Allah', 'god@heaven.com')

user1, user2, user3, user4

(User(username='God', name='God Allah', email='god@heaven.com'),
 User(username='Paddy', name='Paddy the Baddy', email='paddy@bakerman.com'),
 User(username='Patty', name='Patty Cakes', email='patty@cakes.com'),
 User(username='sumguy', name='Sumguy Sumone', email='sumguy@nobody.com'))

Remember that you have to run the class again in the notebook if you instantiated your first user here like I did. Otherwise, your user will output an address, and people don't like to be called 0x123456678, so God probably wouldn't like that either. We cannot put the names in a list of users like below, though...

In [56]:
users = [God, Paddy, Patty, sumguy]

NameError: name 'God' is not defined

We need to set their data equal to their usernames first. Let's do that with sample data from Jovian, because it's quicker than thinking of more names off the top of my head:

In [57]:
aakash = User('aakash', 'Aakash Rai', 'aakash@example.com')
biraj = User('biraj', 'Biraj Das', 'biraj@example.com')
hemanth = User('hemanth', 'Hemanth Jain', 'hemanth@example.com')
jadhesh = User('jadhesh', 'Jadhesh Verma', 'jadhesh@example.com')
siddhant = User('siddhant', 'Siddhant Sinha', 'siddhant@example.com')
sonaksh = User('sonaksh', 'Sonaksh Kumar', 'sonaksh@example.com')
vishal = User('vishal', 'Vishal Goel', 'vishal@example.com')

In [58]:
users = [aakash, biraj, hemanth, jadhesh, siddhant, sonaksh, vishal]

Now we have overwritten our sloppy data with the clean samples. The samples are also people who work for and made Jovian, a platform that can teach you everything that I'm reproducing here. There are assignments and you can get certificates when you have completed all of them.

As far as the users, you can access different properties for each if you call the username.whateverpropertyyouwanthere

In [59]:
#forexmaple
aakash.email

'aakash@example.com'

Or print all of the information

In [60]:
aakash


User(username='aakash', name='Aakash Rai', email='aakash@example.com')

In [61]:
users

[User(username='aakash', name='Aakash Rai', email='aakash@example.com'),
 User(username='biraj', name='Biraj Das', email='biraj@example.com'),
 User(username='hemanth', name='Hemanth Jain', email='hemanth@example.com'),
 User(username='jadhesh', name='Jadhesh Verma', email='jadhesh@example.com'),
 User(username='siddhant', name='Siddhant Sinha', email='siddhant@example.com'),
 User(username='sonaksh', name='Sonaksh Kumar', email='sonaksh@example.com'),
 User(username='vishal', name='Vishal Goel', email='vishal@example.com')]

We can only list sample outputs once we impliment our data structure. That'll happen shortly.

Let's come up with a simple solution first. Impliment the various functions:

1. Insert: Loop through the list and add the user at a position that keeps it sorted.
2. Find: Loop through the list and find the user with the matching username and query.
3. Update: Loop throughh the list, find the user object matching the query and update with new details.
4. List: Return the list whenever you want to list all of the users.

Tip: since usernames are strings, we can compare them useing <, >, or ==. This will allow us to impliment the functions easily. The code will be pretty simple as well.

In [74]:
class UserDatabase:
    def __init__(self):
        self.users = []
    
    def insert(self, user):
        i = 0
        while i < len(self.users):
            #compare the username until one is greater than the new username
            if self.users[i].username > user.username:
                break
            i += 1
        self.users.insert(i, user)
    
    def find(self, username):
        for user in self.users:
            if user.username == username:
                return user
    
    def update(self, user):
        target = self.find(user.username)
        target.name, target.email = user.name, user.email
        
    def list_all(self):
        return self.users

Instatiate that to create a new user database. Note that you can't use the users before the sample code, and if you haven't indented your methods, you won't be able to insert them either!

In [75]:
database = UserDatabase()

In [76]:
database.insert(hemanth)
database.insert(aakash)
database.insert(biraj)

Retrieve and call one of them:

In [78]:
user = database.find('hemanth')
user

User(username='hemanth', name='Hemanth Jain', email='hemanth@example.com')

In [81]:
database.update(User(username = 'hemanth', name = 'Hemanth J', email = 'hemanth@anotherexample.com'))

In [82]:
database.list_all()

[User(username='aakash', name='Aakash Rai', email='aakash@example.com'),
 User(username='biraj', name='Biraj Das', email='biraj@example.com'),
 User(username='hemanth', name='Hemanth J', email='hemanth@anotherexample.com')]

In [84]:
database.insert(siddhant)

In [85]:
database.list_all()

[User(username='aakash', name='Aakash Rai', email='aakash@example.com'),
 User(username='biraj', name='Biraj Das', email='biraj@example.com'),
 User(username='hemanth', name='Hemanth J', email='hemanth@anotherexample.com'),
 User(username='siddhant', name='Siddhant Sinha', email='siddhant@example.com')]

You can test and use more methods by adding a new cell of code to run the various methods and update properties, or add your own! Now we should analyze the complexity and identify the inefficiencies. 

Time complexities of our various operations are:
1. Insert: O(N)
2. Find: O(N)
3. Update: O(N)
4. List all: O(1)

All of them have linear complexity, except listing all will always return the list with one iteration through the list. This is a constant time operation. Space is O(1) for all operations, but is time complexity optimized enough? No, because there are 100 million users on the platform.

In [86]:
%%time
for i in range(100000000):
    j = i * 1

CPU times: user 7.27 s, sys: 30.9 ms, total: 7.3 s
Wall time: 7.34 s


We would never want 10-15 second profile loads. People would stop using the application, so we need to optimize this. Let's choose a better data structure, so we can be senior engineers. This is where binary tress come into play.

Each node of the tree stores a key and value. This is often referred to as a map or treemap. Binary search trees contain a left and right search tree. We either go left or right, searching for the value, until we find and access it. Left side contains nodes that are lexigraphically smaller while the right side will contain nodes that are larger. The tree will be balanced when both sides have roughly the same amount of nodes with respect to height and depth.

A tree contains twice as many nodes in one level compared to the previous level.

Level 1: 1
Level 2: 2
Level 3: 4
Level 4: 8

Etcetera

We can form a basic equation for this for N numbers where 

N + 1 = 2 ^k 

and 

k = log(N + 1) <= log(N) + 1

So storing N records will require a balanced binary search tree of height no larger than space compelxity than log(N) + 1. Our operations will have the complexity O(logN) as well. That's basically the remainder of doing it linearly. That's a lot better, because they will all travers a single path down the root of the tree.

Question: impliment a binary tree with Python and show its usage with examples.

Start with the simplest case, one node, or the root:

In [88]:
class TreeNode:
    def __init__(self, key):
        self.key = key
        self.left = None
        self.right = None

In [89]:
#more nodes
node0 = TreeNode(3)
node1 = TreeNode(4)
node2 = TreeNode(5)

In [90]:
#instantiate one of them
node0

<__main__.TreeNode at 0x10c181090>

In [92]:
node0.key

3

In [94]:
node0.left = node1
node0.right = node2

That's it for a node. Now we have to figure out a way to replicate this logic for every possible node in the future. 

First, we want to track the root like so:

In [95]:
tree = node0

In [96]:
tree.key

3

In [98]:
tree.left.key

4

In [99]:
tree.right.key

5

Tree is connected to its children. We don't use root, because root can often mean other things in computers, and this also refers to the whole node, not necessarily the root.

Let's try to make an unbalanced tree. 




In [100]:
tree_tuple = ((1,3, None), 2, ((None, 3,4), 5, (6,7,8)))

In [101]:
def parse_tuple(data):
    #print(data)
    if isinstance(data, tuple) and len(data) == 3:
        node = TreeNode(data[1])
        node.left = parse_tuple(data[0])
        node.right = parse_tuple(data[2])
    elif data is None:
        node = None
    else:
        node = TreeNode(data)
    return node
    

You can see that parse tuple creates a root node when a tuple of size 3 is the input. It'll invoke itself to create the left and right subtrees. That's called recursion. The chain of recursive calls ends when parse_tuple encounters a number or None as input. This idea will be used a lot for the rest of the notebook.

Exercise: add print statements inside parse tuple  to display arguments. Does it make sense to you?

In [103]:
tree2 = parse_tuple(((1,3,None), 2, ((None, 3, 4), 5, (6, 7, 8))))
tree2

<__main__.TreeNode at 0x10c181990>

In [104]:
tree2.left.key, tree2.right.key

(3, 5)

In [105]:
tree2.left.left.key, tree2.left.right, tree2.right.left.key, tree2.right.right.key

(1, None, 3, 7)

In [106]:
tree2.right.left.right.key, tree2.right.right.left.key, tree2.right.right.right.key

(4, 6, 8)

Exercise: let's define a function to convert a binary tree to a tuple i.e. tree_to_tuple

In [107]:
def tree_to_tuple(node):
    pass

In [116]:
def display_keys(node, space='\t', level=0):
    # print(node.key if node else None, level)
    
    # If the node is empty
    if node is None:
        print(space*level + '∅')
        return   
    
    # If the node is a leaf 
    if node.left is None and node.right is None:
        print(space*level + str(node.key))
        return
    
    # If the node has children
    display_keys(node.right, space, level+1)
    print(space*level + str(node.key))
    display_keys(node.left,space, level+1)    

Display keys uses recursion to display the agruments for each call of the function. Added the commented print statements to see if you'd like to uncomment them.

In [117]:
display_keys(tree2, '  ')

      8
    7
      6
  5
      4
    3
      ∅
2
    ∅
  3
    1


This helps us visualize the examples. Come up with great string representations in order to make better data structures.

In [118]:
display_keys(tree, '   ')

   5
3
   4


Traverse a binary tree now. We need to be able to traverse this inorder, postorder, and preorder.

In [119]:
def traverse_in_order(node):
    if node is None:
        return []
    return (traverse_in_order(node.left)
    + [node.key] + 
    traverse_in_order(node.right))


In [120]:
#with our example tree
tree = parse_tuple(((1,3,None), 2, ((None, 3, 4), 5, (6, 7, 8))))

In [122]:
display_keys(tree, '     ')

               8
          7
               6
     5
               4
          3
               ∅
2
          ∅
     3
          1


In [123]:
#without the space
traverse_in_order(tree)

[1, 3, 2, 3, 4, 5, 6, 7, 8]