# CSPB 3104 Assignment 5:  Julia Scott

***
# Instructions

This assignment is to be completed as a python3 notebook.  When you upload, please upload the completed notebook (ipynb file).

The questions  provided  below will ask you to either write code or 
write answers in the form of markdown.

 Markdown syntax guide is here: [click here](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet)

Using markdown you can typeset formulae using latex.
This way you can write nice readable answers with formulae like thus:

The algorithm runs in time $\Theta\left(n^{2.1\log_2(\log_2( n \log^*(n)))}\right)$, 
where $\log^*(n)$ is the inverse _Ackerman_ function.

__Double click anywhere on this box to find out how your instructor typeset it. Press Shift+Enter to go back.__

***

## Question 1: AVL Trees.

 AVL Trees are yet another self balancing binary search tree (BST) that are sometimes used in the place of  red black trees.
 The key property of an AVL tree is that 

 *for all nodes $n$ in the tree*, $\left|\ \text{height}(n.\text{left}) - \text{height}(n.\text{right}) \right| \leq 1$

 In words, the height of the left subtree and right subtree at any node can differ by at most $1$.
 
 Let $h$ be the height of an AVL tree and $n$ be the number of nodes in the tree.  The goal of this problem is to prove a relationship between $h$ and $n$.  We've broken this into two steps:

 (A) Prove that $n \geq F_h$, where $F_h$ is the $h^{th}$ Fibonacci number. ($F_0 = 1, F_1 = 1, F_2 = 2, \ldots $)
  (*Hint* Use strong induction with two base cases. First establish the property for all AVL trees of heights 0 and 1. Next, assuming
  it holds for trees of height $\leq h$, prove it for trees of height $h+1$ ).
  
  
  Next, it is a fact that for any $k \geq 30$, $F_k \geq 1.5^k$.
  
 (B) Using the above fact and the result from part A,  show that $h = \Theta(\log(n))$.

 (C) We will briefly examine inserting a node into an AVL tree through an example. On the left, we have shown an AVL tree and to the right we show the result after a BST insert has happened.

![AVL Tree Before and After Insertion](avl-tree-insert-problem-img.jpeg "AVL Tree Insertion" )

  Devise a sequence of left and right rotations that will restore the AVL tree property.
Explain for each rotation what is the root node at which we are rotating and which direction. If you wish, you may insert images showing the trees before/after rotation using markdown (see how we inserted the image. But do not forget to upload the images with the submission).






 ### Answer 1 (Expected length: 15 lines)

__(A)__  (A) Prove that $n \geq F_h$, where $F_h$ is the $h^{th}$ Fibonacci number. ($F_0 = 1, F_1 = 1, F_2 = 2, \ldots $)
  (*Hint* Use strong induction with two base cases. First establish the property for all AVL trees of heights 0 and 1. Next, assuming
  it holds for trees of height $\leq h$, prove it for trees of height $h+1$ ).


![AVL Tree Nodes Proof](InductionProof_AVLTreeNodes.jpeg "Induction proof" )



__(B)__ Let k be the height of a balanced AVL tree.  
Using the given fact, take the log of both sides of $F_k \geq 1.5^k$ for $k \geq 30$, and you are given:

$\log(F_k) \geq \log(1.5^k)$

$\log(F_k) \geq k*\log(1.5)$

Divide both sides by log(1.5)

$\frac{\log(F_k)}{\log(1.5)} \geq k$

And we know already that $n \geq F_k$

Therefore, $\log(n) \geq \log(F_k)$ and $k \leq \log(n)$

Thus k is upper bounded by $\log(n)$  and  $k = O(\log(n))$

Since we have a balanced, AVL tree, the height of the tree is:

$k = ceiling(\log(n))$

The ceiling function rounds the value to the nearest integer, thus

$k \geq \log(n)$  for all k greater than or equal to 30.

This means the height is lower bounded by log(n), giving us

$h = \Theta(\log(n))$

__(C)__ Only one rotation is needed.  The rotation takes place at node 29 as the root. 29 is rotated right, and 26 comes up to take it's place. Now 24 is the left child of 29 and 26, the right child.




***
## Question 2: Bloom Filters


 A bloom filter is a fast set data structure that maintains a set $S$ of keys. One can insert keys into the set and test whether a given key $k$ belongs to the set. It may be used in applications where the keys are "complicated" objects such as TCP packets or images that are expensive to compare with each other. 
 

 The data structure is an array $T$ of Booleans size $m$ with $l$ different hash functions $h_1, \ldots, h_l$.
 Initially, `T[i] = FALSE` for all `i`.

 If a key $k$ is to be inserted 
 we first compute $i_1 = h_1(k), \ldots, i_l = h_l(k)$ and then we set $T[i_1] = \cdots T[i_l] = \text{TRUE}$.

 __Note:  A bloom filter is *not* a hash table, but they both use hash functions in interesting ways.__

 __(A)__ Suppose we wish to find out if an element $k$ is a member of the set by checking if
$T[h_1(k)], \ldots, T[h_l(k)]$ are all true. Explain whether this can lead to a *false positive* i.e,
the approach wrongly concludes that $k$ belongs to the set when it was never inserted; or *false negative*
i.e, the approach wrongly concludes that $k$ does not belong to the set when it does.

 __(B)__ Suppose our hash functions are guaranteed to be uniform. I.e, for any randomly chosen
key $k$, for any hash function $h_i$ and cell $j$, 
  $$ \mathbb{P}( h_i(k) = j)  = \frac{1}{m} $$
 If $n$ keys are chosen at random and inserted into the filter, compute that probability that any given cell $T[j]$ is set to FALSE after this.

 __(C)__ Use the results from previous set to estimate the probability of a false positive. I.e, some $l$ cells
$i_1, i_2, \ldots, i_l$ are simultaneously set to TRUE.

 



### Answer 2 { Expected Size: 15 lines}

__(A)__ Yes this can lead to a false positive.  Bloom filters are only mostly correct and how accurate they are depend on the set up of the hash functions, the keys, and the number of m buckets for the true or false values.  

Let's say we have two keys, k1 and k2, and k1 is already in the filter and k2 is not.  Let's also say they happen to map to the same indices for all of the hash functions.  In this case the bloom filter would check h1(k2)... hl(k2), and it would find all values in those locations to be true since h1(k1)... hl(k1) has already been computed and changed to be true.  This would yield a false positive.  

False negatives are impossible because if a key has been added, by the nature of functions, h1(k)....hl(k) will map to the same indices every time, preventing false negatives.

__(B)__ 

The total probability for mapping or not mapping to a given cell in the bloom filter with one hash function is 1, making the total probability equation:  

$1 = \frac{1}{m} + probability\ of\ not\ mapping$

$probability\ of\ not\ mapping = p_{not} = 1 - \frac{1}{m}$

After $n$ keys are inserted the probability that any given cell has not been mapped to, and thus is set to False, is:

$(p_{not})^n$

Thus,

$P = (1 - \frac{1}{m})^n$

__(C)__ For a false positive, l cells will have to be found to be simultaneously True.
The probability of this is all the probabilities of each cell being true, multiplied together, by nature of statistics.
Thus,

$P_{false\ positive} = ((\frac{1}{m})^n)^l = (\frac{1}{m})^{n*l}$


## Testing your solutions -- Do not edit code beyond this point