# CSPB 3104 Assignment 5:

***
# Instructions

This assignment is to be completed as a python3 notebook.  When you upload, please upload the completed notebook (ipynb file).

The questions  provided  below will ask you to either write code or 
write answers in the form of markdown.

 Markdown syntax guide is here: [click here](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet)

Using markdown you can typeset formulae using latex.
This way you can write nice readable answers with formulae like thus:

The algorithm runs in time $\Theta\left(n^{2.1\log_2(\log_2( n \log^*(n)))}\right)$, 
where $\log^*(n)$ is the inverse _Ackerman_ function.

__Double click anywhere on this box to find out how your instructor typeset it. Press Shift+Enter to go back.__

***

## Question 1: AVL Trees.

 AVL Trees are yet another self balancing binary search tree (BST) that are sometimes used in the place of  red black trees.
 The key property of an AVL tree is that 

 *for all nodes $n$ in the tree*, $\left|\ \text{height}(n.\text{left}) - \text{height}(n.\text{right}) \right| \leq 1$

 In words, the height of the left subtree and right subtree at any node can differ by at most $1$.
 
 Let $h$ be the height of an AVL tree and $n$ be the number of nodes in the tree.  The goal of this problem is to prove a relationship between $h$ and $n$.  We've broken this into two steps:

 (A) Prove that $n \geq F_h$, where $F_h$ is the $h^{th}$ Fibonacci number. ($F_0 = 1, F_1 = 1, F_2 = 2, \ldots $)
  (*Hint* Use strong induction with two base cases. First establish the property for all AVL trees of heights 0 and 1. Next, assuming
  it holds for trees of height $\leq h$, prove it for trees of height $h+1$ ).
  
  Next, it is a fact that for any $k \geq 30$, $F_k \geq 1.5^k$.
  
 (B) Using the above fact and the result from part A,  show that $h = \Theta(\log(n))$.

 (C) We will briefly examine inserting a node into an AVL tree through an example. On the left, we have shown an AVL tree and to the right we show the result after a BST insert has happened.

<center>
<img src="avl-tree-insert-problem-img.jpeg" alt="AVL Tree Before and After Insertion" title="AVL Tree Insertion" width="500" height="300"/>
</center>

Devise a sequence of left and right rotations that will restore the AVL tree property. Explain for each rotation what is the root node at which we are rotating and which direction. If you wish, you may insert images showing the trees before/after rotation using markdown (see how we inserted the image. But do not forget to upload the images with the submission).


### Answer 1 (Expected length: 15 lines)

### Part A

**Base Case:** With $n(h) \geq F_{h}$

\begin{align}
(n(h = 0) & = 1) \geq F_{h = 0} = 1 & \text{Base Case $h = 0$} \\
(n(h = 1) & \geq 2 \text{ and} \leq 3) \geq F_{h = 1} = 1 & \text{Base Case $h = 1$}
\end{align}

**Induction Hypothesis:** There is an arbitrary positive integer $k$ such that

$$
n(k) \geq F_{k}.
$$

**Inductive Step:** If 

$$
n(k) \geq F_{k}
$$

then

$$
n(k + 1) \geq F_{k + 1}.
$$

**Induction Proof:** Using the definition of the Fibonacci sequence: $F_{k + 1} = F_{k} + F_{k - 1}$

\begin{align}
n(k+1) & = n(k) + n(k - 1) + 1 & \text{(Plus 1 By AVL Property)} \\
& \geq F_{k} + F_{k - 1} + 1 & \text{(Substitution)} \\
& = F_{k + 1} + 1 & \text{(Definition Of Fibonacci Sequence)} \\
& \geq F_{k + 1} & \text{(Simplification)}
\end{align}

Thus, by the mathematical property of induction, we can conclude that $\color{blue}{n \geq F_{h}}$.

### Part B

We know from Part a that

$$
n(h) \geq F_{h}.
$$

When $h \geq 30$, $F_{h} \geq 1.5^{h}$. We know from the properties of logarithms that

$$
\log_{b}{(b^{x})} = x
$$

is true. Taking the logarithm of both sides in the previous definition for $n(h)$ we find

\begin{align}
\log_{1.5}{(1.5^{h})} & = \log_{1.5}{(n)} \\
h & = \log_{1.5}{(n)}
\end{align}

where when we put this into $\mathcal{O}$ notation we find

$$
\color{blue}{h = \Theta(\log{(n)}).}
$$

### Part C

In the context of AVL trees, we constitute that for all nodes in the tree, including the root, the difference in heights of the left and right subtrees must not differ by more than the absolute value of 1. To restore the AVL property after insertion of 24, we first perform a right rotation of node 29. The right child of 22 then becomes 26, where the left child is now 24 and the right child is now 29 for node 26. This can be seen easier with the image below.

<center>
<img src="1C.png" alt="Solution To 1C" title="Solution To 1C" width="500" height="300"/>
</center>

After this rotation, we can see that the height of left subtree for the root is 2, the height of the right subtree is 3. The balance factor for the root is then (-1). The balance factors for all nodes of the tree are:

- 12: -1
- 16: 0
- 18: 1
- 19: 0
- 21: 1
- 22: 0
- 24: 0
- 26: 0
- 29: 0

Because we take the absolute values of these factors, we have adhered to the AVL properties and have successfully restored its properties.

***
## Question 2: Bloom Filters


 A bloom filter is a fast set data structure that maintains a set $S$ of keys. One can insert keys into the set and test whether a given key $k$ belongs to the set. It may used in applications where the keys are "complicated" objects such as TCP packets or images that are expensive to compare with each other. 
 
 The data structure is an array $T$ of Booleans size $m$ with $l$ different hash functions $h_1, \ldots, h_l$.
 Initially, `T[i] = FALSE` for all `i`.

 If a key $k$ is to be inserted 
 we first compute $i_1 = h_1(k), \ldots, i_l = h_l(k)$ and then we set $T[i_1] = \cdots T[i_l] = \text{TRUE}$.

 __Note:  A bloom filter is *not* a hash table, but they both use hash functions in interesting ways.__

 __(A)__ Suppose we wish to find out if an element $k$ is a member of the set by checking if
$T[h_1(k)], \ldots, T[h_l(k)]$ are all true. Explain whether this can lead to a *false positive* i.e,
the approach wrongly concludes that $k$ belongs to the set when it was never inserted; or *false negative*
i.e, the approach wrongly concludes that $k$ does not belong to the set when it does.

 __(B)__ Suppose our hash functions are guaranteed to be uniform. I.e, for any randomly chosen
key $k$, for any hash function $h_i$ and cell $j$, 
  $$ \mathbb{P}( h_i(k) = j)  = \frac{1}{m} $$
 If $n$ keys are chosen at random and inserted into the filter, compute that probability that any given cell $T[j]$ is set to FALSE after this.

 __(C)__ Use the results from previous set to estimate the probabilisty of a false positive. I.e, some $l$ cells
$i_1, i_2, \ldots, i_l$ are simultaneously set to TRUE.


### Answer 2 { Expected Size: 15 lines}

### Part A

Bloom filters are notorious for producing **false positives**. This is because for each key $k$, it is passed through $l$ different hash functions, we then set each index for all of these hashed values to TRUE. 

The reason why Bloom filters can produce a false positive, is if one of the keys hash to the same index in the array as another or multiple keys, it is possible that when we look to see if a key is present in a Bloom filter, that index may have already been set to TRUE by another key that was hashed to the array. Thus providing a false positive.

On the contrary, Bloom filters can never produce a false negative. This is because if we search through the indices of the Bloom filter and see that none of the indices were set to TRUE, it is impossible for us to then have a false negative.

### Part B

If 

$$
P_{T} = \frac{1}{m}
$$

is the probability that cell $j$ is to be hashed to, then 

$$
P_{F} = 1 - \frac{1}{m}
$$

is the probability that cell $j$ will not be hashed to and will be set to false. But this doesn't tell the full story. Since there $l$ different hash functions, and $n$ different keys, we need to raise the previous probability of a cell $j$ being set to false to the product of $n \cdot l$. Namely

$$
\color{blue}{P_{F} = \left(1 - \frac{1}{m}\right)^{nl}.}
$$

### Part C

Previously, we found that the probability that a cell $j$ was set to FALSE after the hashing operations was

$$
P_{F} = \left(1 - \frac{1}{m}\right)^{nl}.
$$

If we want to determine the probability that a cell $j$ was set to True after the hashing operations, then it would be the complementary probability of the aforementioned expression. Namely

$$
P_{T} = 1 - \left(1 - \frac{1}{m}\right)^{nl}.
$$

But, this is **not** the probability that we will receive a false positive. If we want to calculate the probability of false positive occurring, then we need to account for the $l$ number of hash functions that are being used in the bloom filter. Namely

$$
\color{blue}{P_{FP} = \left(1 - \left(1 - \frac{1}{m}\right)^{nl}\right)^{l}}.
$$

Now, we have accounted for all hash functions in this bloom filter and the aforementioned expression will give us an estimate for the probability that a false positive will occur.

## Testing your solutions -- Do not edit code beyond this point