# Homework 5
#### Due: End of Day May 2 (Tuesday)
#### Tristan Lakin
#### Colaborators: Andrew, Christopher, ChatGPT

## Problem 1
Assume you are given a list of real numbers. Your goal is to determine if there is a pair of
numbers in this list whose product is exactly 1. What is the fastest deterministic algorithm you
can devise to solve this problem? Can you do better in terms of expected running time if you
are allowed to use a randomized algorithm such as a hash function?

We want to check if a pair of numbers exists that are reciprocols. With a deterministic algorithm we can sort the list and create two tracking numbers, one at the front and one at the end. We look at the first element and take its reciprocol, if the second tracked element is the same as the reciprocol then we are done. If the second element is too small then we move the first tracker up one and repeat. If the second tracked element is bigger than the reciprocol we move our tracker down one and check again. This is $O(n \log n)$ because the sorting is $O(n \log n)$ and the searching is $O(n)$.

This is pretty good, but we can actually do better is we just want expected running time. If we use a hash set $H$ then we can find the result in linear time. We can start iterating through the list of numbers and for each number in the list we want to take its reciprocal and check to see if it is in $H$, and cheching to see if something is in a hash map we expect constant running time. If it is in $H$ then we are done and we know its in there. If it is not in $H$ we add the original number into $H$. In this case we expect it to be $O(n)$ time.

## Problem 2
In class, you have seen an example of universal hash functions. Let p be a prime number larger
than the largest possible key, and let $n ( n \ll p)$ be the hash table size. We randomly pick
an $a$ from $\{1, 2, ..., p − 1\}$ and a $b$ from $\{0, 1, 2, ..., p − 1\}$, then the family of hash functions
$\{h_{a,b}(x) = ((ax + b) \mod p) \mod n\}$ is a universal family of hash functions.
While teaching this class before, one student came to me after class and asked me what if we
only randomly pick an $a \in \{1, 2, ..., p − 1\}$ and drop the $b$, is the resulting class of hash functions,
i.e., $\{h_a(x) = ((ax) \mod p) \mod n\}$ still universal? Please help me determine if this conjecture
is correct or not.


To see what it does I want to look at a bunch of the possibilities and figure out if the change of a collision between two different inputs in a hash $h_a(x)$ is $\le 1/n$.

In [179]:
from tabulate import tabulate
import numpy as np
import random

def countAllHashes(p, n):
    aList = range(1, p)
    hList = [[((a*x)%p)%n for x in range(n)] for a in aList]
    counts = [max([h.count(i) for i in h]) for h in hList]
    print(tabulate(np.array(hList).transpose(), 
                   headers=[f"h_(a={i})" for i in aList],
                   showindex=[f"x={i}" for i in range(n)]))
    print(counts)
    print(mean(counts))

def mean(ls):
    return sum(ls)/len(ls)

countAllHashes(11, 10)

       h_(a=1)    h_(a=2)    h_(a=3)    h_(a=4)    h_(a=5)    h_(a=6)    h_(a=7)    h_(a=8)    h_(a=9)    h_(a=10)
---  ---------  ---------  ---------  ---------  ---------  ---------  ---------  ---------  ---------  ----------
x=0          0          0          0          0          0          0          0          0          0           0
x=1          1          2          3          4          5          6          7          8          9           0
x=2          2          4          6          8          0          1          3          5          7           9
x=3          3          6          9          1          4          7          0          2          5           8
x=4          4          8          1          5          9          2          6          0          3           7
x=5          5          0          4          9          3          8          2          7          1           6
x=6          6          1          7          2          8          3          9

Above we see that on average we have 1.9 collisions, this is higher than we should have in a universal hash set. We expect a probability of $1/n$ collisions, which is an expected 1 collision per $n$ trials. Here we see we average well above 1 collision, so this is not going to be a universal hash.

## Problem 3
Given a set of hierarchical intervals on the real-axis subject (i.e., for any two intervals $I_1$ and $I_2$,
either they are disjoint, or one of them is included in the other one.) The goal is to find a subset
of disjoint intervals of maximum cardinality. Determine if the problem is NP-complete or not.
If yes prove it. If not, design an efficient polynomial time algorithm for it.


What we can immediately notice is the graph-like structure of the problem, where a subgraph would be an interval with children of intervals smaller.

Say we have the set of intervals, and they ar sorted by lower bound:
```
------------------
       ------
                    -------------------
                     ------
                             ---------
                              --
                                  ---
```

We can make a graph by taking an interval and making it a node, where any following interval that is contained becomes a child. We take the root of this tree to be the entire number line. Here is that visually with the set above.
```
-∞ <------------------------------------------------------------------> ∞
                          /                 \
               ------------------  -------------------                            
                         |             /        \
                      ------        ------  ---------                            
                                              /  \
                                             --  ---                                                                 
```

The largest disjoint set is then the set of leaves for this graph, since adding any leaf will not affect another branch of the tree, and any one node has at least one leaf below it. Finding this graph can be done in the algorithm described above in $O(n \log n)$ time and counding leaves is done in linear time.

## Problem 4
For a given a graph $G = (V, E)$, a vertex cover of G is a subset of vertices $C ⊆ V$ such that each
edge of $G$ has at least one endpoint in $C$. The goal of the vertex cover problem is to find the
optimal vertex cover $C^∗$ with the minimum number of vertices.
Consider the following randomized algorithm for vertex cover.
- Step 1: Start with $C = \phi$.
- Step 2: Pick an edge $e$ uniformly at random from the edges that are not covered by C (i.e., if $e$ has
endpoints $u$ and $v$, then $\{u, v\} \cap C = \phi$). and add a random endpoint of $e$ to $C$.
- Step 3: If $C$ is a vertex cover, terminate and output $C$; else go to Step 2.

Answer the following questions:
- (a) Consider the very first iteration of the algorithm. What is the probability that a vertex
from the smallest vertex cover $C^∗$ is added to $C$? (Hint: for each edge $e ∈ E$, at least one
endpoint of $e$ must be in $C^∗$.)
- (b) Consider the second iteration of the algorithm. What is the probability that a vertex from
the smallest vertex cover $C^∗$ is added to $C$? (Hint: you should discuss the two scenarios of
whether a vertex from $C^∗$ is added to $C$ in the first iteration or not.)
- (c) Let $k$ be the number of vertices in the smallest vertex cover $C^∗$. Show that the expectation
of the number of vertices of $C$ is $2k$.


(a) Whenever we have a first choice, there are two things it could be. Either both option nodes are in a possible $C^*$ (100%), or only one is (50%). That means we can say that the probability of choosing a valid vertex on the first pull is $\Pr(v_1) \ge 0.5$.

(b) For the second one, and each one after that, it is going to be the same. Since we are always picking an edge $e$ that has two sides and at least 1 of them will be in $C^*$ we expect the same $\Pr \ge 0.5$.

(c) In the case where there is a single possible $C^*$, we expect a 50% chance of pulling the correct vertex in any situation. That means for every 1 correct value we pull we expect a  wrong result. If we finally get $k$ correct values we expect $k$ wrong values. That totals to $2k$ total values.

## Problem 5 
The $k$-leaf problem is as follows: Given an Graph $G(V, E)$, and an integer $k \ge 2$, determine if the
graph has a spanning tree with exactly $k$ leaves. Determine if the $k$-leaf spanning tree problem
is NP-complete. If yes, prove it. If no, develop an efficient polynomial time algorithm for it.

The Hamiltonian problem is trying to find a spanning path, which is a spanning tree with 2 leave. This means we can reduce the Hamiltonian problem to the $k=2$ case of the $k$-leaf problem. Since the Hamiltonian problem is NP-Hard, $k$-leaf is at least NP-hard. Also we can easily see that $k$-leaf is NP since we can make sure it is a valid $k$ leaf path in linear time.