# Motivating Applications

## 1. Internet Routing

Let the vertices of the graph be

1. Hosts / Servers
2. Routers

And whose edges are directed if there is a physical / wireless connection between the vertices.

A fundamental question would be to determine what the shortest path / minimising path is.

However, unlike the case where we could use Dijkstra's algorithm, the internet graph is too large and too dynamic to fit in memory. 

This calls for an algorithm that only computes local information.

This is known as the [Bellman-Ford](https://en.wikipedia.org/wiki/Bellman%E2%80%93Ford_algorithm) algorithm.

## 2. Sequence Alignment

Given two strings $s_1$, $s_2$ over a set of characters $A$. For example,
$$
A = \{\text{A}, \text{C}, \text{G}, \text{T}\} \\[10pt]
s_1 = \text{AGGCT} \\
s_2 =  \text{AGGGCA} \\
$$

Compute how "similar" the two strings are.

We might want to do this to

1. Extrapolate the function of genome substrings
2. Infer evolutionary proximity

To be precise about "similarity", let us first define the distance between two strings $d(s_1, s_2, \alpha)$ to be,
$$
d(s_1, s_2, \alpha) = \sum{\text{penalties}(\alpha)} \\[20pt]
$$
for a given alignment $\alpha$ where
$$
\text{penalties}(\alpha) = 
\begin{cases}
    \text{gap} \geq 0 & \text{for each gap in} \; \alpha \\
    \text{mismatch} \geq 0 & \text{for each mismatch in} \; \alpha \\
    \quad \vdots
\end{cases}
$$

Then the "similairity" of the two strings is the alignment of the strings that minimises the distance.
$$
\text{similarity} = \min_{\alpha}{d(s_1, s_2, \alpha)}
$$

This is known as the [Needleman-Wunsch Score](https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm#:~:text=The%20Needleman%E2%80%93Wunsch%20algorithm%20is%20still%20widely%20used%20for%20optimal,alignments%20having%20the%20highest%20score.)

# Greedy Algorithms

A greedy algorithm iteratively makes "myopic" decisions in the hope that things work out in the end.

The running time of greedy algorithms are generally easier to compute. However, proofs of correctness are not always obvious, generally there are few methods,

1. Induction ("greedy stays ahead")
2. "Exchange argument" (where we show via transformations that an optimal solution is identical to the greedy solution)

## The Caching Problem



We have two memory types 
- small fast memory
- big slow memory

The goal is to process a sequence of "page requests".

On a cache miss / "page fault", we need to bring something from memory into the cache. To do so we will have to evict something from the cache to make room. Who should be evicted?

The optimal algorithm is known as the "furthest-in-future" algorithm (Belady 1960).However, this algorithm assumes that we know the future, so its in unimplemtable. Instead, this serves as a jumping off point for more practical algorithms.

- Least Recently Used

    Assumes that data that was requested recently will be requested soon, and data that was requested a long time ago, will be requested in-frequently.

Proof: TODO

## A Scheduling Problem



We have a single processing thread, with many jobs to process. In what order should we sequence the jobs?

Each job $j$ has two parameters
1. a weight $w_j$
2. a length $l_j$

Define the completion time $c_j$ of a job $j$ 
$$
c_j = \sum_{i \leq j}{l_i}
$$
That is, the time elapsed until the completion of job $j$.

One way to measure the quality of a sequence of jobs is by computing the weighted sum $s$ of completion time
$$
s = \sum{w_j \cdot c_j}
$$

We seek an algorithm that minimises $s$.

If $w_j$ is constant, then we want to schedule smaller $l_j$ first.

Likewise if $l_j$ is constant, then we want to schedule larger $w_j$ first.

In the case that $w_i > w_j$ and $l_i > l_j$ or vice versa, we will need to device a score to order the jobs.

This score should increase with $w$ and decrease with $l$. These could be forms of

- $w_j - l_j$
- $\frac{w_j}{l_j}$

It can be shown via a counter example that the difference score is not always correct. The following is a proof of the correctness of the ratio score.

Proof by "Exchange Argument"

Let
$$
\sigma = \text{greedy schedule} \\
\sigma^* = \text{optimal schedule}
$$

We will proceed by contradiction, showing that we are able to generate a schedule more optimal than $\sigma^*$.

Assumptions:
1. all $\frac{w_j}{l_j}$ are distinct
2. WLOG that $\frac{w_1}{l_1} > \frac{w_2}{l_2} > \cdots > \frac{w_n}{l_n}$ via a renaming of jobs

Therefore
$$
\sigma = 1, 2, \cdots, n
$$

Since $\sigma^* \neq \sigma$, there are consecutive jobs $i, j$ with $i > j$.

Then suppose we exchange the jobs $i, j$ in $\sigma^*$, leaving all other jobs unchanged. This produces a new schedule $\sigma^{**}$.

Using assumption 2, we know that since $i > j$
$$
\frac{w_i}{l_i} < \frac{w_j}{l_j} \\
\implies w_il_j < w_jl_i
$$

Let $s(\sigma)$ return the score of a schedule $\sigma$. Then,
$$
\begin{aligned}
s(\sigma^{**}) = s(\sigma^*) & - w_i(\alpha + l_i) - w_j(\alpha + l_i + l_j) \\
& + w_j(\alpha + l_j) + w_i(\alpha + l_j + l_i)
\end{aligned} \\[10pt]
\implies s(\sigma^{**}) = s(\sigma^*) + w_il_j - w_jl_i
$$
Where $\alpha$ is the sum of the job lengths $l_k$ for $k < i$.

Further, using assumption 2. Since
$$
w_il_j < w_jl_i \\[10pt]
w_il_j - w_jl_i < 0 \\[10pt]
\implies s(\sigma^{**}) < s(\sigma^*)
$$

This is a contradiction as $\sigma^*$ was defined to be the optimal. Therefore, there is no more optimal solution than the solution produced by the greedy algorithm.

We can also examine assumption 1 that the ratios are distinct. In the case that the ratios are equal,
$$
\frac{w_i}{l_i} = \frac{w_j}{l_j} \\[10pt]
w_il_j - w_jl_i = 0 \\[10pt]
\implies s(\sigma^{**}) = s(\sigma^*) = s(\sigma)
$$

Therefore when there are equal ratios, then the greedy algorithm produces a squence that will have the same score as the optimal schedule.

## Minimum Spanning Trees

The aim is to construct a tree that connects a bunch of objects together as cheaply as possible.

This will go over 
- [Prim's Algorithm](https://en.wikipedia.org/wiki/Prim%27s_algorithm)
- [Kruskal's Algorithm](https://en.wikipedia.org/wiki/Kruskal%27s_algorithm)

both of which run in $O(m\log{n})$ for $m$ edges adn $n$ vertices when implemented using suitable data structures.

Given an undirected graph $G=(V,E)$ where,
- the graph is represented as an adjacency list
- there is a cost $c_e$ for each edge $e \in E$, $c_e \in \mathbb{R}$

Return a minimum cost tree $T \in E$ that spans all the vertices. Where
- $T$ is acyclic
- the subgraph $(V, T)$ is connected

To simply the discussion, we will make 2 assumptions
1. $G$ is already connected
    
    If $G$ was not connected then the solution $T$ does not exist. It takes linear time to check if $G$ is connected via BFS/DFS

2. $c_e$ values are distinct

    Both algorithms will still find $T$ regardless of how ties are broken. This assumption serves to make the following dicussion more steamlined.


### Prim's Algorithm

Similar to Dijkstra's Algorithm, we expand the horizon of touched nodes one at a time, each time being "greedy" and going for the cheapest edge.

```
Prims(G):

    X = [s] \\arbitrary add
    T = { }

    while X is not V:

        find cheapest e = (u, p) 
            where u in X but p not in X
        
        add e to T
        add p to X

    return T, X
```

#### Proofs of Correctness

##### Proof that a spanning tree $T^*$ is computed

We define a "cut" to be a pair of non-empty sets $A$, $B$
$$
c = (A, B)
$$

1. Empty Cut Lemma
$$
\text{graph is NOT connected} \iff \exists (A,B) \; \text{with no crossing edges}
$$

2. Double Crossing Lemma

    Sps the cycle $C \subseteq E$ has an edge crossing the cut $(A, B)$. Then do does some other edge of $C$.

    This means that if $e$ is the only edge crossing the cut $(A, B)$ then it is not in any cycle

To show that Prim's algorithm outputs a spanning tree, we will prove that

1. Algorithm maintians an invariant that $T$ spans $X$
2. Algorithm does not get stuck with $X \neq V$
3. Algorithm never creates a cycle

For the algorithm to get stuck, there must exists a cut of the graph that has no crossing edges. By the Empty Cut Lemma, this means that the graph is not connected. This cannot be the case as it violates the first assumption of the algorithm.

For the algorithm to create a cycle, there must exist in the set $X$ a cut that has two crossing edges. Since the algorithm only includes one further edge per cut of the graph, each edge in the algorithm is a "lone edge". By the Double Crossing Lemma, this means that the addition of each each cannot create a cycle.

The algorithm maintains the invariant by adding each node and one edge to $T$, therefore by induction the invariant is maintained.

Thus Prim's algorithm produces a spanning tree.



##### Proof that $T^*$ is minimum

To prove that the output is a minium, we will use the "minimum cut property".

Consider an edge $e$ of $G$, 
$$
\exists (A, B) \; \text{such that} \; e \; \text{is the cheapest edge} \implies  e \; \text{belongs to the MST of} \; G 
$$

The minimum cut property trivially implies the correctness of Prim's algorithm, since the alogrithm only includes edges that satisfy the minimum cut property, that is, for intermediate cut, the algorithm selects the cheapest edge to add to the spanning set $T$.

Proof of the Minimum Cut Property. 

We will proceed with a proof by contradiction using an exchange argument.

Suppose there exists a cut $(A, B)$ with crossing edges $e_1, e_2, \cdots, e_n$, for $n \geq 2$. In the case where there is only one crossing edge this edge must be included in the MST as the tree would not be connected otherwise.

WLOG let $e_1$ be the cheapest edge. Further, we will assume that $e_1$ does NOT exists in the MST. 

We then add $e_1$ into the MST. This necessarily creates a cycle $C$. 

By the double crossing lemma, there must be another edge $e_i$ that 
1. crosses the cut $(A, B)$
2. is more expensive than $e_1$

Then, we will show that exchanging $e_1$ with any suitable $e_i$ results in a cheaper MST, which would be a contradiction.

It is required to show that swapping $e_1$ with $e_i$ results in a valid spanning tree. Before adding $e_1$, it is assumed that the spanning tree is complete. That is
1. there exists a path from any node to any other node
2. the tree is acyclic

Upon the addition of $e_1$, this therefore necessarily creates one cycle, as $e_1$ would add a new path between the nodes at the end points of $e_1$ which would be distinct from the path that already exists in the MST. 

Further, note that when removing any one edge from an existing cycle, each node in the cycle remains connected. This is the case, since for nodes in a cycle, there are exactly two different paths that connect them. One of these path necessarily contains the edge to be removed, and the other does not.

Therefore, replacing $e_1$ with $e_i$ results in a valid spanning tree.

#### Run Time

##### Straighforward implementation

Inspecting the pseudo-code, there are
1. $O(n)$ iterations for $n$ vertices
2. $O(m)$ time per iteration for $m$ edges

Resulting in $O(mn)$ time

##### Heap implementation

A heap can speedup the repeated computation of minimums to $O(\log{n})$. 

If the heap is used to store edges, where the keys are the edge costs, we can acheive $O(m\log{n})$ running time. In the implementation of such an algorithm, it will be important to keep track of only the edges that cross the "frontier"

If the heap is used to store nodes, we can store
1. Nodes in $V-X$
2. For $v \in V-X$, key[$v$] = cheapest edge $(u,v)$ for $u \in X$ 

This can be initialised in time
$$
O(m + n\log{n}) = O(m\log{n})
$$

Where key computations cost $O(m)$, for $O(n-1)$ Heap inserts.

In order to maintain the heap invariant after each extract min,
```
when v is added to X:

    for each edge (v, w):
        if w in V-X:
            Delete w from heap
            Recompute key[w] := min(key[w], cost for (v,w))
            re-insert w
```

In general, there will be $O(m)$ heap operations,
- $(n-1)$ intialisation inserts
- $(n-1)$ extract mins
- each edge $(v,w)$ triggers a re-key

Therefore overall $O(m\log{n})$ time
