# Motivating Applications

## 1. Internet Routing

Let the vertices of the graph be

1. Hosts / Servers
2. Routers

And whose edges are directed if there is a physical / wireless connection between the vertices.

A fundamental question would be to determine what the shortest path / minimising path is.

However, unlike the case where we could use Dijkstra's algorithm, the internet graph is too large and too dynamic to fit in memory. 

This calls for an algorithm that only computes local information.

This is known as the [Bellman-Ford](https://en.wikipedia.org/wiki/Bellman%E2%80%93Ford_algorithm) algorithm.

## 2. Sequence Alignment

Given two strings $s_1$, $s_2$ over a set of characters $A$. For example,
$$
A = \{\text{A}, \text{C}, \text{G}, \text{T}\} \\[10pt]
s_1 = \text{AGGCT} \\
s_2 =  \text{AGGGCA} \\
$$

Compute how "similar" the two strings are.

We might want to do this to

1. Extrapolate the function of genome substrings
2. Infer evolutionary proximity

To be precise about "similarity", let us first define the distance between two strings $d(s_1, s_2, \alpha)$ to be,
$$
d(s_1, s_2, \alpha) = \sum{\text{penalties}(\alpha)} \\[20pt]
$$
for a given alignment $\alpha$ where
$$
\text{penalties}(\alpha) = 
\begin{cases}
    \text{gap} \geq 0 & \text{for each gap in} \; \alpha \\
    \text{mismatch} \geq 0 & \text{for each mismatch in} \; \alpha \\
    \quad \vdots
\end{cases}
$$

Then the "similairity" of the two strings is the alignment of the strings that minimises the distance.
$$
\text{similarity} = \min_{\alpha}{d(s_1, s_2, \alpha)}
$$

This is known as the [Needleman-Wunsch Score](https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm#:~:text=The%20Needleman%E2%80%93Wunsch%20algorithm%20is%20still%20widely%20used%20for%20optimal,alignments%20having%20the%20highest%20score.)

# Greedy Algorithms

A greedy algorithm iteratively makes "myopic" decisions in the hope that things work out in the end.

The running time of greedy algorithms are generally easier to compute. However, proofs of correctness are not always obvious, generally there are few methods,

1. Induction ("greedy stays ahead")
2. "Exchange argument" (where we show via transformations that an optimal solution is identical to the greedy solution)

## The Caching Problem



We have two memory types 
- small fast memory
- big slow memory

The goal is to process a sequence of "page requests".

On a cache miss / "page fault", we need to bring something from memory into the cache. To do so we will have to evict something from the cache to make room. Who should be evicted?

The optimal algorithm is known as the "furthest-in-future" algorithm (Belady 1960).However, this algorithm assumes that we know the future, so its in unimplemtable. Instead, this serves as a jumping off point for more practical algorithms.

- Least Recently Used

    Assumes that data that was requested recently will be requested soon, and data that was requested a long time ago, will be requested in-frequently.

Proof: TODO

## A Scheduling Problem



We have a single processing thread, with many jobs to process. In what order should we sequence the jobs?

Each job $j$ has two parameters
1. a weight $w_j$
2. a length $l_j$

Define the completion time $c_j$ of a job $j$ 
$$
c_j = \sum_{i \leq j}{l_i}
$$
That is, the time elapsed until the completion of job $j$.

One way to measure the quality of a sequence of jobs is by computing the weighted sum $s$ of completion time
$$
s = \sum{w_j \cdot c_j}
$$

We seek an algorithm that minimises $s$.

If $w_j$ is constant, then we want to schedule smaller $l_j$ first.

Likewise if $l_j$ is constant, then we want to schedule larger $w_j$ first.

In the case that $w_i > w_j$ and $l_i > l_j$ or vice versa, we will need to device a score to order the jobs.

This score should increase with $w$ and decrease with $l$. These could be forms of

- $w_j - l_j$
- $\frac{w_j}{l_j}$

It can be shown via a counter example that the difference score is not always correct. The following is a proof of the correctness of the ratio score.

Proof by "Exchange Argument"

Let
$$
\sigma = \text{greedy schedule} \\
\sigma^* = \text{optimal schedule}
$$

We will proceed by contradiction, showing that we are able to generate a schedule more optimal than $\sigma^*$.

Assumptions:
1. all $\frac{w_j}{l_j}$ are distinct
2. WLOG that $\frac{w_1}{l_1} > \frac{w_2}{l_2} > \cdots > \frac{w_n}{l_n}$ via a renaming of jobs

Therefore
$$
\sigma = 1, 2, \cdots, n
$$

Since $\sigma^* \neq \sigma$, there are consecutive jobs $i, j$ with $i > j$.

Then suppose we exchange the jobs $i, j$ in $\sigma^*$, leaving all other jobs unchanged. This produces a new schedule $\sigma^{**}$.

Using assumption 2, we know that since $i > j$
$$
\frac{w_i}{l_i} < \frac{w_j}{l_j} \\
\implies w_il_j < w_jl_i
$$

Let $s(\sigma)$ return the score of a schedule $\sigma$. Then,
$$
\begin{aligned}
s(\sigma^{**}) = s(\sigma^*) & - w_i(\alpha + l_i) - w_j(\alpha + l_i + l_j) \\
& + w_j(\alpha + l_j) + w_i(\alpha + l_j + l_i)
\end{aligned} \\[10pt]
\implies s(\sigma^{**}) = s(\sigma^*) + w_il_j - w_jl_i
$$
Where $\alpha$ is the sum of the job lengths $l_k$ for $k < i$.

Further, using assumption 2. Since
$$
w_il_j < w_jl_i \\[10pt]
w_il_j - w_jl_i < 0 \\[10pt]
\implies s(\sigma^{**}) < s(\sigma^*)
$$

This is a contradiction as $\sigma^*$ was defined to be the optimal. Therefore, there is no more optimal solution than the solution produced by the greedy algorithm.

We can also examine assumption 1 that the ratios are distinct. In the case that the ratios are equal,
$$
\frac{w_i}{l_i} = \frac{w_j}{l_j} \\[10pt]
w_il_j - w_jl_i = 0 \\[10pt]
\implies s(\sigma^{**}) = s(\sigma^*) = s(\sigma)
$$

Therefore when there are equal ratios, then the greedy algorithm produces a squence that will have the same score as the optimal schedule.

## Minimum Spanning Trees

The aim is to construct a tree that connects a bunch of objects together as cheaply as possible.

This will go over 
- [Prim's Algorithm](https://en.wikipedia.org/wiki/Prim%27s_algorithm)
- [Kruskal's Algorithm](https://en.wikipedia.org/wiki/Kruskal%27s_algorithm)

both of which run in $O(m\log{n})$ for $m$ edges adn $n$ vertices when implemented using suitable data structures.

Given an undirected graph $G=(V,E)$ where,
- the graph is represented as an adjacency list
- there is a cost $c_e$ for each edge $e \in E$, $c_e \in \mathbb{R}$

Return a minimum cost tree $T \in E$ that spans all the vertices. Where
- $T$ is acyclic
- the subgraph $(V, T)$ is connected

To simply the discussion, we will make 2 assumptions
1. $G$ is already connected
    
    If $G$ was not connected then the solution $T$ does not exist. It takes linear time to check if $G$ is connected via BFS/DFS

2. $c_e$ values are distinct

    Both algorithms will still find $T$ regardless of how ties are broken. This assumption serves to make the following dicussion more steamlined.


### Prim's Algorithm

Similar to Dijkstra's Algorithm, we expand the horizon of touched nodes one at a time, each time being "greedy" and going for the cheapest edge.

```
Prims(G):

    X = [s] \\arbitrary add
    T = { }

    while X is not V:

        find cheapest e = (u, p) 
            where u in X but p not in X
        
        add e to T
        add p to X

    return T, X
```

#### Proofs of Correctness

##### Proof that a spanning tree $T^*$ is computed

We define a "cut" to be a pair of non-empty sets $A$, $B$
$$
c = (A, B)
$$

1. Empty Cut Lemma
$$
\text{graph is NOT connected} \iff \exists (A,B) \; \text{with no crossing edges}
$$

2. Double Crossing Lemma

    Sps the cycle $C \subseteq E$ has an edge crossing the cut $(A, B)$. Then do does some other edge of $C$.

    This means that if $e$ is the only edge crossing the cut $(A, B)$ then it is not in any cycle

To show that Prim's algorithm outputs a spanning tree, we will prove that

1. Algorithm maintians an invariant that $T$ spans $X$
2. Algorithm does not get stuck with $X \neq V$
3. Algorithm never creates a cycle

For the algorithm to get stuck, there must exists a cut of the graph that has no crossing edges. By the Empty Cut Lemma, this means that the graph is not connected. This cannot be the case as it violates the first assumption of the algorithm.

For the algorithm to create a cycle, there must exist in the set $X$ a cut that has two crossing edges. Since the algorithm only includes one further edge per cut of the graph, each edge in the algorithm is a "lone edge". By the Double Crossing Lemma, this means that the addition of each each cannot create a cycle.

The algorithm maintains the invariant by adding each node and one edge to $T$, therefore by induction the invariant is maintained.

Thus Prim's algorithm produces a spanning tree.



##### Proof that $T^*$ is minimum

To prove that the output is a minium, we will use the "minimum cut property".

Consider an edge $e$ of $G$, 
$$
\exists (A, B) \; \text{such that} \; e \; \text{is the cheapest edge} \implies  e \; \text{belongs to the MST of} \; G 
$$

The minimum cut property trivially implies the correctness of Prim's algorithm, since the alogrithm only includes edges that satisfy the minimum cut property, that is, for intermediate cut, the algorithm selects the cheapest edge to add to the spanning set $T$.

Proof of the Minimum Cut Property. 

We will proceed with a proof by contradiction using an exchange argument.

Suppose there exists a cut $(A, B)$ with crossing edges $e_1, e_2, \cdots, e_n$, for $n \geq 2$. In the case where there is only one crossing edge this edge must be included in the MST as the tree would not be connected otherwise.

WLOG let $e_1$ be the cheapest edge. Further, we will assume that $e_1$ does NOT exists in the MST. 

We then add $e_1$ into the MST. This necessarily creates a cycle $C$. 

By the double crossing lemma, there must be another edge $e_i$ that 
1. crosses the cut $(A, B)$
2. is more expensive than $e_1$

Then, we will show that exchanging $e_1$ with any suitable $e_i$ results in a cheaper MST, which would be a contradiction.

It is required to show that swapping $e_1$ with $e_i$ results in a valid spanning tree. Before adding $e_1$, it is assumed that the spanning tree is complete. That is
1. there exists a path from any node to any other node
2. the tree is acyclic

Upon the addition of $e_1$, this therefore necessarily creates one cycle, as $e_1$ would add a new path between the nodes at the end points of $e_1$ which would be distinct from the path that already exists in the MST. 

Further, note that when removing any one edge from an existing cycle, each node in the cycle remains connected. This is the case, since for nodes in a cycle, there are exactly two different paths that connect them. One of these path necessarily contains the edge to be removed, and the other does not.

Therefore, replacing $e_1$ with $e_i$ results in a valid spanning tree.

#### Run Time

##### Straighforward implementation

Inspecting the pseudo-code, there are
1. $O(n)$ iterations for $n$ vertices
2. $O(m)$ time per iteration for $m$ edges

Resulting in $O(mn)$ time

##### Heap implementation

A heap can speedup the repeated computation of minimums to $O(\log{n})$. 

If the heap is used to store edges, where the keys are the edge costs, we can acheive $O(m\log{n})$ running time. In the implementation of such an algorithm, it will be important to keep track of only the edges that cross the "frontier"

If the heap is used to store nodes, we can store
1. Nodes in $V-X$
2. For $v \in V-X$, key[$v$] = cheapest edge $(u,v)$ for $u \in X$ 

This can be initialised in time
$$
O(m + n\log{n}) = O(m\log{n})
$$

Where key computations cost $O(m)$, for $O(n-1)$ Heap inserts.

In order to maintain the heap invariant after each extract min,
```
when v is added to X:

    for each edge (v, w):
        if w in V-X:
            Delete w from heap
            Recompute key[w] := min(key[w], cost for (v,w))
            re-insert w
```

In general, there will be $O(m)$ heap operations,
- $(n-1)$ intialisation inserts
- $(n-1)$ extract mins
- each edge $(v,w)$ triggers a re-key

Therefore overall $O(m\log{n})$ time


# Problem Set 1

## Question 1

We are given as input a set of $n$ requests (e.g., for the use of an auditorium), with a known start time $s_i$ and finish time $t_i$ for each request $i$.

Assume that all start and finish times are distinct. Two requests conflict if they overlap in time --- if one of them starts between the start and finish times of the other. Our goal is to select a maximum-cardinality subset of the given requests that contains no conflicts. (For example, given three requests consuming the intervals
$$
[0, 3] \; [2,5] \; [4, 7]
$$
we want to return the first and third requests.) We aim to design a greedy algorithm for this problem with the following form: At each iteration we select a new request $i$, including it in the solution-so-far and deleting from future consideration all requests that conflict with $i$.

The solution is to pick the remaining request with the earliest finish time.
```
GreedyAlgo():

    S = { ...requests }
    X = [] \\optimal schedule
    
    while S is not empty:

        r = request from S with the earliest finish time

        append r to X

        for j in S:
            if j conflicts with r:
                remove j from S
            
    return X
```

Proof by an induction argument.

Let $R_j$ denote the requests with the $j$ earliest finishing times.

When there is only one request. The algorithm is trivialy correct.

Assume the algorithm produces the maximum number of non-conflicting requests for the first $k$ jobs.

Upon extracting the next job with the earliest finish time, we know that this job will not conflict with any of the existing jobs.

How do we show that this is maximal?? I'm not sure.

## Question 2

We are given as input a set of $n$ jobs, where job $j$ has a processing time $p_j$ and a deadline $d_j$.  Recall the definition of completion times $C_j$ from the video lectures.  Given a schedule (i.e., an ordering of the jobs), we define the lateness $l_j$ of job $j$ as the amount of time $C_j − d_j$ after its deadline that the job completes, or as 0 if $C_j \leq d_j$. 
$$
l_j =
\begin{cases}
C_j − d_j & \text{if} \; C_j \leq d_j \\
0 & \text{otherwise}
\end{cases}
$$

Our goal is to minimize the maximum lateness,
$$
\max_{j}{l_j}
$$

Solution by exchange argument.

Suppose there is some optimal ordering of jobs

We consider the effect of swapping a pair of jobs, lets call these jobs $j_c$ and $j_r$. 

Where $c < r$.

Assume that before the swap, $j_r$ had the maximum lateness. 
$$
l_r > l_c \\
\implies \sum_{i \leq r}{p_i} - d_r > \sum_{i \leq c}{p_i} - d_c \\
\implies p_r - d_r > - d_c \\
\implies d_c > d_r - p_r
$$

After the swap, the new lateness values become,
$$
l_r \rightarrow l_r - p_c \\
l_c \rightarrow l_c + p_r
$$

In order for the swap to improve on the previous ordering, we would require the new lateness for $j_c$ to be less than the old lateness for $j_r$.
$$
l_c + p_r < l_r \\
\implies \sum_{i \leq c}{p_i} + p_d - d_c < \sum_{i \leq r}{p_i} - d_r \\
\implies - d_c < - d_r \\
\implies d_c > d_r
$$

Therefore since a swap only improves the maximum lateness if the deadlines for the two events are ordered in decreasing order.

Therefore, the optimal ordering will have the deadlines in increasing order.

# Optional Theory Problems

## 1

Consider a connected undirected graph $G$ with not necessarily distinct edge costs. Consider two different minimum-cost spanning trees of $G$, $T$ and $T^{\prime}$. Is there necessarily a sequence of minimum-cost spanning trees,
$$
T = T_0, T_1, T_2, \cdots , T_r = T^{\prime}
$$
with the property that each consecutive pair $T_i$, $T_{i+1}$ of MSTs differ by only a single edge swap? Prove the statement or exhibit a counterexample.

The minimum cut property states that.

For a graph $G$ and cut $(A, B)$
$$
\exists (A, B) \; \text{such that} \; e \; \text{is the cheapest edge} \implies  e \; \text{belongs to the MST of} \; G 
$$

In order for two MSTs $T$, $T^{\prime}$ to be distinct, there needs to be an edge $e$ in $T$, that is not in $T^{\prime}$
$$
\exists e \in T \land e \notin T^{\prime}
$$.

We consider the cut $(A, B)$ whose minimum crossing edge is $e$. By the minimum cut property, in order for this edge to not be in $T^{\prime}$ there must
$$
\exist e^{\prime} \; \text{spanning} \; (A, B) \; \text{for} \; e^{\prime} \in T^{\prime}
$$

Such that the cost of $e$ = cost of $e^{\prime}$.

The converse of this is also true, that being that if the edge costs are distinct, then there is a unique MST.

Therefore, for any two distinct minimum spanning trees, there exists at least one cut $(A, B)$ that has at least two crossing edges $e_i$ with the same cheapest cost. 

In order to prove that a sequence of edge swaps results in a valid MST, we need to show after an edge swap

1. total edge cost remains the same
2. a spanning tree is maintained
3. a cycle is not created

We consider the effect of adding $e^{\prime}$ into $T$. Since $T$ is an acyclic spanning tree, including a second edge necesarrily creates a cycle in $T$. 

Furthermore, this cycle contains both $e$ and $e^{\prime}$. Therefore, it is safe to remove $e$ as it is part of the cycle.

## 2

Consider the following algorithm. The input is a connected undirected graph with edge costs (distinct, if you prefer). The algorithm proceeds in iterations. If the current graph is a spanning tree, then the algorithm halts. Otherwise, it picks an arbitrary cycle of the current graph and deletes the most expensive edge on the cycle. Is this algorithm guaranteed to compute a minimum-cost spanning tree? Prove it or exhibit a counterexample

Does the algorithm result is a spanning tree?

When the algorithm halts the graph is garunteed to have no cycles. Futhermore, remove an edge from a cycle maintains the connectness of the graph, since any cycle contains two distinct paths between nodes on the cycle (those contianing the removed edge and those that do not). 

Therefore, if you start with a connected graph, the algorithm is garunteed to produce a spanning tree.

Does the algorithm produce a minimum spanning tree?

By the minimum cut property, the most expensive edge of any cycle in the graph is definately not included in the MST.

We define the cut that has this most expensive edge as a crossing edge. Since this edge is part of a cycle, by the double crossing lemma, there must be at least two or more crossing edges in this cut. Therefore, by the minimum cut property, the cheapeset edge will be included in the MST which will not be the most expensive edge.

The final bit of this im not sure!!!

## 3

Consider the following algorithm. The input is a connected undirected graph with edge costs (distinct, if you prefer). The algorithm proceeds in phases. Each phase adds some edges to a tree-so-far and reduces the number of vertices in the graph (when there is only 1 vertex left, the MST is just the empty set). In a phase, we identify the cheapest edge $e_v$ incident on each vertex $v$ of the current graph.

Let $F = \{ e_v\}$ be the collection of all such edges in the current phase. Obtain a new (smaller) graph by contracting all of the edges in $F$ (so that each connected component of $F$ becomes a single vertex in the new graph) discarding any self-loops that result. 

Let $T$ denote the union of all edges that ever get contracted in a phase of this algorithm. 

Is $T$ guaranteed to be a minimum-cost spanning tree? Prove it or exhibit a counterexample.

I dont understand this question

# Programming Assignment

## 1

This file ('Week 1 jobs'.txt) describes a set of jobs with positive and integral weights and lengths.  It has the format

[number_of_jobs]

[job_1_weight] [job_1_length]

[job_2_weight] [job_2_length]

...

For example, the third line of the file is "74 59", indicating that the second job has weight 74 and length 59.

You should NOT assume that edge weights or lengths are distinct.

Your task in this problem is to run the greedy algorithm that schedules jobs in decreasing order of the difference (weight - length).  Recall from lecture that this algorithm is not always optimal.  IMPORTANT: if two jobs have equal difference (weight - length), you should schedule the job with higher weight first.  Beware: if you break ties in a different way, you are likely to get the wrong answer.  You should report the sum of weighted completion times of the resulting schedule --- a positive integer --- in the box below. 

In [68]:
from typing import Tuple, TypeVar, Generic

T = TypeVar('T')

class Heap(Generic[T]):
    def __init__(self, type="min"):
        self.order = type if type in ["min", "max"] else "min"

        self.arr: list[T] = []
        self.index: dict[T, int] = dict()

        self.keys: dict[T, int] = dict()
        self.tie_breaks: dict[T, int] = dict()

        return
    
    def __contains__(self, item: T):
        return item in self.index
    
    @property
    def size(self):
        return len(self.arr)
    
    @property
    def lead(self):
        if self.size == 0:
            return None
        return self.arr[0]
    
    def compare_items(self, item_1: T, item_2: T):
        if self.order == "max":
            if self.keys[item_1] == self.keys[item_2]:
                return self.tie_breaks[item_1] > self.tie_breaks[item_2]
            return self.keys[item_1] > self.keys[item_2]
        else:
            if self.keys[item_1] == self.keys[item_2]:
                    return self.tie_breaks[item_1] < self.tie_breaks[item_2]
            return self.keys[item_1] < self.keys[item_2]

    def heapify_up(self, start_idx: int):
        if start_idx == 0:
            return
        
        parent_idx = (start_idx -1) // 2

        parent_node = self.arr[parent_idx]
        start_node = self.arr[start_idx]

        if self.compare_items(parent_node, start_node):
            return

        self.arr[parent_idx], self.arr[start_idx] = self.arr[start_idx], self.arr[parent_idx]
        self.index[parent_node], self.index[start_node] = self.index[start_node], self.index[parent_node]

        self.heapify_up(parent_idx)
        
        return

    def heapify_down(self, start_idx: int):
        arr_size = len(self.arr)

        left_idx = 2 * start_idx + 1
        right_idx = 2 * start_idx + 2

        start_node = self.arr[start_idx]
        node_key = self.keys[start_node]
        
        if left_idx > arr_size-1:
            return
        
        left_node = self.arr[left_idx]

        compare_idx, compare_node = left_idx, left_node
        
        if right_idx < arr_size:

            right_node = self.arr[right_idx]

            if not self.compare_items(left_node, right_node):
                compare_idx, compare_node = right_idx, right_node

        if self.compare_items(start_node, compare_node):
            return
        
        self.arr[compare_idx], self.arr[start_idx] = self.arr[start_idx], self.arr[compare_idx]
        self.index[compare_node], self.index[start_node] = self.index[start_node], self.index[compare_node]

        self.heapify_down(compare_idx)

        return

    def extract_lead(self) -> Tuple[T, int]:
        arr_size = len(self.arr)

        if arr_size == 0:
            return None
        
        if arr_size == 1:
            min_element = self.arr.pop()

            self.tie_breaks.pop(min_element)
            self.index.pop(min_element)

            return min_element, self.keys.pop(min_element)
        
        last_element = self.arr[-1]
        
        self.arr[0], self.arr[-1] = self.arr[-1], self.arr[0]
        self.index[last_element] = 0
        
        min_element = self.arr.pop()
        key = self.keys.pop(min_element)
        self.index.pop(min_element)
        self.tie_breaks.pop(min_element)

        self.heapify_down(0)
        
        return min_element, key
        
    def insert(self, item: T, key: int=None, tie_break: int=None):
        size = len(self.arr)

        self.arr.append(item)
        self.index[item] = size
        
        if key == None:
            key = item

        if tie_break == None:
            tie_break = key

        self.keys[item] = key
        self.tie_breaks[item] = tie_break
        
        if size > 0:
            self.heapify_up(size)
        return
    
    def delete(self, item: T):
        idx = self.index[item]

        if idx == len(self.arr) -1:
            self.arr.pop()
            self.index.pop(item)
            self.keys.pop(item)
            
            return

        last = self.arr[-1]

        self.arr[idx], self.arr[-1] = self.arr[-1], self.arr[idx]
        self.index[last] = idx
        
        deleted = self.arr.pop()
        
        if self.compare_items(deleted, last):
            self.heapify_down(idx)
        else:
            self.heapify_up(idx)
        
        self.index.pop(item)
        self.keys.pop(item)
        self.tie_breaks.pop(item)

        return
    
    def validate(self):

        def report():
            print(self.arr)
            print(self.index)
            print(self.keys)
            return
        
        if len(self.arr) != len(self.keys.keys()) and len(self.arr) != len(self.index.keys()):
            report()
            return False

        for idx, item in enumerate(self.arr):
            if self.index[item] != idx:
                report()
                return False
        
        nodeStack = [0]
        while len(nodeStack) != 0:
            parent_idx = nodeStack.pop()

            lchild_idx = 2 * parent_idx + 1
            rchild_idx = 2 * parent_idx + 2

            arr_size = len(self.arr)
            
            parent_node = self.arr[parent_idx]

            if lchild_idx > arr_size -1:
                continue
            
            left_node = self.arr[lchild_idx]
            
            if rchild_idx > arr_size -1:
                if self.compare_items(parent_node, left_node):
                    continue
                
                report()
                return False

            right_key = self.keys[self.arr[rchild_idx]]
            
            if self.compare_items(parent_node, left_node) and self.compare_items(parent_node, right_key):
                nodeStack.append(lchild_idx)
                nodeStack.append(rchild_idx)
            else:
                report()
                return False

        return True
    
    def test_heap(self):
        
        for i in [2, 5, 3, 8, 9, 7, 4, 11, 1, 40, 23, 41, 52, 15, 13]:
            self.insert(i)
        
        print(f'Insert: {"PASSED" if self.validate() else "FAILED"}')
        
        for i in [3, 8, 7, 1, 11, 23, 15]:
            self.delete(i)

        print(f'Delete: {"PASSED" if self.validate() else "FAILED"}')
        
        self.extract_lead()
        print(f'Extract Lead: {"PASSED" if self.validate() else "FAILED"}')

In [69]:
heap = Heap()
heap.test_heap()

Insert: PASSED
Delete: PASSED
Extract Lead: PASSED


In [70]:
def load_data():
    with open('Week 1 jobs.txt') as f:
        num_jobs = next(f)
        
        # for line in f:
        #     weight, length = [int(x) for x in line.split(" ")]
        return [[int(x) for x in line.split(" ")] for line in f]

In [72]:
def greedy():
    data_arr = load_data()

    heap = Heap(type="max")

    for idx, data in enumerate(data_arr):
        weight, length = data
        key = weight - length
        heap.insert(idx, key=key, tie_break=weight)

    weighted_sum = 0
    current_time = 0
    
    while heap.size != 0:
        next_job, _ = heap.extract_lead()
        weight, length = data_arr[next_job]
        
        current_time += length
        weighted_sum += current_time * weight

    return weighted_sum

greedy()

69119377652

## 2

For this problem, use the same data set as in the previous problem.

Your task now is to run the greedy algorithm that schedules jobs (optimally) in decreasing order of the ratio (weight/length).  In this algorithm, it does not matter how you break ties.  You should report the sum of weighted completion times of the resulting schedule --- a positive integer --- in the box below. 

In [67]:
def greedy_smarter():
    data_arr = load_data()

    heap = Heap(type="max")

    for idx, data in enumerate(data_arr):
        weight, length = data
        key = weight/length
        heap.insert(idx, key=key)

    weighted_sum = 0
    current_time = 0

    while heap.size != 0:
        next_job, _ = heap.extract_lead()
        weight, length = data_arr[next_job]
        
        current_time += length
        weighted_sum += current_time * weight

    return weighted_sum

greedy_smarter()

67311454237

## 3

This file ('Week 1 edges'.txt) describes an undirected graph with integer edge costs.  It has the format

[number_of_nodes] [number_of_edges]

[one_node_of_edge_1] [other_node_of_edge_1] [edge_1_cost]

[one_node_of_edge_2] [other_node_of_edge_2] [edge_2_cost]

...

For example, the third line of the file is "2 3 -8874", indicating that there is an edge connecting vertex #2 and vertex #3 that has cost -8874. 

You should NOT assume that edge costs are positive, nor should you assume that they are distinct.

Your task is to run Prim's minimum spanning tree algorithm on this graph.  You should report the overall cost of a minimum spanning tree --- an integer, which may or may not be negative --- in the box below. 

IMPLEMENTATION NOTES: This graph is small enough that the straightforward O(mn) time implementation of Prim's algorithm should work fine. OPTIONAL: For those of you seeking an additional challenge, try implementing a heap-based version. The simpler approach, which should already give you a healthy speed-up, is to maintain relevant edges in a heap (with keys = edge costs).  The superior approach stores the unprocessed vertices in the heap, as described in lecture.  Note this requires a heap that supports deletions, and you'll probably need to maintain some kind of mapping between vertices and their positions in the heap.

In [84]:
from typing import Tuple

def load_data() -> Tuple[int, int, dict[int, list[Tuple[int, int]]]]:
    with open('Week 1 edges.txt') as f:
        line = next(f)

        num_nodes, num_edges = [int(x) for x in line.split(" ")]
        
        adj_list = {i: [] for i in range(1, num_nodes+1)}
        
        for line in f:
            node_1, node_2, cost = [int(x) for x in line.split(" ")]
            
            adj_list[node_1].append((node_2, cost))
            adj_list[node_2].append((node_1, cost))
    
        return num_nodes, num_edges, adj_list

In [88]:
import math

def prims():
    
    init_node = 1

    num_nodes, num_edges, adj_list = load_data()

    node_heap = Heap[int](type="min")
    for node, edges in adj_list.items():
        
        if node == init_node:
            for edge in edges:
                adj_node, cost = edge
                node_heap.insert(adj_node, cost)
        elif node not in node_heap:
            node_heap.insert(node, math.inf)
    
    x = set([init_node])
    acc_cost = 0 

    while len(x) != num_nodes:
        
        next_node, edge_cost = node_heap.extract_lead()
        x.add(next_node)
        acc_cost += edge_cost
        
        for edge in adj_list[next_node]:
            other_node, cost = edge

            if other_node in node_heap:
                curr_cost = node_heap.keys[other_node]
                node_heap.delete(other_node)
                
                new_cost = min(curr_cost, cost)
                node_heap.insert(other_node, new_cost)

    return acc_cost

prims()

-3612829