# CSPB 3104 Assignment 8: Problem Set
## Instructions

> This assignment is to be completed and uploaded to 
moodle as a python3 notebook. 

> Submission deadlines are posted on moodle. 

> The questions  provided  below will ask you to either write code or 
write answers in the form of markdown.

> Markdown syntax guide is here: [click here](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet)

> Using markdown you can typeset formulae using latex.

> This way you can write nice readable answers with formulae like thus:

>> The algorithm runs in time $\Theta\left(n^{2.1\log_2(\log_2( n \log^*(n)))}\right)$, 
wherein $\log^*(n)$ is the inverse _Ackerman_ function.

__Double click anywhere on this box to find out how your instructor typeset it. Press Shift+Enter to go back.__


----

## Question 1: Shortest Cycle Involving a Given Node.

You are given a directed graph $G: (V, E)$ using an adjacency list representation and a vertex (node) $u$ of the graph.
Write an algorithm to perform the following tasks:

__1(A)__ Write an algorithm that decides (true/false) whether the vertex $u$ belongs to a cycle.

What is the complexity for your algorithm in terms of the number of vertices $|V|$ and the number of edges $|E|$?

Note: Throughout this assignment you may describe your algorithms using words and definitely use algorithms that you have already learned in class. A brief description will do.


Run DFS on u and label edges.
If a back edge is found pointing to u while running DFS, then return True.
If DFS finishes and no back edge pointing to u was found during search, then return False.

The time complexity of this algortithm is $O(|V|+|E|)$ because DFS is $O(|V|+|E|)$.  At worst, the algorithm performs DFS for the entire graph making the complexity the same.

__1(B)__ Write an algorithm which prints the smallest length cycle involving the vertex $u$.

What is the complexity for your algorithm in terms of the number of vertices $|V|$ and the number of edges $|E|$?

Same answer as 1 he said

YOUR ANSWER HERE

----

## Question 2: Tracing an Epidemic

An email with a malicious attachment has evaded the antivirus software of company X.
We know that the CEO's computer was infected during a business trip last month. Since then,investigators have 
been trying to determine whose mailboxes could be infected. For an employee's mailbox to be infected, he or she must have received
and read  an email sent by an already affected employee. 

Starting from the time $0$ denoting when the CEO's mailbox was first infected, investigators have "metadata" for all
the emails from all employees in the form

$(P_i, P_j, t_k, t_l)$ meaning that employee $P_i$ sent an email at time $t_k$ to employee $P_j$, and $P_j$ opened the email at
time $t_l > t_k$.  We assume that $P_j$'s mailbox is infected instantaneously at time $t_l$ if $P_i$'s mailbox was infected before time $t_k$. 

You are given a collection of email records in the form given above, and  you know that person $P_0$ is the CEO who was infected at time $t = 0$.

we ask if a given person of interest $P_j$ could have been infected at a given time of interest $t = T$.

__2(A)__ Write an algorithm that, given a person $P_j$ and time $T$, determines if $P_j$'s mailbox was infected before or at time $T$. What is the worst case complexity of your algorithm in terms of the number of persons $|P|$,  and the number of emails sent $|E|$.

**Hint** You need to first make a graph that represents the possible flow of the "infection" through emails. It is easier to make a complicated graph (in this case, one where each vertex represents more than just a person) and then run a simple graph algorithm (one of the vanilla algorithms we learned this week, ie BFS/DFS/Topological sort) rather than making a simple graph and running a complicated ad-hoc algorithm on it (If your algorithm requires table lookups or passing on metadata specific to the problem at hand, it's probably too complicated).  

## Making the Graph

Make a series of two-node directed graphs based on each email.  Each email graph will have the starting node be the sender and the time the email is sent, and the end node be the receiver and the time they opened the graph.  Label the nodes if they are senders or receivers.  This will be our forest of emails and it will have 2*|E| nodes, taking theta(|E|) time to create.


The edges will look like ([Pi, tk, sender], [Pj, tl, receiver]) and will be weighted with the time difference of tl - tk.  Make a special emails_sent_by list where each bucket in the list corresponds to one of the people in P.  This will take theta(|E|) for the number of emails sent.


Create a single infection node for the CEO at t = 0.  From this node connect all of the email nodes the ceo sent out after t = 0. Let each of the new edges be weighted with the difference between 0 and tk.  I included a picture of my graph and here the ceo sends out only 3 emails.  I colored sender nodes as turquoise and receiver nodes as purple. The blue arrows are an email edge, one that connects sender and receiver.  And the red ring around the CEO's first node signifies his infected email box.


The tree gets built by connecting every receiver node (say a [P_5, 4, receiver] node) with a time weighted directed edge to every single one of that same person's (P_5) sender nodes which have time t_s >= t_receiver (t_s >= 4). The weight is the difference, t_receiver minus t_s.  For example, the upper left hand yellow edge from the CEO (P_0) has a weight of 0.  The middle edge from the CEO has a weight of 2 - 0 = 2.  These edges are colored yellow in the diagram.


As you make the graph, you should disregard all emails received by the ceo.  The CEO is always infected in this scenario so emails back to him don't matter. 

Store the graph in an adjacency list.  The adjacency list worst case ends up having O(2*|E|) = O(|E|) buckets if there are close to as many people as emails.


![Infection Tree](InfectionTree.jpg "Infection Tree" )

This graph allows us to trace infections from the CEO at time 0 as only the emails in the direct flow of the CEO's infecting emails will be reachable from the start node.


In the example, P_3 (person 3) sends an email at t = 0, but they haven't received any emails yet so there are no connecting paths to the larger tree where that email and it's two nodes are reachable. 


## Finding the First Infection of Pj

Use a modified Djirksta's Algorithm to find all the shortest paths from the CEO (P_0) node to each node which has Pj as a receiver.  This algorithm keeps track of the lengths of all the possible paths to Pj. If the algorithm encounters a node with Pj as the receiver, then it will record the time for that receiver node in a min-heap, taking O(log|E|).


Because we used a min-heap in Djirksta's, the total runtime to complete the algorithm is O((|E| + 2*|E|)log|E|) = O(|E|log|E|).

The graph creation took O(|E|) which leaves us with O(|E|log|E|) complexity for our solution.






__2(B)__ Write an algorithm that prints out each person who is infected in increasing order of the times in which they
first got infected.


## Running Djirksta's to Print Infections

Run Djirksta's Algorithm again but this time use it to find the shortest time distance from the CEO infection node to all receiver nodes.

Using a hash map of all the people in P, we can store the current shortest distance for each Pj receiver node by overwriting the time if the new infection time is shorter for than the current value for Pj in our map, or passing if not.  The lookup and overwrite time for hash maps is O(1).

This lets Djirksta's run in O(|E|) time.  The size of this map is |P| and its important to say that only receiver nodes are considered for possible solutions in the hash map.

Printing out our hash map will only take O(|P|) giving us O(|E| + |P|) time total.

----

## Question 3: Testing Moth Age Expert

A person claims to have spent his life studying the emperor gum moth  *Opodiphthera eucalypti*. 
Given two moth samples, he claims to tell us which one is the older. Of course, 
we ourselves are no experts and they all in fact look the same to us.


We test the person as follows: (a) collect a large number $n$ of e.g. moth specimen; (b) randomly
select $m$ different pairs from our collection and have the person tell us which one is older; 
(c) record their answers and analyze them to see if they are _consistent_

Write an algorithm to detect if the "expert" opinions are _consistent_. 


**Hint:** We have refrained from discussing what consistency means in this case. But can provide you an example as a hint.

__Example__ 

Suppose $n= 4$ and the expert says that

Specimen \# $1$ is older than $2$, $3$ is older than $4$, $4$ is older than $2$ and $2$ is older
than $3$.

The expert's opinion is clearly *inconsistent*.

Suppose $n=4$ and the expert says that

Specimen \# $1$ is older than $2$, $3$ is older than $4$ and $4$ is older than $1$. The
expert's answer is *consistent*.



YOUR ANSWER HERE

----

## Question 4: Testing if an undirected graph is acyclic

You are given a strongly connected, undirected graph $G$ with $n$ vertices as an adjacency list. Write an algorithm to check if $G$ has a cycle that runs in time $\Theta(n)$.

*Hint* A connected, undirected acyclic graph is a tree. Since you are already given that $G$ is connected, you are just checking if $G$ is a tree. How many edges would a tree have?


Strongly connected and 

The tree will have n - 1 edges if it has n nodes.  

if it's acyclic: number of nodes = number of edges + 1