# Graph Search

Breadth First Search: $O(n)$ using a FIFO queue
- explore the nodes in layers
- can compute shortest path
- can compute connected components of an undirected graph

Depth First Search: $O(m + n)$ using a LIFO stack via recursion
- explore aggressively
- comput topological ordering of directed acyclic graph
- compute connected components in directed graph

## BFS

### Pseudocode

```
BFS(graph G, start vertex s):

initialise all nodes as unexplored
mark s as explored
let Q = FIFO data structure, initialised with s

while Q is not empty,
    remove the first node of Q, call it V

    for each edge(v, w):
        if w is unexplored:
            mark w as explored
            add w to Q (at the end)
```

### Shortest Path

dist(v) the feqest number of edges on a path from s to v.

Intialise,
$$
dist(v) = 
\begin{cases}
0 & \text{if} \quad v = s\\
+\infty & \text{if} \quad v \neq s
\end{cases}
$$

Then when considering the edge(v, w) include the line
```
if w is unexplored:
    set dist(w) = dist(v) + 1
```

At termination, $dist(v) = i$ says that $v$ is in the $i^{th}$ layer

## Undirected Connectivity

Let $G = (V, E)$ be an undirected graph. A connected component is a maximal connected reigon.

Formally, it is an equivalence class of the relation
$$
u \sim v \iff \exist (u, v) \quad\text{path in}\quad G
$$

### Via BFS
```
mark all nodes unexplored

for i=1 to n:
    if i is not explored:
        BFS(G, i)
```

Runs in $O(m + n)$

## DFS

Explore aggresivly, backtrack when necessary

Psuedocode (recursive version)
```
DFS(graph G, start node s):

mark s as explored

for every edge(s,v):
    if v is unexplored:
        DFS(G, v)
```

## Topological ordering

For a directed acyclic graph $G$, a topological ordering is a labelling $f$ of node $v \in G$, such that
$$
f(v) = \set{x \in \N | 1 \leq x \leq |G| } \\
\forall (u, v) \in G \implies f(u) < f(v)
$$

Where $(u, v)$ is a directed edge from $u$ to $v$.

Theorem: acyclic $\iff$ topological ordering exists

Proof:

- every directed acyclic graph has at least one "sink" vertex.
    - by the pidegonhole principle if a "sink" vertex dosent exist, graph must contain a cycle
- last items in the topologial ordering will be these "sink" vertex.
- by induction, we can recurse a smaller graph with the "sink" vertex removed.
- this new graph has at least one "sink" vertex.
- different ordering of sink vertecies give different valid orderings

### Via DFS

```
mark all nodes unexplored

current_label = n //keeps track of sink vertecies

for each vertex v:
    if v is unexplored:
        DFS(G, v)
```

Within DFS, append at the end
```
set f(s) = current_label
current_label--
```

## Strong Connectivity for Directed Graphs

Strongly connected components of a directed graph as the equivilance classes,

$$
u \sim v \iff \exist (u, v) \land (v, u) \quad\text{paths in}\quad G
$$

Where $(u, v)$ is a directed edge from $u$ to $v$. Strongly connected reigons are cyclic by nature.

### Kosaraju's Two-Pass Algorithm

Can compute SCC in O(m + n) time

Algorithm:
1. Let $G^{rev}$ = G with all arcs in $G$ reversed
2. run DFS-loop on $G^{rev}$
3. run DFS-loop on $G$

First pass on revered graph computes the ordering of nodes to traverse, second pass gets the SCCs.

In the first pass, remember
$$
f(v) = \text{"finishing time" of each vertex}
$$
Then run the second pass in decreasing order of finishing times

First pass psuedocode, we keep track of finishing times 
```
DFS-Loop(graph G):

    global t=0 \\number of nodes processed
    glboal s=NULL \\current source vertex

    \\nodes labelled 1 to n
    for i=n down to 1:
        if i is unexplored:
            s = i
            DFS(G, i)

DFS(graph G, node i):

    global t
    global s

    mark i explored

    for each arc (i, j) in G:
        if j is unexplored:
            DFS(G, j)

    t++
    f(i) := t
```
Then in the second pass, process nodes in reverse order of finishing time keeping track of the leaders of each node,

```
DFS(graph G, node i):

    global t
    global s

    mark i explored
    leader[i] := node s

    for each arc (i, j) in G:
        if j is unexplored:
            DFS(G, j)
```

Analysis of correctness:

Claim 1:

The SCCs of a directed graph $G$ include an acyclic "meta-graph", where
"meta-nodes" = the SCCs $C_1, \ldots , C_k \in G$

Lemma:

Consider two adjacent meta nodes in $G$, $C_1$ and $C_2$ such that there is an edge from $C_1$ to $C_2$. 

Let $f(v)$ = finishing times of DFS-Loop in $G^{rev}$. Then,
$$
\max_{v \in C_1} f(v) < \max_{v \in C_2} f(v)
$$

Corollary:

maximum $f$-value of $G$ must lie in a "sink SCC".

By Corollary:

2nd pass of DFS-loop will begin in a sink SCC, and then loop only on sink SCCs, thus finding SCCs correctly.

Proof of Lemma,

in $G^{rev}$, the edge points from $C_2$ to $C_1$. In the first DFS-loop, there are two cases;

Case 1: we visit a node in $C_1$ before a node in $C_2$. Then, all nodes within $C_1$ are found, and then nodes in $C_2$ are explored. Thus, the finishing time of nodes in $C_2$ > $C_1$
$$
f(v) > f(w) \quad \forall \quad v \in C_2 \: \text{,} \: w \in C_1
$$

Case 2: we visit a node in $C_2$ before a node in $C_2$. Then, the node $v$ whose edge points to $C_1$ will have a larger finishing time as all other nodes in $C_1$, as DFS will only mark a node as finished after all edges have been explored.
$$
f(v) > f(w) \quad \forall \quad w \in C_1
$$

# Optional Theory Problem


## 1

In the 2SAT problem, you are given a set of clauses, where each clause is the disjunction of two literals (a literal is a Boolean variable or the negation of a Boolean variable). You are looking for a way to assign a value "true" or "false" to each of the variables so that all clauses are satisfied --- that is, there is at least one true literal in each clause. For this problem, design an algorithm that determines whether or not a given 2SAT instance has a satisfying assignment. (Your algorithm does not need to exhibit a satisfying assignment, just decide whether or not one exists.) Your algorithm should run in $O(m+n)$ time where $m$ and $n$ are the number of clauses and variables, respectively. [Hint: strongly connected components.]

Disjuction means $\wedge$ the "or" operator. So each clause is of the form
$$
(a \lor b) \\
(\neg \; a \lor b) \\
(a \lor \neg \; b)
$$
For the variables $a$, $b$, etc. taking boolean values. Then in order to satisfy all the clauses, we need all of them to be true. Each disjunction can be written in an "implicative normal form". That is,

Assuming that $(a \lor b)$ is true, this can mean that either
$$
\neg \; a \implies b \quad \lor \quad \neg \; b \implies a
$$
Since, if $a$ is not true then $b$ must be true for the or statement to be true and vice versa. With many clauses, a graph of the clauses can be made where edges are implications and the nodes are the variables $x$ as well as their negation $\neg x$

Note that if $\neg x$ is reachable from $x$ and vice versa, then the set of clauses does not have a solution as 
$$
x \implies \neg x
$$
is contradictory. 

Therefore, after we construct the graph, we can then look for all the strongly connected components. If any SCC contains both a var9able and its contradiction, then a satisfying assignment does not exist.

We can then use Kosaraju's Two Pass DFS algorithm that runs in $O(m + n)$. To modify the code, we can create a dictionary that remembers the index of the SCC that each vertex is in. Then we can do $O(n)$ iterations to check if each vertex is in the same SCC as its negation

# Programming Assignment 1

The file contains the edges of a directed graph. Vertices are labeled as positive integers from 1 to 875714. Every row indicates an edge, the vertex label in first column is the tail and the vertex label in second column is the head (recall the graph is directed, and the edges are directed from the first column vertex to the second column vertex). So for example, the $11^{th}$ row looks liks : "2 47646". This just means that the vertex with label 2 has an outgoing edge to the vertex with label 47646

Your task is to code up the algorithm from the video lectures for computing strongly connected components (SCCs), and to run this algorithm on the given graph.

Output Format: You should output the sizes of the 5 largest SCCs in the given graph, in decreasing order of sizes, separated by commas (avoid any spaces). So if your algorithm computes the sizes of the five largest SCCs to be 500, 400, 300, 200 and 100, then your answer should be "500,400,300,200,100" (without the quotes). If your algorithm finds less than 5 SCCs, then write 0 for the remaining terms. Thus, if your algorithm computes only 3 SCCs whose sizes are 400, 300, and 100, then your answer should be "400,300,100,0,0" (without the quotes).  (Note also that your answer should not have any spaces in it.)

WARNING: This is the most challenging programming assignment of the course. Because of the size of the graph you may have to manage memory carefully. The best way to do this depends on your programming language and environment, and we strongly suggest that you exchange tips for doing this on the discussion forums.

In [294]:
num_nodes = 875714

In [295]:
g = {i: [] for i in range(1, num_nodes+1)}
g_rv = {i: [] for i in range(1, num_nodes+1)}

with open('Week 1 SCC.txt', 'r') as f:
    for line in f:
        items = line.split(" ")
        
        head = int(items[0])
        tail = int(items[1])
        
        g[head].append(tail)
        g_rv[tail].append(head)

In [296]:
class Stack:
    def __init__(self):
        self.arr = []
        self.length = 0
        return
    
    def __iter__(self):
        return self

    def __next__(self):
        if self.length == 0:
            raise StopIteration
        self.length -= 1
        return self.arr.pop()
        
    def push(self, elm):
        self.length += 1
        self.arr.append(elm)
        return
        
    def get(self):
        return self.arr[self.length-1]
     
    def pop(self):
        self.length -= 1
        return self.arr.pop()

In [297]:
print_cycles = 1000

Finishing times are the order of when the node is completed. A node is completed when all of its child arcs have been traversed.

In [298]:
def dfs1_loop(g: dict[int: list[int]]):
    
    f_order = Stack()
    
    explored = {i: False for i in range(1, num_nodes+1)}
    num_explored = 0
    cycle = 0
    
    for i in range(1, num_nodes+1):
        if explored[i] == False:
            
            nodeStack = Stack()
            nodeStack.push(i)
            
            while nodeStack.length > 0:
                
                if num_explored // print_cycles > cycle:
                    cycle += 1
                    completion = (num_explored / num_nodes) * 100
                    print(f'progress: {format(completion, ".2f")}%, nodes: {num_explored}', end='\r')
                
                node = nodeStack.get()

                if explored[node] == False:
                    explored[node] = True
                    num_explored += 1
                
                has_unexplored_children = False
                
                for child in g[node]:
                    if explored[child] == False:
                        has_unexplored_children = True

                        explored[child] = True
                        num_explored += 1
                        nodeStack.push(child)
                
                if not has_unexplored_children:
                    f_order.push(nodeStack.pop())

    completion = (num_explored / num_nodes) * 100
    print(f'progress: {format(completion, ".2f")}%, nodes: {num_explored}', end='\n')
    
    return f_order

In [300]:
f_order = dfs1_loop(g_rv)

progress: 100.00%, nodes: 875714


In [302]:
def dfs2_loop(g: dict[int, list[int]], f_order):

    explored = {i: False for i in range(1, num_nodes+1)}
    num_explored = 0
    cycle = 0
    
    leaders = {}

    for i in f_order:
        if explored[i] == False:

            source = i
            leaders[source] = []

            nodeStack = Stack()
            nodeStack.push(i)
            
            while nodeStack.length > 0:
                
                if num_explored // print_cycles > cycle:
                    cycle += 1
                    completion = (num_explored / num_nodes) * 100
                    print(f'progress: {format(completion, ".2f")}%, nodes: {num_explored}', end='\r')
                
                node = nodeStack.get()

                if explored[node] == False:
                    explored[node] = True
                    num_explored += 1
                
                has_unexplored_children = False
                
                for child in g[node]:
                    if explored[child] == False:
                        has_unexplored_children = True

                        explored[child] = True
                        num_explored += 1
                        nodeStack.push(child)
                
                if not has_unexplored_children:
                    leaders[i].append(nodeStack.pop())

    completion = (num_explored / num_nodes) * 100
    print(f'progress: {format(completion, ".2f")}%, nodes: {num_explored}', end='\n')
    
    return leaders

In [303]:
leaders = dfs2_loop(g, f_order)

progress: 100.00%, nodes: 875714


In [305]:
sccs = [len(v) for v in leaders.values()]
sccs.sort(reverse=True)

In [307]:
sccs[0:10]

[434821, 968, 459, 313, 211, 205, 197, 177, 162, 152]