# 10. Graph Search

Motivation:
1. check is a physical network (telecommunication, transport) is connected (get from anywhaere to anywhere)
2. or non-physical network is connected: co-staring actors
    * the nodes correspond to actors and actresses, 
    * an edge between two nodes, if they played a role in a common movie. 
    * an undirected graph,
    * So versions of this movie network you should be able to find publicly available on the web, and there's lots of fun questions you can ask about the movie network. 
        * what's the minimum number of hops ( = edges = movies that two people both played a role in) from one actor to another actor, 
        * so perhaps the most famous statistic that's been thought about with the movie is the Bacon Number. 
            * So this refers to the fairly ubiquitous actor Kevin Bacon, 
            * the question the Bacon Number of an actor is defined as the minimum number of hops you need in this movie graph to get to Kevin Bacon. 
            *  Bacon Number is fundamentally not just about any path, but actually shortest paths, the minimum number of edges you need to traverse to get from one actor to Kevin Bacon. 
3. shortest paths to go from 1 point to another = driving direction
4. formulate a plan of how to get from an initial state to some goal state: 
    * sudoku
        * directed graph
        * nodes = partially comleted puzzles
        * edge = directed, 1 previousy empty cell is filled with one number
    * robotic hand manipulated to grab an object
5. computing connectivity information about graphs = connected components = pieces of graphs
    * undirected graphs = easy clustering euristics derives from connectivity
    * directed graphs = more subtile = structure of the web
     
A lot of different approaches to searching a graph. Here we focus on two
* Breadth first search
* Depth first search
  
Goals: 
* We define a source (starting vertex) and the goal is to fing anything findable from it
* We want to do it efficiently: 
    * looking at either each piece of the graph only once for a small cost number of times.
    * So looking for running time which is linear on the size of the graph that is big O(m+n).
  
**Generic algorithm**  
(given grapg G, vertex s, no condition on edges m (there could be > or < than vertex))

* look at either each piece of the graph only once =>
    * we're gonna remember whether or not we explored it before. 
    * we need one Boolean per node 
    * (1) we will initialize it by having everything unexplored except S, our starting point we'll have it start off as explored
* And it's useful to think of the nodes thus far as being in some sense territory conquered by the algorithm.
    * there's going to be a frontier in between the conquered and unconquered territory. 
    * the goal of the generic outcome is that each step we supplement the conquered territory by one new node, assuming that there is one adjacent to the territory you've already conquered
        * (2) while possible: 
            * choose and edge (гбм) with u explored and v unexplored
            * if none, stop
            * mark v explored

**Claim:**  
at the termination of this algorithm the node v – explored <=> G has a path from s to v
  
* this claim and the proof holds whether or not G is an undirected graph or a directed graph. 
* almost all of the things in this course about graph search: breadth first search, depth first search also
* the one big difference: connectivity computation

Proof: 
* (=>) easy induciton on number of inerations
* (<=) by contradiction:
    * suppose G has a path p from s to v but v 
    * but v is unexplored at the end of the algorithm
        * s for sure is explored since we initialize the algorithm this way
        * we traverse this path from s to v: some points in the middle could be explored, some – not
        * in the middle of this path there is an edge by which we move from an explored vertex to an unexplored at some point: 
        * <=> (u,w) belonging to path where u - explored, w - unexplored
        * u can be an s and w can be a v
        * How is it possible that an algorithm terminated, but there is still this edge (u,v)
        * generic search won't terminate without exploring w
        * contradiction
        * QED
  
**BFS vs DFS**
* Note:
    * there is a cut between explored and unexplored part woth crossing edges
    * How to choose among frontier edges?
* **Breadth-First-Search:**
    * Explore nodes in layers: 
        * starting point S  = Layer-0. 
        * neighbors of S = Layer-1, 
        * Layer-2 will = nodes that are neighbors of Layer-1 but that are not already in layer 0 or 1
        * So layer i+1, = nodes next to layer i that you haven't already seen yet. 
     * there's a close correspondence between these layers and **shortest path** distances.
     * can also be used to compute **connected components** of an undirected graph in linear time O(n+m) if we use a correct data structure: a queue (FiFo)
* **Depth-First-Search**
    * A strategy of how you agressively explore a maze
    * It's not, for example, very useful for computing shortest path information, but especially in directed graphs it's going to do some remarkable things for us. 
        * So, in directed acyclic graphs, so a directed graph with no directed cycles it will give us what's called the **topological ordering**. 
            * So it'll sequence the nodes in a linear ordering from the first to the last, so that all of the arcs of the directed graph go forward. 
            * So this is useful for example if you have a number of tasks that need to get completed with certain precedence constraints. 
        * Compute **connected components in directed graphs** (for undirected it doesn't mtter what to use) in O(n) time
        
**Time-wise**  
* both of these are superb strategies for exploring a graph. 
* They're both linear time with very good constants. 
* DFS O(m+n) time in a graph with M edges and N vertices. 
* You do wanna use a different data structure reflecting the different search strategy. 
    * So, here because you're exploring aggressively, as soon as you get to a node you'll meet and you start exploring its neighbors, you wanna last-in first-out data structure, also known as a stack. 
    * Depth first search also admits a very elegant recursive formulation, and in that formulation, you don't even need to maintain a stack data structure explicitly, the stack is implicitly taken care of in the recursion.    

### 10.1 BFS basics 

* Linear time: O(m+n), no conditions on m and n
* explaination on undirected, but this entire procedure will work in exactly the same way for a directed graph.

**[Algorithm]**
BFS(Graph G, start vertex s)
* assume that all nodes are unexplored 
* mark s as explored
* let Q = queue data structure (FiFo), initialized with s
    * you can add stuff to the back in constant time 
    * and take stгff from the front in constant time
    * FiFo = first in, first out
* While Q ≠ nan:
    * remove the first node of Q, call it v
    * for each edge (v,w):
        * if w unexplored: 
            * mark w as explored
            * add w to Q at the end

**[Steps]**  
So now, when we follow the code, what happens? 
* G = {(s,a), (s,b), (a,c), (b,c), (b,d), (c,d), (c,e), (c,e)}
* Well in the first iteration of the while loop we ask is the queue empty? 
    * No it's not, because S is in it. 
    * So we remove in this case the only node of the queue. 
    * It's S. 
* And then we iterate over the edges incident to S. 
    * Now there are two of them. 
        * between S and A 
        * between S and B. 
    * And again this is still a little under specified: the algorithm doesn't tell us which of those two edges we should look at. 
        * Turns out it doesn't matter. 
        * Each of those is a valid execution of breadth first search. 
        * But for concreteness, let's suppose that of the two possible edges, we look at the edge (s,a). 
    * So, then we ask, has A already been explored? 
        * No, it hasn't. 
    * We add A to the queue at the end and as explored. 
* Now we go back to the for loop, and so now we move on to the second edge. 
    * It's into S, that's the edge (s, b). 
    * So, we ask, have we already explored B? 
        * Nope, 
        * B gets marked as explored 
        * gets added to the queue at the end. 
* So the queue at this juncture has 
    * first a record for A, cause that was the first one we put in it after we took S out. 
    * And then B follows A in the queue. 
    * Again, depending on the execution this could go either way. But for concreteness, I've done it so that A got added before B. 
* So now we go back up to the while loop
    * we say is the queue empty? 
        * Certainly not. 
        * There's actually two elements. 
    * Now we remove the first node from queue, 
        * in this case, that's the node A 
        * let's look at all the edges incident to A. 
        * And in this case A has two two incident edges. 
            * we look at (a,s) and ask an if statement: has S already been explored? Yes it has, no action needed
            * (a,c) – C we haven't seen yet. 
            * C we can mark as explored and put in the queue. 
            * So, that's gonna be our number four. 
* So now how has the queue changed. 
    * we got rid of a. 
    * and so now B is in the front and we added c at the end. 
* We go back to the while loop, 
    * the queue is not empty, 
    * we take off the first vertex = b
    * b has three incident edges, 
        * s – irrelevant, 
        * c – also irrelevant, 
        * (c, d)
            * we can take the node d, mark it as explored and add it to the queue. 
            * So d is going to be the fifth one that we see. 
* And now the queue has the element C followed by D. 
* etc
  
The nodes are numbered according to the layer they are in
  
Breadth first search is a good way to explore a graph:
* First, it finds everything findable, and obviously nothing else 
* and second, it does it without redundancy

**Claim 1**  
at the end of BFS, v explored <=> G has a path from sa to v (G - directed or undirected)  
Reason: special case of generic algorithm
* the forward direction of this claim is clear: 
    * if you actually find something (= if something's marked as explored) 
    * it's only because you found a sequence of edges that led you there. 
    * So the only way you mark something as explored is if there's a path from S to V. 
* Conversely, to prove that anything with an S to V, for with a path from V will be found, you can proceed by contradiction: 
    * you can look at the part of the path from S to V that, that BFS does successfully explore,
    * and then you gotta ask, why didn't it go one more hop? 
    * It never would've terminated before reaching all the way to V. 
    

**Claim 2**   
* running time of main while loop = O(n_s + m_s), where 
    * n_s – number of nodes reachable from s
    * m_s – number of edges reachable from s
Reason: by inspection of code

BFS(Graph G, start vertex s)
* assume that all nodes are unexplored 
* mark s as explored **csnt**
* let Q = queue data structure (FiFo), initialized with s **csnt**
* While Q ≠ nan:
    * remove the first node of Q, call it v **O(n_s)**
    * for each edge (v,w): **O(m_s)**
        * if w unexplored: **O(1)**
            * mark w as explored
            * add w to Q at the end
            
So we can summarize the total work done in this while loop as follows. 
* inicialization = O(1)
* First we just think about the vertices
    * so in this search we're only gonna ever deal, with the vertices that are findable from S.
        * there are n_s of them. 
    * And what do we do for the given node? 
        * we insert it into the queue 
        * and we delete it from the queue. 
    * we're never gonna deal with a single node more than once. 
    * => constant time overhead per vertex that we ever see, 
    * => for all verticies O(n_s) 
* Now, a given edge, we might look at it twice. 
    * So, for an edge (v,w): once when we first look at the vertex v, and again when we look at the vertex w. 
    * Each time we look at an edge we do constant work. 
    * So that means we're only gonna do constant work per edge. 
    * O(2 m_s) = O(m_s)
* So the overall running time is going to be proportional to 
    * the number of vertices findable from S 
    * plus the number of edges findable from S. 
    

### 10.2 BFS and Shortest paths

Goal: Given BFS(Graph G, start vertex s), compute dist(v) = fewest number of edges on a path that leads to a particular node v?  
(you can define this in the same way for undirected graphs or directed graphs)

Extra code:
* in the initialization step, you set your preliminary estimate of the distance, 
    * the number of the shortest path distance from S to vertex V 
        * if V equals S, you know you can get from S to S on a path of length zero, the empty path. 
        * And if it's any other vertex, you have no idea if there's a path to V at all. 
            * So let's just initially put plus infinity for all vertices other than the starting point. 
            * This is something we will of course revise once we actually discover a path to vertex V. 

    * initialize dist(v): 
        * 0 if v = s
        * + infinity if v ≠ s

* And the only other extra code you have to add is: 
    * when considering edge(v, w):
        * if w unexplored, then set dist(w) = dist(v) + 1
        
        
**[Algorithm]**
ShortesPath(Graph G, start vertex s, finish vertex v)
* assume that all nodes are unexplored
* mark s as explored 
* let Q = queue data structure (FiFo), initialized with s csnt
* initialize dist(v): 
    * 0 if v = s
    * + infinity if v ≠ s
* While Q ≠ nan:
    * remove the first node of Q, call it v 
    * for each edge (v,w): 
        * if w unexplored: 
            * mark w as explored
            * add w to Q at the end
            * set dist(w) = dist(v) + 1

**Claim**  
at termination dist(v) = i <=> v is in the i-th layer  
(i.e shortest s-v path has i edges)  
Proof idea: 
* every layer-i node w is added to q by a layer-(i-1) node v via the edge (v,w)
    *  So the inductive hypothesis tells that 
        * distances were correctly computed for everybody from the lower layers. 
        * So in particular, 
            * whoever this node V was from layer i minus one was responsible for discovering u, in layer i. 
            * it has a distance computed as i minus one. 
            * Yours is assigned to be one more than its, namely i. 
        * So that pushes through the inductive step everything in layer i indeed gets the correct label of a shortest path distance i away from S. 
        
Shortest path calculation is an additional feature you get from this particular (BFG) algorithmю Others do not give it


### 10.3 BFS and Undirected connectivity

Let G = (V, E) an undirected graph  
(And we're not going to assume that G is connected, because a part of a goal of this algorithm is to figure out if it is)
  
**Connected components of an undirected graph** = maximal regions that are connected (in which you can get from any vertex to any)

**Formal definition:**  
* equivalent classes of the relation u~v <=> there exist u-v path in G
    * ~ is equivalence relation: 
        * reflexive: everything has to be related to itself: in a graph there is a path from any node to itself = the empty path.
        * a couple of these relations have to be symmetric, meaning that if u and v are related then v and u are related. Because this is an undirected graph it's clear that this is symmetric.
        * Finally equivalence classes have got to be transitive. So that means if u and v are related and so are v and w and so are u and w. 
        
**Goal:** compute all connected components in O(n+m) time
**Why?**
* for physical networks: check if it is broken
* graph vizualisation
* clustering: quick and dirty heuristic, super fast: linear time
    * set of web-pages, genomes, images
    * and you have a pairwise funciton which tells you how much a pair of objects are alike [low score] / different [high score] 
    * Now here's a graph you can construct using these objects and the similarity data that you have about them. 
        * So you can have a graph where the nodes are the objects. 
        * for each object, you have a single node 
        * for a given pair of nodes, you put in an edge if and only if the two objects are very similar. 
        * So for example, you could put in an edge between two objects if and only if the score is at most ten.
    * in this graph you've constructed, you can find the connected components. 
        * So each of these connected components will be a group of objects, which more or less are all very similar to each other
        * So this would be a cluster of closely related objects in your database

    
**[Algorithm]**
* all nodes unexplored [assume lablled 1 to n] **O(n)**
* for i from 1 to n **O(n)**
    * if i not yet explored **O(n)** (over all of the connetcted components)
        * BFS(G,i) => finds all nodes and marks as explored in a connected component

Note: finds every connected component.
Running time: O(m+n) linear 
* Depending on the graph, m and n: one these might be bigger that the other. 
* So why is it O of m plus n? 
    * Well as far as the nodes, we have to do this initialization there where we mark them all as unexplored, so that takes constant time per node. 
    * We have just the basic overhead of a for loop, so that's constant time per node => O(n) for all nodes. 
    * And then recall we proved that within breadth first search, you do amount of work proportional. 
        * You do constant time for each node in that connected component. 
        * Now, each of the nodes of the graph is in exactly one of the connected components. 
        * So you'll do constant time for each node in the BFS in which you discover that node. 
        * So that's again, O(n) over all of the connected components.
    * And as far as the edges, note we don't even bother to look at edges until we're inside one of these BFS calls. 
        * They played no role in the outer for loop or in the pre-processing. 
        * And remember what we proved about an indication of breadth first search. 
        * The running time, you only do constant amount of work per edge in the connected component that you're exploring. 
        * In the worst case, you look at an edge once from either endpoint and each of that triggers a constant amount of work
        * So when you discover a given connected component, the edge work is proportional to the number of edges in that kind of component. 
        * Each edge of the graph is only in exactly one of the connect components, so over this entire for loop, over all of these BFS calls. 
        * For each edge of the graph, you'll only be responsible for a constant amount of work of the algorithm. 
    * So summarizing because breadth-first search from a given starting node works in time
        * O(m+n) 
            * m: O(1) per edge in each BFS
            * n: O(1) per node

### 10.4 DFS basics

* explore aggressively, only backtrack when necessary
    * we have to go next to one to immediate neighbours
    * when we meet an explored node, we retreat back to nodes and check whether there are edges we haven't follow to explore neighboring nodes

* Why bother with another strategy?
    * also computes a topological ordering of a directed acyclic graph
    * and strongly connected components of directed graphs

* Linear time O(m+n)

**[Algorithm 1]**  
Minor modifications in BFS
* Difference: 
    * istead of queue – a stack: LiFo
        * supports constant time insertions to the front and constant time deletions from the front

DFS(Graph G, start vertex s)
* assume that all nodes are unexplored
* mark s as explored
* let S = stack data structure (LiFo), initialized with s
* While Q ≠ nan:
    * remove the first node of S, call it v
    * chose an edge (v,w):
        * if w unexplored:
            * mark w as explored
            * add w to S to the Front


**[Algorithm 2]**  
Recursive version

DFS(Graph G, start vertex s)
* mark s as explored
* for every edge (s,v):
    * if v unexplored
        * mark v as explored
        * DFS(G,v)

**Claim 1**  
at the end of DFS, v explored <=> G has a path from s to v (G - directed or undirected)
Reason: special case of generic algorithm

**Claim 2**  
running time is O(n_s + m_s) (G - directed or undirected)
* n_s – number of nodes reachable from s 
* m_s – number of edges reachable from s 

Reason: look at each node in connected components of s at most once, each edge at most twice


### 10.5 Topological sort

**Definition**  
* topological ordering of a **directed graph** is an ordering of the vertices of a graph so that 
    * all of the arcs only go forward in the ordering. 

Mathematical def:  
topological ordering of a directed graph is a labelling F from 1 to n of verticies of G such that:
* (1) the F(v)'s are the set {1, 2, ..., n}
* (2) (u,v) belong to G => f(u)<f(v)

**Motivation**  
Ordering of tasks while respecting all precedence constraints

**Note**  
* If a directed graph has a directed cycle – no topological ordering
* What if you don't have a cycle?

**Theorem**
no directed cycle => can compute topological ordering

**Solution 1 Algorithm**
* Observation: Every directed acyclic graph has a sink vertex. That is a vertex without any outgoing arcs.
    * Suppose there is not (ie every vertex has an outgoing edge)
    * We can start in an arbitrary node. 
    * We know it's not a sink vertex, because we're assuming there aren't any. 
    * So there's an outgoing arc, we follow it.
    * We get to some other node.
    * We just keep following outgoing arcs, and we do this as long as we want because every vertex has at least one outgoing arc. 
    * There's a finite number of vertices, right this graph has say N vertices. 
        * So if we follow N arcs, we are going to see N+1 vertices. 
        * So by the pigeon-hole principle, we're going to see some vertex twice. 
        * we have exhibited a directed cycle. 
    * we just prove that a graph with no sink vertex has to have a directed cycle. 
    * So a directed acyclic graph therefore has to have at least one sink vertex
    * QED
* Suppose the acyclic graph did have a topological ordering
    * The vertex which goes last in it
    * Every arc that goes back in ordering is a violation
    * Last vertex should not have an outgoing arc = should be sink vertex
    * All nodes with outgoing arcs we put earlier
    * As far as graph is directed acyclic, we know there is a sink vertex (at least one)

* (1) let v to be a sink vertex of G (if there are many, we pick one arbitrary)
* (2) we put it into the n-th position of F: F(v) = n
* (3) we recurse on the rest of the graph G-{v} 

Why does this algorithm work?
* (1) we need to argue that in every iteration we can indeed find the sink vertex, 
     * if you take a directed acyclic graph and you delete one or more vertices from it, you're still going to have a directed acyclic graph
     * You can't create cycles by just getting rid of stuff. 
     * You can only destroy cycles, and we started with no cycles. 
     * So through all the intermediate recursive calls we have no cycles by our first observation is always the sink.
* (2) So the second that that we have to argue is that we really do produce a topological ordering. 
    * That means for every edge of the graph, it goes forward in the ordering. 
    * That is the head of the arc is given a position later than the tail of the arc.
    * And this simply follows because we always use sink vertices. 
         * So consider the vertex v which is assigned to the position i. 
         * This means then, that when we're down to a graph that only has i vertices remaining, v is the sink vertex.
         * If v is the sink vertex when only the first i vertices remain, what property does it have in the original graph? 
         * Well, it means all of outgoing arcs that it has have to go to vertices that were already deleted and assigned higher positions. 
         * So for every vertex, by the time it actually gets assigned a position, it's a sink and it only has incoming arcs from the as yet unsigned vertices. 
         * It's outgoing arcs all go forward to vertices that were already assigned higher positions, and got deleted previously from the graph.

This guarantees that as long as you don't have a cycle, the topological ordering does indeed exist. Proof by construction of algorithm above.
* Good algorithm, O(n+m) time

**Solution 2 DFS-based algorithm**

DFS(graph G, start vertex s) = inner loop
* mark s as explored
* for every edge (s,v):
   * if v unexplored
      * DFS(G,v)
* set F(s) = current label
* current label -= 1

DFS-Loop(graph G, start vertex s) = outer loop
* Mark all nodes unexplored
* Current_label = n (global variabel to keep track of ordering)
* for each vertex v belonging to G
    * if v not yet explored (in some previous DFS call)
        * DFS(G, v)


* Let directed graph G {(s,v), (v,t), (s,w), (w,t)}
* In the outer loop we take an arbitrary vertex b
    * we call DFS(G, v): 
        * mark v = explored
        * the only vertex we can explore is dt: so for edge (b,t)
            * mark t as explored
            * DFS(G,t), but from t there is no way to go, recursion stops
        * so we assign F(t) = current label = 4
        * current label becomes 3
    * we backtrack to v
        * there is nothing to explore
        * we set F(v) = 3
        * we decrement current label = 2
* We get out of this loop and consider the next vertex, let it be s
    * we haven't seen it yet
    * there are 2 arcs to explore
        * (s,v) we saw v, so we skip it
        * (s,w) => DFS(G,w)
            * we explore t, but it is already seen
            * we finish the DFS with w and set F(w) = 2
            * we decrement current label = 1
    * we backtrack to s
        * we've already considered all its arks and it gets the current label F(s) = 1
        
**Running time:** O(m + n)  
**Reason:** O(1) time per node, O(1) time per edge  
**Correctness:** need to show that if (u,v) is an edge, then f(u) < f(v)

Proof: 
* The proof of correctness splits into two cases, depending on which of the vertices u or v is visited first by depth-first search. 
* Because of our for loop, which iterates over all of the vertices of the graph g, depth-first search is going to be invoked exactly once from each of the vertices. 
* Either u or v could be first, both are possible. 

Case 1:
* So first let's assume that u was visited by DFS before v, 
* DFS on a node, is going to find everything findable from that node. 
     * So if u is visited before v, that means v isn't getting explored, so it's a candidate for being discovered. 
     * Moreover, there's a an arc straight from u to v, so certainly DFS invoked at u is going to discover v. 
     * Furthermore, the recursive call corresponding to the node v is going to finish, it's going to get popped off the program stack before that of u. 
     * The easiest way to see this is just to think about the recursive structure of depth-first search. 
         * So when you call depth-first search from u, that recursive call, that's going to make further recursive calls to all of the relevant neighbors including v, and u's call is not going to get popped off the stack until v's does beforehand. 
         * That's because of the last in, first out nature of a stack or of a recursive algorithm.
         * So because v's recursive call finishes before that of u, that means it will be assigned a larger label than u. 
         * Because the labels keep decreasing as more and more recursive calls get popped off the stack. 
  
Case 2:  
* v is visited before u. 
* And here's where we use the fact that the graph has no cycles. 
    * So there's a direct arc from u to v. 
    * That means there cannot be any directed path from v all the way back to u. 
    * That would create a directed cycle.
    *  Therefore, DFS invoked from v is not going to discover u. 
    * There's no directed path from v to u
* So it doesn't find u at all. 
    * So the recursive call of v again is going to get popped before u's is even pushed onto the stack. 
    * So we're totally done with v before we even start to consider u. 
    * So therefore, for the same reasons, since v's recursive call finishes first, its label is going to be larger, which is exactly what we wanted to prove. 
    
QED    

### 10.6 Computing strong components (based on DFS)

**[Kosaraju 2-pass Algorithm]**  

What are connected components of **directed graphs**?
the **strongly connected components** (SCCs) of a directed graph G are the equivalence classes of the relation
u~v <=> there is a path from u to v **and** from v to u in G

You check: ~ is an equivalence relation

The algorithm to compute SCCs is build on DFS, but
* If you call DFS from just the right place, you'll actually uncover an SCC. 
* If you call it from the wrong place, it will give you no information at all.

**Theorem** SCC's of directed graph can ne computed in linear time O(m+n)

* (1) Let G_rev = G with all arcs reversed
    * Construction of this G_rev can be done in 2 ways:
        * the naive way to implement this would be to literally construct a new copy of the input graph with all the the arcs in the reverse direction, and then just run depth first search on it. 
        * the sort of optimization would be to just run DFS on the original graph, but going across arcs backwards. 
* (2) run DFS-Loop on G_rev
    * that just means the user will check more to make sure that you see all of the nodes of the graph 
        * even if it's disconnected you have an outer loop where you just try each starting point separately. 
        * If you haven't already seen it then you run DFS from that given node. 
    * goal: discovers the "required magical" order of nodes in which running the second DFS-loop will disover SCCs
    * let f(v) = finishing time of each v belonging to G
* (3) run DFS-Loop on G
    * goal: discovers the SCCs one by one
    * processing nodes in decreasing order of finishing times
    * we're going to label each node in the second pass with what we call a leader. 
        * And the idea is that the nodes in the same strong connected component will be labeled with exactly the same leader node. 

**DFS-Loop**
DFS-Loop(Graph G)
* global variable t = 0 
    * = number of nodes processed so far
    * used to compute finishing times in first pass
* global variable s = NULL
    * = current vertex from which DFS was initiated
    * used to compute leaders
* Assume nodes labeled 1 to n
* For i from n down to 1
    * """s is responsible for keeping track of the most recent node from which Depth First Search had been initiated, 
    * """ so if i's not explored and we initiate a Depth First Search from it, we better reset s
    * if i not yet explored 
        * s == i
        * DFS(G, i)            
        
DFS(graph G, node i) = inner loop
* mark i as explored 
    * """ once a node is marked explored, it's explored for this entire indication of DFS-Loop.
    * """ so even if this DFS from a given node i finishes, and then the outer for loop marches on, and encounters i again, it's still going to be marked as explored.
* set leader(i) = node s
* for every edge (i,j) belonging to j:
    * if j unexplored
      * DFS(G,v)
* t += 1
* set F(i) == t (finishing time)
   
   
Example on a particular graph
* First pass on reveresed graph gives the "magical order" of nodes = f(i)'s
* Second run on direct graph: 
    * you reverse arcs
    * you replace node numbers by newly found f(i)'s
  
Running time: 2 DFS = O(m+n)

**[Analysis]**
* observation: every directed graph has two levels of granularity. 
    * zoom out: you see is a directed acyclic graph, of course comprising its strongly connective components. 
    * zoom in: and focus on the fine grain structure with one SCC. 
    
<=> **Claim:** SCC of a directed graph induce in a natural way an **acyclic metagraph**
* meta-nodes = the SCCs C1, ..., Ck of G
* There is an arc C -> C^ <=> There is arc (i,j) belonging to G with i belonging to C and j belonging to C^

Why this metagraph is acyclic?
* Metanodes = SSC's (= ie you can get from anywhere to anywhere else within it)
* So, if you had a cycle that involved two different metanodes, 
    * on such a directed cycle you can also get from anywhere to anywhere else. 
    * So if you had two supposedly distinct SCCs, that you could get from the one to the other and vice versa, they would collapse into a single SCC. 
        * You can get from anywhere to anywhere in one, anywhere from anywhere in the other one, and you can also go between them at will, so you can get from anywhere in this union to anywhere in the union. 
        
The SCCs of G are exactly the same as SCCs of G_rev

**Key Lemma**  
* Consider 2 adjecent SCCs in G: C1 and C2 (C2 is "downstream" to C1)
* Let f(v) = finishing times of DFS-loop in G_rev
* Then: max(f(v in C1)) < max(f(v in C2) 

assume that this lemma is true. Then:
* Corollary: 
    * max time of the entire graph (where the second pass of DFS would begin)
    * it has to be in a "sink SCC" (by transitivity, applying lemma to metanode pairs) 
        * By contradiction
        * Consider this SCC with the maximum F value. 
        * Suppose it was not a sink SCC that it has an outgoing arch
        * follow that outgoing arch to get some other SCC 
        * by the lema the SCC you've got into has even bigger maximum finishing time. 
        * So that contradicts the fact that you started in the SCC with a maximum finishing time. 
    
**Correctness intuition of the Agorithm**  
(see notes for formal proof)
* Assuming the lemma is true we know that the corollary is true. 
* Now using this corollary let's finish the proof of correctness, of Kasaraja's algorithm, 
* We can locate max finishing time somewhere in some sink SCC. 
  
By corollary: 
* 2nd pass of DFS-Loop begins somwhere un a sink SCC C*
    * because otherwise DFS would discover everything downstream (ie all other metanodes)
* => first call of DFS discovers C* and nothing else
* => rest of DFS-Loop would be recursing on G-{C*}
* We start at a sink metanode that has largest max finishing time
* and so on: Successive calls to DFS(G,i) peel of the SCCs one by one in reverse topological order of the metagraph)
   
**Proof of Lemma**
* Let's reverse the graph G to G_rev
    * C1 and C2 (C1 is "downstream" to C2)
    * arc points from j (belonging to C2) to i (belonging to C1)
    * The SCCs in rev graph are exactly the same as in straight graph
* now we're going to have two cases in this proof 
* Let v = 1st node of (C1 and C2) reached by DFS-Loop (on G_rev)
* Case 1: we encounter v in C1
    * all of C1 explored before C2 ever reached
    * reason: no paths from C1 to C2 sicne metagraph is acyclic
    * => every single finishing time in C1 is smaller, than in every single finishing time in C2
* Case 2: we encounter v in C2
    * here we use the fact that we are using DFS, not another algorithm
    * DFS(G_rev, v) won't finish untill all of C1 and C2 completely explored. 
    * => And we won't finish with V until we finish with everything else, that's the depth-first search property. 
    * For that reason the finishing time of this vertex V will be the largest of anything reachable from it. 
    * So in particular it'll be larger than everything in C2 
    * but more to the point, it'll be larger than everything in C1 which is what we are trying to prove.
    * f(v) > f(w) for all w in C1
    

### 10.7 Web Structure

* vericies = web pages
* edges are directed: links

Q: what does the web graph look like?
[assume you've already crawled the all information from the web]
Size: 200 million nodes, 1 billion edges (year 2000)
* => it's impossible to compute the whole graph
* => Compte the SCCs of the web

Today you have
* Map reduce and Hadoop 
    * specialized systems which are meant to operate on massive data sets. 
    * And in particular, they can do things like compute connectivity information on graph data
* But the web is much larger

**The Bow Tie picture of the web**

* In the middle – giant SCC (core of the web) 
    * It would be super weird if there were two different blobs, 10 million web pages each that somehow were not mutually reachable from each other
* Out region = reachable form giant, but you'd not go back
    * corporate sites
* In region = 
    * new web pages
* There are hyperlinks which go from the in to out without traversing giant = tubes
* There parts connected to In part which store "heads" of hyperlinks from In = tendrils
* There parts connected to Out part which store "tails" of hyperlinks firected to Out
* Isolated islands


**Main findings**
* all 4 parts: giant, in, out, residual parts are roughtly of the same size 25%
    * giant might be a little bit bigger
* within core, very well connected 
    * has the "small world" property [Milgram]
    * = six degrees of separation
    * => routing information is easy (in small world networks)
* outside very poor connectivity

**Modern Web Research**
1. temporal aspects – evolution of the graph
2. informational aspects – how does new information propagate through the web (socail networks)
3. finer-grained structure – how to define and compute communities in information and social networks (cuts as baby step)

Reading: Easley + Kleinberg "Networks, Crowds, Markets"

### Problem set 5 (Main)

**1. Given an adjacency-list representation of a directed graph, where each vertex maintains an array of its outgoing edges (but *not* its incoming edges), how long does it take, in the worst case, to compute the in-degree of a given vertex? As usual, we use n and m to denote the number of vertices and edges, respectively, of the given graph. Also, let k denote the maximum in-degree of a vertex. (Recall that the in-degree of a vertex is the number of edges that enter it.)**

* G(V,E) = {m, n}
* v belonging to V
* k = max in-degree 
* Adj = adjacency list of edges for each vertex pf G

* We must read all edges (or the edge not read may contribute to a in-degree count of the given vertex), so we have the time lowered bounded by m
* We can compute the in-degree by just reading all edges and keep track of the in-degree count, so we have the time upper bounded by m

> answer is Theta(m)

* The out-degree ov vertex v = length of Adj[v]
    * The time to compute the out-degree of v is Theta(length of Adj[v]) 
* The sum of lengths of Adj lists for all verticies = n
    * The time to compute the out-degree of all verticies is Theta(m+n)
* The in-degree ov vertex v = number of times it appears in all the Adj lists 
    * If we search all the lists for each vertex, the time to compute the in-degree of all vertices is Theta(m * n)
    * Alternatively, we can allocate a counter k initialize it to zero. 
        * Then we only need to scan the lists in Adj once, 
        * incrementing k when we see v in the lists. 
        * so we only scan all lists which are edges => time is Theta(m)  

**2. Consider the following problem: given an undirected graph G with n vertices and m edges, and two vertices s and t, does there exist at least one s-t path?**  

**If G is given in its adjacency list representation, then the above problem can be solved in O(m+n) time, using BFS or DFS. (Make sure you see why this is true.)**  

**Suppose instead that G is given in its adjacency *matrix* representation. What running time is required, in the worst case, to solve the computational problem stated above? (Assume that G has no parallel edges.)**

* Naive Algorithm:
    * we mark all points as undiscovered
    * we take point s, mark it as discovered
    * (Rec) we scan all other points of the graph of a fact to be linked to s = n-1 scans
        * if there is t, we stop
        * we mark all points that are linked a as discovered
        * if not for each discovered point  
             * (Rec) (= we scan through verticies which are not discovered)

* Lower bound: every time we want to find the edges adjacent to a given vertex 𝑢, we have to traverse the whole array, which is of length 𝑛; this happens for all the vertices
    * Worst case example: linear graph from s to t: on each step you scan n-1, n-2, etc => n(n-1)/2 => O(n^2)
* Upper bound: 
    * first build an adjacency list representation in Theta(𝑛2) time, 
    * with a single scan over the given adjacency matrix and then run BFS or DFS as in the video lectures.


**3. This problem explores the relationship between two definitions about graph distances. In this problem, we consider only graphs that are undirected and connected. The diameter of a graph is the maximum, over all choices of vertices s and t, of the shortest-path distance between s and t. (Recall the shortest-path distance between s and t is the fewest number of edges in an s-t path.)**

**Next, for a vertex s, let l(s) denote the maximum, over all vertices t, of the shortest-path distance between s and t. The radius of a graph is the minimum of l(s) over all choices of the vertex s.**

**Which of the following inequalities always hold (i.e., in every undirected connected graph) for the radius rr and the diameter d? [Select all that apply.]**

**(a) r ≤ d/2
**(b) r ≥ d/2
**(c) r ≥ d
**(d) r ≤ d**

* l(s_i) is the maximum shortest path involving s_i as starting point
* d = diameter = is maximum among l(s_i)
* r = radius = minimum among l(s_i)
* the answer (d) is correct

* Let's take vertex c which represents the radius of G
    * => it is involved in the *shortest* maximal [minimal path]
    * => the distance from c to all other points is at most r
    * => the distance to s which is involved in diameter is =< r
    * => the distance to t which is involved in diameter is =< r
* if c is on the d-path from s to t
    * (s,c) =< r and (c,t) =< r
    * => d =< 2r 
* else [c is away from the d-path from s to t]
    * any other path from s to t is bigger than d
    * the path through c is bigger than d
    * the path through c is smaller than 2r 
    * => d =< 2r
* => d/2 =< r

**4. Consider our algorithm for computing a topological ordering that is based on depth-first search (i.e., NOT the "straightforward solution"). Suppose we run this algorithm on a graph G that is NOT directed acyclic. Obviously it won't compute a topological order (since none exist). Does it compute an ordering that minimizes the number of edges that go backward?**

**For example, consider the four-node graph with the six directed edges (s,v),(s,w),(v,w),(v,t),(w,t),(t,s). Suppose the vertices are ordered s,v,w,t. Then there is one backwards arc, the (t,s) arc. No ordering of the vertices has zero backwards arcs, and some have more than one.**

* since we can start Topological DFS-Loop from any vertex, we can do that from vertex v from the example
  
* we mark all nodes unexplored
* set current label at 4
* Outer loop: for vertex v
    * v unexplored 
        * => DFS call(G,v): mark v as explored
        * DFS(v,t): mark t explored
        * DFS(t,s): mark s explored
        * DFS(s,v): v explored, backtrack to s
        * DFS(s,w): mark w explored
        * DFS(w,t): t explored, backtrack to w
            * there is no more way to go from w => F(w) = 4
            * backtrack to s, there is no more way to go from s => F(s) = 3
            * backtrack to t, there is no more way to go from t => F(t) = 2
            * backtrack to v: DFS(v,w)
                * w explored, backtrack to v, there is no more way to go from v => F(s) = 1
* Thus the "order" would be following: v, t, s, w
    * and there would be 2 backward arcs: (s,v), (w,t)
* and if we run the algorithm from vertex s, there would be only one arc
    * => correct answer is "sometimes yes, sometimes no"
                

**5. On adding one extra edge to a directed graph G, the number of strongly connected components...?**

* if we take a graph where each node is a strongly connected component
    * vertices: a,b,c,d,e
    * edges: (a,b), (b,c), (b,e), (b,d), (d,e), (c,e)
* if we add an edge (a,c) – nothing changes
* if we add an edge (c,a) – (a,c,e) merge into a single SCC


### Optional Problems 5

**In the 2SAT problem, you are given a set of clauses, where each clause is the disjunction of two literals (a literal is a Boolean variable or the negation of a Boolean variable). You are looking for a way to assign a value "true" or "false" to each of the variables so that all clauses are satisfied --- that is, there is at least one true literal in each clause. For this problem, design an algorithm that determines whether or not a given 2SAT instance has a satisfying assignment. (Your algorithm does not need to exhibit a satisfying assignment, just decide whether or not one exists.) Your algorithm should run in O(m+n) time, where mm and nn are the number of clauses and variables, respectively. [Hint: strongly connected components.]**

* Things to know
    * SAT (Boolean satisfiability problem) is the problem of assigning Boolean values to variables to satisfy a given Boolean formula. 
    * The Boolean formula will usually be given in CNF (conjunctive normal form), which is 
        * a conjunction of multiple clauses, 
        * where each clause is a disjunction of literals (variables or negation of variables). 
    * 2-SAT (2-satisfiability) is a restriction of the SAT problem, in 2-SAT every clause has exactly two literals.
        * (a V !b) ^ (!a V b) ^ (!a V !b) ^ (a V c)
    * Note: 
        * (a V b) is equivalent to (!a => b) ^ ( !b => a) (if one variable is false, other must be true) 
        * that's called implicative normal form 
    * we can construct a directed graph from the initial expression 
        * transform each clause into implicative normal form
        * verticies: a, !a, b, !b, etc
        * edges: implications
    * Note:
        * if x is reachable from !x and !x is reachable from x the problem has no solution
        * because if x=true => !x=true, which is false
* If 2 vertex are mutually reachable, they are in a single SCC
    * => In order for 2-SAT problem to have a solution, it is necessary and sufficient that for any variable x the vertices x and !x are in different strongly connected components of the strong connection of the implication graph.
    
    
Algorithm:
1. First pass of Kosaraju algorithm gives us 

https://cp-algorithms.com/graph/2SAT.html

### Program Assignment 5

**The file contains the edges of a directed graph. Vertices are labeled as positive integers from 1 to 875714. Every row indicates an edge, the vertex label in first column is the tail and the vertex label in second column is the head (recall the graph is directed, and the edges are directed from the first column vertex to the second column vertex).** 

**So for example, the 11-th row looks like : "2 47646". This just means that the vertex with label 2 has an outgoing edge to the vertex with label 47646**

**Your task is to code up the algorithm from the video lectures for computing strongly connected components (SCCs), and to run this algorithm on the given graph.**

**Output Format: You should output the sizes of the 5 largest SCCs in the given graph, in decreasing order of sizes, separated by commas (avoid any spaces). So if your algorithm computes the sizes of the five largest SCCs to be 500, 400, 300, 200 and 100, then your answer should be "500,400,300,200,100" (without the quotes). If your algorithm finds less than 5 SCCs, then write 0 for the remaining terms. Thus, if your algorithm computes only 3 SCCs whose sizes are 400, 300, and 100, then your answer should be "400,300,100,0,0" (without the quotes). (Note also that your answer should not have any spaces in it.)**

**WARNING: This is the most challenging programming assignment of the course. Because of the size of the graph you may have to manage memory carefully. The best way to do this depends on your programming language and environment, and we strongly suggest that you exchange tips for doing this on the discussion forums.**

### Iterative approach to PA5 (works on full dataset)

**Data Structures**

In [628]:
# Node labels range from 1 to 875714
num_nodes = 875714

# Adjacency representations of the graph and reverse graph
gr = [[] for i in range(num_nodes)]
r_gr = [[] for i in range(num_nodes)]

# The list index represents the node. If node i is unvisited then visited[i-1] == False and vice versa
visited = [False] * (num_nodes)


# Stack for DFS
stack = deque()

# The finishing times arranged by node index after the first pass
fin_times = [0 for i in range(num_nodes)]

# The order of nodes arranged by finishing times (after the first pass)
fin_times_node_order = [0 for i in range(num_nodes)]

**Load data into direct and reverse graph**

In [629]:
# load final data
file = open("../algorithms_course_code/data/pg_asmt_5_scc.txt", 'r')
data = file.readlines()    
for line in data:
    items = line.split()
    gr[int(items[0])-1] += [int(items[1])]
    r_gr[int(items[1])-1] += [int(items[0])]
file.close()    

In [619]:
# load test data
# [!] before loading change number of nodes in "data structures" clause

def loadGraphTest():
    file = open("../algorithms_course_code/data/pg_asmt_5_scc_test_cases.txt", 'r')
    graph = []
    for line in file:
        line_strip = line.rstrip("\n")
        line_split = line_strip.split(" ")
        if line_split != ['']:
            line_split_int = [int(i) for i in line_split]
            graph.append(line_split_int)
    file.close()    
    return graph

test1 = loadGraphTest()
# test data copy to prevent modification of uploaded
test2 = [edge[:] for edge in test1]

for edge in test2:
    gr[int(edge[0])-1] += [int(edge[1])]
    r_gr[int(edge[1])-1] += [int(edge[0])]

**DFS on reverse graph**

In [630]:
# final on lists (without print)
def dfsReverseFinTimes(r_gr):
    time = 1    
    global stack
    global visited
    global fin_times
    
    for node in range(num_nodes, 0, -1):
        stack.append(node)
        
        while len(stack) > 0:
            stack_node = stack.pop()
                
            if visited[stack_node-1] == False:
                visited[stack_node-1] = True
                stack.append(stack_node)
                
                for out_node in r_gr[stack_node-1]:
                    if visited[out_node-1] == False:
                        stack.append(out_node)              
            else:
                if fin_times[stack_node-1] == 0:
                    fin_times[stack_node-1] = time
                    time += 1
    
    return fin_times

**Rearrange nodes in decresing finishing times order**

In [631]:
def arrangeNodesByFinTimes(fin_times):
    for i in range(num_nodes):
        fin_times_node_order[fin_times[i]-1] = i+1
    return fin_times_node_order

**DFS on direct graph**

In [632]:
def dfsDirectLeaders(fin_times_node_order, gr):
    global stack
    global visited 
    visited = [False] * len(visited)  # Resetting the visited variable
    fin_times_node_order.reverse()  # The nodes should be visited in reverse finishing times
    component_sizes = [] # list to store component sizes
    
    for node in fin_times_node_order:
        component_size = 0
        
        if visited[node-1] == False:
            visited[node-1] = True
            component_size += 1
            stack.append(node)
            
            while len(stack) > 0:
                stack_node = stack.pop()
                
                for out_node in gr[stack_node-1]:  

                    if visited[out_node-1] == False:
                        visited[out_node-1] = True
                        component_size +=1
                        stack.append(out_node)
        
            component_sizes.append(component_size)
    
    return component_sizes 
        

**Final function to get 5 largest SCC**

In [633]:
def fiveMaxSCC(gr, r_gr): 
    fin_times = dfsReverseFinTimes(r_gr)
    nodes_by_ft = arrangeNodesByFinTimes(fin_times)
    comp_sizes = dfsDirectLeaders(nodes_by_ft, gr)
    comp_sizes.sort(reverse = True)
    return len(comp_sizes), comp_sizes[:5]

In [634]:
# run final function on data
fiveMaxSCC(gr, r_gr)

(371762, [434821, 968, 459, 313, 211])

### Recursive approach to PA5 (works on small datasets only)

In [4]:
import copy
import random
from collections import deque

In [6]:
# 0. LOAD DATA
def loadGraph():
    file = open("../algorithms_course_code/data/pg_asmt_5_scc.txt", 'r')
    graph = []
    for line in file:
        line_split = line.split(" ")
        line_split_clean = line_split[:2]
        line_split_int = [int(i) for i in line_split_clean]
        graph.append(line_split_int)
    file.close()    
    return graph

In [635]:
# 0. LOAD TEST DATA

def loadGraphTest():
    file = open("../algorithms_course_code/data/pg_asmt_5_scc_test_cases.txt", 'r')
    graph = []
    for line in file:
        line_strip = line.rstrip("\n")
        line_split = line_strip.split(" ")
        if line_split != ['']:
            line_split_int = [int(i) for i in line_split]
            graph.append(line_split_int)
    file.close()    
    return graph

In [636]:
# 1. TRANSFORM LIST OF EDGES INTO ADJACENCY LIST
def edgeToAdjList(graph):
    
    adj_list_graph = {}
    
    for edge in graph: 
        # process tail node
        try: 
            adj_list_graph[edge[0]]["out_nodes"].append(edge[1])     
        except KeyError:
            adj_list_graph[edge[0]] = {"node_index": edge[0], "visited": 0, "f_t": 0, "lead": 0, "out_nodes": [edge[1]]}
        
        # process head node
        try: 
            adj_list_graph[edge[1]]
        except KeyError:
            adj_list_graph[edge[1]] = {"node_index": edge[1], "visited": 0, "f_t": 0, "lead": 0, "out_nodes": []}
        
    return adj_list_graph 

In [749]:
# how long does it take to execute function on a particular input
import cProfile
cProfile.run('edgeToAdjList(gr)')

         4910756 function calls in 7.186 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    6.402    6.402    6.933    6.933 <ipython-input-748-c16c5f1ede46>:2(edgeToAdjList)
        1    0.253    0.253    7.186    7.186 <string>:1(<module>)
        1    0.000    0.000    7.186    7.186 {built-in method builtins.exec}
  4910752    0.531    0.000    0.531    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}




In [637]:
# 2. REVERSE ADJACENCY LIST

def reverseAdjListGraph(adj_graph):
    """
    takes as input an adjacency list and reverses edges
    """
    vertex_number = len(adj_graph)
    attributes_dict_structure = {"node_index": 0, "visited": 0, "f_t": 0, "lead": 0, "out_nodes": []}
    rev_adj_graph = {i: copy.deepcopy(attributes_dict_structure) for i in range(1, vertex_number+1)}
    
    for i in range(1, vertex_number+1):
        rev_adj_graph[i].update({"node_index": i})
        
        if len(adj_graph[i]['out_nodes']) != 0: 
            
            for j in range(len(adj_graph[i]['out_nodes'])):        
                new_node = adj_graph[i]['out_nodes'][j]
                
                # rev_adj_graph[new_node].update({"node_index": new_node})
                rev_adj_graph[new_node]['out_nodes'].append(i)
    return rev_adj_graph

In [638]:
# 3. FINISHING TIMES: first run of DFS loop on reverse graph

def _dfsFinTimes(adj_list_graph, node):
    # mark i as explored
    adj_list_graph[node]["visited"] = 1                       

    # for every j of edges (node, j):
    for out_node in adj_list_graph[node]["out_nodes"]:        
        # if j unexplored
        if adj_list_graph[out_node]["visited"] == 0:  
            # DFS(G,j)
            _dfsFinTimes(adj_list_graph, out_node)            
     
    global number_of_nodes_processed
    number_of_nodes_processed += 1
    
    # add finishing time to a node of a graph
    adj_list_graph[node]["f_t"] = number_of_nodes_processed   

def _dfsLoopFinTimes(adj_list_graph):       
    # for i from n down to 1
    i = len(adj_list_graph) 
    while i > 0:                                             
        # if i not yet explored
        if adj_list_graph[i]["visited"] == 0:                
            # DFS(G, i)
            _dfsFinTimes(adj_list_graph, i)                          
        i -= 1
        
    # create a list of fin_times keys
    fin_time_keys = [adj_list_graph[node]["f_t"] for node in adj_list_graph.keys()]  
    return fin_time_keys

def finishingTimesRevGraphRecursion(adj_list_graph):
    global number_of_nodes_processed
    number_of_nodes_processed = 0  
    return _dfsLoopFinTimes(adj_list_graph)

In [639]:
# 4. LEADERS AND THEIR SIZES: second run of DFS loop on direct graph with vertcies replaced by finishing times

def _dfsLeadersCounter(adj_list_graph, node):
    # global current_dfs_vertex
    global component_size
    
    # mark i as explored
    adj_list_graph[node]["visited"] = 1                               
    adj_list_graph[node]["lead"] = current_dfs_vertex 
    component_size +=1
       
    # for every j of edges (node, j):
    for new_out_node in adj_list_graph[node]["new_out_nodes"]:        
        # if j unexplored 
        if adj_list_graph[new_out_node]["visited"] == 0:                        
            # DFS(G,j)
            _dfsLeadersCounter(adj_list_graph, new_out_node)         
     
    return adj_list_graph

def dfsLoopLeadersCompSizes(adj_list_graph):    
    rev_num_leaders = []
    component_sizes = []   
    
    # for i from n down to 1
    i = len(adj_list_graph)    
    while i > 0:                                
        # if i not yet explored
        if adj_list_graph[i]["visited"] == 0:    
            
            global current_dfs_vertex
            current_dfs_vertex = i
            
            global component_size
            component_size = 0
            
            # DFS(G, i)
            _dfsLeadersCounter(adj_list_graph, i)               
            
            rev_num_leaders.append(current_dfs_vertex)
            component_sizes.append(component_size)
                          
        i -= 1
    
    return component_sizes #, adj_list_graph # (no need)

In [640]:
# 5. RANDOM SELECT OF I-TH STATISTICS: to get 5 biggest SCC
from random import randrange

def _partition(x, pivot_index = 0):
    i = 0
    if pivot_index !=0: 
        x[0], x[pivot_index] = x[pivot_index], x[0]
    
    for j in range(len(x)-1):
        if x[j+1] < x[0]:
            x[j+1], x[i+1] = x[i+1], x[j+1]
            i += 1
    x[0], x[i] = x[i], x[0]
    return x,i

def RSelect(x,k):
    if len(x) == 1:
        return x[0]
    else:
        xpart = _partition(x, randrange(len(x)))
        x = xpart[0] # partitioned array
        j = xpart[1] # pivot index
        if j == k:
            return x[j]
        elif j > k:
            return RSelect(x[:j], k)
        else:
            k = k - j - 1
            return RSelect(x[(j+1):], k)
        

In [644]:
# 6. FINAL FUNCTION TO GET LARGEST SCC
def kosarajuCount(dir_graph_ed):
    
    # transform direct edges list into adj list
    dir_graph_aj = edgeToAdjList(dir_graph_ed)
    
    # reverse direct adj list graph 
    rev_graph_aj = reverseAdjListGraph(dir_graph_aj)
    
    # first DFS-Loop run on reverse graph
    # number_of_nodes_processed = 0
    fin_time_keys = finishingTimesRevGraphRecursion(rev_graph_aj)  
    
    # add finishing times to direct graph attribures list and mark nodes unvisited
    # add new out_nodes which are old ones mapped to finishing times
    for i in range(1, len(fin_time_keys) + 1):
        dir_graph_aj[i]['visited'] = 0
        dir_graph_aj[i]['f_t'] = fin_time_keys[i-1]
        dir_graph_aj[i]['new_out_nodes'] = []
        for j in range(len(dir_graph_aj[i]['out_nodes'])):
            old_node = dir_graph_aj[i]['out_nodes'][j]
            dir_graph_aj[i]['new_out_nodes'].append(fin_time_keys[old_node-1])
    
    # replace keys in dir_graph_aj by finishing times
    dir_graph_aj_fin_times = {}
    for i in range(1, len(fin_time_keys) + 1):
        ft_key = fin_time_keys[i-1]
        dir_graph_aj_fin_times[ft_key] = dir_graph_aj[i]
    
    # second DFS-Loop run on direct graph      
    component_sizes_unsorted = dfsLoopLeadersCompSizes(dir_graph_aj_fin_times)
    
    # use randomized selection to find 5 largest components
    five_biggest_scc = []
    for i in range(1,6):
        if len(component_sizes_unsorted) >= i:
            ord_stat = len(component_sizes_unsorted) - i
            size = RSelect(component_sizes_unsorted, ord_stat)
            five_biggest_scc.append(size)
        else:
            five_biggest_scc.append(0)
    
    return five_biggest_scc

In [645]:
test1 = loadGraphTest()
test2 = copy.deepcopy(test1)
for_second = kosarajuCount(test2)
for_second

[6, 3, 2, 1, 0]

In [3]:
# attempts to increase recursion depth did not give results, core is dying

import resource, sys, threading
sys.setrecursionlimit(2 ** 20)
#resource.setrlimit(resource.RLIMIT_STACK, (2**29,-1))
threading.stack_size(67108864)
sys.getrecursionlimit()

1048576

# 11. Dijkstra's shortest-path algorithm  

* works in any directed graph with non-negative edge lengths
* computes the shortest paths from a source vertex to all other vertices 
* very fast implementation that uses a heap data structure

**[Algorithm]**
* Input: 
    * directed (for undirected cosmetcic changes required) graph G = (V,E) (m = |E|, n = |V|)
    * Each edge has non-negative length l_e
    * source vertex s
* Output: for each v belonging to V, compute L(v) = length of a shortest path s-v in G (ie sum of edge lengths)
* Assumptions
    * [For convinience]: assume that there is a directed path from S to every other vertex V in the graph, otherwise the shortest path distance is something we define to be plus infinity
        * we can always detect which vertices are not reachable from S just in a preprocessing step using BFS or DFS, and then delete the irrelevant part of the graph
        * Alternatively, Dijkstra's algorithm will quite naturally figure out what vertices there are paths to from S and which ones there are not, so this won't really come up. 
    * [Important]: we always focus on graphs where every length is non-negative (l_e >= 0 for any у belonging to E)
        * paths can be thought of as more abstractly as a just sequence of decisions. 
        * for example, maybe you're engaging in financial transactions and you have the option of both buying and selling assets at different times. 
        * If you sell then you get some kind of profit and that would correspond to a negative edge length. 
        * So there are quite interesting applications in which negative edge lengths are relevant. 
        * If you are dealing with such an application, Dijkstra's algorithm is not the algorithm to use. 
        * But the most well-known dealing with such problems is called Bellman-Ford
* BFS also calculates shoretest pathбиге only for a special case when length of every edge is = 1
    * Say you have an edge that has length three. 
    * Why not just replace all the edges with a path of edges of the appropriate length? 
    * Now we have a network in which every edge has unit length and now we can just run breadth-first search. 
    * => Computing shortest paths with general edge weights reduces to computing shortest paths with unit edge weights? 
        * First comment: if all of the edge lengths were just small numbers, like 1, 2, and 3 and so on, this trick would work fine.
        * The issue is when you have a network where the different edges can have very different lengths. 
        * If you start wantonly replacing single edges with these really long paths, you've blown up the size of your graph way too much. 
        * Even though BFS runs in linear time, it's now on this much larger graph. 
        * We'd prefer something which is linear time or almost linear time that works directly on the original graph. 
        * And that is exactly what Dijkstra's shortest-path algorithm is going to accomplish.

Algorithm:
* One loop, in each iteration of the loop we will compute the shortest path distance to one additional vertex. 
* And by the end of the loop we'll compute shortest path distances to everybody 
* If all of the edge lengths are equal to one, Dijkstra's algorithm becomes breadth-first search
* Dijkstra's algorithm will give us our first opportunity to see the interplay between good algorithm design and good data structure design. 
    * Suitable application of the heap data structure => implementation of Dijkstra's algorithm so it runs blazingly fast, almost linear time: m * log(n).
   
Pseudocode:  
* Input: G = (V,E) (m = |E|, n = |V|),  directed
* Initialize:
    * X = {s} [X = verticies processed so far = computed shortest path distance frot the source vertix to every vertex in X; we are goint to augment X by new vertex each iteration]
    * A[s] = 0 [A = array to store computed shortest path distances for each vertex in X]
    * Additional bookkeeping, which is not needed in real implementation. Just for understanding
        * In array B we'll keep track of the shortest path itself from the source vertex s to each destination v
        * B[s] = empty path [B = array to store computed shortest paths]
* Main loop:
    * while X ≠ V: [need to grow X by one node]
        * [X = [s, ..., ..., ...]; V-X  = [..., ..., ..., ...]]
        * [We are interested in edges crossing from X to V-X]
        * [We scan through all those edges, one of them will lead us to the next vertex]
        * Among all crossing edges (v, w) belonging to E with v belongong to X and w belonging to !X
            * pick the one that minimizes A[v] + l_vw [Dijkstra's gready criterion], [call it (v*, w*)]
            * [A[v] is already computed in earlier step]
        * add w* to X
        * set A[w*] == A[v*] + l_v*w*
            * [A[v*] is already computed in earlier step]
        * set B[w*] == B[v*] updated by (v*, w*)
        
**[Non-Example]**  
* Directed graph
    * (s, v) = 1
    * (s, t) = -2
    * (v, t) = -5
* Why can't we just reduce shortest path computation with negative edge lengths to the problem of computing shortest paths with non-negative edge lengths?
    * For example we could just add a big number to all the edges, that makes them all non-negative and then we just run Dijkstra's algorithm
    * The issue is that different paths between a common origin and destination have differing numbers of edges. 
        * So, some might have five edges, some might have two edges. 
        * Now, if you add 10 to every single edge in the graph, you're going to change path lengths by different amounts. 
        * If a path has five edges, it's going to go up by 50 when you add 10 to every edge. 
        * If a path has only two edges, it's only going to go up by 20 when you add 10 to every edge
        * you might actually screw up which path is the shortest
    * Also: the straight forward Dijkstra's algorithm (without addition of large number) will be also incorrect
    
 
**[Proof of correctness]**  
* Theorem: For every directed graph with non negative egde lengths, Dijkstra's algorithm correctly computes all shortest path distances [ie A[v] = L(v) for every v belonging to V; A[v] = computed shortest path, L(v) = true shortest path distance from s to v]

Proof by induction: on the number of iterations
* Base case A[s] = L(s) = 0 (correct: here we're using the fact that there are no edges with negative edge length. That makes it obvious that sort of having a non empty path can get you negative edge length better than 0)
* Inductive step: 
    * Inductive hypothises: all previous iterations correct <=> for all v belonging to X:
        * A[v] = L(v) and
        * B[v] is a true shortest s-v path in G
    * In current iteration: 
        * we pick an edge (v*, w*)
        * we add w* to X
        * we set B[w*] = B[v*] appended by (v*,w*)
            * By induciton:
            * B[v*] has length L(v*) = 
            * B[w*] has length L(v*) + l_v*w*
        * Also: A[w*] = A(v*) + l_v*w*
            * By the inductive hypothises A[v*] = L(v*)
            * =>  A[w*] = L(v*) + l_v*w*
        * Upshot: in current iteration we set:
            * (1) A[w*] = L(v*) + l_v*w*
            * (2) B[w*] = an s to w* path with length  A[w*]
    * To finish proof: need to show that every s-w* path has length >= L(v*) + l_v*w* (id so pur path is the shortest)
        * So let P = any s to w* path
        * Any such path crosses the fronteer explored (X)/unexplored (!X) because s by def is in X, and we choose w* in !X
        * Probably even several times. Let's look at the first time it crosses the fronteer
        * Let it be via an edge (y, z): s >>> y > z >>> w* (z in !X, y in X)
            * s >>> y path is in X: at least as long as shortest path: L(y) (by def) = A[y] (= by inductive hyp)
            * y > z: length l_yz
            * The final part z >>> w*  if it's non empty, it has a nonnegative length: >= 0 (since no negative edges)
            * path s >>> y
        * Total length of path P >= A[y] + l_yz
            * earlier we selected vertex w* according to Dijkstra's greedy criterion: A[v*] + l_v*w*
            * by Dijkstra's greedy criterion: A[y] + l_yz >= A[v*] + l_v*w* 
            * => length of some path P is >= length of out path
            
        
**[Implementation]**            
* The straight forward implementation of this algorithm is O(m*n)
    * n-1 iterations in a while loop (we sucked all verticies into the X)
    * Within a loop: 
        * we do naively a linear scan through all of the edges. 
        * We go through the edges. 
        * We check if it's an eligible edge, that is if its tail is in X and its head is outside of X. 
        * We can keep track of that just by having an auxiliary boollian variable for each vertex remembering whether it's an X or not. 
        * And then amongst all of the illegible edges by exhaustive search we remember which edge has the smallest Dijkstra store
* But we can do better by organizing data: heap
    * what's the clue that indicates that a data structure might be useful in speeding up Dijkstra's shortest path algorithm. 
    * And the way you'd figure this out is to understand where is all this work coming from
        * Why are we doing a linear amount of work in the edges for a linear number in the vertices iterations? 
        * Well, at each iteration of this while loop we're just doing an exhaustive search to compute a minimum. 
        * We look at every edge, we look at those that cross the frontier, and we compute the one with the minimum Dijkstra score. 
        * Is there some data structure which whose reason for being is in fact to perform fast minimum computations? 
        * It's the heap data structure
    * Heap: perform Insert, Extract min in O(logn) time
        * logically thought of as a complete balanced binary tree, even though they are usually implemented as a laid-out linear array. 
        * height of tree approx log_2(n)
        * And the key property that you also have to maintain is that every node the key at that node has to be at least as small as that of both of the children. 
        * This property ensures that the smallest key of them all has to be at the root of this tree
            * extract-min – by swapping up last leaf, bubbling down
            * insert – via bubbling up
            * delete – bubble up or down as needed
        * All these operations performaed in O(log(n)) time

How to use heaps? 2 Invariants:
* Invariant 1: elements in heap = verticies of V-X
    * when we extract them in from the heap, it'll tell us which is the next vertex to add into the set capital X
* Invariant 2: for v belonging to V-X, key[v] = smallest Dijkstra gready score of an edge (u, v) belonging to E with u belonging to X and v belonging to X
    * NB: Dijkstra gready score of an edge (u, v) = min(A[u] + l_uv
    * in case when 2 edges arrive to a head vertex, we pick for score of the vertex the one which gives min gready score
    * for vertex outside of X that doesn't have any eligible edges terminating at it, we think of the key as being plus infinity
*  if we can successfully maintain these two invariants, then, when we extract min from this heap, we'll get exactly the correct vertex, w* star, that we're supposed to add to the set capital X next
* Shortest path A[w*] will be the key of w* 

How to maintain these invariants without doing too much work?
* invariant 1: will really take care of itself. By definition the vertices which remain in the heap are those that we haven't processed yet, and those are the ones that are outside of capital X. 
* invariant 2: 
    * a tricky problem. 
        * think about this shortest path algorithm at some intermediate iteration. 
        * A bunch of vertices have already been added to X. 
        * A bunch of vortices are still hanging out in the heap. They haven't been added to X. 
        * There's some frontier
        * there are crossing edges, possibly in both directions. 
        * And suppose at the end of a current iteration we identify the vortex W, which we're going to extract from the heap and conceptually add to the set X. 
        * Now the reason things complicated is when we move a vertex from outside X to inside X, the frontier between X and V minus X changes. 
        * then the edges which cross the frontier change. 
        * there might be some edges which used to cross the frontier and now don't. 
        * Those are the ones that are coming into W. There is no problem here
        * there are edges which used to not be crossing the frontier but now they are crossing the frontier. 
        * And those are precisely the edges sticking out of W. 
        * To see why it's tricky let's remember what invariant number two says. 
            * for every vertex which is still in the heap, which is not yet in X, the key for that vertex better be The smallest Dijkstra Grady score of any edge which comes from capital X and sticks into this vertex of V-X. 
            * Now in moving one vertex into X, namely this vertex w, now there can be new edges sticking into vertices which were still on the heap. 
            * As a result, the appropriate key value for vertices in the heap might become smaller than already assigned one
       * => we need to update keys of verticies remaining in V-X
       * but this damage is local:  the vertices whose keys we need to update are at the head of edges that stick out of W.  
    * when w extracted fro the heap (ie added to X), 
        * for each edge (w, v) belonging to E:
            * if v belongs to V - X (ie in the heap)
                * delete v from the heap
                * recompute key(v] = min[key[v], A[w] + l_wv]
                * re-insert v to the heap

**[Running time]**
* You check: the main work is by heap operations (O(log(n)) each) 
    * we extract mins once per iteration of a while loop = (n-1) of a while loops
    * this triggers key updates to maintain invariant 2: deletion of an element followed by insertion 
        * it seems like a vertex could trigger as many as N-1 key updates, which is Theta(n) operations. 
        * And if we sum that up over the N iterations of the while loop that w ould give us Theta(n^2)  heap operations. 
        * So, and indeed, in dense graphs, that can be the case. 
        * But let's have an edge-centric view. 
            * For each edge at the graph, let's think about when can this be responsible for some heap operations, in particular a decrease in key in the resulting insertion and deletion. 
            * If you have an edge and it points from the vertex V to the vertex W. 
            * There's actually only one situation in which this edge is going to be responsible for a decrease in key. 
            * And that's in the case where the tail of the edge V gets sucked into the set X before the head W of this edge gets sucked into the set X. 
            * then indeed we're gonna have to decrease the key of W, 
            * But that's all that's gonna happen: V can only get sucked into X once and never gonna leave it. 
            * So it's only responsible for this single decrease in key of its head W. 
            * And that's one insertion and one deletion. 
            * And in fact, if the endpoints of this edge get sucked into X in the opposite order: if the head of this edge W gets sucked into X first that doesn't even trigger a key decrease for V, 
     * But each edge (v, w) triggers at most one Delete/Insert combo (if v added to x first) = O(m)
         * So this means that the number of heap operations is O(n), that's for the extract mins, + O(m), that's for the insert/delete combos triggered by edges during the decreased keys.
         * O(n+m) = O(m) since graph weakly connected 
             * because of our assumption that's there's a path to s from every other vertex. 
             * If yo u think about it that means that the graph is at least weakly connected if you picked it up it would stay together in one piece. 
             * So that means it at least contains a tree, at least an in an undirected sense, 
             * which means it contains at least N minus one edges. 
             * So we're in the case of weakly connected graphs where N dominates M. 
             * M is always as big as N at least up to a plus one. 
         * So what that means is the running time of Dijkstra's algorithm, with this heap implementation, is just a log factor larger. 
             * Remember, every heap operation takes time logarithmic. 
             * So we do a linear in M number of operations; each takes time logarithmic in N. S
             * So the running time = O(m * lon(n))

### Problem set 6  
  
**1. Consider a directed graph with distinct and nonnegative edge lengths and a source vertex s. Fix a destination vertex t, and assume that the graph contains at least one s-t path. Which of the following statements are true?**

**(a) The shortest s-t path must include the minimum-length edge of G.**  
**(b) The shortest (i.e., minimum-length) s-t path might have as many as n−1 edges, where n is the number of vertices.**  
**(c) There is a shortest s-t path with no repeated vertices (i.e., a "simple" or "loopless" such path).**  
**(d) The shortest s-t path must exclude the maximum-length edge of G**

Answer: b, c

**2. Consider a directed graph G with a source vertex s, a destination t, and nonnegative edge lengths. Under what conditions is the shortest s-t path guaranteed to be unique?**

Answer: When all edge lengths are distinct powers of 2.

**3. Consider a directed graph G=(V,E) and a source vertex s with the following properties: edges that leave the source vertex s have arbitrary (possibly negative) lengths; all other edge lengths are nonnegative; and there are no edges from any other vertex to the source s. Does Dijkstra's shortest-path algorithm correctly compute shortest-path distances (from s) in this graph?**

Answer: Always (why?!)

**4. Consider a directed graph G and a source vertex s. Suppose G has some negative edge lengths but no negative cycles, meaning G does not have a directed cycle in which the sum of the edge lengths is negative. Suppose you run Dijkstra's algorithm on G (with source s). Which of the following statements are true? [Check all that apply.]**

**(a) It's impossible to run Dijkstra's algorithm on a graph with negative edge lengths.**  
**(b) Dijkstra's algorithm always terminates, but in some cases the paths it computes will not be the shortest paths from ss to all other vertices.**  
**(c) Dijkstra's algorithm might loop forever.**  
**(d) Dijkstra's algorithm always terminates, and in some cases the paths it computes will be the correct shortest paths from ss to all other vertices.**

Answer: b, d

**5. Consider a directed graph G and a source vertex s. Suppose G contains a negative cycle (a directed cycle in which the sum of the edge lengths is negative) and also a path from s to this cycle. Suppose you run Dijkstra's algorithm on G (with source s). Which of the following statements are true? [Check all that apply.]**

**(a) Dijkstra's algorithm might loop forever.**  
**(b) It's impossible to run Dijkstra's algorithm on a graph with a negative cycle.**  
**(c) Dijkstra's algorithm always terminates, and in some cases the paths it computes will be the correct shortest paths from ss to all other vertices.**  
**(d) Dijkstra's algorithm always terminates, but in some cases the paths it computes will not be the shortest paths from s to all other vertices.**

Answer: d


### Optional Problems 6

**1. In lecture we define the length of a path to be the sum of the lengths of its edges.** 
**Define the bottleneck of a path to be the maximum length of one of its edges.** 
**A mininum-bottleneck path between two vertices s and t is a path with bottleneck no larger than that of any other s-t path.** 
**Show how to modify Dijkstra's algorithm to compute a minimum-bottleneck path between two given vertices. The running time should be O(mlogn), as in lecture.**


we should replace condition A[w*] == A[v*] + l_v*w* by A[w*] == max(A[v*], l_wv)
we stop iterate once t is processed (ie all edges of t are processed)

In [843]:
import heapq

def dijkstrasBottleneck(graph, start_vertex, fin_vertex):
    processed_verticies_indicies = []
    
    min_bottleneck_and_prev_vert = {vertex: [float('inf'), 0] for vertex in range(1, len(graph)+1)}
    min_bottleneck_and_prev_vert[start_vertex] = [0, start_vertex]  
    
    heap_unexplored_vert = [(0, start_vertex)] # (bottleneck, vertex index)
    
    while len(processed_verticies_indicies) < len(graph):
        current_min_bot, current_vertex = heapq.heappop(heap_unexplored_vert)
        
        # Nodes can get added to the heap multiple times. We only
        # process a vertex the first time we remove it from the heap.
        if current_vertex not in processed_verticies_indicies:           
            
            # iterate through current vertex neighbors to calculate their greedy scores (= distances) 
            for neighbor_index, edge_weight in graph[current_vertex-1].items():
                neighbor_min_bot_neck = max(current_min_bot, edge_weight)
                    

                # Only update gready scores if it's better than any path we've already found.
                if neighbor_min_bot_neck < min_bottleneck_and_prev_vert[neighbor_index][0]:
                    
                    min_bottleneck_and_prev_vert[neighbor_index][0] = neighbor_min_bot_neck
                    min_bottleneck_and_prev_vert[neighbor_index][1] = current_vertex
                    
                    heapq.heappush(heap_unexplored_vert, (neighbor_min_bot_neck, neighbor_index))
            
            
            # if we processed fin vertex
            if current_vertex == fin_vertex:
                
                print("[min bottleneck, previous vertex]")
                return min_bottleneck_and_prev_vert[fin_vertex]
            
            # else mark vertex as processed
            processed_verticies_indicies.append(current_vertex)
            

In [844]:
dijkstrasBottleneck(gr_test, 1, 5)

[min bottleneck, previous vertex]


[2, 4]

In [833]:
gr_test = loadGraphDistTest()
gr_test

[{2: 1, 8: 2},
 {1: 1, 3: 3},
 {2: 3, 4: 1},
 {3: 1, 5: 1, 8: 2},
 {4: 1, 6: 1},
 {5: 1, 7: 5},
 {6: 5, 8: 1},
 {7: 1, 1: 2, 4: 2}]

**2. We can do better. Suppose now that the graph is undirected. Give a linear-time (O(m)) algorithm to compute a minimum-bottleneck path between two given vertices.**

**3. What if the graph is directed? Can you compute a minimum-bottleneck path between two given vertices faster than O(mlogn)?**

Answers see [here](https://blog.asarkar.org/algorithms-design-analysis/hw-5-opt/)

### Programming assignment 6

**In this programming problem you'll code up Dijkstra's shortest-path algorithm.**

**The file contains an adjacency list representation of an undirected weighted graph with 200 vertices labeled 1 to 200. Each row consists of the node tuples that are adjacent to that particular vertex along with the length of that edge. For example, the 6th row has 6 as the first entry indicating that this row corresponds to the vertex labeled 6. The next entry of this row "141,8200" indicates that there is an edge between vertex 6 and vertex 141 that has length 8200. The rest of the pairs of this row indicate the other vertices adjacent to vertex 6 and the lengths of the corresponding edges.**

**Your task is to run Dijkstra's shortest-path algorithm on this graph, using 1 (the first vertex) as the source vertex, and to compute the shortest-path distances between 1 and every other vertex of the graph. If there is no path between a vertex v and vertex 1, we'll define the shortest-path distance between 1 and v to be 1000000.**

**You should report the shortest-path distances to the following ten vertices, in order: 7,37,59,82,99,115,133,165,188,197. You should encode the distances as a comma-separated string of integers. So if you find that all ten of these vertices except 115 are at distance 1000 away from vertex 1 and 115 is 2000 distance away, then your answer should be 1000,1000,1000,1000,1000,2000,1000,1000,1000,1000. Remember the order of reporting DOES MATTER, and the string should be in the same order in which the above ten vertices are given. The string should not contain any spaces. Please type your answer in the space provided.**

**IMPLEMENTATION NOTES: This graph is small enough that the straightforward O(mn) time implementation of Dijkstra's algorithm should work fine. OPTIONAL: For those of you seeking an additional challenge, try implementing the heap-based version. Note this requires a heap that supports deletions, and you'll probably need to maintain some kind of mapping between vertices and their positions in the heap.**

### Load data

In [803]:
def loadGraphDist():
    '''
    loads graph into an array of dicts
    each index of an array corresponds to a graph vertex_index-1
    each dict stores linked verticies with corresponding arc lengths {vert: len, vert: len, ...}
    '''
    file = open("../algorithms_course_code/data/pg_asmt_6_shortest_path.txt", 'r')
    graph = [{} for i in range(200)]
    i = 0
    for line in file:
        line_split = line.split("\t")
        for pair in line_split[1:-1]:
            pair_split = pair.split(",")
            graph[i][int(pair_split[0])] = int(pair_split[1])
        i += 1
    file.close()    
    return graph

In [747]:
gr = loadGraphDist()

In [819]:
# load test data

def loadGraphDistTest():
    '''
    loads graph into an array of dicts
    each index of an array corresponds to a graph vertex_index-1
    each dict stores linked verticies with corresponding arc lengths {vert: len, vert: len, ...}
    '''
    file = open("../algorithms_course_code/data/pg_asmt_6_shortest_path_test_cases.txt", 'r')
    graph = [{} for i in range(8)]
    i = 0
    for line in file:
        line_strip = line.rstrip("\n")
        line_split = line_strip.split(" ")
        
        if line_split != ['']:
            for pair in line_split[1:]:
                pair_split = pair.split(",")
                graph[i][int(pair_split[0])] = int(pair_split[1])
            i += 1
    file.close()    
    return graph

In [821]:
gr_test = loadGraphDistTest()
gr_test

[{2: 1, 8: 2},
 {1: 1, 3: 1},
 {2: 1, 4: 1},
 {3: 1, 5: 1},
 {4: 1, 6: 1},
 {5: 1, 7: 1},
 {6: 1, 8: 1},
 {7: 1, 1: 2}]

### Algorithm

In [797]:
import heapq

def dijkstras(graph, start_vertex):
    processed_verticies_indicies = []
    distances = {vertex: float('inf') for vertex in range(1, len(graph)+1)}
    distances[start_vertex] = 0 
    
    heap_unexplored_vert = [(0, start_vertex)] # (distance, vertex index)
    
    while len(processed_verticies_indicies) < len(graph):
        current_distance, current_vertex = heapq.heappop(heap_unexplored_vert)
        
        # Nodes can get added to the heap multiple times. We only
        # process a vertex the first time we remove it from the heap.
        if current_vertex not in processed_verticies_indicies:           
            
            # iterate through current vertex neighbors to calculate their greedy scores (= distances) 
            for neighbor_index, neighbor_dist in graph[current_vertex-1].items():
                new_neighbor_dist = current_distance + neighbor_dist
            
                # Only update gready scores if it's better than any path we've already found.
                if new_neighbor_dist < distances[neighbor_index]:
                    distances[neighbor_index] = new_neighbor_dist
                    heapq.heappush(heap_unexplored_vert, (new_neighbor_dist, neighbor_index))
            
            # mark vertex as processed
            processed_verticies_indicies.append(current_vertex)
    
    return distances

**Test case run**

In [822]:
dijkstras(gr_test, 1)

{1: 0, 2: 1, 3: 2, 4: 3, 5: 4, 6: 4, 7: 3, 8: 2}

**Real data run**

In [801]:
result = dijkstras(gr, 1)
result

{1: 0,
 2: 2971,
 3: 2644,
 4: 3056,
 5: 2525,
 6: 2818,
 7: 2599,
 8: 1875,
 9: 745,
 10: 3205,
 11: 1551,
 12: 2906,
 13: 2394,
 14: 1803,
 15: 2942,
 16: 1837,
 17: 3111,
 18: 2284,
 19: 1044,
 20: 2351,
 21: 3630,
 22: 4028,
 23: 2650,
 24: 3653,
 25: 2249,
 26: 2150,
 27: 1222,
 28: 2090,
 29: 3540,
 30: 2303,
 31: 3455,
 32: 3004,
 33: 2551,
 34: 2656,
 35: 998,
 36: 2236,
 37: 2610,
 38: 3548,
 39: 1851,
 40: 4091,
 41: 2732,
 42: 2040,
 43: 3312,
 44: 2142,
 45: 3438,
 46: 2937,
 47: 2979,
 48: 2757,
 49: 2437,
 50: 3152,
 51: 2503,
 52: 2817,
 53: 2420,
 54: 3369,
 55: 2862,
 56: 2609,
 57: 2857,
 58: 3668,
 59: 2947,
 60: 2592,
 61: 1676,
 62: 2573,
 63: 2498,
 64: 2047,
 65: 826,
 66: 3393,
 67: 2535,
 68: 4636,
 69: 3650,
 70: 743,
 71: 1265,
 72: 1539,
 73: 3007,
 74: 4286,
 75: 2720,
 76: 3220,
 77: 2298,
 78: 2795,
 79: 2806,
 80: 982,
 81: 2976,
 82: 2052,
 83: 3997,
 84: 2656,
 85: 1193,
 86: 2461,
 87: 1608,
 88: 3046,
 89: 3261,
 90: 2018,
 91: 2786,
 92: 647,
 93: 3

In [800]:
keys = [7, 37, 59, 82, 99, 115, 133, 165, 188, 197]
values = []

for key in keys:
    value = result.get(key)
    values.append(value)
    
print(values)

[2599, 2610, 2947, 2052, 2367, 2399, 2029, 2442, 2505, 3068]
