<a href="https://colab.research.google.com/github/Ali-Kazmi/All-Pairs-Shortest-Path/blob/master/APSP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook I will use python's networkX library to perform All Pairs Shortest Paths (APSP) on the california road dataset. 

The cell below installs the files we need. 

In [0]:
!if ! test -f ca-roads.zip ; then wget https://github.com/rvuduc/graph-demos-for-ali/raw/master/sssp-networkx/ca-roads.zip ; fi
!if ! test -d ca-roads ; then unzip ca-roads.zip && ls -al | grep ca-roads* ; fi
!if ! test -f ca-roads-path-demo.png ; then wget https://raw.githubusercontent.com/rvuduc/graph-demos-for-ali/master/sssp-networkx/ca-roads-path-demo.png ; fi
!if ! test -f problem_utils.py ; then wget https://github.com/rvuduc/graph-demos-for-ali/raw/master/sssp-networkx/problem_utils.py ; fi

Steps to get the dataset I used: 
1) go to https://www.cs.utah.edu/~lifeifei/SpatialDataset.htm 
2) navigate down to the last dataset: city of oldenburg road network (chosen for a small size) 
3) click on the nodes and the edges (two tabs in a browser) 
4) save them as text. Convert them to CSV using excel, then upload them into colab using the cells below 
5) get it into pandas using the commands below (make sure file names are the same)

In [4]:
from google.colab import files

uploaded = files.upload()

Saving ol-nodes.csv to ol-nodes.csv


In [5]:
from google.colab import files

uploaded = files.upload()

Saving OL.cedge.csv to OL.cedge.csv


In [6]:
import pandas as pd

nodes = pd.read_csv("ol-nodes.csv",nrows=2000)
nodes.columns = ['Node_ID', 'X','Y']
print(nodes)

      Node_ID            X            Y
0           1   863.275757  3005.275635
1           2   690.196411  3333.704834
2           3  1197.556519  2984.470215
3           4  1261.188599  2985.956299
4           5   722.436707  3467.454346
...       ...          ...          ...
1995     1996  5259.831055  8158.344238
1996     1997  5121.961426  8039.456055
1997     1998  4969.244629  9184.871094
1998     1999  5509.693359  8321.816406
1999     2000  4911.975586  8922.945312

[2000 rows x 3 columns]


In [7]:
edges=pd.read_csv("OL.cedge.csv",nrows=2000)
edges.columns = ['Edge ID', 'Start_Node_ID','End_Node_ID','L2_distance']
print(edges)

      Edge ID  Start_Node_ID  End_Node_ID  L2_distance
0           1           2471         2479    29.718756
1           2           2463         2471    61.706902
2           3           2443         2448    19.080025
3           4           1417         1491    28.248583
4           5           4961         4962    15.014112
...       ...            ...          ...          ...
1995     1996           2263         4144   106.143394
1996     1997           2184         2213   115.173096
1997     1998           2168         2184    95.859863
1998     1999           4145         4147     9.629313
1999     2000           4145         4146     9.210923

[2000 rows x 4 columns]


The cell below will output an example of one shortest path (between two vertices). For APSP, we will be finding these shortest paths for all shortest paths between all pairs of vertices. As you can imagine, it gets expensive fast 

In [9]:
from problem_utils import get_path, display_image, assert_tibbles_are_equivalent, pandas_df_to_markdown_table
print("Example of what you will produce in this problem (shortest paths on the California road network):")
display_image(get_path('ca-roads-path-demo.png'))
print("What we will be doing is calculating the shortest path between EVERY pair of vertices.")

Example of what you will produce in this problem (shortest paths on the California road network):
What we will be doing is calculating the shortest path between EVERY pair of vertices.


In [10]:
def get_edgelist(df):
    ### BEGIN SOLUTION
    return [(a, b, {'w': w}) for a, b, w in zip(df["Start_Node_ID"], df["End_Node_ID"], df["L2_distance"])]
    ### END SOLUTION
    
# Demo
edgelist = get_edgelist(edges)
edgelist[:]

[(2471, 2479, {'w': 29.718756}),
 (2463, 2471, {'w': 61.706902}),
 (2443, 2448, {'w': 19.080025}),
 (1417, 1491, {'w': 28.248583}),
 (4961, 4962, {'w': 15.014111999999999}),
 (3150, 3156, {'w': 78.55722}),
 (1629, 1633, {'w': 10.386786}),
 (2234, 2249, {'w': 55.720881999999996}),
 (2159, 2162, {'w': 37.383289000000005}),
 (2154, 2159, {'w': 160.897079}),
 (2148, 2150, {'w': 15.142239000000002}),
 (1312, 1317, {'w': 55.35796}),
 (1692, 1700, {'w': 75.79039}),
 (293, 299, {'w': 19.293741}),
 (3806, 3807, {'w': 19.894999}),
 (3521, 3528, {'w': 85.082039}),
 (3515, 3521, {'w': 104.642044}),
 (3833, 3845, {'w': 75.549019}),
 (3857, 3874, {'w': 81.339333}),
 (3848, 3851, {'w': 10.663867999999999}),
 (3833, 3848, {'w': 82.003647}),
 (472, 475, {'w': 562.864563}),
 (3202, 3203, {'w': 5.514648}),
 (0, 2, {'w': 359.674072}),
 (2, 5, {'w': 137.580414}),
 (5, 7, {'w': 314.74127200000004}),
 (7, 11, {'w': 487.54174800000004}),
 (11, 37, {'w': 227.21356200000002}),
 (0, 1, {'w': 95.952362}),
 (1, 3,

In [12]:
!pip install ipython-autotime

Collecting ipython-autotime
  Downloading https://files.pythonhosted.org/packages/e6/f9/0626bbdb322e3a078d968e87e3b01341e7890544de891d0cb613641220e6/ipython-autotime-0.1.tar.bz2
Building wheels for collected packages: ipython-autotime
  Building wheel for ipython-autotime (setup.py) ... [?25l[?25hdone
  Created wheel for ipython-autotime: filename=ipython_autotime-0.1-cp36-none-any.whl size=1832 sha256=26e13d54e573c93d38384aaae1d1b17422d2c3fab51a6b11abe94b6262f16518
  Stored in directory: /root/.cache/pip/wheels/d2/df/81/2db1e54bc91002cec40334629bc39cfa86dff540b304ebcd6e
Successfully built ipython-autotime
Installing collected packages: ipython-autotime
Successfully installed ipython-autotime-0.1


In [0]:
%load_ext autotime

In [14]:

from networkx import Graph
G = Graph()
G.add_nodes_from(nodes["Node_ID"])
G.add_edges_from(edgelist)

print(f"The graph has {G.number_of_nodes()} nodes and {G.number_of_edges()} edges.")

The graph has 3278 nodes and 1999 edges.
time: 435 ms


In [17]:
def get_shortest_path(G):
    from networkx import all_pairs_dijkstra_path
    return dict(all_pairs_dijkstra_path(G, weight='w'))
start = nodes["Node_ID"].iloc[1]
finish = nodes["Node_ID"].iloc[3]
#print(f"Calculating a shortest path between all pairs of nodes")
path1 = get_shortest_path(G)
#print(f"\n==> This cell made a dictionary of shortest paths using networkX `{type(path)}`:")
#print("To use, enter two nodeID's and this will return the shortest path.")
#print("If there is no path,it will throw a key error as the pairs that cannot be reached are not stored.")

#print(path[start][finish])

time: 12.2 s


In [16]:
def get_shortest_path(G):
    from networkx import all_pairs_bellman_ford_path
    return dict(all_pairs_bellman_ford_path(G, weight='w'))
start = nodes["Node_ID"].iloc[1]
finish = nodes["Node_ID"].iloc[3]
#print(f"Calculating a shortest path between all pairs of nodes")
path = get_shortest_path(G)
#print(f"\n==> This cell made a dictionary of shortest paths using networkX `{type(path)}`:")
#print("To use, enter two nodeID's and this will return the shortest path.")
#print("If there is no path,it will throw a key error as the pairs that cannot be reached are not stored.")

#print(path[start][finish])

time: 52 s


In [18]:
print(path1==path)

True
time: 767 ms


In [20]:
print(path[1][8])

[1, 3, 4, 6, 8]
time: 958 µs


In [68]:
from networkx import floyd_warshall

pathfw = floyd_warshall(G)


KeyboardInterrupt: ignored

time: 43.9 s


The two cells above demonstrate an important point: this can take different amounts of time. Using networkX's bellman ford APSP took 52 seconds, and using their dijkstras it only took 14 (non gpu for both)! Using their FW, it took (although this returns the pred as well, so it's a significantly more work). The advantage of bellman ford is that it can handle negative edge weights, but it is slower. Below, I will be trying to use Numba to speed this up even more  

In [30]:
!pip install numba

time: 2.81 s


In [71]:
from heapq import heappush, heappop
from itertools import count
from numba import jit
#@jit(nopython=True)
def _dijkstra_multisource(G, sources, weight, pred=None, paths=None,
                          cutoff=None, target=None):
    G_succ = G._succ if G.is_directed() else G._adj

    push = heappush
    pop = heappop
    dist = {}  # dictionary of final distances
    seen = {}
    # fringe is heapq with 3-tuples (distance,c,node)
    # use the count c to avoid comparing nodes (may not be able to)
    c = count()
    fringe = []
    for source in sources:
        if source not in G:
            raise nx.NodeNotFound("Source {} not in G".format(source))
        seen[source] = 0
        push(fringe, (0, next(c), source))
    while fringe:
        (d, _, v) = pop(fringe)
        if v in dist:
            continue  # already searched this node.
        dist[v] = d
        if v == target:
            break
        for u, e in G_succ[v].items():
            cost = weight(v, u, e)
            if cost is None:
                continue
            vu_dist = dist[v] + cost
            if cutoff is not None:
                if vu_dist > cutoff:
                    continue
            if u in dist:
                if vu_dist < dist[u]:
                    raise ValueError('Contradictory paths found:',
                                     'negative weights?')
            elif u not in seen or vu_dist < seen[u]:
                seen[u] = vu_dist
                push(fringe, (vu_dist, next(c), u))
                if paths is not None:
                    paths[u] = paths[v] + [u]
                if pred is not None:
                    pred[u] = [v]
            elif vu_dist == seen[u]:
                if pred is not None:
                    pred[u].append(v)

    # The optional predecessor and path dictionaries can be accessed
    # by the caller via the pred and paths objects passed as arguments.
    return dist
#@jit(nopython=True)
def _weight_function(G, weight):
    if callable(weight):
        return weight
    # If the weight keyword argument is not callable, we assume it is a
    # string representing the edge attribute containing the weight of
    # the edge.
    if G.is_multigraph():
        return lambda u, v, d: min(attr.get(weight, 1) for attr in d.values())
    return lambda u, v, data: data.get(weight, 1)
#@jit(nopython=True)
def multi_source_dijkstra(G, sources, target=None, cutoff=None,
                          weight='weight'):
    if not sources:
        raise ValueError('sources must not be empty')
    if target in sources:
        return (0, [target])
    weight = _weight_function(G, weight)
    paths = {source: [source] for source in sources}  # dictionary of paths
    dist = _dijkstra_multisource(G, sources, weight, paths=paths,
                                 cutoff=cutoff, target=target)
    if target is None:
        return (dist, paths)
    try:
        return (dist[target], paths[target])
    except KeyError:
        raise nx.NetworkXNoPath("No path to {}.".format(target))

#@jit(nopython=True)
def multi_source_dijkstra_path(G, sources, cutoff=None, weight='weight'):
    length, path = multi_source_dijkstra(G, sources, cutoff=cutoff,
                                         weight=weight)
    return path
#@jit(nopython=True)
def single_source_dijkstra_path(G, source, cutoff=None, weight='weight'):
    return multi_source_dijkstra_path(G, {source}, cutoff=cutoff,
                                      weight=weight)
@jit() 
#@jit(nopython=True)
def numba_all_pairs_dijkstra_path(G, cutoff=None, weight='weight'):
    path = single_source_dijkstra_path
    # TODO This can be trivially parallelized.
    for n in G:
        yield (n, path(G, n, cutoff=cutoff, weight=weight))

time: 88.7 ms


In [72]:
def get_shortest_path(G):
    return dict(numba_all_pairs_dijkstra_path(G, weight='w'))
start = nodes["Node_ID"].iloc[1]
finish = nodes["Node_ID"].iloc[3]
#print(f"Calculating a shortest path between all pairs of nodes")
path4 = get_shortest_path(G)
#print(f"\n==> This cell made a dictionary of shortest paths using networkX `{type(path)}`:")
#print("To use, enter two nodeID's and this will return the shortest path.")
#print("If there is no path,it will throw a key error as the pairs that cannot be reached are not stored.")

#print(path[start][finish])

Compilation is falling back to object mode WITH looplifting enabled because Function "numba_all_pairs_dijkstra_path" failed type inference due to: Untyped global name 'single_source_dijkstra_path': cannot determine Numba type of <class 'function'>

File "<ipython-input-71-4997665e430c>", line 95:
def numba_all_pairs_dijkstra_path(G, cutoff=None, weight='weight'):
    path = single_source_dijkstra_path
    ^

  @jit()

File "<ipython-input-71-4997665e430c>", line 94:
#@jit(nopython=True)
def numba_all_pairs_dijkstra_path(G, cutoff=None, weight='weight'):
^

  state.func_ir.loc))
Fall-back from the nopython compilation path to the object mode compilation path has been detected, this is deprecated behaviour.

For more information visit http://numba.pydata.org/numba-doc/latest/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit

File "<ipython-input-71-4997665e430c>", line 94:
#@jit(nopython=True)
def numba_all_pairs_dijkstra_path(G, cutoff=None, weight

time: 17.6 s


If you just add @jit() above the numba_all_pairs_dijkstra path, it doesn't actually speed it up! It takes slightly longer (13.5 seconds compared to around 12 seconds regularly. On a second run, 17 seconds.) because numba is not properly exploting the parallelism. It also gives a bunch of warnings. Numba nopython mode is faster, but can't just run on networkx graphs. Numba's use case is basic data structures, which is an avenue to explore 

In [40]:
print(path4==path)

True
time: 1.04 s


Now, I'm going to do the following: 

1) convert the networkx graph we had above into an adjacency matrix representation in numpy 

2) use SciPY shortest path on the new adjacency matrix, to test the speed of a numpy based approach 

3) use numba to speed this up? (edit: numba can't just speed up scipy) 

> Indented block



In [58]:
import numpy as np
from networkx import to_numpy_array
adjlist = to_numpy_array(G)

time: 63.9 ms


In [59]:
print(adjlist.shape)
adjlist.view()


(3278, 3278)


array([[0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 1., 1.],
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.]])

time: 7.09 ms


In [66]:
import scipy as sp

result = sp.sparse.csgraph.shortest_path(adjlist,method='FW')

time: 3.64 s


Wow! Scipy can do APSP in around 3.6 seconds (averaged over 3 runs) using floyd warshal! This makes networkX look really slow. 

Looking into how Scipy does it https://github.com/scipy/scipy/blob/master/scipy/sparse/csgraph/_shortest_path.pyx, the trick is using cython (it compiles to C). It also takes advantage of the sparsity of the problem. 

In [63]:
result.view()

array([[ 0.,  2.,  1., ..., inf, inf, inf],
       [ 2.,  0.,  3., ..., inf, inf, inf],
       [ 1.,  3.,  0., ..., inf, inf, inf],
       ...,
       [inf, inf, inf, ...,  0.,  1.,  1.],
       [inf, inf, inf, ...,  1.,  0.,  2.],
       [inf, inf, inf, ...,  1.,  2.,  0.]])

time: 4.52 ms
