<a href="https://colab.research.google.com/github/Ali-Kazmi/All-Pairs-Shortest-Path/blob/master/APSP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook I will use python's networkX library to perform All Pairs Shortest Paths (APSP) on the california road dataset. Then, i'll use scipy to do the same much faster. Lastly, I'll implement my own and use numba to speed it up, beating scipy! 

In [225]:
# Get the data 
from pandas import read_csv
OLcnode = read_csv('https://www.cs.utah.edu/~lifeifei/research/tpq/OL.cnode',
                 names=['Node_ID', 'X','Y'],
                 delim_whitespace=True, nrows=1000) 

OLcedge=read_csv('https://www.cs.utah.edu/~lifeifei/research/tpq/OL.cedge',
                 names=['Edge ID', 'Start_Node_ID','End_Node_ID','L2_distance'],
                 delim_whitespace=True,nrows=1000) 


time: 903 ms


In [2]:
OLcedge.head() 

Unnamed: 0,Edge ID,Start_Node_ID,End_Node_ID,L2_distance
0,0,1609,1622,57.403187
1,1,2471,2479,29.718756
2,2,2463,2471,61.706902
3,3,2443,2448,19.080025
4,4,1417,1491,28.248583


In [3]:
OLcnode.head()

Unnamed: 0,Node_ID,X,Y
0,0,769.948669,2982.984131
1,1,863.275757,3005.275635
2,2,690.196411,3333.704834
3,3,1197.556519,2984.470215
4,4,1261.188599,2985.956299


The cell below will output an example of one shortest path (between two vertices). For APSP, we will be finding these shortest paths for all shortest paths between all pairs of vertices. As you can imagine, it gets expensive fast 

In [226]:
def get_edgelist(df):
    ### BEGIN SOLUTION
    return [(a, b, {'w': w}) for a, b, w in zip(df["Start_Node_ID"], df["End_Node_ID"], df["L2_distance"])]
    ### END SOLUTION
    
# Demo
edgelist = get_edgelist(OLcedge)
edgelist[:5] #just display a few. There's a lot of these! 

[(1609, 1622, {'w': 57.403187}),
 (2471, 2479, {'w': 29.718756}),
 (2463, 2471, {'w': 61.706902}),
 (2443, 2448, {'w': 19.080025}),
 (1417, 1491, {'w': 28.248583})]

time: 10.6 ms


In [6]:
!pip install ipython-autotime

Collecting ipython-autotime
  Downloading https://files.pythonhosted.org/packages/e6/f9/0626bbdb322e3a078d968e87e3b01341e7890544de891d0cb613641220e6/ipython-autotime-0.1.tar.bz2
Building wheels for collected packages: ipython-autotime
  Building wheel for ipython-autotime (setup.py) ... [?25l[?25hdone
  Created wheel for ipython-autotime: filename=ipython_autotime-0.1-cp36-none-any.whl size=1832 sha256=86d2c1dbe22d6f0e53b3f77a931a9a15d9a2327730e33662daac016c5ec210e7
  Stored in directory: /root/.cache/pip/wheels/d2/df/81/2db1e54bc91002cec40334629bc39cfa86dff540b304ebcd6e
Successfully built ipython-autotime
Installing collected packages: ipython-autotime
Successfully installed ipython-autotime-0.1


In [0]:
%load_ext autotime

In [227]:

from networkx import Graph
G = Graph()
G.add_nodes_from(OLcnode["Node_ID"])
G.add_edges_from(edgelist)

print(f"The graph has {G.number_of_nodes()} nodes and {G.number_of_edges()} edges.")

The graph has 1695 nodes and 999 edges.
time: 11 ms


In [228]:
def get_shortest_path(G):
    from networkx import all_pairs_dijkstra_path
    return dict(all_pairs_dijkstra_path(G, weight='w'))
start = OLcnode["Node_ID"].iloc[1]
finish = OLcnode["Node_ID"].iloc[3]
#print(f"Calculating a shortest path between all pairs of nodes")
path1 = get_shortest_path(G)
#print(f"\n==> This cell made a dictionary of shortest paths using networkX `{type(path)}`:")
#print("To use, enter two nodeID's and this will return the shortest path.")
#print("If there is no path,it will throw a key error as the pairs that cannot be reached are not stored.")

#print(path[start][finish])

time: 2.2 s


In [229]:
def get_shortest_path(G):
    from networkx import all_pairs_bellman_ford_path
    return dict(all_pairs_bellman_ford_path(G, weight='w'))
start = OLcnode["Node_ID"].iloc[1]
finish = OLcnode["Node_ID"].iloc[3]
#print(f"Calculating a shortest path between all pairs of nodes")
path = get_shortest_path(G)
#print(f"\n==> This cell made a dictionary of shortest paths using networkX `{type(path)}`:")
#print("To use, enter two nodeID's and this will return the shortest path.")
#print("If there is no path,it will throw a key error as the pairs that cannot be reached are not stored.")

#print(path[start][finish])

time: 7.28 s


In [230]:
print(path1==path)

True
time: 112 ms


In [231]:
print(path[1][5])

[1, 0, 2, 5]
time: 1.27 ms


In [232]:
#This didn't finish in almost 4 minutes on a larger graph... pretty slow, i'd say (it calculates dist and pred, so it takex extra memory. But it only returns 1, ew)
from networkx import floyd_warshall

pathfw = floyd_warshall(G)


KeyboardInterrupt: ignored

time: 5min 32s


In [233]:
#This one finished... in just under 3 minutes, so much worse than other options (still faster than the one above)
from networkx import floyd_warshall_numpy

pathfwnp = floyd_warshall_numpy(G)
#Much better, only 24 seconds on a 1500 or so node graph 

time: 23.9 s


In [17]:
pathfwnp

matrix([[ 0.,  1.,  1., ..., inf, inf, inf],
        [ 1.,  0.,  2., ..., inf, inf, inf],
        [ 1.,  2.,  0., ..., inf, inf, inf],
        ...,
        [inf, inf, inf, ...,  0.,  2.,  3.],
        [inf, inf, inf, ...,  2.,  0.,  1.],
        [inf, inf, inf, ...,  3.,  1.,  0.]])

time: 3.93 ms


The two cells above demonstrate an important point: this can take drastically different amounts of time. Using networkX's bellman ford APSP took 52 seconds, and using their dijkstras it only took 14 (non gpu for both) on one graph! Using their FW, it took much much more time(although this returns the pred as well), but their FW numpy ran just fine, albeit we can do better. The advantage of bellman ford is that it can handle negative edge weights, but it is slower. Below, I will be coding my own then trying to use Numba to speed this up even more  

(note: If you just add @jit() above the numba_all_pairs_dijkstra path, it doesn't actually speed it up! It takes slightly longer (13.5 seconds compared to around 12 seconds regularly. On a second run, 17 seconds.) because numba is not properly exploting the data types. It also gives a bunch of warnings. Numba nopython mode is faster, but can't just run on networkx graphs. Numba's use case is basic data structures, which is an avenue to explore) 

In [0]:
print(path4==path)

True
time: 1.04 s


Now, I'm going to do the following: 

1) convert the networkx graph we had above into an adjacency matrix representation in numpy 

2) use SciPY shortest path on the new adjacency matrix, to test the speed of a numpy based approach 

3) use numba to speed this up? (edit: numba can't just speed up scipy) 

> Indented block



In [234]:
import numpy as np
from networkx import to_numpy_array
adjlist = to_numpy_array(G)

time: 20.7 ms


In [235]:
print(adjlist.shape)
adjlist.view()


(1695, 1695)


array([[0., 1., 1., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 1., 0.]])

time: 4.78 ms


In [236]:
import scipy as sp
import scipy.sparse 

result = sp.sparse.csgraph.shortest_path(adjlist,method='FW')

time: 460 ms


Wow! Scipy can do APSP in around 3.6 seconds (averaged over 3 runs) using floyd warshal, on a 2k by 2k graph! This makes networkX look really slow. The output is the distances matrix (Now what you will see is on a 100 node, 50 edge graph. 4.4 ms) 

As another experiment, for a 1700 by 1700 adjacency matrix, scipy gets it done in 460 ms. 

Looking into how Scipy does it https://github.com/scipy/scipy/blob/master/scipy/sparse/csgraph/_shortest_path.pyx, the trick is using cython (it compiles to C) and numpy. It also takes advantage of the sparsity of the problem. 

In [237]:
result.view()

array([[ 0.,  1.,  1., ..., 73., inf, inf],
       [ 1.,  0.,  2., ..., 72., inf, inf],
       [ 1.,  2.,  0., ..., 74., inf, inf],
       ...,
       [73., 72., 74., ...,  0., inf, inf],
       [inf, inf, inf, ..., inf,  0.,  1.],
       [inf, inf, inf, ..., inf,  1.,  0.]])

time: 4.45 ms


In [238]:
adjlist2 = to_numpy_array(G)

time: 19.3 ms


In [239]:
print(adjlist2)

[[0. 1. 1. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 1.]
 [0. 0. 0. ... 0. 1. 0.]]
time: 1.41 ms


In [240]:
import numpy as np 

def floyd_warshall_simple(graph,vertex_num):
  dist = (np.ones((vertex_num,vertex_num)) * np.inf)
  for a in range(vertex_num):
    dist[a][a]=0
  for b in range(vertex_num):
    for c in range(vertex_num):
      if graph[b][c] != 0:
        dist[b][c]=graph[b][c]
  for k in range(vertex_num):
    for i in range(vertex_num):
      for j in range(vertex_num):
        if dist[i][j] > dist[i][k] + dist[k][j]:
          dist[i][j] = dist[i][k] + dist[k][j]
  return dist

time: 9.43 ms


In [241]:
simplefwanswer=floyd_warshall_simple(adjlist2,118)

time: 1.74 s


In [242]:
print(simplefwanswer)

[[ 0.  1.  1. ... 15. inf inf]
 [ 1.  0.  2. ... 14. inf inf]
 [ 1.  2.  0. ... 16. inf inf]
 ...
 [15. 14. 16. ...  0. inf inf]
 [inf inf inf ... inf  0. inf]
 [inf inf inf ... inf inf  0.]]
time: 2.06 ms


In [243]:
#Correctness checks 
def test_floyd_warshall_algorithms_on_small_matrix():
    INPUT = array([
        [  0.,  inf,  -2.,  inf],
        [  4.,   0.,   3.,  inf],
        [ inf,  inf,   0.,   2.],
        [ inf,  -1.,  inf,   0.]
    ])

    OUTPUT = array([
        [ 0., -1., -2.,  0.],
        [ 4.,  0.,  2.,  4.],
        [ 5.,  1.,  0.,  2.],
        [ 3., -1.,  1.,  0.]])
    print(OUTPUT)
    print(floyd_warshall_simple(INPUT,4))
    #assert (floyd_warshall_naive(INPUT) == OUTPUT).all()
test_floyd_warshall_algorithms_on_small_matrix()

[[ 0. -1. -2.  0.]
 [ 4.  0.  2.  4.]
 [ 5.  1.  0.  2.]
 [ 3. -1.  1.  0.]]
[[ 0. -1. -2.  0.]
 [ 4.  0.  2.  4.]
 [ 5.  1.  0.  2.]
 [ 3. -1.  1.  0.]]
time: 9.49 ms


As we can see, for this on a small graph it takes 1.73 seconds, where scipy takes 5 ms! Let's try to get faster using numba 

On a 1700x1700 adj matrix, this takes 1.74 seconds. Much better than the network x stuff, but still slower than scipy (which got 460 ms on the same problem). 

In [216]:
!pip install numba

time: 2.56 s


In [244]:
from numba import jit
import numpy as np 

@jit(nopython=True)
def floyd_warshall_simple_nb(graph,vertex_num):
  dist = (np.ones((vertex_num,vertex_num)) * np.inf)
  for a in range(vertex_num):
    dist[a][a]=0
  for b in range(vertex_num):
    for c in range(vertex_num):
      if graph[b][c] != 0:
        dist[b][c]=graph[b][c]
  for k in range(vertex_num):
    for i in range(vertex_num):
      for j in range(vertex_num):
        if dist[i][j] > dist[i][k] + dist[k][j]:
          dist[i][j] = dist[i][k] + dist[k][j]
  return dist

time: 5.96 ms


In [245]:
nbfwanswer=floyd_warshall_simple_nb(adjlist2,118)

time: 404 ms


In [246]:
print(nbfwanswer==simplefwanswer)

[[ True  True  True ...  True  True  True]
 [ True  True  True ...  True  True  True]
 [ True  True  True ...  True  True  True]
 ...
 [ True  True  True ...  True  True  True]
 [ True  True  True ...  True  True  True]
 [ True  True  True ...  True  True  True]]
time: 1.3 ms


Just by adding jit we took down the runtime down drastically (from 1.7 seconds to 404 ms. Now our implementation is faster than scipys by a few milliseconds, which i'd call a successful day :) 