<img src="https://venturebeat.com/wp-content/uploads/2021/02/katana-graph.png?fit=800%2C356&strip=all" alt="Katana Graph" width="400" height="178">

This notebook is a tutorial for the Katana Python API on shared memory originally written for [KGC](https://www.knowledgegraph.tech/) 2021.

Trying out Katana Yourself
==========================

Katana is easiest to use in Conda (for Python users).
So make sure you have a [Conda installation](https://conda.io/projects/conda/en/latest/user-guide/install/linux.html).
The Katana open-source packages only support Linux x86-64.
You can create a Conda environment with Katana and Jupyter installed with:

```
conda env create -f https://raw.githubusercontent.com/KatanaGraph/katana/master/katana.yml
```

You can download this Jupyter notebook from: https://raw.githubusercontent.com/KatanaGraph/katana/master/python/examples/jupyter/Katana%20Tutorial.ipynb

Finally, uou can start Jupyter with and open the notebook normally:

```
conda run -n katana jupyter notebook
```
(or `jupyter lab`)


In [1]:
import numpy as np
import timeit

In [2]:
import katana.galois
from katana.datastructures import InsertBag
from katana.loops import do_all, do_all_operator
from katana.property_graph import PropertyGraph

import katana.local
katana.local.initialize()

katana.galois.set_active_threads(8)

8

In [3]:
from katana.example_utils import get_input

# Constants
INFINITY = 1073741823

# Download the input
rmat15_cleaned_symmetric_path = get_input("propertygraphs/rmat15_cleaned_symmetric")

Implementing an Algorithm in Katana Python
==========================================

In [4]:
def bfs(graph: PropertyGraph, source):
    """
    Compute the BFS distance to all nodes from source.

    The algorithm in bulk-synchronous level by level.

    :param graph: The graph to use.
    :param source: The source node for the traversal.
    :return: An array of distances, indexed by node ID.
    """
    next_level_number = 0

    # The work lists for the current and next levels using a Katana concurrent data structure.
    curr_level_worklist = InsertBag[np.uint32]()
    next_level_worklist = InsertBag[np.uint32]()

    # Create an initialize the distance array. source is 0, everywhere else is INFINITY
    distance = np.empty((len(graph),), dtype=np.uint32)
    distance[:] = INFINITY
    distance[source] = 0

    # Start processing with just the source node.
    next_level_worklist.push(source)
    # Execute until the worklist is empty.
    while not next_level_worklist.empty():
        # Swap the current and next work lists
        curr_level_worklist, next_level_worklist = next_level_worklist, curr_level_worklist

        # Clear the worklist for the next level.
        next_level_worklist.clear()
        next_level_number += 1

        # In parallel process the current worklist, by applying bfs_operator for each
        # element of the worklist.
        do_all(
            curr_level_worklist,
            # The call here binds the initial arguments of bfs_operator.
            bfs_operator(graph, next_level_worklist, next_level_number, distance)
        )

    return distance

In [5]:
# This function is marked as a Katana operator meaning that it will be compiled to
# native code and prepared for use with Katana do_all.
@do_all_operator()
def bfs_operator(graph: PropertyGraph, next_level_worklist, next_level_number, distance, node_id):
    """
    The operator called for each node in the work list.

    The initial 4 arguments are provided by bfs above. node_id is taken from
    the worklist and passed to this function by do_all.

    :param next_level_worklist: The work list to add next nodes to.
    :param next_level_number: The level to assign to nodes we find.
    :param distance: The distance array to fill with data.
    :param node_id: The node we are processing.
    :return:
    """
    # Iterate over the out edges of our node
    for edge_id in graph.edges(node_id):
        # Get the destination of the edge
        dst = graph.get_edge_dest(edge_id)
        # If the destination has not yet been reached set it's level and add it
        # to the work list, so it's out edges can be processed in the next level.
        if distance[dst] == INFINITY:
            distance[dst] = next_level_number
            next_level_worklist.push(dst)
        # There is a race here, but it's safe. If multiple calls to operator add
        # the same destination, they will all set the same level. It will create
        # more work since the node will be processed more than once in the next
        # level, but it avoids atomic operations, so it can still be a win in
        # low-degree graphs.

In [6]:
# Load our graph
graph = PropertyGraph(rmat15_cleaned_symmetric_path)

print(f"#Nodes: {len(graph)}, #Edges: {graph.num_edges()}")

# Run our algorithm
distances = bfs(graph, 0)

#Nodes: 32768, #Edges: 363194


Rmat10 Visualized
-----------------


<img src="rmat10.jpg" alt="A graph with a large number of low-degree nodes and a few very high-degree hub nodes." />

The algorithm is run on rmat15 which is 32 times larger.
However, the graph structure is similar: low diameter, with a small number of hub nodes.
Node 0 is the "largest" hub node in these graphs.

In [7]:
# Look at some arbitrary results
distances[:20], distances[490:510]

(array([0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       dtype=uint32),
 array([1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1],
       dtype=uint32))

In [8]:
# Maximum distance to a reached node (i.e., nodes that do not have infinite distance)
np.max(distances[distances < INFINITY])

2

In [9]:
# Number of reached nodes
np.count_nonzero(distances != INFINITY)

29352

In [10]:
print("Average algorithm runtime (100 runs):")
print(timeit.timeit(lambda: bfs(graph, 0), number=100) / 100 * 1000, "ms")

Average algorithm runtime (100 runs):
1.813097900012508 ms


Calling an Existing Katana Algorithm from Python
================================================

In [11]:
from katana import analytics

# Clean the graph to allow rerunning cell
try:
    graph.remove_node_property("distance")
except KeyError:
    pass

analytics.bfs(graph, 0, "distance")

distances = graph.get_node_property("distance").to_numpy()

In [12]:
# Look at some arbitrary results
distances[:20], distances[490:510]

(array([0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       dtype=uint32),
 array([1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1],
       dtype=uint32))

In [13]:
# Maximum distance to a reached node (i.e., nodes that do not have infinite distance)
np.max(distances[distances < INFINITY])

2

In [14]:
# Number of reached nodes
np.count_nonzero(distances != INFINITY)

29352

In [15]:
def run_canned_bfs():
    # Clean the graph to allow rerunning cell
    try:
        graph.remove_node_property("distance")
    except KeyError:
        pass
    
    analytics.bfs(graph, 0, "distance")

print("Average algorithm runtime (100 runs):")
print(timeit.timeit(run_canned_bfs, number=100) / 100 * 1000, "ms")

Average algorithm runtime (100 runs):
1.3316418800968677 ms


In [16]:
print(analytics.BfsStatistics(graph, "distance"))

Number of reached nodes = 29352
Maximum distance = 2
Average distance = 1.29242



Scaling-out
=========

The open-source Katana supports graph algorithms on a single machine.
This is useful and powerful, but limits graphs to those that can fit in memory on a single machine.
Our enterprise offering supports graph algorithms on _distributed_ graphs.
This supports much larger graphs, and much more computing power.
Katana Enterprise will provide an interface similar to the one shown here for distributed graphs, including custom algorithms written in Python.

(Python support for distributed graphs will be available in enterprise Katana by the end of Q2 2021.)

Contact
=======

**https://katanagraph.com/**

**Arthur Peters <amp@katanagraph.com>**

In [17]:
# TODO(amp): Add link to Katana Python documentation when available.
