# Lord Of The Rings PyRaphtory Example Notebook 🧝🏻‍♀️🧙🏻‍♂️💍

## Setup environment and download data 💾

Import all necessary dependencies needed to build a graph from your data in PyRaphtory. Download csv data from github into your tmp folder (file path: /tmp/lotr.csv).

In [1]:
pip install pyvis

Note: you may need to restart the kernel to use updated packages.


In [2]:
from pathlib import Path
from pyraphtory.context import PyRaphtory
from pyraphtory.vertex import Vertex
from pyraphtory.spouts import FileSpout
from pyraphtory.builder import *
from pyvis.network import Network
import csv
import pandas as pd
import numpy as np

!curl -o /tmp/lotr.csv https://raw.githubusercontent.com/Raphtory/Data/main/lotr.csv

18:02:03.919 [io-compute-blocker-4] INFO  com.raphtory.internals.management.Py4JServer - Starting PythonGatewayServer...




  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 52206  100 52206    0     0   198k      0 --:--:-- --:--:-- --:--:--  203k


## Preview data  👀

Preview the head of the dataset.

In [6]:
!head /tmp/lotr.csv

Gandalf,Elrond,33
Frodo,Bilbo,114
Blanco,Marcho,146
Frodo,Bilbo,205
Thorin,Gandalf,270
Thorin,Bilbo,270
Gandalf,Bilbo,270
Gollum,Bilbo,286
Gollum,Bilbo,306
Gollum,Bilbo,308


In [7]:
filename = '/tmp/lotr.csv'

## Create a new Raphtory graph 📊

Turn on logs to see what is going on in PyRaphtory. Initialise Raphtory by creating a PyRaphtory object. Create your new graph.

In [8]:
ctx = PyRaphtory.local()
graph = ctx.new_graph()

18:03:16.208 [io-compute-blocker-4] ERROR com.raphtory.internals.management.Prometheus$ - Failed to start Prometheus on port 9999
18:03:16.209 [io-compute-blocker-4] INFO  com.raphtory.internals.management.Prometheus$ - Prometheus started on port /0:0:0:0:0:0:0:0:59260
18:03:16.215 [spawner-akka.actor.default-dispatcher-3] INFO  akka.event.slf4j.Slf4jLogger - Slf4jLogger started
18:03:16.220 [spawner-akka.actor.default-dispatcher-6] INFO  com.raphtory.internals.components.partition.PartitionOrchestrator - Deploying new Partition Manager for graph 'entitled_erin_bee' - request by 'organisational_periwinkle_barnacle' 
18:03:16.220 [spawner-akka.actor.default-dispatcher-6] INFO  com.raphtory.internals.components.partition.PartitionOrchestrator$ - Creating '1' Partition Managers for 'entitled_erin_bee'.
18:03:16.220 [spawner-akka.actor.default-dispatcher-5] INFO  com.raphtory.internals.components.querymanager.QueryOrchestrator - Deploying new Query Manager for graph 'entitled_erin_bee' - r

## Ingest the data into a graph 😋

Write a parsing method to parse your csv file and ultimately create a graph.

In [10]:
with open(filename, 'r') as csvfile:
    datareader = csv.reader(csvfile)
    for row in datareader:
        source_node = row[0]
        src_id = graph.assign_id(source_node)
        target_node = row[1]
        tar_id = graph.assign_id(target_node)
        time_stamp = int(row[2])
        graph.add_vertex(time_stamp, src_id, Properties(ImmutableProperty("name", source_node)), Type("Character"))
        graph.add_vertex(time_stamp, tar_id, Properties(ImmutableProperty("name", target_node)), Type("Character"))
        graph.add_edge(time_stamp, src_id, tar_id, Type("Character_Co-occurence"))

## Collect simple metrics 📈

Select certain metrics to show in your output dataframe. Here we have selected vertex name, degree, out degree and in degree. 

In [11]:
from pyraphtory.graph import Row
df = graph \
      .select(lambda vertex: Row(vertex.name(), vertex.degree(), vertex.out_degree(), vertex.in_degree())) \
      .to_df(["name", "degree", "out_degree", "in_degree"])

18:04:07.188 [spawner-akka.actor.default-dispatcher-9] INFO  com.raphtory.internals.components.querymanager.QueryManager - Source '0' is unblocking analysis for Graph 'entitled_erin_bee' with 15894 messages sent. Latest update time was 32674
18:04:07.230 [io-compute-blocker-2] INFO  com.raphtory.api.querytracker.TableOutputTracker - Job 801431452_37506057014803071: Starting query progress tracker.
18:04:07.230 [io-compute-blocker-2] INFO  com.raphtory.api.querytracker.TableOutputTracker - Job 801431452_37506057014803071: Starting output collector.
18:04:07.258 [spawner-akka.actor.default-dispatcher-9] INFO  com.raphtory.internals.components.querymanager.QueryManager - Source 0 has completed ingesting and will now unblock
18:04:07.258 [spawner-akka.actor.default-dispatcher-9] INFO  com.raphtory.internals.components.querymanager.QueryManager - Query '801431452_37506057014803071' received, your job ID is '801431452_37506057014803071'.
18:04:07.266 [spawner-akka.actor.default-dispatcher-7]

In [12]:
df

Unnamed: 0,timestamp,window,name,degree,out_degree,in_degree
0,32674,,Hirgon,2,2,0
1,32674,,Hador,3,1,2
2,32674,,Horn,4,1,3
3,32674,,Galadriel,19,6,16
4,32674,,Isildur,18,18,0
...,...,...,...,...,...,...
134,32674,,Faramir,29,3,29
135,32674,,Bain,2,1,1
136,32674,,Walda,13,3,10
137,32674,,Thranduil,2,0,2


**Clean the dataframe, we have deleted the unused window column.** 🧹

In [13]:
## clean
df.drop(columns=['window'], inplace=True)

### Preview the dataframe  👀

In [14]:
df

Unnamed: 0,timestamp,name,degree,out_degree,in_degree
0,32674,Hirgon,2,2,0
1,32674,Hador,3,1,2
2,32674,Horn,4,1,3
3,32674,Galadriel,19,6,16
4,32674,Isildur,18,18,0
...,...,...,...,...,...
134,32674,Faramir,29,3,29
135,32674,Bain,2,1,1
136,32674,Walda,13,3,10
137,32674,Thranduil,2,0,2


**Sort by highest degree, top 10**

In [15]:
df.sort_values(['degree'], ascending=False)[:10]

Unnamed: 0,timestamp,name,degree,out_degree,in_degree
55,32674,Frodo,51,37,22
54,32674,Gandalf,49,35,24
97,32674,Aragorn,45,5,45
63,32674,Merry,34,23,18
32,32674,Pippin,34,30,10
56,32674,Elrond,32,18,24
52,32674,Théoden,30,22,9
134,32674,Faramir,29,3,29
118,32674,Sam,28,20,17
129,32674,Gimli,25,22,11


**Sort by highest in-degree, top 10**

In [16]:
df.sort_values(['in_degree'], ascending=False)[:10]

Unnamed: 0,timestamp,name,degree,out_degree,in_degree
97,32674,Aragorn,45,5,45
134,32674,Faramir,29,3,29
54,32674,Gandalf,49,35,24
56,32674,Elrond,32,18,24
55,32674,Frodo,51,37,22
63,32674,Merry,34,23,18
138,32674,Boromir,18,6,17
118,32674,Sam,28,20,17
3,32674,Galadriel,19,6,16
132,32674,Legolas,25,18,16


**Sort by highest out-degree, top 10**

In [17]:
df.sort_values(['out_degree'], ascending=False)[:10]

Unnamed: 0,timestamp,name,degree,out_degree,in_degree
55,32674,Frodo,51,37,22
54,32674,Gandalf,49,35,24
32,32674,Pippin,34,30,10
63,32674,Merry,34,23,18
52,32674,Théoden,30,22,9
129,32674,Gimli,25,22,11
118,32674,Sam,28,20,17
56,32674,Elrond,32,18,24
4,32674,Isildur,18,18,0
132,32674,Legolas,25,18,16


# Run a PageRank algorithm 📑

Run your selected algorithm on your graph, here we run PageRank. Your algorithms can be obtained from the PyRaphtory object you created at the start. Specify where you write the result of your algorithm to, e.g. the additional column results in your dataframe.

In [18]:
cols = ["prlabel"]

df_pagerank = graph.at(32674) \
                .past() \
                .transform(PyRaphtory.algorithms.generic.centrality.PageRank())\
                .execute(PyRaphtory.algorithms.generic.NodeList(*cols)) \
                .to_df(["name"] + cols)

18:04:35.561 [io-compute-blocker-8] INFO  com.raphtory.api.querytracker.TableOutputTracker - Job PageRank:NodeList_1339944722215541969: Starting query progress tracker.
18:04:35.562 [io-compute-blocker-8] INFO  com.raphtory.api.querytracker.TableOutputTracker - Job PageRank:NodeList_1339944722215541969: Starting output collector.
18:04:35.570 [spawner-akka.actor.default-dispatcher-3] INFO  com.raphtory.internals.components.querymanager.QueryManager - Query 'PageRank:NodeList_1339944722215541969' received, your job ID is 'PageRank:NodeList_1339944722215541969'.
18:04:35.571 [spawner-akka.actor.default-dispatcher-5] INFO  com.raphtory.internals.components.partition.QueryExecutor - PageRank:NodeList_1339944722215541969_0: Starting QueryExecutor.
18:04:35.644 [spawner-akka.actor.default-dispatcher-11] INFO  com.raphtory.internals.components.querymanager.QueryHandler - Job 'PageRank:NodeList_1339944722215541969': Perspective at Time '32674' took 72 ms to run. 
18:04:35.670 [spawner-akka.act

**Clean your dataframe** 🧹

In [19]:
## clean
df_pagerank.drop(columns=['window'], inplace=True)

In [20]:
df_pagerank

Unnamed: 0,timestamp,name,prlabel
0,32674,Hirgon,0.277968
1,32674,Hador,0.459710
2,32674,Horn,0.522389
3,32674,Galadriel,2.228852
4,32674,Isildur,0.277968
...,...,...,...
134,32674,Faramir,8.551166
135,32674,Bain,0.396105
136,32674,Walda,0.817198
137,32674,Thranduil,0.761719


**The top ten most ranked**

In [21]:
df_pagerank.sort_values(['prlabel'], ascending=False)[:10]

Unnamed: 0,timestamp,name,prlabel
97,32674,Aragorn,13.246457
134,32674,Faramir,8.551166
56,32674,Elrond,5.621548
138,32674,Boromir,4.824014
132,32674,Legolas,4.62259
110,32674,Imrahil,4.0956
65,32674,Éomer,3.473897
42,32674,Samwise,3.292762
118,32674,Sam,2.82614
55,32674,Frodo,2.806475


## Run a connected components algorithm 

Example running connected components algorithm on the graph.

In [22]:
cols = ["cclabel"]
df_cc = graph.at(32674) \
                .past() \
                .transform(PyRaphtory.algorithms.generic.ConnectedComponents)\
                .execute(PyRaphtory.algorithms.generic.NodeList(*cols)) \
                .to_df(["name"] + cols)

18:04:44.798 [io-compute-blocker-12] INFO  com.raphtory.api.querytracker.TableOutputTracker - Job ConnectedComponents:NodeList_6529836868478431610: Starting query progress tracker.
18:04:44.798 [io-compute-blocker-12] INFO  com.raphtory.api.querytracker.TableOutputTracker - Job ConnectedComponents:NodeList_6529836868478431610: Starting output collector.
18:04:44.806 [spawner-akka.actor.default-dispatcher-10] INFO  com.raphtory.internals.components.querymanager.QueryManager - Query 'ConnectedComponents:NodeList_6529836868478431610' received, your job ID is 'ConnectedComponents:NodeList_6529836868478431610'.
18:04:44.809 [spawner-akka.actor.default-dispatcher-7] INFO  com.raphtory.internals.components.partition.QueryExecutor - ConnectedComponents:NodeList_6529836868478431610_0: Starting QueryExecutor.
18:04:44.830 [spawner-akka.actor.default-dispatcher-5] INFO  com.raphtory.internals.components.querymanager.QueryHandler - Job 'ConnectedComponents:NodeList_6529836868478431610': Perspectiv

**Clean dataframe.**

In [23]:
## clean
df_cc.drop(columns=['window'], inplace=True)

**Preview dataframe.**

In [24]:
df_cc

Unnamed: 0,timestamp,name,cclabel
0,32674,Hirgon,-8637342647242242534
1,32674,Hador,-8637342647242242534
2,32674,Horn,-8637342647242242534
3,32674,Galadriel,-8637342647242242534
4,32674,Isildur,-8637342647242242534
...,...,...,...
134,32674,Faramir,-8637342647242242534
135,32674,Bain,-6628080393138316116
136,32674,Walda,-8637342647242242534
137,32674,Thranduil,-8637342647242242534


### Number of distinct components 

Extract number of distinct components, which is 3 in this dataframe.

In [25]:
len(set(df_cc['cclabel']))

3

### Size of components 

Calculate the size of the 3 connected components.

In [26]:
df_cc.groupby(['cclabel']).count().reset_index().drop(columns=['timestamp'])

Unnamed: 0,cclabel,name
0,-8637342647242242534,134
1,-6628080393138316116,3
2,-5499479516525190226,2


### Run chained algorithms at once 

In this example, we chain PageRank, Connected Components and Degree algorithms, running them one after another on the graph. Specify all the columns in the output dataframe, including an output column for each algorithm in the chain.

In [27]:
cols = ["inDegree", "outDegree", "degree","prlabel","cclabel"]

df_chained = graph.at(32674) \
                .past() \
                .transform(PyRaphtory.algorithms.generic.centrality.PageRank())\
                .transform(PyRaphtory.algorithms.generic.ConnectedComponents)\
                .transform(PyRaphtory.algorithms.generic.centrality.Degree())\
                .execute(PyRaphtory.algorithms.generic.NodeList(*cols)) \
                .to_df(["name"] + cols)

18:04:55.119 [io-compute-blocker-3] INFO  com.raphtory.api.querytracker.TableOutputTracker - Job PageRank:ConnectedComponents:Degree:NodeList_8208734138950122414: Starting query progress tracker.
18:04:55.119 [io-compute-blocker-3] INFO  com.raphtory.api.querytracker.TableOutputTracker - Job PageRank:ConnectedComponents:Degree:NodeList_8208734138950122414: Starting output collector.
18:04:55.125 [spawner-akka.actor.default-dispatcher-9] INFO  com.raphtory.internals.components.querymanager.QueryManager - Query 'PageRank:ConnectedComponents:Degree:NodeList_8208734138950122414' received, your job ID is 'PageRank:ConnectedComponents:Degree:NodeList_8208734138950122414'.
18:04:55.125 [spawner-akka.actor.default-dispatcher-11] INFO  com.raphtory.internals.components.partition.QueryExecutor - PageRank:ConnectedComponents:Degree:NodeList_8208734138950122414_0: Starting QueryExecutor.
18:04:55.167 [spawner-akka.actor.default-dispatcher-11] INFO  com.raphtory.internals.components.querymanager.Qu

In [28]:
df_chained.drop(columns=['window'], inplace=True)

In [29]:
df_chained['degree_numeric'] = df_chained['degree'].astype(float)

In [30]:
df_chained

Unnamed: 0,timestamp,name,inDegree,outDegree,degree,prlabel,cclabel,degree_numeric
0,32674,Hirgon,0,2,2,0.277968,-8637342647242242534,2.0
1,32674,Hador,2,1,3,0.459710,-8637342647242242534,3.0
2,32674,Horn,3,1,4,0.522389,-8637342647242242534,4.0
3,32674,Galadriel,16,6,19,2.228852,-8637342647242242534,19.0
4,32674,Isildur,0,18,18,0.277968,-8637342647242242534,18.0
...,...,...,...,...,...,...,...,...
134,32674,Faramir,29,3,29,8.551166,-8637342647242242534,29.0
135,32674,Bain,1,1,2,0.396105,-6628080393138316116,2.0
136,32674,Walda,10,3,13,0.817198,-8637342647242242534,13.0
137,32674,Thranduil,2,0,2,0.761719,-8637342647242242534,2.0


### Create visualisation by adding nodes 🔎

In [31]:
def visualise(graph, df_chained):
    # Create network object
    net = Network(notebook=True, height='750px', width='100%', bgcolor='#222222', font_color='white')
    # Set visualisation tool
    net.force_atlas_2based()
    # Get the node list 
    df_node_list = graph.at(32674) \
                .past() \
                .execute(PyRaphtory.algorithms.generic.NodeList()) \
                .to_df(['name'])
    
    nodes = df_node_list['name'].tolist()
    
    node_data = []
    ignore_items = ['timestamp', 'name', 'window']
    for node_name in nodes:
        for i, row in df_chained.iterrows():
            if row['name']==node_name:
                data = ''
                for k,v in row.iteritems():
                    if k not in ignore_items:
                        data = data+str(k)+': '+str(v)+'\n'
                node_data.append(data)
                continue
    # Add the nodes
    net.add_nodes(nodes, title=node_data, value = df_chained.prlabel)
    # Get the edge list
    df_edge_list = graph.at(32674) \
            .past() \
            .execute(PyRaphtory.algorithms.generic.EdgeList()) \
            .to_df(['from', 'to'])
    edges = []
    for i, row in df_edge_list[['from', 'to']].iterrows():
        edges.append([row['from'], row['to']])
    # Add the edges
    net.add_edges(edges)
    # Toggle physics
    net.toggle_physics(True)
    return net

In [32]:
net = visualise(graph, df_chained)

Local cdn resources have problems on chrome/safari when used in jupyter-notebook. 
18:05:06.369 [io-compute-blocker-5] INFO  com.raphtory.api.querytracker.TableOutputTracker - Job NodeList_2344593067198146744: Starting query progress tracker.
18:05:06.370 [io-compute-blocker-5] INFO  com.raphtory.api.querytracker.TableOutputTracker - Job NodeList_2344593067198146744: Starting output collector.
18:05:06.371 [spawner-akka.actor.default-dispatcher-5] INFO  com.raphtory.internals.components.querymanager.QueryManager - Query 'NodeList_2344593067198146744' received, your job ID is 'NodeList_2344593067198146744'.
18:05:06.372 [spawner-akka.actor.default-dispatcher-8] INFO  com.raphtory.internals.components.partition.QueryExecutor - NodeList_2344593067198146744_0: Starting QueryExecutor.
18:05:06.387 [spawner-akka.actor.default-dispatcher-11] INFO  com.raphtory.internals.components.querymanager.QueryHandler - Job 'NodeList_2344593067198146744': Perspective at Time '32674' took 14 ms to run. 
1

  for k,v in row.iteritems():


18:05:07.213 [io-compute-blocker-4] INFO  com.raphtory.api.querytracker.TableOutputTracker - Job EdgeList_180052264941691119: Starting query progress tracker.
18:05:07.214 [io-compute-blocker-4] INFO  com.raphtory.api.querytracker.TableOutputTracker - Job EdgeList_180052264941691119: Starting output collector.
18:05:07.217 [spawner-akka.actor.default-dispatcher-7] INFO  com.raphtory.internals.components.querymanager.QueryManager - Query 'EdgeList_180052264941691119' received, your job ID is 'EdgeList_180052264941691119'.
18:05:07.217 [spawner-akka.actor.default-dispatcher-3] INFO  com.raphtory.internals.components.partition.QueryExecutor - EdgeList_180052264941691119_0: Starting QueryExecutor.
18:05:07.230 [spawner-akka.actor.default-dispatcher-8] INFO  com.raphtory.internals.components.querymanager.QueryHandler - Job 'EdgeList_180052264941691119': Perspective at Time '32674' took 12 ms to run. 
18:05:07.270 [spawner-akka.actor.default-dispatcher-6] INFO  com.raphtory.api.querytracker.

## Show the html file of the visualisation

In [33]:
%%html
net.show('preview.html')

## Shut down PyRaphtory  🛑

In [35]:
ctx.close()

18:05:31.068 [spawner-akka.actor.default-dispatcher-6] INFO  akka.actor.CoordinatedShutdown - Running CoordinatedShutdown with reason [ActorSystemTerminateReason]
