# Lord Of The Rings PyRaphtory Example Notebook 🧝🏻‍♀️🧙🏻‍♂️💍

## Setup environment and download data 💾

Import all necessary dependencies needed to build a graph from your data in PyRaphtory. Download csv data from github into your tmp folder (file path: /tmp/lotr.csv).

In [1]:
from pathlib import Path
from pyraphtory.context import PyRaphtory
from pyraphtory.vertex import Vertex
from pyraphtory.spouts import FileSpout
from pyraphtory.builder import *
from pyvis.network import Network
import csv
import pandas as pd
import numpy as np

!curl -o /tmp/lotr.csv https://raw.githubusercontent.com/Raphtory/Data/main/lotr.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 52206  100 52206    0     0   179k      0 --:--:-- --:--:-- --:--:--  182k


## Preview data  👀

Preview the head of the dataset.

In [2]:
!head /tmp/lotr.csv

Gandalf,Elrond,33
Frodo,Bilbo,114
Blanco,Marcho,146
Frodo,Bilbo,205
Thorin,Gandalf,270
Thorin,Bilbo,270
Gandalf,Bilbo,270
Gollum,Bilbo,286
Gollum,Bilbo,306
Gollum,Bilbo,308


In [3]:
filename = '/tmp/lotr.csv'

## Create a new Raphtory graph 📊

Turn on logs to see what is going on in PyRaphtory. Initialise Raphtory by creating a PyRaphtory object. Create your new graph.

In [4]:
pr = PyRaphtory(logging=True).open()
rg = pr.new_graph()

20:37:31.796 [io-compute-blocker-4] INFO  com.raphtory.internals.management.Py4JServer - Starting PythonGatewayServer...
Port: 61827
Secret: cfa0f59ded76b25a2c1bbabc145dd768865bf3f0b9682377eb85fa2700104068




20:37:32.514 [Thread-12] INFO  com.raphtory.internals.context.LocalContext$ - Creating Service for 'random_bisque_koi'
20:37:32.532 [io-compute-blocker-1] INFO  com.raphtory.internals.management.Prometheus$ - Prometheus started on port /0:0:0:0:0:0:0:0:9999
20:37:33.243 [io-compute-blocker-1] INFO  com.raphtory.internals.components.partition.PartitionOrchestrator$ - Creating '1' Partition Managers for 'random_bisque_koi'.
20:37:34.260 [io-compute-blocker-4] INFO  com.raphtory.internals.components.partition.PartitionManager - Partition 0: Starting partition manager for 'random_bisque_koi'.


## Ingest the data into a graph 😋

Write a parsing method to parse your csv file and ultimately create a graph.

In [5]:
with open(filename, 'r') as csvfile:
    datareader = csv.reader(csvfile)
    for row in datareader:
        source_node = row[0]
        src_id = rg.assign_id(source_node)
        target_node = row[1]
        tar_id = rg.assign_id(target_node)
        time_stamp = int(row[2])
        rg.add_vertex(time_stamp, src_id, Properties(ImmutableProperty("name", source_node)), Type("Character"))
        rg.add_vertex(time_stamp, tar_id, Properties(ImmutableProperty("name", target_node)), Type("Character"))
        rg.add_edge(time_stamp, src_id, tar_id, Type("Character_Co-occurence"))

## Collect simple metrics 📈

Select certain metrics to show in your output dataframe. Here we have selected vertex name, degree, out degree and in degree. 

In [6]:
from pyraphtory.graph import Row
df = rg \
      .select(lambda vertex: Row(vertex.name(), vertex.degree(), vertex.out_degree(), vertex.in_degree())) \
      .to_df(["name", "degree", "out_degree", "in_degree"])

20:37:53.760 [spawner-akka.actor.default-dispatcher-3] INFO  com.raphtory.internals.components.querymanager.QueryManager - Source '0' is unblocking analysis for Graph 'random_bisque_koi' with 7947 messages sent. Latest update time was 32674
20:37:53.763 [io-compute-blocker-1] INFO  com.raphtory.api.querytracker.QueryProgressTracker - Job 708796223_4046664532663839721: Starting query progress tracker.
20:37:53.766 [spawner-akka.actor.default-dispatcher-3] INFO  com.raphtory.internals.components.querymanager.QueryManager - Source 0 has completed ingesting and will now unblock
20:37:53.766 [spawner-akka.actor.default-dispatcher-3] INFO  com.raphtory.internals.components.querymanager.QueryManager - Query '708796223_4046664532663839721' received, your job ID is '708796223_4046664532663839721'.
20:37:53.774 [spawner-akka.actor.default-dispatcher-6] INFO  com.raphtory.internals.components.partition.QueryExecutor - 708796223_4046664532663839721_0: Starting QueryExecutor.
20:37:53.832 [io-compu

**Clean the dataframe, we have deleted the unused window column.** 🧹

In [7]:
## clean
df.drop(columns=['window'], inplace=True)

### Preview the dataframe  👀

In [8]:
df

Unnamed: 0,timestamp,name,degree,out_degree,in_degree
0,32674,Hirgon,2,2,0
1,32674,Hador,3,1,2
2,32674,Horn,4,1,3
3,32674,Galadriel,19,6,16
4,32674,Isildur,18,18,0
...,...,...,...,...,...
134,32674,Faramir,29,3,29
135,32674,Bain,2,1,1
136,32674,Walda,13,3,10
137,32674,Thranduil,2,0,2


**Sort by highest degree, top 10**

In [9]:
df.sort_values(['degree'], ascending=False)[:10]

Unnamed: 0,timestamp,name,degree,out_degree,in_degree
55,32674,Frodo,51,37,22
54,32674,Gandalf,49,35,24
97,32674,Aragorn,45,5,45
63,32674,Merry,34,23,18
32,32674,Pippin,34,30,10
56,32674,Elrond,32,18,24
52,32674,Théoden,30,22,9
134,32674,Faramir,29,3,29
118,32674,Sam,28,20,17
129,32674,Gimli,25,22,11


**Sort by highest in-degree, top 10**

In [10]:
df.sort_values(['in_degree'], ascending=False)[:10]

Unnamed: 0,timestamp,name,degree,out_degree,in_degree
97,32674,Aragorn,45,5,45
134,32674,Faramir,29,3,29
54,32674,Gandalf,49,35,24
56,32674,Elrond,32,18,24
55,32674,Frodo,51,37,22
63,32674,Merry,34,23,18
138,32674,Boromir,18,6,17
118,32674,Sam,28,20,17
3,32674,Galadriel,19,6,16
132,32674,Legolas,25,18,16


**Sort by highest out-degree, top 10**

In [11]:
df.sort_values(['out_degree'], ascending=False)[:10]

Unnamed: 0,timestamp,name,degree,out_degree,in_degree
55,32674,Frodo,51,37,22
54,32674,Gandalf,49,35,24
32,32674,Pippin,34,30,10
63,32674,Merry,34,23,18
52,32674,Théoden,30,22,9
129,32674,Gimli,25,22,11
118,32674,Sam,28,20,17
56,32674,Elrond,32,18,24
4,32674,Isildur,18,18,0
132,32674,Legolas,25,18,16


# Run a PageRank algorithm 📑

Run your selected algorithm on your graph, here we run PageRank. Your algorithms can be obtained from the PyRaphtory object you created at the start. Specify where you write the result of your algorithm to, e.g. the additional column results in your dataframe.

In [12]:
cols = ["prlabel"]

df_pagerank = rg.at(32674) \
                .past() \
                .transform(pr.algorithms.generic.centrality.PageRank())\
                .execute(pr.algorithms.generic.NodeList(*cols)) \
                .to_df(["name"] + cols)

20:38:12.126 [io-compute-blocker-8] INFO  com.raphtory.api.querytracker.QueryProgressTracker - Job PageRank:NodeList_4685909644778744380: Starting query progress tracker.
20:38:12.134 [spawner-akka.actor.default-dispatcher-3] INFO  com.raphtory.internals.components.querymanager.QueryManager - Query 'PageRank:NodeList_4685909644778744380' received, your job ID is 'PageRank:NodeList_4685909644778744380'.
20:38:12.135 [spawner-akka.actor.default-dispatcher-8] INFO  com.raphtory.internals.components.partition.QueryExecutor - PageRank:NodeList_4685909644778744380_0: Starting QueryExecutor.
20:38:12.175 [io-compute-blocker-6] INFO  com.raphtory.api.analysis.table.TableOutputTracker - Job PageRank:NodeList_4685909644778744380: Starting output collector.
20:38:12.577 [spawner-akka.actor.default-dispatcher-8] INFO  com.raphtory.internals.components.querymanager.QueryHandler - Job 'PageRank:NodeList_4685909644778744380': Perspective at Time '32674' took 277 ms to run. 
20:38:12.577 [spawner-akka

**Clean your dataframe** 🧹

In [13]:
## clean
df_pagerank.drop(columns=['window'], inplace=True)

In [14]:
df_pagerank

Unnamed: 0,timestamp,name,prlabel
0,32674,Hirgon,0.277968
1,32674,Hador,0.459710
2,32674,Horn,0.522389
3,32674,Galadriel,2.228852
4,32674,Isildur,0.277968
...,...,...,...
134,32674,Faramir,8.551166
135,32674,Bain,0.396105
136,32674,Walda,0.817198
137,32674,Thranduil,0.761719


**The top ten most ranked**

In [15]:
df_pagerank.sort_values(['prlabel'], ascending=False)[:10]

Unnamed: 0,timestamp,name,prlabel
97,32674,Aragorn,13.246457
134,32674,Faramir,8.551166
56,32674,Elrond,5.621548
138,32674,Boromir,4.824014
132,32674,Legolas,4.62259
110,32674,Imrahil,4.0956
65,32674,Éomer,3.473897
42,32674,Samwise,3.292762
118,32674,Sam,2.82614
55,32674,Frodo,2.806475


## Run a connected components algorithm 

Example running connected components algorithm on the graph.

In [16]:
cols = ["cclabel"]
df_cc = rg.at(32674) \
                .past() \
                .transform(pr.algorithms.generic.ConnectedComponents)\
                .execute(pr.algorithms.generic.NodeList(*cols)) \
                .to_df(["name"] + cols)

20:38:26.650 [io-compute-blocker-2] INFO  com.raphtory.api.querytracker.QueryProgressTracker - Job ConnectedComponents:NodeList_9197127537465185439: Starting query progress tracker.
20:38:26.654 [spawner-akka.actor.default-dispatcher-5] INFO  com.raphtory.internals.components.querymanager.QueryManager - Query 'ConnectedComponents:NodeList_9197127537465185439' received, your job ID is 'ConnectedComponents:NodeList_9197127537465185439'.
20:38:26.656 [spawner-akka.actor.default-dispatcher-11] INFO  com.raphtory.internals.components.partition.QueryExecutor - ConnectedComponents:NodeList_9197127537465185439_0: Starting QueryExecutor.
20:38:26.719 [io-compute-blocker-5] INFO  com.raphtory.api.analysis.table.TableOutputTracker - Job ConnectedComponents:NodeList_9197127537465185439: Starting output collector.
20:38:26.912 [spawner-akka.actor.default-dispatcher-8] INFO  com.raphtory.internals.components.querymanager.QueryHandler - Job 'ConnectedComponents:NodeList_9197127537465185439': Perspect

**Clean dataframe.**

In [17]:
## clean
df_cc.drop(columns=['window'], inplace=True)

**Preview dataframe.**

In [18]:
df_cc

Unnamed: 0,timestamp,name,cclabel
0,32674,Hirgon,-8637342647242242534
1,32674,Hador,-8637342647242242534
2,32674,Horn,-8637342647242242534
3,32674,Galadriel,-8637342647242242534
4,32674,Isildur,-8637342647242242534
...,...,...,...
134,32674,Faramir,-8637342647242242534
135,32674,Bain,-6628080393138316116
136,32674,Walda,-8637342647242242534
137,32674,Thranduil,-8637342647242242534


### Number of distinct components 

Extract number of distinct components, which is 3 in this dataframe.

In [19]:
len(set(df_cc['cclabel']))

3

### Size of components 

Calculate the size of the 3 connected components.

In [20]:
df_cc.groupby(['cclabel']).count().reset_index().drop(columns=['timestamp'])

Unnamed: 0,cclabel,name
0,-8637342647242242534,134
1,-6628080393138316116,3
2,-5499479516525190226,2


### Run chained algorithms at once 

In this example, we chain PageRank, Connected Components and Degree algorithms, running them one after another on the graph. Specify all the columns in the output dataframe, including an output column for each algorithm in the chain.

In [21]:
cols = ["inDegree", "outDegree", "degree","prlabel","cclabel"]

df_chained = rg.at(32674) \
                .past() \
                .transform(pr.algorithms.generic.centrality.PageRank())\
                .transform(pr.algorithms.generic.ConnectedComponents)\
                .transform(pr.algorithms.generic.centrality.Degree())\
                .execute(pr.algorithms.generic.NodeList(*cols)) \
                .to_df(["name"] + cols)

20:38:40.904 [io-compute-blocker-3] INFO  com.raphtory.api.querytracker.QueryProgressTracker - Job PageRank:ConnectedComponents:Degree:NodeList_7032040291154029239: Starting query progress tracker.
20:38:40.907 [spawner-akka.actor.default-dispatcher-6] INFO  com.raphtory.internals.components.querymanager.QueryManager - Query 'PageRank:ConnectedComponents:Degree:NodeList_7032040291154029239' received, your job ID is 'PageRank:ConnectedComponents:Degree:NodeList_7032040291154029239'.
20:38:40.907 [spawner-akka.actor.default-dispatcher-8] INFO  com.raphtory.internals.components.partition.QueryExecutor - PageRank:ConnectedComponents:Degree:NodeList_7032040291154029239_0: Starting QueryExecutor.
20:38:40.965 [io-compute-blocker-5] INFO  com.raphtory.api.analysis.table.TableOutputTracker - Job PageRank:ConnectedComponents:Degree:NodeList_7032040291154029239: Starting output collector.
20:38:41.193 [spawner-akka.actor.default-dispatcher-6] INFO  com.raphtory.api.querytracker.QueryProgressTrac

In [22]:
df_chained.drop(columns=['window'], inplace=True)

In [25]:
df_chained['degree_numeric'] = df_chained['degree'].astype(float)

In [26]:
df_chained

Unnamed: 0,timestamp,name,inDegree,outDegree,degree,prlabel,cclabel,degree_numeric
0,32674,Hirgon,0,2,2,0.277968,-8637342647242242534,2.0
1,32674,Hador,2,1,3,0.459710,-8637342647242242534,3.0
2,32674,Horn,3,1,4,0.522389,-8637342647242242534,4.0
3,32674,Galadriel,16,6,19,2.228852,-8637342647242242534,19.0
4,32674,Isildur,0,18,18,0.277968,-8637342647242242534,18.0
...,...,...,...,...,...,...,...,...
134,32674,Faramir,29,3,29,8.551166,-8637342647242242534,29.0
135,32674,Bain,1,1,2,0.396105,-6628080393138316116,2.0
136,32674,Walda,10,3,13,0.817198,-8637342647242242534,13.0
137,32674,Thranduil,2,0,2,0.761719,-8637342647242242534,2.0


### Create visualisation by adding nodes 🔎

In [31]:
def visualise(rg, df_chained):
    # Create network object
    net = Network(notebook=True, height='750px', width='100%', bgcolor='#222222', font_color='white')
    # Set visualisation tool
    net.force_atlas_2based()
    # Get the node list 
    df_node_list = rg.at(32674) \
                .past() \
                .execute(pr.algorithms.generic.NodeList()) \
                .to_df(['name'])
    
    nodes = df_node_list['name'].tolist()
    
    node_data = []
    ignore_items = ['timestamp', 'name', 'window']
    for node_name in nodes:
        for i, row in df_chained.iterrows():
            if row['name']==node_name:
                data = ''
                for k,v in row.iteritems():
                    if k not in ignore_items:
                        data = data+str(k)+': '+str(v)+'\n'
                node_data.append(data)
                continue
    # Add the nodes
    net.add_nodes(nodes, title=node_data, value = df_chained.prlabel)
    # Get the edge list
    df_edge_list = rg.at(32674) \
            .past() \
            .execute(pr.algorithms.generic.EdgeList()) \
            .to_df(['from', 'to'])
    edges = []
    for i, row in df_edge_list[['from', 'to']].iterrows():
        edges.append([row['from'], row['to']])
    # Add the edges
    net.add_edges(edges)
    # Toggle physics
    net.toggle_physics(True)
    return net

In [32]:
net = visualise(rg, df_chained)

20:45:36.358 [io-compute-blocker-7] INFO  com.raphtory.api.querytracker.QueryProgressTracker - Job NodeList_7770831613554784095: Starting query progress tracker.
20:45:36.361 [spawner-akka.actor.default-dispatcher-7] INFO  com.raphtory.internals.components.querymanager.QueryManager - Query 'NodeList_7770831613554784095' received, your job ID is 'NodeList_7770831613554784095'.
20:45:36.362 [spawner-akka.actor.default-dispatcher-11] INFO  com.raphtory.internals.components.partition.QueryExecutor - NodeList_7770831613554784095_0: Starting QueryExecutor.
20:45:36.432 [io-compute-blocker-8] INFO  com.raphtory.api.analysis.table.TableOutputTracker - Job NodeList_7770831613554784095: Starting output collector.
20:45:36.653 [spawner-akka.actor.default-dispatcher-3] INFO  com.raphtory.api.querytracker.QueryProgressTracker - Job 'NodeList_7770831613554784095': Perspective '32674' finished in 295 ms.
20:45:36.653 [spawner-akka.actor.default-dispatcher-6] INFO  com.raphtory.internals.components.qu

## Show the html file of the visualisation

In [35]:
%%html
net.show('preview.html')

## Shut down PyRaphtory  🛑

In [27]:
pr.shutdown()

11:15:33.517 [Thread-12] INFO  com.raphtory.internals.management.PythonInterop$ - Shutting down pyraphtory
