# Lord Of The Rings PyRaphtory Example Notebook 🧝🏻‍♀️🧙🏻‍♂️💍

## Setup environment and download data 💾

Import all necessary dependencies needed to build a graph from your data in PyRaphtory. Download csv data from github into your tmp folder (file path: /tmp/lotr.csv).

In [None]:
pip install pyvis

In [None]:
from pathlib import Path
from pyraphtory.context import PyRaphtory
from pyraphtory.vertex import Vertex
from pyraphtory.spouts import FileSpout
from pyraphtory.builder import *
from pyvis.network import Network
import matplotlib
import csv
import pandas as pd
import numpy as np

!curl -o /tmp/lotr.csv https://raw.githubusercontent.com/Raphtory/Data/main/lotr.csv

## Preview data  👀

In [None]:
!head /tmp/lotr.csv

In [None]:
filename = '/tmp/lotr.csv'

## Create a new Raphtory graph 📊

We Initialise Raphtory by creating a PyRaphtory object, and we create your new graph. We can turn on logs to see what is going on in PyRaphtory.

In [None]:
ctx = PyRaphtory.local()
graph = ctx.new_graph()

## Ingest the data into a graph 😋

We use a parsing method to parse the csv file and add the vertices and edges to the PyRaphtory graph.

In [None]:
with open(filename, 'r') as csvfile:
    datareader = csv.reader(csvfile)
    for row in datareader:
        source_node = row[0]
        src_id = graph.assign_id(source_node)
        target_node = row[1]
        tar_id = graph.assign_id(target_node)
        time_stamp = int(row[2])
        graph.add_vertex(time_stamp, src_id, Properties(ImmutableProperty("name", source_node)), Type("Character"))
        graph.add_vertex(time_stamp, tar_id, Properties(ImmutableProperty("name", target_node)), Type("Character"))
        graph.add_edge(time_stamp, src_id, tar_id, Type("Character_Co-occurence"))

## Collect simple metrics 📈

Our data has been ingested by the graph object. We can now select certain attributes and metrics to display, for instance vertex name, degree, out degree and in degree. \
We can compute these metrics temporally. To do so,
1. the first step is to create temporal slices of the graph, also called **graph perspectives**. For instance, `.range(start, end, step)` filters the data from `start` to `end`, and splits it into slices of duration `step`. More functions such as `from`,`until`, `at`, `depart`, are presented in the [documentation of the DeplotedTemporalGraph object](https://docs.raphtory.com/en/development/_static/com/raphtory/api/analysis/graphview/DeployedTemporalGraph.html).
2. the second step is to specify which direction we are looking in at each time point (or **snapshot**) `t`. For instance `.past()` aggregate all older data until `t`, `.future()` aggregate all data starting from `t`. We can also use the function `.window()`, also presented in the [documentation of the DottedGraph object](https://docs.raphtory.com/en/development/_static/com/raphtory/api/analysis/graphview/DottedGraph.html). 

We can then specify which metrics we are interested in, and choose the names of the output columns.

In [None]:
from pyraphtory.graph import Row
df = graph \
      .range(1, 32674, 1000) \
      .past()  \
      .select(lambda vertex: Row(vertex.name(), vertex.degree(), vertex.out_degree(), vertex.in_degree())) \
      .to_df(["name", "degree", "out_degree", "in_degree"])

In [None]:
df.head(5)

In [None]:
# We can delete the unused 'window' column
df.drop(columns=['window'], inplace=True)

**Get simple insights from the data 👀**

In [None]:
print('# timetamps:', len(df.timestamp.unique()))
print('# characters:', len(df.name.unique()))

In [None]:
df \
.groupby('timestamp') \
.agg({'name': 'count'}) \
.reset_index() \
.rename(columns={'name': 'nb_characters'}) \
.plot(x='timestamp', y='nb_characters', kind='scatter', title='# characters co-mentioned at least once until time t')

**Suppose we are interested in the top 5 highest degree characters.** Because the dataframe contains the degree of each character up to each time point, we can simply look at the max degree of each character across all times.

In [None]:
df \
.groupby('name') \
.agg({'degree': 'max'}) \
.reset_index() \
.sort_values(['degree'], ascending=False)[:5]

**Alternatively**, we can aggregate all temporal interactions from the start, and compute the degrees on the resulting graph.

In [None]:
df = graph \
    .at(32674) \
    .past() \
    .select(lambda vertex: Row(vertex.name(), vertex.degree(), vertex.out_degree(), vertex.in_degree())) \
    .to_df(["name", "degree", "out_degree", "in_degree"])

Note that **the resulting dataframe only has one timestamp**: the final time 32674.

In [None]:
df.head(5)

In [None]:
df.drop(columns=['window'], inplace=True)
df.sort_values(['degree'], ascending=False)[:10]

## Run a PageRank algorithm 📑

We can run any selected algorithm on our graph data, here we run PageRank. 
- Take the graph.
- The `transform` function applies a given algorithm to the graph, and returns the graph updated with the states that the algorithm generated. Here we apply a generic algorithm `pr.algorithms.generic.centrality.PageRank()` (PageRank) so the `transform` returns the graph where each node has now an extra attribute which is its Pagerank score.
- The `execute` function applies an algorithm (same as `transform`) AND returns the results in a tabular format (graph --> table). Here we apply the `NodeList` algorithm: it outputs one row per vertex in the graph contacted by the names in `cols`.
- We ask Raphtory to output the results in a dataframe `to_df`. 

Both functions `transform` and `execute` are explained further in the [documentation](https://docs.raphtory.com/en/development/_static/com/raphtory/api/analysis/graphview/MultilayerGraphView.html).

In [None]:
cols = ["prlabel"]

df_pagerank = graph.at(32674) \
                .past() \
                .transform(ctx.algorithms.generic.centrality.PageRank())\
                .execute(ctx.algorithms.generic.NodeList(*cols)) \
                .to_df(["name"] + cols)

**Note: Here we run PageRank on the aggregated graph but we could, of course, have run it similarly on temporal slices as explained above.**

In [None]:
df_pagerank.drop(columns=['window'], inplace=True)
df_pagerank.head(5)

**The top ten most ranked characters**

In [None]:
df_pagerank.sort_values(['prlabel'], ascending=False)[:10]

## Run a connected components algorithm 

Similarly we can look for the connected components of the graph.

In [None]:
cols = ["cclabel"]
df_cc = graph.at(32674) \
                .past() \
                .transform(ctx.algorithms.generic.ConnectedComponents)\
                .execute(ctx.algorithms.generic.NodeList(*cols)) \
                .to_df(["name"] + cols)

**Preview dataframe.**

In [None]:
df_cc.drop(columns=['window'], inplace=True)
df_cc.head(5)

**Number of distinct components**

In [None]:
print('# distinct components:', len(set(df_cc['cclabel'])))

**Calculate the size of the 3 connected components.**

In [None]:
df_cc \
.groupby(['cclabel']) \
.count() \
.reset_index() \
.rename(columns={'name': 'size'}) \
.drop(columns=['timestamp'])

## Run chained algorithms at once 

We can also chain PageRank, Connected Components and Degree algorithms, running them one after another on the graph. Specify all the columns in the output dataframe, including an output column for each algorithm in the chain. `cols` dictates the particular order in which we want to output the result columns.

In [None]:
cols = ["inDegree", "outDegree", "degree", "prlabel", "cclabel"]

df_chained = graph.at(32674) \
                .past() \
                .transform(ctx.algorithms.generic.centrality.PageRank())\
                .transform(ctx.algorithms.generic.ConnectedComponents)\
                .transform(ctx.algorithms.generic.centrality.Degree())\
                .execute(ctx.algorithms.generic.NodeList(*cols)) \
                .to_df(["name"] + cols)

In [None]:
df_chained.drop(columns=['window'], inplace=True)
df_chained['degree_numeric'] = df_chained['degree'].astype(float)
df_chained.head(10)

## Create visualisation by adding nodes 🔎

In [None]:
def visualise(graph, df_chained):
    # Create network object
    net = Network(notebook=True, height='750px', width='100%', bgcolor='#222222', font_color='white')
    # Set visualisation tool
    net.force_atlas_2based()
    # Get the node list 
    df_node_list = graph.at(32674) \
                .past() \
                .execute(ctx.algorithms.generic.NodeList()) \
                .to_df(['name'])
    
    nodes = df_node_list['name'].tolist()
    
    node_data = []
    ignore_items = ['timestamp', 'name', 'window']
    for node_name in nodes:
        for i, row in df_chained.iterrows():
            if row['name']==node_name:
                data = ''
                for k,v in row.iteritems():
                    if k not in ignore_items:
                        data = data+str(k)+': '+str(v)+'\n'
                node_data.append(data)
                continue
    # Add the nodes
    net.add_nodes(nodes, title=node_data, value = df_chained.prlabel)
    # Get the edge list
    df_edge_list = graph.at(32674) \
                .past() \
                .execute(ctx.algorithms.generic.EdgeList()) \
                .to_df(['from', 'to'])
    edges = []
    for i, row in df_edge_list[['from', 'to']].iterrows():
        edges.append([row['from'], row['to']])
    # Add the edges
    net.add_edges(edges)
    # Toggle physics
    net.toggle_physics(True)
    return net

In [None]:
net = visualise(graph, df_chained)

**Show the html file of the visualisation**

In [None]:
%%html
net.show('preview.html')

## Shut down PyRaphtory  🛑

In [None]:
ctx.close()