# Higgs Boson Tweets PyRaphtory Example Notebook 💥

## Setup environment and download data 💾

Import all necessary dependencies needed to build a graph from your data in PyRaphtory. 

In [None]:
pip install pyvis

If you would like to use the full dataset, please uncomment the curl command in the cell below and the preview data cell.

In [1]:
from pathlib import Path
from pyraphtory.context import PyRaphtory
from pyraphtory.vertex import Vertex
from pyraphtory.spouts import FileSpout
from pyraphtory.builder import *
from pyvis.network import Network
import csv

# !curl -o /tmp/twitter.csv https://raw.githubusercontent.com/Raphtory/Data/main/higgs-retweet-activity.csv

15:55:11.067 [io-compute-blocker-1] INFO  com.raphtory.internals.management.Py4JServer - Starting PythonGatewayServer...




## Preview data 👀

Preview the retweet twitter data: each line includes the source user A (the retweeter), the destination user B (the user being retweeted) and the time at which the retweet occurs.

In [None]:
# !head /tmp/twitter.csv

## Create a new Raphtory graph 📊

Turn on logs to see what is going on in PyRaphtory. Initialise Raphtory by creating a PyRaphtory object and create your new graph.

In [2]:
graph = PyRaphtory.new_graph()

15:55:20.917 [Thread-12] INFO  com.raphtory.internals.context.LocalContext$ - Creating Service for 'interesting_olive_trout'
15:55:20.935 [io-compute-blocker-7] INFO  com.raphtory.internals.management.Prometheus$ - Prometheus started on port /0:0:0:0:0:0:0:0:9999
2022-10-21 15:55:20,947 io-compute-blocker-7 INFO No Watcher plugin is available for protocol 'jar'
15:55:20.958 [io-compute-blocker-7] INFO  org.apache.arrow.memory.BaseAllocator - Debug mode disabled.
15:55:20.958 [io-compute-blocker-7] INFO  org.apache.arrow.memory.DefaultAllocationManagerOption - allocation manager type not specified, using netty as the default type
15:55:20.959 [io-compute-blocker-7] INFO  org.apache.arrow.memory.CheckAllocator - Using DefaultAllocationManager at memory/DefaultAllocationManagerFactory.class
15:55:21.074 [io-compute-blocker-7] INFO  com.raphtory.arrowmessaging.ArrowFlightServer - ArrowFlightServer(192.168.2.243,64184) is online
15:55:21.235 [spawner-akka.actor.default-dispatcher-3] INFO  a

## Ingest the data into a graph 😋

Write a parsing method to parse your csv file and ultimately create a graph.

Swap twitter_spout with /tmp/twitter.csv if using the big dataset, otherwise keep it as higgstestdata.csv for testing

In [3]:
def parse(graph, tuple: str):
    parts = [v.strip() for v in tuple.split(",")]
    source_node = parts[0]
    src_id = graph.assign_id(source_node)
    target_node = parts[1]
    tar_id = graph.assign_id(target_node)
    time_stamp = int(parts[2])

    graph.add_vertex(time_stamp, src_id, Properties(ImmutableProperty("name", source_node)), Type("User"))
    graph.add_vertex(time_stamp, tar_id, Properties(ImmutableProperty("name", target_node)), Type("User"))
    graph.add_edge(time_stamp, src_id, tar_id, Type("Tweet"))

twitter_builder = GraphBuilder(parse)
# twitter_spout = FileSpout("/tmp/twitter.csv")
twitter_spout = FileSpout("higgstestdata.csv")
graph.load(Source(twitter_spout, twitter_builder))

15:55:25.067 [spawner-akka.actor.default-dispatcher-10] INFO  com.raphtory.internals.components.ingestion.IngestionManager - Ingestion Manager for 'interesting_olive_trout' establishing new data source


com.raphtory.api.analysis.graphview.DeployedTemporalGraph@7370347e

15:55:25.073 [io-compute-2] INFO  com.raphtory.spouts.FileSpoutInstance - Spout: Processing file 'higgstestdata.csv' ...
15:55:25.079 [spawner-akka.actor.default-dispatcher-10] INFO  com.raphtory.internals.components.querymanager.QueryManager - Source '0' is blocking analysis for Graph 'interesting_olive_trout'
15:55:26.889 [spawner-akka.actor.default-dispatcher-8] INFO  com.raphtory.internals.components.querymanager.QueryManager - Source '0' is unblocking analysis for Graph 'interesting_olive_trout' with 15 messages sent. Latest update time was 1341101732


## Collect simple metrics 📈

Select certain metrics to show in your output dataframe. Here we have selected vertex name, degree, out degree and in degree. **Time to finish: ~2 to 3 minutes**

In [4]:
from pyraphtory.graph import Row
df = graph \
      .select(lambda vertex: Row(vertex.name(), vertex.degree(), vertex.out_degree(), vertex.in_degree())) \
      .to_df(["name", "degree", "out_degree", "in_degree"])

15:55:29.965 [io-compute-blocker-9] INFO  com.raphtory.api.analysis.table.TableOutputTracker - Job 737823992_7542332978495795720: Starting query progress tracker.
15:55:29.965 [io-compute-blocker-9] INFO  com.raphtory.api.analysis.table.TableOutputTracker - Job 737823992_7542332978495795720: Starting output collector.
15:55:29.983 [spawner-akka.actor.default-dispatcher-10] INFO  com.raphtory.internals.components.querymanager.QueryManager - Source 0 has completed ingesting and will now unblock
15:55:29.984 [spawner-akka.actor.default-dispatcher-10] INFO  com.raphtory.internals.components.querymanager.QueryManager - Query '737823992_7542332978495795720' received, your job ID is '737823992_7542332978495795720'.
15:55:29.990 [spawner-akka.actor.default-dispatcher-3] INFO  com.raphtory.internals.components.partition.QueryExecutor - 737823992_7542332978495795720_0: Starting QueryExecutor.
15:55:29.990 [spawner-akka.actor.default-dispatcher-10] INFO  com.raphtory.internals.components.partitio

#### Clean the dataframe, we have deleted the unused window column. 🧹

In [5]:
df.drop(columns=['window'], inplace=True)

### Preview the dataframe 👀

In [6]:
df

Unnamed: 0,timestamp,name,degree,out_degree,in_degree
0,1341101732,99258,1,1,0
1,1341101732,8,1,0,1
2,1341101732,75083,1,1,0
3,1341101732,376989,2,2,0
4,1341101732,453850,1,1,0
5,1341101732,84647,1,0,1
6,1341101732,50329,2,0,2
7,1341101732,13813,1,0,1


**Sort by highest degree, top 10**

In [7]:
df.sort_values(['degree'], ascending=False)[:10]

Unnamed: 0,timestamp,name,degree,out_degree,in_degree
3,1341101732,376989,2,2,0
6,1341101732,50329,2,0,2
0,1341101732,99258,1,1,0
1,1341101732,8,1,0,1
2,1341101732,75083,1,1,0
4,1341101732,453850,1,1,0
5,1341101732,84647,1,0,1
7,1341101732,13813,1,0,1


**Sort by highest in-degree, top 10**

In [8]:
df.sort_values(['in_degree'], ascending=False)[:10]

Unnamed: 0,timestamp,name,degree,out_degree,in_degree
6,1341101732,50329,2,0,2
1,1341101732,8,1,0,1
5,1341101732,84647,1,0,1
7,1341101732,13813,1,0,1
0,1341101732,99258,1,1,0
2,1341101732,75083,1,1,0
3,1341101732,376989,2,2,0
4,1341101732,453850,1,1,0


**Sort by highest out-degree, top 10**

In [9]:
df.sort_values(['out_degree'], ascending=False)[:10]

Unnamed: 0,timestamp,name,degree,out_degree,in_degree
3,1341101732,376989,2,2,0
0,1341101732,99258,1,1,0
2,1341101732,75083,1,1,0
4,1341101732,453850,1,1,0
1,1341101732,8,1,0,1
5,1341101732,84647,1,0,1
6,1341101732,50329,2,0,2
7,1341101732,13813,1,0,1


# Run a PageRank algorithm 📑

Run your selected algorithm on your graph, here we run PageRank. Your algorithms can be obtained from the PyRaphtory object you created at the start. Specify where you write the result of your algorithm to, e.g. the additional column results in your dataframe. **Time to finish: ~3 to 4 minutes**

**Clean your dataframe** 🧹

In [None]:
PyRaphtory.close_graphs