# Loading and Cleaning the data

After reading data into Raphtory we can now make use of the graph representation to ask some interesting questions. For this tutorial, we will use a dataset from [SocioPatterns](http://www.sociopatterns.org/datasets/baboons-interactions/), comprising different behavioural interactions between a group of 22 baboons over a month. 

If you want to read more about the dataset, you can check it out in this paper: [V. Gelardi, J. Godard, D. Paleressompoulle, N. Claidière, A. Barrat, “Measuring social networks in primates: wearable sensors vs. direct observations”, Proc. R. Soc. A 476:20190737 (2020)](https://royalsocietypublishing.org/doi/10.1098/rspa.2019.0737). 

In the below code we load this dataset into a dataframe and do a small amount of preprocessing to prepare it for loading into Raphtory. This includes dropping rows with blank fields and mapping the values of the `behaviour category` into a `weight` which can be aggregated. The mapping consists of the following conversions:

* Affiliative (positive interaction) → `+1`
* Agonistic (negative interaction) → `-1` 
* Other (neutral interaction) → `0`

In [7]:
import pandas as pd

edges_df = pd.read_csv(
    "data/OBS_data.txt", sep="\t", header=0, usecols=[0, 1, 2, 3, 4], parse_dates=[0]
)
edges_df["DateTime"] = pd.to_datetime(edges_df["DateTime"]).astype("datetime64[ms]")
edges_df.dropna(axis=0, inplace=True)
edges_df["Weight"] = edges_df["Category"].apply(
    lambda c: 1 if (c == "Affiliative") else (-1 if (c == "Agonistic") else 0)
)
print(edges_df.head())

              DateTime   Actor Recipient  Behavior     Category  Weight
15 2019-06-13 09:50:00  ANGELE    FELIPE  Grooming  Affiliative       1
17 2019-06-13 09:50:00  ANGELE    FELIPE  Grooming  Affiliative       1
19 2019-06-13 09:51:00  FELIPE    ANGELE   Resting  Affiliative       1
20 2019-06-13 09:51:00  FELIPE      LIPS   Resting  Affiliative       1
21 2019-06-13 09:51:00  ANGELE    FELIPE  Grooming  Affiliative       1


  edges_df = pd.read_csv(


# Creating a graph

There are plenty of ways to get data into Raphtory and start running analysis. In this tutorial we are going to cover loading from a Pandas Dataframe. You can also directly update or loading from a saved graph.

To get started we first need to create a graph to store our data. Printing this graph will show it as empty with no vertices, edges or update times.

In [8]:
import raphtory as rp

g = rp.Graph()
g

Graph(number_of_edges=0, number_of_vertices=0, number_of_temporal_edges=0, earliest_time="None", latest_time="None")

Next we load this into Raphtory using the `load_edges_from_pandas` function, modelling it as a weighted multi-layer graph, with a layer per unique `behaviour`. 

In [9]:
g.load_edges_from_pandas(
    df=edges_df,
    src="Actor",
    dst="Recipient",
    time="DateTime",
    layer="Behavior",
    props=["Weight"],
)
print(g)

HBox(children=(HTML(value=''), IntProgress(value=0, max=3196), HTML(value='')))

Graph(number_of_edges=290, number_of_vertices=22, number_of_temporal_edges=3196, earliest_time="1560419400000", latest_time="1562756700000")


# Basic Metrics

Now that we have our graph let's start probing it for some basic metrics, such as how many nodes and edges it contains and the time range over which it exists. 

Note, as the property APIs are the same for the graph, vertices and edges, these are discussed together in [Property queries](https://www.raphtory.com/user-guide/querying/5_properties/).

In [10]:
print("Stats on the graph structure:")

number_of_vertices = g.count_vertices()
number_of_edges = g.count_edges()
total_interactions = g.count_temporal_edges()
unique_layers = g.unique_layers

print("Number of vertices (Baboons):", number_of_vertices)
print("Number of unique edges (src,dst,layer):", number_of_edges)
print("Total interactions (edge updates):", total_interactions)
print("Unique layers:", unique_layers, "\n")


print("Stats on the graphs time range:")

earliest_datetime = g.earliest_date_time
latest_datetime = g.latest_date_time
earliest_epoch = g.earliest_time
latest_epoch = g.latest_time

print("Earliest datetime:", earliest_datetime)
print("Latest datetime:", latest_datetime)
print("Earliest time (Unix Epoch):", earliest_epoch)
print("Latest time (Unix Epoch):", latest_epoch)

Stats on the graph structure:
Number of vertices (Baboons): 22
Number of unique edges (src,dst,layer): 290
Total interactions (edge updates): 3196
Unique layers: ['_default', 'Grooming', 'Resting', 'Presenting', 'Playing with', 'Grunting-Lipsmacking', 'Supplanting', 'Threatening', 'Submission', 'Touching', 'Avoiding', 'Attacking', 'Carrying', 'Embracing', 'Mounting', 'Copulating', 'Chasing'] 

Stats on the graphs time range:
Earliest datetime: 2019-06-13 09:50:00
Latest datetime: 2019-07-10 11:05:00
Earliest time (Unix Epoch): 1560419400000
Latest time (Unix Epoch): 1562756700000


## Accessing vertices and edges  
Three types of functions are provided for accessing the vertices and edges within the graph: 

* **Existance check:** Via `has_vertex()` and `has_edge()` you can check if an entity is present within the graph.
* **Direct access:** `vertex()` and `edge()` will return a vertex/edge object if the entity is present and `None` if it is not.
* **Iterable access:** `vertices` and `edges` will return iterables for all vertices/edges which can be used within a for loop or as part of a [function chain](https://www.raphtory.com/user-guide/querying/6_chaining/).

All of these functions are shown in the code below and will appear in several other examples throughout this tutorial.

In [11]:
print("Checking if specific vertices and edges are in the graph:")
if g.has_vertex(id="LOME"):
    print("Lomme is in the graph")
if g.has_edge(src="LOME", dst="NEKKE", layer="Playing with"):
    print("Lomme has played with Nekke \n")

print("Getting individual vertices and edges:")
print(g.vertex("LOME"))
print(g.edge("LOME", "NEKKE"), "\n")

print("Getting iterators over all vertices and edges:")
print(g.vertices)
print(g.edges)

Checking if specific vertices and edges are in the graph:
Lomme is in the graph
Lomme has played with Nekke 

Getting individual vertices and edges:
Vertex(name=LOME, earliest_time="1560419520000", latest_time="1562756100000")
Edge(source=LOME, target=NEKKE, earliest_time=1560421080000, latest_time=1562755980000, properties={Weight: 1}) 

Getting iterators over all vertices and edges:
Vertices(Vertex(name=ANGELE, earliest_time="1560419400000", latest_time="1562754600000"), Vertex(name=FELIPE, earliest_time="1560419400000", latest_time="1562756700000"), Vertex(name=LIPS, earliest_time="1560419460000", latest_time="1562756700000"), Vertex(name=NEKKE, earliest_time="1560419520000", latest_time="1562756700000"), Vertex(name=LOME, earliest_time="1560419520000", latest_time="1562756100000"), Vertex(name=BOBO, earliest_time="1560419520000", latest_time="1562755500000"), Vertex(name=ATMOSPHERE, earliest_time="1560419640000", latest_time="1562683260000"), Vertex(name=FEYA, earliest_time="156042

# Update history

In the code below we create a vertex object for the monkey Felipe and see when their updates occurred. 

We've limited Felipes updates to the first 10 since they have had many interactions! 


In [16]:
v = g.vertex("FELIPE")
print(
    f"{v.name}'s first interaction was at {v.earliest_date_time} and their last interaction was at {v.latest_date_time}\n"
)
print(f"{v.name} had interactions at the following times: {v.history()[:10]}\n")

FELIPE's first interaction was at 2019-06-13 09:50:00 and their last interaction was at 2019-07-10 11:05:00

FELIPE had interactions at the following times: [1560419400000, 1560419460000, 1560419520000, 1560419580000, 1560419640000, 1560420720000, 1560421260000, 1560422580000, 1560423360000, 1560423420000]



# Neighbours, edges and paths


To investigate who a vertex is connected with we can ask for its `degree()`, `edges`, or `neighbours`. As Raphtory is a directed graph all of these functions also have an `in_` and `out_` variation, allowing you get only incoming and outgoing connections respectively. These functions return the following:

* **degree:** A count of the number of unique connections a vertex has.
* **edges:** An iterable (`Edges`) of edge objects, one for each unique `(src,dst)` pair.
* **neighbours:** An iterable of vertex objects (`PathFromVertex`), one for each node the vertex shares an edge with.

In [8]:
v = g.vertex("FELIPE")
v_name = v.name
in_degree = v.in_degree()
out_degree = v.out_degree()
in_edges = v.in_edges
neighbours = v.neighbours
neighbour_names = v.neighbours.name.collect()

print(
    f"{v_name} has {in_degree} incoming interactions and {out_degree} outgoing interactions.\n"
)
print(in_edges)
print(neighbours, "\n")
print(f"{v_name} interacted with the following baboons {neighbour_names}")


FELIPE has 17 incoming interactions and 18 outgoing interactions.

Edges(Edge(source=ANGELE, target=FELIPE, earliest_time=1560419400000, latest_time=1562753640000, properties={Weight: 1}), Edge(source=LIPS, target=FELIPE, earliest_time=1560423600000, latest_time=1562756700000, properties={Weight: 1}), Edge(source=NEKKE, target=FELIPE, earliest_time=1560443040000, latest_time=1562596380000, properties={Weight: 1}), Edge(source=LOME, target=FELIPE, earliest_time=1560421260000, latest_time=1562149080000, properties={Weight: 1}), Edge(source=BOBO, target=FELIPE, earliest_time=1560423360000, latest_time=1561543080000, properties={Weight: -1}), Edge(source=ATMOSPHERE, target=FELIPE, earliest_time=1560524880000, latest_time=1561638540000, properties={Weight: 1}), Edge(source=FEYA, target=FELIPE, earliest_time=1560853500000, latest_time=1562586000000, properties={Weight: 1}), Edge(source=FANA, target=FELIPE, earliest_time=1560526140000, latest_time=1562752800000, properties={Weight: 1}), Edge(

## Exploded edges
The very first question you may have after reading this is "What if I don't want all of the layers?". For this Raphtory offers you three different ways to split the edge, depending on your use case:

* `.layers()`: which takes a list of layer names and returns a new `Edge View` which only contains updates for the specified layers - This is discussed in more detail in the [Layer views](https://www.raphtory.com/user-guide/views/3_layer/) chapter.
* `.explode_layers()`: which returns an iterable of `Edge Views`, each containing the updates for one layer.
* `.explode()`: which returns an `Exploded Edge` containing only the information from one call to `add_edge()` i.e. an edge object for each update. 

In the code below you can see usage of all of these functions. We first call `explode_layers()`, seeing which layer each edge object represents and output its update history. Next we fully `explode()` the edge and see each update as an individual object. Thirdly we use the `layer()` function to look at only the `Touching` and `Carrying` layers and chain this with a call to `explode()` to see the updates within these individually. 

In [15]:
print("Update history per layer:")
for e in g.edge("FELIPE", "MAKO").explode_layers():
    print(f"{e.src.name} interacted with {e.dst.name} with the following behaviour '{e.layer_name}' at this times: {e.history()}")

print()
print("Individual updates as edges:")
for e in g.edge("FELIPE", "MAKO").explode():
    print(f"At {e.date_time} {e.src.name} interacted with {e.dst.name} in the following manner: '{e.layer_name}'")

print()
print("Individual updates for 'Touching' and 'Carrying:")
for e in g.edge("FELIPE", "MAKO").layers(["Touching", "Carrying"]).explode():
    print(f"At {e.date_time} {e.src.name} interacted with {e.dst.name} in the following manner: '{e.layer_name}'")

Update history per layer:
FELIPE interacted with MAKO with the following behaviour 'Grooming' at this times: [1561043280000, 1561043340000]
FELIPE interacted with MAKO with the following behaviour 'Resting' at this times: [1560437400000, 1560437640000, 1560935460000, 1561117620000, 1561373880000, 1561390860000, 1561390860000, 1561390860000, 1561643580000, 1561970760000, 1562149020000, 1562671020000]
FELIPE interacted with MAKO with the following behaviour 'Playing with' at this times: [1561373880000, 1561373940000, 1561373940000, 1561390920000, 1562148960000, 1562148960000, 1562149080000]
FELIPE interacted with MAKO with the following behaviour 'Grunting-Lipsmacking' at this times: [1561373940000, 1561717080000, 1561717140000]
FELIPE interacted with MAKO with the following behaviour 'Touching' at this times: [1562149020000]
FELIPE interacted with MAKO with the following behaviour 'Carrying' at this times: [1561043280000, 1561373940000]
FELIPE interacted with MAKO with the following beh

# Chaining Functions 

When we called `v.neighbours` in [Vertex metrics](https://www.raphtory.com/user-guide/querying/3_vertex-metrics/#neighbours-edges-and-paths), a `PathFromVertex` was returned rather than a `List`. This, along with all other iterables previously mentioned (`Vertices`,`Edges`,`Properties`), are [lazy](https://en.wikipedia.org/wiki/Lazy_evaluation) data structures which allow you to chain multiple functions together before a final execution. 

For example, for a vertex `v`, `v.neighbours.neighbours` will return the two-hop neighbours. The first call of `neighbours` returns the immediate neighbours of `v`, the second applies the`neighbours` function to each of the vertices returned by the first call. 

We can continue this chain for as long as we like, with any functions in the Vertex, Edge or Property API until we either: 

* Call `.collect()`, which will execute the chain and return the result.
* Execute the chain by handing it to a python function such as `list()`, `set()`, `sum()`, etc.
* Iterate through the chain via a loop/list comprehension.

We can see a basic example of these function chains below in which we get the names of all the monkeys, the names of their two-hop neighbours, zip these together and print the result.

In [9]:
vertex_names = g.vertices.name
two_hop_neighbours = g.vertices.neighbours.neighbours.name.collect()
combined = zip(vertex_names, two_hop_neighbours)
for name, two_hop_neighbour in combined:
    print(f"{name} has the following two hop neighbours {two_hop_neighbour}") 

ANGELE has the following two hop neighbours ['ANGELE', 'LIPS', 'NEKKE', 'LOME', 'BOBO', 'ATMOSPHERE', 'FEYA', 'FANA', 'PIPO', 'MUSE', 'MAKO', 'MALI', 'PETOULETTE', 'ARIELLE', 'HARLEM', 'VIOLETTE', 'EWINE', 'SELF', 'ANGELE', 'FELIPE', 'NEKKE', 'LOME', 'BOBO', 'ATMOSPHERE', 'FEYA', 'FANA', 'PIPO', 'KALI', 'MUSE', 'MAKO', 'MALI', 'PETOULETTE', 'ARIELLE', 'HARLEM', 'VIOLETTE', 'EWINE', 'ANGELE', 'FELIPE', 'LIPS', 'LOME', 'BOBO', 'ATMOSPHERE', 'FEYA', 'FANA', 'PIPO', 'KALI', 'MUSE', 'MAKO', 'MALI', 'PETOULETTE', 'ARIELLE', 'HARLEM', 'VIOLETTE', 'EWINE', 'ANGELE', 'FELIPE', 'LIPS', 'NEKKE', 'BOBO', 'ATMOSPHERE', 'FEYA', 'FANA', 'PIPO', 'KALI', 'MUSE', 'MAKO', 'MALI', 'PETOULETTE', 'ARIELLE', 'HARLEM', 'VIOLETTE', 'EWINE', 'ANGELE', 'FELIPE', 'LIPS', 'NEKKE', 'LOME', 'ATMOSPHERE', 'FEYA', 'FANA', 'PIPO', 'KALI', 'MUSE', 'MAKO', 'MALI', 'PETOULETTE', 'ARIELLE', 'HARLEM', 'VIOLETTE', 'EWINE', 'ANGELE', 'FELIPE', 'LIPS', 'NEKKE', 'LOME', 'BOBO', 'FEYA', 'FANA', 'PIPO', 'KALI', 'MUSE', 'MAKO', 'M

# Chains with properties 

To demonstrate this question, we can include some property aggregation into our chains. 

In the code below we sum the `Weight` value of each of `Felipe's` out-neighbours to rank them by the number of positive interactions he has initiated with them. 

Following this find the most annoying monkey by ranking globally who on average has had the most negative interactions initiated against them.

In [11]:
v = g.vertex("FELIPE")
neighbours_weighted = list(
    zip(
        v.out_edges.dst.name,
        v.out_edges.properties.temporal.get("Weight").values().sum(),
    )
)
sorted_weights = sorted(neighbours_weighted, key=lambda v: v[1], reverse=True)
print(f"Felipe's favourite baboons in descending order are {sorted_weights}")

annoying_monkeys = list(
    zip(
        g.vertices.name,
        g.vertices.in_edges.properties.temporal.get("Weight")
        .values()
        .sum()  # sum the weights within each edge
        .mean()  # average the summed weights for each monkey
        .collect(),
    )
)
most_annoying = sorted(annoying_monkeys, key=lambda v: v[1])[0]
print(
    f"{most_annoying[0]} is the most annoying monkey with an average score of {most_annoying[1]}"
)


Felipe's favourite baboons in descending order are [('NEKKE', 41), ('ANGELE', 31), ('MAKO', 26), ('LOME', 23), ('LIPS', 11), ('HARLEM', 10), ('FANA', 8), ('MALI', 6), ('FEYA', 5), ('ARIELLE', 5), ('EWINE', 5), ('PIPO', 3), ('SELF', 2), ('BOBO', 1), ('ATMOSPHERE', 1), ('PETOULETTE', 1), ('VIOLETTE', 1), ('MUSE', -1)]
EXTERNE is the most annoying monkey with an average score of -2.0


# Graph Views


Raphtory can maintain hundreds of thousands of Graph Views in parallel, allows chaining view functions together to create as specific a filter as is required for your use case, and provides a unified API such that all functions mentioned can be called on a graph, vertex or edge.

There are a number of views supported, such as `at()` takes a time argument in epoch (integer) or datetime (string/datetime object) format and can be called on a graph, vertex, or edge. This will return an equivalent Graph View, Vertex View or Edge View which includes all updates between the beginning of the graphs history and the provided time (inclusive of the time provided). 

You can also apply windows with the `window()` function. This allows you to set a start time as well as an end time (inclusive of start, exclusive of end).

This is useful for digging into specific ranges of the history that you are interested in, for example a given day within your data, filtering everything else outside this range. An example of this can be seen below where we look at the number of times Lome interacts wth Nekke within the full dataset and for one day between the 13th of June and the 14th of June.

In [18]:
from datetime import datetime

start_day = datetime.strptime("2019-06-13", "%Y-%m-%d")
end_day = datetime.strptime("2019-06-14", "%Y-%m-%d")
v_at_2 = g.at(1560428239000).vertex("LOME")  # 13/06/2019 12:17:19 as epoch
e = g.edge("LOME", "NEKKE")
print(
    f"Across the full dataset {e.src.name} interacted with {e.dst.name} {len(e.history())} times"
)
e = e.window(start_day, end_day)
print(
    f"Between {v_at_2.start_date_time} and {v_at_2.end_date_time}, {e.src.name} interacted with {e.dst.name} {len(e.history())} times"
)
print(
    f"Window start: {e.start_date_time}, First update: {e.earliest_date_time}, Last update: {e.latest_date_time}, Window End: {e.end_date_time}"
)


Across the full dataset LOME interacted with NEKKE 41 times
Between 2019-06-13 09:50:00 and 2019-06-13 12:17:19.001000, LOME interacted with NEKKE 8 times
Window start: 2019-06-13 00:00:00, First update: 2019-06-13 10:18:00, Last update: 2019-06-13 15:05:00, Window End: 2019-06-14 00:00:00


## Rolling


