<a href="https://colab.research.google.com/github/tomasonjo/blogs/blob/master/lotr_graph_data_science_catalog/Native%20Graph%20Catalog%20feature%20on%20a%20Lord%20of%20the%20Rings%20dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

* Updated to GDS 2.0 version
* Link to original blog post: https://towardsdatascience.com/exploring-the-graph-catalog-feature-of-neo4j-graph-data-science-plugin-on-a-lord-of-the-rings-d2de0d0a023

In [3]:
!pip install neo4j

Collecting neo4j
  Downloading neo4j-4.4.2.tar.gz (89 kB)
[?25l[K     |███▋                            | 10 kB 18.0 MB/s eta 0:00:01[K     |███████▎                        | 20 kB 12.0 MB/s eta 0:00:01[K     |███████████                     | 30 kB 8.9 MB/s eta 0:00:01[K     |██████████████▋                 | 40 kB 4.4 MB/s eta 0:00:01[K     |██████████████████▎             | 51 kB 4.4 MB/s eta 0:00:01[K     |██████████████████████          | 61 kB 5.2 MB/s eta 0:00:01[K     |█████████████████████████▋      | 71 kB 5.7 MB/s eta 0:00:01[K     |█████████████████████████████▎  | 81 kB 5.4 MB/s eta 0:00:01[K     |████████████████████████████████| 89 kB 3.5 MB/s 
Building wheels for collected packages: neo4j
  Building wheel for neo4j (setup.py) ... [?25l[?25hdone
  Created wheel for neo4j: filename=neo4j-4.4.2-py3-none-any.whl size=115365 sha256=7223bc0578864f59d966b348b03933c4a4ec8ff2f8db26a1651d3bc33ebe1dce
  Stored in directory: /root/.cache/pip/wheels/10/d6/28/9502

In [4]:
# Define Neo4j connections
from neo4j import GraphDatabase
host = 'bolt://3.235.2.228:7687'
user = 'neo4j'
password = 'seats-drunks-carbon'
driver = GraphDatabase.driver(host,auth=(user, password))

In [5]:
# Import libraries
import pandas as pd

def read_query(query):
    with driver.session() as session:
        result = session.run(query)
        return pd.DataFrame([r.values() for r in result], columns=result.keys())
    
def run_query(query):
    with driver.session() as session:
        session.run(query)

## Import

I have used the GoT dataset more times than I can remember, so I decided to explore the internet and search for new exciting graphs. I stumbled upon this Lord of the Rings dataset made available by José Calvo that we will use in this blog post.

The dataset describes interactions between persons, places, groups, and things (The Ring). When choosing how to model this dataset, I decided to have "main" nodes with two labels, primary label "Node" and secondary label one of the following:

* Person
* Place
* Group
* Thing

In [12]:
# Import nodes
import_nodes_query = """

LOAD CSV WITH HEADERS FROM 
"https://raw.githubusercontent.com/morethanbooks/projects/master/LotR/ontologies/ontology.csv" as row 
FIELDTERMINATOR "\t" 
WITH row, CASE row.type WHEN 'per' THEN 'Person' 
                        WHEN 'gro' THEN 'Group' 
                        WHEN 'thin'THEN 'Thing' 
                        WHEN 'pla' THEN 'Place' 
                        END as label 
CALL apoc.create.nodes(['Node',label], [apoc.map.clean(row,['type','subtype'],[null,""])]) YIELD node 
WITH node, row.subtype as class 
MERGE (c:Class{id:class}) 
MERGE (node)-[:PART_OF]->(c)

"""

run_query(import_nodes_query)

In [13]:
# Import relationships
import_relationships_query = """

UNWIND ['1','2','3'] as book 
LOAD CSV WITH HEADERS FROM 
"https://raw.githubusercontent.com/morethanbooks/projects/master/LotR/tables/networks-id-volume" + book + ".csv" AS row 
MATCH (source:Node{id:coalesce(row.IdSource,row.Source)})
MATCH (target:Node{id:coalesce(row.IdTarget,row.Target)})
CALL apoc.create.relationship(source, "INTERACTS_" + book, 
     {weight:toInteger(row.Weight)}, target) YIELD rel
RETURN distinct true

"""

run_query(import_relationships_query)

## Graph data science

The syntax for creating named graphs in Graph Catalog is:


<code>CALL gds.graph.project(graph name, node label, relationship type).</code>


Describing nodes we want to project

In general, with the native projection variant, there are three options to describe the nodes we want to project into memory:

* Project a single node label using a string:
    * 'Label' ('*' is a wildcard operator that projects all nodes)
* Project multiple node labels using an array:
    * ['Label1', 'Label2', 'Label3']
* Project multiple node labels with their properties using a configuration map:
    * <pre>{
  Label: {
    label: "Label",
    properties: [
      "property1",
      "property2"
    ]
  },
  Label2: {
    label: "Label2",
    properties: [
      "foo",
      "bar"
    ]
  }
}</pre>

An important thing to note regarding projecting node labels:

>In the in-memory graph, all projected node labels are merged into a single label. Unlike for relationship >projections, it is currently not possible to specify a filter on projected labels. If the graph is used as input >for an algorithm, all nodes will be considered.

While we can filter which node labels we want to project to the in-memory graph, additional filtering of nodes when executing graph algorithm is currently not supported.




Describing relationships we want to project

The syntax to describe the relationships we want to project is very similar to that of the nodes. 

* Project a single relationship type using a string:
    * 'TYPE' ('*' is a wildcard that projects all relationship-types)
* Project multiple relationship types using an array:
    * ['TYPE1','TYPE2']
* Project more relationship types with their properties using a configuration map:
    * <pre>{ALIAS_OF_TYPE: {type:'RELATIONSHIP_TYPE', 
                 orientation: 'NATURAL',
                 aggregation: 'DEFAULT'
                 properties:[property1,property2]}</pre>

The orientation parameter in the configuration map defines the direction of the relationships we want to project. Possible values are:

* 'NATURAL' -> each relationship is projected the same way as it is stored in Neo4j
* 'REVERSE' -> each relationship is reversed during graph projection
* 'UNDIRECTED' -> each relationship is projected in both natural and reverse orientation

An important thing to note is that the GDS library supports running graph algorithms on a multigraph. The aggregation parameter is handy when we want to convert a multigraph to a single graph(not a multigraph), but we'll take a closer look at that in another blog post.

In [8]:
def drop_graph(name):
    with driver.session() as session:
        drop_graph_query = """
        CALL gds.graph.drop('{}');
        """.format(name)
        session.run(drop_graph_query)

### Whole graph projection

Let's start by projecting the entire graph into memory using the wildcard operator for both the nodes and the relationships.

In [14]:
whole_graph = "CALL gds.graph.project('whole_graph','*', '*')"
run_query(whole_graph);

Most of the time, we start the graph analysis by running the (weakly) connected components algorithm to get an idea of how (dis)connected our graph really is. 

In [15]:
wcc_whole = """

CALL gds.wcc.stream('whole_graph') YIELD nodeId, componentId 
RETURN componentId, count(*) as size 
ORDER BY size DESC LIMIT 10

"""
read_query(wcc_whole)

Unnamed: 0,componentId,size
0,0,86


The graph as a whole consists of a single component. Usually, what you'll get with real-world data is a single super component (85+% of all nodes) and a few small disconnected components.

In [16]:
# Drop whole graph
drop_graph('whole_graph')

### Interactions graph

In the next step, we want to ignore PART_OF relationships and only focus on INTERACTS_X relationships. We will use an array for describing relationship-types to take into account all three INTERACTS_X relationships.

In [17]:
interacts_query = """
CALL gds.graph.project('all_interacts','Node', ['INTERACTS_1', 'INTERACTS_2', 'INTERACTS_3'])
"""
run_query(interacts_query)

Let's run the weakly connected components algorithm on our new projected graph. 

In [18]:
wcc_interacts = """

CALL gds.wcc.stream('all_interacts') YIELD nodeId, componentId 
RETURN componentId, count(*) as size, collect(gds.util.asNode(nodeId).Label) as ids 
ORDER BY size DESC LIMIT 10

"""

read_query(wcc_interacts)

Unnamed: 0,componentId,size,ids
0,0,73,"[Anduin, Aragorn, Arathorn, Arwen, Bag End, Ba..."
1,25,1,[Old Forest]
2,26,1,[Mirkwood]


Our new graph consists of three components. We have a single super component and two tiny components consisting of only a single node. We can deduce that locations "Mirkwood" and "Old Forest" have no INTERACTS_X relationships.

Let's use the same projected graph and only look at interactions from the first book. We can filter which relationship-types should the graph algorithm consider with the relationshipTypes parameter.

In [19]:
wcc_interacts_first_book = """

CALL gds.wcc.stream('all_interacts', 
    {relationshipTypes:['INTERACTS_1']}) YIELD nodeId, componentId
RETURN componentId, count(*) as size, 
       collect(gds.util.asNode(nodeId).Label) as ids
ORDER BY size DESC LIMIT 10

"""

read_query(wcc_interacts_first_book)

Unnamed: 0,componentId,size,ids
0,0,62,"[Anduin, Aragorn, Arathorn, Arwen, Bag End, Ba..."
1,21,1,[Ents]
2,26,1,[Mirkwood]
3,23,1,[Eorl]
4,24,1,[Éowyn]
5,25,1,[Old Forest]
6,22,1,[Éomer]
7,27,1,[Faramir]
8,38,1,[Gorbag]
9,6,1,[Beregond]


We get more disconnected components if we take into account only interactions from the first book. This makes sense as some of the characters/locations haven't yet been introduced in the first book, so they have no INTERACTS_1 relationships.

In [20]:
drop_graph('all_interacts')

### Undirected weighted graph

In the last example, we will show how to project an undirected weighted graph. We will consider only nodes labeled Person and Thing, and for relationships, we will project all the INTERACTS_X relationships along with their weight property, which will be treated as undirected.

In [21]:
load_graph = """

CALL gds.graph.project('undirected_weighted',['Person', 'Thing'], 
    {INTERACTS_1:{type: 'INTERACTS_1', 
                  orientation: 'UNDIRECTED', 
                  properties:['weight']},
     INTERACTS_2:{type:'INTERACTS_2',
                  orientation: 'UNDIRECTED',
                  properties:['weight']},
     INTERACTS_3: {type:'INTERACTS_3', 
                   orientation:'UNDIRECTED',
                   properties:['weight']}});

"""

run_query(load_graph)

#### Unweighted pagerank

To run the unweighted pageRank on our projected graph, we don't have to specify any additional configuration.

In [22]:
unweighted_pagerank = """

CALL gds.pageRank.stream('undirected_weighted')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).Label as name, score 
ORDER BY score DESC LIMIT 5

"""

read_query(unweighted_pagerank)

Unnamed: 0,name,score
0,Aragorn,2.243211
1,Gandalf,2.200864
2,Frodo,2.099686
3,Sam,1.79491
4,Gimli,1.63608


#### Weighted pagerank
To let know the algorithm that it should take relationship weights into account, we need to use relationshipWeightProperty parameter.

In [23]:
weigted_pagerank = """

CALL gds.pageRank.stream('undirected_weighted', {relationshipWeightProperty:'weight'}) 
YIELD nodeId, score 
RETURN gds.util.asNode(nodeId).Label as name, score 
ORDER BY score DESC LIMIT 5

"""

read_query(weigted_pagerank)

Unnamed: 0,name,score
0,Frodo,5.065515
1,Gandalf,3.73223
2,Sam,3.445908
3,Aragorn,3.224506
4,Pippin,2.338957


As Frodo has more interactions (defined as weight) with other characters, he comes out on top with the weighted variant of the pageRank.

### First book analysis

To finish this blog post, we will analyze the network of the first book. We start by running the weighted pageRank on the interaction relationships from the first book only.

In [24]:
first_pagerank = """

CALL gds.pageRank.stream('undirected_weighted', 
     {relationshipWeightProperty:'weight', relationshipTypes:['INTERACTS_1']}) 
YIELD nodeId, score 
RETURN gds.util.asNode(nodeId).Label as name, score 
ORDER BY score DESC LIMIT 5

"""

read_query(first_pagerank)

Unnamed: 0,name,score
0,Frodo,5.478835
1,Gandalf,2.834502
2,Aragorn,2.783486
3,Sam,2.632037
4,Ring,1.964534


In [26]:
l_f = """

CALL gds.louvain.stream('undirected_weighted', 
    {relationshipWeightProperty:'weight', relationshipTypes:['INTERACTS_1']})
YIELD nodeId, communityId
RETURN communityId, collect(gds.util.asNode(nodeId).Label) as members
ORDER BY size(members) DESC LIMIT 5

"""

read_query(l_f)

Unnamed: 0,communityId,members
0,36,"[Aragorn, Arathorn, Arwen, Boromir, Denethor, ..."
1,18,"[Balin, Celeborn, Durin, Galadriel, Gimli, Gló..."
2,34,"[Bilbo, Bill, Frodo, Gandalf, Gildor, Merry, P..."
3,42,"[Goldberry, Bombadil]"
4,4,[Beregond]


In [27]:
drop_graph('undirected_weighted')