<a href="https://colab.research.google.com/github/tomasonjo/blogs/blob/master/lotr_graph_data_science_catalog/Cypher%20projection%20of%20the%20Graph%20Catalog%20feature%20on%20a%20Lord%20of%20the%20Rings%20network.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

* Updated to GDS 2.0 version
* Link to original blog post: https://towardsdatascience.com/how-to-use-cypher-projection-in-neo4j-graph-data-science-library-on-a-lord-of-the-rings-social-b3459138c4f1

In [1]:
!pip install neo4j

Collecting neo4j
  Downloading neo4j-4.4.2.tar.gz (89 kB)
[K     |████████████████████████████████| 89 kB 2.7 MB/s 
Building wheels for collected packages: neo4j
  Building wheel for neo4j (setup.py) ... [?25l[?25hdone
  Created wheel for neo4j: filename=neo4j-4.4.2-py3-none-any.whl size=115365 sha256=3360555015b79a30c0fdb4eb00bc66b6254349351cee360aa68306c72de18675
  Stored in directory: /root/.cache/pip/wheels/10/d6/28/95029d7f69690dbc3b93e4933197357987de34fbd44b50a0e4
Successfully built neo4j
Installing collected packages: neo4j
Successfully installed neo4j-4.4.2


In [2]:
# Define Neo4j connections
from neo4j import GraphDatabase
host = 'bolt://3.235.2.228:7687'
user = 'neo4j'
password = 'seats-drunks-carbon'
driver = GraphDatabase.driver(host,auth=(user, password))

In [3]:
# Import libraries
import pandas as pd

def read_query(query):
    with driver.session() as session:
        result = session.run(query)
        return pd.DataFrame([r.values() for r in result], columns=result.keys())
    
def run_query(query):
    with driver.session() as session:
        session.run(query)

## Import

I have used the GoT dataset more times than I can remember, so I decided to explore the internet and search for new exciting graphs. I stumbled upon this Lord of the Rings dataset made available by José Calvo that we will use in this blog post.

The dataset describes interactions between persons, places, groups, and things (The Ring). When choosing how to model this dataset, I decided to have "main" nodes with two labels, primary label "Node" and secondary label one of the following:

* Person
* Place
* Group
* Thing

In [4]:
# Import nodes
import_nodes_query = """

LOAD CSV WITH HEADERS FROM 
"https://raw.githubusercontent.com/morethanbooks/projects/master/LotR/ontologies/ontology.csv" as row 
FIELDTERMINATOR "\t" 
WITH row, CASE row.type WHEN 'per' THEN 'Person' 
                        WHEN 'gro' THEN 'Group' 
                        WHEN 'thin'THEN 'Thing' 
                        WHEN 'pla' THEN 'Place' 
                        END as label 
CALL apoc.create.nodes(['Node',label], [apoc.map.clean(row,['type','subtype'],[null,""])]) YIELD node 
WITH node, row.subtype as class 
MERGE (c:Class{id:class}) 
MERGE (node)-[:PART_OF]->(c)

"""

run_query(import_nodes_query)

In [5]:
# Import relationships
import_relationships_query = """

UNWIND ['1','2','3'] as book 
LOAD CSV WITH HEADERS FROM 
"https://raw.githubusercontent.com/morethanbooks/projects/master/LotR/tables/networks-id-volume" + book + ".csv" AS row 
MATCH (source:Node{id:coalesce(row.IdSource,row.Source)})
MATCH (target:Node{id:coalesce(row.IdTarget,row.Target)})
CALL apoc.create.relationship(source, "INTERACTS_" + book, 
     {weight:toInteger(row.Weight)}, target) YIELD rel
RETURN distinct true

"""

run_query(import_relationships_query)

## Graph data science

### Cypher projection

The general syntax to use cypher projection is:
<pre>
CALL gds.graph.project.cypher(
    graphName: String,
    nodeQuery: String,
    relationshipQuery: String,
    configuration: Map
)
</pre>
Node query is a cypher statement used to describe the nodes we want to project. It must return the internal ids of the nodes and optionally any of their properties. Relationship query, on the other hand, describes the relationships we want to project. The cypher statement should return internal ids of source and target nodes of relationships, and optionally their type and any of their properties.

In [6]:
def drop_graph(name):
    with driver.session() as session:
        drop_graph_query = """
        CALL gds.graph.drop('{}');
        """.format(name)
        session.run(drop_graph_query)

### Whole graph projection

We will begin with a simple scenario and project the whole graph in memory. Adding the column type in the relationship query allows the data science engine to distinguish between relationship types, which in turn gives us an option to filter relationships when executing algorithms.

In [7]:
whole_graph = """
CALL gds.graph.project.cypher( 'whole-graph', 
    // nodeQuery 
    'MATCH (n) RETURN id(n) AS id', 
    // relationshipQuery 'MATCH (n)-[r]->(m) 
    'MATCH (n)-[r]->(m) RETURN id(n) AS source, id(m) AS target, type(r) as type')
"""
run_query(whole_graph);

As in the previous blog post, we will start with the weakly connected components algorithm. It is used to examine how many islands or disconnected components are there in our graph, which will help us better understand results from other graph algorithms. Also, sometimes we might want to run other graph algorithms only on the largest connected component.

In [8]:
wcc_whole = """

CALL gds.wcc.stream('whole-graph') 
YIELD nodeId, componentId 
RETURN componentId, count(*) as size 
ORDER BY size DESC LIMIT 10

"""
read_query(wcc_whole)

Unnamed: 0,componentId,size
0,0,86


As there is only one connected component in our graph, we don't have to worry about skewed results from other graph algorithms.

In [9]:
# Drop whole graph
drop_graph('whole-graph')

### Undirected weighted relationships graph

Next, we are going to project an undirected weighted graph. Let's take a look at how does the native projection handle undirected relationships:

* UNDIRECTED: each relationship is projected in both natural and reverse orientation

To produce an undirected relationship with the cypher projection, we project a single relationship in both directions, effectively allowing the graph data science engine to traverse the relationship in both directions. Let's take a look at the following example to gain a better understanding. We create two nodes with a single relationship between them.

<pre>
CREATE (:Test)-[:REL]->(:Test);
</pre>

To project the relationship in both directions, we only have to omit the direction of the relationship in our <code>MATCH</code> statement and that's it. A tiny, but very important detail!

<pre>
MATCH (n:Test)-[r]-(m:Test)
RETURN id(n) as source, id(m) as target;
</pre>

Results

|source |	target |
|-------|----------|
|1565   |	1566   |
|1566   |	1565   |

We will also demonstrate a favorable way to project multiple node labels using a UNION statement in the node query.

In [17]:
undirected_interacts_query = """
CALL gds.graph.project.cypher('undirected_interactions', 
    // nodeQuery 
    'MATCH (n:Person) RETURN id(n) AS id 
     UNION MATCH (n:Thing) 
     RETURN id(n) as id', 
    // relationshipQuery (notice no direction on relationship) 
    'MATCH (n)-[r:INTERACTS_1|INTERACTS_2|INTERACTS_3]-(m) 
     RETURN id(n) AS source, id(m) AS target, type(r) as type, r.weight as weight',
     {validateRelationships:false})
"""
run_query(undirected_interacts_query)

### Random walk algorithm

To stray away from the common graph algorithms like PageRank or Louvain, let's use the Random Walk algorithm. In essence, it mimics how a drunk person would traverse our graph. It is commonly used in the node2vec algorithms. We define Frodo as the start node and then walk five random steps twice.

In [18]:
random_walk = """

MATCH (n:Node{Label:'Frodo'})
CALL gds.beta.randomWalk.stream('undirected_interactions',
    {sourceNodes:[n], walkLength:5, walksPerNode:2}) 
YIELD nodeIds 
RETURN [nodeId in nodeIds | gds.util.asNode(nodeId).Label] as result

"""

read_query(random_walk)

Unnamed: 0,result
0,"[Frodo, Wormtongue, Eorl, Aragorn, Arathorn]"
1,"[Frodo, Sam, Bombadil, Bill, Aragorn]"


You will get different results as it is a random walk algorithm after all.

### Triangle count and clustering coefficient

Another useful algorithm for analyzing social networks is Triangle Counting and Clustering Coefficient algorithm. A triangle composes of three nodes, where each node has a relationship to the other two. The clustering coefficient is a measure of the degree to which nodes in a graph tend to cluster together. This allows us to estimate how tightly-knit nodes in our graph are.

In [20]:
triangle_count = """

CALL gds.triangleCount.write('undirected_interactions', 
    {relationshipTypes:['INTERACTS_1'], writeProperty:'triangles'}) 

"""

read_query(triangle_count)

Unnamed: 0,writeMillis,nodePropertiesWritten,globalTriangleCount,nodeCount,postProcessingMillis,preProcessingMillis,computeMillis,configuration
0,14,44,935,44,0,0,20,"{'writeConcurrency': 4, 'writeProperty': 'tria..."


In [25]:
cc = """

CALL gds.localClusteringCoefficient.write('undirected_interactions', 
    {relationshipTypes:['INTERACTS_1'], writeProperty:'clustering'}) 

"""

read_query(cc)

Unnamed: 0,writeMillis,nodePropertiesWritten,averageClusteringCoefficient,nodeCount,postProcessingMillis,preProcessingMillis,computeMillis,configuration
0,6,44,0.622108,44,0,0,24,"{'writeConcurrency': 4, 'triangleCountProperty..."


The global or average clustering coefficient is 0.54, which means that the persons in our graph are quite tightly-knit. We can also look at individuals and their local clustering coefficients.

In [24]:
local_clustering = """

MATCH (p:Person)
RETURN p.Label as person,
       p.clustering as coefficient,
       p.triangles as triangles 
ORDER BY coefficient DESC LIMIT 5

"""

read_query(local_clustering)

Unnamed: 0,person,coefficient,triangles
0,Gildor,1.0,15
1,Arwen,1.0,6
2,Bill,1.0,28
3,Goldberry,1.0,3
4,Shadowfax,1.0,3


In [26]:
drop_graph('undirected_interactions')

### Categorical PageRank

Up until this point, all the above graph analysis could be done with the native projection. Cypher projection can project graphs that exist only at query time or when we want to use more advanced filtering than just node labels or relationship types. 

Categorical PageRank is a concept first introduced by Kenny Bastani in his blog post. I have also written a blog post about it using the Graph algorithms library. Now it is time to demonstrate it with the Graph Data Science library as well.

The idea behind it is pretty simple. As shown in the example above we have a graph of pages that have links between each other and might also belong to one or more categories. To better understand the global pagerank score of nodes in a network, we can breakdown our graph into several subgraphs, one for each category and execute the pagerank algorithm on each of that subgraphs. We store results as a relationship property between category and pages. This way we can break down which are the contributing categories to page’s global pagerank score.

We will start by assuming that each interaction is a positive endorsement(I know it's actually not, but let's pretend). We will breakdown our graph into several subgraphs by the class(men, elves), that the characters belong to. For example, when calculating pagerank for the category of men, all nodes will be considered, but only relationships that come from characters that belong to the class of men will be considered.

In [33]:
categorical_graph = """

CALL gds.graph.project.cypher( 'categorical_men', 
    // nodeQuery 
    'MATCH (n:Person) RETURN id(n) AS id 
     UNION 
    MATCH (n:Thing) RETURN id(n) as id',
    // relationshipQuery 
    'MATCH (c:Class)<-[:PART_OF]-(n)-[r:INTERACTS_1|INTERACTS_2|INTERACTS_3]-(m) 
    // Use the parameter
    WHERE c.id = $class 
    RETURN id(n) AS source, id(m) AS target, r.weight as weight,type(r) as type', 
    {parameters: { class: 'men' }, validateRelationships:false})

"""

run_query(categorical_graph)

Let us now run the weighted pageRank on this graph.

In [34]:
weighted_pagerank = """

CALL gds.pageRank.stream('categorical_men',
    {relationshipWeightProperty:'weight'})
YIELD nodeId, score 
RETURN gds.util.asNode(nodeId).Label as name, score 
ORDER BY score DESC LIMIT 5

"""

read_query(weighted_pagerank)

Unnamed: 0,name,score
0,Gandalf,0.490396
1,Aragorn,0.472205
2,Frodo,0.321637
3,Pippin,0.311533
4,Théoden,0.307829


Gandalf comes out on top, with Aragorn following very closely in the second place. I am wondering if Aragorn has the most support from men in the third book, as he becomes their king at the end.

In [36]:
weighted_pagerank_third_book = """

CALL gds.pageRank.stream('categorical_men', 
    {relationshipTypes:['INTERACTS_3'],
     relationshipWeightProperty:'weight'})
YIELD nodeId, score 
RETURN gds.util.asNode(nodeId).Label as name, score 
ORDER BY score DESC LIMIT 5

"""

read_query(weighted_pagerank_third_book)

Unnamed: 0,name,score
0,Aragorn,0.517452
1,Gandalf,0.462588
2,Pippin,0.42312
3,Faramir,0.397746
4,Éomer,0.367025


As predicted, Aragorn takes the lead. Frodo is no longer on the list as he is quite isolated from everybody in the third book and walks alone with Sam to Mount Doom. To be honest, if you have Sam by your side, you are never lonely though.

In [37]:
drop_graph('categorical_men')

### Virtual categorical graph

We have only looked at the class of men and calculated the categorical pagerank for that specific subgraph. It would be very time consuming if we projected a subgraph for each class separately. That is why we will project a subgraph for each class in a single named graph using virtual relationship types. In the relationship query we have the option to return whatever we feel like as the column type. We will return the original relationship type combined with the class of the person to create virtual relationship types. This will allow us to calculate the categorical pagerank for each class.

In [41]:
virtual_graph = """

CALL gds.graph.project.cypher('categorical_virtual',
    'MATCH (n:Person) RETURN id(n) AS id 
     UNION MATCH (n:Thing) 
     RETURN id(n) as id', 
    'MATCH (c:Class)<-[:PART_OF]-(n)-[r:INTERACTS_1|INTERACTS_2|INTERACTS_3]-(m) 
     RETURN id(n) AS source, id(m) AS target, type(r) + c.id as type, r.weight as weight',
     {validateRelationships:false})

"""

run_query(virtual_graph)

We can now calculate the categorical pagerank for the class of elves.

In [42]:
categorical_elves = """

CALL gds.pageRank.stream('categorical_virtual',
    {relationshipTypes:['INTERACTS_1elves','INTERACTS_2elves','INTERACTS_3elves'],
     relationshipWeightProperty:'weight'})
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).Label as name, score
ORDER BY score DESC LIMIT 5

"""

read_query(categorical_elves)

Unnamed: 0,name,score
0,Frodo,0.353865
1,Aragorn,0.316407
2,Gandalf,0.25172
3,Elrond,0.250256
4,Galadriel,0.240393


If we want to calculate categorical pagerank for each class and store the results, we can use the same approach as in the original blog post, where we store the results in the relationship between a class and a person.

In [46]:
store_categorical = """

MATCH (c:Class)
WHERE NOT c.id IN ['pla', 'mixed', 'orcs']
CALL gds.pageRank.stream('categorical_virtual',
    {relationshipTypes:['INTERACTS_1'+c.id,'INTERACTS_2'+c.id,'INTERACTS_3'+c.id],
     relationshipWeightProperty:'weight'})
YIELD nodeId, score 
WITH c, gds.util.asNode(nodeId) as node, score 
WHERE score > 0.151 
CREATE (c)-[:PAGERANK{score:score}]->(node)

"""

run_query(store_categorical)

We can now get the top three members for each class based on their categorical pagerank score.

In [47]:
top3_by_category = """

MATCH (c:Class)-[s:PAGERANK]->(p:Person)
WITH c, p, s.score as pagerank 
ORDER BY pagerank DESC 
RETURN c.id as class,
       collect(p.Label)[..3] as top_3_members

"""

read_query(top3_by_category)

Unnamed: 0,class,top_3_members
0,hobbit,"[Frodo, Sam, Gandalf]"
1,men,"[Gandalf, Aragorn, Frodo]"
2,elves,"[Frodo, Aragorn, Gandalf]"
3,dwarf,"[Gandalf, Gimli, Legolas]"
4,ainur,"[Frodo, Bombadil, Gandalf]"
5,animal,"[Sam, Gandalf, Frodo]"
6,thing,"[Frodo, Gandalf, Sam]"
7,ents,"[Gandalf, Merry, Pippin]"
