<a href="https://colab.research.google.com/github/NathVM/GA/blob/main/Neo4JGraph.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Measuring performance of Graph Analytics Algorithms using Neo4j graphs

---



Imports:

---



In [None]:
!pip install py2neo
!pip install neo4j
!pip install graphdatascience

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting py2neo
  Downloading py2neo-2021.2.3-py2.py3-none-any.whl (177 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.0/177.0 KB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
Collecting monotonic
  Downloading monotonic-1.6-py2.py3-none-any.whl (8.2 kB)
Collecting pansi>=2020.7.3
  Downloading pansi-2020.7.3-py2.py3-none-any.whl (10 kB)
Collecting interchange~=2021.0.4
  Downloading interchange-2021.0.4-py2.py3-none-any.whl (28 kB)
Installing collected packages: monotonic, pansi, interchange, py2neo
Successfully installed interchange-2021.0.4 monotonic-1.6 pansi-2020.7.3 py2neo-2021.2.3
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting neo4j
  Downloading neo4j-5.7.0.tar.gz (176 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m176.3/176.3 KB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?2

In [None]:
import pandas as pd
from py2neo import Graph, Node, Relationship
from neo4j import GraphDatabase
from google.colab import drive
from graphdatascience import GraphDataScience

Setup:

---



In [None]:
drive.mount('/content/drive')
!cp -r /content/drive/MyDrive/Dataset/share/GA/nj/ /content/
!sed -i '/#dbms.security.auth_enabled/s/^#//g' nj/conf/neo4j.conf
!chmod -R 777 nj
!nj/bin/neo4j start

Mounted at /content/drive
Directories in use:
home:         /content/nj
config:       /content/nj/conf
logs:         /content/nj/logs
plugins:      /content/nj/plugins
import:       /content/nj/import
data:         /content/nj/data
certificates: /content/nj/certificates
licenses:     /content/nj/licenses
run:          /content/nj/run
Starting Neo4j.
Started neo4j (pid:1444). It is available at http://localhost:7474
There may be a short delay until the server is ready.


Neo4j connection:

---



In [None]:
graph = Graph("bolt://localhost:7687")
driver = GraphDatabase.driver("bolt://localhost:7687")

Dataset: 

https://networkrepository.com/TWITTER-Real-Graph-Partial.php

Shared in the google drive 

In [None]:
# Please comment the below line to execute the cell
# Loadding dataset only needed for graph creation
%%script echo skipping
# Map the shared folder 
# https://drive.google.com/drive/folders/113gZK1io1MZGogAULYoBdrlEUHyJcxRh?usp=sharing 
# to your google drive and modify the file path accordingly
file = "/content/drive/MyDrive/Dataset/share/GA/TWITTER-Real-Graph-Partial.edges"
df = pd.read_csv(file)
df.rename(columns = {'1':'source', '2':'target'}, inplace = True)
print(df.head(5))
dft = df

   source  target
0       2       1
1       3       4
2       4       3
3       3       2
4       2       3


Create Graph :

In [None]:
# Please comment the below line to execute the cell
# Loadding dataset only needed for graph creation
# DB is loaded directly from drive for execution so no to run this code
%%script echo skipping
query = """
WITH $rows AS rows
UNWIND rows AS row
MERGE (source:Node {id: row.source})
MERGE (target:Node {id: row.target})
MERGE (source)-[:CONNECTS_TO]-(target)
"""

# set batch size and index properties
batch_size = 1000
index_properties = ['id']

# create indexes on node properties
with driver.session() as session:
    for property_name in index_properties:
        session.run(f"CREATE INDEX ON :Node({property_name})")

# execute the query in batch transactions
with driver.session() as session:
    for i in range(0, len(dft), batch_size):
        batch = dft[i:i+batch_size].to_dict('records')
        session.run(query, rows=batch)

Path Analytics: 

---



In [None]:
start_time = timeit.default_timer()

query = """
MATCH (source:Node {id: 357908})
MATCH (destination:Node)
WHERE source <> destination
MATCH path = allshortestPaths((source)-[:CONNECTS_TO*]-(destination))
WITH source, destination, reduce(distance = 0, r in relationships(path) | distance + 1) AS distance, nodes(path) AS nodes
RETURN source.id, destination.id, distance, nodes, COLLECT( DISTINCT nodes)
"""
result = graph.run(query)

elapsed = timeit.default_timer() - start_time
print("Time taken for allshortestPaths = ", elapsed, "seconds")

for record in result:
  nodes = record["nodes"]
  print ([node["id"] for node in nodes])

Time taken for allshortestPaths =  9.603542574999665 seconds
[357908, 357909]
[357908, 357911]
[357908, 357910]


Time taken for SSSP =  9.603542574999665 seconds

In [None]:
gds = GraphDataScience("bolt://localhost:7687")
print(gds.version())
assert gds.version()

2.3.0


In [None]:
query = """
CALL gds.graph.drop('full_graph')
""" 

graph.run(query)

graphName,database,memoryUsage,sizeInBytes,nodeCount,relationshipCount,configuration,density,creationTime,modificationTime,schema,schemaWithOrientation
full_graph,neo4j,,-1,580768,1435110,"{relationshipQuery: 'MATCH (n)-[:CONNECTS_TO]-(m) RETURN id(n) AS source, 1 AS weight, id(m) AS target LIMIT 1435110', jobId: 'f1996b14-be07-480f-96ec-e7881b138d2b', creationTime: datetime('2023-04-05T17:58:49.872059000+00:00'), validateRelationships: true, nodeQuery: 'MATCH (n) RETURN id(n) AS id', sudo: true, readConcurrency: 4, parameters: []}",4.254814009404595e-06,datetime('2023-04-05T17:58:49.872059000+00:00'),datetime('2023-04-05T17:58:54.693821000+00:00'),"{graphProperties: {}, relationships: {__ALL__: {weight: 'Float (DefaultValue(NaN), TRANSIENT, Aggregation.NONE)'}}, nodes: {__ALL__: {}}}","{graphProperties: {}, relationships: {__ALL__: {properties: {weight: 'Float (DefaultValue(NaN), TRANSIENT, Aggregation.NONE)'}, direction: 'DIRECTED'}}, nodes: {__ALL__: {}}}"


In [None]:
query = """
CALL gds.graph.list()
YIELD graphName, nodeCount, relationshipCount
RETURN graphName, nodeCount, relationshipCount
ORDER BY graphName ASC
""" 

result = graph.run(query)
print(result)

 graphName  | nodeCount | relationshipCount 
------------|-----------|-------------------
 full_graph |    580768 |           1435116 



Centrality Analytics :

---



In [None]:
start_time = timeit.default_timer()
query = """
CALL gds.beta.closeness.stream('full_graph')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS id, score
ORDER BY score DESC
""" 

result = graph.run(query)
elapsed = timeit.default_timer() - start_time
print("Time taken for gds.beta.closeness.stream = ", elapsed, "seconds")
print(result)

Time taken for gds.beta.closeness.stream =  75.23033977599994 seconds
 id | score 
----|-------
  7 |   1.0 
  8 |   1.0 
  9 |   1.0 



Time taken for closeness =  75.23033977599994 seconds

In [None]:
start_time = timeit.default_timer()
query = """
CALL gds.degree.stream('full_graph')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS id, score AS connections
ORDER BY connections DESC, id DESC
"""

result = graph.run(query)
elapsed = timeit.default_timer() - start_time
print("Time taken for gds.degree.stream = ", elapsed, "seconds")
print(result)

Time taken for gds.degree.stream =  12.989420579999205 seconds
     id | connections 
--------|-------------
 471948 |        12.0 
 415947 |        12.0 
 380973 |        12.0 



Time taken for degree centrality =  12.989420579999205 seconds

In [None]:
start_time = timeit.default_timer()
query = """
CALL gds.betweenness.stream('full_graph')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS id, score
ORDER BY score DESC
"""

result = graph.run(query)
elapsed = timeit.default_timer() - start_time
print("Time taken for gds.betweenness.stream = ", elapsed, "seconds")

Time taken for gds.betweenness.stream =  2066.7040606800006 seconds


Time taken for betweenness centrality =  2066.7040606800006 seconds


Community Analytics :

---



In [None]:
query = """
CALL gds.louvain.stats('full_graph')
YIELD communityCount
""" 

result = graph.run(query)
print(result)

 communityCount 
----------------
         147179 



In [None]:
start_time = timeit.default_timer()
query = """
CALL gds.louvain.stream('full_graph')
YIELD nodeId, communityId
""" 

result = graph.run(query)
elapsed = timeit.default_timer() - start_time
print("Time taken for gds.louvain.stream = ", elapsed, "seconds")
print(result)

Time taken for gds.louvain.stream =  17.137901083999964 seconds
 nodeId | communityId 
--------|-------------
      0 |           1 
      1 |           1 
      2 |           1 



Time taken for louvain community detection =  17.137901083999964 seconds

No. of communities identified =  147179

It **not possible** to project graphs in **UNDIRECTED** orientation when **Cypher projections** are used.

**Triangles and Clusturin**g algorithms require that the graph was loaded with UNDIRECTED orientation. These algorithms can not be used with a graph projected by a Cypher projection.

Ref: https://neo4j.com/docs/graph-data-science/current/management-ops/projections/graph-project-cypher/#_relationship_orientation