# Running the node2vec Embedding

In this notebook we're going to generate graph embeddings using the node2vec algorithm. We'll then explore those embeddings using Python Data Science tools.

Let's start by importing some libraries:

In [1]:
from neo4j import GraphDatabase
from neo4j.exceptions import ClientError
from sklearn.manifold import TSNE

import numpy as np
import altair as alt
import pandas as pd
import os

Once we've done that we can initialise the Neo4j driver.

In [3]:
bolt_url = os.getenv("NEO4J_BOLT_URL", "bolt://localhost")
user = os.getenv("NEO4J_USER", "neo4j")
password = os.getenv("NEO4J_PASSWORD", "neo")
driver = GraphDatabase.driver("bolt://graph-embeddings-neo4j", auth=(user, password))

We should have already imported the dataset. We can run the following query to check that the data has been imported:

In [4]:
result = {"label": [], "count": []}
with driver.session(database="eroads") as session:
    for row in session.run("CALL db.labels()"):
        label = row["label"]
        query = f"MATCH (:`{label}`) RETURN count(*) as count"
        count = session.run(query).single()["count"]
        result["label"].append(label)
        result["count"].append(count)
nodes_df = pd.DataFrame(data=result)
nodes_df.sort_values("count")

result = {"relType": [], "count": []}
with driver.session(database="eroads") as session:
    for row in session.run("CALL db.relationshipTypes()"):
        relationship_type = row["relationshipType"]
        query = f"MATCH ()-[:`{relationship_type}`]->() RETURN count(*) as count"
        count = session.run(query).single()["count"]
        result["relType"].append(relationship_type)
        result["count"].append(count)
rels_df = pd.DataFrame(data=result)
rels_df.sort_values("count")

display(nodes_df)
display(rels_df)

Unnamed: 0,label,count
0,Place,894


Unnamed: 0,relType,count
0,EROAD,1250


We should have 894 `Place` nodes and 2,500 `EROAD` relationships.

Now let's run some embeddings. We're going to run the streaming version of the node2vec algorithm. We need to define the following config:

* `nodeProjection` - the node labels to use for our projected graph
* `relationshipProjection` - the relationship types to use for our projected graph
* `embeddingSize` - the size of the vector/list of numbers to create for each node
* `iterations` - the number of iterations to run

Let's give it a try:

In [8]:
with driver.session(database="neo4j") as session:
    result = session.run("""
    CALL gds.alpha.node2vec.stream({
       nodeProjection: "Place",
       relationshipProjection: {
         eroad: {
           type: "EROAD",
           orientation: "UNDIRECTED"
        }
       },
       embeddingSize: 10,
       iterations: 10,
       walkLength: 10
    })
    YIELD nodeId, embedding
    RETURN gds.util.asNode(nodeId).name AS place, embedding
    """)
    
    embeddings_df = pd.DataFrame([dict(record) for record in result])
embeddings_df.head()

Unnamed: 0,embedding,place
0,"[-1.0241117477416992, -0.09616520255804062, 3....",Larne
1,"[-1.19605553150177, -0.4890177547931671, 3.287...",Belfast
2,"[-1.7883660793304443, -0.9587492942810059, 2.3...",Dublin
3,"[-2.3579163551330566, -1.1036665439605713, 2.7...",Wexford
4,"[-2.9394874572753906, -0.9884549975395203, 2.7...",Rosslare


So far everything looks good. Let's now store the embeddings in Neo4j, by using the write version of the algorithm:

In [10]:
with driver.session(database="neo4j") as session:
    result = session.run("""
    CALL gds.alpha.node2vec.write({
       nodeProjection: "Place",
       relationshipProjection: {
         eroad: {
           type: "EROAD",
           orientation: "UNDIRECTED"
        }
       },
       embeddingSize: 10,
       iterations: 10,
       walkLength: 10,
       writeProperty: $embeddingProperty
    })
    """, {"embeddingProperty": "embeddingNode2vec"})
    
    embeddings_df = pd.DataFrame([dict(record) for record in result])
embeddings_df    

Unnamed: 0,computeMillis,configuration,createMillis,nodeCount,nodePropertiesWritten,writeMillis
0,2405,"{'initialLearningRate': 0.025, 'writeConcurren...",52,894,894,41


In [36]:
with driver.session(database="neo4j") as session:
    result = session.run("""
    MATCH (p:Place)-[:IN_COUNTRY]->(country)
    WHERE country.code IN $countries
    RETURN p.name AS place, p.embeddingNode2vec AS embedding, country.code AS country
    """, {"countries": ["E", "GB", "F", "TR", "I", "D", "GR"]})
    X = pd.DataFrame([dict(record) for record in result])
X.head()    

Unnamed: 0,country,embedding,place
0,GB,"[0.31507408618927, 2.386936902999878, -1.13895...",Larne
1,GB,"[-0.1721758246421814, 2.413466453552246, -1.18...",Belfast
2,E,"[0.1272381842136383, 2.110171318054199, -2.385...",La Coruña
3,E,"[0.010006282478570938, 2.0815513134002686, -2....",Pontevedra
4,E,"[0.3130505383014679, -1.9364209175109863, -1.0...",Huelva


In [39]:
X_embedded = TSNE(n_components=2, random_state=6).fit_transform(list(X.embedding))
places = list(X.place)
df = pd.DataFrame(data = {
    "place": places,
    "country": X.country,
    "x": [value[0] for value in list(X_embedded)],
    "y": [value[1] for value in list(X_embedded)]
})
df.head()

Unnamed: 0,place,country,x,y
0,Larne,GB,23.597162,-3.478853
1,Belfast,GB,23.132071,-4.331254
2,La Coruña,E,-6.959006,7.212301
3,Pontevedra,E,-6.563524,7.505499
4,Huelva,E,-11.583806,11.09434


We can then use the Altair visualization library to create a scatterplot of these coordinates:

In [41]:
chart = alt.Chart(df).mark_circle(size=60).encode(
    x='x',
    y='y',
    tooltip=['place', 'country']
).properties(width=700, height=400)
chart.save('node2vec.json')
chart

In [42]:
chart = alt.Chart(df).mark_circle(size=60).encode(
    x='x',
    y='y',
    color='country',
    tooltip=['place', 'country']
).properties(width=700, height=400)
chart.save('node2vec-color.json')
chart