# GDS Sessions

A GDS session allows computing GDS operations (algorithms, ML pipelines, etc) using graphs sourced from another system.
Most notably, that other system is an AuraDB instance, identified by a Bolt URI.
The API does however allow projecting from a local DBMS instance as well, if it is compatible and correctly configured.
In general, there are four ways to use a GDS session:

1. AuraDB -> Aura GDS session
2. local DBMS -> Aura GDS session
3. AuraDB -> local GDS session
4. local DBMS -> local GDS session

Our focus here is to demonstrate the first of these options, which is the most interesting one.


In [None]:
# Just to begin, let's make sure we have the correct version of the GDS Python Client installed

from graphdatascience import __version__

assert __version__ == "1.9a1"

First we need to configure access to our AuraDB instance. Please fill in the instance id and password.

In [None]:
db_id = "YOUR_DATABASE_ID"
db_password = "YOUR_DATABASE_PASSWORD"

Now we connect to the AuraDB instance to and run some preparations for the notebook

In [None]:
import os

from neo4j import GraphDatabase
from graphdatascience.query_runner.aura_db_arrow_query_runner import AuraDbConnectionInfo

# We need to tell the GDS client that we are working with a devenvironment.
# This is only necessary for this testing phase and will not need to be set by external users.
os.environ["AURA_ENV"] = "devstrawberryfield"

db_connection_info = AuraDbConnectionInfo(
    f"neo4j+s://{db_id}-{os.environ['AURA_ENV']}.databases.neo4j-dev.io", ("neo4j", db_password)
)
# start a standard Neo4j Python Driver to connect to the AuraDB instance
driver = GraphDatabase.driver(db_connection_info.uri, auth=db_connection_info.auth)

# try out our connection
with driver.session() as session:
    display(session.run("RETURN true AS success").to_df())

Let's add some very basic data to our database. 
The content does not really matter for this notebook, feel free to replace it with more interesting data.

In [None]:
with driver.session() as session:
    session.run("CREATE CONSTRAINT users FOR (u:User) REQUIRE u.id IS NODE KEY")
    session.run(
        """
        UNWIND range(0, 999) AS i
        CREATE (:User {id: i, age: toInteger(rand() * 75)})
        """
    ).consume()
    session.run(
        """
        UNWIND range(1, 8000) AS i
        WITH toInteger(rand() * 1000) AS source, toInteger(rand() * 1000) AS target
        MATCH (s:User {id: source})
        MATCH (t:User {id: target})
        CREATE (s)-[:KNOWS {since: 2020 - (rand() * 100)}]->(t)
        """
    ).consume()

    print(f"Number of nodes: {session.run('MATCH () RETURN count(*)').single().value()}")
    print(f"Number of relationships: {session.run('MATCH ()-->() RETURN count(*)').single().value()}")

# A new database component: Arrow Server

We have built a new piece of software into the Neo4j DBMS: an Arrow Server.
It is akin to the already existing Bolt and HTTP servers, but it has a very specific purpose: projecting graphs to a remote location, and receiving results to write back to the database.

With the Arrow Server comes one crucial new feature: an aggregating projection function.
This aggregating function is called `gds.graph.project` and is very similar to Cypher projection v2 in standard GDS.
There are three key differences between them:

1. In AuraDB, the aggregating function does not take a graph name as a parameter.
2. In AuraDB, the aggregating function does not project the graph to the local instance.
3. The aggregatoin function should only be called through the python client.

The aggregating function is used in queries that look quite identical to those of Cypher projections v2, and are authored by the user.

There is another function that comes with the Arrow Server, which is internal, undocumented, but is callable: `internal.arrow.status`.
It is used as a crucial part of the GDS Python Client functionality for managing the AuraDB - GDS connection.

In [None]:
# Let's call this function and see what it returns
with driver.session() as session:
    display(session.run("CALL internal.arrow.status").to_df())

# Aura API and GDS Python Client

Apart from the extension to AuraDB, we have also added a new API to the GDS Python Client.
This API is a Python frontend to the Aura API, as well as a set of internal management features for the AuraDB - GDS connection.
In order to use the Aura API, the user needs to have Aura API credentials.
These are generated in the Aura Console (under `Account settings`) and are a pair of strings: `CLIENT_ID` and `CLIENT_SECRET`.

Using these credentials the full set of features offered by the GDS Python Client can be used.
In particular, the features are:

- Creating and connecting to GDS sessions
- List all existing GDS sessions
- Disconnecting GDS sessions

We will illustrate what this looks like below.

## Tenants

If the user is a member of multiple tenants, then they also need to enter their tenant id, in order to disambiguate which tenant they want to use.
In this notebook, we will use only a single tenant and omit the tenant id. 


In [None]:
from graphdatascience.gds_session.gds_session import AuraAPICredentials

# Initialise Aura API credentials

aura_api_credentials = AuraAPICredentials(
    client_id="YOUR_AURA_API_CLIENT_ID", client_secret="YOUR_AURA_API_CLIENT_SECRET"
)

# The GDS session

A key new concept is the GDS session.
This takes the place of an AuraDS instance.
(In fact, it is exactly an AuraDS instance at this time, but we don't want to expose that to the user.
They should think of it as a GDS session and a separate thing, as much as possible.)
The GDS session offers all* the GDS functionality that we are familiar with from AuraDS.
However, since the idea is to offload database work to AuraDB, the GDS session is not to be considered a database instance.

That means that all projections will go from AuraDB to GDS session, not from a co-located database.
Similarly, writing back will follow the same path back to AuraDB, and not to a co-located database.

`*` Some limitations apply, which are related to database operations.
These are not available, nor should they be.
That includes native projections, Cypher projections, looking up nodes by id, and similar.

## Implementation limitation

As mentioned in the parenthesis above, we do make use of existing AuraDS infrastructure to host the GDS sessions.
Due to that fact, there actually is a co-located database, but we try to not expose its Bolt URI, in an attempt to prohibit users adding data to that database. 

In [None]:
from graphdatascience.gds_session.gds_session import GdsSessions, DbmsConnectionInfo

sessions = GdsSessions(
    # here we specify the coordinates for the database connection
    # it can be sourced from anywhere, as long as it is compatible and correctly configured (5.14 or later, with Arrow Server enabled)
    db_connection=DbmsConnectionInfo(
        uri=db_connection_info.uri, username=db_connection_info.auth[0], password=db_connection_info[1]
    ),
    # here we specify the coordinates for the GDS connection
    # either it is self-managed, and then we assume it is a DBMS+plugin, and require DB connection info
    # or, as illustrated here, it is managed through Aura, and then we need Aura API credentials
    ds_connection=aura_api_credentials,
)

#### Listing sessions

A user can list their running sessions.
By default no session is running:

In [None]:
sessions.list_sessions()

#### Creating a new session

A user can connect to a GDS session by calling `sessions.connect`.
A session is identified by a name.
If the session already exists, it will connected to.
If it does not exist, it will be created.

An instance size can be provided. 
Possible values are  `8GB`, `16GB`, `24GB` (`32GB`, `48GB`, `64GB`, `96GB` are not available in the testing environment).

Creating a new session takes a few minutes to complete. 
We know that this is not ideal and the problem is even exaggerated in the development environment because we do not keep that many cloud VMs running in order to keep costs low.

ðŸ’µðŸ’µðŸ’µðŸ’µðŸ’µðŸ’µ

ðŸ’°ðŸ’°ðŸ’°ðŸ’°ðŸ’°ðŸ’°

ðŸ’¸ðŸ’¸ðŸ’¸ðŸ’¸ðŸ’¸ðŸ’¸

NOTE: the creation of a session marks the start of billable activity.
Sessions are machines that run in the cloud, and they cost money.
This cost will accumulate for the lifetime of the session, which needs to be manually deleted.

In [None]:
# let's connect to a GDS session!
gds = sessions.connect("pagerank-compute", "8GB")

# Projecting Graphs

In order to project graphs from an AuraDB instance into the GDS session we created a new projection method: `gds.graph.project`
The projection works similar to Cypher projections V2 and is implemented as a Cypher Aggregation function.
The Cypher query containing the projection function is executed on the AuraDB instance and the data it produces is transferred to the GDS session instance via an Arrow connection. 

There are two key differences between the remote projection and Cypher projections V2:

1. In AuraDB, the aggregating function does not take a graph name as a parameter.
2. The aggregation function should only be called through the GDS Python Client endpoint `gds.graph.project`

### Limitations

The aggregation function is currently limited to projecting homogeneous graph schemas. 
That means that all nodes/relationships will have the same property keys regardless of their labels or type. 
The caller of the aggregation function must ensure to supply all possible properties for each node or relationship. Null values are not supported.

The example data in this notebook contains only `User` nodes with `age` properties.
If there are also `Product` nodes with `cost` properties then we would need to add placeholder `cost` and `age` properties on the `User` and `Product` nodes, respectively.
This is a limitation we will attempt to address.


In [None]:
G, result = gds.graph.project(
    "pagerank-graph",
    """
    MATCH (u:User) 
    OPTIONAL MATCH (u)-[r:KNOWS]->(target:User) 
    RETURN gds.graph.project.remote(u, target, {
      sourceNodeProperties: {age: u.age},
      targetNodeProperties: {age: target.age},
      sourceNodeLabels: labels(u),
      targetNodeLabels: labels(target),
      relationshipType: 'KNOWS',
      relationshipProperties: {since: r.since}
    })
    """,
)

result

# Running Algorithms

Running algorithms on the projected graph works exactly as before, especially when running stream and mutate operations.
Mutated algorithm results will be stored in the in-memory graph catalog of the session instance and the data can be retrieved via the stream operations on the graph like `gds.graph.nodeProperty.stream`.

In [None]:
print("Running PageRank ...")
pr_result = gds.pageRank.mutate(G, mutateProperty="pagerank")
print(f"Compute millis: {pr_result['computeMillis']}")
print(f"Node properties written: {pr_result['nodePropertiesWritten']}")
print(f"Centrality distribution: {pr_result['centralityDistribution']}")

# And then we will run FastRP on that
print("Running FastRP ...")
frp_result = gds.fastRP.mutate(
    G,
    mutateProperty="fastRP",
    embeddingDimension=64,
    featureProperties=["pagerank"],
    propertyRatio=0.2,
    nodeSelfInfluence=0.2,
)
print(f"Compute millis: {frp_result['computeMillis']}")
gds.graph.nodeProperties.stream(G, ["pagerank", "fastRP"], separate_property_columns=True)

# Writing back to AuraDB

The session's in-memory graph was projected from data in AuraDB.
Write back operations will thus persist the data back to the same AuraDB.

When calling any write operations the python client will automatically use the new remote write back functionality so that no API changes are necessary.

The AuraDB coordinates are not stored in the GDS session, but in the client.
Thus, it is important to set up the AuraSessions object with the DB credentials that identify the correct database from which the projection came.

In [None]:
# if this fails once with some error like "unable to retrieve routing table"
# then run it again. this is a transient error with a stale server cache.
gds.graph.nodeProperties.write(G, "pagerank")

Of course, we can just use `.write` modes as well:

In [None]:
gds.fastRP.write(
    G,
    writeProperty="fastRP",
    embeddingDimension=64,
    featureProperties=["pagerank"],
    propertyRatio=0.2,
    nodeSelfInfluence=0.2,
)

We can now use the `gds.run_cypher` method to query the updated graph.
Note that the `run_query` method will behave differently with the new Aura GDS session. Instead of querying the database that hosts GDS, it will query the *AuraDB* instance.

In [None]:
gds.run_cypher(
    """
    MATCH (u:User) 
    RETURN u.id, u.age, u.fastRP, u.pagerank AS rank 
     ORDER BY rank DESC
     LIMIT 5
    """
)

# Closing the session

Generally we intend for the sessions to only live for the time it takes to run a single workload.
If the same workload needs to be re-run, for example to work with updated data, a new session would be created.

ðŸ’µðŸ’µðŸ’µðŸ’µðŸ’µðŸ’µ

ðŸ’°ðŸ’°ðŸ’°ðŸ’°ðŸ’°ðŸ’°

ðŸ’¸ðŸ’¸ðŸ’¸ðŸ’¸ðŸ’¸ðŸ’¸

The `session.delete_gds` operation will delete the session and release all resources associated with it.
It is important to note, that until this command was called the customer will be charged for the costs associated with hosting the session instance.

In [None]:
# this will return True if it did delete something
# it will return False otherwise, but it will not normally fail
sessions.disconnect("pagerank-compute")