# Graph PaySim
This notebook is an example of using Neo4j with Vertex AI.  It takes PaySim data from a Neo4j database, puts that into Feature Store and then runs two classifications on that.  One classification uses the standard PaySim features.  A second uses features engineered using Neo4j Graph Data Science.

<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/benofben/vertex-ai-samples/blob/master/notebooks/community/neo4j/graph_paysim.ipynb" target="_parent">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/benofben/vertex-ai-samples/tree/master/notebooks/community/neo4j/graph_paysim.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
</table>

## Data Set
The notebook uses a version of the PaySim dataset that has been modified to work with Neo4j's graph database.  PaySim is a synthetic fraud dataset.  The goal is to identify whether or not a given transaction constitutes fraud.  The dataset is [here](https://github.com/voutilad/PaySim).

To do -- more info on importing dump into Aura DS

## Prerequisites
We assume that you've already loaded the PaySim data into a Neo4j instance and have the credentials to connect to that.  You'll also need to install the Neo4j Python driver by running the cell below.

In [3]:
!pip install neo4j

Collecting neo4j
  Downloading neo4j-4.3.7.tar.gz (76 kB)
[?25l[K     |████▎                           | 10 kB 23.5 MB/s eta 0:00:01[K     |████████▋                       | 20 kB 27.9 MB/s eta 0:00:01[K     |█████████████                   | 30 kB 17.8 MB/s eta 0:00:01[K     |█████████████████▏              | 40 kB 14.9 MB/s eta 0:00:01[K     |█████████████████████▌          | 51 kB 6.4 MB/s eta 0:00:01[K     |█████████████████████████▉      | 61 kB 6.9 MB/s eta 0:00:01[K     |██████████████████████████████▏ | 71 kB 7.3 MB/s eta 0:00:01[K     |████████████████████████████████| 76 kB 3.3 MB/s 
Building wheels for collected packages: neo4j
  Building wheel for neo4j (setup.py) ... [?25l[?25hdone
  Created wheel for neo4j: filename=neo4j-4.3.7-py3-none-any.whl size=100642 sha256=92f32ea29534ce013071eae2f8b186dc12555fc1ae591030354812ef4bd8c8be
  Stored in directory: /root/.cache/pip/wheels/b5/24/bb/cece9fcfdd5e1aa0683e2533945e1e3f27f70f342ff7e28993
Successfully built ne

## Working with Neo4j
In this section we're going to connect to Neo4j and look around the database.  We're going to generate some new features in the dataset using Neo4j's Graph Data Science library.  Finally, we'll load the data into a Pandas dataframe so that it's all ready to put into GCP Feature Store.

In [4]:
import pandas as pd
from neo4j import GraphDatabase

In [5]:
DB_ULR = 'neo4j+s://6c443062.databases.neo4j.io:7687'
DB_USER = 'neo4j'
DB_PASS = 'some password'
DB_NAME = 'neo4j'

In [8]:
driver = GraphDatabase.driver(DB_ULR, auth=(DB_USER, DB_PASS))

Now, let's explore the data in the database a bit to understand what we have to work with.

In [31]:
# node labels
with driver.session(database = DB_NAME) as session:
  result = session.read_transaction( lambda tx: 
    tx.run(
    """
    CALL db.labels() YIELD label
    CALL apoc.cypher.run('MATCH (:`'+label+'`) RETURN count(*) as freq', {})
    YIELD value
    RETURN label, value.freq AS freq
    """
    ).data()
  )
  df = pd.DataFrame(result)
  display(df)

Unnamed: 0,label,freq
0,Node,0
1,Client,11270
2,Bank,5
3,Merchant,3465
4,Mule,0
5,CashIn,746751
6,CashOut,424574
7,Debit,130284
8,Payment,542443
9,Transfer,0


In [32]:
# relationship types
with driver.session(database = DB_NAME) as session:
  result = session.read_transaction( lambda tx: 
    tx.run(
      """
      CALL db.relationshipTypes() YIELD relationshipType as type
      CALL apoc.cypher.run('MATCH ()-[:`'+type+'`]->() RETURN count(*) as freq', {})
      YIELD value
      RETURN type AS relationshipType, value.freq AS freq
      ORDER by freq DESC
      """
      ).data()
    )
df = pd.DataFrame(result)
display(df)

Unnamed: 0,relationshipType,freq
0,PERFORMED,1844052
1,TO,1844052
2,NEXT,1833720
3,HAS_SSN,11330
4,HAS_EMAIL,11330
5,HAS_PHONE,11330
6,FIRST_TX,10332
7,LAST_TX,10332


In [33]:
# transaction types
with driver.session(database = DB_NAME) as session:
  result = session.read_transaction( lambda tx: 
    tx.run(
    """
    MATCH (t:Transaction)
    WITH sum(t.amount) AS globalSum, count(t) AS globalCnt
    WITH *, 10^3 AS scaleFactor
    UNWIND ['CashIn', 'CashOut', 'Payment', 'Debit', 'Transfer'] AS txType
      CALL apoc.cypher.run('MATCH (t:' + txType + ')
        RETURN sum(t.amount) as txAmount, count(t) AS txCnt', {})
      YIELD value
    RETURN txType,value.txAmount AS TotalMarketValue
    """
    ).data()
  )
  df = pd.DataFrame(result)
  display(df)

Unnamed: 0,txType,TotalMarketValue
0,CashIn,104058200000.0
1,CashOut,53854100000.0
2,Payment,96468140000.0
3,Debit,1016829000.0
4,Transfer,0.0


## Create a New Feature with a Graph Embedding using Neo4j
Now we're going to create an in memory graph represtation of the data.

If you've run these examples previously, you will need to delete the Cypher represenation of the graph.

In [34]:
with driver.session(database = DB_NAME) as session:
    result = session.read_transaction( lambda tx: 
        tx.run(
        """
        CALL gds.graph.drop('client_graph')
        """
        ).data()
    )

We're going to create a representation of the data in Neo4j Graph Data Science (GDS).

In [35]:
with driver.session(database = DB_NAME) as session:
  result = session.read_transaction( lambda tx: 
    tx.run(
    """
    CALL gds.graph.create.cypher('client_graph', 
      'MATCH (c:Client) RETURN id(c) as id, c.num_transactions as num_transactions, c.total_transaction_amnt as total_dollar_amnt, c.is_fraudster as is_fraudster',
      'MATCH (c:Client)-[:PERFORMED]->(t:Transaction)-[:TO]->(c2:Client) return id(c) as source, id(c2) as target, sum(t.amount) as amount, "TRANSACTED_WITH" as type ')
    """
    ).data()
  )
  df = pd.DataFrame(result)
  display(df)

Unnamed: 0,nodeQuery,relationshipQuery,graphName,nodeCount,relationshipCount,createMillis
0,"MATCH (c:Client) RETURN id(c) as id, c.num_tra...",MATCH (c:Client)-[:PERFORMED]->(t:Transaction)...,client_graph,11270,26035,417


Now we can generate an embedding from that graph.  This is a new feature we can use in our predictions.  We're using FastRP, which is a more full featured and higher performance of Node2Vec.  You can learn more about that [here](https://neo4j.com/docs/graph-data-science/current/algorithms/fastrp/).

In [36]:
with driver.session(database = DB_NAME) as session:
  result = session.read_transaction( lambda tx: 
    tx.run(
    """
    CALL gds.fastRP.mutate('client_graph',{
      relationshipWeightProperty:'amount',
      iterationWeights: [0.0, 1.00, 1.00, 0.80, 0.60],
      featureProperties: ['num_transactions', 'total_dollar_amnt'],
      propertyRatio: .25, 
      embeddingDimension:16,
      randomSeed: 1, 
      mutateProperty:'embedding'
    })
    """
    ).data()
  )
  df = pd.DataFrame(result)
  display(df)

Unnamed: 0,nodePropertiesWritten,mutateMillis,nodeCount,createMillis,computeMillis,configuration
0,11270,0,11270,0,14,"{'relationshipWeightProperty': 'amount', 'prop..."


Finally we dump that out to a dataframe

In [37]:
with driver.session(database = DB_NAME) as session:
  result = session.read_transaction( lambda tx: 
    tx.run(
    """
    CALL gds.graph.streamNodeProperties
    ('client_graph', ['embedding', 'num_transactions', 'total_dollar_amnt', 'is_fraudster'])
    YIELD nodeId, nodeProperty, propertyValue
    RETURN nodeId, nodeProperty, propertyValue
    """
    ).data()
  )
  df = pd.DataFrame(result)
  display(df)

Unnamed: 0,nodeId,nodeProperty,propertyValue
0,0,embedding,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
1,0,num_transactions,4
2,0,total_dollar_amnt,118919
3,0,is_fraudster,1
4,3,embedding,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
...,...,...,...
45075,1858717,is_fraudster,10000
45076,1858725,embedding,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
45077,1858725,num_transactions,1
45078,1858725,total_dollar_amnt,1557.3


In [30]:
with driver.session(database = DB_NAME) as session:
  result = session.read_transaction( lambda tx: 
    tx.run(
    'MATCH (c:Client) RETURN id(c) as id, c.name as name'
    ).data()
  )
  df = pd.DataFrame(result)
  display(df)

Unnamed: 0,id,name
0,0,Ryder Mills
1,3,Bella Nichols
2,5,Bella Pitts
3,8,Harper Mccarthy
4,10,Connor Neal
...,...,...
11265,1857442,Khloe Castillo
11266,1857502,Faith Byers
11267,1857640,Reagan Edwards
11268,1858717,Christopher Foley


Now we need to take that dataframe and shape it into something that better represents our classification problem.

In [14]:
x = df.pivot(index='nodeId', columns='nodeProperty', values='propertyValue')
x = x.reset_index()
x.columns.name = None
x

Unnamed: 0,nodeId,embedding,is_fraudster,num_transactions,total_dollar_amnt
0,0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1,4,118919
1,3,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",10000,0,0
2,5,"[6.206621439019955e-09, -1.9226471081879026e-0...",1,80,7.48446e+06
3,8,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",10000,0,0
4,10,"[7.890789133213616e-10, -0.021041372790932655,...",1,227,3.75806e+07
...,...,...,...,...,...
11265,1857442,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",10000,1,21671
11266,1857502,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",10000,1,93.5579
11267,1857640,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",10000,1,92.3323
11268,1858717,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",10000,1,11272.2


is_fraudster will have a value of 0 or 1 if populated.  If the value is 10000 then it's unlabled, so we're going to drop it.

In [15]:
clients = x.loc[x['is_fraudster'] != 10000]
clients

Unnamed: 0,nodeId,embedding,is_fraudster,num_transactions,total_dollar_amnt
0,0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1,4,118919
2,5,"[6.206621439019955e-09, -1.9226471081879026e-0...",1,80,7.48446e+06
4,10,"[7.890789133213616e-10, -0.021041372790932655,...",1,227,3.75806e+07
6,15,"[3.494107261303725e-07, -0.005100565031170845,...",1,106,4.86428e+06
7,18,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0,0,0
...,...,...,...,...,...
11192,1839904,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0,1,52.3779
11204,1844183,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0,1,133989
11224,1847685,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0,1,27.6889
11242,1849983,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0,1,194573


And that's it!  The "clients" dataframe now has a nice dataset that we can push to GCP Feature Store.

## Loading Data into GCP Feature Store
In this section, we'll take our dataframe with newly engineered features and load that into GCP feature store.

In [None]:
pass

## Classification with Vertex AI
In this section, we're going to run two classifiers and compare results.  The first will use the standard PaySim features.  The second will use our new graph features.

In [None]:
pass

## Analyze the Predictions
Now that Vertex AI has made predictions on the dataset, we're going to use Neo4j Bloom to investigate how those predictions fit with the data they were made from.

In [None]:
pass