# Graph PaySim
This notebook is an example of using Neo4j with Vertex AI.  It takes PaySim data from a Neo4j database, puts that into Feature Store and then runs two classifications on that.  One classification uses the standard PaySim features.  A second uses features engineered using Neo4j Graph Data Science.

<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/benofben/vertex-ai-samples/blob/master/notebooks/community/neo4j/graph_paysim.ipynb" target="_parent">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/benofben/vertex-ai-samples/tree/master/notebooks/community/neo4j/graph_paysim.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
</table>

## Data Set
The notebook uses a version of the PaySim dataset that has been modified to work with Neo4j's graph database.  PaySim is a synthetic fraud dataset.  The goal is to identify whether or not a given transaction constitutes fraud.  The dataset is [here](https://github.com/voutilad/PaySim).

To do -- more info on importing dump into Aura DS

## Prerequisites
We assume that you've already loaded the PaySim data into a Neo4j instance and have the credentials to connect to that.  You'll also need to install the Neo4j Python driver by running the cell below.

In [76]:
!pip install neo4j



## Working with Neo4j
In this section we're going to connect to Neo4j and look around the database.  We're going to generate some new features in the dataset using Neo4j's Graph Data Science library.  Finally, we'll load the data into a Pandas dataframe so that it's all ready to put into GCP Feature Store.

In [77]:
import pandas as pd
from neo4j import GraphDatabase

In [81]:
DB_ULR = 'neo4j+s://6c443062.databases.neo4j.io:7687'
DB_USER = 'neo4j'
DB_PASS = 'some password'
DB_NAME = 'neo4j'

In [82]:
driver = GraphDatabase.driver(DB_ULR, auth=(DB_USER, DB_PASS))

Now, let's explore the data in the database a bit to understand what we have to work with.

In [83]:
# node labels
with driver.session(database = DB_NAME) as session:
  result = session.read_transaction( lambda tx: 
    tx.run(
    """
    CALL db.labels() YIELD label
    CALL apoc.cypher.run('MATCH (:`'+label+'`) RETURN count(*) as freq', {})
    YIELD value
    RETURN label, value.freq AS freq
    """
    ).data()
  )
  df = pd.DataFrame(result)
  display(df)

Unnamed: 0,label,freq
0,Node,0
1,Client,11270
2,Bank,5
3,Merchant,3465
4,Mule,0
5,CashIn,746751
6,CashOut,424574
7,Debit,130284
8,Payment,542443
9,Transfer,0


In [84]:
# relationship types
with driver.session(database = DB_NAME) as session:
  result = session.read_transaction( lambda tx: 
    tx.run(
      """
      CALL db.relationshipTypes() YIELD relationshipType as type
      CALL apoc.cypher.run('MATCH ()-[:`'+type+'`]->() RETURN count(*) as freq', {})
      YIELD value
      RETURN type AS relationshipType, value.freq AS freq
      ORDER by freq DESC
      """
      ).data()
    )
df = pd.DataFrame(result)
display(df)

Unnamed: 0,relationshipType,freq
0,PERFORMED,1844052
1,TO,1844052
2,NEXT,1833720
3,HAS_SSN,11330
4,HAS_EMAIL,11330
5,HAS_PHONE,11330
6,FIRST_TX,10332
7,LAST_TX,10332


In [85]:
# transaction types
with driver.session(database = DB_NAME) as session:
  result = session.read_transaction( lambda tx: 
    tx.run(
    """
    MATCH (t:Transaction)
    WITH sum(t.amount) AS globalSum, count(t) AS globalCnt
    WITH *, 10^3 AS scaleFactor
    UNWIND ['CashIn', 'CashOut', 'Payment', 'Debit', 'Transfer'] AS txType
      CALL apoc.cypher.run('MATCH (t:' + txType + ')
        RETURN sum(t.amount) as txAmount, count(t) AS txCnt', {})
      YIELD value
    RETURN txType,value.txAmount AS TotalMarketValue
    """
    ).data()
  )
  df = pd.DataFrame(result)
  display(df)

Unnamed: 0,txType,TotalMarketValue
0,CashIn,104058200000.0
1,CashOut,53854100000.0
2,Payment,96468140000.0
3,Debit,1016829000.0
4,Transfer,0.0


## Create a New Feature with a Graph Embedding using Neo4j
Now we're going to create an in memory graph represtation of the data.

If you've run these examples previously, you will need to delete the Cypher represenation of the graph.

In [151]:
with driver.session(database = DB_NAME) as session:
    result = session.read_transaction( lambda tx: 
        tx.run(
        """
        CALL gds.graph.drop('client_graph')
        """
        ).data()
    )

We're going to create a representation of the data in Neo4j Graph Data Science (GDS).

In [152]:
with driver.session(database = DB_NAME) as session:
  result = session.read_transaction( lambda tx: 
    tx.run(
    """
    CALL gds.graph.create.cypher('client_graph', 
      'MATCH (c:Client) RETURN id(c) as id, c.num_transactions as num_transactions, c.total_transaction_amnt as total_dollar_amnt, c.is_fraudster as is_fraudster', 
      'MATCH (c:Client)-[:PERFORMED]->(t:Transaction)-[:TO]->(c2:Client) return id(c) as source, id(c2) as target, sum(t.amount) as amount, "TRANSACTED_WITH" as type ')
    """
    ).data()
  )
  df = pd.DataFrame(result)
  display(df)

Unnamed: 0,nodeQuery,relationshipQuery,graphName,nodeCount,relationshipCount,createMillis
0,"MATCH (c:Client) RETURN id(c) as id, c.num_tra...",MATCH (c:Client)-[:PERFORMED]->(t:Transaction)...,client_graph,11270,26035,401


Now we can generate an embedding from that graph.  This is a new feature we can use in our predictions.  We're using FastRP, which is a more full featured and higher performance of Node2Vec.  You can learn more about that [here](https://neo4j.com/docs/graph-data-science/current/algorithms/fastrp/).

In [153]:
with driver.session(database = DB_NAME) as session:
  result = session.read_transaction( lambda tx: 
    tx.run(
    """
    CALL gds.fastRP.mutate('client_graph',{
      relationshipWeightProperty:'amount',
      iterationWeights: [0.0, 1.00, 1.00, 0.80, 0.60],
      featureProperties: ['num_transactions', 'total_dollar_amnt'],
      propertyRatio: .25, 
      embeddingDimension:16,
      randomSeed: 1, 
      mutateProperty:'embedding'
    })
    """
    ).data()
  )
  df = pd.DataFrame(result)
  display(df)

Unnamed: 0,nodePropertiesWritten,mutateMillis,nodeCount,createMillis,computeMillis,configuration
0,11270,0,11270,0,13,"{'relationshipWeightProperty': 'amount', 'prop..."


Finally we dump that out to a dataframe

In [155]:
with driver.session(database = DB_NAME) as session:
  result = session.read_transaction( lambda tx: 
    tx.run(
    """
    CALL gds.graph.streamNodeProperties
    ('client_graph', ['embedding', 'num_transactions', 'total_dollar_amnt', 'is_fraudster'])
    YIELD nodeId, nodeProperty, propertyValue
    RETURN gds.util.asNode(nodeId).name AS name, nodeProperty, propertyValue
    """
    ).data()
  )
  df = pd.DataFrame(result)
  display(df)

Unnamed: 0,name,nodeProperty,propertyValue
0,Ryder Mills,embedding,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
1,Ryder Mills,num_transactions,4
2,Ryder Mills,total_dollar_amnt,118919
3,Ryder Mills,is_fraudster,1
4,Bella Nichols,embedding,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
...,...,...,...
45075,Christopher Foley,is_fraudster,10000
45076,Zachary Swanson,embedding,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
45077,Zachary Swanson,num_transactions,1
45078,Zachary Swanson,total_dollar_amnt,1557.3


Now we need to take that dataframe and shape it into something that better represents our classification problem.

In [156]:
x = df.rename(columns={'nodeProperty': 'property', 'propertyValue': 'value'})

In [157]:
x.describe()

Unnamed: 0,name,property,value
count,45080,45080,45080
unique,10944,4,19693
top,Bella Spencer,embedding,10000
freq,12,11270,9608


In [158]:
names = x.name.unique()
ids = x.property.unique()

In [159]:
for name in names:
  print(name)
  y = x.loc[df['name'] == name]
  y.pivot(index='name', columns='property', values='value')


Ryder Mills
Bella Nichols


ValueError: ignored

In [160]:
x.loc[df['name'] == name]

Unnamed: 0,name,property,value
4,Bella Nichols,embedding,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
5,Bella Nichols,num_transactions,0
6,Bella Nichols,total_dollar_amnt,0
7,Bella Nichols,is_fraudster,10000
7444,Bella Nichols,embedding,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
7445,Bella Nichols,num_transactions,0
7446,Bella Nichols,total_dollar_amnt,0
7447,Bella Nichols,is_fraudster,10000
23768,Bella Nichols,embedding,"[0.0, 0.0, -1.0777628745017864e-07, 0.0, -1.07..."
23769,Bella Nichols,num_transactions,14


In [143]:
x.pivot(index='name', columns='property', values='value')

ValueError: ignored

## Loading Data into GCP Feature Store
In this section, we'll take our dataframe with newly engineered features and load that into GCP feature store.

In [None]:
pass

## Classification with Vertex AI
In this section, we're going to run two classifiers and compare results.  The first will use the standard PaySim features.  The second will use our new graph features.

In [None]:
pass

## Analyze the Predictions
Now that Vertex AI has made predictions on the dataset, we're going to use Neo4j Bloom to investigate how those predictions fit with the data they were made from.

In [None]:
pass