#### Copyright IBM All Rights Reserved.
#### SPDX-License-Identifier: Apache-2.0

## Visualization Demos

In this demo we will:
1. Import the required libraries for visualizing gremlin queries
2. Connect to the graph server we setup
3. Learn how to transform the gremlin result set into the required shape for the visualization tool


### Before proceeeding 
Please update the `connect_info` notebook with your db2 and graph server information.

Once the notebook has been updated please run the cell and press save.

## Example scenario

You are a data scientist at health insurance company X. One of your tasks is to investigate insurance claims for fraud.

You just received a notification that your machine learning model for fraudulent claim detection has finished processing the latest transactional data in Db2 and has identified some insurance claims as suspicious.

We need to investigate the identified claim to determine if it is fraudulent

In [1]:
# For using notebooks as modules
import nbfinder

# These imports are for connecting, querying, traversing and returning gremlin result sets
from gremlin_python.process.anonymous_traversal import traversal
from gremlin_python.process.graph_traversal import __
from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection
from gremlin_python.process.traversal import T

# Make sure you have edited and ran the "connect_info" notebook then restarted this notebook
from connect_info import graph_connect_info, db2_connect_info

# When making a secure connection to the gremlin server the SSL certificate verification
# needs to be disabled when using a self signed certificate
from tornado import httpclient

# Db2 imports
import ibm_db as db
import pandas as pd

# These imports are required for working with the gremlin result set
# to transform it into something the visualization tool can work with
import json
from itertools import tee, islice, chain

In [2]:
# This helper function allows us to get the previous and next results when iterating a list
# We use this to determine how the edges connect different vertices when parsing a gremlin result set
def previous_and_next(some_iterable):
    prevs, items, nexts = tee(some_iterable, 3)
    prevs = chain([None], prevs)
    nexts = chain(islice(nexts, 1, None), [None])
    return zip(prevs, items, nexts)

We just received a notification that our Machine Learning algorithm has finished processing the latest transactional data in Db2 and has identified some insurance claims as potentially suspicious.

We'll start by using Db2 to get the results from the machine learning output into our notebook

In [3]:
conn_str="database=" + db2_connect_info["db2_database_name"] + \
    ";hostname=" + db2_connect_info["db2_hostname"] + ";port=" + db2_connect_info["db2_port"] + \
    ";protocol=tcpip;uid=" + db2_connect_info["db2_username"] + ";pwd=" + db2_connect_info["db2_password"]
conn = db.connect(conn_str,'','')
select = """
select Claim_id, charge, decimal(simavgcharge, 10,2) as simavgcharge, SIMCOUNT, SIMMINCHARGE, SIMMAXCHARGE
from demo.claim as thisclaim,
lateral (select avg(q.charge) as simavgcharge, count(*) as simcount, min(charge) as simmincharge, max(charge) as simmaxcharge
from demo.claim_similarity as subqsim, demo.claim as q where subqsim.SIM_CLAIM_ID = q.claim_id and thisclaim.claim_id = subqsim.claim_id) as q
where charge  > float(4)* simavgcharge
group by Claim_id, charge, simavgcharge, SIMCOUNT, simmaxcharge, simmincharge
order by charge/simavgcharge desc fetch first 10 rows only;
"""
stmt = db.exec_immediate(conn, select)
result = db.fetch_assoc(stmt)
data = []
while result != False:
    data.append(result)
    result = db.fetch_assoc(stmt)
db.close(conn)
pd.DataFrame.from_dict(data)

Unnamed: 0,CLAIM_ID,CHARGE,SIMAVGCHARGE,SIMCOUNT,SIMMINCHARGE,SIMMAXCHARGE
0,C4377,9987487.4,700372.98,537,20.7,9973754.0
1,C15383,9973754.0,700398.55,537,20.7,9987487.4
2,C89596,9971312.6,700403.1,537,20.7,9987487.4
3,C27710,9809869.6,689742.02,529,20.7,9987487.4
4,C91109,9949034.0,700444.59,537,20.7,9987487.4
5,C7181,9905392.6,700525.86,537,20.7,9987487.4
6,C22204,9722586.7,689907.02,529,20.7,9987487.4
7,C56230,9866634.1,700598.03,537,20.7,9987487.4
8,C60174,9848933.3,700631.0,537,20.7,9987487.4
9,C6591,9819330.4,700686.12,537,20.7,9987487.4


Your fraud detection model has highlighted Claim 4377 as being suspicious. We can see that the charge for this type of claim is almost 300k over the average charge for this type.

You now need to dig deeper to find out what is going on and we will use Db2 Graph to do that

In [4]:
# We will create a connection to our database
# This `g` object will be used to send all query requests to the graph server
gremlin_connect = httpclient.HTTPRequest(graph_connect_info["graph_url"], validate_cert=False)
g = traversal().withRemote(
    DriverRemoteConnection(
        gremlin_connect,
        graph_connect_info["graph_name"],
        username=graph_connect_info["graph_username"],
        password=graph_connect_info["graph_password"]
    )
)

Let's start our investigation by taking a look at the details of the policyholder for the claim
For example we can find what types of claims they have submitted in the past

In [5]:
"""
This query uses our `g` object to perform the following traversal:
1. Look at vertices with the label 'DEMO.CLAIM'
2. Filter to the vertex with the 'CLAIM_ID' we are interested in
3. Traverse to the person who is insured by the claim
4. Find out what types of diseases they have filed claims on previous
5. Return the complete path from the claim to the disease
6. Return all properties for each hop in the path
7. Convert the gremlin result set into a python list
"""
prev_claims_by_same_policyholder_disease_link = g.V() \
.hasLabel('DEMO.CLAIM') \
.has('CLAIM_ID', 'C4377') \
.out('DEMO.INSURED_OF_CLAIM') \
.out('DEMO.HAS_DISEASE') \
.path() \
.by(__.valueMap(True)) \
.toList()

In [6]:
# You can view the raw output by uncommenting the print statements below
#print(prev_claims_by_same_policyholder_disease_link)
#print(prev_claims_by_same_policyholder_disease_link[0])
#for i in range(len(prev_claims_by_same_policyholder_disease_link[0])):
#    print("prev_claims_by_same_policyholder_disease_link[" + str(i) + "] = " + str(prev_claims_by_same_policyholder_disease_link[0][i]))
#    print("")


The returned value is a nested list of path lists.

The length of the main list is the amount of returned items.

The length of each sublist is the amount of hops from the starting vertex to the disease vertex.

Notice the <T.id: 1> and <T.label: 4> objects, these are defined from the gremlin python driver and we need
to use the import `from gremlin_python.process.traversal import T` to access them

In this query we are going from the claim, to the insured of claim to the diease for each claim filed

In [7]:
"""
Note: vis-network uses the terms nodes and edges instead of vertices and edges, they are interchangable
They are defined as nodes here for vis-network

This function parses our gremlin result set into something vis-network can use
to create a visualization. It starts by creating empty lists for the nodes and edges.
Next it loops through all items in the result set and conditionally parses each item into
either a node object or an edge object. The schema for these objects are available in
the vis-network documentation https://visjs.github.io/vis-network/docs/network/nodes.html and 
https://visjs.github.io/vis-network/docs/network/edges.html
"""
nodes = []
edges = []
# start looping through the list containing the results
for val in prev_claims_by_same_policyholder_disease_link:
    # for every value in the results get the previous, item and next item
    for previous, item, nxt in previous_and_next(val):
        # get our current item id
        itemId = item[T.id]["prefix"] + "::" + item[T.id]["idCols"][0]
        # get our current label
        label = item[T.label]
        # set a colour value for our vertex
        colour = "red"
        # If we are on the first item then set the label value to the claim we are interested in
        if previous == None:
                itemId = "Claim 4377"
                label = "Claim 4377"
                colour = "blue"
        # if the label is disease then set the label value to the disease name
        if label == "DEMO.DISEASE":
            label = item['CONCEPT_NAME'][0]
        # if the label is patient then set the label to "Patient " + patient id
        if label == "DEMO.PATIENT":
            label = "Patient " + itemId.split("::")[1]
            colour = "orange"
        # if the next item exists in the list, meaning there is a link between our current vertex
        # and the next vertex then we need to add that link to our edges list
        if nxt != None:
            # create the link object
            nxtId = nxt[T.id]["prefix"] + "::" + nxt[T.id]["idCols"][0]
            link = {"from": itemId, "to": nxtId, "title": nxt[T.label]}
            # and append it to the edges if it doesn't already exist
            if link not in edges:
                edges.append(link)
        # create the vertex object
        node = {"id": itemId, "label": label, "group": item[T.label], "color": colour}
        if node not in nodes:
            # append the node to our list
            nodes.append(node)

# Once we are done processing the result set we need a way to pass it to the visualization library
# To do that we will dump a json representation of the vertices and edges
# then read it back later
with open('claim_diease_links.json', 'w') as f:
    json.dump(
        {'nodes': nodes,
         'edges': edges
        },
        f, indent=4
    )

In [8]:
%%html
<!-- Create a div that will contain the visualization -->
<div id="claim_diease_links">Visualization is loading...</div>
<script type="text/javascript">
// load the visualization library
require.config({
  paths: {
    Vis: "https://unpkg.com/vis-network@7.6.2/standalone/umd/vis-network.min"
  }
});
require(["Vis"], function(vis) {
  // now we will fetch the json from the previous cell
  fetch('claim_diease_links.json').then(r => r.json()).then(graph => {
    // get a reference to the container we created to hold the visualization
    var container = document.getElementById('claim_diease_links');
    // set our visualization data
    var data = {
      nodes: graph.nodes,
      edges: graph.edges
    };
    // define some default options for the visualization
    // See https://visjs.github.io/vis-network/docs/network/ for all available options
    var options = {
      width: '968px',
      height: '800px',
      nodes: {
        shape: 'dot',
      },
      interaction: {
        hover: true,
      },
      physics: {
        enabled: true,
        solver: "repulsion",
        repulsion: {
          nodeDistance: 200
        },
        stabilization: {
          enabled: true,
        },
      }
    };
    new vis.Network(container, data, options);
  })
})
</script>

Our starting vertex, Claim 4377, is coloured in blue. From our starting vertex we see a link to the patient associated with this claim, Patient 11279, in orange. From the patient we have links to all the diseases they have filed claims with in red.

We can see that this policy holder is associated with multiple chronic diseases and this may be the reason for the abnormal charge but we can dig further to understand if this is the case

We now know that they have filed a lot of claims, let's take a look at those claims to see if we can find anything interesting.

In [None]:
"""
This query uses our `g` object to perform the following traversal:
1. Get the vertices with the label 'DEMO.CLAIM'
2. Filter to the vertex with the 'CLAIM_ID' we are interested in
3. Find out who the insured person is
4. Find out what other claims they have filed
5. Find out who the doctors that handled the claim are and which service providers those doctors work for
6. Return the complete path from start to end
7. Return all properties for each hop in the path
8. Convert the gremlin result set into a python list
"""
 
other_claims_for_policy_holder = g.V() \
.hasLabel('DEMO.CLAIM') \
.has('CLAIM_ID', 'C4377') \
.out('DEMO.POLICYHOLDER_OF_CLAIM') \
.in_('DEMO.POLICYHOLDER_OF_CLAIM') \
.union(__.out('DEMO.INCHARGE_OF_CLAIM').out('DEMO.INCHARGE_DEMO.SERVICE'), __.out('DEMO.INSURED_OF_CLAIM')) \
.path() \
.by(__.valueMap(True)) \
.toList()

In [None]:
# You can view the raw output by uncommenting the print statements below
#print(other_claims_for_policy_holder)
#print(other_claims_for_policy_holder[1])
#for i in range(len(other_claims_for_policy_holder[1])):
#    print("other_claims_for_policy_holder[" + str(i) + "] = " + str(other_claims_for_policy_holder[0][i]))
#    print("")

In [None]:
"""
Note: vis-network uses the terms nodes and edges instead of vertices and edges, they are interchangable
They are defined as nodes here for vis-network

This function to parse the result set is very similar to the previous one. The only difference is how
we are getting the labels for each vertex
"""
other_claims_nodes = []
other_claims_edges = []
# start looping through the list containing the results
for val in other_claims_for_policy_holder:
    # for every value in the results get the previous, item and next item
    for previous, item, nxt in previous_and_next(val):
        # If there is no previous value available then skip the iteration
        if previous == None:
            continue
        # grab the id and label for the vertex
        itemId = item[T.id]["prefix"] + "::" + item[T.id]["idCols"][0]
        label = item[T.label]
        colour = "blue"
        # if we are on the patient vertex then skip the iteration
        if label == "DEMO.PATIENT":
            continue
        # if we are on the policy holder then then set a label
        # and set the colour of the vertex to green
        if label == "DEMO.POLICYHOLDER":
            label = "Policyholder " + itemId.split("::")[1]
            colour = "green"
        # if we are on a service vertex then set the label value to the
        # name of the service and the colour to orange
        if label == "DEMO.SERVICE":
            label = item["SERVICE_NAME"][0]
            colour = "orange"
        # if we are on an incharge vertex then set the label value to the doctors name and id
        # and the colour to grey
        if label == "DEMO.INCHARGE":
            label = "Dr. " + item["LNAME"][0] + " - " + item["SERVICE_ID"][0]
            colour = "grey"
        # if we are on the claim vertex then set the label to be the claim id
        if label == "DEMO.CLAIM":
            label = "Claim " + itemId.split("::")[1]
            # and if we are on the claim we are investigating set the colour to red
            if label == "Claim C4377":
                colour = "red"
        # add our edges
        if nxt != None:
            nxtId = nxt[T.id]["prefix"] + "::" + nxt[T.id]["idCols"][0]
            link = {"from": itemId, "to": nxtId, "title": label}
            if link not in other_claims_edges:
                other_claims_edges.append(link)
        # add our vertices
        node = {"id": itemId, "label": label, "group": item[T.label], "color": colour}
        if node not in other_claims_nodes:
            other_claims_nodes.append(node)
# dump the edges and nodges to json
with open('other_claims_for_policy_holder.json', 'w') as f:
    json.dump(
        {
            'nodes': other_claims_nodes,
            'edges': other_claims_edges
        },
        f,
        indent=4
    )

In [None]:
%%html
<!-- Create a div that will contain the visualization -->
<div id="other_claims_for_policy_holder">Visualization is loading...</div>
<script type="text/javascript">
// load the visualization library
require.config({
  paths: {
    Vis: "https://unpkg.com/vis-network@7.6.2/standalone/umd/vis-network.min"
  }
});
require(["Vis"], function(vis) {
  // now we will fetch the json from the previous cell
  fetch('other_claims_for_policy_holder.json').then(r => r.json()).then(graph => {
    // get a reference to the container we created to hold the visualization
    var container = document.getElementById('other_claims_for_policy_holder');
    // set our visualization data
    var data = {
      nodes: graph.nodes,
      edges: graph.edges
    };
    // define some default options for the visualization
    // See https://visjs.github.io/vis-network/docs/network/ for all available options
    var options = {
      width: '968px',
      height: '800px',
      nodes: {
        shape: 'dot',
      },
      interaction: {
        hover: true,
      },
    };
    new vis.Network(container, data, options);
  })
})
</script>

Looking at the result of this graph query we can quickly see that for every claim this person filed they saw a different doctor who worked for a different service provider.

This is seems out of the oridinary and very suspicious

We can dig deeper by looking at the social connections of the policy holder. What other policy holders are directly, or indirectly, connected to the policy holder?

In [None]:
# We'll start by classifying our risk scores and risk score colours for a better visualization output
def risk_factor(risk_score):
    if risk_score < 0:
        return "no_risk"
    elif risk_score in range(0, 20):
        return "low_risk"
    elif risk_score in range(21, 70):
        return "medium_risk"
    else:
        return "high_risk"

def risk_color(risk_score):
    if risk_score < 20:
        return "green"
    elif risk_score in range(21, 70):
        return "#FFAD73"
    else:
        return "red"

In [None]:
"""
This query uses our `g` object to perform the following traversal:
1. Start with the claim in question
2. Find out who the insured person is
3. Find all their social connections
4. emit each connection found
5. Return the complete path from start to end
6. Return all properties for each hop in the path
7. Convert the gremlin result set into a python list
"""

policy_holder_connections = g.V() \
.hasLabel('DEMO.CLAIM') \
.has('CLAIM_ID', 'C4377') \
.out('DEMO.POLICYHOLDER_OF_CLAIM') \
.repeat(__.out('DEMO.POLICYHOLDER_CONNECTION')) \
.emit() \
.path() \
.by(__.valueMap(True)) \
.toList()

In [None]:
# You can view the raw output by uncommenting the print statements below
#print(policy_holder_connections)
#print(policy_holder_connections[0])
#for i in range(len(policy_holder_connections[0])):
#    print("policy_holder_connections[" + str(i) + "] = " + str(policy_holder_connections[0][i]))
#    print("")


In [None]:
"""
Note: vis-network uses the terms nodes and edges instead of vertices and edges, they are interchangable
They are defined as nodes here for vis-network

This function to parse the result set is very similar to the previous one. The only difference is how
we are getting the labels for each vertex
"""

policy_holder_connection_nodes = []
policy_holder_connection_edges = []

# Loop over the result set
for val in policy_holder_connections:
    risk_score = 0
    # for each nested list iterate over it
    for previous, item, nxt in previous_and_next(val):
        itemId = item[T.id]["prefix"] + "::" + item[T.id]["idCols"][0]
        label = item[T.label]
        # if we are on the claim label then skip this iteration
        if label == "DEMO.CLAIM":
            continue
        # if we are on the policy holder claim label then skip this iteration
        if label == "DEMO.POLICYHOLDER_OF_CLAIM":
            continue
        if label == "DEMO.POLICYHOLDER":
            # for the label POLICYHOLDER set a clean label with the policy holder id
            label = "Policyholder " + itemId.split("::")[1]
        if "RISK_SCORE" in item:
            # if risk score is available then set it
            risk_score = item["RISK_SCORE"][0]
        if nxt != None:
            # create our edge links
            nxtId = nxt[T.id]["prefix"] + "::" + nxt[T.id]["idCols"][0]
            link = {"from": itemId, "to": nxtId, "title": label, "color": "blue"}
            if link not in policy_holder_connection_edges:
                policy_holder_connection_edges.append(link)
        # get the risk classification and colour for this vertex based on it's risk score
        color = risk_color(risk_score)
        risk_group = risk_factor(risk_score)
        #  our item is the policy holder then set the colour to aqua
        if itemId == "DEMO.POLICYHOLDER::PH3759":
            color = "aqua"
            risk_score = 100
            risk_group = "high_risk"
        # create our vertex
        node = {
            "title": risk_group,
            "color": color,
            "id": itemId,
            "label": label,
            "group": risk_group,
            "value": risk_score
            # we are setting each vertices value to it's risk score. vis network will increase the vertex size
            # to corrospond to the value
        }
        # and append it if the vertex is not in our list
        if node not in policy_holder_connection_nodes:
            policy_holder_connection_nodes.append(node)

with open('policy_holder_connections.json', 'w') as f:
    json.dump(
        {
            'nodes': policy_holder_connection_nodes,
            'edges': policy_holder_connection_edges
        },
        f,
        indent=4
    )

In [None]:
%%html
<!-- Create a div that will contain the visualization -->
<div id="policy_holder_connections">Visualization is loading...</div>
<script type="text/javascript">
// load the visualization library
require.config({
  paths: {
    Vis: "https://unpkg.com/vis-network@7.6.2/standalone/umd/vis-network.min"
  }
});
require(["Vis"], function(vis) {
  // now we will fetch the json from the previous cell
  fetch('policy_holder_connections.json').then(r => r.json()).then(graph => {
    // get a reference to the container we created to hold the visualization
    var container = document.getElementById('policy_holder_connections');
    // set our visualization data
    var data = {
      nodes: graph.nodes,
      edges: graph.edges
    };
    // define some default options for the visualization
    // See https://visjs.github.io/vis-network/docs/network/ for all available options
    var options = {
      width: '968px',
      height: '800px',
      nodes: {
        shape: 'dot',
      },
      interaction: {
        hover: true,
      },
    };
    new vis.Network(container, data, options);
  })
})
</script>

On this graph the larger the circles the more high risk the policy holder is.

We can see that the policy holder in question (aqua coloured vertex) is directly connected to two other high risk policy holders.

They are also connected to a third high risk policy holder by 3 degrees of separation.

There is a very high probability that this claim is fraudulent.