## Modeling a fraud ring with a graph




## Why 

It is not fraud detection, but it is going to raise suspicions for special investigation unit (SIU) to examine.

Shared resources
Shared addresses,

Fraudsters typically keep a ring as small as possible in order to conseal their plot. 


### Fraud ring example 1: Auto accident insurance claims

<img src="images/A fraud ring car accident.png" align = "right" alt="Drawing" style="width: 400px;"/>


Figure (A) shows an example of fraud ring. This ring includes five people, a doctor, and a lawyer. The drivers and passengers stage accidents, whose roles can be recycled (they call it recycling the roles"), so a driver in an accident can be a witness in another accident. The National Insurance Crime Bureau (NICB) reports these staged crimes are [big business](https://www.nicb.org/prevent-fraud-theft/staged-auto-accident-fraud). The state of Florida, a no-fault state, is notorious for its rising auto accident frauds though it has been controlled in resent years. Under the no-fault laws, a driver is required to have Personal Injury Protection (PIP) to cover his own injury in an accident. The driver will recover financial losses from his own insurance company up to a specified threshold. The fraud ring thus stages a hit-and-run car accident. The fake passenage can file a claim and get paid by his insurance company. Florida enacted the No-Fault Reform or the Anti-Fraud Laws in 2012 to impose heavy penalties on medical providers who commit PIP fraud. It is [reported](https://www.prnewswire.com/news-releases/floridas-no-fault-reform-anti-fraud-laws-are-working-300017990.html) the number and cost of Personal Injury Protection (PIP) insurance claims have been reduced since then. However, it is anticipated new innovation in fraud ring will emerge in different forms, so the methods and technologies to catch fraud will be needed.

Auto frauds can take various forms. NICB warns the public for 
[Auto repair scam](https://www.nicb.org/prevent-fraud-theft/avoid-auto-repair-scams). 
What happens is an unscrupulous auto shop can use cheap airbags (previously deployed and salvaged) to replace a customer’s deployed airbag, and then bill your insurance company for the retail value of a new one. In this staged case the insurance company should see suspicious number of airbag replacement by a particular auto shop.

### Fraud ring example 2: First-party fraud in retail banking

First-party fraud (FPF) is defined as when someone enters into a relationship with a bank <b>using either their own identity or a synthetic identity</b> with the intent to defraud. It is different from third-party fraud (also known as “identity fraud”) in which a third person's identification is used. According to the U.S. federal government, charge-off rates are currently at about 10% of all outstanding consumer credit. Most estimates of the amount of charged-off credit attributed to first-party fraud range [from 10% to 25%](http://www.infoglide.com/blog/first-party-fraud-assessing-the-damage/). In other words, a bank loses up to 25 cents of every dollar in receivables to FPF. 

Fraudsters create <b>synthetic identities</b>, which is usually a mix of real identity and fictitious identity. How do they do that? Take a look of the following steps:
* SSN: Get a Social Security Number (SSN) of another person
* Name: Fabricate a name to be used with the SSN. 
* Birth date: Create false birth dates that match the appearance of the fraudster, in case in-person appearances are required.
* Address: Create an address to receive mail fraudulently
* Phone: Provide telephone numbers
* Apply a credit account (important): It will be declined in the first application by the credit reporting agencies (CRAs) or credit bureaus. However, after that the fake name with that SSN is in CRAs' systems. Submit the application again until it is accepted.
* Add an authorized user (most popular): it is legitimate for a card to add authorized users such as a spouse or a child. Fraudsters exploit the process by piggybacking they names.

<img src="images/busted out fraud.png" align="right" alt="Drawing" style="width: 500px;"/>

We are not here to discuss various ways to bypass the security system. Readers are encouraged to see other ways to [creat synthetic identities](https://securityintelligence.com/synthetic-identity-theft-three-ways-synthetic-identities-are-created/). Once the identities are created, fraudsters typically build up a good history of behavior with timely payments and low utilizations. They act "normally" -- shopping regularly, opening accounts at different organizations and checking their credit scores. They made overpayment in the final stage of the bust out. That's why FPF is also called the Busted-out fraud.

<img src="images/FPF.png" align="right" alt="Drawing" style="width: 300px;"/>

Because synthetic identities are generated by recycling personal identitifiable information (PII), it is likely the same PII appear multiple times. In the Figure there are five fabricated card holders. It maybe perfectly normal that two card holders sharing the same phone number or address. However, it is highly unlikely two card holders sharing the same SSN. By examining the relationships of the cardholders in the huge database, there is a high chance to detect this fraud ring.

### Why the "traditional" anomaly dection techniques cannot catch them?

In previous chapter we illustrate anomaly detecion techniques with an example on credit card transactions. There are two reasons why the anomaly detection techniques do not work here. The first one lies in the fact that <b>fraudster's behaviors look normal</b>. These fraudsters hide themselves in the normal behaviors, so it is very hard to detect them at the individual card level. The second reason is <b>ineffective detecting technique</b>. We create features by a single dimension such as transaction time, location, payment type, etc. to determine the anomaly. However, it becomes ineffective when we need to detect if someone is using a stolen identity, syntehtic identify, fake IP address or a hijacked device. In order to catch the relationships among the targets, we need to examine the connectivity among the account holders or claims.

### Generate a dataset with synthetic identities

In the follow code I first generate 1,000 normal account holders. Each record has the card holder's full name, the SSN, the Zip+4 and the phone number. I use the python module "names" to generate randomly names. I then generate 4 fraud records in order to create the synthetic identities. The synthetic identities come from any combination of the 5 elements (name, SSN, ZIP+4 and phone number) from the 4 records. As a result, there are $ 4^5=256$ synthetic identities. The final dataset contains 1,000+256 = 1,256 identities.

In [2]:
# !pip install names 
import names
import random
import pandas as pd

random.seed(0)

# Generate phone number
def gen_phone(size):
    phone = []
    for _ in range(size):
        first = str(random.randint(100,999))
        second = str(random.randint(1,888)).zfill(3)
        last = str(random.randint(1,9998)).zfill(4)
        tmp =  '{}-{}-{}'.format(first,second,last)
        phone.append(tmp)
    return phone

# Generate Social Security Number (SSN)
def gen_SSN(size):
    SSN = []
    for _ in range(size):
        first = str(random.randint(100,999))
        second = str(random.randint(1,99)).zfill(2)        
        last = str(random.randint(1,9998)).zfill(4)
        tmp =  '{}-{}-{}'.format(first,second,last)
        SSN.append(tmp)
    return SSN

# Generate Zip+4
def gen_ZIP4(size):
    ZIP4 = []
    for _ in range(size):
        first = str(random.randint(0,99999)).zfill(5)
        last = str(random.randint(1,9998)).zfill(4)
        tmp =  '{}-{}'.format(first,last)
        ZIP4.append(tmp)
    return ZIP4

# Generate names
def gen_card_names(size):
    card_names=[]
    for _ in range(size):
        tmp = names.get_full_name()
        card_names.append(tmp)
    return card_names

In [3]:
# Generate the normal cardholders
size = 1000
norm_data = pd.DataFrame({'Card_name':gen_card_names(size),
                     'ZIP':gen_ZIP4(size),
                     'SSN':gen_SSN(size),
                     'Phone':gen_phone(size)} )
norm_data.head()

Unnamed: 0,Card_name,Phone,SSN,ZIP
0,Genevieve Gallegos,398-130-6320,739-39-3392,49206-9961
1,Masako Holley,127-223-3731,558-38-0053,67351-0382
2,Pearle Goodman,532-778-1057,143-55-7149,65528-8418
3,Douglas Schmidt,921-096-5093,663-89-0170,65814-3326
4,Russell Mills,173-346-2172,809-22-1692,55296-4461


In [4]:
# Generate fraudulent cardholders
size = 4
fraud_data = pd.DataFrame({'Card_name':gen_card_names(size),
                     'ZIP':gen_ZIP4(size),
                     'SSN':gen_SSN(size),
                     'Phone':gen_phone(size)})
fraud_data

Unnamed: 0,Card_name,Phone,SSN,ZIP
0,Sharon Jones,405-682-1878,638-63-1353,28339-8358
1,Jennie Gibson,699-748-8901,136-64-4039,70945-8377
2,Jeff Smith,669-762-0998,908-16-3295,92042-6295
3,Joshua Berumen,785-340-0661,627-60-2087,83946-5778


In [9]:
# Generate synthetic identities from the records
listOLists = [fraud_data['Card_name'],fraud_data['ZIP'],fraud_data['SSN'],fraud_data['Phone']]
index = pd.MultiIndex.from_product(listOLists, names = ["Card_name", "ZIP","SSN","Phone"])
synthetic = pd.DataFrame(index = index).reset_index()
synthetic.shape

(256, 4)

In [11]:
import numpy
# Combine the normal identities and synthetic identities to get the final dataset
# Clear the existing indices by setting the ignore_index option to True.
data = pd.concat([norm_data,synthetic],ignore_index=True)
data['Identifier']= range(1, len(data) + 1)
data['Amount']=random.sample(range(1, 9999), len(data))
# Set the columns in desired order
data = data[['Identifier', 'Card_name','SSN','ZIP','Phone','Amount']]
data.head()
data.to_csv("/Users/chriskuo/Downloads/data.csv", index = False)
synthetic.to_csv("/Users/chriskuo/Downloads/synthetic.csv", index = False)


## Graph database: Neo4j

Graph databases, a very popular tools in graph databases.
Why use a graph database? With graph databases you can store the relationships between the data, and observe those relationship really easily. Because the relationships in the data have meaning in themselves. 


(benefits of Neo4j)

In the following you will learn what is a graph database and what is cypher query language SQL, and most of all, how to use the graph database tool to identify fraud ring.

### How to start Neo4j?

Click [here](https://neo4j.com/download/) to install Neo4j. I found it is easy to install on Windows. Mac users may find it much easier to use [homebrew to install Neo4j](http://brewformulas.org/Neo4j). Homebrew claims "installing the stuff you need that Apple didn’t." It can be installed to your macOS by following this [guide](https://brew.sh/). On my machine it shows something like this. On my command prompt it returns "Started Neo4j. It is available at http://localhost:7474/". It means Neo4j is available in this local address. Copy "http://localhost:7474/" to a brower, you will see a graphical user interface (GUI) Neo4j brower. Now you have successfully open Neo4j! On top you will see a command line with a blinking cursor. In the middle you can click "Learn about Neo4j" to learn, or click "Jump to code" to enter your cyper query language.


<img src="Neo4j brower.png" alt="Drawing" style="width: 600px;"/>

### Learn Basic Cypher Queries for Neo4J
Neo4j uses a query language called <b>Cypher</b>. Below is an example of the basic Cyther statement. It creates two "Cardholder" nodes with properties of the nodes are "Name" and "Age". The first node is refered as c1, and the 2nd, c2.

In [None]:
// Create two nodes
CREATE (c1:Cardholder { Name:"John Appleseed", Age:24 }),
(c2:Cardholder { Name:"Mary Ruth", Age:55 }),
(c3:Cardholder { Name:"John Author", Age:34 })
RETURN c1,c2,c3

<b>Node</b>: Remember a few things about the creation of a node
* To create a node, use *CREATE*.
* A node have a label. The "Cardholder" is the label.
* We can assign a reference to a node. Here the reference is (c).
* A node has properties that provide extra information about the node. There are two properties: "Name" and "Age".
* To display a node, use *RETURN*. The node will be shown as a color circle with text.

<b> Relationship: </b>
The syntax to create a relationship is as simple as the following: Node (a) reports to Node (b). Nodes are represented by parentheses. Relationships are represented by arrows. And the relationship can be inserted between the square brackets.

In [None]:
// Create a relationship
MATCH (a:Cardholder),(b:Cardholder)
WHERE a.Name = "John Appleseed" AND b.Name = "Mary Ruth"
CREATE (a)-[r:RELATED_TO]->(b)
RETURN r

The above statement does the following things:
* MATCH statment to find the two nodes,
* WHERE the two specific nodes are,
* CREATE the relationship. The relationship is established an arrow indicating the direction of the relationship: (a)-[r:RELATED_TO]->(b). "RELATED_TO" is the label named by the coder for this relationship. You are free to use other labels such as "REPORT_TO", "BELONG", or "z".

### Load a csv file 
Neo4j can easily load a csv file by using <b>LOAD CSV</b>. The easiest way to load data from CSV is to use the LOAD CSV statement. When your CSV file cotaining headers, you use HEADERS to load the headers. The LOAD CSV statement also supports common options, such as accessing via column header or column index, configuring the terminator character and other common options. For more instruction on how to load data into Neo4j, click [here](http://neo4j.com/docs/developer-manual/current/cypher/clauses/load-csv/).

We will first use the file "synthetic.csv" to get you familiar with the basic syntax of Neo4j. The dataset synthetic.csv contains all the synthetic records. We will use this file to show the basic syntax. We then apply the same codes to the dataset data.csv to identify the fraud ring. 

In [None]:
LOAD CSV WITH HEADERS FROM "file:///synthetic.csv" AS row
RETURN count(*);

If the CSV file ia a large file, USING PERIODIC COMMIT can be used to instruct Neo4j to perform a commit after a number of rows. This can increase the efficiency by restricting the number of rows in the memory. The commit will happen every 1000 rows. The following example reduces to 500 rows.

In [None]:
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM "file:///synthetic.csv" AS row
RETURN count(*);

### Create nodes and relationships
Let's continue. The following statement a new node with the Cardholder label is created for each row in the CSV file. Also, two columns from the CSV file are set as properties on the nodes.

In [None]:
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM "file:///synthetic.csv" AS row
MERGE (c:Cardholder {name:row.Card_name, SSN: row.SSN})
MERGE (s:ssn {SSN: row.SSN})
return c,s

We know there are duplicate SSN or Card_name in the dataset. If we use CREATE, it will create separate nodes for the SSNs and some can be duplicated. So we use a powerful statement MERGE. The MERGE statement creates a new node or relationship if not exists, or match to an existing node or relationship. Thus the nodes or relationships will be unique. We have used four SSNs and four card names to generate the synthetic records. Each SSN should associate with 4 names.

In [None]:
LOAD CSV WITH HEADERS FROM "file:///synthetic.csv" AS row
MERGE (c:Card_name {name:row.Card_name, SSN: row.SSN})
MERGE (s:ssn {SSN: row.SSN})
MERGE (c)-[r:z]->(s)
return c,s,r

<img src="synthetic.png" alt="insert here" style="width: 600px;"/>

### Use Neo4j to identify the fraud

In [None]:
LOAD CSV WITH HEADERS FROM "file:///data.csv" AS row
MERGE (c:Card_name {name:row.Card_name, SSN: row.SSN})
MERGE (s:ssn {SSN: row.SSN})
MERGE (c)-[r:z]->(s)
RETURN c,s,r

Below is a screenshot showing the Card_names, SSN and the relationships. Not all the nodes and relationships are shown here. Also your result may be different from this image. Most of the results are one-to-one relationships between the names and the SSNs.
<img src="Neo4j data all.png" alt="insert here" style="width: 600px;"/>
There will be SSNs associating with multiple card names. The code below counts the number of links between Card_name and SSN. By filtering the count more than 1, we can identify those accounts sharing the same SSNs. 

In [None]:
LOAD CSV WITH HEADERS FROM "file:///data.csv" AS row
MERGE (c:Card_name {name:row.Card_name, SSN: row.SSN})
MERGE (s:ssn {SSN: row.SSN})
MERGE (c)-[r:z]->(s)
WITH c, s,r,count(r) as f
where f>1
RETURN  c,s,r

<img src="Neo4j data SSN.png" alt="insert here" style="width: 600px;"/>

We also can display the records in a list. Below the fraudulent account names are ranked in decending order by the number of SSN accociated with them.

In [None]:
LOAD CSV WITH HEADERS FROM "file:///data.csv" AS row
MERGE (c:Card_name {name:row.Card_name, SSN: row.SSN})
MERGE (s:ssn {SSN: row.SSN})
MERGE (c)-[r:z]->(s)
WITH c, count(r) as f
where f>=16
RETURN  c,f
ORDER BY f DESC
limit 10


<img src="Neo4j data SSN list.png" alt="insert here" style="width: 500px;"/>