### What are the common features in fraud detection?

In this chapter we survey features studied in the literature for credit card fraud, insurance claim fraud, healthcare fraud and payment fraud. Readers will learn the nature of the problems in different types of fraud, and the basic features for more creative features.


### (1) Features for credit card fraud

Credit card fraud happens in basically two types: application fraud and transaction fraud.  Application fraud is simiar to identity fraud that one person uses another person's personal data to obtain a new card. Transaction fraud happens when a card is stolen or a lost card is obatained to conduct fraudulent transactions. Also, there has been a significant rise in counterfeit card.

Apparently a fraudster will try to abuse the card as much as possible in a short period of time before the card is detected and suspended. So we should see abnormally frequent transactions or high amount. Features are created by aggregating transactions over different time periods to help capture change in spending behavior.

In order to demonstrate the creation of features, below we use a hypothetical example of credit card transactions:


|Merchant id | Merchant category code |Merchant city | Merchant state | Time | Transaction method | Transaction type | Amount |
|:--: |:--:|:--: | :--: |--: |:--: |:--:|--:|
|K2203 | BC | BOS |MA | 9:02 | Magnetic | Retail | 100.10|
|L3425 | GD | NYC |NY | 9:10 | Magnetic | Retail | 40.10 |
|F3928 | VS | NYC |NY | 10:20| Chip | Retail | 5.10 |
|W9843 | TY | POR |ME | 13:20|Magnetic | Internet| 200.00 |


Aggregation of the transactions, either min, max, mean or sum, can reveal much insights. Hundreds of features can be created from the transaction data. Below are some suggestions:

<ul>
     <li><b>Aggregation by time:</b>
        <ul>
            <li>Average or maximum amounts spent per transaction in the past one week, two weeks or XX weeks </li>
            <li>Average or maximum amounts spent per day in the past one week, two weeks or XX weeks,</li>
            <li>Average or maximum amounts by merchant category in the past one week, two weeks or XX weeks,</li>
        </ul>
    </li>
    <ul></ul>
    <li><b>Aggregation by merchant category code:</b>
        <ul>
            <li>Average amount per day spent over a 30 day period on all transactions up to this one on the same merchant type as this transaction</li>
            <li>Total number of transactions with same merchant during past 30 days</li>
            <li>Average amount spent over the course of 1 week during the past 3 months on same merchant type as this transaction
        </ul>
    </li>
    <ul></ul>
    <li><b>Aggregation by merchant location and time:</b> 
    The first two transactions in the above table happened in New York City (NYC) and Boston (BOS) within 8 minutes. It is likely the card has been compromised.
        <ul>
            <li>Number of retail locations per day and the duration between the locations in the past one week, two weeks or XX weeks, </li>
            <li>Minimum number of minutes between transactions of two retail locations in the past one week, two weeks or XX weeks,</li>
        </ul>      
    <li><b>Aggregation by transaction method:</b>
   Transactions by magnetic stripe are prone to fraudulent than chip or PIN transactions. So we can create aggregated amount by transaction type per card.
        <ul>
            <li>Average amount by transaction method per day in the past one week, two weeks or XX weeks, </li>
            <li>Number of transactions by transaction method per day in the past one week, two weeks or XX weeks,</li>
        </ul>
</ul>


Unfortunately there is hardly any credit card datasets publicly available for study due to private nature of financial transactions. Lopez-Rojas et al. (2016) in their paper [PaySim: A financial mobile money simulator for fraud detection](https://www.researchgate.net/profile/Stefan_Axelsson4/publication/313138956_PAYSIM_A_FINANCIAL_MOBILE_MONEY_SIMULATOR_FOR_FRAUD_DETECTION/links/5890f87e92851cda2568a295/PAYSIM-A-FINANCIAL-MOBILE-MONEY-SIMULATOR-FOR-FRAUD-DETECTION.pdf) propose a simulation tool called PaySim to generate similar transactions based on their original mobile money transaction dataset. The synthetic [dataset](https://www.kaggle.com/ntnu-testimon/paysim1/data) is available on Kaggle.com. 

### (2) Features for healthcare fraud

##### The nature of the problem: medical fraud and abuse

The U.S. department of health and human services in a phemphlet [Avoiding Medicare Fraud and Abuse: A Roadmap for Physicians](https://www.care1st.com/media/pdf/corporate/2014FWA/Avoiding_Medicare_FandA_Physicians_FactSheet_905645.pdf) states "most physicians strive to work ethically, render high-quality medical care to their patients, and submit proper claims for payment," yet "the presence of some dishonest health care providers who exploit the health care system for illegal personal gain has created the need for laws that combat fraud and abuse and ensure appropriate quality medical care."

Unfortunately, medical fraud and abuse do exist. Medical fraud happens when a physician knowingly submitting false claims or making misrepresentations. For example, a fraudster physician may collude with a pharmacy to add more expensive medicines to the prescription claim without the awareness of the paticient. Then this false drug prescription claim is submitted to the insurer for reimbursements.

Medical abuse happens when any practice is inconsistent with providing patients with services that are medically necessary, meet professionally recognized standards, and are priced fairly. Typical examples include billing for unnecessary medical services, charging excessively, or misusing codes on a claim such as upcoding or unbundling codes.

Therefore in healthcare fraud detection the focus is on features that can describe suspecious fraudulent medical practice or medical resource abuse.

##### Types of medical fraud and abuse

A medical bill typically includes the pharmacy name, time, the physician identifier patient identifier, drug identifier, quantity and cost amount. Features can be derived from the raw transactions.

[Sparrow (2006)](https://repository.library.georgetown.edu/handle/10822/930062) describes there are two types of fraud: "hit-and-run" and "steal a little, all the time." “Hit-and-run” fraudsters submit many fraudulent claims in a short period of time then disappear. “Steal a little, all the time” fraudsters file small claim bills over a long period of time. How can we capture these types of fraudulent billing behavior in the feature engineering? The aggregation strategy in the previous section "Features in credit card fraud" can help to identify. Hit-and-run fraud should exhibit abnormally high number of claims or amount in a very short of time. The "steal a little, all the time" fraud may be discovered by aggregating long period of time for data abnormaly. 

Sparrow (2006) categorizes the healthcare fraud into seven levels, and [Thorntona et al. (2013)](http://www.sciencedirect.com/science/article/pii/S2212017313002946) builds upon these seven levels to develop a Medicaid multidimensional schema. In the follow discussion I summarize into three levels and propose the basic features.

| Level | Relationship                                  | Description      |
| :------------- |:-----------------------| :-----|
| 1     | In a single Claim, or a transaction	| The claim itself |
| 2	    | Between a patient and a provider |   	For all their claims	|
| 3a   | A patient with many claims | A patient  (maybe phantom)  and all of its claims |
| 3b   | A provider with all its claims | One provider and all of its claims and related patients |
| 4a   | Between an Insurer Policy and a provider	| 	Patients that are covered by the same insurance policy and are targeted by one provider |
| 4b  |  Between a patient and a group of providers | One patient being targeted by multiple providers within a practice |
| 5	   | An insurer Policy with a group of providers |	Patients with the same policy being targeted by multiple providers within a practice |
| 6a  | A group of patients | Groups of patients being targeted by providers. (e.g. patients living in the same location) |
| 6b  | A group of proviers	 | Groups of providers targeting their patients. |
| 7	 | Multiparty, Criminal Conspiracies | Multiparty conspiracies that could involve many relationships |

##### Feature engineering on medical fraud and abuse: one real example

The following features are from [Table 3](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4770922/table/T3/)
in the paper [Improving Fraud and Abuse Detection in General Physician Claims: A Data Mining Study](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4770922/) by Joudaki et al. (2015). In their data mining study they created features for each physician, standardized the values using Z scores, used a hierarchical clustering method to segment physicians, and then identified clusters of physicians that were suspect for abuse and fraud. Through this research 9 features identify physician's abuse and 5 features for fraudulent behavior.

Features of medical resource abuse:
* Percentage of the patients that they were visited more than once in a month
* The average of the prescript drug items in a claim
* The average cost of a drug prescription claimb
* The ratio of the 5 expensive antibiotic prescription to all physician claims
* The ratio of injection prescription to all physician claim
* The ratio of total injection prescription to all physician claim
* The ratio of total prescript antibiotic to all physician claims
* The ratio of injected antibiotic to physician claim
* The ratio of injected corticosteroid prescription to all physician claim

Features of fraudulent behaviors:
* Percentage of reduplicative patients
* Percentage of reduplicative patients-pharmacy
* Percentage of reduplicative patients-pharmacy in a month
* The average cost of a drug prescription claimb
* The ratio of claims referred to a high-cost pharmacy


##### Understand ICD10, CPT and E/M codes to create features

The medical procedures are highly codified for diagnotic, billing and analytic purposes. The feature are created based on the medical codes. So we dedicate a section to describe briefly the coding system.

[ICD-10](https://en.wikipedia.org/wiki/ICD-10) is the 10th revision of the International Statistical Classification of Diseases and Related Health Problems (ICD), a medical classification list by the World Health Organization (WHO). The codes contain for diseases, signs and symptoms, abnormal findings, complaints, social circumstances, and external causes of injury or diseases. ICD-9 is the 9th revision.


<img src="ICD-10.jpeg" alt="Drawing" style="width: 300px;"/>

[The Current Procedural Terminology (CPT) code set](https://en.wikipedia.org/wiki/Current_Procedural_Terminology#Codes_for_evaluation_and_management:_99201.E2.80.9399499) is similar to ICD-9 and ICD-10 coding, except that it identifies the services rendered, rather than the diagnosis on the claim. The CPT code set (copyright protected by the AMA) is designed to communicate uniform information about medical services and procedures among physicians, coders, patients and payers for administrative, financial, and analytical purposes. Each CPT code is 5-digit long. CPT codes are divided into three Categories. Category I is the most common codes that describes most of the procedures in inpatient and outpatient offices and hospitals. Category II codes are supplemental tracking codes for performance management. Category III codes are temporary codes relate to emerging and experimental technologies, services, and procedures. Category I CPT codes are divided into the following six large sections. Because these six sections are the most frequent codes, I list them below:

| Category I | Services   |
| :------------- |:----------|
| 99201–99499 | Evaluation and management |
| 00100–01999; 99100–99150 | Anesthesia     |
| 10000–69990 | Surgery    |
| 70000-79999 | Radiology    |
| 80000–89398 | pathology and laboratory   |
| 90281–99099; 99151–99199; 99500–99607 | Medicine    |

Based on CPT codes, the [Evaluation and Management coding](https://www.cms.gov/Outreach-and-Education/Medicare-Learning-Network-MLN/MLNProducts/Downloads/eval-mgmt-serv-guide-ICN006764.pdf) (commonly known as E/M coding or E&M coding) is a medical coding process for medical billing. Below I show some instances of the E/M codes. For example, a general checkup at a physician’s offices is 99214.

| Evaluation and management in Category I | Services   |
| :------------- |:----------|
| 99201–99215 | Office/other outpatient services  |
| 99217–99220 | Hospital observation services     |
| 99221–99239 | Hospital inpatient services       |
| 99241–99255 | Consultations                     |
| 99281–99288 | Emergency department services     |
| ... | ...    |
