## HEALTHCARE PROVIDER FRAUD DETECTION ANALYSIS
### HEALTHCARE PROVIDER FRAUD DETECTION ANALYSIS

Description  

https://www.kaggle.com/rohitrox/healthcare-provider-fraud-detection-analysis  

Project Objectives Provider Fraud is one of the biggest problems facing Medicare. According to the government, the total Medicare spending increased exponentially due to frauds in Medicare claims. Healthcare fraud is an organized crime which involves peers of providers, physicians, beneficiaries acting together to make fraud claims.

Rigorous analysis of Medicare data has yielded many physicians who indulge in fraud. They adopt ways in which an ambiguous diagnosis code is used to adopt costliest procedures and drugs. Insurance companies are the most vulnerable institutions impacted due to these bad practices. Due to this reason, insurance companies increased their insurance premiums and as result healthcare is becoming costly matter day by day.

Healthcare fraud and abuse take many forms. Some of the most common types of frauds by providers are:

a) Billing for services that were not provided.

b) Duplicate submission of a claim for the same service.

c) Misrepresenting the service provided.

d) Charging for a more complex or expensive service than was actually provided.

e) Billing for a covered service when the service actually provided was not covered.

Problem Statement The goal of this project is to " predict the potentially fraudulent providers " based on the claims filed by them.along with this, we will also discover important variables helpful in detecting the behaviour of potentially fraud providers. further, we will study fraudulent patterns in the provider's claims to understand the future behaviour of providers.

Introduction to the Dataset For the purpose of this project, we are considering Inpatient claims, Outpatient claims and Beneficiary details of each provider. Lets s see their details :

A) Inpatient Data

This data provides insights about the claims filed for those patients who are admitted in the hospitals. It also provides additional details like their admission and discharge dates and admit d diagnosis code.

B) Outpatient Data

This data provides details about the claims filed for those patients who visit hospitals and not admitted in it.

C) Beneficiary Details Data

This data contains beneficiary KYC details like health conditions,regioregion they belong to etc.

In [1]:
# import libraries

import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams["figure.figsize"] = [8,10]
plt.style.use("fivethirtyeight")
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
from sklearn.metrics import accuracy_score, f1_score, make_scorer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegressionCV

### 1. Data Exploration

In [2]:
train_df = pd.read_csv('healthcare-provider-fraud-detection-analysis/Train-1542865627584.csv')

In [3]:
train_df.head()

Unnamed: 0,Provider,PotentialFraud
0,PRV51001,No
1,PRV51003,Yes
2,PRV51004,No
3,PRV51005,Yes
4,PRV51007,No


In [4]:
train_df[train_df['PotentialFraud']=='Yes'].count()

Provider          506
PotentialFraud    506
dtype: int64

In [5]:
outpatient_df = pd.read_csv('healthcare-provider-fraud-detection-analysis/Train_Outpatientdata-1542865627584.csv')

In [6]:
outpatient_df.head()

Unnamed: 0,BeneID,ClaimID,ClaimStartDt,ClaimEndDt,Provider,InscClaimAmtReimbursed,AttendingPhysician,OperatingPhysician,OtherPhysician,ClmDiagnosisCode_1,ClmDiagnosisCode_2,ClmDiagnosisCode_3,ClmDiagnosisCode_4,ClmDiagnosisCode_5,ClmDiagnosisCode_6,ClmDiagnosisCode_7,ClmDiagnosisCode_8,ClmDiagnosisCode_9,ClmDiagnosisCode_10,ClmProcedureCode_1,ClmProcedureCode_2,ClmProcedureCode_3,ClmProcedureCode_4,ClmProcedureCode_5,ClmProcedureCode_6,DeductibleAmtPaid,ClmAdmitDiagnosisCode
0,BENE11002,CLM624349,2009-10-11,2009-10-11,PRV56011,30,PHY326117,,,78943,V5866,V1272,,,,,,,,,,,,,,0,56409.0
1,BENE11003,CLM189947,2009-02-12,2009-02-12,PRV57610,80,PHY362868,,,6115,,,,,,,,,,,,,,,,0,79380.0
2,BENE11003,CLM438021,2009-06-27,2009-06-27,PRV57595,10,PHY328821,,,2723,,,,,,,,,,,,,,,,0,
3,BENE11004,CLM121801,2009-01-06,2009-01-06,PRV56011,40,PHY334319,,,71988,,,,,,,,,,,,,,,,0,
4,BENE11004,CLM150998,2009-01-22,2009-01-22,PRV56011,200,PHY403831,,,82382,30000,72887,4280.0,7197.0,V4577,,,,,,,,,,,0,71947.0


In [7]:
inpatient_df = pd.read_csv('healthcare-provider-fraud-detection-analysis/Train_Inpatientdata-1542865627584.csv')

In [8]:
inpatient_df.head()

Unnamed: 0,BeneID,ClaimID,ClaimStartDt,ClaimEndDt,Provider,InscClaimAmtReimbursed,AttendingPhysician,OperatingPhysician,OtherPhysician,AdmissionDt,ClmAdmitDiagnosisCode,DeductibleAmtPaid,DischargeDt,DiagnosisGroupCode,ClmDiagnosisCode_1,ClmDiagnosisCode_2,ClmDiagnosisCode_3,ClmDiagnosisCode_4,ClmDiagnosisCode_5,ClmDiagnosisCode_6,ClmDiagnosisCode_7,ClmDiagnosisCode_8,ClmDiagnosisCode_9,ClmDiagnosisCode_10,ClmProcedureCode_1,ClmProcedureCode_2,ClmProcedureCode_3,ClmProcedureCode_4,ClmProcedureCode_5,ClmProcedureCode_6
0,BENE11001,CLM46614,2009-04-12,2009-04-18,PRV55912,26000,PHY390922,,,2009-04-12,7866,1068.0,2009-04-18,201,1970,4019,5853,7843.0,2768,71590.0,2724.0,19889.0,5849.0,,,,,,,
1,BENE11001,CLM66048,2009-08-31,2009-09-02,PRV55907,5000,PHY318495,PHY318495,,2009-08-31,6186,1068.0,2009-09-02,750,6186,2948,56400,,,,,,,,7092.0,,,,,
2,BENE11001,CLM68358,2009-09-17,2009-09-20,PRV56046,5000,PHY372395,,PHY324689,2009-09-17,29590,1068.0,2009-09-20,883,29623,30390,71690,34590.0,V1581,32723.0,,,,,,,,,,
3,BENE11011,CLM38412,2009-02-14,2009-02-22,PRV52405,5000,PHY369659,PHY392961,PHY349768,2009-02-14,431,1068.0,2009-02-22,67,43491,2762,7843,32723.0,V1041,4254.0,25062.0,40390.0,4019.0,,331.0,,,,,
4,BENE11014,CLM63689,2009-08-13,2009-08-30,PRV56614,10000,PHY379376,PHY398258,,2009-08-13,78321,1068.0,2009-08-30,975,42,3051,34400,5856.0,42732,486.0,5119.0,29620.0,20300.0,,3893.0,,,,,


In [18]:
inpatient_df.isna().sum()

BeneID                        0
ClaimID                       0
ClaimStartDt                  0
ClaimEndDt                    0
Provider                      0
InscClaimAmtReimbursed        0
AttendingPhysician          112
OperatingPhysician        16644
OtherPhysician            35784
AdmissionDt                   0
ClmAdmitDiagnosisCode         0
DeductibleAmtPaid           899
DischargeDt                   0
DiagnosisGroupCode            0
ClmDiagnosisCode_1            0
ClmDiagnosisCode_2          226
ClmDiagnosisCode_3          676
ClmDiagnosisCode_4         1534
ClmDiagnosisCode_5         2894
ClmDiagnosisCode_6         4838
ClmDiagnosisCode_7         7258
ClmDiagnosisCode_8         9942
ClmDiagnosisCode_9        13497
ClmDiagnosisCode_10       36547
ClmProcedureCode_1        17326
ClmProcedureCode_2        35020
ClmProcedureCode_3        39509
ClmProcedureCode_4        40358
ClmProcedureCode_5        40465
ClmProcedureCode_6        40474
dtype: int64

In [19]:
inpatient_df[inpatient_df['DeductibleAmtPaid'].isna()]

Unnamed: 0,BeneID,ClaimID,ClaimStartDt,ClaimEndDt,Provider,InscClaimAmtReimbursed,AttendingPhysician,OperatingPhysician,OtherPhysician,AdmissionDt,ClmAdmitDiagnosisCode,DeductibleAmtPaid,DischargeDt,DiagnosisGroupCode,ClmDiagnosisCode_1,ClmDiagnosisCode_2,ClmDiagnosisCode_3,ClmDiagnosisCode_4,ClmDiagnosisCode_5,ClmDiagnosisCode_6,ClmDiagnosisCode_7,ClmDiagnosisCode_8,ClmDiagnosisCode_9,ClmDiagnosisCode_10,ClmProcedureCode_1,ClmProcedureCode_2,ClmProcedureCode_3,ClmProcedureCode_4,ClmProcedureCode_5,ClmProcedureCode_6
20,BENE11057,CLM36789,2009-02-03,2009-03-10,PRV51393,13000,PHY326399,PHY430692,,2009-02-03,78605,,2009-03-10,167,41519,53081,486,45340,73679,42731,04185,7907,4150,,3995.0,,,,,
21,BENE11057,CLM38115,2009-02-12,2009-02-18,PRV51342,10000,PHY375861,,,2009-02-12,0389,,2009-02-18,853,99591,78552,V4511,2760,27651,486,41071,25000,2859,,,,,,,
129,BENE11494,CLM66768,2009-09-05,2009-09-15,PRV51501,18000,PHY327234,PHY348953,,2009-09-05,41071,,2009-09-15,255,4280,5990,45829,V4581,496,4110,41401,5716,25000,,66.0,,,,,
159,BENE11592,CLM34790,2009-01-21,2009-01-26,PRV55262,6000,PHY424100,,,2009-01-21,99664,,2009-01-26,668,99664,3371,03811,70703,2869,27801,3569,58881,E8796,,,,,,,
177,BENE11670,CLM37086,2009-02-05,2009-02-08,PRV52019,400,PHY416959,,,2009-02-05,6084,,2009-02-08,711,60490,V1582,41401,56210,32723,V854,42731,,,,,,,,,
196,BENE11758,CLM31378,2008-12-28,2009-01-04,PRV53388,7000,PHY333863,,,2008-12-28,6271,,2009-01-04,740,6273,2749,7230,42830,5859,29411,2448,V5867,V103,,,,,,,
335,BENE12225,CLM39949,2009-02-25,2009-03-12,PRV56276,13000,,PHY404549,,2009-02-25,6079,,2009-03-12,689,40391,5679,5856,5990,3051,5855,99673,25040,2724,,5493.0,,,,,
347,BENE12255,CLM43290,2009-03-20,2009-03-24,PRV51245,3000,PHY315243,,,2009-03-20,78791,,2009-03-24,629,7837,V1254,78321,4589,4019,78060,2769,73300,311,,,,,,,
354,BENE12298,CLM53022,2009-05-28,2009-05-30,PRV56748,4000,PHY383151,PHY423869,PHY361104,2009-05-28,53550,,2009-05-30,372,56211,42731,29410,4589,2749,60000,2449,41401,56400,,4311.0,4019.0,,,,
402,BENE12503,CLM69108,2009-09-22,2009-09-25,PRV52072,4000,PHY427660,PHY313850,,2009-09-22,78060,,2009-09-25,855,78060,3569,44489,4241,2753,20280,42731,4019,78791,,,,,,,


In [10]:
inpatient_df['BeneID'].unique().size

31289

In [11]:
inpatient_df.describe()

Unnamed: 0,InscClaimAmtReimbursed,DeductibleAmtPaid,ClmProcedureCode_1,ClmProcedureCode_2,ClmProcedureCode_3,ClmProcedureCode_4,ClmProcedureCode_5,ClmProcedureCode_6
count,40474.0,39575.0,23148.0,5454.0,965.0,116.0,9.0,0.0
mean,10087.884074,1068.0,5894.611759,4103.738174,4226.35544,4070.172414,5269.444444,
std,10303.099402,0.0,3049.3044,2028.182156,2282.761581,1994.409802,2780.071632,
min,0.0,1068.0,11.0,42.0,42.0,42.0,2724.0,
25%,4000.0,1068.0,3848.0,2724.0,2724.0,2758.75,4139.0,
50%,7000.0,1068.0,5369.0,4019.0,4019.0,4019.0,4139.0,
75%,12000.0,1068.0,8666.25,4439.0,5185.0,4439.0,5185.0,
max,125000.0,1068.0,9999.0,9999.0,9999.0,9986.0,9982.0,


In [12]:
beneficiary_df = pd.read_csv('healthcare-provider-fraud-detection-analysis/Train_Beneficiarydata-1542865627584.csv')

In [13]:
beneficiary_df.head()

Unnamed: 0,BeneID,DOB,DOD,Gender,Race,RenalDiseaseIndicator,State,County,NoOfMonths_PartACov,NoOfMonths_PartBCov,ChronicCond_Alzheimer,ChronicCond_Heartfailure,ChronicCond_KidneyDisease,ChronicCond_Cancer,ChronicCond_ObstrPulmonary,ChronicCond_Depression,ChronicCond_Diabetes,ChronicCond_IschemicHeart,ChronicCond_Osteoporasis,ChronicCond_rheumatoidarthritis,ChronicCond_stroke,IPAnnualReimbursementAmt,IPAnnualDeductibleAmt,OPAnnualReimbursementAmt,OPAnnualDeductibleAmt
0,BENE11001,1943-01-01,,1,1,0,39,230,12,12,1,2,1,2,2,1,1,1,2,1,1,36000,3204,60,70
1,BENE11002,1936-09-01,,2,1,0,39,280,12,12,2,2,2,2,2,2,2,2,2,2,2,0,0,30,50
2,BENE11003,1936-08-01,,1,1,0,52,590,12,12,1,2,2,2,2,2,2,1,2,2,2,0,0,90,40
3,BENE11004,1922-07-01,,1,1,0,39,270,12,12,1,1,2,2,2,2,1,1,1,1,2,0,0,1810,760
4,BENE11005,1935-09-01,,1,1,0,24,680,12,12,2,2,2,2,1,2,1,2,2,2,2,0,0,1790,1200


In [14]:
beneficiary_df.describe()

Unnamed: 0,Gender,Race,State,County,NoOfMonths_PartACov,NoOfMonths_PartBCov,ChronicCond_Alzheimer,ChronicCond_Heartfailure,ChronicCond_KidneyDisease,ChronicCond_Cancer,ChronicCond_ObstrPulmonary,ChronicCond_Depression,ChronicCond_Diabetes,ChronicCond_IschemicHeart,ChronicCond_Osteoporasis,ChronicCond_rheumatoidarthritis,ChronicCond_stroke,IPAnnualReimbursementAmt,IPAnnualDeductibleAmt,OPAnnualReimbursementAmt,OPAnnualDeductibleAmt
count,138556.0,138556.0,138556.0,138556.0,138556.0,138556.0,138556.0,138556.0,138556.0,138556.0,138556.0,138556.0,138556.0,138556.0,138556.0,138556.0,138556.0,138556.0,138556.0,138556.0,138556.0
mean,1.570932,1.254511,25.666734,374.424745,11.907727,11.910145,1.667817,1.506322,1.687643,1.880041,1.762847,1.644476,1.398142,1.324143,1.725317,1.74318,1.920942,3660.346502,399.847296,1298.219348,377.718258
std,0.494945,0.717007,15.223443,266.277581,1.032332,0.936893,0.470998,0.499962,0.463456,0.324914,0.425339,0.478674,0.489517,0.468056,0.446356,0.436881,0.269831,9568.621827,956.175202,2493.901134,645.530187
min,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,-8000.0,0.0,-70.0,0.0
25%,1.0,1.0,11.0,141.0,12.0,12.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,0.0,0.0,170.0,40.0
50%,2.0,1.0,25.0,340.0,12.0,12.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,1.0,2.0,2.0,2.0,0.0,0.0,570.0,170.0
75%,2.0,1.0,39.0,570.0,12.0,12.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2280.0,1068.0,1500.0,460.0
max,2.0,5.0,54.0,999.0,12.0,12.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,161470.0,38272.0,102960.0,13840.0


### 2. Combining Dataframes