# Notebook for data preparation

In [None]:
import os
import sys
import math
import scipy as scipy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')
pd.set_option('display.max_columns',100)

### Project Objectives
Provider Fraud is one of the biggest problems facing Medicare. According to the government, the total Medicare spending increased exponentially due to frauds in Medicare claims. Healthcare fraud is an organized crime which involves peers of providers, physicians, beneficiaries acting together to make fraud claims.

Rigorous analysis of Medicare data has yielded many physicians who indulge in fraud. They adopt ways in which an ambiguous diagnosis code is used to adopt costliest procedures and drugs. Insurance companies are the most vulnerable institutions impacted due to these bad practices. Due to this reason, insurance companies increased their insurance premiums and as result healthcare is becoming costly matter day by day.

Healthcare fraud and abuse take many forms. Some of the most common types of frauds by providers are:

a) Billing for services that were not provided.

b) Duplicate submission of a claim for the same service.

c) Misrepresenting the service provided.

d) Charging for a more complex or expensive service than was actually provided.

e) Billing for a covered service when the service actually provided was not covered.

Problem Statement
The goal of this project is to " predict the potentially fraudulent providers " based on the claims filed by them.along with this, we will also discover important variables helpful in detecting the behaviour of potentially fraud providers. further, we will study fraudulent patterns in the provider's claims to understand the future behaviour of providers.

Introduction to the Dataset
For the purpose of this project, we are considering Inpatient claims, Outpatient claims and Beneficiary details of each provider. Lets s see their details

### Data loading

Data are loaded from the public github repository https://github.com/EY-Tech-Consulting-Denmark/Graphathon-ATP/tree/main/Data/raw_data

In [None]:
beneficiary = pd.read_csv("https://raw.githubusercontent.com/EY-Tech-Consulting-Denmark/Graphathon_2023-04-14/main/Data/raw_data/Train_Beneficiarydata-1542865627584.csv")
inpatient = pd.read_csv("https://raw.githubusercontent.com/EY-Tech-Consulting-Denmark/Graphathon_2023-04-14/main/Data/raw_data/Train_Inpatientdata-1542865627584.csv")
label = pd.read_csv("https://raw.githubusercontent.com/EY-Tech-Consulting-Denmark/Graphathon_2023-04-14/main/Data/raw_data/Train-1542865627584.csv")

#### Label data
This is of list historical data about each provider in the overall dataset.  
This information will allow for detecting patterns that in the future will help to identify whether new, unfamiliar providers are potentially fraudulent or not.


In [None]:
label.head()

In [None]:
label.shape

In [None]:
# class imballance check
label_overview = label.groupby("PotentialFraud").size()
print(label_overview)
label_overview.plot(kind='bar')

#### Beneficiary data

This data contains beneficiary KYC details like health conditions,regioregion they belong to etc.   
This dataset contains both patients that were admitted to the hospital and patients who were not admitted to the hospital.  
Data about patients who were not admitted to the hospital will be disregarded when merging with the other dataset.  


In [None]:
beneficiary.head()

In [None]:
beneficiary.shape

In [None]:
beneficiary.describe()

In [None]:
# missing values check
beneficiary.isna().sum()

Missing value of "DOD" column imply that the patients is still alive.  
The missing values will be replaced with '0' in the Feature engineering process later.  

#### Inpatient data
This data provides insights about the claims filed for those patients who are admitted in the hospitals. It also provides additional details like their admission and discharge dates and admit diagnosis code.  
Later will be this dataset merged with the Beneficiary data dataset presented above.  

In [None]:
inpatient.head()

In [None]:
inpatient.shape

In [None]:
inpatient.describe()

'ClmProcedureCode_6' column is empty for each row, hence will be dropped in Feature engineering step.

In [None]:
# missing values check
inpatient.isna().sum()

Bassed on the context the missing values are not errors but simply not present physician/diagnosis/condition/procedure or that there was no deductible amount paid (missing values of 'DeductibleAmtPaid').  
Therefor there is no need to apply filling in missing value technique in the Feature engineering step. 

### Merging of the datasets

Beneficiary dataset can be merged with the Inpatient dataset on the 'BeneID' column.  
In order to keep only patiens admitted to the hospital, inner join will be utilized.  
The label dataset will be then added through inner join utilizing 'Provider' column.  

In [None]:
# Merging beneficiary and inpatient datasets
data = pd.merge(beneficiary, inpatient, on='BeneID', how='inner')
# adding the label
data = pd.merge(data, label, on='Provider', how='inner')

In [None]:
data.shape

In [None]:
data.head()

### Feature engineering

In [None]:
# data types check
data.dtypes

In [None]:
# Fixing the date columns
date_cols = ['DOB', 'ClaimStartDt', 'DOD', 'ClaimEndDt', 'AdmissionDt', 'DischargeDt']
for date_col in date_cols:
    data[date_col] = pd.to_datetime(data[date_col])
data.head()

In [None]:
# Dropping column ClmProcedureCode_6 as it is empty for every claim
data.drop('ClmProcedureCode_6', axis=1, inplace=True)
data.head()

In [None]:
# Fixing procedures codes
procedure_cols = ['ClmProcedureCode_1', 'ClmProcedureCode_2', 'ClmProcedureCode_3', 'ClmProcedureCode_4', 'ClmProcedureCode_5']
for col in procedure_cols:
    data[col] =  data[col].apply(lambda x: "{:.0f}".format(x) if x is not None else x)
data.head()

Each patient can make multiple claims and hence the age and information should be calculated on the claim level.  
Since the claim can is made while the patient is still alive, the flag whether the customer is dead or not is calculated on a 'patient' level.  
For these purposes are utilized columns date of birth ('DOB'), date of death ('DOD') and the date of start of the claim ('ClaimStartDt')

In [None]:
# adding age of the patient when the claim started column
data['Age'] = round(((data['ClaimStartDt'] - data['DOB']).dt.days)/365)
# adding whether the patient is dead or not
data['IsDead'] = np.where(data['DOD'].isna(), 0, 1)
data.head()

Out of the provided dates can be calculated number of days the patient spent in the hospital,   
how long the claim lasted and whether the claim ended after the patient was already discharged from the hospital.

In [None]:
data['DaysAdmitted'] = ((data['DischargeDt'] - data['AdmissionDt']).dt.days)+1
data['DaysClaimLasted'] = ((data['ClaimEndDt'] - data['ClaimStartDt']).dt.days)+1
data['ClaimEndAfterDischarged'] = np.where(data['ClaimEndDt'] > data['DischargeDt'], 1, 0)
data.head()

In [None]:
# replacing missing deductible amount paid with 0
data.loc[data['DeductibleAmtPaid'].isnull(), 'DeductibleAmtPaid'] = '0'

Physicians codes, diagnosis codes and procedures codes variables has many categories and that is why TotalPhysicians, TotalDiagnosis and TotalProcedures columns are engineered.

In [None]:
# creating helper dataframe to calculate the totals
cols= ['AttendingPhysician', 'OperatingPhysician', 'OtherPhysician', 
       'ClmAdmitDiagnosisCode', 'ClmDiagnosisCode_1', 'ClmDiagnosisCode_10',
       'ClmDiagnosisCode_2', 'ClmDiagnosisCode_3', 'ClmDiagnosisCode_4',
       'ClmDiagnosisCode_5', 'ClmDiagnosisCode_6', 'ClmDiagnosisCode_7',
       'ClmDiagnosisCode_8', 'ClmDiagnosisCode_9', 'ClmProcedureCode_1',
       'ClmProcedureCode_2', 'ClmProcedureCode_3', 'ClmProcedureCode_4',
       'ClmProcedureCode_5']
helper_df = data[cols].copy()
# replacing missing values with 0 in the helper dataframe
helper_df[cols]= helper_df[cols].replace({np.nan:0})
# replacing codes with number 1 for easy counting
for i in cols:
    helper_df[i][helper_df[i]!=0]= 1
helper_df.head(20)

In [None]:
helper_df[cols]= helper_df[cols].astype(int)

In [None]:
data['TotalDiagnosis']= helper_df['ClmDiagnosisCode_1']+helper_df['ClmDiagnosisCode_10']+ \
helper_df['ClmDiagnosisCode_2']+ helper_df['ClmDiagnosisCode_3']+ helper_df['ClmDiagnosisCode_4']+ \
helper_df['ClmDiagnosisCode_5']+ helper_df['ClmDiagnosisCode_6']+ helper_df['ClmDiagnosisCode_7']+helper_df['ClmDiagnosisCode_8']+ helper_df['ClmDiagnosisCode_9']

In [None]:
data['TotalProcedures']= helper_df['ClmProcedureCode_1']+helper_df['ClmProcedureCode_2']+helper_df['ClmProcedureCode_3']+ \
helper_df['ClmProcedureCode_4']+ helper_df['ClmProcedureCode_5']

In [None]:
data['TotalPhysicians']= helper_df['AttendingPhysician']+helper_df['OperatingPhysician']+ \
                         helper_df['OtherPhysician']

In [None]:
data.head()

RenalDiseaseIndicator column has 2 values populated in another way compared with other columns, hence the values are replaced with True and False

In [None]:
# values check
data['RenalDiseaseIndicator'].value_counts()

In [None]:
# Value update
data['RenalDiseaseIndicator']= data['RenalDiseaseIndicator'].replace({'Y':True,'0':False})
data.head()

In [None]:
# Value update
data['PotentialFraud']= data['PotentialFraud'].replace({'Yes':1,'No':0})
data.head()

In [None]:
data.columns

Values True and False can't be fed into the machine learning algorithm, hence are replaced with 1 and 0 respectively.

In [None]:
binary_cols = ['ClaimEndAfterDischarged', 'IsDead', 'PotentialFraud', 'RenalDiseaseIndicator']
for col in binary_cols:
    data[col] = np.where(data[col]==True, 1, 0)
data.head()

In [None]:
binary_cols = ['ChronicCond_Alzheimer', 'ChronicCond_Heartfailure',
       'ChronicCond_KidneyDisease', 'ChronicCond_Cancer',
       'ChronicCond_ObstrPulmonary', 'ChronicCond_Depression',
       'ChronicCond_Diabetes', 'ChronicCond_IschemicHeart',
       'ChronicCond_Osteoporasis', 'ChronicCond_rheumatoidarthritis',
       'ChronicCond_stroke']
for col in binary_cols:
    data[col] = np.where(data[col]==1, 1, 0)
data.head()

In [None]:
data.head()

### Data saving

The preprocessed data are pushed to github to the separate folder https://raw.githubusercontent.com/EY-Tech-Consulting-Denmark/Graphathon_2023-04-14/main/Data/clean_data.  
Data prepared like this will be used to build a graph database and for the machine learning.

In [None]:
#data.to_csv("path_to_local_git_folder\\Graphathon_2023-04-14\\Data\\clean_data\\data.csv", index=False)