<div style="text-align: center;">
    <font size="25" color="white">HEALTHCARE PROVIDER FRAUD DETECTION ANALYSIS</font>
</div>

<div style="text-align: center;">
    <img src="stethoscope-stat.jpg" alt="News Feast" style="width: 100%; max-width: 900px;">
</div>

## Table of Contents

* [1. Project Overview](#chapter1)
  * [1.1 Introduction](#section_1_1)
      * [1.1.1 Problem Statement](#sub_section_1_1_1)
      * [1.1.2 Objectives](#sub_section_1_1_2)
* [2. Importing Packages](#chapter2)
* [3. Loading Data](#chapter3)
* [4. Data Cleaning](#chapter4)
* [5. Exploratory Data Analysis (EDA)](#chapter5)
* [6. Data Pre-processing](#chapter6)
* [7. Model training](#chapter7)
* [8. Model Performance Comparison and evaluation](#chapter8)
* [9. Fine-tune model parameters and hyperparameters](#chapter9)
* [10. MLFlow Tracking](#chapter10)
* [11. Recommendations](#chapter11)
* [12. Conclusion](#chapter12)

# <font color=White>1. Project Overview</font> <a class="anchor" id="chapter1"></a>

Healthcare fraud is a pervasive issue affecting healthcare systems globally, including South Africa. One of the most concerning forms of fraud in South Africa is related to medico-legal claims, which involve healthcare providers, often in collaboration with patients and legal professionals, filing fraudulent claims for compensation. These fraudulent activities not only burden the healthcare system but also lead to financial losses for insurance companies and government health programs.Medico-legal fraud in South Africa typically involves cases where medical professionals inflate claims, bill for services that were never provided, or misrepresent the nature of treatments to secure higher payouts. These practices contribute to soaring healthcare costs and place an unnecessary strain on already limited public health resources. Given the complexity of these fraud cases and the significant financial impact they have on the healthcare system, there is a pressing need for an advanced fraud detection model that can accurately identify and flag potentially fraudulent claims. 

In this project we explore Provider fraus dataset to develop a model that will detect fraudlent claims. The model will be trained using data from inpatient claims, outpatient claims, and beneficiary details. 
Healthcare fraud and abuse take many forms. Some of the most common types of frauds by providers are:

a) Billing for services that were not provided.

b) Duplicate submission of a claim for the same service.

c) Misrepresenting the service provided.

d) Charging for a more complex or expensive service than was actually provided.

e) Billing for a covered service when the service actually provided was not covered.

The goal of this project is to " predict the potentially fraudulent providers " based on the claims filed by them.along with this, we will also discover important variables helpful in detecting the behaviour of potentially fraud providers. further, we will study fraudulent patterns in the provider's claims to understand the future behaviour of providers.



<b>Dataset Overview:</b>

The dataset consists of three primary components—Inpatient Data, Outpatient Data, and Beneficiary Details—used to detect potentially fraudulent behaviors in medico-legal claims. Each dataset provides unique insights that, when combined, can help identify fraudulent patterns.

Inpatient Data:
Contains claims from patients admitted to hospitals, including admission and discharge dates, diagnosis codes, and services provided during the hospital stay.

Outpatient Data:
Captures claims from patients who were not admitted but visited healthcare providers for various treatments. It includes details on diagnoses and outpatient services.

Beneficiary Details:
Comprises patient-specific information, such as medical history, demographic details, and regional data, which can offer insights into the background and health profiles of claimants.

## 1.1.1 Problem Statement <a class="anchor" id="sub_section_1_1_1"></a>

Medico-legal fraud is a growing issue within South Africa’s healthcare system, where healthcare providers and other parties work together to file fraudulent claims. Common fraudulent practices include billing for services not rendered, submitting duplicate claims, and misrepresenting the nature of treatments to inflate claim payouts. These fraudulent activities lead to increased healthcare costs and put a strain on both insurance companies and government programs.

The goal of this project is to build a fraud detection model that can accurately identify fraudulent healthcare providers. By leveraging inpatient, outpatient, and beneficiary data, the project will uncover critical features and fraud patterns that will improve fraud detection, reduce financial losses, and strengthen the overall healthcare system.

## 1.1.2 Objectives <a class="anchor" id="sub_section_1_1_2"></a>

The primary objective of this project is to develop a model to detect potentially fraudulent healthcare providers involved in medico-legal claims in South Africa. Specific goals include:

1. Develop a predictive model capable of identifying potentially fraudulent claims based on historical data from inpatient, outpatient, and beneficiary claims.

2. Identify significant variables and features that are most predictive of fraudulent activities within the dataset.

3. Analyze fraudulent patterns and behaviors associated with healthcare providers who are frequently involved in suspicious medico-legal claims.
   

# <font color=black>2. Importing Packages</font> <a class="anchor" id="chapter2"></a>

In [1]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt

# <font color=black> 3. Loading Data</font> <a class="anchor" id="chapter3"></a>

In [6]:
beneficiary = pd.read_csv("C:\\_repos\\Workplace_project\\Exploring-_healthcare\\Train_Beneficiarydata-1542865627584.csv")
beneficiary.head()

Unnamed: 0,BeneID,DOB,DOD,Gender,Race,RenalDiseaseIndicator,State,County,NoOfMonths_PartACov,NoOfMonths_PartBCov,...,ChronicCond_Depression,ChronicCond_Diabetes,ChronicCond_IschemicHeart,ChronicCond_Osteoporasis,ChronicCond_rheumatoidarthritis,ChronicCond_stroke,IPAnnualReimbursementAmt,IPAnnualDeductibleAmt,OPAnnualReimbursementAmt,OPAnnualDeductibleAmt
0,BENE11001,1943-01-01,,1,1,0,39,230,12,12,...,1,1,1,2,1,1,36000,3204,60,70
1,BENE11002,1936-09-01,,2,1,0,39,280,12,12,...,2,2,2,2,2,2,0,0,30,50
2,BENE11003,1936-08-01,,1,1,0,52,590,12,12,...,2,2,1,2,2,2,0,0,90,40
3,BENE11004,1922-07-01,,1,1,0,39,270,12,12,...,2,1,1,1,1,2,0,0,1810,760
4,BENE11005,1935-09-01,,1,1,0,24,680,12,12,...,2,1,2,2,2,2,0,0,1790,1200


In [14]:
beneficiary.shape

(138556, 25)

# <font color=black> 4. Data cleaning</font> <a class="anchor" id="chapter4"></a>

In [17]:
def check_nulls_and_drop_duplicates(df):
    # Check for null values
    null_values = df.isnull().sum()
    
    # Display columns with null values
    print("Columns with null values:\n", null_values[null_values > 0])
    
    # Drop duplicates
    df_cleaned = df.drop_duplicates()
    
    # Return the cleaned DataFrame
    return df_cleaned



In [18]:
df_beneficiary = check_nulls_and_drop_duplicates(beneficiary)
df_beneficiary

Columns with null values:
 DOD    137135
dtype: int64


Unnamed: 0,BeneID,DOB,DOD,Gender,Race,RenalDiseaseIndicator,State,County,NoOfMonths_PartACov,NoOfMonths_PartBCov,...,ChronicCond_Depression,ChronicCond_Diabetes,ChronicCond_IschemicHeart,ChronicCond_Osteoporasis,ChronicCond_rheumatoidarthritis,ChronicCond_stroke,IPAnnualReimbursementAmt,IPAnnualDeductibleAmt,OPAnnualReimbursementAmt,OPAnnualDeductibleAmt
0,BENE11001,1943-01-01,,1,1,0,39,230,12,12,...,1,1,1,2,1,1,36000,3204,60,70
1,BENE11002,1936-09-01,,2,1,0,39,280,12,12,...,2,2,2,2,2,2,0,0,30,50
2,BENE11003,1936-08-01,,1,1,0,52,590,12,12,...,2,2,1,2,2,2,0,0,90,40
3,BENE11004,1922-07-01,,1,1,0,39,270,12,12,...,2,1,1,1,1,2,0,0,1810,760
4,BENE11005,1935-09-01,,1,1,0,24,680,12,12,...,2,1,2,2,2,2,0,0,1790,1200
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
138551,BENE159194,1939-07-01,,1,1,0,39,140,12,12,...,2,2,2,2,2,2,0,0,430,460
138552,BENE159195,1938-12-01,,2,1,0,49,530,12,12,...,2,1,2,2,2,2,0,0,880,100
138553,BENE159196,1916-06-01,,2,1,0,6,150,12,12,...,1,1,1,2,2,2,2000,1068,3240,1390
138554,BENE159197,1930-01-01,,1,1,0,16,560,12,12,...,2,2,1,2,2,2,0,0,2650,10


In [24]:
df_beneficiary.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 138556 entries, 0 to 138555
Data columns (total 25 columns):
 #   Column                           Non-Null Count   Dtype         
---  ------                           --------------   -----         
 0   BeneID                           138556 non-null  object        
 1   DOB                              138556 non-null  datetime64[ns]
 2   DOD                              1421 non-null    datetime64[ns]
 3   Gender                           138556 non-null  int64         
 4   Race                             138556 non-null  int64         
 5   RenalDiseaseIndicator            138556 non-null  object        
 6   State                            138556 non-null  int64         
 7   County                           138556 non-null  int64         
 8   NoOfMonths_PartACov              138556 non-null  int64         
 9   NoOfMonths_PartBCov              138556 non-null  int64         
 10  ChronicCond_Alzheimer            138556 non-

In [25]:
df_beneficiary['DOD'] = pd.to_datetime(df_beneficiary['DOD'], format='%Y-%m-%d')
df_beneficiary['DOB'] = pd.to_datetime(df_beneficiary['DOB'], format='%Y-%m-%d')

In [26]:
df_beneficiary["DOD"].max()

Timestamp('2009-12-01 00:00:00')

The latest Date of Death (DOD) recorded is December 1, 2009, indicating that the Beneficiary Details data originates from 2009. As a result, NaN values in the DOD columns have been populated with '2009-12-01'.

In [28]:
df_beneficiary["DOD"] = df_beneficiary["DOD"].fillna("2009-12-01")


# <font color=black>5. Exploratory Data Analysis</font> <a class="anchor" id="chapter5"></a>

Timestamp('2009-12-01 00:00:00')

# <font color=black> 6. Data Pre-processing</font> <a class="anchor" id="chapter6"></a>

# <font color=black> 7. Model training</font> <a class="anchor" id="chapter7"></a>

# <font color=black>8. Model Performance Comparison and evaluation</font> <a class="anchor" id="chapter8"></a>

# <font color=black>9. Fine-tune model parameters and hyperparameters</font> <a class="anchor" id="chapter9"></a>

# <font color=black>10. MLFlow Tracking</font> <a class="anchor" id="chapter10"></a>

# <font color=black>11. Recommendations</font> <a class="anchor" id="chapter11"></a>

# <font color=black>12. Conclusion</font> <a class="anchor" id="chapter12"></a>