## Deploy Healthcare Fraud Detection System Algorithm from AWS Marketplace 

Healthcare fraud occurs due to collusion of providers, physicians and/ or beneficiaries through misuse of medical insurance systems. Manual detection of frauds in healthcare industry is a strenuous task. This solution involves scrutiny and prediction of potential fraudulent claims based on the analysis of patterns to comprehend the entity's future behavior. Through timely actions, insurance companies can use the likelihood of healthcare fraud to prevent or mitigate losses.

Ensemble Machine Learning algorithm-based solution that can assist in the decision-making process by predicting the likelihood of healthcare fraud to prevent or mitigate losses.

Leveraging Predictive modeling to detect healthcare fraud can reduce the costs of investigation and can ensure timely payouts.

This sample notebook shows you how to deploy Healthcare Fraud Detection System Algorithm using Amazon SageMaker.

> **Note**: This is a reference notebook and it cannot run unless you make changes suggested in the notebook.

#### Pre-requisites:
1. **Note**: This notebook contains elements which render correctly in Jupyter interface. Open this notebook from an Amazon SageMaker Notebook Instance or Amazon SageMaker Studio.
1. Ensure that IAM role used has **AmazonSageMakerFullAccess**
1. To deploy this ML model successfully, ensure that:
    1. Either your IAM role has these three permissions and you have authority to make AWS Marketplace subscriptions in the AWS account used: 
        1. **aws-marketplace:ViewSubscriptions**
        1. **aws-marketplace:Unsubscribe**
        1. **aws-marketplace:Subscribe**  
    2. or your AWS account has a subscription to Healthcare Fraud Detection System.

#### Contents:
1. [Subscribe to the Algorithm](#1.-Subscribe-to-the-Algorithm)
2. [Prepare dataset](#2.-Prepare-dataset)
    1. [Dataset format expected by the algorithm](#A.-Dataset-format-expected-by-the-algorithm)
    2. [Configure and visualize train,validation and test dataset](#B.-Configure-and-visualize-train,-validation-and-test-dataset)
    3. [Upload datasets to Amazon S3](#C.-Upload-datasets-to-Amazon-S3)
3. [Train a machine learning model](#3.-Train-a-machine-learning-model)
    1. [Set up environment](#A.-Set-up-environment)
    2. [Train a model](#B.-Train-a-model)
4. [Deploy model and verify results](#4.-Deploy-model-and-verify-results)
    1. [Deplay trained model](#A.-Deploy-trained-model)
    2. [Create input payload](#B.-Create-input-payload)
    3. [Perform real-time inference](#C.-Perform-real-time-inference)
    4. [Visualize output](#D.-Visualize-output)
    5. [Delete the endpoint](#E.-Delete-the-endpoint)
5. [Perform Batch inference](#5.-Perform-Batch-inference)
    1. [Inspect the Batch Transform Output in S3](#A.-Inspect-the-Batch-Transform-Output-in-S3)
6. [Clean-up](#6.-Clean-up)
    1. [Delete the model](#A.-Delete-the-model)
    2. [Unsubscribe to the listing (optional)](#B.-Unsubscribe-to-the-listing-(optional))
    

#### Usage instructions
You can run this notebook one cell at a time (By using Shift+Enter for running a cell).

### 1. Subscribe to the Algorithm

To subscribe to the Algorithm:
1. Open the algorithm listing page **Healthcare Fraud Detection System**
1. On the AWS Marketplace listing, click on the **Continue to subscribe** button.
1. On the **Subscribe to this software** page, review and click on **"Accept Offer"** if you and your organization agrees with EULA, pricing, and support terms. 
1. Once you click on **Continue to configuration button** and then choose a **region**, you will see a **Product Arn** displayed. This is the algorithm ARN that you need to specify while creating a deployable model using Boto3. Copy the ARN corresponding to your region and specify the same in the following cell.

In [1]:
algorithm_arn ='healthcare-fraud-detection-system-v1'

### 2. Prepare dataset

In [2]:
import base64
import json 
import uuid
from sagemaker import ModelPackage
import sagemaker as sage
from sagemaker import get_execution_role
from sagemaker.algorithm import AlgorithmEstimator
from sagemaker import ModelPackage
from urllib.parse import urlparse
import boto3
from IPython.display import Image
from PIL import Image as ImageEdit
import urllib.request
import numpy as np
import pandas as pd

#### A. Dataset format expected by the algorithm

The deployed solution has these **2 steps**: Training the algorithm and Testing

<li>: The algorithm trains on user provided dataset.
<li>: The train dataset must contain - "train.csv" with 'utf-8' encoding.
<li>: The machine learning model is trained in the training step and once the model is generated, it can be used to make prediction on test data
<li>: The testing API takes a csv file "test.csv" and predicts whether the claim is fraudulent or not.
<br>

**train.csv**
<li>: train.csv must contain following columns:
    
    - Physician_ID: It contains the id of the Physician who attended the patient
    - Surgeon: It contains the id of the surgeon who operated on the patient.
    - Alt_Physician_ID: It contains the id of the Physician other than Attending Physician and Surgeon who treated the patient.
    - Amt_Deductible : It consists of the amount by the patient. That is equal to Total_claim_amount — Reimbursed_amount.
    - AnnualAmtReimb_OP: It consists of the maximum reimbursement amount for outpatient visits annually.
    - AnnualAmtDeductible_OP : It consists of a premium paid by the patient for outpatient visits annually.
    - ClaimAmtReimbursed: It contains the amount reimbursed for that particular Insurance claim.
    - AnnualAmtReimb_IP: It consists of the maximum reimbursement amount for hospitalization annually.
    - AnnualAmtDeductible_IP: It consists of a premium paid by the patient for hospitalization annually.
    - ClmAdmitDiagnosisCode: It contains codes of the diagnosis performed by the provider on the patient for that claim.
    - DiagnosisGroupCode : It contains a group code for the diagnosis done on the patient.
    - Flag_Alzheimer: Beneficiary with Alzheimer's disease indicated by 1 else 0
    - Flag_Heartfailure: Beneficiary with Heart failure indicated by 1 else 0
    - Flag_KidneyDisease: Beneficiary with Sairam indicated by 1 else 0
    - Flag_Cancer : Beneficiary with Cancer indicated by 1 else 0
    - Flag_ObstrPulmonary : Beneficiary with obstructive pulmonary disease indicated by 1 else 0
    - Flag_Depression: Beneficiary with Depression indicated by 1 else 0
    - Flag_Diabetes : Beneficiary with Diabetes indicated by 1 else 0
    - Flag_IschemicHeart : Beneficiary with Ischemic Heart indicated by 1 else 0
    - Flag_Osteoporosis: Beneficiary with Osteoporosis indicated by 1 else 0
    - Flag_rheumatoidarthritis: Beneficiary with rheumatoid arthritis indicated by 1 else 0
    - Flag_stroke: Beneficiary with stroke indicated by 1 else 0
    - Flag_RenalDisease: Beneficiary with Renal Disease indicated by 1 else 0
    - Gender: Gender of the Beneficiary (or the patient)
    - Date_StartClm: It contains the date when the claim started in mm/dd/yyyy format.
    - Date_EndClm: It contains the date when the claim ended in mm/dd/yyyy format.
    - Age: Age of the Beneficiary (or the patient) in years
    - Flag_DOA: Beneficiary Dead or Alive. 1 indicates Dead, 0 means Alive
    - Date_Admission: It contains the date on which the patient was admitted into the hospital in mm/dd/yyyy format.
    - Ddate_Discharge: It contains the date on which the patient was discharged from the hospital in mm/dd/yyyy format.
    - PotentialFraud: Fraudulent claims where 1 indicates potentially fraud claims and 0 indicates genuine claims. (Target Variable)

<li>: test.csv must contain all above mentioned columns except target variable which in this case is PotentialFraud

#### B. Configure and visualize train, validation and test dataset

In [3]:
training_dataset='data/training/train.csv'

In [4]:
train_input_df = pd.read_csv(training_dataset)
train_input_df.head()

Unnamed: 0,Physician_ID,Alt_Physician_ID,Amt_Deductible,AnnualAmtReimb_OP,AnnualAmtDeductible_OP,Surgeon,ClaimAmtReimbursed,AnnualAmtReimb_IP,AnnualAmtDeductible_IP,ClmAdmitDiagnosisCode,...,Flag_stroke,Flag_RenalDisease,Gender,Date_StartClm,Date_EndClm,Date_Admission,Date_Discharge,Flag_DOA,Age,PotentialFraud
0,PHY373075,PHY396304,1068.0,60,10,PHY396304,4000,4000,1068,4280,...,0,0,Female,6/5/2009,6/7/2009,6/5/2009,6/7/2009,1,68.0,1
1,PHY414401,PHY396839,1068.0,30420,7000,PHY378880,18000,18000,1068,99649,...,0,1,Male,11/18/2009,11/25/2009,11/18/2009,11/25/2009,1,64.0,0
2,PHY389825,PHY380811,1068.0,52900,8750,PHY353495,10000,10000,1068,7840,...,0,1,Female,6/16/2009,6/19/2009,6/16/2009,6/19/2009,1,61.0,0
3,PHY414840,PHY354360,1068.0,0,0,PHY414840,15000,15000,1068,78060,...,0,0,Male,4/13/2009,4/27/2009,4/13/2009,4/27/2009,1,74.0,0
4,PHY394807,PHY353030,1068.0,490,1150,PHY349265,14000,14000,4068,6826,...,0,0,Female,3/28/2009,4/2/2009,3/28/2009,4/2/2009,1,69.0,1


In [5]:
test_dataset='data/transform/test.csv'

In [6]:
test_input_df = pd.read_csv(test_dataset)
test_input_df.head()

Unnamed: 0,Physician_ID,Alt_Physician_ID,Amt_Deductible,AnnualAmtReimb_OP,AnnualAmtDeductible_OP,Surgeon,ClaimAmtReimbursed,AnnualAmtReimb_IP,AnnualAmtDeductible_IP,ClmAdmitDiagnosisCode,...,Flag_rheumatoidarthritis,Flag_stroke,Flag_RenalDisease,Gender,Date_StartClm,Date_EndClm,Date_Admission,Date_Discharge,Flag_DOA,Age
0,PHY373075,PHY396304,1068,60,10,PHY396304,4000,4000,1068,4280,...,0,0,0,Female,6/5/2009,6/7/2009,6/5/2009,6/7/2009,1,68
1,PHY414401,PHY396839,1068,30420,7000,PHY378880,18000,18000,1068,99649,...,0,0,1,Male,11/18/2009,11/25/2009,11/18/2009,11/25/2009,1,64
2,PHY389825,PHY380811,1068,52900,8750,PHY353495,10000,10000,1068,7840,...,0,0,1,Female,6/16/2009,6/19/2009,6/16/2009,6/19/2009,1,61
3,PHY414840,PHY354360,1068,0,0,PHY414840,15000,15000,1068,78060,...,0,0,0,Male,4/13/2009,4/27/2009,4/13/2009,4/27/2009,1,74
4,PHY394807,PHY353030,1068,490,1150,PHY349265,14000,14000,4068,6826,...,0,0,0,Female,3/28/2009,4/2/2009,3/28/2009,4/2/2009,1,69


#### C. Upload datasets to Amazon S3

In [7]:
sagemaker_session = sage.Session()
bucket=sagemaker_session.default_bucket()

In [8]:
# training input location
common_prefix = "heathcare-fraud-detection"
training_input_prefix = common_prefix + "/training-input-data"
TRAINING_WORKDIR = "data/training"
training_input = sagemaker_session.upload_data(TRAINING_WORKDIR, key_prefix=training_input_prefix)

In [9]:
TRANSFORM_WORKDIR = "data/transform"
batch_inference_input_prefix = common_prefix + "/batch-inference-input-data"
transform_input = sagemaker_session.upload_data(TRANSFORM_WORKDIR, key_prefix=batch_inference_input_prefix) + "/unlabelled.csv"
print("Transform input uploaded to " + transform_input)

Transform input uploaded to s3://sagemaker-us-east-2-786796469737/heathcare-fraud-detection/batch-inference-input-data/unlabelled.csv


### 3. Train a machine learning model

Now that dataset is available in an accessible Amazon S3 bucket, we are ready to train a machine learning model.

#### A. Set up environment

In [10]:
role = get_execution_role()

#### B. Train a model

In [11]:
algo = AlgorithmEstimator(
    algorithm_arn=algorithm_arn,
    role=role,
    instance_count=1,
    instance_type='ml.c4.xlarge',
    base_job_name='healthcare-fraud-detection-marketplace')

In [12]:
print ("Now run the training job using algorithm arn %s in region %s" % (algorithm_arn, sagemaker_session.boto_region_name))
algo.fit({'training': training_input})

Now run the training job using algorithm arn arn:aws:sagemaker:us-east-2:786796469737:algorithm/healthcare-fraud-detection-system-v1 in region us-east-2
2021-08-19 05:31:52 Starting - Starting the training job...
2021-08-19 05:32:15 Starting - Launching requested ML instancesProfilerReport-1629351111: InProgress
...
2021-08-19 05:32:47 Starting - Preparing the instances for training.........
2021-08-19 05:34:16 Downloading - Downloading input data
2021-08-19 05:34:16 Training - Downloading the training image...

2021-08-19 05:35:23 Uploading - Uploading generated training model
2021-08-19 05:35:23 Completed - Training job completed
[34mtrain model_evaluation[0m
Training seconds: 90
Billable seconds: 90


### 4. Deploy model and verify results
Now you can deploy the model for performing real-time inference.

In [13]:
model_name='heathcare-fraud-detection'

content_type='text/csv'

real_time_inference_instance_type='ml.c4.xlarge'
batch_transform_inference_instance_type='ml.c4.large'

#### A. Deploy trained model

In [14]:
#Deploy the model
predictor = algo.deploy(1, 'ml.c4.xlarge',endpoint_name=model_name)

..........
-------------!

Once endpoint is created, you can perform real-time inference.

#### B. Create input payload

In [22]:
file_name = 'data/transform/test.csv'

#### C. Perform real-time inference

In [23]:
!aws sagemaker-runtime invoke-endpoint \
    --endpoint-name 'heathcare-fraud-detection' \
    --body fileb://$file_name \
    --content-type 'text/csv' \
    --region us-east-2 \
    "output.csv"

{
    "ContentType": "csv",
    "InvokedProductionVariant": "AllTraffic"
}


#### D. Visualize output

In [24]:
output = pd.read_csv("output.csv")
output.head(10)

Unnamed: 0,Physician_ID,Alt_Physician_ID,Amt_Deductible,AnnualAmtReimb_OP,AnnualAmtDeductible_OP,Surgeon,ClaimAmtReimbursed,AnnualAmtReimb_IP,AnnualAmtDeductible_IP,ClmAdmitDiagnosisCode,...,Flag_stroke,Flag_RenalDisease,Gender,Date_StartClm,Date_EndClm,Date_Admission,Date_Discharge,Flag_DOA,Age,PRED_FRAUDCLAIM
0,PHY373075,PHY396304,1068,60,10,PHY396304,4000,4000,1068,4280,...,0,0,Female,6/5/2009,6/7/2009,6/5/2009,6/7/2009,1,68,1
1,PHY414401,PHY396839,1068,30420,7000,PHY378880,18000,18000,1068,99649,...,0,1,Male,11/18/2009,11/25/2009,11/18/2009,11/25/2009,1,64,0
2,PHY389825,PHY380811,1068,52900,8750,PHY353495,10000,10000,1068,7840,...,0,1,Female,6/16/2009,6/19/2009,6/16/2009,6/19/2009,1,61,0
3,PHY414840,PHY354360,1068,0,0,PHY414840,15000,15000,1068,78060,...,0,0,Male,4/13/2009,4/27/2009,4/13/2009,4/27/2009,1,74,1
4,PHY394807,PHY353030,1068,490,1150,PHY349265,14000,14000,4068,6826,...,0,0,Female,3/28/2009,4/2/2009,3/28/2009,4/2/2009,1,69,1
5,PHY412336,PHY382649,1068,200,20,PHY395504,5000,5000,1068,78321,...,0,1,Female,8/17/2009,8/19/2009,8/17/2009,8/19/2009,1,81,1
6,PHY366670,PHY358829,1068,500,300,PHY358829,12000,12000,1068,71536,...,0,0,Male,9/23/2009,9/26/2009,9/23/2009,9/26/2009,1,78,0
7,PHY342869,PHY353725,1068,2820,840,PHY317003,36000,81000,3204,486,...,0,1,Male,3/25/2009,4/2/2009,3/25/2009,4/2/2009,1,74,1
8,PHY426515,PHY422874,1068,760,70,PHY426515,37000,71700,6204,99666,...,0,1,Male,3/7/2009,3/18/2009,3/7/2009,3/18/2009,1,71,0
9,PHY342897,PHY400655,1068,290,1140,PHY342897,17000,17000,1068,4414,...,0,0,Female,1/15/2009,1/22/2009,1/15/2009,1/22/2009,1,61,0


#### E. Delete the endpoint
Now that you have successfully performed a real-time inference, you do not need the endpoint any more. you can terminate the same to avoid being charged.

In [25]:
predictor.delete_endpoint(delete_endpoint_config=True)

### 5. Perform Batch inference
In this section, you will perform batch inference using multiple input payloads together. If you are not familiar with batch transform, and want to learn more, see these links:
1. [How it works](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-batch-transform.html)
2. [How to run a batch transform job](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-batch.html)

In [26]:
TRANSFORM_WORKDIR = "data/transform"
transform_input = sagemaker_session.upload_data(TRANSFORM_WORKDIR, key_prefix=batch_inference_input_prefix) + "/test.csv"
print("Transform input uploaded to " + transform_input)

Transform input uploaded to s3://sagemaker-us-east-2-786796469737/heathcare-fraud-detection/batch-inference-input-data/test.csv


In [27]:
transformer = algo.transformer(1, 'ml.m4.xlarge')
transformer.transform(transform_input, content_type='text/csv')
transformer.wait()

print("Batch Transform output saved to " + transformer.output_path)

..........
...........................
[34mStarting the inference server with 4 workers.[0m
[34m[2021-08-19 05:55:48 +0000] [13] [INFO] Starting gunicorn 20.1.0[0m
[34m[2021-08-19 05:55:48 +0000] [13] [INFO] Listening at: unix:/tmp/gunicorn.sock (13)[0m
[34m[2021-08-19 05:55:48 +0000] [13] [INFO] Using worker: gevent[0m
[34m[2021-08-19 05:55:48 +0000] [17] [INFO] Booting worker with pid: 17[0m
[34m[2021-08-19 05:55:48 +0000] [18] [INFO] Booting worker with pid: 18[0m
[34m[2021-08-19 05:55:48 +0000] [19] [INFO] Booting worker with pid: 19[0m
[34m[2021-08-19 05:55:48 +0000] [21] [INFO] Booting worker with pid: 21[0m
[35mStarting the inference server with 4 workers.[0m
[35m[2021-08-19 05:55:48 +0000] [13] [INFO] Starting gunicorn 20.1.0[0m
[35m[2021-08-19 05:55:48 +0000] [13] [INFO] Listening at: unix:/tmp/gunicorn.sock (13)[0m
[35m[2021-08-19 05:55:48 +0000] [13] [INFO] Using worker: gevent[0m
[35m[2021-08-19 05:55:48 +0000] [17] [INFO] Booting worker with pid: 1

In [28]:
#output is available on following path
transformer.output_path

's3://sagemaker-us-east-2-786796469737/healthcare-fraud-detection-marketplace-2021-08-19-05-51-28-494'

#### A. Inspect the Batch Transform Output in S3

In [29]:
from urllib.parse import urlparse

parsed_url = urlparse(transformer.output_path)
bucket_name = parsed_url.netloc
file_key = '{}/{}.out'.format(parsed_url.path[1:], "test.csv")

s3_client = sagemaker_session.boto_session.client('s3')

response = s3_client.get_object(Bucket = sagemaker_session.default_bucket(), Key = file_key)

In [30]:
bucketFolder = transformer.output_path.rsplit('/')[3]

In [31]:
import boto3
s3_conn = boto3.client("s3")
bucket_name=bucket
with open('output.csv', 'wb') as f:
    s3_conn.download_fileobj(bucket_name, bucketFolder+'/' + "test.csv" +'.out', f)
    print("Output file loaded from bucket")

Output file loaded from bucket


In [32]:
output = pd.read_csv('output.csv')

In [33]:
output.head(10)

Unnamed: 0,Physician_ID,Alt_Physician_ID,Amt_Deductible,AnnualAmtReimb_OP,AnnualAmtDeductible_OP,Surgeon,ClaimAmtReimbursed,AnnualAmtReimb_IP,AnnualAmtDeductible_IP,ClmAdmitDiagnosisCode,...,Flag_stroke,Flag_RenalDisease,Gender,Date_StartClm,Date_EndClm,Date_Admission,Date_Discharge,Flag_DOA,Age,PRED_FRAUDCLAIM
0,PHY373075,PHY396304,1068,60,10,PHY396304,4000,4000,1068,4280,...,0,0,Female,6/5/2009,6/7/2009,6/5/2009,6/7/2009,1,68,1
1,PHY414401,PHY396839,1068,30420,7000,PHY378880,18000,18000,1068,99649,...,0,1,Male,11/18/2009,11/25/2009,11/18/2009,11/25/2009,1,64,0
2,PHY389825,PHY380811,1068,52900,8750,PHY353495,10000,10000,1068,7840,...,0,1,Female,6/16/2009,6/19/2009,6/16/2009,6/19/2009,1,61,0
3,PHY414840,PHY354360,1068,0,0,PHY414840,15000,15000,1068,78060,...,0,0,Male,4/13/2009,4/27/2009,4/13/2009,4/27/2009,1,74,1
4,PHY394807,PHY353030,1068,490,1150,PHY349265,14000,14000,4068,6826,...,0,0,Female,3/28/2009,4/2/2009,3/28/2009,4/2/2009,1,69,1
5,PHY412336,PHY382649,1068,200,20,PHY395504,5000,5000,1068,78321,...,0,1,Female,8/17/2009,8/19/2009,8/17/2009,8/19/2009,1,81,1
6,PHY366670,PHY358829,1068,500,300,PHY358829,12000,12000,1068,71536,...,0,0,Male,9/23/2009,9/26/2009,9/23/2009,9/26/2009,1,78,0
7,PHY342869,PHY353725,1068,2820,840,PHY317003,36000,81000,3204,486,...,0,1,Male,3/25/2009,4/2/2009,3/25/2009,4/2/2009,1,74,1
8,PHY426515,PHY422874,1068,760,70,PHY426515,37000,71700,6204,99666,...,0,1,Male,3/7/2009,3/18/2009,3/7/2009,3/18/2009,1,71,0
9,PHY342897,PHY400655,1068,290,1140,PHY342897,17000,17000,1068,4414,...,0,0,Female,1/15/2009,1/22/2009,1/15/2009,1/22/2009,1,61,0


### 6. Clean-up

#### A. Delete the model

In [35]:
transformer.delete_model()

#### B. Unsubscribe to the listing (optional)
If you would like to unsubscribe to the algorithm, follow these steps. Before you cancel the subscription, ensure that you do not have any [deployable model](https://console.aws.amazon.com/sagemaker/home#/models) created from the model package or using the algorithm. Note - You can find this information by looking at the container name associated with the model. 

**Steps to unsubscribe to product from AWS Marketplace**:
1. Navigate to __Machine Learning__ tab on [__Your Software subscriptions page__](https://aws.amazon.com/marketplace/ai/library?productType=ml&ref_=mlmp_gitdemo_indust)
2. Locate the listing that you want to cancel the subscription for, and then choose __Cancel Subscription__  to cancel the subscription.