### Bias Detection and Explanability in Healthcare Coverage Amounts
This exercise is part of *Chapter 11* in the book *Applied Machine Learning for Healthcare and Life Sciences Using AWS*. Make sure you have completed the steps as outlined in the *Technical requirements* section of *Chapter 11* to successfully complete this exercise. Also, make sure you have downloaded the dataset created the data folder as described in the *Aquiring the dataset* subsection under the *Detecting bias and explaining model predictions for healthcare coverage amounts* section.  

In this notebook, we will use SageMaker Clarify to create bias metrics for our training data and also our model. We will also generate explanability metrics for our model to gather feature importance for our model.

**Note**: Parts of this notebook reference the following link: https://sagemaker-examples.readthedocs.io/en/latest/sagemaker_processing/fairness_and_explainability/fairness_and_explainability.html

Let's start by importing the required libraries and setting some environment variables. 

In [8]:
from sagemaker import Session
from sagemaker import get_execution_role
import pandas as pd
import numpy as np
import os
import boto3
from datetime import date

session = Session()
bucket = session.default_bucket()
prefix = "chapter11"
region = session.boto_region_name
role = get_execution_role()
s3_client = boto3.client("s3")

Next, we will read the `patients.csv` file into a dataframe and examine its contents.

In [9]:
raw_data=pd.read_csv('data/patients.csv')

In [10]:
raw_data.head()

Unnamed: 0,Id,BIRTHDATE,DEATHDATE,SSN,DRIVERS,PASSPORT,PREFIX,FIRST,LAST,SUFFIX,...,BIRTHPLACE,ADDRESS,CITY,STATE,COUNTY,ZIP,LAT,LON,HEALTHCARE_EXPENSES,HEALTHCARE_COVERAGE
0,1d604da9-9a81-4ba9-80c2-de3375d59b40,1989-05-25,,999-76-6866,S99984236,X19277260X,Mr.,José Eduardo181,Gómez206,,...,Marigot Saint Andrew Parish DM,427 Balistreri Way Unit 19,Chicopee,Massachusetts,Hampden County,1013.0,42.228354,-72.562951,271227.08,1334.88
1,034e9e3b-2def-4559-bb2a-7850888ae060,1983-11-14,,999-73-5361,S99962402,X88275464X,Mr.,Milo271,Feil794,,...,Danvers Massachusetts US,422 Farrell Path Unit 69,Somerville,Massachusetts,Middlesex County,2143.0,42.360697,-71.126531,793946.01,3204.49
2,10339b10-3cd1-4ac3-ac13-ec26728cb592,1992-06-02,,999-27-3385,S99972682,X73754411X,Mr.,Jayson808,Fadel536,,...,Springfield Massachusetts US,1056 Harris Lane Suite 70,Chicopee,Massachusetts,Hampden County,1020.0,42.181642,-72.608842,574111.9,2606.4
3,8d4c4326-e9de-4f45-9a4c-f8c36bff89ae,1978-05-27,,999-85-4926,S99974448,X40915583X,Mrs.,Mariana775,Rutherford999,,...,Yarmouth Massachusetts US,999 Kuhn Forge,Lowell,Massachusetts,Middlesex County,1851.0,42.636143,-71.343255,935630.3,8756.19
4,f5dcd418-09fe-4a2f-baa0-3da800bd8c3a,1996-10-18,,999-60-7372,S99915787,X86772962X,Mr.,Gregorio366,Auer97,,...,Patras Achaea GR,1050 Lindgren Extension Apt 38,Boston,Massachusetts,Suffolk County,2135.0,42.352434,-71.02861,598763.07,3772.2


We will now do some transformations with the data: 
1. Use the `BIRTHDATE` column to calculate the age and categorize patients into different agegroups like Infant, Toddler, Kid, Teen, Adult and Old.
2. Calculate `COVERAGETOEXPENSERATIO` by dividing `HEALTHCARE_COVERAGE` by `HEALTHCARE_EXPENSES`.
3. Create a new binary `TARGET` variable that has the value `False` if the `COVERAGETOEXPENSERATIO` is `<=0.01`, otherwise it is `True`. 

Essentialy, we want to find patients who have less than 1% coverage for their healthcare expenses. 

In [11]:
def calculateage(dob):
    today = date.today()
    return today.year - dob.year - ((today.month, 
                                      today.day) < (dob.month, 
                                                    dob.day))

In [12]:
raw_data['BIRTHDATE'] = pd.to_datetime(raw_data['BIRTHDATE'])
raw_data=raw_data[raw_data['DEATHDATE'].isna()]
raw_data['AGE'] = raw_data['BIRTHDATE'].apply(calculateage)
raw_data['COVERAGETOEXPENSERATIO'] = raw_data['HEALTHCARE_COVERAGE']/raw_data['HEALTHCARE_EXPENSES']
labels = ['Infant','Toddler','Kid','Teen', 'Adult', 'Old']
raw_data['AGEGROUPS'] = pd.cut(raw_data['AGE'], labels=labels, bins=[0, 3, 7, 12, 21, 65, np.inf])
raw_data['Target'] = raw_data['COVERAGETOEXPENSERATIO'].apply(lambda x: 'False' if x <= 0.01 else 'True')

Next, we sellect `CITY`, `COUNTY` and `AGEGROUPS` columns as our features and `TARGET` column as our target. Since the `TARGET` has `True` and `False` values, we model this as a binary classification problem. 

In [13]:
X=raw_data[['CITY','COUNTY','AGEGROUPS']]
y=raw_data[['Target']]

We can now divide our patient population into train and test. We will use Label Encoder to encode our categorical columns into numerical values.

In [14]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [15]:
from sklearn import preprocessing
enc = preprocessing.LabelEncoder()
X_train=X_train.apply(enc.fit_transform)
X_test=X_test.apply(enc.fit_transform)
y_train=y_train.apply(enc.fit_transform)
y_test=y_test.apply(enc.fit_transform)

To train a model on SageMaker, we need to have the first column of our training data as the target variable. This also needs to be removed from the test dataset. We will do this in the next code section and save the files as CSVs. 

In [16]:
X_train.insert(0, 'Target', y_train['Target'])
X_train.to_csv("data/train.csv", index=False, header=False)

In [17]:
X_test.to_csv("data/test.csv", index=False, header=False)

Next, let's upload the files on S3, as shown in the following code.

In [18]:
from sagemaker.s3 import S3Uploader
from sagemaker.inputs import TrainingInput

train_uri = S3Uploader.upload("data/train.csv", "s3://{}/{}".format(bucket, prefix))
train_input = TrainingInput(train_uri, content_type="csv")
test_uri = S3Uploader.upload("data/test.csv", "s3://{}/{}".format(bucket, prefix))

We are now ready to run our model training on SageMaker. We choose the SageMaker `xgboost` container to run our training.

In [19]:
from sagemaker.image_uris import retrieve
from sagemaker.estimator import Estimator

container = retrieve("xgboost", region, version="1.2-1")
xgb = Estimator(
    container,
    role,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    disable_profiler=True,
    sagemaker_session=session,
)

xgb.set_hyperparameters(
    max_depth=5,
    eta=0.2,
    gamma=4,
    min_child_weight=6,
    subsample=0.8,
    objective="binary:logistic",
    num_round=800,
)

xgb.fit({"train": train_input}, logs=False)


2022-11-04 23:35:58 Starting - Starting the training job...
2022-11-04 23:36:15 Starting - Preparing the instances for training...........
2022-11-04 23:37:17 Downloading - Downloading input data.....
2022-11-04 23:37:47 Training - Downloading the training image.........
2022-11-04 23:38:38 Training - Training image download completed. Training in progress..
2022-11-04 23:38:48 Uploading - Uploading generated training model.
2022-11-04 23:38:59 Completed - Training job completed


Once the training completes, we can create a model artifact from the estimator used in the training job.

In [20]:
model_name = "xgb-model-{}".format(datetime.now().strftime("%d-%m-%Y-%H-%M-%S"))
model = xgb.create_model(name=model_name)
container_def = model.prepare_container_def()
session.create_model(model_name, role, container_def)

'xgb-model-04-11-2022-23-38-59'

We now have all the prerequisites for running SageMaker Clarify. We begin by defining a SageMaker Clarify processor. 

In [21]:
from sagemaker import clarify

clarify_processor = clarify.SageMakerClarifyProcessor(
    role=role, instance_count=1, instance_type="ml.m5.xlarge", sagemaker_session=session
)

Next, we define configurations for our data and model bias metrics.

In [22]:
bias_report_output_path = "s3://{}/{}/clarify-bias".format(bucket, prefix)
bias_data_config = clarify.DataConfig(
    s3_data_input_path=train_uri,
    s3_output_path=bias_report_output_path,
    label="Target",
    headers=X_train.columns.to_list(),
    dataset_type="text/csv",
)

In [23]:
model_config = clarify.ModelConfig(
    model_name=model_name,
    instance_type="ml.m5.xlarge",
    instance_count=1,
    accept_type="text/csv",
    content_type="text/csv",
)

We try to create a bias report for our feature `AGEGROUP` and try to analyze the bias that feature has in each `CITY`. We also define a prediction probability threshold for our model predictions.

In [24]:
predictions_config = clarify.ModelPredictedLabelConfig(probability_threshold=0.8)

In [25]:
bias_config = clarify.BiasConfig(
    label_values_or_threshold=[1], facet_name="AGEGROUPS", group_name="CITY"
)

The following code runs the SageMaker Clarify bias report processor.

In [26]:
clarify_processor.run_bias(
    data_config=bias_data_config,
    bias_config=bias_config,
    model_config=model_config,
    model_predicted_label_config=predictions_config,
    pre_training_methods="all",
    post_training_methods="all",
)


Job Name:  Clarify-Bias-2022-11-04-23-39-00-817
Inputs:  [{'InputName': 'dataset', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-068232277654/chapter11/train.csv', 'LocalPath': '/opt/ml/processing/input/data', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'analysis_config', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-068232277654/chapter11/clarify-bias/analysis_config.json', 'LocalPath': '/opt/ml/processing/input/config', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'analysis_result', 'AppManaged': False, 'S3Output': {'S3Uri': 's3://sagemaker-us-east-1-068232277654/chapter11/clarify-bias', 'LocalPath': '/opt/ml/processing/output', 'S3UploadMode': 'EndOfJob'}}]
...............................[34m2022-11-04 23:44:02,641 logging.conf not found wh

In [27]:
bias_report_output_path

's3://sagemaker-us-east-1-068232277654/chapter11/clarify-bias'

Similarly, we can run the SageMaker Clarify explainability report using SHAP configurations, as shown in the following code.

In [28]:
shap_config = clarify.SHAPConfig(
    baseline=[X_test.iloc[0].values.tolist()],
    num_samples=15,
    agg_method="mean_abs",
    save_local_shap_values=True,
)

explainability_output_path = "s3://{}/{}/clarify-explainability".format(bucket, prefix)
explainability_data_config = clarify.DataConfig(
    s3_data_input_path=train_uri,
    s3_output_path=explainability_output_path,
    label="Target",
    headers=X_train.columns.to_list(),
    dataset_type="text/csv",
)

The following code runs the model explainability report using SageMaker Clarify.

In [29]:
clarify_processor.run_explainability(
    data_config=explainability_data_config,
    model_config=model_config,
    explainability_config=shap_config,
)


Job Name:  Clarify-Explainability-2022-11-04-23-47-59-831
Inputs:  [{'InputName': 'dataset', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-068232277654/chapter11/train.csv', 'LocalPath': '/opt/ml/processing/input/data', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'analysis_config', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-068232277654/chapter11/clarify-explainability/analysis_config.json', 'LocalPath': '/opt/ml/processing/input/config', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'analysis_result', 'AppManaged': False, 'S3Output': {'S3Uri': 's3://sagemaker-us-east-1-068232277654/chapter11/clarify-explainability', 'LocalPath': '/opt/ml/processing/output', 'S3UploadMode': 'EndOfJob'}}]
................................[34m2022-11-04 23:53:0

Once you have generated the two reports, return to the subsection *Viewing the bias and explainability reports in SageMaker Studio* to learn how to visualize these reports on SageMaker Studio.

You can delete the model created in this exercise by running the following line of code.

In [30]:
session.delete_model(model_name)