# Diabetes Prediction: MIBA Sec B Team 6

**Goal:** Based on indicators related to a person's physical health, we would like to predict if a person is at risk of developing diabetes. 

The business case and reasoning behind this is explained in our presentation.

## Initital preparation of data and environment

### Necessary imports

In [138]:
try:
    import s3fs
    print("s3fs is already installed.")
except ImportError:
    print("s3fs not found, installing...")
    %pip install s3fs --quiet

s3fs is already installed.


In [139]:
import numpy as np
import io
import pandas as pd

import boto3
import sagemaker
from sagemaker.amazon.amazon_estimator import get_image_uri
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner
from sagemaker.inputs import TrainingInput
from sagemaker.s3 import S3Downloader

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix

import xgboost as xgb
from xgboost import plot_importance
import matplotlib.pyplot as plt
import tarfile
import os
import pandas as pd

from datetime import datetime, timezone
import time

### Check for new dataset upload in S3 bucket

In the following, we **define a loop in order to check if a new dataset has been uploaded to the S3 bucket** within the last 24 hours. 

- We establish a connection to AWS S3 using boto3, specifying a region and a bucket name, to interact with stored files.
- We define the funciton get_latest_file to paginate through the objects in the specified S3 bucket, identify, and return metadata of the latest modified file using a paginator.
- We then implement the function process_new_file to download the latest file identified by get_latest_file from S3 to local storage, then read it into a pandas DataFrame, and return this DataFrame.
- We set up a loop to continuously check for the latest file in the S3 bucket that has been modified within the last 24 hours. This loop downloads the new file using process_new_file, reads it into a DataFrame, and breaks if a new file is successfully processed or no new files are found.
- We also include error handling to manage exceptions during the file retrieval and processing phases, ensuring robustness in operations.

In [140]:
import boto3
import pandas as pd
from datetime import datetime, timezone
import os  # For file handling

# Configure AWS region and bucket name
s3_client = boto3.client('s3', region_name='eu-north-1')
bucket_name = 'prepreocesseddiabetescpfp'

def get_latest_file():
    try:
        paginator = s3_client.get_paginator('list_objects_v2')
        for page in paginator.paginate(Bucket=bucket_name):
            if "Contents" in page:
                latest_file = max(page['Contents'], key=lambda x: x['LastModified'])
                return latest_file
    except Exception as e:
        print(f"Error getting latest file: {e}")
    return None

def process_new_file(file_info):
    local_file_name = file_info['Key']
    try:
        # Download the file from S3 to local storage
        s3_client.download_file(bucket_name, file_info['Key'], local_file_name)
        print(f"File {local_file_name} downloaded successfully.")
        
        # Read the downloaded file into a DataFrame and return it
        df_diabetes = pd.read_csv(local_file_name)
        print("Latest CSV file read into DataFrame 'df_diabetes'.")
        print(df_diabetes.head())
        
        return df_diabetes
        
    except Exception as e:
        print(f"Error processing file: {e}")
        return None

# Existing loop code
df_diabetes = None  # Initialize df_diabetes before the loop
while True:
    latest_file = get_latest_file()
    if latest_file:
        # Check if the latest file was modified within the last 24 hours
        # Your existing condition here...
        df_diabetes = process_new_file(latest_file)  # Capture the returned DataFrame
        if df_diabetes is not None:
            break
    else:
        print("No new files found in the last 24 hours.")
        break  # Break the loop if no files are found
    time.sleep(60)  # Adjust the sleep time as needed

# After the loop, check if df_diabetes is not None before proceeding
if df_diabetes is not None:
    # Proceed with your data processing using df_diabetes
    data = df_diabetes

File diabetes_prediction_dataset.csv downloaded successfully.
Latest CSV file read into DataFrame 'df_diabetes'.
   gender   age  hypertension  heart_disease smoking_history    bmi  \
0  Female  80.0             0              1           never  25.19   
1  Female  54.0             0              0         No Info  27.32   
2    Male  28.0             0              0           never  27.32   
3  Female  36.0             0              0         current  23.45   
4    Male  76.0             1              1         current  20.14   

   HbA1c_level  blood_glucose_level  diabetes  
0          6.6                  140         0  
1          6.6                   80         0  
2          5.7                  158         0  
3          5.0                  155         0  
4          4.8                  155         0  


### Data Loading & preprocessing pipeline definition

Based on the check before for the upload of a **new dataset into the S3 bucket**, we import this dataset into our notebook. 
The dataset was originally gathered from Kaggle, specifically from here: https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset/data.
But this type of data could potentially also be available to an a healthcare system like the Spanish one, on which our business case is based.

We then conduct the following **preprocessing steps** according to what we have discovered to be necessary when inspecting the dataset:
- We separate the dataset into features (X) and the target variable (y), where 'diabetes' is the target variable we aim to predict.
- We define a preprocessing pipeline for numerical features, which includes filling missing values with the mean of the column and then scaling the data to have a mean of 0 and a standard deviation of 1.
- We also define a preprocessing pipeline for categorical features, where missing values are filled with the most frequent value in the column, and then we apply one-hot encoding to transform categorical variables into a form that could be provided to ML algorithms.
We identify the categorical and numerical columns in our dataset, specifically naming 'gender' and 'smoking_history' as categorical, and 'age', 'hypertension', 'heart_disease', 'bmi', 'HbA1c_level', and 'blood_glucose_level' as numerical.
Finally, we combine these preprocessing steps into a single ColumnTransformer. This allows us to apply the specific transformations to the numerical and categorical columns, respectively, while passing through any other columns unmodified.

In [141]:
data = df_diabetes

# Store prediction target in separate variable
X = data.drop('diabetes', axis=1)
y = data['diabetes']

# Define the preprocessing for numerical columns:
# SimpleImputer to fill missing values with the mean, then scaling the data
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Define the preprocessing for categorical columns:
# Filling missing values with the most frequent value then applying one-hot encoding
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Define categorical and numerical columns
categorical_cols = [
    'gender', 'smoking_history'
]
numerical_cols = ['age', 'hypertension', 'heart_disease', 'bmi', 'HbA1c_level', 'blood_glucose_level']

# Create the ColumnTransformer with both transformers
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ], remainder='passthrough')

We quickly **inspect shape and overall structure of the data** to ensure everything looks as expected:

In [142]:
data.shape

(100000, 9)

In [143]:
data.head()

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
0,Female,80.0,0,1,never,25.19,6.6,140,0
1,Female,54.0,0,0,No Info,27.32,6.6,80,0
2,Male,28.0,0,0,never,27.32,5.7,158,0
3,Female,36.0,0,0,current,23.45,5.0,155,0
4,Male,76.0,1,1,current,20.14,4.8,155,0


### Train / val / test split & transformation

- We **split the dataset into training, validation, and test sets** using an 80/20 split for the initial separation of training and testing/validation data, and then a 50/50 split to further divide the test/validation data into validation and test sets. The random_state parameter ensures that our splits are reproducible.
- We apply the previously defined preprocessing steps to the training, validation, and test sets. This involves fitting the preprocessing pipeline to the training data to learn the necessary transformations and then transforming the training data based on this fit.
- The validation and test sets are transformed using the same preprocessing steps as the training set, but without fitting the preprocessing pipeline again. This ensures that all sets are transformed consistently and prevents data leakage from the validation and test sets into the model training process.

In [144]:
# Split data into training, validation, and test sets
X_train, X_testval, y_train, y_testval = train_test_split(X, y, train_size=0.8, random_state=1200)
X_val, X_test, y_val, y_test = train_test_split(X_testval, y_testval, train_size=0.5, random_state=1200)

# Apply preprocessing to the training, validation, and test sets:

# Fit and transform the training set
X_train_preprocessed = preprocessor.fit_transform(X_train)

# Transform the validation and test set
X_val_preprocessed = preprocessor.transform(X_val)
X_test_preprocessed = preprocessor.transform(X_test)

After the split, we check for the shape to ensure the operation and transformation worked correctly:

In [145]:
X_train_preprocessed.shape, X_val_preprocessed.shape, X_test_preprocessed.shape

((80000, 15), (10000, 15), (10000, 15))

### Transform preprocessed data

Since the feature names were lost during the preprocessing, we are now adding them back to the dataset through our preprocessing.

In [146]:
# For the numerical features, there is no change necessary in the naming of the columns
numerical_features = numerical_cols

# For the categorical features, we have to get the new column names from the one-hot encoder
categorical_features = preprocessor.named_transformers_['cat'].named_steps['onehot'].get_feature_names_out(categorical_cols)

# We then need to combine the feature names from both transformations
all_features = list(numerical_features) + list(categorical_features)

# Create DataFrames with the proper column names
X_train_preprocessed = pd.DataFrame(X_train_preprocessed, columns=all_features)
X_val_preprocessed = pd.DataFrame(X_val_preprocessed, columns=all_features)
X_test_preprocessed = pd.DataFrame(X_test_preprocessed, columns=all_features)

# Recombine X and y for each set before uploading to S3, because SageMaker expects the prediction target to be the first column
train_preprocessed = pd.DataFrame(X_train_preprocessed)
train_preprocessed.insert(0, 'diabetes', y_train.values)

val_preprocessed = pd.DataFrame(X_val_preprocessed)
val_preprocessed.insert(0, 'diabetes', y_val.values)

test_preprocessed = pd.DataFrame(X_test_preprocessed)
test_preprocessed.insert(0, 'diabetes', y_test.values)

train_preprocessed

Unnamed: 0,diabetes,age,hypertension,heart_disease,bmi,HbA1c_level,blood_glucose_level,gender_Female,gender_Male,gender_Other,smoking_history_No Info,smoking_history_current,smoking_history_ever,smoking_history_former,smoking_history_never,smoking_history_not current
0,0,-0.081830,-0.284208,-0.202524,-0.001838,-0.492907,-0.295076,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,0,1.695130,-0.284208,4.937683,-0.001838,0.441547,-0.295076,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,0,0.362410,-0.284208,-0.202524,-0.001838,0.161211,0.540355,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,0,0.406834,3.518546,-0.202524,-0.458663,1.002220,-0.196790,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,0,1.695130,-0.284208,-0.202524,-0.001838,0.628438,0.515784,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
79995,0,0.895498,-0.284208,-0.202524,1.137219,1.002220,-1.302507,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
79996,0,-0.481645,-0.284208,-0.202524,1.858522,-1.894589,-0.295076,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
79997,0,1.295314,-0.284208,-0.202524,-0.001838,0.441547,0.491212,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
79998,0,-1.785933,-0.284208,-0.202524,-1.083792,-1.894589,-1.425364,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


### Define S3 upload function

To work with Sagemaker, we are now uploading our data to S3.

- We initialize a resource instance for Amazon S3 using the boto3 library. This instance will allow us to interact with our S3.
- We define a function upload_to_s3 that takes a pandas DataFrame, a bucket name, and a filename as its arguments. This function is designed to upload the DataFrame to our S3 bucket.
Inside the function, we first convert the DataFrame to a CSV format string without headers and index columns. This conversion is done in memory using io.StringIO, which acts as a temporary placeholder for the CSV string.
- We then create an object in the specified S3 bucket with the given filename and upload the CSV string to S3 by calling the put method on the S3 object. This operation saves the DataFrame as a CSV file in the specified S3 bucket under the specified filename.

In [147]:
s3 = boto3.resource('s3')

def upload_to_s3(df, bucket, filename):
    placeholder = io.StringIO()
    df.to_csv(placeholder, header=False, index=False)
    # Upload csv string to S3
    object = s3.Object(bucket, filename)
    object.put(Body=placeholder.getvalue())

After defining this, we upload the train and validation split to another S3 bucket.

In [148]:
upload_to_s3(train_preprocessed, 'diabeterstraindata', 'sagemaker-data/kc/train.csv')
upload_to_s3(val_preprocessed, 'diabeterstraindata', 'sagemaker-data/kc/val.csv')

We now verify these files have been uploaded to S3 by checking our S3 bucket using the AWS console.

## Setting up the model (XGBoost)

Before we dive into the steps we have taken to set up the model, we quickly want to explain why we decided to choose **XGBoost** (eXtreme Gradient Boosting) as a predictive model:

#### 1. High Predictive Accuracy

- XGBoost is renowned for its ability to deliver highly accurate predictions.
- This accuracy is critical in correctly identifying individuals at risk of developing diabetes, ensuring that interventions are appropriately targeted.

#### 2. Complex Relationship Modeling

- The risk factors for diabetes are numerous and complex, including lifestyle, genetic factors, and other health conditions.
- XGBoost excels at uncovering nonlinear relationships and interactions between these factors, making it particularly suited for this application.

#### 3. Efficiency with Large Datasets

- Healthcare data often encompasses extensive records with mulitple features.
- XGBoost is designed to be efficient and scalable, making it capable of handling the volume and complexity of healthcare datasets without compromising on speed or performance - this would also make it possible in the future to include more features should more types of data be collected.

#### 4. Feature Importance Insights

- Understanding which factors contribute most significantly to diabetes risk is crucial for prevention efforts.
- XGBoost provides feature importance scores, offering valuable insights into the most impactful risk factors.
- This information can guide healthcare providers in designing more effective diabetes prevention programs.

#### 5. Flexibility and Customization

- XGBoost offers a wide range of hyperparameters that can be tuned to optimize performance for specific datasets.
- This flexibility allows us to tailor the model to best fit the unique characteristics of the data related to diabetes risk.

#### Summary

Given its predictive power, efficiency, and the actionable insights it can provide, XGBoost stands out as the optimal choice for developing a model to predict diabetes risk. By implementing this model, we aim to significantly reduce the incidence of diabetes and alleviate its financial burden on the Spanish National Health System, ultimately contributing to healthier communities and more sustainable healthcare spending.
We use the class `Estimator` from the `sagemaker.estimator` module. That will create the **environment** to run  training jobs for a model.

We specify: 

- A container name (Sagemaker works with containers. This code is pointing to a pre-existing container that holds everything that is needed to run xgboost. 
- A role name (the training job needs a role to have sufficient permissions, similarly to what we saw in Lambda functions). Remember that we created this role when starting the notebook server. 
- The number of instances for training (we use 1 but could use more in large jobs, to scale). 
- The type of instance (we select one that's included in the Sagemaker Free Tier). 
- The output path, where the model and other info will be written
- The hyperparameters of the algorithm (number of training rounds and loss function)
- The current session (it needs that for internal purposes)

In [149]:
role = sagemaker.get_execution_role()
region_name = boto3.Session().region_name
container = sagemaker.image_uris.retrieve('xgboost', region_name, version='0.90-1')
output_location = 's3://diabetesstorage/sagemaker-output/'

hyperparams = {
    'eval_metric': 'auc,logloss,error',
    'num_round': '20',
    'objective': 'binary:logistic'
}

estimator = sagemaker.estimator.Estimator(
    container,
    role,
    instance_count=1,
    instance_type='ml.m5.xlarge',  # Adjusted to a supported instance type
    output_path=output_location,
    hyperparameters=hyperparams,
    sagemaker_session=sagemaker.Session()
)

INFO:sagemaker.image_uris:Defaulting to only available Python version: py3
INFO:sagemaker.image_uris:Defaulting to only supported image scope: cpu.


Now we create **"channels"**. We need to specify where the data is located and in which format in a specific dictionary.

In [150]:
train_channel = TrainingInput(
    s3_data='s3://diabeterstraindata/sagemaker-data/kc/train.csv',
    content_type='text/csv'
)
val_channel = TrainingInput(
    s3_data='s3://diabeterstraindata/sagemaker-data/kc/val.csv',
    content_type='text/csv'
)

channels_for_training = {
    'train': train_channel,
    'validation': val_channel
}

## Train model (XGBoost)

Now, a training job will be launched.

- We call the fit method on an estimator object, passing it channels_for_training as the input. This input  specifies the data sources where the training data is located. By setting logs=False, we opt not to show the logs generated during the training process. This step initiates the training of the model using the provided data channels.
- We access the _current_job_name attribute of the estimator object. This attribute  holds the name or identifier of the current training job. Accessing this attribute after initiating a training job allows us to retrieve the unique name or identifier associated with that specific training session.


In [151]:
estimator.fit(inputs=channels_for_training, logs=False)

estimator._current_job_name

INFO:sagemaker:Creating training-job with name: sagemaker-xgboost-2024-04-02-20-09-54-006



2024-04-02 20:09:54 Starting - Starting the training job..
2024-04-02 20:10:09 Starting - Preparing the instances for training...
2024-04-02 20:10:33 Downloading - Downloading input data...
2024-04-02 20:10:53 Downloading - Downloading the training image....
2024-04-02 20:11:18 Training - Training image download completed. Training in progress....
2024-04-02 20:11:39 Uploading - Uploading generated training model.
2024-04-02 20:11:50 Completed - Training job completed


'sagemaker-xgboost-2024-04-02-20-09-54-006'

### Retrieve metrics of the model (XGBoost)

We considered the following parameters for evaluating the model: 
- **AUC:** We use this metric to measure the ability of a classifier to distinguish between classes. An AUC of 1 represents a perfect model, while an AUC of 0.5 suggests no discriminative power, equivalent to random guessing.
  - The train:auc indicates the AUC on the training dataset, and validation:auc shows the AUC on a validation dataset. High values on both metrics indicate that the model has a good predictive performance and can distinguish between the positive and negative classes well.
- **Logloss:** Logarithmic loss (log loss) measures the performance of a classification model where the prediction is a probability between 0 and 1. The goal is to minimize this value. Lower log loss values are better, with 0 representing a perfect log loss.
    - The train:logloss is the log loss on the training dataset, while the validation:logloss is on the validation dataset. Consistently low log loss values on both training and validation suggest that the model is performing well and is not overfitting.
- **Error:** The classification error metrics represent the fraction of predictions that were incorrect. The goal is to minimize this metric. A low training error is expected, but what's more important is the validation error, as it indicates how well the model is expected to perform on unseen data.
    - If the validation error is significantly higher than the training error, it could indicate that the model may not generalize well.

Evaluating these metrics together gives us a more comprehensive understanding of the model's performance. We want to see high AUC values along with low log loss and error values on both training and validation datasets. It's also important that the training and validation metrics are close to each other, which suggests that the model generalizes well and is not overfitting or underfitting.

In [152]:
# Get metrics of model
metrics = sagemaker.analytics.TrainingJobAnalytics(
    estimator._current_job_name,
    metric_names=['train:auc', 'validation:auc', 'train:logloss', 'validation:logloss', 'train:error', 'validation:error']
)
metrics.dataframe()

Unnamed: 0,timestamp,metric_name,value
0,0.0,train:auc,0.978681
1,0.0,validation:auc,0.976233
2,0.0,train:logloss,0.086373
3,0.0,validation:logloss,0.091624
4,0.0,train:error,0.027681
5,0.0,validation:error,0.0297


One of the cool things XGBoost normally enables is determining the **feature importance** which can help us determine which features are actually relevant and important to our prediction model. After multiple tries, we have unfortunately not been able to make this work, as it exceeds our knowwledge from the course. Yet, we wanted to document in the following our effort to trying this.

In [153]:
## Feature importance
#### Get feature importance form the model
# model_artifact_s3_uri = f'{output_location}/{estimator.latest_training_job.name}/output/model.tar.gz'
#### Define a local path to download the model artifact
# local_path = 'downloaded_model.tar.gz'
#### Download the model artifact from S3
# S3Downloader.download(s3_uri=model_artifact_s3_uri, local_path=local_path)
#### Extract the model artifact
# with tarfile.open(local_path) as tar:
#    tar.extractall()
# model_bin_file = [file for file in os.listdir('.') if file.endswith('.model') or file.endswith('.bin')][0]
#### Load the model
# xgboost_model = xgb.Booster()
# xgboost_model.load_model(model_bin_file)
#### Plot the feature importance
# plt.figure(figsize=(10, 8))
# plot_importance(xgboost_model)
# plt.title("Feature Importance")
# plt.show()

### Interpretation of metrics

- **AUC (Area Under the Curve):**
    - train:auc: 0.978681 - This is a high AUC value for the training set, which suggests that our model has a strong discriminative power between the positive and negative cases in the training data.
    - validation:auc: 0.976233 - Our validation AUC is also high and quite close to the training AUC, indicating that our model's ability to differentiate between classes generalizes well to the validation set.
- **Log Loss:**
    - train:logloss: 0.086373 - The training log loss is low, indicating a good fit of our model on the training data. Our model gives well-calibrated probabilities for the training set.
    - validation:logloss: 0.091624 - Our validation log loss is only slightly higher than the training log loss. This suggests that our model's predictions on the validation set are also well-calibrated and that the model is likely not overfitting.
- **Error:**
    - train:error: 0.027681 - The low training error rate indicates that our model is accurate in its predictions on the training set.
    validation:error: 0.029700 - The validation error is very close to the training error, which is a positive sign. It means our model is achieving similar accuracy on the validation set, which is not seen during training.

Overall, these metrics suggest that our XGBoost model is performing well. The similar values for training and validation metrics indicate that the model is generalizing well to unseen data, without signs of overfitting. The high AUC values suggest strong classification capability, the low log loss indicates confidence in predictions, and the low error rates reflect high accuracy.

## Model tuning

- We import necessary classes from the SageMaker Python SDK, including various parameter types for hyperparameter tuning and the TrainingInput class for specifying training data inputs.
- We define a range of values for various hyperparameters that will be explored during the hyperparameter tuning process. These parameters include both continuous values (like 'eta', 'min_child_weight', 'alpha', etc.) and integer values ('max_depth'). The aim is to find the best combination of these hyperparameters that optimizes the model's performance.
- We set specific hyperparameters on the estimator object, such as eval_metric, num_round, objective, and early_stopping_rounds. These settings include the evaluation metric to be optimized, the number of training rounds, the objective function of the model (binary logistic regression in this case), and the number of rounds with no improvement to wait before stopping training early.
- We specify the objective metric name ('validation:auc') that the hyperparameter tuning job should optimize for. This is the metric that SageMaker's hyperparameter tuner will attempt to maximize or minimize.
- We create a HyperparameterTuner object by providing the estimator, the objective metric name, the hyperparameter ranges, and limits on the number of jobs and parallel jobs. This setup configures how the hyperparameter tuning process will be executed.
- We prepare the training and validation datasets by specifying their locations in S3 and their content type (CSV in this case), using the TrainingInput class.
- Finally, we launch the hyperparameter tuning job by calling the fit method on the tuner object and passing it the training and validation inputs. This step initiates the search for the best hyperparameter values within the specified ranges by training multiple models in parallel, as configured, and evaluating their performance on the validation dataset.

In [154]:
# Specify the hyperparameter ranges
hyperparameter_ranges = {
    'eta': ContinuousParameter(0.01, 0.2),
    'min_child_weight': ContinuousParameter(1, 10),
    'alpha': ContinuousParameter(0, 2),
    'max_depth': IntegerParameter(3, 10),
    'subsample': ContinuousParameter(0.5, 1),
    'colsample_bytree': ContinuousParameter(0.5, 1),
    'gamma': ContinuousParameter(0, 5),
    'lambda': ContinuousParameter(1e-5, 10),
}

estimator.set_hyperparameters(
    eval_metric='auc,logloss,error',
    num_round=1000,  # Set a large number for num_round and rely on early stopping
    objective='binary:logistic',
    early_stopping_rounds=10
)

# Specify the objective metric
objective_metric_name = 'validation:auc'

# Create and launch a hyperparameter tuning job
tuner = HyperparameterTuner(estimator,
                            objective_metric_name,
                            hyperparameter_ranges,
                            max_jobs=30,
                            max_parallel_jobs=10)

# Specify the training and validation data locations and content type
train_input = TrainingInput('s3://diabeterstraindata/sagemaker-data/kc/train.csv', content_type='text/csv')
validation_input = TrainingInput('s3://diabeterstraindata/sagemaker-data/kc/val.csv', content_type='text/csv')

# Launch a hyperparameter tuning job
tuner.fit({'train': train_input, 'validation': validation_input})

INFO:sagemaker:Creating hyperparameter tuning job with name: sagemaker-xgboost-240402-2011


..............................................................!


### Retrieving metrics of model tuning

- We initialize a SageMaker session, which provides a managed environment within Amazon SageMaker for managing training jobs, endpoints, and other SageMaker resources.
- We retrieve the name of the best training job resulting from a hyperparameter tuning process using the best_training_job() method of the tuner object. This job is identified as the one that achieved the best performance on the specified objective metric during the tuning process.
- We then obtain detailed information about this best training job by invoking the describe_training_job() method of the SageMaker session, passing in the name of the best training job. - - This method returns a dictionary containing comprehensive details about the training job, including its hyperparameters, input data configuration, and metrics.
- The hyperparameters used by the best training job are accessed from the 'HyperParameters' key of the dictionary returned by describe_training_job(). We print these hyperparameters to understand which values led to the best model performance.
- We also retrieve and print the metrics associated with the best training job, specifically looking for metrics related to model validation such as 'validation:auc', 'validation:logloss', and 'validation:error'. These metrics help us evaluate how well the model performed on the validation dataset, giving insights into its generalization ability and predictive performance.

In [155]:
sagemaker_session = sagemaker.Session()

# Get the name of the best training job from hyperparameter tuning
best_training_job_name = tuner.best_training_job()

# Now get the details of the best training job
best_training_job_info = sagemaker_session.describe_training_job(best_training_job_name)

# The hyperparameters of the best training job are in the 'HyperParameters' key
best_hyperparameters = best_training_job_info['HyperParameters']

print(f"The best hyperparameters are: \n {best_hyperparameters}")

# Describe the best training job to get the metrics
metrics = best_training_job_info['FinalMetricDataList']

print("\nThe following metrics are from the best model:")
# Print out the metrics for the best training job
for metric in metrics:
    if metric['MetricName'] in ['validation:auc', 'validation:logloss', 'validation:error']:
        print(f"{metric['MetricName']}: {metric['Value']}")

The best hyperparameters are: 
 {'_tuning_objective_metric': 'validation:auc', 'alpha': '0.8414728374508231', 'colsample_bytree': '0.5322761211279915', 'early_stopping_rounds': '10', 'eta': '0.17491253458439093', 'eval_metric': 'auc,logloss,error', 'gamma': '4.562856759911896', 'lambda': '0.0002383571851544438', 'max_depth': '5', 'min_child_weight': '7.7352942236405875', 'num_round': '1000', 'objective': 'binary:logistic', 'subsample': '0.9211582162396529'}

The following metrics are from the best model:
validation:logloss: 0.08240800350904465
validation:auc: 0.9786859750747681
validation:error: 0.02889999933540821


### Interpretation of metrics

- The **hyperparameter tuning process has optimized the model** to achieve a validation AUC of 0.978689948425293, indicating our model has excellent capability in distinguishing between the classes for the problem at hand. The tuning results show that our model has a strong performance on the validation set, with low log loss (0.08252300322055817) and error (0.02920000688433467), suggesting high accuracy and good calibration of predicted probabilities. This model is expected to generalize well to new data, based on the validation metrics provided.
- The selected hyperparameters include a **significant alpha (L1 regularization term) of 0.32256479401649434 and a substantial lambda (L2 regularization term) of 0.7578470196080217**, suggesting a model that is **robust against overfitting**, by applying both L1 and L2 regularization effectively.
- A **gamma value of 3.5499706277804544** specifies the minimum loss reduction required to make a further partition on a leaf node of the tree. This relatively high value means the model is more conservative and less prone to overfitting.
- The **learning rate (eta) is 0.11035511928866**, which is a moderate value, balancing the speed of learning and the risk of overfitting by controlling the contribution of each tree to the final model.
- The **maximum depth of the trees, 'max_depth': '8', is relatively deep,** allowing the model to capture complex patterns in the data, which could be beneficial given the complex nature of healthcare data.
- The **'min_child_weight' of 9.837971841334964** is **relatively high**, which can help prevent overfitting by making the algorithm more conservative and requiring a significant number of instances to make a child node.
- The **'subsample' rate is 0.9180113714220905**, suggesting that each tree is built using approximately 91.8% of the data, which **helps in preventing overfitting** by adding more randomness into the model.
- The **'colsample_bytree' parameter is set to 0.55858184126595706**, indicating that each tree uses around 55.9% of features, allowing the model to perform feature sampling, providing a diversity of trees and further guarding against overfitting.
- 'num_round' (the number of boosting rounds) is 1000, indicating a **substantial model complexity and potential for learning intricate patterns in the data.**
- The early **stopping parameter 'early_stopping_rounds': '10'** means that the **model training will stop if the validation metric does not improve for 10 consecutive rounds,** helping to prevent overfitting and unnecessary computations.

## Deploying the model

Now that the model is ready, we can "deploy it". This will create an instance that "serves" the model continuosuly. This server will accept queries with input values in real time and will return the model prediction. 

- We create a new estimator object by attaching it to the best training job using the Estimator.attach() method from the SageMaker SDK. This method allows us to rehydrate an estimator object from a completed training job, in this case, the one identified as the best from the hyperparameter tuning process.
- We deploy the best model (from the best training job) to a SageMaker endpoint. To do this, we call the deploy() method on the best_estimator object, specifying the initial number of instances (initial_instance_count=1), the type of instance (instance_type='ml.m4.xlarge'), and the serializer (serializer=sagemaker.serializers.CSVSerializer()). The serializer is configured to format the incoming data as CSV before making predictions.
- This process results in the creation of a SageMaker endpoint based on the model from the best training job. This endpoint can then be used to make real-time predictions by sending it data in the CSV format.

### SageMaker Endpoint Limitation and Resolution

In our project, we are aware of AWS SageMaker's service limits on the number of model endpoints and instance types we can deploy. These limits are designed to help manage AWS resource allocation and control costs. However, encountering these limits can halt project progress by preventing the creation of new model endpoints.

#### How to Solve the Issue
There are three primary methods to address a service limit issue:

1. **Terminate Unused Endpoints**
2. **Request a Service Limit Increase**
3. **Use an Alternative Instance Type**

For the purposes of this project, we have chosen to adopt the first method: terminating unused endpoints. This approach is often the quickest way to free up resources and continue working without waiting for limit increases or changing instance types.

If you're using the project and encounter a `ResourceLimitExceeded` error, indicating you've hit the limit for the number of deployable endpoints, follow these steps to resolve the issue:

1. **Terminate Unused Endpoints**:
   - Open the [SageMaker console](https://console.aws.amazon.com/sagemaker/).
   - In the navigation pane, select **Endpoints**.
   - Review the list of endpoints to identify any that are no longer in use.
   - Select the endpoint(s) you wish to terminate.
   - Click on **Actions**, then choose **Delete** to remove the endpoint and free up resources.

By regularly reviewing and managing your SageMaker endpoints, you can ensure that you stay within the service limits and avoid potential disruptions to your workflow. Should you require more resources for your project in the future, consider requesting a service limit increase or using alternative instance types.

In [20]:
# If we had not done any hyperparameter optimization, we woudl be using the following command:
# predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge', serializer=sagemaker.serializers.CSVSerializer())

# Create a new estimator object from the best training job
best_estimator = sagemaker.estimator.Estimator.attach(best_training_job_name)

# Now, deploy the best model
predictor = best_estimator.deploy(initial_instance_count=1,
                                  instance_type='ml.m5.xlarge',
                                  serializer=sagemaker.serializers.CSVSerializer())


2024-04-02 22:11:16 Starting - Found matching resource for reuse
2024-04-02 22:11:16 Downloading - Downloading the training image
2024-04-02 22:11:16 Training - Training image download completed. Training in progress.
2024-04-02 22:11:16 Uploading - Uploading generated training model
2024-04-02 22:11:16 Completed - Resource retained for reuse

INFO:sagemaker:Creating model with name: sagemaker-xgboost-2024-04-02-22-15-14-490





INFO:sagemaker:Creating endpoint-config with name sagemaker-xgboost-2024-04-02-22-15-14-490
INFO:sagemaker:Creating endpoint with name sagemaker-xgboost-2024-04-02-22-15-14-490


-----!

### Predicting a patient

#### One patient
To predict a new patient, we can input normalizued feature data and hit predict, with the outcome showing the probability of the patient having diabetes (closer to 1 means a high probability).

The output we are seeing, such as b'0.0003831279755104333', represents the model's predicted probability that the given observation belongs to class 0 (whichaccording to our coding of the classes, refers to "no diabetes").

In [21]:
predictor.predict("-0.081830,-0.284208,-0.202524,-0.001838,-0.492907,-0.295076,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0")

b'0.00035908748395740986'

#### Multiple patients at the same time (taken from the test set)

In order to predict more than one patient, we first need to preprocess the input data:

- We create a subset of the test dataset (X_test) by selecting the first 5 rows. This subset is intended for making predictions and demonstrating the prediction process on a smaller scale.
- We print the subset of X_test to the console, displaying the selected rows along with their column names. This step is useful for understanding the data before it undergoes preprocessing.
- The subset is then preprocessed using the previously defined preprocessor pipeline. This preprocessing includes any transformations specified earlier, such as imputation, scaling, and encoding, to prepare the data for the model.
- Since the preprocessing steps can result in a sparse matrix (especially after encoding categorical variables), we convert this sparse matrix to a dense format using the toarray() method. This conversion is necessary because some models or subsequent steps may require a dense format.
- We convert the dense array of preprocessed data into a pandas DataFrame. This step is useful for handling the data in a structured format and potentially applying further transformations or analysis.
- We prepare the preprocessed DataFrame for prediction by converting it to CSV format. This involves writing the DataFrame to an in-memory text stream (io.StringIO()) without headers and index columns, which is a common requirement for model prediction inputs. The resulting CSV format data is stored in a variable for easy access or transmission, for instance, when sending the data to a machine learning model endpoint for prediction.

In [22]:
# Create subset of X_test to predict
subset_X_test = X_test.iloc[:5]

# Display the subset of X_test with column names
print("Subset of X_test before preprocessing:")
print(subset_X_test)

# Preprocess the subset
subset_X_test_preprocessed = preprocessor.transform(subset_X_test)

# Create a DataFrame for the preprocessed data
subset_X_test_preprocessed_df = pd.DataFrame(subset_X_test_preprocessed)

# Convert the preprocessed DataFrame to CSV format for prediction
placeholder = io.StringIO()
subset_X_test_preprocessed_df.to_csv(placeholder, header=False, index=False)
csv_data = placeholder.getvalue()

Subset of X_test before preprocessing:
       gender   age  hypertension  heart_disease smoking_history    bmi  \
65074  Female  50.0             0              0         No Info  33.06   
46427  Female  46.0             0              0         No Info  27.32   
31779  Female   4.0             0              0         No Info  27.32   
33827    Male   7.0             0              0     not current  19.47   
21868  Female  43.0             0              0     not current  30.51   

       HbA1c_level  blood_glucose_level  
65074          6.0                  145  
46427          3.5                  126  
31779          6.1                  158  
33827          3.5                  130  
21868          6.1                  158  


In a next step, we then take the preprocessed data and run the prediction.

In [23]:
# Predict data with deployed model endpoint
predictions = predictor.predict(csv_data)

predictions

b'0.06106920167803764,0.0004101478843949735,0.0009868874913081527,0.00017709078383632004,0.004902362357825041'

Lastly, we generate IDs for the inidividual patients we are predicting and create a datafrmae with their IDS and respective predictions. We then convert it into a csv file and export it to out predictions S3 bucket.

In [24]:
predicted_probabilities = np.fromstring(predictions, sep=',')
predicted_labels = (predicted_probabilities >= 0.5).astype(int)

# Generate sequential patient IDs
PolicyHolder = range(1, len(predicted_labels) + 1)

# Create a DataFrame with generated patient IDs and their corresponding predictions
predictions_df = pd.DataFrame({
    "PolicyHolderID": PolicyHolder,
    "Prediction": predicted_labels
})

# Define the path for saving the CSV
predictions_csv = "predictions.csv"

# Save the DataFrame to a CSV file
predictions_df.to_csv(predictions_csv, index=False)

s3_client = boto3.client('s3')
bucket_name = 'diabetesstorage'

# Upload predictions CSV
predictions_s3_path = 'predictions/predictions.csv'
s3_client.upload_file(predictions_csv, bucket_name, predictions_s3_path)

## Model generalization performance evaluation

To assess the model's **practical impact and usefulness**, we will **assess its performacne against the test set**. The generated confusion matrix will help us claculate the financial impact of the model.
- We import various performance metrics from the sklearn.metrics module, including accuracy, precision, recall, F1 score, ROC AUC score, and the confusion matrix. These metrics are crucial for evaluating the performance of a classification model.
- A pandas DataFrame is created for the preprocessed test data, which we had transformed earlier using a preprocessing pipeline to fit the model's requirements (see above).
- We convert this DataFrame to CSV format (without headers or indexes) and use it to make predictions with the predictor object. The predictor refers to a deployed model endpoint that can accept input data and return predictions. The predictions are decoded from UTF-8 format to a string.
- The prediction results, in the form of probabilities, are converted from a string to a NumPy array. Based on these probabilities, we determine the predicted labels by applying a threshold: values equal to or above 0.5 are classified as 1 (having diabetes), and below 0.5 are classified as 0 (no diabetes).
- We calculate the confusion matrix using the true labels (y_test) and the predicted labels. The confusion matrix provides a detailed breakdown of the model's predictions, including true positives, true negatives, false positives, and false negatives.
- Various performance metrics are computed to evaluate the model, including accuracy, precision, recall, F1 score, and ROC AUC score. The ROC AUC score is calculated using the predicted probabilities rather than the binary labels to assess the model's ability to distinguish between classes.
- Finally, we print the calculated metrics and the confusion matrix to summarize the model's performance on the test dataset. These metrics provide a comprehensive overview of the model's predictive accuracy and its strengths and weaknesses.

In [25]:
# Create a DataFrame for the preprocessed test data
X_test_preprocessed_df = pd.DataFrame(X_test_preprocessed)

# Convert to CSV format and predict
predictions = predictor.predict(X_test_preprocessed_df.to_csv(header=False, index=False)).decode('utf-8')

predicted_probabilities = np.fromstring(predictions, sep=',')
predicted_labels = (predicted_probabilities >= 0.5).astype(int)

# Calculate the confusion matrix
cm = confusion_matrix(y_test, predicted_labels)

# Label the confusion matrix
cm_df = pd.DataFrame(cm, 
                     index=['Actual Negative', 'Actual Positive'], 
                     columns=['Predicted Negative', 'Predicted Positive'])

# Compute various performance metrics
accuracy = accuracy_score(y_test, predicted_labels)
precision = precision_score(y_test, predicted_labels)
recall = recall_score(y_test, predicted_labels)
f1 = f1_score(y_test, predicted_labels)
roc_auc = roc_auc_score(y_test, predicted_probabilities)  # Use predicted probabilities for ROC AUC

# Print the metrics
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")
print(f"ROC AUC Score: {roc_auc:.4f}")

# Print the labeled confusion matrix
print("\nConfusion Matrix:")
print(cm_df)

# Output the total number of predictions made
total_predictions = len(predicted_labels)  # or use len(y_test)
print(f"\nTotal number of predictions made: {total_predictions}")

Accuracy: 0.9711
Precision: 0.9815
Recall: 0.6775
F1 Score: 0.8016
ROC AUC Score: 0.9785

Confusion Matrix:
                 Predicted Negative  Predicted Positive
Actual Negative                9127                  11
Actual Positive                 278                 584

Total number of predictions made: 10000


You will find the interpretation of these results in our presentation. We lastly convert and **export our retrieved model performance metrics to out repective S3 bucket.**

In [26]:
metrics_df = pd.DataFrame({
    "Metric": ["Accuracy", "Precision", "Recall", "F1 Score", "ROC AUC Score"],
    "Value": [accuracy, precision, recall, f1, roc_auc]
})

metrics_csv = "metrics.csv"
metrics_df.to_csv(metrics_csv, index=False)

s3_client = boto3.client('s3')
bucket_name = 'diabetesstorage'

# Upload metrics CSV
metrics_s3_path = 'metrics/metrics.csv'
s3_client.upload_file(metrics_csv, bucket_name, metrics_s3_path)