<header style="padding:1px;background:#f9f9f9;border-top:3px solid #00b2b1"><img id="Teradata-logo" src="https://www.teradata.com/Teradata/Images/Rebrand/Teradata_logo-two_color.png" alt="Teradata" width="220" align="right" />

<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>Heart Failure prediction using teradataml</b>
</header>

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Introduction:</b></p>    
    
<table style = 'width:100%;table-layout:fixed;'>
    <tr>
        <td style = 'vertical-align:middle' width = '50%'>
            <ul style = 'font-size:16px;font-family:Arial'>
                <li>According to the CDC, the number of emergency room visits in 2017 for issues related to the heart and blood vessels was nearly 5 million. In 2016, 72 million people made heart disease-related visits to their doctors.</li><br>
                <li>The cost of caring for cardiovascular disease is more than \$351 billion per year. Nearly \$214 billion pays for the care of people with heart disease, while more than \$137 billion goes to lost productivity.</li><br>
                <li>Heart attack is one of the most expensive conditions treated in U.S. hospitals. Its care costs an estimated \$11.5 billion a year.</li><br>
                <li>By 2035, more than 45 percentTrusted Source of Americans are projected to have some form of cardiovascular disease. Total costs of cardiovascular disease are expected to reach \$1.1 trillion in 2035, with direct medical costs expected to reach \$748.7 billion and indirect costs estimated to reach \$368 billion.</li>
            </ul> 
        </td>
        <td>
            <img src="images/heart.webp" width="350"/>
        </td>
    </tr>
</table>
<p style = 'font-size:16px;font-family:Arial'>Source: <a href = 'https://www.healthline.com/health/heart-disease/statistics#How-much-does-it-cost?'>Healthline</a></p>
<p style = 'font-size:16px;font-family:Arial'>Machine learning can be useful in heart failure prediction as it can analyze large amounts of data from multiple sources and identify complex patterns that may be difficult for humans to recognize. This can potentially improve the accuracy of prediction models and help healthcare professionals identify patients who are at high risk for heart failure, allowing for earlier intervention and better outcomes.</p>

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Data:</b></p>

<p style = 'font-size:16px;font-family:Arial'>
    This is a simulated dataset based on real hospital administrative data for England called Hospital Episodes Statistics. Every public (National Health Service, NHS) hospital in the country must submit records for every admission; private hospitals also submit records for any NHS patients that they treat.</p>

<hr>
<p style = 'font-size:16px;font-family:Arial'>Here, we import the required libraries, set environment variables and environment paths (if required).</p>

In [None]:
# system packages
import sys
import warnings
warnings.filterwarnings("ignore")

from teradataml import *
from teradataml import valib

# Dataset packages 
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score, recall_score, ConfusionMatrixDisplay

# plotting packages
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid")

%matplotlib inline
configure.val_install_location = "val"
display.max_rows = 5

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>1. Initiate a connection to Vantage</b>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'> <b>Let's start by connecting to the Teradata system </b></p>
<p style = 'font-size:16px;font-family:Arial'>You will be prompted to provide the password. Enter your password, press the Enter key, and then use the down arrow to go to the next cell.</p>

In [None]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)
eng.execute('''SET query_band='DEMO=SurvivalAnalysis_Python.ipynb;' UPDATE FOR SESSION;''')

<p style = 'font-size:16px;font-family:Arial'>Begin running steps with Shift + Enter keys. </p>

<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial'>We have provided data for this demo on cloud storage. You can either run the demo using foreign tables to access the data without any storage on your environment or download the data to local storage, which may yield faster execution. Still, there could be considerations of available storage. Two statements are in the following cell, and one is commented out. You may switch which mode you choose by changing the comment string.</p>

In [None]:
# %run -i ../run_procedure.py "call get_data('DEMO_HeartFailure_cloud');"        # Takes 10 seconds
%run -i ../run_procedure.py "call get_data('DEMO_HeartFailure_local');"        # Takes 20 seconds

<p style = 'font-size:16px;font-family:Arial'>Next is an optional step – if you want to see the status of databases/tables created and space used.</p>

In [None]:
%run -i ../run_procedure.py "call space_report();"        # Takes 5 seconds

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>2. Data Exploration</b>

In [None]:
heart_failure = DataFrame(in_schema('DEMO_HeartFailure', 'heart_failure'))

In [None]:
print(heart_failure.shape)
heart_failure

<i><b>*Please scroll down to the end of the notebook for detailed column descriptions of the dataset.</b></i>
<p style = 'font-size:16px;font-family:Arial'>The dataset above has 31 columns in total and the 'death' column is the predicted column where 1 means the patient died and 0 means he/she did not.
<br>
Let's check the data for people who died and who did not.
</p>

In [None]:
# Sample data for people who did not die
heart_failure[heart_failure.death == 0]

In [None]:
# Sample data for people who died
heart_failure[heart_failure.death == 1]

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>2.1 Mortality Rate by Gender</b></p>

In [None]:
grp_gen = heart_failure.select(['gender','death']).groupby(['gender']).agg(['mean', 'count']).to_pandas()
sns.barplot(x='gender', y='mean_death', data=grp_gen)
plt.xticks(ticks=[0, 1], labels=['male', 'female'])
plt.title('Mortality rate by gender')
plt.show()

<p style = 'font-size:16px;font-family:Arial'>The graph depicted above indicates that gender does not appear to be a determining factor in mortality rates related to heart failure.</p>

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>2.2 Mortality Rate by Age</b></p>

In [None]:
grp_gen = heart_failure.select(['age','death']).groupby(['age']).agg(['mean', 'count']).to_pandas()
plt.figure(figsize=(15, 6))
sns.barplot(x='age', y='mean_death', data=grp_gen)
plt.xticks(rotation = 90)
plt.title('Mortality rate by age')
plt.show()

<p style = 'font-size:16px;font-family:Arial'>A noticeable pattern can be observed from the graph, revealing a positive correlation between age and mortality rates associated with heart failure.</p>

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>2.3 Correlation Matrix</b></p>

In [None]:
corr_matrix = heart_failure.to_pandas().corr()

In [None]:
# Set figure size to 20 inches by 8 inches
sns.set(rc={"figure.figsize": (20, 8)})
# Create a heatmap to visualize the correlation matrix
ax = sns.heatmap(corr_matrix, annot=False, cmap='coolwarm', vmin=-1, vmax=1)

# Set title and show plot
plt.title('Multivariate Correlation Matrix')
plt.show()

<p style = 'font-size:16px;font-family:Arial'>Few observations from the correlation matrix above are:</p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Age and mortality are positively correlated.</li>
    <li>Cancer and metastatic cancer exhibit a positive correlation.</li>
    <li>The number of prior appointments attended and the number of appointments missed in the previous year are positively correlated.</li>
</ul>

<p style = 'font-size:16px;font-family:Arial'>While these correlations exist, they may not be strong enough to justify removing any columns from the dataset.</p>

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>2.4 Pairplot for multivariate correlations</b></p>

In [None]:
# Create a pairplot to visualize multivariate correlations

sns.pairplot(heart_failure.to_pandas()[["gender", "los","age","prior_appts_attended","prior_dnas","fu_time"]],
             diag_kind = 'auto', hue = 'gender')

<p style = 'font-size:16px;font-family:Arial'>The plot shown above does not yield conclusive results.</p>

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>2.5 Distribution plots for numeric variables</b></p>

In [None]:
# Loop through each numeric column and create a distribution plot
for col in ["los","age","prior_appts_attended","prior_dnas","fu_time"]:
    # Create a subplot for each column
    plt.figure()
    sns.histplot(data=heart_failure.to_pandas(), x=col, kde=True)
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')

# Show all the distribution plots
plt.show()

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>2. Data Prepration</b>
<p style = 'font-size:16px;font-family:Arial'>Here, we classify the columns in to ID column, numerical, categorical and binary ones.</p>

In [None]:
num_x = ["los","age","prior_appts_attended","prior_dnas","fu_time"]
cat_x = ["gender", "quintile","ethnicgroup"]
bin_x = ["death", "cancer", "cabg","crt", "defib","dementia", "diabetes","hypertension", "ihd",
         "mental_health", "arrhythmias", "copd", "obesity","pvd", "renal_disease", "valvular_disease",
         "metastatic_cancer", "pacemaker", "pneumonia", "pci", "stroke", "senile"]
idcol = ["id"]

<p style = 'font-size:16px;font-family:Arial'>One hot encoding is useful when a categorical data element must be re-expressed as one or more numeric data elements, creating a binary numeric field for each categorical data value.</p>

In [None]:
# 1 - male, 2 - female
values1 = {1: "Gender"}
dummy1 = OneHotEncoder(values=values1, columns="gender")

# quintile (socio-economic status for patient's neighbourhood, from 1 (most affluent) to 5 (poorest))
values2 = {1: "q_richest", 2: "q_rich", 3: "q_average", 4: "q_poor", 5: "q_poorest"}
dummy2 = OneHotEncoder(values=values2, columns="quintile")

# 1 - White, 2 - Black, 3 - Indian Subcontinent, 8 - Not Known, 9 - Other
values3 = {1: "White", 2: "Black", 3: "Indian_Subcontinent", 8: "Not_Known", 9:"Other"}
dummy3 = OneHotEncoder(values=values3, columns="ethnicgroup")

<p style = 'font-size:16px;font-family:Arial'>FillNa allows user to perform missing value/null replacement transformations. Z-Score transforms each column value into the number of standard deviations from the mean value of the column.</p>

In [None]:
fn = FillNa(style = "mode", columns = num_x)
zs = ZScore(columns = num_x,
            out_columns = num_x, 
            fillna = fn)

<p style = 'font-size:16px;font-family:Arial'>Keep the other variables that do not not need trasformation.</p>

In [None]:
retain = Retain(columns = bin_x)

In [None]:
# Process the transformation
df_transformed = valib.Transform(
                            data = heart_failure, 
                            zscore = zs, 
                            one_hot_encode = [dummy1, dummy2, dummy3],
                            retain = retain,
                            index_columns = idcol,
                            key_columns = idcol
                         )

<p style = 'font-size:16px;font-family:Arial'>Save the transformed dataframe into a table <b>heart_failure_clean</b>.</p>

In [None]:
df_transformed.result.to_sql(
                "heart_failure_clean",
                schema_name = "demo_user",
                primary_index="id",
                if_exists="replace"
            )
df_transformed.result

<p style = 'font-size:16px;font-family:Arial'>Splitting the data in training and testing datasets in 75:25 ratio.</p>

In [None]:
query = f'''CREATE MULTISET TABLE TrainTestSplit_output AS (
    SELECT * FROM TD_TrainTestSplit(
        ON heart_failure_clean AS InputTable
        USING
        IDColumn('id')
        trainSize(0.75)
        testSize(0.25)
        Seed(7)
    ) AS dt
) WITH DATA;'''

try:
    eng.execute(query)
except:
    eng.execute('DROP TABLE TrainTestSplit_output;')
    eng.execute(query)

In [None]:
query = f'''CREATE MULTISET TABLE heart_failure_train AS (
    SELECT * FROM TrainTestSplit_output WHERE TD_IsTrainRow = 1
) WITH DATA;'''

try:
    eng.execute(query)
except:
    eng.execute('DROP TABLE heart_failure_train;')
    eng.execute(query)

In [None]:
query = f'''CREATE MULTISET TABLE heart_failure_test AS (
    SELECT * FROM TrainTestSplit_output WHERE TD_IsTrainRow = 0
) WITH DATA;'''

try:
    eng.execute(query)
except:
    eng.execute('DROP TABLE heart_failure_test;')
    eng.execute(query)

In [None]:
heart_failure_train = DataFrame('heart_failure_train')
heart_failure_test = DataFrame('heart_failure_test')
print("Training Set = "+str(heart_failure_train.shape[0])+". Testing Set = "+str(heart_failure_test.shape[0]))

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>3. Model Training</b>

<p style = 'font-size:16px;font-family:Arial'>The TD_GLM function is a generalized linear model (GLM) that performs regression and classification analysis on data sets, where the response follows an exponential or binomial family distribution.
<br>
<br>
Due to gradient-based learning, the function is highly sensitive to feature scaling. Input features should be standardized, such as using ScaleFit, and ScaleTransform, before using them in the function. The function takes only numeric features. We must convert the categorical features to numeric values before training. The function skips the rows with missing (null) values during training.</p>

In [None]:
query = f'''CREATE TABLE glm_model AS (
    SELECT * FROM TD_GLM (
        ON heart_failure_train
        OUT TABLE MetaInformationTable(glm_out) 
        USING
            InputColumns('[3:39]')
            ResponseColumn('death')
            Family('Binomial')
            BatchSize(10)
            MaxIterNum(300)
            RegularizationLambda(0.02)
            Alpha(0.15)
            IterNumNoChange(50)
            Intercept('true')
            LearningRate('optimal')
            InitialEta(0.05)
            LocalSGDIterations(0)
    ) AS dt
) WITH DATA;
'''

try:
    eng.execute(query)
except:
    # Drop the tables and try again if the table already exists
    eng.execute(f'DROP TABLE glm_model;')
    eng.execute(f'DROP TABLE glm_out;')
    eng.execute(query)

In [None]:
glm_model_out = DataFrame(in_schema('demo_user','glm_model')).to_pandas().reset_index()
feat_imp = glm_model_out[glm_model_out['attribute'] > 0].sort_values(by = 'estimate', ascending = False)

# Specify figure size
fig, ax = plt.subplots(figsize=(10, 8))

# Use ax.barh() for horizontal bar chart
ax.barh(feat_imp['predictor'], feat_imp['estimate'], edgecolor='red')

# Add text labels on right of the bars
for x, y in zip(feat_imp['estimate'], feat_imp['predictor']):
    ax.text(x, y, str(round(x, 2)), ha='left', va='center')

# Set y-axis label
ax.set_xlabel('Estimate')

plt.show()

<p style = 'font-size:16px;font-family:Arial'>The feature importances displayed above indicate that age and the number of appointments missed are significant factors in predicting heart failure deaths. Other notable factors include the presence of arrhythmia (irregular heartbeat), hypertension, COPD (chronic obstructive pulmonary disease), and whether the patient belongs to the Indian subcontinent.</p>

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>4. Model Validation</b>
<p style = 'font-size:16px;font-family:Arial'>The TD_GLMPredict function predicts target values (regression) and class labels (classification) for test data using a GLM model trained by the GLM function.
<br>
<br>
Similar to GLM, input features should be standardized, such as using ScaleFit, and ScaleTransform, before using them in the function. The function takes only numeric features. We must convert the categorical features to numeric values before prediction.</p>

In [None]:
query = '''CREATE TABLE glm_predict_out AS (
    SELECT * FROM TD_GLMPredict(
        ON "demo_user"."heart_failure_test" AS inputtable
        PARTITION BY ANY 
        ON glm_model AS ModelTable
        DIMENSION
        USING
            IDColumn('id')
            Accumulate('death')
            OutputProb('True')
            Responses('0','1')
    ) AS dt
) WITH DATA;
'''

try:
    eng.execute(query)
except Exception as e:
    eng.execute('DROP TABLE glm_predict_out;')
    eng.execute(query)

In [None]:
# Evaluate the GLM model's performance using TD_CLASSIFICATIONEVALUATOR

# Check if the necessary tables exist before executing the query
if not eng.has_table('glm_predict_out'):
    print('Error: glm_predict_out table does not exist.')
    sys.exit(1)

query = '''SELECT * from TD_CLASSIFICATIONEVALUATOR(
    ON (
        SELECT
            CAST("death" AS INTEGER) AS "death",
            CAST(prediction AS INTEGER) as prediction
        FROM glm_predict_out
    ) AS InputTable
    OUT TABLE OutputTable(additional_metrics_glm)
    USING
        Labels(0, 1)
        ObservationColumn('death')
        PredictionColumn('Prediction')
) AS dt1
ORDER BY 1, 2, 3;
'''

try:
    eng.execute(query)
except:
    eng.execute('DROP TABLE additional_metrics_glm;')
    eng.execute(query)

In [None]:
glm_result = DataFrame(in_schema('demo_user', 'glm_predict_out')).to_pandas().reset_index()
glm_result

<p style = 'font-size:16px;font-family:Arial'>In the above result, the column <b>death</b> is ground truth, <b>prediction</b> is the predicted output and <b>(prob_0, prob_1)</b> are probabilities of the output class.</p>

In [None]:
metrics_glm = DataFrame(in_schema('demo_user', 'additional_metrics_glm')).to_pandas()
metrics_glm['Metric'] = metrics_glm['Metric'].str.strip('\x00')
metrics_glm

<p style = 'font-size:16px;font-family:Arial'>Recall is more important where <b>Overlooked Cases (False Negatives)</b> are more costly than <b>False Alarms (False Positive)</b>. Because not “capturing” even one case of heart failure could result in death, the models should place emphasis on the recall score. It is far preferable to not “miss” anyone with heart failure even if that means “flagging” some patients as having heart failure that actually do not have the disease.
</p>

<p style = 'font-size:16px;font-family:Arial'>Let us consider one example. Here we check for patient number 856.</p>

In [None]:
heart_failure[heart_failure['id'] == 856]

<p style = 'font-size:16px;font-family:Arial'>Based on the data provided, the individual in question has hypertension, arrhythmia, and an age of 98. These features have been identified as significant predictors of mortality by our model. Therefore, there was a high probability of death for this patient, which is further supported by the outcome indicating death.</p>

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>5. Visualize the results</b>

In [None]:
# Compute confusion matrix
cm = confusion_matrix(glm_result['prediction'], glm_result['death'])

# Plot confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['No Death', 'Death'])
fig, ax = plt.subplots(figsize=(8, 8))
disp.plot(ax=ax, cmap='Blues', colorbar=True)

# Add labels and annotations
plt.title('GLM Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.xticks(ticks=[0, 1], labels=['No Death', 'Death'])
plt.yticks(ticks=[0, 1], labels=['No Death', 'Death'])

# Add text to the plot to show the actual values of the confusion matrix
for i in range(cm.shape[0]):
    for j in range(cm.shape[1]):
        plt.text(j, i, f'{cm[i, j]}', ha='center', va='center', color='white' if cm[i, j] > cm.max()/2 else 'black')
        
# Remove grid lines
ax.grid(False)

# Show the plot
plt.show()

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>6. Cleanup</b>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Work Tables</b></p>
<p style = 'font-size:16px;font-family:Arial'>Cleanup work tables to prevent errors next time.</p>

In [None]:
eng.execute('DROP TABLE heart_failure_train;')

In [None]:
eng.execute('DROP TABLE heart_failure_test;')

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'> <b>Databases and Tables </b></p>
<p style = 'font-size:16px;font-family:Arial'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('DEMO_HeartFailure');"        # Takes 5 seconds

In [None]:
remove_context()

<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>Dataset:</b>

- `id`: patient id
- `death`: If the patient is deceased(boolean)
- `los`: hospital length of stay (in nights)
- `age`: age of the patient (in years)
- `gender`: gender of the patient (1-male, 2-female)
- `cancer`: If the patient has cancer (boolean)
- `cabg`: If the patient has gone through previous heart bypass i.e. Coronary Artery Bypass Graft procedure (boolean)
- `crt`: If the patient has gone through Cardiac Resynchronization Therapy i.e. a device - a treatment for heart failure(boolean)
- `defib`: If the patient has defibrillator implanted (boolean)
- `dementia`: If the patient has dementia (boolean)
- `diabetes`: If the patient has diabetes (boolean)
- `hypertension`: If the patient has hypertension (boolean)
- `ihd`: If the patient has Ischemic Heart Disease (boolean)
- `mental_health`: If the patient has been diagnosed with mental health issues (boolean)
- `arrhythmias`: If the patient has arrhythmia (boolean)
- `copd`: If the patient has Chronic Obstructive Pulmonary Disease (boolean)
- `obesity`: If the patient has obesity (boolean)
- `pvd`: If the patient has Peripheral Vascular Disease (boolean)
- `renal_disease`: If the patient has Renal Disease (boolean)
- `valvular_disease`: If the patient has Valvular Disease (boolean)
- `metastatic_cancer`: If the patient has Metastatic Cancer (boolean)
- `pacemaker`: If the patient has pacemaker (boolean)
- `pneumonia`: If the patient has pneumonia (boolean)
- `prior_appts_attended`: Number of outpatient appointments attended in the previous year
- `prior_dnas`: Number of outpatient appointments missed in the previous year
- `pci`: If the patient has gone though Percutaneous Coronary Intervention procedure (boolean)
- `stroke`: History of stroke
- `senile`: If the patient has Senile amyloidosis (SSA) (boolean)
- `quintile`: Socio-economic status for patient's neighbourhood, from 1 (most affluent) to 5 (poorest)
- `ethnicgroup`: 1 - White, 2 - Black, 3 - Indian Subcontinent, 8 - Not Known, 9 - Other 
- `fu_time`: Follow-up time, i.e. time in days since admission to hospital

<footer style="padding:10px;background:#f9f9f9;border-bottom:3px solid #394851">Copyright © Teradata Corporation - 2023. All Rights Reserved.</footer>