<header style="padding:1px;background:#f9f9f9;border-top:3px solid #00b2b1"><img id="Teradata-logo" src="https://www.teradata.com/Teradata/Images/Rebrand/Teradata_logo-two_color.png" alt="Teradata" width="220" align="right" />

<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>Survival Analysis using teradataml</b>
</header>

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Introduction:</b></p>    
    
<table style = 'width:100%;table-layout:fixed;'>
    <tr>
        <td style = 'vertical-align:middle' width = '50%'>
            <ul style = 'font-size:16px;font-family:Arial'>
                <li>According to the CDC, the number of emergency room visits in 2017 for issues related to the heart and blood vessels was nearly 5 million. In 2016, 72 million people made heart disease-related visits to their doctors.</li><br>
                <li>The cost of caring for cardiovascular disease is more than \$351 billion per year. Nearly \$214 billion pays for the care of people with heart disease, while more than \$137 billion goes to lost productivity.</li><br>
                <li>Heart attack is one of the most expensive conditions treated in U.S. hospitals. Its care costs an estimated \$11.5 billion a year.</li><br>
                <li>By 2035, more than 45 percentTrusted Source of Americans are projected to have some form of cardiovascular disease. Total costs of cardiovascular disease are expected to reach \$1.1 trillion in 2035, with direct medical costs expected to reach \$748.7 billion and indirect costs estimated to reach \$368 billion.</li>
            </ul> 
        </td>
        <td>
            <img src="images/heart.webp" width="350"/>
        </td>
    </tr>
</table>
<p style = 'font-size:16px;font-family:Arial'>Source: <a href = 'https://www.healthline.com/health/heart-disease/statistics#How-much-does-it-cost?'>Healthline</a></p>
<p style = 'font-size:16px;font-family:Arial'>Machine learning can be useful in heart failure prediction as it can analyze large amounts of data from multiple sources and identify complex patterns that may be difficult for humans to recognize. This can potentially improve the accuracy of prediction models and help healthcare professionals identify patients who are at high risk for heart failure, allowing for earlier intervention and better outcomes.</p>

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Data:</b></p>

<p style = 'font-size:16px;font-family:Arial'>
    This is a simulated dataset based on real hospital administrative data for England called Hospital Episodes Statistics. Every public (National Health Service, NHS) hospital in the country must submit records for every admission; private hospitals also submit records for any NHS patients that they treat.</p>

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Downloading and installing additional software needed</b>

In [None]:
%%capture
!pip install --upgrade teradataml

In [None]:
import os
os._exit(00)

<p style = 'font-size:16px;font-family:Arial'>
    <i><b>*BEFORE proceeding, please RESTART the kernel to bring new software into Jupyter.</b></i>
</p>
<p style = 'font-size:16px;font-family:Arial'>Here, we import the required libraries, set environment variables and environment paths (if required).</p>

In [None]:
# system packages
import sys
import warnings
warnings.filterwarnings("ignore")

from teradataml import *
from teradataml import valib

# Dataset packages 
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score, recall_score, ConfusionMatrixDisplay

# plotting packages
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid")

%matplotlib inline
configure.val_install_location = "val"

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>1. Initiate a connection to Vantage</b>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'> <b>Let's start by connecting to the Teradata system </b></p>
<p style = 'font-size:16px;font-family:Arial'>You will be prompted to provide the password. Enter your password, press the Enter key, and then use the down arrow to go to the next cell.</p>

In [None]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)
eng.execute('''SET query_band='DEMO=SurvivalAnalysis_Python.ipynb;' UPDATE FOR SESSION;''')

<p style = 'font-size:16px;font-family:Arial'>Begin running steps with Shift + Enter keys. </p>

<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial'>We have provided data for this demo on cloud storage. You can either run the demo using foreign tables to access the data without any storage on your environment or download the data to local storage, which may yield faster execution. Still, there could be considerations of available storage. Two statements are in the following cell, and one is commented out. You may switch which mode you choose by changing the comment string.</p>

In [None]:
%run -i ../run_procedure.py "call get_data('DEMO_SurvivalAnalysis_cloud');"        # Takes 10 seconds
# %run -i ../run_procedure.py "call get_data('DEMO_SurvivalAnalysis_local');"        # Takes 20 seconds

<p style = 'font-size:16px;font-family:Arial'>Next is an optional step – if you want to see the status of databases/tables created and space used.</p>

In [None]:
%run -i ../run_procedure.py "call space_report();"        # Takes 5 seconds

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>2. Read the data from Vantage as a teradataml Dataframe</b>

In [None]:
heart_failure = DataFrame(in_schema('DEMO_SurvivalAnalysis', 'heart_failure'))

In [None]:
print(heart_failure.shape)
heart_failure.head(5)

<i><b>*Please scroll down to the end of the notebook for detailed column descriptions of the dataset.</b></i>
<p style = 'font-size:16px;font-family:Arial'>The dataset above has 31 columns in total and the 'death' column is the predicted column where 1 means the patient died and 0 means he/she did not.
<br>
Let's check the data for people who died and who did not.
</p>

In [None]:
# Sample data for people who did not die
heart_failure[heart_failure.death == 0].head(5)

In [None]:
# Sample data for people who died
heart_failure[heart_failure.death == 1].head(5)

<p style = 'font-size:16px;font-family:Arial'>How can we look at the factors and determine potential mortality? Looking at the data above, it is not evident what causes death due to heart failure. Let's analyze further.</p>
<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>2. Data Prepration</b>
<p style = 'font-size:16px;font-family:Arial'>Here, we classify the columns in to ID column, numerical, categorical and binary ones.</p>

In [None]:
num_x = ["los","age","prior_appts_attended","prior_dnas","fu_time"]
cat_x = ["gender", "quintile","ethnicgroup"]
bin_x = ["death", "cancer", "cabg","crt", "defib","dementia", "diabetes","hypertension", "ihd",
         "mental_health", "arrhythmias", "copd", "obesity","pvd", "renal_disease", "valvular_disease",
         "metastatic_cancer", "pacemaker", "pneumonia", "pci", "stroke", "senile"]
idcol = ["id"]

<p style = 'font-size:16px;font-family:Arial'>One hot encoding is useful when a categorical data element must be re-expressed as one or more numeric data elements, creating a binary numeric field for each categorical data value.</p>

In [None]:
# 1 - male, 2 - female
values1 = {1: "Gender"}
dummy1 = OneHotEncoder(values=values1, columns="gender")

# quintile (socio-economic status for patient's neighbourhood, from 1 (most affluent) to 5 (poorest))
values2 = {1: "q_richest", 2: "q_rich", 3: "q_average", 4: "q_poor", 5: "q_poorest"}
dummy2 = OneHotEncoder(values=values2, columns="quintile")

# 1 - White, 2 - Black, 3 - Indian Subcontinent, 8 - Not Known, 9 - Other
values3 = {1: "White", 2: "Black", 3: "Indian_Subcontinent", 8: "Not_Known", 9:"Other"}
dummy3 = OneHotEncoder(values=values3, columns="ethnicgroup")

<p style = 'font-size:16px;font-family:Arial'>FillNa allows user to perform missing value/null replacement transformations. Z-Score transforms each column value into the number of standard deviations from the mean value of the column.</p>

In [None]:
fn = FillNa(style = "mode", columns = num_x)
zs = ZScore(columns = num_x,
            out_columns = num_x, 
            fillna = fn)

<p style = 'font-size:16px;font-family:Arial'>Keep the other variables that do not not need trasformation.</p>

In [None]:
retain = Retain(columns = bin_x)

In [None]:
# Process the transformation
df_transformed = valib.Transform(
                            data = heart_failure, 
                            zscore = zs, 
                            one_hot_encode = [dummy1, dummy2, dummy3],
                            retain = retain,
                            index_columns = idcol,
                            key_columns = idcol
                         )

<p style = 'font-size:16px;font-family:Arial'>Save the transformed dataframe into a table <b>heart_failure_clean</b>.</p>

In [None]:
df_transformed.result.to_sql(
                "heart_failure_clean",
                schema_name = "demo_user",
                primary_index="id",
                if_exists="replace"
            )
df_transformed.result.head(5).to_pandas()

<p style = 'font-size:16px;font-family:Arial'>The Chi-Square test finds statistically significant associations between categorical variables. The test determines if the categorical variables are statistically independent or not. The null-hypothesis here is that the target variable is independent of given predictor/column.</p>
<p style = 'font-size:16px;font-family:Arial'>The following rules are used to compute the hypothesis conclusion:</p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>If the chi-square statistic is greater than the critical value, then the function rejects the Null hypothesis.</li>
    <li>If the chi-square statistic is lesser than or equal to the critical value, then the function fails to reject the Null hypothesis.</li>
    </ul>
    
<p style = 'font-size:16px;font-family:Arial'>If the test rejects the null-hypothesis, then that categorical predictor/column is significant and should be used in further analysis.
    <br>
    The following function performs a ChiSq test on a column and returns the result.
</p>

In [None]:
def generate_contingency_table(col):
    q1 = f'''
            CREATE MULTISET TABLE contingency_{col} as(
                select {col} as {col}
                , sum((case when death = 1 then 1 else 0 end)) as death
                , sum((case when death = 0 then 1 else 0 end)) as non_death
                from heart_failure_clean
                group by {col}
            ) with data;
        '''
    pd.read_sql(q1, eng)
    q2 = f'''
            SELECT * from TD_CHISQ (
                ON contingency_{col} AS CONTINGENCY
                USING
                    Alpha (0.05)
            ) AS dt;        '''
    result = pd.read_sql(q2, eng)
    return result

<p style = 'font-size:16px;font-family:Arial'>The above Python function generates a contingency table for a given column in a dataset of heart failure records, and then performs a chi-square test on the table. It can be used to test whether there is a significant association between dementia and death by heart attack, or between gender and death by heart attack. The following cell outputs the categorical columns that are significant i.e., chisq test rejected the null hypothesis.</p>

In [None]:
cols = []
for column in df_transformed.result.columns:
    try:
        result = generate_contingency_table(column)
        if result['conclusion'][0].rstrip('\x00') == 'Reject Null hypothesis':
            cols.append(column)
    except:
        pass

In [None]:
print("\033[1mOriginal categorical columns: \033[0m", cat_x + bin_x)
print("\033[1mSignificant categorical columns: \033[0m", cols)

<p style = 'font-size:16px;font-family:Arial'>The result above indicates that dementia, senile and ethnic_group(being from indian subcontinent or not) are significant categorical variables that are important in prediction death in this use case.</p>
<p style = 'font-size:16px;font-family:Arial'>Splitting the data in training and testing datasets in 70:30 ratio.</p>

In [None]:
tdf_samples = df_transformed.result.sample(frac = [0.3, 0.7])
copy_to_sql(tdf_samples[tdf_samples['sampleid'] == 2], table_name = 'heart_failure_train', schema_name = 'demo_user', if_exists = 'replace')
copy_to_sql(tdf_samples[tdf_samples['sampleid'] == 1], table_name = 'heart_failure_test', schema_name = 'demo_user', if_exists = 'replace')

In [None]:
heart_failure_train = DataFrame('heart_failure_train')
heart_failure_test = DataFrame('heart_failure_test')
print("Training Set = "+str(heart_failure_train.shape[0])+". Testing Set = "+str(heart_failure_test.shape[0]))

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>3. Model Training</b>

<p style = 'font-size:16px;font-family:Arial'>The function is an ensemble algorithm used for classification and regression predictive modeling problems. It is an extension of bootstrap aggregation (bagging) of decision trees. Typically, constructing a decision tree involves evaluating the value for each input feature in the data to select a split point.
<br>
<br>
The function reduces the features to a random subset (that can be considered at each split point); the algorithm can force each decision tree in the forest to be very different to improve prediction accuracy.</p>

In [None]:
from teradataml import DecisionForest

DecisionForest_out = DecisionForest(data = heart_failure_train,
                            input_columns = cols + num_x,
                            response_column = 'death',
                            max_depth = 7,
                            num_trees = 100,
                            min_node_size = 2,
                            seed = 2,
                            tree_type = 'CLASSIFICATION')

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>4. Model Testing</b>
<p style = 'font-size:16px;font-family:Arial'>DecisionForestPredict outputs the probability that each observation is in the predicted class.</p>

In [None]:
from teradataml import DecisionForestPredict

decision_forest_predict_out = DecisionForestPredict(
                                                        object = DecisionForest_out,
                                                        newdata = heart_failure_test,
                                                        id_column = "id",
                                                        detailed = False,
                                                        output_response_probdist = True,
                                                        output_prob = True,
                                                        output_responses =  ['0', '1'],
                                                        terms = 'death'
                                                    )

In [None]:
rf_pred=decision_forest_predict_out.result.to_pandas()
rf_pred['prediction'] = rf_pred['prediction'].astype('int64')

In [None]:
rf_pred.head()

<p style = 'font-size:16px;font-family:Arial'>In the above result, the column <b>death</b> is ground truth, <b>prediction</b> is the predicted output and <b>(prob_0, prob_1)</b> are probabilities of the output class.
<br>
<br>
Recall is more important where <b>Overlooked Cases (False Negatives)</b> are more costly than <b>False Alarms (False Positive)</b>. Because not “capturing” even one case of heart failure could result in death, the models should place emphasis on the recall score. It is far preferable to not “miss” anyone with heart failure even if that means “flagging” some patients as having heart failure that actually do not have the disease.
</p>

In [None]:
recall_score(rf_pred.death, rf_pred.prediction)

<p style = 'font-size:16px;font-family:Arial'>The best value is 1 and the worst value is 0. Higher the recall, better is the model for our usecase.</p>
<p style = 'font-size:16px;font-family:Arial'> Following is the confusion matrix for the same.</p>

In [None]:
cm = confusion_matrix(rf_pred.death, rf_pred.prediction)
ConfusionMatrixDisplay(cm).plot()

<p style = 'font-size:16px;font-family:Arial'>Looking at the above metrics, we can say that the model has performed decently well on testing data.</p>

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>9. Cleanup</b>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Work Tables</b></p>
<p style = 'font-size:16px;font-family:Arial'>Cleanup work tables to prevent errors next time.</p>

In [None]:
eng.execute('DROP TABLE heart_failure_train;')

In [None]:
eng.execute('DROP TABLE heart_failure_test;')

In [None]:
for col in df_transformed.result.columns:
    try:
        eng.execute(f'DROP TABLE contingency_{col}')
    except:
        pass

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'> <b>Databases and Tables </b></p>
<p style = 'font-size:16px;font-family:Arial'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('DEMO_SurvivalAnalysis');"        # Takes 5 seconds

In [None]:
remove_context()

<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>Dataset:</b>

- `id`: patient id
- `death`: If the patient is deceased(boolean)
- `los`: hospital length of stay (in nights)
- `age`: age of the patient (in years)
- `gender`: gender of the patient (1-male, 2-female)
- `cancer`: If the patient has cancer (boolean)
- `cabg`: If the patient has gone through previous heart bypass i.e. Coronary Artery Bypass Graft procedure (boolean)
- `crt`: If the patient has gone through Cardiac Resynchronization Therapy i.e. a device - a treatment for heart failure(boolean)
- `defib`: If the patient has defibrillator implanted (boolean)
- `dementia`: If the patient has dementia (boolean)
- `diabetes`: If the patient has diabetes (boolean)
- `hypertension`: If the patient has hypertension (boolean)
- `ihd`: If the patient has Ischemic Heart Disease (boolean)
- `mental_health`: If the patient has been diagnosed with mental health issues (boolean)
- `arrhythmias`: If the patient has arrhythmia (boolean)
- `copd`: If the patient has Chronic Obstructive Pulmonary Disease (boolean)
- `obesity`: If the patient has obesity (boolean)
- `pvd`: If the patient has Peripheral Vascular Disease (boolean)
- `renal_disease`: If the patient has Renal Disease (boolean)
- `valvular_disease`: If the patient has Valvular Disease (boolean)
- `metastatic_cancer`: If the patient has Metastatic Cancer (boolean)
- `pacemaker`: If the patient has pacemaker (boolean)
- `pneumonia`: If the patient has pneumonia (boolean)
- `prior_appts_attended`: Number of outpatient appointments attended in the previous year
- `prior_dnas`: Number of outpatient appointments missed in the previous year
- `pci`: If the patient has gone though Percutaneous Coronary Intervention procedure (boolean)
- `stroke`: History of stroke
- `senile`: If the patient has Senile amyloidosis (SSA) (boolean)
- `quintile`: Socio-economic status for patient's neighbourhood, from 1 (most affluent) to 5 (poorest)
- `ethnicgroup`: 1 - White, 2 - Black, 3 - Indian Subcontinent, 8 - Not Known, 9 - Other 
- `fu_time`: Follow-up time, i.e. time in days since admission to hospital

<footer style="padding:10px;background:#f9f9f9;border-bottom:3px solid #394851">Copyright © Teradata Corporation - 2023. All Rights Reserved.</footer>