<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Diabetes prediction using BYOM
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Introduction</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>According to research, there are a lot of people that are suffering from diabetes all over the world. Studies show that in 2019, diabetes was the direct cause of 1.5 million deaths and almost 50% of all deaths occurred before the age of 70.
</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>  
Over time, diabetes can have a negative impact on multiple organs. It can damage the heart, blood vessels, eyes, kidneys, and nerves. The earlier a person receives a proper treatment, the more likely he or she will be in lowering blood glucose level. In addition, the risk of failure of other organs will also be reduced.
</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>  
Hence as a data science consultant, we are showcasing the complete approach about how we can make prediction of diagnosis of  diabetes 6 months in advance. We are demonstrating how we can bring our models that were trained with open-source technologies to Teradata Vantage for scoring. The data we are using is a sample dataset and the results and predictions may not be entirely accurate.
</p>



<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>1. Data</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>This dataset contains data for 10000 patients, half of which were diagnosed with diabetes. It contains 624 columns. We used patients' visit records consisting of diagnoses, procedures, medications and demographics. In addition, we also added a temporal aspect to the medical features. We differentiated between events occurring 1-3 months before diagnosis, 3-6 months, and 6-12 months, before the prediction window. The main aim of the data is to distinguish between healthy people and those who were diagnosed with diabetes, according to "target" column which is set to 0 for non-diabetic and 1 for diabetic.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><a href = 'https://synthea.mitre.org/'>Link to the dataset</a>: This dataset was generated by Synthea for the experimentation purpose and does not reflect the actual population.</p>


<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>2. Connect to Vantage</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In the section, we import the required libraries and set environment variables and environment paths (if required).</p>

In [None]:
import json
import getpass
import pandas as pd
from teradataml import *
import warnings
warnings.filterwarnings("ignore")

import teradatasql
import plotly.express as px
import plotly.figure_factory as ff
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import classification_report
import numpy as np
from sklearn import metrics

import teradataml
from teradataml.dataframe.dataframe import in_schema
from teradataml.catalog.byom import save_byom, retrieve_byom
from teradataml.analytics.byom import H2OPredict

import sqlalchemy
from sqlalchemy import event
from sqlalchemy.types import String
from teradataml.context.context import *
from teradataml.dataframe.dataframe import DataFrame
from sqlalchemy.types import VARCHAR
from teradataml.dataframe.copy_to import copy_to_sql

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>You will be prompted to provide the password. Enter your password, press the Enter key, then use down arrow to go to next cell. Begin running steps with Shift + Enter keys.</p>

In [None]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

In [None]:
%%capture
execute_sql('''SET query_band='DEMO=Diabetes_Classification_BYOM.ipynb;' UPDATE FOR SESSION; ''')

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>3. Getting Data for This Demo</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We have provided data for this demo on cloud storage.  Here we are downloading the data to local storage which may yield somewhat faster execution, but there could be considerations of available storage. We are only creating local databases and tables as there are 625 columns in table which will be faster in local tables as compared to foreign tables.</p> 
    
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>“Note:  The data loading part of this demo will be slow because we have 378 columns."</b></p>    


In [None]:
%run -i ../run_procedure.py "call get_data('DEMO_DiabetesPrediction_BYOM_local');"
 # Takes about 4 minutes 


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Next is an optional step – if you want to see status of databases/tables created and space used.</p>

In [None]:
%run -i ../run_procedure.py "call space_report();"

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>4. Analyze the raw data set</b></p>
<!-- <p style = 'font-size:16px;font-family:Arial;color:#00233C'>One of the challenges with this data set is that each recording consists of 755 individual metrics.  If this data set were to be used as input to a Decision Forest or other supervised learning algorithm "as is"; this large number of features would cause extraordinary performance degradation for very little gain in accuracy.  Not to mention, wrangling 755 columns adds additional complexity in programming and automation.</p> -->

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Simple SQL query to show the data. Taking test data to showcase scoring capabilities in Vantage.</p>




In [None]:
query = '''
SELECT TOP 5 * FROM DEMO_DiabetesPrediction_BYOM.Diabetes_Diagnosis_BYOM_Data;
'''
df_byom=DataFrame.from_query(query)
df_byom

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>5. Bringing our Trained H2O Model to Vantage

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The <a href = 'https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Teradata-Package-for-Python-User-Guide-17.20/BYOM/save_byom'>save_byom</a> function allows users to save various models stored in different formats such as PMML, MOJO, and so on.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>This function takes the following function parameters</p>
    <ul style = 'font-size:16px;font-family:Arial'>
        <li>Model_ID specifies the unique model identifier for this model</li>
        <li>Model_File specifies the absolute path of the file which has model information</li>
    <li>Other parameter values detailed in the documentation</li>
        </ul>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We have our trained model in the form of zip file. In order to use it in-database, we need to save it to the table.</p>    

In [None]:
query = """
CREATE SET TABLE DBT_H2O_Models (
  model_id VARCHAR (30),
  model BLOB
)
PRIMARY INDEX (model_id);"""

try:
    execute_sql(query)
except:
    db_drop_table('DBT_H2O_Models')
    execute_sql(query)

In [None]:
save_byom(model_id="dbt_model_1", model_file="GBM_2_AutoML_1_20230214_131128.zip", table_name="DBT_H2O_Models")

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>6. Model Scoring
 

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The <a href = 'https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Teradata-Package-for-Python-User-Guide-17.20/BYOM/retrieve_byom'>retrieve_byom</a> API allows a user to retrieve a saved model. Output of this function can be directly passed as input to the PMMLPredict and H2OPredict functions.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>This function takes the following function parameters</p>
    <ul style = 'font-size:16px;font-family:Arial'>
        <li>Model_ID specifies the unique model identifier of the model to be retrieved</li>
    <li>Other parameter values detailed in the documentation</li>
        </ul>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>After bringing our model to Vantage, we need to score it on the data available in Vantage to check its performance.</p> 

In [None]:
df_data = DataFrame(in_schema("DEMO_DiabetesPrediction_BYOM", "Diabetes_Diagnosis_BYOM_Data"))

In [None]:
modeldata = retrieve_byom("dbt_model_1", table_name="DBT_H2O_Models")

configure.byom_install_location = "mldb"

result = H2OPredict(newdata=df_data,
                    newdata_partition_column='MBR_ID',
                    newdata_order_column='MBR_ID',
                    modeldata=modeldata,
                    modeldata_order_column='model_id',
                    model_output_fields=['classProbabilities'],
                    accumulate=['MBR_ID'],
                    overwrite_cached_models='*',
                    enable_options=['contributions','stageProbabilities'],
                    model_type='OpenSource'
                    )

In [None]:
df_predict = result.result

In [None]:
%%time
df_predict=df_predict.to_pandas()

In [None]:
def Predict(x):
    if x["1"] >= 0.5:
        return 1
    return 0
def Prob(x):
    return x["1"]
Target = df_data.select(['MBR_ID', 'target'])
# DataFrame.from_query("SELECT MBR_ID, target FROM DEMO_DiabetesPrediction_BYOM.Diabetes_Diagnosis_BYOM_Data;")
Result = df_predict.merge(Target.to_pandas(),how='inner', on="MBR_ID")

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We are using the class probabilities returned by H2OPredict to classify the prediction as 0 or 1.</p>

In [None]:
Result["classprobabilities"].isna().count()

In [None]:
Result["classprobabilities"].apply(json.loads)

In [None]:
Result["classprobabilities"] = Result["classprobabilities"].apply(json.loads)

In [None]:
Result["Prediction"] = Result["classprobabilities"].apply(Predict)

In [None]:
Result["Prob_1"] = Result["classprobabilities"].apply(Prob)

In [None]:
Result

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Create CONFUSION MATRIX.</p>

In [None]:
cm = confusion_matrix(Result['target'], Result['Prediction'])
cmd = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['DoesNotHaveDiabetes', 'HasDiabetes'])
cmd.plot()
plt.show()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The above Confusion Matrix shows the actual and the Predicted values. Based on H2O model used the above matrix shows the predicted and actual value comparison for people having Diabetes and those not having Diabetes.</p>


In [None]:
print(classification_report(Result['target'], Result['Prediction'], target_names=['Non-Diabetic','Diabetic']))

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The above output has the macro and weighted-average metrics of precision, recall, and F1-score values.</p>
<table style = 'font-size:16px;font-family:Arial;color:#00233C'>
  <tr>
    <th>Column</th>
    <th>Description</th>
  </tr>
  <tr>
    <td>Precision</td>
    <td>The positive predictive value. Refers to the fraction of relevant instances among
the total retrieved instances.
        Precision answers the following question: what proportion of predicted Positives is truly Positive? 
        Precision = (TP)/(TP+FP)</td>
  </tr>
  <tr>
    <td>Recall</td>
    <td>Refers to the fraction of relevant instances retrieved over the total amount of
relevant instances. Recall answers a different question: what proportion of actual Positives is correctly classified?
Recall = (TP)/(TP+FN)</td>
  </tr>
  <tr>
    <td>F1</td>
    <td>F1 score, defined as the harmonic mean of the precision and recall and is a number between 0 and 1. F1 score maintains a balance between the precision and recall for your classifier.                                         
                      F1 = 2*(precision*recall/precision+recall)</td>
  </tr>
  <tr>
    <td>Support</td>
    <td>The number of times a label displays in the Observation Column.</td>
  </tr>
</table>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>**TP:- True Positive , FP :- False Positive, TN :- True Negative , FN :- False Negative

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Show AUC-ROC Curve</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Create AUC-ROC curve with the sk-learn roc_curve() Function.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The ROC curve shows the performance of a binary classification model as its discrimination threshold varies. For a range of thresholds, the curve plots the TPR(true-positive rate) against FPR(false-positive rate).The ROC curve is plotted with TPR against the FPR where TPR is on the y-axis and FPR is on the x-axis.</p>

In [None]:
fpr, tpr, thresholds = metrics.roc_curve(Result['target'], Result['Prob_1'])
auc = metrics.auc(fpr, tpr)
plt.plot(fpr,tpr,label="AUC="+str(auc))
plt.legend(loc=4)
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.title("AUC-ROC Curve")
plt.show()

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>7. Cleanup</b></p>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Work Tables</b></p>

In [None]:
db_drop_table('DBT_H2O_Models')

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Databases and Tables</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('DEMO_DiabetesPrediction_BYOM');" 
#Takes 2 minutes

In [None]:
remove_context()

<footer style="padding-bottom:35px; background:#f9f9f9; border-bottom:3px solid #00233C">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2023. All Rights Reserved
        </div>
    </div>
</footer>