<header style="padding:1px;background:#f9f9f9;border-top:3px solid #00b2b1"><img id="Teradata-logo" src="https://www.teradata.com/Teradata/Images/Rebrand/Teradata_logo-two_color.png" alt="Teradata" width="220" align="right" />

<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>Parkinson's Disease prediction using Feature Projection and Decision Tree Classifier and GLM</b>
</header>

<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>Introduction</b></p>

<p style = 'font-size:16px;font-family:Arial'>Research shows that 89 percent of people with Parkinson’s disease (PD) experience speech and voice disorders, including soft, monotone, breathy and hoarse voice and uncertain articulation. As a result, people with PD report they are less likely to participate in conversation, or have confidence in social settings than healthy individuals in their age group.
<br>
Speech disorders can progressively diminish quality of life for a person with PD. The earlier a person receives a baseline speech evaluation and speech therapy, the more likely he or she will be able to maintain communication skills as the disease progresses. Communication is a key element in quality of life and positive self-concept and confidence for people with PD.
<br>
Hence as a consultant, we are approached by an organization to detect Parkinson's Disease at an early stage.</p>

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Data</b></p>
<p style = 'font-size:16px;font-family:Arial'>This dataset is composed of a range of biomedical voice measurements from 31 people, 23 with Parkinson's disease (PD). Each column in the table is a particular voice measure, and each row corresponds one of 195 voice recording from these individuals ("name" column). The main aim of the data is to discriminate healthy people from those with PD, according to "status" column which is set to 0 for healthy and 1 for PD.</p>

<p style = 'font-size:16px;font-family:Arial'><a href = 'https://archive.ics.uci.edu/ml/datasets/parkinsons'>Link to the dataset</a>: Max A. Little, Patrick E. McSharry, Eric J. Hunter, Lorraine O. Ramig (2008), 'Suitability of dysphonia measurements for telemonitoring of Parkinson's disease', IEEE Transactions on Biomedical Engineering (to appear).</p>


<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>Accessing the Data</b></p>
<p style = 'font-size:16px;font-family:Arial'>These demos will work either with foreign tables accessed from Cloud Storage via NOS or you may import the tables to your machine. If you import data for multiple demos, you may need to use the Data Dictionary "Manage Your Space" routine to cleanup tables you no longer need. 
    
<p style = 'font-size:16px;font-family:Arial'>Use the link below to access the 2 options for using data from the data dictionary notebook:

[Click Here to get data for this notebook](../Data_Dictionary/Data_Dictionary.ipynb#TRNG_ParkinsonsDisease)

[Click Here to Manage Your Space](../Data_Dictionary/Data_Dictionary.ipynb#Manage_Your_Space)


<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>Contents:</b></p>
<ol style = 'font-size:16px;font-family:Arial'>
    <li>Configuring the Environment</li>
    <li>Initiate a connection to Vantage</li>
    <li>Analyze the raw data set</li>
    <li>Train and Test a Decision Forest Model</li>
        <ul>
            <li>4.1 Train and Test split using SAMPLE. Splitting the dataset in 80:20 ratio for Train and Test respectively</li>
            <li>4.2 Train a Decision Tree Model</li> 
                <ol style = 'font-size:16px;font-family:Arial'>
                    <li style = 'font-size:16px;font-family:Arial' >Using the TD_DecisionForest and TD_DecisonForestPredict In Database function to predict if the persion can have Parkinson's Disease or not. So there are only 2 responses '0' and '1'.</li>
                    <li style = 'font-size:16px;font-family:Arial'>Using the TD_GLM and TD_GLMPredict In Database function to predict if the persion can have Parkinson's Disease or not. So there are only 2 responses '0' and '1'.</li>
            </ol>
            <li>4.3 Evaluate the Model :- Evaluation of the model is done using the TD_ClassificationEvaluator which provides various parameters for the model like Accuracy, Precision ,Recall etc.</li>
        </ul>
    <li>Cleanup</li>
</ol>

<hr>
<b style = 'font-size:22px;font-family:Arial;color:#E37C4D'>1. Configuring the Environment</b>
<p style = 'font-size:16px;font-family:Arial'>In the section, we import the required libraries and set environment variables and environment paths (if required).</p>

In [None]:
import json
import getpass
import pandas as pd
from teradataml import *

<hr>
<b style = 'font-size:22px;font-family:Arial;color:#E37C4D'>2. Initiate a connection to Vantage</b>

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Make changes for your execution</b></p>

<p style = 'font-size:16px;font-family:Arial'>The Jupyter Module for Teradata provides a helper library called tdconnect - this can use the underlying client configs and pass a JWT token for SSO. Establish connection to Teradata Vantage server (uses the Teradata SQL Driver for Python). Before you execute the following statement, replace the variables &ltHOSTNAME&gt, &ltUID&gt and &ltPWD&gt with your target Vantage system hostname (or IP address), and your database user ID(QLID) and password, respectively.</p>
    
<p style = 'font-size:14px;font-family:Arial'>td_context = create_context(host="tdprdX.td.teradata.com", username="xy123456", password=gp.getpass(prompt='Password:'), logmech="LDAP")</p>

In [None]:
eng = create_context(host = 'host.docker.internal', username='demo_user', password = getpass.getpass())

<hr>
<p style = 'font-size:22px;font-family:Arial;color:#E37C4D'><b>3. Analyze the raw data set</b></p>
<!-- <p style = 'font-size:16px;font-family:Arial'>One of the challenges with this data set is that each recording consists of 755 individual metrics.  If this data set were to be used as input to a Decision Forest or other supervised learning algorithm "as is"; this large number of features would cause extraordinary performance degradation for very little gain in accuracy.  Not to mention, wrangling 755 columns adds additional complexity in programming and automation.</p> -->

<p style = 'font-size:16px;font-family:Arial'>Simple SQL query to show the data. Taking just a sample set of data to show case how we can use the Teradata In database functions</p>

In [None]:
query = '''
SELECT * FROM TRNG_ParkinsonsDisease.pd_speech_features;
'''

pd.read_sql(query, eng)

<hr>
<p style = 'font-size:22px;font-family:Arial;color:#E37C4D'><b>4. Train and Test a Decision Tree Model</b></p>

<p style = 'font-size:16px;font-family:Arial'>Now that we have our prepared data set, we can perform an abbreviated machine learning workflow:</p>

<ol style = 'font-size:16px;font-family:Arial'>
    <li>Create Train and Test data sets using SAMPLE Clause</li>
    <li>Train the model</li>
    <li>Evaluate the model using Test data</li>
</ol>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>4.1 - Train and Test split using SAMPLE</b></p>

<p style = 'font-size:16px;font-family:Arial'>Using EXCEPT clause in the second statement ensures a non-intersecting set of data</p>

In [None]:
query = '''
CREATE MULTISET TABLE pd_speech_features_train AS (
    SELECT * FROM TRNG_ParkinsonsDisease.pd_speech_features SAMPLE 0.8
) WITH DATA
PRIMARY INDEX(id)
'''

try:
    eng.execute(query)
except:
    eng.execute('DROP TABLE pd_speech_features_train;')
    eng.execute(query)

train = pd.read_sql('SELECT * FROM pd_speech_features_train', eng)
pd.read_sql('SELECT "class", COUNT(*) FROM pd_speech_features_train GROUP BY "class"', eng)

<p style = 'font-size:16px;font-family:Arial'>The output shows the number of people we are considering for each class to train the model</p>



In [None]:
query = '''
CREATE MULTISET TABLE pd_speech_features_test AS (
    SELECT * FROM TRNG_ParkinsonsDisease.pd_speech_features
    EXCEPT
    SELECT * FROM pd_speech_features_train
) WITH DATA
PRIMARY INDEX(id)
'''

try:
    eng.execute(query)
except:
    eng.execute('DROP TABLE pd_speech_features_test;')
    eng.execute(query)

test = pd.read_sql('SELECT * FROM pd_speech_features_test', eng)
pd.read_sql('SELECT "class", COUNT(*) FROM pd_speech_features_test GROUP BY "class"', eng)

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>4.2.1 - Train a Decision Tree Model</b></p>

<p style = 'font-size:16px;font-family:Arial'>The <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Advanced-SQL-Engine-Analytic-Functions/April-2022/Advanced-SQL-Engine-Analytic-Functions/TD_DecisionForest'>TD_DecisionForest</a> is an ensemble algorithm used for classification and regression predictive modeling problems. It is an extension of bootstrap aggregation (bagging) of decision trees. </p>

<p style = 'font-size:16px;font-family:Arial'>This function takes the training data as input, as well as the following function parameters</p>
    <ul style = 'font-size:16px;font-family:Arial'>
        <li>InputColumns; list or range of columns used as features</li>
        <li>ResponseColumn; the dependend or target value</li>
        <li>TreeType; either CLASSIFICATION or REGRESSION</li>
    <li>Other hyperparameter values detailed in the documentation</li>
        </ul>

In [None]:
%%time
query = '''
CREATE multiset table DF_table as (
    SELECT * FROM TD_DecisionForest (
    ON pd_speech_features_train PARTITION BY ANY
  USING
      InputColumns('[2:753]')
      ResponseColumn('"class"')
      MaxDepth(5)
      MinNodeSize(1)
      TreeType('CLASSIFICATION')
      Seed(2)
      Mtry(3)
      MtrySeed(1)
    ) as dt
) with data;
'''

try:
    eng.execute(query)
except:
    eng.execute('DROP TABLE DF_table;')
    eng.execute(query)

pd.read_sql('SELECT * FROM DF_table', eng)

<p style = 'font-size:16px;font-family:Arial'>The TD_DecisionForest function creates a tree as seen in the output above based on the parameters passed in the query</p>



<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>4.2.2 - Train a GLM Model</b></p>

<p style = 'font-size:16px;font-family:Arial'>The <a href = 'https://docs.teradata.com/r/Teradata-Vantage-Machine-Learning-Engine-Analytic-Function-Reference/May-2019/Statistical-Analysis/Generalized-Linear-Model-Functions/GLM'>The generalized linear model (GLM) is an extension of the linear regression model that enables the linear equation to relate to the dependent variables by a link function. The GLM function supports several distribution families and associated link functions. </p>

<p style = 'font-size:16px;font-family:Arial'>This function takes the training data as input, as well as the following function parameters</p>
    <ul style = 'font-size:16px;font-family:Arial'>
        <li>InputColumns; list or range of columns used as features</li>
        <li>ResponseColumn; the dependend or target value</li>
        <li>Family; either Binomial or Gaussian</li>
    <li>Other hyperparameter values detailed in the documentation</li>
        </ul>

In [None]:
%%time
query = '''
CREATE TABLE td_glm_output_credit_ex AS (
SELECT * FROM td_glm (
ON pd_speech_features_train
USING
InputColumns('[2:753]')
ResponseColumn('"class"')
Family('Binomial')
BatchSize(10)
MaxIterNum(300)
RegularizationLambda(0.02)
Alpha(0.15)
IterNumNoChange(50)
Tolerance(0.001)
Intercept('true')
LearningRate('optimal')
InitialEta(0.05)
Momentum(0.0)
LocalSGDIterations(0)
) AS dt
) WITH DATA
;
'''

try:
    eng.execute(query)
except:
    eng.execute('DROP TABLE td_glm_output_credit_ex;')
    eng.execute(query)

pd.read_sql('SELECT * FROM td_glm_output_credit_ex', eng)

<p style = 'font-size:16px;font-family:Arial'>The TD_GLM function creates various output predictors and values based on the above parameters passed in the query</p>



<p style = 'font-size:16px;font-family:Arial;> The TD_DecisionForest function creates a tree as seen in the output above based on the parameters applied in the query. </b></p>



<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>4.3.1 - Evaluate the Model</b></p>

<p style = 'font-size:16px;font-family:Arial'>Execute a testing prediction using the split data above.  Evaluate the model by creating a confusion matrix with the <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Advanced-SQL-Engine-Analytic-Functions/April-2022/Advanced-SQL-Engine-Analytic-Functions/TD_ClassificationEvaluator'>TD_ClassificationEvaluator</a> SQL Function.</p>


<ol style = 'font-size:16px;font-family:Arial'>
    <li>Execute <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Advanced-SQL-Engine-Analytic-Functions/April-2022/Advanced-SQL-Engine-Analytic-Functions/DecisionForestPredict-SQL-Engine'>DecisionForestPredict</a> using the model built above</li>
    <li>Execute <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Advanced-SQL-Engine-Analytic-Functions/April-2022/Advanced-SQL-Engine-Analytic-Functions/TD_ClassificationEvaluator'>TD_ClassificationEvaluator</a> and pass the actual classification and the predicted value</li>
    <li>Investigate the Confusion Matrix and additional metrics values</li>
    </ol>

In [None]:
query = '''
create multiset table DF_predict_test as (
    select * from DecisionForestPredict (
        ON pd_speech_features_test PARTITION BY ANY
        ON DF_table as Model dimension
        using
        IDColumn ('id')
        OutputProb ('true')
        Responses ('0', '1')
        Accumulate ('"class"')
    ) as dt
) with data;
'''

try:
    eng.execute(query)
except:
    eng.execute('DROP TABLE DF_predict_test;')
    eng.execute(query)

pd.read_sql('SELECT * FROM DF_predict_test', eng)

<p style = 'font-size:16px;font-family:Arial'>The TD_DecisionForestPredict function creates probabilities for the prediction made depending on the class and the Id cloumns. The output of the predict function is passed to the Classification Evaluator to get the parameters of the functions.</p>



<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>4.3.2 - Evaluate the Model</b></p>

<p style = 'font-size:16px;font-family:Arial'>Execute a testing prediction using the split data above.  Evaluate the model by creating a confusion matrix with the <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Advanced-SQL-Engine-Analytic-Functions/April-2022/Advanced-SQL-Engine-Analytic-Functions/TD_ClassificationEvaluator'>TD_ClassificationEvaluator</a> SQL Function.</p>


<ol style = 'font-size:16px;font-family:Arial'>
    <li>Execute <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Advanced-SQL-Engine-Analytic-Functions/April-2022/Advanced-SQL-Engine-Analytic-Functions/DecisionForestPredict-SQL-Engine'>GLMPredict</a> using the model built above</li>
    <li>Execute <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Advanced-SQL-Engine-Analytic-Functions/April-2022/Advanced-SQL-Engine-Analytic-Functions/TD_ClassificationEvaluator'>TD_ClassificationEvaluator</a> and pass the actual classification and the predicted value</li>
    <li>Investigate the Confusion Matrix and additional metrics values</li>
    </ol>

In [None]:
query = '''
CREATE TABLE vt_glm_predict_credit_ex AS (
SELECT * from TD_GLMPredict (
ON pd_speech_features_test AS INPUTTABLE
ON td_glm_output_credit_ex AS Model DIMENSION
USING
IDColumn ('ID')
Accumulate('"class"')
OutputProb('true')
Responses ('0','1')
) AS dt
) WITH DATA
;
'''

try:
    eng.execute(query)
except:
    eng.execute('DROP TABLE vt_glm_predict_credit_ex;')
    eng.execute(query)

pd.read_sql('SELECT * FROM vt_glm_predict_credit_ex', eng)

<p style = 'font-size:16px;font-family:Arial'>The TD_GLMPredict function creates probabilities for the prediction made depending on the class and the Id cloumns. The output of the predict function is passed to the Classification Evaluator to get the parameters of the functions.</p>



In [None]:
query = '''
SELECT * FROM TD_ClassificationEvaluator(
   ON (select prediction, cast("class" as VARCHAR(32000) CHARACTER SET UNICODE NOT CASESPECIFIC) as "class" from DF_predict_test) AS InputTable
   OUT VOLATILE TABLE OutputTable(additional_metrics_speech_test)
   USING
   ObservationColumn('"class"')
   PredictionColumn('prediction')
   Labels('0','1')
) AS dt;
'''

try:
    eng.execute(query)
except:
    eng.execute('DROP TABLE additional_metrics_speech_test;')
    eng.execute(query)

pd.read_sql('SELECT * FROM additional_metrics_speech_test', eng)

In [None]:
query = '''CREATE SET TABLE vt_glm_predict_credit_ex_1 ,FALLBACK ,
     CHECKSUM = DEFAULT,
     DEFAULT MERGEBLOCKRATIO,
     MAP = TD_MAP1,
     LOG
     (
      id BIGINT,
      prediction INT,
      "class" INT,
            prob_0 FLOAT,
      prob_1 FLOAT

)PRIMARY INDEX ( id )
;
'''

try:
    eng.execute(query)
except:
    eng.execute('DROP TABLE vt_glm_predict_credit_ex_1;')
    eng.execute(query)


In [None]:
query = '''insert into vt_glm_predict_credit_ex_1 sel id , cast(prediction as int), "class" , prob_0, prob_1 from vt_glm_predict_credit_ex;'''
eng.execute(query)

In [None]:
query = '''
SELECT * FROM TD_ClassificationEvaluator(
       ON (select prediction,  "class" from vt_glm_predict_credit_ex_1) AS InputTable
       OUT TABLE OutputTable(additional_metrics_speech_test_glm)
       USING
       ObservationColumn('"class"')
       PredictionColumn('prediction')
       Labels(0,1)
    ) AS dt;
'''

try:
    eng.execute(query)
except:
    eng.execute('DROP TABLE additional_metrics_speech_test_glm;')
    eng.execute(query)

pd.read_sql('SELECT * FROM additional_metrics_speech_test_glm', eng)

<p style = 'font-size:16px;font-family:Arial'>The TD_ClassificationEvaluator generates various metrics for the prediction function , like Accuracy, Precision, Recall etc. As seen above the Accuracy is 0.70 Precision is above 0.35 etc</p>



<p style = 'font-size:16px;font-family:Arial'>Thus we can evaluate and compare 2 different models using the IN-Database functions. In this case since we have sample data the result parameters like the Accuracy, Precision, Recall etc. seem to be similar for both the models, but we can still see that TD_DecisionForest is slightly better TD_GLM.  


<hr>
<b style = 'font-size:22px;font-family:Arial;color:#E37C4D'>5. Cleanup</b>

In [None]:
eng.execute('DROP view TRNG_ParkinsonsDisease.pd_speech_features;')

In [None]:
eng.execute('DROP TABLE gs_tables_db.TRNG_ParkinsonsDisease_pd_speech_features;')

In [None]:
eng.execute('DROP TABLE pd_speech_features_train;')

In [None]:
eng.execute('DROP TABLE pd_speech_features_test;')

In [None]:
eng.execute('DROP TABLE DF_table;')

In [None]:
eng.execute('DROP TABLE DF_predict_test;')

In [None]:
eng.execute('DROP TABLE additional_metrics_speech_test;')

In [None]:
eng.execute('DROP TABLE td_glm_output_credit_ex;')

In [None]:
eng.execute('DROP TABLE vt_glm_predict_credit_ex_1;')

In [None]:
eng.execute('DROP TABLE vt_glm_predict_credit_ex;')

In [None]:
eng.execute('DROP TABLE additional_metrics_speech_test_glm;')

In [None]:
eng.execute('DROP database TRNG_ParkinsonsDisease;')

<footer style="padding:10px;background:#f9f9f9;border-bottom:3px solid #394851">Copyright © Teradata Corporation - 2023. All Rights Reserved.</footer>