<header style="padding:1px;background:#f9f9f9;border-top:3px solid #00b2b1"><img id="Teradata-logo" src="https://www.teradata.com/Teradata/Images/Rebrand/Teradata_logo-two_color.png" alt="Teradata" width="220" align="right" />

<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>Parkinson's Disease prediction using Decision Forest Classifier and GLM</b>
</header>

<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>Introduction</b></p>

<p style = 'font-size:16px;font-family:Arial'>Research shows that 89 percent of people with Parkinson’s disease (PD) experience speech and voice disorders, including soft, monotone, breathy and hoarse voice and uncertain articulation. As a result, people with PD report they are less likely to participate in conversation, or have confidence in social settings than healthy individuals in their age group.
<br>
<br>    
Speech disorders can progressively diminish quality of life for a person with PD. The earlier a person receives a baseline speech evaluation and speech therapy, the more likely he or she will be able to maintain communication skills as the disease progresses. Communication is a key element in quality of life and positive self-concept and confidence for people with PD.
<br>
<br>    
Hence as a consultant, we are approached by an organization to detect Parkinson's Disease at an early stage. We are not showcasing a complete DataScience Usecase but we are trying to show how the Teradata In Database functions can be used for Model training and scoring and comparing the performance of 2 models. The data we are using is sample data and the results and predictions may not be entirely accurate.</p>

<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>1. Data</b></p>
<p style = 'font-size:16px;font-family:Arial'>This dataset is composed of a range of biomedical voice measurements from 31 people, 23 with Parkinson's disease (PD). Each column in the table is a particular voice measure, and each row corresponds one of 195 voice recording from these individuals ("name" column). The main aim of the data is to discriminate healthy people from those with PD, according to "status" column which is set to 0 for healthy and 1 for PD.</p>

<p style = 'font-size:16px;font-family:Arial'><a href = 'https://archive.ics.uci.edu/ml/datasets/parkinsons'>Link to the dataset</a>: Max A. Little, Patrick E. McSharry, Eric J. Hunter, Lorraine O. Ramig (2008), 'Suitability of dysphonia measurements for telemonitoring of Parkinson's disease', IEEE Transactions on Biomedical Engineering (to appear).</p>

<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>2. Contents:</b></p>
<ol style = 'font-size:16px;font-family:Arial'>
    <li>Configuring the Environment</li>
    <li>Initiate a connection to Vantage</li>
    <li>Analyze the raw data set</li>
    <li>Train and Test a Decision Forest Model</li>
        <ul>
            <li>4.1 Train and Test split using SAMPLE. Splitting the dataset in 80:20 ratio for Train and Test respectively</li>
            <li>4.2 Train a Model</li> 
                <ol style = 'font-size:16px;font-family:Arial'>
                    <li style = 'font-size:16px;font-family:Arial' >Using the TD_DecisionForest and TD_DecisonForestPredict In Database function to predict if the persion can have Parkinson's Disease or not. So there are only 2 responses '0' and '1'.</li>
                    <li style = 'font-size:16px;font-family:Arial'>Using the TD_GLM and TD_GLMPredict In Database function to predict if the persion can have Parkinson's Disease or not. So there are only 2 responses '0' and '1'.</li>
            </ol>
            <li>4.3 Evaluate the Model :- Evaluation of the model is done using the TD_ClassificationEvaluator which provides various parameters for the model like Accuracy, Precision ,Recall etc.</li>
        </ul>
    <li>Cleanup</li>
</ol>

<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>3. Start by connecting to the Vantage system.</b></p>


<p style = 'font-size:16px;font-family:Arial'>In the section, we import the required libraries and set environment variables and environment paths (if required).</p>

In [None]:
import json
import getpass
import pandas as pd
from teradataml import *
import warnings
warnings.filterwarnings("ignore")

<p style = 'font-size:16px;font-family:Arial'>You will be prompted to provide the password. Enter your password, press the Enter key, then use down arrow to go to next cell.</p>

In [None]:
%run -i ../startup.ipynb


In [None]:
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

<p style = 'font-size:16px;font-family:Arial'>Begin running steps with Shift + Enter keys.</p>

In [None]:
%sql SET query_band='DEMO=VAL-teradataml-Demo.ipynb;' UPDATE FOR SESSION; 

<b style = 'font-size:20px;font-family:Arial;color:#E37C4D'>4. Getting Data for This Demo
<p style = 'font-size:16px;font-family:Arial'>We have provided data for this demo on cloud storage.  You have the option of either running the demo using foreign tables to access the data without using any storage on your environment or downloading the data to local storage which may yield somewhat faster execution, but there could be considerations of available storage.  There are two statements in the following cell, and one is commented out.  You may switch which mode you choose by changing the comment string.
Here we are only creating local databases and tables as there are 755 coulmns in table which will be faster in local tables.</p>    


In [None]:
%run -i ../run_procedure.py "call get_data('DEMO_ParkinsonsDisease_local');"
 # Takes about 2 minutes


<p style = 'font-size:16px;font-family:Arial'>Next is an optional step – if you want to see status of databases/tables created and space used.</p>

In [None]:
%run -i ../run_procedure.py "call space_report();"

<hr>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>5. Analyze the raw data set</b></p>
<!-- <p style = 'font-size:16px;font-family:Arial'>One of the challenges with this data set is that each recording consists of 755 individual metrics.  If this data set were to be used as input to a Decision Forest or other supervised learning algorithm "as is"; this large number of features would cause extraordinary performance degradation for very little gain in accuracy.  Not to mention, wrangling 755 columns adds additional complexity in programming and automation.</p> -->

<p style = 'font-size:16px;font-family:Arial'>Simple SQL query to show the data. Taking just a sample set of data to show case how we can use the Teradata In-Database functions</p>




In [None]:
query = '''
SELECT * FROM DEMO_ParkinsonsDisease.Speech_Features;
'''

pd.read_sql(query, eng)

<p style = 'font-size:16px;font-family:Arial'>There are more than 750 different features of the speech recordings which are used for analysis. The "CLASS" column which is the rightmost column of the answerset above(please scroll to the right), indicates whether the person has Parkinson's Disease(1) or DOES NOT have Parkinson's Disease(0)</p>



<hr>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>6. Create Train and Test Dataset</b></p>

<p style = 'font-size:16px;font-family:Arial'>Now that we have our prepared data set, we can perform an abbreviated machine learning workflow:</p>

<ol style = 'font-size:16px;font-family:Arial'>
    <li>Create Train and Test data sets using SAMPLE Clause(80:20 split)</li>
    <li>Train the model</li>
    <li>Evaluate the model using Test data</li>
</ol>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Train and Test split using SAMPLE</b></p>

<p style = 'font-size:16px;font-family:Arial'>Using EXCEPT clause in the second statement ensures a non-intersecting set of data</p>

In [None]:
query = '''
CREATE MULTISET TABLE pd_speech_features_train AS (
    SELECT * FROM DEMO_ParkinsonsDisease.Speech_Features where id mod 10 >= 1
) WITH DATA
PRIMARY INDEX(id)
'''

try:
    eng.execute(query)
except:
    eng.execute('DROP TABLE pd_speech_features_train;')
    eng.execute(query)

train = pd.read_sql('SELECT * FROM pd_speech_features_train', eng)
pd.read_sql('SELECT "class", COUNT(*) FROM pd_speech_features_train GROUP BY "class"', eng)

<p style = 'font-size:16px;font-family:Arial'>The output shows the number of people we are considering for each class to train the model – class 1 has Parkinson’s</p>



In [None]:
query = '''
CREATE MULTISET TABLE pd_speech_features_test AS (
    SELECT * FROM DEMO_ParkinsonsDisease.Speech_Features
    EXCEPT
    SELECT * FROM pd_speech_features_train
) WITH DATA
PRIMARY INDEX(id)
'''

try:
    eng.execute(query)
except:
    eng.execute('DROP TABLE pd_speech_features_test;')
    eng.execute(query)

test = pd.read_sql('SELECT * FROM pd_speech_features_test', eng)
pd.read_sql('SELECT "class", COUNT(*) FROM pd_speech_features_test GROUP BY "class"', eng)

<p style = 'font-size:16px;font-family:Arial'>The output shows the number of people we are considering for each class to validate the model – class 1 has Parkinson’s</p>



<hr>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>7. Decision Tree Model</b></p>

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>7.1 - Train a Decision Tree Model</b></p>

<p style = 'font-size:16px;font-family:Arial'>The <a href = 'https://docs.teradata.com/search/all?query=TD_DecisionForest&content-lang=en-US'>TD_DecisionForest</a> is an ensemble algorithm used for classification and regression predictive modeling problems. It is an extension of bootstrap aggregation (bagging) of decision trees. </p>

<p style = 'font-size:16px;font-family:Arial'>This function takes the training data as input, as well as the following function parameters</p>
    <ul style = 'font-size:16px;font-family:Arial'>
        <li>InputColumns; list or range of columns used as features (we used an ordinal reference of columns 2:753)</li>
        <li>ResponseColumn; the dependend or target value (we used “class”, the first column)</li>
        <li>TreeType; either CLASSIFICATION or REGRESSION</li>
    <li>Other hyperparameter values detailed in the documentation</li>
        </ul>

In [None]:
%%time
query = '''
CREATE multiset table DF_table as (
    SELECT * FROM TD_DecisionForest (
    ON pd_speech_features_train PARTITION BY ANY
  USING
      InputColumns('[2:753]')
      ResponseColumn('"class"')
      MaxDepth(5)
      MinNodeSize(1)
      TreeType('CLASSIFICATION')
      Seed(2)
      Mtry(3)
      MtrySeed(1)
    ) as dt
) with data;
'''

try:
    eng.execute(query)
except:
    eng.execute('DROP TABLE DF_table;')
    eng.execute(query)



<p style = 'font-size:16px;font-family:Arial'>The TD_DecisionForest function produces a model and a JSON representation of the decision tree. Below is explaination for some columns in the JSON tree. The other details can be found at the link <a href = 'https://docs.teradata.com/search/all?query=TD_DecisionForest&content-lang=en-US'>here.</a></p>

</p>
<html>
   <head>
      <style>
         table, th, td {
            border: 1px solid black;
            border-collapse:collapse;
         }
      </style>
   </head>
   <body>
      <table>
         <tr>
            <th>JSON Type</th>
            <th>Description</th>             
         </tr>
         <tr>
            <td>id_</td>
            <td>"Node identifier"</td>
         </tr>
         <tr>
            <td>nodeType_</td> 
            <td>The node type. Possible values: CLASSIFICATION_NODE,CLASSIFICATION_LEAF,REGRESSION_NODE,REGRESSION_LEAF.</td>
         </tr>
         <tr>
            <td>split_</td> 
            <td>The start of JSON item that describes a split in the node.</td>
         </tr> 
         <tr>
            <td>responseCounts_</td> 
            <td>[Classification trees] Number of observations in each class at node identified by id.</td>
         </tr>
         <tr>
            <td>size_</td> 
            <td>Total number of observations at node identified by id.</td>
         </tr> 
         <tr>
            <td>maxDepth_</td> 
            <td>Maximum possible depth of tree, starting from node identified by id. For root node, the
value is max_depth. For leaf nodes, the value is 0. For other nodes, the value is the
maximum possible depth of tree, starting from that node.</td>
         </tr>  
      </table>
   </body>
</html>


<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>7.2 - Evaluate the Model</b></p>

<p style = 'font-size:16px;font-family:Arial'>Execute a testing prediction using the split data above.  Evaluate the model by creating a confusion matrix with the <a href = 'https://docs.teradata.com/search/all?query=TD_ClassificationEvaluator&content-lang=en-US'>TD_ClassificationEvaluator</a> SQL Function.</p>


<ol style = 'font-size:16px;font-family:Arial'>
    <li>Execute <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Model-Scoring-Functions/DecisionForestPredict'>DecisionForestPredict</a> using the model built above</li>
    <li>Execute <a href = 'https://docs.teradata.com/search/all?query=TD_ClassificationEvaluator&content-lang=en-US'>TD_ClassificationEvaluator</a> and pass the actual classification and the predicted value</li>
</ol>

In [None]:
query = '''
create multiset table DF_predict_test as (
    select * from DecisionForestPredict (
        ON pd_speech_features_test PARTITION BY ANY
        ON DF_table as Model dimension
        using
        IDColumn ('id')
        OutputProb ('true')
        Responses ('0', '1')
        Accumulate ('"class"')
    ) as dt
) with data;
'''

try:
    eng.execute(query)
except:
    eng.execute('DROP TABLE DF_predict_test;')
    eng.execute(query)

pd.read_sql('SELECT * FROM DF_predict_test', eng)

<p style = 'font-size:16px;font-family:Arial'>The TD_DecisionForestPredict function creates probabilities for the prediction made depending on the class and the Id cloumns. The output of the predict function is passed to the Classification Evaluator to get the parameters of the functions.</p>

<p style = 'font-size:16px;font-family:Arial'>DecisionForestPredict outputs the probability that each observation is in the predicted class. To use DecisionForestPredict output as input to ML Engine ROC function, you must first transform it to show the probability that each observation is in the positive class. One way to do this is to change the probability to (1- current probability) when the predicted class is negative.The prediction algorithm compares floating-point numbers. Due to possible inherent data type differences between ML Engine and Analytics Database executions, predictions can differ.</p>


<p style = 'font-size:16px;font-family:Arial'>We create the Confusion Matrix to compare the actual and the Predicted values. Confusion matrix is a very popular measure used while solving classification problems. It can be applied to binary classification as well as for multiclass classification problems. Confusion matrices represent counts from predicted and actual values. It is an N x N matrix used for evaluating the performance of a classification model, where N is the number of target classes.</p>


In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
df = pd.read_sql('SELECT id, cast("class" as int) "class", cast(prediction as int) prediction FROM DF_predict_test', eng)
cm = confusion_matrix(df['class'], df['prediction'], normalize='all')
cmd = ConfusionMatrixDisplay(cm, display_labels=['DoesNotHaveParkinson', 'HasParkinson'])
cmd.plot()

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>7.3 - Use classification Evaluator for DecisionForestPredict</b></p>

<p style = 'font-size:16px;font-family:Arial'>Evaluate the model by creating a confusion matrix with the <a href = 'https://docs.teradata.com/search/all?query=TD_ClassificationEvaluator&content-lang=en-US'>TD_ClassificationEvaluator</a> SQL Function.</p>

<p style = 'font-size:16px;font-family:Arial'>In classification problems, a confusion matrix is used to visualize the performance of a classifier. The confusion matrix contains predicted labels represented across the row-axis and actual labels represented
across the column-axis. Each cell in the confusion matrix corresponds to the count of occurrences of labels
in the test data.</p>

<p style = 'font-size:16px;font-family:Arial'>Apart from accuracy, the secondary output table returns micro, macro, and weighted-averaged metrics of precision, recall, and F1-score values.</p>


In [None]:
query = '''
SELECT * FROM TD_ClassificationEvaluator(
   ON (select prediction, cast("class" as VARCHAR(32000) CHARACTER SET UNICODE NOT CASESPECIFIC) as "class" from DF_predict_test) AS InputTable
   OUT VOLATILE TABLE OutputTable(additional_metrics_speech_test)
   USING
   ObservationColumn('"class"')
   PredictionColumn('prediction')
   Labels('0','1')
) AS dt;
'''

try:
    eng.execute(query)
except:
    eng.execute('DROP TABLE additional_metrics_speech_test;')
    eng.execute(query)

pd.read_sql('SELECT * FROM additional_metrics_speech_test', eng)

<p style = 'font-size:16px;font-family:Arial'>The above output has the secondary output table that returns micro, macro, and weighted-averaged metrics of precision, recall, and F1-score values.</p>
<table style = 'font-size:16px;font-family:Arial'>
  <tr>
    <th>Column</th>
    <th>Description</th>
  </tr>
  <tr>
    <td>Precision</td>
    <td>The positive predictive value. Refers to the fraction of relevant instances among
the total retrieved instances.</td>
  </tr>
  <tr>
    <td>Recall</td>
    <td>Refers to the fraction of relevant instances retrieved over the total amount of
relevant instances.</td>
  </tr>
  <tr>
    <td>F1</td>
    <td>F1 score, defined as the harmonic mean of the precision and recall.</td>
  </tr>
  <tr>
    <td>Support</td>
    <td>The number of times a label displays in the ObservationColumn.</td>
  </tr>
</table>

<hr>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>8. Generalized Linear Model(GLM)</b></p>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>8.1 - Train a GLM Model</b></p>

<p style = 'font-size:16px;font-family:Arial'>The <a href = 'https://docs.teradata.com/search/all?query=TD_GLM&content-lang=en-US'>Generalized Linear Model (GLM)</a> is an extension of the linear regression model that enables the linear equation to relate to the dependent variables by a link function. The GLM function supports several distribution families and associated link functions. </p>

<p style = 'font-size:16px;font-family:Arial'>This function takes the training data as input, as well as the following function parameters</p>
    <ul style = 'font-size:16px;font-family:Arial'>
        <li>InputColumns; list or range of columns used as features (we used an ordinal reference of columns 2:753)</li>
        <li>ResponseColumn; the dependend or target value (we used “class”, the first column) </li>
        <li>Family; either Binomial or Gaussian</li>
    <li>Other hyperparameter values detailed in the documentation</li>
        </ul>
        
<p style = 'font-size:16px;font-family:Arial'>Feature engineering transform functions encapsulate variable transformations during the training phase so you can chain them to create a pipeline for operationalization.</p>
<p style = 'font-size:16px;font-family:Arial'>Each TD_nameFit function outputs a table to input to the TD_nameTransform function as FitTable. For example, TD_ScaleFit outputs a FitTable for TD_ScaleTransform. We are using the mean ScaleMethod for this case.</p>

In [None]:
%%time
query = '''select * from TD_scaleFit(
on pd_speech_features_train as InputTable
OUT VOLATILE TABLE OutputTable(scaleFitOut_train)
using
TargetColumns('[2:753]')
MissValue('Keep')
ScaleMethod('mean')
GlobalScale('f')
)as dt;
'''

try:
    eng.execute(query)
except:
    eng.execute('DROP TABLE scaleFitOut_train;')
    eng.execute(query)

pd.read_sql('SELECT * FROM scaleFitOut_train', eng)

<p style = 'font-size:16px;font-family:Arial'>Using the mean ScaleMethod the ScaleFit function calculates the mean values of each feature used and the output of this ScaleFit function is used in the TD_ScaleTransform function as the fit table.</p>

In [None]:
%%time
query = '''Create table Trasformed_data_train as( SELECT * FROM TD_scaleTransform (
ON pd_speech_features_train AS InputTable
ON scaleFitOut_train AS FitTable DIMENSION
USING
Accumulate ('id','"class"')
) AS dt )with data;
'''

#eng.execute(query)

try:
    eng.execute(query)
except:
    eng.execute('DROP TABLE Trasformed_data_train;')
    eng.execute(query)

pd.read_sql('SELECT * FROM Trasformed_data_train', eng)

<p style = 'font-size:16px;font-family:Arial'>Using the ScaleTransform method, we tranform the values of the features to make them more feasible for th GLM model. Here we are tranforming the training data.</p>

In [None]:
%%time
query = '''Create table Trasformed_data_test as( SELECT * FROM TD_scaleTransform (
ON pd_speech_features_test AS InputTable
ON scaleFitOut_train AS FitTable DIMENSION
USING
Accumulate ('id','"class"')
) AS dt )with data;
'''

#eng.execute(query)

try:
    eng.execute(query)
except:
    eng.execute('DROP TABLE Trasformed_data_test;')
    eng.execute(query)

pd.read_sql('SELECT * FROM Trasformed_data_test', eng)

<p style = 'font-size:16px;font-family:Arial'>Using the ScaleTransform method, we tranform the values of the features to make them more feasible for th GLM model. Here we are tranforming the data for scoring.</p>

In [None]:
%%time
query = '''
CREATE TABLE td_glm_output AS (
SELECT * FROM td_glm (
ON Transformed_data_train
USING
InputColumns('[2:353]')
ResponseColumn('"class"')
Family('Binomial')
BatchSize(200)
MaxIterNum(300)
RegularizationLambda(0.02)
Alpha(0.15)
IterNumNoChange(50)
Tolerance(0.001)
Intercept('true')
LearningRate('optimal')
InitialEta(0.001)
Momentum(0.0)
LocalSGDIterations(0)
) AS dt
) WITH DATA
;
'''

try:
    eng.execute(query)
except:
    eng.execute('DROP TABLE td_glm_output;')
    eng.execute(query)

pd.read_sql('SELECT * FROM td_glm_output', eng)

<p style = 'font-size:16px;font-family:Arial'>The TD_GLM function creates various output predictors and values based on the above parameters passed in the query</p>

<p style = 'font-size:16px;font-family:Arial'>The function output is a trained GLM model which can be input to the TD_GLMPredict function
for prediction. The model also contains model statistics of MSE, Loglikelihood, AIC, and BIC.
Further model evaluation can be done as a post-processing step using functions such as
TD_RegressionEvaluator,TD_ClassificationEvaluator and TD_ROC.</p>


<p style = 'font-size:16px;font-family:Arial;> The TD_DecisionForest function creates a tree as seen in the output above based on the parameters applied in the query. </b></p>



<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>8.2 - Evaluate the Model</b></p>

<p style = 'font-size:16px;font-family:Arial'>Execute a testing prediction using the split data above.  Evaluate the model by creating a confusion matrix with the <a href = 'https://docs.teradata.com/search/all?query=TD_ClassificationEvaluator&content-lang=en-US'>TD_ClassificationEvaluator</a> SQL Function.</p>


<ol style = 'font-size:16px;font-family:Arial'>
    <li>Execute <a href = 'https://docs.teradata.com/search/all?query=TD_GLMPredict&content-lang=en-US'>GLMPredict</a> using the model built above</li>
    <li>Execute <a href = 'https://docs.teradata.com/search/all?query=TD_ClassificationEvaluator&content-lang=en-US'>TD_ClassificationEvaluator</a> and pass the actual classification and the predicted value</li>
</ol>

In [None]:
query = '''
CREATE TABLE vt_glm_predict AS (
SELECT * from TD_GLMPredict (
ON Transformed_data_test AS INPUTTABLE
ON td_glm_output AS Model DIMENSION
USING
IDColumn ('ID')
Accumulate('"class"')
OutputProb('true')
Responses ('0','1')
) AS dt
) WITH DATA
;
'''

try:
    eng.execute(query)
except:
    eng.execute('DROP TABLE vt_glm_predict;')
    eng.execute(query)

pd.read_sql('SELECT * FROM vt_glm_predict', eng)

<p style = 'font-size:16px;font-family:Arial'>The TD_GLMPredict function predicts target values (regression) and class labels (classification) for test data using a GLM model trained by the TD_GLM function.Similar to TD_GLM, input features should be standardized, such as using TD_ScaleFit, and TD_ScaleTransform, before using in the function. The function takes only numeric features. The categorical
features must be converted to numeric values prior to prediction.</p>

<p style = 'font-size:16px;font-family:Arial'>Rows with missing (null) values are skipped by the function during prediction. For prediction results evaluation, you can use TD_RegressionEvaluator, TD_ClassificationEvaluator or TD_ROC function as
postprocessing step.</p>


<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>8.3 - Use classification Evaluator for GLMPredict</b></p>

<p style = 'font-size:16px;font-family:Arial'>Evaluate the model by creating a confusion matrix with the <a href = 'https://docs.teradata.com/search/all?query=TD_ClassificationEvaluator&content-lang=en-US'>TD_ClassificationEvaluator</a> SQL Function.</p>



<p style = 'font-size:16px;font-family:Arial'>Since TD_ClassificationEvaluator requires same datatype for prediction and class columns so creating another table with same datatype.</p>

In [None]:
query = '''CREATE SET TABLE vt_glm_predict_conv ,FALLBACK ,
     CHECKSUM = DEFAULT,
     DEFAULT MERGEBLOCKRATIO,
     MAP = TD_MAP1,
     LOG
     (
      id BIGINT,
      prediction INT,
      "class" INT,
            prob_0 FLOAT,
      prob_1 FLOAT

)PRIMARY INDEX ( id )
;
'''

try:
    eng.execute(query)
except:
    eng.execute('DROP TABLE vt_glm_predict_conv;')
    eng.execute(query)

query = '''insert into vt_glm_predict_conv sel id , cast(prediction as int), "class" , prob_0, prob_1 from vt_glm_predict;'''
eng.execute(query)    

<p style = 'font-size:16px;font-family:Arial'>Create CONFUSION MATRIX for the GLM Predict model.</p>

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
df_glm = pd.read_sql('SELECT * FROM vt_glm_predict_conv', eng)
cm = confusion_matrix(df_glm['class'], df_glm['prediction'], normalize='all')
cmd = ConfusionMatrixDisplay(cm, display_labels=['DoesNotHaveParkinson', 'HasParkinson'])
cmd.plot()

In [None]:
query = '''
SELECT * FROM TD_ClassificationEvaluator(
       ON (select prediction,  "class" from vt_glm_predict_conv) AS InputTable
       OUT TABLE OutputTable(additional_metrics_speech_test_glm)
       USING
       ObservationColumn('"class"')
       PredictionColumn('prediction')
       Labels(0,1)
    ) AS dt;
'''

try:
    eng.execute(query)
except:
    eng.execute('DROP TABLE additional_metrics_speech_test_glm;')
    eng.execute(query)

pd.read_sql('SELECT * FROM additional_metrics_speech_test_glm', eng)

<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>9. Comparison of the Metrics generated by the 2 Models. Decision Forest vs GLM</b></p>

In [None]:
query = '''CREATE MULTISET TABLE metric_union as (select cast('DecisionForest' as VARCHAR(15)) as Model, trim(Metric) as Metric,MetricValue from additional_metrics_speech_test a 
union all 
select 'GLM' as Model ,  trim(Metric) as Metric,MetricValue from additional_metrics_speech_test_glm b
)with data PRIMARY INDEX (Metric)
;
'''

try:
    eng.execute(query)
except:
    eng.execute('DROP TABLE metric_union;')
    eng.execute(query)
    
df_chart = pd.read_sql('select * from metric_union', eng)


In [None]:
from matplotlib import pyplot as plt
df_chart['Metric'] = df_chart['Metric'].str.replace(r'\x00', '')
df_pivot = pd.pivot_table(
df_chart,
values="MetricValue",
index="Metric",
columns="Model"
)
#df_chart.plot.bar(x='Metric',y='MetricValue' , legend='model')
ax=df_pivot.plot(kind='bar')
# Get a Matplotlib figure from the axes object for formatting purposes
fig = ax.get_figure()
# Change the plot dimensions (width, height)
fig.set_size_inches(12, 6)
# Change the axes labels
ax.set_xlabel("Metrics")
ax.set_ylabel("Metric Values")

<p style = 'font-size:16px;font-family:Arial'>Thus here we have used 2 different models to train and predict the data. The classification evaluator is used to evaluate and compare the models. The Teradata In-Database functions are used for training, prediction and evaluation. In this case since we have sample data the result parameters like the Accuracy, Precision, Recall etc. may not be accurate for both the models, still from the above graph we can conclude that in this case TD_DecisionForest is better TD_GLM.  


<hr>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>10. Cleanup</b></p>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Work Tables</b></p>

In [None]:
eng.execute('DROP TABLE pd_speech_features_train;')

In [None]:
eng.execute('DROP TABLE pd_speech_features_test;')

In [None]:
eng.execute('DROP TABLE DF_table;')

In [None]:
eng.execute('DROP TABLE DF_predict_test;')

In [None]:
eng.execute('DROP TABLE additional_metrics_speech_test;')

In [None]:
eng.execute('DROP TABLE td_glm_output;')

In [None]:
eng.execute('DROP TABLE vt_glm_predict_conv;')

In [None]:
eng.execute('DROP TABLE vt_glm_predict;')

In [None]:
eng.execute('DROP TABLE additional_metrics_speech_test_glm;')

In [None]:
eng.execute('DROP TABLE metric_union;')

In [None]:
eng.execute('DROP TABLE Trasformed_data_train;')

In [None]:
eng.execute('DROP TABLE Trasformed_data_test;')

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Databases and Tables</b></p>
<p style = 'font-size:16px;font-family:Arial'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('DEMO_ParkinsonsDisease');" 
#Takes 45 seconds

<footer style="padding:10px;background:#f9f9f9;border-bottom:3px solid #394851">Copyright © Teradata Corporation - 2023. All Rights Reserved.</footer>