<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Parkinson's Disease prediction using Decision Forest Classifier and GLM
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Introduction</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Research shows that 89 percent of people with Parkinson’s disease (PD) experience speech and voice disorders, including soft, monotone, breathy and hoarse voice and uncertain articulation. As a result, people with PD report they are less likely to participate in conversation or have confidence in social settings than healthy individuals in their age group.
<br>
<br>    
Speech disorders can progressively diminish quality of life for a person with PD. The earlier a person receives a baseline speech evaluation and speech therapy, the more likely he or she will be able to maintain communication skills as the disease progresses. Communication is a key element in quality of life and positive self-concept and confidence for people with PD.
<br>
<br>    
Hence as a consultant, we are approached by an organization to detect Parkinson's Disease at an early stage. We are not showcasing a complete Data Science Usecase but we are trying to show how the Vantage In-Database functions can be used for Model training and scoring and comparing the performance of 2 models. The data we are using is sample data and the results and predictions may not be entirely accurate.</p>

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Data</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>This dataset is composed of a range of biomedical voice measurements from different people with Parkinson's disease (PD). Each column in the table is a particular voice measure, and each row corresponds one of 195 voice recording from these individuals. Various speech signal processing algorithms including Time Frequency Features, Mel Frequency Cepstral Coefficients (MFCCs), Wavelet Transform based Features, Vocal Fold Features and TWQT features have been applied to the speech recordings of Parkinson's Disease (PD) patients to extract clinically useful information for PD assessment. The main aim of the data is to discriminate healthy people from those with PD, according to "status" column which is set to 0 for healthy and 1 for PD.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><a href = 'https://archive.ics.uci.edu/ml/datasets/parkinsons'>Link to the dataset</a>: Max A. Little, Patrick E. McSharry, Eric J. Hunter, Lorraine O. Ramig (2008), 'Suitability of dysphonia measurements for telemonitoring of Parkinson's disease', IEEE Transactions on Biomedical Engineering (to appear).</p>

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Contents:</b></p>
<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Configuring the Environment</li>
    <li>Initiate a connection to Vantage</li>
    <li>Analyze the raw data set</li>
    <li>Train and Test a Decision Forest Model</li>
        <ul>
            <li>4.1 Train and Test split using SAMPLE. Splitting the dataset in 80:20 ratio for Train and Test respectively</li>
            <li>4.2 Train a Model</li> 
                <ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
                    <li style = 'font-size:16px;font-family:Arial;color:#00233C' >Using the DecisionForest and DecisonForestPredict In Database function to predict if the person can have Parkinson's Disease or not. So there are only 2 responses '0' and '1'.</li>
                    <li style = 'font-size:16px;font-family:Arial;color:#00233C'>Using the GLM and TDGLMPredict In Database function to predict if the person can have Parkinson's Disease or not. So there are only 2 responses '0' and '1'.</li>
            </ol>
            <li>4.3 Evaluate the Model :- Evaluation of the model is done using the TD_ClassificationEvaluator which provides various parameters for the model like Accuracy, Precision ,Recall etc.</li>
        </ul>
    <li>Cleanup</li>
</ol>

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>1. Connect to the Vantage.</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Let us start by checking the version of the teradataml installed. The Openml functions used in this notebook will require Version 20.0.0.0.</p>

In [None]:
pip show teradataml

<div class="alert alert-block alert-info">
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Note: </b><i>If the VM has lower version, please uncomment the below code and execute the cell.  After the cell executes, please restart the kernel. The simplest way to restart the Kernel is by typing zero zero: <b> 0 0</b></i></p>
</div>

In [None]:
%%capture
!pip install --upgrade teradataml

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In the section, we import the required libraries and set environment variables and environment paths (if required).</p>

In [None]:
import json
import getpass
import pandas as pd

from teradataml.dataframe.dataframe import DataFrame
from teradataml import *
import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)
display.max_rows=5

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>You will be prompted to provide the password. Enter your password, press the Enter key, then use down arrow to go to next cell. Begin running steps with Shift + Enter keys.</p>

In [None]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

In [None]:
execute_sql('''SET query_band='DEMO=Parkinsons_Disease_Prediction_PY_SQL.ipynb;' UPDATE FOR SESSION; ''')

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>2. Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We have provided data for this demo on cloud storage.  You have the option of either running the demo using foreign tables to access the data without using any storage on your environment or downloading the data to local storage which may yield somewhat faster execution, but there could be considerations of available storage. Here we are only creating local databases and tables as there are 755 columns in table which will be faster in local tables.</p> 
    
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>“Note:  The early part of this demo will be slow because we are starting out with so many columns. The strategy of this demo is to eliminate irrelevant columns so we can focus on the ones that are the best predictors of the disease and as a by-product, get better performance."</b></p>    


In [None]:
%run -i ../run_procedure.py "call get_data('DEMO_ParkinsonsDisease_local');"
 # Takes about 3 minutes


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Next is an optional step – if you want to see status of databases/tables created and space used.</p>

In [None]:
%run -i ../run_procedure.py "call space_report();"

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>3. Analyze the raw data set</b></p>


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Create a DataFrame to get the data from the table created.</p>




In [None]:
speech_features = DataFrame(in_schema('DEMO_ParkinsonsDisease','Speech_Features'))
speech_features

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>There are more than 750 different features of the speech recordings which are used for analysis. The "CLASS" column which is the rightmost column of the answer set above(please scroll to the right), indicates whether the person has Parkinson's Disease(1) or DOES NOT have Parkinson's Disease(0)</p>


<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>4. Create Train and Test Dataset</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Now that we have our prepared data set, we can perform an abbreviated machine learning workflow:</p>

<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Create Train and Test data sets using SAMPLE Clause(80:20 split)</li>
    <li>Train the model</li>
    <li>Evaluate the model using Test data</li>
</ol>
</p>



<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Feature engineering transform functions encapsulate variable transformations during the training phase so you can chain them to create a pipeline for operationalization. We used the RandomProjectionMinComponents to find the minimum components required. BY using this we were able to reduce the number of columns from 753 to 318.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Further we use the nameFit and nameTransform functions. Each nameFit function outputs a table to input to the nameTransform function as FitTable. For example, ScaleFit outputs a FitTable for ScaleTransform. We are using the mean ScaleMethod for this case.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Using the STD ScaleMethod the ScaleFit function calculates the mean values of each feature used and the output of this ScaleFit function is used in the ScaleTransform function as the fit table. ScaleFit() function outputs statistics to input to ScaleTransform() function, which scales specified input DataFrame columns.</p>

In [None]:
from teradataml import ScaleFit, ScaleTransform

sf_fit = ScaleFit(data = speech_features, scale_method = 'STD',
                     target_columns = ['2:318'])

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The Transform() function applies numeric transformations to input columns, using Fit() output. Here the output of the ScaleFit function is used by the ScaleTransform to apply the numeric transformations to the input columns. </p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Note: There may be some Teradata Database warnings for the ScaleTransform functions, but these are just warnings which can be ignored.</b></p>

In [None]:
sf_trns = sf_fit.transform(data = speech_features, accumulate = ['"id"','"class"'])
sf_trns.result

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Using teradataml OpenSource ML functions to find out feature importance </b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Below we are splitting the train and test dataset for the model evaluation.</p>


In [None]:
df_transform = sf_trns.result
df_transform = df_transform.assign(target = df_transform["class"])
df_transform = df_transform.drop(["class"], axis=1)

In [None]:
TrainTestSplit_out = TrainTestSplit(
                                    data = df_transform,   #sf_trns.result,
                                    id_column = "id",
                                    train_size = 0.80,
                                    test_size = 0.20,
                                    seed = 25
                                   )

In [None]:
df_train = TrainTestSplit_out.result[TrainTestSplit_out.result['TD_IsTrainRow'] == 1].drop(['TD_IsTrainRow'], axis = 1)
df_test = TrainTestSplit_out.result[TrainTestSplit_out.result['TD_IsTrainRow'] == 0].drop(['TD_IsTrainRow'], axis = 1)

In [None]:
X_train = df_train.drop(["target","id"], axis = 1)
y_train = df_train.select(["target"])
X_test = df_test.drop(["target","id"], axis = 1)
y_test = df_test.select(["target"])

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Logistic Regression is used to create the model output which will be used further to check the feature importance.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Logistic Regression is a Machine Learning classification algorithm that is used to predict the probability of a categorical dependent variable. In logistic regression, the dependent variable is a binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.). This is suitable here as we have to determine if the patient has Parkinson or Does not have Parkinson.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here, we are using the LogisticRegression from the teradataml OpenSource ML library. We will create the model and fit using the training data.</p> 

In [None]:
from teradataml import td_sklearn as osml
lr = osml.LogisticRegression(C=0.1, penalty='l2', solver= 'liblinear', random_state=1)
lr.fit(X_train, y_train)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will check the accuracy of the moel by using the score function.</p>

In [None]:
lr.score(X_test, y_test)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The score of the model is 0.76 which means the accuracy of the model is 76%</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Using the predict function we will predict if the patient has Parkinsons or not</p>

In [None]:
#model predictions
predict_lr =lr.predict(X_test,y_test)
predict_lr

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Based on the predicted and the actual values from the dataset we will check the classification report for the model. We get the Precision, Recall, F1-Score and Accuracy of the model from the classification Report</p>

In [None]:
y_true_df = predict_lr.select(["target"])
y_pred_df = predict_lr.select("logisticregression_predict_1")

In [None]:
opt = td_sklearn.classification_report(y_true=y_true_df, y_pred=y_pred_df, digits=4)
print(opt)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here we can see that the accuracy for this model is 76%.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Below we try to get the important features using the logistic regression coefficient and plot the graph for features based on their importance</p>

In [None]:
import matplotlib.pyplot as plt
importances = pd.DataFrame(data={
    'Attribute': X_train.columns,
    'Importance': lr.coef_[0]
})
importances = importances.sort_values(by='Importance', ascending=False)
importances.to_csv("logistic_feature_imp.csv")
plt.figure(figsize=(100,30))
plt.bar(x=importances['Attribute'], height=importances['Importance'], color='#087E8B')
plt.title('Feature importances obtained from coefficients', size=20)
plt.xticks(rotation='vertical')
plt.show()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The above graph shows the important features. Since we have a lot of features (318), we have also created a logistic_feature_imp.csv, which shows the exact values for all the features. </p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>As seen in the graph and output file apart from the top 10-15 variables, rest of the variable co-efficients are very close to zero. So we will consider only the top 12 [those feature co-eff >= 0.3 and <= -0.3] variables for checking the accuracy.</p>

In [None]:
###Since except top 10 to 20 variables, rest of the variable co-efficienct is close to zero. 
###So we will be taking Top 12 [those feature co-eff >= 0.3 and <= -0.3] variables and checking the accuracy.
parkinson_new_df = df_transform[["id",
"DFA",
"std_delta_delta_log_energy",
"std_7th_delta",
"std_delta_log_energy",
"std_7th_delta_delta",
"GNE_mean",
"mean_MFCC_2nd_coef",                          
"mean_2nd_delta",
"mean_MFCC_5th_coef",
"std_MFCC_3rd_coef",
"mean_MFCC_6th_coef",
"GNE_SNR_TKEO", "target"]]

parkinson_new_df

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>With these 12 features we then check the corelation of these features. </p>


In [None]:
#get the correlation matrix
import seaborn as sns
pd_parkinson_new_df = parkinson_new_df.to_pandas(all_rows=True)
corr = pd_parkinson_new_df.corr()

##plot heatmap
plt.figure(figsize=(20,10))
plt.title('Correlation Matrix')
sns.heatmap(corr, annot = True);

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>As seen in the above corelation matrix , we can see that there are 2 features where the corelation is above 92%. The intersection formed by 5th and 6th feature from the top on the Y axis with the 3rd and 4th feature respectively from left on the X axis, we can observe that the corelation is 0.92 and 0.95. </p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>So now in the below evaluation we will be using only these 10 important features.</p>


<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Train and Test split using SAMPLE</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Create the dataset with only these 10 variables along with the id and class columns which will be used as the id and target variables for the below evaluations.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Using 80:20 split data to create the training and testing dataset.</p>

In [None]:
tdf_samples = df_transform.sample(frac = [0.2, 0.8])[["id",
"DFA",
"std_delta_delta_log_energy",
"std_7th_delta",
"GNE_mean",
"mean_MFCC_2nd_coef",                          
"mean_2nd_delta",
"mean_MFCC_5th_coef",
"std_MFCC_3rd_coef",
"mean_MFCC_6th_coef",
"GNE_SNR_TKEO","sampleid", "target"]]

In [None]:
pd_speech_features_train = tdf_samples[tdf_samples['sampleid'] == 2]

In [None]:
pd_speech_features_test = tdf_samples[tdf_samples['sampleid'] == 1]

In [None]:
copy_to_sql(tdf_samples[tdf_samples['sampleid'] == 2], table_name = 'pd_speech_features_train', schema_name = 'demo_user',
            if_exists = 'replace')
train_df_data = DataFrame('pd_speech_features_train')
train_df_data.select(['target','id']).groupby('target').count()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The output shows the number of people we are considering for each class to train the model – class 1 has Parkinson’s</p>



In [None]:
copy_to_sql(tdf_samples[tdf_samples['sampleid'] == 1], table_name = 'pd_speech_features_test', schema_name = 'demo_user', 
            if_exists = 'replace')
test_df_data = DataFrame('pd_speech_features_test')
test_df_data.select(['target','id']).groupby('target').count()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The output shows the number of people we are considering for each class to test the model – class 1 has Parkinson’s</p>

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>5. Decision Tree Model</b></p>

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>5.1 - Train a Decision Tree Model</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The <a href = 'https://docs.teradata.com/r/Enterprise/Teradata-Package-for-Python-Function-Reference-17.20/teradataml-Analytic-Database-SQL-Engine-Analytic-Functions/Supported-on-Database-Versions-16.20.xx-17.00.xx-17.05.xx/DecisionForestPredict'>DecisionForest</a> is an ensemble algorithm used for classification and regression predictive modelling problems. It is an extension of bootstrap aggregation (bagging) of decision trees. </p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>This function takes the training data as input, as well as the following function parameters</p>
    <ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
        <li>InputColumns; list or range of columns used as features (we used an ordinal reference of columns 2:753)</li>
        <li>ResponseColumn; the dependent or target value (we used “class”, the first column)</li>
        <li>TreeType; either CLASSIFICATION or REGRESSION</li>
    <li>Other hyperparameter values detailed in the documentation</li>
        </ul>

In [None]:
DecisionForest_out = DecisionForest(data = train_df_data, 
                            input_columns = ['id', 'DFA', 'std_delta_delta_log_energy', 'std_7th_delta', 'GNE_mean', 'mean_MFCC_2nd_coef', 
      'mean_2nd_delta', 'mean_MFCC_5th_coef', 'std_MFCC_3rd_coef', 'mean_MFCC_6th_coef', 'GNE_SNR_TKEO', '"sampleid"'], 
                            response_column = 'target', 
                            max_depth = 5, 
                            num_trees = 4, 
                            min_node_size = 1, 
                            mtry = 3, 
                            mtry_seed = 1, 
                            seed = 2, 
                            tree_type = 'CLASSIFICATION')

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The DecisionForest function produces a model and a JSON representation of the decision tree. Below is explanation for some columns in the JSON tree. The other details can be found at the link <a href = 'https://docs.teradata.com/search/all?query=TD_DecisionForest&content-lang=en-US'>here.</a></p>

</p>
<html>
   <head>
      <style>
         table, th, td {
            border: 1px solid black;
            border-collapse:collapse;
         }
      </style>
   </head>
   <body>
      <table>
         <tr>
            <th>JSON Type</th>
            <th>Description</th>             
         </tr>
         <tr>
            <td>id_</td>
            <td>"Node identifier"</td>
         </tr>
         <tr>
            <td>nodeType_</td> 
            <td>The node type. Possible values: CLASSIFICATION_NODE,CLASSIFICATION_LEAF,REGRESSION_NODE,REGRESSION_LEAF.</td>
         </tr>
         <tr>
            <td>split_</td> 
            <td>The start of JSON item that describes a split in the node.</td>
         </tr> 
         <tr>
            <td>responseCounts_</td> 
            <td>[Classification trees] Number of observations in each class at node identified by id.</td>
         </tr>
         <tr>
            <td>size_</td> 
            <td>Total number of observations at node identified by id.</td>
         </tr> 
         <tr>
            <td>maxDepth_</td> 
            <td>Maximum possible depth of tree, starting from node identified by id. For root node, the
value is max_depth. For leaf nodes, the value is 0. For other nodes, the value is the
maximum possible depth of tree, starting from that node.</td>
         </tr>  
      </table>
   </body>
</html>


<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>5.2 - Evaluate the Model</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Execute a testing prediction using the split data above.  Evaluate the model by creating a confusion matrix with the <a href = 'https://docs.teradata.com/search/all?query=TD_ClassificationEvaluator&content-lang=en-US'>TD_ClassificationEvaluator</a> SQL Function.</p>


<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Execute <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Model-Scoring-Functions/DecisionForestPredict'>DecisionForestPredict</a> using the model built above</li>
    <li>Execute <a href = 'https://docs.teradata.com/search/all?query=TD_ClassificationEvaluator&content-lang=en-US'>TD_ClassificationEvaluator</a> and pass the actual classification and the predicted value</li>
</ol>

In [None]:
decision_forest_predict_out = TDDecisionForestPredict(object = DecisionForest_out.result,
                                                        newdata = test_df_data,
                                                        id_column = "id",
                                                        detailed = False,
                                                        output_prob = True,
                                                        output_responses = ['0','1'],
                                                        accumulate = 'target')
decision_forest_predict_out.result

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The DecisionForestPredict function creates probabilities for the prediction made depending on the class and the Id columns. The output of the predict function is passed to the Classification Evaluator to get the parameters of the functions.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>DecisionForestPredict outputs the probability that each observation is in the predicted class. To use DecisionForestPredict output as input to ML Engine ROC function, you must first transform it to show the probability that each observation is in the positive class. One way to do this is to change the probability to (1- current probability) when the predicted class is negative. The prediction algorithm compares floating-point numbers. Due to possible inherent data type differences between ML Engine and Analytics Database executions, predictions can differ.</p>


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We create the Confusion Matrix to compare the actual and the Predicted values. Confusion matrix is a very popular measure used while solving classification problems. It can be applied to binary classification as well as for multiclass classification problems. Confusion matrices represent counts from predicted and actual values. It is an N x N matrix used for evaluating the performance of a classification model, where N is the number of target classes.</p>


In [None]:
predicted_data = decision_forest_predict_out.result
predicted_data

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
df = predicted_data.to_pandas().reset_index()
cm = confusion_matrix(df['target'], df['prediction'], normalize='all')
cmd = ConfusionMatrixDisplay(cm, display_labels=['DoesNotHaveParkinson', 'HasParkinson'])
cmd.plot()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The above Confusion Matrix shows the actual and the Predicted values. Based on the Decision Forest model the above matrix shows the predicted and actual value comparison for people having parkinson and those not having parkinson.</p>


<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>5.3 - Use classification Evaluator for DecisionForestPredict</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Evaluate the model by creating a confusion matrix with the <a href = 'https://docs.teradata.com/search/all?query=TD_ClassificationEvaluator&content-lang=en-US'>TD_ClassificationEvaluator</a> SQL Function.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In classification problems, a confusion matrix is used to visualize the performance of a classifier. The confusion matrix contains predicted labels represented across the row-axis and actual labels represented
across the column-axis. Each cell in the confusion matrix corresponds to the count of occurrences of labels
in the test data.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Apart from accuracy, the secondary output table returns micro, macro, and weighted-average metrics of precision, recall, and F1-score values.</p>


In [None]:
ClassificationEvaluator_obj = ClassificationEvaluator(data=predicted_data,
                                                          observation_column='target',
                                                          prediction_column='prediction',
                                                          labels=['0','1'])

In [None]:
df_metrics = ClassificationEvaluator_obj.output_data
df_metrics

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The above output has the secondary output table that returns micro, macro, and weighted-average metrics of precision, recall, and F1-score values.</p>
<table style = 'font-size:16px;font-family:Arial;color:#00233C'>
  <tr>
    <th>Column</th>
    <th>Description</th>
  </tr>
  <tr>
    <td>Precision</td>
    <td>The positive predictive value. Refers to the fraction of relevant instances among
the total retrieved instances.
        Precision answers the following question: what proportion of predicted Positives is truly Positive? 
        Precision = (TP)/(TP+FP)</td>
  </tr>
  <tr>
    <td>Recall</td>
    <td>Refers to the fraction of relevant instances retrieved over the total amount of
relevant instances. Recall answers a different question: what proportion of actual Positives is correctly classified?
Recall = (TP)/(TP+FN)</td>
  </tr>
  <tr>
    <td>F1</td>
    <td>F1 score, defined as the harmonic mean of the precision and recall and is a number between 0 and 1. F1 score maintains a balance between the precision and recall for your classifier.                                         
                      F1 = 2*(precision*recall/precision+recall)</td>
  </tr>
  <tr>
    <td>Support</td>
    <td>The number of times a label displays in the Observation Column.</td>
  </tr>
</table>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>**TP:- True Positive , FP :- False Positive, TN :- True Negative , FN :- False Negative

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>6. Generalized Linear Model(GLM)</b></p>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>6.1 - Train a GLM Model</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The <a href = 'https://docs.teradata.com/search/all?query=TD_GLM&content-lang=en-US'>Generalized Linear Model (GLM)</a> is an extension of the linear regression model that enables the linear equation to relate to the dependent variables by a link function. The GLM function supports several distribution families and associated link functions. </p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>This function takes the training data as input, as well as the following function parameters</p>
    <ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
        <li>InputColumns; list or range of columns used as features (we used an ordinal reference of columns 2:753)</li>
        <li>ResponseColumn; the dependent or target value (we used “class”, the first column) </li>
        <li>Family; either Binomial or Gaussian</li>
    <li>Other hyperparameter values detailed in the documentation</li>
        </ul>
        


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We use the GLM function to create the GLM model using the train dataset.</p>

In [None]:
from teradataml import GLM, TDGLMPredict

glm_model = GLM(data = DataFrame('"demo_user"."pd_speech_features_train"'),
                input_columns = ['1:10'], 
                response_column = 'target',
                learning_rate = 'OPTIMAL',
                terms = ['id','target'],
                momentum = 0.0,
                family = 'Binomial')

glm_model.result

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The GLM function creates various output predictors and values based on the above parameters passed in the query</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The function output is a trained GLM model which can be input to the TDGLMPredict function
for prediction. The model also contains model statistics of MSE, Loglikelihood, AIC, and BIC.
Further model evaluation can be done as a post-processing step using functions such as
TD_RegressionEvaluator,TD_ClassificationEvaluator and TD_ROC.</p>


<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>6.2 - Evaluate the Model</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Execute a testing prediction using the split data above.  Evaluate the model by creating a confusion matrix with the <a href = 'https://docs.teradata.com/search/all?query=TD_ClassificationEvaluator&content-lang=en-US'>TD_ClassificationEvaluator</a> SQL Function.</p>


<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Execute <a href = 'https://docs.teradata.com/search/all?query=TDGLMPredict&content-lang=en-US'>TDGLMPredict</a> using the model built above</li>
    <li>Execute <a href = 'https://docs.teradata.com/search/all?query=TD_ClassificationEvaluator&content-lang=en-US'>TD_ClassificationEvaluator</a> and pass the actual classification and the predicted value</li>
</ol>

In [None]:
import teradataml
from teradataml import GLM, TDGLMPredict
obj = TDGLMPredict(newdata = DataFrame('"demo_user"."pd_speech_features_test"'),
                           id_column = 'id',
                           object = glm_model.result,
                           accumulate = 'target',
                           output_prob=True,
                           output_responses = ['0', '1'],
                           terms='target')

obj.result

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The TDGLMPredict function predicts target values (regression) and class labels (classification) for test data using a GLM model trained by the GLM function. Similar to GLM, input features should be standardized, such as using ScaleFit, and ScaleTransform, before using in the function. The function takes only numeric features. The categorical
features must be converted to numeric values prior to prediction.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Rows with missing (null) values are skipped by the function during prediction. For prediction results evaluation, you can use TD_RegressionEvaluator, TD_ClassificationEvaluator or TD_ROC function as
postprocessing step.</p>


<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>6.3 - Use classification Evaluator for GLMPredict</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Evaluate the model by creating a confusion matrix with the <a href = 'https://docs.teradata.com/search/all?query=TD_ClassificationEvaluator&content-lang=en-US'>TD_ClassificationEvaluator</a> SQL Function.</p>



<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Create CONFUSION MATRIX for the GLM Predict model.</p>

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
df = obj.result.to_pandas()
cm = confusion_matrix(df['target'], df['prediction'], normalize='all')
cmd = ConfusionMatrixDisplay(cm, display_labels=['DoesNotHaveParkinson', 'HasParkinson'])
cmd.plot()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The above Confusion Matrix shows the actual and the Predicted values. Based on GLM the above matrix shows the predicted and actual value comparison for people having parkinson and those not having parkinson.</p>


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Since TD_ClassificationEvaluator requires same datatype for prediction and class columns so creating another table with same datatype.</p>

In [None]:
from teradataml import ConvertTo
glm_predicted_data = ConvertTo(data = obj.result,
                           target_columns = ['id','target', "prediction",'prob_0','prob_1'],
                           target_datatype = ["INTEGER","INTEGER","INTEGER","INTEGER","INTEGER"])

In [None]:
ClassificationEvaluator_obj_glm = ClassificationEvaluator(data=glm_predicted_data.result,
                                                          observation_column='target',
                                                          prediction_column='prediction',
                                                          labels=['0','1'])

In [None]:
glm_metrics = ClassificationEvaluator_obj_glm.output_data
glm_metrics

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>7. Comparison of the Metrics generated by the 2 Models. Decision Forest vs GLM</b></p>

In [None]:
df_metrics = df_metrics.assign(model='DecisionForest')
glm_metrics = glm_metrics.assign(model='GLM')
df_union = df_metrics.concat(glm_metrics)

In [None]:
df_chart=df_union.to_pandas()
warnings.simplefilter(action='ignore', category=FutureWarning)
from matplotlib import pyplot as plt
df_chart['Metric'] = df_chart['Metric'].str.replace('\x00', '')
df_pivot = pd.pivot_table(
df_chart,
values="MetricValue",
index="Metric",
columns="model"
)
#df_chart.plot.bar(x='Metric',y='MetricValue' , legend='model')
ax=df_pivot.plot(kind='bar')
# Get a Matplotlib figure from the axes object for formatting purposes
fig = ax.get_figure()
# Change the plot dimensions (width, height)
fig.set_size_inches(12, 6)
# Change the axes labels
ax.set_xlabel("Metrics")
ax.set_ylabel("Metric Values")

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Thus here we have used 2 different models to train and predict the data. The classification evaluator is used to evaluate and compare the models. The Teradata In-Database functions are used for training, prediction and evaluation. In this case since we have sample data the result parameters like the Accuracy, Precision, Recall etc. may not be accurate for both the models, still from the above graph we can conclude that in this case GLM model with an accuracy of 82% is better than DecisionForest with accuracy of 77%.  


<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>8. Cleanup</b></p>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Work Tables</b></p>

In [None]:
tables = ['pd_speech_features_train', 'pd_speech_features_test','additional_metrics_speech_test','df_predict_output',
          'glm_predict_output', 'additional_metrics_speech_test_glm','metric_union','DF_train','DF_Predict' ]

# Loop through the list of tables and execute the drop table command for each table
for table in tables:
    try:
        db_drop_table(table_name=table)
    except:
        pass

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Databases and Tables</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('DEMO_ParkinsonsDisease');" 
#Takes 45 seconds

In [None]:
remove_context()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>If you have updated the teradataml package, reinstall the package by uncommenting and running the below code cell.</p>

In [None]:
%%capture
!pip install teradataml==17.20.0.6 --force-reinstall
!pip install numpy==1.24.2 --force-reinstall

<footer style="padding-bottom:35px; background:#f9f9f9; border-bottom:3px solid #00233C">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2023,2024. All Rights Reserved
        </div>
    </div>
</footer>