<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       ClearScape Analytic Functions for Linear Regression, Numeric Feature Transformation and Selection
  <br>
       <img id="teradata-logo" src="../../images/TeradataLogo.png" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>
<hr>

<br>

<b style = 'font-size:24px;font-family:Arial;color:#00233C'>Demonstration of Native Numeric feature processing and Linear Regression workflow</b>


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In the typical process for creating Machine Learning models, a significant amount of time is spent on data preparation and feature selection.  Furthermore, these manipulations must be replicated in operations for effective deployment of any model into production.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The following demonstration will illustrate the use of native <b style = 'color:#00b2b1'>ClearScape Analytics</b> functions that can provide for greater efficiency, ease of use, and the ability to process data at extreme scale for the tasks of selection and processing of numeric features.  The demonstration will then use this prepared data set as inputs to a Decision Forest Regression model training and evaluation process.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The data for this demonstration consists of a Home Sales Price data set, which includes many numeric and non-numeric features.  Steps in this demo are as follow:</p>

<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Create an Analytic Data Set consisting of only numeric columns with all NULL values filled in, and values rescaled between 0 and 1</li>
    <li>Take the prepared data as input to a Linear Regression Model</li>
    <li>Score and evaluate model accuracy against a set of Testing data.</li>
    </ol>
    
<img src = 'Flow_Diagram_Regression.png' width = 100%>

<hr>
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Step 1 - Create a dense, numeric-only data set</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The raw data consists of 82 columns, 43 of which are non-numeric.  Additionally, some of the numeric columns contain NULL value, which also need to be filled in for the algorithm to work properly.  <b>Note</b> it is possible to convert these columns to numeric values using other SQL functions, but for this demonstration we will show how to remove them.</p>

<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Inspect the rows of the table</li>
    <li>Remove non-numeric columns using ANTISELECT</li>
    <li>Discover any missing values and columns with missing values</li>
    <li>Convert FLOAT Columns to INTEGER to prepare for imputation</li>
    <li>Use SimpleImputeFit/SimpleImputeTransform to fill NULL values</li>
    <li>Use ScaleFit/ScaleTransform to rescale the data</li>
    </ol>

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Imports and Connection</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Import required packages and create a connection context to Vantage.</p>

In [None]:
import warnings
warnings.filterwarnings('ignore')
display.suppress_vantage_runtime_warnings = True

import json
from teradataml import *



from IPython.display import display as ipydisplay

from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# load vars json
with open('../../vars.json', 'r') as f:
    session_vars = json.load(f)

# Create the SQLAlchemy Context
host = session_vars['environment']['host']
username = session_vars['hierarchy']['users']['business_users'][1]['username']
password = session_vars['hierarchy']['users']['business_users'][1]['password']

eng = create_context(host=host, username=username, password=password)


# confirm connection
print(eng)

<b style = 'font-size:18px;font-family:Arial;color:#00233C'>1.1 - Inspect the Data</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Create a Virtual DataFrame that represents the data in VantageCloud Lake OFS Storage</p>

In [None]:
tdf_housing = DataFrame('"demo_ofs"."housing_prices_full"')

In [None]:
ipydisplay(tdf_housing.shape)
ipydisplay(tdf_housing.head(5))

<b style = 'font-size:18px;font-family:Arial;color:#00233C'>1.2 - Remove specific columns using ANTISELECT</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>ANTISELECT takes a list of column names, column ordinals, or ranges of names/ordinals</p>

In [None]:
from teradataml import Antiselect

# iterate over the data types
# of each column to get a list of non-numeric columns
as_res = Antiselect(data = tdf_housing, 
                    exclude = [key for key, value in tdf_housing.dtypes.__dict__['_column_names_and_types'] if value == 'str'])
as_res.result.head(5)

<b style = 'font-size:18px;font-family:Arial;color:#00233C'>1.3 - Find missing values and columns</b>

<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li><a href = 'https://docs.teradata.com/r/Teradata-VantageCloud-Lake/Analyzing-Your-Data/Analytics-Database-Analytic-Functions/Data-Exploration-Functions/TD_GetRowsWithMissingValues'>GetRowsWithMissingValues</a> can be used to find all rows the contain NULLs, optionally passing target columns</li>
    <li><a href = 'https://docs.teradata.com/r/Teradata-VantageCloud-Lake/Analyzing-Your-Data/Analytics-Database-Analytic-Functions/Data-Exploration-Functions/TD_ColumnSummary'>ColumnSummary</a> offers a more detailed set of statistics on selected or all columns </li>
    </ul>

In [None]:
from teradataml import GetRowsWithMissingValues, ColumnSummary

GetRowsWithMissingValues(data = as_res.result).result

In [None]:
tdf_cs = ColumnSummary(data = as_res.result, target_columns = as_res.result.columns).result
tdf_cs[tdf_cs['NullCount'] > 0]

<hr>
<b style = 'font-size:18px;font-family:Arial;color:#00233C'>1.4 - Convert FLOAT to INTEGER</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><a href = 'https://docs.teradata.com/r/Teradata-VantageCloud-Lake/Analyzing-Your-Data/Analytics-Database-Analytic-Functions/Data-Cleaning-Functions/TD_ConvertTo'>ConvertTo</a> converts the specified input table columns to specified data types.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In order for SimpleImpute to work properly for MODE replacement of values (the example which follows this one), data types need to be one of CHAR, VARCHAR, BYTEINT, SMALLINT, or INTEGER.  <b>ConvertTo</b> will take selected columns and convert them to the target type.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Rerun ColumnSummary to verify the type change</p>

In [None]:
from teradataml import ConvertTo

res_cv = ConvertTo(data = as_res.result, target_columns = 'masvnrarea', target_datatype = 'integer')

tdf_cs = ColumnSummary(data = res_cv.result, target_columns = res_cv.result.columns).result
tdf_cs[tdf_cs['NullCount'] > 0]

<b style = 'font-size:18px;font-family:Arial;color:#00233C'>1.5 - Impute Missing Values</b>

<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li><a href = 'https://docs.teradata.com/r/Teradata-VantageCloud-Lake/Analyzing-Your-Data/Analytics-Database-Analytic-Functions/Data-Cleaning-Functions/TD_SimpleImputeFit'>SimpleImputeFit</a> will output a table with the values that will be used to substitute the missing values</li>
    <li><a href = 'https://docs.teradata.com/r/Teradata-VantageCloud-Lake/Analyzing-Your-Data/Analytics-Database-Analytic-Functions/Data-Cleaning-Functions/TD_SimpleImputeTransform'>SimpleImputeTransform</a> will return the input data set with the missing values filled in</li>
    <li>Verify the NULL values have been removed</li>
    </ul>
 
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Note one can also use the Fit table as input to <a href = 'https://docs.teradata.com/r/Teradata-VantageCloud-Lake/Analyzing-Your-Data/Analytics-Database-Analytic-Functions/Feature-Engineering-Transform-Functions/TD_ColumnTransformer'>ColumnTransformer</a></p>

In [None]:
from teradataml import SimpleImputeFit, SimpleImputeTransform, ScaleFit, ColumnTransformer

si_fit = SimpleImputeFit(data = res_cv.result, 
                         stats_columns = ['lotfrontage', 'masvnrarea', 'garageyrblt'],
                         stats = ['mean','mode', 'mean'])

si_trns = SimpleImputeTransform(data = res_cv.result, object = si_fit.output)

si_trns.result.head(5)

In [None]:
# Re-run GetRowsWithMissingValues Function - verify no results

GetRowsWithMissingValues(data = si_trns.result).result

In [None]:
from teradataml import ScaleFit, ScaleTransform

sf_fit = ScaleFit(data = si_trns.result, scale_method = 'RESCALE (lb=0, ub=1)',
                     target_columns = ['1:36'])

sf_trns = ScaleTransform(data = si_trns.result, object = sf_fit.output, accumulate = ['id','saleprice'])
sf_trns.result.head(5)

<hr>
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Step 2 - Train the Linear Model</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The goal here is to take the numeric, dense data set as inputs to the model training and validation steps.  In order to do so, we must split the data into training and testing data sets.  This is done in simple SQL using SAMPLE clause.</p>


<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Create Training and Testing data sets using SAMPLE</li>
    <li>Create the Linear Regression Model</li>
    </ol>

<hr>
<b style = 'font-size:18px;font-family:Arial;color:#00233C'>2.1 - Split data using SAMPLE</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Commit the data to permanent train/test tables.  This will also materialize all the selections, transformations and imputations above.</p>

In [None]:
tdf_samples = sf_trns.result.sample(frac = [0.2, 0.8])
copy_to_sql(tdf_samples[tdf_samples['sampleid'] == 2], table_name = 'housing_train', schema_name = 'demo_ofs', if_exists = 'replace')
copy_to_sql(tdf_samples[tdf_samples['sampleid'] == 1], table_name = 'housing_test', schema_name = 'demo_ofs', if_exists = 'replace')

<hr>
<b style = 'font-size:18px;font-family:Arial;color:#00233C'>2.2 - Train the Model</b>

        
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The <a href = 'https://docs.teradata.com/r/Teradata-VantageCloud-Lake/Analyzing-Your-Data/Analytics-Database-Analytic-Functions/Model-Training-Functions/TD_GLM'>TD_GLM</a> function is a generalized linear model (GLM) that performs regression and classification analysis on data sets, where the response follows an exponential family distribution and supports the following models:</p>
<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>    
    <li>Regression (Gaussian family): The loss function is squared error.</li>
<li>Binary Classification (Binomial family): The loss function is logistic and implements logistic regression. The only response values are 0 or 1.</li>
    </ul>
    
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The function uses the Minibatch Stochastic Gradient Descent (SGD) algorithm that is highly scalable for large datasets. The algorithm estimates the gradient of loss in minibatches, which is defined by the Batchsize argument and updates the model with a learning rate using the LearningRate argument.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The function also supports the following approaches:</p>
<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>L1, L2, and Elastic Net Regularization for shrinking model parameters</li>
    <li>Accelerated learning using Momentum and Nesterov approaches</li>
    </ul>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The function uses a combination of IterNumNoChange and Tolerance arguments to define the convergence criterion and runs multiple iterations (up to the specified value in the MaxIterNum argument) until the algorithm meets the criterion.
The function also supports LocalSGD, a variant of SGD, that uses LocalSGDIterations on each AMP to run multiple batch iterations locally followed by a global iteration.</p>

In [None]:
from teradataml import GLM, TDGLMPredict

glm_model = GLM(data = DataFrame('"demo_ofs"."housing_train"'),
                input_columns = '2:37', 
                response_column = 'saleprice')
glm_model.result.head(5)

<hr>
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Step 3 - Run the prediction and score results</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Execute a test prediction using the split data above.  Evaluation of the model accuracy is done using <a href = 'https://docs.teradata.com/r/Teradata-VantageCloud-Lake/Analyzing-Your-Data/Analytics-Database-Analytic-Functions/Model-Evaluation-Functions/TD_RegressionEvaluator'>RegressionEvaluator</a> to derive various accuracy metrics including <b>Mean Absolute Error (MAE)</b> and <b>Root Mean Squared Logarithmic Error (RMSLE)</b>.  Note that Mean Absolute Error shows the actual value (price in dollars) accuracy, while RMSLE takes into account the ratio of difference between predicted and actual value (e.g. 30 actual/40 predicted and 300/400 have the same accuracy ratio, but 10x different absolute accuracy).</p>


<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Execute TDGLMPredict using the model built above</li>
    <li>Execute RegressionEvaluator</li>
    </ol>

<hr>
<b style = 'font-size:18px;font-family:Arial;color:#00233C'>3.1 - Run the Prediction Function</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Call <a href = 'https://docs.teradata.com/r/Teradata-VantageCloud-Lake/Analyzing-Your-Data/Analytics-Database-Analytic-Functions/Model-Scoring-Functions/TD_GLMPredict'>TDGLMPredict</a> using the testing data that was split above, and use the model response as input.  Also pass additional parameters and accumulate the actual sales price value.</p>

In [None]:
glm_prediction = TDGLMPredict(newdata = DataFrame('"demo_ofs"."housing_test"'),
                           id_column = 'id',
                           object = glm_model.result,
                           accumulate = 'saleprice')
  
glm_prediction.result.head(5)

<hr>
<b style = 'font-size:18px;font-family:Arial;color:#00233C'>3.2 - Calculate Model Accuracy</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>RegressionEvaluator will calculate multiple evaluation metrics.  See the documentation for a full list of available metrics and their meaning.</p>

In [None]:
from teradataml import RegressionEvaluator

re_result = RegressionEvaluator(data = glm_prediction.result, 
                                observation_column = 'saleprice', 
                                prediction_column = 'prediction', 
                                metrics = ['MAE', 'RMSLE','MSE', 'MSLE', 'MAPE', 'MPE','RMSE','MPD','MGD', 'EV'])

In [None]:
re_result.result

<hr>
<b style = 'font-size:18px;font-family:Arial;color:#00233C'>Simple Plotting to visualize predictions</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Retrieve a subset of rows and plot the prediction vs. actual sale price</p>

In [None]:
df_prediction = glm_prediction.result.to_pandas(num_rows = 20)
df_prediction.set_index('id', drop = True).plot(kind = 'bar', figsize = (10,10));

<hr>
<b style = 'font-size:18px;font-family:Arial;color:#00233C'>Clean Up</b>

In [None]:
db_drop_table('housing_train', schema_name = 'demo_ofs')
db_drop_table('housing_test', schema_name = 'demo_ofs')

In [None]:
remove_context()