<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
    ClearScape Analytics in-database functions for ML and AI pipelines
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>
<hr style="height:2px;border:none;background-color:#00233C;">

<br>

<b style = 'font-size:24px;font-family:Arial;color:#00233C'>Demonstration of native functions for operationalizing ML/AI and advanced analytics at scale</b>


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In the typical process for creating Machine Learning models, a significant amount of time is spent on data preparation and feature selection.  Then, in order to deploy these models in production, the typical proces is to <b>re-write</b> these pipelines for use in an operational deployment.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Teradata <b>ClearScape Analytics</b> functions not only allow greater efficiency, ease of use, and scalability during the development process; but can be seamlessly deployed into production with minimal refactoring.  As new data arrives for processing, transformation, and scoring in real-time or batch, advanced workload optimization will ensure the strictest performance SLAs are met at any level of concurrency or data volume.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>This demonstration will illustrate example functions that can be used in all steps in the <b>development</b> process, but also how to deploy the same analytics seamlessly into <b>production</b> with minimal modifcations, allowing organizations to democratize access to advanced analytics, Machine Learning, and AI.</p>

<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li><b>Exploratory Data Analysis</b> to identify patterns and overall usefulness of the data set</li>
    <li><b>Data Preparation</b> cleansing data, removing outliers, rescaling</li>
    <li><b>Feature Engineering</b> transform raw data into usable features for training and prediction</li>
    <li><b>Model Building</b> leverage the massive scale of the Teradata MPP engine to train deep or wide predictive models</li>
    <li><b>Model Evaluation</b> measure model efficacy at scale</li>
    <li><b>Operations</b> seamlessly deploy and manage in productionl</li>
    </ol>
 

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The data for this demonstration consists of a home sales price data set, which includes many numeric and non-numeric features.  The data is a useful one, since it needs some amount of cleansing and preparation before predictive models can be built.</p>






In [None]:
%connect unlimited

<hr>
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Access and inspect data</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Teradata AI Unlimited can transparently <b>read</b> and <b>write</b> data from various third-party catalogs such as <b>AWS Glue</b>, <b>Azure OneLake</b>, etc.  To access this data, perform the following</p>

<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Create an Authorization object for secured access</li>
    <li>Create a "DATALAKE" object that points to the catalog</li>
    <li>Use SQL syntax to inspect the catalog, database, and table metadata</li>
    </ol>


In [None]:
/*First, we'll create an authorization to establish the credendials for AWS resources */
REPLACE AUTHORIZATION unlimited.glue_auth_aws
USER 'EXAMPLE'
PASSWORD 'EXAMPLE';

In [None]:
/*Second we create the connectivity to the Iceberg Glue Data Lake*/

REPLACE DATALAKE aws_glue_catalog
EXTERNAL SECURITY CATALOG unlimited.glue_auth_aws,
EXTERNAL SECURITY STORAGE unlimited.glue_auth_aws
USING
catalog_type ('glue')
storage_region ('us-west-2')
TABLE FORMAT iceberg;

In [None]:
HELP DATALAKE aws_glue_catalog;

In [None]:
HELP DATABASE aws_glue_catalog.tddemos_glue_db;

In [None]:
HELP TABLE aws_glue_catalog.tddemos_glue_db.customer_journey;

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Step 1 - Exploratory Data Analysis</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Once data gets above a certain scale, traditional approaches to EDA become cumbersome - copying data to client-side tools is time and resource intensive, and writing traditional SQL to calculate things like data distribution, percentiles, or other statistics can be complex.  Teradata Vantage has both simplified and optimized this process by providing built-in EDA functions that include capabilities for analyzing these patterns, such as</p>

<table style = 'width:100%;table-layout:fixed;'>
    <tr>
        <td style = 'vertical-align:top' width = '50%'>
            <ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
                <li>Distributions</li>
                <br>
                <li>Univariate Statistics</li>
                <br>
                <li>Categoric Summaries</li>
                <br>
                <li>NULLs and missing data</li>
            </ol>
        </td>
        <td><img src = 'images/EDA_Gen.jpeg' width = '250'></td>
    </tr>
</table>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The raw data consists of 82 columns, 43 of which are non-numeric.</p>

In [None]:
SELECT TOP 10 * FROM demo_ofs.housing_prices_full

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>For the demonstration, create a view which represents a subset of the data.</p>

In [None]:
REPLACE VIEW pricesV AS (
    SELECT id, lotfrontage, masvnrarea, alley, electrical, _1stflrsf, _2ndflrsf, saleprice
    FROM demo_ofs.housing_prices_full)

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>TD_Histogram</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>calculates the frequency distribution of a dataset using one of multiple methods.  Mutiple columns can also be analyzed in a single call</p>

In [None]:
SELECT * FROM TD_Histogram(
    ON pricesV as InputTable
USING
TargetColumn('lotfrontage')
MethodType('STURGES')
) as dt
ORDER BY 2;

In [None]:
%chart y=CountOfValues, x=Label, title="Simple Histogram"

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>TD_ColumnSummary</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>will return information about all the columns in the data set - NULL, zero, blank, etc.</p>

In [None]:
SELECT * FROM TD_ColumnSummary (
  ON pricesV AS InputTable
  USING
  TargetColumns ('[:]')
) AS d
ORDER BY 1;

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Step 2 - Data Preparation</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Once the user understands the overall patterns in the data, they can begin to clean and prepare it for the analytic process.  As above, native functions are available to simplify and optimize this process.  Furthermore, these functions use a <b>fit and transform</b> approach, which will assist in re-use for <b>operations</b>.  Data preparation and cleansing functions include</p>

<table style = 'width:100%;table-layout:fixed;'>
    <tr>
        <td style = 'vertical-align:top' width = '50%'>
            <ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
                <li>Type conversions</li>
                <br>
                <li>Value imputation</li>
                <br>
                <li>Futile columns identification</li>
                <br>
                <li>Outlier removal</li>
            </ol>
        </td>
        <td><img src = 'images/Cleansing_Gen.jpeg' width = '250'></td>
    </tr>
</table>

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Simple Imputer</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The SimpleImpute Fit/Transform functions will assign missing values to a dataset, using statistical (mean, median, mode) or literal values.  Note the fit table can be persisted and re-used as part of the operational pipeline</p>

In [None]:
SELECT * FROM TD_ColumnSummary (
  ON pricesV AS InputTable
  USING
  TargetColumns ('[:]')
) AS d
WHERE NullCount > 0
ORDER BY 1;

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Fit the model</b></p>

In [None]:
-- fit the SimpleImpute function on the train set
SELECT * FROM TD_SimpleImputeFit (
    ON pricesV as InputTable
    USING
    ColsForStats('lotfrontage', 'masvnrarea','alley', 'electrical')
    Stats('mean','median', 'mode', 'mode')
) as dt

In [None]:
DROP TABLE SI_FIT;

In [None]:
CREATE TABLE SI_FIT AS (
    SELECT * FROM TD_SimpleImputeFit (
        ON pricesV as InputTable
        USING
        ColsForStats('lotfrontage', 'masvnrarea','alley', 'electrical')
    Stats('mean','median', 'mode', 'mode')
) as dt) WITH DATA;

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Transform the data</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Pass the transformed data to ColumnSummary again to check for NULLs and blanks</p>

In [None]:
WITH si_transform AS (
    SELECT * FROM TD_SimpleImputeTransform (
  ON pricesV AS InputTable
  ON SI_FIT AS FitTable DIMENSION
) AS d)

SELECT * FROM TD_ColumnSummary (
  ON si_transform AS InputTable
  USING
  TargetColumns ('[:]')
) AS d
ORDER BY 1;

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Outlier Removal</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>will remove statistical outliers from the data set</p>

In [None]:
SELECT * FROM TD_UnivariateStatistics (
  ON pricesV AS InputTable
  USING
  TargetColumns ('lotfrontage')
  Stats ('MEAN', 'MEDIAN', 'MODE', 'PRC')
) AS dt;

In [None]:
SELECT * FROM TD_OutlierFilterFit (
  ON pricesV AS InputTable
  USING
  TargetColumns ('lotfrontage')
  UpperPercentile (0.98)
  OutlierMethod ('Percentile')
 ) AS dt;

In [None]:
DROP TABLE OF_FIT

In [None]:
CREATE TABLE OF_FIT AS (
    SELECT * FROM TD_OutlierFilterFit (
      ON pricesV AS InputTable
    USING
      TargetColumns ('lotfrontage')
      UpperPercentile (0.98)
      OutlierMethod ('Percentile')
 ) AS dt) WITH DATA;

In [None]:
SELECT TOP 10 * FROM TD_OutlierFilterTransform (
  ON pricesV AS InputTable PARTITION BY ANY
  ON OF_FIT AS FitTable DIMENSION
) AS dt;

In [None]:
WITH outlier_transform AS (
SELECT * FROM TD_OutlierFilterTransform (
  ON pricesV AS InputTable PARTITION BY ANY
  ON OF_FIT AS FitTable DIMENSION
) AS d)

SELECT * FROM TD_Histogram(
    ON outlier_transform  as InputTable
USING
TargetColumn('lotfrontage')
MethodType('STURGES')
) as dt
ORDER BY 2;

In [None]:
%chart y=CountOfValues, x=Label, title="Simple Histogram"

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Combine the Transformations</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>ColumnTransformer will execute multiple transformations in a single pass</p>

In [None]:
REPLACE VIEW Cleansed_PricesV AS (
    SELECT * FROM TD_ColumnTransformer(
    ON pricesV AS inputtable
        
    ON SI_FIT AS SimpleImputeFitTable DIMENSION
    ON OF_FIT AS OutlierFilterFitTable DIMENSION
    )AS dt
)

In [None]:
SELECT TOP 5 * FROM Cleansed_PricesV;

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Step 3 - Feature Engineering</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The next step in the proces is to create new features from the existing data set.  As above, the fit and transform process allows the user to create <b>reusable</b> objects for production.  Feature engineering functions include</p>

<table style = 'width:100%;table-layout:fixed;'>
    <tr>
        <td style = 'vertical-align:top' width = '50%'>
            <ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
                <li>One-hot encoding</li>
                <br>
                <li>Rescaling</li>
                <br>
                <li>Binning</li>
                <br>
                <li>Normalization</li>
                <br>
                <li>Ordinal Encoding</li>
                <br>
                <li>Function/Polynomial conversion</li>
            </ol>
        </td>
        <td><img src = 'images/FE_Gen.jpeg' width = '250'></td>
    </tr>
</table>

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>One-Hot encoding</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Convert the categorical values with dummy encodings - a column for each value with a binary representation of whether the row has that attribute.</p>

In [None]:
SELECT * FROM TD_CategoricalSummary (
  ON Cleansed_PricesV AS InputTable
  USING
  TargetColumns ('alley', 'electrical')
) AS dt
ORDER BY 1;

In [None]:
SELECT * FROM TD_OneHotEncodingFit(
    ON Cleansed_PricesV AS INPUTTABLE
USING
    TargetColumn('alley','electrical')
    IsInputDense('true')
    CategoryCounts(2,5)
    Approach('Auto')
) AS dt;

In [None]:
DROP TABLE OHE_FIT;

In [None]:
CREATE TABLE OHE_FIT AS (
    SELECT * FROM TD_OneHotEncodingFit(
        ON Cleansed_PricesV AS INPUTTABLE
    USING
        TargetColumn('alley','electrical')
        IsInputDense('true')
        CategoryCounts(2,5)
        Approach('Auto')
    ) AS dt
) WITH DATA;

In [None]:
SELECT TOP 5 * FROM TD_OneHotEncodingTransform (
    ON Cleansed_PricesV AS InputTable
    ON OHE_FIT AS FitTable Dimension
    USING
        IsInputDense('True')
) AS dt 
ORDER BY 1;

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Rescaling</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Rescale all the numeric data to create a final analytic data set.  Create a fit table from the one-hot encoded data set</p>

In [None]:
DROP TABLE SF_FIT;

In [None]:
WITH ohe_transformed AS (
    SELECT * FROM TD_OneHotEncodingTransform (
        ON Cleansed_PricesV AS InputTable
        ON OHE_FIT AS FitTable Dimension
    USING
        IsInputDense('True')
) AS d)

SELECT * FROM TD_ScaleFit(
    ON ohe_transformed AS InputTable
    OUT TABLE OutputTable(SF_FIT)
    USING
        TargetColumns('[1:4]','[6:10]','[12:13]')
        ScaleMethod('range')
) as dt;

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Combine the Transformations</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>ColumnTransformer will execute multiple transformations in a single pass.  This same pattern can be passed as-is to <b>new</b> data coming into the system - either real-time or prepared in batch.</p>

In [None]:
REPLACE VIEW Final_ADS_V AS (
    SELECT id, lotfrontage, masvnrarea, _1stflrsf, _2ndflrsf,
        alley_0, alley_1, electrical_0, electrical_1, electrical_2,
        electrical_3, electrical_4, saleprice
    FROM TD_ColumnTransformer(
        ON pricesV AS inputtable

        ON SI_FIT AS SimpleImputeFitTable DIMENSION
        ON OF_FIT AS OutlierFilterFitTable DIMENSION
        ON OHE_FIT AS OneHotEncodingFitTable DIMENSION
        ON SF_FIT AS ScaleFitTable DIMENSION
        
    )AS dt
)

In [None]:
SELECT TOP 10 * FROM Final_ADS_V

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Step 4 - Model training and scoring</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>For illustration purposes, the ML model training and scoring processes are presented in a single vignette.  Model building functions include</p>

<table style = 'width:100%;table-layout:fixed;'>
    <tr>
        <td style = 'vertical-align:top' width = '50%'>
            <ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
                <li>Test/Train split</li>
                <br>
                <li>GLM</li>
                <br>
                <li>XGBoost</li>
                <br>
                <li>Decision Trees</li>
                <br>
                <li>SVM</li>
                <br>
                <li>KMeans</li>
                <br>
                <li>Vector similarity</li>
            </ol>
        </td>
        <td><img src = 'images/Model_Gen.jpeg' width = '250'></td>
    </tr>
</table>

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Split the data</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Test/Train Split can rapidly create testing and training data sets. Additional functionality includes stratification by column, random seeding for repeatability, etc.</p>

In [None]:
SELECT TOP 5 * FROM TD_TrainTestSplit(
ON Final_ADS_V AS InputTable
USING
IDColumn('id')
trainSize(0.75)
testSize(0.25)
)As dt;

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Train the model</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In this demonstration we will use a simple glm (Generalized Linear Model) to predict housing sale price based on the features we've created.  The model is created as a table which assists in operationalization.</p>

In [None]:
WITH housing_train AS (
    SELECT * FROM TD_TrainTestSplit(
    ON Final_ADS_V AS InputTable
    USING
    IDColumn('id')
    trainSize(0.75)
    testSize(0.25)
    )AS d
    WHERE TD_IsTrainRow = 1
)

SELECT * from TD_GLM (
    ON housing_train AS InputTable
    USING
    InputColumns('[2:12]')
    ResponseColumn('saleprice')
    Family('Gaussian')
) AS dt


In [None]:
DROP TABLE Housing_Model;

In [None]:
CREATE TABLE Housing_Model AS (

WITH housing_train AS (
    SELECT * FROM TD_TrainTestSplit(
    ON Final_ADS_V AS InputTable
    USING
    IDColumn('id')
    trainSize(0.75)
    testSize(0.25)
    )AS d
    WHERE TD_IsTrainRow = 1
)

SELECT * from TD_GLM (
    ON housing_train AS InputTable
    USING
    InputColumns('[2:12]')
    ResponseColumn('saleprice')
    Family('Gaussian')
) AS dt
) WITH DATA;

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Test the model</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Run the prediction against the training data set</p>

In [None]:
WITH housing_test AS (
    SELECT * FROM TD_TrainTestSplit(
    ON Final_ADS_V AS InputTable
    USING
    IDColumn('id')
    trainSize(0.75)
    testSize(0.25)
    )AS d
    WHERE TD_IsTrainRow = 0
)


SELECT TOP 10 * from TD_GLMPredict (
  ON housing_test AS INPUTTABLE
  ON Housing_Model AS ModelTable DIMENSION
  USING
  IDColumn ('id')
  Accumulate('saleprice')
) AS dt

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Step 5 - Model evaluation</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In most real-world scenarios, it is untenable to copy all the testing data to the client for evaluation.  Vantage provides built-in evaluation functions to ascertain model efficacy at scale, including</p>

<table style = 'width:100%;table-layout:fixed;'>
    <tr>
        <td style = 'vertical-align:top' width = '50%'>
            <ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
                <li>Regression evaluation</li>
                <br>
                <li>Classification evaluation</li>
                <br>
                <li>ROC curve</li>
                <br>
                <li>Silhouette</li>
            </ol>
        </td>
        <td><img src = 'images/Eval_Gen.jpeg' width = '250'></td>
    </tr>
</table>

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Regression Evaluation</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Evaluate the efficacy of our trained model on the testing data.  The RegressionEvaluator will calculate many metrics including root mean squared error (RMSE), mean absolute error, F-statistics, etc.</p>

In [None]:
DROP TABLE housing_prediction;

In [None]:
CREATE VOLATILE TABLE housing_prediction AS (

WITH housing_test AS (
    SELECT * FROM TD_TrainTestSplit(
    ON Final_ADS_V AS InputTable
    USING
    IDColumn('id')
    trainSize(0.75)
    testSize(0.25)
    )AS d
    WHERE TD_IsTrainRow = 0
)


SELECT * from TD_GLMPredict (
  ON housing_test AS INPUTTABLE
  ON Housing_Model AS ModelTable DIMENSION
  USING
  IDColumn ('id')
  Accumulate('saleprice')
) AS dt) WITH DATA
ON COMMIT PRESERVE ROWS;

In [None]:
SELECT * FROM TD_RegressionEvaluator(
    ON housing_prediction AS InputTable
    USING
    ObservationColumn('saleprice')
    PredictionColumn('prediction')
    Metrics('RMSE','MAE','MAPE')
) AS dtt;
;

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>In-database plotting</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Teradata Vantage can also generate visualizations in-database; which allows for rapid visual analysis of the results.  Here, unpivot the data for easier multi-series plotting.</p>

In [None]:
REPLACE VIEW housing_unpivot_V AS (
SELECT * FROM TD_UNPIVOTING(
ON housing_prediction AS InputTable PARTITION BY ANY 
USING 
IDCOLUMN('id')
TARGETCOLUMNS ('prediction', 'saleprice')

INCLUDENULLS('true')
)AS dt);

In [None]:
SELECT TOP 5 * FROM housing_unpivot_V;

In [None]:
EXECUTE FUNCTION
TD_PLOT(
  SERIES_SPEC(
    TABLE_NAME(housing_unpivot_V),
    ROW_AXIS(SEQUENCE(id)),
    SERIES_ID(AttributeName),
    PAYLOAD (
      FIELDS(AttributeValue),
      CONTENT(REAL)
    )
  ),
  FUNC_PARAMS(
    TITLE('XY Plot'),
    PLOTS[(
      TYPE('line'),
      LEGEND('upper right'),
      YRANGE(0.0,400000.0),
      SERIES[
          (ID(1), NAME('prediction'), FORMAT('r-')),
          (ID(2), NAME('saleprice'), FORMAT('b--'))]
    )],
    WIDTH(1024),
    HEIGHT(768)
  )
);

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Step 6 - Operationalization</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>With many traditional approaches to the ML and AI workflow; this is the point at which the <b>hard</b> work begins.  Developers must take all the steps they've performed in various tools and translate them into a design pattern that is <b>robust, repeatable, and performant</b>.  Thankfully with Teradata Vantage, this is done virtually automatically.  Multiple design patterns can be implemented, including</p>

<table style = 'width:100%;table-layout:fixed;'>
    <tr>
        <td style = 'vertical-align:top' width = '50%'>
            <ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
                <li>Batch pipelines</li>
                <br>
                <li>Enterprise feature store</li>
                <br>
                <li>On-demand/near-realtime processing</li>
            </ol>
        </td>
        <td><img src = 'images/Pipeline_Gen.jpeg' width = '250'></td>
    </tr>
</table>

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Batch pipelines</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Depending on use case, it may be acceptable to prepare and/or evaluate data in batch.  All or part of the pipeline developed above can be implemented using ETL toos, workflow schedulers, or part of a CI/CD-style process.  Note the SQL that is used is relatively simple as compared to writing these transformation and cleansing tasks using standard SQL.  This demonstration will re-use assets developed above, including;</p>

<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li><b>Fit tables</b> which contain the metadata required to perform the cleansing and feature engineering tasks</li>
    <li><b>Column transformer</b> to execute complex cleansing and transformation tasks using a single expression</li>
    <li><b>Model tables</b>.  Models are stored in Vantage as tables - and can be treated as any other data asset, with versioning, DR, and the ability to manage and execute multiple versions or model types at will</li>
    </ul>


In [None]:
--Simulate new data coming in
REPLACE VIEW new_housing_dataV AS (
    SELECT id, lotfrontage, masvnrarea, alley, electrical, _1stflrsf, _2ndflrsf, saleprice
    FROM demo_ofs.housing_prices_full)

In [None]:
/*Copy the same transformation expression from above
Replace the inputtable clause with the new data */

SELECT TOP 10 id, lotfrontage, masvnrarea, _1stflrsf, _2ndflrsf,
    alley_0, alley_1, electrical_0, electrical_1, electrical_2,
    electrical_3, electrical_4, saleprice
FROM TD_ColumnTransformer(
    ON new_housing_dataV AS INPUTTABLE

    ON SI_FIT AS SimpleImputeFitTable DIMENSION
    ON OF_FIT AS OutlierFilterFitTable DIMENSION
    ON OHE_FIT AS OneHotEncodingFitTable DIMENSION
    ON SF_FIT AS ScaleFitTable DIMENSION

)AS dt


In [None]:
/* Add the GLMPredict Function
Using the existing model */

WITH new_data_transformed AS (
    SELECT TOP 10 id, lotfrontage, masvnrarea, _1stflrsf, _2ndflrsf,
        alley_0, alley_1, electrical_0, electrical_1, electrical_2,
        electrical_3, electrical_4, saleprice
    FROM TD_ColumnTransformer(
        ON new_housing_dataV AS INPUTTABLE

        ON SI_FIT AS SimpleImputeFitTable DIMENSION
        ON OF_FIT AS OutlierFilterFitTable DIMENSION
        ON OHE_FIT AS OneHotEncodingFitTable DIMENSION
        ON SF_FIT AS ScaleFitTable DIMENSION

    )AS d)

SELECT TOP 10 * from TD_GLMPredict (
  ON new_data_transformed AS INPUTTABLE
  ON Housing_Model AS ModelTable DIMENSION
  USING
  IDColumn ('id')
  Accumulate('saleprice')
) AS dt;

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Interactive pipeline</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Given the performance and scale of Teradata Vantage, we can execute a similar workflow, but instead of executing against a batch of rows, we can input data directly.</p>
<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li><b>Dynamically-generated SQL</b> using external tools or applications</li>
    <li><b>Stored Procedures</b> to minimize complexity</li>
    </ul>

In [None]:
%var lf=50, ma=100, fl1=2000, fl2=1500

In [None]:
WITH input_data AS (
    SELECT 0 as id, ${lf} as lotfrontage, ${ma} as masvnrarea, '' as alley, 'SBrkr' as electrical, 
    ${fl1} as _1stflrsf, ${fl2} as _2ndflrsf),

new_data_transformed AS (
    SELECT id, lotfrontage, masvnrarea, _1stflrsf, _2ndflrsf,
        alley_0, alley_1, electrical_0, electrical_1, electrical_2,
        electrical_3, electrical_4 --, saleprice
    FROM TD_ColumnTransformer(
        ON input_data AS INPUTTABLE

        ON SI_FIT AS SimpleImputeFitTable DIMENSION
        ON OF_FIT AS OutlierFilterFitTable DIMENSION
        ON OHE_FIT AS OneHotEncodingFitTable DIMENSION
        ON SF_FIT AS ScaleFitTable DIMENSION

    )AS d)

SELECT * from TD_GLMPredict (
  ON new_data_transformed AS INPUTTABLE
  ON Housing_Model AS ModelTable DIMENSION
  USING
  IDColumn ('id')
) AS dt;

In [None]:
REPLACE PROCEDURE housing_price_prediction_sp
(   
    IN ip_lf BIGINT,
    IN ip_ma BIGINT,
    IN ip_al VARCHAR(30),
    IN ip_elec VARCHAR(30),
    IN ip_1st BIGINT,
    IN ip_2nd BIGINT
)
DYNAMIC RESULT SETS 1
BEGIN
    DECLARE SqlStr VARCHAR(2000);
    DECLARE rslt CURSOR WITH RETURN ONLY FOR stmt;
    
    SET SQLStr = 'SELECT prediction from TD_GLMPredict (
  ON (
        SELECT id, lotfrontage, masvnrarea, _1stflrsf, _2ndflrsf,
            alley_0, alley_1, electrical_0, electrical_1, electrical_2,
            electrical_3, electrical_4
        FROM TD_ColumnTransformer(
            ON (SELECT 0 as id, '|| ip_lf ||' as lotfrontage, '|| ip_ma ||' as masvnrarea, 
            '''|| ip_al ||''' as alley, '''||ip_al||''' as electrical,
            '||ip_1st||' as _1stflrsf, '||ip_2nd||' as _2ndflrsf) AS INPUTTABLE

            ON SI_FIT AS SimpleImputeFitTable DIMENSION
            ON OF_FIT AS OutlierFilterFitTable DIMENSION
            ON OHE_FIT AS OneHotEncodingFitTable DIMENSION
            ON SF_FIT AS ScaleFitTable DIMENSION

        )AS d) AS INPUTTABLE
  ON Housing_Model AS ModelTable DIMENSION
  USING
  IDColumn (''id'')
) AS dt';
   PREPARE stmt FROM SqlStr;
   OPEN rslt;
END;

In [None]:
/* Pass in
- Lot Frontage
- Masonry Veneer Area
- alley
- electrical
- 1st floor footage
- 2nd floor footage
*/
CALL housing_price_prediction_sp(95, 100, '', 'SBrkr', 2992, 1770)

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Conclusion</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The preceding demonstration has reviewed how Teradata Vantage <b>ClearScape Analytics</b> functions provide analysts, developers, and data scientists a set of powerful tools that will allow oeganizations to develop and <b>deliver</b> advanced analytic products rapidly into production, unlocking the value of innovation and next-generation AI and ML outcomes.</p>

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:18px;font-family:Arial;color:#00233C'>Clean Up</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Clean up all objects created during this demonstration.</p>

In [None]:
DROP VIEW pricesV;

In [None]:
DROP TABLE SI_FIT;

In [None]:
DROP TABLE OF_FIT;

In [None]:
DROP TABLE OHE_FIT;

In [None]:
DROP TABLE SF_FIT;

In [None]:
DROP VIEW Cleansed_PricesV;

In [None]:
DROP VIEW Final_ADS_V;

In [None]:
DROP TABLE Housing_Model;

In [None]:
DROP VIEW housing_unpivot_V;

In [None]:
DROP VIEW new_housing_dataV;

In [None]:
DROP PROCEDURE housing_price_prediction_sp;

In [None]:
%disconnect demo_system