<header style="padding:10px;background:#f9f9f9;border-top:3px solid #00b2b1"><img id="Teradata-logo" src="https://www.teradata.com/Teradata/Images/Rebrand/Teradata_logo-two_color.png" alt="Teradata" width="220" align="right" />

<p style = 'font-size:28px;font-family:Arial;color:#E37C4D'><b>Customer Segmentation with K-means Clustering and Data Preparation Piplelines</b></p>
</header>
<hr>


<br>

<b style = 'font-size:24px;font-family:Arial;color:#E37C4D'>Leverage native Vantage processing for efficient and highly scalable data preparation, model training, and evaluation workflows</b>

<p style = 'font-size:16px;font-family:Arial'>K-means clustering is one of the most popular <b>unsupervised</b> machine learning algorithms.  Essentially, the algorithm seeks to group similar data points together by minimizing the average ("means" in K-means) distance for all data points from each cluster's center (centroid).</p>

<table style = 'width:100%;table-layout:fixed;'>
    <tr>
        <td style = 'vertical-align:top' width = '50%'>
            <ol style = 'font-size:16px;font-family:Arial'>
                <li>Define the number of clusters (k)</li>
                <br>
                <li>The algorithm chooses random points as centroids</li>
                <br>
                <li>Each iteration attempts to optimize the centroid locations</li>
                <br>
                <li>Iterations end once the distances have stabilized or the max iteration count is reached</li>
            </ol>
        </td>
        <td><img src = 'images/K-means_convergence.gif' width = '250'></td>
    </tr>
</table>

<p style = 'font-size:16px;font-family:Arial'>One limitation of this algorithm is that it only accepts numeric data as feature input (categorical clustering can be performed using K-modes algorithm).  Typically, data engineers or data scientists will perform multiple <b>serial</b> steps to prepare a numeric-only data set that can be passed to the K-means algorithm.</p>

<p style = 'font-size:16px;font-family:Arial'>Vantage provides native "Fit and Transform" functions to assist in data preparation and transformation at scale.  To aid in efficiency and operationalization, Vantage provides a bulk <b>Column Transformer</b> function which can take multiple transformation directives at the same time, and act on the whole data set at once.  This allows for both process and code simplifcation, allowing more streamlined and robust operational deployment.</p> 

<img src = 'Flow_Diagram_KMeans.png' width = 100%>
<hr>

<b style = 'font-size:24px;font-family:Arial;color:#E37C4D'>Live Demonstration</b>

<p style = 'font-size:16px;font-family:Arial'>The data for this demonstration is based on online purchase history data set, which can be found <a href = 'https://www.kaggle.com/code/hellbuoy/online-retail-k-means-hierarchical-clustering/data'>here</a>.  The goal is to segment the customers by purchase volume and value.  Steps are as follow:</p>

<table style = 'width:100%;table-layout:fixed;'>
    <tr>
        <td style = 'vertical-align:top' width = '50%'>
            <ol style = 'font-size:16px;font-family:Arial'>
                <li>Analyze the raw data, split a testing set</li>
                <br>
                <li>Engineer numeric features</li>
                <br>
                <li>Build the K-means model</li>
                <br>
                <li>Apply in-line transformation to the testing set</li>
                <br>
                <li>Make Predictions and evaluate model accuracy</li>
            </ol>
        </td>
        <td><img src = 'images/clustering_img.png' width = '250'></td>
    </tr>
</table>

<p style = 'font-size:28px;font-family:Arial;color:#E37C4D'><b>1. Connect to Vantage and explore the dataset</b></p>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'> <b>Let's start by connecting to the Teradata system </b></p>
<p style = 'font-size:16px;font-family:Arial'>You will be prompted to provide the password. Enter your password, press Enter, then use down arrow to go to next cell.</p>

In [2]:
%connect local, hidewarnings=true

Password: ········


Success: 'local' connection established and activated for user 'demo_user', with default database 'demo_user'


<p style = 'font-size:16px;font-family:Arial'>Setup for execution of notebook. Begin running steps with Shift + Enter keys.</p>

In [3]:
Set query_band='DEMO=Clustering_KMeans.ipynb;' update for session;

Success: 1 rows affected

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'> <b>Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial'>We have provided data for this demo on cloud storage. In this demo since we are using Temporal table we will be creating databases and tables in local storage and use them in the notebook. Please execute the procedure in the next cell.</p>

In [5]:
--call get_data('DEMO_Retail_cloud');    -- takes about 20 seconds, estimated space: 0 MB
call get_data('DEMO_Retail_local');     -- takes about 35 seconds, estimated space: 11 MB

Success: 0 rows affected

Success: 0 rows affected

Unnamed: 0,Message
1,That ran for 0:00:27.89 with 20 statements and 4 errors.


<p style = 'font-size:16px;font-family:Arial'>Optional step – if you want to see status of databases/tables created and space used.</p>

In [6]:
call space_report();  -- optional, takes about 10 seconds

Success: 0 rows affected

Success: 0 rows affected

Unnamed: 0,Space_Report
1,"You have: #databases=3 #tables=2 #views=8 You have used 23.7 MB of 30,851.3 MB available - 0.1% ... Space Usage OK"
2,
3,Database Name #tables #views Avail MB Used MB
4,"demo_user 0 4 30,815.4 MB 0.7 MB"
5,DEMO_CreditCard 0 1 0.0 MB 0.0 MB
6,DEMO_Retail 0 3 0.0 MB 0.0 MB
7,DEMO_Retail_db 2 0 35.9 MB 23.0 MB


<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'> <b> Access data in Vantage  </b> </p>
<p style = 'font-size:16px;font-family:Arial'>For this demo, data is already resident in Object Storage which we are accessing via ReadNOS, create a reference to the table, and sample the contents using the get_data procedure used above.  Data could just as easily reside in permanent tables, another RDBMS, or another Vantage system.</p>

In [7]:
SELECT TOP 5 * FROM DEMO_Retail.UK_Retail_Data;

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
1,571067,22952,60 CAKE CASES VINTAGE CHRISTMAS,24,2011-10-13 15:08:00.000000,0.55,16793,United Kingdom
2,543476,22984,CARD GINGHAM ROSE,12,2011-08-02 15:24:00.000000,0.42,13050,United Kingdom
3,543476,22441,GROW YOUR OWN BASIL IN ENAMEL MUG,8,2011-08-02 15:24:00.000000,2.1,13050,United Kingdom
4,579558,21175,GIN AND TONIC DIET METAL SIGN,12,2011-11-30 11:24:00.000000,2.55,14755,United Kingdom
5,579558,23298,SPOTTY BUNTING,3,2011-11-30 11:24:00.000000,4.95,14755,United Kingdom


<hr>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>Step 1 - Data Preparation</b></p>

<p style = 'font-size:16px;font-family:Arial'>Here, we will inspect the original data set, and perform various preparation tasks.</p>

<ol style = 'font-size:16px;font-family:Arial'>
    <li>Inspect the rows of the table</li>
    <li>Inspect the column metadata using <a href = 'https://docs-dev.teradata.com/r/Teradata-VantageTM-Advanced-SQL-Engine-Analytic-Functions/June-2022/Data-Exploration-Functions/TD_ColumnSummary'>TD_ColumnSummary</a></li>
    <li>Split off a testing data set to be used in evaluation</li>
    </ol>
    
<p style = 'font-size:16px;font-family:Arial'>Replace the following with a valid connection name:</p>

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'>1.1 - Inspect the Data</p>

<p style = 'font-size:16px;font-family:Arial'>Simple SQL query to show the data</p>

In [None]:
SELECT TOP 5 * FROM DEMO_Retail.UK_Retail_Data;

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'>1.2 View Column information</p>

<p style = 'font-size:16px;font-family:Arial'><a href = 'https://docs-dev.teradata.com/r/Teradata-VantageTM-Advanced-SQL-Engine-Analytic-Functions/June-2022/Data-Exploration-Functions/TD_ColumnSummary'>TD_ColumnSummary</a> provides more details on column values and ranges</p>

In [8]:
SELECT * FROM TD_ColumnSummary(
    ON DEMO_Retail.UK_Retail_Data as inputtable
    USING
        targetcolumns('[0:7]')
) as dt;

Unnamed: 0,ColumnName,Datatype,NonNullCount,NullCount,BlankCount,ZeroCount,PositiveCount,NegativeCount,NullPercentage,NonNullPercentage
1,StockCode,VARCHAR(10) CHARACTER SET UNICODE,536641,0,0.0,,,,0,100
2,InvoiceDate,TIMESTAMP(6),536641,0,,,,,0,100
3,InvoiceNo,VARCHAR(10) CHARACTER SET UNICODE,536641,0,0.0,,,,0,100
4,Description,VARCHAR(1024) CHARACTER SET UNICODE,536641,0,0.0,,,,0,100
5,Country,VARCHAR(1024) CHARACTER SET UNICODE,536641,0,0.0,,,,0,100
6,CustomerID,FLOAT,401604,0,,0.0,401604.0,0.0,0,100
7,Quantity,BIGINT,536641,0,,0.0,526054.0,10587.0,0,100
8,UnitPrice,FLOAT,536641,0,,2510.0,534129.0,2.0,0,100


<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'>1.3 Create a Testing data set</p>

<p style = 'font-size:16px;font-family:Arial'>Manufactured example - select several "Customer ID" values for testing later.</p>

In [9]:
REPLACE VIEW UK_Retail_Test_V as (
    SELECT * FROM DEMO_Retail.UK_Retail_Data 
    WHERE CustomerID IN ('17307', '12503', '18268', '12908', '13693')
);

Success: 9 rows affected

<hr>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>Step 2 - Feature Engineering</b></p>

<p style = 'font-size:16px;font-family:Arial'>This section will illustrate how to prepare the data set for model training.  We will use standard SQL and various "Fit" functions to create input for the <a href = 'https://docs-dev.teradata.com/r/Teradata-VantageTM-Advanced-SQL-Engine-Analytic-Functions/June-2022/Feature-Engineering-Transform-Functions/TD_ColumnTransformer'>TD_ColumnTransformer</a> function to take as input in order to execute a bulk transformation.</p>

<ol style = 'font-size:16px;font-family:Arial'>
    <li>Create a per-customer grouping of data</li>
    <li>Create Fit Tables
        <ul><li>Remove Outliers</li>
            <li>Impute Missing Values</li>
            <li>Create New Numeric Features</li>
            <li>Rescale the Data Set</li>
        </ul></li>
    <li>Call the final Transformation function</li>
    </ol>

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'>2.1 - Create a per-customer table</p>

<p style = 'font-size:16px;font-family:Arial'>Simple GROUP BY, exclude the testing IDs.  Note there are 4367 unique customers in this training set.</p>

In [10]:
REPLACE VIEW Customer_ID_Group_V AS (
    SELECT CustomerID,
        SUM(quantity) as TotalQuantity , 
        SUM(UnitPrice) as TotalPrice, 
        COUNT(DISTINCT(StockCode)) as TotalItems 
    FROM DEMO_Retail.UK_Retail_Data
    WHERE CustomerID NOT IN ('17307', '12503', '18268', '12908', '13693')
    GROUP BY CustomerID
)

Success: 5 rows affected

In [11]:
SELECT COUNT(*) FROM Customer_ID_Group_V

Unnamed: 0,Count(*)
1,4367


<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'>2.2 Create Fit Tables</p>

<p style = 'font-size:16px;font-family:Arial'>Vantage <a href = 'https://docs-dev.teradata.com/r/Teradata-VantageTM-Advanced-SQL-Engine-Analytic-Functions/June-2022/Feature-Engineering-Transform-Functions'>Feature Engineering Transform Functions</a> Use a "Fit and Transform" approach to make processing more modular and efficient.  "Fit tables" can be used as input to either individual Transform functions, or passed to a single <a href = 'https://docs-dev.teradata.com/r/Teradata-VantageTM-Advanced-SQL-Engine-Analytic-Functions/June-2022/Feature-Engineering-Transform-Functions/TD_ColumnTransformer'>TD_ColumnTransformer</a> function.</p>

<table style = 'width:100%;table-layout:fixed;'>
    <tr>
        <td style = 'vertical-align:top' width = '50%'>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Fit outlier removal using <a href = 'https://docs-dev.teradata.com/r/Teradata-VantageTM-Advanced-SQL-Engine-Analytic-Functions/June-2022/Data-Cleaning-Functions/Handling-Outliers/TD_OutlierFilterFit'>TD_OutlierFilterFit</a></li>
    <li>Fit a simple imputer to replace missing values using <a href = 'https://docs-dev.teradata.com/r/Teradata-VantageTM-Advanced-SQL-Engine-Analytic-Functions/June-2022/Data-Cleaning-Functions/Handling-Missing-Values/TD_SimpleImputeFit'>TD_SimpleImputeFit</a></li>
    <li>Fit column calculations to create new features using <a href = 'https://docs-dev.teradata.com/r/Teradata-VantageTM-Advanced-SQL-Engine-Analytic-Functions/June-2022/Feature-Engineering-Transform-Functions/TD_NonLinearCombineFit'>TD_NonLinearCombineFit</a></li>
    <li>Call <a href = 'https://docs-dev.teradata.com/r/Teradata-VantageTM-Advanced-SQL-Engine-Analytic-Functions/June-2022/Feature-Engineering-Transform-Functions/TD_ColumnTransformer'>TD_ColumnTransformer</a> to execute the transformations (to allow for Scaling)</li>
    <li>Rescale the data using <a href = 'https://docs-dev.teradata.com/r/Teradata-VantageTM-Advanced-SQL-Engine-Analytic-Functions/June-2022/Feature-Engineering-Transform-Functions/TD_ScaleFit'>TD_ScaleFit/Transform</a></li>
            </ul></td>
        <td><img src = 'images/fit_transform.png' width = '300'></td>
    </tr>
    </table>

In [12]:
--Remove Outliers
--Trim below 3rd, and above 97th percentile

CREATE VOLATILE TABLE outlierFit_CS as (
    SELECT * from TD_OutlierFilterFit(
        ON Customer_ID_Group_V as inputTable
        USING
        TargetColumns('TotalQuantity','TotalPrice')
        LowerPercentile(0.03)
        UpperPercentile(0.97)
        OutlierMethod('Percentile')
        PercentileMethod('PercentileCont')
        ReplacementValue('Median')
    )as dt
) WITH DATA
ON COMMIT PRESERVE ROWS;

Success: 0 rows affected

In [13]:
--Impute Missing Values
--Replace any missing CustomerID with a specific value
CREATE VOLATILE TABLE ImputeFit_CS AS (
    SELECT * FROM TD_SimpleImputeFit(
        ON Customer_ID_Group_V AS InputTable
        USING
        ColsForLiterals('CustomerID')
        Literals('19000')                                        
    ) AS dt
)WITH DATA
ON COMMIT PRESERVE ROWS;

Success: 0 rows affected

In [14]:
--Create a new column by multiplying quantity and price

SELECT * FROM TD_NonLinearCombineFit (
    ON Customer_ID_Group_V as InputTable
    OUT VOLATILE TABLE OutputTable (NonLinearCombineFit_CS_TotalSales)
    USING
        TargetColumns ('TotalQuantity','TotalPrice')
        Formula ('Y = X1*X0')
        ResultColumn ('TotalSales')
) AS dt;

Success: 0 rows affected

Unnamed: 0,TotalSales,TotalQuantity,TotalPrice
1,Y = X1*X0,,


In [15]:
--Create another new column by diving the total sales by the number of unique items

SELECT * FROM TD_NonLinearCombineFit (
   ON Customer_ID_Group_V AS InputTable
   OUT VOLATILE TABLE OutputTable (NonLinearCombineFit_CS_SalesPerItem)
   USING
   TargetColumns ('TotalQuantity','TotalPrice','TotalItems')
   Formula ('Y = (X0*X1)/X2')
   ResultColumn ('SalesPerItem')
) AS dt;

Success: 0 rows affected

Unnamed: 0,SalesPerItem,TotalQuantity,TotalPrice,TotalItems
1,Y = (X0*X1)/X2,,,


In [16]:
--Execute ColumnTransformer to build the data set
REPLACE VIEW Transformed_Customer_ID_Group_V AS (
    SELECT * from TD_ColumnTransformer(
        ON Customer_ID_Group_V AS InputTable
        
        ON OutlierFit_CS AS OutlierFilterFitTable DIMENSION
        ON ImputeFit_CS AS SimpleImputeFitTable DIMENSION
        ON NonLinearCombineFit_CS_TotalSales AS NonLinearCombineFitTable DIMENSION
        ON NonLinearCombineFit_CS_SalesPerItem as NonLinearCombineFitTable DIMENSION
    )as dt
)

Success: 7 rows affected

In [19]:
--to test if by creating table it is working
Create table Scaled_data as(
    SELECT * from TD_ColumnTransformer(
        ON Customer_ID_Group_V AS InputTable
        
        ON OutlierFit_CS AS OutlierFilterFitTable DIMENSION
        ON ImputeFit_CS AS SimpleImputeFitTable DIMENSION
        ON NonLinearCombineFit_CS_TotalSales AS NonLinearCombineFitTable DIMENSION
        ON NonLinearCombineFit_CS_SalesPerItem as NonLinearCombineFitTable DIMENSION  
)
AS dt
) WITH DATA
;

Success: 0 rows affected

In [17]:
--ScaleFit/Transform to rescale the data
SELECT * FROM TD_ScaleFit(
    ON Transformed_Customer_ID_Group_V AS InputTable
    OUT VOLATILE TABLE OutputTable(ScaleFit_CS)
    USING
        TargetColumns('TotalQuantity','TotalItems','TotalPrice','TotalSales','SalesPerItem')
        ScaleMethod('range')
) as dt;

ERROR: Unable to run SQL: Unable to run SQL query: Database reported error:3610:Internal error: Please do not resubmit the last request.  SubCode, CrashCode: 0, 0

In [18]:
--ScaleFit/Transform to rescale the data
SELECT * FROM TD_ScaleFit(
    ON Transformed_Customer_ID_Group_V AS InputTable
    OUT TABLE OutputTable(ScaleFit_CS)
    USING
        TargetColumns('TotalQuantity','TotalItems','TotalPrice','TotalSales','SalesPerItem')
        ScaleMethod('range')
) as dt;

ERROR: Unable to run SQL: Unable to run SQL query: Database reported error:3610:Internal error: Please do not resubmit the last request.  SubCode, CrashCode: 0, 0

In [20]:
--ScaleFit/Transform to rescale the data
SELECT * FROM TD_ScaleFit(
    ON Scaled_data AS InputTable
    OUT VOLATILE TABLE OutputTable(ScaleFit_CS)
    USING
        TargetColumns('TotalQuantity','TotalItems','TotalPrice','TotalSales','SalesPerItem')
        ScaleMethod('range')
) as dt;

Success: 0 rows affected

Unnamed: 0,TD_STATTYPE_SCLFIT,TotalQuantity,TotalItems,TotalPrice,TotalSales,SalesPerItem
1,min,24.0,1.0,6.72,206.7,26.260061315496092
2,max,4796.0,1794.0,1291.0499999999995,5751069.009999999,555001.0
3,sum,2967291.0,267606.0,937127.7840000016,1065094805.8069998,17633855.62153829
4,count,4367.0,4367.0,4367.0,4367.0,4367.0
5,,0.0,0.0,0.0,0.0,0.0
6,avg,679.4804213418823,61.279138997023125,214.59303503549387,243896.2229922143,4037.979304222187
7,multiplier,1.0,1.0,1.0,1.0,1.0
8,intercept,0.0,0.0,0.0,0.0,0.0
9,location,24.0,1.0,6.72,206.7,26.260061315496092
10,scale,4772.0,1793.0,1284.3299999999997,5750862.309999999,554974.7399386845


In [None]:
REPLACE VIEW Scaled_Transformed_Customer_ID_Group_V AS (
    SELECT * FROM TD_ScaleTransform(
        ON Transformed_Customer_ID_Group_V AS InputTable
        ON ScaleFit_CS as FitTable DIMENSION
        USING
            Accumulate('CustomerID')
    )as dt
)

In [None]:
SELECT TOP 5 * FROM Scaled_Transformed_Customer_ID_Group_V

<hr>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>Step 3 - Build the K-means Model</b></p>

<p style = 'font-size:16px;font-family:Arial'>As discussed above, the K-means algorithm takes a number of clusters "k", chooses a random starting point for each centroid, and iterates until a hard limit, or an optimium value is reached.</p>

<p style = 'font-size:16px;font-family:Arial'><b>Finding an Ideal value for K</b></p>
<table style = 'width:100%;table-layout:fixed;'>
    <tr>
        <td style = 'vertical-align:top' width = '50%'>
<p style = 'font-size:16px;font-family:Arial'>The example below uses a value of 5 for the number of clusters to build the model against.  Typically, data scientists will build the model using various values for "k", and plot the "WCSS" (Within Cluster Sum-of-Squares) value on a series of each value chosen for k.  The "elbow" point (where the slope changes) is usually a good value for k.  <a href = 'https://docs-dev.teradata.com/r/Teradata-VantageTM-Advanced-SQL-Engine-Analytic-Functions/June-2022/Model-Training-Functions/TD_KMeans'>TD_KMeans</a> function will return this value as "TotalWithinSS : ###" as a row in the "td_modelinfo_kmeans" column.</p></td>
        <td><img src = 'images/WCSS_elbow.png' width = '300'></td>
    </tr>
    </table>

<p style = 'font-size:16px;font-family:Arial'><b>Other Function Parameters Include (but are not limited to)</b></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Input Table</li>
    <li>StopThreshold - The algorithm converges if the distance between the centroids from the previous iteration and the current iteration is less than the specified value.</li>
    <li>MaxIterNum</li>Specify the maximum number of iterations for the K-means algorithm. The algorithm stops after performing the specified number of iterations even if the convergence criterion is not met.
    </ul>

In [None]:
DROP TABLE KMeans_Model

In [None]:
Select * from TD_KMeans (
    ON Scaled_Transformed_Customer_ID_Group_V as InputTable
    OUT TABLE ModelTable(KMeans_Model)
    USING
        IdColumn('CustomerID')
        TargetColumns('TotalQuantity','TotalPrice','TotalItems','TotalSales','SalesPerItem')
        StopThreshold(0.0395)
        NumClusters(5)
        --Seed(0)
        MaxIterNum(500)
)as dt;

<hr>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>Step 4 - Bulk Transformation</b></p>

<p style = 'font-size:16px;font-family:Arial'>Here, the Fit tables created above will be passed to a single <a href = 'https://docs-dev.teradata.com/r/Teradata-VantageTM-Advanced-SQL-Engine-Analytic-Functions/June-2022/Feature-Engineering-Transform-Functions/TD_ColumnTransformer'>TD_ColumnTransformer</a> function.  This is similar to an operational approach, where a single query will prepare new or incoming data for immediate analysis.</p>

<img src = 'images/column_transformer.png' width = '300'>

In [None]:
SELECT TOP 5 * FROM UK_Retail_Test_V

In [None]:
--Steps broken up above can be put together into a single query

REPLACE VIEW Scaled_Transformed_Test_V AS (
    
SELECT * FROM TD_ColumnTransformer(
            --Use our groupby inside the ON clause
            ON (SELECT CustomerID,
                    SUM(quantity) as TotalQuantity , 
                    SUM(UnitPrice) as TotalPrice, 
                    COUNT(DISTINCT(StockCode)) as TotalItems
                FROM UK_Retail_Test_V
                GROUP BY CustomerID
                ) AS InputTable
            
            --Pass each fit table from above as dimensions
            ON OutlierFit_CS AS OutlierFilterFitTable DIMENSION
            ON ImputeFit_CS AS SimpleImputeFitTable DIMENSION
            ON NonLinearCombineFit_CS_TotalSales AS NonLinearCombineFitTable DIMENSION
            ON NonLinearCombineFit_CS_SalesPerItem AS NonLinearCombineFitTable DIMENSION
            ON ScaleFit_CS as ScaleFitTable DIMENSION
    )as dt
)

In [None]:
SELECT TOP 5 * FROM Scaled_Transformed_Test_V

<hr>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>Step 5 - Predict and Evaluate</b></p>

<p style = 'font-size:16px;font-family:Arial'>Finally, we run the model against new (in this case testing) data using <a href = 'https://docs-dev.teradata.com/r/Teradata-VantageTM-Advanced-SQL-Engine-Analytic-Functions/June-2022/Model-Scoring-Functions/TD_KMeansPredict'>TD_KMeansPredict</a>.  The preparation step has been completed in a single query above.  Additionally, we will use an evaluation function <a href = 'https://docs-dev.teradata.com/r/Teradata-VantageTM-Advanced-SQL-Engine-Analytic-Functions/June-2022/Model-Evaluation-Functions/TD_Silhouette'>TD_Silhouette</a> to analyze how well the new cluster predictions match the original model.</p>

<ol style = 'font-size:16px;font-family:Arial'>
    <li>Call TD_KMeansPredict</li>
    <li>Inpect the results</li>
    <li>Call TD_Silhouette on the output</li>
    </ol>

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'>5.1 - Call the Prediction Function</p>

<p style = 'font-size:16px;font-family:Arial'>Pass the Input Data, Model Table, and other parameters including columns to accumulate.  Note here we create a new View to assist with Silhouette analysis of the prediction.</p>

In [None]:
REPLACE VIEW KMeans_Output_V AS (
    SELECT * FROM TD_KMeansPredict (
        ON Scaled_Transformed_Test_V AS InputTable
        ON KMeans_Model as ModelTable DIMENSION
        USING
            OutputDistance('true')
            Accumulate('[1:5]')
    )as dt
)

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'>5.2 - Inspect the Results</p>

<p style = 'font-size:16px;font-family:Arial'>Simple SELECT</p>

In [None]:
SELECT TOP 5 * FROM KMeans_Output_V 

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'>5.3 - Evaluate the Prediction</p>

<p style = 'font-size:16px;font-family:Arial'><a href = 'https://docs-dev.teradata.com/r/Teradata-VantageTM-Advanced-SQL-Engine-Analytic-Functions/June-2022/Model-Evaluation-Functions/TD_Silhouette'>TD_Silhouette</a> is a native Vantage function that evaluates the similarity of an object to its cluster (cohesion) compared to other clusters (separation).  The silhouette scores and its definitions are as follows:</p>

<ul style = 'font-size:16px;font-family:Arial'>
    <li>1: Data is appropriately clustered</li>
    <li>-1: Data is not appropriately clustered</li>
    <li>0: Datum is on the border of two natural clusters</li>
    </ul>

<p style = 'font-size:16px;font-family:Arial'>See the documentation for a full listing of parameters and return values.</p>

In [None]:
SELECT * FROM TD_Silhouette(
    ON KMeans_Output_V as inputTable
    USING
        IdColumn('CustomerID')
        ClusterIdColumn('td_clusterid_kmeans')
        TargetColumns('[3:7]')
        OutputType('SCORE')
) as dt

<p style = 'font-size:28px;font-family:Arial;color:#E37C4D'><b>4.Clean up</b> </p>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Worktables</b> </p>

In [None]:
DROP VIEW UK_Retail_Test_V

In [None]:
DROP VIEW Customer_ID_Group_V

In [None]:
DROP VIEW Transformed_Customer_ID_Group_V

<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'> <b>Database and Tables </b></p>
<p style = 'font-size:16px;font-family:Arial'>The following code will clean up tables and databases created above.</p>

In [4]:
call remove_data('DEMO_Retail');-- takes about 5 seconds, optional if you want to use the data later

Success: 0 rows affected

Success: 0 rows affected

Unnamed: 0,Message
1,Removed objects related to DEMO_Retail. That ran for 0:00:02.40


In [None]:
DROP VIEW Scaled_Transformed_Customer_ID_Group_V

In [None]:
DROP VIEW Scaled_Transformed_Test_V

In [None]:
DROP VIEW KMeans_Output_V

In [None]:
DROP TABLE KMeans_Model

In [None]:
DROP TABLE ScaleFit_CS

In [None]:
DROP TABLE NonLinearCombineFit_CS_SalesPerItem

In [None]:
DROP TABLE NonLinearCombineFit_CS_TotalSales

In [None]:
DROP TABLE ImputeFit_CS

In [None]:
DROP TABLE OutlierFit_CS