<header style="padding:10px;background:#f9f9f9;border-top:3px solid #00b2b1"><img id="Teradata-logo" src="https://www.teradata.com/Teradata/Images/Rebrand/Teradata_logo-two_color.png" alt="Teradata" width="220" align="right" />

<p style = 'font-size:28px;font-family:Arial;color:#E37C4D'><b>ClearScape Analytics for Customer Segmentation using K-means Clustering and Data Preparation Piplelines</b></p>
</header>
<hr>

<br>

<b style = 'font-size:24px;font-family:Arial;color:#E37C4D'>Leverage ClearScape Analytics for efficient and highly scalable data preparation, model training, and evaluation workflows</b>

<p style = 'font-size:16px;font-family:Arial'>K-means clustering is one of the most popular <b>unsupervised</b> machine learning algorithms.  Essentially, the algorithm seeks to group similar data points together by minimizing the average ("means" in K-means) distance for all data points from each cluster's center (centroid).</p>

<table style = 'width:100%;table-layout:fixed;'>
    <tr>
        <td style = 'vertical-align:top' width = '50%'>
            <ol style = 'font-size:16px;font-family:Arial'>
                <li>Define the number of clusters (k)</li>
                <br>
                <li>The algorithm chooses random points as centroids</li>
                <br>
                <li>Each iteration attempts to optimize the centroid locations</li>
                <br>
                <li>Iterations end once the distances have stabilized or the max iteration count is reached</li>
            </ol>
        </td>
        <td><img src = 'images/K-means_convergence.gif' width = '250'></td>
    </tr>
</table>

<p style = 'font-size:16px;font-family:Arial'>One limitation of this algorithm is that it only accepts numeric data as feature input (categorical clustering can be performed using K-modes algorithm).  Typically, data engineers or data scientists will perform multiple <b>serial</b> steps to prepare a numeric-only data set that can be passed to the K-means algorithm.</p>

<p style = 'font-size:16px;font-family:Arial'>ClearScape Analytics provides native "Fit and Transform" functions to assist in data preparation and transformation at scale.  To aid in efficiency and operationalization, Vantage provides a bulk <b>Column Transformer</b> function which can take multiple transformation directives at the same time, and act on the whole data set at once.  This allows for both process and code simplifcation, allowing more streamlined and robust operational deployment.</p> 

<img src = 'Flow_Diagram_KMeans.png' width = 100%>
<hr>

<b style = 'font-size:24px;font-family:Arial;color:#E37C4D'>Live Demonstration</b>

<p style = 'font-size:16px;font-family:Arial'>The data for this demonstration is based on online purchase history data set, which can be found <a href = 'https://www.kaggle.com/code/hellbuoy/online-retail-k-means-hierarchical-clustering/data'>here</a>.  The goal is to segment the customers by purchase volume and value.  Steps are as follow:</p>

<table style = 'width:100%;table-layout:fixed;'>
    <tr>
        <td style = 'vertical-align:top' width = '50%'>
            <ol style = 'font-size:16px;font-family:Arial'>
                <li>Analyze the raw data, split a testing set</li>
                <br>
                <li>Engineer numeric features</li>
                <br>
                <li>Build the K-means model</li>
                <br>
                <li>Apply in-line transformation to the testing set</li>
                <br>
                <li>Make Predictions and evaluate model accuracy</li>
            </ol>
        </td>
        <td><img src = 'images/clustering_img.png' width = '250'></td>
    </tr>
</table>

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'>Imports and Connection</p>

<p style = 'font-size:16px;font-family:Arial'>Import required packages and create a connection context to Vantage.</p>

In [1]:
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter = lambda *args, **kwargs: ""


import json
from teradataml import *
warnings.simplefilter = lambda *args, **kwargs: ""

from IPython.display import display as ipydisplay
import matplotlib.pyplot as plt
%matplotlib inline

# display.print_sqlmr_query = False

In [None]:
# load vars json
with open('../../vars.json', 'r') as f:
    session_vars = json.load(f)

# Create the SQLAlchemy Context
host = session_vars['environment']['host']
username = session_vars['hierarchy']['users']['business_users'][1]['username']
password = session_vars['hierarchy']['users']['business_users'][1]['password']

eng = create_context(host=host, username=username, password=password)

eng.execute(f'''SET SESSION COMPUTE GROUP {session_vars['hierarchy']['users']['business_users'][1]['compute_group']}''')

# confirm connection
print(eng)

<hr>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>Step 1 - Data Preparation</b></p>

<p style = 'font-size:16px;font-family:Arial'>Here, we will inspect the original data set, and perform various preparation tasks.</p>

<ol style = 'font-size:16px;font-family:Arial'>
    <li>Inspect the rows of the table</li>
    <li>Inspect the column metadata using <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Data-Exploration-Functions/TD_ColumnSummary'>ColumnSummary</a></li>
    <li>Split off a testing data set to be used in evaluation</li>
    </ol>
    

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'>1.1 - Inspect the Data</p>

<p style = 'font-size:16px;font-family:Arial'>Simple SQL query to show the data</p>

In [3]:
tdf_retail_data = DataFrame('"demo_ofs"."UK_Retail_Data"')

In [4]:
ipydisplay(tdf_retail_data.shape)
ipydisplay(tdf_retail_data.head(5))

(406829, 8)

InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
536365,71053,WHITE METAL LANTERN,6,2010-01-12 12:26:00.000000,3.39,17850.0,United Kingdom
536365,21730,GLASS STAR FROSTED T-LIGHT HOLDER,6,2010-01-12 12:26:00.000000,4.25,17850.0,United Kingdom
536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-01-12 12:26:00.000000,2.55,17850.0,United Kingdom
536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-01-12 12:26:00.000000,3.39,17850.0,United Kingdom
536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-01-12 12:26:00.000000,2.75,17850.0,United Kingdom


<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'>1.2 View Column information</p>

<p style = 'font-size:16px;font-family:Arial'><a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Data-Exploration-Functions/TD_ColumnSummary'>ColumnSummary</a> provides more details on column values and ranges</p>

In [5]:
from teradataml import ColumnSummary

warnings.simplefilter('ignore')

ColumnSummary(data = tdf_retail_data, target_columns = tdf_retail_data.columns).result

ColumnName,Datatype,NonNullCount,NullCount,BlankCount,ZeroCount,PositiveCount,NegativeCount,NullPercentage,NonNullPercentage
CustomerID,FLOAT,406829,0,,0.0,406829.0,0.0,0.0,100.0
InvoiceNo,VARCHAR(10) CHARACTER SET UNICODE,406829,0,0.0,,,,0.0,100.0
StockCode,VARCHAR(10) CHARACTER SET UNICODE,406829,0,0.0,,,,0.0,100.0
Description,VARCHAR(1024) CHARACTER SET UNICODE,406829,0,0.0,,,,0.0,100.0
UnitPrice,FLOAT,406829,0,,40.0,406789.0,0.0,0.0,100.0
Quantity,BIGINT,406829,0,,0.0,397924.0,8905.0,0.0,100.0
Country,VARCHAR(1024) CHARACTER SET UNICODE,406829,0,0.0,,,,0.0,100.0
InvoiceDate,TIMESTAMP(6),406829,0,,,,,0.0,100.0


<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'>1.3 Create a Testing data set</p>

<p style = 'font-size:16px;font-family:Arial'>Select several "Customer ID" values for testing later.  This uses the SQLAlchemy ClauseElement Expression capabilities in the <a href = 'https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Teradata-Package-for-Python-User-Guide-17.20/teradataml-Extension-with-SQLAlchemy/Using-SQLAlchemy-Clause-Element-and-Expression/Using-Basic-SQLAlchemy-ClauseElement-and-Expression-for-Filtering'>teradataml package</a>.</p>

In [6]:
in_expr = tdf_retail_data['CustomerID'].expression.in_(['17307', '12503', '18268', '12908', '13693'])

tdf_retail_test = tdf_retail_data[in_expr]

In [7]:
tdf_retail_test

InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
C541499,84819,DANISH ROSE ROUND SEWING BOX,-1,2011-01-18 19:23:00.000000,4.25,13693.0,United Kingdom
C541499,22325,MOBILE VINTAGE HEARTS,-3,2011-01-18 19:23:00.000000,4.95,13693.0,United Kingdom
C542693,15036,ASSORTED COLOURS SILK FAN,-600,2011-01-31 16:36:00.000000,0.65,12908.0,United Kingdom
542694,15036,ASSORTED COLOURS SILK FAN,600,2011-01-31 16:37:00.000000,0.53,12908.0,United Kingdom
561680,84968A,SET OF 16 VINTAGE ROSE CUTLERY,2,2011-07-28 23:13:00.000000,12.75,18268.0,United Kingdom
C570708,M,Manual,-600,2011-12-10 14:11:00.000000,0.19,12908.0,United Kingdom
C538110,21232,STRAWBERRY CERAMIC TRINKET BOX,-144,2010-09-12 19:24:00.000000,1.06,17307.0,United Kingdom
C540271,M,Manual,-1,2011-06-01 15:51:00.000000,1126.0,12503.0,Spain
C561590,84968A,SET OF 16 VINTAGE ROSE CUTLERY,-2,2011-07-28 15:16:00.000000,12.75,18268.0,United Kingdom
C541499,22766,PHOTO FRAME CORNICE,-1,2011-01-18 19:23:00.000000,2.95,13693.0,United Kingdom


<hr>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>Step 2 - Feature Engineering</b></p>

<p style = 'font-size:16px;font-family:Arial'>This section will illustrate how to prepare the data set for model training.  We will use various "Fit" functions to create input dataframes for the <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Feature-Engineering-Transform-Functions/TD_ColumnTransformer'>ColumnTransformer</a> function to take as input in order to execute a bulk transformation.</p>

<ol style = 'font-size:16px;font-family:Arial'>
    <li>Create a per-customer grouping of data</li>
    <li>Create Fit Tables
        <ul><li>Remove Outliers</li>
            <li>Impute Missing Values</li>
            <li>Create New Numeric Features</li>
            <li>Rescale the Data Set</li>
        </ul></li>
    <li>Call the final Transformation function</li>
    </ol>

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'>2.1 - Create a per-customer table</p>

<p style = 'font-size:16px;font-family:Arial'>Simple GROUP BY, exclude the testing IDs.  Note there are 3930 unique customers in this training set.</p>

In [8]:
# Filter the tdf to exclued the customer IDs in our "test" set

notin_expr = tdf_retail_data.CustomerID.expression.notin_(['17307', '12503', '18268', '12908', '13693'])

tdf_train = tdf_retail_data[notin_expr].groupby('CustomerID').agg({'Quantity':['sum'], 
                                                                   'UnitPrice':['sum'], 
                                                                   'StockCode':['count']})
tdf_train.head(5)

CustomerID,sum_Quantity,sum_UnitPrice,count_StockCode
12348.0,2341,178.70999999999998,31
12350.0,197,65.3,17
12349.0,631,605.1,73
12347.0,2458,481.21,182
12346.0,0,2.08,2


<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'>2.2 Create Fit Tables</p>

<p style = 'font-size:16px;font-family:Arial'>Vantage <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Feature-Engineering-Transform-Functions'>Feature Engineering Transform Functions</a> Use a "Fit and Transform" approach to make processing more modular and efficient.  "Fit tables" can be used as input to either individual Transform functions, or passed to a single <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Feature-Engineering-Transform-Functions/TD_ColumnTransformer'>ColumnTransformer</a> function.</p>

<table style = 'width:100%;table-layout:fixed;'>
    <tr>
        <td style = 'vertical-align:top' width = '50%'>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Fit outlier removal using <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Data-Cleaning-Functions/Handling-Outliers/TD_OutlierFilterFit'>OutlierFilterFit</a></li>
    <li>Fit a simple imputer to replace missing values using <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Data-Cleaning-Functions/Handling-Missing-Values/TD_SimpleImputeFit'>SimpleImputeFit</a></li>
    <li>Fit column calculations to create new features using <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Feature-Engineering-Transform-Functions/TD_NonLinearCombineFit'>NonLinearCombineFit</a></li>
    <li>Call <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Feature-Engineering-Transform-Functions/TD_ColumnTransformer'>ColumnTransformer</a> to execute the transformations (to allow for Scaling)</li>
    <li>Rescale the data using <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Feature-Engineering-Transform-Functions/TD_ScaleFit'>ScaleFit/Transform</a></li>
            </ul></td>
        <td><img src = 'images/fit_transform.png' width = '300'></td>
    </tr>
    </table>

In [None]:
from teradataml import OutlierFilterFit, SimpleImputeFit, NonLinearCombineFit, ColumnTransformer


ft_outlier = OutlierFilterFit(data = tdf_train,
                             target_columns = ['sum_Quantity', 'sum_UnitPrice'], 
                             lower_percentile = 0.03, 
                             upper_percentile = 0.97,
                             percentile_method = 'PercentileCont', 
                             replacement_value = 'Median')

ft_impute = SimpleImputeFit(data = tdf_train, 
                            stats_columns = 'sum_Quantity',
                            literals_columns = 'CustomerID', 
                            literals = '19000', 
                            stats = 'MIN')

ft_nlc_TotSales = NonLinearCombineFit(data = tdf_train, 
                            target_columns = ['sum_Quantity', 'sum_UnitPrice'], 
                            formula = 'Y = X0*X1', 
                            result_column = 'TotalSales')

ft_nlc_SalesPer = NonLinearCombineFit(data = tdf_train, 
                                     target_columns = ['sum_Quantity', 'sum_UnitPrice', 'count_StockCode'], 
                                     formula = 'Y = (X0*X1)/X2', 
                                     result_column = 'SalesPerItem')

tdf_Transformed = ColumnTransformer(input_data = tdf_train, 
                                   ouliterfilter_fit_data = ft_outlier.result, 
                                   simpleimpute_fit_data = ft_impute.output,
                                   nonlinearcombine_fit_data = ft_nlc_TotSales.result)

tdf_train_Transformed = ColumnTransformer(input_data = tdf_Transformed.result, 
                                         nonlinearcombine_fit_data = ft_nlc_SalesPer.result)

In [10]:
from teradataml import ScaleFit, ScaleTransform

ft_rescale = ScaleFit(data = tdf_train_Transformed.result, 
                     target_columns = ['sum_Quantity', 'count_StockCode', 'sum_UnitPrice', 'TotalSales', 'SalesPerItem'],
                     scale_method = 'range')

tdf_train_scaled = ScaleTransform(data = tdf_train_Transformed.result, object = ft_rescale.output, accumulate = ['CustomerID']).result

tdf_train_scaled.head(5)

CustomerID,sum_Quantity,count_StockCode,sum_UnitPrice,TotalSales,SalesPerItem
12348.0,0.0134198211367258,0.0037584565271861,0.0043191360857765,0.0002117716367983,0.0125657613019035
12350.0,0.002537787658231,0.0020045101478326,0.0015781970029724,4.262827889238582e-05,0.0080979841175871
12349.0,0.0047405873455756,0.0090202956652468,0.0146243033154463,0.0001965289756534,0.0096670026360526
12347.0,0.0140136634487519,0.0226760210473565,0.0116300793231299,0.0005306461144156,0.0101119273162671
12346.0,0.001537899320888,0.0001252818842395,5.027028738411565e-05,3.726231469846489e-05,0.0078325881604629


<hr>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>Step 3 - Build the K-means Model</b></p>

<p style = 'font-size:16px;font-family:Arial'>As discussed above, the K-means algorithm takes a number of clusters "k", chooses a random starting point for each centroid, and iterates until a hard limit, or an optimium value is reached.</p>

<p style = 'font-size:16px;font-family:Arial'><b>Finding an Ideal value for K</b></p>
<table style = 'width:100%;table-layout:fixed;'>
    <tr>
        <td style = 'vertical-align:top' width = '50%'>
<p style = 'font-size:16px;font-family:Arial'>The example below uses a value of 5 for the number of clusters to build the model against.  Typically, data scientists will build the model using various values for "k", and plot the "WCSS" (Within Cluster Sum-of-Squares) value on a series of each value chosen for k.  The "elbow" point (where the slope changes) is usually a good value for k.  <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Model-Training-Functions/TD_KMeans'>KMeans</a> function will return this value as "TotalWithinSS : ###" as a row in the "td_modelinfo_kmeans" column.</p></td>
        <td><img src = 'images/WCSS_elbow.png' width = '300'></td>
    </tr>
    </table>

<p style = 'font-size:16px;font-family:Arial'><b>Other Function Parameters Include (but are not limited to)</b></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Input dataframe</li>
    <li>StopThreshold - The algorithm converges if the distance between the centroids from the previous iteration and the current iteration is less than the specified value.</li>
    <li>MaxIterNum</li>Specify the maximum number of iterations for the K-means algorithm. The algorithm stops after performing the specified number of iterations even if the convergence criterion is not met.
    </ul>

In [13]:
from teradataml import KMeans

kmeans_res = KMeans(data = tdf_train_scaled, 
                    id_column = 'CustomerID', 
                    target_columns = ['sum_Quantity', 'count_StockCode', 'sum_UnitPrice', 'TotalSales', 'SalesPerItem'], 
                    num_clusters = 5, 
                    iter_max = 100, 
                    threshold=0.0295)
kmeans_res.result.to_pandas().sort_index()

Unnamed: 0_level_0,sum_Quantity,count_StockCode,sum_UnitPrice,TotalSales,SalesPerItem,td_size_kmeans,td_withinss_kmeans,CustomerID,td_modelinfo_kmeans
td_clusterid_kmeans,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0.0,0.005682,0.009133,0.005448,0.000107,0.008627,1360.0,0.05292,,
1.0,0.002695,0.002202,0.001504,4.3e-05,0.008061,2137.0,0.010522,,
2.0,0.012041,0.026724,0.016331,0.000603,0.010427,730.0,0.245511,,
3.0,0.0563,0.07953,0.049629,0.010228,0.026752,132.0,1.913392,,
4.0,0.239298,0.41094,0.612624,0.30638,0.230131,8.0,3.793312,,
,,,,,,,,,Number of Iterations : 8
,,,,,,,,,Total_WithinSS : 6.01565660654618E+00
,,,,,,,,,Between_SS : 7.55969966511910E+00
,,,,,,,,,Number of Clusters : 5
,,,,,,,,,Converged : True


<hr>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>Step 4 - Bulk Transformation</b></p>

<p style = 'font-size:16px;font-family:Arial'>Here, the Fit table objects created above will be passed to a single <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Feature-Engineering-Transform-Functions/TD_ColumnTransformer'>ColumnTransformer</a> function.  This is similar to an operational approach, where a single query will prepare new or incoming data for immediate analysis.</p>

<img src = 'images/column_transformer.png' width = '300'>

In [14]:
# Recall the customer ID's we held back in the beginneing
tdf_retail_test.head(5)

InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
561680,84968A,SET OF 16 VINTAGE ROSE CUTLERY,2,2011-07-28 19:13:00.000000,12.75,18268.0,United Kingdom
C540271,M,Manual,-1,2011-06-01 11:51:00.000000,1126.0,12503.0,Spain
C538110,21232,STRAWBERRY CERAMIC TRINKET BOX,-144,2010-09-12 15:24:00.000000,1.06,17307.0,United Kingdom
557092,15036,ASSORTED COLOURS SILK FAN,600,2011-06-16 15:39:00.000000,0.72,12908.0,United Kingdom
542694,15036,ASSORTED COLOURS SILK FAN,600,2011-01-31 12:37:00.000000,0.53,12908.0,United Kingdom


In [15]:
# Perform the same groupby and aggregation as above
tdf_gb_test = tdf_retail_test.groupby('CustomerID').agg({'Quantity':['sum'], 
                                                        'UnitPrice':['sum'], 
                                                        'StockCode':['count']})

In [16]:
# Pass this to the columntransformer function

tdf_Transformed = ColumnTransformer(input_data = tdf_gb_test, 
                                   ouliterfilter_fit_data = ft_outlier.result, 
                                   simpleimpute_fit_data = ft_impute.output,
                                   nonlinearcombine_fit_data = ft_nlc_TotSales.result)

tdf_test_Transformed = ColumnTransformer(input_data = tdf_Transformed.result, 
                                         nonlinearcombine_fit_data = ft_nlc_SalesPer.result)

tdf_test_scaled = ScaleTransform(data = tdf_test_Transformed.result, object = ft_rescale.output, accumulate = ['CustomerID']).result


tdf_test_scaled.head(5)

CustomerID,sum_Quantity,count_StockCode,sum_UnitPrice,TotalSales,SalesPerItem
13693.0,0.0015074458689892,0.0003758456527186,0.0005341218034562,3.720700365050087e-05,0.00782096170454
18268.0,0.001537899320888,0.0001252818842395,0.0006162943886033,3.726231469846489e-05,0.0078325881604629
17307.0,0.0008070164753174,0.0,2.5618511839982013e-05,3.719864442424749e-05,0.0077790538548198
12908.0,0.001537899320888,0.0003758456527186,5.0511971458077736e-05,3.726231469846489e-05,0.0078325881604629
12503.0,0.0015328237455715,0.0,0.0272136267281318,3.679262963053515e-05,0.0074376744540028


<hr>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>Step 5 - Predict and Evaluate</b></p>

<p style = 'font-size:16px;font-family:Arial'>Finally, we run the model against new (in this case testing) data using <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Model-Scoring-Functions/TD_KMeansPredict'>KMeansPredict</a>.  The preparation step has been completed in a single query above.  Additionally, we will use an evaluation function <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Model-Evaluation-Functions/TD_Silhouette'>Silhouette</a> to analyze how well the new cluster predictions match the original model.</p>

<ol style = 'font-size:16px;font-family:Arial'>
    <li>Call KMeansPredict</li>
    <li>Inpect the results</li>
    <li>Call Silhouette on the output</li>
    </ol>
    
<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'>5.1 - Call the Prediction Function</p>

<p style = 'font-size:16px;font-family:Arial'>Pass the Input Data, Model Table, and other parameters including columns to accumulate.  Note here we create a new View to assist with Silhouette analysis of the prediction.</p>

In [17]:
from teradataml import KMeansPredict


kmeans_prediction = KMeansPredict(data = tdf_test_scaled, 
                                  object = kmeans_res.model_data, 
                                  output_distance = True, 
                                  accumulate = ['sum_Quantity', 'count_StockCode', 'sum_UnitPrice', 'TotalSales', 'SalesPerItem'])

copy_to_sql(kmeans_prediction.result, table_name = 'kmeans_pred', if_exists = 'replace')
kmeans_prediction.result

CustomerID,td_clusterid_kmeans,td_distance_kmeans,sum_Quantity,count_StockCode,sum_UnitPrice,TotalSales,SalesPerItem
13693.0,1,0.0023960767222745,0.0015074458689892,0.0003758456527186,0.0005341218034562,3.720700365050087e-05,0.00782096170454
18268.0,1,0.0025474744660377,0.001537899320888,0.0001252818842395,0.0006162943886033,3.726231469846489e-05,0.0078325881604629
12503.0,0,0.023995411653506,0.0015328237455715,0.0,0.0272136267281318,3.679262963053515e-05,0.0074376744540028
12908.0,1,0.0026145535033422,0.001537899320888,0.0003758456527186,5.0511971458077736e-05,3.726231469846489e-05,0.0078325881604629
17307.0,1,0.0032672829843313,0.0008070164753174,0.0,2.5618511839982013e-05,3.719864442424749e-05,0.0077790538548198


<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'>5.2 - Evaluate the Prediction</p>

<p style = 'font-size:16px;font-family:Arial'><a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Model-Evaluation-Functions/TD_Silhouette'>Silhouette</a> is a native Vantage function that evaluates the similarity of an object to its cluster (cohesion) compared to other clusters (separation).  The silhouette scores and its definitions are as follows:</p>

<ul style = 'font-size:16px;font-family:Arial'>
    <li>1: Data is appropriately clustered</li>
    <li>-1: Data is not appropriately clustered</li>
    <li>0: Datum is on the border of two natural clusters</li>
    </ul>

<p style = 'font-size:16px;font-family:Arial'>See the documentation for a full listing of parameters and return values.</p>

In [18]:
tdf_prediction = DataFrame('kmeans_pred')

In [19]:
from teradataml import Silhouette

res_sil = Silhouette(data = tdf_prediction, 
                     id_column = 'CustomerID', 
                     cluster_id_column = 'td_clusterid_kmeans', 
                    target_columns = ['sum_Quantity', 'count_StockCode', 'sum_UnitPrice', 'TotalSales', 'SalesPerItem'])

res_sil.result

silhouette_score
0.7797773330631074


<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'>Clean Up</p>

In [20]:
db_drop_table('kmeans_pred')

True

In [21]:
remove_context()

True