<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Customer Segmentation with K-means Clustering and Data Preparation Piplelines
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style = 'font-size:20px;font-family:Arial'><b>Introduction</b></p>

<b style = 'font-size:20px;font-family:Arial;'>Leverage native Vantage processing for efficient and highly scalable data preparation, model training, and evaluation workflows</b>

<p style = 'font-size:16px;font-family:Arial'>K-means clustering is one of the most popular <b>unsupervised</b> machine learning algorithms.  Essentially, the algorithm seeks to group similar data points together by minimizing the average ("means" in K-means) distance for all data points from each cluster's center (centroid).</p>

<table style = 'width:100%;table-layout:fixed;'>
    <tr>
        <td style = 'vertical-align:top' width = '50%'>
            <ol style = 'font-size:16px;font-family:Arial'>
                <li>Define the number of clusters (k)</li>
                <br>
                <li>The algorithm chooses random points as centroids</li>
                <br>
                <li>Each iteration attempts to optimize the centroid locations</li>
                <br>
                <li>Iterations end once the distances have stabilized or the max iteration count is reached</li>
            </ol>
        </td>
        <td><img src = 'images/K-means_convergence.gif' width = '250'></td>
    </tr>
</table>

<p style = 'font-size:16px;font-family:Arial'>One limitation of this algorithm is that it only accepts numeric data as feature input (categorical clustering can be performed using K-modes algorithm).  Typically, data engineers or data scientists will perform multiple <b>serial</b> steps to prepare a numeric-only data set that can be passed to the K-means algorithm.</p>

<p style = 'font-size:16px;font-family:Arial'>Vantage provides native "Fit and Transform" functions to assist in data preparation and transformation at scale.  To aid in efficiency and operationalization, Vantage provides a bulk <b>Column Transformer</b> function which can take multiple transformation directives at the same time, and act on the whole data set at once.  This allows for both process and code simplification, allowing more streamlined and robust operational deployment.</p> 

<img src = 'Flow_Diagram_KMeans.png' width = 100%>


<p style = 'font-size:16px;font-family:Arial'>The data for this demonstration is based on online purchase history data set, which can be found <a href = 'https://www.kaggle.com/code/hellbuoy/online-retail-k-means-hierarchical-clustering/data'>here</a>.  The goal is to segment the customers by purchase volume and value.  Steps are as follow:</p>

<table style = 'width:100%;table-layout:fixed;'>
    <tr>
        <td style = 'vertical-align:top' width = '50%'>
            <ol style = 'font-size:16px;font-family:Arial'>
                <li>Analyze the raw data, split a testing set</li>
                <br>
                <li>Engineer numeric features</li>
                <br>
                <li>Build the K-means model</li>
                <br>
                <li>Apply in-line transformation to the testing set</li>
                <br>
                <li>Make Predictions and evaluate model accuracy</li>
            </ol>
        </td>
        <td><img src = 'images/clustering_img.png' width = '250'></td>
    </tr>
</table>

In [None]:
# getpass to ask password to user and prevent storing it plain in the Notebook
import getpass
import pandas as pd

# import all Teradataml functions and supporting libraries
from teradataml import *

display.max_rows = 5

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>1. Connect to Vantage</b></p>
<p style = 'font-size:16px;font-family:Arial'>You will be prompted to provide the password. Enter your password, press Enter, then use down arrow to go to next cell.</p>

In [None]:
%run -i ../../UseCases/startup.ipynb
eng = create_context(host = 'host.docker.internal', username = 'demo_user', password = password)
print(eng)

<p style = 'font-size:16px;font-family:Arial'>Setup for execution of notebook. Begin running steps with Shift + Enter keys.</p>

In [None]:
%%capture
execute_sql('''SET query_band='DEMO=K-Means_Clustering_and_ML_model_Python.ipynb;' UPDATE FOR SESSION; ''')

<p style = 'font-size:18px;font-family:Arial'> <b>Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial'>We have provided data for this demo on cloud storage. You can either run the demo using foreign tables to access the data without any storage on your environment or download the data to local storage, which may yield faster execution. Still, there could be considerations of available storage. Two statements are in the following cell, and one is commented out. You may switch which mode you choose by changing the comment string.</p>

In [None]:
%run -i ../run_procedure.py "call get_data('DEMO_Retail_local');" # Takes about 2 minutes
#%run -i ../run_procedure.py "call get_data('DEMO_Retail_cloud');"

<hr style="height:1px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>Create a Teradata DataFrame (virtual DataFrame)</b>
<p style = 'font-size:16px;font-family:Arial'>For this demo, data is already resident in Object Storage which we are accessing via ReadNOS, create a reference to the table, and sample the contents using the get_data procedure used above.  Data could just as easily reside in permanent tables, another RDBMS, or another Vantage system.</p>

In [None]:
tdf = DataFrame(in_schema('DEMO_Retail', 'UK_Retail_Data'))

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>2. Data Preparation</b></p>
<p style = 'font-size:16px;font-family:Arial'>Here, we will inspect the original data set, and perform various preparation tasks.</p>

<ol style = 'font-size:16px;font-family:Arial'>
    <li>Inspect the rows of the table</li>
    <li>Inspect the column metadata using <a href = 'https://docs.teradata.com/search/all?query=Python+ColumnSummary&content-lang=en-US'>ColumnSummary</a></li>
    <li>Split off a testing data set to be used in evaluation</li>
    </ol>

<hr style="height:1px;border:none;">

<p style = 'font-size:18px;font-family:Arial'><b>2.1 View Column information</b></p>

<p style = 'font-size:16px;font-family:Arial'><a href = 'https://docs.teradata.com/search/all?query=Python+ColumnSummary&content-lang=en-US'>ColumnSummary</a> provides more details on column values and ranges. Note that the resulting DataFrame is a property of the function object. </p>

In [None]:
obj = ColumnSummary(data=tdf, target_columns="0:7")

# Note: The resulting DataFrame is accessed as a property of the function object.
obj.result

<hr style="height:1px;border:none;">

<p style = 'font-size:18px;font-family:Arial'><b>2.2 Create a Testing data set</b></p>

<p style = 'font-size:16px;font-family:Arial'>From our sample data we are selecting some "Customer ID" values for testing later.</p>

In [None]:
UK_Retail_Test_V = tdf.loc[tdf.CustomerID.isin(['17307', '12503', '18268', '12908', '13693'])]
UK_Retail_Test_V

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>3. Feature Engineering</b></p>

<p style = 'font-size:16px;font-family:Arial'>Vantage provides many inDb feature engineering functions, in this section we will see how to prepare the data set for model training.  We will use standard SQL and various "Fit" functions to create input for the <a href = 'https://docs.teradata.com/search/all?query=Python+ColumnTransformer&content-lang=en-US'>ColumnTransformer</a> function to take as input in order to execute a bulk transformation.</p>

<ol style = 'font-size:16px;font-family:Arial'>
    <li>Create a per-customer grouping of data</li>
    <li>Create Fit Tables
        <ul><li>Remove Outliers</li>
            <li>Impute Missing Values</li>
            <li>Create New Numeric Features</li>
            <li>Rescale the Data Set</li>
        </ul></li>
    <li>Call the final Transformation function</li>
    </ol>

<hr style="height:1px;border:none;">

<p style = 'font-size:18px;font-family:Arial'><b>3.1 Create a per-customer table</b></p>

<p style = 'font-size:16px;font-family:Arial'>First let's isolate the training data without the holdouts.</p>

<p style = 'font-size:16px;font-family:Arial'>Simple GROUP BY, exclude the testing IDs.<br> Note there are 4367 unique customers in this training set.</p>

In [None]:
Customer_ID_Group_V = tdf.loc[~tdf.CustomerID.isin(['17307', '12503', '18268', '12908', '13693'])] \
    .select(["CustomerID","Quantity","UnitPrice","StockCode"]) \
    .groupby("CustomerID") \
    .agg({'Quantity':['sum'],'UnitPrice':['sum'],'StockCode':['unique']})

Customer_ID_Group_V

<p style = 'font-size:16px;font-family:Arial'>Rename columns using DataFrame.assign()</p>

In [None]:
Customer_ID_Group_V = Customer_ID_Group_V.assign(
                      CustomerID=Customer_ID_Group_V.CustomerID,  
                      TotalQuantity=Customer_ID_Group_V.sum_Quantity, 
                      TotalPrice=Customer_ID_Group_V.sum_UnitPrice, 
                      TotalItems=Customer_ID_Group_V.unique_StockCode, 
                      drop_columns=True)

Customer_ID_Group_V

In [None]:
Customer_ID_Group_V.count()

<hr style="height:1px;border:none;">

<p style = 'font-size:18px;font-family:Arial'><b>3.2 Create Fit Tables</b></p>

<p style = 'font-size:16px;font-family:Arial'>Vantage <a href = 'https://docs.teradata.com/search/all?query=Python+Function+Reference+Feature+Engineering+Transform+Functions&content-lang=en-US'>Feature Engineering Transform Functions</a> Use a "Fit and Transform" approach to make processing more modular and efficient.  "Fit tables" can be used as input to either individual Transform functions, or passed to a single <a href = 'https://docs.teradata.com/search/all?query=Python+ColumnTransformer&content-lang=en-US'>ColumnTransformer</a> function.</p>

<table style = 'width:100%;table-layout:fixed;'>
    <tr>
        <td style = 'vertical-align:top' width = '50%'>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Fit outlier removal using <a href = 'https://docs.teradata.com/search/all?query=Python+OutlierFilterFit&content-lang=en-US'>OutlierFilterFit</a></li>
    <li>Fit a simple imputer to replace missing values using <a href = 'https://docs.teradata.com/search/all?query=Python+SimpleImputeFit&content-lang=en-US'>SimpleImputeFit</a></li>
    <li>Fit column calculations to create new features using <a href = 'https://docs.teradata.com/search/all?query=Python+NonLinearCombineFit&content-lang=en-US'>NonLinearCombineFit</a></li>
    <li>Call <a href = 'https://docs.teradata.com/search/all?query=ColumnTransformer&content-lang=en-US'>ColumnTransformer</a> to execute the transformations (to allow for Scaling)</li>
    <li>Rescale the data using <a href = 'https://docs.teradata.com/search/all?query=Python+ScaleFit&content-lang=en-US'>ScaleFit/Transform</a></li>
            </ul></td>
        <td><img src = 'images/fit_transform.png' width = '300' style='background-color:white'></td>
    </tr>
    </table>

<p style = 'font-size:16px;font-family:Arial'><b>OutlierFilterFit</b> function calculates the lower_percentile, upper_percentile, count of rows and median for the specified input table columns. The calculated values for each column help the
    <b>OutlierFilterTransform</b> function detect outliers in the input table.

In [None]:
# Remove Outliers
# Trim below 3rd, and above 97th percentile

outlierFit_CS = OutlierFilterFit(data=Customer_ID_Group_V,
                               target_columns=['TotalQuantity','TotalPrice'],
                               lower_percentile=0.03,
                               upper_percentile=0.97,
                               outlier_method="PERCENTILE",
                               replacement_value="MEDIAN",
                               percentile_method="PERCENTILECONT")
 
# Print the result DataFrame.
outlierFit_CS.result
    

<p style = 'font-size:16px;font-family:Arial'><b>SimpleImputeFit </b>will output a table with the values that will be used to substitute the missing values.<br><b>SimpleImputeTransform</b> will return the input data set with the missing values filled in.</p>

In [None]:
# Impute Missing Values
# Replace any missing CustomerID with a specific value

ImputeFit_CS = SimpleImputeFit(data=Customer_ID_Group_V,
                              literals_columns="CustomerID",
                              literals="19000")

# Note that this function uses "output" and not "result"
ImputeFit_CS.output

<p style = 'font-size:16px;font-family:Arial'><b>NonLinearCombineFit</b> function returns the target columns and a specified formula which uses the non-linear combination of existing features.


In [None]:
# Create a new column by multiplying quantity and price

NonLinearCombineFit_CS_TotalSales = NonLinearCombineFit(
                                    data = Customer_ID_Group_V,
                                    target_columns = ["TotalQuantity", "TotalPrice"],
                                    formula = "Y = X1*X0",
                                    result_column = "TotalSales")
    
# Print the result DataFrames.
NonLinearCombineFit_CS_TotalSales.result    

In [None]:
# Create another new column by diving the total sales by the number of unique items
NonLinearCombineFit_CS_SalesPerItem = NonLinearCombineFit(
                                      data = Customer_ID_Group_V,
                                      target_columns = ["TotalQuantity","TotalPrice","TotalItems"],
                                      formula = "Y = (X0*X1)/X2",
                                      result_column = "SalesPerItem")

NonLinearCombineFit_CS_SalesPerItem.result

<p style = 'font-size:16px;font-family:Arial'><b>The ColumnTransformer</b> function transforms the entire dataset in a single operation. You only need to provide the FIT tables to the function, and the function runs all transformations that you require in a
single operation. Running all the it table transformations together in one-go gives approx. 30% performance improvement over running each transformation sequentially.</p>

In [None]:
# Execute ColumnTransformer to build the data set

Transformed_Customer_ID_Group_VT = ColumnTransformer(
                                   input_data=Customer_ID_Group_V,
                                   outlierfilter_fit_data = outlierFit_CS.result,
                                   simpleimpute_fit_data = ImputeFit_CS.output,
                                   nonlinearcombine_fit_data=NonLinearCombineFit_CS_TotalSales.result,
                                   nonlinearcombine_fit_data1=NonLinearCombineFit_CS_SalesPerItem.result
                                   )

Transformed_Customer_ID_Group_VT.result

<p style = 'font-size:16px;font-family:Arial'><b>ScaleFit and ScaleTransform </b>scales specified input
table columns i.e perform the specific scale methods like standard deviation, mean etc to the input columns </p> 

In [None]:
# ScaleFit/Transform to rescale the data
ScaleFit_CS = ScaleFit(data=Transformed_Customer_ID_Group_VT.result,
                       target_columns=["TotalQuantity","TotalItems","TotalPrice","TotalSales","SalesPerItem"],
                       scale_method="range")

ScaleFit_CS.output

In [None]:
Scaled_Transformed_Customer_ID_Group_VT = ScaleTransform(
                                          data=Transformed_Customer_ID_Group_VT.result,
                                          object=ScaleFit_CS.output,
                                          accumulate="CustomerID")

<p style = 'font-size:16px;font-family:Arial'>Let us look at the final values which we will now input to the model. </p> 

In [None]:
# Result DataFrame.
Scaled_Transformed_Customer_ID_Group_VT.result.head(5)

<hr style="height:2px;border:none;">

<p style = 'font-size:20px;font-family:Arial'><b>4. Build the K-means Model</b></p>

<p style = 'font-size:16px;font-family:Arial'>As discussed above, the K-means algorithm takes a number of clusters "k", chooses a random starting point for each centroid, and iterates until a hard limit, or an optimum value is reached.</p>

<p style = 'font-size:16px;font-family:Arial'><b>Finding an Ideal value for K</b></p>
<table style = 'width:100%;table-layout:fixed;'>
    <tr>
        <td style = 'vertical-align:top' width = '50%'>
<p style = 'font-size:16px;font-family:Arial'>The example below uses a value of 5 for the number of clusters to build the model against.  Typically, data scientists will build the model using various values for "k", and plot the "WCSS" (Within Cluster Sum-of-Squares) value on a series of each value chosen for k.  The "elbow" point (where the slope changes) is usually a good value for k.  <a href = 'https://docs.teradata.com/search/all?query=Python+KMeans&content-lang=en-US'>KMeans</a> function will return this value as "TotalWithinSS : ###" as a row in the "td_modelinfo_kmeans" column.</p></td>
        <td><img src = 'images/WCSS_elbow.png' width = '300'></td>
    </tr>
    </table>

<p style = 'font-size:16px;font-family:Arial'><b>Other Function Parameters Include (but are not limited to)</b></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Input Table</li>
    <li>StopThreshold - The algorithm converges if the distance between the centroids from the previous iteration and the current iteration is less than the specified value.</li>
    <li>MaxIterNum</li>Specify the maximum number of iterations for the K-means algorithm. The algorithm stops after performing the specified number of iterations even if the convergence criterion is not met.
    </ul>

In [None]:
number_clusters = 5

KMeans_Model = KMeans(id_column="CustomerID",
                      target_columns=["TotalQuantity","TotalItems","TotalPrice","TotalSales","SalesPerItem"],
                      data=Scaled_Transformed_Customer_ID_Group_VT.result,
                      threshold=0.0395,
                      num_clusters=number_clusters,
                      #seed=0,
                      iter_max=500
                      #, output_cluster_assignment=True
                     )

KMeans_Model.result

<p style = 'font-size:16px;font-family:Arial'>From the generated model we can see how many ids are there in each cluster, what is the size of each cluster, number of iterations it took for model to converge etc.
    

In [None]:
top_clusters = KMeans_Model.result[["td_clusterid_kmeans","td_size_kmeans"]].to_pandas().head()
top_clusters

You can use the <a href = 'https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Teradata-Package-for-Python-User-Guide/Plotting-in-teradataml'>Plot</a> function to create the image within the database.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

top_clusters_plot = fig, ax = plt.subplots()

ax.bar(top_clusters.td_clusterid_kmeans, top_clusters.td_size_kmeans, width=1, edgecolor="white", linewidth=0.7)

ax.set(xlabel="cluster ids",
       ylabel="count of ids",
      title="Number of IDs per Cluster")

plt.show()

In [None]:
Scaled_Transformed_Customer_ID_Group_VT.result

<p style = 'font-size:16px;font-family:Arial'>The above barchart shows the number of ids in each cluster where cluster_id is 0-4 generated by the Kmeans function.<br><br>Let us now visualize how the clusters look like. We have used five columns to create clusters namely 'TotalQuantity','TotalPrice','TotalItems','TotalSales','SalesPerItem'.<br><br> For the visualization in 2d plane i.e with X and Y coordinates we will use two of these column values for our plots.

In [None]:
KMeansPredict_out = KMeansPredict(
    object=KMeans_Model.result,
    data=Scaled_Transformed_Customer_ID_Group_VT.result,
    output_distance=True,
    #accumulate=["TotalQuantity","TotalItems","TotalPrice","TotalSales","SalesPerItem"]
    accumulate="1:5"
)

KMeansPredict_out.result

In [None]:
KMeans_predicted = KMeansPredict_out.result.head(5).to_pandas()
KMeans_predicted

In [None]:
import matplotlib as mpl
from matplotlib.markers import MarkerStyle

In [None]:
KMeansPredict_plot = fig, ax = plt.subplots()

colors = ["b","g","r","c","y"]
markers = ["o","*","s","P","^"]
grouped = KMeans_predicted.td_clusterid_kmeans.unique()

idcolor=dict()
idmarker=dict()
idtracker=list()

for n, kmeansid in enumerate(grouped):
    idcolor[kmeansid]=colors[n]
    idmarker[kmeansid]=markers[n]
    
for n, kmeansid in enumerate(KMeans_predicted.td_clusterid_kmeans):     
    ax.scatter(KMeans_predicted.TotalQuantity[n], 
               KMeans_predicted.TotalPrice[n], 
               color=idcolor[kmeansid],
               marker=idmarker[kmeansid],
               label="" if kmeansid in idtracker else kmeansid,
               s=50
              )
    idtracker.append(kmeansid)
    
ax.set(xlabel="TotalQuantity",
       ylabel="TotalPrice",
      title="TotalQuantity v. TotalPrice Cluster")

ax.legend(title="ClusterID")
plt.show()

<p style = 'font-size:16px;font-family:Arial'> In the above chart we have plotted TotalQuantity Vs TotalPrice to see how the clusters look like in the 2d plane where X axis is TotalQuantity and Y axis is TotalPrice. Note that this is only due to the limitation of visualization on being only in 2 dimensions where as the cluster creation is based on all 5 columns.

In [None]:
KMeansPredict_plot = fig, ax = plt.subplots()

colors = ["b","g","r","c","y"]
markers = ["o","*","s","P","^"]
grouped = KMeans_predicted.td_clusterid_kmeans.unique()

idcolor=dict()
idmarker=dict()
idtracker=list()

for n, kmeansid in enumerate(grouped):
    idcolor[kmeansid]=colors[n]
    idmarker[kmeansid]=markers[n]
    
for n, kmeansid in enumerate(KMeans_predicted.td_clusterid_kmeans):     
    ax.scatter(KMeans_predicted.TotalQuantity[n], 
               KMeans_predicted.TotalSales[n], 
               color=idcolor[kmeansid],
               marker=idmarker[kmeansid],
               label="" if kmeansid in idtracker else kmeansid,
               s=50
              )
    idtracker.append(kmeansid)
    
ax.set(xlabel="TotalQuantity",
       ylabel="TotalSales",
      title="TotalQuantity v. TotalSales Cluster")

ax.legend(title="ClusterID")
plt.show()

<p style = 'font-size:16px;font-family:Arial'> In the above chart we have plotted TotalQuantity Vs TotalSales. See how the chart is different from the TotalQuantity Vs TotalPrice chart. You can change the values for x-axis and y-axis to see how the clusters with rest of the columns looks like.

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>5. Bulk Transformation</b></p>

<p style = 'font-size:16px;font-family:Arial'>Here, the Fit tables created above will be passed to a single <a href = 'https://docs.teradata.com/search/all?query=Python+ColumnTransformer&content-lang=en-US'>ColumnTransformer</a> function.  This is similar to an operational approach, where a single query will prepare new or incoming data for immediate analysis.</p>

<img src = 'images/column_transformer.png' width = '300' style='background-color:white'>

<p style = 'font-size:16px;font-family:Arial'>Now let's review the test data.</p>

In [None]:
UK_Retail_Test_V

<p style = 'font-size:16px;font-family:Arial'>Again, we'll create the aggregated DataFrame for the test data.</p>

In [None]:
UK_Retail_Test_Group_V = UK_Retail_Test_V \
    .select(["CustomerID","Quantity","UnitPrice","StockCode"]) \
    .groupby("CustomerID") \
    .agg({'Quantity':['sum'],'UnitPrice':['sum'],'StockCode':['unique']})

UK_Retail_Test_Group_V

<p style = 'font-size:16px;font-family:Arial'>And then modify the names of the columns to match the training data set for the saved transforms. </p>

In [None]:
UK_Retail_Test_Group_V = UK_Retail_Test_Group_V.assign(
                         CustomerID=UK_Retail_Test_Group_V.CustomerID,
                         TotalQuantity=UK_Retail_Test_Group_V.sum_Quantity,
                         TotalPrice=UK_Retail_Test_Group_V.sum_UnitPrice,
                         TotalItems=UK_Retail_Test_Group_V.unique_StockCode,
                         drop_columns=True)

UK_Retail_Test_Group_V

<p style = 'font-size:16px;font-family:Arial'>And now all the transformations can be applied together into a single statement using the previously created transformation fit functions. Remember that these fit tables can be saved permanently and reused for batch or real-time scoring.</p>

In [None]:
Scaled_Transformed_Test_Group_V = ColumnTransformer(
                                  input_data=UK_Retail_Test_Group_V,
                                  outlierfilter_fit_data = outlierFit_CS.result,
                                  simpleimpute_fit_data = ImputeFit_CS.output,
                                  nonlinearcombine_fit_data=NonLinearCombineFit_CS_TotalSales.result,
                                  nonlinearcombine_fit_data1=NonLinearCombineFit_CS_SalesPerItem.result,
                                  scale_fit_data=ScaleFit_CS.output)

Scaled_Transformed_Test_Group_V.result.head(5)

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>6. Predict and Evaluate</b></p>

<p style = 'font-size:16px;font-family:Arial'>Finally, we run the model against new (in this case testing) data using <a href = 'https://docs.teradata.com/search/all?query=Python+KMeansPredict&content-lang=en-US'>KMeansPredict</a>.  The preparation step has been completed in a single query above.  Additionally, we will use an evaluation function <a href = 'https://docs.teradata.com/search/all?query=Python+Silhouette&content-lang=en-US'>Silhouette</a> to analyze how well the new cluster predictions match the original model.</p>

<ol style = 'font-size:16px;font-family:Arial'>
    <li>Execute KMeansPredict</li>
    <li>Inpect the results</li>
    <li>Use Silhouette to evaluate the output</li>
    </ol>

<hr style="height:1px;border:none;">

<p style = 'font-size:20px;font-family:Arial'><b>6.1 Call the Prediction Function</b></p>

<p style = 'font-size:16px;font-family:Arial'>Pass the Input Data, the existing KMeans Model, and other parameters including columns to accumulate.  Note here we create a new View to assist with Silhouette analysis of the prediction.</p>

In [None]:
KMeans_Output_V = KMeansPredict(
                  object=KMeans_Model.result,
                  data=Scaled_Transformed_Test_Group_V.result,
                  output_distance=True,
                  accumulate="1:5"
                  )

<hr style="height:1px;border:none;">

<p style = 'font-size:18px;font-family:Arial'><b>6.2 Inspect the Results</b></p>

<p style = 'font-size:16px;font-family:Arial'>Viewing the result property of the output will show us the results</p>

In [None]:
KMeans_Output_V.result

<hr style="height:1px;border:none;">

<p style = 'font-size:20px;font-family:Arial'><b>6.3 Evaluate the Prediction</b></p>

<p style = 'font-size:16px;font-family:Arial'><a href = 'https://docs.teradata.com/search/all?query=Python+Silhouette&content-lang=en-US'>Silhouette</a> is a native Vantage function that evaluates the similarity of an object to its cluster (cohesion) compared to other clusters (separation).  The silhouette scores and its definitions are as follows:</p>

<ul style = 'font-size:16px;font-family:Arial'>
    <li>1: Data is appropriately clustered</li>
    <li>-1: Data is not appropriately clustered</li>
    <li>0: Datum is on the border of two natural clusters</li>
    </ul>

<p style = 'font-size:16px;font-family:Arial'>See the documentation for a full listing of parameters and return values.</p>

In [None]:
from teradataml import Silhouette
Silhouette_result1 = Silhouette(id_column="CustomerID",
                                cluster_id_column="td_clusterid_kmeans",
                                target_columns="3:7",
                                output_type="SCORE",
                                data=KMeans_Output_V.result)
 
# Print the result DataFrame.
Silhouette_result1.result

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>7. Clean up</b></p>

<p style = 'font-size:20px;font-family:Arial'> <b>Database and Tables </b></p>
<p style = 'font-size:16px;font-family:Arial'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('DEMO_Retail');"  # takes about 5 seconds, optional if you want to use the data later

In [None]:
remove_context()

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'> <b> 8. Conclusion </b> </p>
<p style = 'font-size:16px;font-family:Arial'>In this notebook we have seen some of the Teradata Vantage Clearscape's new inDb functions and how we can create k clustering model from the transformed data.</p>

<p style = 'font-size:20px;font-family:Arial'><b>Reference Links:</b></p>
<ul style = 'font-size:16px;font-family:Arial'>
        <li>Teradata Analytic Function Reference:
        <a href = 'https://docs.teradata.com/r/Teradata-VantageCloud-Lake/Analyzing-Your-Data/Analytics-Database-Analytic-Functions/Analytics-Database-Analytic-Functions-Overview'>
        https://docs.teradata.com/r/Teradata-VantageCloud-Lake/Analyzing-Your-Data/Analytics-Database-Analytic-Functions/Analytics-Database-Analytic-Functions-Overview</a></li>
  
</ul>

<footer style="padding-bottom:35px; border-bottom:3px solid #91A0Ab">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2025. All Rights Reserved
        </div>
    </div>
</footer>