<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       ClearScape Analytics for Customer Segmentation using K-means Clustering and Data Preparation Piplelines
  <br>
       <img id="teradata-logo" src="../../images/TeradataLogo.png" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>
<hr>

<br>

<b style = 'font-size:24px;font-family:Arial;color:#00233C'>Leverage ClearScape Analytics for efficient and highly scalable data preparation, model training, and evaluation workflows</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>K-means clustering is one of the most popular <b>unsupervised</b> machine learning algorithms.  Essentially, the algorithm seeks to group similar data points together by minimizing the average ("means" in K-means) distance for all data points from each cluster's center (centroid).</p>

<table style = 'width:100%;table-layout:fixed;'>
    <tr>
        <td style = 'vertical-align:top' width = '50%'>
            <ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
                <li>Define the number of clusters (k)</li>
                <br>
                <li>The algorithm chooses random points as centroids</li>
                <br>
                <li>Each iteration attempts to optimize the centroid locations</li>
                <br>
                <li>Iterations end once the distances have stabilized or the max iteration count is reached</li>
            </ol>
        </td>
        <td><img src = 'images/K-means_convergence.gif' width = '250'></td>
    </tr>
</table>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>One limitation of this algorithm is that it only accepts numeric data as feature input (categorical clustering can be performed using K-modes algorithm).  Typically, data engineers or data scientists will perform multiple <b>serial</b> steps to prepare a numeric-only data set that can be passed to the K-means algorithm.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>ClearScape Analytics provides native "Fit and Transform" functions to assist in data preparation and transformation at scale.  To aid in efficiency and operationalization, Vantage provides a bulk <b>Column Transformer</b> function which can take multiple transformation directives at the same time, and act on the whole data set at once.  This allows for both process and code simplifcation, allowing more streamlined and robust operational deployment.</p> 

<img src = 'Flow_Diagram_KMeans.png' width = 100%>
<hr>

<b style = 'font-size:24px;font-family:Arial;color:#00233C'>Live Demonstration</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The data for this demonstration is based on online purchase history data set, which can be found <a href = 'https://www.kaggle.com/code/hellbuoy/online-retail-k-means-hierarchical-clustering/data'>here</a>.  The goal is to segment the customers by purchase volume and value.  Steps are as follow:</p>

<table style = 'width:100%;table-layout:fixed;'>
    <tr>
        <td style = 'vertical-align:top' width = '50%'>
            <ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
                <li>Analyze the raw data, split a testing set</li>
                <br>
                <li>Engineer numeric features</li>
                <br>
                <li>Build the K-means model</li>
                <br>
                <li>Apply in-line transformation to the testing set</li>
                <br>
                <li>Make Predictions and evaluate model accuracy</li>
            </ol>
        </td>
        <td><img src = 'images/clustering_img.png' width = '250'></td>
    </tr>
</table>

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Imports and Connection</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Import required packages and create a connection context to Vantage.</p>

In [None]:
import warnings
warnings.filterwarnings('ignore')

import json
from teradataml import *
display.suppress_vantage_runtime_warnings = True


from IPython.display import display as ipydisplay
import matplotlib.pyplot as plt
%matplotlib inline


In [None]:
# load vars json
with open('../../vars.json', 'r') as f:
    session_vars = json.load(f)

# Create the SQLAlchemy Context
host = session_vars['environment']['host']
username = session_vars['hierarchy']['users']['business_users'][1]['username']
password = session_vars['hierarchy']['users']['business_users'][1]['password']

eng = create_context(host=host, username=username, password=password)

eng.execute(f'''SET SESSION COMPUTE GROUP {session_vars['hierarchy']['users']['business_users'][1]['compute_group']}''')

# confirm connection
print(eng)

<hr>
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Step 1 - Data Preparation</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here, we will inspect the original data set, and perform various preparation tasks.</p>

<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Inspect the rows of the table</li>
    <li>Inspect the column metadata using <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Data-Exploration-Functions/TD_ColumnSummary'>ColumnSummary</a></li>
    <li>Split off a testing data set to be used in evaluation</li>
    </ol>
    

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>1.1 - Inspect the Data</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Simple SQL query to show the data</p>

In [None]:
tdf_retail_data = DataFrame('"demo_ofs"."UK_Retail_Data"')

In [None]:
ipydisplay(tdf_retail_data.shape)
ipydisplay(tdf_retail_data.head(5))

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>1.2 View Column information</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Data-Exploration-Functions/TD_ColumnSummary'>ColumnSummary</a> provides more details on column values and ranges</p>

In [None]:
from teradataml import ColumnSummary
ColumnSummary(data = tdf_retail_data, target_columns = tdf_retail_data.columns).result

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>1.3 Create a Testing data set</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Select several "Customer ID" values for testing later.  This uses the SQLAlchemy ClauseElement Expression capabilities in the <a href = 'https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Teradata-Package-for-Python-User-Guide-17.20/teradataml-Extension-with-SQLAlchemy/Using-SQLAlchemy-Clause-Element-and-Expression/Using-Basic-SQLAlchemy-ClauseElement-and-Expression-for-Filtering'>teradataml package</a>.</p>

In [None]:
in_expr = tdf_retail_data['CustomerID'].expression.in_(['17307', '12503', '18268', '12908', '13693'])

tdf_retail_test = tdf_retail_data[in_expr]

In [None]:
tdf_retail_test

<hr>
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Step 2 - Feature Engineering</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>This section will illustrate how to prepare the data set for model training.  We will use various "Fit" functions to create input dataframes for the <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Feature-Engineering-Transform-Functions/TD_ColumnTransformer'>ColumnTransformer</a> function to take as input in order to execute a bulk transformation.</p>

<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Create a per-customer grouping of data</li>
    <li>Create Fit Tables
        <ul><li>Remove Outliers</li>
            <li>Impute Missing Values</li>
            <li>Create New Numeric Features</li>
            <li>Rescale the Data Set</li>
        </ul></li>
    <li>Call the final Transformation function</li>
    </ol>

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>2.1 - Create a per-customer table</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Simple GROUP BY, exclude the testing IDs.  Note there are 3930 unique customers in this training set.</p>

In [None]:
# Filter the tdf to exclued the customer IDs in our "test" set

notin_expr = tdf_retail_data.CustomerID.expression.notin_(['17307', '12503', '18268', '12908', '13693'])

tdf_train = tdf_retail_data[notin_expr].groupby('CustomerID').agg({'Quantity':['sum'], 
                                                                   'UnitPrice':['sum'], 
                                                                   'StockCode':['count']})
tdf_train.head(5)

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>2.2 Create Fit Tables</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Vantage <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Feature-Engineering-Transform-Functions'>Feature Engineering Transform Functions</a> Use a "Fit and Transform" approach to make processing more modular and efficient.  "Fit tables" can be used as input to either individual Transform functions, or passed to a single <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Feature-Engineering-Transform-Functions/TD_ColumnTransformer'>ColumnTransformer</a> function.</p>

<table style = 'width:100%;table-layout:fixed;'>
    <tr>
        <td style = 'vertical-align:top' width = '50%'>
<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Fit outlier removal using <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Data-Cleaning-Functions/Handling-Outliers/TD_OutlierFilterFit'>OutlierFilterFit</a></li>
    <li>Fit a simple imputer to replace missing values using <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Data-Cleaning-Functions/Handling-Missing-Values/TD_SimpleImputeFit'>SimpleImputeFit</a></li>
    <li>Fit column calculations to create new features using <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Feature-Engineering-Transform-Functions/TD_NonLinearCombineFit'>NonLinearCombineFit</a></li>
    <li>Call <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Feature-Engineering-Transform-Functions/TD_ColumnTransformer'>ColumnTransformer</a> to execute the transformations (to allow for Scaling)</li>
    <li>Rescale the data using <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Feature-Engineering-Transform-Functions/TD_ScaleFit'>ScaleFit/Transform</a></li>
            </ul></td>
        <td><img src = 'images/fit_transform.png' width = '300'></td>
    </tr>
    </table>

In [None]:
ft_outlier = OutlierFilterFit(data = tdf_train,
                             target_columns = ['sum_Quantity', 'sum_UnitPrice'], 
                             lower_percentile = 0.03, 
                             upper_percentile = 0.97,
                             percentile_method = 'PercentileCont', 
                             replacement_value = 'Median')

ft_impute = SimpleImputeFit(data = tdf_train, 
                            stats_columns = 'sum_Quantity',
                            literals_columns = 'CustomerID', 
                            literals = '19000', 
                            stats = 'MIN')

ft_nlc_TotSales = NonLinearCombineFit(data = tdf_train, 
                            target_columns = ['sum_Quantity', 'sum_UnitPrice'], 
                            formula = 'Y = X0*X1', 
                            result_column = 'TotalSales')

ft_nlc_SalesPer = NonLinearCombineFit(data = tdf_train, 
                                     target_columns = ['sum_Quantity', 'sum_UnitPrice', 'count_StockCode'], 
                                     formula = 'Y = (X0*X1)/X2', 
                                     result_column = 'SalesPerItem')

tdf_Transformed = ColumnTransformer(input_data = tdf_train, 
                                   ouliterfilter_fit_data = ft_outlier.result, 
                                   simpleimpute_fit_data = ft_impute.output,
                                   nonlinearcombine_fit_data = ft_nlc_TotSales.result)

tdf_train_Transformed = ColumnTransformer(input_data = tdf_Transformed.result, 
                                         nonlinearcombine_fit_data = ft_nlc_SalesPer.result)

tdf_train_Transformed.result.sample(2)

In [None]:
ft_rescale = ScaleFit(data = tdf_train_Transformed.result, 
                     target_columns = ['sum_Quantity', 'count_StockCode', 'sum_UnitPrice', 'TotalSales', 'SalesPerItem'],
                     scale_method = 'range')

tdf_train_scaled = ScaleTransform(data = tdf_train_Transformed.result, object = ft_rescale.output, accumulate = ['CustomerID']).result

tdf_train_scaled.sample(2)

<hr>
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Step 3 - Build the K-means Model</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>As discussed above, the K-means algorithm takes a number of clusters "k", chooses a random starting point for each centroid, and iterates until a hard limit, or an optimium value is reached.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Finding an Ideal value for K</b></p>
<table style = 'width:100%;table-layout:fixed;'>
    <tr>
        <td style = 'vertical-align:top' width = '50%'>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The example below uses a value of 5 for the number of clusters to build the model against.  Typically, data scientists will build the model using various values for "k", and plot the "WCSS" (Within Cluster Sum-of-Squares) value on a series of each value chosen for k.  The "elbow" point (where the slope changes) is usually a good value for k.  <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Model-Training-Functions/TD_KMeans'>KMeans</a> function will return this value as "TotalWithinSS : ###" as a row in the "td_modelinfo_kmeans" column.</p></td>
        <td><img src = 'images/WCSS_elbow.png' width = '300'></td>
    </tr>
    </table>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Other Function Parameters Include (but are not limited to)</b></p>
<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Input dataframe</li>
    <li>StopThreshold - The algorithm converges if the distance between the centroids from the previous iteration and the current iteration is less than the specified value.</li>
    <li>MaxIterNum</li>Specify the maximum number of iterations for the K-means algorithm. The algorithm stops after performing the specified number of iterations even if the convergence criterion is not met.
    </ul>

In [None]:


kmeans_res = KMeans(data = tdf_train_scaled, 
                    id_column = 'CustomerID', 
                    target_columns = ['sum_Quantity', 'count_StockCode', 'sum_UnitPrice', 'TotalSales', 'SalesPerItem'], 
                    num_clusters = 5, 
                    iter_max = 100, 
                    threshold=0.0295)
kmeans_res.result.to_pandas().sort_index()

<hr>
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Step 4 - Bulk Transformation</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here, the Fit table objects created above will be passed to a single <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Feature-Engineering-Transform-Functions/TD_ColumnTransformer'>ColumnTransformer</a> function.  This is similar to an operational approach, where a single query will prepare new or incoming data for immediate analysis.</p>

<img src = 'images/column_transformer.png' width = '300'>

In [None]:
# Recall the customer IDs we held back in the beginneing
tdf_retail_test.head(5)

In [None]:
# Perform the same groupby and aggregation as above
tdf_gb_test = tdf_retail_test.groupby('CustomerID').agg({'Quantity':['sum'], 
                                                        'UnitPrice':['sum'], 
                                                        'StockCode':['count']})

In [None]:
# Pass this to the columntransformer function

tdf_Transformed = ColumnTransformer(input_data = tdf_gb_test, 
                                   ouliterfilter_fit_data = ft_outlier.result, 
                                   simpleimpute_fit_data = ft_impute.output,
                                   nonlinearcombine_fit_data = ft_nlc_TotSales.result)

tdf_test_Transformed = ColumnTransformer(input_data = tdf_Transformed.result, 
                                         nonlinearcombine_fit_data = ft_nlc_SalesPer.result)

tdf_test_scaled = ScaleTransform(data = tdf_test_Transformed.result, object = ft_rescale.output, accumulate = ['CustomerID']).result


tdf_test_scaled.head(5)

<hr>
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Step 5 - Predict and Evaluate</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Finally, we run the model against new (in this case testing) data using <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Model-Scoring-Functions/TD_KMeansPredict'>KMeansPredict</a>.  The preparation step has been completed in a single query above.  Additionally, we will use an evaluation function <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Model-Evaluation-Functions/TD_Silhouette'>Silhouette</a> to analyze how well the new cluster predictions match the original model.</p>

<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Call KMeansPredict</li>
    <li>Inpect the results</li>
    <li>Call Silhouette on the output</li>
    </ol>
    
<hr>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>5.1 - Call the Prediction Function</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Pass the Input Data, Model Table, and other parameters including columns to accumulate.  Note here we create a new View to assist with Silhouette analysis of the prediction.</p>

In [None]:

kmeans_prediction = KMeansPredict(data = tdf_test_scaled, 
                                  object = kmeans_res.model_data, 
                                  output_distance = True, 
                                  accumulate = ['sum_Quantity', 'count_StockCode', 'sum_UnitPrice', 'TotalSales', 'SalesPerItem'])

copy_to_sql(kmeans_prediction.result, table_name = 'kmeans_pred', if_exists = 'replace')
kmeans_prediction.result

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>5.2 - Evaluate the Prediction</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Model-Evaluation-Functions/TD_Silhouette'>Silhouette</a> is a native Vantage function that evaluates the similarity of an object to its cluster (cohesion) compared to other clusters (separation).  The silhouette scores and its definitions are as follows:</p>

<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>1: Data is appropriately clustered</li>
    <li>-1: Data is not appropriately clustered</li>
    <li>0: Datum is on the border of two natural clusters</li>
    </ul>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>See the documentation for a full listing of parameters and return values.</p>

In [None]:
tdf_prediction = DataFrame('kmeans_pred')

In [None]:

res_sil = Silhouette(data = tdf_prediction, 
                     id_column = 'CustomerID', 
                     cluster_id_column = 'td_clusterid_kmeans', 
                    target_columns = ['sum_Quantity', 'count_StockCode', 'sum_UnitPrice', 'TotalSales', 'SalesPerItem'])

res_sil.result

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Clean Up</b></p>

In [None]:
db_drop_table('kmeans_pred')

In [None]:
remove_context()