<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Optimize Customer Segmentation using ClearScape Analytic Functions and Open-Source Language Models
  <br>
       <img id="teradata-logo" src="images/TeradataLogo.png" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>
<hr>

<br>

<b style = 'font-size:24px;font-family:Arial;color:#00233C'>Leverage highly-scalable native processing functions to create ideal customer segments using word embeddings and clustering algorithms</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Vector embedding</b> is a numerical representation of data that captures semantic relationships and similarities, making it possible to perform mathematical operations and comparisons on the data for various tasks like text analysis and recommendation systems.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>K-means clustering</b> is one of the most popular <b>unsupervised</b> machine learning algorithms.  Essentially, the algorithm seeks to group similar data points together by minimizing the average ("means" in K-means) distance for all data points from each cluster's center (centroid).</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Using <b>Teradata Vantage</b> and <b >ClearScape Analytics</b>, we can combine these advanced AI and ML techniques to <b>rapidly</b> find the ideal number of customer segments based on the semantic meaning of their comments history.  This segmentation can be used on its own for marketing and other tasts, or used in further predictive analytics use cases.</p>

<table style = 'width:100%;table-layout:fixed;'>
    <tr>
        <td style = 'vertical-align:top' width = '50%'>
            <ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
                <li>Create a Vector Embedding table using open-source LLMs applied at scale in the database</li>
                <br>
                <li>Rapidly iterate over multiple K-means models, evaluating each</li>
                <br>
                <li>Visualize the experimental results to indicate the best cluster</li>
            </ol>
        </td>
        <td><img src = 'images/comparative_superlative_small.jpg' width = '250'></td>
        <td><img src = 'images/K-means_convergence.gif' width = '250'></td>
    </tr>
</table>

<hr>

In [None]:
from teradataml import *
from teradatasqlalchemy.types import VARCHAR
import pandas as pd
import csv
import os
import sqlite3, getpass

from IPython.display import display as ipydisplay
import matplotlib.pyplot as plt
%matplotlib inline


import warnings
warnings.filterwarnings('ignore')

display.suppress_vantage_runtime_warnings = True

dims = [f"v{i}" for i in range(1, 51)]


# Create a SQL connection to our SQLite database
con = sqlite3.connect("/home/jovyan/JupyterLabRoot/Teradata/Config/userinfo.db")
cur = con.cursor()

# Return all results of the connection profile table

cur.execute('SELECT * FROM connectionprofile;')
res = cur.fetchall()

# obtain the connection information for the _private connection
username = ''
host = ''
for ln in res:
    if '_private' in ln[0]: # we have the local connection context
        username = ln[3]
        host = ln[4]
if len(username) == 0: username = input('Please enter username: ')
if len(host) == 0: host = input('Please enter hostname or IP address: ')

print(username)
print(host)

con.close()


eng = create_context(host=host, username=username, password=getpass.getpass(f'Enter password for {username}: '))


# confirm connection
print(eng)

<hr>
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Step 1 - Data Preparation using an LLM to create a Vector Table</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here, we will inspect the original data set, and use native vector embedding functions to generate features</p>

<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Inspect the rows of the Customer Comments table</li>
    <li>Inspect the GloVe Model table</li>
    <li>Use TD_WordEmbeddings function to create the vector table</li>
    </ol>
    

<hr>
<b style = 'font-size:18px;font-family:Arial;color:#00233C'>1.1 - Inspect the Data</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Simple DataFrame methods to show the data</p>

In [None]:
tdf_comments = DataFrame('web_commentV')

ipydisplay(tdf_comments.head(2))

<hr>
<b style = 'font-size:18px;font-family:Arial;color:#00233C'>1.2 - Model table</b>

<p style = 'font-size:16px;font-family:Arial'>We format the model table as in the documentation: a column for the token, and a column for each dimension of the vector space. This example uses the GloVe 50-dimensional pre-trained embeddings. We filter out non-ASCII characters to comply with the function's requirements.</p>

In [None]:
tdf_glove_50d = DataFrame('glove_6B_50d_ft')
ipydisplay(tdf_glove_50d.sample(2))
ipydisplay(tdf_glove_50d.shape)

<hr>
<b style = 'font-size:18px;font-family:Arial;color:#00233C'>1.3 - Embeddings</b>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>TD_WordEmbeddings can perform four operations: token-embedding, doc-embedding, token2token-similarity, and doc2doc-similarity.  We will use doc-embedding as the basis for our Segmentation.</p>

In [None]:
tdf_embeddings = WordEmbeddings(
                data=tdf_comments.iloc[:1000],
                model=DataFrame('glove_6B_50d_ft'),
                id_column="comment_id",
                model_text_column="doc_id",
                model_vector_columns=dims,
                primary_column="comment_text",
                operation="doc-embedding",
                accumulate=['comment_text', 'customer_id']
                ).result
tdf_embeddings.sample(2)

<hr>
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Step 2 - Find the Ideal K-means Model</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>As discussed above, the K-means algorithm takes a number of clusters "k", chooses a random starting point for each centroid, and iterates until a hard limit or an optimium value is reached.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Other Function Parameters Include (but are not limited to)</b></p>
<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Input dataframe</li>
    <li>StopThreshold - The algorithm converges if the distance between the centroids from the previous iteration and the current iteration is less than the specified value.</li>
    <li>MaxIterNum</li>Specify the maximum number of iterations for the K-means algorithm. The algorithm stops after performing the specified number of iterations even if the convergence criterion is not met.
    </ul>
    
<hr>
<b style = 'font-size:18px;font-family:Arial;color:#00233C'>1.1 - Example Model - 4-cluster test</b>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The example below uses an arbitrary number of clusters to create the first model.  Note the output metadata provides information such as the number of iterations, converged or not, etc.</p>

In [None]:
KMeans(data = tdf_embeddings, 
        id_column = 'comment_id', 
        target_columns = dims, 
        num_clusters = 4, 
        iter_max = 100, 
        threshold=0.0295).result.to_pandas()[['td_withinss_kmeans', 'td_modelinfo_kmeans']].fillna('').sort_index()

<hr>
<b style = 'font-size:18px;font-family:Arial;color:#00233C'>2.2 - Finding an Ideal value for K</b>

<table style = 'width:100%;table-layout:fixed;'>
    <tr>
        <td style = 'vertical-align:top' width = '50%'>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Typically, data scientists will build the model using various values for "k", and plot the "WCSS" (Within Cluster Sum-of-Squares) value on a series of each value chosen for k.  The "elbow" point (where the slope changes) is usually a good value for k.  <a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Model-Training-Functions/TD_KMeans'>KMeans</a> function will return this value as "TotalWithinSS : ###" as a row in the "td_modelinfo_kmeans" column.</p>
<br>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In the below example, we will iterate over this function using values from 2 to 8 for "k".  Due to the highly-scalable nature of the native training function, we can perform this analysis incredibly rapidly.</p>
        </td>
        <td><img src = 'images/WCSS_elbow.png' width = '300'></td>
    </tr>
    </table>

In [None]:
a = []
for k in range(2,9):
    kmeans_model = KMeans(data = tdf_embeddings, 
                        id_column = 'comment_id', 
                        target_columns = dims, 
                        num_clusters = k, 
                        iter_max = 100, 
                        threshold=0.0295)
    wss = kmeans_model.result.to_pandas()[['td_withinss_kmeans']].sum()[0]
    a.append([k, wss])
    print(f'{str(k)} clusters, WCSS = {str(wss)}')

df = pd.DataFrame(a, columns = ['k', 'WCSS']).set_index('k')

<hr>
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Step 3 - Find the ideal number of Customer Segments</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>A simple plot will show the "elbow" point indicating an ideal number of clusters or segments.</p>

In [None]:
df.plot();

<hr>
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Conclusion</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>From our simple demonstration above, we can see how data practitioners can rapidly derive powerful and unique predictive features by combining the latest AI with traditional Machine Learning <b>at scale</b>.  Furthermore, we can easily operationalize this process by combining this vector embedding and segmentation into traditional Customer 360, analytics, or additional predicitve modeling tasks - all while using an on-demand compute engine.</p>

<hr>
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Cleanup</b></p>

In [None]:
remove_context()