<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Complaints Clustering Using Teradata VantageCloud and open-source language models
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style = 'font-size:20px;font-family:Arial;color:#00233c'><b>Introduction:</b></p>

<p style="font-size:16px;font-family:Arial;color:#00233c">This feature uses advanced clustering techniques powered by <b>Teradata Vantage</b> and <b>Open source Embeddings</b> model to group similar customer complaints together. By identifying common themes and patterns, this functionality provides valuable insights into the key issues and pain points experienced by customers.</p>


<p style="font-size:16px;font-family:Arial;color:#00233c"><b>Key Features of Complaints Clustering:</b></p>
<ul style="font-size:16px;font-family:Arial;color:#00233c">
  <li>Leverages advanced clustering algorithms powered by <b>Teradata Vantage</b> and <b>Open source Embeddings.</b></li>
  <li>Groups similar customer complaints together, revealing common themes and pain points.</li>
  <li>Provides clients with a deeper understanding of the key issues affecting their customers.</li>
  <li>Enables clients to prioritize and address the most pressing concerns more effectively.</li>
  <li>Helps clients identify opportunities for product improvements and enhanced customer experience.</li>
</ul>


<p style = 'font-size:16px;font-family:Arial;color:#00233c'>Unlock the revolutionary potential of Generative AI to categorize and analyze complaints with unparalleled efficiency.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233c'><b>Steps in the analysis:</b></p>
<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
  <li>Configuring the environment</li>
  <li>Connect to Vantage</li>
  <li>Create a Custom Container in Vantage</li>
  <li>Install Dependencies</li>
  <li>Operationalizing AI-powered analytics</li>
  <li>Complaints Analysis</li>
  <li>Cluster the Complaints</li>
  <li>Cleanup</li>
</ol>

<hr style='height:2px;border:none;background-color:#00233C;'>
<b style = 'font-size:20px;font-family:Arial;color:#00233c'>1. Configuring the environment</b>

<hr style='height:1px;border:none;background-color:#00233C;'>
<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>1.1 Install the required libraries</b></p>

In [None]:
%%capture

!pip install -r requirements.txt --quiet

<div class="alert alert-block alert-info">
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Note: </b><i>Please restart the kernel after executing these two lines. The simplest way to restart the Kernel is by typing zero zero: <b> 0 0</b></i></p>

<hr style='height:1px;border:none;background-color:#00233C;'>
<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>1.2 Import the required libraries</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here, we import the required libraries, set environment variables and environment paths (if required).</p>

In [None]:
from teradataml import *
from utils.oaf_utils import *
from teradatasqlalchemy.types import *
from time import sleep
import pandas as pd
import csv, sys, os, warnings
from os.path import expanduser
from collections import OrderedDict
from wordcloud import WordCloud

from IPython.display import clear_output , display as ipydisplay
import matplotlib.pyplot as plt
import plotly.express as px

%matplotlib inline
warnings.filterwarnings('ignore')
display.suppress_vantage_runtime_warnings = True
from IPython.display import display, Markdown
pd.set_option('display.max_colwidth', 100)
pd.set_option('display.max_rows', 100)

# load vars json
with open('./config/vars.json', 'r') as f:
    session_vars = json.load(f)

# Database login information
host = session_vars['environment']['host']
username = session_vars['hierarchy']['users']['business_users'][1]['username']
password = session_vars['hierarchy']['users']['business_users'][1]['password']

# UES Authentication information
ues_url = session_vars['environment']['UES_URI']
configure.ues_url = ues_url
pat_token = session_vars['hierarchy']['users']['business_users'][1]['pat_token']
pem_file = session_vars['hierarchy']['users']['business_users'][1]['key_file']


compute_group = session_vars['hierarchy']['users']['business_users'][1]['compute_group']


# get the current python version to match deploy a custom container
# python_version = str(sys.version_info[0]) + '.' + str(sys.version_info[1])
python_version = "3.10"
print(f'Using Python version {python_version} for user environment')


# Hugging Face model for the demo
model_name = 'sentence-transformers/all-MiniLM-L6-v2'

# a list of required packages to install in the custom OAF container
# modify this if using different models or design patterns
pkgs = ['transformers',
        'torch',
        'sentencepiece',
        'pandas',
        'sentence-transformers',
        'dill']

# container name - set here for easier notebook navigation
### User will also be asked to change it ###
oaf_name = 'oaf_cluster'
###########################

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>2. Connect to Vantage</b>

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>2.1 Connect to Vantage</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>After connecting, check cluster status. Start it if necessary - note the cluster only needs to be running to execute the APPLY sections of the demo.</p>

In [None]:
# check for existing connection
eng = check_and_connect(
    host=host, username=username, password=password, compute_group=compute_group
)
print(eng)

# check cluster status
res = check_cluster_start(compute_group=compute_group)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Begin running steps with Shift + Enter keys. </p>

<hr style='height:1px;border:none;background-color:#00233C;'>

<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>2.2  Connect to the Environment Service</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>To better support integration with Cloud Services and common automation tools; the <b > User Environment Service</b> is accessed via RESTful APIs.  These APIs can be called directly or in the examples shown below that leverage the Python Package for Teradata (teradataml) methods.</p> 

In [None]:
# check to see if there is a valid UES auth
# if not, authenticate
try:
    demo_env = get_env(oaf_name)
    print("Existing valid UES token")

except Exception as e:
    if """NoneType' object has no attribute 'value""" in str(e):
        if set_auth_token(
            ues_url=ues_url, username=username, pat_token=pat_token, pem_file=pem_file
        ):
            print("UES Authentication successful")
        else:
            print("UES Authentication failed, check URL and account info")
        pass
    else:
        raise

<hr style="height:2px;border:none;background-color:#00233C;">

<b style = 'font-size:20px;font-family:Arial;color:#00233C'>3. Create a Custom Container in Vantage</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>If desired, the user can create a <b>new</b> custom environment by starting with a "base" image and customizing it.  The steps are:</p> 
<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>List the available "base" images the system supports</li>
    <li>List any existing "custom" environments the user has created</li>
    <li>If there are no custom environments, then create a new one from a base image</li>
    </ul>

In [None]:
# Create a new environment, or connect to an existing one
try:
    ipydisplay(list_user_envs())
except Exception as e:
    if str(e).find("No user environments found") > 0:
        print("No user environments found")
        pass
    else:
        raise

print("Use an existing environment, or create a new one:")
print(f"OAF Environment is set to {oaf_name}.")
print("Enter to accept, or input a new value.")
print("If the environment is not in the list, an new one will be created")
i = oaf_name
if len(i) != 0:
    oaf_name = i
    print(f"OAF Environment is now {oaf_name}")

try:
    demo_env = create_env(
        env_name=oaf_name,
        base_env=f"python_{python_version}",
        desc="OAF Demo env for LLM",
    )
except Exception as e:
    if str(e).find("same name already exists") > 0:
        print("Environment already exists, obtaining a reference to it")
        demo_env = get_env(oaf_name)
        pass
    elif "Invalid value for base environment name" in str(e):
        print("Unsupported base environment version, using defaults")
        demo_env = create_env(env_name=oaf_name, desc="OAF Demo env for LLM")
    else:
        raise

# Note create_env seems to be asynchronous - sleep a bit for it to register
sleep(5)

try:
    ipydisplay(list_user_envs())
except Exception as e:
    if str(e).find("No user environments found") > 0:
        print("No user environments found")
        pass
    else:
        raise

<hr style='height:2px;border:none;background-color:#00233C;'>

<b style = 'font-size:20px;font-family:Arial;color:#00233C'><b>4. Install Dependencies</b></b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The second step in the customization process is to install Python package dependencies. This demonstration uses the Hugging Face <a href = 'https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2'>all-MiniLM-L6-v2</a> Sentence Transformer.  Since VantageCloud Lake Analytic Clusters are secured by default against unauthorized access to the outside network, the user can load the required libraries and model using teradataml methods:
</p> 

<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>List the currently installed models and python libraries</li>
    <li><b>If necessary</b>, install any required packages</li>
    <li><b>If necessary</b>, install the pre-trained model.  This process takes several steps;
        <ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
            <li>Import and download the model</li>
            <li>Create a zip archive of the model artifacts</li>
            <li>Call the install_model() method to load the model to the container</li>
        </ol></li>
    </ul>

In [None]:
ipydisplay(demo_env.models)

# just showing a sample here - remove .head(5) to see them all
ipydisplay(demo_env.libs.head(5))

<hr style='height:1px;border:none;background-color:#00233C;'>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>4.1 A note on package versions</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The next demonstration makes use of the DataFrame apply() method, which automatically passes the python code to the Analytic Cluster.  As such, one needs to ensue the python package versions match.  dill and pandas are required, as is any additional libraries for the use case.
</p> 

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Note</b> while not required for many OAF use cases, for this demo the required packages for the model execution must be installed in the local environment first.</p>

In [None]:
# remove_env(oaf_name)

In [None]:
# Install any Python add-ons needed by the script in the user environment
# Using option asynchronous=True for an asychronous execution of the statement.
# Note: Avoid asynchronous installation when batch-executing all notebook statements,
#       as execution will continue even without installation being complete.
#
# Can install by passing a list of packages/versions
#   Or
# install using a requirements.txt file.

# For this demo,
# this code block will collect the current user's package versions
# for installation into the container
# when using dataframe.apply(), it pandas and dill are required
# to reduce issues, match the version between client and container


# import these functions inside of a function namespace
def get_versions(pkgs):
    local_v_pkgs = []
    for p in pkgs:

        # fix up any hyphened package names
        p_fixed = p.replace("-", "_")

        # import the packages and append the strings to the list
        exec(
            f"""import {p_fixed}; local_v_pkgs.append('{p}==' + str({p_fixed}.__version__))"""
        )
    return local_v_pkgs


v_pkgs = get_versions(pkgs)

# check to see if these packages need to be installed
# by comparing the len of the intersection of the list of required packages with the installed ones
if not len(
    set([x.split("==")[0] for x in pkgs]).intersection(demo_env.libs["name"].to_list())
) == len(pkgs):

    # pass the list of packages - split off any extra info from the version property e.g., plus sign
    claim_id = demo_env.install_lib(
        [x.split("+")[0] for x in v_pkgs], asynchronous=True
    )
else:
    print(f"All required packages are installed in the {oaf_name} environment")

In [None]:
# claim_id = demo_env.uninstall_lib(libs=["numpy"], asynchronous=True)

# claim_id = demo_env.install_lib(libs=["setuptools"], asynchronous=True)


# claim_id = demo_env.install_lib(libs=["numpy==1.23.5"], asynchronous=True)


# claim_id = demo_env.install_lib(libs="dill==0.3.8", asynchronous=True)

<hr style='height:1px;border:none;background-color:#00233C;'>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>4.2 Monitor library installation status</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Optionally - users can monitor the library installation status using the cell below:
</p> 

In [None]:
# Check the status of installation using status() API.
# Create a loop here for demo purposes
try:
    claim_id
    ipydisplay(demo_env.status(claim_id))
    stage = demo_env.status(claim_id)["Stage"].iloc[-1]
    while stage == "Started":
        stage = demo_env.status(claim_id)["Stage"].iloc[-1]
        clear_output()
        ipydisplay(demo_env.status(claim_id))
        sleep(5)
except NameError:
    print("No installations to monitor")


# Verify the Python libraries have been installed correctly.
ipydisplay(demo_env.libs)

In [None]:
# demo_env.status(claim_id)["Additional Details"].iloc[-1]

<hr style='height:1px;border:none;background-color:#00233C;'>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>4.3 Download and install model</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Open Analytics Framework containers do not have open access to the external network, which contributes to a very secure runtime environment.  As such, users will load pre-trained models using the below APIs.  For illustration purposes, the following code will check to see if the model archive exists locally and if it doesn't, will import and download it by creating a model object.  The archive will then be created and installed into the remote environment.
</p> 

In [None]:
# check to see if the model needs to be downloaded/archived

# construct the file name for the model:
model_fname = "models--" + model_name.replace("/", "--")
print(f"model_fname: {model_fname}")

if not os.path.isfile(f"{model_fname}.zip"):

    from sentence_transformers import SentenceTransformer
    import shutil

    print("Creating Model Archive...")

    model = SentenceTransformer(model_name)
    shutil.make_archive(
        model_fname,
        format="zip",
        root_dir=f'{expanduser("~")}/.cache/huggingface/hub/{model_fname}/',
    )
else:
    print("Local model archive exists.")

# check to see if the model is already installed
try:
    if demo_env.models.empty:  # no models installed at all
        print("Installing Model...")
        claim_id = demo_env.install_model(
            model_path=f"{model_fname}.zip", asynchronous=True
        )
    elif not any(
        model_fname in x for x in demo_env.models["Model"]
    ):  # see if model is there
        print("Installing Model...")
        claim_id = demo_env.install_model(
            model_path=f"{model_fname}.zip", asynchronous=True
        )
    else:
        print("Model already installed")
except Exception as e:
    if """NoneType' object has no attribute 'empty""" in str(e):
        print("Installing Model...")
        claim_id = demo_env.install_model(
            model_path=f"{model_fname}.zip", asynchronous=True
        )
        pass
    else:
        raise

<hr style='height:1px;border:none;background-color:#00233C;'>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>4.4 Monitor model installation status</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Optionally - users can monitor the model installation status using the cell below:
</p> 

In [None]:
# Check the status of installation using status() API.
# Create a loop here for demo purposes
try:
    claim_id
    ipydisplay(demo_env.status(claim_id))
    stage = demo_env.status(claim_id)["Stage"].iloc[-1]
    while stage != "File Installed":
        stage = demo_env.status(claim_id)["Stage"].iloc[-1]
        clear_output()
        ipydisplay(demo_env.status(claim_id))
        sleep(5)
except NameError:
    print("No installations to monitor")


# Verify the model has been installed correctly.
demo_env.refresh()
ipydisplay(demo_env.models)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The preceding demo showed how users can perform a <b>one-time</b> configuration task to prepare a custom environment for analytic processing at scale.  Once this configuration is complete, these containers can be re-used in ad-hoc development tasks, or used for operationalizing analytics in production.</p>

<hr style='height:2px;border:none;background-color:#00233C;'>
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>5. Operationalizing AI-powered analytics</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The following demonstration will illustrate how developers can take the next step in the process to <b>operationalize</b> this processing, enabling the entire organization to leverage AI across the data lifecycle, including</p>

<table style = 'width:100%;table-layout:fixed;'>
    <tr>
        <td style = 'vertical-align:top' width = '30%'>
           <ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
               <li><b>Prepare the environment</b>.  Package the scoring function into a more robust program, and stage it on the remote environment</li>
            <br>
            <br>
               <li><b>Python Pipeline</b>.  Execute the function using Python methods</li>
            <br>
            <br>
               <li><b>SQL Pipeline</b>.  Execute the function using SQL - allowing for broad adoption and use in ETL and operational needs</li>
        </ol>
        </td>
        <td width = '20%'></td>
        <td style = 'vertical-align:top'><img src = 'images/OAF_Ops.png' width=350></td>
    </tr>
</table>


<hr style='height:1px;border:none;background-color:#00233C;'>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>5.1 Check connection</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Reconnect to the database, UES, and start cluster if necessary<get_context()/p> 

In [None]:
# check for existing connection
eng = check_and_connect(
    host=host, username=username, password=password, compute_group=compute_group
)
print(eng)


# check to see if there is a valid UES auth
# if not, authenticate
try:
    demo_env = get_env(oaf_name)

except Exception as e:
    if """NoneType' object has no attribute 'value""" in str(
        e
    ):  # UES auth expired/required
        if set_auth_token(
            ues_url=ues_url, username=username, pat_token=pat_token, pem_file=pem_file
        ):
            print("UES Authentication successful")
            try:
                demo_env = get_env(oaf_name)
                pass
            except Exception as l:
                if f"""User environment '{oaf_name}' not found""" in str(l):
                    print("User environment not found")
                    pass
                else:
                    raise
        else:
            print("UES Authentication failed, check URL and account info")
        pass
    elif f"""User environment '{oaf_name}' not found""" in str(e):
        print("User environment not found")
        pass
    else:
        raise


# check cluster status
check_cluster_start(compute_group=compute_group)

<hr style='height:1px;border:none;background-color:#00233C;'>

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>5.2 Create a server-side sentiment function</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The goal of this exercise is to create a <b>server-side</b> function which can be staged on the analytic cluster.  This offers many improvements over the method used above;</p> 
<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li><b>Performance</b>.  Staging the code and dependencies in the container environment reduces the amount of I/O, since the function doesn't need to get serialized to the cluster when called</li>
    <li><b>Operationalization</b>.  The execution pipeline can be encapsulated into a SQL statement, which allows for seamless use in ETL pipelines, dashboards, or applications that need access</li>
    <li><b>Flexibility</b>. Developers can express much greater flexibility in how the code works to optimize for performance, stability, data cleanliness or flow logic</li>
</ul>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>These benefits do come with some amount of additional work.  Developers need to account how data is passed in and out of the code runtime, and how to pass it back to the SQL engine to assemble and return the final resultset.  Code is executed when the user expresses an <a href = 'https://docs.teradata.com/r/Teradata-VantageCloud-Lake/SQL-Reference/SQL-Operators-and-User-Defined-Functions/Table-Operators/APPLY'>APPLY SQL function</a>;</p> 
<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li><b>Input Query</b>.  The APPLY function takes a SQL query as input.  This query can be as complex as needed and include data preparation, cleansing, and/or any other set-based logic necessary to create the desired input data set.  This complexity can also be abstracted into a database view.  When using the teradata client connectors for Python or R, thise query is represented as a DataFrame or tibble.</li>
    <li><b>Pre-processing</b>.  Based on the query plan, data is retrieved from storage (cache, block storage, or object storage) and the input query is executed.</li>
    <li><b>Distribution</b>.  Input data can be partitioned and/or ordered to be processed on a specific container or collection of them.  For example, the user may want to process all data for a single post code in one partition, and run thousands of these in parallel.  Data can also be distributed evenly across all units of parallelism in the system</li>
    <li><b>Input</b>.  The data for each container is passed to the runtime using tandard input (stdin)</li>
    <li><b>Processing</b>.  The user's code executes, parsing stdin for the input data</li>
    <li><b>Output</b>.  Data is sent out of the code block using standard output (stdout)</li>
    <li><b>Resultset</b>.  Resultset is assembled by the analytic database, and the SQL query returns</li>
    </ol>


<hr style='height:1px;border:none;background-color:#00233C;'>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>5.3 Example server-side code block</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>This is the python script used in the demonstration.  It is saved to the filesystem as <code>Complaints_Clustering_OAF.py</code>.  Note here the original client-side processing function has been reused, and the additional logic is for input, output, and error handling.</p> 


<hr style='height:1px;border:none;background-color:#00233C;'>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>5.4.  Install the file and any additional artifacts</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Use the install_file() method to install this python file to the container.  As a reminder, this container is persistent, so these steps need only be done infrequently.</p> 

In [None]:
demo_env.install_file("Complaints_Clustering_OAF.py", replace=True)

<hr style='height:1px;border:none;background-color:#00233C;'>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>5.5  Call the APPLY function </b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>This function can be executed in two ways;</p> 
<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li><b><a href = 'https://docs.teradata.com/r/Teradata-VantageCloud-Lake/Analyzing-Your-Data/Teradata-Package-for-Python-on-VantageCloud-Lake/Working-with-Open-Analytics/teradataml-Apply-Class-for-APPLY-Table-Operator'>Python</a></b> by calling the Apply() module function</li>
    <li><b><a href = 'https://docs.teradata.com/r/Teradata-VantageCloud-Lake/SQL-Reference/SQL-Operators-and-User-Defined-Functions/Table-Operators/APPLY'>SQL</a></b> which allows for broad adoption across the enterprise</li>
    </ul>
    

<hr style='height:1px;border:none;background-color:#00233C;'>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>5.6 APPLY using Python</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The process is as follows</p> 
<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Construct a dictionary that will define the return columns and data types</li>
    <li>Construct a teradataml DataFrame representing the data to be processed - note this is a "virtual" object representing data and logic <b>in-database</b></li>
    <li>Execute the module function.  This constructs the function call in the database, but does not execute anything.  Note the Apply function takes several arguments - the input data, environment name, and the command to run</li>
    <li>In order to execute the function, an "execute_script()" method must be called.  This method returns the server-side DataFrame representing the complete operation.  This DataFrame can be used in further processing, stored as a table, etc.</li>
    </ol>
    

In [None]:
# return types
types_dict = OrderedDict({})
types_dict["complaint_id"] = BIGINT()
types_dict["consumer_complaint_narrative"] = VARCHAR(1000)

for i in range(384):
    key = f"embeddings_{str(i)}"
    types_dict[key] = FLOAT()


# remove extra characters from text
tdf = DataFrame.from_query(
    """SELECT TOP 100 complaint_id, date_received, product,issue, sub_issue, submitted_via,
    CASE 
        WHEN consumer_complaint_narrative IS NULL THEN ' '
        ELSE OREPLACE(OREPLACE(OREPLACE(OREPLACE(OREPLACE(consumer_complaint_narrative , X'0d' , ' ') , X'0a' , ' ') , X'09', ' '), ',', ' '), '"', ' ')
    END consumer_complaint_narrative 
    FROM user.Consumer_Complaints WHERE consumer_complaint_narrative <> '';"""
)

In [None]:
demo_env.install_file("Complaints_Clustering_OAF.py", replace=True)

apply_obj = Apply(
    data=tdf[["complaint_id", "consumer_complaint_narrative"]],
    apply_command="python Complaints_Clustering_OAF.py",
    returns=types_dict,
    env_name=oaf_name,
    delimiter=",",
)

<hr style='height:1px;border:none;background-color:#00233C;'>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>5.7 Execute the function</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>call execute_script(), and return a single record to the client to check the data.</p> 

In [None]:
embeddings_df = apply_obj.execute_script()
ipydisplay(embeddings_df)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Now, lets merge embeddings and row columns for further processing.</p>

In [None]:
embeddings_df_new = tdf.merge(
    right=embeddings_df,
    left_on="complaint_id",
    right_on="complaint_id",
    how="inner",
    lsuffix="tdf",
    rsuffix="emb",
)

# remove duplicate columns embeddings_df
embeddings_df_new = embeddings_df_new.drop(
    columns=["complaint_id_emb", "consumer_complaint_narrative_emb"]
)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Now the results can be saved back to Vantage.</p> 

In [None]:
copy_to_sql(
    df=embeddings_df_new,
    table_name="complaints_embeddings",
    if_exists="replace",
)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Now, let's load full embeddings table from Vantage</p>

In [None]:
complaints_embeddings_tdf = DataFrame("complaints_embeddings")
# sentiment_pdf = complaints_embeddings_tdf.to_pandas()

<hr style='height:2px;border:none;background-color:#00233C;'>
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>6. Complaints Analysis</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We'll analyze the sample of customer complaints data.</p> 


<hr style='height:1px;border:none;background-color:#00233C;'>
<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>6.1 Graph for Count of Product Complaints Over Years</b></p>

<p style='font-size:16px;font-family:Arial;color:#00233c'>The provided graph visualizes the count of complaints over the past few years, categorized by product names.</p>

In [None]:
viz_df = complaints_embeddings_tdf.assign(
    year=complaints_embeddings_tdf.date_received.year_of_calendar()
)

pd_df = (
    viz_df.select(["product", "year", "complaint_id_tdf"])
    .groupby(["product", "year"])
    .agg(["count"])
    .to_pandas()
)

In [None]:
# Sorting the DataFrame by year for each product
pd_df_sorted = pd_df.sort_values(by=["product", "year"])
import plotly.express as px

# Plotting using Plotly
fig = px.line(
    pd_df_sorted,
    x="year",
    y="count_complaint_id_tdf",
    color="product",
    markers=True,
    title="Count of Product Complaints Over Years",
)
fig.update_layout(
    xaxis_title="Year",
    yaxis_title="Count",
    legend_title="Product",
    width=1200,
    height=600,
)

fig.show()

<hr style='height:1px;border:none;background-color:#00233C;'> 
<p style='font-size:18px;font-family:Arial;color:#00233c'><b>6.2 Graph for Count of Complaints by Months</b></p> 
<p style='font-size:16px;font-family:Arial;color:#00233c'>The provided graph visualizes the count of complaints by months. We can see that the mean count is above 500, and the July and August months have the maximum complaints count.</p>

In [None]:
df = complaints_embeddings_tdf.assign(
    complaint_month=complaints_embeddings_tdf.date_received.month_of_year()
)

In [None]:
# df = df.assign(complaint_month=func.td_month_of_year(df.date_received.expression))
grp_gen = (
    df.select(["complaint_month", "complaint_id_tdf"])
    .groupby(["complaint_month"])
    .agg(["count"])
    .to_pandas()
)

# Define a reverse mapping dictionary
reverse_month_mapping = {
    1: "January",
    2: "February",
    3: "March",
    4: "April",
    5: "May",
    6: "June",
    7: "July",
    8: "August",
    9: "September",
    10: "October",
    11: "November",
    12: "December",
}

# Create a new column with month names based on reverse mapping
grp_gen["month"] = grp_gen["complaint_month"].map(reverse_month_mapping)


fig = px.bar(
    grp_gen.sort_values(by="complaint_month"),
    x="month",
    y="count_complaint_id_tdf",
    labels={"count_complaint_id": "Number of Complaints", "month": "Complaint Month"},
    title="Number of Complaints by Month",
)

# Add hover information
fig.update_traces(hovertemplate="Month: %{x}<br>Number of Complaints: %{y:,}")

fig.show()

<hr style='height:1px;border:none;background-color:#00233C;'> 

<p style='font-size:18px;font-family:Arial;color:#00233c'><b>6.3 Graph for Number of Complaints by Product</b></p> <p style='font-size:16px;font-family:Arial;color:#00233c'>The graph displays the number of complaints received for different products. The data shows that the highest number of complaints are related to credit cards or prepaid cards, as well as credit reporting and credit repair services.</p>

In [None]:
# Assuming df is your DataFrame and sns is seaborn
grp_gen = (
    df[["product", "complaint_id_tdf"]].groupby(["product"]).agg(["count"]).to_pandas()
)

fig = px.bar(
    grp_gen,
    x="product",
    y="count_complaint_id_tdf",
    labels={"count_complaint_id": "Number of Complaints", "product": "Product"},
    title="Number of Complaints by Product",
)

# Add hover information
fig.update_traces(hovertemplate="Product: %{x}<br>Number of Complaints: %{y:,}")

fig.show()

<hr style='height:1px;border:none;background-color:#00233C;'> 

<p style='font-size:18px;font-family:Arial;color:#00233c'><b>6.4 Graph for Number of Complaints by Issue</b></p> <p style='font-size:16px;font-family:Arial;color:#00233c'>The graph displays the number of complaints received for different issues. The data shows that the highest number of complaints are related to issue of incorrect information on your report.</p>

In [None]:
# Assuming df is your DataFrame and sns is seaborn
grp_gen = (
    df.select(["issue", "complaint_id_tdf"])
    .groupby(["issue"])
    .agg(["count"])
    .to_pandas()
)

grp_gen = grp_gen.sort_values("count_complaint_id_tdf", ascending=False)[:10]

fig = px.bar(
    grp_gen,
    x="issue",
    y="count_complaint_id_tdf",
    labels={"count_complaint_id_tdf": "Number of Complaints", "issue": "Issue"},
    title="Number of Complaints by Issue(Top 10)",
)

# Add hover information
fig.update_traces(hovertemplate="Issue: %{x}<br>Number of Complaints: %{y:,}")

fig.show()

<hr style='height:1px;border:none;background-color:#00233C;'> 

<p style='font-size:18px;font-family:Arial;color:#00233c'><b>6.5 Graph for Number of Complaints by Sub-Issue</b></p> 

<p style='font-size:16px;font-family:Arial;color:#00233c'>The graph displays the number of complaints received for different sub-issues. The data shows that the highest number of complaints are related to issue of information belongs to someone else.</p>

In [None]:
# Assuming df is your DataFrame and sns is seaborn
grp_gen = (
    df.select(["sub_issue", "complaint_id_tdf"])
    .groupby(["sub_issue"])
    .agg(["count"])
    .to_pandas()
)

grp_gen = grp_gen.sort_values("count_complaint_id_tdf", ascending=False)[:10]

fig = px.bar(
    grp_gen,
    x="sub_issue",
    y="count_complaint_id_tdf",
    labels={"count_complaint_id": "Number of Complaints", "sub_issue": "Sub-Issue"},
    title="Number of Complaints by Sub-Issue(Top 10)",
)

# Add hover information
fig.update_traces(hovertemplate="Sub-Issue: %{x}<br>Number of Complaints: %{y:,}")

fig.show()

<hr style='height:1px;border:none;background-color:#00233C;'> 

<p style='font-size:18px;font-family:Arial;color:#00233c'><b>6.6 Graph for Number of Complaints by Channel</b></p>

<p style='font-size:16px;font-family:Arial;color:#00233c'>The graph displays the number of complaints received for different issues. The data shows that the all the complaints are submitted by web channel.</p>

In [None]:
grp_gen = (
    df.select(["submitted_via", "complaint_id_tdf"])
    .groupby(["submitted_via"])
    .agg(["count"])
    .to_pandas()
)

# Create a mapping of numbers to product names
product_mapping = {i: product for i, product in enumerate(grp_gen["submitted_via"])}

# Replace product names with numbers in the DataFrame
grp_gen["product_num"] = grp_gen["submitted_via"].map(
    {product: i for i, product in enumerate(grp_gen["submitted_via"])}
)

fig = px.bar(
    grp_gen,
    x="submitted_via",
    y="count_complaint_id_tdf",
    labels={
        "count_complaint_id": "Number of Complaints",
        "submitted_via": "Submitted Via",
    },
    title="Number of Complaints by Channel",
)

# Add hover information
fig.update_traces(hovertemplate="Submitted Via: %{x}<br>Number of Complaints: %{y:,}")

fig.show()

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>7. Cluster the Complaints</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>For our complaint clustering task, we'll be using a sample of the data to cluster the complaints. This approach will allow us to effectively analyze and categorize the complaints without using the entire dataset.</p>

In [None]:
data_cols = (
    DataFrame("complaints_embeddings").columns[:1]
    + DataFrame("complaints_embeddings").columns[7:]
)

KMeans_Model = KMeans(
    data=DataFrame("complaints_embeddings")[data_cols],
    id_column="complaint_id_tdf",
    target_columns=DataFrame("complaints_embeddings").columns[7:],
    output_cluster_assignment=True,
    num_clusters=5,
)

In [None]:
embeddings_cluster = DataFrame("complaints_embeddings").join(
    other=KMeans_Model.result,
    how="inner",
    on="complaint_id_tdf=complaint_id_tdf",
    lprefix="L_",
)

In [None]:
# View complaints in cluster 1
embeddings_cluster[
    ["td_clusterid_kmeans", "complaint_id_tdf", "consumer_complaint_narrative_tdf"]
].loc[embeddings_cluster.td_clusterid_kmeans == 1]

<hr style='height:1px;border:none;background-color:#00233C;'> 

<p style='font-size:18px;font-family:Arial;color:#00233c'><b>7.1 Visualization of Clusters with Complaints</b></p> 

<p style='font-size:16px;font-family:Arial;color:#00233c'>The graph displays the clustering of complaints into distinct groups. Based on the analysis, the data has been divided into 5 optimal clusters, each representing a unique pattern or category of complaints. This clustering approach helps to identify the key areas or types of complaints that are most prevalent, allowing for more targeted investigation and resolution efforts.</p>

In [None]:
# emb = DataFrame('kmeans_features').to_pandas()
clus = embeddings_cluster.to_pandas()

In [None]:
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, random_state=42)
tsne_result = tsne.fit_transform(clus.iloc[:, 20:-1])

In [None]:
tsne_df = pd.DataFrame(tsne_result, columns=["tsne_1", "tsne_2"])
tsne_df["cluster_id"] = clus["td_clusterid_kmeans"]
tsne_df["complaint_id"] = clus["consumer_complaint_narrative_tdf"]

In [None]:
import pandas as pd
import plotly.express as px

# Assuming you have already computed tsne_df as per the previous example

# Create a new DataFrame combining t-SNE results with complaint information
tsne_complaint_df = pd.DataFrame(tsne_result, columns=["tsne_1", "tsne_2"])
tsne_complaint_df["cluster_id"] = clus["td_clusterid_kmeans"]
tsne_complaint_df["complaint_id"] = clus["complaint_id_tdf"]
tsne_complaint_df["complaint"] = clus["consumer_complaint_narrative_tdf"]

# Truncate text for hover data
max_chars = 50  # Maximum characters to display
tsne_complaint_df["truncted_complaint"] = clus[
    "consumer_complaint_narrative_tdf"
].apply(lambda x: x[:max_chars] + "..." if len(x) > max_chars else x)

# Plot using Plotly Express
fig = px.scatter(
    tsne_complaint_df,
    x="tsne_1",
    y="tsne_2",
    color="cluster_id",
    hover_data=["complaint_id", "truncted_complaint", "cluster_id"],
)

fig.update_traces(marker=dict(size=15))
fig.update_layout(
    title="t-SNE Visualization of Clusters with Complaints",
    xaxis_title="dimension-1",
    yaxis_title="dimension-2",
    xaxis=dict(tickangle=45),
    width=1000,
    height=800,
    hoverlabel=dict(bgcolor="white", font_size=16, font_family="Rockwell"),
    autosize=False,
)

# Customize the hovertemplate
fig.update_traces(
    hovertemplate="<b>Complaint ID:</b> %{customdata[0]}<br>"
    "<b>Complaint:</b> %{customdata[1]}<br>"
    "<b>Cluster ID:</b> %{customdata[2]}<br><extra></extra>"
)

fig.show()

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>8. Cleanup</b>

<hr style='height:1px;border:none;background-color:#00233C;'>
<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>8.1 Stop the Cluster</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Hibernate the environment if desired</p>

In [69]:
check_cluster_stop(compute_group)

Unnamed: 0,ComputeProfileName,InstanceName,ComputeGroupName,ComputeMapName,ComputeInstanceType,CurrentState,LastReqState,LastStartTime,LastEndTime
0,XSMALLANALYTICPROFILE,f4vgdm,ANALYTICGROUP,TD_COMPUTE_XSMALL,ANALYTIC,ACTIVE,SUSPEND,2024-08-23 04:50:25 +0000 UTC,2024-08-22 13:53:00 +0000 UTC


Cluster Suspended or Suspend request already submitted.


True

<footer style="padding-bottom:35px; background:#f9f9f9; border-bottom:3px solid #00233C">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2024. All Rights Reserved
        </div>
    </div>
</footer>