<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Complaints Clustering using Vantage and LLM
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style = 'font-size:20px;font-family:Arial'><b>Introduction:</b></p>

<p style="font-size:16px;font-family:Arial">This feature uses advanced clustering techniques powered by <b>Teradata Vantage</b> and <b>AWS Bedrock - Amazon's Titan embeddings model</b> model to group similar customer complaints together. By identifying common themes and patterns, this functionality provides valuable insights into the key issues and pain points experienced by customers.</p>


<p style="font-size:16px;font-family:Arial"><b>Key Features of Complaints Clustering:</b></p>
<ul style="font-size:16px;font-family:Arial">
  <li>Leverages advanced clustering algorithms powered by <b>Teradata Vantage</b> and <b>Amazon's Titan embeddings.</b></li>
  <li>Groups similar customer complaints together, revealing common themes and pain points.</li>
  <li>Provides clients with a deeper understanding of the key issues affecting their customers.</li>
  <li>Enables clients to prioritize and address the most pressing concerns more effectively.</li>
  <li>Helps clients identify opportunities for product improvements and enhanced customer experience.</li>
</ul>


<p style = 'font-size:16px;font-family:Arial'>Unlock the revolutionary potential of Generative AI to categorize and analyze complaints with unparalleled efficiency.</p>

<p style = 'font-size:16px;font-family:Arial'><b>Steps in the analysis:</b></p>
<ol style = 'font-size:16px;font-family:Arial'>
    <li>Configuring the environment</li>
    <li>Connect to Vantage</li>
    <li>Data Exploration</li>
    <li>Configuring AWS Titan Embeddings</li>
    <li>Cluster the Complaints</li>
    <li>Cleanup</li>
</ol>

<hr style='height:2px;border:none;'>
<b style = 'font-size:20px;font-family:Arial'>1. Configuring the environment</b>
<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>1.1 Downloading and installing additional software needed</b>

In [None]:
%%capture
!pip install -r requirements.txt --upgrade --quiet

<div class="alert alert-block alert-info">
    <p style = 'font-size:16px;font-family:Arial'><b>Note: </b><i>Please restart the kernel after executing these two lines. The simplest way to restart the Kernel is by typing zero zero: <b> 0 0</b></i></p>
</div>

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>1.2 Import the required libraries</b></p>
<p style = 'font-size:16px;font-family:Arial'>Here, we import the required libraries, set environment variables and environment paths (if required).</p>

In [None]:
# Data manipulation and analysis
import pandas as pd

# Suppress warnings
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

# General imports
import os
import getpass

# Plotting packages
import plotly.express as px
import plotly.graph_objects as go

# Teradata library
from teradataml import *
from teradatagenai import TeradataAI, TextAnalyticsAI, VSManager, VectorStore, VSApi
from sqlalchemy import func

# Display settings
display.max_rows = 5
display.print_sqlmr_query = False
display.suppress_vantage_runtime_warnings = True
configure.val_install_location = "val"
configure.byom_install_location = "byom"

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>2. Connect to Vantage</b>
<p style = 'font-size:16px;font-family:Arial'>We will be prompted to provide the password. We will enter the password, press the Enter key, and then use the down arrow to go to the next cell.</p>

In [None]:
print("Checking if this environment is ready to connect to VantageCloud Lake...")

if os.path.exists("/home/jovyan/JupyterLabRoot/VantageCloud_Lake/.config/.env"):
    print("Your environment parameter file exist.  Please proceed with this use case.")
    # Load all the variables from the .env file into a dictionary
    env_vars = dotenv_values("/home/jovyan/JupyterLabRoot/VantageCloud_Lake/.config/.env")
    # Create the Context
    eng = create_context(host=env_vars.get("host"), username=env_vars.get("username"), password=env_vars.get("my_variable"))
    execute_sql('''SET query_band='DEMO=text_analytics_teradatagenai_aws_huggingface.ipynb;' UPDATE FOR SESSION;''')
    print("Connected to VantageCloud Lake with:", eng)
else:
    print("Your environment has not been prepared for connecting to VantageCloud Lake.")
    print("Please contact the support team.")

<p style = 'font-size:16px;font-family:Arial'>Begin running steps with Shift + Enter keys. </p>

<hr style='height:2px;border:none'>
<p style = 'font-size:20px;font-family:Arial'><b>2.  Set up the LLM connection</b></p>

<p style = 'font-size:16px;font-family:Arial'>The <b>teradatagenai</b> python library can both connect to cloud-based LLM services as well as instantiate private models running <b>at scale</b> on local GPU compute. In this case we will use anthropoc claude-instant-v1 for low-cost, high-throughput tasks.</p>

<ol style = 'font-size:16px;font-family:Arial'>
<li><b>aws_access_key_id</b>: Enter your AWS access key ID</li>
<li><b>aws_secret_access_key</b>: Enter your AWS secret access key</li>
<li><b>region name</b>: Enter the AWS region you want to configure (e.g., us-east-1)</li>
<ol>

In [None]:
access_key = getpass.getpass('aws_access_key_id: ')
secret_key = getpass.getpass('aws_secret_access_key: ')
region_name = getpass.getpass('region name: ')

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>3. Data Exploration</b>

In [None]:
df = DataFrame(in_schema('DEMO_ComplaintAnalysis', 'Consumer_Complaints'))

In [None]:
# df.columns

<hr style='height:1px;border:none;'>
<p style = 'font-size:18px;font-family:Arial'><b>3.1 Graph for Count of Product Complaints Over Years</b></p>

<p style='font-size:16px;font-family:Arial'>The provided graph visualizes the count of complaints over the past few years, categorized by product names.</p>

In [None]:
viz_df = df.assign(year = func.td_year_of_calendar(df.date_received.expression))

In [None]:
pd_df = viz_df.select(['product','year','complaint_id']).groupby(['product', 'year']).agg(['count']).to_pandas()

In [None]:
# Sorting the DataFrame by year for each product
pd_df_sorted = pd_df.sort_values(by = ['product', 'year'])

# Plotting using Plotly
fig = px.line(
    pd_df_sorted,
    x = 'year',
    y = 'count_complaint_id',
    color = 'product',
    markers = True,
    title = 'Count of Product Complaints Over Years'
)

fig.update_layout(
    xaxis_title = 'Year',
    yaxis_title = 'Count',
    legend_title = 'Product',
    width = 1200,
    height = 600
)

fig.show()

<hr style='height:1px;border:none;'> 
<p style='font-size:18px;font-family:Arial'><b>3.2 Graph for Count of Complaints by Months</b></p> 
<p style='font-size:16px;font-family:Arial'>The provided graph visualizes the count of complaints by months. We can see that the mean count is above 500, and the July and August months have the maximum complaints count.</p>

In [None]:
df = df.assign(complaint_month = func.td_month_of_year(df.date_received.expression))
grp_gen = df.select(['complaint_month','complaint_id']).groupby(['complaint_month']).agg(['count']).to_pandas()

# Define a reverse mapping dictionary
reverse_month_mapping = {1: 'January', 2: 'February', 3: 'March', 4: 'April', 5: 'May', 6: 'June',
                         7: 'July', 8: 'August', 9: 'September', 10: 'October', 11: 'November', 12: 'December'}

# Create a new column with month names based on reverse mapping
grp_gen['month'] = grp_gen['complaint_month'].map(reverse_month_mapping)


fig = px.bar(
    grp_gen.sort_values(by = 'complaint_month'),
    x = 'month', y = 'count_complaint_id',
    labels = {
        'count_complaint_id': 'Number of Complaints',
        'month': 'Complaint Month'
    },
    title = 'Number of Complaints by Month'
)

# Add hover information
fig.update_traces(hovertemplate = 'Month: %{x}<br>Number of Complaints: %{y:,}')

fig.show()

<hr style='height:1px;border:none;'> 

<p style='font-size:18px;font-family:Arial'><b>3.3 Graph for Number of Complaints by Product</b></p> <p style='font-size:16px;font-family:Arial'>The graph displays the number of complaints received for different products. The data shows that the highest number of complaints are related to credit cards or prepaid cards, as well as credit reporting and credit repair services.</p>

In [None]:
grp_gen = df.select(['product','complaint_id']).groupby(['product']).agg(['count']).to_pandas()

fig = px.bar(
    grp_gen,
    x = 'product',
    y = 'count_complaint_id',
    labels = {
        'count_complaint_id': 'Number of Complaints',
        'product': 'Product'
    },
    title = 'Number of Complaints by Product'
)

# Add hover information
fig.update_traces(hovertemplate = 'Product: %{x}<br>Number of Complaints: %{y:,}')

fig.show()

<hr style='height:1px;border:none;'> 

<p style='font-size:18px;font-family:Arial'><b>3.4 Graph for Number of Complaints by Issue</b></p> <p style='font-size:16px;font-family:Arial'>The graph displays the number of complaints received for different issues. The data shows that the highest number of complaints are related to issue of incorrect information on your report.</p>

In [None]:
grp_gen = df.select(['issue','complaint_id']).groupby(['issue']).agg(['count']).to_pandas()

grp_gen = grp_gen.sort_values('count_complaint_id', ascending = False)[:10]

fig = px.bar(
    grp_gen,
    x = 'issue',
    y = 'count_complaint_id',
    labels = {
        'count_complaint_id': 'Number of Complaints',
        'issue': 'Issue'
    },
    title = 'Number of Complaints by Issue(Top 10)'
)

# Add hover information
fig.update_traces(hovertemplate = 'Issue: %{x}<br>Number of Complaints: %{y:,}')

fig.show()

<hr style='height:1px;border:none;'> 

<p style='font-size:18px;font-family:Arial'><b>3.5 Graph for Number of Complaints by Sub-Issue</b></p> 

<p style='font-size:16px;font-family:Arial'>The graph displays the number of complaints received for different sub-issues. The data shows that the highest number of complaints are related to issue of information belongs to someone else.</p>

In [None]:
grp_gen = df.select(['sub_issue','complaint_id']).groupby(['sub_issue']).agg(['count']).to_pandas()

grp_gen = grp_gen.sort_values('count_complaint_id', ascending = False)[:10]

fig = px.bar(
    grp_gen,
    x = 'sub_issue',
    y = 'count_complaint_id',
    labels = {
        'count_complaint_id': 'Number of Complaints',
        'sub_issue': 'Sub-Issue'
    },
    title='Number of Complaints by Sub-Issue(Top 10)'
)

# Add hover information
fig.update_traces(hovertemplate = 'Sub-Issue: %{x}<br>Number of Complaints: %{y:,}')

fig.show()

<hr style='height:1px;border:none;'> 

<p style='font-size:18px;font-family:Arial'><b>3.6 Graph for Number of Complaints by Channel</b></p>

<p style='font-size:16px;font-family:Arial'>The graph displays the number of complaints received for different issues. The data shows that the all the complaints are submitted by web channel.</p>

In [None]:
grp_gen = df.select(['submitted_via','complaint_id']).groupby(['submitted_via']).agg(['count']).to_pandas()

# Create a mapping of numbers to product names
product_mapping = {i: product for i, product in enumerate(grp_gen['submitted_via'])}

# Replace product names with numbers in the DataFrame
grp_gen['product_num'] = grp_gen['submitted_via'].map(
    {product: i for i, product in enumerate(grp_gen['submitted_via'])}
)

fig = px.bar(
    grp_gen,
    x = 'submitted_via',
    y = 'count_complaint_id',
    labels = {
                'count_complaint_id': 'Number of Complaints',
                'submitted_via': 'Submitted Via'
    },
    title = 'Number of Complaints by Channel'
)

# Add hover information
fig.update_traces(hovertemplate = 'Submitted Via: %{x}<br>Number of Complaints: %{y:,}')

fig.show()

<hr style="height:2px;border:none;">
<b style='font-size:20px;font-family:Arial'>4. Generating Embeddings</b>

<p style='font-size:16px; font-family:Arial;'>
<div style="display: flex; align-items: center; gap: 30px; margin-top: 10px;">
  <div style="flex: 1; font-size:16px; font-family:Arial;">
    <p>
      The <code>embeddings()</code> function generates vector representations of text from a specified column, capturing the semantic meaning of each entry.
    </p>
    <p>
      These embeddings can then be used for tasks such as semantic similarity, clustering, retrieval, or as input features for downstream machine learning models.
    </p>
  </div>
</div>

In [None]:
len(df.columns)

In [None]:
# Instantiate the TeradataAI class with the Amazon Bedrock model.
llm_embedding = TeradataAI(api_type="aws",                      
               model_name="amazon.titan-embed-text-v2:0",
               access_key=access_key,
               secret_key=secret_key,
               region="us-west-2")

In [None]:
# Instantiate the TextAnalyticsAI class with the embedding model.
obj_embeddings = TextAnalyticsAI(llm=llm_embedding)

In [None]:
# Generate embeddings
tdf_embeddings = obj_embeddings.embeddings(column="consumer_complaint_narrative",data=df.iloc[:10],accumulate="0:17",output_format='VECTOR')

In [None]:
tdf_embeddings.info()

In [None]:
tdf_embeddings.columns[:5] + tdf_embeddings.columns[6:-3] + ["Message"]

In [None]:
tdf_embeddings = tdf_embeddings.drop(columns=tdf_embeddings.columns[:5] + tdf_embeddings.columns[6:-3] + ["Message"])

In [None]:
copy_to_sql(df = tdf_embeddings, table_name = 'complaints_embeddings', if_exists = 'replace')

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>5. Cluster the Complaints</b>

<p style = 'font-size:16px;font-family:Arial'>For our complaint clustering task, we'll be using a sample of the data to cluster the complaints. This approach will allow us to effectively analyze and categorize the complaints without using the entire dataset.</p>

In [None]:
KMeans_Model = KMeans(
    data = DataFrame('complaints_embeddings'),
    id_column = "complaint_id",
    target_columns = ["Embedding"],
    output_cluster_assignment = True,
    num_clusters = 5
)

In [None]:
print("Data information: \n", KMeans_Model.model_data.shape)

In [None]:
KMeans_Model.result

In [None]:
embeddings_cluster = DataFrame('complaints_embeddings').join(
    other = KMeans_Model.result,
    how = "inner",
    on = "complaint_id=complaint_id",
    lprefix =  "L_"
)

In [None]:
# View complaints in cluster 1
embeddings_cluster[['td_clusterid_kmeans','complaint_id','consumer_complaint_narrative']] \
                    .loc[embeddings_cluster.td_clusterid_kmeans == 1]

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>6. Cleanup</b>

<p style = 'font-size:18px;font-family:Arial'><b>Work Tables</b></p>
<p style = 'font-size:16px;font-family:Arial'>Cleanup work tables to prevent errors next time.</p>

In [None]:
tables = ['complaints_embeddings']

# Loop through the list of tables and execute the drop table command for each table
for table in tables:
    try:
        db_drop_table(table_name=table)
    except:
        pass

<p style = 'font-size:18px;font-family:Arial'> <b>Databases and Tables </b></p>
<p style = 'font-size:16px;font-family:Arial'>The following code will clean up tables and databases created above.</p>

In [None]:
remove_context()

<hr style="height:1px;border:none;">
<b style = 'font-size:18px;font-family:Arial'>Dataset:</b>
<br>
<br>
<p style='font-size: 16px; font-family: Arial;'>The dataset is sourced from <a href='https://www.consumerfinance.gov/data-research/consumer-complaints/'>Consumer Financial Protection Bureau</a></p>

<footer style="padding-bottom:35px; border-bottom:3px solid #91A0Ab">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2024. All Rights Reserved
        </div>
    </div>
</footer>