<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Topic Modelling using Vantage and Amazon Bedrock
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style = 'font-size:20px;font-family:Arial;'><b>Introduction:</b></p>

<p style='font-size:16px;font-family:Arial;'>In this comprehensive user demo, we will delve into the world of topic modeling using <b>Teradata Vantage</b> and <b>Amazon Bedrock LLM model</b>. This cutting-edge technology empowers businesses to uncover hidden insights from vast amounts of consumer complaints data, enabling them to identify trends, improve customer satisfaction, and enhance their overall brand reputation.</p> 

<p style='font-size:16px;font-family:Arial;'><b>Key Features:</b></p> 

<ol style='font-size:16px;font-family:Arial;'> 
    <li><b>Scalable Data Ingestion</b>: Seamlessly integrate and process large volumes of consumer complaints data from various sources, including Amazon Bedrock, into Teradata Vantage.</li> 
    <li><b>Advanced Topic Modelling</b>: Utilize state-of-the-art topic modeling algorithms to identify and categorize underlying themes and sentiments within the complaints data, providing actionable insights.</li> 
    <li><b>Real-time Analytics</b>: Leverage Teradata Vantage's real-time analytics capabilities to monitor and respond to emerging trends and issues in consumer complaints.</li> 
	<li><b>Customizable Dashboards</b>: Create tailored dashboards to visualize and track key performance indicators (KPIs) and metrics specific to your business needs.</li> 
	<li><b>Integration with Amazon Bedrock</b>: Seamlessly integrate with Amazon Bedrock to collect and analyze consumer complaints data from these platforms.</li> </ol> 
	
<p style='font-size:16px;font-family:Arial;'><b>Benefits:</b></p> 
	
<ol style='font-size:16px;font-family:Arial;'>
<li><b>Enhanced Customer Insights</b>: Gain a deeper understanding of customer concerns and preferences, enabling data-driven decision-making.</li> 
<li><b>Improved Customer Satisfaction</b>: Identify and address recurring issues, leading to increased customer satisfaction and loyalty.</li> 
<li><b>Competitive Advantage</b>: Stay ahead of the competition by proactively addressing consumer complaints and improving brand reputation.</li> 
<li><b>Cost Savings</b>: Reduce the financial burden of handling and resolving consumer complaints by identifying and addressing root causes.</li> 
<li><b>Data-Driven Decision-Making</b>: Make informed business decisions based on actionable insights derived from topic modeling and real-time analytics.</li> </ol>

<p style = 'font-size:16px;font-family:Arial;'><b>Steps in the analysis:</b></p>
<ol style = 'font-size:16px;font-family:Arial;'>
    <li>Connect to Vantage</li>
    <li>Configuring AWS CLI</li>
    <li>Exploring the data</li>
    <li>Topic Modelling</li>
    <li>Cleanup</li>
</ol>

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial;'><b>Downloading and installing additional software needed</b>

In [None]:
%%capture

!pip install  --upgrade -r requirements.txt --quiet

<div class="alert alert-block alert-info">
<p style = 'font-size:16px;font-family:Arial;'><b>Note: </b><i>Please restart the kernel after executing these two lines. The simplest way to restart the Kernel is by typing zero zero: <b> 0 0</b></i></p>

<hr style="height:2px;border:none;">
<p style = 'font-size:16px;font-family:Arial;'>Here, we import the required libraries, set environment variables and environment paths (if required).</p>

In [None]:
import numpy as np
import pandas as pd
import timeit
import boto3
from tqdm import tqdm
from teradataml import *
import plotly.express as px

display.max_rows = 5
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', 5)

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial;'>1. Connect to Vantage</b>
<p style = 'font-size:16px;font-family:Arial;'>We will be prompted to provide the password. We will enter the password, press the Enter key, and then use the down arrow to go to the next cell.</p>

In [None]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)
execute_sql('''SET query_band='DEMO=Topic_Modelling.ipynb;' UPDATE FOR SESSION;''')

<p style = 'font-size:16px;font-family:Arial;'>Begin running steps with Shift + Enter keys. </p>

<p style = 'font-size:20px;font-family:Arial;'><b>Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial;'>We have provided data for this demo on cloud storage. We have the option of either running the demo using foreign tables to access the data without using any storage on our environment or downloading the data to local storage, which may yield somewhat faster execution. However, we need to consider available storage. There are two statements in the following cell, and one is commented out. We may switch which mode we choose by changing the comment string.</p>

In [None]:
# %run -i ../run_procedure.py "call get_data('DEMO_ComplaintAnalysis_cloud');"        # Takes 1 minute
%run -i ../run_procedure.py "call get_data('DEMO_ComplaintAnalysis_local');"        # Takes 2 minutes

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial;'>2. Configuring AWS CLI</b>
<p style = 'font-size:16px;font-family:Arial;'>The following cell will prompt us for the following information:</p>
<ol style = 'font-size:16px;font-family:Arial;'>
<li><b>aws_access_key_id</b>: Enter your AWS access key ID</li>
<li><b>aws_secret_access_key</b>: Enter your AWS secret access key</li>
<li><b>region name</b>: Enter the AWS region you want to configure (e.g., us-east-1)</li>
<ol>

<div class="alert alert-block alert-info">
<p style = 'font-size:16px;font-family:Arial;'><b>Note: </b><i>If the AWS CLI commands below fail to execute or encounter issues, you may proceed directly to Section 2.1, where the same configuration is performed using Boto3. </i></p> 

In [None]:
# def configure_aws():
#     print("configure the AWS CLI")
#     # enter the access_key/secret_key
#     access_key = getpass.getpass("aws_access_key_id ")
#     secret_key = getpass.getpass("aws_secret_access_key ")
#     region_name = getpass.getpass("region name")

#     #set to the env
#     !aws configure set aws_access_key_id {access_key}
#     !aws configure set aws_secret_access_key {secret_key}
#     !aws configure set default.region {region_name}

In [None]:
# does_access_key_exists = !aws configure get aws_access_key_id

# if len(does_access_key_exists) == 0:
#     configure_aws()

In [None]:
# !aws configure list

<hr style="height:1px;border:none;">
<b style = 'font-size:18px;font-family:Arial;'>2.1 Initialize the Bedrock Model</b>
<ul style = 'font-size:16px;font-family:Arial;'>
<li>The code below initializes a Boto3 client for the “bedrock-runtime” service.</li>
<li>The model can be used for natural language generation tasks.</li>
<ul>

In [None]:
session = boto3.Session()
credentials = session.get_credentials()

# Create a Bedrock Runtime client in the AWS Region of your choice.
client = boto3.client(
    "bedrock-runtime",
    region_name="us-east-1",
    aws_access_key_id = getpass.getpass("aws_access_key_id:"),
    aws_secret_access_key = getpass.getpass("aws_secret_access_key:"),
    aws_session_token = getpass.getpass("Enter AWS Session Token: ")
)

<ul style = 'font-size:16px;font-family:Arial;'>
<li>The code below tests a Boto3 client connection to the Amazon Bedrock “bedrock-runtime” service by sending a sample message to a chatbot model.</li>

In [None]:
from botocore.exceptions import ClientError

try:
    response = client.converse(
        modelId="amazon.nova-lite-v1:0",
        messages=[{
            "role": "user",
            "content": [{"text": "Hello"}]
        }],
        inferenceConfig={"maxTokens": 10}
    )
    print("Test call successful! Response:")
    print(response)
except ClientError as e:
    print(f"Error testing client: {e}")

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial;'>3. Exploring the data</b>

In [None]:
df = DataFrame(in_schema('DEMO_ComplaintAnalysis', 'Consumer_Complaints'))

<p style = 'font-size:16px;font-family:Arial;'>Here we subset the data to get only the complaints related to Mortgage. We further analyze the issues of those complaints and pick the top 5 topics.</p>

In [None]:
df = df[df.product == 'Mortgage']

In [None]:
df.select(['issue', 'sub_issue', 'complaint_id']).groupby(['issue', 'sub_issue']).agg(['count']).sort('count_complaint_id', ascending = False)

<p style = 'font-size:16px;font-family:Arial;'>According to the result above, we can classify the issues into the following topics:</p>

<ul style = 'font-size:16px;font-family:Arial;'>
    <li><b>Mortgage Application</b>: Applying or refinancing</li>
    <li><b>Payment Trouble</b>: Issues during payment</li>
    <li><b>Mortgage Closing</b>: Finalizing the mortgage</li>
    <li><b>Report Inaccuracy</b>: Incorrect information</li>
    <li><b>Payment Struggle</b>: Difficulty paying</li>
<ul>

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial;'>4. Topic Modelling</b>

<p style = 'font-size:16px;font-family:Arial;'>Topic modeling using Large Language Models (LLMs) revolutionizes the way we understand and categorize vast collections of text data. LLMs excel in understanding the semantics and context of words, enabling sophisticated topic modeling techniques.</p>

<p style = 'font-size:16px;font-family:Arial;'>Traditionally, topic modeling algorithms like Latent Dirichlet Allocation (LDA) rely on statistical patterns within documents to identify topics. However, LLMs offer a more nuanced approach. By leveraging their deep understanding of language, LLMs can extract complex themes and topics from unstructured text data with higher accuracy and flexibility.</p>

<p style = 'font-size:16px;font-family:Arial;'>LLMs can generate coherent topics without needing predefined categories, making them ideal for exploratory analysis of diverse datasets. Moreover, their ability to capture subtle nuances in language allows for more precise topic identification, even in noisy or ambiguous texts.</p>

<p style = 'font-size:16px;font-family:Arial;'><b>Reasoning with a Chain of Thought</b>: Imagine you're trying to solve a problem. With a large language model, you start with an initial idea or question. Then, you use the model's capabilities to explore related concepts, gradually connecting them together. Each step builds upon the previous one, leading you closer to understanding or solving the problem. It's like putting together puzzle pieces, one by one, until you see the whole picture.</p>

In [None]:
def get_topic(prompt):
    request_body = {
        "system": [
            {"text": "You are a review categorization agent"}
        ],
        "messages": [
            {
                "role": "user",
                "content": [{"text": prompt}]
            },
        ],
        "inferenceConfig": {
            "maxTokens": 300,
            "topP": 0.9,
            "topK": 20,
            "temperature": 0.7,
        }
    }
    
   # Invoke the model and extract the response body.
    response = client.invoke_model(
        modelId="amazon.nova-lite-v1:0",
        body=json.dumps(request_body)
    )
    model_response = json.loads(response["body"].read())
    return(model_response)

In [None]:
pd_df = df.to_pandas()
pd_df['Predicted Topic'] = ''
pd_df['Reasoning with Chain of Thought'] = ''

In [None]:
for i in tqdm(range(len(pd_df))):
    try:
        prompt = f'''
        The following is text from a complaint:

        “{pd_df['consumer_complaint_narrative'][i]}”

        Give me reasoning as well as topic for this complaint
        Instructions for Reasoning:
        - Give me Reasoning in detail
        - Only one sentence reasoning
        Instructions for Topic:
        - The complaint falls into only one of the following Topics: Mortgage Application, Payment Trouble, Mortgage Closing, Report Inaccuracy, Payment Struggle
        - Only select one Topic

        My output comes in the format:
        Topic: ,    
        Reasoning: 
        '''

        response = get_topic(prompt)
        text = response['output']['message']['content'][0]['text']
        topic = re.search('Topic:(.*)', text).group(1)
        if topic == "":
            topic = re.search('Topic:\n(.*)', text).group(1)
        reasoning = re.search('Reasoning:(.*)', text).group(1)
        if reasoning == "":
            reasoning = re.search('Reasoning:\n(.*)', text).group(1)
        pd_df['Predicted Topic'][i] = topic
        pd_df['Reasoning with Chain of Thought'][i] = reasoning
    except:
        pass

In [None]:
pd_df[['complaint_id', 'consumer_complaint_narrative', 'Predicted Topic', 'Reasoning with Chain of Thought']]

<p style = 'font-size:16px;font-family:Arial;'>Now the results can be saved back to Vantage.</p>

In [None]:
copy_to_sql(df = pd_df, table_name = 'topic_prediction', if_exists = 'replace')

<hr style='height:1px;border:none;'>
<p style = 'font-size:18px;font-family:Arial;'><b>4.1 Number of Complaints by Predicted Topic</b></p>

<p style = 'font-size:16px;font-family:Arial;'>A graph illustrating the Number of Complaints by Predicted Topic reveals that the majority of complaints are centered around Mortgage Application, while the fewest are related to Mortgage Closing.</p>

In [None]:
grp_gen = DataFrame('topic_prediction').select(['Predicted Topic','complaint_id']).groupby(['Predicted Topic']).agg(['count']).to_pandas()

grp_gen = grp_gen.sort_values('count_complaint_id', ascending = False)[:10]

fig = px.bar(grp_gen, x='Predicted Topic', y='count_complaint_id',
             labels={'count_complaint_id': 'Number of Complaints', 'Predicted Topic': 'Predicted Topic'},
             title='Number of Complaints by Predicted Topic')

# Add hover information
fig.update_traces(hovertemplate='Issue: %{x}<br>Number of Complaints: %{y:,}')

fig.show(renderer="notebook")

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial;'>5. Cleanup</b>

<p style = 'font-size:18px;font-family:Arial;'> <b>Databases and Tables </b></p>
<p style = 'font-size:16px;font-family:Arial;'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('DEMO_ComplaintAnalysis');"        # Takes 10 seconds

In [None]:
remove_context()

<hr style="height:1px;border:none;">
<b style = 'font-size:18px;font-family:Arial;'>Dataset:</b>
<br>
<br>
<p style='font-size: 16px; font-family: Arial; color: #00233C;'>The dataset is sourced from <a href='https://www.consumerfinance.gov/data-research/consumer-complaints/'>Consumer Financial Protection Bureau</a></p>

<footer style="padding-bottom:35px; background:#f9f9f9; border-bottom:3px solid #00233C">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2024. All Rights Reserved
        </div>
    </div>
</footer>