<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Topic Modelling using Teradata VantageCloud and open-source language models
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style = 'font-size:20px;font-family:Arial;'><b>Introduction:</b></p>

<p style='font-size:16px;font-family:Arial;'>In this comprehensive user demo, we will delve into the world of topic modeling using <b>Teradata Vantage</b> and <b>open-source language models</b>. This cutting-edge technology empowers businesses to uncover hidden insights from vast amounts of consumer complaints data, enabling them to identify trends, improve customer satisfaction, and enhance their overall brand reputation.</p> 

<p style='font-size:16px;font-family:Arial;'><b>Key Features:</b></p> 

<ol style='font-size:16px;font-family:Arial;'> 
    <li><b>Scalable Data Ingestion</b>: Seamlessly integrate and process large volumes of consumer complaints data from various sources into Teradata Vantage.</li> 
    <li><b>Advanced Topic Modelling</b>: Utilize state-of-the-art topic modeling algorithms to identify and categorize underlying themes and sentiments within the complaints data, providing actionable insights.</li> 
    <li><b>Real-time Analytics</b>: Leverage Teradata Vantage's real-time analytics capabilities to monitor and respond to emerging trends and issues in consumer complaints.</li> 
	<li><b>Customizable Dashboards</b>: Create tailored dashboards to visualize and track key performance indicators (KPIs) and metrics specific to your business needs.</li> 
	<li><b>Integration with open-source language models</b>: Seamlessly integrate with open-source language models to collect and analyze consumer complaints data from these platforms.</li> </ol> 
	
<p style='font-size:16px;font-family:Arial;'><b>Benefits:</b></p> 
	
<ol style='font-size:16px;font-family:Arial;'>
<li><b>Enhanced Customer Insights</b>: Gain a deeper understanding of customer concerns and preferences, enabling data-driven decision-making.</li> 
<li><b>Improved Customer Satisfaction</b>: Identify and address recurring issues, leading to increased customer satisfaction and loyalty.</li> 
<li><b>Competitive Advantage</b>: Stay ahead of the competition by proactively addressing consumer complaints and improving brand reputation.</li> 
<li><b>Cost Savings</b>: Reduce the financial burden of handling and resolving consumer complaints by identifying and addressing root causes.</li> 
<li><b>Data-Driven Decision-Making</b>: Make informed business decisions based on actionable insights derived from topic modeling and real-time analytics.</li> </ol>

<p style = 'font-size:16px;font-family:Arial;'><b>Steps in the analysis:</b></p>
<ol style = 'font-size:16px;font-family:Arial;'>
     <li>Configuring the environment</li>
  <li>Connect to Vantage</li>
  <li>Create a Custom Container in Vantage</li>
  <li>Install Dependencies</li>
  <li>Operationalizing AI-powered analytics</li>
  <li>Topic Modelling</li>
  <li>Cleanup</li>
</ol>

<hr style='height:2px;border:none;'>
<b style = 'font-size:20px;font-family:Arial;'>1. Configuring the environment</b>

<hr style='height:1px;border:none;'>
<p style = 'font-size:18px;font-family:Arial;'><b>1.1 Install the required libraries</b></p>

In [None]:
%%capture

!pip install -r requirements.txt --quiet

In [None]:
!pip install itables teradataml==20.0.0.7 teradatamlwidgets==20.0.0.6 teradatamodelops==7.0.3 teradatasql==20.0.0.34 teradatasqlalchemy==20.0.0.7 sentencepiece sentence-transformers wordcloud --quiet

<div class="alert alert-block alert-info">
<p style = 'font-size:16px;font-family:Arial;'><b>Note: </b><i>Please restart the kernel after executing these two lines. The simplest way to restart the Kernel is by typing zero zero: <b> 0 0</b></i></p>

<hr style='height:1px;border:none;'>
<p style = 'font-size:18px;font-family:Arial;'><b>1.2 Import the required libraries</b></p>
<p style = 'font-size:16px;font-family:Arial;'>Here, we import the required libraries, set environment variables and environment paths (if required).</p>

In [None]:
from teradataml import *
from teradatasqlalchemy.types import *
from time import sleep
import pandas as pd
import csv, sys, os, warnings
from os.path import expanduser
from collections import OrderedDict
from wordcloud import WordCloud
from IPython.display import clear_output , display as ipydisplay
import matplotlib.pyplot as plt
from itables import init_notebook_mode
import itables.options as opt
from dotenv import load_dotenv
import time

# Set display options for dataframes, plots, and warnings
opt.style="table-layout:auto;width:auto;float:left"
opt.columnDefs = [{"className": "dt-left", "targets": "_all"}]
init_notebook_mode(all_interactive=True)
%matplotlib inline
warnings.filterwarnings('ignore')
display.suppress_vantage_runtime_warnings = True

# import utils for lake environment
import os
import sys
module_path = os.path.abspath(os.path.join('..', '..','config'))
sys.path.append(module_path)
from oaf_utils import *

# get the current python version to match deploy a custom container
# python_version = str(sys.version_info[0]) + '.' + str(sys.version_info[1])
python_version = "3.10"
print(f'Using Python version {python_version} for user environment')

# Hugging Face model for the demo
model_name = 'facebook/bart-large-mnli'
# model_name = 'maximalists/BRAG-Llama-3.1-8b-v0.1'

# a list of required packages to install in the custom OAF container
# modify this if using different models or design patterns
pkgs = ['numpy',
        'transformers',
        'torch',
        'sentencepiece',
        'pandas',
        'sentence-transformers']

# container name - set here for easier notebook navigation
### User will also be asked to change it ###
# oaf_name = 'oaf_topic_classification'
oaf_name = 'oaf_demo_gpu'
###########################

<hr style="height: 2px; border: none;">
<p style = 'font-size: 20px; font-family: Arial;'><b>Part 1</b></p>

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial;'>2. Connect to Vantage</b>

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial;'><b>2.1 Connect to Vantage</b></p>
<p style = 'font-size:16px;font-family:Arial;'>Load the environment variables from a .env file and use them to create a connection context to Teradata.</p>

In [None]:
print("Creating the context...") 
load_dotenv("../../.config/.env", override=True)
host = os.getenv("host")
username = os.getenv("username")
my_variable = os.getenv("my_variable")

eng = create_context(host=host, username=username, password=my_variable)
execute_sql('''SET query_band='DEMO=Topic_Modelling_VCL.ipynb;' UPDATE FOR SESSION;''')
print("Connected to Teradata:", eng)

<p style = 'font-size:16px;font-family:Arial;'>Begin running steps with Shift + Enter keys. </p>

<hr style='height:1px;border:none;'>

<p style = 'font-size:18px;font-family:Arial;'><b>2.2  Connect to the Environment Service</b></p>

<p style = 'font-size:16px;font-family:Arial;'>To better support integration with Cloud Services and common automation tools; the <b > User Environment Service</b> is accessed via RESTful APIs.  These APIs can be called directly or in the examples shown below that leverage the Python Package for Teradata (teradataml) methods.</p> 

In [None]:
if os.path.exists("/home/jovyan/JupyterLabRoot/VantageCloud_Lake/.config/.env"):
    print("Your environment parameter file exist.  Please proceed with this use case.")
    # Load all the variables from the .env file into a dictionary
    env_vars = dotenv_values("/home/jovyan/JupyterLabRoot/VantageCloud_Lake/.config/.env")
else:
    print("Your environment has not been prepared for connecting to VantageCloud Lake.")
    print("Please contact the support team.")

In [None]:
open_analytics_endpoint = os.getenv("ues_uri")
access_token = os.getenv("access_token")
pem_file = os.getenv("pem_file")
compute_group = os.getenv("gpu_compute_group")

if set_auth_token(base_url=env_vars.get("ues_uri"),
                  pat_token=env_vars.get("access_token"), 
                  pem_file=env_vars.get("pem_file"),
                  valid_from=int(time.time())
                 ):
    print("UES Authentication successful")
else:
    print("UES Authentication failed. Check credentials.")
    sys.exit(1)

<p style = 'font-size:16px;font-family:Arial;'>After connecting and authenticating, check cluster status. Start it if necessary - note the cluster only needs to be running to execute the APPLY sections of the demo.</p>

In [None]:
execute_sql(f"SET SESSION COMPUTE GROUP {compute_group};")
res = check_cluster_start(compute_group=compute_group)

In [None]:
list_user_envs()

<hr style="height:2px;border:none;">

<b style = 'font-size:18px;font-family:Arial;'>3. Create a Custom Container in Vantage</b>

<p style = 'font-size:16px;font-family:Arial;'>If desired, the user can create a <b>new</b> custom environment by starting with a "base" image and customizing it.  The steps are:</p> 
<ul style = 'font-size:16px;font-family:Arial;'>
    <li>List the available "base" images the system supports</li>
    <li>List any existing "custom" environments the user has created</li>
    <li>If there are no custom environments, then create a new one from a base image</li>
    </ul>

In [None]:
# Create a new environment, or connect to an existing one
try:
    ipydisplay(list_user_envs())
except Exception as e:
    if str(e).find("No user environments found") > 0:
        print("No user environments found")
        pass
    else:
        raise

print("Use an existing environment, or create a new one:")
print(f"OAF Environment is set to {oaf_name}.")
print("Enter to accept, or input a new value.")
print("If the environment is not in the list, an new one will be created")
i = oaf_name
if len(i) != 0:
    oaf_name = i
    print(f"OAF Environment is now {oaf_name}")

try:
    demo_env = create_env(
        env_name=oaf_name,
        base_env=f"python_{python_version}",
        desc="OAF Demo env for LLM",
    )
except Exception as e:
    if str(e).find("same name already exists") > 0:
        print("Environment already exists, obtaining a reference to it")
        demo_env = get_env(oaf_name)
        pass
    elif "Invalid value for base environment name" in str(e):
        print("Unsupported base environment version, using defaults")
        demo_env = create_env(env_name=oaf_name, desc="OAF Demo env for LLM")
    else:
        raise

# Note create_env seems to be asynchronous - sleep a bit for it to register
sleep(5)

try:
    ipydisplay(list_user_envs())
except Exception as e:
    if str(e).find("No user environments found") > 0:
        print("No user environments found")
        pass
    else:
        raise

<hr style='height:2px;border:none;'>

<p style = 'font-size:20px;font-family:Arial;'><b>4. Install Dependencies</b></p>

<p style = 'font-size:16px;font-family:Arial;'>The second step in the customization process is to install Python package dependencies. This demonstration uses the Hugging Face <a href = 'https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english'>distilbert-base-uncased-finetuned-sst-2-english</a> Sentence Transformer.  Since VantageCloud Lake Analytic Clusters are secured by default against unauthorized access to the outside network, the user can load the required libraries and model using teradataml methods:
</p> 

<ul style = 'font-size:16px;font-family:Arial;'>
    <li>List the currently installed models and python libraries</li>
    <li><b>If necessary</b>, install any required packages</li>
    <li><b>If necessary</b>, install the pre-trained model.  This process takes several steps;
        <ol style = 'font-size:16px;font-family:Arial;'>
            <li>Import and download the model</li>
            <li>Create a zip archive of the model artifacts</li>
            <li>Call the install_model() method to load the model to the container</li>
        </ol></li>
    </ul>

In [None]:
ipydisplay(demo_env.models)

# just showing a sample here - remove .head(5) to see them all
ipydisplay(demo_env.libs.head(5))

<hr style='height:1px;border:none;'>
<p style = 'font-size:18px;font-family:Arial;'><b>4.1 A note on package versions</b></p>

<p style = 'font-size:16px;font-family:Arial;'>The next demonstration makes use of the DataFrame apply() method, which automatically passes the python code to the Analytic Cluster.  As such, one needs to ensue the python package versions match.  dill and pandas are required, as is any additional libraries for the use case.
</p> 

<p style = 'font-size:16px;font-family:Arial;'><b>Note</b> while not required for many OAF use cases, for this demo the required packages for the model execution must be installed in the local environment first.</p>

In [None]:
# import these functions inside of a function namespace
def get_versions(pkgs):
    local_v_pkgs = []
    for p in pkgs:

        # fix up any hyphened package names
        p_fixed = p.replace("-", "_")

        # import the packages and append the strings to the list
        exec(
            f"""import {p_fixed}; local_v_pkgs.append('{p}==' + str({p_fixed}.__version__))"""
        )
    return local_v_pkgs


v_pkgs = get_versions(pkgs)


# check to see if these packages need to be installed
# by comparing the len of the intersection of the list of required packages with the installed ones
if not len(
    set([x.split("==")[0] for x in pkgs]).intersection(demo_env.libs["name"].to_list())
) == len(pkgs):

    # pass the list of packages - split off any extra info from the version property e.g., plus sign
    claim_id = demo_env.install_lib(
        [x.split("+")[0] for x in v_pkgs], asynchronous=True
    )
else:
    print(f"All required packages are installed in the {oaf_name} environment")

<hr style='height:1px;border:none;'>
<p style = 'font-size:18px;font-family:Arial;'><b>4.2 Monitor library installation status</b></p>

<p style = 'font-size:16px;font-family:Arial;'>Optionally - users can monitor the library installation status using the cell below:
</p> 

In [None]:
# Check the status of installation using status() API.
# Create a loop here for demo purposes
try:
    claim_id
    ipydisplay(demo_env.status(claim_id))
    stage = demo_env.status(claim_id)["Stage"].iloc[-1]
    while stage == "Started":
        stage = demo_env.status(claim_id)["Stage"].iloc[-1]
        clear_output()
        ipydisplay(demo_env.status(claim_id))
        sleep(5)
except NameError:
    print("No installations to monitor")


# Verify the Python libraries have been installed correctly.
ipydisplay(demo_env.libs)

<hr style='height:1px;border:none;'>
<p style = 'font-size:18px;font-family:Arial;'><b>4.3 Download and install model</b></p>

<p style = 'font-size:16px;font-family:Arial;'>Open Analytics Framework containers do not have open access to the external network, which contributes to a very secure runtime environment.  As such, users will load pre-trained models using the below APIs.  For illustration purposes, the following code will check to see if the model archive exists locally and if it doesn't, will import and download it by creating a model object.  The archive will then be created and installed into the remote environment.
</p> 

In [None]:
# check to see if the model needs to be downloaded/archived

# construct the file name for the model:
model_fname = "models--" + model_name.replace("/", "--")
# model_fname = "bart-large-mnli"
print(f"model_fname: {model_fname}")

if not os.path.isfile(f"{model_fname}.zip"):

    from sentence_transformers import SentenceTransformer
    import shutil

    print("Creating Model Archive...")

    model = SentenceTransformer(model_name)
    shutil.make_archive(
            model_fname,
            format="zip",
            root_dir=f'{expanduser("~")}/.cache/huggingface/hub/{model_fname}/',
        )
else:
    print("Local model archive exists.")

# check to see if the model is already installed
try:
    if demo_env.models.empty:  # no models installed at all
        print("Installing Model...")
        claim_id = demo_env.install_model(
            model_path=f"{model_fname}.zip", asynchronous=True
        )
    elif not any(
        model_fname in x for x in demo_env.models["Model"]
    ):  # see if model is there
        print("Installing Model...")
        claim_id = demo_env.install_model(
            model_path=f"{model_fname}.zip", asynchronous=True
        )
    else:
        print("Model already installed")
except Exception as e:
    if """NoneType' object has no attribute 'empty""" in str(e):
        print("Installing Model...")
        claim_id = demo_env.install_model(
            model_path=f"{model_fname}.zip", asynchronous=True
        )
        pass
    else:
        raise

<hr style='height:1px;border:none;'>
<p style = 'font-size:18px;font-family:Arial;'><b>4.4 Monitor model installation status</b></p>

<p style = 'font-size:16px;font-family:Arial;'>Optionally - users can monitor the model installation status using the cell below:
</p> 

In [None]:
# Check the status of installation using status() API.
# Create a loop here for demo purposes
try:
    claim_id
    ipydisplay(demo_env.status(claim_id))
    stage = demo_env.status(claim_id)["Stage"].iloc[-1]
    while stage != "File Installed":
        stage = demo_env.status(claim_id)["Stage"].iloc[-1]
        clear_output()
        ipydisplay(demo_env.status(claim_id))
        sleep(5)
except NameError:
    print("No installations to monitor")


# Verify the model has been installed correctly.
demo_env.refresh()
ipydisplay(demo_env.models)

In [None]:
ipydisplay(demo_env.models)

<p style = 'font-size:16px;font-family:Arial;'>The preceding demo showed how users can perform a <b>one-time</b> configuration task to prepare a custom environment for analytic processing at scale.  Once this configuration is complete, these containers can be re-used in ad-hoc development tasks, or used for operationalizing analytics in production.</p>

<hr style="height: 2px; border: none;">
<p style = 'font-size: 20px; font-family: Arial;'><b>Part 2</b></p>

<hr style='height:2px;border:none;'>
<p style = 'font-size:20px;font-family:Arial;'><b>5. Operationalizing AI-powered analytics</b></p>
<p style = 'font-size:16px;font-family:Arial;'>The following demonstration will illustrate how developers can take the next step in the process to <b>operationalize</b> this processing, enabling the entire organization to leverage AI across the data lifecycle, including</p>

<table style = 'width:100%;table-layout:fixed;'>
    <tr>
        <td style = 'vertical-align:top' width = '30%'>
           <ol style = 'font-size:16px;font-family:Arial;'>
               <li><b>Prepare the environment</b>.  Package the scoring function into a more robust program, and stage it on the remote environment</li>
            <br>
            <br>
               <li><b>Python Pipeline</b>.  Execute the function using Python methods</li>
            <br>
            <br>
               <li><b>SQL Pipeline</b>.  Execute the function using SQL - allowing for broad adoption and use in ETL and operational needs</li>
        </ol>
        </td>
        <td width = '20%'></td>
        <td style = 'vertical-align:top'><img src = 'images/OAF_Ops.png' width=350 style="border: 4px solid #404040; border-radius: 10px;" ></td>
    </tr>
</table>


<hr style='height:1px;border:none;'>
<p style = 'font-size:18px;font-family:Arial;'><b>5.1 Check connection</b></p>
<p style = 'font-size:16px;font-family:Arial;'>Reconnect to the database, UES, and start cluster if necessary<get_context()/p> 

In [None]:
# check for existing connection and connect.
eng = check_and_connect(
    host=host, username=username, password=my_variable, compute_group=compute_group
)
print(eng)

# check to see if there is a valid UES auth
if set_auth_token(
    base_url=env_vars.get("ues_uri"),
    pat_token=env_vars.get("access_token"),
    pem_file=env_vars.get("pem_file"),
    valid_from=int(time.time())
):
    print("UES Authentication successful")
else:
    print("UES Authentication failed. Check credentials.")
    sys.exit(1)

# Get environment
demo_env = get_env(oaf_name)

# Check cluster status
check_cluster_start(compute_group=compute_group)

<hr style='height:1px;border:none;'>

<p style = 'font-size:18px;font-family:Arial;'><b>5.2 Create a server-side embedding function</b></p>

<p style = 'font-size:16px;font-family:Arial;'>The goal of this exercise is to create a <b>server-side</b> function which can be staged on the analytic cluster.  This offers many improvements over the method used above;</p> 
<ul style = 'font-size:16px;font-family:Arial;'>
    <li><b>Performance</b>.  Staging the code and dependencies in the container environment reduces the amount of I/O, since the function doesn't need to get serialized to the cluster when called</li>
    <li><b>Operationalization</b>.  The execution pipeline can be encapsulated into a SQL statement, which allows for seamless use in ETL pipelines, dashboards, or applications that need access</li>
    <li><b>Flexibility</b>. Developers can express much greater flexibility in how the code works to optimize for performance, stability, data cleanliness or flow logic</li>
</ul>

<p style = 'font-size:16px;font-family:Arial;'>These benefits do come with some amount of additional work.  Developers need to account how data is passed in and out of the code runtime, and how to pass it back to the SQL engine to assemble and return the final resultset.  Code is executed when the user expresses an <a href = 'https://docs.teradata.com/r/Teradata-VantageCloud-Lake/SQL-Reference/SQL-Operators-and-User-Defined-Functions/Table-Operators/APPLY'>APPLY SQL function</a>;</p> 
<ol style = 'font-size:16px;font-family:Arial;'>
    <li><b>Input Query</b>.  The APPLY function takes a SQL query as input.  This query can be as complex as needed and include data preparation, cleansing, and/or any other set-based logic necessary to create the desired input data set.  This complexity can also be abstracted into a database view.  When using the teradata client connectors for Python or R, thise query is represented as a DataFrame or tibble.</li>
    <li><b>Pre-processing</b>.  Based on the query plan, data is retrieved from storage (cache, block storage, or object storage) and the input query is executed.</li>
    <li><b>Distribution</b>.  Input data can be partitioned and/or ordered to be processed on a specific container or collection of them.  For example, the user may want to process all data for a single post code in one partition, and run thousands of these in parallel.  Data can also be distributed evenly across all units of parallelism in the system</li>
    <li><b>Input</b>.  The data for each container is passed to the runtime using tandard input (stdin)</li>
    <li><b>Processing</b>.  The user's code executes, parsing stdin for the input data</li>
    <li><b>Output</b>.  Data is sent out of the code block using standard output (stdout)</li>
    <li><b>Resultset</b>.  Resultset is assembled by the analytic database, and the SQL query returns</li>
    </ol>


<hr style='height:1px;border:none;'>
<p style = 'font-size:18px;font-family:Arial;'><b>5.3 Example server-side code block</b></p>

<p style = 'font-size:16px;font-family:Arial;'>This is the python script used in the demonstration.  It is saved to the filesystem as <code>Topic_Modelling_OAF.py</code>.  Note here the original client-side processing function has been reused, and the additional logic is for input, output, and error handling.</p> 


<hr style='height:1px;border:none;'>
<p style = 'font-size:18px;font-family:Arial;'><b>5.4.  Install the file and any additional artifacts</b></p>

<p style = 'font-size:16px;font-family:Arial;'>Use the install_file() method to install this python file to the container.  As a reminder, this container is persistent, so these steps need only be done infrequently.</br>
Note: Ensure that a valid .zip file path is provided in the <code>"model_path"</code> variable within the .py file below. </p> 

In [None]:
demo_env.install_file("Topic_Modelling_OAF.py", replace=True)

<hr style='height:1px;border:none;'>
<p style = 'font-size:18px;font-family:Arial;'><b>5.5  Call the APPLY function </b></p>
<p style = 'font-size:16px;font-family:Arial;'>This function can be executed in two ways;</p> 
<ul style = 'font-size:16px;font-family:Arial;'>
    <li><b><a href = 'https://docs.teradata.com/r/Teradata-VantageCloud-Lake/Analyzing-Your-Data/Teradata-Package-for-Python-on-VantageCloud-Lake/Working-with-Open-Analytics/teradataml-Apply-Class-for-APPLY-Table-Operator'>Python</a></b> by calling the Apply() module function</li>
    <li><b><a href = 'https://docs.teradata.com/r/Teradata-VantageCloud-Lake/SQL-Reference/SQL-Operators-and-User-Defined-Functions/Table-Operators/APPLY'>SQL</a></b> which allows for broad adoption across the enterprise</li>
    </ul>
    

<hr style='height:1px;border:none;'>
<p style = 'font-size:18px;font-family:Arial;'><b>5.6 APPLY using Python</b></p>

<p style = 'font-size:16px;font-family:Arial;'>The process is as follows</p> 
<ol style = 'font-size:16px;font-family:Arial;'>
    <li>Construct a dictionary that will define the return columns and data types</li>
    <li>Construct a teradataml DataFrame representing the data to be processed - note this is a "virtual" object representing data and logic <b>in-database</b></li>
    <li>Execute the module function.  This constructs the function call in the database, but does not execute anything.  Note the Apply function takes several arguments - the input data, environment name, and the command to run</li>
    <li>In order to execute the function, an "execute_script()" method must be called.  This method returns the server-side DataFrame representing the complete operation.  This DataFrame can be used in further processing, stored as a table, etc.</li>
    </ol>
    

In [None]:
# return types
types_dict = OrderedDict({})
types_dict["complaint_id"] = VARCHAR(32000)
types_dict["consumer_complaint_narrative"] = VARCHAR(10000)
types_dict["topic"] = VARCHAR(1000)

# remove extra characters from text
tdf = DataFrame.from_query(
    """SELECT TOP 10 complaint_id, date_received, product,
    CASE 
        WHEN consumer_complaint_narrative IS NULL THEN ' '
        ELSE OREPLACE(OREPLACE(OREPLACE(OREPLACE(OREPLACE(consumer_complaint_narrative , X'0d' , ' ') , X'0a' , ' ') , X'09', ' '), ',', ' '), '"', ' ')
    END consumer_complaint_narrative,
    'Mortgage Application, Payment Trouble, Mortgage Closing, Report Inaccuracy, Payment Struggle' as topics 
    FROM Demo_ComplaintAnalysis.Consumer_Complaints WHERE consumer_complaint_narrative <> '';"""
)

In [None]:
apply_obj = Apply(
    data=tdf[["complaint_id", "consumer_complaint_narrative", "topics"]],
    apply_command="python Topic_Modelling_OAF.py",
    returns=types_dict,
    env_name=demo_env,
    delimiter="#",
)

<hr style='height:1px;border:none;'>
<p style = 'font-size:18px;font-family:Arial;'><b>5.7 Execute the function</b></p>
<p style = 'font-size:16px;font-family:Arial;'>call execute_script(), and return a single record to the client to check the data.</p> 

In [None]:
topic_analysis_df = apply_obj.execute_script()
ipydisplay(topic_analysis_df)

<p style = 'font-size:16px;font-family:Arial;'>Now the results can be saved back to Vantage.</p> 

In [None]:
copy_to_sql(df=topic_analysis_df, table_name="topic_prediction", if_exists="replace")

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial;'>6. Topic Modelling</b>

<p style = 'font-size:16px;font-family:Arial;'>Topic modeling using Large Language Models (LLMs) revolutionizes the way we understand and categorize vast collections of text data. LLMs excel in understanding the semantics and context of words, enabling sophisticated topic modeling techniques.</p>

<p style = 'font-size:16px;font-family:Arial;'>Traditionally, topic modeling algorithms like Latent Dirichlet Allocation (LDA) rely on statistical patterns within documents to identify topics. However, LLMs offer a more nuanced approach. By leveraging their deep understanding of language, LLMs can extract complex themes and topics from unstructured text data with higher accuracy and flexibility.</p>

<p style = 'font-size:16px;font-family:Arial;'>LLMs can generate coherent topics without needing predefined categories, making them ideal for exploratory analysis of diverse datasets. Moreover, their ability to capture subtle nuances in language allows for more precise topic identification, even in noisy or ambiguous texts.</p>

<p style = 'font-size:16px;font-family:Arial;'><b>Reasoning with a Chain of Thought</b>: Imagine you're trying to solve a problem. With a large language model, you start with an initial idea or question. Then, you use the model's capabilities to explore related concepts, gradually connecting them together. Each step builds upon the previous one, leading you closer to understanding or solving the problem. It's like putting together puzzle pieces, one by one, until you see the whole picture.</p>


<hr style='height:1px;border:none;'>
<p style = 'font-size:18px;font-family:Arial;'><b>6.1 Number of Complaints by Predicted Topic</b></p>

<p style = 'font-size:16px;font-family:Arial;'>A graph illustrating the Number of Complaints by Predicted Topic reveals that the majority of complaints are centered around Mortgage Application, while the fewest are related to Mortgage Closing.</p>

In [None]:
import plotly.express as px

grp_gen = (
    DataFrame("topic_prediction")
    .select(["topic", "complaint_id"])
    .groupby(["topic"])
    .agg(["count"])
    .to_pandas()
)

grp_gen = grp_gen.sort_values("count_complaint_id", ascending=False)[:10]

fig = px.bar(
    grp_gen,
    x="topic",
    y="count_complaint_id",
    labels={
        "count_complaint_id": "Number of Complaints",
        "topic": "topic",
    },
    title="Number of Complaints by Predicted Topic",
)

# Add hover information
fig.update_traces(hovertemplate="Issue: %{x}<br>Number of Complaints: %{y:,}")

fig.show()

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial;'>7. Cleanup</b>

<hr style='height:1px;border:none;'>
<p style = 'font-size:18px;font-family:Arial;'><b>7.1 Remove the Container</b></p>
<p style = 'font-size:16px;font-family:Arial;'>Remove the container if desired</p>

In [None]:
# remove_env("oaf_demo_gpu")

<hr style='height:1px;border:none;'>
<p style = 'font-size:18px;font-family:Arial;'><b>7.2 Stop the Cluster</b></p>
<p style = 'font-size:16px;font-family:Arial;'>Hibernate the environment if desired</p>

In [None]:
# check_cluster_stop(compute_group)

<hr style="height:1px;border:none;">
<b style = 'font-size:18px;font-family:Arial;'>Dataset:</b>
<br>
<br>
<p style='font-size: 16px; font-family: Arial; color: #00233C;'>The dataset is sourced from <a href='https://www.consumerfinance.gov/data-research/consumer-complaints/'>Consumer Financial Protection Bureau</a></p>

<footer style="padding-bottom:35px; background:#f9f9f9; border-bottom:3px solid #00233C">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2024. All Rights Reserved
        </div>
    </div>
</footer>