# Exercise 2 - Jupyter Notebook 
## Analyzing consumer complaints using text embeddings and machine learning
SAP TechEd 2025, Hands-On Workshop: DA261 - Unlocking AI-driven insights from your business data in SAP HANA Cloud
<br><br>


### Understanding the exercise scenario [0:30s]

In this exercise, you will explore how to __classify and process consumer complaints texts__ using 
- __text analysis for sentiment detection__, 
- __similarity search__ with __text embeddings__ and 
- __AutoML__ techniques to build a __consumer complaints classification machine learning model__,  
- thus __unlocking__ the __semantic understanding of text data__ as text embedding feature with machine learning models. 

<br>

The __classification of incoming service requests__ is a __common machine learning use case__. This exercise illustrates, how such a use case can be implemented using __out-of-the-box capabilities from the SAP HANA Cloud database__, inluding a text embedding model to generate embedding vector, a vector engine to store the generated text embeddings, similarity search functions with vector columns and machine learning algorithms from Predictive Analysis Library (PAL) in the database AI engine.

The data used in this scenario, derives from the __consumer complaint database__, from the US government Consumer Financial Protection Bureau.  
Details regarding data license, download and import instruction are given [Appendix section](##loading-the-data-sample).
<br>

Now let's get started with the exercise!
<br><br>

## Ex. 2.0 - Connect to your SAP HANA Cloud instance

Throughout this exercise, we will be using the __Python Machine Learning client for SAP HANA__, as your reference to all its functions and capabilities see the [hana-ml documentation](https://help.sap.com/doc/cd94b08fe2e041c2ba778374572ddba9/2025_2_QRC/en-US/change_log.html). The current version released with SAP HANA Cloud 2025 Q3 release is 2.26.
- The python package hana-ml in general allows to script in python, while SQL code is generated on-the-fly and directly passed to a connected SAP HANA database system for execution.
    - It allows to access and prepare data by means of a HANA dataframe, a python object holding a SQL select query. Many methods are provided to be used with the HANA dataframe, changing the SQL select query behind the scenes.
    - As its core, it provides methods to apply AI functions (algorithms from the Predictive Analysis Library PAL and Automated Predictive Library APL) to the data prepared with a HANA dataframe, designed to apply all the processing within the SAP HANA database.
- The python package is delivered with the general SAP HANA Client, in addition the latest can always be found in the pypi public repository at https://pypi.org/project/hana-ml/
<br><br>

In Python, installed packages require to be imported into the session so they can be used.

### Step 0: Establish and check connection [2:5s]

In [None]:
## Loading the Python Machine Learning client library for SAP HANA and review the current client version
import hana_ml
print(hana_ml.__version__)

Running the referenced script prepared in the Getting Started section to connect to the SAP HANA CLoud database.

In [None]:
%run "../ex0/ex0_2-check_setup.ipynb"

<br>

## Ex. 2.1 - Exploring Consumer Complaints Data
### Step 1: Create the HANA dataframe for the consumer complaints data [5:60s]

__Introduction to SAP HANA dataframes__
- The [HANA dataframe](https://help.sap.com/doc/cd94b08fe2e041c2ba778374572ddba9/2025_2_QRC/en-US/hana_ml.dataframe.html#module-hana_ml.dataframe) represents a database query as a dataframe in python, comarable to a pandas dataframe. Most operations are designed to not bring data back from the database unless into the python runtime explicitly requested. 
- SAP HANA dataframes can be created
    - based on table, view, calculation view (incl. parameters), SQL statement incl. multi-statement
    - or create from pandas dataframe or sparkd dataframe. 

Based on the [ConnectionContext-object](https://help.sap.com/doc/cd94b08fe2e041c2ba778374572ddba9/2025_2_QRC/en-US/hana_ml.dataframe.html#hana_ml.dataframe.ConnectionContext) __"myconn"__, a child of the dataframe-class, we are using the table-method to create the initial HANA dataframe.

In [None]:
# Check the connection using "myconn"
myconn.connection.isconnected()

In [None]:
# Creating a HANA dataframe in Python against the HANA Cloud table, data is preloaded into the system based on the details in /Appendix/Loading the data sample-section
consumercomplaints_hdf = myconn.table("CONSUMER_FINANCIAL_COMPLAINTS", schema="DA261_SHARE") 


<br>
Explore what a HANA dataframe object is within the python evironment. The python methods print() or display() present the output of executed python commands.

In [None]:
# Understand what is the HANA dataframe
print(consumercomplaints_hdf)
print(consumercomplaints_hdf.select_statement)

<br>
Understand the data structure of the query set underlying the dataframe

In [None]:
# Understand the data structure of the query set underlying the dataframe, using the shape, columns and dtypes() methods
print(consumercomplaints_hdf.shape, '\n')
print(consumercomplaints_hdf.columns, '\n') 
display(consumercomplaints_hdf.dtypes())

As the shape method output indicates, the table for the purpose of this session has 128330 rows and 19 columns.  
The column __ConsumerComplaintNarrative__ has type NCLOB, and the __ComplaintNarrative_EMBEDDING__ has type REAL_VECTOR with 768 dimensions.
<br>

<br>

For an __initial view at the data__, the __head()__- in conjunction with the __collect()-method__ can be used.

- Only when the collect()-method is used, data is brought back into the python envrionment. __Do not__ use __collect() without__ any further __filtering methods__ applied.
- The head method adds the TOP N predicate to the SQL select clause

In [None]:
# Filter on the TOP 5 rows, and show them in python
display(consumercomplaints_hdf.head(5).collect())

In [None]:
# Add a custom column list using the select-dataframe method

# Maximize row display beforehand
import pandas as pd
pd.set_option('max_colwidth', None) 

# Display columns 'ConsumerComplaintNarrative', 'ComplaintNarrative_EMBEDDING'
display(consumercomplaints_hdf.select('ConsumerComplaintNarrative', 'ComplaintNarrative_EMBEDDING').head(2).collect())

<br>
The dataframe and its method applied, is only a SQL query statement. So how does it look like now?

In [None]:
print(consumercomplaints_hdf.select('ConsumerComplaintNarrative', 'ComplaintNarrative_EMBEDDING').head(5).select_statement)

<br><br>

### Step 2: Explore consumer complaints data [5:120s]

Let's look as some exemplary HANA dataframe methods for aggregation, filtering and data exploration. 

The __describe()__-method, provides insightful details about column value distribution and statistics, especially for numeric columns.

In [None]:
# Before calling the describe()-method, we use the drop-method to exclude the unsupported NCLOB-type columns from the analysis
consumercomplaints_hdf.drop("ConsumerComplaintNarrative").describe().collect()

Take notice of unique value counts, nulls and not null-counts.  
The complaintID is the only numeric (Integer) column, however as an ID column really doesn't suite for looking at the column value statistics.  

<br>


<br><br>
__What are the key issues categories, consumer complain about?__  
Let's create a new HANA dataframe using the __sql()-method__, wich allows to apply multi-line sql-statements as the initial HANA dataframe query.

In [None]:
# Use a SQL query, filtering for issue categories which count more than 1000 complaints. For illustraion, we didn't apply a having-clause select query.[result: 108903]
cc_key_issues_hdf=myconn.sql("""
Select * from "DA261_SHARE"."CONSUMER_FINANCIAL_COMPLAINTS" as CC 
    WHERE CC."Issue" in (Select distinct "Issue" 
                         from (Select "Issue", Count("ComplaintID") as N from "DA261_SHARE"."CONSUMER_FINANCIAL_COMPLAINTS"
                               Group by "Issue" 
                               having Count("ComplaintID") > 1000 ))
""")

print('Dataframe shape: ', cc_key_issues_hdf.shape, '\n')

display(cc_key_issues_hdf.drop("ComplaintNarrative_EMBEDDING").head(2).collect())

<br>

For getting an __aggregated view__, on the __number of issues per issue category__ with >1000 counts, let's use a __bar-chart visualization__ with the HANA dataframe
- The hana-ml visualizations automatically push-down alls the aggregation queries required into the SAP HANA database, only required result sets are provided to the visualization tools.

In [None]:
# Create a bar-chart visualization
from hana_ml.visualizers.eda import EDAVisualizer 
eda = EDAVisualizer(enable_plotly=True)

fig, bar = eda.bar_plot(data=cc_key_issues_hdf, 
                        column="Issue", 
                        aggregation={'ComplaintID':'count'}, 
                        width=1200, height=600, 
                        title="Count of complaints by key Issue categories")

<br><br>
<br><br>
__How do financial companies respond to consumer complaints?__

Explore complaints by __Company public response__, looking at the full original dataframe.

In [None]:
# Create a bar-chart visualization
fig, bar = eda.bar_plot(data=consumercomplaints_hdf, 
                        column="CompanyPublicResponse", 
                        aggregation={'ComplaintID':'count'}, 
                        width=1200, height=400, orientation='h',
                        title="Count of ComplaintID by CompanyPublicResponse")

Explore complaints by __Company response category__

In [None]:
# Create a bar-chart visualization
fig, bar = eda.bar_plot(data=consumercomplaints_hdf, 
                        column="CompanyResponseToConsumer", 
                        aggregation={'ComplaintID':'count'}, 
                        width=1200, height=300, orientation='h',
                        title="Count of complaints by CompanyResponseToConsumer")

As we can see, financial consumer products companies responded to a significant amount to consumer complaints with a __monetary relief__. It might be of crucial interest for companies to earyl on detect such consomer complaints wich might lead towards a monetary consumer complaint compensation. We will come back to this scenario.
<br><br>

## Ex. 2.2 Analyze consumer sentiment using text analysis (optional)

### Step 3: Exploring the sentiment of consumer complaints [6:120s]

It is of key interest for financial product companies, not to stay in dispute with customers. In the data analyzed in the exercise, the consumer financial protection bureau discontinued the [Consumer disputed?](https://cfpb.github.io/api/ccdb/fields.html), which gave a clear customer sentiment-category regarding the dispute status.  
From the __complaint narrative__ itself (or of course from direct customer interaction communication), financial product or services companies often __apply sentiment analysis__techniques to __detect trends__, especially in publicly visible communications, forum, chatting or evaluation threads. Is there a negative sentiment trend on our product or company, could be a common question to be answered.  
<br>
__Sentiment analysis__ as a technique in Natural Language Processing (NLP) and text analysis, is still applied commonly to detect respective information early on.

In [None]:
# Let's take a look, how many complaints had been classified as disputed, using the dataframe agg()-method for an aggregation analysis
consumercomplaints_hdf.agg([('count','ComplaintID','n_COMPLAINTS')] ,group_by = ['ConsumerDisputed']).collect()

In [None]:
# If you wish, take a look at the HANA dataframe SQL query statement again
#consumercomplaints_hdf.agg([('count','ComplaintID','n_COMPLAINTS')] ,group_by = ['ConsumerDisputed']).select_statement

<br><br>

Let's try to understand, __which companies experienced larger numbers of complaints qualified by consumers as still disputed__.

In [None]:
# Creating a new, temporary dataframe where we apply the filter()-method using a SQL where-clause expression
hdf_tmp=consumercomplaints_hdf.filter('"ConsumerDisputed" = \'Yes\'')

# Then count the remaining complaints by company using the agg()-method
hdf_tmp.agg([('count','ComplaintID','n_COMPLAINTS')] ,group_by = ['Company']).sort('n_COMPLAINTS', desc=True).collect()

<br><br>

Let's __filter the complaints for a company__, with a __significant number of complaints voiced to be in dispute__ by the consumer

In [None]:
# Creating a new, temporary dataframe where we apply the filter()-method using a SQL where-clause expression
hdf_tmp=consumercomplaints_hdf.filter('"ConsumerDisputed" = \'Yes\' AND "Company" = \'UNITED SERVICES AUTOMOBILE ASSOCIATION\'')

<br>

Now, further __prepare__ the hana dataframe for the __sentiment analysis__  
- Restrict the columns of interest using the select()-method  
- Adding a __new column_ within the select()-method: ('<valid SQL value expression>', '<Column Name>'), in our case we seek to add a LANGUAGE column for the analysis

In [None]:
# Select columns of interest
hdf_tmp_sentiment=hdf_tmp.select('ComplaintID', 'ConsumerComplaintNarrative', 'CompanyPublicResponse', 'CompanyResponseToConsumer', 'Issue')

# Add a new column using a SQL column value expression, here adding a static string 'en'
hdf_tmp_sentiment=hdf_tmp_sentiment.select('*', ('\'en\'', 'LANGUAGE'))

hdf_tmp_sentiment.head(1).collect()

<br>

Apply the __sentiment analysis__ method from the __text analysis__ function in __hana-ml__.
- note, the text analysis functions in SAP HANA Cloud are executed by the NLP services, which require to be activated during configuration of the SAP HANA cloud instance.

In [None]:
# Let's look at the sentiment for the first 50 complaints
hdf_tmp_sentiment=hdf_tmp_sentiment.select('ComplaintID',  'ConsumerComplaintNarrative', 'LANGUAGE').head(50)

# Load the sentiment analysis function module and run the analysis for all sentiment areas, incl. document- and sentence-level sentiment
# Results get captured in the ordered list of dataframe, we named: doc_sentiment, sentence_sentiment, phrase_sentiment, sentences, extra
from hana_ml.text.ta import sentiment_analysis
doc_sentiment, sentence_sentiment, phrase_sentiment, sentences, extra = sentiment_analysis(data=hdf_tmp_sentiment, thread_ratio=0.5, timeout=20)

![](./images/ta_sentiment-classes.png)
<!--- <img src="./images/ta_sentiment-classes.png" width=400 height=150 /> --->

In [None]:
# Results for document sentiment analysis
display(doc_sentiment.sort('DOC_SENTIMENT_MAGITUDE', desc=True).head(5).collect())

Identify the ComplaintID, for the complaint showing the biggest document sentiment magnitude and explore it's sentence-level sentiment
- Set the Identified ComplaintID into the filterSQL-variable as an additional in-value

In [None]:
# Let's further investage sentence-level sentiment for complaints

filterSQL=f'"ComplaintID" in (9999999, <ComplaintID-value>)'  

display(sentence_sentiment.filter(filterSQL).select('SENTENCE_ID', 'CONTENT', 'SENTIMENT_LABEL', 'SENTIMENT_SCORE', 'SENTIMENT_MAGNITUDE').collect())

<br><br>

## 2.3 Search in consumer complaints narratives using similarity search (optional)

### Step 4: Explore consumer complaints narratives using Text Embeddings and similarity search [6:180s]

<br>

As we've seen earlier, the initially provide data already contains a text embedding column. See the Appendix section on embeddings how those have been generated.

In [None]:
# Let's review the original dataframe with complaint text and the generated text embedding column
filterSQL=f'"ComplaintID" in (9999999, 5033663)'  
display(consumercomplaints_hdf.select('ComplaintID','ConsumerComplaintNarrative', 'ComplaintNarrative_EMBEDDING').filter(filterSQL).head(1).collect())

<br>

For our better understanding, let's __create a text embedding__ quickly using the __VECTOR_EMBEDDING-SQL function__ in a multi-line SQL dataframe query!  
Do the text embedding vectors look like, compared to the previous cell output?

In [None]:
# Create a dataframe for the ComplaintID 5033663 and use the VECTOR_EMBEDDING-SQL function applying the latest text embedding model in SAP HANA Cloud (SAP_GXY.20250407)
hdf_tmp_textembedding = myconn.sql(
"""
SELECT "ComplaintID", "ConsumerComplaintNarrative",
		VECTOR_EMBEDDING("ConsumerComplaintNarrative", 'DOCUMENT','SAP_GXY.20250407') AS "NEW_TEXTEMBEDDING_VECTOR"
	FROM "DA261_SHARE"."CONSUMER_FINANCIAL_COMPLAINTS" 
    WHERE "ComplaintID"=5033663
	;
"""
)
display(hdf_tmp_textembedding.collect()) 

<br><br>

For the __similarity search scenario__, now let's __look at complaints__, where the financial product or service __companies__ responded with __monetary relief__.  
We seek to explore, if we had been __victim of identity theft__ and complain respectively, would we find __similar complaints where the company responded with monetary relief__?

In [None]:
# Filter complaints closed with monetary relief and select columns of interest
hdf_tmp_simsearch=consumercomplaints_hdf.filter('"CompanyResponseToConsumer" in (\'Closed with monetary relief\')') 
hdf_tmp_simsearch=hdf_tmp_simsearch.select('ComplaintID','ConsumerComplaintNarrative', 'ComplaintNarrative_EMBEDDING')

# Display a sample of complaint texts 'closed with monetary relief'
import pandas as pd
pd.set_option('max_colwidth', None) 
display(hdf_tmp_simsearch.select('ComplaintID','ConsumerComplaintNarrative').head(3).collect())

<br>

Now, let's use __similarity search__ for complaints by consumers who received a monetary compensation, similar to our __inquiry text__ about  __identity theft__   
- note, for text embedding vector similariy search our __inquiry text__ get's __transformed into a text embedding__ using the __embed_query()-method__

In [None]:
# Now, search for complaints by consumer victims similar to the topic of identity theft expressed in our complaint text
search_sentence = """As I have reported before, I am a victim of identity theft that happenned during the last black friday sales and I have been affected ever since. 
                     I have called the companies and spoke with them in length that these items are not mine. 
                     I have done everything I can to have these items removed. 
                     They are effecting my everyday life I literally have lost my trust in credit card and internet security over this identity theft. 
                     I expect monetary compensation for those fraudulent credit cards transaction caused by identity theft."""
search_sentence_embedding = myconn.embed_query(search_sentence)

# Let's see if the embedding or our query sentence worked? 
# Note, the slice-method here just helped with the cut-off for print display of the embeddings after 24 elements of the vector
print(search_sentence_embedding[slice(24)]) 


<br>

Executing the Vector Engine similarity search using the search_by_similarity()-dataframe method

In [None]:
# Executing the Vector Engine similarity search using the search_by_similarity()-dataframe method
# Choose between similarity_function 'L2DISTANCE' | 'COSINE_SIMILARITY'
hdf_tmp_simsearch.sort_by_similarity("ComplaintNarrative_EMBEDDING", query=search_sentence, model_version='SAP_GXY.20250407', similarity_function='COSINE_SIMILARITY'
                                   ).select('ComplaintID', 'ConsumerComplaintNarrative', 'SIMILARITY').head(10).collect() 

<br>

How does such a __vector similarity search query__ look like __SQL__?   
Would it work to pass an __inquiry__ in a __different language__ but __still get semantically similar results__ from the text embeddings searched again?

In [None]:
# Search for complaints related to "Je suis victime d'usurpation d'identité et de fraude à la carte bancaire. Je réclame donc une indemnisation."
# or compose a search text youself "I'd like to report .... "  > let's see if you get high similarity score results with your query.
hdf_tmp_simsearch_sql = myconn.sql(
"""
SELECT "ComplaintID", "Product", "ConsumerComplaintNarrative",
		COSINE_SIMILARITY("ComplaintNarrative_EMBEDDING", 
                          VECTOR_EMBEDDING('Je suis victime de usurpation de identité et de fraude à la carte bancaire. Je réclame donc une indemnisation.', 'DOCUMENT','SAP_GXY.20250407')) AS SIM
	FROM "DA261_SHARE"."CONSUMER_FINANCIAL_COMPLAINTS" 
	ORDER BY SIM DESC
	LIMIT 10;
"""
)
display(hdf_tmp_simsearch_sql.collect()) #"Company", 

<br><br>

## Sections 2.4 - Predicting monetary relief-response to consumer complaints using AutoML classification

To use of machine learning models __classify__ incoming customer emails, complaints, etc. or service requests or alike has been a common use case scenario.  
With the recent emergence of __text embedding__ models __unlocking the semantic understanding of text data__ and __machine learning algorithm enhancements__ to be capable to process such rich and high-dimensional embedding vectors as feature, this rich understanding of text data now unlocks the potential to build even better classification models for those use case.  
- Latest machine learning algorithms can process a mix of classic numeric or catgorial attributes along with vector-type features. 
- Pre-processing techniques in machine learning, like principal component analysis of vectors allow the extraction of key information into a lower dimensional space, e.g. from vectors with 768 or far more dimensions to 64, 128 or 256 dimensions with only little to neglectable precision loss by the machine learning model while gaining much better processing times. 

This exercise illustrates, how such a use case can be implemented using __out-of-the-box capabilities from the SAP HANA Cloud database__, inluding a text embedding model to generate embedding vector, a vector engine to store the generated text embeddings and machine learning algorithms from Predictive Analysis Library (PAL) in the database AI engine.
- PAL algorithms enabled to process vector data incl. AutoML, Hybrid Gradient Boosting Trees (HGBT), Multi-Target MLP (MT_MLP), KMEANS, HDBSCAN, Vector PCA, and many more.
- note the SAP HANA Cloud text embedding models is processed by the NLP services, use of those requires to be activated during configuration of the SAP HANA cloud instance.

### Step 5: Data selection and preparation [5:90s]

For the exemplary __machine learning classification__ scenario, to __predict__ and thus early alert on a potentially required __monetary relief-response to a customer complaint__,  
we are seeking to select the data for __a company__ with a __high proportion of complaint responses with monetary relief__.

In [None]:
# First, to reduce the relevant data, we are filtering out complaints where company responded about acting as authorized by contract or law
hdf_tmp=consumercomplaints_hdf.filter('not "CompanyPublicResponse" in (\'Company believes it acted appropriately as authorized by contract or law\')')

# Count the remaining complaints per company, mapped into the dataframe: hdf_complaints_company
hdf_complaints_company=hdf_tmp.agg([('count','ComplaintID','n_complaints')] ,group_by = ['Company']).sort('n_complaints', desc=True)

# Count the complaints responded to with monetary relief per company, mapped into in the dataframe: hdf_complaints_monetary_company
hdf_complaints_monetary_company=hdf_tmp.filter('"CompanyResponseToConsumer" = \'Closed with monetary relief\'').agg([('count','ComplaintID','n_complaints')] ,group_by = ['Company'])

<br>

Join and calculate percentage of complaints respondet to with monetary relief, map the result in the dataframe: hdf_pct_monetary  
New dataframe-functions introduced in this step are
- rename_columns({'<name>': 'new_name'}, {..}, ...)
- sort for sorting the resulting query sets by a column's values

In [None]:
# Join the dataframes, rename column n_complaints as it is in both join-dataframes
hdf_joined=hdf_complaints_company.set_index("Company").join(hdf_complaints_monetary_company.rename_columns({'n_complaints': 'n_complaints_monetary_relief'}).set_index("Company"))

# Calculate the percentage of complaints respondet to with monetary relief, map the result in the dataframe hdf_pct_monetary
hdf_pct_monetary=hdf_joined.select('Company', 'n_complaints', 'n_complaints_monetary_relief', ('("n_complaints_monetary_relief"/"n_complaints")*100', 'PCT_MON'))

# Sort and filter the final results set
hdf_pct_monetary.filter('"n_complaints_monetary_relief" > 50').sort('PCT_MON', desc=True).head(10).collect()

<br><br>

The __United Services Automobile Association (USAA)__ has a decent number of overall complaints and a significant proportion responded with monetary relief, well suitable for this hands-on exercise. The USAA is a private American financial services and insurance company that provides a range of banking, insurance, investment, and retirement solutions.   
- so let's filter for the USAA data

In [None]:
# Let's explore the filtered data using the describe()-method, looking at columns for missing values, and columns suitable to include in the classification model
hdf_tmp=hdf_tmp.filter('"Company" = \'UNITED SERVICES AUTOMOBILE ASSOCIATION\'')

# Recall - we need to exclude NCLOB columns from the describe-analysis
hdf_tmp.drop("ConsumerComplaintNarrative").describe().collect()

<br>

As we start to prepare our HANA dataframe to build the classificaton model, let's drop column with lot's of NULL values or drop respective rows.  
- Lot's of nulls in columns 'SubProduct', 'SubIssue','State', 'ConsumerDisputed'
- Only a small number of nulls in column 'State'

In [None]:
# As explored with the describe-method, we want to drop columns for the classification task with lots of NULL values or drop respective rows
hdf_tmp=hdf_tmp.select('ComplaintID', 'Product', 'Issue', 'State', 'ZipCode', 'ComplaintNarrative_EMBEDDING', 'CompanyResponseToConsumer')
hdf_tmp=hdf_tmp.filter('not "State" is NULL')

In [None]:
# Rename classification target column to LABEL and replace ' ' with '_' in LABEL values for later processing
hdf_tmp=hdf_tmp.select('ComplaintID', 'Product', 'Issue', 'State', 'ZipCode', 'ComplaintNarrative_EMBEDDING', ('REPLACE("CompanyResponseToConsumer", \' \', \'_\')', 'LABEL'))

print(hdf_tmp.columns)
print(hdf_tmp.shape)

In [None]:
# Review target LABEL value distribution
hdf_tmp.agg([('count','ComplaintID','n_complaints')] ,group_by = ['LABEL']).collect()

Now we are almost ready to build the classification model.
<br><br>

### Step 6: Consumer complaints Text Embedding vector dimension reduction [3:60s]

As described earlier, due to high dimensionality of vector data, it may impact the performance of downstream algorithms significantly, hence techniques to extract as much "information" from the very high dimensional space, into a lower dimensional space should be considered. One such dimension reduction technique is Principle Component Analysis, supported in the Predictive Analysis Library (PAL) to process vector input data in the functions Vector-PCA and Categorial-PCA.  
Let's use Vector-PCA to reduce the dimensionality from a 768 dimension text embedding-vector to a 64 principal component-vector

In [None]:
# Reducing the Text Embedding vector dimensions from 768 to a vector of 64 principal component dimensions
from hana_ml.algorithms.pal.decomposition import VectorPCA
vecpca = VectorPCA(n_components=64)
hdf_pcavectors = vecpca.fit_transform(data=hdf_tmp.select('ComplaintID','ComplaintNarrative_EMBEDDING'), key='ComplaintID')

hdf_pcavectors=hdf_pcavectors.rename_columns({'SCORE_VECTOR': 'ComplaintNarrative_PCA_VECTOR'})
print(hdf_pcavectors.shape, '\n')
print(hdf_pcavectors.select_statement, '\n')
print('Vector-PCA runtime is: ', vecpca.runtime, '\n')

<br>

Note, as needed HANA dataframes can also be saved back to SAP HANA tables, temporary tables, etc. or even as SQL views.

In [None]:
#hdf_pcavectors.save('CC_PCAVECTORS', force=True ) # table_type='...'

<br>

Now, compose our final dataframe for the classification model, joining the selected feature columns with the PCA-Vectors

In [None]:
# Prepare final data for classification
hdf_cc_classification=hdf_tmp.select('ComplaintID', 'Product', 'Issue', 'State', 'ZipCode', 'LABEL').set_index("ComplaintID").join(hdf_pcavectors.set_index("ComplaintID"))

print(hdf_cc_classification.shape, '\n')
display(hdf_cc_classification.head(1).collect())

<br><br>

### Step 7: Build the AutoML classification model predicting monetary relief-response to consumers [12:300s]

Review the prepared HANA dataframe to train the classication model. Is it still the same as above ...

In [None]:
# Review the prepared HANA dataframe ...
print(hdf_cc_classification.columns, '\n')
print(hdf_cc_classification.shape, '\n')
display(hdf_cc_classification.head(1).collect())

<br>

Prepare the feature columns as a Python list variable

In [None]:
# List of features for the classification model: what would be known at the point of time a complaint is issued?
# Let's start to build a model solemnly using the PCA_Vector derived from complaint narratives' Text Embedding column
class_features=['ComplaintNarrative_PCA_VECTOR'] 

# Variant of columns to poentially explore
#class_features=['Product', 'State', 'ComplaintNarrative_PCA_VECTOR'] 
#class_features=['Product', 'State']                                  # ,'ComplaintNarrative_PCA_VECTOR'

<br>

Split the data to sample hold-out set of 20% of the data, not seen and used during model training, but used for model performance (predictive quality) evaluation (score).  
Stratified sampling ensured same proportions of LABEL-values in both the training-sample and the hold-out sample.

In [None]:
# Split the data into a model training- and a houl-out sample for testing the classification model
from hana_ml.algorithms.pal.partition import train_test_val_split
train_ids, test_ids, _ = train_test_val_split(data=hdf_cc_classification.select('ComplaintID','LABEL'), partition_method='stratified', stratified_column='LABEL',
                                                       training_percentage=0.8,
                                                       testing_percentage=0.2,
                                                       validation_percentage=0.0)

In [None]:
# Join the train_ids, test_ids from the sampling step before with the original data to compose the full classif_train- and classif_test-dataframes
hdf_cc_classif_train=hdf_cc_classification.drop('LABEL').set_index("ComplaintID").join(train_ids.set_index("ComplaintID")) #.select('ComplaintID')
display(hdf_cc_classif_train.shape)

hdf_cc_classif_test=hdf_cc_classification.drop('LABEL').set_index("ComplaintID").join(test_ids.set_index("ComplaintID"))
display(hdf_cc_classif_test.shape)

display(hdf_cc_classif_train.head(1).collect())

<br><br>

The Predictive Analysis Library (PAL) __AutoML-functions__ provides a means to __automatically explore different machine learning algorithms__ and algorithm __parameter combinations__ for a given machine learning scenario and data provided. It moreover evaluates __chaining of multiple algorithms__ into so-called algorithm pipelines. It thus benchmarks various algorithm configuration or pipelines against each other and seeks to improve to the best possible machine learning model using a genetic algorithm for this optimization approach.  
<br>
As a __target metric__ for the AutoML process to find the best model for, is that __we want the model to focus on best possible predictions__ for the __LABEL value: Closed_with_monetary_relief__, that is adressed with the __scorings-value "F1_SCORE_Closed_with_monetary_relief'__. Apart from that, we are using a default configuration to start the AutoML process.

In [None]:
# Creating the PAL AutoML classification scenario and general settings 
from hana_ml.algorithms.pal.auto_ml import AutomaticClassification
from hana_ml.visualizers.automl_progress import PipelineProgressStatusMonitor

automl_class = AutomaticClassification(config_dict='default',      #'default' or 'light' AutoML classification configuration (selection of algorithms and paramters)  
                                       population_size=10, generations=5,  offspring_size=10,
                                       search_method='GA', successive_halving=True, fold_num=3,
                                       random_seed=1234, max_layer=2, elite_number=5, 
                                       scorings={"F1_SCORE_Closed_with_monetary_relief": 1.0}) 

<br>

The AutoML configuration determines which algorithms and parameter values get probed, it can certainly be adjusted as needed

In [None]:
# Display the classifier algorithm and their configurations included
automl_class.display_config_dict(category="Classifier")

#automl_class.display_config_dict()

<br>

Start the AutoML process and opening the AutoML Progress Monitor

In [None]:
# Typically, a database administrator would have constrained the database which can be consumed by a machine learning taks using workload classes 
automl_class.enable_workload_class("PAL_AUTOML_WORKLOAD")
#automl_class.disable_workload_class_check()
progress_status_monitor = PipelineProgressStatusMonitor(connection_context=myconn, automatic_obj=automl_class)
progress_status_monitor.start()

automl_class.fit(data=hdf_cc_classif_train, key='ComplaintID', features=class_features, label='LABEL')

print(automl_class.runtime)

<br>

Explore the best models using the __Unified Model report__, overviewing the resulting best (elite) pipeline models found.
- What is the __F1_SCORE_Closed_with_monetary_relief__ value achieved with the best model?

In [None]:
from hana_ml.visualizers.unified_report import UnifiedReport
UnifiedReport(automl_class).build().display()

<br>

Apply the __score()-method__ with the hold-out sample, to __evaluate the model performance (i.e. accuracy etc.)__ on data unseen during fitting of the model

In [None]:
score_pred_hdf, score_stats_hdf = automl_class.score( data=hdf_cc_classif_test, key='ComplaintID', features=class_features, label='LABEL')
automl_class.runtime

In [None]:
# Review the statistics - what is the F1_SCORE_Closed_with_monetary_relief from the score-task compared to the fit-task
score_stats_hdf.collect()

In [None]:
# Review the predicted values in details for the complaints in the hold-out sample
score_pred_hdf.filter('SCORE = \'Closed_with_monetary_relief\'').sort('CONFIDENCE', desc=True).select('ID', 'SCORE', 'CONFIDENCE').head(5).collect()

<br>

Use the __predict()-method__, if one is interested in the __predictions__ of company's response for new complaints

In [None]:
# Predict company response-category for incoming complaints
hdf_predictions = automl_class.predict( data=hdf_cc_classif_test.drop('LABEL'), key='ComplaintID', features=class_features) 
print(automl_class.runtime)

hdf_predictions_monetary=hdf_predictions.filter('SCORES = \'Closed_with_monetary_relief\'').select('ID', 'SCORES') 
#hdf_predictions_monetary.head(5).collect() 

# Complaints predicted to be adressed by monetary relief
hdf_result=consumercomplaints_hdf.select('ComplaintID', 'ConsumerComplaintNarrative').set_index("ComplaintID").join(hdf_predictions_monetary.set_index("ID"))
hdf_result.rename_columns({'SCORES': 'PREDICTED_CompanyResponseToConsumer'}).head(10).collect()

# Appendix - reference sections

## Loading the data sample (optional)

Data and License  
This repository contains a [CSV dataset example sample file](/datasets/complaint_clean.csv) that is a small dataset of service request tickets on complaints received about financial products and services. The dataset was obtained from [Kaggle Simulations](https://www.kaggle.com/sebastienverpile/consumercomplaintsdata/home?select=Consumer_Complaints.csv). It is originaly from [DATA.GOV](https://catalog.data.gov/dataset/consumer-complaint-database). The Dataset is licensed under [CC0 1.0 Universal (CC0 1.0) Public Domain Dedication](https://creativecommons.org/publicdomain/zero/1.0/) that waives copyright interest in a work you've created and dedicates it to the world-wide public domain.

The consumer complaints data was downloaded as csv-file with the following filtering applied
- field=all
- has_narrative=true
- company_received_max=2025-08-20
- date_received_max=2025-06-30 and date_received_min=2011-12-01
- company_public_response in
    - Company believes it acted appropriately as authorized by contract or law
    - Company disputes the facts presented in the complaint
    - Company believes the complaint is the result of a misunderstanding
    - Company believes complaint caused principally by actions of third party outside the control or direction of the company
    - Company believes complaint is the result of an isolated error
    - Company believes the complaint provided an opportunity to answer consumer%27s questions
    - Company believes complaint represents an opportunity for improvement to better serve consumers
    - Company can't verify or dispute the facts in the complaint 
- Company response to consumer in
    - Closed with explanation 
    - Closed with non-monetary relief 
    - Closed with monetary relief 
    - Closed
    
Note, the company's optional public-facing response to a consumer's complaint. Companies can choose to select a response from a pre-set list of options that will be posted on the public database.

See this direct download link of the filtered set [Filtered consumer-complaints download link](https://www.consumerfinance.gov/data-research/consumer-complaints/search/api/v1/?company_public_response=Company%20believes%20it%20acted%20appropriately%20as%20authorized%20by%20contract%20or%20law&company_public_response=Company%20disputes%20the%20facts%20presented%20in%20the%20complaint&company_public_response=Company%20believes%20the%20complaint%20is%20the%20result%20of%20a%20misunderstanding&company_public_response=Company%20believes%20complaint%20caused%20principally%20by%20actions%20of%20third%20party%20outside%20the%20control%20or%20direction%20of%20the%20company&company_public_response=Company%20believes%20complaint%20is%20the%20result%20of%20an%20isolated%20error&company_public_response=Company%20believes%20the%20complaint%20provided%20an%20opportunity%20to%20answer%20consumer%27s%20questions&company_public_response=Company%20believes%20complaint%20represents%20an%20opportunity%20for%20improvement%20to%20better%20serve%20consumers&company_public_response=Company%20can%27t%20verify%20or%20dispute%20the%20facts%20in%20the%20complaint&company_received_max=2025-08-20&date_received_max=2025-06-30&date_received_min=2011-12-01&field=all&format=csv&has_narrative=true&no_aggs=true&size=128330)

In [None]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Loading the data into a pandas dataframe first
consumer_data= pd.read_csv('./complaints-2025-08-21_03_58.csv')
consumer_data.head(2)
consumer_data.columns

In [None]:
consumer_data=consumer_data.rename(columns={"Date received": "DateReceived", "Sub-product": "SubProduct", "Sub-issue": "SubIssue", 
                                            "Consumer complaint narrative": "ConsumerComplaintNarrative", "Company public response": "CompanyPublicResponse", 
                                            "Consumer consent provided?": "ConsumerConsentProvided", "Company response to consumer": "CompanyResponseToConsumer", 
                                            "Submitted via": "SubmittedVia", "Date sent to company": "DateSentToCompany",  "Timely response?": "TimelyResponse",
                                            "Consumer disputed?": "ConsumerDisputed", "Complaint ID": "ComplaintID", "ZIP code": "ZipCode"})

In [None]:
consumer_data.columns
consumer_data.dtypes

In [None]:
# move last column to first 
last_col = consumer_data.iloc[:, -1]  
# extract last column 
consumer_data = pd.concat([last_col, consumer_data.iloc[:, :-1]], axis=1)  

In [None]:
consumer_data.head(3)
consumer_data.shape
consumer_data.columns

<br>

Create HANA dataframe and HANA cloud database table from pandas dataframe, specifying specific HANA column datatypes

In [None]:
from hana_ml.dataframe import create_dataframe_from_pandas
import pandas as pd
consumercomplaints_hdf = create_dataframe_from_pandas(
        myconn,
        consumer_data, #.head(100000),
        table_name="CONSUMER_COMPLAINTS_TMP",
        force=True,
        replace=True,
        drop_exist_tab=True
        ,table_structure={"ComplaintID": "INT", "DateReceived" : "NVARCHAR(10)", "Product" : "NVARCHAR(128)", "SubProduct" : "NVARCHAR(128)", 
                          "Issue": "NVARCHAR(128)", "SubIssue"  : "NVARCHAR(256)", "ConsumerComplaintNarrative"  : "NCLOB", "CompanyPublicResponse": "NCLOB", 
                          "Company": "NVARCHAR(128)", "State": "NVARCHAR(64)", "ZipCode": "NVARCHAR(8)", "Tags": "NVARCHAR(128)", "ConsumerConsentProvided": "NVARCHAR(24)"
                          ,"SubmittedVia": "NVARCHAR(24)", "DateSentToCompany"  : "NVARCHAR(10)", "CompanyResponseToConsumer"  : "NVARCHAR(64)", 
                          "TimelyResponse": "NVARCHAR(8)",  "ConsumerDisputed": "NVARCHAR(8)"} #, "ComplaintNew": "NVARCHAR(5000)",  "ResponseNew": "NVARCHAR(5000)"
        )
print(consumercomplaints_hdf.select_statement) 