<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Combining Structured and Unstructured Data for Predictive Modeling – a recipe
 <br>       
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 150px; height: auto; margin-top: 20pt;">
  <br>
    </p>
</header>

<p style = 'font-size:20px;font-family:Arial'><b>Introduction:</b></p>
<p style = 'font-size:16px;font-family:Arial'>
    In real-world machine learning, the most useful signals don’t always live in neat, structured tables. Sure, you’ve got sensor readings and transaction logs—but a ton of valuable context often hides in messier stuff like operator notes, support tickets, or log files. The trick is: once you combine those two worlds—structured and unstructured—you can build smarter, more insightful models.
<br><br>
That’s exactly what this blog post is about: a hands-on recipe for predicting equipment outages by blending sensor data with free-text operator logs. We’ll show how to turn unstructured text into meaningful features using embeddings and clustering, then plug those into a supervised model. And the best part? You can run the whole pipeline—embedding, scoring, and all—right inside Teradata Vantage.
<br><br>
While we focus on predictive maintenance here, this pattern is super flexible. You can use it anywhere unstructured text adds value. Think customer support tickets in banking, field reports in telecom, or even patient notes in healthcare. Here's what that might look like:
<br><br>


| Industry           | Structured Data                 | Unstructured Data        | Use Case                           |
| ------------------ | ------------------------------- | ------------------------ | ---------------------------------- |
| Manufacturing      | Sensor data, production cycles  | Operator logs            | Predictive maintenance             |
| Banking            | Transaction logs                | Customer support tickets | Fraud detection                    |
| Retail             | Sales metrics, inventory levels | Customer reviews         | Demand forecasting                 |
| Healthcare         | Patient vitals, lab results     | Doctor’s notes           | Early diagnosis / readmission risk |
| Telecommunications | Network metrics                 | Technician reports       | Network failure prediction         |


<p style = 'font-size:16px;font-family:Arial'>In this walkthrough, we’re taking historical machine data—both structured (like sensor readings) and the unstructured (like operator-written log entries)-and using it to predict whether a machine will fail in the next 24 hours.
<br>
    Here’s what we will do:
</p>

<ul style = 'font-size:16px;font-family:Arial'>
    <li>First, we turn the operator logs into embeddings—dense vector representations of the text</li>
    <li>Then we use KMeans to find common patterns (or “topic prototypes”) in those embeddings</li>
    <li>We turn those topics into features by measuring how close each log is to each cluster center</li>
    <li>Next, we combine those new features with the classic sensor data and train a model to predict outages</li>
    <li>Finally, we put it all together into a full inference pipeline that runs end-to-end inside the database—no data movement, no glue code</li>
</ul>
<p style = 'font-size:16px;font-family:Arial'>It all comes together in a clean, scalable setup. Here’s a sketch of what that looks like (Note that we already assume that the features from the structured data have been calculated, as well as the outage events have been defined based on a certain SLA/KPI):</p>

<center><img src="images/flowchart_embeddingsfeatures.png" alt="workflow_topictrend" style="background-color: #f0f0f0; border: 4px solid #404040; border-radius: 10px;">

<p style = 'font-size:16px;font-family:Arial'>This method provides an efficient way to not only categorize messages by topic but also track how these topics evolve over time, offering actionable insights into changing customer concerns, emerging issues, and overall trends.</p>

<p style = 'font-size:20px;font-family:Arial;'><b>Dataset Overview</b></p>
<p style = 'font-size:16px;font-family:Arial;'>
To show how you can combine structured and unstructured data for predictive maintenance, we’re using a synthetic dataset. Real-world data is often locked away due to privacy or operational reasons, so we built a dataset that mimics the messiness and patterns you'd typically run into in industrial settings.
<br><br>
We’ve got 10,000 rows, split into 8,000 for training and 2,000 for testing.
</p>
<p style = 'font-size:16px;font-family:Arial;'>Each row has:</p>
<ul style = 'font-size:16px;font-family:Arial;'>
  <li><b>15 numeric features</b> —your usual suspects from sensors: temp, pressure, RPM, vibration, etc.</li>
  <li><b>One text column</b>, maintenance_log_aug, where operators log what they saw or did, in free-form comments.</li>
</ul>

<ul style = 'font-size:16px;font-family:Arial;'>As you might expect, the log entries are all over the place:
<li>Weird formatting</li>
<li>Typos like <code>successefully</code> or <code>prblms</code></li>
<li>Some super detailed, others barely say anything</li>
<li>Inconsistent language and terminology</li>
</ul>


<p style = 'font-size:16px;font-family:Arial;'>That makes traditional NLP techniques like bag-of-words kind of useless. So instead, we turn those logs into dense vector embeddings using a pre-trained Hugging Face model (exported to ONNX for in-database use). These embeddings are much better at capturing the real meaning of the text—even if it’s messy.
<br><br>
Next, we’ll show how to turn those embeddings into clean, interpretable features and plug them into a supervised model alongside the sensor data.</p>




<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial;'><b>1. Connect to Vantage, Import python packages and explore the dataset</b></p>

In [None]:
!pip install -r requirements.txt --quiet

In [None]:
%%capture
!pip install teradataml --upgrade --quiet
!pip install teradatasqlalchemy --upgrade --quiet
!pip install teradataml-plus --upgrade --quiet

<div class="alert alert-block alert-info">
<p style = 'font-size:16px;font-family:Arial;'><b>Note: </b><i>The above libraries have to be installed. Restart the kernel after executing these cells to bring the installed libraries into memory. The simplest way to restart the Kernel is by typing <b> 0 0</b></i> (zero zero) and pressing <i>Enter</i>.</p>
</div>
<p style = 'font-size:16px;font-family:Arial;'>Here, we import the required libraries, set environment variables and environment paths (if required).</p>

In [None]:
import getpass
import tdmlplus
import teradataml
import pandas as pd
import numpy as np

from huggingface_hub import hf_hub_download

from catboost import CatBoostClassifier
from sklearn.metrics import f1_score, confusion_matrix, roc_curve, auc, classification_report
import plotly.graph_objs as go
import plotly.express as px

from IPython.display import Markdown, display
from teradataml import *

In [None]:
%run utils/tab_widget.py # imports a function `display_dataframes_in_tabs`
list_relevant_tables = [] # we will be adding names of relevant tables progressivley into this list to display 

<hr style="height:2px;border:none;">
<b style = 'font-size:18px;font-family:Arial;'> 1.1 Connect to Vantage</b>
<p style = 'font-size:16px;font-family:Arial;'>We will be prompted to provide the password. We will enter the password, press the Enter key, and then use the down arrow to go to the next cell.</p>

In [None]:
%run -i /home/jovyan/JupyterLabRoot/UseCases/startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

In [None]:
%%capture
execute_sql('''SET query_band='DEMO=Combine_structured_unstructured_predictive_modelling;' UPDATE FOR SESSION; ''')

<p style = 'font-size:18px;font-family:Arial;'><b>Getting Data for This Demo</b></p>


In [None]:
%run utils/_dataload.ipynb # takes about 20 seconds

<p style = 'font-size:16px;font-family:Arial;'>In addition, we want to check if our database has already got the required functionality to generate embeddings.</p>


In [None]:
VCL = False # current system is VCE/VCore

In [None]:
if VCL:
    results = execute_sql("help database mldb").fetchall()
else:
    results = execute_sql("help user mldb").fetchall()

embeddings_functions = [x[0] for x in results if x[0].startswith("ONNXEmbeddings")]
if len(embeddings_functions) >0:#
    print("\n".join(embeddings_functions))
    print("---------------------\nONNXEmbeddings is installed")
else:
    print("ONNXEmbeddings is not installed. Please Upgrade to BYOM version 6")

# Inspect the Data

Before jumping into modeling, we take a quick look at the input data using a handy little utility called `display_dataframes_in_tabs()`. It shows our key tables as interactive tabs right in the Jupyter notebook—super useful for sanity checks.

We’ve got two main datasets: **pump\_failure\_train** and **pump\_failure\_test**. Both have the same structure:

* `row_id`: A unique ID for each row
* `outage_next_24h`: Our target—1 means there was an outage in the next 24 hours, 0 means everything was fine
* 15 numeric features from sensors (things like temperature, pressure, etc.)
* `maintenance_log_aug`: A free-text field where operators wrote down what they saw or did

Pretty straightforward. This is the raw material we’ll be working with.


In [None]:
list_relevant_tables+=["pump_failure_train", "pump_failure_test"]

In [None]:
display_dataframes_in_tabs(list_relevant_tables)

In [None]:
DF_train = DataFrame(in_schema(username,"pump_failure_train"))
DF_test = DataFrame(in_schema(username,"pump_failure_test"))

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial;'><b>2. Load HuggingFace Model</b>
<p style = 'font-size:16px;font-family:Arial;'>To generate embeddings, we need an ONNX model capable of transforming text into vector representations. We use a pretrained model from
<a href="https://huggingface.co/Teradata/gte-base-en-v1.5" target="_blank">Teradata's Hugging Face repository</a>   , such as gte-base-en-v1.5. The model and its tokenizer are downloaded and stored in Vantage tables as BLOBs using the save_byom function.</p>

In [None]:
model_name = "bge-small-en-v1.5"
number_dimensions_output = 384
model_file_name = "model.onnx"

In [None]:
# Step 1: Download Model from Teradata HuggingFace Page
hf_hub_download(repo_id=f"Teradata/{model_name}", filename=f"tokenizer.json", local_dir="./")

In [None]:
# using the command line syntax as it is more reliable then the python function
!hf download Teradata/{model_name} onnx/{model_file_name} --local-dir ./

In [None]:
try:
    db_drop_table("embeddings_models")
except:
    pass
try:
    db_drop_table("embeddings_tokenizers")
except:
    pass

In [None]:
# Step 2: Load Models into Vantage
# a) Embedding model
save_byom(model_id = model_name, # must be unique in the models table
               model_file = f"onnx/{model_file_name}",
               table_name = 'embeddings_models' )
# b) Tokenizer
save_byom(model_id = model_name, # must be unique in the models table
              model_file = 'tokenizer.json',
              table_name = 'embeddings_tokenizers') 

In [None]:
display_dataframes_in_tabs(["embeddings_models","embeddings_tokenizers"])

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial;'>3. Create the Embeddings</b>

<hr style="height:2px;border:none;">
<b style = 'font-size:18px;font-family:Arial;'>3.1 Generate Embeddings with ONNXEmbeddings</b>

<p style = 'font-size:16px;font-family:Arial;'>
Now it's time to generate the embeddings using <b>ONNXEmbeddings</b>.<br>We run the ONNXEmbeddings function to generate embeddings for a small subset of records. The model is <b>loaded into the cache memory on each node</b>, and Teradata's <b>Massively Parallel Processing (MPP)</b> architecture ensures that embeddings are computed in parallel using <b>ONNX Runtime</b> on each node.  <br>Having said that, generating embeddings for the entire training set can be time-consuming, especially when working on a system with limited resources. In the <b>ClearScape Analytics experience</b>, only a <b>4 AMP system</b> with constrained RAM and CPU power is available. To ensure smooth execution, we test embedding generation on a small sample and use <b>pre-calculated embeddings</b> for the remainder of demo. In a real-life scenario you would tyipically encounter multiple hundred AMPs with much more compute power!<br>Also have a look at the most important input parameters of this <b>ONNXEmbeddings</b> function.
<ul style = 'font-size:16px;font-family:Arial;'>
<li><b>InputTable</b>: The source table containing the text to be embedded. </li>
<li><b>ModelTable</b>: The table storing the ONNX model.                    </li>
<li><b>TokenizerTable</b>: The table storing the tokenizer JSON file.       </li>
<li><b>Accumulate</b>: Specifies additional columns to retain in the output </li>  
<li><b>OutputFormat</b>: Specifies the data format of the output embeddings (<b>FLOAT32(768)</b>, matching the model's output dimension).</li>
</ul>
<p style = 'font-size:16px;font-family:Arial;'>
Since embedding generation is computationally expensive, we only process <b>10 records for testing</b> and rely on precomputed embeddings for further analysis.  
</p>


In [None]:
configure.byom_install_location = "mldb"

In [None]:
input_table = f"SELECT TOP 10 t.row_id, t.maintenance_log_aug as txt FROM {username}.pump_failure_train t" # we only create 10 embeddings to test the function
DF_sample10 = DataFrame.from_query(input_table)

In [None]:
my_model = DataFrame.from_query(f"select * from {username}.embeddings_models where model_id = '{model_name}'")
my_tokenizer = DataFrame.from_query(f"select model as tokenizer from {username}.embeddings_tokenizers where model_id = '{model_name}'")

In [None]:
DF_embeddings_sample = ONNXEmbeddings(
    newdata = DF_sample10,
    modeldata = my_model, 
    tokenizerdata = my_tokenizer, 
    accumulate = ["row_id", "txt"],
    model_output_tensor = "sentence_embedding",
    output_format = f'FLOAT32({number_dimensions_output})',
    enable_memory_check = False
).result

In [None]:
DF_embeddings_sample

In [None]:
list_relevant_tables.append("operator_log_embeddings_train")
DF_train_embeddings = DataFrame("operator_log_embeddings_train")
DF_test_embeddings = DataFrame("operator_log_embeddings_test")
DF_train_embeddings.shape

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial;'>4. Prototype Distances as Features: Making Embeddings Work for ML</b>



<p style="font-size:16px;font-family:Arial;">
After converting the free-text maintenance logs into <b>384-dimensional embeddings</b> using a pre-trained transformer model, we’re left with a nontrivial question: 
<b>how do we combine these dense, abstract vectors with our 15 structured sensor features</b> in a way that’s efficient, interpretable, and still predictive?
</p>

<p style="font-size:16px;font-family:Arial;">
You might expect us to simply append the embedding vectors to the sensor data and move on. 
And technically, that would work—many modern ML models can ingest hundreds of features without complaint. 
But the imbalance between 15 structured features and 384 text-derived ones poses practical issues. 
It increases training time, introduces the risk of overfitting, and—perhaps most critically—renders the model harder to interpret and trust. 
So, like many, we turned to <b>dimensionality reduction</b>.
</p>

<p style="font-size:16px;font-family:Arial;">
At this point, you're probably thinking of <b>PCA</b>, maybe <b>autoencoders</b>. 
These are the usual suspects for cutting down dimensionality. 
But here's where we do something unconventional: <b>we use KMeans clustering.</b>
</p>

<p style="font-size:16px;font-family:Arial;">
At first glance, KMeans doesn’t sound like a dimensionality reduction method at all. 
Clustering is usually about grouping, not compressing. 
But in our case, we flip the usual logic. 
We don’t care about the cluster assignments. 
Instead, for each embedding vector, we compute its <b>distance to each cluster centroid</b>—turning a 384-dimensional embedding into a set of, say, 15 distances. 
These distance values then become our new features.
</p>

<p style="font-size:16px;font-family:Arial;">
<b>Why would we do that?</b><br>
First, it gives us <b>explicit control over dimensionality</b>: the number of clusters we choose is the number of features we generate. 
Second, and more importantly, it gives us a way to <b>interpret</b> the resulting features. 
Each centroid represents a prototype region of the text space. 
By looking at the log entries closest to each cluster center, we can understand what kind of language or issue that cluster "captures"—whether it's communication errors, maintenance confirmations, or vague operator remarks.
</p>

<p style="font-size:16px;font-family:Arial;">
You might be surprised, but this isn’t as ad hoc as it sounds. 
There’s a line of work that treats centroids in vector space as <b>prototypes</b>, and uses distances to them as features—seen in areas like <b>Bag-of-Visual-Words</b> for image classification (explained in the box below, feel free to skip it).
</p>

<div style="border-left:5px solid ;padding:10px;margin:10px 0;">
<p style="font-size:16px;font-family:Arial;">
<b><i>Excursus: Prototype Distances in Bag-of-Visual-Words (BoVW)</i></b><br>
While using distances to KMeans centroids as features might seem unconventional in the context of text embeddings, 
the same idea has long been a cornerstone of traditional computer vision pipelines—particularly in the <b>Bag-of-Visual-Words (BoVW)</b> model.
</p>

<p style="font-size:16px;font-family:Arial;">
Here’s how it works:
</p>
<ul style="font-size:16px;font-family:Arial;">
<li>In BoVW, an image is first decomposed into many local descriptors—for example, SIFT or ORB vectors extracted from key points in the image. 
Each descriptor is a high-dimensional vector representing a small patch of the image. 
These vectors are then clustered using KMeans, where the resulting centroids represent prototypical “visual words.” 
This creates a fixed “vocabulary” of visual patterns.</li>

<li>To represent a new image, each of its local descriptors is assigned to the nearest centroid, 
and the image is then encoded as a histogram of how often each centroid appears. 
This histogram becomes the input to downstream classifiers like SVMs.</li>

<li>In more advanced versions, instead of just counting how often a centroid is assigned, researchers use soft-assignment or distance-weighted histograms, 
where the feature reflects how close the image patches are to each centroid. 
In this sense, the image is embedded in a space defined by prototype distances.</li>

<li>This technique made it possible to take a highly variable, unstructured input (an image), 
and project it into a compact, interpretable, and fixed-length feature vector—exactly what we aim to achieve with log text embeddings.</li>
</ul>

<p style="font-size:16px;font-family:Arial;">
BoVW was widely used in state-of-the-art vision models before deep learning took over, 
and it still appears in lightweight or embedded ML applications today. 
The conceptual takeaway is clear: representing new data points by their relationship to learned prototypes is a powerful and general method for feature engineering—even if the domain shifts from pixels to text.
</p>
</div>


<p style="font-size:16px;font-family:Arial;">
While <b>PCA</b> tends to collapse semantic structure into hard-to-interpret axes 
(and often assigns large variance to a single, uninformative direction), 
clustering preserves <b>topical or behavioral structure</b> in a way that's more meaningful 
and robust to the quirks of transformer embeddings.
</p>

<p style="font-size:16px;font-family:Arial;">
Of course, this method isn’t without trade-offs. 
Distance features tend to be <b>correlated</b>, especially when clusters overlap. 
That makes them less suitable for models that assume feature independence. 
And, like PCA, <b>KMeans is unsupervised</b>—it’s not optimizing for the predictive task directly. 
But for our needs, it's a strong compromise between <b>performance</b>, <b>interpretability</b>, 
and <b>deployment feasibility</b>.
</p>

<p style="font-size:16px;font-family:Arial;">
Crucially, this approach integrates seamlessly into our environment. 
<b>Teradata Vantage</b> offers a <b>highly optimized, in-database KMeans implementation</b>. 
That means we can cluster the embedding space once and then transform new log entries into feature vectors 
entirely within the database—no extra infrastructure, no model serving, no data movement.
</p>

<p style="font-size:16px;font-family:Arial;">
So yes, using <b>KMeans</b> for dimensionality reduction might seem unconventional. 
But as we show in the next section, it provides a surprisingly effective and practical foundation 
for combining <b>unstructured</b> and <b>structured features</b> in real-world ML pipelines.
</p>

<p style="font-size:16px;font-family:Arial;">
Below is the code for fitting and persisting a cluster model with <b>15 clusters</b>.
</p>


In [None]:
num_clusters = 15 # 15 features
kmeans_out = KMeans(
    id_column="row_id",
    data=DF_train_embeddings,
    target_columns=f"emb_0:emb_{number_dimensions_output-1}",
    output_cluster_assignment=False,
    num_init=10,
    num_clusters=num_clusters,
    iter_max=50,
    seed= 42
)

In [None]:
copy_to_sql(kmeans_out.model_data, "operator_log_clustermodel", if_exists="replace")

In [None]:
list_relevant_tables.append("operator_log_clustermodel")

In [None]:
display_dataframes_in_tabs(list_relevant_tables,-1)

<hr style="height:2px;border:none;">

<b style = 'font-size:18px;font-family:Arial;'>4.1 Naming the Clusters</b>

<p style="font-size:16px;font-family:Arial;">
To make our cluster-based features more <b>interpretable</b>, we give each cluster a short, meaningful name—
a kind of label that hints at what the grouped messages are all about. 
Since the clusters group together similar log texts, 
we can look at the ones closest to each centroid and get a feel for the common theme.
</p>

<p style="font-size:16px;font-family:Arial;">
Here’s how we do it: we pull the <b>top 5 messages</b> nearest to each cluster center and scan them 
(either by eye or with a little help from GPT) to figure out what kind of issue or pattern the cluster represents. 
These names—like <code>sensor_malfunction_issues</code> or <code>filter_replacement_status</code>—
become the column names in our feature set. 
That way, when we look at feature importances later, we’re not staring at <code>cluster_7</code> 
but at something that actually makes sense.
</p>

<p style="font-size:16px;font-family:Arial;">
It’s a small step that pays off big in terms of clarity—great for explaining the model to others, 
debugging unexpected behavior, or meeting <b>interpretability requirements</b> in regulated environments.
</p>


In [None]:
# Step 1: Compute distance of each log embedding to cluster centroids using in-DB KMeans
DF_clusterdistance = KMeansPredict(
    data=DF_train_embeddings,
    object=DataFrame(in_schema(username, "operator_log_clustermodel")),
    output_distance=True
).result

# Step 2: Rank messages within each cluster by distance to centroid (ascending)
DF_clusterdistance = DF_clusterdistance.assign(
    rank_distance=DF_clusterdistance.td_distance_kmeans.window(
        partition_columns=DF_clusterdistance.td_clusterid_kmeans,
        order_columns=DF_clusterdistance.td_distance_kmeans
    ).dense_rank()
)

# Step 3: Keep only top 5 closest messages per cluster
DF_clusterdistance_top = DF_clusterdistance.loc[DF_clusterdistance.rank_distance <= 5]

In [None]:
# Step 4: Join back to original logs to retrieve the text content
DF_topmesages = DF_clusterdistance_top.join(
    DF_train.select(["row_id", "maintenance_log_aug"]),
    how="inner",
    on=["row_id = row_id"],
    lsuffix="a"
).select(["td_clusterid_kmeans", "maintenance_log_aug"]).drop_duplicate()

# Step 5: Convert to pandas for downstream processing
df_topmessages = DF_topmesages.to_pandas()

In [None]:
df_topmessages

In [None]:
# Set this to True if you want to dynamically generate topic names via OpenAI, you need an OpenAI key for this
I_Have_an_OpenAI_API_Key = False

In [None]:
if I_Have_an_OpenAI_API_Key:
    import os, getpass
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter OPENAI API KEY")

In [None]:
# OpenAI prompt for generating interpretable topic names
if I_Have_an_OpenAI_API_Key:
    prompt_template = """Your task is to identify a common topic of 10 messages that have shown similar vector embeddings. 
    Your answer should be a single string, maximum 30 characters, only using lowercase latin alphabet and underscores. 
    The cosine similarity with this topic will be used as a column name in a database table. Only return this column name.
    Here are the messages:
    
    {messages}
    
    ====
    Column Name:
    """

In [None]:
if I_Have_an_OpenAI_API_Key:
    from openai import OpenAI
    column_names =  {}
    client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
    
    for i in range (15):
        cluster_feedback = '\n\n'.join(df_topmessages[df_topmessages['td_clusterid_kmeans'] == i]['maintenance_log_aug'])
        this_prompt = prompt_template.format(messages = cluster_feedback)
        try:
            chat_completion = client.chat.completions.create(
                messages=[
                    {
                        "role": "user",
                        "content": this_prompt,
                    }
                ],
                model="gpt-4o",
                temperature=0,
                max_tokens=4096
            )
            column_names[i] = chat_completion.choices[0].message.content.strip()
        except Exception as e:
            raise ValueError(f"Failed to call OpenAI API: {str(e)}")

In [None]:
# Fallback: Predefined topics from a previous run
if not I_Have_an_OpenAI_API_Key :
    column_names = {
        0: 'emergency_stop_test_success',
        1: 'no_unusual_noises_detected',
        2: 'steady_power_consumption',
        3: 'control_panel_status',
        4: 'sensors_status_normal',
        5: 'routine_maintenance_completed',
        6: 'filter_replacement_status',
        7: 'temperature_status',
        8: 'slight_leak_detection',
        9: 'no_leaks_or_valve_issues',
        10: 'cleaned_debris_pump_area',
        11: 'steady_pressure_and_lubrication',
        12: 'slight_increase_in_noise',
        13: 'coolant_maintenance',
        14: 'latest_version_software_update'
    }

<hr style="height:2px;border:none;">

<b style = 'font-size:18px;font-family:Arial;'>Vector Distance-Based Feature Extraction (and Reuse)</b>

<b style = 'font-size:18px;font-family:Arial;'>4.2 Using Clusters as Features</b>

<p style = 'font-size:16px;font-family:Arial;'>To actually use our clusters as features, we need to answer a simple question: <em>how similar is each log message to each cluster/topic?</em> We do this by computing the <b>cosine similarity</b> between each embedding and the cluster centroids. This tells us how strongly a given message matches each semantic pattern we discovered earlier.</p>

<p style = 'font-size:16px;font-family:Arial;'>The nice part is that we can do this entirely <b>inside the Teradata database</b>, using <code>VectorDistance</code> to compute similarities and <code>Pivoting</code> to turn them into individual columns—one for each cluster. No data leaves the platform; everything stays within Teradata.</p>

<p style = 'font-size:16px;font-family:Arial;'>We wrapped this logic into a small reusable function, so we can apply the same transformation to both training and test data. The output is a tidy table of features, ready to be joined with the structured sensor data.</p>


In [None]:
def get_centroid_features(DF_embeddings):
    """
    Compute similarity features between input embeddings and cluster centroids,
    and pivot the top-k similarities into individual columns.

    Args:
        DF_embeddings (DataFrame): A teradataml DataFrame containing row-wise embedding vectors.

    Returns:
        DataFrame: A DataFrame with pivoted similarity features to cluster centroids.
    """
    DF_VD = VectorDistance(
        target_data=DF_embeddings,
        target_id_column="row_id",
        reference_data=DataFrame(in_schema(username, "operator_log_clustermodel")),
        ref_id_column="td_clusterid_kmeans",
        distance_measure="COSINE",
        target_feature_columns=f"emb_0:emb_{number_dimensions_output-1}",
        ref_feature_columns=f"emb_0:emb_{number_dimensions_output-1}",
        topk=15
    ).result

    DF_VD2 = DF_VD.assign(
        similarity=1.0 - DF_VD.distance,
        row_id=DF_VD.target_id
    )

    pivot_obj = Pivoting(
        data=DF_VD2,
        data_partition_column="row_id",
        data_order_column="reference_id",
        partition_columns="row_id",
        target_columns="similarity",
        rows_per_partition=15,
        output_column_names=["logmessagesimilarity_" + c for c in list(column_names.values())]
    )

    DF_VD_pivoted = pivot_obj.result

    return DF_VD_pivoted


In [None]:
DF_logfeatures_train = get_centroid_features(DF_train_embeddings)
DF_logfeatures_test = get_centroid_features(DF_test_embeddings)

<p style = 'font-size:16px;font-family:Arial;'>Finally, we join the structured sensor features with the newly generated log similarity features - creating a unified dataset ready for supervised learning.</p>

In [None]:
# Merge training data with log-derived features and drop redundant columns
DF_train_ADS = DF_train.merge(
    DF_logfeatures_train,
    on="row_id",
    rsuffix="emb"
).drop(
    columns=["maintenance_log_aug", "row_id_emb"]
)

In [None]:
# Merge test data with log-derived features and drop redundant columns
DF_test_ADS = DF_test.merge(
    DF_logfeatures_test,
    on="row_id",
    rsuffix="emb"
).drop(
    columns=["maintenance_log_aug", "row_id_emb"]
)


<b style = 'font-size:18px;font-family:Arial;'>Testing Feature Value: Structured, Unstructured, or Both?</b>

<p style = 'font-size:16px;font-family:Arial;'>To evaluate whether the log-derived embedding features actually improve predictive performance, we train three separate models using different feature sets: (1) classic sensor features only, (2) text embedding–based features only, and (3) a combination of both. </p>

<p style = 'font-size:16px;font-family:Arial;'>The model choice—CatBoost—is arbitrary; it’s a fast, reliable classifier that works well with heterogeneous feature types, can deal with inter-correlated features and doesn’t require much preprocessing. </p>

<p style = 'font-size:16px;font-family:Arial;'>What matters here isn’t fine-tuning the model, but comparing performance across feature sets. This experiment helps us understand whether the unstructured log data carries additional signal beyond what’s already available in the structured telemetry. Once satisfied, we can export the trained model to ONNX and use Teradata’s Bring Your Own Model (BYOM) capabilities to run the entire inference pipeline—embedding generation, feature extraction, and model scoring— fully inside the database.</p>

In [None]:
# Pulling data into Python memory for model training
df_train = DF_train_ADS.to_pandas()
df_test = DF_test_ADS.to_pandas()

In [None]:
df_train.shape, df_test.shape

In [None]:
# Define target column
target = "outage_next_24h"

# Identify feature groups: classic sensor features (first 15), text-based features (the rest)
classic_features = list(df_train.columns[1:16])
text_features = list(df_train.columns[16:])

In [None]:
# Split training features into classic, text-only, and combined sets
X_train_classic = df_train[classic_features]
X_train_text = df_train[text_features]
X_train_combined = df_train[classic_features + text_features]

# Split test features in the same way
X_test_classic = df_test[classic_features]
X_test_text = df_test[text_features]
X_test_combined = df_test[classic_features + text_features]

# Extract training and test targets
y_train = df_train[target]
y_test = df_test[target]

In [None]:
# Define a reusable function to train and evaluate a CatBoost model
def train_eval(X_train, X_test, y_train, y_test, label):
    # Initialize CatBoost with reasonable defaults and class imbalance handling
    model = CatBoostClassifier(
        iterations=500,
        learning_rate=0.05,
        depth=6,
        eval_metric='F1',
        scale_pos_weight=(y_train == 0).sum() / (y_train == 1).sum(),  # handle class imbalance
        verbose=0
    )

    # Train the model
    model.fit(X_train, y_train)

    # Predict labels and probabilities
    y_pred = model.predict(X_test)
    y_proba = model.predict_proba(X_test)[:, 1]

    # Compute evaluation metrics
    f1 = f1_score(y_test, y_pred)
    cm = confusion_matrix(y_test, y_pred)
    fpr, tpr, _ = roc_curve(y_test, y_proba)
    roc_auc = auc(fpr, tpr)
    report = classification_report(y_test, y_pred, output_dict=True)

    # Return all relevant results and the trained model
    return {
        'label': label,
        'f1': f1,
        'cm': cm,
        'fpr': fpr,
        'tpr': tpr,
        'auc': roc_auc,
        'report': report,
        'model': model
    }

# Run experiments with different feature sets and collect results
results = []
results.append(train_eval(X_train_classic, X_test_classic, y_train, y_test, 'Classic Features'))
results.append(train_eval(X_train_text, X_test_text, y_train, y_test, 'Text Features'))
results.append(train_eval(X_train_combined, X_test_combined, y_train, y_test, 'Combined Features'))


<p style = 'font-size:16px;font-family:Arial;'>The results speak for themselves: mixing structured and unstructured features gives us the best model performance. Using just the sensor data gets us a decent F1 score of 0.29 and an AUC of 0.80, but it struggles to catch the minority class (i.e., the actual outages). The text-only model does slightly worse on F1 (0.19), but interestingly, its AUC is still strong at 0.81—suggesting the log embeddings add something useful, even on their own.</p>

<p style = 'font-size:16px;font-family:Arial;'>
But the real magic happens when we put both feature sets together. The F1 score jumps to 0.51, and the AUC shoots up to 0.95. That’s a solid gain, and it means we’re catching more real outages without making a mess of the rest. So those operator comments? Turns out they’re worth keeping.</p>

In [None]:
# Evaluation Summary
for res in results:
    print(f"--- {res['label']} ---")
    print(f"F1 Score: {res['f1']:.4f}")
    print(f"AUC: {res['auc']:.4f}")
    print("Confusion Matrix:")
    print(res['cm'])
    print("Classification Report:")
    for label, metrics in res['report'].items():
        if isinstance(metrics, dict):
            print(f"  {label}: precision={metrics['precision']:.2f}, recall={metrics['recall']:.2f}, f1={metrics['f1-score']:.2f}")
    print()

<p style = 'font-size:16px;font-family:Arial;'>
The ROC plot backs this up nicely—across the full range of thresholds, the combined model clearly outperforms the others. It’s a strong signal that **operator logs really do bring extra context** that complements the sensor data. Those embedding-based text features aren’t just filler—they genuinely boost the model’s ability to catch outages more reliably.</p>

In [None]:
import plotly.io as pio
pio.renderers.default = "notebook_connected"
# ROC Plot
roc_traces = [
    go.Scatter(x=res['fpr'], y=res['tpr'], mode='lines', name=f"{res['label']} (AUC={res['auc']:.2f})")
    for res in results
]

layout = go.Layout(
    title="ROC Curves",
    xaxis=dict(title='False Positive Rate'),
    yaxis=dict(title='True Positive Rate'),
    width=700, height=500
)
fig = go.Figure(data=roc_traces, layout=layout)
fig.show()

<b style = 'font-size:18px;font-family:Arial;'>4.3 Feature Importances  </b>


<p style = 'font-size:16px;font-family:Arial;'>
Knowing which features matter most isn't just a nice-to-have—it’s crucial for understanding, trusting, and explaining the model. Since we’re blending sensor readings with log-based embeddings, it’s important to see what’s actually driving predictions. That’s where our earlier decision to turn high-dimensional text into interpretable prototype distances really pays off. Instead of dealing with opaque vectors, we get meaningful signals like “how much does this message sound like a known malfunction?”</p>

<p style = 'font-size:16px;font-family:Arial;'>
And it works. The usual suspects like <code>motor_temp</code> and <code>rpm</code> show up at the top, but so do several log-based features—like <code>logmessagesimilarity_inspection_results</code> and <code>logmessagesimilarity_sensor_malfunction_issues</code>. This solid mix confirms that both data types pull their weight, and the model is learning from text in a way that’s not just useful, but explainable.</p>

In [None]:
# Get feature importances from the combined model (results[2] corresponds to combined features)
importances = results[2]["model"].get_feature_importance()

# Get corresponding feature names
feat_names = results[2]["model"].feature_names_

# Create a DataFrame to organize feature names and their importance values
df_plot = pd.DataFrame({'Feature': feat_names, 'Importance': importances})

# Add a color column: red for text-based features, blue for classic sensor features
df_plot['Color'] = df_plot['Feature'].apply(
    lambda x: 'red' if x.startswith('logmessagesimilarity') else 'blue'
)

# Sort features by importance, descending
df_plot.sort_values(by='Importance', ascending=False, inplace=True)

# Create a horizontal bar chart of feature importances using Plotly
fig = px.bar(
    df_plot,
    x='Importance',
    y='Feature',
    orientation='h',
    color='Color',
    color_discrete_map='identity',  # use literal color values from the DataFrame
    category_orders={'Feature': df_plot['Feature'].tolist()},  # preserve sorted order
    labels={'Importance': 'Importance', 'Feature': 'Feature'}
)

# Final layout tweaks and display
fig.update_layout(title='Feature Importances', height=1000)
fig.show()


<hr style="height:2px;border:none;">

<p style = 'font-size:20px;font-family:Arial;'><b>5. Deployment</b></p>


<p style = 'font-size:16px;font-family:Arial;'>Now it's time to wrap things up with deployment. In this last part, we show how to get the full pipeline running inside Teradata, even when the Python environment is gone. First, we export the trained CatBoost model to ONNX and store it in Vantage. Then, we rebuild the entire prediction flow using teradataml DataFrames—embedding generation, similarity scoring, model inference, the works. Finally, we collapse it all into one clean SQL query using a Common Table Expression (CTE), so the whole thing is ready to run entirely in-database, production-style.</p>

<hr style="height:2px;border:none;">

<p style = 'font-size:20px;font-family:Arial;'><b>5.1 Exporting and Registering the Model with Teradata BYOM</b></p>




<p style = 'font-size:16px;font-family:Arial;'>Here’s where we turn the trained CatBoost model into something Teradata can run in-database. We convert it to ONNX format and upload it to a BYOM model table—so it's ready to plug straight into the inference pipeline.</p>

In [None]:
# Set BYOM install location
configure.byom_install_location = "mldb"

# Save CatBoost model to ONNX format
onnxfile_catb = "catb-outage.onnx"
model_catb_id = "catb_outage"

results[2]["model"].save_model(
    onnxfile_catb,
    format="onnx",
    export_parameters={
        'onnx_domain': 'ai.catboost',
        'onnx_model_version': 1,
        'onnx_doc_string': 'model for Outage prediction',
        'onnx_graph_name': 'CatBoostModel_for_OutagePrediction'
    }
)

# Save ONNX model to Teradata BYOM table
model_table_name = "outage_models"
try:
    db_drop_table(model_table_name)
except:
    pass

save_byom(
    model_id=model_catb_id,
    model_file=onnxfile_catb,
    table_name=model_table_name
)

# Display saved model table
display_dataframes_in_tabs([model_table_name])


<p style = 'font-size:20px;font-family:Arial;'><b>5.1 Develop Inference Pipeline
</b></p>


<p style = 'font-size:16px;font-family:Arial;'>In the next step, we string everything together into a full pipeline—right inside Teradata. We start by simulating new inference data and run it through several in-database ML steps: generate embeddings with a Hugging Face model, compute similarity scores to our KMeans cluster centers, combine everything with the sensor features, and finally run inference with the CatBoost model in ONNX format. One neat trick here is using <code>literal_column</code> to write raw SQL for extracting the label and probability from the ONNX JSON output. All of this happens in-database, no data movement, just Python and SQL playing nicely together.</p>

In [None]:
# Simulate new inference data by creating a subset of test data 
(DF_test
 .loc[DF_test.row_id < 100]
 .drop(columns=["outage_next_24h"])
 .to_sql("pump_failure_inference", primary_index="row_id", if_exists="replace")
)

# Reference the inference data
DF_inference = DataFrame(in_schema(username, "pump_failure_inference"))

# Step 1: Generate sentence embeddings using BYOM HuggingFace model (all processing in-database)
DF_embeddings_inference = ONNXEmbeddings(
    newdata=DF_inference.assign(
        row_id=DF_inference.row_id,
        txt=DF_inference.maintenance_log_aug,
        drop_columns=True
    ),
    modeldata=my_model,
    tokenizerdata=my_tokenizer,
    accumulate=["row_id"],
    model_output_tensor="sentence_embedding",
    output_format=f'FLOAT32({number_dimensions_output})',
    enable_memory_check=False
).result

# Step 2: Compute similarity features to precomputed cluster centroids with the previously defined function
DF_logfeatures_inference = get_centroid_features(DF_embeddings_inference)

# Step 3: Join similarity features with original inference features in Teradata
DF_inference_ADS = (DF_inference
    .merge(DF_logfeatures_inference, on="row_id", rsuffix="emb")
    .drop(columns=["maintenance_log_aug", "row_id_emb"])
)

# Step 4: Run classification inference using the deployed ONNX classification model
DF_predictions_raw = ONNXPredict(
    newdata=DF_inference_ADS,
    modeldata=DataFrame(in_schema(username, model_table_name)),
    accumulate=["row_id"],
    overwrite_cached_models="*",
    model_input_fields_map=f"features={feat_names[0]}:{feat_names[-1]}"
).result

from sqlalchemy.sql import literal_column as col

# Step 5: Extract label and probability values from ONNX JSON output. col allows us to directly use SQL for creating new columns.
DF_predictions_final = DF_predictions_raw.assign(
    row_id=DF_predictions_raw.row_id,
    json_report_json=col("NEW JSON (json_report)"),
    outage_prediction=col("CAST(json_report_json.JSONExtractValue('$.label[0]') AS INTEGER)"),
    outage_probability=col("CAST(json_report_json.JSONExtractValue('$.probabilities[0].value.1') AS DECIMAL(7,6))"),
    drop_columns=True
).select(["row_id", "outage_prediction", "outage_probability"])


In [None]:
DF_predictions_final


<p style = 'font-size:16px;font-family:Arial'>
<center><img src="images/flowchart_embeddingsfeatures.png" alt="workflow_topictrend" style="border: 4px solid #404040; border-radius: 10px;"/></center>

<p style = 'font-size:16px;font-family:Arial;'>So far, everything runs nicely inside the database—but it still depends on an active Python session with teradataml to tie it all together. To make the pipeline truly deployable, we need something more portable. That’s where capturing the entire logic as plain SQL comes in. Using a small utility called show_CTE_query(), we can turn the full chain of transformations—embedding generation, similarity scoring, pivoting, feature joining, and inference—into one big Common Table Expression (CTE). The result is a standalone SQL query that mirrors our Python logic and can be deployed as a view, macro, or job—no need for Python to be running in the background.</p>

In [None]:
inference_CTE_qu = DF_predictions_final.show_CTE_query()

In [None]:
display(Markdown(f"```sql\n{inference_CTE_qu}\n```"))

In [None]:
DataFrame.from_query(inference_CTE_qu)



<p style = 'font-size:16px;font-family:Arial;'>Now that we’ve wrapped the whole pipeline into a single SQL query, the obvious next step is to turn it into a view. This gives us a clean, reusable object inside the database that handles everything—from text embedding to final prediction—in one go. And the best part? Anyone with the access rights can now query the view directly from tools like Teradata Studio or any SQL client, no Python required.</p>

In [None]:
# Define the name of the deployed inference view
inference_view_name = "outage_inference_v"

# Create or replace the view using the full CTE-based SQL pipeline
execute_sql(f"""
    REPLACE VIEW {username}.{inference_view_name} AS
    {inference_CTE_qu}
""")

<p style = 'font-size:16px;font-family:Arial;'>To make sure our view is doing what it should, we run a few quick sanity checks: first, we query it with some known data and confirm it returns predictions. Then we clear the input table—it should return nothing. Finally, we insert fresh test records and see new predictions show up. These simple tests give us confidence that the whole thing is working reliably and ready for production, whether it's part of a bigger pipeline or just used for on-demand scoring.</p>

In [None]:
# Preview inference results (expecting row_ids < 100 from initial sample)
DataFrame(inference_view_name)

In [None]:
# Clear the inference table
execute_sql(f"DELETE FROM {username}.pump_failure_inference")

# Re-check the view (should now return no rows)
DataFrame(inference_view_name)

In [None]:
# Insert new test data into the inference table (row_ids between 1000 and 1050, without labels)
execute_sql(f"""
    INSERT INTO {username}.pump_failure_inference
    SELECT * FROM ANTISELECT(
        ON {username}.pump_failure_test
        USING Exclude ('outage_next_24h')
    ) t
    WHERE row_id > 1000 AND row_id < 1050
""")

# Preview updated inference results (now expecting row_ids between 1000 and 1050)
DataFrame(inference_view_name)

<p style = 'font-size:20px;font-family:Arial;'><b>Conclusion
</b></p>


<p style = 'font-size:16px;font-family:Arial;'>
And that’s a wrap! What we’ve built here is a practical and powerful way to bring structured and unstructured data together for predictive modeling. By turning operator log texts into interpretable similarity features and combining them with sensor data, we’ve managed to boost model performance while keeping things explainable.</p>



<p style = 'font-size:16px;font-family:Arial;'>The best part? Everything runs fully inside Teradata. Embeddings, similarity scoring, classification—all of it happens in-database. And by exporting the model to ONNX and capturing the whole pipeline as a SQL query, we’ve made it super easy to deploy. Just wrap it in a view, and suddenly any analyst or app with SQL access can run real-time predictions—no Python or extra infrastructure needed.</p>

<p style = 'font-size:16px;font-family:Arial;'>This pattern isn’t limited to pump failures or predictive maintenance. You can reuse it for all sorts of use cases where unstructured data matters—think fraud detection, churn prediction, service ticket triage, and more. If you've got logs, notes, or any kind of free text, this is a great way to turn them into something your models can use and your business can trust.</p>

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial;'>6. Cleanup</b>

In [None]:
%run utils/_dataremove.ipynb

In [None]:
remove_context()

<footer style="padding-bottom:35px; border-bottom:3px solid">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2025. All Rights Reserved
        </div>
    </div>
</footer>