<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Headline Generation using Huggingface models in Vantage
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style = 'font-size:20px;font-family:Arial'><b>Introduction</b></p>

<p style = 'font-size:16px;font-family:Arial'>Businesses that produce large volumes of written content—such as marketing teams, news organizations, and e‑commerce platforms—often struggle to consistently create compelling, concise headlines that capture attention and drive engagement. This challenge grows as teams aim to optimize content for social media visibility, and customer click‑through rates across multiple channels. </p>

<p style = 'font-size:16px;font-family:Arial'>The small models like t5-small-headline-generator model offers an efficient solution by automatically transforming long articles, product descriptions, or marketing copy into sharp, impactful headlines. Its lightweight T5 architecture enables fast, scalable generation of high‑quality titles, helping organizations save time, maintain consistent messaging, boost content performance, and accelerate their publishing workflows without compromising creativity.</p>
<p style = 'font-size:16px;font-family:Arial'>With the help of Vantage's BYOM capabilities we can load and use these models with the Teradata system. Without heavy data movement and keeping scoring inside the warehouse we can avoid repeated data extracts and transfers, simplifying governance and helping with data residency and security. 

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>1. Connect to Vantage</b>

In [None]:
import getpass
import warnings
warnings.filterwarnings('ignore')

from teradataml import *

import zipfile

configure.byom_install_location = 'mldb'

<p style = 'font-size:16px;font-family:Arial'>We will be prompted to provide the password. We will enter the password, press the Enter key, and then use the down arrow to go to the next cell.</p>

In [None]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

In [None]:
%%capture
execute_sql("SET query_band='DEMO=Headline_Generation_Python.ipynb;' UPDATE FOR SESSION;")

<p style = 'font-size:16px;font-family:Arial'>We begin running steps with Shift + Enter keys. </p>

<p style = 'font-size:20px;font-family:Arial'><b>Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial'>We have provided data for this demo on cloud storage. We have the option of either running the demo using foreign tables to access the data without using any storage on our environment or downloading the data to local storage, which may yield somewhat faster execution. However, we need to consider available storage. There are two statements in the following cell, and one is commented out. We may switch which mode we choose by changing the comment string.</p>

In [None]:
# %run -i ../run_procedure.py "call get_data('DEMO_NLP_cloud');"        # Takes 30 seconds
%run -i ../run_procedure.py "call get_data('DEMO_NLP_local');"        # Takes 1 minute

<p style = 'font-size:16px;font-family:Arial'>Next is an optional step – if you want to see the status of databases/tables created and space used.</p>

In [None]:
%run -i ../run_procedure.py "call space_report();"        # Takes 10 seconds

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>2. Data Exploration</b>
<p style = 'font-size:16px;font-family:Arial'>Create a "Virtual DataFrame" that points to the data set in Vantage. Check the shape of the dataframe as check the datatype of all the columns of the dataframe.</p>


In [None]:
df = DataFrame(in_schema("DEMO_NLP", "Headline_Dataset"))
df

In [None]:
df.shape

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>3. Load Model in Vantage</b>
<p style = 'font-size:16px;font-family:Arial'>The HuggingFace model <a href = 'https://huggingface.co/JulesBelveze/t5-small-headline-generator'>JulesBelveze/t5-small-headline-generator</a>  is converted in Onnx format using Model is converted to ONNX using the script at <a href = 'https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/convert_generation.py'>here</a> 
<p style = 'font-size:16px;font-family:Arial'>The converted model is placed here in zipped folder in the Google cloud storage. Let us download, unzip the folder and load the model in Vantage  

In [None]:
!wget -O model.zip "https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_NLP/model.zip"

In [None]:
with zipfile.ZipFile("model.zip", 'r') as zip_ref:
    zip_ref.extractall()

In [None]:
#Deploying model to DB
try:
    db_drop_table("byom_models")
except:
    True

save_byom("t5-small-headline-generator",
          "t5-small-headline-generator.onnx",
          "byom_models"
         )

In [None]:
DataFrame("byom_models")

<p style = 'font-size:18px;font-family:Arial'><b>Tokenizer</b>
    <p style = 'font-size:16px;font-family:Arial'>Load tokenizer

In [None]:
#Deploying tokenizer to DB

try:
    db_drop_table("byom_tokenizers")
except:
    True

save_byom("t5-small-headline-generator",
          "t5-small-headline-generator.json",
          "byom_tokenizers")

In [None]:
DataFrame("byom_tokenizers")

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>4. Headline Generation</b>
<p style = 'font-size:16px;font-family:Arial'>Let's run the model to generate the headlines.

In [None]:
# Note: User must have the model and tokenizer data already loaded in the database.
# Retrieve model.
modeldata = retrieve_byom("t5-small-headline-generator", table_name="byom_models")
tokenizerdata = retrieve_byom("t5-small-headline-generator", table_name="byom_tokenizers")

In [None]:
# Assigning tokenizer_id, tokenizer to model_id, model in Seq2Seq_tokenizers.
tokenizerdata_a1 = tokenizerdata.assign(tokenizer_id=tokenizerdata.model_id)
tokenizerdata_a2 = tokenizerdata_a1.assign(tokenizer=tokenizerdata_a1.model)

<p style = 'font-size:16px;font-family:Arial'><b>ShowModelProperties</b><br>
Use the ShowModelProperties to check expected output from the function.

In [None]:
nlp_data = df[df.id <= 10]

<p style = 'font-size:16px;font-family:Arial'>The model expects the data/column for which the headline to be generated to be as txt, hence adding the column named as txt 

In [None]:
nlp_data = nlp_data.assign(txt=nlp_data.content)

In [None]:
nlp_data

In [None]:
# Showcasing the model properties of t5-small-headline-generator model 
# that has been created outside the Vantage.           
ONNXSeq2Seq_out = ONNXSeq2Seq(modeldata = modeldata,
                                     tokenizerdata=tokenizerdata_a2.select(['tokenizer_id', 'tokenizer']),
                                     newdata=nlp_data.select(["id", "txt"]),
                                     accumulate='id',
                                     model_output_tensor= 'sequences',
                                     show_model_properties=True)

In [None]:
ONNXSeq2Seq_out.result

<p style = 'font-size:18px;font-family:Arial'><b>Inference</b><br>
<p style = 'font-size:16px;font-family:Arial'>The model takes in a content as text and outputs a headline similar to a summary.

In [None]:
ONNXSeq2Seq_output = ONNXSeq2Seq(modeldata = modeldata,
                                         tokenizerdata=tokenizerdata_a2.select(['tokenizer_id', 'tokenizer']),
                                         newdata=nlp_data.select(["id", "txt"]),
                                         accumulate=(["id", "txt"]),
                                         model_output_tensor= 'sequences',
                                         const_min_length=10,
                                         const_max_length=84,
                                         const_num_beams=4,
                                         const_repetition_penalty=1.2,
                                         const_length_penalty=2.0,
                                         const_num_return_sequences=1)
       


In [None]:
ONNXSeq2Seq_output.result

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>5. Cleanup</b>

<p style = 'font-size:18px;font-family:Arial'><b>Work Tables</b></p>
<p style = 'font-size:16px;font-family:Arial'>Cleanup work tables to prevent errors next time.</p>

In [None]:
tables = ['byom_models','byom_tokenizers']

# Loop through the list of tables and execute the drop table command for each table
for table in tables:
    try:
        db_drop_table(table_name=table)
    except:
        pass

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'> <b>Databases and Tables </b></p>
<p style = 'font-size:16px;font-family:Arial'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('DEMO_NLP');"        # Takes 10 seconds

In [None]:
remove_context()

<footer style="padding-bottom:35px; border-bottom:3px solid #91A0Ab">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2026 All Rights Reserved
        </div>
    </div>
</footer>