<header style="padding:1px;background:#f9f9f9;border-top:3px solid #00b2b1">
<img id="Teradata-logo" src="https://www.teradata.com/Teradata/Images/Rebrand/Teradata_logo-two_color.png" alt="Teradata" width="220" align="right" />

<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>End to End ML Pipelines with Teradata Vantage</b>
</header>

<b style = 'font-size:18px;font-family:Arial;color:#E37C4D'>Introduction</b>
<p style = 'font-size:16px;font-family:Arial'>In this notebook we explore a hypothetical end to end generative AI pipeline to illustrate the usage of different tools offered by Teradata Vantage to easily, define a problem, collect, clean and preprocess data, integrate training data to cloud native machine learning tools, and operationalize the trained model. </p>
<p style = 'font-size:16px;font-family:Arial;color:#E37C4D'><b>Contents</b></p>
<ol style = 'font-size:16px;font-family:Arial'>
    <li>Problem Definition</li>
    <li>Loading of Sample Data</li>
    <li>Data Exploration</li>
    <li>Data Cleaning and Preprocessing</li>
    <li>Model Training</li>
    <li>Operationalization</li>
</ol>
<b style = 'font-size:18px;font-family:Arial;color:#E37C4D'>Notes</b>
<p style = 'font-size:16px;font-family:Arial'>This notebook is designed to work, with minimum extra configuration, within a ClearScape Analytics Experience environment, it can be added into a separate folder in the UseCases environment. </p>

<b style = 'font-size:18px;font-family:Arial;color:#E37C4D'>Importing Teradata Machine Learning Libraries</b>
<p style = 'font-size:16px;font-family:Arial'>These libraries are already available within ClearScape Analytics Experience.</p>

In [None]:
import teradataml as tdml
import json
import pandas as pd
import seaborn as sns
import plotly.express as px
import tdnpathviz

<b style = 'font-size:18px;font-family:Arial;color:#E37C4D'>Establish Database Connection</b>
<p style = 'font-size:16px;font-family:Arial'>Set up the connection, at this moment there is no database defined in the connection, the default database is created as the first step below.</p>

In [3]:
%run -i ../startup.ipynb
eng = tdml.create_context(host = 'host.docker.internal', username='demo_user', password = password)

Performing setup ...
Setup complete



Enter password:  ··········


... Logon successful
Connected as: teradatasql://demo_user:xxxxx@host.docker.internal/dbc




<p style = 'font-size:16px;font-family:Arial'>Create default database</p>

In [None]:
qry = """
CREATE DATABASE teddy_retailers_ml
AS PERMANENT = 110e6;
"""
eng.execute(qry)

<p style = 'font-size:16px;font-family:Arial'>Define the default database in the context</p>

In [4]:
eng = tdml.create_context(host = 'host.docker.internal', username='demo_user', password = password, database = 'teddy_retailers_ml')



<p style = 'font-size:16px;font-family:Arial'>Creation of the tables with sample mock data, these tables are created by reading information from object storage</p>

In [5]:
qry='''
CREATE MULTISET TABLE teddy_retailers_ml.products AS
(
  SELECT product_id, product_name, department_id
    FROM (
		LOCATION='/gs/storage.googleapis.com/clearscape_analytics_demo_data/DEMO_groceryML/products.csv') as products
) WITH DATA;
'''
eng.execute(qry)

<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7ff3a8119280>

In [None]:
qry='''
CREATE MULTISET TABLE teddy_retailers_ml.order_products AS
(
  SELECT order_id, product_id, add_cart_order
    FROM (
		LOCATION='/gs/storage.googleapis.com/clearscape_analytics_demo_data/DEMO_groceryML/order_products.csv') as orders_products
) WITH DATA;
'''
eng.execute(qry)

<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7ff376997eb0>

<b style = 'font-size:18px;font-family:Arial;color:#E37C4D'>Data Exploration</b>
<p style = 'font-size:16px;font-family:Arial'>At this stage we load data from the database to Teradata DataFrames. This allows to perform analysis with ease</p>

In [None]:
orders = tdml.DataFrame("order_products")
orders

order_id,product_id,add_cart_order
3018,180,4
3018,2,8
3018,140,6
3018,17,14
3018,11,13
3018,14,5
3018,53,11
3018,153,10
3018,39,3
3018,142,1


<p style = 'font-size:16px;font-family:Arial'>The statement below generates a DataFrame that aggregates the number of products added to each order</p>

In [None]:
counts_per_order = orders.groupby("order_id").agg({"product_id": "count"})
counts_per_order

<p style = 'font-size:16px;font-family:Arial'>With the DataFrame that was created above, it is possible to create a histogram DataFrame, a histogram plot can be generated with base on this Dataframe.</p>

In [None]:
count_prod_hist = tdml.Histogram(data=counts_per_order,
                target_columns="count_product_id",
                method_type="Sturges") 

In [None]:
count_prod_hist_pd = count_prod_hist.result.sort("Label").to_pandas()
count_prod_hist_pd

In [None]:
fig = px.bar(count_prod_hist_pd, x="MaxValue", y="Bin_Percent")
fig.show()

<p style = 'font-size:16px;font-family:Arial'>As we ingested the factual data regarding orders, we can also add dimmensional/categorical data regarding products, we can retrieve the product names from this dataset, this is useful for illustrative purposes below </p>

In [17]:
products = tdml.DataFrame('products')
products

product_id,product_name,department_id,seq_product_id
59,Chicken,3,161
160,AluminumPans,4,262
36,Pepper,1,138
97,All-PurposeCleaner,4,199
137,ColdMedicine,6,239
177,ExtensionCords,4,279
15,TomatoSauce,2,117
99,OvenCleaner,4,201
19,CannedVegetables,2,121
122,Antiperspirant,5,224


<p style = 'font-size:16px;font-family:Arial'>After ingesting the product categorical data, we can merge the categorical data and factual data.</p>

In [None]:
orders_products_merged = orders.join(
    other = products,
    on = "product_id",
    how = "inner",
    lsuffix = "ordrs", 
    rsuffix = "prdt")
orders_products_merged

<p style = 'font-size:16px;font-family:Arial'>With this merged table we can analyze what are the most commonly ordered products</p>
<p style = 'font-size:16px;font-family:Arial'>Another standard exploratory step involves identifying the most popular products and the most frequent sequences in which these products are added to the shopping cart.</p>

In [None]:
product_counts = orders_products_merged.groupby('product_name').agg({"ordrs_product_id": "count"})
product_counts.sort("count_ordrs_product_id", ascending=False)

<p style = 'font-size:16px;font-family:Arial'>To identify the most common sequences in which products are added to the shopping cart, we need to identify the sequences in which products were included on each of the order. For ease of analysis, we'll express these sequences as pairs. This implies we'll identify pairs of products in which one product frequently follows another.</p>
<p style = 'font-size:16px;font-family:Arial'>The Teradata nPath function is routinely used for such purposes. This function is particularly useful for identifying sequential patterns within data. The primary inputs for nPath are the rows to be analyzed, the column used for data partitioning, and the column that dictates the order of the sequence.</p>
<p style = 'font-size:16px;font-family:Arial'>In our case, we'll perform the path analysis on the data in `orders_products_merged`. Since we aim to identify sequential patterns in relations to orders, our partition column is `order_id`. Given that we want to construct the sequence based on the order in which products were added to the cart, our ordinal column is `add_cart_order`.</p>
<p style = 'font-size:16px;font-family:Arial'>Other key components of the nPath function include the conditions that a data point should meet to be included in the sequence, and the patterns we're identifying among the rows that meet the condition. These are supplied to nPath as the parameters `symbol` and `pattern`, respectively.
The nPath function for our purpose would be as follows:
</p>

In [None]:
common_seqs = tdml.NPath(
    data1=orders_products_merged,
    data1_partition_column="order_id",
    data1_order_column="add_cart_order",
    mode="OVERLAPPING",
    pattern="A.A",
    symbols="TRUE as A",
    result=[
        "FIRST (order_id OF A) AS order_id",
        "ACCUMULATE (product_name OF A) AS path",
        "COUNT (* OF A) AS countrank"
        ]
    ).result

<p style = 'font-size:16px;font-family:Arial'>Where the symbol `A` merely implies that a `product_id` `A` exist in an order, thus `True as A`. The pattern `A.A` is fulfilled each time a given product `A`, follows another product `A` added previously in the shopping cart. We set the mode as overlapping since the second element of a given pair could be the first element of another pair. </p>
<p style = 'font-size:16px;font-family:Arial'>The `FIRST` function retrieves the `order_id` from the initial row that matched the pattern, while ACCUMULATE constructs the path with the `product_name` found in every matched row, following the sequence defined by `add_cart_order`.
</p>

<p style = 'font-size:16px;font-family:Arial'>To obtain a count of the most common pairs, we group by 'path' and aggregate by the count of 'order_id', sorting the results in descending order based on the count. </p>

In [None]:
common_seqs.groupby('path').agg({"order_id": "count"}).sort('count_order_id',ascending=False)

<p style = 'font-size:16px;font-family:Arial'>The main paths found through this analysis can be plotted through Teradata's tdnpathviz visualization module.</p>

In [None]:
from tdnpathviz.visualizations import plot_first_main_paths

In [None]:
plot_first_main_paths(common_seqs,path_column='path',id_column='order_id')

<b style = 'font-size:18px;font-family:Arial;color:#E37C4D'>Data Cleaning and Preprocessing</b>
<p style = 'font-size:16px;font-family:Arial'>To prepare the data for processing by a machine learning library or a pre-trained model, especially those required for generative AI through NLP, we need to verify a few things:</p>
<p style = 'font-size:16px;font-family:Arial'>Product identifiers need to be sequential. This ensures that the NLP model can process them as it would do with words in a sentence. As well we must define markers to accurately signify the start and end of a “sentence”, similar to a capital letter and a period in English grammar.</p>
<p style = 'font-size:16px;font-family:Arial'>We need to eliminate null values. Machine learning training via NLP tools fundamentally involves matrix multiplication, which can only be performed with numerical values.
Adding sequential product identifiers:</p>
<p style = 'font-size:16px;font-family:Arial'>In our specific dataset, the product identifier `product_id` happens to be sequential, however, this is just a matter of chance, not design, for this reason we are going to assume that the `product_id` is not sequential, this has the extra advantage of allowing us to define the starting point of the sequence, which in turn we can utilize to define our “sentence” `start` and “sentence” `end` markers. </p>
<p style = 'font-size:16px;font-family:Arial'>We are going to define the number `101` as our “sentence” `start` marker and the number `102` as our “sentence” end marker. With these two constants defined we are going to assign a sequential identifier to the items in our `products` table, this sequential identifier will start at `103` and continue from there. </p>


<p style = 'font-size:16px;font-family:Arial'>We create a Volatile Table `temp` that matches each product `product_id` with the sequential identifier that we are creating. For this we use SQL’s `ROW_NUMBER()` and `OVER` statements. This table will have a primary index on `product_id` we want this table to be preserved with the session, so we define the parameter `ON_COMMIT_PRESERVE_ROWS`:</p>

In [None]:
create_table_qry = '''
CREATE VOLATILE TABLE temp AS (
    SELECT product_id, 
    ROW_NUMBER() OVER (ORDER BY product_id) + 102 as seq_product_id
    FROM products
) WITH DATA PRIMARY INDEX (product_id) ON COMMIT PRESERVE ROWS;
'''
eng.execute(create_table_qry)

<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7ff376865040>

<p style = 'font-size:16px;font-family:Arial'>To create the sequential `product_id` this statement is retrieving the row number of each record in the `products` table, ordered by `product_id`, and adding 102 to that number, then the result is defined as `seq_product_id`.</p>

<p style = 'font-size:16px;font-family:Arial'>We create a new column in our `products` table to store the sequential product identifiers:</p>

In [None]:
add_column_qry = '''
ALTER TABLE products
ADD seq_product_id INTEGER;
'''
eng.execute(add_column_qry)

<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7ff37689bdc0>

<p style = 'font-size:16px;font-family:Arial'>Finally, we add the sequential identifiers to our product’s table, as `seq_product_id`:</p>

In [None]:
modify_table_qry = '''
UPDATE products
SET seq_product_id = (
    SELECT temp.seq_product_id
    FROM temp
    WHERE products.product_id = temp.product_id
);
'''
eng.execute(modify_table_qry)

<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7ff376865a60>

<p style = 'font-size:16px;font-family:Arial'>The resulting table can be ingested again into our `products` `DataFrame`.</p>

In [18]:
products = tdml.DataFrame('products')
products.sort('seq_product_id')

product_id,product_name,department_id,seq_product_id
1,Milk,1,103
2,Bread,1,104
3,Eggs,1,105
4,Butter,1,106
5,Cheese,1,107
6,Yogurt,1,108
7,Cereal,1,109
8,Oatmeal,1,110
9,GranolaBars,1,111
10,PancakeMix,1,112


<b style = 'font-size:18px;font-family:Arial;color:#E37C4D'>Tokenization and sequence tagging:</b>
<p style = 'font-size:16px;font-family:Arial'>In predicting the next product, a customer will add to their shopping cart, we interpret the sequence of previously added products as "sentences", where each product stands as a distinct "word". Widely used pre-trained NLP models, such as BERT for instance, necessitate a specific input format. This is where tokenization and tagging come into the picture. For the training stage we need to supply a dataset in which each order, or "sentence", is represented by the sequence of products or "words" added to it, including markers for both the beginning and end of the sequence.</p>
<p style = 'font-size:16px;font-family:Arial'>As part of our data preparation process, we will create sequences of products added per order as we did during the data exploration phase. However, we'll use sequential identifiers rather than product names in the `path`. These sequential identifiers serve as tokens to identify the products. As previously mentioned, we will also integrate markers for the beginning and end of the sequence.</p>

<p style = 'font-size:16px;font-family:Arial'>First, we need to recreate the `orders_products_merged` `DataFrame`, this time with the updated ‘products’ `DataFrame` in the join. We keep all the columns on the `products` `DataFrame` for reference and simplicity, in a real scenario we would keep only the id’s. </p>

In [None]:
orders_products_merged = orders.join(
    other = products,
    on = "product_id",
    how = "inner",
    lsuffix = "ordrs", 
    rsuffix = "prdt")
orders_products_merged

order_id,ordrs_product_id,prdt_product_id,add_cart_order,product_name,department_id
3752,125,125,3,Lotion,5
3487,22,22,16,Crackers,1
3487,130,130,15,HairSpray,5
4425,50,50,2,FrozenPizza,3
4425,76,76,1,CatFood,1
3018,142,142,1,FirstAidSupplies,6
3018,39,39,3,Vinegar,1
3018,180,180,4,Brooms,4
4425,86,86,8,FacialTissues,4
3487,38,38,8,CookingOil,1


<p style = 'font-size:16px;font-family:Arial'>From this modified `orders_products_merged` we will generate a preliminary `DataFrame` with the orders represented as sequences of products. At this stage each record will have at least two items. In cases in which only one item was added to the shopping cart, for example, the record would be populated with the beginning of sentence marker, and the sequential identifier of the product added.  </p>
<p style = 'font-size:16px;font-family:Arial'>To achieve this objective, we will add a new column to the `orders_products_merged` `DataFrame` with the constant value `101` which is our beginning of sentence marker.</p>

In [None]:
orders_products_merged = orders_products_merged.assign(
    bgn = 101
).select(["order_id", "add_cart_order", "seq_product_id", "bgn"])
orders_products_merged

<p style = 'font-size:16px;font-family:Arial'>With this updated `DataFrame`, we'll use Teradata’s nPath function to generate the sequences of products per order. This process is quite similar to what we did in the data exploration stage. The difference here is that we won't include the `order_id` as it is not part of the "sentence" and not relevant for training a model. Also, we won't accumulate all tokens in one column; instead, we'll have a distinct column for each token. Additionally, the pattern will not be looking for matches on product pairs; we'll consider any sequence, even if it only contains one product. For this, we use the wildcard `*`. </p>
<p style = 'font-size:16px;font-family:Arial'>Another useful data point we identified during data exploration comes in handy here. Since we need to define a column for each token, and we aim to optimize column generation by adding columns where tokens are present, we'll rely on our earlier findings that the maximum number of products in an order in our dataset is `21`. We'll need at least one more column to account for an end-of-sentence marker in the case of longer lists, so we'll have 25 columns in our prepared dataset. </p>
<p style = 'font-size:16px;font-family:Arial'>Carrying out all the processes above is made significantly easier with Teradata’s nPath function.</p>

In [None]:
prepared_ds = tdml.NPath(
    data1=orders_products_merged,
    data1_partition_column="order_id",
    data1_order_column="add_cart_order",
    mode="NONOVERLAPPING",
    pattern="A*",
    symbols="TRUE as A",
    result=["FIRST (bgn OF A) AS c0",
            "NTH (seq_product_id, 1 OF A) as c1",
            "NTH (seq_product_id, 2 OF A) as c2",
            "NTH (seq_product_id, 3 OF A) as c3",
            "NTH (seq_product_id, 4 OF A) as c4",
            "NTH (seq_product_id, 5 OF A) as c5",
            "NTH (seq_product_id, 6 OF A) as c6",
            "NTH (seq_product_id, 7 OF A) as c7",
            "NTH (seq_product_id, 8 OF A) as c8",
            "NTH (seq_product_id, 9 OF A) as c9",
            "NTH (seq_product_id, 10 OF A) as c10",
            "NTH (seq_product_id, 11 OF A) as c11",
            "NTH (seq_product_id, 12 OF A) as c12",
            "NTH (seq_product_id, 13 OF A) as c13",
            "NTH (seq_product_id, 14 OF A) as c14",
            "NTH (seq_product_id, 15 OF A) as c15",
            "NTH (seq_product_id, 16 OF A) as c16",
            "NTH (seq_product_id, 17 OF A) as c17",
            "NTH (seq_product_id, 18 OF A) as c18",
            "NTH (seq_product_id, 19 OF A) as c19",
            "NTH (seq_product_id, 20 OF A) as c20",
            "NTH (seq_product_id, 21 OF A) as c21",
            "NTH (seq_product_id, 22 OF A) as c22",
            "NTH (seq_product_id, 23 OF A) as c23",
            "NTH (seq_product_id, 24 OF A) as c24",
            "NTH (seq_product_id, 25 OF A) as c25",
    ]
).result
prepared_ds

<p style = 'font-size:16px;font-family:Arial'>The aggregate function `NTH` is employed to extract the `seq_product_id` from each row in the sequence (where each product added to an order corresponds to a row in the sequence). This ID is then allocated to the appropriate column based on its position within the specific order. This is accomplished by specifying the indices `1` to `25` in the snippet above. If an order does not contain a product added at a particular position, we encounter a `None` value. These null values will need to be addressed and cleaned in the subsequent steps. </p>

<p style = 'font-size:16px;font-family:Arial'>At this point, we simply need to manage the null values and add end-of-sentence markers. We will leverage Teradata's robust in-database capabilities to accomplish this.</p>

<p style = 'font-size:16px;font-family:Arial'>As a first step we will persist the `prepared_ds` `DataFrame` to the data warehouse.</p>

In [None]:
prepared_ds.to_sql("prepared_ds", if_exists="replace")

<p style = 'font-size:16px;font-family:Arial'>As a next step we need to produce a table with the following characteristics:</p>
<p style = 'font-size:16px;font-family:Arial'>We need to insert the final sentence marker `102` after the last product added to each order. </p>
<p style = 'font-size:16px;font-family:Arial'>The value of the columns where no product is placed instead of `null` should be `0`. </p>
<p style = 'font-size:16px;font-family:Arial'>To achieve the conditions above, we need to inspect if the value of the column is `null` and modify that `null` conditionally, based on whether the preceding position is a valid product, in which case the ‘null’ should be replaced by an end of “sentence” marker, or another `null` value, in which case it should be replaced by `0`. </p> 
<p style = 'font-size:16px;font-family:Arial'>The COALESCE function enables us to efficiently apply a condition when we encounter that the value of a column is `null`. In this context, it is used to inspect each column, and if the column's value is `null`, it then checks the preceding column's value using a CASE statement. The CASE statement allows for conditional logic in SQL. In this situation, if the previous column's value is `null`, it returns `0`; otherwise, it returns `102`.</p>
<p style = 'font-size:16px;font-family:Arial'>We can utilize the following script to preserve a cleaned version that is prepared for the training phase.</p>

In [None]:
create_cleaned_ds_qry = '''
CREATE TABLE cleaned_ds AS (
  SELECT
    c0,
    c1,
    COALESCE(c2, CASE WHEN c1 IS NULL THEN 0 ELSE 102 END) AS c2,
    COALESCE(c3, CASE WHEN c2 IS NULL THEN 0 ELSE 102 END) AS c3,
    COALESCE(c4, CASE WHEN c3 IS NULL THEN 0 ELSE 102 END) AS c4,
    COALESCE(c5, CASE WHEN c4 IS NULL THEN 0 ELSE 102 END) AS c5,
    COALESCE(c6, CASE WHEN c5 IS NULL THEN 0 ELSE 102 END) AS c6,
    COALESCE(c7, CASE WHEN c6 IS NULL THEN 0 ELSE 102 END) AS c7,
    COALESCE(c8, CASE WHEN c7 IS NULL THEN 0 ELSE 102 END) AS c8,
    COALESCE(c9, CASE WHEN c8 IS NULL THEN 0 ELSE 102 END) AS c9,
    COALESCE(c10, CASE WHEN c9 IS NULL THEN 0 ELSE 102 END) AS c10,
    COALESCE(c11, CASE WHEN c10 IS NULL THEN 0 ELSE 102 END) AS c11,
    COALESCE(c12, CASE WHEN c11 IS NULL THEN 0 ELSE 102 END) AS c12,
    COALESCE(c13, CASE WHEN c12 IS NULL THEN 0 ELSE 102 END) AS c13,
    COALESCE(c14, CASE WHEN c13 IS NULL THEN 0 ELSE 102 END) AS c14,
    COALESCE(c15, CASE WHEN c14 IS NULL THEN 0 ELSE 102 END) AS c15,
    COALESCE(c16, CASE WHEN c15 IS NULL THEN 0 ELSE 102 END) AS c16,
    COALESCE(c17, CASE WHEN c16 IS NULL THEN 0 ELSE 102 END) AS c17,
    COALESCE(c18, CASE WHEN c17 IS NULL THEN 0 ELSE 102 END) AS c18,
    COALESCE(c19, CASE WHEN c18 IS NULL THEN 0 ELSE 102 END) AS c19,
    COALESCE(c20, CASE WHEN c19 IS NULL THEN 0 ELSE 102 END) AS c20,
    COALESCE(c21, CASE WHEN c20 IS NULL THEN 0 ELSE 102 END) AS c21,
    COALESCE(c22, CASE WHEN c21 IS NULL THEN 0 ELSE 102 END) AS c22,
    COALESCE(c23, CASE WHEN c22 IS NULL THEN 0 ELSE 102 END) AS c23,
    COALESCE(c24, CASE WHEN c23 IS NULL THEN 0 ELSE 102 END) AS c24,
    CASE WHEN c25 IS NULL THEN 0 ELSE 102 END AS c25
  FROM prepared_ds
) WITH DATA;
'''
eng.execute(create_cleaned_ds_qry)

<p style = 'font-size:16px;font-family:Arial'>The resulting data would look like this:</p>

In [None]:
cleaned_ds_dtf = tdml.DataFrame('cleaned_ds')
cleaned_ds_dtf

<p style = 'font-size:16px;font-family:Arial'>The cleaned training dataset can be persisted to object storage to easily be consumed by any Machine Learning tool on any cloud. </p>
<p style = 'font-size:16px;font-family:Arial'>We could easily store the data to a parquet file in Azure, for example, with a simple statement like this:</p>


In [None]:
'''
SELECT NodeId, AmpId, Sequence, ObjectName, ObjectSize, RecordCount
FROM WRITE_NOS_FM (
    ON (
        select * from cleaned_ds
    )

    USING
    LOCATION('/AZ/<azure_blob_storage_folder>/')
    STOREDAS('PARQUET')
    COMPRESSION('GZIP')
    NAMING('RANGE')
    INCLUDE_ORDERING('TRUE')
    MAXOBJECTSIZE('4MB')
) AS d 
ORDER BY AmpId;
'''

<b style = 'font-size:18px;font-family:Arial;color:#E37C4D'>Model Training</b>
<p style = 'font-size:16px;font-family:Arial'>We are prepared to begin training a model now. Teradata can easily be integrated with Azure ML, especially if you already have a Teradata Vantage Cloud Lake Instance in the platform. Integration is also very convenient with AWS Sage Maker and Google Vertex. </p>
<p style = 'font-size:16px;font-family:Arial'>In the example we have been discussing the data used for training consists of a sequence of products. The training stage is all about discovering the relationship between these products as regard to the context in which they were added to the shopping cart. This is analogous to discovering the order of words in a sentence based on the context. The vector space in which the products, `words` are assigned a value based on their relation to each other is called an embedding. And the type of neural network used to discover these types of relations is called a transformer.</p> 
<p style = 'font-size:16px;font-family:Arial'>Since neural networks go about discovering the relationships by masking some values in the sentence and trying to guess what ‘word’ should occupy that space, the process is computationally expensive.</p>
<p style = 'font-size:16px;font-family:Arial'>Training in the cloud offers the advantage of scalability and cost-performance since compute resources can be engaged when they are cheapest, and to the amount that is strictly needed. </p>
<p style = 'font-size:16px;font-family:Arial'>During training the predictive capability of the model is continually tested. For this purpose, usually the available data is divided between a training set, used to train the model, and a testing set. When the predictive capability of the model is deemed good enough, in accordance with the business requirements, and business and technical constraints the model is ready for operationalization.</p>
<p style = 'font-size:16px;font-family:Arial'>In our case, once trained, our language model can be utilized to carry out tasks like generating a sequence of products, which can then serve as recommendations for consumers.</p>

<b style = 'font-size:18px;font-family:Arial;color:#E37C4D'>Model Deployment</b>
<p style = 'font-size:16px;font-family:Arial'>In the past it was common to build services around the environment on which the model was built. These services usually exposed endpoints to request predictions from the model. Under this type of architecture, the processes for maintaining the model, adjustment, train on new results, etc., and the efficiency of the systems depending on it was not optimal. Replicating the entirety of an environment is not straightforward. For this reason, standards, such as ONNX, have been developed to export the model configuration easily. It is rather common to export these files to an independent service and build some API endpoints around that system. However, this only solves part of the problem, the model is still isolated from the business context. </p>
<p style = 'font-size:16px;font-family:Arial'>Teradata Vantage incorporates the Build Your Own Model package (BYOM). BYOM allows to import the model, as an ONNX file, for example, into a table in the data warehouse/ lake. The model becomes, under this paradigm, another asset in your data environment, that you can operationalize, update, and refine.
The deployment of a model in this scenario could be something as simple as the code below, you identify your model, you identify the table that you want to export it and voila!</p>


In [None]:
#deploy downloaded model to Teradata
print(f'Deploying model with id "%s" to table "%s"...' % (deployment_conf["model_id"], deployment_conf['model_table']))
tdml.delete_byom(
    deployment_conf["model_id"],
    deployment_conf['model_table'])
tdml.save_byom(
    deployment_conf["model_id"], 
    './' + model_file_path,
    deployment_conf['model_table'])


<b style = 'font-size:18px;font-family:Arial;color:#E37C4D'>Model Operations</b>
<p style = 'font-size:16px;font-family:Arial'>In the context of our example, we could run very straight forward SQL commands to retrieve recommendations from the model deployed in our data warehouse/lake. This has the added advantage that the recommendations from the model could further be refined with context from other data in our data environment, previous history of the specific costumer, local offerings according to the customer location, and many other parameters. This is an advantage that an isolated generative AI model would not be able to integrate.</p>

<img id="architecture-diagram" src="./Images/architecture.jpg" alt="architecture"/>

<p style = 'font-size:16px;font-family:Arial'>Reference query for retrieving recommendations</p>
<p style = 'font-size:16px;font-family:Arial'>We use VantageCloud Lake function ONNXPredict</p>

In [None]:
def get_context_based_recommendations(prouct_names, rec_number = 5, overwrite_cache = False):
    seq_ids = collect_seq_ids(prouct_names)
    select_tensor = generate_select_tensor(seq_ids, 32)

    query = f"""
    select tokennum as num, product_name, p.product_id, p.department_id
    from
    (
        with tbl as
        (
            select
                REGEXP_SUBSTR(json_report,'([0-9,]+)', 1, 1, 'c') score
            from
                mldb.ONNXPredict(
                on (%s)
                    on (select * from %s where model_id = '%s') dimension
                    using
                        Accumulate('input_0_0')
                        %s
                ) a
            )
        SELECT 
            tokennum, 
            cast(seq_product_id as int) seq_product_id
        FROM TABLE (STRTOK_SPLIT_TO_TABLE(1, tbl.score, ',')
            RETURNS (outkey INTEGER,
                    tokennum INTEGER,
                    seq_product_id VARCHAR(30) CHARACTER SET UNICODE)
                ) AS d
                where tokennum <= %d
        ) f
        join product_id_to_seq_product_id pitspi on pitspi.seq_product_id = f.seq_product_id
        join products p on p.product_id = pitspi.product_id
    """%(
        select_tensor, 
        deployment_conf["model_table"], 
        deployment_conf["model_id"], 
        f"OverwriteCachedModel('%s')"%deployment_conf["model_id"] if overwrite_cache else "",
        rec_number
        )

    result = {}

    sql_res = conn.execute(query)

    for row in sql_res:
        result[row["num"]] = {"product_name": row["product_name"], "product_id": row["product_id"], "department_id": row["department_id"], "aisle_id": row["aisle_id"]}

    return result