<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Shipping Delivery Delay Prediction using Open Table Format Tables
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style="font-size:20px;font-family:Arial"><b>Introduction</b></p>

<p style="font-size:16px;font-family:Arial">
The rapid growth in global shipping and e-commerce demand has significantly outpaced the operational capabilities of many logistics providers. While order volumes continue to rise, improvements in delivery infrastructure, capacity planning, and real-time visibility have not kept pace. As a result, delayed deliveries have become an increasingly common and critical risk across multiple industries.<br>
One of the primary contributors to this challenge is the fragmentation of enterprise systems and data. Logistics and retail organizations often rely on multiple, disconnected systems—such as order management, warehouse management, transportation management, and partner platforms—that maintain data in different formats and across different locations. This lack of standardization and integration limits end-to-end visibility, slows decision-making, and increases the risk of inconsistencies across the supply chain.<br>
In the e-commerce retail sector, delivery delays have a direct and cascading impact on the supply chain. Late shipments disrupt inventory replenishment cycles, delay order fulfillment, and reduce operational efficiency. Beyond internal disruptions, these delays also negatively affect customer experience. Buyers expect fast and reliable delivery, and failure to meet these expectations leads to dissatisfaction, loss of trust, and reduced customer loyalty—ultimately harming the retailer’s credibility and long-term growth.
</p>
<p style="font-size:18px;font-family:Arial"><b>Why Teradata</b></p>
<p style="font-size:16px;font-family:Arial">
 Teradata addresses these challenges by providing a <b>unified analytics platform capable of performing large-scale, in-database analysis on massive volumes of data</b>. Its massively parallel processing (MPP) architecture enables complex analytics—such as delivery performance analysis, delay prediction, and supply chain optimization—to be executed directly where the data resides, eliminating the need for costly data movement and improving time to insight.<br>   
    In addition, Teradata natively <b>supports Open Table Formats (OTF)</b> such as Apache Iceberg, allowing it to read and analyze data stored in open, cloud-native formats alongside traditional relational tables. This capability enables organizations to seamlessly combine structured enterprise data with high-volume operational and logistics data stored in data lakes, all within a single SQL and governance framework. By unifying fragmented datasets and enabling scalable in-database analytics across both native and OTF tables, Teradata empowers organizations to gain real-time visibility, improve delivery reliability, and make faster, data-driven decisions across the end-to-end supply chain.
</p>

<hr style="height:2px;border:none">
<p style = 'font-size:20px;font-family:Arial'><b>1. Connect to VantageCloud</b></p>
<p style = 'font-size:16px;font-family:Arial'>Connect to VantageCloud using <code>create_context</code> from the teradataml Python library. </p>

In [None]:
from getpass import getpass
from teradataml import *
import os

# Suppress warnings
import warnings

warnings.filterwarnings('ignore')
display.suppress_vantage_runtime_warnings = True

In [None]:
print("Checking if this environment is ready to connect to VantageCloud Lake...")

if os.path.exists("/home/jovyan/JupyterLabRoot/VantageCloud_Lake/.config/.env"):
    print("Your environment parameter file exist.  Please proceed with this use case.")
    # Load all the variables from the .env file into a dictionary
    env_vars = dotenv_values("/home/jovyan/JupyterLabRoot/VantageCloud_Lake/.config/.env")
    # Create the Context
    eng = create_context(host=env_vars.get("host"), username=env_vars.get("username"), password=env_vars.get("my_variable"))
    execute_sql('''SET query_band='DEMO=VCL_Shipping_Delivery_Delay_Prediction_using_OTF.ipynb;' UPDATE FOR SESSION;''')
    print("Connected to VantageCloud Lake with:", eng)
else:
    print("Your environment has not been prepared for connecting to VantageCloud Lake.")
    print("Please contact the support team.")

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'> <b> 2. Data exploring  </b></p>

<p style = 'font-size:16px;font-family:Arial'>
The following tables are present in the database
    <ul style = 'font-size:16px;font-family:Arial'>  
        <li><b>orderitems</b>:This table contains a mapping between orders placed by customers and the table of products purchased</li>
        <li><b>orders</b>:This table contains data on orders placed by each user. This table is divided into two tables train_orders which have 3 additional columns related to delivery details namely order_status,order_delivered_timestamp and order_estimated_delivery_date</li>
        <li><b>payments</b>: This table contains payments made by each user, containing payment details and transaction value</li>
        </ul>
<p style = 'font-size:16px;font-family:Arial'>        The following tables are OTF format in cloud
    <ul style = 'font-size:16px;font-family:Arial'> 
        <li><b>products</b> :This table contains a list of products sold on ecccomercce and contained in transactions</li>
        <li><b>customers</b>: This table contains data on customers who make product transactions.
        </li>
        </ul>
<p style = 'font-size:16px;font-family:Arial'> Let's look at the detailed contents of the table

In [None]:
tdf_orderitems=DataFrame(in_schema("DEMO_Shipment","orderitems"))
tdf_orderitems

In [None]:
tdf_order=DataFrame(in_schema("DEMO_Shipment","orders"))
tdf_order

In [None]:
tdf_pay=DataFrame(in_schema("DEMO_Shipment","payments"))
tdf_pay

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>2.1 Connecting Open Table Format tables</b>
<p style = 'font-size:16px;font-family:Arial'>
An open table format (OTF) is a standardized, engine-agnostic way of organizing and managing large analytical datasets in data lakes, separating storage from compute while ensuring consistent access across multiple processing engines. It is needed to address challenges such as data silos, proprietary lock-in, lack of transactional consistency, and the inability to safely share data across tools and platforms. Open table formats like Apache Iceberg provide database-like capabilities—ACID transactions, schema and partition evolution, and time travel—on low-cost object storage. Teradata supports open table formats by natively reading and analyzing OTF tables alongside traditional relational tables, enabling high-performance, in-database analytics on large volumes of data without data movement. This allows organizations to unify structured enterprise data with lake-based data under a single SQL, security, and governance framework, accelerating insights while preserving openness and flexibility.<br>    Connecting to open table format tables require additional authentication object and datalake definition. Please refer to Getting_Started_OTF for more information.

In [None]:
cust_in_schema_tbl = in_schema(schema_name="demo_shipping",
                           table_name="customers",
                           datalake_name="iceberg_glue")
c_datalake_df = DataFrame(cust_in_schema_tbl)
c_datalake_df

In [None]:
prod_in_schema_tbl = in_schema(schema_name="demo_shipping",
                           table_name="products",
                           datalake_name="iceberg_glue")
p_datalake_df = DataFrame(prod_in_schema_tbl)
p_datalake_df

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'> <b> 3. Creating Shipment dataset </b></p>
<p style = 'font-size:16px;font-family:Arial'> As our goal is to build a binary classification machine learning model that can predict delays in logistics with high accuracy; We will combine all the tables available both in the Vantage tables and OTF tables to create a dataset which we will then use to create our prediction model.
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>General steps we will follow:</b></p>
<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Combine both tables in Vantage and OTF to create a dataset</li>    
    <li>Feature engineering to create target variable as will as other columns</li>
    <li>Creating Fit tables for Scaling, OneHotEncoding and LableEncoding</li>
    <li>Applying ColumnTransformer to create final dataset, creating train and test datasets</li>
    <li>Creating prediction model</li>
    <li>Model evaluation and explainability</li>
</ol>    

In [None]:
tdf_ship = DataFrame.from_query('''SELECT
    o.*,
    c.  customer_zip_code_prefix ,c.customer_city,c.customer_state,
    i.product_id,i.seller_id ,i.price ,i.shipping_charges,
    p.payment_sequential ,p.payment_type,p.payment_installments,p.payment_value,
    pr.product_category_name,pr.product_weight_g,pr.product_length_cm,pr.product_height_cm,pr.product_width_cm 
FROM DEMO_Shipment.orders o
LEFT JOIN iceberg_glue.demo_shipping.customers c
    ON o.customer_id = c.customer_id
LEFT JOIN DEMO_Shipment.orderitems i
    ON o.order_id = i.order_id
LEFT JOIN DEMO_Shipment.payments p
    ON o.order_id = p.order_id
LEFT JOIN iceberg_glue.demo_shipping.products pr
    ON i.product_id = pr.product_id;''')

In [None]:
tdf_ship.shape

In [None]:
#saving intermediate table
copy_to_sql(tdf_ship, table_name="shipment_dataset", if_exists="replace")

In [None]:
tdf_ship = DataFrame("shipment_dataset")
tdf_ship

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'> <b> 4. Feature Engineering </b></p>

<p style = 'font-size:16px;font-family:Arial'>As a first step let us check how many nulls we have in our data.</p>

In [None]:
#checking NULLs in data
colsum = ColumnSummary(data=tdf_ship,
                        target_columns=[':']
                       )
cs = colsum.result.filter(items = ['ColumnName', 'Datatype', 'NullCount','NullPercentage'])
cs[cs['NullPercentage'] > 0.0]

<p style = 'font-size:16px;font-family:Arial'>Dropping the rows with approved date as null.</p>

In [None]:
tdf_ship = tdf_ship.dropna(how='any', subset=['order_approved_at'])

<hr style="height:1px;border:none;">
<p style = 'font-size:20px;font-family:Arial'> <b> 4.1 Creating Target Variable </b></p>
<p style = 'font-size:16px;font-family:Arial'> 
We will create target column which is 'is the shipment delayed or not' as 'is_late_s' . To calculate this we will check all the records which have order_status as 'delivered' and where the order_delivered_timestamp > order_estimated_delivery_date.    

In [None]:
#Late delivery is if the 
#order_delivered_timestamp feature is greater than the order_estimated_delivery_date feature for delivered orders
tdf_ship = tdf_ship.assign(is_late_s = case([(tdf_ship['order_status'] == 'delivered' and (tdf_ship.order_delivered_timestamp > tdf_ship.order_estimated_delivery_date),
                          1)], else_= 0))


<hr style="height:1px;border:none;">
<p style = 'font-size:20px;font-family:Arial'> <b> 4.2 Adding calculated columns </b></p>
<p style = 'font-size:18px;font-family:Arial'><b> Product Volume </b>
<p style = 'font-size:16px;font-family:Arial'>Added the prod_vol feature which is the multiplication of product_length_cm, product_height_cm, and product_width_cm.    

In [None]:
tdf_ship=tdf_ship.assign(prod_vol = tdf_ship.product_length_cm*tdf_ship.product_height_cm*tdf_ship.product_width_cm)

In [None]:
#taking percentile values of prod_vol
d1=tdf_ship.assign(True, percentile_25=tdf_ship.prod_vol.percentile(0.25, interpolation=None),
            percentile_75=tdf_ship.prod_vol.percentile(0.75, interpolation=None))
p_25p=d1.get_values()[0][0].item()
p_75p=d1.get_values()[0][1].item()

In [None]:
tdf_ship = tdf_ship.assign(vol_category = case([(tdf_ship.prod_vol >= p_75p,'large'),
                                              ((tdf_ship.prod_vol >= p_25p),'medium'),
                                              (tdf_ship.prod_vol < p_25p ,'small')
                                             ], else_= 'small'))

In [None]:
tdf_ship.shape

In [None]:
tdf_ship

<p style = 'font-size:18px;font-family:Arial'><b> Day Of Week </b>
<p style = 'font-size:16px;font-family:Arial'>Day Of Week is a representation of the days of the week in number 0-6. We will use day_of_week() function available in teradataml library.  

In [None]:
tdf_ship=tdf_ship.assign(purchase_day_of_week = tdf_ship.order_purchase_timestamp.day_of_week(),
                      approved_day_of_week = tdf_ship.order_approved_at.day_of_week())


<p style = 'font-size:18px;font-family:Arial'><b> Purchase Hour and Approved Hour</b>
<p style = 'font-size:16px;font-family:Arial'>We will extarct hour from the purchase_hour and approved_hour. We will use extract() function available in teradataml library.  

In [None]:
tdf_ship=tdf_ship.assign(purchase_hour = tdf_ship.order_purchase_timestamp.extract('HOUR'),
                      approved_hour = tdf_ship.order_approved_at.extract('HOUR'))

<p style = 'font-size:18px;font-family:Arial'><b> Price Category</b>
<p style = 'font-size:16px;font-family:Arial'>We will categorise the price in 3bins based on the 75th pecentile and 25th percentile of the price.  

In [None]:
#taking percentile values of price
d2=tdf_ship.assign(True, percentile_25=tdf_ship.price.percentile(0.25, interpolation=None),
            percentile_75=tdf_ship.price.percentile(0.75, interpolation=None))

ship_25p=d2.get_values()[0][0].item()
ship_75p=d2.get_values()[0][1].item()

In [None]:
tdf_ship = tdf_ship.assign(price_category = case([(tdf_ship.price >= ship_75p,'expensive'),
                                              (tdf_ship.price >= ship_25p,'affordable'),
                                              (tdf_ship.price < ship_25p,'budget')
                                             ], else_= 'budget'))

<p style = 'font-size:18px;font-family:Arial'><b> Weight Category</b>
<p style = 'font-size:16px;font-family:Arial'>We will categorise the weight in 3bins based on the 75th pecentile and 25th percentile of the weight.

In [None]:
#taking percentile values of price
d3=tdf_ship.assign(True, percentile_25=tdf_ship.product_weight_g.percentile(0.25, interpolation=None),
            percentile_75=tdf_ship.product_weight_g.percentile(0.75, interpolation=None))

wt_25p=d3.get_values()[0][0].item()
wt_75p=d3.get_values()[0][1].item()

In [None]:
tdf_ship = tdf_ship.assign(weigth_category = case([(tdf_ship.product_weight_g >= wt_75p,'heavy'),
                                              (tdf_ship.product_weight_g >= wt_25p,'medium'),
                                              (tdf_ship.product_weight_g < wt_25p,'light')
                                             ], else_= 'light'))

<p style = 'font-size:18px;font-family:Arial'><b> Shipping Charges Category</b>
<p style = 'font-size:16px;font-family:Arial'>We will categorise shipping charges in 3bins based on the 75th pecentile and 25th percentile of the shipping charges.

In [None]:
#taking percentile values of price
d4=tdf_ship.assign(True, percentile_25=tdf_ship.shipping_charges.percentile(0.25, interpolation=None),
            percentile_75=tdf_ship.shipping_charges.percentile(0.75, interpolation=None))

s_25p=d4.get_values()[0][0].item()
s_75p=d4.get_values()[0][1].item()

In [None]:
tdf_ship = tdf_ship.assign(freight_category = case([(tdf_ship.shipping_charges >= s_75p,'expensive'),
                                              (tdf_ship.shipping_charges >= s_25p,'medium'),
                                               (tdf_ship.shipping_charges < s_25p,'budget')   
                                             ], else_= 'budget'))

In [None]:
tdf_ship

In [None]:
tdf_ship.shape

<p style = 'font-size:18px;font-family:Arial'><b> Zip code counts binning Category</b>
<p style = 'font-size:16px;font-family:Arial'>We will categorise the late delivery on the zip codes and categorize them in 3bins based on the 75th pecentile and 25th percentile of the counts.

In [None]:
#taking zip code counts 
zip = tdf_ship[tdf_ship['is_late_s'] == 1].select(['customer_zip_code_prefix','order_id']) \
                                           .groupby(['customer_zip_code_prefix']).agg({'order_id' : ['count']}) 
      

In [None]:
#taking percentile of zip code counts 
d_z=zip.assign(True, percentile_25=zip.count_order_id.percentile(0.25, interpolation=None),
            percentile_75=zip.count_order_id.percentile(0.75, interpolation=None))

#75th & 25th percentile values
zip_25p=d_z.get_values()[0][0].item()
zip_75p=d_z.get_values()[0][1].item()

#binning the zip code in one of the category
zip = zip.assign(zip_category = case([(zip.count_order_id >= zip_75p,'often'),
                                       (zip.count_order_id >= zip_25p,'quite'),
                                        (zip.count_order_id < zip_25p,'rarely')
                                             ], else_= 'rarely'))

zip

<p style = 'font-size:18px;font-family:Arial'><b> Customer City counts binning Category</b>
<p style = 'font-size:16px;font-family:Arial'>We will categorise the late delivery on the customer cities and categorize them in 3bins based on the 75th pecentile and 25th percentile of the counts.

In [None]:
#taking city counts 
city = tdf_ship[tdf_ship['is_late_s'] == 1].select(['customer_city','order_id']).groupby(['customer_city']). \
       agg({'order_id' : ['count']})

#taking percentile of city counts
d_c=city.assign(True, percentile_25=city.count_order_id.percentile(0.25, interpolation=None),
            percentile_75=city.count_order_id.percentile(0.75, interpolation=None))

#75th & 25th percentile values
city_25p=d_c.get_values()[0][0].item()
city_75p=d_c.get_values()[0][1].item()

#binning the city in one of the category
city = city.assign(city_category = case([(city.count_order_id >= city_75p,'often'),
                                       (city.count_order_id >= city_25p,'quite')
                                             ], else_= 'rarely'))

city

<p style = 'font-size:18px;font-family:Arial'><b> Seller counts binning Category</b>
<p style = 'font-size:16px;font-family:Arial'>We will categorise the late delivery by sellers and categorize them in 3bins based on the 75th pecentile and 25th percentile of the counts.

In [None]:
#taking seller counts 
seller = tdf_ship[tdf_ship['is_late_s'] == 1].select(['seller_id','order_id']).groupby(['seller_id']). \
       agg({'order_id' : ['count']})

#taking percentile of seller counts
d_s=seller.assign(True, percentile_25=seller.count_order_id.percentile(0.25, interpolation=None),
            percentile_75=seller.count_order_id.percentile(0.75, interpolation=None))

#75th & 25th percentile values
seller_25p=d_s.get_values()[0][0].item()
seller_75p=d_s.get_values()[0][1].item()

#binning the seller in one of the category
seller = seller.assign(seller_category = case([(seller.count_order_id >= seller_75p,'often'),
                                       (seller.count_order_id >= seller_25p,'quite'),
                                        (seller.count_order_id < seller_25p,'rarely')       
                                             ], else_= 'rarely'))

seller

<p style = 'font-size:18px;font-family:Arial'><b> Product counts binning Category</b>
<p style = 'font-size:16px;font-family:Arial'>We will categorise the late delivery of products and categorize them in 3bins based on the 75th pecentile and 25th percentile of the counts.

In [None]:
#taking product counts 
product = tdf_ship[tdf_ship['is_late_s'] == 1].select(['product_id','order_id']).groupby(['product_id']). \
       agg({'order_id' : ['count']})

#taking percentile of product counts
d_p=product.assign(True, percentile_25=product.count_order_id.percentile(0.25, interpolation=None),
            percentile_75=product.count_order_id.percentile(0.75, interpolation=None))

#75th & 25th percentile values
product_25p=d_p.get_values()[0][0].item()
product_75p=d_p.get_values()[0][1].item()

#binning the product in one of the category
product = product.assign(seller_category = case([(product.count_order_id >= seller_75p,'often'),
                                       (product.count_order_id >= seller_25p,'quite')
                                             ], else_= 'rarely'))

product

<p style = 'font-size:16px;font-family:Arial'>Joining all of these (zip,city, seller, product) with shipment_data

In [None]:
tdf = tdf_ship.join(other = zip,on = "customer_zip_code_prefix", how = "left", rprefix ="zip")\
               .join(other = city,on = "customer_city", how = "left", rprefix ="city")\
               .join(other = seller,on = "seller_id", how = "left", rprefix ="seller")\
               .join(other = product,on = "product_id", how = "left", rprefix ="product")
tdf 

In [None]:
tdf= tdf.assign(drop_columns = True
                ,order_id=tdf.order_id
                ,customer_state=tdf.customer_state.cast(type_=VARCHAR(20))
                ,price=tdf.price
                ,shipping_charges=tdf.shipping_charges
                ,payment_seq=tdf.payment_sequential
                ,payment_type=tdf.payment_type
                ,payment_installments=tdf.payment_installments
                ,payment=tdf.payment_value
                ,product_type=tdf.product_category_name
                ,product_wt=tdf.product_weight_g
                ,is_late_s=tdf.is_late_s
                ,prod_vol=tdf.prod_vol
                ,vol_cat=tdf.vol_category.cast(type_=VARCHAR(20))
                ,approved_day_of_week=tdf.approved_day_of_week
                ,purchase_day_of_week=tdf.purchase_day_of_week
                ,approved_hour=tdf.approved_hour
                ,purchase_hour=tdf.purchase_hour
                ,price_cat=tdf.price_category.cast(type_=VARCHAR(20))
                ,weigth_cat=tdf.weigth_category.cast(type_=VARCHAR(20))
                ,freight_cat=tdf.freight_category.cast(type_=VARCHAR(20))
                ,zip_cat=tdf.zip_category.cast(type_=VARCHAR(20))
                ,city_cat=tdf.city_category.cast(type_=VARCHAR(20))
                ,seller_cat=tdf.seller_category.cast(type_=VARCHAR(20))
                ,product_cat=tdf.product_seller_category.cast(type_=VARCHAR(20))
               )
         

In [None]:
#saving intermediate table
copy_to_sql(tdf, table_name="data_preprocess", if_exists="replace")

In [None]:
tdf_pre = DataFrame("data_preprocess")
tdf_pre.shape

<hr style="height:1px;border:none;">
<p style = 'font-size:20px;font-family:Arial'> <b> 4.3 Encoding of categorical columns </b></p>
<p style = 'font-size:16px;font-family:Arial'>First we get the list of columns which are categories and calculate the disticnt counts in each of them.    

In [None]:
# Create a list of column names with data type 'str'
ohe_col_list = [col.split()[0] for col in str(tdf_pre.dtypes).split('\n') if col.split()[1] == 'str']

In [None]:
#counts of all str columns
tdf_pre.select(ohe_col_list).agg('unique')

<p style = 'font-size:18px;font-family:Arial'><b> One-Hot Encoding</b>
<p style = 'font-size:16px;font-family:Arial'> Due to number of distinct values in customer_state(27) and product_type(70); we will do label encoding for these, for others we will do one-hot encoding

In [None]:
ohe_col_list.remove("order_id")
ohe_col_list.remove("customer_state")
ohe_col_list.remove("product_type")	

In [None]:
ohe_col_list

In [None]:
#one hot encoding and label encoding
# create fit object to encode categorical columns
hot_fit = OneHotEncodingFit(data=tdf_pre,
                                is_input_dense=True,
                                target_column=ohe_col_list,
                                category_counts=[4,2,3,3,2,2,3,3,2],
                                approach="auto",
                                other_column="other")
 
# Print the result DataFrame.
hot_fit.result

<p style = 'font-size:18px;font-family:Arial'><b> Label Encoding</b>

In [None]:
ordinal_fit = OrdinalEncodingFit(target_column=['customer_state','product_type'],
                                 data=tdf_pre,
                                 default_value=-1
                                )

ordinal_fit.result

<hr style="height:1px;border:none;">
<p style = 'font-size:20px;font-family:Arial'> <b> 4.4 Scaling of numerical columns </b></p>
 

In [None]:
scale_list = [col.split()[0] for col in str(tdf_pre.dtypes).split('\n') if col.split()[1] not in ('str','int')]
scale_list

In [None]:
scale_fit = ScaleFit(data=tdf_pre,
                       target_columns=scale_list,
                       scale_method="RANGE",
                       miss_value="KEEP",
                       global_scale=False)
scale_fit.output

<hr style="height:1px;border:none;">
<p style = 'font-size:20px;font-family:Arial'> <b> 4.5 Applying all the fit tables</b> 
    <p style = 'font-size:16px;font-family:Arial'> We will apply all the fit tables created in above steps using the ColumnTransformer function and create the final dataset</p>

In [None]:
out1 = ColumnTransformer(input_data=tdf_pre,
                         scale_fit_data=scale_fit.output,
                         onehotencoding_fit_data=hot_fit.result,
                         ordinalencoding_fit_data=ordinal_fit.result
                                        )
tdf = out1.result   

In [None]:
tdf

In [None]:
all_col=tdf.columns
sel_col = list(set(all_col) - set(ohe_col_list))

In [None]:
tdf_final = tdf.select(sel_col)

In [None]:
tdf_final

In [None]:
tdf_final.shape

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'> <b> 5. Create train and test data  </b></p>

<p style = 'font-size:16px;font-family:Arial'>Now we have transformed our data and it is fit to be used in machine learning models, let us split the whole dataset into train and test sets for model training and scoring. We will use <b>TrainTestSplit</b> function for this task.</p>

In [None]:
TrainTestSplit_out = TrainTestSplit(
                                    data = tdf_final,
                                    id_column = "order_id",
                                    stratify_column = "is_late_s",
                                    train_size = 0.75,
                                    test_size = 0.25,
                                    seed = 21
)

In [None]:
# Split into 2 virtual dataframes
df_train = TrainTestSplit_out.result[TrainTestSplit_out.result['TD_IsTrainRow'] == 1].drop(['TD_IsTrainRow'], axis = 1)
df_test = TrainTestSplit_out.result[TrainTestSplit_out.result['TD_IsTrainRow'] == 0].drop(['TD_IsTrainRow'], axis = 1)

In [None]:
copy_to_sql(df_train, table_name="train_shipment_ds", if_exists="replace",primary_index ="order_id")
copy_to_sql(df_test, table_name="test_shipment_ds", if_exists="replace",primary_index ="order_id")

In [None]:
train_data=DataFrame("train_shipment_ds")
train_data.shape

In [None]:
train_data

In [None]:
test_data=DataFrame("test_shipment_ds")
test_data.shape

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>6. InDb Model Training and Scoring</b></p>
<p style = 'font-size:16px;font-family:Arial'>For our model we will use logistic regression.<br>
  <b>Logistic regression</b> is a statistical algorithm used for binary classification problems. It is a type of supervised learning algorithm that predicts the probability of an input belonging to a certain class (e.g., positive or negative) based on its features.<br>Logistic regression works by modeling the relationship between the input features and the probability of belonging to a certain class using a logistic function. The logistic function takes the input feature values and maps them onto a probability scale between 0 and 1, which represents the probability of belonging to the positive class.<br>
    The <b>GLM </b>function is a generalized linear model (GLM) that performs regression and classification analysis on data sets.

In [None]:
col_list=train_data.columns
col_list.remove("order_id")
col_list.remove("is_late_s")

In [None]:
glm_model = GLM(data = train_data,
                input_columns = col_list, 
                response_column = 'is_late_s',
                family = 'Binomial')

In [None]:
glm_model.result

<p style = 'font-size:16px;font-family:Arial'>We have created our model, let's do the predictions on the test dataset.

In [None]:
glm_prediction = TDGLMPredict(newdata = test_data,
                           id_column = 'order_id',
                           object = glm_model.result,
                           accumulate = 'is_late_s',
                           family = 'Binomial',   
                           output_prob=True,
                           output_responses = ['0', '1'])

In [None]:
out_glm = glm_prediction.result.assign(prediction = glm_prediction.result.prediction.cast(type_ = BYTEINT))
out_glm = out_glm.assign(prediction = out_glm.prediction.cast(type_ = VARCHAR(2)))
out_glm = out_glm.assign(is_late_s = out_glm.is_late_s.cast(type_ = VARCHAR(2)))
out_glm

<p style = 'font-size:16px;font-family:Arial'>The output above shows prob_1, i.e. shipment will be delayed and prob_0, i.e. shipment will not be delayed. The prediction column uses these probabilities to give a class label, i.e. prediction column.</p>

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>6.2 Evaluation of Logistic Regression Model</b></p>
<p style = 'font-size:16px;font-family:Arial'>We will use the <b>ClassificationEvaluator</b> function to evaluate the trained glm model on test data. This will let us know how well our model has performed on unseen data.</p>

In [None]:
ClassificationEvaluator_glm = ClassificationEvaluator(
                                                        data = out_glm,
                                                        observation_column = 'is_late_s',
                                                        prediction_column = 'prediction',
                                                        labels = ['0', '1']
)

In [None]:
ClassificationEvaluator_glm.output_data.head(10)

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>6.3 Model Explainability</b></p>
<p style = 'font-size:16px;font-family:Arial'>SHAP computes the contribution of each feature in a prediction as as average marginal contribution of the feature value across all possible coalitions. TD_SHAP also computes mean absolute contribution of each feature as global explanation (using OUT clause) which can be used as a measure of feature importance.</p>

In [None]:
Shap_out = Shap(data=train_data, 
                    object=glm_model.result, 
                    id_column='order_id',
                    training_function="TD_GLM", 
                    model_type="Classification",
                    input_columns=col_list)
#Print the result DataFrame.
Shap_out.output_data

<p style = 'font-size:16px;font-family:Arial'>Top 10 features

In [None]:
df = Shap_out.output_data.to_pandas()
df_t = df.T
df_t=df_t.rename(columns={'index': 'Feature', 0: 'Importance'})
df_sort=df_t.sort_values(by='Importance', ascending=False)
df_sort.head(10) 

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>7. Cleanup</b></p>
<p style = 'font-size:18px;font-family:Arial'><b>Work Tables</b></p>
<p style = 'font-size:16px;font-family:Arial;'>
We need to clean up our work tables to prevent errors next time.

In [None]:
tables = ['shipment_dataset' ,'data_preprocess' ,'train_shipment_ds' ,'test_shipment_ds']

# Loop through the list of tables and execute the drop table command for each table
for table in tables:
    try:
        db_drop_table(table_name = table)
    except:
        pass

In [None]:
remove_context()

<p style = 'font-size:20px;font-family:Arial'> <b> 8. Conclusion </b> </p>
<p style = 'font-size:16px;font-family:Arial'>In this notebook, we explored how open table formats enable scalable, flexible analytics on large volumes of data stored in modern data lake architectures. We demonstrated how data stored in open formats can be seamlessly integrated with enterprise analytics workflows, eliminating data silos and reducing the need for complex data movement. By leveraging Teradata’s ability to perform high-performance, in-database analytics and natively read open table format tables, organizations can analyze structured and lake-based data together using a single SQL and governance framework. This approach provides a strong foundation for building scalable analytics, advanced AI use cases, and data-driven decision-making, while maintaining openness, performance, and enterprise-grade reliability.</p>

<footer style="padding-bottom:35px; border-bottom:3px solid #91A0Ab">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2026. All Rights Reserved
        </div>
    </div>
</footer>