<header style="padding:1px;background:#f9f9f9;border-top:3px solid #00b2b1"><img id="Teradata-logo" src="https://www.teradata.com/Teradata/Images/Rebrand/Teradata_logo-two_color.png" alt="Teradata" width="220" align="right" />

<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>Shipping Time Prediction Using Vantage InDB Analytic Functions</b>
</header>

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Introduction</b></p>
<p style = 'font-size:16px;font-family:Arial'>

<p style = 'font-size:16px;font-family:Arial'>eBay, as an online marketplace, faces the challenge of accurately estimating delivery dates for shipments from various sellers. The current estimation process, based on seller handling time and carrier transit time, often leads to inconsistent and inaccurate predictions. This results in customer dissatisfaction and potential erosion of trust in the platform. Therefore, there is a need to develop a robust system that can reliably estimate delivery dates, accounting for handling time, transit time, and other relevant variables affecting the actual delivery timeframe. This system should leverage historical data, predictive modeling techniques, and machine learning algorithms to improve accuracy. The successful implementation of an improved delivery time estimation system will enhance customer satisfaction, increase buyer trust, boost sales, and improve seller engagement on the platform.</p>

  
<p style = 'font-size:16px;font-family:Arial'> To address the problem in estimating delivery dates for eBay packages, we propose leveraging Teradata's in-database capabilities. By using Teradata's data cleaning and machine learning functionalities, we can develop a robust model to predict delivery dates. This involves collecting relevant data, cleaning it for accuracy, performing feature engineering, developing a predictive model,and validating and optimizing it. The implementation of this solution can lead to improved customer satisfaction, increased trust, higher sales, and enhanced seller engagement.</p>
   

<p style = 'font-size:16px;font-family:Arial'>
The implemented functions are from the following documentation:</p>
<li style = 'font-size:16px;font-family:Arial'> <a href='https://www.docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20'>Advanced SQL Engine 17.20 Functions</a></li>       
<li style = 'font-size:16px;font-family:Arial'> <a href= 'https://docs.teradata.com/r/Enterprise_IntelliFlex_Lake_VMware/Vantage-Analytics-Library-User-Guide/Welcome-to-Vantage-Analytics-Library'>Vantage Analytics Library</a></li>
<li style = 'font-size:16px;font-family:Arial'> <a href= 'https://docs.teradata.com/r/Teradata-VantageTM-Unbounded-Array-Framework-Time-Series-Reference'>UAF Time-Series 17.20 Functions</a></li>    
<br> 
       
    
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Data</b></p>
      
<p style = 'font-size:16px;font-family:Arial'>The data was collected from open source <a href= 'https://www.kaggle.com/datasets/armanaanand/ebay-delivery-date-prediction'>Kaggle</a> with following description</p>
    
<img src='images/DataSet.png'>
    


<hr>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>1. Connect to Vantage.</b></p>

<p style = 'font-size:16px;font-family:Arial'>In the section, we import the required libraries and set environment variables and environment paths (if required).</p>

In [None]:
import getpass
import time
import pandas as pd
import teradataml as tdml
from teradataml import *

import sqlalchemy
from sqlalchemy import event
import csv
# from teradataml.dataframe.data_transfer import read_csv
from teradatasqlalchemy.types import *
import random
from PIL import Image

from teradataml import *
configure.val_install_location = "val"

import plotly.express as px
import io
import warnings
warnings.filterwarnings('ignore')
display.max_rows=5

<p style = 'font-size:16px;font-family:Arial'>You will be prompted to provide the password. Enter your password, press the Enter key, then use down arrow to go to next cell. Begin running steps with Shift + Enter keys.</p>

In [None]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

In [None]:
%%capture
execute_sql('''SET query_band='DEMO=Shipping_Time_Prediction_PY_SQL.ipynb;' UPDATE FOR SESSION; ''')

<hr>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'> <b>2. Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial'>We have provided data for this demo only locally.  Since we are altering the table and creating new columns for various calculations we have only created the local version of the database and tables. The data is stored on cloud and we will insert the data in a local table and then do the processing.</p>


In [None]:
%run -i ../run_procedure.py "call get_data('DEMO_ShippingTimePrediction_local');"
 # Takes about 1 minute 30 secs

<p style = 'font-size:16px;font-family:Arial'>Next is an optional step – if you want to see status of databases/tables created and space used.</p>

In [None]:
%run -i ../run_procedure.py "call space_report();"

<hr>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>3. Analyze the raw data set</b></p>

<p style = 'font-size:16px;font-family:Arial'>The dataset is shipping dataset with data containing 110,000 rows. A more detailed description of the features is already mentioned in the introduction of the Data.

</p>
<p style = 'font-size:16px;font-family:Arial'>Create a DataFrame to get the data from the table created.</p>



In [None]:
raw_df=DataFrame(in_schema('DEMO_ShipTimePred', 'Delivery_date_data'))
raw_df

In [None]:
df=raw_df.to_pandas(all_rows=True).reset_index()

In [None]:
#Plotting Distribution by channel
conversions=df[['record_number','b2c_c2c']]
conversions = conversions.groupby(['b2c_c2c'], as_index=False).count()
fig = px.bar(conversions, x='b2c_c2c', y='record_number', color='b2c_c2c')

fig.update_layout(title='Channel Distribution',
                   xaxis_title='Channel',
                   yaxis_title='Distributions')
fig.show()

<p style = 'font-size:16px;font-family:Arial'>The above chart shows the distribution of the shipments based on the channel use B2C(Business to Customer) and C2C(Customer to Customer).

</p>
<p style = 'font-size:16px;font-family:Arial'>We can also try analyzing the shipments by Shipment Methods. Since the data is sample data for the purpose of this demo the shipment methods used are not specified and are using just numbers to categorize the shipment methods used and are depicted as shipment methods ids.
</p>



In [None]:
#Plotting Distribution by Shipment Method
shipments=df[['record_number','shipment_method_id']]
shipments = shipments.groupby(['shipment_method_id'], as_index=False).count()
fig = px.bar(shipments, x='shipment_method_id', y='record_number', color='shipment_method_id')

fig.update_layout(title='Shipment Method Distribution',
                   xaxis_title='Shipment Method',
                   yaxis_title='Distributions')
fig.show()

<p style = 'font-size:16px;font-family:Arial'>The above chart shows the distribution of the shipments based on different shipment method. As seen in the chart most of the shipments are using the Shipment Method Id(0)

</p>
<p style = 'font-size:16px;font-family:Arial'>We can also try analyzing the shipments by Categories. Similar to the Shipment Methods the categories are not specified and are using just numbers to categorize the shipments. The categories are defined using numbers and are depicted as Category IDs. 
</p>



In [None]:
#Plotting Distribution by category
categories=df[['record_number','category_id']]
categories = categories.groupby(['category_id'], as_index=False).count()
fig = px.bar(categories, x='category_id', y='record_number', color='category_id')

fig.update_layout(title='Category Distribution',
                   xaxis_title='Category',
                   yaxis_title='Distributions')
fig.show()

<p style = 'font-size:16px;font-family:Arial'>The above chart shows how the distribution of the shipments based on the categories. Most of the shipments are for categories with Category Ids between 0-5.

</p>
<p style = 'font-size:16px;font-family:Arial'>Below we try to check the Shipment Fees for various shipments.
</p>



In [None]:
import plotly.express as px
ShipFees=df[['record_number','shipping_fee','weight_units']]
ax=px.scatter(ShipFees, x="record_number", y="shipping_fee",
              size="shipping_fee",size_max = 70,color="shipping_fee",hover_data=['shipping_fee'],
              width=900, height=400, 
              # color_discrete_map = {'Online Display': '#E15759','Online Video': '#76B7B2','Facebook': '#4E79A7','Instagram': '#F28E2B' ,'Paid Search': '#59A14F'},
             labels={
                     "shipping_fee": "Shipping Fees",
                     "record_number": "Record Number"
        }
             )
ax.update_layout(showlegend=False)
ax.update_layout(title_text='Shipping Fees Distribution', title_x=0.5)
ax.show()

<p style = 'font-size:16px;font-family:Arial'>The above chart shows how the distribution of the shipments based on the shipment fees. The size of the bubble is dependant on the fees, larger the size of the bubble, larger the fees. The colors also depict the increasing fees as shown in the legend at the side. The least starts from dark purple and the maximum fees is yellow i.e. the largets yellow bubble at the left top corner has the maximum shipping fees.</p>


<hr>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>4. Data Preprocessing and Cleaning</b></p>


<p style = 'font-size:18px;font-family:Arial'><b>Data Preprocessing:</b>

<p style = 'font-size:16px;font-family:Arial'>New column 'distance' is added to the table which will store distance between the item location and the buyer location.</p>
<p style = 'font-size:18px;font-family:Arial'>The geospatial function <b>ST_SPHERICALDISTANCE</b> in Vantage is used to calculate the distance using the latitude and longitude columns of the item and buyer.</p>

In [None]:
qry='''ALTER TABLE DEMO_ShipTimePred_db.Delivery_Date_Data
ADD distance FLOAT;'''

execute_sql(qry)

In [None]:
qry='''UPDATE DEMO_ShipTimePred_db.Delivery_Date_Data
SET distance = NEW ST_Geometry('ST_Point', item_long, item_lat).ST_SPHERICALDISTANCE(NEW ST_Geometry('ST_Point', 
buyer_long, buyer_lat))/1000;'''

execute_sql(qry)

<p style = 'font-size:18px;font-family:Arial'><b>Checking and handling missing values:</b>

<p style = 'font-size:16px;font-family:Arial'>We create a new table with this available data so that we maintain the copy of the original data.</p>

In [None]:
qry='''CREATE multiset TABLE delivery_date_complete_dataset AS (
        SELECT *
        FROM DEMO_ShipTimePred_db.Delivery_Date_Data
    ) WITH DATA PRIMARY INDEX (record_number);'''

try:
    execute_sql(qry)
except:
    db_drop_table('delivery_date_complete_dataset')
    execute_sql(qry)
    

<p style = 'font-size:18px;font-family:Arial'><b>Get Rows With Missing Values</b></p>

<p style = 'font-size:16px;font-family:Arial'>TD_GetRowsWithMissingValues used on the table delivery_date_complete_dataset will select rows from the table where at least one of the first 19 columns has missing values.</p>

In [None]:
MissingVal_df = DataFrame.from_query("""
SELECT * FROM TD_GetRowsWithMissingValues ( 
    ON delivery_date_complete_dataset  AS InputTable
    USING
    TargetColumns ('[0:23]')
) AS dt;
""")
MissingVal_df

<p style = 'font-size:18px;font-family:Arial'><b>Replace Missing Values</b></p>

<p style = 'font-size:16px;font-family:Arial'>We create a reusable function to replace missing values for various columns which is used below to calculate missing values for declared_handling_days, weight, carrier_min_estimate and carrier_max_estimate. Below is the logic used for replacing missing values:</p>
<li style = 'font-size:16px;font-family:Arial'>It calculates the average value (AvgVal) of a specified column (avgColumn) grouped by another column (groupCol) in the delivery_date_complete_dataset table. Only non-null values are considered, and the result is grouped by the specified column.</li>
<li style = 'font-size:16px;font-family:Arial'>Updates the delivery_date_complete_dataset table by filling in missing values in the avgColumn with either the corresponding value from AverageData based on the matching groupCol, or with the overall average value if no match is found.</li>


In [None]:
def temp_col(col):
    execute_sql("""
    ALTER TABLE delivery_date_complete_dataset
    ADD "{0}_varchar" VARCHAR(50);""".format(col))
    
    execute_sql("""
    UPDATE delivery_date_complete_dataset
    SET "{0}_varchar" = CAST({0} AS VARCHAR(50));""".format(col))

In [None]:
def handleMissingData(avgColumn, groupCol):
    temp_col(avgColumn)
    
    try:
        execute_sql("""DROP TABLE AVERAGEDATA""")
        print("DROPPING TABLE AVERAGEDATA")
    except:
        print("[Teradata Database] [Info] Object 'AVERAGEDATA' does not exist.")
        
    execute_sql("""
        CREATE VOLATILE TABLE AverageData AS (
            SELECT DISTINCT AVG("{0}") as AvgVal, "{1}" as "{1}"
            FROM delivery_date_complete_dataset
            WHERE "{0}_varchar" <> '**********************'
            GROUP BY "{1}"
        )
        WITH DATA
        ON COMMIT PRESERVE ROWS;
    """.format(avgColumn, groupCol))
    
    execute_sql("""
        UPDATE delivery_date_complete_dataset AS E
    SET "{0}" = 
        CASE
            WHEN E."{0}_varchar" = '**********************'
                THEN COALESCE(
                    (SELECT AvgVal FROM AverageData AS D WHERE E."{1}" = D."{1}"),
                    (SELECT AVG(AvgVal) FROM AverageData)
                )
            ELSE "{0}"
        END;
    """.format(avgColumn, groupCol))
    


<p style = 'font-size:16px;font-family:Arial'>The above code is used to get the missing values for columns 'declared_handling_days', 'weight','carrier_min_estimate' and 'carrier_max_estimate' </p>

In [None]:
handleMissingData("declared_handling_days", "seller_id")

In [None]:
handleMissingData("weight", "category_id") 

In [None]:
handleMissingData("carrier_min_estimate", "shipment_method_id") 

In [None]:
handleMissingData("carrier_max_estimate", "shipment_method_id") # Handle missing carrier_max_estimate

<p style = 'font-size:16px;font-family:Arial'>The below code is used to get the missing values for column package size based on weight and average package size.</p>

In [None]:
qry='''CREATE VOLATILE TABLE pkg_averagedata AS (
            SELECT DISTINCT AVG(weight) as AvgVal, package_size
            FROM delivery_date_complete_dataset
            WHERE package_size <> 'NONE'
            GROUP BY package_size
        )
        WITH DATA
        ON COMMIT PRESERVE ROWS;'''
execute_sql(qry)

In [None]:
qry='''CREATE VOLATILE TABLE temp_table AS (
        SELECT *
        FROM (
            SELECT
                delivery_date_complete_dataset.*,
                pkg_averagedata.package_size AS new_package_size,
                ABS(delivery_date_complete_dataset.weight - pkg_averagedata.AvgVal) AS difference,
                ROW_NUMBER() OVER (PARTITION BY delivery_date_complete_dataset.record_number ORDER BY ABS(delivery_date_complete_dataset.weight - pkg_averagedata.AvgVal)) AS rn
            FROM delivery_date_complete_dataset
            CROSS JOIN pkg_averagedata
            WHERE delivery_date_complete_dataset.package_size = 'NONE'
        ) AS subquery
        WHERE rn = 1
    ) WITH DATA PRIMARY INDEX(record_number) ON COMMIT PRESERVE ROWS;'''

execute_sql(qry)

In [None]:
qry='''UPDATE delivery_date_complete_dataset
    FROM temp_table
    SET package_size = new_package_size
    WHERE delivery_date_complete_dataset.record_number = temp_table.record_number;'''

execute_sql(qry)

<p style = 'font-size:16px;font-family:Arial'>We will drop the varchar columns which were generated when handling missing values to get the data in shape.</p>

In [None]:
qry='''ALTER TABLE delivery_date_complete_dataset
DROP declared_handling_days_varchar,
DROP weight_varchar,
DROP carrier_min_estimate_varchar,
DROP carrier_max_estimate_varchar;'''

execute_sql(qry)

<p style = 'font-size:16px;font-family:Arial'>We will standardize all the date and timestamp columns to timestamp(0).</p>

In [None]:
execute_sql('''ALTER TABLE delivery_date_complete_dataset
    ADD payment_datetime_temp TIMESTAMP(0),
    ADD acceptance_scan_timestamp_temp TIMESTAMP(0),
    ADD delivery_date_temp TIMESTAMP(0);''')
    
execute_sql('''UPDATE delivery_date_complete_dataset
    SET payment_datetime_temp = CAST(SUBSTRING(cast(payment_datetime as VARCHAR(30)), 1, 19) AS TIMESTAMP(0)),
        acceptance_scan_timestamp_temp = CAST(SUBSTRING(cast(acceptance_scan_timestamp as VARCHAR(30)), 1, 19) AS TIMESTAMP(0)),
        delivery_date_temp = cast(cast(cast(delivery_date as date format 'yyyy-mm-dd') as varchar(10)) || ' 00:00:00' as timestamp(0));''')

execute_sql('''ALTER TABLE delivery_date_complete_dataset
    DROP payment_datetime,
    DROP acceptance_scan_timestamp,
    DROP delivery_date;''')
    
execute_sql('''ALTER TABLE delivery_date_complete_dataset
    RENAME payment_datetime_temp TO payment_datetime,
    RENAME acceptance_scan_timestamp_temp TO acceptance_scan_timestamp,
    RENAME delivery_date_temp TO delivery_date;''')
    

<p style = 'font-size:16px;font-family:Arial'>Calculate number of shipping days, handling days and delivery days based on the above timestamp values.</p>

In [None]:
execute_sql('''ALTER TABLE delivery_date_complete_dataset
    ADD handling_days INTEGER,
    ADD shipping_days INTEGER,
    ADD delivery_days INTEGER;''')
    
execute_sql('''UPDATE delivery_date_complete_dataset
    SET handling_days = CAST(acceptance_scan_timestamp AS DATE) - CAST(payment_datetime AS DATE),
        shipping_days = CAST(delivery_date AS DATE) - CAST(acceptance_scan_timestamp AS DATE),
        delivery_days = CAST(delivery_date AS DATE) - CAST(payment_datetime AS DATE);''')

<p style = 'font-size:16px;font-family:Arial'>Round off the declared_handling_days and distance and then Delete rows where distance , weight or item price are zero.</p>

In [None]:
execute_sql('''UPDATE delivery_date_complete_dataset
SET declared_handling_days = CAST(ROUND(declared_handling_days, 0) AS INTEGER)
,distance = CAST(ROUND(distance, 0) AS INTEGER);''')

execute_sql('''DELETE FROM delivery_date_complete_dataset
WHERE distance = 0.0 OR weight = 0 OR item_price = 0.0;''')

In [None]:
df_= DataFrame.from_query("""
    SELECT * FROM delivery_date_complete_dataset
""")
df_

In [None]:
# df_final=df_.select(["b2c_c2c","shipping_fee","item_price","quantity", "weight","package_size","record_number"
#                     ,"distance","shipment_method_id","category_id",])
df_final=df_.drop(["seller_id","declared_handling_days", "carrier_min_estimate", "carrier_max_estimate",
                "item_zip","buyer_zip", "weight_units", "item_lat","item_long","buyer_lat","buyer_long",
                   "payment_datetime", "acceptance_scan_timestamp", "delivery_date"], axis=1)
df_final

<hr>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>5. Creation of final analytic dataset </b></p>
<p style = 'font-size:16px;font-family:Arial'>We have datasets in which different columns have different units . If we feed these features to the model as is, there is every chance that one feature will influence the result more due to its value than the others. But this doesn’t necessarily mean it is more important as a predictor. So, to give importance to all the features we need feature scaling.</p>
    
<p style = 'font-size:16px;font-family:Arial'>Here, we apply the Standard scale and transform functions which are ScaleFit and ScaleTransform functions in Vantage. ScaleFit() function outputs statistics to input to ScaleTransform() function, which scales specified input DataFrame columns.</p> 

In [None]:
from teradataml import ScaleFit , ScaleTransform
scaler = ScaleFit(
                    data=df_final,
                    target_columns=["shipping_fee","item_price", "quantity", "weight", "distance"],
                    scale_method="STD",
                    global_scale=False)

In [None]:
ADS_scaled = ScaleTransform(data=df_final,
                         object=scaler.output,
                         accumulate=["record_number","b2c_c2c", "package_size", "delivery_days"
                                     ,"shipment_method_id","category_id"]
                           ).result
ADS_scaled

In [None]:
copy_to_sql(ADS_scaled, table_name = 'delivery_date_dataset_final', if_exists='replace')

In [None]:
qry=  '''Create multiset table ohe_fit as ( 
        SELECT * FROM TD_OneHotEncodingFit( 
        ON (select record_number,b2c_c2c, package_size,cast(shipment_method_id as VARCHAR(5)) as shipment_method_id, 
        cast(category_id as VARCHAR(5)) as category_id, shipping_fee ,item_price, quantity ,weight, distance,delivery_days
        from delivery_date_dataset_final) AS INPUTTABLE 
        USING 
        TargetColumn('b2c_c2c','shipment_method_id','category_id','package_size') 
        OtherColumnName('other') 
        IsInputDense('true') 
        CategoryCounts(2,23,33,6) 
        Approach('Auto') 
        ) AS dt  
        )with data;'''

try:
    execute_sql(qry)
except:
    db_drop_table('ohe_fit')
    execute_sql(qry)
    

In [None]:
qry='''CREATE MULTISET TABLE ohe_dataset AS (
    SELECT * FROM TD_OneHotEncodingTransform ( 
    ON (select record_number,b2c_c2c, package_size,cast(shipment_method_id as VARCHAR(5)) as shipment_method_id, 
        cast(category_id as VARCHAR(5)) as category_id, shipping_fee ,item_price, quantity ,weight, distance,delivery_days 
        from delivery_date_dataset_final) AS InputTable 
    ON ohe_fit AS FitTable Dimension 
    USING 
    IsInputDense('True') 
    ) AS dt)
    WITH DATA;'''

try:
    execute_sql(qry)
except:
    db_drop_table('ohe_dataset')
    execute_sql(qry)

<hr>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>6. Creation of Train and Test data.</b></p>
<p style = 'font-size:16px;font-family:Arial'>The TrainTestSplit() function simulates how a model would perform on new data. The function divides the dataset into train and test subsets to evaluate machine learning algorithms and validate processes. The first subset is used to train the model. The second subset is used to make predictions and compare the predictions to actual values.</p> 

In [None]:
qry= '''CREATE MULTISET TABLE delivery_date_ads AS (
    SELECT * FROM TD_TrainTestSplit( 
        ON ohe_dataset AS InputTable 
        USING 
        IDColumn('record_number') 
        trainSize(0.75) 
        testSize(0.25) 
        Seed(42)
    ) AS dt
) WITH DATA PRIMARY INDEX(record_number);'''

try:
    execute_sql(qry)
except:
    db_drop_table('delivery_date_ads')
    execute_sql(qry)

<p style = 'font-size:16px;font-family:Arial'>Creating delivery_date_train_dataset.</p>

In [None]:
qry = '''create multiset table delivery_date_train_dataset AS(
select * FROM delivery_date_ads
    WHERE TD_IsTrainRow in (1)
)WITH DATA PRIMARY INDEX(record_number);'''

try:
    execute_sql(qry)
except:
    db_drop_table('delivery_date_train_dataset')
    execute_sql(qry)

<p style = 'font-size:16px;font-family:Arial'>Creating delivery_date_test_dataset.</p>

In [None]:
qry = '''create table delivery_date_test_dataset AS(
select * FROM delivery_date_ads
    WHERE TD_IsTrainRow in (0)
)WITH DATA PRIMARY INDEX(record_number);'''

try:
    execute_sql(qry)
except:
    db_drop_table('delivery_date_test_dataset')
    execute_sql(qry)

<hr>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>7. Feature Selection using Elastic Net Regularization.</b></p>
<p style = 'font-size:16px;font-family:Arial'>Feature selection is a crucial step in building predictive models as it helps identify the most relevant and informative features from a potentially large set of variables. In this context, elastic net regularization is a powerful technique that can be employed to effectively filter out features and improve model performance.</p>

<p style = 'font-size:16px;font-family:Arial'>Elastic net regularization combines the L1 (Lasso) and L2 (Ridge) regularization techniques, offering a balanced approach to feature selection. It applies a penalty term to the model's objective function, encouraging sparsity in the coefficient estimates and promoting the selection of a subset of important features while shrinking the coefficients of less relevant or redundant features.</p>


<p style = 'font-size:16px;font-family:Arial'>For more information on **Regularization**: <a href='https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Database-Analytic-Functions/Model-Training-Functions/TD_GLM/TD_GLM-Syntax-Elements'>[Link]</a></p>

In [None]:
qry1 = '''CREATE MULTISET TABLE td_glm_cal_ex AS (
    SELECT * from TD_GLM (
        ON delivery_date_train_dataset
        USING
        InputColumns('[2:74]')
        ResponseColumn('delivery_days')
        Family('Gaussian')
        RegularizationLambda(0.00175)
        LearningRate('optimal')
        IterNumNoChange(100)
    ) as dt
) WITH DATA;'''

qry='''CREATE MULTISET TABLE td_glm_cal_ex AS (
    SELECT * from TD_GLM (
        ON delivery_date_train_dataset
        USING
        InputColumns('[2:74]')
        ResponseColumn('delivery_days')
        Family('Gaussian')
    BatchSize(500)
    MaxIterNum(300)
    RegularizationLambda(0.02)
    Alpha(0.3)
    IterNumNoChange(70)
    Tolerance(0.008)
    Intercept('true')
    LearningRate('adaptive')
    InitialEta(0.015)
    Momentum(0.8)
    LocalSGDIterations(10)) as dt
) WITH DATA;'''

try:
    execute_sql(qry)
except:
    db_drop_table('td_glm_cal_ex')
    execute_sql(qry)

<p style = 'font-size:16px;font-family:Arial'>The output of the TD_GLM function provides attributes where the index of the predictors have positive values and the estimate column has the predictor weights. For feature selection we consider all columns which are the predictors and have weights >0 i.e. estimate > 0. </p>
<p style = 'font-size:16px;font-family:Arial'> In the for loop we create a list of all such columns and create a table with only the columns which have weightage as predictors for the model.</p>

In [None]:
features_output = pd.read_sql("""SELECT predictor FROM td_glm_cal_ex where attribute >= 0 and estimate not in (0)""", eng)
Str = ""
train_values = pd.read_sql("""SELECT top 1 * FROM delivery_date_train_dataset""", eng)

for i in range(len(features_output)):
    if(features_output['predictor'][i] in train_values.columns.tolist()):
        Str += '"{}"'.format(features_output['predictor'][i])
        if i != (len(features_output) - 1):
            Str += ","

<p style = 'font-size:16px;font-family:Arial'>We create the train and test datasets with only these features(columns) to be used in the model for predictions.</p>

In [None]:
qry= '''CREATE MULTISET TABLE selected_ddp_features_train AS (
    select "record_number",{0},"delivery_days" from delivery_date_train_dataset
) WITH DATA PRIMARY INDEX(record_number)'''.format(Str)

try:
    execute_sql(qry)
except:
    db_drop_table('selected_ddp_features_train')
    execute_sql(qry)

In [None]:
qry = '''
CREATE MULTISET TABLE selected_ddp_features_test AS (
    select "record_number",{0},"delivery_days" from delivery_date_test_dataset
) WITH DATA
PRIMARY INDEX(record_number)
'''.format(Str)

try:
    execute_sql(qry)
except:
    db_drop_table('selected_ddp_features_test')
    execute_sql(qry)

<hr>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>8. Generalized Linear Model (GLM) in Teradata </b></p>
<p style = 'font-size:16px;font-family:Arial'>The TD_GLM function is a generalized linear model (GLM) that performs regression and classification analysis on data sets, where the response follows an exponential family distribution and supports the following models:</p>
<li style = 'font-size:16px;font-family:Arial'>Regression (Gaussian family): The loss function is squared error.
<li style = 'font-size:16px;font-family:Arial'>Binary Classification (Binomial family): The loss function is logistic and implements logistic regression. The only response values are 0 or 1.</p>
<p style = 'font-size:16px;font-family:Arial'>The function uses the Minibatch Stochastic Gradient Descent (SGD) algorithm. The algorithm estimates the gradient of loss in minibatches, which is defined by the BatchSize argument and updates the model with a learning rate using the LearningRate argument.</p>
    <p style = 'font-size:16px;font-family:Arial'>Here we are using Regression</p>    

In [None]:
qry1= '''CREATE MULTISET TABLE td_glm_cal_ex AS (
    SELECT * from TD_GLM (
        ON selected_ddp_features_train
        USING
        InputColumns('[1:51]')
        ResponseColumn('delivery_days')
        Family('Gaussian')
BatchSize(800)
MaxIterNum(300)
RegularizationLambda(0.0175)
Alpha(0.2)
IterNumNoChange(200)
Tolerance(0.001)
Intercept('true')
LearningRate('adaptive')
InitialEta(0.02)
Momentum(0.85)
LocalSGDIterations(20)
) as dt
) WITH DATA;'''

try:
    execute_sql(qry1)
except:
    db_drop_table('td_glm_cal_ex')
    execute_sql(qry1)

<p style = 'font-size:18px;font-family:Arial'><b>TD_GLMPredict </b></p>
<p style = 'font-size:16px;font-family:Arial'>The TD_GLMPredict function predicts target values (regression) and class labels (classification) for test data using a GLM model of the TD_GLM function.</p>

In [None]:
qry = '''CREATE MULTISET TABLE glm_predict_cal_ex AS (
    SELECT * from TD_GLMPredict (
      ON (SELECT * FROM selected_ddp_features_test) AS INPUTTABLE
      ON td_glm_cal_ex AS ModelTable DIMENSION
      USING
      IDColumn ('record_number')
      Accumulate('delivery_days')
    ) AS dt
) WITH DATA;'''

try:
    execute_sql(qry)
except:
    db_drop_table('glm_predict_cal_ex')
    execute_sql(qry)

In [None]:
df= DataFrame.from_query("""SELECT * FROM glm_predict_cal_ex;""")
df

In [None]:
import matplotlib.pyplot as plt
# import matplotlib.patches as patches
import seaborn as sns

df_plot=df.to_pandas(all_rows=True).reset_index().head(100)
x = df_plot['record_number']
# Put array of years here
y1 = df_plot['delivery_days']
y2 = df_plot['prediction']
plt.figure(figsize=(20,8))
sns.lineplot(data= df_plot ,x="record_number",y="delivery_days",ci=None)
sns.lineplot(data= df_plot ,x="record_number",y="prediction",ci=None)
plt.grid()
# plt.xticks(np.arange(1,60, step=1))
plt.legend(['Actual Value', 'Predicted Value'], loc='best', fontsize=16)
plt.title('Comparison of Actual vs Predicted Delivery Days', fontsize=20)
plt.xlabel('Record Number', fontsize=16)
plt.ylabel('Delivery Days', fontsize=16)
plt.show()

<p style = 'font-size:18px;font-family:Arial'><b>TD_RegressionEvaluator</b></p>

<p style = 'font-size:16px;font-family:Arial'>The TD_RegressionEvaluator function computes metrics to evaluate and compare multiple models and summarizes how close predictions are to their expected values.</p>

<p style = 'font-size:16px;font-family:Arial'>For more information on **RegressionEvaluator**: <a href='https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Database-Analytic-Functions/Model-Evaluation-Functions/TD_RegressionEvaluator'> [Link] </a></p>

In [None]:
DataFrame.from_query('''SELECT * FROM TD_RegressionEvaluator(
ON glm_predict_cal_ex as InputTable
USING
ObservationColumn('delivery_days')
PredictionColumn('prediction')
Metrics('RMSE','R2','FSTAT')
DegreesOfFreedom(5,48)
NUMOFINDEPENDENTVARIABLES(5)
) as dt;''')

<p style = 'font-size:16px;font-family:Arial'>The Metrics of the regression evaluator has the RMSE, R2 and the F-STAT metrics which are specified in the Metrics.</p>

<p style = 'font-size:16px;font-family:Arial'>Root mean squared error (RMSE)The most common metric for evaluating linear regression model performance is called root mean squared error, or RMSE. The basic idea is to measure how bad/erroneous the model’s predictions are when compared to actual observed values. So a high RMSE is “bad” and a low RMSE is “good”.</p>

<p style = 'font-size:16px;font-family:Arial'>The coefficient of determination — more commonly known as R² — allows us to measure the strength of the relationship between the response and predictor variables in the model. It’s just the square of the correlation coefficient R, so its values are in the range 0.0–1.0. Higher values of R- Squared is Good.</p>

<p style = 'font-size:16px;font-family:Arial'>The metrics specified in the Metrics syntax element are displayed. For FSTAT, the following columns are displayed:</p>
<li style = 'font-size:16px;font-family:Arial'>F_score</li>
<li style = 'font-size:16px;font-family:Arial'>F_Critcialvalue</li>
<li style = 'font-size:16px;font-family:Arial'>p_value</li>
<li style = 'font-size:16px;font-family:Arial'>F_Conclusion.</li></p>



<hr>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>9. Decision Forest </b></p>

<p style = 'font-size:16px;font-family:Arial'>The Decision Forest is a powerful method used for predicting outcomes in both classification and regression problems. It's an improvement on the technique of combining (or "bagging") multiple decision trees. Normally, building a decision tree involves assessing the importance of each feature in the data to determine how to divide the information. This method takes a unique approach by only considering a random subset of features at each division point in the tree. This forces each decision tree within the "forest" to be different from one another, which ultimately improves the accuracy of the predictions. The function relies on a training dataset to develop a prediction model. Then, the TD_DecisionForestPredict function uses the model built by the TD_DecisionForest function to make predictions. It supports regression, binary, and multi-class classification tasks.</p>

<p style = 'font-size:16px;font-family:Arial'>Typically, constructing a decision tree involves evaluating the value for each input feature in the data to select a split point. The function reduces the features to a random subset (that can be considered at each split point); the algorithm can force each decision tree in the forest to be very different to improve prediction accuracy. The function uses a training dataset to create a predictive model. The TD_DecisionForestPredict function uses the model created by the TD_DecisionForest function for making predictions. The function supports regression, binary, and multi-class classification.</p>


In [None]:
query = '''Create table DF_train as (
SELECT * FROM TD_DecisionForest (
ON selected_ddp_features_train AS INPUTTABLE partition by ANY
USING
  InputColumns('[1:51]')
        ResponseColumn('delivery_days')
MaxDepth(32)
MinNodeSize(1)
NumTrees(4)
ModelType('REGRESSION')
Seed(4)
Mtry(-1)
MtrySeed(4)
) AS dt
) with data;
'''
try:
    execute_sql(query)
except:
    db_drop_table('DF_train')
    execute_sql(query)

<p style = 'font-size:16px;font-family:Arial'><b>TD_DecisionForestPredict</b></p>
<p style = 'font-size:16px;font-family:Arial'>TD_DecisionForestPredict function uses the model output by TD_DecisionForest function to analyze the input data and make predictions. This function outputs the probability that each observation is in the predicted class. Processing times are controlled by the number of trees in the model. When the number of trees is more than what can fit in memory, then the trees are cached in a local spool space.</p>


In [None]:
query = '''
Create table DF_Predict as (
SELECT * FROM TD_DecisionForestPredict (
ON selected_ddp_features_train AS InputTable PARTITION BY ANY
ON DF_Train AS ModelTable DIMENSION
USING
   IDColumn ('record_number')
   Accumulate('delivery_days')
) AS dt) with data;'''

try:
    execute_sql(query)
except:
    db_drop_table('DF_Predict')
    execute_sql(query)

In [None]:
df_result = DataFrame('DF_Predict')
df_result

In [None]:
import matplotlib.pyplot as plt
# import matplotlib.patches as patches
import seaborn as sns

df_plot=df_result_pd=df_result.to_pandas(all_rows=True).reset_index()
x = df_plot['record_number'][:300]
# Put array of years here
y1 = df_plot['delivery_days'][:300]
y2 = df_plot['prediction'][:300]
plt.figure(figsize=(20,8))
sns.lineplot(data= df_plot[:300] ,x="record_number",y="delivery_days",ci=None)
sns.lineplot(data= df_plot[:300] ,x="record_number",y="prediction",ci=None)
plt.grid()
# plt.xticks(np.arange(1,60, step=1))
plt.legend(['Actual Value', 'Predicted Value'], loc='best', fontsize=16)
plt.title('Comparison of Actual vs Predicted Delivery Days', fontsize=20)
plt.xlabel('Record Number', fontsize=16)
plt.ylabel('Delivery Days', fontsize=16)
plt.show()

<p style = 'font-size:18px;font-family:Arial'><b>TD_RegressionEvaluator</b></p>

<p style = 'font-size:16px;font-family:Arial'>The TD_RegressionEvaluator function computes metrics to evaluate and compare multiple models and summarizes how close predictions are to their expected values.</p>

<p style = 'font-size:16px;font-family:Arial'>For more information on **RegressionEvaluator**: <a href='https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Database-Analytic-Functions/Model-Evaluation-Functions/TD_RegressionEvaluator'> [Link] </a></p>

In [None]:
query = '''
SELECT * FROM TD_RegressionEvaluator(
ON DF_Predict as InputTable
USING
ObservationColumn('delivery_days')
PredictionColumn('prediction')
Metrics('RMSE','R2','FSTAT')
DegreesOfFreedom(5,48)
NUMOFINDEPENDENTVARIABLES(5)
) as dt;
'''

DF_eval=DataFrame.from_query(query)
DF_eval

<p style = 'font-size:16px;font-family:Arial'>The Metrics of the regression evaluator has the RMSE, R2 and the F-STAT metrics which are specified in the Metrics.</p>

<p style = 'font-size:16px;font-family:Arial'>Root mean squared error (RMSE)The most common metric for evaluating linear regression model performance is called root mean squared error, or RMSE. The basic idea is to measure how bad/erroneous the model’s predictions are when compared to actual observed values. So a high RMSE is “bad” and a low RMSE is “good”.</p>

<p style = 'font-size:16px;font-family:Arial'>The coefficient of determination — more commonly known as R² — allows us to measure the strength of the relationship between the response and predictor variables in the model. It’s just the square of the correlation coefficient R, so its values are in the range 0.0–1.0. Higher values of R- Squared is Good.</p>

<p style = 'font-size:16px;font-family:Arial'>The metrics specified in the Metrics syntax element are displayed. For FSTAT, the following columns are displayed:</p>
<li style = 'font-size:16px;font-family:Arial'>F_score</li>
<li style = 'font-size:16px;font-family:Arial'>F_Critcialvalue</li>
<li style = 'font-size:16px;font-family:Arial'>p_value</li>
<li style = 'font-size:16px;font-family:Arial'>F_Conclusion.</li></p>



<hr>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>10. Conclusion</b></p>
<p style = 'font-size:16px;font-family:Arial'>We have seen an end-to-end exploration process for Shipping Time Predictions using ClearScape Analytics on Teradata Vantage. We have preprocessed data, created model using the InDB Analytic functions and compared the performance of the 2 models. The data we have used is sample data and so the results may not be accurate. Thanks to the in-database capabilities offered by Teradata Vantage with ClearScape Analytics, we were able to run this exploration with the smallest notebook instance.</p>

<hr>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>11. Cleanup</b></p>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Work Tables</b></p>

In [None]:
tables = ['DF_Predict','DF_train','glm_predict_cal_ex','td_glm_cal_ex',
          'selected_ddp_features_train','selected_ddp_features_test','delivery_date_test_dataset', 
          'delivery_date_train_dataset','delivery_date_ads','ohe_dataset','ohe_fit','delivery_date_dataset_final',
          'delivery_date_complete_dataset']

# Loop through the list of tables and execute the drop table command for each table
for table in tables:
    try:
        db_drop_table(table_name=table)
    except:
        pass
      


<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Databases and Tables</b></p>
<p style = 'font-size:16px;font-family:Arial'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('DEMO_ShippingTimePrediction');" 
#Takes 45 seconds

In [None]:
remove_context()

<footer style="padding:10px;background:#f9f9f9;border-bottom:3px solid #394851">Copyright © Teradata Corporation - 2023. All Rights Reserved.</footer>