<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Shipping Time Prediction Using Vantage InDB Analytic Functions
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Introduction</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>eBay, as an online marketplace, faces the challenge of accurately estimating delivery dates for shipments from various sellers. The current estimation process, based on seller handling time and carrier transit time, often leads to inconsistent and inaccurate predictions. This results in customer dissatisfaction and potential erosion of trust in the platform. Customer satisfaction starts with the experience. However, in every customer experience there is risk of unknown or unexpected issues. Therefore, there is a need to develop a robust system that can reliably estimate delivery dates, accounting for handling time, transit time, and other relevant variables affecting the actual delivery timeframe. Luckily, Teradata Vantage and ClearScape Analytics provide the features to examine historical data, predictive modeling techniques, and machine learning algorithms to improve accuracy. The successful implementation of an improved delivery time estimation system will enhance customer satisfaction, increase buyer trust, boost sales, and improve seller engagement on the platform.</p>
    
   
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Business Value</b></p>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Understand shipment process and what factors lead to inaccurate predictions.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Increase customer satisfaction.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Ensure timeliness and accurate scheduling.</li></p>
    <p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Why Vantage?</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>To build more effective ML and AI models, developers and data scientists need to look outside the box for data, tools, and techniques that can continuously enhance the accuracy, speed, and efficacy of their models. Unfortunately, most of the time, this creativity comes at a cost. Plus, combining different types of analytics and data into the development pipeline usually adds complexity, fragility, and difficulties with operationalizing the process.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Luckily, Teradata Vantage provides ClearScape Analytics functions which allow users to seamlessly combine a wide range of behavioral, text processing, statistical analysis, and advanced analytic functions with model training and deployment tools on the same platform.  This allows for rapid development, testing, and validation of new techniques at scale in near-real time so new, more accurate models can easily be deployed to production.<p>

  
<p style = 'font-size:16px;font-family:Arial;color:#00233C'> To address the problem in estimating delivery dates for eBay packages, we propose leveraging Teradata's in-database capabilities. By using Teradata's data cleaning and machine learning functionalities, we can develop a robust model to predict delivery dates. This involves collecting relevant data, cleaning it for accuracy, performing feature engineering, developing a predictive model,and validating and optimizing it. The implementation of this solution can lead to improved customer satisfaction, increased trust, higher sales, and enhanced seller engagement.</p>
   



<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>1. Connect to Vantage.</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In the section, we import the required libraries and set environment variables and environment paths (if required).</p>

In [None]:
import getpass
import time
import pandas as pd
import teradataml as tdml
from teradataml import *

import sqlalchemy
from sqlalchemy import event
import csv
# from teradataml.dataframe.data_transfer import read_csv
from teradatasqlalchemy.types import *
import random
from PIL import Image

from teradataml import *
configure.val_install_location = "val"

import plotly.express as px
import io
import warnings
warnings.filterwarnings('ignore')
display.max_rows=5

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will be prompted to provide the password. We will enter the password, press the Enter key, and then use the down arrow to go to the next cell. Begin running steps with Shift + Enter keys.</p>

In [None]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

In [None]:
%%capture
execute_sql('''SET query_band='DEMO=Shipping_Time_Prediction_PY_SQL.ipynb;' UPDATE FOR SESSION; ''')

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>2. Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We have provided data for this demo on cloud storage. We have the option of either running the demo using foreign tables to access the data without using any storage on our environment or downloading the data to local storage, which may yield somewhat faster execution. However, we need to consider available storage. There are two statements in the following cell, and one is commented out. We may switch which mode we choose by changing the comment string.</p>


In [None]:
%run -i ../run_procedure.py "call get_data('DEMO_ShippingTimePrediction_local');"
 # Takes about 40 secs

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
Optional step – We should execute the below step only if we want to see the status of databases/tables created and space used.
</p>

In [None]:
%run -i ../run_procedure.py "call space_report();"

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>3. Analyze the raw data set</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The dataset is shipping dataset with data containing 110,000 rows. A more detailed description of the features is already mentioned at the end.

</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Let us start by creating a "Virtual DataFrame" that points directly to the dataset in Vantage. We then begin our analysis by checking the shape of the DataFrame and examining the data types of all its columns.</p>


In [None]:
raw_df=DataFrame(in_schema('DEMO_ShipTimePred', 'Delivery_date_data'))
raw_df

In [None]:
raw_df.shape

In [None]:
conversions = raw_df.select(['record_number','b2c_c2c']).groupby('b2c_c2c').count()
conversions=conversions.to_pandas(all_rows=True).reset_index()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We can see that the aggregated data is available to us in teradataml dataframe. Let's visualize this data to better understand the Distribution values by the types of Channel. Vantage's Clearscape Analytics can easily integrate with 3rd party visualization tools like Tableau, PowerBI or many python modules available like plotly, seaborn etc. We can do all the calculations and pre-processing on Vantage and pass only the necessary information to visualization tools, this will not only make the calculation faster but also reduce the time due to less data movement between tools. We do the data transfer for this and the subsequent visualizations wherever necessary.</p>

In [None]:
#Plotting Distribution by channel

fig = px.bar(data_frame=conversions, x='b2c_c2c', y='count_record_number', color='b2c_c2c')

fig.update_layout(title='Channel Distribution',
                   xaxis_title='Channel',
                   yaxis_title='Distributions')
fig.show()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The above chart shows the distribution of the shipments based on the channel use B2C(Business to Customer) and C2C(Customer to Customer).

</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We can also try analyzing the shipments by Shipment Methods. Since the data is sample data for the purpose of this demo the shipment methods used are not specified and are using just numbers to categorize the shipment methods used and are depicted as shipment methods ids.
</p>



In [None]:
#Plotting Distribution by Shipment Method
shipments=raw_df.select(['record_number','shipment_method_id']).groupby('shipment_method_id').count()
figure = Figure(width=800, height=400, heading="Shipment Method Distribution")

plot = shipments.plot(
    x=shipments.shipment_method_id,
    y=shipments.count_record_number,
    kind='bar',
    xlabel='Shipment Method',
    ylabel='Distributions',
    color='blue',
    figure=figure,
    grid_linestyle='-',
    grid_linewidth=0.5
)

plot.show()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The above chart shows the distribution of the shipments based on different shipment method. As seen in the chart most of the shipments are using the Shipment Method Id(0)

</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We can also try analyzing the shipments by Categories. Similar to the Shipment Methods the categories are not specified and are using just numbers to categorize the shipments. The categories are defined using numbers and are depicted as Category IDs. 
</p>



In [None]:
#Plotting Distribution by category
categories=raw_df.select(['record_number','category_id']).groupby('category_id').count()
figure = Figure(width=800, height=400, heading="Category Distribution")

plot = categories.plot(
    x=categories.category_id,
    y=categories.count_record_number,
    kind='bar',
    xlabel='Category',
    ylabel='Distributions',
    color='blue',
    figure=figure,
    grid_linestyle='-',
    grid_linewidth=0.5
)

plot.show()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The above chart shows how the distribution of the shipments based on the categories. Most of the shipments are for categories with Category Ids between 0-5.

</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Below we try to check the Shipment Fees for various shipments.
</p>



In [None]:
import plotly.express as px
# ShipFeesdf=df[['shipping_fee','package_size']].groupby('package_size').agg(["min", "max"])
ShipFeesdf=raw_df.select(['shipping_fee','package_size']).groupby('package_size').max()
ShipFeesdf_plot=ShipFeesdf.to_pandas()

In [None]:
fig = px.bar(ShipFeesdf_plot, x='package_size', y='max_shipping_fee')

fig.update_layout(title='Shipping Fees Variation',
                   xaxis_title='Package Size',
                   yaxis_title='Shipping Fees')
fig.show()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The above chart shows how the maximum shipment fees based on the size of the package. As seen the fees is maximum for the largest package size.</p>


<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>4. Data Preprocessing and Cleaning</b></p>


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We create a smaller dataset to pre process and clean the data and use the same in predictions.</p>

In [None]:
raw_df = raw_df.assign(days=raw_df.delivery_date -  raw_df.acceptance_scan_timestamp.cast(type_=DATE))
raw_df=raw_df[raw_df.days >= 3] 
raw_df=raw_df[raw_df.days <= 6 ]
window = raw_df.days.window(order_columns='record_number')
raw_df = raw_df.assign(rn=window.row_number())
raw_df = raw_df[raw_df.rn <= 5000]
raw_df

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Data Preprocessing:</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>New column 'distance' is added to the table which will store distance between the item location and the buyer location.</p>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'>The geospatial function <b>ST_SPHERICALDISTANCE</b> in Vantage is used to calculate the distance using the latitude and longitude columns of the item and buyer.</p>

In [None]:
raw_df.shape

In [None]:
copy_to_sql(raw_df,table_name = 'Delivery_Date_Data_new', schema_name = 'DEMO_ShipTimePred_db', if_exists= 'replace' )

In [None]:
qry='''ALTER TABLE DEMO_ShipTimePred_db.Delivery_Date_Data_new
ADD distance FLOAT;'''

execute_sql(qry)

In [None]:
qry='''UPDATE DEMO_ShipTimePred_db.Delivery_Date_Data_new
SET distance = NEW ST_Geometry('ST_Point', item_long, item_lat).ST_SPHERICALDISTANCE(NEW ST_Geometry('ST_Point', 
buyer_long, buyer_lat))/1000;'''

execute_sql(qry)

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Checking and handling missing values:</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We create a new table with this available data so that we maintain the copy of the original data.</p>

In [None]:
qry='''CREATE multiset TABLE delivery_date_complete_dataset AS (
        SELECT *
        FROM DEMO_ShipTimePred_db.Delivery_Date_Data_new
    ) WITH DATA PRIMARY INDEX (record_number);'''

try:
    execute_sql(qry)
except:
    db_drop_table('delivery_date_complete_dataset')
    execute_sql(qry)
    

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Get Rows With Missing Values</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>TD_GetRowsWithMissingValues used on the table delivery_date_complete_dataset will select rows from the table where at least one of the first 19 columns has missing values.</p>

In [None]:
complete_dataset_df =  DataFrame('delivery_date_complete_dataset')
obj = GetRowsWithMissingValues(data=complete_dataset_df,
                                   target_columns='0:23')

In [None]:
complete_dataset_df.shape

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Replace Missing Values</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We create a reusable function to replace missing values for various columns which is used below to calculate missing values for declared_handling_days, weight, carrier_min_estimate and carrier_max_estimate. Below is the logic used for replacing missing values:</p>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>It calculates the average value (AvgVal) of a specified column (avgColumn) grouped by another column (groupCol) in the delivery_date_complete_dataset table. Only non-null values are considered, and the result is grouped by the specified column.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Updates the delivery_date_complete_dataset table by filling in missing values in the avgColumn with either the corresponding value from AverageData based on the matching groupCol, or with the overall average value if no match is found.</li>


In [None]:
def temp_col(col):
    execute_sql("""
    ALTER TABLE delivery_date_complete_dataset
    ADD "{0}_varchar" VARCHAR(50);""".format(col))
    
    execute_sql("""
    UPDATE delivery_date_complete_dataset
    SET "{0}_varchar" = CAST({0} AS VARCHAR(50));""".format(col))

In [None]:
def handleMissingData(avgColumn, groupCol):
    temp_col(avgColumn)
    
    try:
        execute_sql("""DROP TABLE AVERAGEDATA""")
        print("DROPPING TABLE AVERAGEDATA")
    except:
        print("[Teradata Database] [Info] Object 'AVERAGEDATA' does not exist.")
        
    execute_sql("""
        CREATE VOLATILE TABLE AverageData AS (
            SELECT DISTINCT AVG("{0}") as AvgVal, "{1}" as "{1}"
            FROM delivery_date_complete_dataset
            WHERE "{0}_varchar" <> '**********************'
            GROUP BY "{1}"
        )
        WITH DATA
        ON COMMIT PRESERVE ROWS;
    """.format(avgColumn, groupCol))
    
    execute_sql("""
        UPDATE delivery_date_complete_dataset AS E
    SET "{0}" = 
        CASE
            WHEN E."{0}_varchar" = '**********************'
                THEN COALESCE(
                    (SELECT AvgVal FROM AverageData AS D WHERE E."{1}" = D."{1}"),
                    (SELECT AVG(AvgVal) FROM AverageData)
                )
            ELSE "{0}"
        END;
    """.format(avgColumn, groupCol))
    


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The above code is used to get the missing values for columns 'declared_handling_days', 'weight','carrier_min_estimate' and 'carrier_max_estimate' </p>

In [None]:
handleMissingData("declared_handling_days", "seller_id")

In [None]:
handleMissingData("weight", "category_id") 

In [None]:
handleMissingData("carrier_min_estimate", "shipment_method_id") 

In [None]:
handleMissingData("carrier_max_estimate", "shipment_method_id") # Handle missing carrier_max_estimate

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The below code is used to get the missing values for column package size based on weight and average package size.</p>

In [None]:
complete_dataset_df =  DataFrame('delivery_date_complete_dataset')
pkg_averagedata = complete_dataset_df[complete_dataset_df.package_size != 'NONE']
pkg_averagedata=pkg_averagedata.select(['package_size','weight']).groupby('package_size').mean()
pkg_averagedata

In [None]:
temp_df= complete_dataset_df[complete_dataset_df.package_size == 'NONE']
temp_df = temp_df.merge(right=pkg_averagedata, how='cross', on = '1', lsuffix='t1', rsuffix = 't2')
temp_df = temp_df.assign(difference=(temp_df.weight - temp_df.mean_weight).abs())
window = temp_df.difference.window(partition_columns='record_number',
                       order_columns='difference')

temp_df = temp_df.assign(rn=window.row_number())
temp_df = temp_df[temp_df.rn == 1]
temp_df

In [None]:
complete_dataset_df = complete_dataset_df.merge(right=temp_df, how='left',on=["record_number=record_number"]
                                                ,lsuffix='t3',rsuffix='t4')

In [None]:
complete_dataset_df

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will standardize round the values for handling days and distance and calculate the values for handling, shipping and delivery days.</p>

In [None]:
complete_dataset_df = complete_dataset_df.assign(drop_columns=True,
                                                 b2c_c2c = complete_dataset_df.b2c_c2c_t3,
                                                 seller_id=complete_dataset_df.seller_id_t3,
                                                 declared_handling_days= complete_dataset_df.declared_handling_days_t3.round(0),
                                                 acceptance_scan_timestamp=complete_dataset_df.acceptance_scan_timestamp_t3,
                                                 shipment_method_id=complete_dataset_df.shipment_method_id_t3,
                                                 shipping_fee=complete_dataset_df.shipping_fee_t3,
                                                 carrier_min_estimate=complete_dataset_df.carrier_min_estimate_t3,
                                                 carrier_max_estimate=complete_dataset_df.carrier_max_estimate_t3,
                                                 item_zip=complete_dataset_df.item_zip_t3,
                                                 buyer_zip=complete_dataset_df.buyer_zip_t3,
                                                 category_id=complete_dataset_df.category_id_t3,
                                                 item_price=complete_dataset_df.item_price_t3,
                                                 quantity=complete_dataset_df.quantity_t3,
                                                 payment_datetime=complete_dataset_df.payment_datetime_t3,
                                                 delivery_date=complete_dataset_df.delivery_date_t3,
                                                 weight=complete_dataset_df.weight_t3,
                                                 weight_units=complete_dataset_df.weight_units_t3,
                                                 package_size=complete_dataset_df.package_size_t2,
                                                 record_number=complete_dataset_df.record_number_t3,
                                                 item_lat=complete_dataset_df.item_lat_t3,
                                                 item_long=complete_dataset_df.item_long_t3,
                                                 buyer_lat=complete_dataset_df.buyer_lat_t3,
                                                 buyer_long=complete_dataset_df.buyer_long_t3,
                                                 distance=complete_dataset_df.distance_t3.round(0),
                                                 mean_weight=complete_dataset_df.mean_weight,
                                                 difference=complete_dataset_df.difference,
                                                 rn=complete_dataset_df.rn_t4
                                                  )
complete_dataset_df

In [None]:
complete_dataset_df = complete_dataset_df.assign(
                                    handling_days = complete_dataset_df.acceptance_scan_timestamp.cast(type_=DATE)
                                            - complete_dataset_df.payment_datetime.cast(type_=DATE),
                                    shipping_days = complete_dataset_df.delivery_date.cast(type_=DATE)
                                            - complete_dataset_df.acceptance_scan_timestamp.cast(type_=DATE),
                                    delivery_days = complete_dataset_df.delivery_date.cast(type_=DATE)
                                            - complete_dataset_df.payment_datetime.cast(type_=DATE))

complete_dataset_df

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>For our analysis we will delete rows where distance , weight or item price are zero and select data where the delivery days are between 3 and 6.</p>

In [None]:
complete_dataset_df=complete_dataset_df[complete_dataset_df.distance != 0.0] 
complete_dataset_df=complete_dataset_df[complete_dataset_df.weight != 0]
complete_dataset_df=complete_dataset_df[complete_dataset_df.item_price != 0.0]

In [None]:
complete_dataset_df=complete_dataset_df[complete_dataset_df.delivery_days >= 3] 
complete_dataset_df=complete_dataset_df[complete_dataset_df.delivery_days <= 6 ]

In [None]:
complete_dataset_df

In [None]:
# df_final=df_.select(["b2c_c2c","shipping_fee","item_price","quantity", "weight","package_size","record_number"
#                     ,"distance","shipment_method_id","category_id",])
df_final=complete_dataset_df.drop(["seller_id","declared_handling_days", "carrier_min_estimate", "carrier_max_estimate",
                "item_zip","buyer_zip", "weight_units", "item_lat","item_long","buyer_lat","buyer_long",
                   "payment_datetime", "acceptance_scan_timestamp", "delivery_date"], axis=1)
df_final

In [None]:
df_final.shape

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>5. Creation of final analytic dataset </b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We have datasets in which different columns have different units . If we feed these features to the model as is, there is every chance that one feature will influence the result more due to its value than the others. But this doesn’t necessarily mean it is more important as a predictor. So, to give importance to all the features we need feature scaling.</p>
    
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here, we apply the Standard scale and transform functions which are ScaleFit and ScaleTransform functions in Vantage. ScaleFit() function outputs statistics to input to ScaleTransform() function, which scales specified input DataFrame columns.</p> 

In [None]:
from teradataml import ScaleFit , ScaleTransform
scaler = ScaleFit(
                    data=df_final,
                    target_columns=["shipping_fee","item_price", "quantity", "weight", "distance"],
                    scale_method="STD",
                    global_scale=False)

In [None]:
ADS_scaled = ScaleTransform(data=df_final,
                         object=scaler.output,
                         accumulate=["record_number","b2c_c2c", "package_size", "delivery_days"
                                     ,"shipment_method_id","category_id"]
                           ).result
ADS_scaled

In [None]:
ADS_scaled = ADS_scaled.assign(shipment_method_id = ADS_scaled.shipment_method_id.cast(type_=VARCHAR(5)),
                               category_id = ADS_scaled.category_id.cast(type_=VARCHAR(5)))
ADS_scaled

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>OneHotEncodingfit function records all the parameters required for OneHotEncodingTransform() function. Such as, target attributes and their categorical values to be encoded and other parameters.    Output of OneHotEncodingFit() function is used by OneHotEncodingTransform() function for encoding the input data. It supports inputs in both sparse and dense format.</p>

In [None]:
copy_to_sql(ADS_scaled, table_name = 'delivery_date_dataset_final', if_exists='replace')
ADS_scaled = DataFrame('delivery_date_dataset_final')

In [None]:
fit_obj = OneHotEncodingFit(data = ADS_scaled,
                                is_input_dense=True,
                                target_column=['b2c_c2c','shipment_method_id','category_id','package_size'],
                                category_counts=[2,23,33,6],
                                approach='Auto',
                                other_column="other")

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>OneHotEncodingTransform function encodes specified attributes and categorical values as one-hot numeric vectors,  using OneHotEncodingFit() function output.</p>

In [None]:
OneHotTrandform_df = OneHotEncodingTransform(data=ADS_scaled,
                                  object=fit_obj.result,
                                  is_input_dense=True)
OneHotTrandform_df.result

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>6. Creation of Train and Test data.</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The TrainTestSplit() function simulates how a model would perform on new data. The function divides the dataset into train and test subsets to evaluate machine learning algorithms and validate processes. The first subset is used to train the model. The second subset is used to make predictions and compare the predictions to actual values.</p> 

In [None]:
TrainTestSplit_out = TrainTestSplit(data = OneHotTrandform_df.result,
                                        id_column="record_number",
                                        train_size=0.75,
                                        test_size=0.25,
                                        seed=42)

TrainTestSplit_df = TrainTestSplit_out.result

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Creating Train and Test datasets.</p>

In [None]:
delivery_date_train_dataset = TrainTestSplit_df[TrainTestSplit_df.TD_IsTrainRow == 1]
delivery_date_test_dataset = TrainTestSplit_df[TrainTestSplit_df.TD_IsTrainRow == 0]

In [None]:
copy_to_sql(delivery_date_train_dataset, table_name = 'delivery_date_train_dataset', if_exists='replace')
delivery_train_dataset = DataFrame('delivery_date_train_dataset')

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>7. Feature Selection using Elastic Net Regularization.</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Feature selection is a crucial step in building predictive models as it helps identify the most relevant and informative features from a potentially large set of variables. In this context, elastic net regularization is a powerful technique that can be employed to effectively filter out features and improve model performance.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Elastic net regularization combines the L1 (Lasso) and L2 (Ridge) regularization techniques, offering a balanced approach to feature selection. It applies a penalty term to the model's objective function, encouraging sparsity in the coefficient estimates and promoting the selection of a subset of important features while shrinking the coefficients of less relevant or redundant features.</p>


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>For more information on **Regularization**: <a href='https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Database-Analytic-Functions/Model-Training-Functions/TD_GLM/TD_GLM-Syntax-Elements'>[Link]</a></p>

In [None]:
GLM_out = GLM(input_columns= ['2:11','13:75'],
                    response_column = "delivery_days",
                    data = delivery_train_dataset,
                    family='Gaussian',
                    learning_rate = 'adaptive',
                    batch_size=500,
                    max_iter_num=100,
                    alpha=0.3,
                    lambda1=0.01,
                    iter_num_no_change=70,
                    tolerance=0.002,
                    intercept=True,
                    initial_eta=0.015,
                    momentum = 0.8,
                    local_sgd_iterations=10
                    )
GLM_out.result

In [None]:
glm_fs_df = GLM_out.result
copy_to_sql(glm_fs_df, table_name = 'td_glm_cal_ex', if_exists='replace')

In [None]:
glm_fs_df = glm_fs_df[glm_fs_df.attribute> 0]
glm_fs_df = glm_fs_df[glm_fs_df.estimate != 0]

In [None]:
val_list=glm_fs_df.select(['predictor']).get_values()
final_list =  list(val_list[:,0]) + ['record_number', 'delivery_days']


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The output of the TD_GLM function provides attributes where the index of the predictors have positive values and the estimate column has the predictor weights. For feature selection we consider all columns which are the predictors and have weights >0 i.e. estimate > 0. </p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'> In the for loop we create a list of all such columns and create a table with only the columns which have weightage as predictors for the model.</p>

In [None]:
train_dataset = delivery_date_train_dataset[final_list]
test_dataset = delivery_date_test_dataset[final_list]

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We create the train and test datasets with only these features(columns) to be used in the model for predictions.</p>

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>8. Generalized Linear Model (GLM) in Teradata </b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The TD_GLM function is a generalized linear model (GLM) that performs regression and classification analysis on data sets, where the response follows an exponential family distribution and supports the following models:</p>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Regression (Gaussian family): The loss function is squared error.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Binary Classification (Binomial family): The loss function is logistic and implements logistic regression. The only response values are 0 or 1.</li>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The function uses the Minibatch Stochastic Gradient Descent (SGD) algorithm. The algorithm estimates the gradient of loss in minibatches, which is defined by the BatchSize argument and updates the model with a learning rate using the LearningRate argument.</p>
    <p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here we are using Regression</p>    

In [None]:
GLM_df = GLM(input_columns= ['0:50'],
                    response_column = "delivery_days",
                    data = train_dataset,
                    family='Gaussian',
                    learning_rate = 'adaptive',
                    batch_size=800,
                    max_iter_num=300,
                    alpha=0.2,
                    lambda1=0.01,
                    iter_num_no_change=200,
                    tolerance=0.002,
                    intercept=True,
                    initial_eta=0.02,
                    momentum = 0.8,
                    local_sgd_iterations=20
                    )
GLM_df.result

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>TDGLMPredict </b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The TDGLMPredict function predicts target values (regression) and class labels (classification) for test data using a GLM model of the TD_GLM function.</p>

In [None]:
TDGLMPredict_out = TDGLMPredict(object=GLM_df.result,
                                    newdata=test_dataset,
                                    accumulate="delivery_days",
                                    id_column="record_number")
df=TDGLMPredict_out.result
df

In [None]:
import matplotlib.pyplot as plt
# import matplotlib.patches as patches
import seaborn as sns

df_plot=df.to_pandas(all_rows=True).reset_index().head(200)
x = df_plot['record_number']
# Put array of years here
y1 = df_plot['delivery_days']
y2 = df_plot['prediction']
plt.figure(figsize=(20,8))
sns.lineplot(data= df_plot ,x="record_number",y="delivery_days",ci=None)
sns.lineplot(data= df_plot ,x="record_number",y="prediction",ci=None)
plt.grid()
# plt.xticks(np.arange(1,60, step=1))
plt.legend(['Actual Value', 'Predicted Value'], loc='best', fontsize=16)
plt.title('Comparison of Actual vs Predicted Delivery Days', fontsize=20)
plt.xlabel('Record Number', fontsize=16)
plt.ylabel('Delivery Days', fontsize=16)
plt.show()

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>RegressionEvaluator</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The RegressionEvaluator function computes metrics to evaluate and compare multiple models and summarizes how close predictions are to their expected values.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>For more information on **RegressionEvaluator**: <a href='https://docs.teradata.com/r/Lake/Teradata-Package-for-Python-Function-Reference-on-VantageCloud-Lake/teradataml-Analytic-Database-17.20.xx-Analytic-Functions/MODEL-EVALUATION-functions/RegressionEvaluator'> [Link] </a></p>

In [None]:
RegressionEvaluator_out = RegressionEvaluator(data = df,
                                                  observation_column = "delivery_days",
                                                  prediction_column = "prediction",
                                                  freedom_degrees = [5, 48],
                                                  independent_features_num = 5,
                                                  metrics = ['RMSE','R2','FSTAT'])
RegressionEvaluator_out.result

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The Metrics of the regression evaluator has the RMSE, R2 and the F-STAT metrics which are specified in the Metrics. The main values to observe are the P_VALUE and the F_CONCLUSION. Lesser the value of RMSE the more correct values will be predicted by the model. The P_VALUE should be less than 0.05 and the F_CONCLUSION should be Reject null hypothesis which means that the model has given expected outputs.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Root mean squared error (RMSE)The most common metric for evaluating linear regression model performance is called root mean squared error, or RMSE. Root means squared error (MSE) is the square root of the average of the squares of the errors between observed values and predicted values.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The coefficient of determination — more commonly known as R² — allows us to measure the strength of the relationship between the response and predictor variables in the model. R Squared (R2) is the proportion of the variation in the dependent variable that is predictable from the independent variable(s).</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The metrics specified in the Metrics syntax element are displayed. For FSTAT, the following columns are displayed:</p>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>F_score:- F_score value from the F-test.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>F_Critcialvalue:- F critical value from the F-test. (alpha, df1, df2, UPPER_TAILED) , alpha = 95%</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>p_value:- Probability value associated with the F_score value (F_score, df1, df2, UPPER_TAILED)</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>F_Conclusion:- F-test result, either 'reject null hypothesis' or 'fail to reject null hypothesis'. If F_score > F_Critcialvalue, then 'reject null hypothesis' Else 'fail to reject null hypothesis'.</li></p>




<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>9. Decision Forest </b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The Decision Forest is a powerful method used for predicting outcomes in both classification and regression problems. It's an improvement on the technique of combining (or "bagging") multiple decision trees. Normally, building a decision tree involves assessing the importance of each feature in the data to determine how to divide the information. This method takes a unique approach by only considering a random subset of features at each division point in the tree. This forces each decision tree within the "forest" to be different from one another, which ultimately improves the accuracy of the predictions. The function relies on a training dataset to develop a prediction model. Then, the TD_DecisionForestPredict function uses the model built by the TD_DecisionForest function to make predictions. It supports regression, binary, and multi-class classification tasks.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Typically, constructing a decision tree involves evaluating the value for each input feature in the data to select a split point. The function reduces the features to a random subset (that can be considered at each split point); the algorithm can force each decision tree in the forest to be very different to improve prediction accuracy. The function uses a training dataset to create a predictive model. The TD_DecisionForestPredict function uses the model created by the TD_DecisionForest function for making predictions. The function supports regression, binary, and multi-class classification.</p>


In [None]:
train_dataset.to_sql(table_name='train_dataset' , if_exists='replace')
test_dataset.to_sql(table_name='test_dataset' , if_exists='replace')

In [None]:
DecisionForest_out = DecisionForest(data = DataFrame('train_dataset'), 
                            input_columns = ['0:52'], 
                            response_column = 'delivery_days', 
                            max_depth = 24, 
                            num_trees = 6, 
                            min_node_size = 1, 
                            mtry = -1, 
                            mtry_seed = 2,
                            seed = 2, 
                            tree_type = 'REGRESSION')

decision_df=DecisionForest_out.result

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>TDDecisionForestPredict</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>TDDecisionForestPredict function uses the model output by TD_DecisionForest function to analyze the input data and make predictions. This function outputs the probability that each observation is in the predicted class. Processing times are controlled by the number of trees in the model. When the number of trees is more than what can fit in memory, then the trees are cached in a local spool space.</p>


In [None]:
DF_Predict_out = TDDecisionForestPredict(
    newdata=DataFrame('test_dataset'),
    object=DecisionForest_out.result,
    id_column='record_number',
    accumulate='delivery_days',
    )

DF_Predict_out.result

In [None]:
df_result = DF_Predict_out.result
df_result = df_result.assign(delivery_hours = df_result.delivery_days*24,
                             prediction_hours = df_result.prediction * 24)
df_result                             

In [None]:
import matplotlib.pyplot as plt
# import matplotlib.patches as patches
import seaborn as sns

df_plot=df_result_pd=df_result.to_pandas(all_rows=True).reset_index()
plt.figure(figsize=(20,8))
sns.lineplot(data= df_plot[:200] ,x="record_number",y="delivery_days",ci=None)
sns.lineplot(data= df_plot[:200] ,x="record_number",y="prediction",ci=None)
plt.grid()
# plt.xticks(np.arange(1,60, step=1))
plt.legend(['Actual Value', 'Predicted Value'], loc='best', fontsize=16)
plt.title('Comparison of Actual vs Predicted Delivery Days', fontsize=20)
plt.xlabel('Record Number', fontsize=16)
plt.ylabel('Delivery Days', fontsize=16)
plt.show()

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>RegressionEvaluator</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The RegressionEvaluator function computes metrics to evaluate and compare multiple models and summarizes how close predictions are to their expected values.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>For more information on **RegressionEvaluator**: <a href='https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Database-Analytic-Functions/Model-Evaluation-Functions/TD_RegressionEvaluator'> [Link] </a></p>

In [None]:
RegressionEvaluator_dfout = RegressionEvaluator(data = df_result,
                                                  observation_column = "delivery_days",
                                                  prediction_column = "prediction",
                                                  freedom_degrees = [5, 48],
                                                  independent_features_num = 5,
                                                  metrics = ['RMSE','R2','FSTAT'])
RegressionEvaluator_dfout.result

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The Metrics of the regression evaluator has the RMSE, R2 and the F-STAT metrics which are specified in the Metrics. The main values to observe are the P_VALUE and the F_CONCLUSION. Lesser the value of RMSE the more correct values will be predicted by the model. The P_VALUE should be less than 0.05 and the F_CONCLUSION should be Reject null hypothesis which means that the model has given expected outputs.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Root mean squared error (RMSE)The most common metric for evaluating linear regression model performance is called root mean squared error, or RMSE. Root means squared error (MSE) is the square root of the average of the squares of the errors between observed values and predicted values.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The coefficient of determination — more commonly known as R² — allows us to measure the strength of the relationship between the response and predictor variables in the model. R Squared (R2) is the proportion of the variation in the dependent variable that is predictable from the independent variable(s).</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The metrics specified in the Metrics syntax element are displayed. For FSTAT, the following columns are displayed:</p>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>F_score:- F_score value from the F-test.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>F_Critcialvalue:- F critical value from the F-test. (alpha, df1, df2, UPPER_TAILED) , alpha = 95%</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>p_value:- Probability value associated with the F_score value (F_score, df1, df2, UPPER_TAILED)</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>F_Conclusion:- F-test result, either 'reject null hypothesis' or 'fail to reject null hypothesis'. If F_score > F_Critcialvalue, then 'reject null hypothesis' Else 'fail to reject null hypothesis'.</li></p>



<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b> Conclusion</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We have seen an end-to-end exploration process for Shipping Time Predictions using ClearScape Analytics on Teradata Vantage. We have preprocessed data, created model using the InDB Analytic functions and compared the performance of the 2 models. The data we have used is sample data and so the results may not be accurate. Thanks to the in-database capabilities offered by Teradata Vantage with ClearScape Analytics, we were able to run this exploration with the smallest notebook instance.</p>

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>11. Cleanup</b></p>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Work Tables</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We need to clean up our work tables to prevent errors next time.</p>

In [None]:
tables = ['temp_Ship','train_dataset','test_dataset','td_glm_cal_ex',
          'delivery_date_dataset_final','delivery_date_train_dataset','delivery_date_test_dataset']

# Loop through the list of tables and execute the drop table command for each table
for table in tables:
    try:
        db_drop_table(table_name=table)
    except:
        pass
      


In [None]:
db_drop_table(table_name='Delivery_date_data_new', schema_name='DEMO_ShipTimePred_db') 

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Databases and Tables</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will use the following code to clean up tables and databases created for this demonstration.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('DEMO_ShippingTimePrediction');" 
#Takes 45 seconds

In [None]:
remove_context()

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>Resources</b>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Let’s look at the elements we have available for reference for this notebook:</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
The implemented functions are from the following documentation:</p>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'> <a href='https://www.docs.teradata.com/r/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20'>Advanced SQL Engine 17.20 Functions</a></li>       
<li style = 'font-size:16px;font-family:Arial;color:#00233C'> <a href= 'https://docs.teradata.com/r/Enterprise_IntelliFlex_Lake_VMware/Vantage-Analytics-Library-User-Guide/Welcome-to-Vantage-Analytics-Library'>Vantage Analytics Library</a></li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'> <a href= 'https://docs.teradata.com/r/Teradata-VantageTM-Unbounded-Array-Framework-Time-Series-Reference'>UAF Time-Series 17.20 Functions</a></li>    
<br> 
       
    
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Data</b></p>
      
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The data was collected from open source <a href= 'https://www.kaggle.com/datasets/armanaanand/ebay-delivery-date-prediction'>Kaggle</a> with following description</p>
    
<img src='images/DataSet.png'>
   
   
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Filters:</b></p>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Industry:</b> Transportation</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Functionality:</b> Machine Learning</li>
    <li style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Use Case:</b> Shipping Time Predictions</li></p>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Related Resources:</b></p>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><a href ='https://www.teradata.com/Blogs/Using-a-Lake-Centric-Modernization-Approach'>Using a Lake-Centric Modernization Approach to Clean Up a Data and Compute Mess</a></li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><a href ='https://www.teradata.com/Blogs/Hyper-scale-time-series-forecasting-done-right'>Hyper-scale time series forecasting done right</a></li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><a href ='https://www.teradata.com/Blogs/Data-Analytics-Keeps-the-Wheels-on-the-Bus'>Data & Analytics Keep the Wheels on the Bus!</a></li>
  


<footer style="padding-bottom:35px; background:#f9f9f9; border-bottom:3px solid #00233C">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2023, 2024. All Rights Reserved
        </div>
    </div>
</footer>