<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Train Delay Prediction
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Introduction</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
Customer satisfaction starts with the experience. However, in every customer experience there is risk of unknown or unexpected issues. For instance, trains run on a very structured schedule, but delays still occur. Train delays significantly affect both the operational effectiveness of railway companies and the overall experience of passengers in the transportation sector. Teradata Vantage and ClearScape Analytics provide the features to examine historical data to determine the root cause of these delay, which in turn with enhance train operations and reduce interruptions. In Vantage, users can develop predictive modeling to anticipate these delays and enable pro-active planning, so resources can be allocated as necessary.</p>

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Business Values</b></p>
<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Understand delays and what factors lead to these delays.</li>
    <li>Reduce the number of delays and increase customer satisfaction.</li>
    <li>Ensure timeliness and accurate scheduling.</li>
</ul>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Why Vantage?</b></p>  
<p style = 'font-size:16px;font-family:Arial;color:#00233C'> To build more effective ML and AI models, developers and data scientists need to look outside the box for data, tools, and techniques that can continuously enhance the accuracy, speed, and efficacy of their models. Unfortunately, most of the time, this creativity comes at a cost. Plus, combining different types of analytics and data into the development pipeline usually adds complexity, fragility, and difficulties with operationalizing the process.<br>
Teradata Vantage provides ClearScape Analytics functions which allow users to seamlessly combine a wide range of behavioral, text processing, statistical analysis, and advanced analytic functions with model training and deployment tools on the same platform.<br>  
This allows for rapid development, testing, and validation of new techniques at scale in near-real time so new, more accurate models can easily be deployed to production.</p>



<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>1. Connect to Vantage</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Let's start by importing the libraries needed.</p>

In [None]:
# Standard libraries
import getpass
import warnings

# Third-party libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Teradata libraries
from teradataml import *
display.max_rows = 5

# Suppress warnings
warnings.filterwarnings('ignore')
warnings.simplefilter(action='ignore', category=DeprecationWarning)
warnings.simplefilter(action='ignore', category=RuntimeWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will be prompted to provide the password. We will enter the password, press the Enter key, and then use the down arrow to go to the next cell. Begin running steps with Shift + Enter keys.</p>

In [None]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

In [None]:
%%capture
execute_sql('''SET query_band='DEMO=Train_Delay_Python.ipynb;' UPDATE FOR SESSION; ''')

<p style = 'font-size:18px;font-family:Arial;color:#00233C'> <b>Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We have provided data for this demo on cloud storage. We have the option of either running the demo using foreign tables to access the data without using any storage on our environment or downloading the data to local storage, which may yield somewhat faster execution. However, we need to consider available storage. There are two statements in the following cell, and one is commented out. We may switch which mode we choose by changing the comment string.</p>

In [None]:
# %run -i ../run_procedure.py "call get_data('DEMO_TrainDelay_cloud');"        # Takes 30 seconds
%run -i ../run_procedure.py "call get_data('DEMO_TrainDelay_local');"        # Takes 1 minute

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Optional step – We should execute the below step only if we want to see the status of databases/tables created and space used.</p>

In [None]:
%run -i ../run_procedure.py "call space_report();"        # Takes 10 seconds

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>2. Data Exploration</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
    Let us start by creating a "Virtual DataFrame" that points directly to the dataset in Vantage. We then begin our analysis by checking the shape of the DataFrame and examining the data types of all its columns.</p>


In [None]:
mydata = DataFrame(in_schema("DEMO_TrainDelay" ,"Train_Dataset"))
mydata

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><i>mydata</i> is a Vantage DataFrame object which behaves in similar manner to a Pandas DataFrame and has similar methods and functions like:
<ul style = 'font-size:16px;font-family:Arial;color:#00233C'> 
    <li> shape to get the number of rows and columns</li>
    <li> dtypes to get the data types per columns</li>
    <li> groupby, select, agg, ... to compute and manipulate aggregation</li>
    <li> iloc, loc to filter rows and columns</li>
    <li> columns to get the column names</li>
    </ul>
 <p style = 'font-size:16px;font-family:Arial;color:#00233C'>This likeness facilitates a seamless transition and interchangeability between the two, allowing us to leverage our familiarity with pandas while harnessing the power of Teradata for robust data manipulation and analysis.   
    </p>

In [None]:
type(mydata)

In [None]:
mydata.shape

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>mydata dataframe contains 54616 rows and 3 columns.</p>

In [None]:
mydata.dtypes

<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>The columns are 3:
    <li> TravelID as int </li>
<li> events as string</li>
    <li> datetime as datetime</li></ul>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>As an example, we can see all different events contained in the dataset:</p>

In [None]:
mydata.groupby(['Events']).agg('count')

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
We can see that the aggregated data is available to us in teradataml dataframe. Let's visualize this data for better understanding. Clearscape Analytics can easily integrate with 3rd party visualization tools like Tableau, PowerBI or many python modules available like plotly, seaborn etc. We can do all the calculations and pre-processing in Vantage and pass only the necessary information to visulazation tools, this will not only make the calculations faster but also reduce overall processing time due to less data movement between tools and applications. For converting a teradataml dataframe to a Pandas DataFrame we use to_pandas() method.</p>

In [None]:
df4plot = mydata.groupby(['Events']).agg('count').to_pandas()
df4plot.head(5)

In [None]:
plt.rcParams['figure.figsize'] = [15, 10]
plt.rc('ytick', labelsize=20)
df4plot.sort_values('count_TravelID',ascending=True).plot.barh(x='Events',y='count_TravelID')

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here we see that we have as much departure as arrival which is expected. The most frequent events are <i>Door light failure</i> and <i>Normal stop</i>.</p>

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>3. Advanced Data Exploration : Path Analysis </b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The <b>nPath</b> function scans a set of rows looking for patterns that we specify. For each set of input rows that matches the specified pattern, nPath produces a single output row. This is extremely useful when our goal is to identify the paths that lead to an outcome.<br>
In our example, we want to build all the paths of events a travel (or trip) passes through, meaning:</p>
<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li> for each travel we want to get the sequence of events</li>
</ul>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>A travel can be modelled as a sequence of event starting from the *departure* event, and ending with the *arrival* event.</p>
<center><img src="images/npath_sankey.png"/></center>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
For our example:
<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>We will pass our dataset 'mydata' to the function.</li>
    <li>Provide partitioning (TravelID) and ordering column.</li>
    <li>Mode <b>OVERLAPPING</b> vs. <b>NONOVERLAPPING</b>
        <ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
            <li><b>OVERLAPPING</b> finds every occurrence of the match, regardless of the current row being part of a previous match.</li>
            <li><b>NONOVERLAPPING</b> starts matching again at the row that follows the previous match.
        </ul>
    </li>
    <li>Symbols.  Create a set of column expression aliases that can be assembled into a pattern to match.
        <ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
            <li>Example: "EVENT = 'departure' AS Dep" will alias a match on the EVENT column when the event equals 'Departure'.</li>
        </ul>
    </li>
      <li>Pattern.  Compose a pattern to search for across the rows of events.  This pattern is composed of Symbols and directives.
        <ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
            <li>Example: '^Dep' uses a directive ^ to indicate the P Symbol must occur at the beginning of the group of rows</li>
        </ul>
    </li>
    <li>Result.  Since nPath emits a single row per group-of-row matches, Result indicates what columns make up this row and how to aggregate the data.</li>
    </ol>    

In [None]:
myPathAnalysis = NPath(data1     = mydata,
               data1_partition_column = 'TravelID',
               data1_order_column     = 'Datetime',
               result                 = ['FIRST (TravelID OF any (Dep,Arr)) AS TravelID',
                                         'ACCUMULATE (cast(events as VARCHAR(50) CHARACTER SET UNICODE NOT CASESPECIFIC)OF any(Other,Dep,Arr)) AS MyPath',
                                         'first(Datetime of Dep) AS departure_time',
                                         'last(Datetime of Arr) As arrival_time'
                                        ],
               mode                   = 'nonoverlapping',
               pattern                = '^Dep.Other*.Arr$',
               symbols                = ["events='departure' AS Dep",
                                         "events='arrival' AS Arr",
                                         "true as Other"
                                        ],
        )

In [None]:
npath_df=myPathAnalysis.result
npath_df

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The results of the npath can be customized. We can add the path (here the *mypath* column) but also the departure and arrival time for each travel. </p>

In [None]:
npath_df.shape

<p style = 'font-size:16px;font-family:Arial;color:#00233C'> We can see that we have 7300 path where the starting event is departure and endinng event is arrival.<br>We can store these results in table for further analysis.

In [None]:
copy_to_sql(npath_df, table_name = 'npath_data', if_exists = 'replace')

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In order to visualize the distribution of the different path of events, we typically use Sankey diagram of the aggregated over the paths reported by the NPATH command.</p>

In [None]:
from tdnpathviz.visualizations import plot_first_main_paths

In [None]:
%%time
plot_first_main_paths(npath_df,path_column='mypath',id_column='travelid',width=1000)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
To check the details of any path or node we can move the mouse pointer over it and check details. The number on the path represent the count of travelids which have that path and source and target mentions the incoming and outcoming events.<br>
When the pointer is moved over a Node, for example when the pointer is on the long purple Node at the right top arrival it shows incoming flow count: 4 and outgoing flow count: 0 which means that there are 4 different events which lead to this node similarly outgoing flow count gives the count of events after this event.<br>
<br>
For sake of clarity, it is important to focus on the most important paths from a business viewpoint. Here we decided to look at the most frequent ones, i.e. a frequency > 20.</p>

In [None]:
nPathdf_group=npath_df.groupby("mypath")\
                .count()\
                .sort('count_travelid',ascending=False)
nPathdf_group

In [None]:
count_travel=nPathdf_group.count_travelid
nPathdf_group_plot=nPathdf_group[count_travel >= 20]

In [None]:
%%time
plot_first_main_paths(nPathdf_group_plot,path_column='mypath',id_column='count_travelid',width=1000)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The visualization of paths in event table is critical to design the best modeling strategy. For instance the business may decide to ignore some events because to doubt about the meaning of a given event and rapidly assess its importance in its entire dataset.</p>

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>4. Data Preparation using the Massive Parallel Processing of Vantage</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
In this example, we want to predict the delay induced by each event assuming each delay adds up independently from each other. For this purpose, we will use Machine Learning algorithm to predict the delay from the frequency of each event.
<center><img src="images/data_science_model.png"/></center>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
It is a good practice to perform data preparation using tables and/or views. In this way we can leverage the Massive Parallel Processing of Vantage. Moreover, the data preparation is shareable across the enterprise and guarantees the operationalization of the solution.<br>In this example, we decided to use a view, named *usecase_dataset*. Doing this will provide a consistently updated dataset with the latest data. This view can be used later to historize as many dataset as needed for training and testing.<br>
To do so, we can push a SQL query to build this view in a data lab space in Vantage. Note that this view relies on the NPATH Vantage function and timestamp manipulation to create the target feature which is the travel duration in seconds (*travel_duration_sec*).</p>

In [None]:
myquery = """REPLACE VIEW usecase_dataset (TravelID,travel_duration_sec, travel)
 AS
 SELECT TravelID,travel_duration_sec, travel  FROM (
  SELECT 
        TravelID AS TravelID,
        departure_time AS departure_time,
        arrival_time AS arrival_time,
        (arrival_time - departure_time) HOUR TO SECOND(4) as travel_duration,
        INTERVAL(PERIOD(departure_time,arrival_time)) MINUTE(3) as travel_duration_min,
        EXTRACT(HOUR FROM travel_duration)*3600 + EXTRACT(MINUTE FROM travel_duration)*60 + EXTRACT(SECOND FROM travel_duration) as travel_duration_sec,
        mypath as travel
  FROM npath_data
) A;
"""

In [None]:
execute_sql(myquery)

In [None]:
df_mydata = DataFrame("usecase_dataset")
df_mydata

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>5. Model development : prepare the data for the Machine Learning Algorithm</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Feature Creation</b><br>
The dataset built in Vantage contains all the information to address the business question. Based on the dataset and the business question additional features can be created which will machine learning algorithm to get the meaningful insights.<br>
In our example, the strategy proposed by the data scientist consists of spliting the paths and count frequency of each event in it.</p>
<center><img src="images/model_strategy.png"/></center>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We use the <i>NGramSplitter</i> function to process the paths of each travel. The function will split the corpus of texts into "terms" (grams) of selected size.</p>

In [None]:
ngrams = NGramSplitter(data=df_mydata,
                          text_column='travel',
                          delimiter = ",",
                          grams = "1",
                          overlapping=False,
                          to_lower_case=True,
                          total_gram_count=True,
                          punctuation = "[\\]\\\\[\\`]"
              )

In [None]:
ngrams.result

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The NGRAMS function add new columns (and rows). We will use two of them:</p>
<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li> ngram : is the event found in the travel</li>
    <li> frequency : is the frequency of this event in the path</li>
 </ul>   
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We need to get the number of possible ngrams: </p>

In [None]:
keys = (ngrams.result).select(['ngram','frequency']).groupby(['ngram']).sum()
keys

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We can visualize again the distribution of events in the dataset:</p>

In [None]:
import matplotlib.pyplot as plt
keys=keys.to_pandas()
keys.sort_values('sum_frequency',ascending=True).plot.barh(x='ngram',figsize=(10,5),fontsize=20,legend=False)
plt.ylabel('events',fontsize=20)
plt.xlabel('frequency',fontsize=20)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In order to make the dataset ready for the Machine Learning algorithm, we need to pivot the data and fill missing values with 0.<br>
For this purpose, we use two functions:</p>
<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
  <li>Pivot, to pivot the data and generate as many columns as event type. When an event does not occur during the travel, pivot assign its frequency to NULL or NaN</li>
    <li>assign, is used here to fill the missing values using the *isnan* function</li></ul>

In [None]:
df_ngram = ngrams.result

In [None]:
df_ngram

In [None]:
df_ngram.shape

In [None]:
df_ngram.shape

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><i>* below command is pivoting the data and takes approx 1min 30sec to execute </i></p>

In [None]:
%%time
pivot = df_ngram.pivot(columns=df_ngram.ngram, aggfuncs=df_ngram.frequency.sum())

In [None]:
dataset = pivot.assign(drop_columns                 = True,
           travelid                              = pivot.TravelID,
           travel                                = pivot.travel,
           travel_duration_sec                   = pivot.travel_duration_sec, 
           frequency_abnormal_weather_condition  = pivot['sum_frequency_abnormalweathercondition']  if not pivot['sum_frequency_abnormalweathercondition'].isna() else (1.-pivot['sum_frequency_abnormalweathercondition'].isna()),
           frequency_accident_involving_person   = pivot['sum_frequency_accidentinvolvingperson']  if not pivot['sum_frequency_accidentinvolvingperson'].isna() else (1.-pivot['sum_frequency_accidentinvolvingperson'].isna()),
           frequency_body_on_track               = pivot['sum_frequency_bodyontrack']  if not pivot['sum_frequency_bodyontrack'].isna() else (1.-pivot['sum_frequency_bodyontrack'].isna()),          
           frequency_crowded_stop                = pivot['sum_frequency_crowdedstop']  if not pivot['sum_frequency_crowdedstop'].isna() else (1.-pivot['sum_frequency_crowdedstop'].isna()),
           frequency_door_failure                = pivot['sum_frequency_doorfailure']  if not pivot['sum_frequency_doorfailure'].isna() else (1.-pivot['sum_frequency_doorfailure'].isna()),          
           frequency_door_light_failure          = pivot['sum_frequency_doorlightfailure']  if not pivot['sum_frequency_doorlightfailure'].isna() else (1.-pivot['sum_frequency_doorlightfailure'].isna()),
           frequency_electrical_failure          = pivot['sum_frequency_electricalfailure']  if not pivot['sum_frequency_electricalfailure'].isna() else (1.-pivot['sum_frequency_electricalfailure'].isna()),          
           frequency_engine_failure              = pivot['sum_frequency_electricalfailure']  if not pivot['sum_frequency_electricalfailure'].isna() else (1.-pivot['sum_frequency_electricalfailure'].isna()),
           frequency_normal_stop                 = pivot['sum_frequency_normalstop']  if not pivot['sum_frequency_normalstop'].isna() else (1.-pivot['sum_frequency_normalstop'].isna()),          
           frequency_road_work                   = pivot['sum_frequency_roadwork'] if not pivot['sum_frequency_roadwork'].isna() else (1.-pivot['sum_frequency_roadwork'].isna()),
           frequency_stop_sign_failure           = pivot['sum_frequency_stopsignfailure'] if not pivot['sum_frequency_stopsignfailure'].isna() else (1.-pivot['sum_frequency_stopsignfailure'].isna()),          
           frequency_unexpected_stop             = pivot['sum_frequency_unexpectedstop'] if not pivot['sum_frequency_unexpectedstop'].isna() else (1.-pivot['sum_frequency_unexpectedstop'].isna())
          )

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here we decide to create a table with the dataset in order to test different machine learning algorithm.</p>

In [None]:
copy_to_sql(dataset,table_name='my_dataset',if_exists='replace')

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>6. Model development : apply Machine Learning Algorithm</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
In our case, using a Generalized Linear Model answers the following business questions:</p>
<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>   
    <li>what is the travel duration when no event occur ? (even if this travel does not exist) => the answer is the intercept</li>
    <li>what is the delay induced by each event type ? (under the assumption there is no interaction between events) => the answers are the coefficients of the model</li>
    <li>can I simulate a new scenario ? => this is addressed by the scoring on new data. By the way, it can be done with any Machine Learning trained model</li>
 </ul>   
<center><img src="images/GLM.png"/></center>


In [None]:
dataset_num = DataFrame('my_dataset')

In [None]:
dataset_num

In [None]:
dataset_num.loc[:,['travelid','travel_duration_sec']].describe()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Create train and test data</b>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Now we have transformed our data and it is fit to be used in machine learning model, let us split the whole dataset into train and test sets for model training and scoring. We will use TrainTestSplit function for this task..</p>

In [None]:
TrainTestSplit_out = TrainTestSplit(
                                    data = dataset_num,
                                    id_column = "travelid",
                                    train_size = 0.75,
                                    test_size = 0.25,
                                    seed = 20
)

In [None]:
dataset_training = TrainTestSplit_out.result[TrainTestSplit_out.result['TD_IsTrainRow'] == 1].drop(['TD_IsTrainRow'], axis = 1)
dataset_testing  = TrainTestSplit_out.result[TrainTestSplit_out.result['TD_IsTrainRow'] == 0].drop(['TD_IsTrainRow'], axis = 1)

In [None]:
dataset_training.shape

In [None]:
dataset_testing.shape

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We want to predict travel_duration_sec using the frequencies of all events: we define the formula accordingly.

In [None]:
formula = 'travel_duration_sec ~ '+' + '.join(dataset_num.columns[3:-1])
formula

In [None]:
from teradataml import GLM, TDGLMPredict
glm_out = GLM(     formula      = formula,
                   linkfunction = 'IDENTITY',
                   family       = "GAUSSIAN",
                   data         = dataset_training,
                   threshold    = 0.001,
                   iter_max=300,
                   tolerance=0.001,
                   momentum=0.1,
                   nesterov=True,
                   learning_rate='CONSTANT'
                   )

In [None]:
glm_out.result

In [None]:
model_coefficients = glm_out.result.to_pandas().reset_index()
feat_imp = model_coefficients[model_coefficients['attribute'] > 0].sort_values(by = 'estimate', ascending = False)

# Specify figure size
fig, ax = plt.subplots(figsize=(10, 8))

# Use ax.barh() for horizontal bar chart
ax.barh(feat_imp['predictor'], feat_imp['estimate'], edgecolor='red')

# Add text labels on right of the bars
for x, y in zip(feat_imp['estimate'], feat_imp['predictor']):
    ax.text(x, y, str(round(x, 2)), ha='left', va='center')

# Set y-axis label
ax.set_xlabel('Estimate')

plt.title('Feature importance')

plt.show()

<br>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The figure above displays feature importance which are significant factors in predicting the target variable which in our case is travel_duration_sec. </p>

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>7. Model Performance</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The model accuracy is tested on the testing dataset (dataset_testing) using the GLMPredict function</p>

In [None]:
predictions = TDGLMPredict(object=glm_out.result,
                                        newdata=dataset_testing,
                                        accumulate="travel_duration_sec",
                                        id_column="travelid")

In [None]:
predictions.result

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The TD_RegressionEvaluator function computes metrics to evaluate and compare multiple models and summarizes how close predictions are to their expected values.</p>

In [None]:
from teradataml import RegressionEvaluator
RegressionEvaluator_out = RegressionEvaluator(data = predictions.result,
                                                      observation_column = "travel_duration_sec",
                                                      prediction_column = "prediction",
                                                      freedom_degrees = [1, 2],
                                                      metrics = ['RMSE','R2','FSTAT'])

In [None]:
RegressionEvaluator_out

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The Metrics of the regression evaluator has the RMSE, R2 and the F-STAT metrics which are specified in the Metrics.<br>The Regression evaluator is used to evaluate and compare the models. </p>  

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Root mean squared error (RMSE)The most common metric for evaluating linear regression model performance is called root mean squared error, or RMSE. The basic idea is to measure how bad/erroneous the model’s predictions are when compared to actual observed values. So a high RMSE is “bad” and a low RMSE is “good”.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The coefficient of determination — more commonly known as R² — allows us to measure the strength of the relationship between the response and predictor variables in the model. It’s just the square of the correlation coefficient R, so its values are in the range 0.0–1.0. Higher values of R- Squared is Good.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'> F-statistics (FSTAT) conducts an F-test. An F-test is any statistical test in which the test statistic has an F-distribution under the null hypothesis.
<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>F_score = F_score value from the F-test.</li>
<li>F_Critcialvalue = F critical value from the F-test.</li>
<li>p_value = Probability value associated with the F_score value.</li>
<li>F_conclusion = F-test result, either 'reject null hypothesis' or 'fail to reject null hypothesis'. If F_score > F_Critcialvalue, then 'reject null hypothesis' Else 'fail to reject null hypothesis'</li>
</ul>
</p>

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Conclusion</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In this notebook we have seen the end-to-end model creation using the The Teradata Vantage In-Database functions. We have seen how we can use nPath function to create get sequence of events which led to the desired output event. We can further analyse these events via model creation on which event has the most impact on the output event. Here we have built a basic model which has fairly ok R2 value ( regression models with R2 higher than 0.8 are considered good) and you can experiment by adjusting the model parameters to observe their impact on predictions and evaluation metrics.

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>8. Cleanup</b></p>
<p style = 'font-size:18px;font-family:Arial;color:#00233C;color:#00233C'><b>Work Tables</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C;'>
We need to clean up our work tables to prevent errors next time

In [None]:
tables = ['npath_data','my_dataset']

# Loop through the list of tables and execute the drop table command for each table
for table in tables:
    try:
        db_drop_table(table_name = table)
    except:
        pass



In [None]:
execute_sql('DROP VIEW usecase_dataset;')

<p style = 'font-size:18px;font-family:Arial;color:#00233C;color:#00233C'><b>Databases and Tables</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will use the following code to clean up tables and databases created for this demonstration.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('DEMO_TrainDelay');" 
#Takes 10 seconds

In [None]:
remove_context()

<hr style="height:1px;border:none;background-color:#00233C;">
 
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>Required Materials</b>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Let’s look at the elements we have available for reference for this notebook:</p>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Filters:</b></p>
    <ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li><b>Industry:</b> Transportation</li>
    <li><b>Functionality:</b> Machine Learning</li>
    <li><b>Use Case:</b> Delay Predictions</li>
    </ul>
    <p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Related Resources:</b></p>
    <ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li><a href = 'https://www.teradata.com/Blogs/Using-a-Lake-Centric-Modernization-Approach'>Using a Lake-Centric Modernization Approach to Clean Up a Data and Compute Mess</a></li>
    <li><a href = 'https://www.teradata.com/Blogs/Hyper-scale-time-series-forecasting-done-right'>Hyper-scale time series forecasting done right</a></li>
    <li><a href = 'https://www.teradata.com/Blogs/Data-Analytics-Keeps-the-Wheels-on-the-Bus'>Data & Analytics Keep the Wheels on the Bus!</a></li>
        </ul> 

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Reference Links:</b></p>
<ul style = 'font-size:16px;font-family:Arial;color:#00233C'> 
       <li>Teradata Vantage™ - Analytics Database Analytic Functions - 17.20: <a href = 'https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Introduction-to-Analytics-Database-Analytic-Functions '>https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Introduction-to-Analytics-Database-Analytic-Functions </a></li>    
  <li>Teradata® Package for Python User Guide - 17.20: <a href = 'https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Teradata-Package-for-Python-User-Guide-17.20/Introduction-to-Teradata-Package-for-Python'>https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Teradata-Package-for-Python-User-Guide-17.20/Introduction-to-Teradata-Package-for-Python</a></li>
  <li>Teradata® Package for Python Function Reference - 17.20: <a href = 'https://docs.teradata.com/r/Enterprise/Teradata-Package-for-Python-Function-Reference-17.20/Teradata-Package-for-Python-Function-Reference'>https://docs.teradata.com/r/Enterprise/Teradata-Package-for-Python-Function-Reference-17.20/Teradata-Package-for-Python-Function-Reference</a></li>      
</ul>

<footer style="padding-bottom:35px; background:#f9f9f9; border-bottom:3px solid #00233C">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2023,2024. All Rights Reserved
        </div>
    </div>
</footer>