<header style="padding:1px;background:#f9f9f9;border-top:3px solid #00b2b1"><img id="Teradata-logo" src="https://www.teradata.com/Teradata/Images/Rebrand/Teradata_logo-two_color.png" alt="Teradata" width="220" align="right" />

<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>Behavioral Analysis and Visualization using Vantage and Plotly</b>
</header>

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Introduction</b></p>
<p style = 'font-size:16px;font-family:Arial'>
This is a demonstration of Vantage capabilities for functional demos e.g.
    <li style = 'font-size:16px;font-family:Arial'> Sessionize Function - Assign a Session Identifier to a series of events within a time window </li>
    <li style = 'font-size:16px;font-family:Arial'> nPath Function - Analyze events and match patterns to collect specific user paths </li>
</p>
<br>
<p style = 'font-size:16px;font-family:Arial'>
nPath is useful when your goal is to identify the paths that lead to an outcome. The nPath function scans a set of rows, looking for patterns that you specify. For each set of input rows that matches the pattern, nPath produces a single output row. The function provides a flexible pattern-matching capability that lets you specify complex patterns in the input data and define the values that are output for each matched input set. For example, you can use nPath to analyze:
<li style = 'font-size:16px;font-family:Arial'>Web site click data, to identify paths that lead to sales over a specified amount</li>
<li style = 'font-size:16px;font-family:Arial'>Sensor data from industrial processes, to identify paths to poor product quality</li>
<li style = 'font-size:16px;font-family:Arial'>Healthcare records of individual patients, to identify paths that indicate that patients are at risk of developing conditions such as heart disease or diabetes</li>
<li style = 'font-size:16px;font-family:Arial'>Financial data for individuals, to identify paths that provide information about credit or fraud risks</li>
</p>    


<p style = 'font-size:28px;font-family:Arial;color:#E37C4D'><b>1.Connect to Vantage,  import python packages and explore the dataset</b></p>


<p style = 'font-size:16px;font-family:Arial'>Let us start with importing the required libraries, set environment variables and connect to Vantage.</p>

In [None]:
#import libraries
import getpass
import warnings


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


from teradataml import *


display.max_rows = 5

warnings.filterwarnings('ignore')

<p style = 'font-size:16px;font-family:Arial'>You will be prompted to provide the password. Enter your password, press the Enter key, then use down arrow to go to next cell. Below command will make a connection to the Vantage environment </p>

In [None]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

In [None]:
%%capture
execute_sql('''SET query_band='DEMO=BehavioralAnalysis_PY_SQL.ipynb;' UPDATE FOR SESSION; ''')

<p style = 'font-size:16px;font-family:Arial'>Begin running steps with Shift + Enter keys.</p>

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'> <b>Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial'>We have provided data for this demo on cloud storage. You have the option of either running the demo using foreign tables to access the data without using any storage on your environment or downloading the data to local storage which may yield somewhat faster execution, but there could be considerations of available storage. There are two statements in the following cell, and one is commented out. You may switch which mode you choose by changing the comment string.</p>

In [None]:
%run -i ../run_procedure.py "call get_data('DEMO_Retail_cloud');"
 # takes about 30 seconds, estimated space: 0 MB
# %run -i ../run_procedure.py "call get_data('DEMO_Retail_local');" 
# takes about 50 seconds, estimated space: 23 MB

<p style = 'font-size:16px;font-family:Arial'>Next is an optional step – if you want to see status of databases/tables created and space used.</p>

In [None]:
%run -i ../run_procedure.py "call space_report();"

<p style = 'font-size:16px;font-family:Arial'>
Source events data may come from other source systems, log files, Object Storage, etc.  This section illustrates connecting to existing customer event data using the teradataml python package.

<p style = 'font-size:16px;font-family:Arial'>
This notebook makes use of two powerful behavioral analysis functions available in Vantage:
<ol style = 'font-size:16px;font-family:Arial'>    
 <li><b>Sessionize</b> Which will group a series of events into a keyed (by number) session.</li>
<li><b>nPath</b> Sophisticated pattern matching function to analyze and collect data across rows.</li>
</ol>    

In [None]:
#1.  Create the teradataml dataframe from our source table - this creates a pointer to the location without moving data
tdf_retail_events = DataFrame(in_schema('DEMO_Retail', 'Retail_Events'))
tdf_retail_events.head()

<p style = 'font-size:16px;font-family:Arial'>Sample data shows the events in the table we have linked in the dataframe.

<p style = 'font-size:28px;font-family:Arial;color:#E37C4D'><b>2.  Sessionize Function</b></p>
<p style = 'font-size:16px;font-family:Arial'>This function can be called as a method of teradataml library or in literal SQL.<br>In this notebook we are using the sql syntax of the function please refer BehavioralAnalysis_Python.ipynb for teradataml python syntax.<br><br>
The Sessionize function maps each click in a session to a unique session identifier. A session is a sequence of clicks by one user that are separated by at most 'n' seconds. In our case we are taking a duration of 24 hours for our session and observing the user behaviour in this time.

In [None]:
#2. Call the Sessionize function.  This function has several required parameters:
#data_partition_column - unique identifier of the user or entity we consolidate events for.
#data_order_column - the column or list of columns to use to order the sessions.
#time_column - column to apply the time boundary around to create a "session"
#time_out - duration in seconds to mark rows as a single session, 24 hours as example below, float.
#function returns an instance of the "Sessionize" object.  The "result" property is the teradata dataframe (virtual dataframe)

qry = '''
SELECT * FROM Sessionize (
 ON DEMO_Retail.Retail_Events
 PARTITION BY entity_id
 ORDER BY datestamp
 USING
 TimeColumn ('datestamp')
 TimeOut (86400)
) ;
'''

sessionized_events = DataFrame.from_query(qry)

sessionized_events

<p style = 'font-size:16px;font-family:Arial'>In the data returned above we can see that the function has assigned a sessionid on the events based on the parameter(time_out value) we have given

In [None]:
#commit our sessionized results to a permanent table:
sessionized_events.to_sql(table_name = 'demo_sessionized_events', if_exists = 'replace')

<p style = 'font-size:28px;font-family:Arial;color:#E37C4D'><b>3.  nPath Function</b>
<p style = 'font-size:16px;font-family:Arial'>As above, this function can be called as a method of teradataml, or in literal SQL.<br>In this notebook we are using the sql syntax of the function please refer BehavioralAnalysis_Python.ipynb for teradataml python syntax.

<p style = 'font-size:16px;font-family:Arial'><b>Call a basic nPath SQL function.</b>
 <ul style = 'font-size:16px;font-family:Arial'>
     <li>nPath scans a set of rows, looking for patterns.</li>
     <li>For each set of input rows that matches the pattern, nPath produces a single output row.</li>
  </ul>
<p style = 'font-size:16px;font-family:Arial'> This function allows for matching of complex patterns in the input data, as well as defining the output values for each matched set of rows.
<br>
For the below example:
<ol style = 'font-size:16px;font-family:Arial'>
    <li>Pass the sessionized data by reference.</li>
    <li>Provide partitioning (session key) and ordering columns.</li>
    <li>Mode <b>OVERLAPPING</b> vs. <b>NONOVERLAPPING</b>
        <ul style = 'font-size:16px;font-family:Arial'>
            <li><b>OVERLAPPING</b> finds every occurrence of the match, regardless of the current row being part of a previous match.</li>
            <li><b>NONOVERLAPPING</b> starts matching again at the row that follows the previous match.
        </ul>
    </li>
    <li>Symbols.  Create a set of column expression aliases that can be assembled into a pattern to match.
        <ul style = 'font-size:16px;font-family:Arial'>
            <li>Example: "EVENT = 'Mem Purchase' as P" will alias a match on the EVENT column when the content equals 'Mem Purchase'.</li>
        </ul>
    </li>
      <li>Pattern.  Compose a pattern to search for across the rows of events.  This pattern is composed of Symbols and directives.
        <ul style = 'font-size:16px;font-family:Arial'>
            <li>Example: '^P' uses a directive ^ to indicate the P Symbol must occur at the beginning of the group of rows</li>
        </ul>
    </li>
    <li>Result.  Since nPath emits a single row per group-of-row matches, Result indicates what columns make up this row and how to aggregate the data.</li>
    </ol>
    

In [None]:
qry = '''
 SELECT * FROM NPATH
    (ON demo_sessionized_events
        PARTITION BY sessionid
        ORDER BY datestamp
    USING
        MODE (NONOVERLAPPING)
        SYMBOLS (True as A ,event IN ('Mem Cancel') AS B)
        PATTERN ('A{2,5}.B')
        RESULT (FIRST (entity_id OF A) AS entity_id, 
                FIRST (sessionid OF A) AS sessionid, 
                ACCUMULATE (cast(event as VARCHAR(50) CHARACTER SET UNICODE NOT CASESPECIFIC) OF ANY(A,B)) AS path, 
                COUNT (* OF ANY (A,B)) AS event_cnt)
    );
'''

npath_mem_cancel = DataFrame.from_query(qry)

npath_mem_cancel

<p style = 'font-size:16px;font-family:Arial'>
Here we can clearly see that the nPath function has created a Path the customer took to the final event (Mem Cancel) we have mentioned in the sql call.

<p style = 'font-size:16px;font-family:Arial'>
    <b> Call the nPath function in literal SQL.</b><br>
since nPath emits a single row per match, it greatly reduces the number of rows returned from the function call.  As such, it may be possible or desirable to return the data directly using pandas read_sql instead of creating a view, table, or other in-database object:
<br>
    <b> Construct the SQL Statement </b><br>
As an example, construct the statement to match sessions where either 'Membership Purchase' or 'Product Purchase' occurred after a series of prior actions of at least one action and no more than five actions:
<ol style = 'font-size:16px;font-family:Arial'>    
   <li>Create Three Symbols:
       <ul style = 'font-size:16px;font-family:Arial'> 
           <li>MP: Membership Purchase</li>
           <li>PP: Product Purchase</li>
           <li>A: Match any row not Membership Purchase or Product Purchase</li>
       </ul>
    </li>
    <li>Assemble the Symbols into a Pattern using directives to match any A event between one and five times preceding MP OR PP:
        <ul style = 'font-size:16px;font-family:Arial'> 
            <li> A{1,5}.(PP|MP)</li>
        </ul>
    </li>
    <li>Return the sessionid, path, and number of steps</li>
    </ol>

In [None]:
qry = '''
SELECT * FROM NPATH
    (ON demo_sessionized_events
        PARTITION BY sessionid
        ORDER BY datestamp
    USING
        MODE (NONOVERLAPPING)
        SYMBOLS (event = 'Purchase' AS PP, event = 'Mem Purchase' AS MP, event NOT IN ('Purchase', 'Mem Purchase') AS A)
        PATTERN ('A{1,5}.(PP|MP)')
        RESULT (FIRST (datestamp of A) as start_time, 
                FIRST (entity_id of A) as entity_id, 
                FIRST (sessionid of ANY(MP, A, PP)) as sessionid, 
                COUNT (* of ANY(MP, A, PP)) as event_cnt, 
                ACCUMULATE (cast(event as VARCHAR(50) CHARACTER SET UNICODE NOT CASESPECIFIC) of ANY(MP, A, PP)) as path)
    );
'''

npath_purchase = DataFrame.from_query(qry)

npath_purchase

<p style = 'font-size:16px;font-family:Arial'> Here also we can see that the nPath function calculates and displays the path customer took to our final event (Purchase or Mem Purchase) as mentioned in the Pattern parameter of the sql.

<p style = 'font-size:28px;font-family:Arial;color:#E37C4D'><b>4.  Analysis and Visualization</b>
<p style = 'font-size:16px;font-family:Arial'> Analysis can be performed on the data in-database, or externally with the local pandas dataframe:

In [None]:
#Operate on the data as it lies in the database, and only retrieve the result of the aggregation

npath_mem_cancel.groupby(['path']).count().sort(['count_sessionid'], ascending = False).head()

In [None]:
#or - operate on the data in the pandas dataframe:
df_npath_pandas = npath_mem_cancel.to_pandas()

ax = df_npath_pandas['event_cnt'].value_counts().sort_values().plot(kind='bar', figsize=(7, 6), rot=0)
plt.xlabel("Event count in path")
plt.ylabel("Number of Path")
plt.title("Number of events in a Path", y = 1.02)

<p style = 'font-size:16px;font-family:Arial'> In our nPath function we have used the pattern where final event is 'Mem Cancel', The above histogram shows the number of events where the final event is Mem Cancel.

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>4.1  Sankey Charts</b></p>
<p style = 'font-size:16px;font-family:Arial'> In order to visualize the distribution of the different path of events, we typically use Sankey diagram of the aggregated over the paths reported by the NPATH command.


In [None]:
from tdnpathviz.visualizations import plot_first_main_paths

In [None]:
%%time
plot_first_main_paths(npath_purchase,path_column='path',id_column='entity_id',width=1100)

<p style = 'font-size:16px;font-family:Arial'> To check the details of any path or node we can move the mouse pointer over it and check details. For example if you move the pointer over the dark Green path having the largest width at the TOP going towards the right most node(Purchase).The number/count shows there number of entities who followed that path starting from Product Return ---> Purchase.<br>
When the pointer is moved over a Node, for example when the pointer is on the long purple Node at the right top Product Return it shows incoming flow count and outgoing flow count. Incoming flow count means the number of different event which led to the event in consideration and outgoing flow count the number of different event after this event. Similarly other nodes and paths can be analyzed.
<p style = 'font-size:16px;font-family:Arial'> This visualization takes the input from Teradata nPath output. Here also we can see the events customer took to his final event of 'Purchase' or 'Mem Purchase'(membership purchase). 

<p style = 'font-size:28px;font-family:Arial;color:#E37C4D'><b>5.  Clean up</b></p>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'> <b>Worktables </b></p>

In [None]:
db_drop_table(table_name='demo_sessionized_events') 

<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'> <b>Database and Tables </b></p>
<p style = 'font-size:16px;font-family:Arial'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('DEMO_Retail');" 
#Takes 10 seconds
#Please note that the same data is used in UseCases/TextProcessing_TF_IDF notebooks

In [None]:
remove_context()

<p style = 'font-size:16px;font-family:Arial'><b>Links:</b></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Teradata Python Package User Guide: <a href = 'https://docs.teradata.com/reader/eteIDCTX4O4IMvazRMypxQ/uDjppX7PJInABCckgu~KFg'>https://docs.teradata.com/reader/eteIDCTX4O4IMvazRMypxQ/uDjppX7PJInABCckgu~KFg</a></li>
    <li>Teradataml Python Reference: <a href = 'https://docs.teradata.com/reader/GsM0pYRZl5Plqjdf9ixmdA/MzdO1q_t80M47qY5lyImOA'>https://docs.teradata.com/reader/GsM0pYRZl5Plqjdf9ixmdA/MzdO1q_t80M47qY5lyImOA</a></li>
    <li>Teradata nPath Function Reference: <a href = 'https://docs.teradata.com/reader/CWVY0AJy8wyyf7Sm0EsK~w/wjkE42ypEfeMkRFOIqVXfQ'>https://docs.teradata.com/reader/CWVY0AJy8wyyf7Sm0EsK~w/wjkE42ypEfeMkRFOIqVXfQ</a></li>
    <li>Teradata Sessionize Function Reference: <a href = 'https://docs.teradata.com/reader/CWVY0AJy8wyyf7Sm0EsK~w/RNbOiUg9~r~cxSZHrR~sFQ'>https://docs.teradata.com/reader/CWVY0AJy8wyyf7Sm0EsK~w/RNbOiUg9~r~cxSZHrR~sFQ</a></li>
        <li>Python Pandas Reference: <a href = 'https://pandas.pydata.org/docs/user_guide/index.html'>https://pandas.pydata.org/docs/user_guide/index.html</a></li>
        <li>Plotly Reference: <a href = 'https://plotly.com/'>https://plotly.com/</a></li>
</ul>


<footer style="padding:10px;background:#f9f9f9;border-bottom:3px solid #394851">©2023 Teradata. All Rights Reserved</footer>