<header style="padding:1px;background:#f9f9f9;border-top:3px solid #00b2b1"><img id="Teradata-logo" src="https://www.teradata.com/Teradata/Images/Rebrand/Teradata_logo-two_color.png" alt="Teradata" width="220" align="right" />

<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>Behavioral Analysis and Visualization using Vantage and Plotly</b>
</header>

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Introduction</b></p>
<p style = 'font-size:16px;font-family:Arial'>
This is a demonstration of Vantage capabilities for functional demos e.g.
    <li style = 'font-size:16px;font-family:Arial'> Sessionize Function - Assign a Session Identifier to a series of events within a time window </li>
    <li style = 'font-size:16px;font-family:Arial'> nPath Function - Analyze events and match patterns to collect specific user paths </li>
</p>
<br>
<p style = 'font-size:16px;font-family:Arial'>
nPath is useful when your goal is to identify the paths that lead to an outcome. The nPath function scans a set of rows, looking for patterns that you specify. For each set of input rows that matches the pattern, nPath produces a single output row. The function provides a flexible pattern-matching capability that lets you specify complex patterns in the input data and define the values that are output for each matched input set. For example, you can use nPath to analyze:
<li style = 'font-size:16px;font-family:Arial'>Web site click data, to identify paths that lead to sales over a specified amount</li>
<li style = 'font-size:16px;font-family:Arial'>Sensor data from industrial processes, to identify paths to poor product quality</li>
<li style = 'font-size:16px;font-family:Arial'>Healthcare records of individual patients, to identify paths that indicate that patients are at risk of developing conditions such as heart disease or diabetes</li>
<li style = 'font-size:16px;font-family:Arial'>Financial data for individuals, to identify paths that provide information about credit or fraud risks</li>
</p>    


<p style = 'font-size:28px;font-family:Arial;color:#E37C4D'><b>1. Import python packages, connect to Vantage and explore the dataset</b></p>


<p style = 'font-size:16px;font-family:Arial'>Installing python libraries which are not installed by default in the environment.</p>


In [None]:
%%capture
!pip install --user colorlover

<p style = 'font-size:16px;font-family:Arial'><b><i>*BEFORE proceeding, please RESTART the kernel to bring new software into Jupyter.</i></b></p>

In [None]:
#import libraries
import getpass
import warnings
import datetime
from collections import defaultdict

import pandas as pd
import numpy as np

import teradataml.dataframe.dataframe as tdf
from teradataml.dataframe.dataframe import in_schema
from teradataml.context.context import create_context, remove_context, get_context
from teradataml.dataframe.copy_to import copy_to_sql
from teradataml.dataframe.fastload import fastload

from teradataml.analytics.sqle.Sessionize import Sessionize
from teradataml.analytics.sqle.NPath import NPath

##import tdconnect

from teradatasqlalchemy.types import *

from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from collections import defaultdict
import plotly.offline as offline
import colorlover as cl

offline.init_notebook_mode()

warnings.filterwarnings('ignore')

<p style = 'font-size:16px;font-family:Arial'>You will be prompted to provide the password. Enter your password, press the Enter key, then use down arrow to go to next cell. Below command will make a connection to the Vantage environment </p>

In [None]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)
eng.execute('''SET query_band='DEMO=BehavioralAnalysis.ipynb;' UPDATE FOR SESSION; ''')

<p style = 'font-size:16px;font-family:Arial'>Begin running steps with Shift + Enter keys.</p>

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'> <b>Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial'>We have provided data for this demo on cloud storage. In this demo since we are using Temporal table we will be creating databases and tables in local storage and use them in the notebook. Please execute the procedure in the next cell.</p>

In [None]:
%run -i ../run_procedure.py "call get_data('DEMO_Retail_cloud');"
 # takes about 30 seconds, estimated space: 0 MB
#%run -i ../run_procedure.py "call get_data('DEMO_Retail_local');" 
# takes about 50 seconds, estimated space: 23 MB

<p style = 'font-size:16px;font-family:Arial'>Next is an optional step – if you want to see status of databases/tables created and space used.</p>

In [None]:
%run -i ../run_procedure.py "call space_report();"

<p style = 'font-size:16px;font-family:Arial'>
Source events data may come from other source systems, log files, Object Storage, etc.  This section illustrates connecting to existing customer event data using the teradataml python package.

<p style = 'font-size:16px;font-family:Arial'>
This notebook makes use of two powerful behavioral analysis functions available in Vantage:
<ol style = 'font-size:16px;font-family:Arial'>    
 <li><b>Sessionize</b> Which will group a series of events into a keyed (by number) session.</li>
<li><b>nPath</b> Sophisticated pattern matching function to analyze and collect data across rows.</li>
</ol>    

In [None]:
#1.  Create the teradataml dataframe from our source table - this creates a pointer to the location without moving data
tdf_retail_events = tdf.DataFrame(in_schema('DEMO_Retail', 'Retail_Events'))
tdf_retail_events.head()

<p style = 'font-size:16px;font-family:Arial'>Sample data shows the events in the table we have linked in the dataframe.

<p style = 'font-size:28px;font-family:Arial;color:#E37C4D'><b>2.  Sessionize Function</b></p>
<p style = 'font-size:16px;font-family:Arial'>This function can be called as a method of teradataml library or in literal SQL.<br>
The Sessionize function maps each click in a session to a unique session identifier. A session is a sequence of clicks by one user that are separated by at most 'n' seconds. In our case we are taking a duration of 24 hours for our session and observing the user behaviour in this time.

In [None]:
#2. Call the Sessionize function.  This function has several required parameters:
#data_partition_column - unique identifier of the user or entity we consolidate events for.
#data_order_column - the column or list of columns to use to order the sessions.
#time_column - column to apply the time boundary around to create a "session"
#time_out - duration in seconds to mark rows as a single session, 24 hours as example below, float.
#function returns an instance of the "Sessionize" object.  The "result" property is the teradata dataframe (virtual dataframe)

sessionized_events = Sessionize(data = tdf_retail_events, 
                               data_partition_column = ['entity_id'], 
                               data_order_column = ['datestamp'], 
                               time_column = 'datestamp', 
                               time_out = 86400.00)

sessionized_events.result.head()

<p style = 'font-size:16px;font-family:Arial'>In the data returned above we can see that the function has assigned a sessionid on the events based on the parameter(time_out value) we have given

In [None]:
#commit our sessionized results to a permanent table:
sessionized_events.result.to_sql(table_name = 'demo_sessionized_events', if_exists = 'replace')

<p style = 'font-size:28px;font-family:Arial;color:#E37C4D'><b>3.  nPath Function</b>
<p style = 'font-size:16px;font-family:Arial'>As above, this function can be called as a method of teradataml, or in literal SQL.  Both examples are below

In [None]:
#Call the nPath function of teradataml library

#First, use the result attribute of the sessionize object to efficiently get a pointer to the data:
tdf_sessionized_events = sessionized_events.result

<p style = 'font-size:16px;font-family:Arial'><b>Call a basic nPath function.</b>
 <ul style = 'font-size:16px;font-family:Arial'>
     <li>nPath scans a set of rows, looking for patterns.</li>
     <li>For each set of input rows that matches the pattern, nPath produces a single output row.</li>
  </ul>
<p style = 'font-size:16px;font-family:Arial'> This function allows for matching of complex patterns in the input data, as well as defining the output values for each matched set of rows.
<br>
For the below example:
<ol style = 'font-size:16px;font-family:Arial'>
    <li>Pass the sessionized data by reference.</li>
    <li>Provide partitioning (session key) and ordering columns.</li>
    <li>Mode <b>OVERLAPPING</b> vs. <b>NONOVERLAPPING</b>
        <ul style = 'font-size:16px;font-family:Arial'>
            <li><b>OVERLAPPING</b> finds every occurrence of the match, regardless of the current row being part of a previous match.</li>
            <li><b>NONOVERLAPPING</b> starts matching again at the row that follows the previous match.
        </ul>
    </li>
    <li>Symbols.  Create a set of column expression aliases that can be assembled into a pattern to match.
        <ul style = 'font-size:16px;font-family:Arial'>
            <li>Example: "EVENT = 'Mem Purchase' as P" will alias a match on the EVENT column when the content equals 'Mem Purchase'.</li>
        </ul>
    </li>
      <li>Pattern.  Compose a pattern to search for across the rows of events.  This pattern is composed of Symbols and directives.
        <ul style = 'font-size:16px;font-family:Arial'>
            <li>Example: '^P' uses a directive ^ to indicate the P Symbol must occur at the beginning of the group of rows</li>
        </ul>
    </li>
    <li>Result.  Since nPath emits a single row per group-of-row matches, Result indicates what columns make up this row and how to aggregate the data.</li>
    </ol>
    

In [None]:
#Create two symbols and assemble them with directives:
# 1. True as A - matches any row
# 2. EVENT Column match the string 'Mem Cancel' as B
# Pattern directs a range of any row (A) between 2 and 5 times preceding 'Mem Cancel' (B) - A{2,5}.B

npath_sessions = NPath(data1 = tdf_sessionized_events, 
                      data1_partition_column = ['SESSIONID'], 
                      data1_order_column = 'datestamp', 
                      mode = 'NONOVERLAPPING', 
                      symbols = ['True as A', 'EVENT in (\'Mem Cancel\') as B'], 
                      pattern = 'A{2,5}.B', 
                      result = ['FIRST (entity_id OF A) AS entity_id', 
                               'FIRST (sessionid OF A) AS sessionid', 
                               'ACCUMULATE (cast(event as VARCHAR(50) CHARACTER SET UNICODE NOT CASESPECIFIC) OF ANY(A,B)) AS path', 
                               'COUNT (* OF ANY (A,B)) AS event_cnt'])

npath_sessions.result.head()

<p style = 'font-size:16px;font-family:Arial'>
Here we can clearly see that the nPath function has created a Path the customer took to the final event (Mem Cancel) we have mentioned in the teradataml method.

<p style = 'font-size:28px;font-family:Arial;color:#E37C4D'><b>4.  Analysis and Visualization</b>
<p style = 'font-size:16px;font-family:Arial'> Analysis can be performed on the data in-database, or externally with the local pandas dataframe:

In [None]:
#Operate on the data as it lies in the database, and only retrieve the result of the aggregation

npath_sessions.result.groupby(['path']).count().sort(['count_sessionid'], ascending = False).head()

In [None]:
#or - operate on the data in the pandas dataframe:
df_npath_pandas = npath_sessions.result.to_pandas()
#df_sql_npath['event_cnt'].plot(kind = 'hist');
df_npath_pandas['event_cnt'].plot(kind = 'hist');

<p style = 'font-size:16px;font-family:Arial'> In our nPath function we have used the pattern where final event is 'Mem Cancel', The above histogram shows the number of events where the final event is Mem Cancel.

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>4.1  Sankey Charts</b></p>
<p style = 'font-size:16px;font-family:Arial'> Sankey charts can help visualize flows and volumes of flows through a system. Below is a basic overview of Sankey charting with Plotly.


In [None]:
#Very basic Sankey as illustrated in the presentation
#This assumes three user paths:
#Browse->Call->Refund (1)
#Call->Browse->Refund (1)
#Browse->Refund (1)

sankeyChart = dict(
    type='sankey',
    node = dict(
      pad = 15,
      thickness = 20,
      label = ["Browse", "Call", "Call", "Browse", "Refund"], #Ordered list of unique steps
      color = cl.scales['5']['seq']['BuPu']
    ),
    link = dict(
        source = [0,1,2,3], #The index of the list of nodes that represent the "left" side of a pair-path
        target = [2,3,4,4], #The "right" side of the pair-path
        value = [1,1,1,2] #how many of those pair-paths occurred
    )
  )


fig = dict(data=[sankeyChart])
iplot(fig, validate=False)

<p style = 'font-size:16px;font-family:Arial'> This sankey plot shows the events leading to Refund event. It shows the 3 patterns Browse->Call->Refund ,Call->Browse->Refund and Browse->Refund.

In [None]:
#Convert Teradata nPath output to plotly Sankey
#can handle paths up to 999 links in length

# npath_pandas = df_sql_npath
 
dataDict = defaultdict(int)
eventDict = defaultdict(int)
maxPath = df_npath_pandas['event_cnt'].max()


for index, row in df_npath_pandas.iterrows():
    rowList = row['path'].replace('[','').replace(']','').split(',')
    pathCnt = row['event_cnt']
    pathLen = len(rowList)
    for i in range(len(rowList)-1):
        leftValue = str(100 + i + maxPath - pathLen) + rowList[i].strip()
        rightValue = str(100 + i + 1 + maxPath - pathLen) + rowList[i+1].strip()
        valuePair = leftValue + '+' + rightValue
        dataDict[valuePair] += pathCnt
        eventDict[leftValue] += 1
        eventDict[rightValue] += 1

eventList = []
for key,val in eventDict.items():
    eventList.append(key)

sortedEventList = sorted(eventList)
sankeyLabel = []
for event in sortedEventList:
    sankeyLabel.append(event[3:])

sankeySource = []
sankeyTarget = []
sankeyValue = []

for key,val in dataDict.items():
    sankeySource.append(sortedEventList.index(key.split('+')[0]))
    sankeyTarget.append(sortedEventList.index(key.split('+')[1]))
    sankeyValue.append(val)

sankeyColor = []
for i in sankeyLabel:
    sankeyColor.append('blue')

sankeyChart = dict(
    type='sankey',
    node = dict(
      pad = 15,
      thickness = 20,
      line = dict(
        color = 'black',
        width = 0.5
      ),
      label = sankeyLabel,
      color = sankeyColor
    ),
    link = dict(
        source = sankeySource,
        target = sankeyTarget,
        value = sankeyValue
    )
  )
layout =  dict(
    title = "Basic Sankey Pathing",
    font = dict(
      size = 10
    )
)


fig = dict(data=[sankeyChart], layout=layout)
iplot(fig, validate=False)

<p style = 'font-size:16px;font-family:Arial'> This visualization takes the input from Teradata nPath output to plotly Sankey visualization. Here also we can see the events customer took to his final event of 'Purchase' or 'Mem Purchase'(membership purchase). 

<p style = 'font-size:28px;font-family:Arial;color:#E37C4D'><b>5.  Clean up</b></p>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'> <b>Worktables </b></p>

In [None]:
from teradataml import db_drop_table
db_drop_table(table_name='demo_sessionized_events', schema_name='DEMO_USER') 

<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'> <b>Database and Tables </b></p>
<p style = 'font-size:16px;font-family:Arial'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('DEMO_Retail');" 
#Takes 10 seconds
#Please note that the same data is used in UseCases/TextProcessing_TF_IDF notebooks

<p style = 'font-size:16px;font-family:Arial'><b>Links:</b></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Teradata Python Package User Guide: <a href = 'https://docs.teradata.com/reader/eteIDCTX4O4IMvazRMypxQ/uDjppX7PJInABCckgu~KFg'>https://docs.teradata.com/reader/eteIDCTX4O4IMvazRMypxQ/uDjppX7PJInABCckgu~KFg</a></li>
    <li>Teradataml Python Reference: <a href = 'https://docs.teradata.com/reader/GsM0pYRZl5Plqjdf9ixmdA/MzdO1q_t80M47qY5lyImOA'>https://docs.teradata.com/reader/GsM0pYRZl5Plqjdf9ixmdA/MzdO1q_t80M47qY5lyImOA</a></li>
    <li>Teradata nPath Function Reference: <a href = 'https://docs.teradata.com/reader/CWVY0AJy8wyyf7Sm0EsK~w/wjkE42ypEfeMkRFOIqVXfQ'>https://docs.teradata.com/reader/CWVY0AJy8wyyf7Sm0EsK~w/wjkE42ypEfeMkRFOIqVXfQ</a></li>
    <li>Teradata Sessionize Function Reference: <a href = 'https://docs.teradata.com/reader/CWVY0AJy8wyyf7Sm0EsK~w/RNbOiUg9~r~cxSZHrR~sFQ'>https://docs.teradata.com/reader/CWVY0AJy8wyyf7Sm0EsK~w/RNbOiUg9~r~cxSZHrR~sFQ</a></li>
        <li>Python Pandas Reference: <a href = 'https://pandas.pydata.org/docs/user_guide/index.html'>https://pandas.pydata.org/docs/user_guide/index.html</a></li>
        <li>Plotly Reference: <a href = 'https://plotly.com/'>https://plotly.com/</a></li>
</ul>


<footer style="padding:10px;background:#f9f9f9;border-bottom:3px solid #394851">©2023 Teradata. All Rights Reserved</footer>