<header style="padding:1px;background:#f9f9f9;border-top:3px solid #00b2b1"><img id="Teradata-logo" src="https://www.teradata.com/Teradata/Images/Rebrand/Teradata_logo-two_color.png" alt="Teradata" width="220" align="right" />

<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>Multi Touch Attribution using Vantage</b>

<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>Target Audience</b></p>
 
<p style = 'font-size:16px;font-family:Arial'>This notebook is a simplified version of the MultiTouch_Attribution_PY_SQL notebook as it is targeted for the Business Analyst persona rather than the Data Scientist persona.</p>  
    
    
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>Introduction</b></p>

<p style = 'font-size:16px;font-family:Arial'>Marketing attribution modelling techniques aim to determine the contribution of each marketing touchpoint or channel in influencing customer behaviour and driving conversions. These models provide valuable insights into the effectiveness of marketing efforts, helping businesses make informed decisions regarding resource allocation and optimization.</p>
<p style = 'font-size:16px;font-family:Arial'><a href='#rule'>Rule-based</a> attribution modelling relies on predetermined rules or heuristics to assign credit to various touchpoints along the customer journey. Common rule-based models include the First Touch, Last Touch, Uniform (linear) and Exponential(time decay) models. The First Touch model attributes all credit to the first touchpoint a customer interacts with, while the Last Touch model assigns all credit to the final touchpoint before conversion. The Uniform model evenly distributes credit across all touchpoints in the customer journey. The Exponential model assigns more credit to touchpoints closer to the conversion event.<p>
    
<p style = 'font-size:16px;font-family:Arial'><a href='#stat'>Statistical</a> and <a href='#ml'>Algorithmic-based</a> attribution modelling, on the other hand, utilizes advanced statistical and machine learning techniques to determine the contribution of each touchpoint. These models take into account various factors such as the order, timing, and interaction patterns of touchpoints.<p>
   
<p style = 'font-size:16px;font-family:Arial'>All approaches have their strengths and limitations. <a href='#rule'>Rule-based</a> models are relatively straight forward to implement and interpret, but they may oversimplify the complexity of customer journeys. <a href='#ml'>Algorithmic-based</a> models offer more sophisticated and granular insights but may require advanced analytics expertise and extensive data sets to achieve accurate results.
It's important for businesses to select the most suitable attribution modelling approach based on their specific goals, available data, and resources. Implementing an effective marketing attribution model can significantly enhance decision-making and optimize marketing strategies.<p>
    
<p style = 'font-size:16px;font-family:Arial'>In this use case we will show several different analytic techniques to perform Multi Touch Attribution modelling and analysis using Vantage.<p>
<img src="images/Attribution.png">    
<p style = 'font-size:16px;font-family:Arial'>Our innovative approach includes the use of <a href='#path'>Path Analysis</a> not only to identify and visualize customer conversion journeys but also to prepare data for advanced and sometimes creative techniques.<p>
</header>

<hr>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>1. Start by connecting to the Vantage system.</b></p>


<p style = 'font-size:16px;font-family:Arial'>In the section, we import the required libraries and set environment variables and environment paths (if required).</p>

In [None]:
import teradataml as tdml
import getpass
import pandas as pd
import plotly.express as px
from sklearn.ensemble import RandomForestClassifier
# import markov_model_attribution as mma
import seaborn as sns
import matplotlib.pyplot as plt
import tdnpathviz
from teradataml import *
import warnings
warnings.filterwarnings('ignore')
display.max_rows = 5

<p style = 'font-size:16px;font-family:Arial'>You will be prompted to provide the password. Enter your password, press the Enter key, then use down arrow to go to next cell. Begin running steps with Shift + Enter keys.</p>

In [None]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)
eng.execute('''SET query_band='DEMO=MultiTouchAttribution_PY_SQL.ipynb;' UPDATE FOR SESSION; ''')

<hr>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>2. Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial'>We have provided data for this demo on cloud storage.  In this demo as we are using the nPath function with needs all character data in LATIN character set, we will only use the local option of creating tables and DDL.</p>   


In [None]:
# %run -i ../run_procedure.py "call get_data('DEMO_MultiTouchAttribution_cloud');"
 # Takes about 30 secs
%run -i ../run_procedure.py "call get_data('DEMO_MultiTouchAttribution_local');"
 # Takes about 1 minute 30 secs

<p style = 'font-size:16px;font-family:Arial'>Next is an optional step – if you want to see status of databases/tables created and space used.</p>

In [None]:
%run -i ../run_procedure.py "call space_report();"

<hr>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>3. Analyze the raw data set</b></p>

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Data</b></p>
<p style = 'font-size:16px;font-family:Arial'>The dataset is digital marketing data containing 586,000 marketing touch-points from July (2018), comprising 240,000 unique customers who generated ~18,000 conversions. A more detailed description of the features is shown below:

<li style = 'font-size:14px;font-family:Arial'>Cookie: Anonymous customer id enabling us to track the progression of a given customer</li>
<li style = 'font-size:14px;font-family:Arial'>Timestamp: Date and time when the visit took place</li>
<li style = 'font-size:14px;font-family:Arial'>Interaction: Categorical variable indicating the type of interaction that took place</li>
<li style = 'font-size:14px;font-family:Arial'>Conversion: Boolean variable indicating whether a conversion took place</li>
<li style = 'font-size:14px;font-family:Arial'>Conversion Value: Value of the potential conversion event (revenue)</li>
<li style = 'font-size:14px;font-family:Arial'>Channel: The marketing channel that brought the customer to our site</li>
</p>
<p style = 'font-size:16px;font-family:Arial'>Create a DataFrame to get the data from the table created.</p>



In [None]:
attr_df=DataFrame(in_schema('DEMO_MultiTouchAttribution', 'Attribution_Data'))
attr_df

<p style = 'font-size:16px;font-family:Arial'>The Attribution data contains the channel details with the timestamp of the conversion , its conversion value and cost.</p>

In [None]:
df=attr_df.to_pandas().reset_index()
#Plotting conversions over time by channel
conversions = df.loc[df['conversion'] == 1]
conversions['time'] = conversions['tmstp'].dt.date
conversions = conversions[conversions['time']< pd.to_datetime("2018-7-30").date()]
conversions.drop(columns = ['cookie', 'interaction'], inplace = True)
# conversions = conversions.groupby(['time','channel'], as_index=False).sum()
conversions = conversions.groupby(['channel'], as_index=False).sum()

fig = px.bar(conversions, x='channel', y='conversion', color='channel')

fig.update_layout(title='Channel Conversions',
                   xaxis_title='Channel',
                   yaxis_title='Conversions')
fig.show()

<p style = 'font-size:16px;font-family:Arial'>The above chart shows the number of conversions by each Channel.</p>

In [None]:
channel_df=DataFrame(in_schema('DEMO_MultiTouchAttribution', 'Channel_Cost'))

<p style = 'font-size:16px;font-family:Arial'>The Channel data contains the channels and cost.</p>

In [None]:
df_plot= channel_df.to_pandas().reset_index()
fig, ax = plt.subplots(figsize=(8, 5))
sns.barplot(x = 'channel',y = 'cost',data = df_plot)
plt.xlabel('Channel')
plt.ylabel('Cost of Conversion')
plt.title('Channel Cost')

plt.show()

<p style = 'font-size:16px;font-family:Arial'>The cost of Online Video is highest and that of Instagram is lowest.</p>

<hr>
<a id="path"></a>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>4. PATH ANALYSIS</b></p>



<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>4.1. Use nPath® to visualise conversion journeys</p>

<p style = 'font-size:16px;font-family:Arial'>We want to see how our customers are converting.</p>
<p style = 'font-size:16px;font-family:Arial'>Pathing is the process of discovering a sequence of antecedent actions that occur prior to a specific event of interest on sessionized data. Pathing discovers the most salient patterns across a group of individuals or entities based on which further actions are considered. Pathing allows you to provide an explanation of the relation and the relative importance of each factor.</p>

<p style = 'font-size:16px;font-family:Arial'>The nPath® function provides a flexible pattern-matching capability that lets you specify complex patterns in the input data and define the values that are output for each matched input set. So we can use powerful nPath® analytic function in Vantage to do pattern/time series analysis that is very hard to do in simple SQL. We want to see the common channel paths that customers take when they convert.</p>

<p style = 'font-size:16px;font-family:Arial'>In the code here you can see a few key points:</p>
<li style = 'font-size:16px;font-family:Arial'>The 'Pattern' we are searching for is 8 events followed by conversion (conversion =1).</li>
<li style = 'font-size:16px;font-family:Arial'>The 'Symbols' we are using is anything but converting is 'EVENT' and conversion column = 1 is 'CONVERSION'.</li>
<li style = 'font-size:16px;font-family:Arial'>We create a dummy 'Conversion' event to enable its visualization.</li>
</p>

In [None]:
npath_sessions = NPath(data1 = attr_df, 
                      data1_partition_column = ['cookie'], 
                      data1_order_column = ['tmstp'], 
                      mode = 'NONOVERLAPPING', 
                      symbols = ['conversion=\'1\' as CONVERSION, conversion=\'0\' as EVENT'], 
                      pattern = 'EVENT{0,8}.CONVERSION', 
                      result = ['ACCUMULATE (case when conversion=\'1\' then \'Conversion\' else channel end OF ANY(CONVERSION,EVENT)) AS path',
                                  'COUNT (* of ANY(CONVERSION,EVENT)) as event_cnt',
                                  'FIRST (cookie OF ANY(CONVERSION,EVENT)) AS cookie'])


# npath_sessions.result\
#                     .groupby(['path'])\
#                     .count()\
#                     .sort('count_event_cnt',ascending=False)\
#                     .to_pandas()\
#                     .head(10)
convcntpath = npath_sessions.result
convcntpath

<p style = 'font-size:16px;font-family:Arial;'>A visualization of this gives us lots of insight into the most common paths (the top 50) that users are taking before converting. A Sankey Diagram can be created using the output(path) of the nPath function used in the query above.</p>
<p style = 'font-size:16px;font-family:Arial;'><i>**The visualization takes around 1 minute 30 seconds to execute</i></p>

In [None]:
#Convert Teradata nPath output to plotly Sankey
#can handle paths up to 999 links in length
import pandas as pd
import plotly.graph_objects as go
from collections import defaultdict
import random

def sankeyPlot(res, direction, title_text="Sankey nPath", topN=15):
    npath_pandas = res.copy()

    if topN:
        npath_pandas = npath_pandas.sort_values(by='count_event_cnt', ascending=False).head(topN)

    if direction == "from":
        dataDict = defaultdict(int)

        for index, row in npath_pandas.iterrows():
            pathCnt = row['count_event_cnt']
            rowList = [item.strip() for item in row['path'].replace('[','').replace(']','').split(',')]
            for i in range(len(rowList)-1):
                leftValue = rowList[i] + str(i)
                rightValue = rowList[i+1] + str(i+1)
                valuePair = leftValue + '+' + rightValue
                dataDict[valuePair] += pathCnt

        eventList = []
        for key in dataDict.keys():
            leftValue, rightValue = key.split('+')
            if leftValue not in eventList:
                eventList.append(leftValue)
            if rightValue not in eventList:
                eventList.append(rightValue)

        sankeyLabel = [s[:-1] for s in eventList]
        
        sankeySource = []
        sankeyTarget = []
        sankeyValue = []

        for key,val in dataDict.items():
            sankeySource.append(eventList.index(key.split('+')[0]))
            sankeyTarget.append(eventList.index(key.split('+')[1]))
            sankeyValue.append(val)

        sankeyColor = []
        for i in sankeyLabel:
            sankeyColor.append('#'+''.join([random.choice('0123456789ABCDEF') for _ in range(6)]))

        link = dict(source = sankeySource, target = sankeyTarget, value = sankeyValue, color='light grey')
        node=dict(label=sankeyLabel, color=sankeyColor)
        data=go.Sankey(link=link, node=node)

        fig=go.Figure(data)

        fig.update_layout(
            hovermode ='closest',
            title = title_text,
            title_font_size=20,
            plot_bgcolor='white',
            paper_bgcolor='white'
        )

        fig.show()

    elif direction == "to":
        
        dataDict = defaultdict(int)
        eventDict = defaultdict(int)
        maxPath = npath_pandas['count_event_cnt'].max()
    
        for index, row in npath_pandas.iterrows():
            rowList = row['path'].replace('[','').replace(']','').split(',')
            pathCnt = row['count_event_cnt']
            pathLen = len(rowList)
            for i in range(len(rowList)-1):
                leftValue = str(1000 + i + maxPath - pathLen) + rowList[i].strip()
                rightValue = str(1000 + i + 1 + maxPath - pathLen) + rowList[i+1].strip()
                valuePair = leftValue + '+' + rightValue
                dataDict[valuePair] += pathCnt
                eventDict[leftValue] += 1
                eventDict[rightValue] += 1
    
        eventList = []
        for key,val in eventDict.items():
            eventList.append(key)
    
        sortedEventList = sorted(eventList)
        sankeyLabel = []
        for event in sortedEventList:
            sankeyLabel.append(event[4:])
    
        sankeySource = []
        sankeyTarget = []
        sankeyValue = []

        for key,val in dataDict.items():
            sankeySource.append(sortedEventList.index(key.split('+')[0]))
            sankeyTarget.append(sortedEventList.index(key.split('+')[1]))
            sankeyValue.append(val)
    
        sankeyColor = []
        for i in sankeyLabel:
            sankeyColor.append('#'+''.join([random.choice('0123456789ABCDEF') for _ in range(10)]))
    
        link = dict(source = sankeySource, target = sankeyTarget, value = sankeyValue, color='light grey')
        data=go.Sankey(link=link, node=dict(label=sankeyLabel))
    
        fig=go.Figure(data)
        fig.update_layout(
                hovermode ='closest',
                title = title_text,
                title_font_size=20,
                plot_bgcolor='white',
                paper_bgcolor='white'
                )
    
        fig.show()

    else:
        print("Invalid direction.")


<p style = 'font-size:16px;font-family:Arial'>Consider an example where we create a path for a cookie which leads to conversion.</p>

In [None]:
attr_df[attr_df['cookie'] == 'FFfBikCE3onF3hACFCCE9iDf3'].sort('tmstp')

<p style = 'font-size:16px;font-family:Arial'>The above table shows the output of 1 cookie ordered by Timestamp(tmstp). We can see that there were 3 touch points of the facebook channel when conversion did not happen. Finally on the 4th touch point of the Facebook channel, conversion takes place. So the path will be </p>
<p style = 'font-size:14px;font-family:Arial'><b>Facebook</b><b style = 'font-size:12px;font-family:Arial'>(2018-07-02 16:08:02)--><b><b style = 'font-size:14px;font-family:Arial'>Facebook</b><b style = 'font-size:12px;font-family:Arial'>(2018-07-08 18:38:32)--><b><b style = 'font-size:14px;font-family:Arial'>Facebook</b><b style = 'font-size:12px;font-family:Arial'>(2018-07-10 12:30:15)--><b><b style = 'font-size:14px;font-family:Arial'>Facebook</b><b style = 'font-size:12px;font-family:Arial'>(2018-07-14 10:33:31)--><b><b style = 'font-size:14px;font-family:Arial'>Conversion</b></p>

<p style = 'font-size:16px;font-family:Arial'>Below we plot the paths for Top 100 path that led to conversion based on the count of events.</p>

In [None]:
res = convcntpath\
                    .groupby(['path'])\
                    .count()\
                    .sort('count_event_cnt',ascending=False)\
                    .to_pandas()\
                    .head(100)

In [None]:
sankeyPlot(res,"to","Path to Conversion",100)

<p style = 'font-size:16px;font-family:Arial'>The above Sankey Diagram shows the paths that led to Conversion.</p>

<p style = 'font-size:16px;font-family:Arial'>To check the details of any path or node we can move the mouse pointer over it and check details. For example if you move the pointer over the path having the largest width at the top most path going towards the right most node(Conversion) it shows <b>2.30k, source: Facebook, target: Conversion.</b> It means there were 2.30k touch points where after going to Facebook the next event was Conversion. Similarly 1.92k Online Video touch points, 1.98k Paid Search touch points, 873 Instagram touch points, 816 Online Display touch points which lead to Conversion. </p>

<p style = 'font-size:16px;font-family:Arial'>When the pointer is moved over a Node, for example when the pointer is on the largest Node at the top before conversion is <b>Facebook </b>  it shows <b>incoming flow count: 5 and outgoing flow count: 1</b> which means that there are 5 different paths which lead to Facebook after which the next 1 event led to Conversion. Similarly other nodes and paths can be analyzed.</p>



<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>4.2 Use nPath as a data preparation function and input to additional analytics techniques</b></p>

<p style = 'font-size:16px;font-family:Arial'>In this step we are using nPath function to create input tables to be used by statistical and machine learning based approaches. We have used these tables in analysis below for example in TERM FREQUENCY - INVERSE DOCUMENT FREQUENCY (TF-IDF) analysis where we score these converting and non-converting journeys.</p>

<p style = 'font-size:16px;font-family:Arial;color:#E37C4D'><b> Create a table with all converting journeys</b></p>

<p style = 'font-size:16px;font-family:Arial'>We are creating a table with all kinds of paths that lead to Conversion.  To achieve this we look at any sequence of events ending with a conversion.</p>

In [None]:
# Drop the table if it already exists
qry = 'DROP TABLE CONV_JOURNEYS;'
try:
    eng.execute(qry)
except Exception as e:
    if str(e.args).find('3807') >= 1:
        pass
    else:
        raise


# Create the table
qry = '''
CREATE MULTISET TABLE CONV_JOURNEYS as(
SELECT * FROM nPath (
  ON (select cookie, tmstp, interaction, conversion, conversion_value, cost, 
  TRANSLATE(channel USING UNICODE_TO_LATIN) as channel from 
  DEMO_MultiTouchAttribution.Attribution_Data) PARTITION BY cookie ORDER BY tmstp
  USING
  Mode (NONOVERLAPPING)
  Pattern ('E*.C')
  Symbols (conversion='1' as C
          ,conversion='0' as E)
  Result (ACCUMULATE (channel OF ANY(C,E)) AS path
          ,COUNT (* of ANY(C,E)) as event_cnt
          ,FIRST (cookie OF ANY(C,E)) AS cookie
  )
) AS dt
where event_cnt > 1
)WITH DATA PRIMARY INDEX(cookie);
'''

# Execute the query
eng.execute(qry)

<p style = 'font-size:16px;font-family:Arial;color:#E37C4D'><b>Create a table with all non-converting journeys (leaving out potential converting journeys)</b></p>
<p style = 'font-size:16px;font-family:Arial'>We are creating a table with all kinds of paths that do not lead to any Conversion. To achieve this we look for all paths where cookies are not part of any converting journey (just previously defined) and leaving out any potential converting journey.</p>

In [None]:
# Drop the table if it already exists
qry = 'DROP TABLE NONCONV_JOURNEYS;'
try:
    eng.execute(qry)
except Exception as e:
    if str(e.args).find('3807') >= 1:
        pass
    else:
        raise


# Create the table
qry = '''
CREATE MULTISET TABLE NONCONV_JOURNEYS AS (
    SELECT path, event_cnt, cookie FROM NPATH
    (ON (select cookie, tmstp, interaction, conversion, conversion_value, cost, 
  TRANSLATE(channel USING UNICODE_TO_LATIN) as channel from DEMO_MultiTouchAttribution.Attribution_Data 
  where tmstp < (select max(tmstp) from DEMO_MultiTouchAttribution.Attribution_Data where conversion ='1'))
  PARTITION BY cookie ORDER BY tmstp
    USING
        MODE (NONOVERLAPPING)
        SYMBOLS (TRUE as A)
        PATTERN ('A*')
        RESULT (ACCUMULATE (channel of ANY(A)) as path,
                ACCUMULATE (conversion of ANY(A)) as conv
                ,COUNT (* of ANY(A)) as event_cnt
                ,FIRST (cookie OF ANY(A)) AS cookie
                )
    )
WHERE cookie IS NOT IN (SEL distinct cookie FROM CONV_JOURNEYS)
AND conv Not like '%1%'
AND event_cnt >1
)WITH DATA PRIMARY INDEX(cookie);
'''

# Execute the query
eng.execute(qry)

<hr>
<a id="rule"></a>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>5. RULE BASED MODELS</b></p>


<p style = 'font-size:16px;font-family:Arial;'>Rule Based attribution models assign conversion credits (weights) to touchpoints in a conversion path according to certain predefined rules.
</p>
<p style = 'font-size:16px;font-family:Arial;'>These rules are used to identify the position of an interaction on the conversion path and then assign conversion credit solely on the basis of its position.
</p>
<p style = 'font-size:16px;font-family:Arial;'>To execute rule based models we can leverage Vantage native Attribution function and easily consider the following methods:
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Uniform: Conversion event is attributed uniformly to preceding attributable events.</li>
    <li>First Click: Conversion event is attributed entirely to first attributable event.</li>
    <li>Last Click: Conversion event is attributed entirely to most recent attributable event</li> 
    <li>Exponential:  Conversion event is attributed exponentially to preceding attributable events (the more recent the event, the higher the attribution).</li>
 </ul>
</p>

<p style = 'font-size:16px;font-family:Arial;'>The function takes data and parameters from multiple tables and outputs attributions. Please refer to Teradata Vantage™ - Analytics Database Analytic Functions documentation for more on Attribution function.</p>

<p style = 'font-size:16px;font-family:Arial;'>Attribution Input :
<ol style = 'font-size:14px;font-family:Arial'>
<li style = 'font-size:14px;font-family:Arial'>Input tables (maximum of five) (Contain data for computing attributions).</li>
<li style = 'font-size:14px;font-family:Arial'>ConversionEventTable (Contains conversion events).</li>
<li style = 'font-size:14px;font-family:Arial'>FirstModelTable (Defines type and distributions of model - we'll create one table per model)</li></ol>
</p>

<p style = 'font-size:16px;font-family:Arial;'>Attribution Syntax Elements:
<ol style = 'font-size:14px;font-family:Arial'>
<li style = 'font-size:14px;font-family:Arial;'>EventColumn specifies the name of the input column that contains the events.</li>
<li style = 'font-size:14px;font-family:Arial;'>TimeColumn specifies the name of the input column that contains the timestamps of the  events.</li>
<li style = 'font-size:14px;font-family:Arial;'>WindowSize specifies how to determine the maximum window size for the attribution calculation</li></ol>
    </p>


<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>5.1. Create Conversion Event Table.</b></p> 
<p style = 'font-size:16px;font-family:Arial'> Since we are focusing on the events that led to Conversion our ATTRIBUTION CONVERSION Table will have only one value <b>'conversion'</b>.</p>     
    

In [None]:
# Drop the table if it already exists
qry = 'DROP TABLE ATTRIBUTION_CONVERSION;'
try:
    eng.execute(qry)
except Exception as e:
    if str(e.args).find('3807') >= 1:
        pass
    else:
        raise

# Create the table
qry = '''
CREATE TABLE ATTRIBUTION_CONVERSION
(
    CONVERSION VARCHAR(100)
);
'''

# Execute the query
eng.execute(qry)

#Insert model specification values (line1)
qry = '''
INSERT INTO ATTRIBUTION_CONVERSION VALUES ('conversion');;
'''

# Execute the query
eng.execute(qry)


<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>5.2 Create model specifications tables</b></p>
<p style = 'font-size:16px;font-family:Arial'>We will need to create 1 model table for each type of Attribution: First Click , Last Click, Uniform and Exponential Attribution hence we are creating 4 different model tables below and creating data for each of these model types.</p>

<p style = 'font-size:16px;font-family:Arial;color:#E37C4D'><b>Uniform Model (applies equal weighting to all contributing touchpoints in the customer journey)</b></p>

In [None]:
# Drop the table if it already exists
qry = 'DROP TABLE ATTRIBUTION_MODEL_UNIFORM;'
try:
    eng.execute(qry)
except Exception as e:
    if str(e.args).find('3807') >= 1:
        pass
    else:
        raise

# Create the table
qry = '''
CREATE TABLE ATTRIBUTION_MODEL_UNIFORM
(
    ID   INT,
    MODEL VARCHAR(100)
);
'''

# Execute the query
eng.execute(qry)

#Insert model specification values (line1)
qry = '''
INSERT INTO ATTRIBUTION_MODEL_UNIFORM VALUES (0,'EVENT_REGULAR');
'''

# Execute the query
eng.execute(qry)

#Insert model specification values (line2)
qry = '''
INSERT INTO ATTRIBUTION_MODEL_UNIFORM VALUES (1,'ALL:1.0:UNIFORM:NA');
'''

# Execute the query
eng.execute(qry)

<p style = 'font-size:16px;font-family:Arial;color:#E37C4D'><b> First Click Model (100% of the credit is directly attributed to the first interaction in the customer journey)</b></p>


In [None]:
# Drop the table if it already exists
qry = 'DROP TABLE ATTRIBUTION_MODEL_FIRSTCLICK;'
try:
    eng.execute(qry)
except Exception as e:
    if str(e.args).find('3807') >= 1:
        pass
    else:
        raise

# Create the table
qry = '''
CREATE TABLE ATTRIBUTION_MODEL_FIRSTCLICK
(
    ID   INT,
    MODEL VARCHAR(100)
);
'''

# Execute the query
eng.execute(qry)

#Insert model specification values (line1)
qry = '''
INSERT INTO ATTRIBUTION_MODEL_FIRSTCLICK VALUES (0,'EVENT_REGULAR');
'''

# Execute the query
eng.execute(qry)

#Insert model specification values (line2)
qry = '''
INSERT INTO ATTRIBUTION_MODEL_FIRSTCLICK VALUES (1,'ALL:1.0:FIRST_CLICK:NA');
'''

# Execute the query
eng.execute(qry)

<p style = 'font-size:16px;font-family:Arial;color:#E37C4D'><b> Last Click Model (100% of the credit is directly attributed to the last interaction in the customer journey)</b></p>

In [None]:
# Drop the table if it already exists
qry = 'DROP TABLE ATTRIBUTION_MODEL_LASTCLICK;'
try:
    eng.execute(qry)
except Exception as e:
    if str(e.args).find('3807') >= 1:
        pass
    else:
        raise

# Create the table
qry = '''
CREATE TABLE ATTRIBUTION_MODEL_LASTCLICK
(
    ID   INT,
    MODEL VARCHAR(100)
);
'''

# Execute the query
eng.execute(qry)

#Insert model specification values (line1)
qry = '''
INSERT INTO ATTRIBUTION_MODEL_LASTCLICK VALUES (0,'EVENT_REGULAR');
'''

# Execute the query
eng.execute(qry)

#Insert model specification values (line2)
qry = '''
INSERT INTO ATTRIBUTION_MODEL_LASTCLICK VALUES (1,'ALL:1.0:LAST_CLICK:NA');
'''

# Execute the query
eng.execute(qry)

<p style = 'font-size:16px;font-family:Arial;color:#E37C4D'><b> Exponential Model (assigns exponentially more weight to the interactions which are closest in time to conversion)</b></p>

In [None]:
# Drop the table if it already exists
qry = 'DROP TABLE ATTRIBUTION_MODEL_EXPONENTIAL;'
try:
    eng.execute(qry)
except Exception as e:
    if str(e.args).find('3807') >= 1:
        pass
    else:
        raise

# Create the table
qry = '''
CREATE TABLE ATTRIBUTION_MODEL_EXPONENTIAL
(
    ID   INT,
    MODEL VARCHAR(100)
);
'''

# Execute the query
eng.execute(qry)

#Insert model specification values (line1)
qry = '''
INSERT INTO ATTRIBUTION_MODEL_EXPONENTIAL VALUES (0,'EVENT_REGULAR');
'''

# Execute the query
eng.execute(qry)

#Insert model specification values (line2)
qry = '''
INSERT INTO ATTRIBUTION_MODEL_EXPONENTIAL VALUES (1,'ALL:1.0:EXPONENTIAL:0.5,ROW');
'''

# Execute the query
eng.execute(qry)

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>5.3. Compute all four models and store outputs in a table</p>
<p style = 'font-size:16px;font-family:Arial'>After creating the four model tables we will use them in the calculation of ATTRIBUTION for each channel based on all these models as in the query below.</p> 

<p style = 'font-size:16px;font-family:Arial'>In order to consider 20 rows from most to least recent preceding conversion to compute all Rule-based models we use the WindowSize argument of the Attribution function. More specifically we use the "rows:K" option which assigns attributions to at most K events before conversion event. In our case K=20.</p>



In [None]:
# Drop the table if it already exists
qry = 'DROP TABLE ATTRIBUTION_4MODEL_OUTPUT;'
try:
    eng.execute(qry)
except Exception as e:
    if str(e.args).find('3807') >= 1:
        pass
    else:
        raise

# Create the table
qry = '''
CREATE TABLE ATTRIBUTION_4MODEL_OUTPUT
AS (
SELECT 
U.cookie,
U.tmstp,
U.channel
 ,F.ATTRIBUTION AS FIRST_CLICK_ATTRIBUTION
  ,L.ATTRIBUTION AS LAST_CLICK_ATTRIBUTION
 ,U.ATTRIBUTION AS UNIFORM_ATTRIBUTION
 ,E.ATTRIBUTION AS EXPONENTIAL_ATTRIBUTION
 ,F.TIME_TO_CONVERSION AS FIRST_CLICK_TTC
  ,L.TIME_TO_CONVERSION AS LAST_CLICK_TTC
 ,U.TIME_TO_CONVERSION AS UNIFORM_TTC
 ,E.TIME_TO_CONVERSION AS EXPONENTIAL_TTC
FROM ATTRIBUTION
     (
         ON (select cookie, tmstp, TRANSLATE(interaction USING UNICODE_TO_LATIN) as interaction, conversion, conversion_value, cost, channel 
          from DEMO_MultiTouchAttribution.Attribution_Data) AS INPUT
         PARTITION BY cookie
         ORDER BY tmstp   
         ON ATTRIBUTION_CONVERSION      AS CONVERSION    DIMENSION
         ON ATTRIBUTION_MODEL_UNIFORM    AS MODEL1        DIMENSION
         USING
         EVENTCOLUMN ('interaction') 
         TimeCOLUMN ('TMSTP')
         WINDOWSize('ROWS:20') 
     ) U,
ATTRIBUTION
     (
         ON (select cookie, tmstp, TRANSLATE(interaction USING UNICODE_TO_LATIN) as interaction, conversion, conversion_value, cost, channel 
          from DEMO_MultiTouchAttribution.Attribution_Data)  AS INPUT
         PARTITION BY cookie
         ORDER BY tmstp   
         ON ATTRIBUTION_CONVERSION      AS CONVERSION    DIMENSION
         ON ATTRIBUTION_MODEL_LASTCLICK    AS MODEL1        DIMENSION
         USING
         EVENTCOLUMN ('interaction') 
         TimeCOLUMN ('TMSTP')
         WINDOWSize('ROWS:20') 
     ) L,
 ATTRIBUTION
     (
         ON (select cookie, tmstp, TRANSLATE(interaction USING UNICODE_TO_LATIN) as interaction, conversion, conversion_value, cost, channel 
          from DEMO_MultiTouchAttribution.Attribution_Data) AS INPUT
         PARTITION BY cookie
         ORDER BY tmstp   
         ON ATTRIBUTION_CONVERSION      AS CONVERSION    DIMENSION
         ON ATTRIBUTION_MODEL_FIRSTCLICK    AS MODEL1        DIMENSION
         USING
         EVENTCOLUMN ('interaction') 
         TimeCOLUMN ('TMSTP')
         WINDOWSize('ROWS:20') 
     ) F,
 ATTRIBUTION
     (
         ON (select cookie, tmstp, TRANSLATE(interaction USING UNICODE_TO_LATIN) as interaction, conversion, conversion_value, cost, channel 
          from DEMO_MultiTouchAttribution.Attribution_Data) AS INPUT
         PARTITION BY cookie
         ORDER BY tmstp   
         ON ATTRIBUTION_CONVERSION      AS CONVERSION    DIMENSION
         ON ATTRIBUTION_MODEL_EXPONENTIAL    AS MODEL1        DIMENSION
         USING
         EVENTCOLUMN ('interaction') 
         TimeCOLUMN ('TMSTP')
         WINDOWSize('ROWS:20')  
     ) E
     WHERE F.cookie = L.cookie
AND   F.cookie = U.cookie
AND   F.cookie = E.cookie
AND   F.TMSTP = L.TMSTP
AND   F.TMSTP = U.TMSTP
AND   F.TMSTP = E.TMSTP
AND   F.channel     = L.channel
AND   F.channel     = U.channel
AND   F.channel     = E.channel
) WITH DATA;
'''

# Execute the query
eng.execute(qry)

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>5.4. Calculate attribution weights by channel and rule based model</b></p>

In [None]:
#Calculate attribution weights for all four rule based models
qry = '''
WITH TOTAL
AS
(SELECT
SUM(UNIFORM_ATTRIBUTION) AS TOT_UNI,
SUM(FIRST_CLICK_ATTRIBUTION) AS TOT_FC, 
SUM(LAST_CLICK_ATTRIBUTION)AS TOT_LC, 
SUM(EXPONENTIAL_ATTRIBUTION) AS TOT_EXP
from 
ATTRIBUTION_4MODEL_OUTPUT)

select 
CHANNEL, 
SUM(UNIFORM_ATTRIBUTION)/TOT_UNI AS UNIFORM_ATTRIBUTION,
SUM(FIRST_CLICK_ATTRIBUTION)/TOT_FC AS FIRST_CLICK_ATTRIBUTION, 
SUM(LAST_CLICK_ATTRIBUTION)/TOT_LC AS LAST_CLICK_ATTRIBUTION, 
SUM(EXPONENTIAL_ATTRIBUTION)/TOT_EXP as EXPONENTIAL_ATTRIBUTION
from 
ATTRIBUTION_4MODEL_OUTPUT, TOTAL
GROUP BY CHANNEL;
'''

# Execute the query
eng.execute(qry)

AttribRule=DataFrame.from_query(qry)

In [None]:
# View results
AttribRule_plot=AttribRule.to_pandas()
AttribRule_plot

<p style = 'font-size:16px;font-family:Arial'>The above output shows the Attribution values for each type of channel using different models.</p>

In [None]:
import plotly.graph_objects as go

fig = go.Figure(
    data=[
        go.Bar(name='Uniform', x=AttribRule_plot["CHANNEL"], y=AttribRule_plot["UNIFORM_ATTRIBUTION"], yaxis='y', offsetgroup=1,marker_color='#76B7B2'),
        go.Bar(name='First Click', x=AttribRule_plot["CHANNEL"], y=AttribRule_plot["FIRST_CLICK_ATTRIBUTION"], yaxis='y', offsetgroup=2, marker_color='#F28E2B'),
        go.Bar(name='Last Click', x=AttribRule_plot["CHANNEL"], y=AttribRule_plot["LAST_CLICK_ATTRIBUTION"], yaxis='y', offsetgroup=3,marker_color='#E15759'),
        go.Bar(name='Exponential', x=AttribRule_plot["CHANNEL"], y=AttribRule_plot["EXPONENTIAL_ATTRIBUTION"], yaxis='y', offsetgroup=4,marker_color='#4E79A7')
    ],
    layout={
        'yaxis': {'title': 'Attribution '},

    }
)
 
# Change the bar mode
fig.update_layout(barmode='group')
fig.show()

<p style = 'font-size:16px;font-family:Arial'>From the above graph we can see that the Attribution Value for Facebook channel is highest in all the 4 models and that for Online Display is the lowest.</p>

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>5.5. Exploring Uniform Model in more details</b></p>

<p style = 'font-size:16px;font-family:Arial'>Whatever the model the attribution function will output a score (or attribution weight) and compute the time to conversion.</p>
<p style = 'font-size:16px;font-family:Arial'>We can easily put this information in perspective with the cost to measure and visualize channel effectiveness.</p>
<p style = 'font-size:16px;font-family:Arial'>The uniform model can serve as a starting point or baseline for attribution analysis. It provides a benchmark against which more advanced attribution models can be compared. By evaluating the performance of other models relative to the uniform model, marketers can gain insights into the additional value or improvement offered by more sophisticated approaches like the Statistics based models or Machine learning models. We have used some of these models below in this notebook.</p>

In [None]:
qry = '''
 SELECT ATTRIB.channel,  sum(uniform_attribution) AS total_attribution , sum(cost) as total_cost,
AVG(-uniform_ttc)/86400 AS time_to_conversion
FROM ATTRIBUTION_4MODEL_OUTPUT As ATTRIB
INNER JOIN DEMO_MultiTouchAttribution.CHANNEL_COST AS COST
ON ATTRIB.CHANNEL=COST.CHANNEL
GROUP BY 1
'''

# Execute the query
eng.execute(qry)

AttribUni=DataFrame.from_query(qry)

AttribUni_plot=AttribUni.to_pandas()
AttribUni.head()

<p style = 'font-size:16px;font-family:Arial'>The total attribution , cost and time to conversion are used from the output of the Attribution function used above. Here we are considering only the attribution scores from the UNIFORM attribution model(sum(uniform_attribution)).</p> 
<p style = 'font-size:16px;font-family:Arial'>All three dimensions - cost, attribution and time to conversion - can be plotted on a bubble chart, the size of the bubbles showing the cost. </p>

In [None]:
import plotly.express as px
ax=px.scatter(AttribUni_plot, x="total_attribution", y="time_to_conversion",
              size="total_cost",size_max = 70,color="CHANNEL",hover_data=['CHANNEL'],
              width=900, height=400, 
              color_discrete_map = {'Online Display': '#E15759','Online Video': '#76B7B2','Facebook': '#4E79A7','Instagram': '#F28E2B' ,'Paid Search': '#59A14F'},
             labels={
                     "total_attribution": "Total  Attribution",
                     "time_to_conversion": "Time to Conversion (Days)"
        }
             )
ax.update_layout(showlegend=False)
ax.update_layout(title_text='Channel Performance - Uniform Model', title_x=0.5)
ax.show()

<p style = 'font-size:16px;font-family:Arial'>The above graph shows the Channel Performance using the UNIFORM Model. The size of the circle depends on the total cost of the channel. When we move the mouse over the circles we can see channel, it's attribution value, time to conversion and also the cost, in the text. The largest circle is for Online Video followed by Facebook, which indicates that the Online Video channel is less performant than Facebook (higher cost, lower attribution).</p>

<hr>
<a id="stat"></a>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>6. STATISTICAL BASED MODELS</b></p>


<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>6.1 SIMPLE FREQUENCY ANALYSIS</b></p>


<p style = 'font-size:16px;font-family:Arial'>A simple frequency analysis (obtained by calculating the occurrences of the channel in the journeys leading to Conversion) can be used as a basic approach to compute marketing attribution.</p>

<p style = 'font-size:16px;font-family:Arial'>The NGramSplitter function tokenizes (splits) an input stream of text and outputs n multigrams (called n-grams) based on the specified delimiter and reset parameters. NGramSplitter provides more flexibility than standard tokenization when performing text analysis.</p>
<p style = 'font-size:16px;font-family:Arial;'>We just need to tokenize paths in converting journeys and calculate the frequency. </p>
<p style = 'font-size:16px;font-family:Arial;'>Here for path tokenization we use NGramSplitter function which splits the input stream of text (here paths) into "terms" (grams) of selected size (1:- which means each event) and count them.</p>



In [None]:
#Drop table if exists
qry = 'DROP TABLE cngrams;'
try:
    eng.execute(qry)
except Exception as e:
    if str(e.args).find('3807') >= 1:
        pass
    else:
        raise

# Create the table.  
qry = '''
CREATE MULTISET TABLE cngrams AS(
    SEL ngram as channel
    ,frequency
    ,sum(frequency) OVER(PARTITION BY 1) as tot
    ,(1.000 * frequency/tot) as tp
    FROM(
        SEL TRIM(BOTH FROM ngram) as ngram
                , sum(frequency) as frequency
            FROM NGramSplitter (
    ON  (select * from conv_journeys where event_cnt <=20)
    USING
    TextColumn ('path')
    Delimiter (',')
    Grams ('1')
    Overlapping ('true')
    ToLowerCase ('false')
    --Punctuation ('\[.,?\!\]')
    --Reset ('\[.,?\!\]')
    Reset ('[]')
    TotalGramCount ('false')
  --  Accumulate ('cookie')
  ) AS dt
    group by 1
)as aa
) WITH DATA PRIMARY INDEX (channel);
'''

# Execute the query
eng.execute(qry)

In [None]:
freqattribution = DataFrame(in_schema('demo_user','cngrams'))
freq=freqattribution.to_pandas().reset_index()
freq

<p style = 'font-size:16px;font-family:Arial;'> The output of the NGramSplitter contains ngram, the frequency of the channel, the Total frequency(to) and the percentage of the channel frequency to total frequency(tp). </p>

<p style = 'font-size:16px;font-family:Arial'>Visualizing the results in a vertical bar chart.</p>

In [None]:
import plotly.express as px
fig = px.bar(freq, y="tp", x="channel", 
             color='channel', orientation='v',
             height=600,width=900,
             color_discrete_map = {'Online Display': '#E15759','Online Video': '#76B7B2','Facebook': '#4E79A7','Instagram': '#F28E2B' ,'Paid Search': '#59A14F'},
             title='Attribution Summary')
fig.update_layout(title_text='Frequency Based Attribution Summary', title_x=0.5)
fig.update_xaxes(title='Channel',tickangle=-45)
fig.update_yaxes(title='Attribution Weight')
fig.update_traces(width=0.5)
fig.show()

<p style = 'font-size:16px;font-family:Arial'>The above graph shows the Frequency based Attribution value for each channel using the ngrams. We can see that the Attribution Value for Facebook channel is highest and that for Online Display is lowest.</p>

<hr>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>7. ASSOCIATION ANALYSIS (looking for association of channels driving conversion)</b></p>

<p style = 'font-size:16px;font-family:Arial'>Association analysis can help identify channels that are frequently used in combination within converting journeys.  This information can guide resource allocation and enable marketers to focus on the most effective channel combinations to lift conversion.</p>

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>7.1. Prepare data</b></p>

<p style = 'font-size:16px;font-family:Arial'>Association analysis can help identify channels that are frequently used in combination with other successful channels. This information can guide resource allocation and enable marketers to focus on the most effective channels.</p>

<p style = 'font-size:16px;font-family:Arial'>We use the nPath function to identify all cookies that are leading to a conversion and use this cookies list as a filter to the original dataset.</p>

In [None]:
#Drop table if exists
qry = 'DROP TABLE ATTRIBUTION_DATA4_ASSO_2;'
try:
    eng.execute(qry)
except Exception as e:
    if str(e.args).find('3807') >= 1:
        pass
    else:
        raise


qry = '''
CREATE TABLE ATTRIBUTION_DATA4_ASSO_2 AS
(
Select * from DEMO_MultiTouchAttribution.Attribution_Data
where cookie in (
SELECT cookie FROM nPath (
  ON DEMO_MultiTouchAttribution.Attribution_Data PARTITION BY cookie ORDER BY tmstp
  USING
  Mode (NONOVERLAPPING)
  Pattern ('E*.C')
  Symbols (conversion='1' as C
          ,conversion='0' as E)
  Result (ACCUMULATE (case when conversion='1' then 'converted ' else channel end OF ANY(C,E)) AS path
          ,COUNT (* of ANY(C,E)) as event_cnt
          ,FIRST (cookie OF ANY(C,E)) AS cookie
  ) 
) where event_cnt >1  
)
) with data;
'''

# Execute the query
eng.execute(qry)

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>7.2. Compute Association Analysis</b></p>
<p style = 'font-size:16px;font-family:Arial'>We calculate the association by passing the function name as 'association' to the td_analyze procedure call. The source data will be the output of the nPath function.</p>

In [None]:
#Drop table if exists
qry = 'DROP TABLE CONVERSION_ASSO_GLOBAL_2;'
try:
    eng.execute(qry)
except Exception as e:
    if str(e.args).find('3807') >= 1:
        pass
    else:
        raise

# Run Vantage Analytics Library. An output table is created.  
qry = '''
call 
val.td_analyze
(
'association',
'database=demo_user;
tablename=ATTRIBUTION_DATA4_ASSO_2;
groupcolumn=cookie;
itemcolumn=channel;
outputdatabase=demo_user;
outputtablename=CONVERSION_ASSO_GLOBAL;'
);
'''

# Execute the query
eng.execute(qry)

In [None]:
ConvAsso = DataFrame(in_schema('demo_user','CONVERSION_ASSO_GLOBAL'))
ConvAsso=ConvAsso.to_pandas().reset_index()
ConvAsso

<p style = 'font-size:16px;font-family:Arial'>The output of the td_analyze procedure call for the function 'association' has the above columns.</p>
<p style = 'font-size:16px;font-family:Arial'>Item1of2 and item2of2 are the channel for which the association is calculated. The measures are defined as follows:</p>

<li style = 'font-size:16px;font-family:Arial'>
Support is percentage of groups containing the items on the left (left side support), on the right (right side support) or on both sides of a rule (rule support).</li>
<li style = 'font-size:16px;font-family:Arial'>Confidence is percentage of groups containing the left side items that also contain the right side items.</li>
<li style = 'font-size:16px;font-family:Arial'>Lift is a measure of how much the probability is raised that the right side items occur in a group given that the left side items occur in the group.</li>
<li style = 'font-size:16px;font-family:Arial'>Z Score is a statistical measure of how much the expected and actual values of the number of groups containing all the items in the rule varies.  (Zero means expected and actual are the same.)</li>
</p>

In [None]:
import plotly.graph_objects as go


marker_text = [f"{size}" for size in round(ConvAsso['CONFIDENCE'],2)]
hover_text = [f"Lift: {value}" for value in round(ConvAsso['LIFT'],2)]

fig = go.Figure(data=go.Scatter(x=ConvAsso['ITEM1OF2'],
                                y=ConvAsso['ITEM2OF2'],
                                mode='markers+text',
                               text=marker_text,  # Set the marker size text values
    hovertext=hover_text,  # Set the hovertext values
    hoverinfo='text',  # Only show hovertext on hover
                                #text=hover_text, 
                                marker=dict(
        size=ConvAsso['CONFIDENCE'],
        sizemode='area',
        sizeref=0.0004,
        symbol='square',
        color=ConvAsso['LIFT'],
        colorscale='GnBu'
    )))
       # text=toto['LIFT'])) # hover text goes here

fig.update_layout(title='Channel Associations in Converting Journeys', title_x=0.5)
fig.show()

<p style = 'font-size:16px;font-family:Arial'>The strongest channel associations within conversion journeys are <b>Instagram</b> + <b>Facebook</b> and <b>Paid Search</b> + <b>Online Display</b>. 
<p style = 'font-size:16px;font-family:Arial

<hr>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>8. TERM FREQUENCY (Inverse Document Frequency (TF-IDF))</b></p>

<p style = 'font-size:16px;font-family:Arial'>TF-IDF is a technique commonly used in natural language processing and text mining tasks to determine the importance of a term within a document or corpus.</p> 

<p style = 'font-size:14px;font-family:Arial'>      TF-IDF can be defined as the calculation of how relevant a word in a series or corpus is to a text.
<p style = 'font-size:14px;font-family:Arial'>The meaning increases proportionally to the number of times in the text a word appears but is compensated by the word frequency in the corpus.
<p style = 'font-size:14px;font-family:Arial'>It's commonly used for ranking word relevance and then compare text documents.

<p style = 'font-size:16px;font-family:Arial'>Considering paths (sequence of events) as text we commute and compare the TF-IDF scores between the two sets of event paths (converting and non-converting). We can then examine the top-ranked terms - in our case, channels - with high TF-IDF scores in each set to identify the channels that are most distinctive or important within each set. Therefore we can compare channel contribution across Converted and Non-Converted journeys and put calculated attribution weights in perspective.</p>



<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>8.1. Prepare Data</b></p>

 <p style = 'font-size:16px;font-family:Arial;'>Tokenize paths for both converting and non-converting journeys and save output into a table. We use NGramSplitter function here for path tokenization which splits the input stream of text (here paths) into "terms" (grams) of selected size (1:- which means each event) and count them.
</p>

<p style = 'font-size:16px;font-family:Arial;color:#E37C4D'><b> Converting journeys.</b></p>

In [None]:
#Drop table if exists
qry = 'DROP TABLE convgrams;'
try:
    eng.execute(qry)
except Exception as e:
    if str(e.args).find('3807') >= 1:
        pass
    else:
        raise

# Create table 
qry = '''
 CREATE MULTISET TABLE convgrams AS(
    SEL*
       
            FROM NGramSplitter (
    ON  (select * from conv_journeys where event_cnt <=20)
    USING
    TextColumn ('path')
    Delimiter (',')
    Grams ('1')
    Overlapping ('true')
    ToLowerCase ('false')
    --Punctuation ('\[.,?\!\]')
    --Reset ('\[.,?\!\]')
    Reset ('[]')
    TotalGramCount ('false')
  --  Accumulate ('cookie')
  ) AS dt
) WITH DATA PRIMARY INDEX (ngram)
;
''' 

# Execute the query
eng.execute(qry)

<p style = 'font-size:16px;font-family:Arial;color:#E37C4D'><b> Non-Converting journeys.</b></p>
<p style = 'font-size:16px;font-family:Arial'>Similar to the converting journeys we also use the NgramSplitter on the non-converting journeys

In [None]:
#Drop table if exists
qry = 'DROP TABLE nonconvgrams;'
try:
    eng.execute(qry)
except Exception as e:
    if str(e.args).find('3807') >= 1:
        pass
    else:
        raise

# Create table 
qry = '''
 CREATE MULTISET TABLE nonconvgrams AS(
    SEL*
       
            FROM NGramSplitter (
    ON  (select * from nonconv_journeys)
    USING
    TextColumn ('path')
    Delimiter (',')
    Grams ('1')
    Overlapping ('true')
    ToLowerCase ('false')
    --Punctuation ('\[.,?\!\]')
    --Reset ('\[.,?\!\]')
    Reset ('[]')
    TotalGramCount ('false')
  --  Accumulate ('cookie')
  ) AS dt
) WITH DATA PRIMARY INDEX (ngram)
;
''' 

# Execute the query
eng.execute(qry)

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>8.2. Compute TF-IDF scores</b></p>

 <p style = 'font-size:16px;font-family:Arial;'>Calculate the TF-IDF scores for each term in the term-document sets (Converting and Non-Converting). TF-IDF is computed by multiplying the term frequency (TF) of a term in a document by its inverse document frequency (IDF) across the collection of documents. The TF component measures the importance of a term within an individual event path, while the IDF component captures the rarity or distinctiveness of a term across the entire set of event paths.
</p>

<p style = 'font-size:16px;font-family:Arial;color:#E37C4D'><b> Converting journeys.</b></p>

In [None]:
#Drop table if exists
qry = 'DROP TABLE CONVTFIDF;'
try:
    eng.execute(qry)
except Exception as e:
    if str(e.args).find('3807') >= 1:
        pass
    else:
        raise

# Create table 
qry = '''
CREATE TABLE CONVTFIDF
AS(
WITH 
TFCONV AS
(SELECT t."ngram", t.cookie, 
        1.00000 * t.frequency / t.event_cnt as tf
from convgrams t)
,
IDFCONV AS
(SELECT "ngram", log((SELECT count(cookie) FROM convgrams)/(count(*))) as IDF 
FROM convgrams 
group by "ngram"
)
SELECT TFCONV."ngram", sum((tf*idf)) AS tfidf
FROM TFCONV JOIN IDFCONV  ON TFCONV."ngram"=IDFCONV."ngram"
group by 1) WITH DATA;
''' 

# Execute the query
eng.execute(qry)

<p style = 'font-size:16px;font-family:Arial;color:#E37C4D'><b> Non-Converting journeys.</b></p>

In [None]:
#Drop table if exists
qry = 'DROP TABLE NONCONVTFIDF;'
try:
    eng.execute(qry)
except Exception as e:
    if str(e.args).find('3807') >= 1:
        pass
    else:
        raise
# Create table 
qry = '''
CREATE TABLE NONCONVTFIDF
AS(
WITH 
TFNONCONV AS
(SELECT 
        "ngram", cookie, 
        1.00000 * frequency / event_cnt as tf
from nonconvgrams)
,
IDFNONCONV AS
(SELECT "ngram", log((SELECT count(cookie) FROM nonconvgrams)/(count(*))) as IDF 
FROM nonconvgrams
group by "ngram")

SELECT TFNONCONV."ngram", sum((tf*idf)) AS tfidf
FROM TFNONCONV JOIN IDFNONCONV  ON TFNONCONV."ngram"=IDFNONCONV."ngram"
group by 1) WITH DATA;
''' 

# Execute the query
eng.execute(qry)

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>8.3. Rank and Compare</b></p>

 <p style = 'font-size:16px;font-family:Arial;'>The following SQL query will rank and regroup the channel TF-IDF scores for channels in both Converting and Non-Converting journeys.
</p>

In [None]:
qry = '''
SELECT
conv.channel, converted_rank,nonconverted_rank
from
(
select 
ngram as channel,
rank () over ( order by tfidf desc) as converted_rank
from CONVTFIDF) CONV
INNER JOIN
(
select 
ngram as channel,
rank () over ( order by tfidf desc) as nonconverted_rank
from nonCONVTFIDF) NONCONV
on
CONV.channel=nonconv.channel
;
'''

# Execute the query
eng.execute(qry)

 <p style = 'font-size:16px;font-family:Arial;'>We will create a Slope Chart to compare the channel significance ranking in both Converting and Non-Converting journeys.
</p>

In [None]:
slope=DataFrame.from_query(qry)
slope

In [None]:
df=slope.to_pandas()
df.sort_values(by='channel', inplace=True)

In [None]:
import matplotlib.pyplot as plt

# Sort DataFrame by channel
df.sort_values(by='channel', inplace=True)

# Create figure and axis
fig, ax = plt.subplots()

# Set x and y values for the slope chart
x = [0, 1]
channels = df['channel']
y_conv = df['converted_rank']
y_nconv = df['nonconverted_rank']

# Define custom colors for each channel
color_mapping = {
    'Instagram': '#F28E2B',
    'Facebook': '#4E79A7',
    'Online Display': '#E15759',
    'Online Video': '#76B7B2',
    'Paid Search': '#59A14F',
    # Add more channels and corresponding colors as needed
}

# Plot the slope chart with assigned colors
for channel, conv, nconv in zip(channels, y_conv, y_nconv):
    color = color_mapping.get(channel, 'black')  # Default color if channel not found in the mapping
    ax.plot(x, [conv, nconv], marker='o', markersize=10, color=color, label='_nolegend_')
    ax.text(-0.1, conv, channel, ha='right', va='center', fontsize=8, color='black')
    ax.text(1.05, nconv, channel, ha='left', va='center', fontsize=8, color='black')

# Set x-axis ticks and labels
ax.set_xticks(x)
ax.set_xticklabels(['CONVERTING', 'NON CONVERTING'])

# Set y-axis label
ax.set_ylabel('Rank')

# Set title
ax.set_title('Comparing Channel in Converting and Non Converting Paths',loc='center', pad=30)

# Remove spines (borders) of the plot
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['bottom'].set_visible(False)

# Hide ticks and tick labels on the left spine
ax.yaxis.set_ticks_position('none')
ax.xaxis.set_ticks_position('bottom')

# Set the limits of the x-axis
ax.set_xlim(-0.4, 1.2)

# Format y-axis tick labels to remove decimal values with .5 and invert the scale
ax.yaxis.set_major_locator(plt.MaxNLocator(integer=True))
ax.invert_yaxis()

# Display the plot
plt.show()

<p style = 'font-size:16px;font-family:Arial;'><b>Online Video</b> and <b>Facebook</b> are slightly more significantly appearing in Converting journeys and <b>Paid Search</b> is clearly more distinctive to Non-Converting journeys.
</p>

<hr>
<a id="ml"></a>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>9. MACHINE LEARNING BASED MODELS</b></p>


<p style = 'font-size:16px;font-family:Arial'>Machine Learning based models allow us to switch from rule-based/heuristic methods to probabilistic ones, moving further up the maturity scale. With a data-driven algorithmic  approach, attribution outputs are predicated based on data and the modelling of that data.</p>

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>NAIVE BAYES</b></p>


<p style = 'font-size:16px;font-family:Arial'>Naive Bayes is a machine learning algorithm commonly used for classification tasks, including text classification, spam filtering, and sentiment analysis. While it is not typically used to directly compute marketing attribution, it can be employed as part of a broader marketing attribution framework.</p>

<p style = 'font-size:16px;font-family:Arial'>We will use Naive Bayes for binary text classification of paths in two categories, converted and non converted. Once the Naive Bayes classifier is trained, it can be used to estimate the probability that a specific marketing touchpoint contributed to an outcome.</p>

<p style = 'font-size:16px;font-family:Arial'>By evaluating the likelihood of the observed features associated with conversion, the algorithm can provide a probability score representing the attribution weight.</p>
<p style = 'font-size:16px;font-family:Arial'>To run a Naive Bayes classification model we can leverage Vantage native Naive Bayes text classifier trainer function beside some in-database data preparation.</p>


<p style = 'font-size:16px;font-family:Arial;color:#E37C4D'><b>Prepare Data</b></p>

 <p style = 'font-size:16px;font-family:Arial;'>Tokenize paths for both converting and non-converting journeys and save output into a table. We use NGramSplitter function here for path tokenization which splits the input stream of text (here paths) into "terms" (grams) of selected size (1:- which means each event) and count them.</p>
 <p style = 'font-size:16px;font-family:Arial;'>This data preparation step will serve both Naive Bayes and Random Forest models.
</p>

In [None]:
#Drop table if exists
qry = 'DROP TABLE ALLNGRAMS;'
try:
    eng.execute(qry)
except Exception as e:
    if str(e.args).find('3807') >= 1:
        pass
    else:
        raise

#   
qry = '''
CREATE MULTISET TABLE ALLNGRAMS AS(
 SEL 
 cookie,TRIM(BOTH FROM ngram) as ngram, '1' as distcnt, frequency as totcnt, '1' as conv 
            FROM NGramSplitter (
    ON  conv_journeys
    USING
    TextColumn ('path')
    Delimiter (',')
    Grams ('1')
    Overlapping ('f')
    ToLowerCase ('false')
    --Punctuation ('\[.,?\!\]')
    --Reset ('\[.,?\!\]')
    Reset ('[]')
    TotalGramCount ('false')
   Accumulate ('cookie')
  ) AS dt
 
  UNION ALL 
  
SEL 
 cookie,TRIM(BOTH FROM ngram) as ngram, '1' as distcnt, frequency as totcnt, '0' as conv 
            FROM NGramSplitter (
    ON  nonconv_journeys
    USING
    TextColumn ('path')
    Delimiter (',')
    Grams ('1')
    Overlapping ('true')
    ToLowerCase ('false')
    --Punctuation ('\[.,?\!\]')
    --Reset ('\[.,?\!\]')
    Reset ('[]')
    TotalGramCount ('false')
   Accumulate ('cookie')
  ) AS dt
  ) with data;
  '''

# Execute the query
eng.execute(qry)

<p style = 'font-size:16px;font-family:Arial;color:#E37C4D'><b>Run Naive Bayes Text Classifier model</b></p>
<p style = 'font-size:16px;font-family:Arial'>TD_NaiveBayesTextClassifierTrainer function calculates the conditional probabilities for token-category pairs, the prior probabilities, and the missing token probabilities for all categories. The trainer function trains the model with the probability values (and the predict function - not used here - would use the values to classify paths into categories).</p>

In [None]:
#Drop table if exists
qry = 'DROP TABLE NBOUTPUT;'
try:
    eng.execute(qry)
except Exception as e:
    if str(e.args).find('3807') >= 1:
        pass
    else:
        raise

# Run Naive Bayes Text Classifier and output the result in a table  
qry = '''
CREATE TABLE NBOUTPUT AS
(
  SELECT * FROM TD_NaiveBayesTextClassifierTrainer (
   ON allngrams AS InputTable
   USING
   TokenColumn ('ngram')
   DocCategoryColumn ('conv')
   DocIDColumn ('cookie')
   ModelType ('Bernoulli')
) AS dt)
WITH DATA;
'''

# Execute the query
eng.execute(qry)

<p style = 'font-size:16px;font-family:Arial;color:#E37C4D'><b>Derive Attribution Weights and visualize</b></p>
<p style = 'font-size:16px;font-family:Arial'>The output of the Naive Bayes Text Classifier contains: 
    <li style = 'font-size:16px;font-family:Arial'>token: The classified training tokens (channels from tokenized paths).</li>
    <li style = 'font-size:16px;font-family:Arial'>category: The category of the token (converted, non-converted).</li>
<li style = 'font-size:16px;font-family:Arial'>prob: The probability of the token in the category.</li>
</p>    
<p style = 'font-size:16px;font-family:Arial'>This output probability is used to calculate the attribution of the channels.</p>

In [None]:
qry = '''
  WITH TOTAL as
  (Select sum(prob) as total_attribution from
    NBOUTPUT
    where category='1'
    and token in ('Online Display', 'Online Video', 'Facebook','Instagram','Paid Search'))
    
  Select token as channel, prob/total_attribution as nb_attribution from
    NBOUTPUT, TOTAL
    where category='1'
    and token in ('Online Display', 'Online Video', 'Facebook','Instagram','Paid Search');
'''

# Execute the query
eng.execute(qry)

In [None]:
nbattribution=DataFrame.from_query(qry)
nbattribution=nbattribution.to_pandas()
nbattribution

<p style = 'font-size:16px;font-family:Arial'>Visualizing the results in a vertical bar chart.</p>

In [None]:
import plotly.express as px
fig = px.bar(nbattribution, y="nb_attribution", x="channel", 
             color='channel', orientation='v',
             height=600,width=900,
             color_discrete_map = {'Online Display': '#E15759','Online Video': '#76B7B2','Facebook': '#4E79A7','Instagram': '#F28E2B' ,'Paid Search': '#59A14F'},
             title='Attribution Summary')
fig.update_layout(title_text='Naive Bayes Model Attribution Summary', title_x=0.5)
fig.update_xaxes(title='Channel',tickangle=-45)
fig.update_yaxes(title='Attribution Weight')
fig.update_traces(width=0.5)
fig.show()

<p style = 'font-size:16px;font-family:Arial'>The above graph shows the Attribution value using the Naive Bayes Model. The Attribution Value for Facebook channel is highest and that for Online Display is the lowest.</p>

<hr>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>10. MULTITOUCH ATTRIBUTION MODELS SUMMARY</b></p>

<p style = 'font-size:16px;font-family:Arial'>To compare the attribution results of all models into a single comparative chart, we will group them together using the below query and create a visualization chart.</p>

In [None]:
qry= '''
SELECT
ATTRIB.CHANNEL,
ATTRIB.Uniform,
ATTRIB.FirstClick,
ATTRIB.LastClick,
ATTRIB.Exponential,
NB.NaiveBayes,
--ASSO.Association,
FREQ.Frequency
FROM
(
WITH TOTAL
AS
(SELECT
SUM(UNIFORM_ATTRIBUTION) AS TOT_UNI,
SUM(FIRST_CLICK_ATTRIBUTION) AS TOT_FC, 
SUM(LAST_CLICK_ATTRIBUTION)AS TOT_LC, 
SUM(EXPONENTIAL_ATTRIBUTION) AS TOT_EXP
from 
ATTRIBUTION_4MODEL_OUTPUT)
select 
CHANNEL, 
SUM(UNIFORM_ATTRIBUTION)/TOT_UNI AS Uniform,
SUM(FIRST_CLICK_ATTRIBUTION)/TOT_FC AS FirstClick, 
SUM(LAST_CLICK_ATTRIBUTION)/TOT_LC AS LastClick, 
SUM(EXPONENTIAL_ATTRIBUTION)/TOT_EXP as Exponential
from 
ATTRIBUTION_4MODEL_OUTPUT, TOTAL
GROUP BY CHANNEL)
as ATTRIB
,

(    
      WITH TOTAL as
  (Select sum(prob) as total_attribution from
    NBOUTPUT
    where category='1'
    and token in ('Online Display', 'Online Video', 'Facebook','Instagram','Paid Search'))
    
  Select token as channel, prob/total_attribution as NaiveBayes from
    NBOUTPUT, TOTAL
    where category='1'
    and token in ('Online Display', 'Online Video', 'Facebook','Instagram','Paid Search')
)AS NB
,
(
select channel, tp as frequency from cngrams
 ) AS FREQ   
 WHERE
ATTRIB.channel =NB.channel and
--ATTRIB.channel =ASSO.channel and
ATTRIB.channel =FREQ.channel
'''

# Execute the query
eng.execute(qry)



In [None]:
summary=DataFrame.from_query(qry)
summary_plot=summary.to_pandas()
summary_plot

In [None]:
import plotly.graph_objects as go

fig = go.Figure(
    data=[
        go.Bar(name='Uniform', x=summary_plot["CHANNEL"], y=summary_plot["Uniform"], yaxis='y', offsetgroup=1,marker_color='#76B7B2'),
        go.Bar(name='First Click', x=summary_plot["CHANNEL"], y=summary_plot["FirstClick"], yaxis='y', offsetgroup=2, marker_color='#F28E2B'),
        go.Bar(name='Last Click', x=summary_plot["CHANNEL"], y=summary_plot["LastClick"], yaxis='y', offsetgroup=3,marker_color='#E15759'),
        go.Bar(name='Exponential', x=summary_plot["CHANNEL"], y=summary_plot["Exponential"], yaxis='y', offsetgroup=4,marker_color='#4E79A7'),
        go.Bar(name='Naive Bayes', x=summary_plot["CHANNEL"], y=summary_plot["NaiveBayes"], yaxis='y', offsetgroup=7,marker_color='#EDC948'),
        go.Bar(name='Frequency', x=summary_plot["CHANNEL"], y=summary_plot["frequency"], yaxis='y', offsetgroup=9,marker_color='#B07AA1')
    ],
    layout={
        'yaxis': {'title': 'Attribution '},

    }
)
 
# Change the bar mode
fig.update_layout(barmode='group')
fig.show()

<p style = 'font-size:16px;font-family:Arial'>Statistical based(Simple Frequency, Association and Term Frequency) and Algorithmic based(like Naive Bayes) models tend to produce slightly different attribution scores compared to rule based.</p>
<p style = 'font-size:16px;font-family:Arial'>The bar chart above shows how many conversions were attributed to each channel for each model. Analyzing the graph, specifically the statistical/ML based in comparison to the other methods, you can gain insights as to the relative importance of different marketing channels. For the first touch, last touch and linear touch models, Facebook and Paid Search are the most import channels driving conversions while Instagram and Online Display are the least important. However, according to the Statistical/ML based models, Instagram is far more important to our conversions than our simple attribution models suggest - indeed according to the probabilistic model it is infact our third most important channel. Also, according to Associations and Naive Bayes models, Online Video appears less important compared to what other models say.</p>

<hr>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>12. Cleanup</b></p>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Work Tables</b></p>

In [None]:
tables = ['CONV_JOURNEYS', 'NONCONV_JOURNEYS','cngrams','convgrams','nonconvgrams','ALLNGRAMS','ATTRIBUTION_CONVERSION',
          'ATTRIBUTION_MODEL_UNIFORM','ATTRIBUTION_MODEL_FIRSTCLICK','ATTRIBUTION_MODEL_LASTCLICK','ATTRIBUTION_MODEL_EXPONENTIAL',
            'ATTRIBUTION_4MODEL_OUTPUT','ATTRIBUTION_DATA4_ASSO_2','CONVTFIDF','NONCONVTFIDF','NBOUTPUT']

# Loop through the list of tables and execute the drop table command for each table
for table in tables:
    db_drop_table(table_name=table, schema_name='demo_user')
    # Construct the drop table SQL statement
    # drop_table_sql = f"DROP TABLE {table};"
    # Execute the drop table command
    # eng.execute(drop_table_sql)

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Databases and Tables</b></p>
<p style = 'font-size:16px;font-family:Arial'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('DEMO_MultiTouchAttribution');" 
#Takes 40 seconds

In [None]:
remove_context()

<footer style="padding:10px;background:#f9f9f9;border-bottom:3px solid #394851">Copyright © Teradata Corporation - 2023. All Rights Reserved.</footer>