### [Marketing Analytics through Markov Chain](https://towardsdatascience.com/marketing-analytics-through-markov-chain-a9c7357da2e8)

## **Essential definitions**
#### channels
    
#### $ \:\:\:\:\:$channels examples: google, facebook, email...

#### path
    
#### $ \:\:\:\:\:$path examples: 
#### $ \:\:\:\:\: \:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$ facebook.com / referral
#### $ \:\:\:\:\: \:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$ google.com / referral
#### $ \:\:\:\:\: \:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$ email
#### $ \:\:\:\:\: \:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$ facebook / email / instagram
#### $ \:\:\:\:\: \:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$ None

## **Arguments in ref (Jéssica)**
- **Data** data.frame containing customer journeys data.
- **var_path** column name containing paths.
- **var_conv** column name containing total conversions.
- **var_null** column name containing total paths that do not lead to conversions.
- **var_value** column name containing total conversion value.
- **max_order** maximum Markov Model order considered.
- **roc_npt** number of points used for approximating roc and auc [[1]](https://medium.com/bio-data-blog/entenda-o-que-%C3%A9-auc-e-roc-nos-modelos-de-machine-learning-8191fb4df772)[[2]](https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc).
- **plot** if TRUE, a plot with penalized auc with respect to order will be displayed.
- **nsim_start** <em>minimum number of simulations</em> used in computation.
- **max_step** <em>maximum number of steps for a single simulated path<em>. if NULL, it is the maximum number of steps found into Data.
- **out_more** if TRUE, transition probabilities between channels and removal effects will be shown.
- **sep** separator between the channels
- **ncore** number of threads used in computation.
- **nfold** how many repetitions are used to verify if convergence is reached at each iteration.
- **seed** random seed. Giving this parameter the same value over different runs guarantees that results will not vary.
- **conv_par** convergence parameter for the algorithm. The estimation process ends whenthe percentage of variation of the results over different repetitions is less than convergence parameter.
- **rate_step_sim** number of simulations used at each iteration is equal to the number of simulations used at previous iteration multiplied by rate_step_sim.
- **verbose** if TRUE, additional information about process convergence will be shown.

### Referências
##### ***Você pode clicar nas palavras em azul para consultar as referências!
### [Package ‘ChannelAttribution’](https://cran.r-project.org/web/packages/ChannelAttribution/ChannelAttribution.pdf), documento enviado pela Jéssica

### Aplicação da referência principal (enviada pela Jéssica) em python: [Markov Multi-Channel Attribution](https://stackoverflow.com/questions/51817219/channel-attribution-markov-chain-model-in-python)

### Referências, pacotes e ilustrações para explicar a aplicação e os resultados do método:

### [1- pip install](https://pypi.org/project/marketing-attribution-models/)
### [2- Marketing Multi-Channel Attribution model with R (part 1: Markov chains concept)](https://www.analyzecore.com/2016/08/03/attribution-model-r-part-1/)
### [3- Marketing Multi-Channel Attribution model with R (part 2: practical issues)](https://www.analyzecore.com/2017/05/31/marketing-multi-channel-attribution-model-r-part-2-practical-issues/)
### [3- Cadeia de Markov em python no github](https://github.com/franciscoicmc/simulacao/blob/master/Markov-PageRank.ipynb) 



# A seguir será realizada uma adaptação do método em python

In [1]:
import time
import pandas as pd
import numpy as np
import collections
from itertools import chain
import itertools
from scipy.stats import stats
import statistics 

### Problema do impacto da **propaganda** na **conversão**.
#### No dataframe a seguir **path** é o caminho que resulta em uma certa **probabilidade** de conversão **(conversions)** calculada pelo método de cadeia de Markov, em que cada termo (exemplo: google, instagram) que o compõe é o vértice do grafo que simboliza a cadeia de Markov associada.

In [2]:
df = pd.read_excel("channel attribution example.xlsx")
df.head()

Unnamed: 0,path,conversions
0,google / organic,231
1,l.instagram.com / referral,228
2,(direct) / (none),204
3,m.facebook.com / referral,179
4,PK - Sapphire Brand Campaign,138


In [3]:
df.to_csv("channel.csv")

In [4]:
def unique(list1):  
    unique_list = []   
    for x in list1: 
        if x not in unique_list: 
            unique_list.append(x) 

    return(unique_list)

def split_fun(path):
    return path.split('>')

def calculate_rank(vector):
    a={}
    rank=0
    for num in sorted(vector):
        if num not in a:
            a[num]=rank
            rank=rank+1
    return[a[i] for i in vector]

def transition_matrix_func(import_data):

    z_import_data=import_data.copy()

    z_import_data['path1']='start>'+z_import_data['path']
    z_import_data['path2']=z_import_data['path1']+'>convert'


    z_import_data['pair']=z_import_data['path2'].apply(split_fun)

    zlist=z_import_data['pair'].tolist()
    zlist=list(chain.from_iterable(zlist))
    zlist=list(map(str.strip, zlist))
    T=calculate_rank(zlist)

    M = [[0]*len(unique(zlist)) for _ in range(len(unique(zlist)))]

    for (i,j) in zip(T,T[1:]):
        M[i][j] += 1

    x_df=pd.DataFrame(M)

    np.fill_diagonal(x_df.values,0)

    x_df=pd.DataFrame(x_df.values/x_df.values.sum(axis=1)[:,None])
    x_df.columns=sorted(unique(zlist))
    x_df['index']=sorted(unique(zlist))
    x_df.set_index("index", inplace = True) 
    x_df.loc['convert',:]=0
    return(x_df)

def simulation(trans,n):

    sim=['']*n
    sim[0]= 'start'
    i=1
    while i<n:
        sim[i] = np.random.choice(trans.columns, 1, p=trans.loc[sim[i-1],:])[0]
        if sim[i]=='convert':
            break
        i=i+1

    return sim[0:i+1]


def markov_chain(data_set,no_iteration=10,no_of_simulation=10000,alpha=5):


    import_dataset_v1=data_set.copy()
    import_dataset_v1=(import_dataset_v1.reindex(import_dataset_v1.index.repeat(import_dataset_v1.conversions))).reset_index()
    import_dataset_v1['conversions']=1

    import_dataset_v1=import_dataset_v1[['path','conversions']]

    import_dataset=(import_dataset_v1.groupby(['path']).sum()).reset_index()
    import_dataset['probability']=import_dataset['conversions']/import_dataset['conversions'].sum()

    final=pd.DataFrame()


    for k in range(0,no_iteration):
        start = time.time()
        import_data=pd.DataFrame({'path':np.random.choice(import_dataset['path'],size=import_dataset['conversions'].sum(),p=import_dataset['probability'],replace=True)})
        import_data['conversions']=1                           

        tr_matrix=transition_matrix_func(import_data)
        channel_only = list(filter(lambda k0: k0 not in ['start','convert'], tr_matrix.columns)) 

        ga_ex=pd.DataFrame()
        tr_mat=tr_matrix.copy()
        p=[]

        i=0
        while i<no_of_simulation:
            p.append(unique(simulation(tr_mat,1000)))
            i=i+1


        path=list(itertools.chain.from_iterable(p))
        counter=collections.Counter(path)

        df=pd.DataFrame({'path':list(counter.keys()),'count':list(counter.values())})
        df=df[['path','count']]
        ga_ex=ga_ex.append(df,ignore_index=True) 

        df1=(pd.DataFrame(ga_ex.groupby(['path'])[['count']].sum())).reset_index()

        df1['removal_effects']=df1['count']/len(path)
        #df1['removal_effects']=df1['count']/sum(df1['count'][df1['path']=='convert'])
        df1=df1[df1['path'].isin(channel_only)]
        df1['ass_conversion']=df1['removal_effects']/sum(df1['removal_effects'])

        df1['ass_conversion']=df1['ass_conversion']*sum(import_dataset['conversions']) 

        final=final.append(df1,ignore_index=True)
        end = time.time()
        t1=(end - start)
        print(t1)   

    '''
    H0: u=0
    H1: u>0
    '''


    unique_channel=unique(final['path'])
    #final=(pd.DataFrame(final.groupby(['path'])[['ass_conversion']].mean())).reset_index()
    final_df=pd.DataFrame()

    for i in range(0,len(unique_channel)):

        x=(final['ass_conversion'][final['path']==unique_channel[i]]).values
        final_df.loc[i,0]=unique_channel[i]
        final_df.loc[i,1]=x.mean()

        v=stats.ttest_1samp(x,0)
        final_df.loc[i,2]=v[1]/2

        if v[1]/2<=alpha/100:
            final_df.loc[i,3]=str(100-alpha)+'% statistically confidence'
        else:
            final_df.loc[i,3]=str(100-alpha)+'% statistically not confidence'

        final_df.loc[i,4]=len(x)
        final_df.loc[i,5]=statistics.stdev(x)
        final_df.loc[i,6]=v[0]

    final_df.columns=['channel','ass_conversion','p_value','confidence_status','frequency','standard_deviation','t_statistics']       
    final_df['ass_conversion']=sum(import_dataset['conversions']) *final_df['ass_conversion'] /sum(final_df['ass_conversion'])

    return final_df,final

import_dataset=pd.read_csv('channel.csv')

data,dataset=markov_chain(import_dataset,no_iteration=10,no_of_simulation=10000,alpha=5)

12.478021621704102
10.520028352737427
10.083999395370483
11.279914855957031
11.1840980052948
12.13890528678894
11.795907258987427
11.705999612808228
10.6172456741333
11.040041208267212


### Definição de **channel**
#### $\;\;\;\;\;$ **channel** é o caminho (ou estratégia de marketing que leva a uma conversão). No próximo dataframe é utilizado **path** com sentido similar


### Definição de **removal_effects**
#### <p style='text-align: justify;'>$\;\;\;\;\;$ Considere uma probabilidade de conversões **residual**, definida como a probabilidade total de conversões menos a probabilidade que se obteria se não existisse o **path** (ou estratégia de propaganda específica), **removal_effects** é a razão entra a probabilidade de conversão **residual** e a **total**. Então, quanto maior o **removal_effects** mais impacto esse **path** tem no número de conversões.</p>

### Definição de **ass_conversion**
$\;\;\;\;\;$ **ass_conversion** é a fração do **removal_effects** de uma dada estratégia em relação a soma de **todos os removal_effects**

$$ass\_conversion_i = \frac{removal\_effects_i}{\sum \limits _{i=0} ^{N} removal\_effects_i}{Total\_conversion}$$

### Definição de **count**
$\;\;\;\;\;$ **count** é o número de vezes que o respectivo caminho foi utilizado para  a conversão

### Definição de **t_statistics**
$\;\;\;\;\;$ t_statistics mede, nesse caso, o impacto da estratégia. Quanto maior o valor t_statistics maior o impacto. 

### <center>![t-statistic](t-test.jpeg)</center>

In [5]:
pd.set_option("display.max_rows", 1000)
data

Unnamed: 0,channel,ass_conversion,p_value,confidence_status,frequency,standard_deviation,t_statistics
0,(direct) / (none),1232.423604,9.891807e-18,95% statistically confidence,10.0,21.086098,185.263055
1,0e7307fc4f-EMAIL_CAMPAIGN_2018_08_13_09_04,1.438747,5.618429e-05,95% statistically confidence,10.0,0.702372,6.49295
2,10e1861a3d-EMAIL_CAMPAIGN_2018_10_07_11_03,6.018608,6.973012e-05,95% statistically confidence,10.0,3.023779,6.309153
3,1a106d9a00-EMAIL_CAMPAIGN_2018_11_06_05_23,2.956845,1.196279e-05,95% statistically confidence,10.0,1.183034,7.922394
4,3df8425c54-EMAIL_CAMPAIGN_2018_10_25_07_04_COP...,2.58241,4.847408e-05,95% statistically confidence,10.0,1.236376,6.620636
5,496348d35d-EMAIL_CAMPAIGN_2018_09_29_07_46,0.764974,0.003137931,95% statistically confidence,8.0,0.563023,3.852037
6,49a41f9f72-EMAIL_CAMPAIGN_2018_09_14_06_54,2.903326,3.087439e-05,95% statistically confidence,9.0,1.145401,7.622272
7,4e1a75c7ff-EMAIL_CAMPAIGN_2018_10_16_05_14,2.552722,0.0001387857,95% statistically confidence,10.0,1.408111,5.746346
8,4f7bef2225-EMAIL_CAMPAIGN_2018_11_02_06_53,3.144684,5.519389e-06,95% statistically confidence,10.0,1.142975,8.720985
9,65ca25afa9-EMAIL_CAMPAIGN_2018_10_18_07_55,2.224112,3.486584e-05,95% statistically confidence,10.0,1.01995,6.911986


### Definição de **path**
#### **path** é o caminho (ou estratégia de marketing que leva a uma conversão)


### Definição de **removal_effects**
#### <p style='text-align: justify;'>Considere uma probabilidade de conversões **residual**, definida como a probabilidade total de conversões menos a probabilidade que se obteria se não existisse o **path** (ou estratégia de propaganda específica), **removal_effects** é a razão entra a probabilidade de conversão **residual** e a **total**. Então, quanto maior o **removal_effects** mais impacto esse **path** tem no número de conversões.</p>

### Definição de **ass_conversion**
#### **ass_conversion** é a fração do **removal_effects** de uma dada estratégia em relação a soma de **todos os removal_effects**

$$ass\_conversion_i = \frac{removal\_effects_i}{\sum \limits _{i=0} ^{N} removal\_effects_i}.$$

### Definição de **count**
#### **count** é o número de vezes que o respectivo caminho foi utilizado para  a conversão

In [6]:
pd.set_option("display.max_rows", None)
dataset

Unnamed: 0,path,count,removal_effects,ass_conversion
0,(direct) / (none),7642,0.16655,1217.275769
1,0e7307fc4f-EMAIL_CAMPAIGN_2018_08_13_09_04,2,4.4e-05,0.318575
2,10e1861a3d-EMAIL_CAMPAIGN_2018_10_07_11_03,45,0.000981,7.167942
3,1a106d9a00-EMAIL_CAMPAIGN_2018_11_06_05_23,20,0.000436,3.185752
4,3df8425c54-EMAIL_CAMPAIGN_2018_10_25_07_04_COP...,5,0.000109,0.796438
5,496348d35d-EMAIL_CAMPAIGN_2018_09_29_07_46,1,2.2e-05,0.159288
6,49a41f9f72-EMAIL_CAMPAIGN_2018_09_14_06_54,16,0.000349,2.548601
7,4e1a75c7ff-EMAIL_CAMPAIGN_2018_10_16_05_14,14,0.000305,2.230026
8,4f7bef2225-EMAIL_CAMPAIGN_2018_11_02_06_53,25,0.000545,3.98219
9,65ca25afa9-EMAIL_CAMPAIGN_2018_10_18_07_55,19,0.000414,3.026464
