# New payment funnel - test results 

I've received an analytical task from an international online store. My predecessor failed to complete it: they launched an A/B test and then quit (to start a watermelon farm in Brazil). They left only the technical specifications and the test results.  

Expected result: within 14 days of signing up, users will show better conversion into product page views (the product_page event), product card views (product_card) and purchases (purchase). At each of the stage of the funnel product_page → product_card → purchase, there will be at least a 10% increase.  

To see if the results of the test match the expectations of the company, I have to:
1. Explore the data - Are there duplicates? Are there missing values? Does the data need converting data types? Are there users who enter both samples? 
1. Carry out exploratory data analysis - Is the number of events per user distributed equally in the samples? How is the number of events distributed by days?
1. Evaluate the A/B test results

In [124]:
import pandas as pd                                           
import numpy as np
import datetime as dt 
import matplotlib.pyplot as plt
from scipy import stats as st
import seaborn as sns
import plotly.express as px
from plotly import graph_objects as go
import math as mth

In [65]:
try:
    marketing_events = pd.read_csv('ab_project_marketing_events_us.csv')
except:
    marketing_events = pd.read_csv('/datasets/ab_project_marketing_events_us.csv')
try:
    users = pd.read_csv('final_ab_new_users_upd_us.csv')
except:
    users = pd.read_csv('/datasets/final_ab_new_users_upd_us.csv')
try:
    user_events = pd.read_csv('final_ab_events_upd_us.csv')
except:
    user_events = pd.read_csv('/datasets/final_ab_events_upd_us.csv')
try:
    participants = pd.read_csv('final_ab_participants_upd_us.csv')
except:
    participants = pd.read_csv('/datasets/final_ab_participants_upd_us.csv')

# Data exploration

## Missing values and data categories 

Using the info() method, we can detect if there are missing values and if there are data categories that need to change. 

In [66]:
marketing_events.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   name       14 non-null     object
 1   regions    14 non-null     object
 2   start_dt   14 non-null     object
 3   finish_dt  14 non-null     object
dtypes: object(4)
memory usage: 576.0+ bytes


St_date and finish_dt can be changed to date time

In [67]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58703 entries, 0 to 58702
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   user_id     58703 non-null  object
 1   first_date  58703 non-null  object
 2   region      58703 non-null  object
 3   device      58703 non-null  object
dtypes: object(4)
memory usage: 1.8+ MB


First_date can be changed to date time.

In [68]:
user_events.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 423761 entries, 0 to 423760
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   user_id     423761 non-null  object 
 1   event_dt    423761 non-null  object 
 2   event_name  423761 non-null  object 
 3   details     60314 non-null   float64
dtypes: float64(1), object(3)
memory usage: 12.9+ MB


Event_dt to date time.  
The details column has missing values, but we need to see if the values in the column are important for our goal. 

In [69]:
participants.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14525 entries, 0 to 14524
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   user_id  14525 non-null  object
 1   group    14525 non-null  object
 2   ab_test  14525 non-null  object
dtypes: object(3)
memory usage: 340.6+ KB


Nothing out of the ordinary.

Lets start first by changing the data types that we found.

In [70]:
marketing_events['start_dt'] = pd.to_datetime(marketing_events['start_dt'])
marketing_events['finish_dt'] = pd.to_datetime(marketing_events['finish_dt'])
users['first_date'] = pd.to_datetime(users['first_date'])
user_events['event_dt'] = pd.to_datetime(user_events['event_dt'])

Now that we changed all the relevant columns in to date time, we can take a look at the missing values in the user_events table.  
Lets have a look at the table as it is

In [71]:
user_events.head()

Unnamed: 0,user_id,event_dt,event_name,details
0,E1BDDCE0DAFA2679,2020-12-07 20:22:03,purchase,99.99
1,7B6452F081F49504,2020-12-07 09:22:53,purchase,9.99
2,9CD9F34546DF254C,2020-12-07 12:59:29,purchase,4.99
3,96F27A054B191457,2020-12-07 04:02:40,purchase,4.99
4,1FD7660FDF94CA1F,2020-12-07 10:15:09,purchase,4.99


Details like prices are less of the interesting in this matter, we only care for the amount of purchases, so I think that I'm just going to delete the column.

In [72]:
del user_events['details']

## Duplicates

Lets look for duplicates in the different tables, and is so delete them.

In [73]:
marketing_events.duplicated()

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
dtype: bool

Clean

In [74]:
users[users.duplicated()]

Unnamed: 0,user_id,first_date,region,device


Clean

In [75]:
user_events[user_events.duplicated()]

Unnamed: 0,user_id,event_dt,event_name


Clean

In [76]:
participants[participants.duplicated()]

Unnamed: 0,user_id,group,ab_test


there are no duplicates in the data.

## People in both tests

Lets see if there are users in both groups A and B.

In [77]:
participants

Unnamed: 0,user_id,group,ab_test
0,D1ABA3E2887B6A73,A,recommender_system_test
1,A7A3664BD6242119,A,recommender_system_test
2,DABC14FDDFADD29E,A,recommender_system_test
3,04988C5DF189632E,A,recommender_system_test
4,4FF2998A348C484F,A,recommender_system_test
...,...,...,...
14520,1D302F8688B91781,B,interface_eu_test
14521,3DE51B726983B657,A,interface_eu_test
14522,F501F79D332BE86C,A,interface_eu_test
14523,63FBE257B05F2245,A,interface_eu_test


In [78]:
participants['ab_test'].unique()

array(['recommender_system_test', 'interface_eu_test'], dtype=object)

Lets first check if there are people in both tests. To do so, I need to use the groupby method and group the data by user id and count the number of tests for each user. If there are two tests we need to delete the users.
after words we need to divide the tests to 2 tables.

In [79]:
in_two_tests = participants.groupby('user_id', as_index = False).agg({'ab_test': 'count'})
in_two_tests[in_two_tests['ab_test'] == 2]

Unnamed: 0,user_id,ab_test
1,001064FEAAB631A1,2
8,00341D8401F0F665,2
23,0082295A41A867B5,2
38,00E68F103C66C1F7,2
41,00EFA157F7B6E1C4,2
...,...,...
13576,FEA0C585A53E7027,2
13582,FEC0BCA6C323872F,2
13605,FF2174A1AA0EAD20,2
13610,FF44696E39039D29,2


Lets divide the data, to two test tables, and check if there are users in both groups in the separate tests using the groupby method.

In [80]:
EU_test = participants[participants['ab_test'] == 'interface_eu_test']
not_EU = participants[participants['ab_test'] == 'recommender_system_test']

In [81]:
in_two_groups_EU = EU_test.groupby('user_id', as_index = False).agg({'group': 'count'})
in_two_groups_EU[in_two_groups_EU['group'] == 2]

Unnamed: 0,user_id,group


In [82]:
in_two_groups_not = not_EU.groupby('user_id', as_index = False).agg({'group': 'count'})
in_two_groups_not[in_two_groups_not['group'] == 2]

Unnamed: 0,user_id,group


In [83]:
in_two_groups = participants.groupby('user_id', as_index = False).agg({'group': 'count'})
in_two_groups[in_two_groups['group'] == 2]

Unnamed: 0,user_id,group
1,001064FEAAB631A1,2
8,00341D8401F0F665,2
23,0082295A41A867B5,2
38,00E68F103C66C1F7,2
41,00EFA157F7B6E1C4,2
...,...,...
13576,FEA0C585A53E7027,2
13582,FEC0BCA6C323872F,2
13605,FF2174A1AA0EAD20,2
13610,FF44696E39039D29,2


Looks clean.  

## Start and end dates

Lets see if the date range is correct.

In [84]:
user_events[user_events['event_dt'] < '2020-12-07 00:00:00']

Unnamed: 0,user_id,event_dt,event_name


Now lets look at the end point.

In [85]:
user_events[user_events['event_dt'] > '2021-01-01 23:59:59']

Unnamed: 0,user_id,event_dt,event_name


Looks good. Now, lets see that there are no users who started the program after the 21st of December

In [86]:
users[users['first_date'] > '2020-12-21']

Unnamed: 0,user_id,first_date,region,device
22757,5815F7ECE74D949F,2020-12-22,CIS,PC
22758,32EAEA5E903E3BC1,2020-12-22,N.America,Android
22759,9DF7A3C46487EF0B,2020-12-22,EU,Android
22760,ADE98C6440423287,2020-12-22,EU,iPhone
22761,5A5833D3AEA75255,2020-12-22,N.America,PC
...,...,...,...,...
32118,165AFCBF42C043F8,2020-12-23,EU,PC
32119,54E7F36C0E976E24,2020-12-23,EU,Android
32120,7E43EB2E03A33E78,2020-12-23,EU,PC
32121,B8B679DEE9F2CA06,2020-12-23,EU,PC


ok... lets get rid of them

In [87]:
users = users[~(users['first_date'] > '2020-12-21')]
users[users['first_date'] > '2020-12-21']

Unnamed: 0,user_id,first_date,region,device


## EU

Lets find the percentage of EU members in the data.

In [88]:
len(users[users['region'] == 'EU']['user_id'])/len(users['user_id'])*100

73.85104790419162

seems like we have more EU members then we thought.

## 14 days limit

Lets see if we stop checking users after 14 days. First I'll create a joint table that contains all the participants with the criterias we had to this point.

In [89]:
df = participants.merge(user_events, how='inner', on= 'user_id')
df = df.merge(users, how='inner', on= 'user_id')

Now I'll create a new column that contains the number of days between the each event and the customer first date.

In [90]:
df['limit'] = (df['event_dt'] - df['first_date']).dt.days
df[df['limit'] > 14]

Unnamed: 0,user_id,group,ab_test,event_dt,event_name,first_date,region,device,limit
77,66FC298441D50783,A,recommender_system_test,2020-12-29 12:59:37,login,2020-12-08,EU,iPhone,21
362,12FCEFC7D1907D47,A,recommender_system_test,2020-12-26 10:20:34,login,2020-12-09,EU,PC,17
389,172F0C1F993BE914,B,recommender_system_test,2020-12-26 04:13:01,login,2020-12-07,EU,iPhone,19
390,172F0C1F993BE914,B,recommender_system_test,2020-12-28 10:45:40,login,2020-12-07,EU,iPhone,21
597,A1C3D3C6C3CADDC5,A,recommender_system_test,2020-12-26 13:20:09,purchase,2020-12-11,EU,Android,15
...,...,...,...,...,...,...,...,...,...
97241,9197EFF2D0FB18C9,B,interface_eu_test,2020-12-28 13:42:35,product_cart,2020-12-08,EU,iPhone,20
97245,9197EFF2D0FB18C9,B,interface_eu_test,2020-12-27 07:03:53,product_page,2020-12-08,EU,iPhone,19
97246,9197EFF2D0FB18C9,B,interface_eu_test,2020-12-28 13:42:36,product_page,2020-12-08,EU,iPhone,20
97250,9197EFF2D0FB18C9,B,interface_eu_test,2020-12-27 07:03:49,login,2020-12-08,EU,iPhone,19


Lets get rid of them. 

In [91]:
df = df[~(df['limit'] > 14)]
df[df['limit'] > 14]

Unnamed: 0,user_id,group,ab_test,event_dt,event_name,first_date,region,device,limit


## Divide for AB test

Lets take the df table and take out only the part that contains only the recommender_system_test for the AB test.  

In [92]:
df_ab = df[df['ab_test'] == 'recommender_system_test']
len(df_ab['user_id'].unique())

3675

Seems like we have less participants then we hoped for.

## Mid-way conclusion

1. There were no significant missing values, and I deleted the column containing them from the user events table.
1. There were no duplicates.
1. I checked if there were people in both groups in the tests separately.
1. Sew that there were more EU users then expected. 
1. I deleted all the users that joined the test out side of the time frame that was decide upon, and that the users activity did not range more then 14 days, and after the 1.1.2021.
1. divided the data for AB test.

# Exploratory data analysis

## Conversion at different funnel stages

I'll group the number of user_id by the event name so I can see there frequency of occurrence.

In [95]:
freq_aco = df.groupby('event_name', as_index= False)['user_id'].count().sort_values(by = 'user_id',ascending = False)
freq_aco

Unnamed: 0,event_name,user_id
0,login,40997
2,product_page,26505
3,purchase,13635
1,product_cart,13147


In [96]:
new_index = [0,2,1,3]
freq_aco.reindex(new_index)

Unnamed: 0,event_name,user_id
0,login,40997
2,product_page,26505
1,product_cart,13147
3,purchase,13635


Lets find the unique number of users who did each of the events at least once. To do so I need to group the unique number of users per event.

In [97]:
df.groupby('event_name', as_index= False)['user_id'].nunique().sort_values(by = 'user_id',ascending = False).reindex(new_index)

Unnamed: 0,event_name,user_id
0,login,12632
2,product_page,8246
1,product_cart,4074
3,purchase,4219


Lets see the percentage of people that did one of the actions at least once. I'll take the previous table and divide it by the number of unique users.

In [101]:
df.groupby('event_name')['user_id'].nunique().sort_values(ascending = False).reindex(['login','product_page','product_cart','purchase'])/df.user_id.nunique() * 100 

event_name
login           99.984170
product_page    65.268324
product_cart    32.246319
purchase        33.394016
Name: user_id, dtype: float64

Lets fined the share of users that proceed from each stage to the next, I'll do that by using pact_change()

In [103]:
funnel = df.groupby('event_name', as_index= False)['user_id'].nunique().sort_values(by = 'user_id',ascending = False)
funnel['pac'] = funnel['user_id'].pct_change()
funnel.reindex(new_index)

Unnamed: 0,event_name,user_id,pac
0,login,12632,
2,product_page,8246,-0.347213
1,product_cart,4074,-0.034368
3,purchase,4219,-0.488358


Lets plot a funnel chart, just to see the difference visually

In [104]:
fpc = px.funnel(funnel.reindex(new_index), x='user_id', y='event_name', title = 'users that made the next step')
fpc.show()

It dose seem that the problematic part in the funnel is the login and the product page. Lets see if we can identify a visual difference between the test groups.  

In [105]:
df_a = df[df['group'] == 'A']
df_b = df[df['group'] == 'B']

In [107]:
df_a.groupby('event_name')['user_id'].nunique().sort_values(ascending = False).reindex(['login','product_page','product_cart','purchase'])/df_a.user_id.nunique() * 100 

event_name
login           99.986457
product_page    65.980498
product_cart    31.595341
purchase        34.019502
Name: user_id, dtype: float64

In [108]:
funnel_a = df_a.groupby('event_name', as_index= False)['user_id'].nunique().sort_values(by = 'user_id',ascending = False)
funnel_a['pac'] = funnel_a['user_id'].pct_change()
funnel_a.reindex(new_index)

Unnamed: 0,event_name,user_id,pac
0,login,7383,
2,product_page,4872,-0.340106
1,product_cart,2333,-0.071258
3,purchase,2512,-0.484401


In [109]:
fpc = px.funnel(funnel_a.reindex(new_index), x='user_id', y='event_name', title = 'users that made the next step')
fpc.show()

In [60]:
funnel_b = df_b.groupby('event_name', as_index= False)['user_id'].nunique().sort_values(by = 'user_id',ascending = False)
funnel_b['pac'] = funnel_b['user_id'].pct_change()
funnel_b

Unnamed: 0,event_name,user_id,pac
0,login,5690,
2,product_page,3645,-0.359402
1,product_cart,1867,-0.487791
3,purchase,1834,-0.017675


In [61]:
df_b.groupby('event_name')['user_id'].nunique().sort_values(ascending = False)/df_b.user_id.nunique() * 100 

event_name
login           99.982428
product_page    64.048498
product_cart    32.806185
purchase        32.226322
Name: user_id, dtype: float64

In [62]:
fpc = px.funnel(funnel_b, x='user_id', y='event_name', title = 'users that made the next step')
fpc.show()

I don't see a big difference in the funnel stages between the test groups.

Lets see the results for the required test

In [110]:
df_aba =df_ab[df_ab['group'] == 'A']
df_abb = df_ab[df_ab['group'] == 'B']

In [112]:
df_aba.groupby('event_name')['user_id'].nunique().sort_values(ascending = False).reindex(['login','product_page','product_cart','purchase'])/df_aba.user_id.nunique() * 100 

event_name
login           100.000000
product_page     64.797961
product_cart     29.996360
purchase         31.743720
Name: user_id, dtype: float64

In [115]:
funnel_aba = df_aba.groupby('event_name', as_index= False)['user_id'].nunique().sort_values(by = 'user_id',ascending = False)
funnel_aba['pac'] = funnel_aba['user_id'].pct_change()
funnel_aba.reindex(new_index)

Unnamed: 0,event_name,user_id,pac
0,login,2747,
2,product_page,1780,-0.35202
1,product_cart,824,-0.055046
3,purchase,872,-0.510112


In [116]:
faba = px.funnel(funnel_aba.reindex(new_index), x='user_id', y='event_name', title = 'users that made the next step')
faba.show()

In [117]:
df_abb.groupby('event_name')['user_id'].nunique().sort_values(ascending = False).reindex(['login','product_page','product_cart','purchase'])/df_abb.user_id.nunique() * 100 

event_name
login           99.892241
product_page    56.357759
product_cart    27.478448
purchase        27.586207
Name: user_id, dtype: float64

In [118]:
funnel_abb = df_abb.groupby('event_name', as_index= False)['user_id'].nunique().sort_values(by = 'user_id',ascending = False)
funnel_abb['pac'] = funnel_abb['user_id'].pct_change()
funnel_abb.reindex(new_index)

Unnamed: 0,event_name,user_id,pac
0,login,927,
2,product_page,523,-0.435814
1,product_cart,255,-0.003906
3,purchase,256,-0.510516


In [119]:
fabb = px.funnel(funnel_abb.reindex(new_index), x='user_id', y='event_name', title = 'users that made the next step')
fabb.show()

it seems that when it comes to the required test, group b has a much larger drop in percentage in the next funnel.

Now lets calculate the relative overall change in conversion in group B relative to group A at each stage of the funnel

In [134]:
fig = go.Figure()

fig.add_trace(go.Funnel(
    name = 'group a',
    x = funnel_aba.reindex(new_index)['user_id'],
    y = funnel_aba.reindex(new_index)['event_name'],
    textinfo = "value"))

fig.add_trace(go.Funnel(
    name = 'group b',
    x = funnel_abb.reindex(new_index)['user_id'],
    y = funnel_abb.reindex(new_index)['event_name'],
    textinfo = "value"))

fig.show()

We see that in group b we have a sharper decline in product page, only 56% of the privies step vs 65% of group a. Doesn't look like the test went that well. 

## Number of events per user

Lets see the number of events per user. To see that, I need to create a table grouped by user id that counts the number of events each user has participated in. Then I can create a histogram that shows the number of users that took part in each number of events. 

In [136]:
event_count = df.groupby('user_id', as_index=False).agg({'event_dt':'count'})

In [137]:
hisg = px.histogram(event_count, x="event_dt",
                   title='events per user',
                   labels={'event_dt':'number of events'}, 
                   )
hisg.show()

In the general table, we see that most users do around 1-9 events and the pick is at 6. It's, logical to assume that most participate in the perches of 1 or 2 gifts in the testing time.  

Lets see what we get from our AB recommender_system_test test groups. I will group the df_ab table, by user id and group, counting the number of events. Then, using the new table, I'll create a histogram that shows the number of events per user in both groups. 

In [141]:
event_ab_count = df_ab.groupby(['user_id','group'], as_index = False).agg({'event_dt':'count'})

In [143]:
hisab = px.histogram(event_ab_count, x="event_dt",
                   title='events per user',
                   labels={'event_dt':'number of events'},
                   color = 'group'
                   )
hisab.show()

Generally, when looking at both groups, they have a similar event number in all. It dose fill as though group A has a higher concentration of users in the 2-9 event range, and it's peek is much grater then group B.

## Number of events distributed by days

Lets see the number of events distributed by days using a histogram chart

In [144]:
hist = px.histogram(df ,x = 'event_dt', color = 'group', title = 'number of events per day in every group')
hist.show()

We see there is a rise in events that peaks around the 21st for both groups, that very quickly drops to the same number of events as in the beginning and even lower. There are 2 parts with no values, it may be because of a holiday and the new years.

Lets see if there are any marketing events that may overlap with the test.

In [145]:
marketing_events

Unnamed: 0,name,regions,start_dt,finish_dt
0,Christmas&New Year Promo,"EU, N.America",2020-12-25,2021-01-03
1,St. Valentine's Day Giveaway,"EU, CIS, APAC, N.America",2020-02-14,2020-02-16
2,St. Patric's Day Promo,"EU, N.America",2020-03-17,2020-03-19
3,Easter Promo,"EU, CIS, APAC, N.America",2020-04-12,2020-04-19
4,4th of July Promo,N.America,2020-07-04,2020-07-11
5,Black Friday Ads Campaign,"EU, CIS, APAC, N.America",2020-11-26,2020-12-01
6,Chinese New Year Promo,APAC,2020-01-25,2020-02-07
7,Labor day (May 1st) Ads Campaign,"EU, CIS, APAC",2020-05-01,2020-05-03
8,International Women's Day Promo,"EU, CIS, APAC",2020-03-08,2020-03-10
9,Victory Day CIS (May 9th) Event,CIS,2020-05-09,2020-05-11


The only marketing event that overlaps with the test is the Christmas&New Year Promo. How ever, it starts at the 25th, right where we see a significant decrease in the number of events. So we can say that marketing events had nothing to do with the results.

## Mid-way conclusion

1. Made funnel graphs that show conversions rates through different stages.
1. Looked for the number of events per user using scatter plots.
1. Used a histogram to see the distribution of events through time.

# AB test

To test if there is a better conversion between the funnels of group A or group B, we need to create pivot table that has the groups as columns, the events as index and the number of unique users as values.

In [146]:
groups_piv = df_ab.pivot_table(index = 'event_name', columns = 'group', values = 'user_id', aggfunc = 'nunique').reset_index()
groups_piv

group,event_name,A,B
0,login,2747,927
1,product_cart,824,255
2,product_page,1780,523
3,purchase,872,256


In [151]:
def check_hypothesis(group1,group2,event,alpha=0.0125):
    success1=groups_piv[groups_piv.event_name==event][group1].iloc[0]
    success2=groups_piv[groups_piv.event_name==event][group2].iloc[0]
    
    trials1=df_ab[df_ab.group==group1]['user_id'].nunique()
    trials2=df_ab[df_ab.group==group2]['user_id'].nunique()
    
    p1 = success1/trials1
    p2 = success2/trials2
    p_combined = (success1 + success2) / (trials1 + trials2)

    difference = p1 - p2
    
    z_value = difference / mth.sqrt(p_combined * (1 - p_combined) * (1/trials1 + 1/trials2))

    distr = st.norm(0, 1)

    p_value = (1 - distr.cdf(abs(z_value))) * 2

    print('p-value: ', p_value)

    if (p_value < alpha):
        print("Rejecting H0")
    else:
        print("Failed to reject H0")

Now that we have everything in order, we just need to run function. So I won't write the same thing with different events over and over again, I'll create loop to run them one after the other.  

In [152]:
for i in groups_piv.event_name.unique():
    check_hypothesis('A','B',i,alpha=0.0125)

p-value:  0.08529860212027773
Failed to reject H0
p-value:  0.14534814557238196
Failed to reject H0
p-value:  4.310980554755872e-06
Rejecting H0
p-value:  0.017592402663314743
Failed to reject H0


Product_page has the largest difference out of all the events, sadly in favor of the control group.

# Summary

1. In the data exploration stage I:  
    1. Used the info() method, to detect if there were missing values
       and if there were data categories that needed to change.  
       I changed the data types that I found, and deleted the column containing the missing values.
    1. Looked for duplicates.
    1. checked if there are people in both tests.  
       To do so, I used the groupby method and grouped the data by user id and counted the number of tests for each user.  
       I divided the data, to two test tables,  
       and checked if there are users in both groups in the separate tests using the groupby method.
    1. sew if the date range is correct and looked at the end point.
       Then I checked that there are no users who started the program after the 21st of December,
       and got rid of thous that I found.
    1. Found the percentage of EU members in the data.
    1. sew if we stop checking users after 14 days.
       First I created a joint table that contains all the participants with the criterias we had to this point. 
       Then I created a new column that contains the number of days between each event and the customers first date. 
       and got rid of what I found.
    1. took the df table and took out only the part that contains the recommender_system_test for the AB test.
1. In the EDA stage I:  
    1. Grouped the number of user_id by the event name so I could see there frequency of occurrence.  
       To find the unique number of users who did each of the events at least once,  
       I had to group the unique number of users per event.  
       To see the percentage of people that did one of the actions at least once,  
       I took the previous table and divided it by the number of unique users.  
       To fined the share of users that proceeded from each stage to the next, I used pact_change().  
       I plotted a funnel chart, just to see the difference visually.
    1. sew the number of events per user.
       I had to create a table grouped by user id that counts the number of events each user has participated in.
       Then I created a histogram that shows the number of users that took part in each number of events.
    1. Sew the number of events distributed by days using a histogram chart.  
       Then, printed the marketing table to see if events are over lapping and affecting our test. 
1. In the AB test I:
    1. tested if there is a better conversion between the funnels of group A or group B. 
       I had to create a pivot table that has the groups as columns,  
       the events as index and the number of unique users as values.
       I Defined the hypothesis.  
       Create a function that can check the hypothesis.  
       I used the z-criterion to check the statistical difference between the proportions.  
       I created a loop that ran the events one after the other in the function.

## Conclusions

In the EDA stage we see that the funnel stages do not show the expected percentage increase that the company has hoped for.  
The conversions between the funnels stayed pretty similar, all except the conversion to the cart, which in group B was really bad. If the company wants to create a better conversion rate they need to do something to the 2 first stages, especially the second one.  
It seemed at first that the holidays may affect the test results, but it seemed fine in the end.   
In the AB test stage, we sew that the main difference between the groups is in the third stage, and we also sew that in the EDA, the control group did better in that stage, which only shows that the changes to the site didn't have a positive affect on the conversion rate.