# PROJECT SPRINT 10: A/B TESTING

### DESCRIPTION OF THE PROJECT: This is a test and analysis done for a big online store. Together with the marketing department, I received a list of hypotheses that may help boost revenue.<br>

### PURPOSE OF THE TEST: **Prioritize these hypotheses, launch A/B test and analyze the results.** 

***

### The project is divided into several parts. Each part has its own purpose and is outlined in a sequencial order so you can follow the progress to the end.<br>

>### Part One: Pre-processing of the data.<br>
>### Part Two: Check Compliances.<br>
>### Part Three: Main KPIs (without statistical analysis).<br>
>### Part Four: Prioritizing Hypotheses.<br>
>### Part Five: A/B Test Analysis.<br>
>### Part Six: Conlusions based on the A/B test results.

***

### Description of the data:<br>
> Hypotheses dataset:<br>
> Hypotheses — brief descriptions of the hypotheses<br>
> Reach — user reach, on a scale of one to ten<br>
> Impact — impact on users, on a scale of one to ten<br>
> Confidence — confidence in the hypothesis, on a scale of one to ten<br>
> Effort — the resources required to test a hypothesis, on a scale of one to ten.<br>

> Orders dataset:<br>
> transactionId — order identifier<br>
> visitorId — identifier of the user who placed the order<br>
> date — of the order<br>
> revenue — from the order<br>
> group — the A/B test group that the user belongs to<br>

> Visits dataset:<br>
> date — date<br>
> group — A/B test group<br>
> visits — the number of visits on the date specified in the A/B test group specified

***

### Part One: Pre-processing the data

#### In this part I will import libraries, check and clean the data from both datasets, and check any inconsistencies of the data that may prevent to do further actions and analysis.

**1. Libraries**

In [1]:
# import all the necessary libraries for the whole project
import pandas as pd
import scipy.stats as stats
import datetime as dt
import numpy as np
import sidetable
import plotly.express as px 
import matplotlib.pyplot as plt 
import plotly.graph_objects as go 

**2. Reading the datasets and checking for missing values**

In [2]:
# reading the orders csv file
orders = pd.read_csv('/Users/cesarchaparro/Desktop/TripleTen/Sprint_10/project/orders_us.csv', parse_dates=['date'])
orders.head()

Unnamed: 0,transactionId,visitorId,date,revenue,group
0,3667963787,3312258926,2019-08-15,30.4,B
1,2804400009,3642806036,2019-08-15,15.2,B
2,2961555356,4069496402,2019-08-15,10.2,A
3,3797467345,1196621759,2019-08-15,155.1,B
4,2282983706,2322279887,2019-08-15,40.5,B


In [3]:
# info about the dataframe
orders.stb.missing(style=True)

Unnamed: 0,missing,total,percent
transactionId,0,1197,0.00%
visitorId,0,1197,0.00%
date,0,1197,0.00%
revenue,0,1197,0.00%
group,0,1197,0.00%


* We can see that there are 1197 rows and no missing vallues.

In [4]:
# reading the visits csv file
visits = pd.read_csv('/Users/cesarchaparro/Desktop/TripleTen/Sprint_10/project/visits_us.csv', parse_dates=['date'])
visits.head()

Unnamed: 0,date,group,visits
0,2019-08-01,A,719
1,2019-08-02,A,619
2,2019-08-03,A,507
3,2019-08-04,A,717
4,2019-08-05,A,756


In [5]:
# info about the visits dataframe
visits.stb.missing(style=True)

Unnamed: 0,missing,total,percent
date,0,62,0.00%
group,0,62,0.00%
visits,0,62,0.00%


* We can see that there are 62 rows and no missing values.

In [6]:
# reading the hypotheses csv file
pd.set_option('display.max_colwidth', None)
hypotheses = pd.read_csv('/Users/cesarchaparro/Desktop/TripleTen/Sprint_10/project/hypotheses_us.csv')
hypotheses

Unnamed: 0,Hypothesis;Reach;Impact;Confidence;Effort
0,Add two new channels for attracting traffic. This will bring 30% more users;3;10;8;6
1,Launch your own delivery service. This will shorten delivery time;2;5;4;10
2,Add product recommendation blocks to the store's site. This will increase conversion and average purchase size;8;3;7;3
3,Change the category structure. This will increase conversion since users will find the products they want more quickly;8;3;3;8
4,Change the background color on the main page. This will increase user engagement;3;1;1;1
5,Add a customer review page. This will increase the number of orders;3;2;2;3
6,Show banners with current offers and sales on the main page. This will boost conversion;5;3;8;3
7,Add a subscription form to all the main pages. This will help you compile a mailing list;10;7;8;5
8,Launch a promotion that gives users discounts on their birthdays;1;9;9;5


* We can see that the format is not the correct one to manipulate the data on the dataframe so I will modify the format to make columns.

In [7]:
# define the data as a list of dictionaries
hypotheses = [
    {"Hypothesis": "Add two new channels for attracting traffic. This will bring 30% more users", "Reach": 3, "Impact": 10, "Confidence": 8, "Effort": 6},
    {"Hypothesis": "Launch your own delivery service. This will shorten delivery time", "Reach": 2, "Impact": 5, "Confidence": 4, "Effort": 10},
    {"Hypothesis": "Add product recommendation blocks to the store's site. This will increase conversion and average purchase size", "Reach": 8, "Impact": 3, "Confidence": 7, "Effort": 3},
    {"Hypothesis": "Change the category structure. This will increase conversion since users will find the products they want more quickly", "Reach": 8, "Impact": 3, "Confidence": 3, "Effort": 8},
    {"Hypothesis": "Change the background color on the main page. This will increase user engagement", "Reach": 3, "Impact": 1, "Confidence": 1, "Effort": 1},
    {"Hypothesis": "Add a customer review page. This will increase the number of orders", "Reach": 3, "Impact": 2, "Confidence": 2, "Effort": 3},
    {"Hypothesis": "Show banners with current offers and sales on the main page. This will boost conversion", "Reach": 5, "Impact": 3, "Confidence": 8, "Effort": 3},
    {"Hypothesis": "Add a subscription form to all the main pages. This will help you compile a mailing list", "Reach": 10, "Impact": 7, "Confidence": 8, "Effort": 5},
    {"Hypothesis": "Launch a promotion that gives users discounts on their birthdays; This will increase customer retention", "Reach": 1, "Impact": 9, "Confidence": 9, "Effort": 5},
]
# create the DataFrame
hypotheses = pd.DataFrame(hypotheses)
# Convert column names to lowercase
hypotheses.columns = hypotheses.columns.str.lower()
hypotheses


Unnamed: 0,hypothesis,reach,impact,confidence,effort
0,Add two new channels for attracting traffic. This will bring 30% more users,3,10,8,6
1,Launch your own delivery service. This will shorten delivery time,2,5,4,10
2,Add product recommendation blocks to the store's site. This will increase conversion and average purchase size,8,3,7,3
3,Change the category structure. This will increase conversion since users will find the products they want more quickly,8,3,3,8
4,Change the background color on the main page. This will increase user engagement,3,1,1,1
5,Add a customer review page. This will increase the number of orders,3,2,2,3
6,Show banners with current offers and sales on the main page. This will boost conversion,5,3,8,3
7,Add a subscription form to all the main pages. This will help you compile a mailing list,10,7,8,5
8,Launch a promotion that gives users discounts on their birthdays; This will increase customer retention,1,9,9,5


In [8]:
# checking for duplicates on column of transactions
duplicates = orders['transactionId'].duplicated().sum()
if duplicates > 0:
  print(f'There are {duplicates} duplicate rows in the DataFrame.')
else:
  print('No duplicate rows found.')

No duplicate rows found.


* We can see that there are no duplicated transactions on the dataframe.

In [9]:
# group by test group and see when start and when ends the test
orders.groupby(['group'])['date'].agg(['min','max'])

Unnamed: 0_level_0,min,max
group,Unnamed: 1_level_1,Unnamed: 2_level_1
A,2019-08-01,2019-08-31
B,2019-08-01,2019-08-31


* We can see that both orders and visits dataframes have same start and end dates.

***

### Part Two: Check compliances

#### In this part I will go through all the technical requirements that need to be fulfilled in order to perform a correct test.

In [10]:
# group by group to see how many users on each group
users_per_group = orders.groupby(['group'])['visitorId'].nunique().reset_index()
users_per_group

Unnamed: 0,group,visitorId
0,A,503
1,B,586


In [11]:
# filter entries for groups A and B
groupA = users_per_group[users_per_group['group'] == 'A']['visitorId'].iloc[0]
groupB = users_per_group[users_per_group['group'] == 'B']['visitorId'].iloc[0]
# calculate the difference
difference = groupB - groupA
print(f'There is a difference of {difference} users between group A and B')

There is a difference of 83 users between group A and B


* We can see there is not an even split of the groups, there is a difference of 83 users. I will check if there are users that are included on both groups by error on the next step.

In [12]:
# pylint: disable=missing-final-newline
# users in both groups
both_groups = list(orders.groupby(['visitorId'])['group'].nunique().reset_index().query('group > 1')['visitorId'])
both_groups

[8300375,
 199603092,
 232979603,
 237748145,
 276558944,
 351125977,
 393266494,
 457167155,
 471551937,
 477780734,
 818047933,
 963407295,
 1230306981,
 1294878855,
 1316129916,
 1333886533,
 1404934699,
 1602967004,
 1614305549,
 1648269707,
 1668030113,
 1738359350,
 1801183820,
 1959144690,
 2038680547,
 2044997962,
 2378935119,
 2458001652,
 2579882178,
 2587333274,
 2600415354,
 2654030115,
 2686716486,
 2712142231,
 2716752286,
 2780786433,
 2927087541,
 2949041841,
 2954449915,
 3062433592,
 3202540741,
 3234906277,
 3656415546,
 3717692402,
 3766097110,
 3803269165,
 3891541246,
 3941795274,
 3951559397,
 3957174400,
 3963646447,
 3972127743,
 3984495233,
 4069496402,
 4120364173,
 4186807279,
 4256040402,
 4266935830]

In [13]:
# not in both groups
clean_orders = orders[(orders.visitorId.apply(lambda x: x not in both_groups))]
clean_orders

Unnamed: 0,transactionId,visitorId,date,revenue,group
0,3667963787,3312258926,2019-08-15,30.4,B
1,2804400009,3642806036,2019-08-15,15.2,B
3,3797467345,1196621759,2019-08-15,155.1,B
4,2282983706,2322279887,2019-08-15,40.5,B
5,182168103,935554773,2019-08-15,35.0,B
...,...,...,...,...,...
1191,3592955527,608641596,2019-08-14,255.7,B
1192,2662137336,3733762160,2019-08-14,100.8,B
1193,2203539145,370388673,2019-08-14,50.1,A
1194,1807773912,573423106,2019-08-14,165.3,A


* We can see now that the clean list of users which is valid for the test is the users that are only participating in one version of the test.

In [14]:
# group by group to see how many users on each group
users_by_group = clean_orders.groupby(['group'])['visitorId'].nunique().reset_index()
users_by_group

Unnamed: 0,group,visitorId
0,A,445
1,B,528


In [15]:
# filter entries for groups A and B
groupA = users_by_group[users_by_group['group'] == 'A']['visitorId'].iloc[0]
groupB = users_by_group[users_by_group['group'] == 'B']['visitorId'].iloc[0]
# calculate the difference
difference = groupB - groupA
print(f'There is a difference of {difference} users between group A and B')

There is a difference of 83 users between group A and B


* We can see that there is a difference of 83 users that take part of group A and B to do the test. The difference is quite significant to continue correctly the test.

In [16]:
# double checking if there are still users on both groups
(clean_orders.groupby(['visitorId'])['group'].nunique()>1).sum()

0

* We can see that now each user if participiating only in one only version of the test. The rest of users were dropped from the test.

#### Note: We don't have data on this data sets about the existence of any marketing events during this period of August, neither the existence of special promotions or holidays. Therefore, we infer that the planned dates for the test are normal dates that retrieve a usual user behavior.

* We can also infer that there is not existence of ghosts users on the data set, users that are registered but don't take part on any of the versions of the test. I can see that from the missing values and the columns earlier on the datasets information.

***

### Part Three: KPIs

#### Following are KPIs used for further analysis, all the KPIs in this part have been done not considering the existence of outliers (yet).

In [17]:
# building an array with unique paired date-group values 
datesGroups = clean_orders[['date','group']].drop_duplicates()
datesGroups.head(10)

Unnamed: 0,date,group
0,2019-08-15,B
7,2019-08-15,A
45,2019-08-16,A
47,2019-08-16,B
55,2019-08-01,A
66,2019-08-01,B
86,2019-08-22,A
87,2019-08-22,B
124,2019-08-17,A
125,2019-08-17,B


* Orders per day.

In [18]:
# group by date and group and calculate number of unique transactions
orders_per_day = clean_orders.groupby(['date', 'group'])['transactionId'].nunique().reset_index()
orders_per_day.columns = ['date', 'group', 'orders_per_day']
orders_per_day.head()

Unnamed: 0,date,group,orders_per_day
0,2019-08-01,A,23
1,2019-08-01,B,17
2,2019-08-02,A,19
3,2019-08-02,B,23
4,2019-08-03,A,24


In [19]:
# dataFrame with orders per day, group A
OrdersA = orders_per_day[orders_per_day['group'] == 'A'][['date', 'orders_per_day']]
# dataFrame with orders per day, group B
OrdersB = orders_per_day[orders_per_day['group'] == 'B'][['date', 'orders_per_day']]

In [20]:
OrdersA.head()

Unnamed: 0,date,orders_per_day
0,2019-08-01,23
2,2019-08-02,19
4,2019-08-03,24
6,2019-08-04,11
8,2019-08-05,22


In [21]:
OrdersB.head()

Unnamed: 0,date,orders_per_day
1,2019-08-01,17
3,2019-08-02,23
5,2019-08-03,14
7,2019-08-04,14
9,2019-08-05,21


* Visits per day.

In [22]:
# group by date and group and calculate number of unique visits
visits_per_day = visits.groupby(['date', 'group'])['visits'].sum().reset_index()
visits_per_day.columns = ['date', 'group', 'visits_per_day']
visits_per_day.head()

Unnamed: 0,date,group,visits_per_day
0,2019-08-01,A,719
1,2019-08-01,B,713
2,2019-08-02,A,619
3,2019-08-02,B,581
4,2019-08-03,A,507


In [23]:
# dataFrame with visits per day, group A
VisitsA = visits_per_day[visits_per_day['group'] == 'A'][['date', 'visits_per_day']]
VisitsA.head()

Unnamed: 0,date,visits_per_day
0,2019-08-01,719
2,2019-08-02,619
4,2019-08-03,507
6,2019-08-04,717
8,2019-08-05,756


In [24]:
# dataFrame with visits per day, group A
VisitsB = visits_per_day[visits_per_day['group'] == 'B'][['date', 'visits_per_day']]
VisitsB.head()

Unnamed: 0,date,visits_per_day
1,2019-08-01,713
3,2019-08-02,581
5,2019-08-03,509
7,2019-08-04,770
9,2019-08-05,707


* General Revenue per day.

In [25]:
# create general revenue per day
revenue_per_day = clean_orders.groupby(['date', 'group'])['revenue'].sum().reset_index()
revenue_per_day.columns = ['date', 'group', 'revenue_per_day']
revenue_per_day.head()


Unnamed: 0,date,group,revenue_per_day
0,2019-08-01,A,2266.6
1,2019-08-01,B,967.2
2,2019-08-02,A,1468.3
3,2019-08-02,B,2568.1
4,2019-08-03,A,1815.2


In [26]:
# dataFrame with orders and revenue by day, group A
RevenueA = revenue_per_day[revenue_per_day['group'] == 'A'][['date', 'revenue_per_day']]
# dataFrame with orders and revenue by day, group B
RevenueB = revenue_per_day[revenue_per_day['group'] == 'B'][['date', 'revenue_per_day']]

In [27]:
RevenueA.head(10)

Unnamed: 0,date,revenue_per_day
0,2019-08-01,2266.6
2,2019-08-02,1468.3
4,2019-08-03,1815.2
6,2019-08-04,675.5
8,2019-08-05,1398.0
10,2019-08-06,668.4
12,2019-08-07,1942.0
14,2019-08-08,1404.8
16,2019-08-09,2095.2
18,2019-08-10,2387.5


In [28]:
RevenueB.head(10)

Unnamed: 0,date,revenue_per_day
1,2019-08-01,967.2
3,2019-08-02,2568.1
5,2019-08-03,1071.6
7,2019-08-04,1531.6
9,2019-08-05,1449.3
11,2019-08-06,3369.3
13,2019-08-07,3435.6
15,2019-08-08,2379.9
17,2019-08-09,1656.2
19,2019-08-10,1674.8


* Revenue by User.

In [29]:
# group clean_orders by users and sum the revenue
revenue_per_user = clean_orders.groupby(['visitorId'])['revenue'].sum().reset_index()
revenue_per_user.columns = ['visitorId', 'revenue']
revenue_per_user.head()

Unnamed: 0,visitorId,revenue
0,5114589,10.8
1,6958315,25.9
2,11685486,100.4
3,39475350,65.4
4,47206413,15.2


In [30]:
# see distribution of revenue per user
revenue_per_user['revenue'].describe()

count      973.000000
mean       136.550051
std        663.321828
min          5.000000
25%         20.800000
50%         50.400000
75%        135.200000
max      19920.400000
Name: revenue, dtype: float64

* Orders per User.

In [31]:
# group by visitorId and aggregate the number of unique transactionId and min date for group A
ordersByUsersA = clean_orders[clean_orders['group'] == 'A'].groupby('visitorId', as_index=False).agg({
    'transactionId': pd.Series.nunique,
    'date': 'min'
})
ordersByUsersA.columns = ['visitorId', 'orders', 'date']
ordersByUsersA.head()

Unnamed: 0,visitorId,orders,date
0,11685486,1,2019-08-23
1,54447517,1,2019-08-08
2,66685450,1,2019-08-13
3,78758296,1,2019-08-15
4,85103373,1,2019-08-04


In [32]:
# group by visitorId and aggregate the number of unique transactionId and min date for group B
ordersByUsersB = clean_orders[clean_orders['group'] == 'B'].groupby('visitorId', as_index=False).agg({
    'transactionId': pd.Series.nunique,
    'date': 'min'
})
ordersByUsersB.columns = ['visitorId', 'orders', 'date']
ordersByUsersB.head()

Unnamed: 0,visitorId,orders,date
0,5114589,1,2019-08-16
1,6958315,1,2019-08-04
2,39475350,1,2019-08-08
3,47206413,1,2019-08-10
4,48147722,1,2019-08-22


* Ratio of the number of orders to the number of visitors for each test group

In [33]:
ratio = users_by_group.merge(orders_per_day, how = 'left', on = 'group')
ratio.head()

Unnamed: 0,group,visitorId,date,orders_per_day
0,A,445,2019-08-01,23
1,A,445,2019-08-02,19
2,A,445,2019-08-03,24
3,A,445,2019-08-04,11
4,A,445,2019-08-05,22


In [34]:
ratio_new = (
    ratio.drop(['date'], axis=1)
    .groupby('group', as_index=False)
    .agg({'visitorId': 'sum', 'orders_per_day': 'sum'})
)
ratio_new.head()

Unnamed: 0,group,visitorId,orders_per_day
0,A,13795,468
1,B,16368,548


* p-value and difference between order sizes for both test groups.

In [35]:
# create a pd.Series object with the required length
# np.arange() function to create a list of indices
# concatenate the list from first and second parts
# pass the argument that specifies that Series objects are to be concatenated by row
sampleA = pd.concat([ordersByUsersA['orders'],pd.Series(0, index=np.arange(visits[visits['group'] == 'A']['visits'].sum() - len(ordersByUsersA['orders'])), name = 'orders')], axis = 0)
sampleB = pd.concat([ordersByUsersB['orders'],pd.Series(0, index=np.arange(visits[visits['group'] == 'B']['visits'].sum() - len(ordersByUsersB['orders'])), name = 'orders')], axis = 0)

print('{0:.3f}'.format(stats.mannwhitneyu(sampleA, sampleB)[1]))
print('{0:.3f}'.format(sampleB.mean()/sampleA.mean()-1))

0.011
0.160


* Cumulative orders and visitors by date.

In [36]:
# building an array with unique paired date-group values 
datesGroups = clean_orders[['date','group']].drop_duplicates()
# collect the aggregated cumulative daily data on orders
ordersAggregated = datesGroups.apply(lambda x: clean_orders[
    np.logical_and(clean_orders['date'] <= x['date'], clean_orders['group'] == x['group'])
].agg({'date': 'max', 'group': 'max', 'transactionId': pd.Series.nunique,
       'visitorId': pd.Series.nunique, 'revenue': 'sum'}), axis=1)
ordersAggregated = ordersAggregated.sort_values(by=['date', 'group'])
 
# get the aggregated cumulative daily data on visitors
visitorsAggregated = datesGroups.apply(lambda x: visits[
    np.logical_and(visits['date'] <= x['date'], visits['group'] == x['group'])
].agg({'date': 'max', 'group': 'max', 'visits': 'sum'}), axis=1)
visitorsAggregated = visitorsAggregated.sort_values(by=['date', 'group'])

# merge cumulative satasets of orders and visitors
cumulativeData = ordersAggregated.merge(visitorsAggregated, left_on=['date', 'group'], right_on=['date', 'group'])
cumulativeData.columns = ['date', 'group', 'orders', 'buyers', 'revenue', 'visitors']
print(cumulativeData.head(5))


        date group  orders  buyers  revenue  visitors
0 2019-08-01     A      23      19   2266.6       719
1 2019-08-01     B      17      17    967.2       713
2 2019-08-02     A      42      36   3734.9      1338
3 2019-08-02     B      40      39   3535.3      1294
4 2019-08-03     A      66      60   5550.1      1845


* Cumulative revenue by day and A/B test groups.

In [37]:
# dataFrame with cumulative orders and cumulative revenue by day, group A
cumulativeRevenueA = cumulativeData[cumulativeData['group'] == 'A'][['date','revenue', 'orders']]
# dataFrame with cumulative orders and cumulative revenue by day, group B
cumulativeRevenueB = cumulativeData[cumulativeData['group'] == 'B'][['date','revenue', 'orders']]

In [38]:
cumulativeRevenueA.head(10)

Unnamed: 0,date,revenue,orders
0,2019-08-01,2266.6,23
2,2019-08-02,3734.9,42
4,2019-08-03,5550.1,66
6,2019-08-04,6225.6,77
8,2019-08-05,7623.6,99
10,2019-08-06,8292.0,114
12,2019-08-07,10234.0,130
14,2019-08-08,11638.8,144
16,2019-08-09,13734.0,155
18,2019-08-10,16121.5,170


In [39]:
cumulativeRevenueB.head(10)

Unnamed: 0,date,revenue,orders
1,2019-08-01,967.2,17
3,2019-08-02,3535.3,40
5,2019-08-03,4606.9,54
7,2019-08-04,6138.5,68
9,2019-08-05,7587.8,89
11,2019-08-06,10957.1,112
13,2019-08-07,14392.7,135
15,2019-08-08,16772.6,157
17,2019-08-09,18428.8,176
19,2019-08-10,20103.6,198


* General Average Purchase Size for A/B test groups.

In [40]:
average_purchase = clean_orders.groupby(['date', 'group'])['revenue'].mean().reset_index()
average_purchase

Unnamed: 0,date,group,revenue
0,2019-08-01,A,98.547826
1,2019-08-01,B,56.894118
2,2019-08-02,A,77.278947
3,2019-08-02,B,111.656522
4,2019-08-03,A,75.633333
...,...,...,...
57,2019-08-29,B,112.080000
58,2019-08-30,A,136.544444
59,2019-08-30,B,156.514286
60,2019-08-31,A,106.037500


In [41]:
# dataFrame with orders and revenue by day, group A
average_purchaseA = average_purchase[average_purchase['group'] == 'A'][['date', 'revenue']]
# dataFrame with orders and revenue by day, group B
average_purchaseB = average_purchase[average_purchase['group'] == 'B'][['date', 'revenue']]

In [42]:
average_purchaseA.head()

Unnamed: 0,date,revenue
0,2019-08-01,98.547826
2,2019-08-02,77.278947
4,2019-08-03,75.633333
6,2019-08-04,61.409091
8,2019-08-05,63.545455


In [43]:
average_purchaseB.head()

Unnamed: 0,date,revenue
1,2019-08-01,56.894118
3,2019-08-02,111.656522
5,2019-08-03,76.542857
7,2019-08-04,109.4
9,2019-08-05,69.014286


* Relative Difference for General Average Purchase Size.

In [44]:
revenue_per_day.head()

Unnamed: 0,date,group,revenue_per_day
0,2019-08-01,A,2266.6
1,2019-08-01,B,967.2
2,2019-08-02,A,1468.3
3,2019-08-02,B,2568.1
4,2019-08-03,A,1815.2


In [45]:
orders_per_day.head()

Unnamed: 0,date,group,orders_per_day
0,2019-08-01,A,23
1,2019-08-01,B,17
2,2019-08-02,A,19
3,2019-08-02,B,23
4,2019-08-03,A,24


AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

* Cumulative Average Purchase Size for A/B test groups.

* Relative Difference for Cumulative Average Purchase Size.

In [46]:
# gathering the data into one DataFrame
mergedCumulativeRevenue = cumulativeRevenueA.merge(cumulativeRevenueB, left_on='date', right_on='date', how='left', suffixes=['A', 'B'])
mergedCumulativeRevenue.head()

Unnamed: 0,date,revenueA,ordersA,revenueB,ordersB
0,2019-08-01,2266.6,23,967.2,17
1,2019-08-02,3734.9,42,3535.3,40
2,2019-08-03,5550.1,66,4606.9,54
3,2019-08-04,6225.6,77,6138.5,68
4,2019-08-05,7623.6,99,7587.8,89


In [47]:
# relative difference for the average purchase sizes
mergedCumulativeRevenue['relative_difference'] = (mergedCumulativeRevenue['revenueB'] / mergedCumulativeRevenue['ordersB']) / (mergedCumulativeRevenue['revenueA'] / mergedCumulativeRevenue['ordersA']) - 1
mergedCumulativeRevenue.head()

Unnamed: 0,date,revenueA,ordersA,revenueB,ordersB,relative_difference
0,2019-08-01,2266.6,23,967.2,17,-0.422675
1,2019-08-02,3734.9,42,3535.3,40,-0.006114
2,2019-08-03,5550.1,66,4606.9,54,0.014514
3,2019-08-04,6225.6,77,6138.5,68,0.116511
4,2019-08-05,7623.6,99,7587.8,89,0.107136


* Conversion rate for each group as ratio of orders to number of visits each day.

In [48]:
# merge datasets of orders and visits for group A
conversion_rateA = OrdersA.merge(VisitsA, on = 'date', how = 'left')
conversion_rateA.columns = ['date', 'orders_per_day', 'visits_per_day']
conversion_rateA.head(10)

Unnamed: 0,date,orders_per_day,visits_per_day
0,2019-08-01,23,719
1,2019-08-02,19,619
2,2019-08-03,24,507
3,2019-08-04,11,717
4,2019-08-05,22,756
5,2019-08-06,15,667
6,2019-08-07,16,644
7,2019-08-08,14,610
8,2019-08-09,11,617
9,2019-08-10,15,406


In [49]:
# merge datasets of orders and visits for group B
conversion_rateB = OrdersB.merge(VisitsB, on = 'date', how = 'left')
conversion_rateB.columns = ['date', 'orders_per_day', 'visits_per_day']
conversion_rateB.head(10)

Unnamed: 0,date,orders_per_day,visits_per_day
0,2019-08-01,17,713
1,2019-08-02,23,581
2,2019-08-03,14,509
3,2019-08-04,14,770
4,2019-08-05,21,707
5,2019-08-06,23,655
6,2019-08-07,23,709
7,2019-08-08,22,654
8,2019-08-09,19,610
9,2019-08-10,22,369


In [50]:
# adding conversion column and its calculation to group A
conversion_rateA['conversionA'] = conversion_rateA['orders_per_day'] / conversion_rateA['visits_per_day']
conversion_rateA.head(10)

Unnamed: 0,date,orders_per_day,visits_per_day,conversionA
0,2019-08-01,23,719,0.031989
1,2019-08-02,19,619,0.030695
2,2019-08-03,24,507,0.047337
3,2019-08-04,11,717,0.015342
4,2019-08-05,22,756,0.029101
5,2019-08-06,15,667,0.022489
6,2019-08-07,16,644,0.024845
7,2019-08-08,14,610,0.022951
8,2019-08-09,11,617,0.017828
9,2019-08-10,15,406,0.036946


In [51]:
# adding conversion column and its calculation to group B
conversion_rateB['conversionB'] = conversion_rateB['orders_per_day'] / conversion_rateB['visits_per_day']
conversion_rateB.head(10)

Unnamed: 0,date,orders_per_day,visits_per_day,conversionB
0,2019-08-01,17,713,0.023843
1,2019-08-02,23,581,0.039587
2,2019-08-03,14,509,0.027505
3,2019-08-04,14,770,0.018182
4,2019-08-05,21,707,0.029703
5,2019-08-06,23,655,0.035115
6,2019-08-07,23,709,0.03244
7,2019-08-08,22,654,0.033639
8,2019-08-09,19,610,0.031148
9,2019-08-10,22,369,0.059621


* Cumulative Conversion.

In [52]:
# calculating cumulative conversion
cumulativeData['cum_conversion'] = cumulativeData['orders']/cumulativeData['visitors']
cumulativeData.head()

Unnamed: 0,date,group,orders,buyers,revenue,visitors,cum_conversion
0,2019-08-01,A,23,19,2266.6,719,0.031989
1,2019-08-01,B,17,17,967.2,713,0.023843
2,2019-08-02,A,42,36,3734.9,1338,0.03139
3,2019-08-02,B,40,39,3535.3,1294,0.030912
4,2019-08-03,A,66,60,5550.1,1845,0.035772


In [53]:
# selecting data on group A 
cumulativeDataA = cumulativeData[cumulativeData['group']=='A']
# selecting data on group B
cumulativeDataB = cumulativeData[cumulativeData['group']=='B']

* Relative Difference for Cumulative Conversion.

***

### Part Four: Prioritizing hypotheses

In [54]:
hypotheses

Unnamed: 0,hypothesis,reach,impact,confidence,effort
0,Add two new channels for attracting traffic. This will bring 30% more users,3,10,8,6
1,Launch your own delivery service. This will shorten delivery time,2,5,4,10
2,Add product recommendation blocks to the store's site. This will increase conversion and average purchase size,8,3,7,3
3,Change the category structure. This will increase conversion since users will find the products they want more quickly,8,3,3,8
4,Change the background color on the main page. This will increase user engagement,3,1,1,1
5,Add a customer review page. This will increase the number of orders,3,2,2,3
6,Show banners with current offers and sales on the main page. This will boost conversion,5,3,8,3
7,Add a subscription form to all the main pages. This will help you compile a mailing list,10,7,8,5
8,Launch a promotion that gives users discounts on their birthdays; This will increase customer retention,1,9,9,5


In [55]:
# apply the ICE framework to prioritize hypotheses. Sort them in descending order of priority
hypotheses['ICE'] = (hypotheses['impact'] * hypotheses['confidence']) / hypotheses['effort']
hypotheses[['hypothesis', 'ICE']].sort_values(by='ICE', ascending = False)

Unnamed: 0,hypothesis,ICE
8,Launch a promotion that gives users discounts on their birthdays; This will increase customer retention,16.2
0,Add two new channels for attracting traffic. This will bring 30% more users,13.333333
7,Add a subscription form to all the main pages. This will help you compile a mailing list,11.2
6,Show banners with current offers and sales on the main page. This will boost conversion,8.0
2,Add product recommendation blocks to the store's site. This will increase conversion and average purchase size,7.0
1,Launch your own delivery service. This will shorten delivery time,2.0
5,Add a customer review page. This will increase the number of orders,1.333333
3,Change the category structure. This will increase conversion since users will find the products they want more quickly,1.125
4,Change the background color on the main page. This will increase user engagement,1.0


In [56]:
# apply the RICE framework to prioritize hypotheses. Sort them in descending order of priority
hypotheses['RICE'] = (hypotheses['reach'] * hypotheses['impact'] * hypotheses['confidence']) / hypotheses['effort']
hypotheses[['hypothesis', 'RICE']].sort_values(by='RICE', ascending=False)

Unnamed: 0,hypothesis,RICE
7,Add a subscription form to all the main pages. This will help you compile a mailing list,112.0
2,Add product recommendation blocks to the store's site. This will increase conversion and average purchase size,56.0
0,Add two new channels for attracting traffic. This will bring 30% more users,40.0
6,Show banners with current offers and sales on the main page. This will boost conversion,40.0
8,Launch a promotion that gives users discounts on their birthdays; This will increase customer retention,16.2
3,Change the category structure. This will increase conversion since users will find the products they want more quickly,9.0
1,Launch your own delivery service. This will shorten delivery time,4.0
5,Add a customer review page. This will increase the number of orders,4.0
4,Change the background color on the main page. This will increase user engagement,3.0


#### From the two prioritization of hypotheses using ICE and RICE I can infer:<br>

* ICE prioritizes based on the following factors:<br>

> Reach (R): The number of users potentially affected by the hypothesis.<br>
> Impact (I): The expected impact of the hypothesis on a relevant metric (e.g., conversion rate, revenue).<br>
> Confidence (C): The level of certainty in your estimates for Reach and Impact.<br>

* RICE prioritizes based on the following factors:<br>

> Reach (R): Same as ICE.<br>
> Impact (I): Similar to ICE, but can consider effort required to implement the hypothesis.<br>
> Confidence (C): Similar to ICE.<br>
> Effort (E): The amount of time and resources needed to implement the hypothesis.<br>

* Changes in Prioritization:<br>

> Subscription form: With RICE, the "subscription form" hypothesis jumps to the top due to its high Reach and potentially high Impact on acquiring leads (assuming a good conversion rate).<br>
> Product recommendations: This hypothesis also moves up significantly in RICE due to a balance between Reach, Impact, and lower Effort compared to some ideas in ICE.<br>
> Discounts and new channels: While still considered valuable, these drop in priority with RICE because the Effort to implement them might be higher.<br>

* Key Takeaways:<br>

> RICE incorporates effort into the prioritization, potentially favoring ideas with high potential impact but lower implementation complexity.<br>
> If user base growth and immediate impact are critical, ICE might be suitable.<br>
> If resource allocation and long-term value are essential, RICE might be a better choice.

***

### Part Five: A/B Test Analysis

#### Following are the plots, analysis and conclusions of the data based on earlier KPIs and not taking into consideration the existence of outliers.

In [57]:
# plot general orders for A/B groups
fig = go.Figure()
fig.add_trace(go.Scatter(x = OrdersA['date'], 
                        y = OrdersA['orders_per_day'],
                    mode = 'lines',
                    name = 'A', line = dict(color = 'orange', width = 2 )))
fig.add_trace(go.Scatter(x = OrdersB['date'], 
                        y = OrdersB['orders_per_day'], 
                    mode = 'lines',
                    name = 'B', line = dict(color = 'blue', width = 2 )))

fig.update_layout(autosize = False, width = 800, height = 500, margin = dict(
        l = 50, r = 50, b = 100, t = 100, pad = 4), paper_bgcolor = 'LightSteelBlue',
                  xaxis_tickformat = '%d %b <br>%Y',
                  xaxis_title = 'date' 
                  )
# Customize y-axis label
fig.update_yaxes(title_text = 'orders_per_day')  # Add y-axis label
# Add title to the plot
fig.update_layout(title_text = 'Daily Orders for A and B groups')
fig.update_layout(
    showlegend = True)
fig.show()


The behavior of DatetimeProperties.to_pydatetime is deprecated, in a future version this will return a Series containing python datetime objects instead of an ndarray. To retain the old behavior, call `np.array` on the result



In [58]:
# plot general revenue for A/B groups
fig = go.Figure()
fig.add_trace(go.Scatter(x = RevenueA['date'], 
                        y = RevenueA['revenue_per_day'],
                    mode = 'lines',
                    name = 'A', line = dict(color = 'orange', width = 2 )))
fig.add_trace(go.Scatter(x = RevenueB['date'], 
                        y = RevenueB['revenue_per_day'], 
                    mode = 'lines',
                    name = 'B', line = dict(color = 'blue', width = 2 )))

fig.update_layout(autosize = False, width = 800, height = 500, margin = dict(
        l = 50, r = 50, b = 100, t = 100, pad = 4), paper_bgcolor = 'LightSteelBlue',
                  xaxis_tickformat = '%d %b <br>%Y',
                  xaxis_title = 'date')
# Customize y-axis label
fig.update_yaxes(title_text = 'visitors')  # Add y-axis label
# Add title to the plot
fig.update_layout(title_text = 'Daily Revenue for A and B groups')
fig.update_layout(
    showlegend = True)
fig.show()

In [59]:
# plot scatter plot
fig = go.Figure()
fig.add_trace(go.Scatter(x = ordersByUsersA['date'], 
                        y = ordersByUsersA['orders'],
                    mode = 'markers',
                    name = 'A', marker = dict(color = 'blue')))
fig.add_trace(go.Scatter(x = ordersByUsersB['date'], 
                        y = ordersByUsersB['orders'], 
                    mode = 'markers',
                    name = 'B', marker = dict(color = 'red')))

fig.update_layout(autosize = False, width = 800, height = 500, margin = dict(
        l = 50, r = 50, b = 100, t = 100, pad = 4), paper_bgcolor = 'LightSteelBlue',
                  xaxis_tickformat = '%d %b <br>%Y',
                  xaxis_title = 'date')

fig.update_yaxes(title_text = 'number of orders')
fig.update_layout(title_text = 'Number of Orders per User',
                  showlegend = True)
fig.show()

#### Scatter plot for the number of orders per user for groups A and B.<br>

> The x-axis represents the dates from 1st August to 30th August 2019.<br>
> The y-axis represents the number of orders per user.<br>
> Group A’s orders are represented by blue dots, while Group B’s orders are represented by red dots.<br>

> Both groups show users with orders ranging from 1 to 3 per day.<br>
> Group B has a consistent presence of 1 order per user almost every day.<br>
> Group A has occasional instances where users place 2 or 3 orders.<br>

> Group A (blue dots) shows more variability with instances of 2 and 3 orders per user on certain days.<br>
> Group B (red dots) primarily shows 1 order per user, with occasional days showing 2 orders.<br>
> Both groups have days where no orders exceed 1 per user.<br>

> On several days, Group A has users placing 2 or 3 orders, such as around August 4, August 18, and late August.<br>
> Group B shows less frequent but consistent occurrences of users placing 2 orders, notably around August 11 and August 18.<br>

> Users in Group A have more variability in their order behavior, indicating some users might be more active and place multiple orders on certain days.<br>
> Group B users tend to place a single order per day, indicating more consistent but lower activity levels.

In [60]:
ordersByUsersA['orders'].describe()

count    445.000000
mean       1.051685
std        0.267669
min        1.000000
25%        1.000000
50%        1.000000
75%        1.000000
max        3.000000
Name: orders, dtype: float64

In [61]:
ordersByUsersB['orders'].describe()

count    528.000000
mean       1.037879
std        0.210008
min        1.000000
25%        1.000000
50%        1.000000
75%        1.000000
max        3.000000
Name: orders, dtype: float64

#### Comparative Analysis:<br>

> Both groups have a mean number of orders per user close to 1, with Group A slightly higher at 1.051685 compared to Group B’s 1.037879.<br>
> The median for both groups is 1.0, indicating that more than half of the users in both groups place exactly one order.<br>

> Group A has a higher standard deviation (0.267669) compared to Group B (0.210008), suggesting that Group A’s order counts are more variable.<br>
> The maximum number of orders is 3 for both groups, indicating occasional multiple orders per user, as seen in the scatter plot.<br>

> Group A exhibits slightly more variability in user activity, with a small portion of users placing more than one order per day more frequently than in Group B.<br>
> This aligns with the scatter plot observations where Group A had instances of 2 and 3 orders per user, while Group B was more consistent with 1 order and occasional 2 orders.<br>

> The higher standard deviation in Group A points to greater inconsistency in daily orders, while Group B users show more consistent order behavior, supported by the scatter plot showing fewer instances of multiple orders.

* Calculation of 95th and 99th percentile for number of orders per user.

In [62]:
print(np.percentile(ordersByUsersA['orders'], [95, 99]))
print(np.percentile(ordersByUsersB['orders'], [95, 99]))

[1.   2.56]
[1. 2.]


#### Users with orders greater than the 99th percentile, i.e., more than 2.56 orders can be considered outliers for group A and users with 3 orders are outliers of group B because their order count exceeds the 99th percentile.

In [63]:
ratio_new['ordersToVisitorsRatio'] = (ratio_new['orders_per_day'] / ratio_new['visitorId']).map(lambda x: "{0:.4f}".format(x))
ratio_new.head()

Unnamed: 0,group,visitorId,orders_per_day,ordersToVisitorsRatio
0,A,13795,468,0.0339
1,B,16368,548,0.0335


#### We can see that the ratio of orders for Group A is a bit higher than the orders for group B.

In [64]:
# plot cumulative revenue for A/B groups
fig = go.Figure()
fig.add_trace(go.Scatter(x = cumulativeRevenueA['date'], 
                        y = cumulativeRevenueA['revenue'],
                    mode = 'lines',
                    name = 'A', line = dict(color = 'orange', width = 2 )))
fig.add_trace(go.Scatter(x = cumulativeRevenueB['date'], 
                        y = cumulativeRevenueB['revenue'], 
                    mode = 'lines',
                    name = 'B', line = dict(color = 'blue', width = 2 )))

fig.update_layout(autosize = False, width = 800, height = 500, margin = dict(
        l = 50, r = 50, b = 100, t = 100, pad = 4), paper_bgcolor = 'LightSteelBlue',
                  xaxis_tickformat = '%d %b <br>%Y', 
                  title_text = 'date')
# Customize y-axis label
fig.update_yaxes(title_text = 'visitors')  # Add y-axis label
# Add title to the plot
fig.update_layout(title_text = 'Cumulative Revenue for A and B groups')
fig.update_layout(
    showlegend = True)
fig.show()


The behavior of DatetimeProperties.to_pydatetime is deprecated, in a future version this will return a Series containing python datetime objects instead of an ndarray. To retain the old behavior, call `np.array` on the result



#### The plot is comparing the revenue performance of two different groups (A and B) over time.<br>
> The x-axis represents the dates, ranging from 1st August 2019 to 30th August 2019.<br>
> The y-axis represents the cumulative revenue in visitors, with values ranging from 0 to 80,000.<br>
> There are two lines on the plot:<br>
> - An orange line representing group A.<br>
> - A blue line representing group B.<br>

#### Trends:<br>

> Both groups start with low cumulative revenue at the beginning of the time period.<br>
> Group A (orange line) shows a relatively steady increase in cumulative revenue throughout the month.<br>
> Group B (blue line) also shows a steady increase initially but has a significant jump around mid-August, after which it continues to grow steadily but at a higher rate compared to group A.<br>

#### Key Observations:<br>

> The sudden spike in the blue line around mid-August suggests that group B experienced a significant event or change that greatly increased its cumulative revenue. This could be due to a successful marketing campaign, a product launch, or another impactful event.<br>
> Group A shows a more consistent and linear growth pattern, suggesting steady but slower improvements over time.

In [65]:
# plot general average purchase size for A/B groups
fig = go.Figure()
fig.add_trace(go.Scatter(x = average_purchaseA['date'], 
                        y = average_purchaseA['revenue'],
                    mode = 'lines',
                    name = 'A', line = dict(color = 'orange', width = 2 )))
fig.add_trace(go.Scatter(x = average_purchaseB['date'], 
                        y = average_purchaseB['revenue'], 
                    mode = 'lines',
                    name = 'B', line = dict(color = 'blue', width = 2 )))

fig.update_layout(autosize = False, width = 800, height = 500, margin = dict(
        l = 50, r = 50, b = 100, t = 100, pad = 4), paper_bgcolor = 'LightSteelBlue',
                  xaxis_tickformat = '%d %b <br>%Y', 
                  title_text = 'date')
# Customize y-axis label
fig.update_yaxes(title_text = 'average purchase size')  # Add y-axis label
# Add title to the plot
fig.update_layout(title_text = 'Average Purchase Size for A and B groups')
fig.update_layout(
    showlegend = True)
fig.show()

In [66]:
# pylint: disable=undefined-variable
fig = go.Figure()
fig.add_trace(go.Scatter(x = cumulativeRevenueA['date'], 
                        y = cumulativeRevenueA['revenue'] / cumulativeRevenueA['orders'],
                    mode = 'lines',
                    name = 'A', line = dict(color = 'green', width = 2 )))
fig.add_trace(go.Scatter(x = cumulativeRevenueB['date'], 
                        y = cumulativeRevenueB['revenue'] / cumulativeRevenueB['orders'], 
                    mode = 'lines',
                    name = 'B', line = dict(color = 'purple', width = 2 )))

fig.update_layout(autosize = False, width = 800, height = 500, margin = dict(
        l = 50, r = 50, b = 100, t = 100, pad = 4), paper_bgcolor = 'LightSteelBlue',
                  xaxis_tickformat = '%d %b <br>%Y', 
                  title_text = 'date')
# Customize y-axis label
fig.update_yaxes(title_text = 'average purchase size')  # Add y-axis label
# Add title to the plot
fig.update_layout(title_text = 'Cumulative Average Purchase Size for A and B groups')
fig.update_layout(
    showlegend = True)
fig.show()

#### The plot is comparing the average purchase size of two different groups (A and B) over time.

> The x-axis represents the dates, ranging from 1st August 2019 to 30th August 2019.<br>
> The y-axis represents the average purchase size, with values ranging from 60 to 160.<br>
> There are two lines on the plot:<br>
> - A green line representing group A.<br>
> - A purple line representing group B.<br>

#### Trends:<br>

> Group A (green line) shows some fluctuations initially, with a notable dip at the beginning but then stabilizes around the 100 mark with slight variations.<br>
> Group B (purple line) starts similarly but experiences a dramatic increase around mid-August, reaching over 160 before gradually decreasing but still maintaining an average purchase size significantly higher than group A.<br>

#### Comparison:<br>

> By the end of the time period, group B's average purchase size is higher than group A's, despite some decline after the peak.<br>
> The purple line's steep rise around mid-August indicates a significant event or change impacting the average purchase size for group B, similar to the cumulative revenue plot.<br>

#### Key Observations:<br>

> The initial part of the month shows both groups with similar average purchase sizes, hovering around the 80-100 range.<br>
> Group B’s sudden increase in mid-August suggests an impactful event or strategy that significantly boosted their average purchase size.<br>
> After the spike, group B's average purchase size slightly decreases but remains higher than group A’s average purchase size.<br>
> Group A's average purchase size remains relatively steady throughout the period, suggesting consistent purchasing behavior.

In [67]:
# plotting relative difference for the average purchase sizes
fig = px.line(mergedCumulativeRevenue,
              x = 'date',
              y = 'relative_difference',
              title = 'Relative Difference for Cumulative Average Purchase Size')
#adding reference line
fig.add_hline(y = 0,
              line_dash = 'dash',
              line_color = 'purple')
fig.update_layout(autosize = False, width = 800, height = 500, margin = dict(
        l = 50, r = 50, b = 100, t = 100, pad = 4), paper_bgcolor = 'LightSteelBlue',
                  xaxis_tickformat = '%d %b <br>%Y')
fig.show()


The behavior of DatetimeProperties.to_pydatetime is deprecated, in a future version this will return a Series containing python datetime objects instead of an ndarray. To retain the old behavior, call `np.array` on the result



#### The plot is comparing the relative difference between two groups (A and B) over time in terms of their cumulative average purchase size.<br>

> The x-axis represents the dates, ranging from 1st August 2019 to 30th August 2019.<br>
> The y-axis represents the relative difference, with values ranging from approximately -0.4 to 0.4.<br>
> There is a single blue line representing the relative difference over time.<br>
> The horizontal dashed purple line represents the zero line, which serves as a reference point where the relative difference is zero.<br>

#### Trends:<br>

> At the beginning of the period, the relative difference starts below zero, indicating that the average purchase size for group A was lower than that for group B.<br>
> The relative difference increases rapidly, crosses the zero line, and goes positive, indicating that at some point, the average purchase size for group A was higher than that for group B.<br>
> The relative difference then shows fluctuations, going above and below the zero line, indicating changes in which group had a higher average purchase size.<br>
> Around mid-August, there's a sharp increase in the relative difference, indicating a significant event that caused group B’s average purchase size to exceed group A’s substantially.<br>
> Post mid-August, the relative difference remains positive but shows a gradual decrease, indicating that while group B maintained a higher average purchase size, the difference between the two groups started to diminish slightly towards the end of August.<br>

#### Key Observations:<br>

> The sharp increase around mid-August corresponds to the significant spike observed in the average purchase size plot for group B.<br>
> The fluctuations around the zero line indicate a competitive interaction between the two groups' average purchase sizes before the mid-August spike.<br>
>After the spike, group B consistently has a higher average purchase size than group A, as indicated by the relative difference staying positive.


In [68]:
# plot daily conversion rate for groups A and B
fig = go.Figure()
fig.add_trace(go.Scatter(x = conversion_rateA['date'], 
                        y = conversion_rateA['conversionA'],
                    mode = 'lines',
                    name = 'A', line = dict(color = 'green', width = 2 )))
fig.add_trace(go.Scatter(x = conversion_rateB['date'], 
                        y = conversion_rateB['conversionB'], 
                    mode = 'lines',
                    name = 'B', line = dict(color = 'purple', width = 2 )))

fig.update_layout(autosize = False, width = 800, height = 500, margin = dict(
        l = 50, r = 50, b = 100, t = 100, pad = 4), paper_bgcolor = 'LightSteelBlue',
                  xaxis_tickformat = '%d %b <br>%Y', 
                  title_text = 'date')
# Customize y-axis label
fig.update_yaxes(title_text = 'conversion rate')  # Add y-axis label
# Add title to the plot
fig.update_layout(title_text = 'Daily Conversion Rate for group A and B')
fig.update_layout(
    showlegend = True)
fig.show()


The behavior of DatetimeProperties.to_pydatetime is deprecated, in a future version this will return a Series containing python datetime objects instead of an ndarray. To retain the old behavior, call `np.array` on the result



#### The plot shows the daily conversion rates for groups A and B over a period in August 2019.<br>

> The x-axis represents the dates, ranging from 1st August 2019 to 30th August 2019.<br>
> The y-axis represents the conversion rate ratio of orders and number of visits per day, with values ranging from approximately 0.01 to 0.06<br>

> Group A's conversion rates are represented by the green line, while Group B's are shown by the purple line.<br>
> Both groups exhibit considerable variability in their daily conversion rates.<br>
> Both groups show peaks and valleys, indicating days with high and low conversion rates.<br>
> Group A's conversion rate appears to have more frequent and possibly sharper fluctuations than Group B’s. This might suggest greater instability or sensitivity to daily changes in Group A’s conversion rate.<br>
> Group B, while also variable, might show slightly more periods of stability.<br>

> Around August 11, Group B shows a significant spike, reaching a conversion rate above 0.05, while Group A shows a more moderate peak.<br>
> After August 18, Group A shows higher conversion rates more frequently than Group B.<br>
> There are several points where the conversion rates of the two groups intersect, suggesting days where their performance was very similar.

In [69]:
fig = go.Figure()
fig.add_trace(go.Scatter(x = cumulativeDataA['date'], 
                        y = cumulativeDataA['cum_conversion'],
                    mode = 'lines',
                    name = 'A', line = dict(color = 'green', width = 2 )))
fig.add_trace(go.Scatter(x = cumulativeDataB['date'], 
                        y = cumulativeDataB['cum_conversion'], 
                    mode = 'lines',
                    name = 'B', line = dict(color = 'purple', width = 2 )))

fig.update_layout(autosize = False, width = 800, height = 500, margin = dict(
        l = 50, r = 50, b = 100, t = 100, pad = 4), paper_bgcolor = 'LightSteelBlue',
                  xaxis_tickformat = '%d %b <br>%Y', 
                  title_text = 'date')
# Customize y-axis label
fig.update_yaxes(title_text = 'cumulative conversion')  # Add y-axis label
# Add title to the plot
fig.update_layout(title_text = 'Cumulative Conversion for A and B groups')
fig.update_layout(
    showlegend = True)
fig.show()

In [70]:
# merge cumulative conversions
#mergedCumulativeConversions = cumulativeDataA[['date','conversion']].merge(cumulativeDataB[['date','conversion']], left_on = 'date', right_on = 'date', how = 'left', suffixes = ['A', 'B'])
#mergedCumulativeConversions['ratio_conversions'] = mergedCumulativeConversions['conversionB'] / mergedCumulativeConversions['conversionA'] - 1
# plot a relative difference graph for the cumulative conversion rates
#fig = px.line(mergedCumulativeConversions,
              #x = 'date',
              #y = 'ratio_conversions',
              #title = 'Relative Difference for Cumulative Conversion')
#adding reference line
#fig.add_hline(y = 0,
              #line_dash = 'dash',
              #line_color = 'purple')
#fig.update_layout(autosize = False, width = 800, height = 500, margin = dict(
        #l = 50, r = 50, b = 100, t = 100, pad = 4), paper_bgcolor = 'LightSteelBlue',
                  #xaxis_tickformat = '%d %b <br>%Y')
#fig.show()

In [71]:
#plt.plot(mergedCumulativeConversions['date'], mergedCumulativeConversions['conversionB']/mergedCumulativeConversions['conversionA']-1)
#plt.legend()

#plt.axhline(y=0, color='black', linestyle='--')
#plt.axhline(y=-0.1, color='grey', linestyle='--')

***

#### Since there is some anomalie in group A and group B, users with 2.56 orders or more in group A and users with 3 orders in group B, I will filter the data removind these outliers and check the plots again to see any differences.

In [72]:
# filter ordersByUsersA dataframe removing the users with 2.56 orders
filtered_ordersByUsersA = ordersByUsersA[ordersByUsersA['orders'] <= 2.55]
filtered_ordersByUsersA.head()

Unnamed: 0,visitorId,orders,date
0,11685486,1,2019-08-23
1,54447517,1,2019-08-08
2,66685450,1,2019-08-13
3,78758296,1,2019-08-15
4,85103373,1,2019-08-04


In [73]:
# filter ordersByUsersB dataframe removing the users with 3 orders
filtered_ordersByUsersB = ordersByUsersB[ordersByUsersB['orders'] <= 2.99]
filtered_ordersByUsersB.head()

Unnamed: 0,visitorId,orders,date
0,5114589,1,2019-08-16
1,6958315,1,2019-08-04
2,39475350,1,2019-08-08
3,47206413,1,2019-08-10
4,48147722,1,2019-08-22


In [74]:
# plot scatter plot from filtered ordersByUsersA
fig = go.Figure()
fig.add_trace(go.Scatter(x = filtered_ordersByUsersA['date'], 
                        y = filtered_ordersByUsersA['orders'],
                    mode = 'markers',
                    name = 'A', marker = dict(color = 'blue')))
fig.add_trace(go.Scatter(x = filtered_ordersByUsersB['date'], 
                        y = filtered_ordersByUsersB['orders'], 
                    mode = 'markers',
                    name = 'B', marker = dict(color = 'red')))

fig.update_layout(autosize = False, width = 800, height = 500, margin = dict(
        l = 50, r = 50, b = 100, t = 100, pad = 4), paper_bgcolor = 'LightSteelBlue',
                  xaxis_tickformat = '%d %b <br>%Y',
                  xaxis_title = 'date')

fig.update_yaxes(title_text = 'number of orders')
fig.update_layout(title_text = 'Number of Orders per User',
                  showlegend = True)
fig.show()


The behavior of DatetimeProperties.to_pydatetime is deprecated, in a future version this will return a Series containing python datetime objects instead of an ndarray. To retain the old behavior, call `np.array` on the result



In [75]:
filtered_ordersByUsersA['orders'].describe()

count    440.000000
mean       1.029545
std        0.169522
min        1.000000
25%        1.000000
50%        1.000000
75%        1.000000
max        2.000000
Name: orders, dtype: float64

In [76]:
filtered_ordersByUsersB['orders'].describe()

count    526.000000
mean       1.030418
std        0.171899
min        1.000000
25%        1.000000
50%        1.000000
75%        1.000000
max        2.000000
Name: orders, dtype: float64