# PROJECT SPRINT 10: A/B TESTING

### DESCRIPTION OF THE PROJECT: This is a test and analysis done for a big online store. Together with the marketing department, I received a list of hypotheses that may help boost revenue.<br>

### PURPOSE OF THE TEST: **Prioritize these hypotheses, launch A/B test and analyze the results.** 

***

### The project is divided into several parts. Each part has its own purpose and is outlined in a sequencial order so you can follow the progress to the end.<br>

>### Part One: Pre-processing of the data.<br>
>### Part Two: Check Compliances.<br>
>### Part Three: Main KPIs (without statistical analysis).<br>
>### Part Four: Prioritizing Hypotheses.<br>
>### Part Five: A/B Test Analysis.<br>
>### Part Six: Conlusions based on the A/B test results.

***

### Description of the data:<br>
> Hypotheses dataset:<br>
> Hypotheses — brief descriptions of the hypotheses<br>
> Reach — user reach, on a scale of one to ten<br>
> Impact — impact on users, on a scale of one to ten<br>
> Confidence — confidence in the hypothesis, on a scale of one to ten<br>
> Effort — the resources required to test a hypothesis, on a scale of one to ten.<br>

> Orders dataset:<br>
> transactionId — order identifier<br>
> visitorId — identifier of the user who placed the order<br>
> date — of the order<br>
> revenue — from the order<br>
> group — the A/B test group that the user belongs to<br>

> Visits dataset:<br>
> date — date<br>
> group — A/B test group<br>
> visits — the number of visits on the date specified in the A/B test group specified

***

### Part One: Pre-processing the data

#### In this part I will import libraries, check and clean the data from both datasets, and check any inconsistencies of the data that may prevent to do further actions and analysis.

**1. Libraries**

In [157]:
# import all the necessary libraries for the whole project
import pandas as pd
import scipy.stats as stats # type: ignore
import datetime as dt
import numpy as np
import sidetable
import plotly.express as px

**2. Reading the datasets and checking for missing values**

In [158]:
# reading the orders csv file
orders = pd.read_csv('/Users/cesarchaparro/Desktop/TripleTen/Sprint_10/project/orders_us.csv', parse_dates=['date'])
orders.head()

Unnamed: 0,transactionId,visitorId,date,revenue,group
0,3667963787,3312258926,2019-08-15,30.4,B
1,2804400009,3642806036,2019-08-15,15.2,B
2,2961555356,4069496402,2019-08-15,10.2,A
3,3797467345,1196621759,2019-08-15,155.1,B
4,2282983706,2322279887,2019-08-15,40.5,B


In [159]:
# info about the dataframe
orders.stb.missing(style=True)

Unnamed: 0,missing,total,percent
transactionId,0,1197,0.00%
visitorId,0,1197,0.00%
date,0,1197,0.00%
revenue,0,1197,0.00%
group,0,1197,0.00%


* We can see that there are 1197 rows and no missing vallues.

In [160]:
# reading the visits csv file
visits = pd.read_csv('/Users/cesarchaparro/Desktop/TripleTen/Sprint_10/project/visits_us.csv', parse_dates=['date'])
visits.head()

Unnamed: 0,date,group,visits
0,2019-08-01,A,719
1,2019-08-02,A,619
2,2019-08-03,A,507
3,2019-08-04,A,717
4,2019-08-05,A,756


In [161]:
# info about the visits dataframe
visits.stb.missing(style=True)

Unnamed: 0,missing,total,percent
date,0,62,0.00%
group,0,62,0.00%
visits,0,62,0.00%


* We can see that there are 62 rows and no missing values.

**3. Optimization of memory of the datasets**

In [162]:
# info about the orders dataset
orders.info(memory_usage = 'deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1197 entries, 0 to 1196
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   transactionId  1197 non-null   int64         
 1   visitorId      1197 non-null   int64         
 2   date           1197 non-null   datetime64[ns]
 3   revenue        1197 non-null   float64       
 4   group          1197 non-null   object        
dtypes: datetime64[ns](1), float64(1), int64(2), object(1)
memory usage: 105.3 KB


In [163]:
# use category method to change the type of data on the column.
orders['group'] = orders['group'].astype('category')

In [164]:
# check how optimized became the dataframe of visits.
orders.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1197 entries, 0 to 1196
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   transactionId  1197 non-null   int64         
 1   visitorId      1197 non-null   int64         
 2   date           1197 non-null   datetime64[ns]
 3   revenue        1197 non-null   float64       
 4   group          1197 non-null   category      
dtypes: category(1), datetime64[ns](1), float64(1), int64(2)
memory usage: 38.9 KB


* We can see that we optimized the size of the file from 105.3 KB to 38.9 KB.

In [165]:
# info about the visits dataset
visits.info(memory_usage = 'deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62 entries, 0 to 61
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   date    62 non-null     datetime64[ns]
 1   group   62 non-null     object        
 2   visits  62 non-null     int64         
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 4.6 KB


In [166]:
# use category method to change the type of data on the column.
visits['group'] = visits['group'].astype('category')

In [167]:
# check how optimized became the dataframe of visits.
visits.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62 entries, 0 to 61
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   date    62 non-null     datetime64[ns]
 1   group   62 non-null     category      
 2   visits  62 non-null     int64         
dtypes: category(1), datetime64[ns](1), int64(1)
memory usage: 1.4 KB


* We can see that we optimized the size of the file from 4.6 KB to 1.4 KB.

In [168]:
# checking for duplicates on column of transactions
duplicates = orders['transactionId'].duplicated().sum()
if duplicates > 0:
  print(f'There are {duplicates} duplicate rows in the DataFrame.')
else:
  print('No duplicate rows found.')

No duplicate rows found.


* We can see that there are no duplicated transactions on the dataframe.

In [169]:
# group by test group and see when start and when ends the test
orders.groupby(['group'])['date'].agg(['min','max'])





Unnamed: 0_level_0,min,max
group,Unnamed: 1_level_1,Unnamed: 2_level_1
A,2019-08-01,2019-08-31
B,2019-08-01,2019-08-31


* We can see that both orders and visits dataframes have same start and end dates.

***

### Part Two: Check compliances

#### In this part I will go through all the technical requirements that need to be fulfilled in order to perform a correct test.

In [170]:
# group by group to see how many users on each group
orders.groupby(['group'])['visitorId'].nunique()





group
A    503
B    586
Name: visitorId, dtype: int64

* We can see there is not an even split of the groups, I will check if there are users that are included on both groups by error on the next step.

In [171]:
# pylint: disable=missing-final-newline
# users in both groups
both_groups = list(orders.groupby(['visitorId'])['group'].nunique().reset_index().query('group > 1')['visitorId'])
both_groups

[8300375,
 199603092,
 232979603,
 237748145,
 276558944,
 351125977,
 393266494,
 457167155,
 471551937,
 477780734,
 818047933,
 963407295,
 1230306981,
 1294878855,
 1316129916,
 1333886533,
 1404934699,
 1602967004,
 1614305549,
 1648269707,
 1668030113,
 1738359350,
 1801183820,
 1959144690,
 2038680547,
 2044997962,
 2378935119,
 2458001652,
 2579882178,
 2587333274,
 2600415354,
 2654030115,
 2686716486,
 2712142231,
 2716752286,
 2780786433,
 2927087541,
 2949041841,
 2954449915,
 3062433592,
 3202540741,
 3234906277,
 3656415546,
 3717692402,
 3766097110,
 3803269165,
 3891541246,
 3941795274,
 3951559397,
 3957174400,
 3963646447,
 3972127743,
 3984495233,
 4069496402,
 4120364173,
 4186807279,
 4256040402,
 4266935830]

In [172]:
# not in both groups
clean_orders = orders[(orders.visitorId.apply(lambda x: x not in both_groups))]
clean_orders

Unnamed: 0,transactionId,visitorId,date,revenue,group
0,3667963787,3312258926,2019-08-15,30.4,B
1,2804400009,3642806036,2019-08-15,15.2,B
3,3797467345,1196621759,2019-08-15,155.1,B
4,2282983706,2322279887,2019-08-15,40.5,B
5,182168103,935554773,2019-08-15,35.0,B
...,...,...,...,...,...
1191,3592955527,608641596,2019-08-14,255.7,B
1192,2662137336,3733762160,2019-08-14,100.8,B
1193,2203539145,370388673,2019-08-14,50.1,A
1194,1807773912,573423106,2019-08-14,165.3,A


* We can see now that the clean list of users which is valid for the test is the users that are only participating in one version of the test.

In [173]:
# group by group to see how many users on each group
clean_orders.groupby(['group'])['visitorId'].nunique()





group
A    445
B    528
Name: visitorId, dtype: int64

* We can see that there is an uneven number of users that take part of group A and B to do the test. The difference is quite significant to continue correctly the test.

In [174]:
# double checking if there are still users on both groups
(clean_orders.groupby(['visitorId'])['group'].nunique()>1).sum()

0

* We can see that now each user if participiating only in one only version of the test. The rest of users were dropped from the test.

#### Note: We don't have data on this data sets about the existence of any marketing events during this period of August, neither the existence of special promotions or holidays. Therefore, we infer that the planned dates for the test are normal dates that retrieve a usual user behavior.

* We can also infer that there is not existence of ghosts users on the data set, users that are registered but don't take part on any of the versions of the test. I can see that from the missing values and the columns earlier on the datasets information.

***

### Part Three: KPIs

In [175]:
# revenue per user
revenue_per_user = clean_orders.groupby(['visitorId'])['revenue'].sum().reset_index()
revenue_per_user.columns = ['visitorId', 'revenue']
revenue_per_user

Unnamed: 0,visitorId,revenue
0,5114589,10.8
1,6958315,25.9
2,11685486,100.4
3,39475350,65.4
4,47206413,15.2
...,...,...
968,4259830713,50.1
969,4278982564,385.7
970,4279090005,105.3
971,4281247801,45.6


In [176]:
# see distribution of revenue per user
revenue_per_user['revenue'].describe()

count      973.000000
mean       136.550051
std        663.321828
min          5.000000
25%         20.800000
50%         50.400000
75%        135.200000
max      19920.400000
Name: revenue, dtype: float64

In [179]:
# Plot the distribution of revenue per user
fig = px.histogram(revenue_per_user['revenue'],
                   x = 'revenue',
                   title = 'Revenue per User Distribution',
                   nbins = 100)
# Update the y-axis label
fig.update_yaxes(title_text= 'users')
fig.show()