Let’s imagine you work on the product team at a medium-sized online e-commerce business. The UX designer worked really hard on a new version of the product page, with the hope that it will lead to a higher conversion rate. The product manager (PM) told you that the current conversion rate is about 13% on average throughout the year, and that the team would be happy with an increase of 2%, meaning that the new design will be considered a success if it raises the conversion rate to 15%.

1) FORMULATING A HYPOTHESIS

First, we will need to establish a hypothesis, in order to ensure our interpretation is correct and rigorous.

As we do not know, whether the new design will perform better or worse, we will choose to use a two-tailed-test:

Ho:P=Po

Ha:P=/=Po

where P and Po will stand for the conversion rate of the new and old design.

We will also set a confidence level of 95%:

a=0.05

The a value is the threshold we have set ourselves. What this means is that if the probability of observing a result as extreme or more(p-value) is lower than a, then we will reject the Null hypothesis. Since our a=0.05, our confidence level is 95%

2) CHOOSING THE VARIABLES

In order to test our hypothesis, we need two groups:

A control group ( shown the old design)
A treatment broup ( shown the new design)
This is our Independent Variable. Even though we are testing for an increase in conversion rate. We have two groups to control for other variables that could have an effect on our results, for example seasonality:

When we have a control group, we can compare their direct results with that of the treatment group as the only systematic difference between the groups is the webpage they are shown.

Our Dependant Variable is what we are trying to measure. In this case it is the conversion rate. We can apply a binary variable classification to this: 0 - User Bought during their session 1 - User did not buy during their session

This allows us to easily calculate the mean for each group to get the conversion rate of each design.

3) CHOOSING A SAMPLE SIZE

It is valuable to note, we will not test our whole user base, the conversion rate we will get will only be estimates of the true rate

The best way to consider this is: The number of people we decide to capture in each group will impact the precision of our estimated conversion rates. So the larger our sample the more precise our estimates and the more likely we will detect a difference between the two groups. However, the larger our sample, the more expensive and time consuming our study becomes.

So the questions is: How many people should we have in each group?

We can use a method called Power Analysis. It is made up of a few factors:

Power of the test (1 - beta) - This is the probability of finding a statistical difference, between the groups in our test, when a difference is actually present. Usually this is 0.8 by convention (can be explored)

Alpha value(a) - The critical value we set earlier to 0.05

Effect size - How big of a difference we expect there to be between the conversion rates.

Since we know the team are happed with a 2% difference, we can use the 13 and 15% to calculate the effect size.



In [1]:
# In python there are already built in functions that take care of these calculations for us

import numpy as np
import pandas as pd
import scipy.stats as stats
import statsmodels.stats.api as sms
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from math import ceil

%matplotlib inline

## This is plot styling preferences
plt.style.use('seaborn-whitegrid')
font = {'family' : 'Helvetica',
        'weight' : 'bold',
        'size' : 14}

mpl.rc('font', **font)

effect_size = sms.proportion_effectsize(0.13, 0.15)    ## This calculates our effect size based on our expected rates

required_n = sms.NormalIndPower().solve_power(         ## This calculates the sample size needed
    effect_size,
    power=0.8,
    alpha=0.05,
    ratio=1
    )

required_n =ceil(required_n)      #Rounds up to the next whole number

print(required_n)

4720


We can see above that we would need 4720 observation for each group.

As we have set the power parameter to 0.8, in practice this means if there exists an actual difference in conversion rate between our designs. Assuming the difference is the one we estimated (13% vs 15%), we have about 80% chance to detect it as statistically significant in our test with the sample size we calculated

4) COLLECTING AND PREPARING THE DATA

At this point we know exactly what size sample we would need. At this point you would usually work with the engineering team to get the data.

However to keep going, we will use a online dataset as a simulation.

In [None]:
df = pd.read_csv('ab_data.csv')
df.head()

In [None]:
df.info()

Lets make sure all the control group are seeing the correct page

In [None]:
pd.crosstab(df['group'], df['landing_page'])    

Before we go ahead and sample the data to get our subset, lets make sure there are no users that have been sampled multiple times


In [None]:

session_counts = df['user_id'].value_counts(ascending=False)
multi_users = session_counts[session_counts > 1].count()

print(multi_users)

We can see there are 3894 users, that appear more than once. AS this is fairly low in comparison to total number of data points
We will go ahead and remove them from the Dataframe using the following code:

In [None]:


users_to_drop = session_counts[session_counts > 1].index

df=df[~df['user_id'].isin(users_to_drop)]
print(df.shape[0])

So we can see above we have removed 3894 duplicate user ids in each category (7788) from the total of 294478, giving us a unique total of 294478 - 7788 = 286690

In [None]:
pd.crosstab(df['group'], df['landing_page'])

5) SAMPLING

Now that our dataframe is nice and clean, we can proceed and sample the advised 4720 entries for each of the groups. We will use the pandas sample function to do this

In [None]:
control_sample = df[df['group'] == 'control'].sample(n=required_n)
treatment_sample = df[df['group'] == 'treatment'].sample(n=required_n)

ab_test = pd.concat([control_sample, treatment_sample], axis=0)
ab_test.reset_index(drop=True, inplace=True)

ab_test

In [None]:
ab_test.info()

In [None]:
ab_test['group'].value_counts()

At this point, everything looks good. we have a even split of control and treatment group. We can analyse the results

6) VISUALIZING THE RESULTS

Lets calculate some basic statistics to get an idea of what our samples look like.

In [None]:
conversion_rates = ab_test.groupby('group')['converted']

std_p = lambda x: np.std(x, ddof=0)    ##Std Deviation of the proportion
se_p = lambda x: stats.sem(x, ddof=0)  ##Std Error of the proportion (std/sqrt(n))

conversion_rates = conversion_rates.agg([np.mean, std_p, se_p])
conversion_rates.columns = ['conversion_rate', 'std_deviation', 'std_error']

conversion_rates.style.format('{:.3f}')

Judging by the stats above, it does look like our two designs similarly, our new design actually performed slightly worse

Now lets plot the data, to make the results easier to digest

In [None]:


plt.figure(figsize=(8,6))

sns.barplot(x=ab_test['group'], y=ab_test['converted'], ci=False)

plt.ylim(0, 0.17)
plt.title('Conversion rate by group', pad=20)
plt.xlabel('Group', labelpad=15)
plt.ylabel('Converted (proportion)', labelpad=15);

Our sampled conversion rates ar close and both a lot worse than originally assumed (13%). There may be a difference when sampling from a population. Now we have to question, is this difference statistically significant??

6) TESTING THE HYPOTHESIS

The final step in our analysis, is testing our initial hypothesis. Since our sample is a large sample. We can use a normal approximation for calculating our P value. In other wors a Z-test

We will use the statsmodels.stats.proportion to get the p-value and confidence intervals

In [None]:
from statsmodels.stats.proportion import proportions_ztest, proportion_confint

control_results = ab_test[ab_test['group'] == 'control']['converted']
treatment_results = ab_test[ab_test['group'] == 'treatment']['converted']

n_con = control_results.count()
n_treat = treatment_results.count()
successes = [control_results.sum(), treatment_results.sum()]
nobs = [n_con, n_treat]

z_stat, pval = proportions_ztest(successes, nobs=nobs)
(lower_con, lower_treat), (upper_con, upper_treat) = proportion_confint(successes, nobs=nobs, alpha=0.05)

print(f'Z score: {z_stat:.2f}')
print(f'P value: {pval:.3f}')
print(f'ci 95% for control group: [{lower_con:.3f}, {upper_con:.3f}]')
print(f'ci 95% for treatment group: [{lower_treat:.3f}, {upper_treat:.3f}]')

As our P valueis 0.119, is larger than our 0.05 threshold, We reject the null hypothesis and the change in design did have an effect of conversion rate, just that it was negative. Our new design performed worse than our old one.

If we also look at the confidence intervals, our old design did not achieve its baseline expectation of 13%


With all this being considered we need to return to the design stage and assess what we can do to improve user usage.