

A/B Testing


---

In [None]:
import pandas as pd
desktop=pd.read_csv('desktop.csv')
laptop=pd.read_csv('laptop.csv')

In [None]:
#to see the first five rows of each dataset

print(desktop.head())
print(laptop.head())

   userid  spending  age  visits
0       1      1250   31     126
1       2       900   27       5
2       3         0   30     459
3       4      2890   22      18
4       5      1460   38      20
   userid  spending  age  visits
0      31      1499   32      12
1      32       799   23      40
2      33      1200   45      22
3      34         0   59     126
4      35      1350   17      85


In [None]:
# import the SciPy package’s stats module so we can use it for t-tests
#print the results of three separate t-tests:
#   one comparing the spending of desktop and laptop subscribers,
#   one comparing the ages of desktop and laptop subscribers,
#   and one comparing the number of recorded website visits of desktop and laptop subscribers.

import scipy.stats
print(scipy.stats.ttest_ind(desktop['spending'],laptop['spending']))
print(scipy.stats.ttest_ind(desktop['age'],laptop['age']))
print(scipy.stats.ttest_ind(desktop['visits'],laptop['visits']))

#After determining that desktop subscribers are different from laptop subscribers,
#we can conclude that we should send them different marketing emails.

TtestResult(statistic=np.float64(-2.109853741030508), pvalue=np.float64(0.03919630411621095), df=np.float64(58.0))
TtestResult(statistic=np.float64(-0.7101437106800108), pvalue=np.float64(0.4804606394128761), df=np.float64(58.0))
TtestResult(statistic=np.float64(0.20626752311535543), pvalue=np.float64(0.8373043059847984), df=np.float64(58.0))


A/B testing uses experiments to help businesses
determine which practices will give them the greatest chances of success.

It consists of a few steps:
  1. experimental design
  2. random assignment into treatment and control groups
  3. careful measurement of outcomes
  4. statistical comparison of outcomes between groups


To do the statistical comparisons will be familiar, we use the t-tests
While t-tests are a part of the A/B testing process, they are not the only part. A/B testing is a process for collecting new data, which can then be analyzed using tests like the t-test.

---

  Hypothesis 0 Changing the color of text in our emails from black to
  blue will have no effect on revenues.

  Hypothesis 1 Changing the color of text in our emails from black to
  blue will lead to a change in revenues (either an increase or a decrease).

---

Here, we import the NumPy package, giving it the alias np, so we can
use its median() method. Then we simply take the median age of our group
of desktop subscribers and create groupa, a subset of our desktop subscribers
whose age is below or equal to the median age, and groupb, a subset of our
desktop subscribers whose age is above the median age.

In [None]:
import numpy as np
medianage=np.median(desktop['age'])
groupa=desktop.loc[desktop['age']<=medianage,:]
groupb=desktop.loc[desktop['age']>medianage,:]

In [None]:
#fabricated data that shows hypothetical outcomes for members of our two groups

emailresults1=pd.read_csv('emailresults1.csv')
print(emailresults1.head())

   userid  revenue
0       1      100
1       2        0
2       3       50
3       4      550
4       5      175


---

In this snippet, we use the pandas merge() method to combine our
dataframes. We specify on='userid', meaning that we take the row of
emailresults1 that corresponds to a particular userid and merge it with
the row of groupa that corresponds to that same userid. The end result of
using merge() is a dataframe in which every row corresponds to a particular user identified by their unique userid.

In [None]:
groupa_withrevenue=groupa.merge(emailresults1,on='userid')
groupb_withrevenue=groupb.merge(emailresults1,on='userid')

In [None]:
#fter preparing our data, it’s simple to perform a t-test to check whether our groups are different.

print(scipy.stats.ttest_ind(groupa_withrevenue['revenue'],groupb_withrevenue['revenue']))

#The important part of this output is the pvalue variable, which tells us
#the p-value of our test. We can see that the result says that p = 0.037, approximately

TtestResult(statistic=np.float64(-2.186454851070545), pvalue=np.float64(0.03730073920038287), df=np.float64(28.0))


---

The output is 125.0. The average groupb customer has outspent the average groupa customer by $125.

This difference is statistically significant, so we
reject Hypothesis 0 in favor of Hypothesis 1, concluding (for now, at least)
that the blue text in marketing emails leads to about $125 more in revenue
per user than black text.

In [None]:
print(np.mean(groupb_withrevenue['revenue'])-np.mean(groupa_withrevenue['revenue']))

125.0


---

This was an experiment.

1. We split a population into two groups,
2. Performed different actions on each group
3. Compared the results.

In the context of business, such an experiment is often called an
A/B test. The A/B part of the name refers to the two groups, Group A and
Group B, whose different responses to emails we compared. Every A/B
test follows the same pattern we went through here: a split into two groups,
application of a different treatment (for example, sending different emails)
to each group, and statistical analysis to compare the groups’ outcomes and
draw conclusions about which treatment is better.

---

Math Behind A/B Testing:

E(A’s revenue with blk text) + E(effect of changing blk → blue on A) = E(A’s revenue with blue text)

E(B’s revenue with blk text) + E(effect of changing blk → blue on B) = E(B’s revenue with blue text)

---

Perform an A/B test on our laptop subscriber list, but this time we’ll
use randomization to select our groups to avoid having a confounded experimental design:

In this snippet, we use the NumPy random.random() method to generate a
column that consists of randomly generated 0s and 1s. We can interpret a 0 to
mean that a user belongs to Group C, and a 1 to mean that a user belongs
to group D. When we generate 0s and 1s randomly like this, the groups could
end up with different sizes.

In [None]:
np.random.seed(18811015)
laptop.loc[:,'groupassignment1']=1*(np.random.random(len(laptop.index))>0.5)
groupc=laptop.loc[laptop['groupassignment1']==0,:].copy()
groupd=laptop.loc[laptop['groupassignment1']==1,:].copy()

In [None]:
emailresults2=pd.read_csv('emailresults2.csv')


#join our email results to our group dataframes
groupc_withrevenue=groupc.merge(emailresults2,on='userid')
groupd_withrevenue=groupd.merge(emailresults2,on='userid')

We find that the p-value is less than 0.05, indicating that the difference
between the groups is statistically significant. This time, our experiment
isn’t confounded, because we used random assignment to ensure that the
differences between groups are the result of our different emails, not the
result of different characteristics of each group.

In [None]:
#use a t-test to check whether the revenue resulting
# from Group C is different from the revenue we get from Group D

print(scipy.stats.ttest_ind(groupc_withrevenue['revenue'],groupd_withrevenue['revenue']))

TtestResult(statistic=np.float64(-2.381320497676198), pvalue=np.float64(0.024288828555138562), df=np.float64(28.0))


---

Calculate the estimated effect here with subtraction: the mean rev-
enue obtained from subjects in Group D minus the mean revenue obtained
from subjects in Group C. The difference between mean revenue from
Group C and mean revenue from Group D, about $260, is the size of the
effect of our experiment.

In [None]:
print(np.mean(groupd_withrevenue['revenue'])-np.mean(groupc_withrevenue['revenue']))

260.3333333333333


---

Optimizing with the Champion/Challenger Framework:


Suppose you have a champion email and want to continue A/B testing
to try to improve it. You do another random split of your users, into a new
Group A and a new Group B. You send the champion email to Group A. You
send another email to Group B that differs from the champion email in one
way that you want to learn about; for example, maybe it uses formal rather
than informal language. When we compare the revenues from Group A and
Group B after the email campaign, we’ll be able to see whether this new
email performs better than the champion email.

---

Preventing Mistakes with Twyman’s Law and A/A Testing:


A/B testing is a relatively simple process from beginning to end. Nevertheless,
we are all human and make mistakes. In any data science effort, not just A/B
A/B testing, it’s important to proceed carefully and constantly check whether
we’ve done something wrong. One piece of evidence that often indicates
that we’ve done something wrong is that things are going too well.

---

Understanding Effect Sizes:

In the first A/B test we ran, we observed a difference of $125

between the
Group A users who received a black-text email and the Group B users who
received a blue-text email. This $125

difference between groups is also
called the A/B test’s effect size. It’s natural to try to form a judgment about
whether we should consider this $125 effect size a small effect, a medium
effect, or a large effect.



To judge whether an effect is small or large, we have to compare it to
something else. Consider the following list of nominal GDP figures (in
US dollars, as of 2019) for Malaysia, Myanmar, and the Marshall Islands,
respectively:

In [None]:
# The result is 158884197328.32672, or about $158,884,197,328 (almost
# $159 billion). The standard deviation is a common way to measure how dis-
# persed a dataset is.

gdps=[365303000000,65994000000,220000000]
print(np.std(gdps))

158884197328.32672


In [None]:
# The output is about 7.9 · 10–10, which shows us that the $125 effect size
# is a little more than 1 one-billionth of the standard deviation of our GDP figures.

print(125/np.std(gdps))

7.867365169217765e-10


In [None]:

# we conduct a survey of the prices of burgers at local restaurants. Maybe we find the following prices:
burgers=[9.0,12.99,10.50]

#tandard deviation
print(np.std(burgers))

1.6455394252341695


The standard deviation of our burger price data is about 1.65. So, two
countries’ GDPs differing by about $80 billion is roughly comparable to
two burger prices differing by about 80 cents: both represent about half of
a standard deviation in their respective domains.

In [None]:
#when we compare a $125 effect size to this, we see that it’s huge:

# We see that $125 is about 75.9 burger price standard deviations.
# Seeing a $125 difference in burger prices in your town is therefore something like
# seeing a man who is over 20 feet tall.

print(125/np.std(burgers))
#the result is about 0.76

75.96293232671214


In [None]:
print(125/np.std(emailresults1['revenue']))

0.763769235188029


Calculating the Significance of Data:


We typically use statistical significance as the key piece of evidence that con-
vinces us that an effect that we study in an A/B test is real. Mathematically,
statistical significance depends on three things:

    • The size of the effect being studied (like the increase in revenue that
    results from changing an email’s text color). Bigger effects make statistical significance more likely.
    • The size of the sample being studied (the number of people on a sub-
    scriber list who are receiving our marketing emails). Bigger samples
    make statistical significance more likely.
    • The significance threshold we’re using (typically 0.05). A higher thresh-
    old makes statistical significance more likely

If we have a big sample size, and we’re studying a big effect, our t-tests will
likely reach statistical significance. On the other hand, if we study an effect
that’s very small, with a sample that’s very small, we may have predestined our
own failure: the probability that we detect a statistically significant result is essentially 0—even if the email truly does have an effect. Since running an
A/B test costs time and money, we’d rather not waste resources running tests
like this that are predestined to fail to reach statistical significance.

In [None]:
#mport a module into Python that makes calculating statistical power easy

from statsmodels.stats.power import TTestIndPower

In [None]:
# To calculate power with this module, we’ll need to define parameters for
# the three things that determine statistical significance (see the preceding bulleted list).

alpha=0.05

# we choose the standard 0.05 threshold for alpha, as is standard in much empirical research.

In [None]:
# Define sample size.

#A/B test on a group of email subscribers that consists of 90 people total.
# That means we’ll have 45 people in Group A and 45 people in Group B,
# so we define the number of observations in each of our groups as 45.

nobs=45

In [None]:
#define an estimated effect size

effectsize=0.5

In [None]:
#  can use a function that will take the three parameters we’ve
# defined and calculate the statistical power we should expect:

analysis = TTestIndPower()
power = analysis.solve_power(effect_size=effectsize, nobs1=nobs, alpha=alpha)

# the estimated statistical power for our hypothetical A/B test is about 0.65
print(power)

0.6501855019775578


This means that we expect about a 65 percent chance of detecting an effect from our A/B test and about a 35 percent chance that even though a true effect exists, our A/B test doesn’t find it. These odds might seem unfavorable if a given /B test is expected to be expensive.

In [None]:
analysis = TTestIndPower()
alpha = 0.05
effect = 0.5
power = 0.8
observations = analysis.solve_power(effect_size=effect, power=power, alpha=alpha)

# the result is about 63.8
print(observations)

63.7656117754095


This means that if we want to have 80 percent statistical power for our
planned A/B test, we’ll need to recruit at least 64 participants for both
groups. Being able to perform these kinds of calculations can be helpful
in the planning stages of A/B tests.

---

Applications and Advanced Considerations:

1. One of the most common applications of A/B testing is user interface/experience design. A website might randomly assign
visitors to two groups (called Group A and Group B, as usual) and show different versions of the site to each group. The site can then measure which
version leads to more user satisfaction, higher revenue, more link clicks,
more time spent on the site, or whatever else interests the company. The
whole process can be completely automated, which is what enables the highspeed, high-volume A/B testing that today’s top tech companies are doing.

2. E-commerce companies run tests, including A/B tests, on product pricing. By running an A/B test on pricing, you can measure what economists
call the price elasticity of demand, meaning how much demand changes in
response to price changes.

3. Email design, user-interface design, and product pricing are all common
concerns for business-to-consumer (B2C) business models, in which businesses
sell directly to consumers. B2C scenarios are a natural fit for A/B testing
because the number of customers, products, and transactions tends to be
higher for B2C businesses than for other businesses, so we can get large sample sizes and higher statistical power.

4. In this trade-off, two goals are in constant
tension: to explore (for example, to run A/B tests with possibly bad email
designs to learn which is best) and to exploit (for example, to send out only
the champion email because it seems to perform the best). Exploration
can lead to missed opportunities if one of your challengers performs much
worse than the champion; you would have been better off just sending out
the champion to everyone. Exploitation can lead to missed opportunities if
your champion is not as good as another challenger that you haven’t tested
yet because you’re too busy exploiting your champion to do the requisite
exploration.

---

---

The Ethics of A/B Testing:

A/B testing is fraught with difficult ethical issues. This may seem surpris-
ing, but remember, A/B testing is an experimental method in which we
intentionally alter human subjects’ experiences in order to study the results
for our own gain. This means that A/B testing is human experimentation (example are given on the page 92).

---
Summary:
---
We discussed A/B testing.

We started with a simple t-test, and then looked at the need for random, non-confounded data collection as part of the A/B testing process. We covered some nuances of A/B testing, including the champion/challenger framework and Twyman’s law, as well as ethical concerns. In the next chapter, we’ll discuss binary classification, an essential skill for any data scientist.