In [1]:
import pandas as pd
desktop=pd.read_csv('desktop.csv')
laptop=pd.read_csv('laptop.csv')

In [2]:
import scipy.stats
print(scipy.stats.ttest_ind(desktop['spending'],laptop['spending']))
print(scipy.stats.ttest_ind(desktop['age'],laptop['age']))
print(scipy.stats.ttest_ind(desktop['visits'],laptop['visits']))

Ttest_indResult(statistic=-2.109853741030508, pvalue=0.03919630411621095)
Ttest_indResult(statistic=-0.7101437106800108, pvalue=0.4804606394128761)
Ttest_indResult(statistic=0.20626752311535543, pvalue=0.8373043059847984)


## Running Experiments to Test New Hypotheses

Should we change the color of the text?

Hypothesis 0  Changing the color of text in our emails from black to blue will have no effect on revenues.

Hypothesis 1  Changing the color of text in our emails from black to blue will lead to a change in revenues (either an increase or a decrease).

In [3]:
import numpy as np
medianage=np.median(desktop['age'])
groupa=desktop.loc[desktop['age']<=medianage,:]
groupb=desktop.loc[desktop['age']>medianage,:]

In [8]:
emailresults1=pd.read_csv('emailresults1.csv')

print(emailresults1.head())

   userid  revenue
0       1      100
1       2        0
2       3       50
3       4      550
4       5      175


In [9]:
groupa_withrevenue=groupa.merge(emailresults1,on='userid')
groupb_withrevenue=groupb.merge(emailresults1,on='userid')
# "specify on='userid', meaning that we take the row of emailresults1 that corresponds to a particular userid and 
# merge it with the row of groupa that corresponds to that same userid”

In [10]:
print(scipy.stats.ttest_ind(groupa_withrevenue['revenue'],groupb_withrevenue['revenue']))

Ttest_indResult(statistic=-2.186454851070545, pvalue=0.03730073920038287)


In [12]:
print(np.mean(groupb_withrevenue['revenue'])-np.mean(groupa_withrevenue['revenue']))

125.0


###### A/B Testing
- (1) a split into two groups, application of a different treatment to each group
- (2) statistical analysis to compare the groups' outcomes and draw conclusions about which treatment is better

###### Understanding the Math of A/B Testing
E(A's revenue with blk text) + E(effect of changing blk → blue on A) = E(A’s revenue with blue text)

E(B’s revenue with blk text) + E(effect of changing blk → blue on B) = E(B’s revenue with blue text)

Reject Hypothesis 0? 

(1) First calculate E(effect of changing blk → blue on A) and E(effect of changing blk → blue on B)

(2) If either are not equal to 0, reject Hypothesis 0

After experiment:
    
    E(A’s revenue with blk text) = 104
    E(B’s revenue with blue text) = 229
    
    => 104 + E(effect of changing blk → blue on A) = E(A’s revenue with blue text)
    => E(B’s revenue with blk text) + E(effect of changing blk → blue on B) = 229

Simplified equation:
    
    104 + E(effect of changing blk → blue on everyone) = 229
 => Effect of blue text: $125 revenue increase

In [13]:
# Translating Math into Practice

np.random.seed(18811015)
laptop.loc[:,'groupassignment1']=1*(np.random.random(len(laptop.index))>0.5)
groupc=laptop.loc[laptop['groupassignment1']==0,:].copy()
groupd=laptop.loc[laptop['groupassignment1']==1,:].copy()

In [14]:
emailresults2=pd.read_csv('emailresults2.csv')

In [15]:
groupc_withrevenue=groupc.merge(emailresults2,on='userid')
groupd_withrevenue=groupd.merge(emailresults2,on='userid')

In [16]:
print(scipy.stats.ttest_ind(groupc_withrevenue['revenue'],groupd_withrevenue['revenue']))

Ttest_indResult(statistic=-2.381320497676198, pvalue=0.024288828555138562)


In [17]:
# including picture has a nonzero effect

In [18]:
print(np.mean(groupd_withrevenue['revenue'])-np.mean(groupc_withrevenue['revenue']))

260.3333333333333


## Optimizing with the Champion/Challenger Framework

"Since the new email is in direct competition with the champion email, we call it the challenger. If the champion performs better than the challenger, the champion retains its champion status. If the challenger performs better than the champion, that challenger becomes the new champion.

This process can continue indefinitely: we have a champion that represents the state of the art of whatever we’re doing (marketing emails, in this case). We constantly test the champion by putting it in direct competition with a succession of challengers in A/B tests. Each challenger that leads to significantly better outcomes than the champion becomes the new champion and is, in turn, put into competition against new challengers later.

This endless process is called the champion/challenger framework for A/B tests. It’s meant to lead to continuous improvement, continuous refinement, and asymptotic optimization to get to the best-possible performance in all aspects of business.”

## Preventing Mistakes with Twyman's Law and A/A Testing

“The inevitability of mistakes should lead us to be naturally suspicious of anything that seems too good, bad, interesting, or strange to be true. 

This natural suspicion is advocated by Twyman’s law, which states that “any figure that looks interesting or different is usually wrong.” 

This law has been restated in several ways, including “any statistic that appears interesting is almost certainly a mistake” and “the more unusual or interesting the data, the more likely it is to have been the result of an error.”

“A/A testing is a type of testing is just what it sounds like; we go through the steps of randomization, treatment, and comparison of two groups just as in A/B testing, but instead of sending two different emails to our two randomized groups, we send the identical email to each group”

“A/A testing can be a useful sanity check that can prevent us from getting carried away by the kind of unusual, interesting, too-good-to-be-true results that Twyman’s law warns us about.”

## Understanding Effect Sizes

In [19]:
# difference of $125 -> A/B test's effect size
# is this a small effect, medium effect, or a large effect?

# sonsider list of nominal GDPs for Malaysia, Myanmar, and Marshall Islands
gdps=[365303000000,65994000000,220000000]
print(np.std(gdps))

158884197328.32672


In [20]:
print(125/np.std(gdps))

7.867365169217765e-10


In [21]:
# this shows that the $125 effect size is a little more than 1 one billionth of the std of the GDP figures

In [23]:
burgers=[9.0,12.99,10.50]
print(np.std(burgers))

1.6455394252341695


In [24]:
print(125/np.std(burgers))

75.96293232671214


In [25]:
# $125 is about 75.9 burger price standard deviations

Cohen's d: an effect size divided by a relevant standard deviation
- the number of standard deviations that two populations' means are apart from each other

Cohen's d:
- 0.2 or lower, small effect
- about 0.5, medium effect
- around 0.8 or higher, large effect

In [26]:
print(125/np.std(emailresults1['revenue']))

0.763769235188029


In [27]:
# 0.76 ~ 0.8 => large effect size

## Calculating the Significance of Data

Mathematically, statistical significance depends on three things:
- Size of the effect being studied; Bigger effects make statistical significance more likely.
- Size of the sample being studied; Bigger samples make statistical significance more likely.
- Significance threshold; A higher threshold makes statistical significance more likely.

stastical power: The probability that a correctly run A/B test will reject a false null hypothesis

In [30]:
from statsmodels.stats.power import TTestIndPower
alpha=0.05
nobs=45 # number of observations
effectsize=0.5 #using Cohen's d

analysis=TTestIndPower()
power = analysis.solve_power(effect_size=effectsize, nobs1=nobs, alpha=alpha)
print(power)

0.6501855020289931


In [31]:
# 65 percent chance of detecting an effect from our A/B and about a 35 percent chance that even though a true effect 
# exists, our A/B test doesn't find it

In [32]:
# solve_power() reverses power calculations
analysis = TTestIndPower()
alpha = 0.05
effect = 0.5
power = 0.8
observations = analysis.solve_power(effect_size=effect, power=power, alpha=alpha)
# calculates the nobs to achieve the power level
print(observations)

63.765611775409525


In [33]:
# “if we want to have 80 percent statistical power for our planned A/B test, we’ll need to recruit at least 64 participants for both groups”

## Applications and Advanced Considerations

“By running an A/B test on pricing, you can measure what economists call the price elasticity of demand, meaning how much demand changes in response to price changes. 
- If your A/B test finds only a very small change in demand when you increase the price, you should increase the price for everyone and take advantage of their greater willingness to pay. 
- If your A/B test finds that demand drops off significantly when you increase the price slightly, you can conclude that customers are sensitive to price, and their purchase decisions depend heavily on price considerations.”

“business-to-consumer (B2C) business models: businesses sell directly to consumers"
- "B2C scenarios are a natural fit for A/B testing because the number of customers, products, and transactions tends to be higher for B2C businesses than for other businesses, so we can get large sample sizes and higher statistical power”

###### “exploration/exploitation trade-off in A/B tests. 
In this trade-off, two goals are in constant tension: 
- to explore (for example, to run A/B tests with possibly bad email designs to learn which is best)
    - Exploration can lead to missed opportunities if one of your challengers performs much worse than the champion; you would have been better off just sending out the champion to everyone.
- to exploit (for example, to send out only the champion email because it seems to perform the best)
    - Exploitation can lead to missed opportunities if your champion is not as good as another challenger that you haven’t tested yet because you’re too busy exploiting your champion to do the requisite exploration.”
    
multi-armed bandit problem: a mathematical formalization of the exploration/exploitation dilemma”

## The Ethics of A/B Testing

- Informed Consent
- Risk
    - Potential Downsides to participation as a human subject
    - the probability of experiencing those downsides
- Potential Benefits