### What is A/B testying?
- A/B testing is a tool that allows you to test two or more different ideas against each other in the real world, and to see which one empirically performs better.
- **A/B Testing:** Test different ideas against each other in the real world
- Choose the one that statistically better
### Why is A/B testing important?
- No guessing
- Providees accurate answers - quickly
- ...and establish causal relationships

### A/B test process
- Develop a hypothesis about your product or business
- **Randomly** assign users to two different groups
- Expose:
    - Group 1 to the current product rules
    - Group 2 to a product that tests the hypothesis
- Pick whichever performs better accordingto a set of KPIs (Key performance indicators)

### Where can A/B testing be used?
Users + ideas -> A/B test
- testing impact of drugs
- incentivizing spending
- driving user growth
- as well as many more use cases beyond these

### Key Performance Indicators (KPIs)
- **A/B Tests:** Measure impact of changes on KPIs
- **KPIs** - metrics important to an organization
    - likelihood of a side-effect
    - revenue
    - conversion rate

### Pandas Merge Method
- Merging - equivalent of SQL `JOIN`
- In `pandas`:
    - `pd.merge(df1,df2)`
    - `df1.merge(df2)`
    
### Merging mechganics
df_new = df1.merge(df2, how='inner', on=['id'])

### Group: `pandas.DataFrame.groupby()`
DataFrame.groupby(by=$\color{red}{\text{None}}$, axis=$\color{red}{\text{0}}$, level=$\color{red}{\text{None}}$
<br>&emsp;&emsp;&emsp;&emsp; as_index=$\color{red}{\text{True}}$, sort=$\color{red}{\text{True}}$,
<br>&emsp;&emsp;&emsp;&emsp; group_keys=$\color{red}{\text{True}}$, squeez=$\color{red}{\text{False}}$, **kwargs)
            
### Aggregate: `pandas.DataFrame.agg()`
DataFrame.agg(func, axis=$\color{red}{\text{0}}$, *args, **kwargs)

### Grouping Data: .groupby()
- `by`: fields to group by
- `axis`: `axis=0` will group by columns, `axis=1` will group by rows
- `as_index`: `as_index=True` will use group labels as index

df_groupby = df1.groupby(by=['$\color{red}{\text{country}}$' , '$\color{red}{\text{device}}$'],
axis=$\color{red}{\text{0}}$, as_index=$\color{red}{\text{False}}$)

Let us group by "country" and "device". The next relevant argument is `axis` which specifies whether we are grouping by row or column values. The default value, "0", groups by columns. The other argument of interest is `as_index`. By default, this argument is “True”, which means that the grouped by fields become indices. We want to set this to "False" so that this does not happen. This returns a dataFrameGroupedBy object. 

#### Aggregatng data
df_groupby.price.mean()

#### Aggregate data: .agg()
Pass the name of an aggregation function to `agg()`
<br> df_groupby.price.agg('$\color{red}{\text{mean}}$')
<br> df_groupby.price.agg(['$\color{red}{\text{mean}}$' , '$\color{red}{\text{median}}$'])

#### .agg(): multiple functions, multiple columns
Pass a dictionary of column names and aggregation functions
<br> df_groupby.agg({'price' : ['$\color{red}{\text{mean}}$' , '$\color{red}{\text{min}}$' , '$\color{red}{\text{max}}$'],
<br>&emsp;&emsp;&emsp;&emsp; 'age': ['$\color{red}{\text{mean}}$' , '$\color{red}{\text{min}}$' , '$\color{red}{\text{max}}$']})

#### .agg(): custom functions
$\color{green}{\text{def}}$ $\color{red}{\text{truncate_mean}}$(data):
<br>&emsp;&emsp; $\color{red}{\text{"""Compute the mean excluding outliers"""}}$
<br>&emsp;&emsp; top_val = data.quantile($\color{red}{\text{0.9}}$)
<br>&emsp;&emsp; bot_val = data.quantile($\color{red}{\text{0.1}}$)
<br>&emsp;&emsp; trunc_data = data[(data <= top_val) & (data >= bot_val)]
<br>&emsp;&emsp; mean = trunc_data.mean()
<br>&emsp;&emsp; $\color{green}{\text{return}}$(mean)
<br>
<br> df_groupby.agg({'$\color{red}{\text{age}}$' : [truncate_mean]})

#### Using the Timedelta class
- **Lapse Date**: Date the trial ends for a given user

#### Parsing dates - on import
pandas.read_csv(....,
<br>&emsp;&emsp; parse_dates=$\color{red}{\text{False}}$,
<br>&emsp;&emsp; infer_datetime_format=$\color{red}{\text{False}}$,
<br>&emsp;&emsp; keep_date_col=$\color{red}{\text{False}}$,
<br>&emsp;&emsp; date_parser=$\color{red}{\text{None}}$,
<br>&emsp;&emsp; dayfirstr=$\color{red}{\text{False}}$, ....)

#### Example:
df = pd.read_csv('file.csv',
<br>&emsp;&emsp; parse_dates=$\color{red}{\text{True}}$,
<br>&emsp;&emsp; infer_datetime_format=$\color{red}{\text{True}}$)


Two of the primary arguments, `parse_dates` and `infer_datetime_format` are shown in use above. By setting these to true, `read_csv()` will attempt to convert the string representation to a date on import.

#### Parsing dates - manually
`pandas.to_datetime(arg, errors='raise', ..., format=None, ...)`

##### strftime 
Python strftime cheatsheet: 
https://strftime.org
- 1993-01-27 -- `"%Y-%m-%d"`
- 05/13/2017 05:45:37 -- `"%m/%d/%Y %H:%M:%S"`
- September 01, 2017 -- `"%B %d, %Y"`

#### Pivot table
pandas.pivot_table($\color{red}{\text{data}}$, values=$\color{red}{\text{None}}$, index=$\color{red}{\text{None}}$, columns=$\color{red}{\text{None}}$, aggfunc='$\color{red}{\text{mean}}$', <br>&emsp;&emsp; fill_value=$\color{red}{\text{None}}$, margins=$\color{red}{\text{False}}$, dropna=$\color{red}{\text{True}}$, margins_name='$\color{red}{\text{All}}$', observed=$\color{red}{\text{False}}$, sort=$\color{red}{\text{True}}$)

In [1]:
import pandas as pd
import numpy as np
# 
df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",
                         "bar", "bar", "bar", "bar"],
                   "B": ["one", "one", "one", "two", "two",
                         "one", "one", "two", "two"],
                   "C": ["small", "large", "large", "small",
                         "small", "large", "small", "small",
                         "large"],
                   "D": [1, 2, 2, 3, 3, 4, 5, 6, 7],
                   "E": [2, 4, 5, 5, 6, 6, 8, 9, 9]})

df

Unnamed: 0,A,B,C,D,E
0,foo,one,small,1,2
1,foo,one,large,2,4
2,foo,one,large,2,5
3,foo,two,small,3,5
4,foo,two,small,3,6
5,bar,one,large,4,6
6,bar,one,small,5,8
7,bar,two,small,6,9
8,bar,two,large,7,9


In [2]:
table = pd.pivot_table(df, values='D', index=['A', 'B'],
                    columns=['C'], aggfunc=np.sum, fill_value=0)
table

Unnamed: 0_level_0,C,large,small
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,4,5
bar,two,7,6
foo,one,4,1
foo,two,0,6


In [3]:
import numpy as np
import pandas as pd
from datetime import timedelta
# 
# Define the most recent date in our  data
current_date = pd.to_datetime('2018-03-17')
# The last date a user could lapse be included
max_lapse_date = current_date - timedelta(days=14)
# 
sub_time = current_date - max_lapse_date
# convert the sub_time from a timedelta to an int
sub_time = sub_time.days
print(sub_time, type(sub_time))
# 
a = np.array([[0, 1, 2],
              [0, 2, 4],
              [0, 3, 6]])
np.where(a < 4, a, -1)  # -1 is broadcast

14 <class 'int'>


array([[ 0,  1,  2],
       [ 0,  2, -1],
       [ 0,  3, -1]])

#### Correcting for seasonality with trailing averages
- We can correct for this movement, by calculating a trailing average over our data. A trailing average is a smoothing technique that sets the value for a given day as the average over the past n-days. To smooth weekly seasonality we want n-equals-seven. This has the effect of averaging over a week such that every day is pulled towards the weekly level limiting the day level effects.
- First, we use the pandas `rolling()` method to find the rolling window. The primary parameters of `rolling()` are `window` which is our n value from above, and `center` which is a Boolean. If this is true then our averaged value will be placed at the middle of our window, if not it will be placed in the index that we are looking back from, which is what we want.
- Once we have this we call `mean` to find our average over the window. We can calculate this and store it in an additional variable.

###### Example:
- series_new = df.column.rolling(window=$\color{red}{\text{7}}$, center=$\color{red}{\text{False}}$)
    - window: how many data points to average over
    - center: specify to average backwards
- df['$\color{red}{\text{rolling_column}}$'] = series_new.mean()

#### Noisy data
- It is incredibly noisy, as the values vary widely from day-to-day. We can apply an exponential moving average so we can check if any macro trends are hidden among this noise.
- This type of average weights the points such that the earlier ones are weighted less than the more recent ones within our window. This pulls our data back to any central trend, while maintaining any recent movements.
- We can use the `ewm()` method to find these weighted windows. To do this we specify the `span` argument to be our window size. Then we find the average of this weighted data. Here we apply this to our set of purchase data, using a 30-day window. Determining windows like this can require prior knowledge of the structure of the data or some trial-and-error. See that this does remove a lot of the noise and reveals a slight upward trend.
    - `.ewm()`: exponential weighting function
    - `span`: Window to apply weights over

###### Example:
exp_mean = df.column.ewm(span=$\color{red}{\text{30}}$)
df['$\color{red}{\text{exp_mean}}$'] = exp_mean.mean()


#### Summary - Data Smoothing Techniques
- **Trailing Average**:
    - Smooths seasonality by averaging over the periodicity
- **Exponential Moving Average**:
    - Reveals trend by pulling towards the central tendency
    - Weights the most recent values relative to the window more heavily
- You can use `.rolling()` and `.ewm()` for many more methods of smoothing

#### Plotting annotations - events
- `plt.axvline()`: Plots vertical line at x-intercept
    - `color`: Specify the color of the plotted line
    - `linestyle`: The type of line to plot
    
##### Example:
- df.plot(x=['$\color{red}{\text{column1}}$'], y=['$\color{red}{\text{label_1}}$', '$\color{red}{\text{label_2}}$'])
- events.Date = pd.to_datetime(events.Date)
- $\color{blue}{\text{for}}$ row $\color{blue}{\text{in}}$ events.iterrows():
- &emsp;&emsp; tmp = row[$\color{red}{\text{1}}$]
- &emsp;&emsp; plt.axvline(
- &emsp;&emsp; x=tmp.Datew, color=$\color{red}{\text{'k'}}$, linestyle=$\color{red}{\text{'--'}}$)

### Introduction to A/B testing
- An A/B test is an experiment in which you test two different values of the same variable against one another to determine which one is “better” by evaluating their performance among two randomly assigned groups of real users.
- A/B testing is a powerful and flexible tool. You could apply it to improve sales or interactions within an app or website, or in a variety of other contexts form medicine (called randomized controlled trials in this case) to optimizing efficiency on an assembly line.
- With that said it is important to be aware of the cases that are and are not conducive to improvement through A/B testing. While no general rule exists, there are some things worth considering. Good problems for A/B testing are those where users are being impacted in an individual way. Questions like which color will make the user more likely to press a button or which price is better are ripe for A/B testing.
- Bad problems for A/B testing are those in which there are network effects of users, that is the change you want to test impacts or is directly related to an area where users are interacting with one another. An example of this is the matching algorithm for a dating app. It is hard to untangle the impact of the change if users are matching with people outside of the test group. There are many nuances to this, but it is worth thinking through how directly you can connect the change made in the A/B test to the change in behaviors. If you can't do this well, perhaps consider other methods.

#### Considerations in test design
- There are two primary concerns in test design. Ensuring that our test can be practically run and that we can derive meaningful results from it. These two objectives are strongly connected.

#### Test sensitivity
- A good starting point is to ask, what percentage change would it be meaningful to detect in your response variable. 1% - 20%? It makes sense that smaller changes would be more difficult to detect, as they can more easily be overshadowed by randomness. The minimum level of change we want to detect is called __sensitivity__.

#### Data variability
- While understanding the desired change in the data due to the treatment is important, it is also important to understand the latent variability in the data. In this case, it makes sense to understand if the purchase amount is consistent across all users, or if it varies widely. A change due to the treatment will be more easily captured in the former case.

#### Standard deviation
- We can find the standard deviation of our data using the pandas `std()` method by passing in a vector of our statistics. Typically, we will rely on the standard deviation of the test results in evaluating our test, but using the value of our initial data is important for planning as we will see.

#### Null hypothesis
- This is the hypothesis that our control and treatment, that is our two phrases, have the same impact on the response. Any observed difference is just due to randomness. If we can conclude this is not the case, then we say our results are statistically significant and that there is a difference.
    - A Type I error (or Type 1), is the incorrect rejection of a true null hypothesis. ($\alpha$)
        - Reject the null hypothesis when there is in fact no significant effect (false positive). The `p-value` is optimistically small.
    - A Type II error (sometimes called a Type 2 error) is the failure to reject a false null hypothesis. ($\beta$)
        - Not reject the null hypothesis when there is a significant effect (false negative). The `p-value` is pessimistically large. 
    - In interpreting the p-value of a significance test, you must specify a significance level, often referred to as the Greek lower case letter alpha ($\alpha$). A common value for the significance level is 5% written as 0.05.
    - The `p-value` is interested in the context of the chosen significance level. A result of a significance test is claimed to be “statistically significant” if the `p-value` is less than the significance level. This means that the null hypothesis (that there is no result) is rejected
    - The `p-value` is the probability of observing a value as or more extreme than the one observed under the Null Hypothesis.
            - p <= alpha: reject Ho, different distribution.
            - p > alpha: fail to reject Ho, same distribution.

#### Types of error & confidence level
- Rejecting the null hypothesis when it is true is called type I error, and retaining the false null hypothesis is type II error. We define our probability of not making a type I error as the Confidence Level. We will not go into great detail, but intuitively it should make sense that the higher we make this value the larger of a sample we will need. A common value of this is 0 point 95

#### Statistical power
- Related to this is the idea of Statistical Power. Power is the probability of finding statistically significant results when the Null hypothesis is false.

####  Connecting the Different Components
- Power and Confidence level are connected to the standard error and sensitivity of our test. To estimate our needed sample size, we can choose our desired sensitivity, set our desired confidence level & power, and then estimate our standard error using these values.

#### Power formula
- The key takeaway to note is that the relation between Power and n, our sample size, is that as n goes up so too does our power. Additionally, as our confidence level goes up our power goes down.

- What is Power?
    - Power is how likely there is an actual effect vs. chance.
    - Likelihood that the test is correctly "proving" your hypotheses
    - A high power means that the test results are likely valid. As the power increases, the probability of making a Type II error decreases.
    - A low power meansthat the test results are questionable
    - Helps determine if sample size is large enough.
    - The statistical power of a study (sometimes called sensitivity) is how likely the study is to distinguish an actual effect from one of chance. It’s the likelihood that the test is correctly rejecting the null hypothesis (i.e. “proving” your hypothesis). For example, a study that has an 80% power means that the study has an 80% chance of the test having significant results.
    - It is possible to perform a hypothesis test without calculating the statistical power. If your sample size is too small, your results may be inconclusive when they may have been conclusive if you had a large enough sample.
       
#### Power Analysis
- **Effect Size**: The quantified magnitude of a result present in the population. Effect size is calculated using a specific statistical measure, such as Pearson’s correlation coefficient for the relationship between variables or `Cohen’s d` for the difference between groups.
- **Sample Size**: The number of observations in the sample.
- **Significance**: The significance level used in the statistical test, e.g. alpha. Often set to 5% or 0.05.
- **Statistical Power**: The probability of accepting the alternative hypothesis if it is true.

All four variables are related. For example, a larger sample size can make an effect easier to detect, and the statistical power can be increased in a test by increasing the significance level. A power analysis involves estimating one of these four parameters given values for three other parameters. This is a powerful tool in both the design and in the analysis of experiments that we wish to interpret using statistical hypothesis tests.
For example, the statistical power can be estimated given an effect size, sample size and significance level. Alternately, the sample size can be estimated given different desired levels of significance.

We can start with sensible defaults for some parameters, such as a significance level of 0.05 and a power level of 0.80. We can then estimate a desirable minimum effect size, specific to the experiment being performed. A power analysis can then be used to estimate the minimum sample size required.

In addition, multiple power analyses can be performed to provide a curve of one parameter against another, such as the change in the size of an effect in an experiment given changes to the sample size. More elaborate plots can be created varying three of the parameters. This is a useful tool for experimental design.

##### Student’s t Test Power Analysis:
- **Significance level (alpha)**: 5% or 0.05.
- **Effect Size:** Cohen’s d of at least 0.80.
- **Statistical Power:** 80% or 0.80.

In [6]:
# estimate sample size via power analysis
from statsmodels.stats.power import TTestIndPower
# parameters for power analysis
effect = 0.80
alpha = 0.05
power = 0.95
# perform power analysis
analysis = TTestIndPower()
result = analysis.solve_power(effect, power=power, nobs1=None, ratio=1.0, alpha=alpha)
print('Sample Size: %.3f' % result)

Sample Size: 41.594


In [1]:
# 
from statsmodels.stats.power import tt_ind_solve_power
# 
effect_size=.8
sample_size=30
alpha=.05
ratio=1.0
statistical_power = tt_ind_solve_power(effect_size=effect_size, nobs1=sample_size, 
                                       alpha=alpha, 
                                       ratio=1.0, alternative='two-sided')
print("The statistical power is: {0}".format(statistical_power))
print('We are 86% confident that we can reject the null hypothesis.')

The statistical power is: 0.861422509233477
We are 86% confident that we can reject the null hypothesis.
