#### Experiment Design
This notebook summarizes the protocal of A/B testing experiment design based on course materials from [Udacity A/B Testing course](https://learn.udacity.com/courses/ud257) and [an experiment design project](https://olgabelitskaya.github.io/P7_Design_an_A_B_Test_Overview.html)
##### Overviews
1. Two situations A/B testing doesn't work well:
    - a. when it takes time to get results
    - b. A/B testing can't tell you if you are missing something
2. Components of experiment Design:
    - design metrics 
        - invariant metrics: metrics doesn't change because of the experiment and can be used for sanity check (randomization check)
        - evaluation metrics: metrics used to evaluate the effect of the experiment
    - design experiment
        - choose subjects: determine unit of diversion
        - choose population: determine what is the population of interest
        - sample size calculation:
            - a. sample size calculation by margin of error: $$n=\left(\frac{Z_{1-\alpha/2}\sigma}{E}\right)^2$$ where $Z_{1-\alpha/2}$ is the Z-score for $\alpha$ significance level, $\sigma$ is the standard error, $E$ is the margin of error, e.g. we hope to observe at least 0.1 in difference, then $E = 0.1$
            - b. sample size calculation by power: $$n=\left(\frac{Z_{1-\alpha/2} + Z_{\beta}}{ES}\right)^2$$ where $Z_{\beta}$ is the Z-score for $\beta$ power level. $ES$ is the effect size, usually computed by $ES = \frac{|\mu_1-\mu_2|}{\sigma}$. One specific exmaple of sample size calculation for comparing proportions between two groups is: $$n = \frac{(Z_{1-\alpha/2}\sqrt{2\frac{p_1+p_2}{2}\left(1-\frac{p_1+p_2}{2}\right)}+Z_{\beta}\sqrt{p_1(1-p_1)+p_2(1-p_2)})^2}{(p_1-p_2)^2}$$
        - decide duration
            - Duration of the experiment
                - long enough exposure to get sufficient sample size
                - not too long because the decision can't wait too long and there is cost of experiment
                - if weekday is different from weekend, then at least duration of 7 days
                - if the period is 7 days, then 28 days duration is better than 1 month
            - When to launch the experiment 
                - semester vs vacation/holiday
                - weekday vs weekend
            - What fraction of the traffic should be exposed to the experiment
                - if the change is hardly to be noticed, e.g. color of button, then it can be exposed to all the traffic
                - if the change might influence user experience, e.g. login method, then it is better to keep the experiment exposed to very small amount of traffic
                - whether there are other experiment on-going
                - whether you are interested in a specific population, e.g. college student users rather than all the users
    - Analysis of experiment:
        - sanity check: use invariant metric to check whether the control group and experiment group are equivalent in sample size
            - hypothesis test: $$H_0: p = 0.5$$, $SE = \sqrt{\frac{0.5*0.5}{n_c+n_e}}$
            - if fail:
                - a. check technical error, e.g. infrastructure, experiment setup
                - b. retrospective analysis: use collected data to recreate balanced experiment diversion and understand what is causing the failure
                - c. try pre- and post-period experiment design. If both went wrong, probably infrastructure; if only experiment went wrong, probably experiment setup error, e.g. filter to English language only
        - single metric used:
            - check both statistical significance (p-value) and practical significance (magnitude and direction)
            - alert of Simpson's Paradox: e.g. by department the admission rate of male and female are comparable and by some department female admission rate is higher than male; but because more females apply to low admission departments, causing overall female admission rate lower than male admission rate
            - Use both parametric hypothesis test and sign test
            - After getting test results, slicing the metrics to check for Simpson's Paradox
        - multiple metrics used:
            - more likely to see random significant results (need to control false positive/significance)
            - Assuming $m$ tests are independent: $$\alpha_{overall} = 1-(1-\alpha_{per test})^m$$ this gives: $$\alpha_{overall}\leq m\times \alpha_{per test}$$
            - Bonferroni correction (conservative): $$\alpha_{pertest}=\frac{\alpha_{overall}}{m}$$
            - Sidak correction: $$\alpha_{pertest}=1-(1-\alpha_{overall})^{1/m}$$
        - draw conclusions:
            - How do you understand the change:
                - consider results for both parametric tests and sign tests
                - consider both statistical significance and practical significance
                - consider results of both single tests and multiple tests
                - consider metric values in slicing
            - Whether it is worthy to launch:
                - cost of the launch
                - Potential increase revenue by the launch
                - porportion of users that will be benefited by the launch
                - whether the launch would benefit a group of users/features but harm another group of users/features
            - Sometimes multiple experiments are needed before making a change
3. A protocal of experiment design:
    - step 1: define problem of interest, population, unit of diversion, potential confounders
    - step 2: design metrics, decide what are invariant metrics and what are evaluation metrics. **Clearly list the expected magnitute and direction for each evaluation metrics.**
    - step 3: in pre-stage or retrospective study, collect a sample to compute the variability of the **evaluation metrics** for the use of sample size calculation
        - analytical variability: $SE = \sigma$ or $SE = \sqrt{p(1-p)/n}$
            - **Use $SE\propto \frac{1}{\sqrt{n}}$ to transport this variability between different samples**
        - empirical variability: empirical SE or bootstrap
    - step 4: calculate sample size for **evaluation metrics**. If there are multiple evaluation metrics, compute sample size for each of them
    - step 5: collect total traffic, decide what proportion of total traffic should be exposed to the expriement and compute the duration of the exposure.
        - compute the duration for the largest sample size
        - if with maximal possible traffic, the duration to collect sufficient sample size is still too long, need to delete or modify this evaluation metric. 
    - step 6: decide when to conduct experiment and collect experiment results
    - step 7: sanity check
    - step 8: hypothesis tests (single and multiple; parametric and sign tests)
    - step 9: slicing metrics and draw conclusions


##### Details on Metrics Design
1. Basic Rules:
    - Be practical. Make sure you can actually collect this metric
    - Consider repeatance. For example, if use user_id for click through rate, one user_id can click multiple times. But if we use cookie, which is unique for each click/user_id, there won't be repeatance
    - Avoid metrics that need a long time to collect, e.g. customer reviews after change of design of a shopping website
2. Methods:
    - step 1: Be clear with your objective, e.g. better user experience, more revenue.
    - step 2: Understand your business model and construct a **customer funnel**. For example, for social media app, which aims to increase users' engagement in the community, the customer funnel is: visitor --> lurker --> voter --> content creator --> moderator --> group creator
    - step 3: Use metrics to detail the customer funnel. For example, for the social media customer funnel metrics can be: # clicks on shared links/# downloads of the app --> # signed up users --> # users that at least liked/shared/saved a post --> # users at least created a comment/post --> # users joined at least one group --> # users created at least one group
    - step 3.5: potential measurement units:
        - pageviews
        - clicks
        - user_id/email/phone number
        - cookie
        - device
        - event
    - step 4: After creating measurable metrics according to the funnel, we can divide the metrics between two stages to create new metrics that measure the conversion rate
        - click-through-rate: $\frac{\#~clicks}{\#~pagevisits}$
        - click through-probability: $\frac{\#~\text{unique users who clicks}}{\#~\text{unique users who visited the page}}$
    - step 5: Further detail above created metrics.
        - Check whether the measurement is practical.
        - Decide the time interval to collect the metrics, e.g. considering seasonality, collect user_time_spent weekly, forming weekly_user_total/avg_time_spent
        - Slicing. e.g. collect weekday_user_avg_time_spent vs. weekend_user_avg_time_spent, US_user_avg_time_spent vs CN_user_avg_time_spent, student_avg_time_spent vs. employee_avg_time_spent
    - step 6: Interpret the metrics and check for validity:
        - Check whether the magnitude makes sense. For example, 10% click-though-rate is probably wrong
        - Check the distribution of metrics by different slicing (time/region/group/platform/android vs. ios) and understand the difference. For example, difference in loading time between PC and mobile platforms is normal and due to the difference in platform infrustractures.
3. Other techniques:
    - use external data, e.g. open sources customer/market information
    - use own data, e.g. retrospective/observational data, survey/user experience data, collect new data

##### Details on experiment Design
1. Unit of Diversion

| unit of diversion | pros | cons |
| ---- | ---- | ---- |
| user_id | stable | multiple id for one person | 
| cookie | unique by brower; can avoid multiple id to some extend | changes when you change browser; users can clear cookies |
| event | use only for non-user-visible changes | no consistent experience | 
| device_id | unchangeable by user | less common; only available for mobile tied to specific devices, e.g. iwatch |
| IP address | stable and unique most of the time | less common; changes when location changes | 

2. Decide Target Population. For example:
    - users on certain browser/platform/system(android vs. ios)
    - users in certain geo-region
    - users in certain language
    - user in certain age bucket

##### Details on experiment Analysis
1. Sanity check
    - purpose: use invariant metrics to check whether the control group and experiment group are equally separated.
    - hypothesis: $H_0: p_{select~in~control} = 0.5$
    - Required information from expriement: total sample sizes $N$, control group sample size $n_c$
    - How to test:
        - Step 1: under null hypothesis, $SE = \sqrt{\frac{0.5*0.5}{N}}$
        - Step 2: build 95% confidence interval for plausible probability of being selected to control group: $[0.5 - 1.96*SE, 0.5 + 1.96*SE]$
        - Step 3: compute observed select-in-control-probability: $\hat p_{select~in~control} = \frac{n_c}{N}$. If $\hat p_{select~in~control}\in [0.5 - 1.96*SE, 0.5 + 1.96*SE]$, sanity check passed.
        - Step 3.5: if there are multiple invariant metrics, conduct sanity check for each invariant metric.
2. Parametric hypothesis test
    - hypothesis: $H_0: p_c = p_e$ or $H_0: \mu_c = \mu_e$ where $p_c$ and $p_e$ are probability of binary evaluation metrics in control group and experiment group, respectively. $\mu_c$ and $\mu_e$ are mean of continuous evaluation metrics in control group and experiment group, respectively.
    - standard error: 
        - For continuous evaluation metrics: for known within-group standard deviation: $$SE = \sqrt{\frac{\sigma_c^2}{n_c} + \frac{\sigma_e^2}{n_e}}$$ when within-group standard deviations are unknown: $$SE = \sigma_{pool}\sqrt{\frac{1}{n_c} + \frac{1}{n_e}}$$, where $\sigma^2_{pool} = \frac{\sum_{i=1}^{n_c}(v_{ci}-\mu_c)^2 + \sum_{i=1}^{n_e}(v_{ei} - \mu_e)^2}{N-2}$ is the pooled standard error.
        - For binary evaluation metrics: $$SE = \sqrt{p_{pool}(1-p_{pool})\left(\frac{1}{n_c}+\frac{1}{n_e}\right)}$$ where $p_{pool} = \frac{p_cn_c + p_en_e}{N}$ is the probability of event in the pooled sample
    - construct 95% confidence intervals: $[\mu_e-\mu_c - 1.96 SE, \mu_e-\mu_c + 1.96 SE]$ or $[p_e-p_c - 1.96 SE, p_e-p_c + 1.96 SE]$ 
    - significance check:
        - statistical significnace: compare the confidence interval with 0
        - practical significance: compare the confidence interval with minimal accepted change in magenitude
3. Sign test
    - hypothesis: $H_0: p_+ = 0.5$ where $p_+ = \frac{sum_{i=1}^{min(n_c, n_e)}I_i(v_{ei} > v_{ci})}{min(n_c, n_e)}$ is the proportion of paried control-experiment results where there is an improved experiment result.
    - p-value: $$p-value = pbinom(sum_{i=1}^{min(n_c, n_e)}I_i(v_{ei} > v_{ci}), 0.5, min(n_c, n_e))$$
    - significance check: compare p-value with 0.05 or $\alpha_{pertest}$ if Bonferroni correction is used.