# P7 - A/B Testing - Experiment Design

## Metric Choice

##### 1. Number of cookies ( Invariant )
Cookies are information kept by browsers for every site. Cookies as assigned when a user visit a site (URL) for the first time and persist within each browser until it expires or be cleared. So the number of unique cookies is mostly driven by the number of users and the number of different browsers people use to visit the site. Our experiment of showing the question should have no effect on the number of users or their browsing behaviour. Hence, we expect the number of cookies to be invariant in our experiment.

##### 2. Number of user-ids
This experiment is likely to results less user to enroll in the free trail. So it can't be used as an invariant metrics for validation.

It is usable but a poor choice for evaluation metric. Suppose in our experiment, 100 unique cookies are assigned to the control group and 120 to the experiment. At the end of experiment we found that 50 users have completed enrollment in control group and 60 in experiment group. Even if we have more user-ids in experiment group, we can't say that the experiment treatment is effective because of the difference in `user-ids` could simply be caused by the difference in sample size rather than the effectness of experiment treatment. 

A strictly better way is to normalized the number-ids by divideing it with the unique number of clicks. This is in fact the gross conversion discussed below.

##### 3. Number of clicks ( Invariant )
The question is shown after users have clicked the "Start free trail" button. Hence, we don't expect any difference in the number of clicks between the two groups. Therefore, it can be used as an invariant metrics for validation.

##### 4. Click-through-probability ( Invariant )
The `Click-through-probability` is the ratio of `number of clicks` and `number of cookies`. As both `number of cookies` and `number of clicks` are expected to be invariant, we will expect `click-through-probability` to be invariant too.

##### 5. Gross conversion (Evaluation)
If the experiment is effective, we expect to see the `gross conversion` to decrease. That is we expect fewer proportion of student will complete the enrollment process. To launch the experiment, we requires the gross conversion to be both statistically and practically significantly less in the experiment group.

##### 6. Retention (Evaluation*)
If the experiment is effective, we would expect the experiment group to have a higher retention as the experiment. In order for us to launch the experiment, we require the `retention` in our experiment group to be both statistically and practically greater than the one in our control group.
(As it turns out, this metrics required too long to complete for the given alpha and beta, and hence not used in the final evaluation )

##### 7. Net conversion (Evaluation)
_Net conversion_ is the ratio of the _number of unique user ids_ over the _number of clicks_.

$$ \textrm{Net conversion} = \frac{\textrm{Number of unqiue user ids}}{\textrm{Number of clicks}}  $$

Let's defined the difference of _net conversion_ between control and experiment group to be

$$ \delta_{\textrm{ Net conversion}} = \textrm{Net conversion}_{experiment} - \textrm{Net conversion}_{control} $$

Then, in order to launch, we would require $ \delta_{\textrm{ Net conversion}} $ is not smaller than 0 with both statistically and practically significance, because
1. the experiment hypothesis states that "...the experiment would not significantly reduce the number of students to continue past the free trial", and;
2. As discussed earlier in point 3 that we expect the number of clicks to be invariant,



### Measuring Standard Deviation

| Metrics | Standard Deviation | Comparable to empirical ? |
| --- | --- |
| Gross conversion | 0.0202 | Yes |
| Retention | 0.0549 | No |
| Net conversion | 0.0156 | Yes |

The analytical estimate tends to be comparable to the empirically estimates when the unit of diversion and
the unit of analysis are the same.

Hence, I expect the estimate for gross conversion and net Conversion to be comparable to empirical estimate. Similarly, the empirical and analytical estimate for retention are not likely to be comparable, because the unit of analysis ( number of unique user id) is different to the unit of diversion.

### Sizing
#### Number of Samples vs. Power

Using alpha = 0.05 and beta = 0.2, a total of 685,275 page views is required to conduct this experiment.

#### Duration vs. Exposure

**Risks**
The experiment is not a risky one. It does not expose the users to risk that exceeds the "minimal risk" level. No sensitive information is involved in the experiment. Asking a question about time available has hard any physical, psychological, emotional, social and economic effects on the users.
**Exposure and Duration**
Because the experiment is not risky and we assume there were no other experiments need to run simultaneously, I would divert 100% of the traffic to this experiment. And it would take 18 days to complete.

## Experiment Analysis
### Sanity Checks

| Invariant Metrics | Lower bound | Upper bound | Observed | Passes |
| --- |--- | --- | --- |
| Number of cookies | 0.4988 | 0.5012 | 0.5006 | Yes |
| Number of clicks on "Start free trail" | 0.4959 | 0.5041 | 0.5005 | Yes |
| Click-through-probability on "Start free trail" | -0.0013 | 0.0013 | 0.0001 | Yes |

The table above shows the lower and upper bound of the confidence interval of each metrics. The observed value for both metrics are within the confidence interval, meaning there is no statically significant difference between the control and the experiment groups for each invariant metrics. Hence,
we can continue with our experiment.

### Result Analysis

| Evaluation Metrics | Lower bound | Upper bound | Statistically significant | Practically significant |
| --- | --- | --- | --- |
| Gross conversion | -0.029123 | -0.011986 | Yes | Yes |
| Net conversion | -0.011605 | 0.001857 | No | No |

The table above calculates the lower and upper bounds of the confidence interval of the differences between control and experiment group for evaluation metrics. Because the lower bound for the difference of gross conversion is smaller than zero, we can conclude that the difference of `Gross conversion` is statistically significant. Similarly, because zero is within the CI for `Net conversion`, we conclude that the `Net conversion` is not statistically significant.

The practical significant level for `Gross conversion` and `Net conversion` are 0.01 and 0.0075 respectively. Because the absolute value of the lower bound of `Gross conversion` is greater than 0.01, it is practically significant. And because, the practically significant interval of -0.0075 to 0.0075 overlaps with the CI for `Net conversion`, we conclude that the `Net conversion` is *NOT* practically significant.

### Sign Tests

| Invariant Metrics | p-value | Statistically significant |
| --- | --- |
| Gross conversion | 0.0026  | Yes |
| Net conversion | 0.6776 | No |

The P-value for `Gross conversion` is 0.0026, meaning if there is no difference in `Gross conversion`, then the chance of us getting the observed experiment result (signs) is only 0.26%. Because 0.26% is well below our significant level 0f 5%, we conclude the difference for `Gross conversion` is significant.

On contrast, the p-value for `Net conversion` is 67%, well above of 5% significant level. Hence it is not statistically significant.

### Bonferroni correction
I didn't use Bonferroni correction.

Bonferroni correction is used to correct the false positive probability of a decision rule that involves multiple hypothesis tests. 

In our experiment, there are two hypothesis tests. Each test had a 5% of chance to return a false positive. If our decision rule is to launch the experiment when **ANY** of the two hypothesis tests return a positive result, then the probability of making a false positive decision is $(1-0.95^2) = 0.0975$, much large than the individual 5%. Intuitively, this is because the more tests we used, the more likely at least one test will return a positive purely by chance. And Bonferroni correction is used to counter that.

In contrast to above, in the Udacity experiment, we actually want **BOTH** of the tests to meet our expectation at the same time. In this case, the chance for us to make a false positive conclusion is $0.05^2 = 0.0025$ (assuming independence), much smaller than the 5% significance of each individual test.

To sum up, because we want **BOTH** test to be meet our expectations instead of **ANY** Bonferroni correction is not applicable here.

### Summary

There is no discrepancy between the effect size test and sign test. The difference in gross conversion is significant is both tests, and the difference in net conversion is not in both case.


### Recommendation


The gross conversion is both statistically and practically significantly lower in the experiment group. As we want to reduce the number of enrolled students who do not continue after the trial period, the result is what we expected to launch. 

The 95% confidence interval for net conversion is between -0.011605 and 0.001857. So we failed to reject the null hypothesis and conclude the difference between net conversions is not statistically significant. That's said, it is still possible that the net conversion is indeed lower in the experiment group. It's just the difference is not large enough to be detected by our test. A much large sample would increase the power of our hypothesis test and perhaps comes to a different conclusion. However,  we can say we are 95% confident that the true difference in net conversion is between -0.011605 and 0.001857. So in the worst-case scenario of -0.011605, the difference will not be acceptable for us to launch the experiment because it's absolute value exceeds the practical significance of 0.0075. In order words, the decease (if there is any) in the net conversion would  be practically significant in that case.

As we need both of the criteria to pass in order to launch, I do NOT recommend launching the experiment.

## Follow-Up Experiment

##### Description
Reward system has an enormous impact on human behaviour and is frequently used as a motivational tool in education and many other areas (i.e. gaming). It is highly effective when used right. I propose Udacity to reward those students who spend enough time study and complete a project during the trial period. The prize is a month of free subscription to the nano degree. And this information is only revealed to students after they have enrolled.

##### Hypothesis
I expect students to be more motivated. More student would complete the first project in the trial to get the reward. Once they received the free subscription, they will continue to stay within the program as they do not want to waste the free subscription just earned. By the end of the free subscription, the similar phycological effect will increase the probability for those students to become a paid subscriber, as otherwise, they will feel have wasted all of the hard work they have put in so far.

##### Evaluation metrics

Let the probability of conversion be defined as
$ P_{conversion}  = \frac{\textrm{Number of user ids who eventually paid by the end of experiment}}{\textrm{Total number of enrolled user ids in the cohort}} $

If the proposed experiment is effective, then we would expect to see more student continue after the trail and paid. The hypothesis can be expressed formally as:

$H_{0}:  \delta = 0$

$H_{A}: \delta > 0$

where $\delta = P_{conversion}^{experiment} - P_{ conversion}^{control} $

##### Invariant metric

* Number of students enrolled each month

Because the reward is shown to a student who has already enrolled, it should have no impact on the number of students enrolled each month. 

##### Unit of diversion
Given the evaluation metrics, it is nature to use user-id as the unit of diversion for this experiment.
