In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from scipy import stats 
from statsmodels.stats.proportion import proportions_ztest
from statsmodels.stats.proportion import binom_test

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

### Metrics and Hypothesis Decision 

* Hypothesis: Will warning users just before they sign up for a course free-trial but who are not able to devote sufficient time, reduce churn rate of those who enrolled in free-trials without significantly reducing free-trial conversion rate?
* Unit of diversion: Cookie 

Possible Metrics:
* Number of cookies: That is, number of unique cookies to view the course overview page. (dmin=3000)
* Number of user-ids: That is, number of users who enroll in the free trial. (dmin=50)
* Number of clicks: That is, number of unique cookies to click the "Start free trial" button (which happens before the free trial screener is trigger). (dmin=240)
* Click-through-probability: That is, number of unique cookies to click the "Start free trial" button divided by number of unique cookies to view the course overview page. (dmin=0.01)
* Gross conversion: That is, number of user-ids to complete checkout and enroll in the free trial divided by number of unique cookies to click the "Start free trial" button. (dmin= 0.01)
* Retention: That is, number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by number of user-ids to complete checkout. (dmin=0.01)
* Net conversion: That is, number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by the number of unique cookies to click the "Start free trial" button. (dmin= 0.0075)

Which of the following metrics would you choose to measure for this experiment and why? For each metric you choose
* Would you use it as an invariant metric or an evaluation metric?
    * Invariant - Num Cookies, clicks, CTP before screener
    * Evaluation
        * 1) Gross enrollment conversion - slight decrease expected (-0.01)
        * 2) 14-day Enrollment Retention - healthy increase expected (+0.01)
        * 3) 14-day Net Conversion to paid - healthy increase expected (+0.0075)
    * Omit
        * Num user_ids enrolled - not normalised and may fluctuate due to external events
        
* The practical significance boundary for each metric (difference that would have to be observed before that was a meaningful change for the business)


Any place "unique cookies" are mentioned, the uniqueness is determined by day. (That is, the same cookie visiting on different days would be counted twice.) User-ids are automatically unique since the site does not allow the same user-id to enroll twice.

Decision to make:
* what results you will be looking for in order to launch the experiment?
* Would a change in any one of your evaluation metrics be sufficient? 
* Would you want to see multiple metrics all move or not move at the same time in order to launch?

Let us start by constructing the baseline metrics

In [None]:
df = pd.read_csv('/kaggle/input/udacity-ab-testing/baseline.csv', names=['metric', 'value'])
df['dmin'] = [3000, 240, -50, 0.01, -0.01, 0.01, 0.0075]
df['idx'] = ["C", "CL", "ID", "CTP", "CG", "R", "CN"]
df = df.set_index('idx')
df

### Determine Distribution and Standard Error 

Since the sample size given by Udacity is n = 5000 cookies, we first need to scale the collected count data, i.e. the number of cookies, the number of clicks and the number of user-ids.

In [None]:
df['scaled_value'] = np.nan
scaling_factor = 5000/df.loc["C"]["value"]
for i in ['C', 'CL', 'ID']:
    df.at[i, 'scaled_value'] = df.loc[i]['value'] * scaling_factor
df

Since the unit of diversion is the same as the unit of analysis (denominator of the metric formula) for each evaluation metric (cookie in the case of Gross Conversion and Net Conversion and user-id in the case of Retention) and we can make assumptions about the distributions of the metrics (binominal), we can calculate the standard errors analytically (instead of empirically).
* 1) Given N unique Cookies that click, what is P(enrollment)?
* 2) Given N unique Cookies that click, what is P(payment)?
* 3) Given N enrollments, what is P(payment)?

In [None]:
def checkN (n, p, metric):
    """ check whether n is large enough to assume normal distribution with 3 S.D.
    n: sample size
    p: probability of event occuring
    return: string 
    """
    if n > 9*((1-p)/p) and n > 9*(p/(1-p)):
        result = print(metric,":  n =", n, "is large enough to assume normal distribution")
    else:
        result = print(metric,":  n =", n, "is not large enough to assume normal distribution")
    return result

for i,j in zip(["CL", "CL", "ID"],["CG", "CN", "R"]):
    checkN(df.at[i, "scaled_value"], df.at[j,"value"], df.at[j,"metric"])

Next we need to calculate the standard error for each metric, which is the standard deviation of the sampling distribution of the sample mean given by the formula:

$ SE = \sqrt{\frac{\hat{p}*(1-\hat{p})}{n}}$

with $ \sqrt{\hat{p}*(1-\hat{p})}$ estimating the population standard deviation.

In [None]:
def standardError (n, p):
    '''Return the standard deviation for a given probability p and sample size n'''
    return (p*(1-p)/n)**0.5


df["SE"] = np.nan
for i in ["CG", "CN"]:
    df.at[i, "SE"] = standardError(df.loc["CL"]["scaled_value"], df.loc[i]["value"]) 
    
df.at["R", "SE"] = standardError(df.loc["ID"]["scaled_value"], df.loc["R"]["value"])
df

### Determine sample size and duration

Next we need to determine experiment sample size. Since we do not assume common standard deviations then a more precise way to determine the required sample size would be:

$ n = \frac{(z_{1-\alpha/2}*\sqrt{2*\hat{p}*(1-\hat{p})}+z_{1-\beta}*\sqrt{\hat{p}*(1-\hat{p})+(\hat{p}+dmin)*(1-(\hat{p}+dmin))})^2}{dmin^2}$

This is also the approach used by many online sample size calculators such as the one by [Evan Miller ](https://www.evanmiller.org/ab-testing/sample-size.html). Further, we want to calculate the experiment sample size in terms of cookies that visit the page. Thus, we also need to account for the circumstance that our evaluation metrics' units of analysis are clicks and user-ids, respectively.The total experiment sample size per evaluation metric is hence given by: 

$ n_{c} = \frac{n}{CTP}*2 $ and $ n_{c} = \frac{\frac{n}{CTP}}{CG}*2 $

Given our calculations, we would need around 638,940 pageviews (cookies) to test the first hypothesis (given our assumptions on alpha, beta, baseline conversions and dmin). To additionally test the third hypothesis, we would need a total of 685,336 pageviews. And, in case we would like to also test the second hypothesis, we would need a total of around 4,737,771 pageviews.

In [None]:
def get_sampleSize (p, dmin, alpha=0.05, beta=0.2):
    '''Return sample size given alpha, beta, p and dmin'''
    return (pow((stats.norm.ppf(1-alpha/2)*(2*p*(1-p))**0.5+stats.norm.ppf(1-beta)*(p*(1-p)+(p+dmin)*(1-(p+dmin)))**0.5),2))/(pow(dmin,2))

df["n_C"] = np.nan
for i in ["CG", "CN"]:
    df.at[i, "n_C"] = round((get_sampleSize(df.loc[i]["value"], df.loc[i]["dmin"])/df.loc["CTP"]["value"])*2)

df.at["R", "n_C"] = round(((get_sampleSize(df.loc["R"]["value"], df.loc["R"]["dmin"])/df.loc["CTP"]["value"])/df.loc["CG"]["value"])*2)
df

Now, for each case, we can calculate how many days we would approximately need to run the experiment in order to reach n_C. According to the challenge description, we are thereby assuming that there are no other experiments we want to run simultaneously. So, theoretically, we could divert 100% of the traffic to our experiment (i.e. about 50% of all visitors would then be in the treatment condition). Given our estimation that there are about 40,000 unique pageviews per day, this would result in:

In [None]:
def duration(total_needed, daily_traffic, pct=1):
    return round(total_needed / (daily_traffic * pct), 2)

for i, j in zip(["CG", "CN", "R"],["CG", "CG+CN", "CG+CN+R"]):
   print("Days required for",j,":", duration(df.loc[i]["n_C"], df.loc["C"]["value"]))

print("Days required for, CN+CG: ", duration(df.loc['CN']["n_C"], df.loc["C"]["value"], 0.47))

We see that we would need to run the experiment for about 119 days in order to test all three hypotheses (and this does not even take into account the 14 additional days (free trial period) we have to wait until we can evaluate the experiment). Such a duration (esp. with 100% traffic diverted to it) appears to be very risky. 

* First, we cannot perfom any other experiment during this period (opportunity costs). 
* Secondly, if the treatment harms the user experience (frustrated students, inefficient coaching resources) and decreases conversion rates, we won't notice it (or cannot really say so) for more than four months (business risk). 
* Consequently, it seems more reasonable to only test the first and third hypothesis and to discard retention as an evaluation metric. Especially since net conversion is a product of rentention and gross conversion, so that we might be able to draw inferences about the retention rate from the two remaining evaluation metrics.

So, how much traffic should we divert to the experiment? Given the considerations above, we want the experiment to run relatively fast and for not more than a few weeks. Also, as the nature of the experiment itself does not seem to be very risky (e.g. the treatment doesn't involve a feature that is critical with regards to potential media coverage), we can be confident in diverting a high percentage of traffic to the experiment. 

Still, since there is always the potential that something goes wrong during implemention, we may not want to divert all of our traffic to it. Hence, 80% (22 days) would seem to be quite reasonable. However, when we look at the data provided by Udacity (see 4.1) we see that it takes 37 days to collect 690,203 pageviews, meaning that they most likely diverted somewhere between 45% and 50% of their traffic to the experiment

### Experiment Result Analysis

In [None]:
control = pd.read_csv('/kaggle/input/udacity-ab-testing/experiment.csv')
experiment = pd.read_csv('/kaggle/input/udacity-ab-testing/control.csv')
control.head()

In [None]:
experiment.head()

### Sanity Check

To ensure that the experiment has been run properly, we first conduct a sanity check using the three invariant metrics outlined above (3.3.1). We have two counts (number of cookies, number of clicks) and one probability. As stated earlier, we would expect that these metrics do not differ significantly between control and treatment group. Otherwise, this would imply that someting is wrong with the experiment setup and that our results are biased.


**number of cookies + number of clicks**

In the provided data, the column "pageviews" represents the number of cookies that browse the course overview page. Given our assumptions, we would expect that the total number of cookies in the treatment group and the total number of cookies in the control group each account for about 50% of the combined number of cookies of both groups (treatment + control) as they should have been assigned randomly. We can calculate the test-statistic Z and compared the corresponding p-value against our selected alpha level. 

In [None]:
#calculate the number of observations and successes 
alpha = 0.05
n = control["Pageviews"].sum()+experiment["Pageviews"].sum()
n_control = control["Pageviews"].sum()

#calculate the test-statistic Z and corresponding p_value
z_statistic, p_value = proportions_ztest(n_control, n, value=0.5, alternative="two-sided", prop_var=False)

print("z-test-statistic: ", z_statistic)
print("p-value:" , p_value)

#alternatively compute p-value using the exact binomial test
p_value_binom = binom_test(n_control, n, prop=0.5, alternative='two-sided')
print("p-value_binomial: ", p_value_binom)

if p_value_binom > alpha:
    print("The null hypothesis cannot be rejected and the sanity check is passed")
else:
    print("The null hypothesis is rejected and the sanity check is not passed")

**click-through probabilites**

To check whether the click-through probabilites in the control and treatment groups are significantly different from each other, we conduct a two proportion z-test with a click being interpreted as a success. We thereby assume that the two populations have normal distributions but not necessarily equal variances (hence p is not pooled below).can calculate the Z-test-statistic and then check the corresponding p-value as shown:

In [None]:
#calculate the number of observations and successes for each group
n = np.array([control["Pageviews"].sum(), experiment["Pageviews"].sum()])
n_clicks = np.array([control["Clicks"].sum(), experiment["Clicks"].sum()])

#calculate the test-statistic Z and corresponding p_value
z_statistic, p_value = proportions_ztest(n_clicks, n, value=0, alternative="two-sided", prop_var=False)
print("z-test-statistic: ", z_statistic, "p-value:" , p_value)
if p_value > alpha:
    print("The null hypothesis cannot be rejected and the sanity check is passed")
else:
    print("The null hypothesis is rejected and the sanity check is not passed")

### Evaluation Metrics

* CI_left - lower bound of confidence interval of the metric
* CI_right - upper bound of confidence interval of the metric
* d - the observed change
* dmin - the minimum observed change for the metric to be practically relevant
*  A metric is statistically significant if the confidence interval does not include 0 (you can be confident there was a change), 
* A metric is practically relevant if the confidence interval does not include the practical significance boundary (that is, you can be confident there is a change that matters to the business.)

For our evaluation metric hypotheses using two proportion z-tests with a click being interpreted as a success. We thereby assume that the two populations have normal distributions but not necessarily equal variances (hence p is not pooled below). To perform the test, we can calculate a 95% confidence interval around the expected difference of the two metrics which is 0.  we will compute the respective confidence interval around the observed difference between the conversion metrics. 

a) Compute confidence interval around the expected difference of 0.

$ CI = [0-Z_{1-\alpha/2}*SE; 0+Z_{1-\alpha/2}*SE] $ with $ SE_{pooled} = \sqrt{\frac{S_{cont}^2}{n_{cont, pageviews}}+\frac{S_{exp}^2}{n_{exp, pageviews}}} $

whereby $ S = \sqrt{p*(1-p)} $ and $ p = CTP = \frac{n_{clicks}}{n_{pageviews}} $ &nbsp;

b) Compute the observed difference between the two metrics d and check whether d lies within CI

$ d = CTP_{experiment}-CTP_{control} $

In [None]:
true_sample_size = control.iloc[:23]["Pageviews"].sum()+experiment.iloc[:23]["Pageviews"].sum()
test_results = pd.DataFrame(columns=["CI_left", "CI_right", "d","stat sig?", "dmin", "pract rel?"], index=["CG", "CN"])

for i,j in zip(["Enrollments", "Payments"],["CG", "CN"]):
    #compute difference between treatment and control conversion rates 
    conv_control = control.iloc[:23][i].sum()/control.iloc[:23]["Clicks"].sum()
    conv_experiment = experiment.iloc[:23][i].sum()/experiment.iloc[:23]["Clicks"].sum()
    test_results.at[j, "d"] = conv_experiment-conv_control
    
    #compute sample standard deviations
    S_control = (conv_control*(1-conv_control))**0.5
    S_experiment = (conv_experiment*(1-conv_experiment))**0.5
    SE_pooled = (S_control**2/control.iloc[:23]["Clicks"].sum()+S_experiment**2/experiment.iloc[:23]["Clicks"].sum())**0.5
    
    #compute 95% confidence interval around observed difference d
    test_results.at[j, "CI_left"] = test_results.at[j, "d"]-(stats.norm.ppf(1-alpha/2)*SE_pooled)
    test_results.at[j, "CI_right"] = test_results.at[j, "d"]+(stats.norm.ppf(1-alpha/2)*SE_pooled)
    
    if test_results.at[j, "CI_left"] <= 0 <= test_results.at[j, "CI_right"]: #check statistical significance
        test_results.at[j, "stat sig?"] = "no"
    else:
        test_results.at[j, "stat sig?"] = "yes"

    test_results.at[j, "dmin"] = df.loc[j]["dmin"] #check if practical relevant
    if test_results.at[j, "dmin"] >= 0:
        #check if d is larger than dmin and if dmin lies left of the confidence interval around d
        if test_results.at[j, "d"] > test_results.at[j, "dmin"] and test_results.at[j, "CI_left"] > test_results.at[j, "dmin"]:
                test_results.at[j, "pract rel?"] = "yes"
        else:
            test_results.at[j, "pract rel?"] = "no"
    else:
        #check if d is smaller than dmin and if dmin lies right of the confidence interval around d
        if test_results.at[j, "d"] < test_results.at[j, "dmin"] and test_results.at[j, "dmin"] > test_results.at[j, "CI_right"]:
                test_results.at[j, "pract rel?"] = "yes"
        else:
            test_results.at[j, "pract rel?"] = "no"
test_results

### Interpretation of results and recommendations

**Gross conversion:** the observed gross conversion in the treatment group is around 2.06% smaller than the gross conversion observed in the control group. Further, we see that also the values within the confidence interval are most compatible with a negative effect. Lastly, this effect appears to be practically relevant as those values are smaller than dmin, the minimum effect size to be considered relevant for the business.

**Net conversion:** While we cannot reject the null hypothesis for this test, we see that the observed net conversion in the treatment group is around 0.49% smaller than the net conversion observed in the control group. Further, the values that are considered most reasonabily compatible with the data range from -1.16% to 0.19%.

Given these results, we can assume that the introduction of the "Free Trial Screener" may indeed help to set clearer expectations for students upfront. However, the **results are less compatible with the assumption that the decrease in gross conversion is entirely absorbed by an improvement in the overall student experience and still less compatible with dmin**(net conversion), the minimum effect size to be considered relevant for the business. Consequently, assuming that Udacity has a fair interest in increasing revenues, **we would recommend to not roll out the "Free Trial Screener" feature.**

This being said the feature may increase the total number of people who opt for the freely available materials. If true and assuming a steady conversion rate from users who first learn with the freely accessible materials and then upgrade, the feature may still help to increase net conversion. However, if at all, this effect is more likely to happen over a longer time period and, hence, would require a test with a longer timeframe.