# Statistical Tests and Experiments

## Let's analyse Udacity A/B test!

<div><img style="height: 350px;" src="https://upload.wikimedia.org/wikipedia/commons/3/3b/Udacity_logo.png" /></div>

## Background

We are analyzing once run, online Udacity A/B test results! The dataset is part of the Udacity Google's A/B testing course

Data is available at https://docs.google.com/spreadsheets/d/1Mu5u9GrybDdska-ljPXyBjTpdZIUev_6i7t4LRDfXM8/edit#gid=0

The data consists of two sheets, each for different group of the test. Download the sheets as CSVs and union them to form 1 dataset.

You will mostly focus on statistical hypothesis testing in this project on real-life data.

## To DO

* Verify that the difference in each of the metrics between control and experiment groups is statistically significant using z-test, 95% confidence level.
* Verify that the difference in each of the metrics between control and experiment groups is statistically significant using t-test, 95% confidence level.
* Compare both test method results. Explain why they differ / do not differ that much and why.
* Choose 1 method (either z or t) and explore statistical significance of any metric under different confidence levels - 60%, 90%, 95%, 99%. If conclusions about significance differ under different confidence levels, explain why.
* Calculate p-values.

## Solution

In [None]:

import pandas as pd
import numpy as np
import seaborn as sns
from statsmodels.stats.weightstats import ztest, ttest_ind

Importing the Data

In [None]:
control = pd.read_csv("Final Project Results - Control.csv")
experiment= pd.read_csv("Final Project Results - Experiment.csv")

Calculating different metrics and adding them to the dataframe of the two groups

In [None]:
# for control group
control["CTR"] = control["Clicks"]/control["Pageviews"]
control["Retention"] = control["Payments"] / control["Enrollments"]
control["Gross Conversion"] = control["Enrollments"] / control["Clicks"]

control

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments,CTR,Retention,Gross Conversion
0,"Sat, Oct 11",7723,687,134.0,70.0,0.088955,0.522388,0.195051
1,"Sun, Oct 12",9102,779,147.0,70.0,0.085586,0.47619,0.188703
2,"Mon, Oct 13",10511,909,167.0,95.0,0.086481,0.568862,0.183718
3,"Tue, Oct 14",9871,836,156.0,105.0,0.084693,0.673077,0.186603
4,"Wed, Oct 15",10014,837,163.0,64.0,0.083583,0.392638,0.194743
5,"Thu, Oct 16",9670,823,138.0,82.0,0.085109,0.594203,0.167679
6,"Fri, Oct 17",9008,748,146.0,76.0,0.083037,0.520548,0.195187
7,"Sat, Oct 18",7434,632,110.0,70.0,0.085015,0.636364,0.174051
8,"Sun, Oct 19",8459,691,131.0,60.0,0.081688,0.458015,0.18958
9,"Mon, Oct 20",10667,861,165.0,97.0,0.080716,0.587879,0.191638


In [None]:
# for experiment group
experiment["CTR"] = experiment["Clicks"]/experiment["Pageviews"]
experiment["Retention"] = experiment["Payments"] / experiment["Enrollments"]
experiment["Gross Conversion"] = experiment["Enrollments"] / experiment["Clicks"]
experiment

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments,CTR,Retention,Gross Conversion
0,"Sat, Oct 11",7716,686,105.0,34.0,0.088906,0.32381,0.153061
1,"Sun, Oct 12",9288,785,116.0,91.0,0.084518,0.784483,0.147771
2,"Mon, Oct 13",10480,884,145.0,79.0,0.084351,0.544828,0.164027
3,"Tue, Oct 14",9867,827,138.0,92.0,0.083815,0.666667,0.166868
4,"Wed, Oct 15",9793,832,140.0,94.0,0.084959,0.671429,0.168269
5,"Thu, Oct 16",9500,788,129.0,61.0,0.082947,0.472868,0.163706
6,"Fri, Oct 17",9088,780,127.0,44.0,0.085827,0.346457,0.162821
7,"Sat, Oct 18",7664,652,94.0,62.0,0.085073,0.659574,0.144172
8,"Sun, Oct 19",8434,697,120.0,77.0,0.082642,0.641667,0.172166
9,"Mon, Oct 20",10496,860,153.0,98.0,0.081936,0.640523,0.177907


Calculating the z score for click through rate

In [None]:
# mean of each group
mean_control = control["CTR"].sum()/len(control)
mean_experiment = experiment["CTR"].sum()/len(experiment)

print(f"Control: {mean_control}")
print(f"Experiment: {mean_experiment}")

Control: 0.0821292746965395
Experiment: 0.08219052099356416


In [None]:
# standard deviation
std_control = np.std(control["CTR"])
std_experiment = np.std(experiment["CTR"])

print(f"Control: {std_control}")
print(f"Experiment: {std_experiment}")

Control: 0.0031848158519031883
Experiment: 0.003073578677986675


In [None]:
#compute Standard error pooled
se_pooled = (std_control**2/len(control) + std_experiment**2/len(control))**0.5
print(f"Standard Error: {se_pooled}")


Standard Error: 0.0007276384961568549


In [None]:
zscore = (mean_control - mean_experiment) / se_pooled
zscore

-0.0841713259374707

Create new columns to store different metrics and their respective z, t and p values and show if there are statistically significant for different confidence levels

In [None]:
metricsTable = pd.DataFrame(data = {"Metric Name": ["CTR",  "Retention", "Gross Conversion"]})

# drop the missing values to calculate for retention and conversion
controldropped = control.dropna()
experimentdropped = experiment.dropna()

#function to store the number of observations for each group in an array
def array(control, experiment, metric):
  control_array = control[metric].to_numpy()
  experiment_array = experiment[metric].to_numpy()
  return control_array, experiment_array

#function to calcuate z-score
def z_score(control, experiment, metric):
  z_statistic, p_value = ztest(array(control, experiment, metric)[0], array(control, experiment, metric)[1], value=0, alternative='two-sided', usevar='pooled')
  return (f' {z_statistic:.3f}, {p_value:.3f}')

#function to calcuate t-score
def t_score(control, experiment, metric):
  t_statistic, p_value, df = ttest_ind(array(control, experiment, metric)[0], array(control, experiment, metric)[1], alternative='two-sided', usevar='pooled', value=0)
  return (f' {t_statistic:.3f}, {p_value:.3f}')

# Using the functions defined above to calculate the metrics
# for CTR
zCTR = z_score(control, experiment, "CTR")
tCTR = t_score(control, experiment, "CTR")

# for Retention
zRetention = z_score(controldropped, experimentdropped, "Retention")
tRetention = t_score(controldropped, experimentdropped, "Retention")

# for Gross Conversion
zGC = z_score(controldropped, experimentdropped, "Gross Conversion")
tGC = t_score(controldropped, experimentdropped, "Gross Conversion")

# add values to metric table
metricsTable["Z-Score, P-Value"] = [zCTR, zRetention, zGC]
metricsTable["T-Score, P-Value"] = [tCTR, tRetention, tGC]
# separating the z, t and p scores into different columns
metricsTable[['Z-Score','Z_P-Value']] = metricsTable.pop("Z-Score, P-Value").str.split(',', expand=True)
metricsTable[['T-Score','T_P-Value']] = metricsTable.pop("T-Score, P-Value").str.split(',', expand=True)


# using the p-value of the z-score to explore statistical significance under different confidence levels
# YES for reject the null and NO for accept the null

sixty = []
ninty = []
nintyfive = []
nintynine = []

for i in metricsTable["Z_P-Value"]:
  if float(i) < 0.4:
    sixty.append("YES")
  else:
    sixty.append("NO")
  if float(i) < 0.1:
    ninty.append("YES")
  else:
    ninty.append("NO")
  if float(i) < 0.05:
    nintyfive.append("YES")
  else:
    nintyfive.append("NO")
  if float(i) < 0.01:
    nintynine.append("YES")
  else:
    nintynine.append("NO")

metricsTable["60% CI"] = sixty
metricsTable["90% CI"] = ninty
metricsTable["95% CI"] = nintyfive
metricsTable["99% CI"] = nintynine


metricsTable


Unnamed: 0,Metric Name,Z-Score,Z_P-Value,T-Score,T_P-Value,60% CI,90% CI,95% CI,99% CI
0,CTR,-0.083,0.934,-0.083,0.934,NO,NO,NO,NO
1,Retention,-1.008,0.313,-1.008,0.319,YES,NO,NO,NO
2,Gross Conversion,1.54,0.124,1.54,0.131,YES,NO,NO,NO


## Explanation
Since our p-value is above our α=0.05 threshold, we cannot reject the Null hypothesis Hₒ, which means that the experiment did not perform significantly different (let alone better) than the control.

When comparing the z-test and t-test methods for checking if the difference between control and experiment groups is statistically significant using 95% confidence level, we can see from the table above that they are thesame reason being that the the tests were done using same sample standard deviation as the population standard deviation is unknown. The only diference between the two is that t-score uses degree of freedom in its calculation

Looking at the output in the table we see that there are no statistical significant difference in the two groups at 90%, 95% and 99% confidence levels because the p-values are greater than 0.1 and this can also be caused by the small sample size we have and so there is no good representation of the population

The 60% confidence interval helped to determine a general level of accuracy of the sample and from the table we can see that it confirms that we cannot reject the null hypothesis when considering the CTR metric but otherwise for the other metrics

## Important Definitons
**Central limit theorem** is concerned with the sampling distribution of the means. Sampling distribution is the distribution of means of samples taking from the population

**Confidence interval** communcates how accurate our estimate is likely to be, When we express the estimation of a population paramenter, it is good practice to give it as a confidence interval because it communiicates what estimate is likely to be

Eg: The mean of a population equals .... (WRONG)

The mean of a population lies between ..... (RIGHT)

**Statistically Significant** means that we have evidence that the result we see in the sample also exist in the population. We use the P-value to determine if something is statistically significant, it shows if the effect in the sample has an effect in the population or could have occured by chance or by sampling error

**P-Value** is the probability that if the null hypothesis were true, sampling variation would produce an estimate that is further away from the hypothesised value than our data estimate. We use the significance level, alpha to check whether to accept or reject the null hypthesis

If P-Value < alpha ---> Reject the Null hypothesis

> BUT

If P-Value > alpha ---> Do not reject the Null hypothesis