You work for an early-stage startup in Germany. Your team has been working on a redesign of the landing page. The team believes a new design will increase the number of people who click through and join your site. They have been testing the changes for a few weeks, and now they want to measure the impact of the change and need you to determine if the increase can be due to random chance or if it is statistically significant.

## 💪 Challenge
Complete the following tasks:

1. Analyze the conversion rates for each of the four groups: the new/old design of the landing page and the new/old pictures.
2. Can the increases observed be explained by randomness? (Hint: Think A/B test)
3. Which version of the website should they use?

### Evaluation Metric

** Before evaluation of test and control groups, the evaluation metric should be clearly defined. Here the metric is Conversion Rate. We define it with below formula

### $ Conversion\hspace{0.15cm}Rate = \frac{num\hspace{0.15cm}converted}{total\hspace{0.15cm}observations} $

In [1]:
#Import necessary modules
import numpy as np
import pandas as pd
from scipy import stats

In [2]:
#Read the data and display the first 5 rows
df = pd.read_csv('redesign.csv')
df.head()

Unnamed: 0,treatment,new_images,converted
0,yes,yes,0
1,yes,yes,0
2,yes,yes,0
3,yes,no,0
4,no,yes,0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40484 entries, 0 to 40483
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   treatment   40484 non-null  object
 1   new_images  40484 non-null  object
 2   converted   40484 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 949.0+ KB


There are no null-missing values and a total of 40K observations. Remembering the column descriptions:
- "treatment" - "yes" if the user saw the new version of the landing page, no otherwise.
- "new_images" - "yes" if the page used a new set of images, no otherwise.
- "converted" - 1 if the user joined the site, 0 otherwise.

Now lets create a seperate column for all options

In [4]:
#Control group who saw the old version of landing page without the new set of images
df['control'] = (df['treatment']=='no') & (df['new_images'] == 'no')
#Users who saw the old version of landing page and the new set of images
df['image_only'] = (df['treatment']=='no') & (df['new_images'] == 'yes')
#Users who saw the new version of landing page but did not see the new set of images
df['land_page_only'] = (df['treatment']=='yes') & (df['new_images'] == 'no')
#Users who saw the new version of landing page and the new set of images
df['both'] = (df['treatment']=='yes') & (df['new_images'] == 'yes')

In [5]:
df.head()

Unnamed: 0,treatment,new_images,converted,control,image_only,land_page_only,both
0,yes,yes,0,False,False,False,True
1,yes,yes,0,False,False,False,True
2,yes,yes,0,False,False,False,True
3,yes,no,0,False,False,True,False
4,no,yes,0,False,True,False,False


In [6]:
#Group the images by converted value and apply sum to get the corresponding conversion of True Values
#Store the variables in a temporary Pandas Dataframe
temp = df.groupby('converted').apply(np.sum).iloc[:,-4:]
temp

Unnamed: 0_level_0,control,image_only,land_page_only,both
converted,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,9037,8982,8906,8970
1,1084,1139,1215,1151


In [7]:
# Percentage method to get a percentage value of Pandas Series
def pct(series):
    return series/series.sum()

In [8]:
#Apply percentage and sum to temp dataframe and transpose it to get the control group in index
conv = pd.DataFrame([temp.iloc[1], temp.apply(sum),temp.apply(pct).iloc[1]], index=['converted', 'total','convert_pct']).T
conv

Unnamed: 0,converted,total,convert_pct
control,1084.0,10121.0,0.107104
image_only,1139.0,10121.0,0.112538
land_page_only,1215.0,10121.0,0.120047
both,1151.0,10121.0,0.113724


### Comments on conversion counts

* The total number of samples in all groups are equal (10121)

* Number&percentage of users who joined seems in higher than control group for all test groups but we need to further analyze whether this increase indicates a statistically significant change or can be explained by random chance.

## Assumptions of Binomial Distribution

* 2 types of outcome (There are 2 outcomes only join the site or not)
* Events are independent, one conversion does not affect (a user who joined the website does not increase or decrease the probability of another users decision)
* The probability stays same for all observations in a test (The users who join the test are randomly chosen and there is no significant difference in user groups) 

### Rule for using normal distribution

$ \hat{p}\quad=\quad num\hspace{0.15cm}converted\quad/\quad num\hspace{0.15cm}total\hspace{0.15cm}samples$

$ Num\hspace{0.15cm}samples*\hat{p}>= 5 $

$ Num\hspace{0.15cm}samples*(1 - \hat{p}) >= 5 $
  

In [33]:
# probability = converted/total in control samples
p_hat = conv.loc['control', 'convert_pct']
total = conv.loc['control', 'total']
total*p_hat > 5, total*(1-p_hat) > 5

(True, True)

Since we have enough samples we can use normal distribution can be used as an approximation to binomial distribution

## Hypothesis Testing

We want to see the effect of new layout design and new image set on conversion rate and we are concerned with both positive & negative change. Therefore we should set our hypothesis for a 2-tailed test. 

**Null Hypothesis** = The conversion rate is not different in control group and test group

**Alternate Hypothesis** = Conversion rate of test group is different than conversion rate of control group

### Some formulas for test

**Pooled Probability Formula** - Pooled probability of control and test groups

$ \normalsize{\hat{p} = \frac{ConversionControlGroup + ConversionTestGroup}{NumSamplesControl+NumSamplesTest}}  $ 

**Standart Error** - Standart Error expected when using sample conversion as conversion of population

$ \normalsize{SE = \sqrt{\hat{p}*(1-\hat{p})*(\frac{1}{NumSamplesControl}+\frac{1}{NumSamplesTest})}} $ 

**Conversion Difference** - Difference of conversion rates between test-control groups

$ \normalsize{\hat{d} =  \hat{p_{test}} * \hat{p_{control}}} $


**Margin of error** - Standard error multiplied by critial value (z-score) that determines the margin of error accepted for Ho 

$ \normalsize{m = z * StandardError} $


**Test Statistic** - Standardized value calculated from test that is used to determine the significant difference btw samples

$ \normalsize{TestStatistic = \frac{(p1-p2)}{StandardError}} $


**P-Value** - Probability of null hypothesis being true, calculated from test-statistic.

$ \normalsize{p-value = 2 * (1 - cdf(|TestStatistic|)) }$

## Selecting confidence level to identify Z-Score

Z value represents how many standart deviations away a certain value is from sample mean.
Confidence Level is the total area under normal distribution curve where values are between upper and lower Z scores

In [34]:
#Scipy percetage point function can be used for z-score calculation
#Z scores for confidence level (ex. %95) define the upper-lower boundries
confidence_level = 0.95
z_score_lower = stats.norm.ppf((1-confidence_level)/2).round(2)
z_score_upper = stats.norm.ppf(confidence_level+((1-confidence_level)/2)).round(2)
print(f'If a confidence level of %95 is expected on a 2 tailed test;\nZ-Score of {z_score_lower} and {z_score_upper} \
must be used for confidence interval calculation')

If a confidence level of %95 is expected on a 2 tailed test;
Z-Score of -1.96 and 1.96 must be used for confidence interval calculation


In [35]:
def check_significance(p1, n1, p2, n2=None, confidence_level=0.95):
    '''Method to apply 2 tailed test and check if there is significant change in conversion'''
    ##Calculate Z Scores
    z_score_lower = stats.norm.ppf((1-confidence_level)/2)
    z_score_upper = stats.norm.ppf(confidence_level+(1-confidence_level)/2)
    
    # If control sample size not given, calculate for equal sample sizes
    if n2 is None:
        n2 = n1
    
    #Calculate the Pooled Probability
    p_hat = (p1*n1 + p2*n2) / (n1+n2)
    
    #Calculate Standard Error
    SE = np.sqrt(p_hat * (1-p_hat) * ((1/n1)+(1/n2)))
    
    #Calculate margin of error and multiply with samples to get confidence interval for conversion rates
    margin_of_error_1 = z_score_upper*SE
    margin_of_error_2 = z_score_lower*SE
    upper = n2*(p1+margin_of_error_1)
    lower = n2*(p1+margin_of_error_2)
    
    #Calculate Test Statistic
    test_statistic = (p1-p2) / SE;
    
    #Calculate P-Value
    p_val = (1-stats.norm.cdf(abs(test_statistic)))*2
    
    #Calculate Conversion for Test Sample
    conversion = p2*n2
    
    #Display ci, conversion, test-statistic and p-val
    print(f'Confidence interval is between {lower:.2f} and {upper:.2f}')
    print(f'Conversion Value for Test Set is {conversion}'),
    print(f'Test Statistic: {test_statistic}, P Value: {p_val}\n')
    
    #Return whether the conversion is outside confidence interval of Ho (significant)
    return (conversion>upper) or (conversion<lower)

In [36]:
#Since sample sizes are equal, below formula checks the significance for conversion against control group
def temp_test(conversion):
    p1 = conv.loc['control', 'convert_pct']
    p2 = conversion
    n1 = conv.loc['control', 'total']
    return check_significance(p1, n1, p2, None, 0.95)

In [37]:
#Create an additional column to show whether there is a statistically significant chage
conv['significant_change'] = conv['convert_pct'].apply(temp_test)
conv

Confidence interval is between 997.77 and 1170.23
Conversion Value for Test Set is 1084.0
Test Statistic: 0.0, P Value: 1.0

Confidence interval is between 996.81 and 1171.19
Conversion Value for Test Set is 1139.0
Test Statistic: -1.2363867031898539, P Value: 0.2163148562938333

Confidence interval is between 995.52 and 1172.48
Conversion Value for Test Set is 1215.0
Test Statistic: -2.9018903061123846, P Value: 0.0037091839675174043

Confidence interval is between 996.61 and 1171.39
Conversion Value for Test Set is 1151.0
Test Statistic: -1.5025954414248646, P Value: 0.13294339963478086



Unnamed: 0,converted,total,convert_pct,significant_change
control,1084.0,10121.0,0.107104,False
image_only,1139.0,10121.0,0.112538,False
land_page_only,1215.0,10121.0,0.120047,True
both,1151.0,10121.0,0.113724,False


### Comments on table after tests

** From above table we can see that there is no significant change related to adding **new image set** to the website with current landing page 

** There is significant change for users who saw the **new landing page without the new image set**. For this option conversion is higher compared to control samples. In other words, with %95 confidence, the new landing page design has higher conversion.


Since using new images on a website with a landing page could also have a different effect, let's take the website with a landing page as our control group and apply test to see if adding image set causes a significant change in conversion.

In [38]:
# We take the user who only saw the new landing page as control group, and users who saw the landing page&images as test group
p1 = conv.loc['land_page_only', 'convert_pct']
p2 = conv.loc['both', 'convert_pct']
n1 = conv.loc['land_page_only', 'total']
check_significance(p1, n1, p2, None, 0.95)

Confidence interval is between 1125.41 and 1304.59
Conversion Value for Test Set is 1151.0
Test Statistic: 1.400116394982091, P Value: 0.1614784662559714



False

In [23]:
#Another way to easily apply the two sample test is the ttest_ind method in python scipy stats package
#We can submit the test conversion arrays direct as parameter to this method. 
a = df.loc[df['land_page_only']==True, 'converted']
b = df.loc[df['both']==True, 'converted']
stats.ttest_ind(a, b)

Ttest_indResult(statistic=1.4001150227427406, pvalue=0.1614942052210431)

Value not outside the confidence interval, we cannot reject the null hypothesis (for alpha=0.05). 
In other words, with these samples, we cannot statistically prove there is an effect of adding new image set to website with a new landing page.

### Conclusion

After the evaluation of conversion figures in all test groups, we can conclude (with %95 confidence) that the new landing page design has a positive effect on conversion rate, whereas a new image set does not have any statistically significant impact. 

Decision of action on the other hand can be made according to comparison of increase in conversion rate and costs for launching the new landing page on the website.

For the new image set, further tests could be run with more samples, thus decreasing the margin of error and increasing the level of significance. Then looking at the direction and magnitude of change, a decision for the image set can be given. 

### Additional Analysis for Statistical power & Sample Size

Statistical Power is the ability of our test to identify the small changes in evaluation metric. If a test has more samples, the  statistical power / sensitivity gets higher, which means our probability of false negative (probability of missing a change) is lower.

For sample size study we take the control group and test group of user who saw the new image set

In [39]:
p1 = conv.loc['control', 'convert_pct']
p2 = conv.loc['image_only', 'convert_pct']
n1 = conv.loc['control', 'total']
check_significance(p1, n1, p2, None, 0.95)

Confidence interval is between 996.81 and 1171.19
Conversion Value for Test Set is 1139.0
Test Statistic: -1.2363867031898539, P Value: 0.2163148562938333



False

In [40]:
pct_change = (conv.loc['image_only', 'convert_pct']-conv.loc['control', 'convert_pct']) / conv.loc['control', 'convert_pct']
pct_change

0.050738007380073856

There is %5.07 change between conversion rate control group and new image set group. But this change is statistically not significant considering alpha=0.05 level. Our current sample sizes are 10121 each. 

Let's calculate the sample sizes that would be required to identify this change. 


In [28]:
##These methods are received from Datacamp Customer Analytics and AB Testing Course

def get_power(n, p1, p2, cl):
    '''Returns the power of our hypothesis test by number of samples,probabilities (conversion_rate) and confidence level'''
    alpha = 1 - cl
    
    qu = stats.norm.ppf(1 - alpha/2)
    
    diff = abs(p2 - p1)
    bp = (p1 + p2) / 2
    
    v1 = p1 * (1 - p1)
    v2 = p2 * (1 - p2)
    
    bv = bp * (1 - bp)
    
    power_part_one = stats.norm.cdf((n**0.5 * diff - qu * (2 * bv)**0.5)/ (v1 + v2)**0.5)
    power_part_two = 1 - stats.norm.cdf((n**0.5 * diff + qu * (2 * bv)**0.5)/ (v1 + v2)**0.5)
    
    power = power_part_one + power_part_two
    return(power)
    
    
def get_sample_size(power, p1, p2, cl, max_n=1000000):
    '''calculates the power for different test size until it reaches the desired power level, and returns the test size'''
    n = 1
    while n <= max_n:
        tmp_power = get_power(n, p1, p2, cl)
        if tmp_power >= power:
            return n
        else:
            n = n + 1

In [29]:
#Sample size required for our hypothesis test to reject H0 %80 of the time when H1 is true, considering our test-control conversions
power = 0.8
size = get_sample_size(0.8, p1, p2, 0.95)
size

51966

This means, in order to identify a change of %5.26 in conversion rate %80 of the time, we would need 50K samples in both test groups. 
Therefore getting more data for testing image set might be a good idea to better analyse its effect.
Other ways to increase power could be increasing the confidence level or apply a one tailed test (if we are only concerned with change in one direction)

### Online sources

* https://campus.datacamp.com/courses/customer-analytics-and-ab-testing-in-python/the-design-and-application-of-ab-testing - AB Testing Course
* https://support.minitab.com/en-us/minitab-express/1/ - Formulas for calculation
* https://en.wikipedia.org/wiki/Statistical_hypothesis_testing - Hypothesis Testing
* https://machinelearningmastery.com/statistical-power-and-power-analysis-in-python - Power Analysis
* https://classroom.udacity.com/courses/ud257 - AB Testing Course