# Website A/B Testing - Lab

## Introduction

In this lab, you'll get another chance to practice your skills at conducting a full A/B test analysis. It will also be a chance to practice your data exploration and processing skills! The scenario you'll be investigating is data collected from the homepage of a music app page for audacity.

## Objectives

You will be able to:
* Analyze the data from a website A/B test to draw relevant conclusions
* Explore and analyze web action data

## Exploratory Analysis

Start by loading in the dataset stored in the file 'homepage_actions.csv'. Then conduct an exploratory analysis to get familiar with the data.

> Hints:
    * Start investigating the id column:
        * How many viewers also clicked?
        * Are there any anomalies with the data; did anyone click who didn't view?
        * Is there any overlap between the control and experiment groups? 
            * If so, how do you plan to account for this in your experimental design?

In [57]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import math
import random
import seaborn as sns
import statsmodels.api as sm 
from statsmodels.formula.api import ols

In [58]:
df=pd.read_csv('homepage_actions.csv')
df.head()

Unnamed: 0,timestamp,id,group,action
0,2016-09-24 17:42:27.839496,804196,experiment,view
1,2016-09-24 19:19:03.542569,434745,experiment,view
2,2016-09-24 19:36:00.944135,507599,experiment,view
3,2016-09-24 19:59:02.646620,671993,control,view
4,2016-09-24 20:26:14.466886,536734,experiment,view


In [59]:
df.groupby('group').count()

Unnamed: 0_level_0,timestamp,id,action
group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
control,4264,4264,4264
experiment,3924,3924,3924


In [60]:
df.groupby('action').count()

Unnamed: 0_level_0,timestamp,id,group
action,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
click,1860,1860,1860
view,6328,6328,6328


### How many users clicked and how many only viewed 

In [61]:
ids=df.groupby('id').count()
ids

Unnamed: 0_level_0,timestamp,group,action
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
182988,1,1,1
182994,2,2,2
183089,1,1,1
183136,1,1,1
183141,2,2,2
...,...,...,...
937003,1,1,1
937073,1,1,1
937108,1,1,1
937139,2,2,2


In [62]:
# Some users both clicked and viewed.

In [63]:
ids.groupby('action').count()

Unnamed: 0_level_0,timestamp,group
action,Unnamed: 1_level_1,Unnamed: 2_level_1
1,4468,4468
2,1860,1860


In [64]:
"""This shows us that 4468 users only viewed, while 1860 also clicked"""

'This shows us that 4468 users only viewed, while 1860 also clicked'

### Are there anomalies, did somebody click but not viewed

In [65]:
ids.groupby('group').count()

Unnamed: 0_level_0,timestamp,action
group,Unnamed: 1_level_1,Unnamed: 2_level_1
1,4468,4468
2,1860,1860


In [66]:
clicks=df.loc[df['action']=='click']
views=df.loc[df['action']=='view']

In [67]:
ids=clicks['id']

In [68]:
result = ids.isin(views['id'])
result.value_counts()

True    1860
Name: id, dtype: int64

In [69]:
#This shows that there are no anomalies, nobody that clicked didn't view.

### Are there users present in both groups

In [70]:
control=df.loc[df['group']=='control']
experiment=df.loc[df['group']=='experiment']
cids=control['id']
eids=experiment['id']

In [71]:
result = cids.isin(experiment['id'])
result.value_counts()

False    4264
Name: id, dtype: int64

In [74]:
"""There are no users present in both groups, the users that show up in two rows do so because they did two actions,
view and click, in fact they also have two different time stamps. As we can also notice from the dataframe below
"""

'There are no users present in both groups, the users that show up in two rows do so because they did two actions,\nview and click, in fact they also have two different time stamps. As we can also notice from the dataframe below\n'

In [75]:
ids

Unnamed: 0_level_0,timestamp,group,action
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
182988,1,1,1
182994,2,2,2
183089,1,1,1
183136,1,1,1
183141,2,2,2
...,...,...,...
937003,1,1,1
937073,1,1,1
937108,1,1,1
937139,2,2,2


## Conduct a Statistical Test

Conduct a statistical test to determine whether the experimental homepage was more effective than that of the control group.

In [76]:
""" What I think makes sense is a welch test to see if there is a significant difference in the number
of clicks between the control group and the experimental group"""

' What I think makes sense is a welch test to see if there is a significant difference in the number\nof clicks between the control group and the experimental group'

In [78]:
"""The null hypothesis is that there is no significant difference between the numbers of clicks in
the experimental group and in the control group.
The alternative hypothesis instead is that there is a difference, and in particular we hope that the experimental
group had more clicks than the control group"""

'The null hypothesis is that there is no significant difference between the numbers of clicks in\nthe experimental group and in the control group.\nThe alternative hypothesis instead is that there is a difference, and in particular we hope that the experimental\ngroup had more clicks than the control group'

In [81]:
control['clicks']=0
experiment['clicks']=0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [82]:
control.loc[control['action']=='click', 'clicks']=1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value)


In [83]:
control['clicks'].value_counts()

0    3332
1     932
Name: clicks, dtype: int64

In [84]:
experiment.loc[experiment['action']=='click', 'clicks']=1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value)


In [85]:
experiment['clicks'].value_counts()

0    2996
1     928
Name: clicks, dtype: int64

In [86]:
# import flatiron_stats as fs ## for some reason this doesn't work so I'm gonna import single functions

In [87]:
def welch_t(a, b):
	nom=np.abs(np.mean(a)-np.mean(b))
	denom=np.sqrt((a.var(ddof=1)/len(a))+(b.var(ddof=1)/len(b)))
	welch_t=nom/denom
	return welch_t

In [88]:
def welch_df(a, b):
    nom=((a.var(ddof=1)/len(a))+(b.var(ddof=1)/len(b)))**2
    denom=(((a.var(ddof=1)**2)/((len(a)**2)*(len(a)-1)))+((b.var(ddof=1)**2)/((len(b)**2)*(len(b)-1))))
    ddof=nom/denom
    return ddof

In [89]:
def p_value(a, b, two_sided=False):
    t = welch_t(a,b)
    df = welch_df(a,b)
    if two_sided==False:
        p = 1-stats.t.cdf(t,df)
    else:
        p=(1-stats.t.cdf(t,df))*2
        
    return p

In [90]:
p_value(control['clicks'], experiment['clicks'], two_sided=False)

0.026743886922199422

In [91]:
"""From the results of the test considering the value of p returned it seems clear that the difference
is statistically significant."""

'From the results of the test considering the value of p returned it seems clear that the difference\nis statistically significant.'

# Verifying Results
One sensible formulation of the data to answer the hypothesis test above would be to create a binary variable representing each individual in the experiment and control group. This binary variable would represent whether or not that individual clicked on the homepage; 1 for they did and 0 if they did not.

The variance for the number of successes in a sample of a binomial variable with n observations is given by:

# n*p(1-p)

Given this, perform 3 steps to verify the results of your statistical test:

Calculate the expected number of clicks for the experiment group, if it had the same click-through rate as that of the control group.
Calculate the number of standard deviations that the actual number of clicks was from this estimate.
Finally, calculate a p-value using the normal distribution based on this z-score.

### Step 1:
Calculate the expected number of clicks for the experiment group, if it had the same click-through rate as that of the control group. 

In [92]:
totclicks=control['clicks'].sum()
totview=len(control['clicks'])

In [93]:
clickrate=totclicks/totview
clickrate

0.21857410881801126

In [94]:
experclicks=experiment['clicks'].sum()

In [95]:
expertot=len(experiment['clicks'])

In [96]:
expctd=clickrate*expertot
expctd

857.6848030018762

### Step 2:
Calculate the number of standard deviations that the actual number of clicks was from this estimate.

In [97]:
variance=totview*(clickrate)*(1-clickrate)
variance

728.2889305816135

In [98]:
std=np.sqrt(variance)
std

26.986828835222813

In [99]:
difference=experclicks-expctd
num=difference/std
num

2.6055375912248526

### Step 3: 
Finally, calculate a p-value using the normal distribution based on this z-score.

In [101]:
p=1-stats.norm.cdf(num)
p

0.004586510360224505

### Analysis:

Does this result roughly match that of the previous statistical test?

> Comment: **Your analysis here**

In [102]:
"""The p is lower in this case but still consisent with the result that we found above, being lower than alpha
and therefore allowing us to reject the null hypothesis and stating that the difference with the new
website layout is statistically signifcantly different."""

'The p is lower in this case but still consisent with the result that we found above, being lower than alpha\nand therefore allowing us to reject the null hypothesis and stating that the difference with the new\nwebsite layout is statistically signifcantly different.'

## Summary

In this lab, you continued to get more practice designing and conducting AB tests. This required additional work preprocessing and formulating the initial problem in a suitable manner. Additionally, you also saw how to verify results, strengthening your knowledge of binomial variables, and reviewing initial statistical concepts of the central limit theorem, standard deviation, z-scores, and their accompanying p-values.