# Website A/B Testing - Lab

## Introduction

In this lab, you'll get another chance to practice your skills at conducting a full A/B test analysis. It will also be a chance to practice your data exploration and processing skills! The scenario you'll be investigating is data collected from the homepage of a music app page for audacity.

## Objectives

You will be able to:
* Analyze the data from a website A/B test to draw relevant conclusions
* Explore and analyze web action data

## Exploratory Analysis

Start by loading in the dataset stored in the file 'homepage_actions.csv'. Then conduct an exploratory analysis to get familiar with the data.

> Hints:
    * Start investigating the id column:
        * How many viewers also clicked?
        * Are there any anomalies with the data; did anyone click who didn't view?
        * Is there any overlap between the control and experiment groups? 
            * If so, how do you plan to account for this in your experimental design?

In [1]:
#Your code here
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
data = pd.DataFrame(pd.read_csv("homepage_actions.csv"))
data.head()

Unnamed: 0,timestamp,id,group,action
0,2016-09-24 17:42:27.839496,804196,experiment,view
1,2016-09-24 19:19:03.542569,434745,experiment,view
2,2016-09-24 19:36:00.944135,507599,experiment,view
3,2016-09-24 19:59:02.646620,671993,control,view
4,2016-09-24 20:26:14.466886,536734,experiment,view


In [3]:
data.info()
#number of entries and data typed in columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8188 entries, 0 to 8187
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   timestamp  8188 non-null   object
 1   id         8188 non-null   int64 
 2   group      8188 non-null   object
 3   action     8188 non-null   object
dtypes: int64(1), object(3)
memory usage: 256.0+ KB


In [4]:
data.value_counts("id")
#number of unique id's = 6328

id
937217    2
575502    2
583751    2
583515    2
583153    2
         ..
568845    1
568937    1
568992    1
569142    1
182988    1
Length: 6328, dtype: int64

In [5]:
exp_pple = data.loc[data['group'] == "experiment"]
len(exp_pple.value_counts("id"))
# number of in experimental group

2996

In [6]:
exp_click = exp_pple.loc[exp_pple["action"] == "click"]
len(exp_click.value_counts("id"))
#number experimental clickers


928

In [7]:
exp_viewers = exp_pple.loc[exp_pple["action"] != "click"]
#everyone in the experimental group who clicked viewed

In [8]:
exp_viewers

Unnamed: 0,timestamp,id,group,action
0,2016-09-24 17:42:27.839496,804196,experiment,view
1,2016-09-24 19:19:03.542569,434745,experiment,view
2,2016-09-24 19:36:00.944135,507599,experiment,view
4,2016-09-24 20:26:14.466886,536734,experiment,view
5,2016-09-24 20:32:25.712659,681598,experiment,view
...,...,...,...,...
8176,2017-01-18 07:07:50.090346,540466,experiment,view
8179,2017-01-18 08:53:50.910310,615849,experiment,view
8183,2017-01-18 09:11:41.984113,192060,experiment,view
8184,2017-01-18 09:42:12.844575,755912,experiment,view


In [9]:
exp_click_id = exp_click["id"].unique()
#experimental click ids

In [10]:
exp_click_id

array([349125, 601714, 487634, 468601, 555973, 444902, 269335, 596892,
       653403, 922848, 283438, 370483, 826660, 246990, 832871, 839729,
       332165, 691753, 226057, 852585, 366151, 835449, 408715, 489523,
       349239, 828266, 825953, 786605, 420863, 755354, 619230, 838042,
       378232, 638030, 539338, 251716, 649807, 884709, 615186, 559646,
       700205, 619415, 645349, 432416, 696656, 297965, 430193, 399915,
       918596, 826501, 737683, 395723, 578003, 419220, 687985, 655266,
       206558, 279949, 559647, 424505, 766624, 378476, 825944, 335999,
       292135, 311308, 883141, 768170, 565462, 897647, 261705, 676381,
       832343, 831394, 246836, 305577, 411360, 183938, 520428, 215711,
       629121, 324368, 913773, 330972, 594134, 654566, 914320, 901121,
       355128, 843456, 451893, 191118, 925767, 432786, 603861, 712674,
       540722, 838743, 582967, 247961, 589699, 292642, 303720, 804645,
       360932, 714267, 288313, 541646, 277266, 414354, 314594, 277682,
      

In [11]:
exp_non_click = set(exp_viewers["id"]).difference(set(exp_click["id"]))
len(exp_non_click)
#number of experimental viewers who did not click

2068

In [12]:
ctrl_pple = data.loc[data['group'] == "control"]
len(ctrl_pple.value_counts("id"))
# number of in control group

3332

In [13]:
ctrl_click = ctrl_pple.loc[ctrl_pple["action"] == "click"]
len(ctrl_click.value_counts("id"))
#number control clickers


932

In [14]:
# len(ctrl_pple.loc[ctrl_pple["action"] != "click"])
ctrl_viewers = ctrl_pple.loc[ctrl_pple["action"] != "click"]
len(ctrl_viewers)
#everyone in the control group who clicked viewed

3332

In [15]:
ctrl_click_id = ctrl_click["id"].unique()
#control click ids

In [16]:
ctrl_non_click = set(ctrl_viewers["id"]).difference(set(ctrl_click["id"]))
len(ctrl_non_click)
#number of experimental viewers who did not click

2400

In [17]:
tot_viewers_non_click = exp_non_click.union(ctrl_non_click)
len(tot_viewers_non_click)

4468

## Conduct a Statistical Test

Conduct a statistical test to determine whether the experimental homepage was more effective than that of the control group.

In [18]:
#Create experimental and control daata frames indexed by Id's with binary data on clicking or non clicking

In [19]:
#Your code here

exp_pple_df = pd.DataFrame(exp_pple["id"].unique(),columns=["id"],index=exp_viewers.index)
exp_pple_df

Unnamed: 0,id
0,804196
1,434745
2,507599
4,536734
5,681598
...,...
8176,540466
8179,615849
8183,192060
8184,755912


In [40]:
exp_pple_df["view"] = 1
exp_pple_df["click"] = (exp_pple["action"].replace({"click":1,"view":0}))
exp_pple_df["no_click"]= (exp_pple_df["view"])-(exp_pple_df["click"])
#creating binary columns

In [21]:
#exp_pple_df = exp_pple_df.set_index("id")

In [37]:
#exp_pple_df["click"] = exp_pple["action"].replace({"click":True,"view":False}).astype(int)
exp_pple_df

Unnamed: 0,id,view,click,no_click
0,804196,1,0,1
1,434745,1,0,1
2,507599,1,0,1
4,536734,1,0,1
5,681598,1,0,1
...,...,...,...,...
8176,540466,1,0,1
8179,615849,1,0,1
8183,192060,1,0,1
8184,755912,1,0,1


In [38]:
exp_pple_df["click"].value_counts()

0    2996
Name: click, dtype: int64

In [45]:
(exp_viewers.loc[exp_pple["action"].replace({"click":1,"view":0})])#.astype(int)#.value_counts()

Unnamed: 0,timestamp,id,group,action
0,2016-09-24 17:42:27.839496,804196,experiment,view
0,2016-09-24 17:42:27.839496,804196,experiment,view
0,2016-09-24 17:42:27.839496,804196,experiment,view
0,2016-09-24 17:42:27.839496,804196,experiment,view
0,2016-09-24 17:42:27.839496,804196,experiment,view
...,...,...,...,...
0,2016-09-24 17:42:27.839496,804196,experiment,view
1,2016-09-24 19:19:03.542569,434745,experiment,view
0,2016-09-24 17:42:27.839496,804196,experiment,view
0,2016-09-24 17:42:27.839496,804196,experiment,view


In [35]:
exp_pple_df["click"].value_counts()

0    2996
Name: click, dtype: int64

In [26]:
# for i in range(len(exp_pple_df)):
#     for j in range(len(exp_click)):
#         if exp_pple_df.iloc[i]["id"] == exp_click.iloc[j]["id"]:
#             exp_pple_df.iloc[i]["click"].replace(to)
# return exp_pple_df["click"]

In [27]:
# exp_pple_df[exp_pple_df["id"].isin(exp_click_id).replace(to_replace=exp_pple_df)]

In [28]:
import flatiron_stats as fs

ModuleNotFoundError: No module named 'flatiron_stats'

## Verifying Results

One sensible formulation of the data to answer the hypothesis test above would be to create a binary variable representing each individual in the experiment and control group. This binary variable would represent whether or not that individual clicked on the homepage; 1 for they did and 0 if they did not. 

The variance for the number of successes in a sample of a binomial variable with n observations is given by:

## $n\bullet p (1-p)$

Given this, perform 3 steps to verify the results of your statistical test:
1. Calculate the expected number of clicks for the experiment group, if it had the same click-through rate as that of the control group. 
2. Calculate the number of standard deviations that the actual number of clicks was from this estimate. 
3. Finally, calculate a p-value using the normal distribution based on this z-score.

### Step 1:
Calculate the expected number of clicks for the experiment group, if it had the same click-through rate as that of the control group. 

In [None]:
#Your code here

### Step 2:
Calculate the number of standard deviations that the actual number of clicks was from this estimate.

In [None]:
#Your code here

### Step 3: 
Finally, calculate a p-value using the normal distribution based on this z-score.

In [None]:
#Your code here

### Analysis:

Does this result roughly match that of the previous statistical test?

> Comment: **Your analysis here**

## Summary

In this lab, you continued to get more practice designing and conducting AB tests. This required additional work preprocessing and formulating the initial problem in a suitable manner. Additionally, you also saw how to verify results, strengthening your knowledge of binomial variables, and reviewing initial statistical concepts of the central limit theorem, standard deviation, z-scores, and their accompanying p-values.