<a id='intro'></a>
## Introduction

An A/B test of performig different landing pages on e-commerce website.  The goal is to work through this notebook to help the company understand if they should implement the new page, keep the old page, or perhaps run the experiment longer to make their decision.

<a id='probability'></a>
#### Part I - Probability

To get started, let's import our libraries.

In [2]:
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt
%matplotlib inline
#The seed to assure you get the same answers 
random.seed(42)

`1.` Now, read in the `ab_data.csv` data. Store it in `df`.  

a. Read in the dataset and take a look at the top few rows here:

In [3]:
df = pd.read_csv('ab_data.csv'); df.head(3)

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,2017-01-21 22:11:48.556739,control,old_page,0
1,804228,2017-01-12 08:01:45.159739,control,old_page,0
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0


b. Use the cell below to find the number of rows in the dataset.

In [4]:
df.shape[0]

294478

c. The number of unique users in the dataset.

In [5]:
df['user_id'].nunique()

290584

d. The proportion of users converted.

In [6]:
df[df['converted'] == 1].user_id.nunique() / df['user_id'].nunique()

0.12104245244060237

e. The number of times the `new_page` and `treatment` group don't match.

In [7]:
df[(df['group'] == 'treatment') & (df['landing_page'] != 'new_page')].shape[0] +\
df[(df['group'] != 'treatment') & (df['landing_page'] == 'new_page')].shape[0]

3893

f. Do any of the rows have missing values?

In [10]:
df.isnull().sum()

user_id         0
timestamp       0
group           0
landing_page    0
converted       0
dtype: int64

`2.` For the rows where **treatment** does not match with **new_page** or **control** does not match with **old_page**, we cannot be sure if this row truly received the new or old page. Drop them.

In [11]:
#list of rows to drop under the term that "treatment" does not match with  "new_page"
drop_list_1 = df[(df['group'] == 'treatment') & (df['landing_page'] != 'new_page')].index.tolist()
#list of rows to drop under the term that "new_page" does not match with  "treatment"
drop_list_2 = df[(df['group'] != 'treatment') & (df['landing_page'] == 'new_page')].index.tolist()

In [12]:
#drop unsuitable rows and store the result in new data frame df2
df2 = df.drop(drop_list_1 + drop_list_2)

In [13]:
# Double Check all of the correct rows were removed - this should be 0
df2[((df2['group'] == 'treatment') == (df2['landing_page'] == 'new_page'))== False].shape[0]

0

a. How many unique **user_id**s are in **df2**?

In [14]:
df2.nunique()


user_id         290584
timestamp       290585
group                2
landing_page         2
converted            2
dtype: int64

Since we have timestamps for **290585** visits and only **290584** unique ids, we assume that one id is repeated in two sessions

b. There is one **user_id** repeated in **df2**.  What is it?

In [15]:
#get the value of duplicated id
rep_id = df2[df2['user_id'].duplicated()].reset_index(drop = True).iloc[0,0]; rep_id

773192

c. What is the row information for the repeat **user_id**? 

In [16]:
#subset of data frame with duplicated usere_id
df2[df2['user_id'] == rep_id]

Unnamed: 0,user_id,timestamp,group,landing_page,converted
1899,773192,2017-01-09 05:37:58.781806,treatment,new_page,0
2893,773192,2017-01-14 02:55:59.590927,treatment,new_page,0


- Rows with the duplicated user_id **773192** has indexes **1899** and **2893** 
- These rows has different timestamps, same group **treatment** and same landing_page **new_page**

d. Remove **one** of the rows with a duplicate **user_id**, but keep dataframe as **df2**.

In [17]:
# drop the row with index 1899
df2 = df2.drop(1899); df2.head(1)

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,2017-01-21 22:11:48.556739,control,old_page,0


a. What is the probability of an individual converting regardless of the page they receive?

In [18]:
df2[df2['converted'] == 1].shape[0] / df2.shape[0]

0.11959708724499628

b. Given that an individual was in the `control` group, what is the probability they converted?

In [19]:
actual_p_old = df2[(df2['group'] == 'control') & (df2['converted'] == 1)].shape[0]  / df2[df2['group'] == 'control'].shape[0]
actual_p_old

0.1203863045004612

In [20]:
df2[(df2['group'] == 'control') & (df2['landing_page'] == 'old_page')].shape[0] 

145274

c. Given that an individual was in the `treatment` group, what is the probability they converted?

In [21]:
actual_p_new = df2[(df2['group'] == 'treatment') & (df2['converted'] == 1)].shape[0]  / df2[df2['group'] == 'treatment'].shape[0]
actual_p_new

0.11880806551510564

In [22]:
df2[(df2['group'] == 'treatment') & (df2['landing_page'] == 'new_page')].shape[0] 

145310

d. What is the probability that an individual received the new page?

In [25]:
df2[df2['landing_page'] == 'new_page'].shape[0] / df2.shape[0]

0.5000619442226688

Actual diffenrence between conversions

In [26]:
act_diff = actual_p_new - actual_p_old; act_diff

-0.0015782389853555567

Consider the results from above we **don't have statistically significant evidence** that new page leads to more conversions.
As we see, the probability that an individual received the new page in our test is 0.5. This confirmed by followed absolute numbers:
- the number of individuals in the treatment group received new page is **145310**
- the number of individuals in the control group received old page is **145274**

These digits allowed us to say that we comparing equal numbers of tests old page and new page. And the test is fair enough.

Based on the probability of converting we got above, we see that number for the new page is **0.1188** and it is less than the probability for old page **0.1204**. 

Technically speaking these numbers almost even, so instead of the statement that old page converts more, the preferred statement is 
> "we don't have statistically significant evidence that new page converting more, then old page"