## Analyze A/B Test Results



## Table of Contents
- [Introduction](#intro)
- [A/B Test](#ab_test)



<a id='intro'></a>
### Introduction

A/B tests are very commonly performed by data analysts and data scientists.  It is important that you get some practice working with the difficulties of these 

### Scenario
you will be working to understand the results of an A/B test run by an e-commerce website.  Your goal is to work through this notebook to help the company understand if they should implement the new page, keep the old page, or perhaps run the experiment longer to make their decision.



To get started, let's import our libraries.

In [None]:
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt
%matplotlib inline
random.seed(42)

`1.` Now, read in the `ab_data.csv` data. Store it in `df`.

a. Read in the dataset and take a look at the top few rows here:

In [None]:
df=pd.read_csv('ab_data.csv')
df.head()

b. Use the cell below to find the number of rows in the dataset.

In [None]:
df.info()

c. The number of unique users in the dataset.

d. The proportion of users converted.

e. The number of times the `new_page` and `treatment` don't match.

In [None]:
dif1=  #control group with new_page
dif2=  #treatment group with old_page


f. Do any of the rows have missing values?

In [None]:
 #we saw with df.info() that there's no missing values. If the output is False, it's a double check.


`2.` For the rows where **treatment** does not match with **new_page** or **control** does not match with **old_page**, we cannot be sure if this row truly received the new or old page.  


a. Store your new dataframe in **df2**.

In [None]:
df2 =  #control group with new_page
df2 =  #treatment group with old_page

In [None]:
# Double Check all of the correct rows were removed - this should be 0


`3.` 

a. How many unique **user_id**s are in **df2**?

b. Check duplicates: Is there any one **user_id** repeated in **df2**.  

c. What is the row information for the repeat **user_id**? 

d. Remove **one** of the rows with a duplicate **user_id**, but keep your dataframe as **df2**.

`4.`

a. What is the probability of an individual converting regardless of the page they receive?

In [None]:
df2[df2['converted']==True].user_id.nunique()/df2.user_id.nunique()

b. Given that an individual was in the `control` group, what is the probability they converted?

In [None]:
df2.query('group == "control"').converted.mean()

c. Given that an individual was in the `treatment` group, what is the probability they converted?

In [None]:
df2.query('group == "treatment"').converted.mean()

d. What is the probability that an individual received the new page?

In [None]:
df[df['landing_page']=='new_page'].user_id.nunique()/df.user_id.nunique()

e. Discussion question: 
<br>
<br>
Consider your results from parts (a) through (d) above, and explain below whether you think there is sufficient evidence to conclude that the new treatment page leads to more conversions.

<a id='ab_test'></a>
###  A/B Test

Notice that because of the time stamp associated with each event, you could technically run a hypothesis test continuously as each observation was observed.  

However, then the hard question is do you stop as soon as one page is considered significantly better than another or does it need to happen consistently for a certain amount of time?  How long do you run to render a decision that neither page is better than another?  

These questions are the difficult parts associated with A/B tests in general.  


`1.` For now, consider you need to make the decision just based on all the data provided.  If you want to assume that the old page is better unless the new page proves to be definitely better at a Type I error rate of 5%, what should your null and alternative hypotheses be?  You can state your hypothesis in terms of words or in terms of **$p_{old}$** and **$p_{new}$**, which are the converted rates for the old and new pages.

`2.` Assume under the null hypothesis, $p_{new}$ and $p_{old}$ both have "true" success rates equal to the **converted** success rate regardless of page - that is $p_{new}$ and $p_{old}$ are equal. Furthermore, assume they are equal to the **converted** rate in **ab_data.csv** regardless of the page. <br><br>

Use a sample size for each page equal to the ones in **ab_data.csv**.  <br><br>

a. What is the **conversion rate** for $p_{new}$ under the null? 

In [None]:
p_new=df2.converted.mean() #under the null, both rates are equal. So we just calculate the converted rate no matter the page.
p_new

b. What is the **conversion rate** for $p_{old}$ under the null? <br><br>

In [None]:
p_old=df2.converted.mean() #under the null, both rates are equal. So we just calculate the converted rate no matter the page.
p_old

c. What is $n_{new}$, the number of individuals in the treatment group?

In [None]:
n_new=df2[df2['group'] == 'treatment'].shape[0]
n_new

d. What is $n_{old}$, the number of individuals in the control group?

In [None]:
n_old=df2[df2['group'] == 'control'].shape[0]
n_old

e. Simulate $n_{new}$ transactions with a conversion rate of $p_{new}$ under the null.  Store these $n_{new}$ 1's and 0's in **new_page_converted**.

In [None]:
new_page_converted = np.random.choice([0,1],n_new, p=(1-p_new,p_new))
np.mean(new_page_converted)

f. Simulate $n_{old}$ transactions with a conversion rate of $p_{old}$ under the null.  Store these $n_{old}$ 1's and 0's in **old_page_converted**.

In [None]:
old_page_converted = np.random.choice([0,1],n_old, p=(1-p_old,p_old))
np.mean(old_page_converted)

l. Use the above portions to correctly thinking about statistical significance. Fill in the below to calculate the number of conversions for each page, as well as the number of individuals who received each page. Let `n_old` and `n_new` refer the the number of rows associated with the old page and new pages, respectively.

In [None]:
import statsmodels.api as sm

convert_old = 
convert_new = 
n_old = 145274
n_new = 145310

m. Now use `stats.proportions_ztest` to compute your test statistic and p-value.  [Here](http://knowledgetack.com/python/statsmodels/proportions_ztest/) is a helpful link on using the built in.

In [None]:
z_s, p_v = sm.stats.proportions_ztest( [convert_new, convert_old], [n_new, n_old], alternative='larger' )
z_s,p_v

n. What do the z-score and p-value you computed in the previous question mean for the conversion rates of the old and new pages?  