# A/B Testing

In [None]:
Table of Contents
- [Introduction](#intro)
- [Part I - Probability](#probability)
- [Part II - A/B Test](#ab_test)
- [Part III - Regression](#regression)

In [2]:
### Introduction

#A/B tests are very commonly performed by data analysts and data scientists.  It is important that you get some practice working with the difficulties of these 
#For this project, you will be working to understand the results of an A/B test run by an e-commerce website.  Your goal is to work through this notebook to help the company understand if they should implement the new page, keep the old page, or perhaps run the experiment longer to make their decision.
#**As you work through this notebook, follow along in the classroom and answer the corresponding quiz questions associated with each question.** The labels for each classroom concept are provided for each question.  This will assure you are on the right track as you work through the project, and you can feel more confident in your final submission meeting the criteria.  As a final check, assure you meet all the criteria on the [RUBRIC](https://review.udacity.com/#!/projects/37e27304-ad47-4eb0-a1ab-8c12f60e43d0/rubric).

In [None]:
#### Part I - Probability

In [3]:
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
#`1.` Now, read in the `ab_data.csv` data. Store it in `df`.  **Use your dataframe to answer the questions in Quiz 1 of the classroom.**

In [None]:
# A Read in the dataset and take a look at the top few rows here:

In [9]:
df = pd.read_csv("ab_data.csv")
df.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,2017-01-21 22:11:48.556739,control,old_page,0
1,804228,2017-01-12 08:01:45.159739,control,old_page,0
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0
4,864975,2017-01-21 01:52:26.210827,control,old_page,1


In [6]:
# b. Use the below cell to find the number of rows in the dataset.
df.shape

(294478, 5)

In [12]:
#c. The number of unique users in the dataset.
unique_users=df['user_id'].nunique()

In [13]:
unique_users

290584

In [None]:
#d. The proportion of users converted.

In [14]:
df.converted.mean()   # here ask the users converted 

0.11965919355605512

In [None]:
#e. The number of times the `new_page` and `treatment` don't line up.

In [15]:
df

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,2017-01-21 22:11:48.556739,control,old_page,0
1,804228,2017-01-12 08:01:45.159739,control,old_page,0
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0
4,864975,2017-01-21 01:52:26.210827,control,old_page,1
...,...,...,...,...,...
294473,751197,2017-01-03 22:28:38.630509,control,old_page,0
294474,945152,2017-01-12 00:51:57.078372,control,old_page,0
294475,734608,2017-01-22 11:45:03.439544,control,old_page,0
294476,697314,2017-01-15 01:20:28.957438,control,old_page,0


In [31]:
not_line_up_1=df[(df['group']=='treatment')& (df['landing_page']=='old_page')].count()
not_line_up_2=df.query(" group=='control' and landing_page=='new_page' ").count()

In [32]:
not_line_up_1+not_line_up_2

user_id         3893
timestamp       3893
group           3893
landing_page    3893
converted       3893
dtype: int64

In [None]:
#f. Do any of the rows have missing values?

In [33]:
df.isnull().sum()  

user_id         0
timestamp       0
group           0
landing_page    0
converted       0
dtype: int64

In [34]:
#`2.` For the rows where **treatment** is not aligned with **new_page** or **control** is not aligned with **old_page**, we cannot be sure if this row truly received the new or old page.  Use **Quiz 2** in the classroom to provide how we should handle these rows.  

#a. Now use the answer to the quiz to create a new dataset that meets the specifications from the quiz.  Store your new dataframe in **df2**.

In [35]:
df

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,2017-01-21 22:11:48.556739,control,old_page,0
1,804228,2017-01-12 08:01:45.159739,control,old_page,0
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0
4,864975,2017-01-21 01:52:26.210827,control,old_page,1
...,...,...,...,...,...
294473,751197,2017-01-03 22:28:38.630509,control,old_page,0
294474,945152,2017-01-12 00:51:57.078372,control,old_page,0
294475,734608,2017-01-22 11:45:03.439544,control,old_page,0
294476,697314,2017-01-15 01:20:28.957438,control,old_page,0


In [41]:
df2=df[((df['group']=='treatment')==(df['landing_page']=='new_page'))==True]

In [43]:
df2[((df2['group'] == 'treatment') == (df2['landing_page'] == 'new_page')) == False].shape[0]

0

In [46]:
df2['converted']

0         0
1         0
2         0
3         0
4         1
         ..
294473    0
294474    0
294475    0
294476    0
294477    0
Name: converted, Length: 290585, dtype: int64

In [47]:
df2['converted'].value_counts()

0    255832
1     34753
Name: converted, dtype: int64

In [None]:
#`3.` Use **df2** and the cells below to answer questions for **Quiz3** in the classroom.

In [50]:
#a. How many unique **user_id**s are in **df2**?
df2['user_id'].nunique()

290584

In [60]:
#b. There is one **user_id** repeated in **df2**.  What is it?
df2['user_id'].duplicated().sum()

1

In [62]:
#c. What is the row information for the repeat **user_id**? 
# to show the duplicated rows , set "keep = False"

df2[df2['user_id'].duplicated(keep=False)]

Unnamed: 0,user_id,timestamp,group,landing_page,converted
1899,773192,2017-01-09 05:37:58.781806,treatment,new_page,0
2893,773192,2017-01-14 02:55:59.590927,treatment,new_page,0


In [68]:
#d. Remove **one** of the rows with a duplicate **user_id**, but keep your dataframe as **df2**.
df2.drop_duplicates(keep='first')
df2.duplicated().sum()

0

In [69]:
#`4.` Use **df2** in the below cells to answer the quiz questions related to **Quiz 4** in the classroom.

#a. What is the probability of an individual converting regardless of the page they receive?

In [70]:
round(df2.converted.mean(),4)

0.1196

In [81]:
#b. Given that an individual was in the `control` group, what is the probability they converted?
control_ver =float(df2.query("group=='control' and converted=='1'")['user_id'].nunique())
control = float(df2.query("group=='control'")['user_id'].nunique())
round(control_ver / control,4)

0.0

In [75]:
#c. Given that an individual was in the `treatment` group, what is the probability they converted?
