## Analyze A/B Test Results

## Table of Contents
- [Introduction](#intro)
- [Part I - Probability](#probability)
- [Part II - A/B Test](#ab_test)
- [Part III - Regression](#regression)


<a id='intro'></a>
### Introduction

The data for this project is provided by Udacity as a part of [Data Analyst Nanodegree](https://www.udacity.com/course/data-analyst-nanodegree--nd002?) program. The dataset contains results of an A/B test run by an e-commerce website.  I analyze the dataset to help the company understand if they should implement the new page, keep the old page, or perhaps run the experiment longer to make their decision.

<a id='probability'></a>
#### Part I - Probability

To get started, let's import our libraries.

In [1]:
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt
%matplotlib inline
random.seed(42)

`1.` I read in the `ab_data.csv` data and store it in `df`.

a. I read in the dataset and examine the top few rows

In [2]:
df = pd.read_csv('ab_data.csv')
df.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,2017-01-21 22:11:48.556739,control,old_page,0
1,804228,2017-01-12 08:01:45.159739,control,old_page,0
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0
4,864975,2017-01-21 01:52:26.210827,control,old_page,1


b. Number of rows in the dataset

In [3]:
df.shape[0]

294478

c. The number of unique users in the dataset.

In [4]:
df.user_id.nunique()

290584

d. The proportion of users converted.

In [5]:
df.converted.mean()

0.11965919355605512

e. The number of times the `new_page` and `treatment` don't match.

In [6]:
mismatch1 = df[(df['group']=='treatment') \
               & (df['landing_page']=='old_page')]

In [7]:
mismatch2 = df[(df['group']=='control') \
               & (df['landing_page']=='new_page')]

In [8]:
mismatch = mismatch1['user_id'].count() + mismatch2['user_id'].count()
mismatch

3893

new_page and treatment don't match 3893 times

f. Checking to see if any of the rows have missing values

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294478 entries, 0 to 294477
Data columns (total 5 columns):
user_id         294478 non-null int64
timestamp       294478 non-null object
group           294478 non-null object
landing_page    294478 non-null object
converted       294478 non-null int64
dtypes: int64(2), object(3)
memory usage: 11.2+ MB


All the columns have 294478 non-null values. Hence, no row has missing values.

`2.` For the rows where **treatment** does not match with **new_page** or **control** does not match with **old_page**, I cannot be sure if this row truly received the new or old page.

a. As the number of rows that mismatch (3893) are a small sample out of 294478 total rows, I will delete the rows which mismatch and store the new dataframe in **df2**.

In [10]:
df2 = df.drop(mismatch1.index)
df2 = df2.drop(mismatch2.index)

In [11]:
df2.shape

(290585, 5)

In [12]:
# Double Check all of the correct rows were removed - this should be 0
df2[((df2['group'] == 'treatment') == (df2['landing_page'] == 'new_page')) == False].shape[0]

0

`3.` Analyzing the dataframe **df2**

a. Number of unique **user_id**s are in **df2**

In [13]:
df2['user_id'].nunique()

290584

b. One **user_id** repeated in **df2**

In [14]:
df2[df2['user_id'].duplicated()]

Unnamed: 0,user_id,timestamp,group,landing_page,converted
2893,773192,2017-01-14 02:55:59.590927,treatment,new_page,0


c. Row information for the repeat **user_id**

In [15]:
df2[df2['user_id'].duplicated(keep=False)]

Unnamed: 0,user_id,timestamp,group,landing_page,converted
1899,773192,2017-01-09 05:37:58.781806,treatment,new_page,0
2893,773192,2017-01-14 02:55:59.590927,treatment,new_page,0


d. I remove **one** of the rows with a duplicate **user_id**

In [16]:
df2 = df2.drop(2893)

In [17]:
df2.shape #to check no. of rows in df2

(290584, 5)

`4.` Calculating the summary statistics and probability

a. Probability of an individual converting regardless of the page they receive

In [18]:
df2['converted'].mean()

0.11959708724499628

b. Conditional probability that the individual converted given that an individual was in the `control` group

In [19]:
df2.query('group == "control"')['converted'].mean()

0.1203863045004612

c. Conditional probability that the individual converted given that an individual was in the `treatment` group

In [20]:
df2.query('group == "treatment"')['converted'].mean()

0.11880806551510564

d. Probability that an individual received the new page

In [21]:
df2.query('landing_page == "new_page"')['user_id'].count()/df2.shape[0]

0.50006194422266881

e. Based on your results from parts (a) through (d) above, is sufficient evidence to conclude that the new treatment page leads to more conversions?

There is **not sufficient** evidence to conclude that the new treatment page leads to more conversions.

The probability that the individual receives new page is 50%. This means there is a an equal probability whether the individual could receive an old page or new page.

The probability of an individual converting regardless of the page they receive is 11.96%. The probability of converting given the individual received the new page (treatment group) is 11.88%. This probability is marginally less than the probability of individual converting given they received the old page (control group) which is 12.04%.

Hence, it seems that the new page did not make any difference to the conversion rate of individuals.

