# Analyze A/B Test Results 

- [Introduction](#intro)
- [Part I - Probability](#probability)
- [Part II - A/B Test](#ab_test)
- [Part III - Regression](#regression)
- [Final Check](#finalcheck)
- [Submission](#submission)

<a id='intro'></a>
## Introduction

A/B tests are very commonly performed by data analysts and data scientists. For this project, I will be working to understand the results of an A/B test run by an e-commerce website.  My goal is to work through this notebook to help the company understand if they should:
- Implement the new webpage, 
- Keep the old webpage, or 
- Perhaps run the experiment longer to make their decision.

Each **ToDo** task below has an associated quiz present in the classroom.  Though the classroom quizzes are **not necessary** to complete the project, they help ensure you are on the right track as you work through the project, and you can feel more confident in your final submission meeting the [rubric](https://review.udacity.com/#!/rubrics/1214/view) specification. 

<a id='probability'></a>
## Part I - Probability

To get started, let's import our libraries.

In [2]:
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt
%matplotlib inline
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, precision_score, recall_score, accuracy_score
#set the seed to assure you get the same answers
random.seed(42)

In [3]:
df= pd.read_csv("ab_data.csv")
df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'ab_data.csv'

number of rows in the dataset.

In [None]:
df.shape

number of unique users in the dataset.

In [None]:
df.nunique()

proportion of users converted.

In [None]:
len(df.query('converted == 1')) / df.shape[0]

number of times when the "group" is `treatment` but "landing_page" is not a `new_page`.

In [None]:
len(df.query('group == "treatment" and landing_page != "new_page"  or  group == "control" and landing_page != "old_page"'))

Do any of the rows have missing values?

In [None]:
df.query('group == "treatment" and landing_page == "new_page"').isna().any()

In [None]:
# Remove the inaccurate rows, and store the result in a new dataframe df2
df2= df.query('(group == "control" and landing_page == "old_page") or (group == "treatment" and landing_page == "new_page")')

In [None]:
df2.shape

In [None]:
# Double Check all of the incorrect rows were removed from df2 - 
# Output of the statement below should be 0
df2[((df2['group'] == 'treatment') == (df2['landing_page'] == 'new_page')) == False].shape[0]

How many unique **user_id**s are in **df2**?

In [None]:
df2.nunique()

find if there are repeated IDs

In [None]:
df2[df2.duplicated(subset=['user_id'])]

In [None]:
df2.loc[df2.user_id.duplicated(), :]

In [None]:
# Remove one of the rows with a duplicate user_id..
# Hint: The dataframe.drop_duplicates() may not work in this case because the rows with duplicate user_id are not entirely identical. 
df2= df2.drop_duplicates(subset= ['user_id'])

In [None]:
# Check again if the row with a duplicate user_id is deleted or not
df2[df2.duplicated(subset=['user_id'])]

What is the probability of an individual converting regardless of the page they receive?<br><br>


In [None]:
len(df2.query('converted == 1')) / len(df2)

Given that an individual was in the `control` group, what is the probability they converted?

In [None]:
d1= len(df2.query('group == "control"  and  converted == 1')) / len(df2.query('group == "control"'))
d1

Given that an individual was in the `treatment` group, what is the probability they converted?

In [None]:
d2= len(df2.query('group == "treatment"  and  converted == 1')) / len(df2.query('group == "treatment"'))
d2

In [None]:
# Calculate the actual difference (obs_diff) between the conversion rates for the two groups.
obs_diff= d2 - d1
obs_diff

What is the probability that an individual received the new page?

In [None]:
len(df2.query('landing_page == "new_page"')) / len(df2)

>**based on the data on hand now.. the new page has done worse than old page (but almost no change)..
 so it's better to wait and collect some more data**

<a id='ab_test'></a>
## Part II - A/B Test

Now do the A/B Test to apply statistics to find out which page is better.  


I will assume that the old page is better unless the new page proves to be definitely better at a Type I error rate of 5%.  

>**
𝐻0: P𝑜𝑙𝑑 >= Pnew ....
𝐻1: P𝑜𝑙𝑑 < Pnew 
**

In this section, I will: 

- Simulate (bootstrap) sample data set for both groups, and compute the  "converted" probability $p$ for those samples. 


- Use a sample size for each group equal to the ones in the `df2` data.


- Compute the difference in the "converted" probability for the two samples above. 


- Perform the sampling distribution for the "difference in the converted probability" between the two simulated-samples over 10,000 iterations; and calculate an estimate. 



a. What is the **conversion rate** for $p_{new}$ under the null hypothesis? 

In [None]:
# assuming that 𝑝𝑛𝑒𝑤 =  𝑝_𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 >> so p_new will be calculated on the whole df2
p_new= df2.converted.mean()
p_new

**b.** What is the **conversion rate** for $p_{old}$ under the null hypothesis? 

In [None]:
# assuming that p_old =  𝑝_𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 >> so p_new will be calculated on the whole df2
p_old= df2.converted.mean()
p_old

**c.** What is $n_{new}$, the number of individuals in the treatment group? <br><br>

In [None]:
n_new= len(df2.query('group == "treatment" '))
n_new

**d.** What is $n_{old}$, the number of individuals in the control group?

In [None]:
n_old= len(df2.query('group == "control" '))
n_old

**e. Simulate Sample for the `treatment` Group**<br> 
Simulate $n_{new}$ transactions with a conversion rate of $p_{new}$ under the null hypothesis.  <br><br>

In [None]:
new_page_converted= np.random.choice([0,1] , size=n_new , p=[(1-p_new), p_new])
new_page_converted

**f. Simulate Sample for the `control` Group** <br>
Simulate $n_{old}$ transactions with a conversion rate of $p_{old}$ under the null hypothesis. <br> 

In [None]:
old_page_converted= np.random.choice([0,1] , size=n_old , p=[(1-p_old), p_old])
old_page_converted

**g.** Find the difference in the "converted" probability $(p{'}_{new}$ - $p{'}_{old})$ for simulated samples from the parts (e) and (f) above. 

In [None]:
obs_diff2= new_page_converted.mean() - old_page_converted.mean()
obs_diff2
# note: if obs_diff2 is positive.. this supports the alternative hypothesis


**h. Sampling distribution** <br>
Re-create `new_page_converted` and `old_page_converted` and find the $(p{'}_{new}$ - $p{'}_{old})$ value 10,000 times using the same simulation process used in parts (a) through (g) above. 

<br>
then will store all  $(p{'}_{new}$ - $p{'}_{old})$  values in a NumPy array called `p_diffs`.

In [None]:
# Sampling distribution 
p_diffs = []

for i in range(10000):
    new_page_converted= np.random.choice([0,1] , size=n_new , p=[1-p_new, p_new])
    old_page_converted= np.random.choice([0,1] , size=n_old , p=[1-p_old, p_old])
    obs_diff3= new_page_converted.mean() - old_page_converted.mean()
    p_diffs.append(obs_diff3)

**i. Histogram**<br> 
Plot a histogram of the **p_diffs**. 


In [None]:
plt.hist(p_diffs)
plt.axvline(obs_diff2, color='r')
plt.xlabel('difference')
plt.ylabel('frequency')
# H0: P𝑜𝑙𝑑 >= Pnew .... H1: P𝑜𝑙𝑑 < Pnew

**j.** What proportion of the **p_diffs** are greater than the actual difference observed in the `df2` data?

In [None]:
(p_diffs > obs_diff2).mean()
# this is the p-value
# remember that big p-value means that we can't reject H0 >> so H0 is true
# --> so we stay at our assumption of: P𝑜𝑙𝑑 >= Pnew

**l. Using Built-in Methods for Hypothesis Testing**<br>
Will try to use built-in to achieve similar results.  

- `convert_old`: number of conversions with the old_page
- `convert_new`: number of conversions with the new_page
- `n_old`: number of individuals who were shown the old_page
- `n_new`: number of individuals who were shown the new_page

In [None]:
import statsmodels.api as sm

# number of conversions with the old_page
convert_old = len(df2.query('converted == 1  and  landing_page == "old_page"'))

# number of conversions with the new_page
convert_new =len(df2.query('converted == 1  and  landing_page == "new_page"'))

# number of individuals who were shown the old_page
n_old = len(df2.query('landing_page == "old_page"'))

# number of individuals who received new_page
n_new = len(df2.query('landing_page == "new_page"'))


### The two-sample z-test
We determine whether or not the $Z_{score}$ lies in the "rejection region" in the distribution. A "rejection region" is an interval where the null hypothesis is rejected iff the $Z_{score}$ lies in that region.



In [None]:
import statsmodels.api as sm
z_score, p_value = sm.stats.proportions_ztest([convert_new, convert_old], [n_new, n_old], alternative='larger')
print('z_score: ', z_score, '\n', 'p_value: ', p_value)

* **to reject H0 z-score need to be more than (1.95996) .. but it's (-1.311) .. so we can't reject H0**
> **which means that the new page didn't help imporve the conversion rates.**

* **also the p-value here is same with the one calulated manually**
> **which give the same results .. that the new page didn't help imporve the conversion rates.**

<a id='regression'></a>
### Part III - A regression approach

I will get the same result achieved in the A/B test in Part II above by performing regression.<br><br> 



**b.** First need to create the following two columns in the `df2` dataframe:
 1. `intercept` - It should be `1` in the entire column. 
 2. `ab_page` - It's a dummy variable column, having a value `1` when an individual receives the **treatment**, otherwise `0`.  

In [None]:
df2['intercept']= 1

df2[['2', 'ab_page']]= pd.get_dummies(df2['group'])
df2= df2.drop(columns=['2'])
df2.head()

**c.** Use **statsmodels** to instantiate your regression model on the two columns you created in part (b). above, then fit the model to predict whether or not an individual converts. 


In [None]:
y= df2['converted']
x= df2[['intercept', 'ab_page']]

model= sm.Logit(y, x)
results= model.fit()

**d.** Provide the summary of model below.

In [None]:
results.summary2()

**e.** What is the p-value associated with **ab_page**? Why does it differ from the value you found in **Part II**?<br><br>  


In [None]:
np.exp(-0.0150)
# people who recieved 'treatment' are .985 more likely to convert
# (same as)people who didn't recieve 'treatment' are 1.015 more likely to convert

> * **p-value associated with ab_page: 0.1899 (bigger than the Type I error rate (0.05))**
> * **p-value in partII: 0.6349**
 ____________
> * **H0 in logistic regression: there is no relation between X variables and Y variable (two-sided)**
> * **H0 in partII: old page results to higher conversion rates than new page (one-sided)**

**f.** Considering other things that might influence whether or not an individual converts.

In [None]:
df2.head()

>* **other factors in the dataset is 'landing_page' (but won't use it as it assosiated with 'group')**
>* **if possible I would suggest gathering more data about the users (like age, gender, location)**
>* **other factors to consider also might be the (click through rate)**

**g. Adding countries**<br> 
Now along with testing if the conversion rate changes for different pages, also add an effect based on which country a user lives in. 

In [None]:
# Read the countries.csv
cdf= pd.read_csv('countries.csv')
cdf.head(1)

In [None]:
# Join with the df2 dataframe
df_merged= df2.merge(cdf, on='user_id')
df_merged.head(1)

In [None]:
# Create the necessary dummy variables
df_merged[['CA', 'UK', 'US']] =pd.get_dummies(df_merged['country'])
df_merged.head()

In [None]:
# Fit your model, and summarize the results
y2= df_merged['converted']
# in x2 I dropped 'CA' to make it the baseline
x2= df_merged[['intercept', 'UK', 'US']]
# in model2 I used OLS becase there are 3 variables not 2.. so it's multi linear not logistic
model2= sm.Logit(y2, x2)
results2= model2.fit()
results2.summary2()

statistical conclusion:

>**users from UK are .0053 more likely to convert (compared to users from CA)**

>**users from US are .0042 more likely to convert (compared to users from CA)**

practical conclusion:

>**we shouldn't care much about which country users belong to... as it almost doesn't affect the model**

**h. Fit your model and obtain the results**<br> 
Now will look at an interaction between page and country to see if are there significant effects on conversion.

In [None]:
df_merged['CA_ab_page'] = df_merged['CA']*df_merged['ab_page']
df_merged['US_ab_page'] = df_merged['US']*df_merged['ab_page']
df_merged['UK_ab_page'] = df_merged['UK']*df_merged['ab_page']
df_merged.head()

In [None]:
y3= df_merged['converted']
# dropped 'CA' to make it the baseline
x3= df_merged[['intercept','ab_page', 'UK', 'US', 'UK_ab_page', 'US_ab_page']]
# used OLS becase there are 3 variables not 2.. so it's multi linear not logistic
model3= sm.Logit(y3, x3)
results3= model3.fit()
results3.summary2()

statistical conclusion:

>**the coefficient parameters are still low.. which indicates that using the new interactions columns didn't make new difference in results**

practical conclusion:

>**we shouldn't care much about which country users belong to... as it almost doesn't affect the model**

*after the previous analysis we conclude that the old page has higher conversion rate.