### Project Hypothesis Testing:   “Free Trial” Screener
----
**Project description**: 

Udacity is an online learning system (https://www.udacity.com/).  At the time of the experiment, Udacity courses had two options on the course overview page: "start free trial", and "access course materials". If the student clicks "start free trial", they will be asked to enter their credit card information, and then they will be enrolled in a free trial for the paid version of the course. If the student clicks "access course materials", they will be able to view the videos and take the quizzes for free, but they will not receive coaching support or a verified certificate, and they will not submit their final project for feedback.

Udacity tested a change where if the student clicked "start free trial", they were asked how much time they had available to devote to the course. If the student indicated 5 or more hours per week, they would be taken through the checkout process as usual. If they indicated fewer than 5 hours per week, a message will appear indicating that Udacity courses usually require a greater time commitment for successful completion, and suggesting that the student might like to access the course materials for free.

The hypothesis was that this might set clearer expectations for students upfront, thus reducing the number of frustrated students who left the free trial because they didn't have enough time—without significantly reducing the number of students to continue past the free trial and eventually complete the course. If this hypothesis held true, Udacity could improve the overall student experience and improve coaches' capacity to support students who are likely to complete the course.

For more reference on this problem refer: https://docs.google.com/document/u/1/d/1aCquhIqsUApgsxQ8-SQBAigFDcfWVVohLEXcV6jWbdI/pub?embedded=True


#### Data description :
Columns:
- Pageviews: Number of unique cookies to view the course overview page that day. 
- Clicks: Number of unique cookies to click the course overview page that day. 
- Enrollments: Number of user-ids to enroll in the free trial that day. 
- Payments: Number of user-ids who enrolled on that day to remain enrolled for 14 days and thus make a payment. (Note that the date for this column is the start date, that is, the date of enrollment, rather than the date of the payment. The payment happened 14 days later. Because of this, the enrollments and payments are tracked for 14 fewer days than the other columns.)


---
### Instructions
Based on the scenario outlined above and the data associated with this project, you are required to perform A/B testing and Machine Learning analysis. You must also answer the quiz questions linked with this project. For convenience, the quiz questions are copied below. 

**Deadline:** Please complete the quiz, and submit all the data analysis work you carried out in this jupiter notebook via Github - see instructions below - at the very latest by 23h59 on Thursday 8 August 2019. 

**Why this project?**


This project is designed with the two main points in mind


We need to understand your proficiency in the important data science concepts (statistical, algorithmic, and others) and hard skills (advanced programming in python or R)
To give you an additional opportunity to add a few more data science project experience into your portfolio.  


So please put maximum effort to demonstrate your skill in this project. Answer the quiz with diligence, and perform the data analysis as best as you can. 

**Detailed instruction**:

- Git Fork/Clone the project jupyter notebooks and the corresponding data from this github: https://github.com/10acad/piq2019
The jupyter notebook “HypothesisTesting.ipynb”, which contains this instruction in the first markdown cell, is for this project.
- The “data/UdacityABtesting.xlsx” excel file is the data for this project
- Following the recommended tasks below, perform A/B testing and Machine learning analysis on the data while at the same time answering the questions listed below.
- At the minimum you must perform an analysis that will allow you to answer the questions, visualize features, and produce a model to draw a reasonable conclusion. The more detailed your data analysis and clear answers to the quiz, the better it will be for your selection as 10 Academy Fellow as well as you stand a better chance to get a job interviews. Note that these notebooks and your other work at Github are critical for your Data Science Career - as they are the evidence to your skills. So even after submitting whatever you managed to do by the deadline, keep improving your model and explanations.  
- PLEASE SUBMIT WHATEVER YOU MANAGED TO DO BEFORE THE DEADLINE. WE KNOW THE TIME IS SHORT, AND IT IS FOR A PURPOSE. 
- Upload your jupyter notebook to your Github public repository. If you have forked Github link above, which is what we recommend, then you just have to do the following 
    - git add -u *  #add all modified files tracked by git 
    - git commit -m ‘submit’ 
    - git push
- Copy the Github link to your version of “HypothesisTesting.ipynb” and paste it here (the 10 Academy quiz page). If you prefer, there is also a possibility to directly upload your jupyter notebook.
- If you have any questions or confusions regarding what you are expected to do in this project or how to submit, please contact community@10academy.org well before the deadline.


### Objective 1:  A/B testing 

**Quiz**:
- From the project description above, what is the metrics the A/B testing intends to improve? Note that in many cases A/B testing is measured using the companies Key Performance Indicators such as page visits, customer satisfaction, etc.    
- How many days of observation are there in the control and experimental group?
- How many missing values are there in each of the control and experiments data?
- The experiment for this project involves displaying a screen if a user clicks a particular button. What is the underlying statistical probability distribution for a data collected from this type of experiments? Why?


- Assessment of the statistical significance of an A/B test is dependent on what kind of probability distribution the experimental data follows. Given your answer above, which statistical tests are appropriate to use for this project? 

- In frequentist analysis, mostly used for A/B testing, we use p-values to measure the significance of the experimental feature over the null hypothesis (the hypothesis that the new feature does not have an impact). How are p-values computed? What information do p-values provide? Are you familiar with type-I and type-II errors? Can you comment to which error types p-values are related? 


- Are the number of data points in the experiment enough to make a reasonable judgement or should Udacity run a longer experiment? Remember that running the experiment longer may be costly for many reasons, so you should always optimize the number of samples to make a statistically sound decision. 
- What does your A/B testing analysis tells you? Does the experimental feature improve Enrollment, the target variable? 
- Bonus points: Briefly describe your understanding of Bayesian A/B testing?

**Data analysis tasks**:
Tasks you need to perform here to demonstrate your understanding:
 * Plan your analysis steps  - write down your plan in the Jupyter markup cell 
 * Load and explore the control and experiment data tables
 * Visualize some of the features to understand patterns and relationships 
 * Perform A/B testing analysis pay attention to the following details
Missing values
Errors on your final result


### Objective 2: Machine Learning

**Quiz**:
- Which data features are relevant to predicting the target variable? 
- Explain what the difference is between using A/B testing to test a hypothesis (in this case showing a message window) vs using - Machine learning to learn the viability of the same effect?  
- Understand why Machine Learning could be a better approach for performing A/B Testing versus traditional statistical inference (e.g. z-score, t-test)
- Explain the purpose of training using k-fold cross validation instead of using the whole data to train the ML models?
- Does the "Experiment" column prove to be relevant to predicting Enrollment? What does this tell you? Compare this with the A/B testing you did earlier. 
- What information do you gain using the Machine Learning approach that you couldn’t obtain using A/B testing?

**Data analysis tasks**:
- Combine the control_tbl and experiment_tbl, adding an “id” column indicating if the data was part of the experiment or not
- Add a “row_id” column to help for tracking which rows are selected for training and testing in the modeling section
- Create a “Day of Week” feature from the “Date” column
- Drop the “Date” column and the “Payments” column
- Handle the missing data (NA) by removing these rows.
- Shuffle the rows to mix the data up for learning
- Using the “Enrollments” columns as target variable, train a machine learning model using 5-fold cross validation the following 3 different algorithms:
    - Linear Regression
    - Decision Trees
    - XGBoost
- Calculate the Root Mean Square Error Mean Absolute Error (MAE), Root mean squared error (RMSE)  errors of the model using the test data. See <a href=https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d> here </a> for reference on these metrics.
- Compute feature importance - what’s driving the model? Which parameters are important predictors for the difference ML models? What contributes to the goal of gaining Enrollments?
- Discuss your results - draw some conclusions. For example how is the Experiment=0 or 1 variable related to the Enrollment prediction? Hint: think of positive and negative correlations. 
- Explain what information you gain using the Machine Learning approach that you couldn’t obtain using A/B testing?
- Make a recommendation on what Udacity should do?
- Comment on what will improve your model.
- Comment on the challenges you encountered.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
pageviews = 5000

In [3]:
df_control = pd.read_csv("data/Control.csv")
df_experiment = pd.read_csv("data/Experiment.csv")

In [4]:
df_control.head(5)


Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7723.0,687.0,134.0,70.0
1,"Sun, Oct 12",9102.0,779.0,147.0,70.0
2,"Mon, Oct 13",10511.0,909.0,167.0,95.0
3,"Tue, Oct 14",9871.0,836.0,156.0,105.0
4,"Wed, Oct 15",10014.0,837.0,163.0,64.0


In [5]:
df_basevals = pd.read_csv("data/baseline_value.csv", index_col=False,header = None, names = ['metric','baseline_val'])
df_basevals.metric = df_basevals.metric.map(lambda x: x.lower())
df_basevals

Unnamed: 0,metric,baseline_val
0,metric,baseline_val
1,unique cookies to view course overview page pe...,40000
2,"unique cookies to click ""start free trial"" per...",3200
3,enrollments per day:,660
4,"click-through-probability on ""start free trial"":",0.08
5,"probability of enrolling, given click:",0.20625
6,"probability of payment, given enroll:",0.53
7,"probability of payment, given click",0.1093125


In [6]:

round(np.sqrt((.206250*(1-.206250))/(5000*3200/40000)),4)

0.0202

In [7]:
results = {"Control":pd.Series([df_control.Pageviews.sum(),df_control.Clicks.sum(),
                                  df_control.Enrollments.sum(),df_control.Payments.sum()],
                                  index = ["cookies","clicks","enrollments","payments"]),
           "Experiment":pd.Series([df_experiment.Pageviews.sum(),df_experiment.Clicks.sum(),
                               df_experiment.Enrollments.sum(),df_experiment.Payments.sum()],
                               index = ["cookies","clicks","enrollments","payments"])}
df_results = pd.DataFrame(results)
df_results

Unnamed: 0,Control,Experiment
cookies,345543.0,344660.0
clicks,28378.0,28325.0
enrollments,3785.0,3423.0
payments,2033.0,1945.0


In [8]:
##Count Metrics
df_results['Total']=df_results.Control + df_results.Experiment
df_results['Prob'] = 0.5
df_results['StdErr'] = np.sqrt((df_results.Prob * (1- df_results.Prob))/df_results.Total)
df_results["MargErr"] = 1.96 * df_results.StdErr
df_results["CI_lower"] = df_results.Prob - df_results.MargErr
df_results["CI_upper"] = df_results.Prob + df_results.MargErr
df_results["Obs_val"] = df_results.Experiment/df_results.Total
df_results["Pass_Sanity"] = df_results.apply(lambda x: (x.Obs_val > x.CI_lower) and (x.Obs_val < x.CI_upper),axis=1)
df_results['Diff'] = abs((df_results.Experiment - df_results.Control)/df_results.Total)

df_results

Unnamed: 0,Control,Experiment,Total,Prob,StdErr,MargErr,CI_lower,CI_upper,Obs_val,Pass_Sanity,Diff
cookies,345543.0,344660.0,690203.0,0.5,0.000602,0.00118,0.49882,0.50118,0.49936,True,0.001279
clicks,28378.0,28325.0,56703.0,0.5,0.0021,0.004116,0.495884,0.504116,0.499533,True,0.000935
enrollments,3785.0,3423.0,7208.0,0.5,0.005889,0.011543,0.488457,0.511543,0.474889,False,0.050222
payments,2033.0,1945.0,3978.0,0.5,0.007928,0.015538,0.484462,0.515538,0.488939,True,0.022122


In [9]:
# click through probability (clicks/cookies)

control_cookies = df_results.loc['cookies','Control']
control_clicks = df_results.loc['clicks','Control']

exp_cookies = df_results.loc['cookies','Experiment']
exp_clicks = df_results.loc['clicks', 'Experiment']

## control value 
cont_p_hat = control_clicks/control_cookies

## observed value (experimental value)
exp_p_hat = exp_clicks/exp_cookies

## Standard Error
SE_ClickProb = np.sqrt((cont_p_hat * (1- cont_p_hat))/control_cookies)


## margin of error for 95% confidence interval (z = 1.96)

ME_ClickProb = SE_ClickProb * 1.96

## CI
upper_ClickProb = exp_p_hat + ME_ClickProb
lower_ClickProb = exp_p_hat - ME_ClickProb

## Sane in the membrane (yes, it passes)
print(cont_p_hat,exp_p_hat,lower_ClickProb,upper_ClickProb, SE_ClickProb, ME_ClickProb)

0.08212581357457682 0.08218244066616376 0.08126698684411665 0.08309789448821087 0.0004670682765546443 0.0009154538220471028


In [10]:
##Evaluation Metrics Results Caculation
df_control_notnull = df_control[pd.isnull(df_control.Enrollments) != True]
df_experiment_notnull = df_experiment[pd.isnull(df_control.Enrollments) != True]

  This is separate from the ipykernel package so we can avoid doing imports until


In [11]:
results_notnull = {"Control":pd.Series([df_control_notnull.Pageviews.sum(),df_control_notnull.Clicks.sum(),
                                  df_control_notnull.Enrollments.sum(),df_control_notnull.Payments.sum()],
                                  index = ["cookies","clicks","enrollments","payments"]),
           "Experiment":pd.Series([df_experiment_notnull.Pageviews.sum(),df_experiment_notnull.Clicks.sum(),
                               df_experiment_notnull.Enrollments.sum(),df_experiment_notnull.Payments.sum()],
                               index = ["cookies","clicks","enrollments","payments"])}
df_results_notnull = pd.DataFrame(results_notnull)
df_results_notnull

Unnamed: 0,Control,Experiment
cookies,212163.0,211362.0
clicks,17293.0,17260.0
enrollments,3785.0,3423.0
payments,2033.0,1945.0


In [12]:
df_results_notnull['Total']=df_results_notnull.Control + df_results_notnull.Experiment

df_results_notnull

Unnamed: 0,Control,Experiment,Total
cookies,212163.0,211362.0,423525.0
clicks,17293.0,17260.0,34553.0
enrollments,3785.0,3423.0,7208.0
payments,2033.0,1945.0,3978.0


In [13]:
# experiment values

enrollments_exp = df_results_notnull.loc["enrollments"].Experiment
clicks_exp = df_results_notnull.loc["clicks"].Experiment
payments_exp = df_results_notnull.loc["payments"].Experiment

# control values

enrollments_cont = df_results_notnull.loc["enrollments"].Control
clicks_cont = df_results_notnull.loc["clicks"].Control
payments_cont = df_results_notnull.loc["payments"].Control



# metrics

GrossConversion_exp = enrollments_exp/clicks_exp
NetConversion_exp = payments_exp/clicks_exp
GrossConversion_cont = enrollments_cont/clicks_cont
NetConversion_cont = payments_cont/clicks_cont

GrossConversion = (enrollments_exp + enrollments_cont)/(clicks_cont + clicks_exp)
NetConversion = (payments_cont + payments_exp)/(clicks_cont + clicks_exp)

In [14]:
print('GrossConversion: {} \nNetConversion:{}'.format(GrossConversion,NetConversion))

GrossConversion: 0.20860706740369866 
NetConversion:0.1151274853124186


In [15]:
GrossConversion_cont

0.2188746891805933

In [16]:

GrossConversion_exp

0.19831981460023174

In [17]:
def stats_prop(p_hat,z_score,N_cont,N_exp,diff):
    std_err = np.sqrt((p_hat * (1- p_hat ))*(1/N_cont + 1/N_exp))
    marg_err = z_score * std_err
    ci_lower = diff - marg_err
    ci_upper = diff + marg_err
    
    return std_err,marg_err,ci_lower,ci_upper

In [18]:
GrossConversion_diff = GrossConversion_exp - GrossConversion_cont
GrossConversion_diff

-0.020554874580361565

In [19]:
se_gross,me_gross,cil_gross,ciu_gross = stats_prop(GrossConversion,1.96,clicks_cont,
                                                   clicks_exp,GrossConversion_diff)

In [20]:
print(se_gross,me_gross,cil_gross,ciu_gross)

0.004371675385225936 0.008568483755042836 -0.0291233583354044 -0.01198639082531873


In [21]:
NetConversion_diff = NetConversion_exp - NetConversion_cont
NetConversion_diff

-0.0048737226745441675

In [22]:
se_net,me_net,cil_net,ciu_net = stats_prop(NetConversion,1.96,clicks_cont,
                                           clicks_exp,NetConversion_diff)

In [23]:
print(se_net,me_net,cil_net,ciu_net)

0.0034341335129324238 0.0067309016853475505 -0.011604624359891718 0.001857179010803383


In [24]:
df_SignTest = pd.merge(df_control_notnull,df_experiment_notnull,on="Date")
df_SignTest['GrossConversion_cont'] = df_SignTest.Enrollments_x/df_SignTest.Clicks_x
df_SignTest['GrossConversion_exp'] = df_SignTest.Enrollments_y/df_SignTest.Clicks_y
df_SignTest['NetConversion_cont'] = df_SignTest.Payments_x/df_SignTest.Clicks_x
df_SignTest['NetConversion_exp'] = df_SignTest.Payments_y/df_SignTest.Clicks_y

cols = ['Date','GrossConversion_cont','GrossConversion_exp','NetConversion_cont','NetConversion_exp']

In [25]:
df_SignTest = df_SignTest[cols]

In [26]:
df_SignTest.head()

Unnamed: 0,Date,GrossConversion_cont,GrossConversion_exp,NetConversion_cont,NetConversion_exp
0,"Sat, Oct 11",0.195051,0.153061,0.101892,0.049563
1,"Sun, Oct 12",0.188703,0.147771,0.089859,0.115924
2,"Mon, Oct 13",0.183718,0.164027,0.10451,0.089367
3,"Tue, Oct 14",0.186603,0.166868,0.125598,0.111245
4,"Wed, Oct 15",0.194743,0.168269,0.076464,0.112981


In [27]:
df_SignTest['GC_Sign'] = df_SignTest.GrossConversion_cont - df_SignTest.GrossConversion_exp
df_SignTest['NC_Sign'] = df_SignTest.NetConversion_cont - df_SignTest.NetConversion_exp

In [28]:
len(df_SignTest)

23

len(df_SignTest[df_SignTest.GC_Sign > 0])

  QUESTION TWO---MACHINE LEARNING
  
 ### Objective 2: Machine Learning

**Quiz**:
- Which data features are relevant to predicting the target variable? 
- Explain what the difference is between using A/B testing to test a hypothesis (in this case showing a message window) vs using - Machine learning to learn the viability of the same effect?  
- Understand why Machine Learning could be a better approach for performing A/B Testing versus traditional statistical inference (e.g. z-score, t-test)
- Explain the purpose of training using k-fold cross validation instead of using the whole data to train the ML models?
- Does the "Experiment" column prove to be relevant to predicting Enrollment? What does this tell you? Compare this with the A/B testing you did earlier. 
- What information do you gain using the Machine Learning approach that you couldn’t obtain using A/B testing?

**Data analysis tasks**:
- Combine the control_tbl and experiment_tbl, adding an “id” column indicating if the data was part of the experiment or not
- Add a “row_id” column to help for tracking which rows are selected for training and testing in the modeling section
- Create a “Day of Week” feature from the “Date” column
- Drop the “Date” column and the “Payments” column
- Handle the missing data (NA) by removing these rows.
- Shuffle the rows to mix the data up for learning
- Using the “Enrollments” columns as target variable, train a machine learning model using 5-fold cross validation the following 3 different algorithms:
    - Linear Regression
    - Decision Trees
    - XGBoost
- Calculate the Root Mean Square Error Mean Absolute Error (MAE), Root mean squared error (RMSE)  errors of the model using the test data. See <a href=https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d> here </a> for reference on these metrics.
- Compute feature importance - what’s driving the model? Which parameters are important predictors for the difference ML models? What contributes to the goal of gaining Enrollments?
- Discuss your results - draw some conclusions. For example how is the Experiment=0 or 1 variable related to the Enrollment prediction? Hint: think of positive and negative correlations. 
- Explain what information you gain using the Machine Learning approach that you couldn’t obtain using A/B testing?
- Make a recommendation on what Udacity should do?
- Comment on what will improve your model.
- Comment on the challenges you encountered.
  

In [84]:
df_control.head(3)

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7723.0,687.0,134.0,70.0
1,"Sun, Oct 12",9102.0,779.0,147.0,70.0
2,"Mon, Oct 13",10511.0,909.0,167.0,95.0


In [85]:
df_experiment.head(3)


Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7716,686,105.0,34.0
1,"Sun, Oct 12",9288,785,116.0,91.0
2,"Mon, Oct 13",10480,884,145.0,79.0


In [94]:
# Combine the control_tbl and experiment_tbl
merged_data = pd.merge(df_control, df_experiment, how = 'outer')

In [155]:
merged_data.describe()

Unnamed: 0,Pageviews,Clicks,Enrollments,Payments
count,74.0,74.0,46.0,46.0
mean,9327.067568,766.256757,156.695652,86.478261
std,719.455794,66.005616,32.289571,21.730408
min,7434.0,632.0,94.0,34.0
25%,8891.5,713.0,131.75,70.0
50%,9379.5,768.5,153.5,91.0
75%,9779.0,826.5,175.5,100.75
max,10667.0,909.0,233.0,128.0


In [100]:
# Declare an empty list that is to be converted into a column 
row_id = [] 
  
# Using 'Train' as the column name 
# and equating it to the list 
#merged_data['Train'] = row_id 

In [154]:
import datetime
#today = datetime.date.today()
from datetime import datetime
#Create a “Day of Week” feature from the “Date” column
#merged_data['Date'] = pd.to_datetime(merged_data['Date'])  

In [133]:
merged_data['Date'] 

In [134]:
type(merged_data['Date'])

pandas.core.series.Series