## Preparation

In [12]:
import pandas as pd
import numpy as np
import scipy.stats as st

In [2]:
# load data
test_result_path = 'dataset/Translation_Test/test_table.csv'
user_info_path = 'dataset/Translation_Test/user_table.csv'
test_result = pd.read_csv(test_result_path)
test_result.head(5)

Unnamed: 0,user_id,date,source,device,browser_language,ads_channel,browser,conversion,test
0,315281,2015-12-03,Direct,Web,ES,,IE,1,0
1,497851,2015-12-04,Ads,Web,ES,Google,IE,0,1
2,848402,2015-12-04,Ads,Web,ES,Facebook,Chrome,0,0
3,290051,2015-12-03,Ads,Mobile,Other,Facebook,Android_App,0,1
4,548435,2015-11-30,Ads,Web,ES,Google,FireFox,0,1


In [3]:
user_info = pd.read_csv(user_info_path)
user_info.head(5)

Unnamed: 0,user_id,sex,age,country
0,765821,M,20,Mexico
1,343561,F,27,Nicaragua
2,118744,M,23,Colombia
3,987753,F,27,Venezuela
4,554597,F,20,Spain


## Experiment Overview
In spanish speaking countries, the displayed language of the site might have a influence on the visitor conversion. In order to test the causality between site language and conversion rate, the scientist setup an experiment as follow:
For control group: display spanish 
For test group: display language based on local language

Make sure to control the noised caused by "unrelated" variables as similar as possible for both groups, these variables are: marketing channel, device, ads channel, browser, sex and age of users.   

Additional sanity check is to see, for users in test group, if their country is not Spain, their records on browser_language should be "Other"; while for users in control group, all user's browser_language should be "ES".

## Question 1:
### Based on the conversion rate of test and control groups, is it true that the test is negative?

In [7]:
# Separate rows in the test_result dataframe with the test column as 1(test) or 0(control)
test_group = test_result.loc[test_result['test'] == 1]
control_group = test_result.loc[test_result['test'] == 0]

In [9]:
# sample size of test_group and control_group is
n_test = test_group.shape[0]
n_control = control_group.shape[0]
print([n_test,n_control])

[215983, 237338]


In [10]:
# total number of (user)conversion is
cov_test = test_group['conversion'].sum()
cov_control = control_group['conversion'].sum()
print([cov_test,cov_control])

[9379, 13096]


In order to compare the delta of conversion rates between test and control group is negative, we need to run a one side **two sample ratio test**. (https://onlinecourses.science.psu.edu/stat414/node/268/)

In [11]:
# conversion rates
p_test = float(cov_test/n_test)
p_control = float(cov_control/n_control)
# delta
delta = p_test - p_control
# theta_square: sample variance of delta
p_pool = float((cov_test+cov_control)/(n_test+n_control))
theta_square = p_pool*(1-p_pool)*(1/n_test + 1/n_control)

In [15]:
print("Conversion Rate of Test Group: %s", round(p_test,5))
print("Conversion Rate of Control Group: %s", round(p_control, 5))

Conversion Rate of Test Group: %s 0.04342
Conversion Rate of Control Group: %s 0.05518


Null Hypothesis H0: delta >= 0

Alternative Hypothesis Ha: delta < 0

Test statistics:
$$ Z = \frac{\delta - 0}{\theta}$$

Reject H0 if $Z <= -Z_{0.95}$ with $95\%$ confidence.

In [18]:
z_statistics = delta/np.sqrt(theta_square)
z_95 = st.norm.ppf(1-0.05) # find Z0.95 with scipy.stats pkg
print([z_statistics, -z_95])

[-18.20833653382862, -1.6448536269514722]


### Since the Z value is smaller than $-Z_{0.95}$, we can come to a conclusion that the delta between test and control group is indeed negative.