## Descriptive statistics

Assuming some data from a MOOC platform's AB Test, this notebook performs dynamic exploration of dimensions and some 
descriptive statistics for the dimension of 'Gender' and per revenue per session that a user logs into the system:

rps_diff_frac = fractional differential lift of the revenue with respect to treatment subset of female variant

rps_diff = absolute differential lift of the revenue

rps_ctrl = mean for the revenue for the control variant of the experiment

rps_diff_err = Std error of the mean (SEM)

* Note: Standard Error of the Mean (SEM) σM = the stdev of the sampling distribution of the mean, where σ is the standard deviation of the original distribution and N is the sample size (the number of scores each mean is based upon).

Two ways of computing it: standard deviation and standard error of the mean:

a) s = pd.Series(np.random.randn(1000))

stats.sem(s.values) # stats.sem(s, axis=None, ddof=0) # n degrees of freedom

b) s.std() / np.sqrt(len(s))


----
Output:

rps_diff_frac = 1.062

rps_diff = 0.54

rps_ctrl = 8.71

rps_diff_err = 0.059


In [5]:
import pandas as pd
import numpy as np
from scipy import stats

#from qtextasdata import QTextAsData,QInputParams
# def query_database_harelba_q():
#     # Create an instance of q. Default input parameters can be provided here if needed
#     q = QTextAsData()

#     # execute a query, using specific input parameters
#     r = q.execute('select * from /etc/passwd',QInputParams(delimiter=':'))

#     # Get the result status (ok/error). In case of error, r.error will contain a QError instance with the error information
#     print r.status

sessions = pd.read_csv('./data/sessions-hypercube.csv')
sessions_orig = pd.read_csv('./data/sessions-with-features.csv')
orig_sessions_female_control = sessions_orig.loc[sessions_orig['gender'] == 'female'].loc[sessions_orig['variant']== 'control']
print sessions_orig.head(), len(sessions_orig)
print orig_sessions_female_control.head(), len(orig_sessions_female_control)

#female_sessions = sessions.loc[sessions['gender'] == 'female']
female_sessions_control = sessions.loc[sessions['gender'] == 'female'].loc[sessions['variant']== 'control']
female_sessions_test = sessions.loc[sessions['gender'] == 'female'].loc[sessions['variant']== 'test']
#print female_sessions_control.head(), len(female_sessions_control)

rps_female_ctrl = np.divide(female_sessions_control.rps_sum, female_sessions_control.n)
type(pd.DataFrame(rps_female_ctrl))
rps_female_ctrl = np.divide(female_sessions_control.rps_sum, female_sessions_control.n)
type(pd.DataFrame(rps_female_ctrl))

female_sessions_control['mean_rps'] = female_sessions_control.rps_sum/female_sessions_control.n 
female_sessions_test['mean_rps'] = female_sessions_test.rps_sum/female_sessions_test.n 

print female_sessions_control.head(), type(female_sessions_control)

print "\nrps_ctrl: ", female_sessions_control['mean_rps'].mean()
print "rps_test: ",female_sessions_test['mean_rps'].mean()

print "rps_diff: ", np.abs(female_sessions_control['mean_rps'].mean() 
                           - female_sessions_test['mean_rps'].mean())

print "rps_diff_frac: ", np.divide(female_sessions_test['mean_rps'].mean(),
                        female_sessions_control['mean_rps'].mean())

print "rps_diff_err: ", stats.sem(orig_sessions_female_control['rps'])



   customer_id   datestamp  variant engagement_level  gender  ips  cps  pps  \
0            0  2016-04-11     test     less_engaged  female    6    3    1   
1            0  2016-04-14     test     less_engaged  female    3    1    0   
2            1  2016-04-11  control     less_engaged  female    2    1    0   
3            2  2016-04-17  control     less_engaged  female    3    0    0   
4            3  2016-04-12  control     less_engaged  female    3    0    0   

   rps  
0   25  
1    0  
2    0  
3    0  
4    0   165649
    customer_id   datestamp  variant engagement_level  gender  ips  cps  pps  \
2             1  2016-04-11  control     less_engaged  female    2    1    0   
3             2  2016-04-17  control     less_engaged  female    3    0    0   
4             3  2016-04-12  control     less_engaged  female    3    0    0   
6             5  2016-04-12  control     less_engaged  female    1    0    0   
12           10  2016-04-12  control     less_engaged  female   

pandas.core.frame.DataFrame

pandas.core.frame.DataFrame

     datestamp  variant engagement_level  gender     n  ips_sum  ips2_sum  \
0   2016-04-11  control     less_engaged  female  6012    16791     64643   
3   2016-04-11  control     more_engaged  female  2395    30543    528387   
12  2016-04-12  control     less_engaged  female  6351    17901     69519   
15  2016-04-12  control     more_engaged  female  2401    30251    508169   
24  2016-04-13  control     less_engaged  female  6245    17425     67085   

    cps_sum  cps2_sum  pps_sum  pps2_sum  rps_sum  rps2_sum   mean_rps  
0      4463      7967      883      1011    18064    537322   3.004657  
3      8859     52487     1722      3376    35157   1641085  14.679332  
12     4756      8452      955      1089    20018    615046   3.151945  
15     8786     52492     1670      3174    34997   1618677  14.576010  
24     4820      8688      949      1089    19490    565980   3.120897   <class 'pandas.core.frame.DataFrame'>

rps_ctrl:  8.70970871788
rps_test:  9.24934892878
rps_diff: 