In [1]:
import psycopg2
import pandas as pd

**Task 1**

- **Business Hypothesis**:
    - H0: There is no association between the onboarding flow and completion
    - Ha: There is a relationship between onboarding flow and completion

In [2]:
conn = psycopg2.connect(user='postgres',dbname='dwh',host='localhost',password='gun125')

In [3]:
sql = '''
SELECT a.variant
,count(case when b.user_id is not null then a.user_id end) as completed
,count(case when b.user_id is null then a.user_id end) as not_completed
FROM exp_assignment a
LEFT JOIN game_actions b on a.user_id = b.user_id
 and b.action = 'onboarding complete'
WHERE a.exp_name = 'Onboarding'
GROUP BY 1
'''
df1 = pd.read_sql(sql,conn)

  df1 = pd.read_sql(sql,conn)


In [4]:
df1

Unnamed: 0,variant,completed,not_completed
0,variant 1,38280,11995
1,control,36268,13629


In [5]:
from scipy.stats import chi2_contingency

In [7]:
stat, p, dof, expected=chi2_contingency(df1[['completed','not_completed']])

In [9]:
print(p)

5.397977210897444e-36


Since the p-value is very low, we can reject H0. So we can conclude that there is a relationship between onboarding flow and onboarding completion.

**Task2**

- Business Hypothesis:
    - H0: There is no difference in the average amount spent by people who have experienced new onboarding flow vs old onboarding flow
    - Ha: There is a difference in the average amount spent by people who have experienced new onboarding flow vs old onboarding flow

In [10]:
sql = '''
SELECT a.variant
    ,a.user_id
    ,sum(coalesce(b.amount,0)) as amount
    FROM exp_assignment a
    LEFT JOIN game_purchases b on a.user_id = b.user_id
    WHERE a.exp_name = 'Onboarding'
    GROUP BY 1,2
'''
df2 = pd.read_sql(sql,conn)

  df2 = pd.read_sql(sql,conn)


In [12]:
df2.head()

Unnamed: 0,variant,user_id,amount
0,control,6216,0.0
1,control,56447,0.0
2,variant 1,46230,0.0
3,control,62649,0.0
4,control,53258,0.0


In [13]:
old_flow = df2[df2['variant']=='control']['amount']
new_flow = df2[df2['variant']=='variant 1']['amount']

In [14]:
import scipy.stats as stats

In [15]:
old_flow.shape

(49897,)

In [16]:
new_flow.shape

(50275,)

In [17]:
stats.ttest_ind(old_flow,new_flow,equal_var=False)

Ttest_indResult(statistic=0.776545812794534, pvalue=0.43742861555660695)

Since the p-value is not low, we can't reject H0. This implies that, there is no reason to believe that the avg amount spent by users coming from new vs old onboarding flow is any different

**Task3**

- Business Hypothesis
    - H0: There is no difference in the average amount spent by people who have completed new onboarding flow vs old onboarding flow.
    - H1: There is a difference in the average amount spent by people who have completed new onboarding flow vs old onboarding flow.

In [18]:
sql = '''
SELECT a.variant
    ,a.user_id
    ,sum(coalesce(b.amount,0)) as amount
    FROM exp_assignment a
    LEFT JOIN game_purchases b on a.user_id = b.user_id
    JOIN game_actions c on a.user_id = c.user_id
     and c.action = 'onboarding complete'
    WHERE a.exp_name = 'Onboarding'
    GROUP BY 1,2
'''
df3 = pd.read_sql(sql,conn)

  df3 = pd.read_sql(sql,conn)


In [19]:
df3.head()

Unnamed: 0,variant,user_id,amount
0,control,6216,0.0
1,control,56447,0.0
2,variant 1,46230,0.0
3,control,62649,0.0
4,control,53258,0.0


In [20]:
old_flow = df3[df3['variant']=='control']['amount']
new_flow = df3[df3['variant']=='variant 1']['amount']

In [21]:
old_flow.shape

(36268,)

In [22]:
new_flow.shape

(38280,)

In [23]:
stats.ttest_ind(old_flow,new_flow,equal_var=False)

Ttest_indResult(statistic=2.2296491118689623, pvalue=0.025773724266754013)

Since the p-value is low,we can reject H0. This means that the average amount spent by people who have completed new vs old flow is different.

**Task4**

- Business Hypothesis
    - H0: There is no difference in the email opt-in rate pre vs post legislation
    - H1: There is a difference in the email opt-in rate pre vs post legislation

In [36]:
sql = '''
 with t2 as (
	with t1 as (
		select
		created,
		case when a.created between '2020-01-13' and '2020-01-26' then 'pre'
		     when a.created between '2020-01-27' and '2020-02-09' then 'post'
		     end as variant
		, b.user_id as opted_in,
		a.user_id  as cohorted
		FROM game_users a
		LEFT JOIN game_actions b on a.user_id = b.user_id 
		 and b.action = 'email_optin'
		WHERE a.created between '2020-01-13' and '2020-02-09')
		select created,variant,count(distinct opted_in) as opted_in,
		count(distinct cohorted) as cohorted
		from t1
		group by 1,2)
select created,variant,opted_in::float/cohorted::float as perc_opted
from t2

'''
df4 = pd.read_sql(sql,conn)

  df4 = pd.read_sql(sql,conn)


In [38]:
df4.head()

Unnamed: 0,created,variant,perc_opted
0,2020-01-13,pre,0.603009
1,2020-01-14,pre,0.594896
2,2020-01-15,pre,0.616311
3,2020-01-16,pre,0.580982
4,2020-01-17,pre,0.600287


In [39]:
pre = df4[df4['variant']=='pre']['perc_opted']
post = df4[df4['variant']=='post']['perc_opted']

In [40]:
pre.shape

(14,)

In [41]:
post.shape

(14,)

In [43]:
pre.mean()

0.5860421451914261

In [44]:
post.mean()

0.40678111885456747

In [42]:
stats.ttest_ind(pre,post,equal_var=True)

Ttest_indResult(statistic=28.54822227709363, pvalue=3.70597995898679e-21)

The computed p-value is low so we reject H0. We can conclude that there is a change in the pre and post rates. There is an evidence to suggest rates have come down due to the impact of legislation