Взять датасет из google диска: https://drive.google.com/file/d/1MpWBFIbqu4mbiD0BBKYX6YhS-f4mN3Z_. Проверить гипотезу о том, в каком варианте теста (control/personalization) больше конверсия (converted) и значимо ли это отличие статистически.

In [1]:
import numpy as np
import pandas as pd

from statsmodels.stats import proportion

  import pandas.util.testing as tm


In [2]:
!wget 'https://drive.google.com/uc?export=download&id=1MpWBFIbqu4mbiD0BBKYX6YhS-f4mN3Z_' -O data.zip

--2022-04-12 11:37:28--  https://drive.google.com/uc?export=download&id=1MpWBFIbqu4mbiD0BBKYX6YhS-f4mN3Z_
Resolving drive.google.com (drive.google.com)... 172.253.115.139, 172.253.115.113, 172.253.115.100, ...
Connecting to drive.google.com (drive.google.com)|172.253.115.139|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-04-c0-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/9vcjauc4g2aprvja769p5sorar5h5gmo/1649763450000/14904333240138417226/*/1MpWBFIbqu4mbiD0BBKYX6YhS-f4mN3Z_?e=download [following]
--2022-04-12 11:37:30--  https://doc-04-c0-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/9vcjauc4g2aprvja769p5sorar5h5gmo/1649763450000/14904333240138417226/*/1MpWBFIbqu4mbiD0BBKYX6YhS-f4mN3Z_?e=download
Resolving doc-04-c0-docs.googleusercontent.com (doc-04-c0-docs.googleusercontent.com)... 172.217.15.97, 2607:f8b0:4004:811::2001
Connecting to doc-04-c0-docs.googleusercontent.com (doc-04-c0

In [3]:
!unzip data.zip

Archive:  data.zip
  inflating: marketing description.txt  
  inflating: marketing_campaign.csv  
  inflating: subscribers.csv         
  inflating: users.csv               


In [4]:
marketing_campaign = pd.read_csv('marketing_campaign.csv')
marketing_campaign

Unnamed: 0,user_id,date_served,marketing_channel,variant,language_displayed,converted
0,a1000,1/1/18,House Ads,personalization,English,True
1,a1001,1/1/18,House Ads,personalization,English,True
2,a1002,1/1/18,House Ads,personalization,English,True
3,a1003,1/1/18,House Ads,personalization,English,True
4,a1004,1/1/18,House Ads,personalization,English,True
...,...,...,...,...,...,...
10032,a11032,1/17/18,Email,control,German,True
10033,a11033,1/17/18,Email,control,German,True
10034,a11034,1/5/18,Instagram,control,German,False
10035,a11035,1/17/18,Email,control,German,True


In [5]:
marketing_campaign.variant.value_counts()

control            5091
personalization    4946
Name: variant, dtype: int64

In [6]:
marketing_campaign.converted.value_counts()

False    8946
True     1076
Name: converted, dtype: int64

In [7]:
subscribers = pd.read_csv('subscribers.csv')
subscribers

Unnamed: 0,user_id,subscribing_channel,date_subscribed,date_canceled,is_retained
0,a1000,House Ads,1/1/18,,True
1,a1001,House Ads,1/1/18,,True
2,a1002,House Ads,1/1/18,,True
3,a1003,House Ads,1/1/18,,True
4,a1004,House Ads,1/1/18,,True
...,...,...,...,...,...
10032,a11032,Email,1/17/18,1/24/18,False
10033,a11033,Email,1/17/18,,True
10034,a11034,Email,1/17/18,,True
10035,a11035,Email,1/17/18,,True


In [8]:
users = pd.read_csv('users.csv')
users

Unnamed: 0,user_id,age_group,language_preferred
0,a1000,0-18 years,English
1,a1001,19-24 years,English
2,a1002,24-30 years,English
3,a1003,30-36 years,English
4,a1004,36-45 years,English
...,...,...,...
10032,a11032,45-55 years,German
10033,a11033,55+ years,German
10034,a11034,55+ years,German
10035,a11035,0-18 years,German


In [9]:
data = marketing_campaign.copy()

In [11]:
z_crit_value = 1.96 # соответствует доверительному интервалу в 95%
k1 = data[data['variant']=='control']['converted'].sum()
n1 = data[data['variant']=='control'].shape[0]
k2 = data[data['variant']=='personalization']['converted'].sum()
n2 = data[data['variant']=='personalization'].shape[0]

k1, n1, k2, n2

(371, 5091, 705, 4946)

In [12]:
grouped = data.pivot_table(values='converted', index='variant', aggfunc=['sum', 'count'])
grouped

Unnamed: 0_level_0,sum,count
Unnamed: 0_level_1,converted,converted
variant,Unnamed: 1_level_2,Unnamed: 2_level_2
control,371,5076
personalization,705,4946


In [13]:
p1, p2 = k1/n1, k2/n2
p1, p2

(0.07287369868395208, 0.14253942579862516)

In [14]:
P = (p1*n1 + p2*n2) / (n1 + n2)
z = (p1 - p2) / (P * (1 - P) * (1/n1 + 1/n2))**(1/2)
z

-11.278864170859038

In [15]:
if abs(z) > z_crit_value:
    print("We may reject the null hypothesis!")
else:
    print("We have failed to reject the null hypothesis")

We may reject the null hypothesis!


In [16]:
from statsmodels.stats import proportion

z_score, z_pvalue = proportion.proportions_ztest(np.array([k1, k2]), 
                                                 np.array([n1, n2]))

print(f'Results are z_score={z_score:.3f} pvalue={z_pvalue:.3f}')

Results are z_score=-11.279 pvalue=0.000


Так как pvalue=0 мы можем отвергнуть нулевую теорию и говорить о том, что различие групп control и personalization статистически значимо.