### Задача. 

Вы аналитик в стартапе, который хочет протестировать новую фичу продукта - объяснять пользователю, почему ему был рекомендован определенный товар. Сейчас конверсия равна 13%, продакт менеджер хочет увеличить до 15％.

### Формулировка АВ теста

Метрика - Conversion through rate (CTR). 0 - если человек не купил, 1 - купил.

$$H_0 = CTR_1 = 13\%$$
$$H_1 = CTR_2 = 15\%$$
$$\alpha = 5\% $$


Далее нужно разбить выборку на контрольную и экспериментальную. Сколько человек нужно отобрать? Для этого, кроме альфы и размера эффекта, нужно определиться с мощностью критерия, обычно $$(1-\beta) = 0.8 $$ , что означает, что мы имеем 80%-ный шанс определить наш прирост метрик как статистически значимый на отобранной выборке


### Считаем

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import statsmodels.stats.api as sms
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from math import ceil


from statsmodels.stats.proportion import proportions_ztest, proportion_confint

%matplotlib inline

Считаем размер необходимой выборки

In [4]:
effect_size = sms.proportion_effectsize(0.13, 0.15)    # Calculating effect size based on our expected rates

required_n = sms.NormalIndPower().solve_power(
    effect_size, 
    power=0.8, 
    alpha=0.05, 
    ratio=1
    )                                                  # Calculating sample size needed

required_n = ceil(required_n)                          # Rounding up to next whole number                          

print(required_n)

4720


Смотрим на данные

In [6]:
df = pd.read_csv('ab_data.csv')

df.head()
df.tail()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
294473,751197,2017-01-03 22:28:38.630509,control,old_page,0
294474,945152,2017-01-12 00:51:57.078372,control,old_page,0
294475,734608,2017-01-22 11:45:03.439544,control,old_page,0
294476,697314,2017-01-15 01:20:28.957438,control,old_page,0
294477,715931,2017-01-16 12:40:24.467417,treatment,new_page,0


In [7]:
df.groupby(["group", "landing_page"])["landing_page"].count()

group      landing_page
control    new_page          1928
           old_page        145274
treatment  new_page        145311
           old_page          1965
Name: landing_page, dtype: int64

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294478 entries, 0 to 294477
Data columns (total 5 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   user_id       294478 non-null  int64 
 1   timestamp     294478 non-null  object
 2   group         294478 non-null  object
 3   landing_page  294478 non-null  object
 4   converted     294478 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 11.2+ MB


294478 наблюдений, каждое - сессия юзера + 5 колонок:

    user_id - The user ID of each session
    timestamp - Timestamp for the session
    group - Which group the user was assigned to for that session {control, treatment}
    landing_page - Which design each user saw on that session {old_page, new_page}
    converted - Whether the session ended in a conversion or not (binary, 0=not converted, 1=converted)



In [9]:
df.user_id.value_counts()

805339    2
754884    2
722274    2
783176    2
898232    2
         ..
642985    1
771499    1
923606    1
712675    1
715931    1
Name: user_id, Length: 290584, dtype: int64

In [None]:
#выкинем повторяющихся юзеров
session_counts = df['user_id'].value_counts(ascending=False)
users_to_drop = session_counts[session_counts > 1].index
df = df[~df['user_id'].isin(users_to_drop)]
print(f'The updated dataset now has {df.shape[0]} entries')

The updated dataset now has 286690 entries


Сделаем семпл в 4720 юзеров

In [None]:
control_sample = df[df['group'] == 'control'].sample(n=required_n, random_state=22)
treatment_sample = df[df['group'] == 'treatment'].sample(n=required_n, random_state=22)

ab_test = pd.concat([control_sample, treatment_sample], axis=0)
ab_test.reset_index(drop=True, inplace=True)

In [None]:
treatment_sample

Unnamed: 0,user_id,timestamp,group,landing_page,converted
259346,860447,2017-01-11 21:20:47.193292,treatment,new_page,0
237647,845654,2017-01-06 21:49:33.725054,treatment,new_page,0
73088,833106,2017-01-11 21:56:24.637002,treatment,new_page,0
121106,665687,2017-01-08 04:17:45.135586,treatment,new_page,0
78032,658409,2017-01-22 13:18:58.765132,treatment,new_page,0
...,...,...,...,...,...
46153,908512,2017-01-14 22:02:29.922674,treatment,new_page,0
235886,873211,2017-01-05 00:57:16.167151,treatment,new_page,0
268794,631276,2017-01-20 18:56:58.167809,treatment,new_page,0
190461,662301,2017-01-03 08:10:57.768806,treatment,new_page,0


In [None]:
treatment_sample.groupby(["group", "landing_page"])["landing_page"].count()

group      landing_page
treatment  new_page        4720
Name: landing_page, dtype: int64

In [None]:
# treatment_sample["converted"][:400]=1

In [None]:
ab_test = pd.concat([control_sample, treatment_sample], axis=0)
ab_test.reset_index(drop=True, inplace=True)

In [None]:
ab_test['group'].value_counts()

control      4720
treatment    4720
Name: group, dtype: int64

Описательные статистики

In [None]:
conversion_rates = ab_test.groupby('group')['converted']

std_p = lambda x: np.std(x, ddof=0)              # Std. deviation of the proportion
se_p = lambda x: stats.sem(x, ddof=0)            # Std. error of the proportion (std / sqrt(n))

conversion_rates = conversion_rates.agg([np.mean, std_p, se_p])
conversion_rates.columns = ['conversion_rate', 'std_deviation', 'std_error']


conversion_rates.style.format('{:.3f}')

Unnamed: 0_level_0,conversion_rate,std_deviation,std_error
group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
control,0.123,0.329,0.005
treatment,0.126,0.331,0.005


Выглядит так, будто нет разницы между старой и новой версией продукта - 12.3% против 12.6%

Однако значима ли статистически разница? Проверим с помощью z-test

In [None]:
control_results = ab_test[ab_test['group'] == 'control']['converted']
treatment_results = ab_test[ab_test['group'] == 'treatment']['converted']

n_con = control_results.count()
n_treat = treatment_results.count()
successes = [control_results.sum(), treatment_results.sum()]
nobs = [n_con, n_treat]

z_stat, pval = proportions_ztest(successes, nobs=nobs)

(lower_con, lower_treat), (upper_con, upper_treat) = proportion_confint(successes, nobs=nobs, alpha=0.05)

print(f'z statistic: {z_stat:.2f}')
print(f'p-value: {pval:.3f}')
print(f'ci 95% for control group: [{lower_con:.3f}, {upper_con:.3f}]')
print(f'ci 95% for treatment group: [{lower_treat:.3f}, {upper_treat:.3f}]')


z statistic: -0.34
p-value: 0.732
ci 95% for control group: [0.114, 0.133]
ci 95% for treatment group: [0.116, 0.135]


Выводы - pーvalue=0.732 > 0.05 -> Н0 отвергнуть мы не можем -> разница статистически незначима между дизайнами.

Если посмотреть на доверительный интервал экспериментальной группы, то он включает бейзлайн в 13% и не включает желаемые 15%