Product — a mobile application on which we launched a monetization A/B test with an alternative subscription type

In the test group, a subscription with a one-week duration and a three-day trial period was introduced. In the control group, the subscription remained for a month with a three-week trial period. It is necessary to compare the two versions

Description of columns:

- user_id — user identifier
- install_date — installation date
- test_group — the group the user was assigned to
- country — user's country code
- trial — whether the user activated the trial period
- paid — whether the user made a purchase after the trial period
- subscription_name — subscription name
- revenue_1m — revenue generated by the user during the first month after the trial period ended.

In [16]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from statsmodels.stats.proportion import proportions_ztest
from urllib.parse import quote

In [17]:
sheet_id = '1TTSv9l4E89PHMpbg_6xgk-SN1GYZ53BbQDjysAzgAnU'
sheet_name = 'Task 2'
encoded_sheet_name = quote(sheet_name)
url = f'https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={encoded_sheet_name}'

In [18]:
df = pd.read_csv(url)

In [19]:
df

Unnamed: 0,user_id,install_date,country,test_group,trial,paid,subscription_name,revenue_1m
0,0000dd3fa4702a63d1b76aaffe1ab39b,2023-06-05,US,treatment,0,0,,0.0
1,0001f27ab7e22228e54c8b2028b43f24,2023-06-07,AU,treatment,0,0,,0.0
2,0006c5c547801308b36ea3cf669856ae,2023-06-07,AU,treatment,0,0,,0.0
3,000d1a300263c5db91cbefa3852898a7,2023-06-07,AU,control,0,0,,0.0
4,000e5e62e8746e467ed9f49ac5de3208,2023-06-06,US,treatment,0,0,,0.0
...,...,...,...,...,...,...,...,...
15280,ffeef2252b1e6f9084eb9eefdb72461b,2023-06-04,AU,control,1,0,monthly.5.99.3d.trial,0.0
15281,ffef2243e364f6d53cf5bee90473a4d7,2023-06-06,CA,treatment,0,0,,0.0
15282,fff046eac6fd5329dd1fe44ad5e162cb,2023-06-07,US,treatment,0,0,,0.0
15283,fff32685daea5e16da8f1243f40467a0,2023-06-01,GB,treatment,0,0,,0.0


In [20]:
nan_mean = df.isna().mean()
nan_mean

user_id              0.000000
install_date         0.000000
country              0.000000
test_group           0.000000
trial                0.000000
paid                 0.000000
subscription_name    0.887275
revenue_1m           0.000000
dtype: float64

In [21]:
# Checking data types in the dataset. We see that the install_date column is of type object, which is incorrect. 
#We need to fix the type of this column
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15285 entries, 0 to 15284
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   user_id            15285 non-null  object 
 1   install_date       15285 non-null  object 
 2   country            15285 non-null  object 
 3   test_group         15285 non-null  object 
 4   trial              15285 non-null  int64  
 5   paid               15285 non-null  int64  
 6   subscription_name  1723 non-null   object 
 7   revenue_1m         15285 non-null  float64
dtypes: float64(1), int64(2), object(5)
memory usage: 955.4+ KB


In [22]:
# Correcting the type of this column
df['install_date'] = pd.to_datetime(df['install_date'], format="%Y/%m/%d")

####  Using the Z-test for the difference in proportions

##### The Z-test for the difference in proportions is a statistical test used to determine if there is a significant difference between the proportions of two independent groups. This test is particularly applicable when dealing with binary outcomes, such as success/failure or yes/no.

In [23]:
treatment_paid = df.query('test_group == "treatment" and paid != 0')['user_id'].count()
treatment_total = df.query('test_group == "treatment"')['user_id'].count()

control_paid = df.query('test_group == "control" and paid != 0')['user_id'].count()
control_total = df.query('test_group == "control"')['user_id'].count()

stat, p_value = proportions_ztest([treatment_paid, control_paid], [treatment_total, control_total])

print(f'Z-statistics: {stat}')
print(f'p-value: {p_value}')

alpha = 0.05
if p_value < alpha:
    print("The difference in conversion to payment is statistically significant")
else:
    print("The difference in conversion to payment is statistically insignificant")

Z-statistics: 2.011677116982162
p-value: 0.04425398431753958
The difference in conversion to payment is statistically significant


#### After studying the results by countries, it was found that there were statistically significant differences only for Canada

In [24]:
treatment_paid = df.query('test_group == "treatment" and paid != 0 and country == "CA"')['user_id'].count()
treatment_total = df.query('test_group == "treatment" and country == "CA"')['user_id'].count()

control_paid = df.query('test_group == "control" and paid != 0 and country == "CA"')['user_id'].count()
control_total = df.query('test_group == "control" and country == "CA"')['user_id'].count()

stat, p_value = proportions_ztest([treatment_paid, control_paid], [treatment_total, control_total])

print(f'Z-statistics: {stat}')
print(f'p-value: {p_value}')

alpha = 0.05
if p_value < alpha:
    print("The difference in conversion to payment is statistically significant")
else:
    print("The difference in conversion to payment is statistically insignificant")

Z-statistics: 2.5361806260856774
p-value: 0.011206891346182076
The difference in conversion to payment is statistically significant


In [25]:
treatment_paid = df.query('test_group == "treatment" and paid != 0 and country != "CA"')['user_id'].count()
treatment_total = df.query('test_group == "treatment" and country != "CA"')['user_id'].count()

control_paid = df.query('test_group == "control" and paid != 0 and country != "CA"')['user_id'].count()
control_total = df.query('test_group == "control" and country != "CA"')['user_id'].count()

stat, p_value = proportions_ztest([treatment_paid, control_paid], [treatment_total, control_total])

print(f'Z-statistics: {stat}')
print(f'p-value: {p_value}')

alpha = 0.05
if p_value < alpha:
    print("The difference in conversion to payment is statistically significant")
else:
    print("The difference in conversion to payment is statistically insignificant")

Z-statistics: 0.9648496589421626
p-value: 0.3346201189352176
The difference in conversion to payment is statistically insignificant


#### We use Bootstrap to compare ARPPU and ARPU between groups. This method was chosen because the distribution doesnt correspond to normal

In [26]:
data = df[df['country'] == "CA"]

def calculate_arpu(data):
    return data['revenue_1m'].sum() / data['revenue_1m'].count()

def calculate_arppu(data):
    paying_users = data[data['paid'] == 1]
    if paying_users.empty:
        return 0
    return paying_users['revenue_1m'].sum() / paying_users['paid'].sum()

def bootstrap(data, num_iterations=1000):
    bootstrap_arpu = np.zeros(num_iterations)
    bootstrap_arppu = np.zeros(num_iterations)

    for i in range(num_iterations):
        bootstrap_sample = data.sample(n=len(data), replace=True)
        bootstrap_arpu[i] = calculate_arpu(bootstrap_sample)
        bootstrap_arppu[i] = calculate_arppu(bootstrap_sample)

    return bootstrap_arpu, bootstrap_arppu

treatment_data = data[data['test_group'] == 'treatment']
control_data = data[data['test_group'] == 'control']

bootstrap_arpu_treatment, bootstrap_arppu_treatment = bootstrap(treatment_data)
bootstrap_arpu_control, bootstrap_arppu_control = bootstrap(control_data)

arpu_diff = bootstrap_arpu_treatment - bootstrap_arpu_control
arppu_diff = bootstrap_arppu_treatment - bootstrap_arppu_control

confidence_interval_arpu = np.percentile(arpu_diff, [2.5, 97.5])
confidence_interval_arppu = np.percentile(arppu_diff, [2.5, 97.5])

print("ARPU Difference 95% Confidence Interval:", confidence_interval_arpu)
print("ARPPU Difference 95% Confidence Interval:", confidence_interval_arppu)

ARPU Difference 95% Confidence Interval: [-0.00400428  0.08500696]
ARPPU Difference 95% Confidence Interval: [-1.7853125  -0.17611111]


The ARPPU confidence interval does not include 0, which indicates that there are statistically significant differences between groups, namely the negative impact of changes on the metric. There are no statistically significant differences in the ARPU metric. This suggests that an increase in conversion and a decrease in income per paying user neutralized the final effect

In [27]:
data = df[df['country'] != "CA"]

def calculate_arpu(data):
    return data['revenue_1m'].sum() / data['revenue_1m'].count()

def calculate_arppu(data):
    paying_users = data[data['paid'] == 1]
    if paying_users.empty:
        return 0
    return paying_users['revenue_1m'].sum() / paying_users['paid'].sum()

def bootstrap(data, num_iterations=1000):
    bootstrap_arpu = np.zeros(num_iterations)
    bootstrap_arppu = np.zeros(num_iterations)

    for i in range(num_iterations):
        bootstrap_sample = data.sample(n=len(data), replace=True)
        bootstrap_arpu[i] = calculate_arpu(bootstrap_sample)
        bootstrap_arppu[i] = calculate_arppu(bootstrap_sample)

    return bootstrap_arpu, bootstrap_arppu

treatment_data = data[data['test_group'] == 'treatment']
control_data = data[data['test_group'] == 'control']

bootstrap_arpu_treatment, bootstrap_arppu_treatment = bootstrap(treatment_data)
bootstrap_arpu_control, bootstrap_arppu_control = bootstrap(control_data)

arpu_diff = bootstrap_arpu_treatment - bootstrap_arpu_control
arppu_diff = bootstrap_arppu_treatment - bootstrap_arppu_control

confidence_interval_arpu = np.percentile(arpu_diff, [2.5, 97.5])
confidence_interval_arppu = np.percentile(arppu_diff, [2.5, 97.5])

print("ARPU Difference 95% Confidence Interval:", confidence_interval_arpu)
print("ARPPU Difference 95% Confidence Interval:", confidence_interval_arppu)

ARPU Difference 95% Confidence Interval: [-0.02660521  0.03329056]
ARPPU Difference 95% Confidence Interval: [-1.06393292 -0.03645437]


For other countries the effect is the same

Using a two-sided Z-test for the difference in proportions, it was found that there is a significant difference between the test and control groups. In the test group, the conversion to payment is higher. When analyzing differences across countries, significant distinctions were identified only for Canada.

For the analysis of ARPPU (Average Revenue Per Paying User) and ARPU (Average Revenue Per User), the Bootstrap method was employed due to the absence of a normal distribution. The methodology revealed a statistically significant decrease in ARPPU for both groups and all countries. However, no differences were found concerning ARPU. This metric includes both conversion and revenue per payer, indicating that the increase in conversion to payer and the decrease in revenue per payer neutralized each other. At this stage, no significant change in ARPU for the experimental group was detected.

What's next?
We won't roll out changes to all users immediately. After some time, it's necessary to repeat the analysis for cohorts acquired during the test to assess their subsequent performance