In [1]:
import math
import numpy as np
import pandas as pd
import scipy.stats as stats
from statsmodels.stats.proportion import proportions_ztest
import random

In a project where users purchase in-game currency for real money, the design team has proposed redesigning the payment screen for mobile devices to increase the number of transactions. The new version started being implemented on a group of users starting from July 24th at 00:00. The base version was retained for a portion of the users. Based on the payment data, we have to answer following question:

__Should the innovation be implemented for everyone or should it be rejected?__

Let's load our dataset containing complete user information using Pandas

In [4]:
user_df = pd.read_csv(r'C:\Users\Микита\Desktop\Test\raw_data.csv')

user_df.head()

Unnamed: 0,id_user,gender,date_reg,platform,id_traffic_source,country_group,age_group,system,date_payment,method,amount,successful_payment,split_group
0,21e32a673252ae39cc68d0f0a5da14b3bd4d13355b7098...,female,2021-07-16 02:06:45,mobile,Alderaan,1,3,Android,,,,,0
1,7c10f6579762217e505adcd1d39cfde05a32a566edcfb8...,male,2021-07-17 01:48:29,mobile,Mandalore,2,1,Android,,,,,0
2,eb48efe2760f26d57e6cc835f663054815c37ddfcc95a7...,female,2021-07-17 05:05:10,mobile,Alderaan,1,3,Android,,,,,0
3,4047dc22673448be39086342345d470442ae0b519adf4c...,female,2021-07-02 11:21:50,mobile,Mandalore,1,4,iOS,,,,,0
4,255ec0d0e1bac8c474827de94d99cf1661ac1e424c91e1...,female,2021-07-02 17:59:34,desktop,Coruscant,4,3,Windows,,,,,0


In [20]:
user_df.shape

(58938, 13)

According to the conditions of the A/B test, we are investigating user behavior specifically on mobile platforms. Therefore, we need to create a new table that includes only those who access the platform from their phones.

In [21]:
mobile_users_df = user_df[user_df['platform'] == 'mobile']

mobile_users_df.head()
mobile_users_df.shape

(46619, 13)

As we can see in this case, number of users for new table is significantly different from the original table

First, let's extract the test and control groups from the table and count number of users in both.

In [22]:
test = mobile_users_df[mobile_users_df['split_group'] == 1]
control = mobile_users_df[mobile_users_df['split_group'] == 0]

print(f'{test.shape = }')
print(f'{control.shape = }')

test.shape = (6680, 13)
control.shape = (39939, 13)


We can observe that the sample sizes of the two groups differ significantly, which is not acceptable for testing purposes. This means that we need to somehow select a subset of users from the control group that has a similar size to the test group. We can do this either by randomly selecting users or by using stratified sampling. Let's try both methods.
For test clearity we will choose only those users that are or did not make payment or this payment was only after the start of experiment.

In [43]:
control['date_payment'] = pd.to_datetime(control['date_payment'], errors ='coerce')

# Set the threshold date for filtering
threshold_date = pd.to_datetime('2021-07-24')

# Filter and select a random subset of rows
filtered_df = control[(control['date_payment'].isna()) | (control['date_payment'] >= threshold_date)]
control_group = filtered_df.sample(n = test.shape[0])

payments_control = control_group[control_group['date_payment'].notna()].shape[0]
payments_test = test[test['date_payment'].notna()].shape[0]

(payments_control, payments_test)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  control['date_payment'] = pd.to_datetime(control['date_payment'], errors ='coerce')


(573, 950)

Now, let's simple calculate conversions for both groups:

In [45]:
conversion_control, conversions_test = payments_control/test.shape[0], payments_test/test.shape[0]

conversion_control, conversions_test

(0.08577844311377246, 0.14221556886227546)

So, in this A/B test, we need to determine whether the new design variant of the application will have a higher conversion rate than before or not. As a result, we have the following hypothesis: $$H_{0}: C_{B} = C_{A};\\H_{1}: C_{B} > C_{A}.$$ For testing proportions, the best approach is to use the so-called Z-test, as it is better suited for testing proportions.

In [18]:
def z_test(conversions1, sample_size1, conversions2, sample_size2):
    p1 = conversions1 / sample_size1
    p2 = conversions2 / sample_size2

    p_combined = (conversions1 + conversions2) / (sample_size1 + sample_size2)
    standard_error = math.sqrt(p_combined * (1 - p_combined) * ((1 / sample_size1) + (1 / sample_size2)))

    z_score = (p1 - p2) / standard_error
    p_value = 1 - stats.norm.cdf(z_score)

    return z_score, p_value

Now let's move on to calculating the necessary sample size for conducting an experiment.

There are several factors to consider when determining the sample size for an experiment. These factors include the desired level of confidence, the desired level of precision, the expected effect size, and any anticipated variability in the data.

In [25]:
def count_sample_size(first_type_error, second_type_error, delta, sigma):
    power_index = (stats.norm.ppf(first_type_error / 2) + stats.norm.ppf(second_type_error)) ** 2
    size = 2 * (sigma ** 2) * power_index / (delta ** 2)
    
    return int(size)

In [26]:
size = count_sample_size(0.05, 0.2, 0.0065, 0.16)

print(f'{size = }')

size = 9511


Let's verify that at this sample size, the probabilities of errors are controlled at the specified levels. This can be done using synthetic A/A and A/B tests. We will generate pairs of samples with and without an effect and calculate the proportions of cases where the Z-test made a mistake. If the proportions of Type I and Type II ($\alpha = 0.05$, $\beta = 0.2$) errors match the ones we specified earlier, we can consider the criterion suitable for our futher work.

The following method seems reasonable for the verification: for the A/A test, we will generate Bernoulli distributed data instead of continuous data, as we are comparing conversions that can be described as Bernoulli random variables.

In [50]:
def verify_criterion(sample_size, cycle_lenght, alpha):
    first_type_errors = []
    
    for _ in range(cycle_lenght):
        control = np.random.binomial(1, 1 / 4, sample_size)
        test = np.random.binomial(1, 1 / 4, sample_size)
        
        conversion_control = np.sum(control)
        conversion_test = np.sum(test)
    
        _, pvalue_aa = z_test(conversion_control, sample_size, conversion_test, sample_size)
        first_type_errors.append(pvalue_aa < alpha)
    
        part_first_type_errors = np.mean(first_type_errors)
    
    print(f'part_first_type_errors = {part_first_type_errors:0.3f}')
    

In [51]:
verify_criterion(size, 5000, 0.05)

part_first_type_errors = 0.048


As we can see, the criterion is working correctly. Of course, there may be some slight deviations due to the randomness of the sample generation, but in any case, they are sufficiently insignificant to have a strong impact on the experiment. On the other hand, if the error rate in our artificial A/A test significantly exceeded the threshold, it would be a bad sign. It would indicate that the test would make mistakes more often, and the probability of making an error would increase, which is something we definitely don't want as a business. However, if the rate of Type I errors were less than 0.05, it would be less concerning.

In [52]:
z_test(conversions_control, test_group.shape[0], conversions_test, test_group.shape[0])

(24.436576977677692, 0.0)

The p-value is practically zero, which allows us to conclude that the __null hypothesis of equal conversion should be rejected and the alternative hypothesis, stating that the conversion in the test group is greater than the conversion in the control group, should be accepted.__


However, it is important to note the following: we have not yet examined the number of successful payments in both groups. Although we have payment data, we need to ensure that the payments were indeed successful. Otherwise, what is the point of introducing changes that do not lead to an increase in profit? Therefore, we will proceed as follows: we will test the conversions not only based on payment occurrences but also based on their successful completion. This will help us better understand the true observed effect.

In [55]:
new_control_conversion = control_group[control_group['successful_payment'] == 1]
new_test_conversion  = test_group[test_group['successful_payment'] == 1]

successful_payment_conversion_test = new_test_conversion.shape[0]
successful_payment_conversion_control = new_control_conversion.shape[0]

(669, 4569)

In [49]:
z_test(successful_payment_conversion_test, test_group.shape[0], successful_payment_conversion_control, test_group.shape[0])

(7.698975085758187, 6.8833827526759706e-15)

As we can see, in this case as well, we have a low p-value, which again allows us to reject the null hypothesis.

In the previous analysis, we considered all purchases made by users, including multiple purchases by the same individual. However, it is possible that the new design may influence the purchasing behavior of specific users, leading to increased activity. Let's now conduct the same test but consider __only unique purchases__. This will provide us with an understanding of how the modified version affects the entire player population as a whole.

In [56]:
dataset_unique = mobile_users_df.drop_duplicates(subset = ['id_user'])

dataset_unique.shape

(40515, 13)

In [57]:
test_group_unique = dataset_unique[dataset_unique['split_group'] == 1]
control_group_unique = dataset_unique[dataset_unique['split_group'] == 0].sample( n = test_group_unique.shape[0])

print(f'{test_group_unique.shape = }')
print(f'{control_group_unique.shape = }')

test_group_unique.shape = (5961, 13)
control_group_unique.shape = (5961, 13)


In [58]:
conversions_test_group_unique = test_group_unique['successful_payment'].sum()
conversions_control_group_unique = control_group_unique['successful_payment'].sum()

size_unique = test_group_unique.shape[0]

print(f'{conversions_test_group_unique = }')
print(f'{conversions_control_group_unique = }')

conversions_test_group_unique = 153.0
conversions_control_group_unique = 123.0


In [59]:
z_score, p_value = z_test(conversions_test_group_unique, size_unique, conversions_control_group_unique, size_unique)

print(f'{p_value = }')

p_value = 0.03384535219349505


How we can see here, $p-value$ is lower than $\alpha = 0.05$, so we should reject the $H_{0}$ in case of unique values. However, let's build some confidence interval for our conversions

In [61]:
def bootstrap(data, number_bootstrap = 5000): #Performing bootstrap resampling on the data.
    n_samples = len(data)
    bootstrap_samples = []
    
    for _ in range(number_bootstrap):
        sample = random.choices(data, k = n_samples)
        bootstrap_samples.append(np.mean(sample))
    
    return bootstrap_samples

payments_test_group = [1 if i == 1.0 else 0 for i in test_group_unique['successful_payment'].values]
payments_control_group = [1 if i == 1.0 else 0 for i in control_group_unique['successful_payment'].values]

bootstrap_samples_test = bootstrap(payments_test_group)
bootstrap_samples_control = bootstrap(payments_control_group)

confidence_interval_control = np.percentile(bootstrap_samples_control, [2.5, 97.5])
confidence_interval_test = np.percentile(bootstrap_samples_test, [2.5, 97.5])

print("Bootstrap 95% Confidence Interval for control group: [{:.5f}, {:.5f}]".format(confidence_interval_control[0], confidence_interval_control[1]))
print("Bootstrap 95% Confidence Interval for test group: [{:.5f}, {:.5f}]".format(confidence_interval_test[0],confidence_interval_test[1]))

Bootstrap 95% Confidence Interval for control group: [0.01694, 0.02432]
Bootstrap 95% Confidence Interval for test group: [0.02164, 0.02970]


By constructing confidence intervals using the bootstrap technique, some details become clearer. Indeed, overall conversion rate appears to increase, judging by the confidence interval bounds.

## Final Conclusions

So, we conducted two versions of the conversion test for the control and test groups on mobile platforms, with the null hypothesis being that there is no difference in conversion and the alternative hypothesis being that the conversion rate for the test group is higher than that of the control group. In the first version, it was decided to conduct the test considering that users might make more than one purchase, whereas in the second version, we excluded such cases. It was found that in both versions, there is significant statistical evidence in favor of the alternative hypothesis. Confidence intervals for the conversions were also constructed for both groups in the second case. Therefore, overall, there is an increase in conversion for the entire population of unique users. The conclusion is as follows: due to the reasons mentioned above, __I would recommend implementing the new design__.