## A/B Test Analysis

We need to analyze A/B test data from the popular [game Cookie Cats](https://www.facebook.com/cookiecatsgame). It's a classic "match-three" puzzle game where the player must connect tiles of the same color to clear the board and win a level. The board also features singing cats :)

During gameplay, players encounter gates that force them to wait a certain amount of time before they can progress or make an in-app purchase. In this task block, we will analyze the results of an A/B test in which the first gate in Cookie Cats was moved from level 30 to level 40. Specifically, we will examine the impact on player retention—in other words, we want to understand whether moving the gate 10 levels later had any effect on whether users stopped playing the game earlier or later in terms of the number of days since installing the game.

We will be working with data from the file `cookie_cats.csv`. The variables in the dataset are as follows:

- userid - a unique number that identifies each player.
- version - indicates whether the player was assigned to the control group (gate_30 – gate at level 30) or the test group (gate_40 – gate at level 40).
- sum_gamerounds - the number of game rounds played by the player during the first week after installation.
- retention_1 - did the player return and start playing 1 day after installation?
- retention_7 - did the player return and start playing 7 days after installation?

When a player installed the game, they were randomly assigned to either the gate_30 or the gate_40 group.

1. Read the A/B test data into a variable named `df` and display the average value of the `retention_7` metric (7-day retention) by game version. Formulate a hypothesis: which version provides better retention 7 days after game installation?

In [3]:
import pandas as pd

data_path = './data/cookie_cats.csv'
df = pd.read_csv(data_path)
df

Unnamed: 0,userid,version,sum_gamerounds,retention_1,retention_7
0,116,gate_30,3,False,False
1,337,gate_30,38,True,False
2,377,gate_40,165,True,False
3,483,gate_40,1,False,False
4,488,gate_40,179,True,True
...,...,...,...,...,...
90184,9999441,gate_40,97,True,False
90185,9999479,gate_40,30,False,False
90186,9999710,gate_30,28,True,False
90187,9999768,gate_40,51,True,False


In [4]:
df['retention_7'] = df['retention_7'].astype(int)
group_A_retention_7_mean = df['retention_7'][df.version == 'gate_30'].mean()
group_B_retention_7_mean = df['retention_7'][df.version == 'gate_40'].mean()
group_A_retention_7_mean, group_B_retention_7_mean

(0.19020134228187918, 0.18200004396667327)

From the obtained average values, we can assume that the 7-day retention rate after installation is higher in the first group (gate_30).

2. Use a z-test, as shown in the lecture example, to check whether one of the game versions results in a better `retention_7` rate at a significance level of 0.05. Also, compute the confidence intervals for both samples. Display the result in the following format:
```
z statistic: ...
p-value: ...
95% confidence interval for the group control: [..., ...]
95% confidence interval for the group treatment: [..., ...]
```
where instead of `...` - calculated values. As a conclusion, answer two questions: 
    1. is there a statistically significant difference between the behavior of users in different versions of the game?   
    2. do the confidence intervals for user retention in different versions of the game overlap? What does this indicate?  
    
Note that in such tasks, we use a `proportion` Z-test. This is because our dependent variable has a binary value (whether the user returns or not, or whether the user clicks or not in other situations – only two possible values for the variable: 0/1, True/False). If we were measuring, for example, whether there is a statistically significant difference in the weight of men and women in a certain sample, we would use the `statsmodels.stats.ztest`, function, because the dependent variable `вага` is continuous (of type float, instead of int or bool with only two possible values).

In [8]:
from statsmodels.stats.proportion import proportions_ztest, proportion_confint

group_A_results = df['retention_7'][df.version == 'gate_30']
group_B_results = df['retention_7'][df.version == 'gate_40']

n_group_A = group_A_results.count()
n_group_B = group_B_results.count()
successes = [group_A_results.sum(), group_B_results.sum()]
nobs = [n_group_A, n_group_B]

z_stat, pval = proportions_ztest(successes, nobs=nobs)
(lower_con, lower_treat), (upper_con, upper_treat) = proportion_confint(successes, nobs=nobs, alpha=0.05)

print(f'z statistic: {z_stat:.2f}')
print(f'p-value: {pval:.3f}')
print(f'95% confidence interval for the group control: [{lower_con:.3f}, {upper_con:.3f}]')
print(f'95% confidence interval for the group treatment: [{lower_treat:.3f}, {upper_treat:.3f}]')

z statistic: 3.16
p-value: 0.002
95% confidence interval for the group control: [0.187, 0.194]
95% confidence interval for the group treatment: [0.178, 0.186]


There is a statistically significant difference between the groups gate_30 and gate_40, as the obtained p-value is 0.02, which is less than the significance level of 0.05. The confidence intervals for the groups do not overlap, which further confirms the presence of a statistically significant difference between the groups.

3. here is another type of test used for binary metrics like "will the user take an action or not" -  [**the Chi-square test**](https://www.bmj.com/about-bmj/resources-readers/publications/statistics-square-one/8-chi-squared-tests) (here is another  [explanation](https://www.scribbr.com/statistics/chi-square-tests/)  of the test with examples).  It has different null (H0) and alternative (H1) hypotheses compared to the z- and t-tests. Additionally, this test can be used if we have more than two groups under investigation, meaning it's not an A/B test, but rather an A/B/C/D test, for example.  

In **z- and t-тестах** (which differ in that in the former, we do not know the population variance, but if we have a large dataset, these two tests yield very similar results) **we check if there is a difference in the mean values between user groups**.  

In the **Chi-square test, we check if there is a relationship between the user group and whether they will take the desired action**. It's essentially investigating the same thing, but in a slightly different way. To verify, we can perform multiple tests (especially if one gives an inconclusive result like a p-value of 0.07 — it seems like fail to reject H0 at the 5% significance level, but it’s interesting to see what other tests say), so we will also do a Chi-square test and compare its result with the z-test.

You can read more about the differences between tests [here](https://stats.stackexchange.com/a/178860) - this is just an explanation from a StackOverflow user, but there are smart people there.

To perform the Chi-square test, we will use the function from `scipy.stats` `chi2_contingency` to compute the Chi-square statistic and p-value for hypothesis testing. You need to pass a 2x2 table to this function: the number of cases for each game version and the value of `retention_7`.

**Task**: perform the Chi-square test at the 5% significance level to determine if there is a relationship between the game version and whether the player returns on the 7th day after installing the game. 
The hypotheses are as follows:
- H0: The value of retention_7 is independent of the game version.
- H1: There is a relationship between the game version and the value of retention_7.

Output the p-value and draw a conclusion.


In [10]:
from scipy.stats import chi2_contingency

In [37]:
group_A_true_count = df[(df.version == 'gate_30') & (df['retention_7'] == 1)]['retention_7'].count()
group_A_false_count = df[(df.version == 'gate_30') & (df['retention_7'] == 0)]['retention_7'].count()
group_B_true_count = df[(df.version == 'gate_40') & (df['retention_7'] == 1)]['retention_7'].count()
group_B_false_count = df[(df.version == 'gate_40') & (df['retention_7'] == 0)]['retention_7'].count()

group_A_true_count, group_A_false_count, group_B_true_count, group_B_false_count

observed_table = pd.DataFrame({
    "True": [group_A_true_count, group_B_true_count],  
    "False": [group_A_false_count, group_B_false_count]
}, index=["Group A", "Group B"])
observed_table

Unnamed: 0,True,False
Group A,8502,36198
Group B,8279,37210


In [42]:
chi2_stat, p_value, dof, expected = chi2_contingency(observed_table)

print('chi2_stat = ', chi2_stat)
print('p_value = ', p_value)
print('dof = ', dof)
print('expected = ', expected)

chi2_stat =  9.959086799559165
p_value =  0.0016005742679058301
dof =  1
expected =  [[ 8317.09742873 36382.90257127]
 [ 8463.90257127 37025.09742873]]


The obtained p-value is 0.0016, which is less than the significance level of 0.05. Therefore, we reject the null hypothesis that the value of retention_7 is independent of the game version.