# Business Problem

Cookie Cats is a hugely popular mobile puzzle game developed by Tactile Entertainment. It's a classic "connect three" style puzzle game where the player must connect tiles of the same color in order to clear the board and win the level. It also features singing cats. We're not kidding!

As players progress through the game they will encounter gates that force them to wait some time before they can progress or make an in-app purchase. In this project, we will analyze the result of an A/B test where the first gate in Cookie Cats was moved from level 30 to level 40. In particular, we will analyze the impact on player retention.

<center><iframe width="560" height="315" src="https://www.datacamp.com/projects/184" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></iframe></center>


### Data Description from (https://www.datacamp.com/projects/184)

<p>The data is from 90,189 players that installed the game while the AB-test was running. The variables are:</p>
<ul>
<li><code>userid</code> - a unique number that identifies each player.</li>
<li><code>version</code> - whether the player was put in the control group (<code>gate_30</code> - a gate at level 30) or the test group (<code>gate_40</code> - a gate at level 40).</li>
<li><code>sum_gamerounds</code> - the number of game rounds played by the player during the first week after installation
<li><code>retention_1</code> - did the player come back and play 1 day after installing?</li>
<li><code>retention_7</code> - did the player come back and play 7 days after installing?</li>
</ul>
<p>When a player installed the game, he or she was randomly assigned to either <code>gate_30</code> or <code>gate_40</code>. </p>

# 0.0 Imports

In [35]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats import chi2_contingency

## 0.1 Load Data

In [5]:
path = 'C:/Users/edils/repos/teste_ab/data/'

df_raw = pd.read_csv(path + 'cookie_cats.csv')

In [6]:
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90189 entries, 0 to 90188
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   userid          90189 non-null  int64 
 1   version         90189 non-null  object
 2   sum_gamerounds  90189 non-null  int64 
 3   retention_1     90189 non-null  bool  
 4   retention_7     90189 non-null  bool  
dtypes: bool(2), int64(2), object(1)
memory usage: 2.2+ MB


# 1.0 Data Understanding

In [27]:
df1 = df_raw.copy()

In [28]:
df1.head()

Unnamed: 0,userid,version,sum_gamerounds,retention_1,retention_7
0,116,gate_30,3,False,False
1,337,gate_30,38,True,False
2,377,gate_40,165,True,False
3,483,gate_40,1,False,False
4,488,gate_40,179,True,True


In [32]:
df1.groupby('version').agg({
        'version':['count'],
        'sum_gamerounds':['min','max','mean','std'],
        'retention_1':['count'],
        'retention_7':['count']
    
}).reset_index()


Unnamed: 0_level_0,version,version,sum_gamerounds,sum_gamerounds,sum_gamerounds,sum_gamerounds,retention_1,retention_7
Unnamed: 0_level_1,Unnamed: 1_level_1,count,min,max,mean,std,count,count
0,gate_30,44699,0,2961,51.342111,102.057598,44699,44699
1,gate_40,45489,0,2640,51.298776,103.294416,45489,45489


We have an outlier with sum_gamerounds = 49854. Let's remove it.

In [30]:
df1 = df1.loc[df1['sum_gamerounds'] != 49854,:]

In [52]:
# Criar uma tabela de frequência para sum_gamerounds
game_rounds = df1['sum_gamerounds'].value_counts(normalize=True).reset_index()
game_rounds.columns = ['sum_gamerounds', 'percentage']

# Exibir a distribuição de porcentagem de sum_gamerounds
print(game_rounds)

     sum_gamerounds  percentage
0                 1    0.061405
1                 2    0.051071
2                 0    0.044285
3                 3    0.043886
4                 4    0.040238
..              ...         ...
936             933    0.000011
937             617    0.000011
938            1462    0.000011
939             578    0.000011
940             708    0.000011

[941 rows x 2 columns]


Podemos observar que cerca de 4% dos meus usuarios tiveram 0 sum_gamerounds, ou seja, não jogaram o jogo nenhuma vez. Devemos excluir pois irão poluir as amostras

# 2.0 Experiment Design

--- 

# Objective

Analyze customer retention by moving the gate to level 30 to level 40

H0: Moving the gate does not increase retention
H1: Moving the gate increase retention

## 2.1 Sample Size

In [34]:
pd.crosstab(df1['version'], df1['retention_1'])

retention_1,False,True
version,Unnamed: 1_level_1,Unnamed: 2_level_1
gate_30,24665,20034
gate_40,25370,20119


In [37]:
# Criar a tabela de contingência para retention_1
contingency_table_1 = pd.crosstab(df1['version'], df1['retention_1'])

# Realizar o teste qui-quadrado para retention_1
chi2_stat_1, p_value_1, dof_1, expected_1 = chi2_contingency(contingency_table_1)

# Exibir os resultados para retention_1
print("Retention 1-day:")
print(f"Chi-squared Statistic: {chi2_stat_1}")
print(f"P-value: {p_value_1}")
print(f"Degrees of Freedom: {dof_1}")
print("Contingency Table:")
print(contingency_table_1)
print("Expected Frequencies:")
print(expected_1)

Retention 1-day:
Chi-squared Statistic: 3.169835543170799
P-value: 0.07500999897705692
Degrees of Freedom: 1
Contingency Table:
retention_1  False  True 
version                  
gate_30      24665  20034
gate_40      25370  20119
Expected Frequencies:
[[24798.35970417 19900.64029583]
 [25236.64029583 20252.35970417]]


In [38]:
contingency_table_7 = pd.crosstab(df1['version'], df1['retention_7'])

# Realizar o teste qui-quadrado para retention_7
chi2_stat_7, p_value_7, dof_7, expected_7 = chi2_contingency(contingency_table_7)

# Exibir os resultados para retention_7
print("Retention 7-day:")
print(f"Chi-squared Statistic: {chi2_stat_7}")
print(f"P-value: {p_value_7}")
print(f"Degrees of Freedom: {dof_7}")
print("Contingency Table:")
print(contingency_table_7)
print("Expected Frequencies:")
print(expected_7)

Retention 7-day:
Chi-squared Statistic: 9.91527552890567
P-value: 0.0016391259678654425
Degrees of Freedom: 1
Contingency Table:
retention_7  False  True 
version                  
gate_30      36198   8501
gate_40      37210   8279
Expected Frequencies:
[[36382.49203885  8316.50796115]
 [37025.50796115  8463.49203885]]
