**READ FILE AS RELATIVE PATH -- ACHIEVING REUSABILITY**

In [None]:
import os 
import pandas as pd 
import warnings
warnings.filterwarnings('ignore')

relative_path = os.path.join(os.getcwd())

raw_data = relative_path + '/query_results.csv'

raw_data = pd.read_csv(raw_data)
raw_data

## EDA

**LET'S VISUALISATE THE PERCENTAGE OF THE A/B TEST GROUPS IN OUR WHOLE DATAFRAME**

In [None]:
import seaborn as sns 
sns.countplot(x=raw_data["abtest_group"], data=raw_data)

In [None]:
raw_data = raw_data.fillna(0)
raw_data

**Randomised Date, Daily Installations per User**

In [None]:
daily_installations = raw_data[raw_data.install_date == "2017-03-01"]
daily_installations

**LET'S CALCULATE THE YEARLY INSTALLATION RATE AKA USER GROWTH RATE PER YEAR BASED ON NEW INSTALLATIONS**

In [None]:
monthly_installations = len(daily_installations)*30  # 30 shows the number of days per month 
monthly_installations

In [None]:
yearly_installation_rate_or_growth_rate = len(raw_data)/monthly_installations
yearly_installation_rate_or_growth_rate

**YEARLY PURCHASE RATE**

In [None]:
sum_of_yearly_purchases = sum(raw_data.purchases)
sum_of_yearly_purchases

In [None]:
YPR = len(raw_data)/sum(raw_data.purchases)
YPR

In [None]:
import statistics

statistics.mean(raw_data.purchases)

In [None]:
statistics.stdev(raw_data.purchases)

In [None]:
statistics.mean(raw_data.gameends)

In [None]:
statistics.stdev(raw_data.gameends)

**CHECKING UNIQUE VALUES IN DATES**

In [None]:
raw_data["activity_date"].unique()

In [None]:
raw_data["assignment_date"].unique()

**I WANTED TO CALCULATE RETENTION AKA CHURN RATE BASED ON THE LAST ACTIVITY DATE COMPARED TO INSTALLA DATE BUT I THINK THIS IS NOT POSSIBLE IN THIS DATA**

**THIS SHOWS YEARLY DATA**

In [None]:
raw_data["install_date"].unique()

**HOW MANY PLAYERS IN OUR DATAFRAME HAVE NEVER CONVERTED?**

In [None]:
no_conversion_data = raw_data[raw_data.conversion_date == 0]
no_conversion_data

In [None]:
purchases_max_number_in_no_conversion_data = no_conversion_data[no_conversion_data.purchases >= 2]
purchases_max_number_in_no_conversion_data

**in our non-conversion data, only 30 players have had more than 2 purchases -- a relatively small number**

**HOW MANY OF OUR PLAYERS HAVE ACTUALLY CONVERTED ALL THIS PERIOD OF TIME?**

In [None]:
merged = raw_data.merge(no_conversion_data, how='left', indicator=True)
conversion = merged[merged['_merge']=='left_only']
conversion = conversion.drop("_merge", axis=1)
conversion_date_data = conversion.fillna(0)
conversion_date_data

Well, 1468 people have had a conversion rate. The number is small but not insignificant since every customer counts and brings money. But let's see at a later stage if the new feature could bring more converted customers in the future. 

**LET'S CALCULATE ONE KPI: YEARLY CONVERSION RATE**

In [None]:
yearly_conversion_rate =len(conversion_date_data)/len(no_conversion_data)*100
yearly_conversion_rate

**HOW MANY TOTAL PURCHASES PER AB TEST GROUP?**

In [None]:
new = conversion_date_data["purchases"].groupby(conversion_date_data['abtest_group']).size()
new

How many large purchases?

In [None]:
purchases_max_number = conversion_date_data[conversion_date_data.purchases >= 2]
purchases_max_number

**HOW MANY (MORE THAN 2 PURCHASES) PER TEST GROUP?** 

In [None]:
purchases = purchases_max_number["purchases"].groupby(purchases_max_number['abtest_group']).size()
purchases

**VISUALISING IF THERE IS ANY EFFECT BETWEEN THE ABTEST GROUP AND THE DURATION OF A GAME**

In [None]:
sns.scatterplot(x=purchases_max_number["abtest_group"], y = purchases_max_number["gameends"], data=purchases_max_number)

Some outliers in Group B show that some gamers played more rounds

In [None]:
sns.scatterplot(x=purchases_max_number["abtest_group"], y = purchases_max_number["purchases"], data=purchases_max_number)

Based on the above graph, the new feature does not mean more purchases

In [None]:
purchases_min_number = conversion_date_data[conversion_date_data.purchases <= 2]
purchases_min_number

**LET'S SEE WHAT INSIGHTS WE GET IF THE NUMBER OF PURCHASES IS SMALL**

## HOW THE NUMBER OF PURCHASES IS RELATED TO THE NUMBER OF PLAYING ROUNDS?

In [None]:
import numpy as np
purchases_gameends = conversion_date_data["gameends"].groupby(conversion_date_data['purchases']).mean().astype(float).astype(np.int32)
purchases_gameends

**based on the above table, proportionally speaking, the more the purchases, the more the rounds without that meaning that correlation implies causation**

In [None]:
group_B = conversion_date_data[conversion_date_data.abtest_group=="B"]
group_B

## Finding the mean game ends (rounds) in conversion group B per purchase number

In [None]:
purchases_gameends_group_B = group_B["gameends"].groupby(group_B['purchases']).mean().astype(float).astype(np.int32)
purchases_gameends_group_B = pd.DataFrame(purchases_gameends_group_B)
purchases_gameends_group_B

In [None]:
sns.countplot(x=group_B["purchases"], data=raw_data)

In [None]:
group_A = conversion_date_data[conversion_date_data.abtest_group=="A"]
group_A

In [None]:
purchases_gameends_group_A = group_A["gameends"].groupby(group_A['purchases']).mean().astype(float).astype(np.int32)
purchases_gameends_group_A = pd.DataFrame(purchases_gameends_group_A)
purchases_gameends_group_A

In [None]:
sns.countplot(x=group_A["purchases"], data=raw_data)

**Some conclusions: The playing rounds in both groups tend to be higher as the number of purchases getting higher. This shows that some players are real funs of the game**

In [None]:
groupa_pur = sum(group_A.purchases)
groupa_gameends = sum(group_A.gameends)
print(groupa_pur, groupa_gameends)

In [None]:
game_behaviour_group_B = sum(group_B.purchases)/sum(group_B.gameends)*100
game_behaviour_group_B

In [None]:
game_behaviour_group_A = sum(group_A.purchases)/sum(group_A.gameends)*100
game_behaviour_group_A

In Group B, when the num of purchases is higher, the number of rounds goes higher too. The same happens in Group A but less frequently since sometimes a high purchase does not mean more rounds. 

**TIME SPENT IN THE GAME DAILY TO THOSE WHO HAVE HIGHEST PURCHASES AND GAMEENDS**

In [None]:
funs = raw_data[(raw_data['purchases'] > 10) & (raw_data['gameends'] >=5)]
funs


From the above data we can conclude the following:

Some users make 1 monthly purchase on average and play 1-2 times per month like the player with the playerid == 14907662. The third, fourth and fifth player of our above dataframe lie in the same category. Even the one before the end seems to be in this category based on his/her playing behaviour and the day of installation till activity date. So, we need to take into consideration the time span and not just the total num of purchases and gameends. 

Some other users are real funs like the playerid == 41458856 who purchases 383 times within a two month timeframe (see: install date to activity date). 
This seems to be an almost daily user of the game who plays approx 1 hour per day as seen from the gameends. 

In [None]:
raw_data.install_date = pd.to_datetime(raw_data.install_date)
raw_data.install_date

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

**FROM THE ABOVE TIME SERIES GRAPH WE CAN SEE THAT SPECIFIC DATES AND MONTHS YIELD MORE PURCHASES AT A YEARLY BASIS. E.G. 2016-05-01**

In [None]:
sns.lineplot(x=group_B["conversion_date"][100:110], y=group_B["purchases"], data=group_B)
plt.xticks(rotation=20)
plt.title('Month of more purchases and conersions in Group B')
plt.show()

In [None]:
sns.lineplot(x=group_A["conversion_date"][200:210], y=group_A["purchases"], data=group_A)
plt.xticks(rotation=20)
plt.title('Month of more purchases and conersions in Group A')
plt.show()

**SOME INSIGHTS: MID WINDER AND SUMMER ARE MONTHS WERE PEOPLE SEEM TO CONVERT MORE AND PURCHASE MORE**

**STATISTICS AND DISTRIBUTIONS**

Caslculating the conversion rate per A/B Test Group within the conversion date data

In [None]:
group_a_conversion_rate =len(conversion_date_data)/len(group_A.conversion_date)
group_a_conversion_rate

In [None]:
group_b_conversion_rate =len(conversion_date_data)/len(group_B.conversion_date)
group_b_conversion_rate

**TIME TO USE STATISTICS, PROBABILITY AND DISTRIBUTIONS TO ANALYSE THE A/B TEST RESULTS IN THE WHOLE DATA FRAME**

In [None]:
raw_data

In [None]:
raw_data["conversion_date"].iloc[raw_data["conversion_date"] != 0] = 1
raw_data

In [None]:
raw_data["conversion_date"].unique()

**ONLY 2.58% FROM OUR WHOLE CUSTOMERS MANAGED TO CONVERT**

In [None]:
raw_data["conversion_date"].sum()/len(raw_data)*100

In [None]:
# Find unique users
print("Unique users:", len(raw_data.playerid.unique()))

# Check for not unique users
print("Non-unique users:", len(raw_data)-len(raw_data.playerid.unique()))

In [None]:
# Probability of user converting
print("Probability of user converting:", raw_data.conversion_date.mean())

In [None]:
# Probability of control group converting

old_features = raw_data[raw_data['abtest_group']=='A']['conversion_date'].mean()
old_features

In [None]:
# Probability of experimental group converting

popup_feature = raw_data[raw_data['abtest_group']=='B']['conversion_date'].mean()
popup_feature

In [None]:
p_diff = popup_feature-old_features

print("Difference in probability of conversion for new and old features (not under H_0):", p_diff)

In [None]:
raw_data["abtest_group"].iloc[raw_data["abtest_group"] == "B"] = 1
raw_data

In [None]:
n_new = raw_data["abtest_group"][raw_data['abtest_group']==1].value_counts()

print("new:", n_new)

In [None]:
raw_data["abtest_group"].iloc[raw_data["abtest_group"] == "A"] = 2
raw_data

In [None]:
n_old = raw_data["abtest_group"][raw_data['abtest_group']==2].value_counts()

print("old:", n_old) 

In [None]:

import statsmodels.api as sm

# Calculate number of conversions

convert_old = len(raw_data[(raw_data['abtest_group']==2)&(raw_data['conversion_date']==1)])
convert_new = len(raw_data[(raw_data['abtest_group']==1)&(raw_data['conversion_date']==1)])

print("convert_old:", convert_old, 
      "\nconvert_new:", convert_new,
      "\nn_old:", n_old,
      "\nn_new:", n_new)


In [None]:
import numpy as np
from scipy.stats import norm

mu_B = popup_feature
mu_A = old_features

var_B = mu_B * (1-mu_B)
var_A = mu_A * (1-mu_A)

n_B = convert_new
n_A = convert_old

Z = (mu_B - mu_A)/np.sqrt(var_B/n_B + var_A/n_A)
pvalue = norm.sf(Z)

print("Z-score: {0}\np-value: {1}".format(Z,pvalue))

In [None]:
import matplotlib.pyplot as plt

z = np.arange(-3, 3, 0.1)
plt.plot(z, norm.pdf(z))
plt.fill_between(z[z>Z], norm.pdf(z[z>Z]))
plt.show()

Using standard deviation per group instead of variance

In [None]:
import numpy as np
from scipy.stats import norm


mu_B = popup_feature
mu_A = old_features

std_B = np.std(raw_data["abtest_group"]==1)
std_A = np.std(raw_data["abtest_group"]==2)

n_B = convert_new
n_A = convert_old

Z = (mu_B - mu_A)/np.sqrt(std_B**2/n_B + std_A**2/n_A)
pvalue = norm.sf(Z)

print("Z-score: {0}\np-value: {1}".format(Z,pvalue))

**FINDING STANDARD DEVIATION FROM THE MEAN GIVES BETTER P-VALUE RESULTS THAN CALCULATING THE VARIANCE BUT IN BOTH CASES WE CAN SAY THAT WE CANNOT REJECT THE NULL HYPOTHESIS THAT SAYS: OLD FEATURES BRING MORE CONVERSIONS THAN THE POP UP ONE**

In [None]:
from scipy.stats import norm
norm.cdf(Z) #how significant our z_score is

In [None]:
norm.ppf(1-(0.05)) #critical value of 95% confidence


z_score is less than critical value of 95% confidence. Hence we fail to reject null hypothesis. Therefore the conclusion is that we accept null hypothesis.

**LOGISTIC REGRESSION FOR STATS ANALYSIS**

In [None]:
#adding an intercept column
raw_data['intercept'] = 1

#Create dummy variable column
raw_data['ab_feature'] = pd.get_dummies(raw_data['abtest_group'])[1]

raw_data.head()

In [None]:
import statsmodels.api as sm
model=sm.Logit(raw_data['conversion_date'].astype(float),raw_data[['intercept','ab_feature']])
results=model.fit()

In [None]:
results.summary()

Conclusions:¶
None of the variables have significant p-values. Therefore, we will fail to reject the null and conclude that there is not sufficient evidence to suggest that the new feature will bring more conversions and more customers.

In the larger picture, based on the available information, we do not have sufficient evidence to suggest that the new feature results in more conversions than the old game architecture.