# AB Testing / Titanic Datasets

AB Testing Steps:

* 1- Create a hypothesis
* 2- Assumptions check
  * Assumption 1: Distributions are normal for each variable
  * Assumption 2: Variances are homogeneus
* 3- Apply the hypothesis and control p value. If p value is less than 0.05 we can reject the HO if not we can accept HO. 
 * a) If the assumptions are correct use the independent samples t-test (parametric test, t test)
 * b) If the assumptions are not correct use the mannehitneyu test (non-parametric test, mannwhitneyu)
 
**Note**: If assumption 1 is not correct we can use directly non-parametric test (option b). If assumption 1 is correct but assumption 2 is not, we can use the parametric test and add that variances are not homogeneus as a argument.

___

**Import libraries and create dataframe called df**

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.stats.api as sms
from scipy.stats import ttest_1samp, shapiro, levene, ttest_ind, mannwhitneyu, \
pearsonr, spearmanr, kendalltau, f_oneway, kruskal

pd.set_option('display.max_columns', None)
pd.set_option('display.max_row', 10)
pd.set_option('display.float_format', lambda x: '%.5f' % x)

df = sns.load_dataset('titanic')

### Problem: Is there a statistical difference between the average age of men and women on the Titanic ship?

**Step 1: Create hypothesis and check the average age of women and men**

Hypothesis (H0): There is no statistical difference between the average age of men and women on the Titanic ship.

u1 = Average age women<br>
u2 = Average age men

HO: u1 = u2<br>
H1: u1 != u2

In [3]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [5]:
df.groupby('sex').agg({'age': 'mean'})

Unnamed: 0_level_0,age
sex,Unnamed: 1_level_1
female,27.91571
male,30.72664


**Step 2: Assumption Check**

Assumption 1: Distributions are normal for each variable (Use shapiro test for normal distribution)<br>
Assumption 2: Variances are homogeneus (Use levene test for normal distribution)<br>

Assumption 1:

H0: Distribution is normal. <br>
H1: Distrubution is not mormal.

In [12]:
test_stat, p_value = shapiro(df.loc[df['sex'] == 'female', 'age'].dropna())
print('Test statistic: %.5f\np value: %.5f' % (test_stat, p_value))

Test statistic: 0.98479
p value: 0.00705


P value is less than 0.05 so we reject the H0 hypothesis, which is distribution is normal for women age.

In [18]:
test_stat, p_value = shapiro(df.loc[df['sex'] == 'male', 'age'].dropna())
print('Test statistic: %.5f\np value: %.5f' % (test_stat, p_value))

Test statistic: 0.97473
p value: 0.00000


P value is less than 0.05 so we reject the H0 hypothesis, which is distribution is normal for men age. Both distributions are not normal that's why we can directly go to the non-parametric solution but let's check the second assumption. 

Assumption 2:

H0: Variances are homogeneous. <br>
H1: Variances are not homogeneous.

In [21]:
test_stat, p_value = levene(df.loc[df['sex'] == 'female', 'age'].dropna(),
                           df.loc[df['sex'] == 'male', 'age'].dropna())
print('Test statistic: %.5f\np value: %.5f' % (test_stat, p_value))

Test statistic: 0.00130
p value: 0.97121


P value is greater than 0.05 so we accept the H0 hypothesis, which is variances are homogeneous.

**Step 3: Apply Hypothesis**

Non parametric solution: mannwhitneyu

In [24]:
test_stat, p_value = mannwhitneyu(df.loc[df['sex'] == 'female', 'age'].dropna(),
                           df.loc[df['sex'] == 'male', 'age'].dropna())
print('Test statistic: %.5f\np value: %.5f' % (test_stat, p_value))

Test statistic: 53212.50000
p value: 0.02609


Because of the p value is less than 0.05 we reject the H0 hypothesis. So we can say that there is a statistical difference between them and it is not random.