# AB Testing / Diabetes Datasets

AB Testing Steps:

* 1- Create a hypothesis
* 2- Assumptions check
  * Assumption 1: Distributions are normal for each variable
  * Assumption 2: Variances are homogeneus
* 3- Apply the hypothesis and control p value. If p value is less than 0.05 we can reject the HO if not we can accept HO. 
 * a) If the assumptions are correct use the independent samples t-test (parametric test, t test)
 * b) If the assumptions are not correct use the mannehitneyu test (non-parametric test, mannwhitneyu)
 
**Note**: If assumption 1 is not correct we can use directly non-parametric test (option b). If assumption 1 is correct but assumption 2 is not, we can use the parametric test and add that variances are not homogeneus as a argument.

___

**Import libraries and create dataframe called df**

In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.stats.api as sms
from scipy.stats import ttest_1samp, shapiro, levene, ttest_ind, mannwhitneyu, \
pearsonr, spearmanr, kendalltau, f_oneway, kruskal

pd.set_option('display.max_columns', None)
pd.set_option('display.max_row', 10)
pd.set_option('display.float_format', lambda x: '%.5f' % x)

df = pd.read_csv('diabetes.csv')

### Problem: Is there a statistical difference between the average age of diabetic and not diabetic ?

**Step 1: Create hypothesis and check the average age of diabetic and not diabetic**

Hypothesis (H0): There is no statistical difference between the average age of diabetic and not diabetic.

u1 = Average age diabetic<br>
u2 = Average age not diabetic

HO: u1 = u2<br>
H1: u1 != u2

In [9]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Outcome = 1 : diabetic
Outcome = 0 : not diabetic

In [10]:
df.groupby('Outcome').agg({'Age': 'mean'})

Unnamed: 0_level_0,Age
Outcome,Unnamed: 1_level_1
0,31.19
1,37.06716


**Step 2: Assumption Check**

Assumption 1: Distributions are normal for each variable (Use shapiro test for normal distribution)<br>
Assumption 2: Variances are homogeneus (Use levene test for normal distribution)<br>

Assumption 1:

H0: Distribution is normal. <br>
H1: Distrubution is not mormal.

In [21]:
test_stat, p_value = shapiro(df.loc[df['Outcome'] == 1, 'Age'])
print('Test statistic: %.5f\np value: %.5f' % (test_stat, p_value))

Test statistic: 0.95457
p value: 0.00000


P value is less than 0.05 so we reject the H0 hypothesis, which is distribution is normal for diabetics.

In [22]:
test_stat, p_value = shapiro(df.loc[df['Outcome'] == 0, 'Age'])
print('Test statistic: %.5f\np value: %.5f' % (test_stat, p_value))

Test statistic: 0.80117
p value: 0.00000


P value is less than 0.05 so we reject the H0 hypothesis, which is distribution is normal for not diabetics. Both distributions are not normal that's why we can directly go to the non-parametric solution but let's check the second assumption. 

Assumption 2:

H0: Variances are homogeneous. <br>
H1: Variances are not homogeneous.

In [14]:
test_stat, p_value = levene(df.loc[df['Outcome'] == 1, 'Age'],
                           df.loc[df['Outcome'] == 0, 'Age'])
print('Test statistic: %.5f\np value: %.5f' % (test_stat, p_value))

Test statistic: 2.22521
p value: 0.13619


P value is greater than 0.05 so we accept the H0 hypothesis, which is variances are homogeneous.

**Step 3: Apply Hypothesis**

Non parametric solution: mannwhitneyu

In [15]:
test_stat, p_value = mannwhitneyu(df.loc[df['Outcome'] == 1, 'Age'],
                           df.loc[df['Outcome'] == 0, 'Age'])
print('Test statistic: %.5f\np value: %.5f' % (test_stat, p_value))

Test statistic: 92050.00000
p value: 0.00000


P value is less than 0.05 it is mean that we reject the H0 hypothesis. So we can say that there is a statistical difference between the diabetics and not diabetics, and it is not random.