# ANOVA - Analysis of Variance

Analysis of variance (ANOVA) is a tool to compare the means of several populations, based on random, independent samples from each population. It provides a statistical test to determine if population means are equal or not (i.e. came from the same distribution). ANOVA is a parametric test that assumes a normal distribution of values (null hypothesis).

## Case Study

Problem: Is there any difference in weekdays about total bill ?

**Import Libraries**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.stats.api as sms
from scipy.stats import ttest_1samp, shapiro, levene, ttest_ind, mannwhitneyu, \
pearsonr, spearmanr, kendalltau, f_oneway, kruskal

pd.set_option('display.max_columns', None)
pd.set_option('display.max_row', 10)
pd.set_option('display.float_format', lambda x: '%.5f' % x)

**Load dataset**

In [3]:
df = sns.load_dataset('tips')

In [4]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


**Groupby the date and check total bill means**

In [5]:
df.groupby('day').agg({'total_bill': 'mean'})

Unnamed: 0_level_0,total_bill
day,Unnamed: 1_level_1
Thur,17.68274
Fri,17.15158
Sat,20.44138
Sun,21.41


**As you see above, there is a difference between the days but is it statistically meaningful or random ?**

**Step 1: Create hypothesis**

H0: u1=u2=u3=u4<br>
H1: They are not equal.

**Step 2: Assumption Check**

Assumption 1: Distributions are normal for each variable (Use shapiro test for normal distribution)<br>
Assumption 2: Variances are homogeneus (Use levene test for normal distribution)<br>

Assumption 1:

H0: Distribution is normal. <br>
H1: Distrubution is not mormal.

In [18]:
for day in df['day'].unique():
    p_value = shapiro(df.loc[df['day'] == day, 'total_bill'])[1]
    print(day + ' p value: %.5f' % p_value)

Sun p value: 0.00357
Sat p value: 0.00001
Thur p value: 0.00003
Fri p value: 0.04086


P values is less than 0.05 so distrubutions are not normal for each days. We reject the H0 hypothesis and directly go to the non-parametric test. We dont need to check Assumption 2 but we will.

Assumption 2:

H0: Variances are homogeneous. <br>
H1: Variances are not homogeneous.

In [21]:
test_stat, p_value = levene(df.loc[df['day'] == 'Sun', 'total_bill'],
                           df.loc[df['day'] == 'Sat', 'total_bill'],
                           df.loc[df['day'] == 'Thur', 'total_bill'],
                           df.loc[df['day'] == 'Fri', 'total_bill'])
print('Test statistic: %.5f\np value: %.5f' % (test_stat, p_value))

Test statistic: 0.66536
p value: 0.57408


P values is greater than 0.05 so we accept the H0, which is variances are homogeneous.

**Parametric Solution**

We will go to the non-parametric solution but in this example, we will also discover the parametric solution

In [25]:
test_stat, p_value = f_oneway(df.loc[df['day'] == 'Sun', 'total_bill'],
        df.loc[df['day'] == 'Sat', 'total_bill'],
        df.loc[df['day'] == 'Thur', 'total_bill'],
        df.loc[df['day'] == 'Fri', 'total_bill'])
print('Test statistic: %.5f\np value: %.5f' % (test_stat, p_value))

Test statistic: 2.76748
p value: 0.04245


P value is less than 0.05 so we reject the H0 hyppthesis. (H0: u1=u2=u3=u4)

**Non-Parametric Solution**

We need to non-parametric solution because distributions are not normal. 

In [26]:
test_stat, p_value = kruskal(df.loc[df['day'] == 'Sun', 'total_bill'],
        df.loc[df['day'] == 'Sat', 'total_bill'],
        df.loc[df['day'] == 'Thur', 'total_bill'],
        df.loc[df['day'] == 'Fri', 'total_bill'])
print('Test statistic: %.5f\np value: %.5f' % (test_stat, p_value))

Test statistic: 10.40308
p value: 0.01543


P value is less than 0.05 so we reject the H0 hyppthesis. (H0: u1=u2=u3=u4) There is a statistical difference between the days total bill.

**Above, we discover that there is a statistical different in group. But which day is different ?**

In [27]:
from statsmodels.stats.multicomp import MultiComparison

In [28]:
comprasion = MultiComparison(df['total_bill'], df['day'])

In [29]:
tukey = comprasion.tukeyhsd(alpha=0.05)

In [34]:
print(tukey)

Multiple Comparison of Means - Tukey HSD, FWER=0.05 
group1 group2 meandiff p-adj   lower   upper  reject
----------------------------------------------------
   Fri    Sat   3.2898 0.4554 -2.4802  9.0598  False
   Fri    Sun   4.2584 0.2373 -1.5859 10.1028  False
   Fri   Thur   0.5312    0.9 -5.4437   6.506  False
   Sat    Sun   0.9686 0.8921 -2.6089  4.5462  False
   Sat   Thur  -2.7586 0.2375 -6.5456  1.0284  False
   Sun   Thur  -3.7273 0.0669 -7.6266  0.1721  False
----------------------------------------------------


**As you see above, there is no differecen between the days for %95 confident interval. Let's check for %90 confident interval** 

In [37]:
print(comprasion.tukeyhsd(alpha=0.1))

Multiple Comparison of Means - Tukey HSD, FWER=0.10 
group1 group2 meandiff p-adj   lower   upper  reject
----------------------------------------------------
   Fri    Sat   3.2898 0.4554 -1.8479  8.4275  False
   Fri    Sun   4.2584 0.2373 -0.9455  9.4624  False
   Fri   Thur   0.5312    0.9  -4.789  5.8513  False
   Sat    Sun   0.9686 0.8921 -2.2169  4.1542  False
   Sat   Thur  -2.7586 0.2375 -6.1307  0.6134  False
   Sun   Thur  -3.7273 0.0669 -7.1994 -0.2552   True
----------------------------------------------------


**There is a difference between the Sunday and Thursday in last row for %90 confident interval. It returns True.**