### ANOVA in python


In [1]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

#### Loading the data
As usual, we start by loading in a dataset of our sample observations. This particular table is of salaries in IT and has 4 columns:

- S - the individuals salary
- X - years of experience
- E - education level (1-Bachelors, 2-Masters, 3-PHD)
- M - management (0-no management, 1-yes management)

In [2]:
df = pd.read_csv("D:\\Programming\\Datasets\\IT_salaries.csv")
df.head()

Unnamed: 0,S,X,E,M
0,13876,1,1,1
1,11608,1,3,0
2,18701,1,3,1
3,11283,1,2,0
4,11767,1,3,0


#### Generating the ANOVA table
In order to generate the ANOVA table, you first fit a linear model and then generate the table from this object. Our formula will be written as:

    Control_Column ~ C(factor_col1) + factor_col2 + C(factor_col3) + ... + X

_We indicate categorical variables by wrapping them with C() ._

In [3]:
formula = "S ~ C(E) + C(M) + X"
lm = ols(formula, df).fit()
table = sm.stats.anova_lm(lm, typ=2)
print(table)

                sum_sq    df           F        PR(>F)
C(E)      9.152624e+07   2.0   43.351589  7.672450e-11
C(M)      5.075724e+08   1.0  480.825394  2.901444e-24
X         3.380979e+08   1.0  320.281524  5.546313e-21
Residual  4.328072e+07  41.0         NaN           NaN


#### Interpreting the table
For now, simply focus on the outermost columns. On the left, you can see our various groups, and on the right, the probability that the factor is indeed influential. Values less than 0.05 (or whatever we set 
Î±
 to) indicate rejection of the null hypothesis. In this case, notice that all three factors appear influential, with management being the potentially most significant, followed by years experience, and finally, educational degree.

## EXERCISE

In [None]:
#Loading dataset
toothgrowth = pd.read_csv("D:\Programming\Datasets\ToothGrowth.csv")
toothgrowth.head()

Unnamed: 0,len,supp,dose
0,4.2,VC,0.5
1,11.5,VC,0.5
2,7.3,VC,0.5
3,5.8,VC,0.5
4,6.4,VC,0.5


In [5]:
#Generating the anova table
formula = "len ~ C(supp) + dose"
lm = ols(formula, toothgrowth).fit()
table = sm.stats.anova_lm(lm, typ=2)
print(table)

               sum_sq    df           F        PR(>F)
C(supp)    205.350000   1.0   11.446768  1.300662e-03
dose      2224.304298   1.0  123.988774  6.313519e-16
Residual  1022.555036  57.0         NaN           NaN


- The values are less than 0.05 therefore we reject the null hypothersis and conclude
- that both suppliment and dose have significant effect on the length of the tooth 
- with suppliment having the most effect on length of the tooth.

#### Compare to t-tests
Now that you've had a chance to generate an ANOVA table, its interesting to compare the results to those from the t-tests you were working with earlier. With that, start by breaking the data into two samples: those given the OJ supplement, and those given the VC supplement. Afterward, you'll conduct a t-test to compare the tooth length of these two different samples:

In [None]:

#splitting the dataset into two samples based on suppliment type
oj_sample = toothgrowth[toothgrowth["supp"] == "OJ"]
vc_sample = toothgrowth[toothgrowth["supp"] == "VC"]

print("OJ sample\n", oj_sample.head())
print("VC sample\n", vc_sample.head())

OJ sample
      len supp  dose
30  15.2   OJ   0.5
31  21.5   OJ   0.5
32  17.6   OJ   0.5
33   9.7   OJ   0.5
34  14.5   OJ   0.5
VC sample
     len supp  dose
0   4.2   VC   0.5
1  11.5   VC   0.5
2   7.3   VC   0.5
3   5.8   VC   0.5
4   6.4   VC   0.5


Now run a t-test between these two groups and print the associated two-sided p-value

In [17]:
from scipy.stats import ttest_ind

#extractig tooth length for each group
oj_length = oj_sample["len"]
vc_length = vc_sample["len"]

#Performing t_test
t_stat, p_value = ttest_ind(oj_length, vc_length)

print("two sided p_value:", p_value)

two sided p_value: 0.06039337122412849
