# ANOVA using `ModelSpec`


In this lab we illustrate how to run create specific ANOVA analyses
using `ModelSpec`.

In [1]:
import numpy as np
import pandas as pd

from statsmodels.api import OLS
from statsmodels.stats.anova import anova_lm

from ISLP import load_data
from ISLP.models import (ModelSpec,
                         derived_feature,
                         summarize)

### Forward Selection
 
We will  apply the forward-selection approach to the  `Hitters` 
data.  We wish to predict a baseball player’s `Salary` on the
basis of various statistics associated with performance in the
previous year.

In [2]:
Hitters = load_data('Hitters')
np.isnan(Hitters['Salary']).sum()

59

    
 We see that `Salary` is missing for 59 players. The
`dropna()`  method of data frames removes all of the rows that have missing
values in any variable (by default --- see  `Hitters.dropna?`).

In [3]:
Hitters = Hitters.dropna()
Hitters.columns

Index(['AtBat', 'Hits', 'HmRun', 'Runs', 'RBI', 'Walks', 'Years', 'CAtBat',
       'CHits', 'CHmRun', 'CRuns', 'CRBI', 'CWalks', 'League', 'Division',
       'PutOuts', 'Assists', 'Errors', 'Salary', 'NewLeague'],
      dtype='object')

## Grouping variables

A look at the [description](https://islp.readthedocs.io/en/latest/datasets/Hitters.html) of the data shows
that there are both career and 1986 offensive stats, as well as some defensive stats.

Let's group the offensive into recent and career offensive stats, as well as a group of defensive variables.

In [4]:
offense_1986 = derived_feature(['AtBat', 'Hits', 'HmRun', 'Runs', 'RBI', 'Walks'],
                               name='offense_1986')
offense_career = derived_feature(['CAtBat', 'CHits', 'CHmRun', 'CRuns', 'CRBI', 'CWalks'],
                                 name='offense_career')
defense_1986 = derived_feature(['PutOuts', 'Assists', 'Errors'],
                               name='defense_1986')
confounders = derived_feature(['Division', 'League', 'NewLeague'],
                              name='confounders')

We'll first do a sequential ANOVA where terms are added sequentially

In [5]:
design = ModelSpec([confounders, offense_1986, defense_1986, offense_career]).fit(Hitters)
Y = np.array(Hitters['Salary'])
X = design.transform(Hitters)

Along with a score we need to specify the search strategy. This is done through the object
`Stepwise()`  in the `ISLP.models` package. The method `Stepwise.first_peak()`
runs forward stepwise until any further additions to the model do not result
in an improvement in the evaluation score. Similarly, the method `Stepwise.fixed_steps()`
runs a fixed number of steps of stepwise search.

In [6]:
M = OLS(Y, X).fit()
summarize(M)

Unnamed: 0,coef,std err,t,P>|t|
intercept,148.2187,73.595,2.014,0.045
Division[W],-116.0404,40.188,-2.887,0.004
League[N],63.7503,79.006,0.807,0.421
NewLeague[N],-24.3989,78.843,-0.309,0.757
AtBat,-1.9509,0.624,-3.125,0.002
Hits,7.4395,2.363,3.148,0.002
HmRun,4.3449,6.19,0.702,0.483
Runs,-2.3312,2.971,-0.785,0.433
RBI,-1.067,2.595,-0.411,0.681
Walks,6.2196,1.825,3.409,0.001


We'll first produce the sequential, or Type I ANOVA results. This builds up a model sequentially and compares
two successive models.

In [7]:
anova_lm(*[OLS(Y, D).fit() for D in design.build_sequence(Hitters, anova_type='sequential')])

Unnamed: 0,df_resid,ssr,df_diff,ss_diff,F,Pr(>F)
0,262.0,53319110.0,0.0,,,
1,259.0,51312630.0,3.0,2006478.0,6.741147,0.0002144265
2,253.0,35898420.0,6.0,15414220.0,25.89351,6.063309e-24
3,250.0,34718820.0,3.0,1179602.0,3.963099,0.008730527
4,244.0,24208570.0,6.0,10510250.0,17.655596,5.701196e-17


We can similarly compute the Type II ANOVA results which drops each term and compares to the full model.

In [8]:
D_full = design.transform(Hitters)
OLS_full = OLS(Y, D_full).fit()
dfs = []
for d in design.build_sequence(Hitters, anova_type='drop'):
    dfs.append(anova_lm(OLS(Y,d).fit(), OLS_full).iloc[1:])
df = pd.concat(dfs)
df.index = design.names
df

Unnamed: 0,df_resid,ssr,df_diff,ss_diff,F,Pr(>F)
intercept,244.0,24208570.0,1.0,402425.4,4.056076,0.04511037
confounders,244.0,24208570.0,3.0,966173.8,3.246046,0.02261572
offense_1986,244.0,24208570.0,6.0,3097572.0,5.203444,4.648586e-05
defense_1986,244.0,24208570.0,3.0,1467933.0,4.931803,0.002415732
offense_career,244.0,24208570.0,6.0,10510250.0,17.655596,5.701196e-17
