# 분산분석과 모형성능 
## 1) 분산 분석 ANOVA:Analysis of Variance
- RSS만으로 평가 한계: scale영향 받음
- 종속변수의 분산과 독립변수의 분산 간의 관계 사용
    - 활용1) 서로 다른 두 개의 모델 비교 가능
    - 활용2) 독립변수 cat 경우 각 값의 영향력 분석 가능
    

- y의 분산(정확하진 않음): TSS(total sum of squares)
    - 종속변수 움직임 범위
    
    
- $\hat{y}$의 분산: ESS(explained sum of squares)


- 잔차e의 분산:RSS(residual sum of squares)
    - 오차의 크기
    - 회귀모형이 상수항 포함, perfect하다면 잔차e평균 = 0
    
    
$$
\bar{e} = \bar{y} - \bar{\hat{y}} = 0 \\
\bar{y} = \bar{\hat{y}}
$$


$$
TSS = ESS + RSS
$$

- 결론
    - ESS 모형 예측지의 움직임 크기(분산) <= TSS 종속변수의 움직임 크기(분산)
    - performance good: ESS ~ TSS

In [7]:
from sklearn.datasets import make_regression
import statsmodels.api as sm

X0, y, coef = make_regression(n_samples=100, n_features=1, noise=30, coef=True)
df = pd.DataFrame({"X": X0.reshape(100), "Y": y})
model = sm.OLS.from_formula("Y ~X", data=df)
result = model.fit()


실습 TSS = RSS + ESS

In [8]:
print("ESS = ", result.mse_model)
print("RSS = ", result.ssr)
print("ESS + RSS = ", result.mse_model + result.ssr)
print("TSS = ", result.uncentered_tss)
print("R squared = ", result.rsquared)
print("ESS / TSS = ",  result.mse_model/ result.uncentered_tss)

ESS =  250718.58440214756
RSS =  93984.23108510357
ESS + RSS =  344702.81548725115
TSS =  344712.3660317656
R squared =  0.7273470744581731
ESS / TSS =  0.7273269226989194


실습 y = $\hat{y}$ + e

In [9]:
pd.DataFrame({'y':y, 'y hat': result.fittedvalues, 
              'resid': result.resid, 
              'sum' : result.fittedvalues + result.resid}).tail(3)

Unnamed: 0,y,y hat,resid,sum
97,-49.708137,-36.868776,-12.83936,-49.708137
98,-62.752708,-50.576745,-12.175963,-62.752708
99,42.131043,41.684536,0.446507,42.131043


## 2) 결정계수(Coefficient of Determination)
### $$
R^2 = 1 - \dfrac{\text{RSS}}{\text{TSS}} = \dfrac{\text{ESS}}{\text{TSS}}
$$

- y와 $\hat{y}$의 샘플 상관계수 r의 제곱은 결정계수 $R^2$과 같음

In [29]:
print(pd.DataFrame({'y hat': result.fittedvalues, 'y': y}).corr())
print()
print("R squares: ", result.rsquared)

          y hat         y
y hat  1.000000  0.852846
y      0.852846  1.000000

R squares:  0.7273470744581731


## 3) 분산 분석표
- Sum of squares -> $R^2$ 계산용도
  - TSS
  - ESS
  - RSS
- Mean squares -> F-test statistic 계산용도

## 4) F-검정(회귀분석) & 분산분석 관계
- 분산분석 결과를 활용 -> F-검정에 필요한 검정통계량 구함
  - $\hat{w}$: 기댓값 0인 정규분포에서 나온 표본
  - $\hat{y}$: $\hat{y} = \hat{w}^Tx$ 선형조합이므로 정규분포
  - e: $e = M\epsilon$ 선형조합이므로 정규분포
  

- $\therefore$ ESS와 RSS의 비율은 F-분포를 따름

$$
\dfrac{ESS}{K-1} \div \dfrac{RSS}{N-K} \sim F(K-1, N-l)
$$

- cf. F-분포: 카이제곱 분포를 따르는, 독립적인 두 개의 확률 변수 간의, 비율
- cf. 카이제곱 분포: 정규분포 따르는, 확률변수X의 샘플들의, 제곱의 합이 따르는 분포

실습 분산분석표

In [13]:
sm.stats.anova_lm(result)

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
X,1.0,250718.584402,250718.584402,261.431317,2.077939e-29
Residual,98.0,93984.231085,959.022766,,


## 4.1) F 검정 활용 - 모형비교
- 쓸모없는 변수 제외 시키기
- Full model($x_1, x_2, x_3$) vs Reduced model($x_1$)
- $H_0: w_2 = w_3 = 0$ 
- code
  - `sm.stats.anova_lm(reduced_model.fit(), full_model.fit())`


In [50]:
#INDUS & AGE는 PRICE에 영향없음을 증명
from sklearn.datasets import load_boston

boston = load_boston()
df_boston = pd.DataFrame(boston.data, columns=boston.feature_names)
df_boston["MEDV"] = boston.target

full_model_f = "MEDV ~ CRIM + ZN + INDUS + NOX + RM + AGE + DIS + RAD + TAX + PTRATIO + B + LSTAT + CHAS"
reduced_model_f = "MEDV ~ CRIM + ZN + NOX + RM + DIS + RAD + TAX + PTRATIO + B + LSTAT + CHAS"

full_model = sm.OLS.from_formula(full_model_f, data=df_boston)
reduced_model = sm.OLS.from_formula(reduced_model_f, data=df_boston)

print(sm.stats.anova_lm(reduced_model.fit(), full_model.fit()))
print()
print("p-value:0.94이므로 귀무가설 accept = 필요없는 변수")

   df_resid           ssr  df_diff   ss_diff         F    Pr(>F)
0     494.0  11081.363952      0.0       NaN       NaN       NaN
1     492.0  11078.784578      2.0  2.579374  0.057274  0.944342

p-value:0.94이므로 귀무가설 accept = 필요없는 변수


## 4.2) F검정 활용 - 변수 중요도 비교
- full model vs z변수를 뺀 reduced model의 성능 비교 -> z변수 중요도 측정
- 단일계수 t 검정의 유의확률과 동일
- code
  - `sm.stats.anova_lm(full_model.fit(), typ=2)`

In [52]:
sm.stats.anova_lm(full_model.fit(), typ=2)

Unnamed: 0,sum_sq,df,F,PR(>F)
CRIM,243.219699,1.0,10.801193,0.00108681
ZN,257.492979,1.0,11.435058,0.0007781097
INDUS,2.516668,1.0,0.111763,0.7382881
NOX,487.155674,1.0,21.634196,4.245644e-06
RM,1871.324082,1.0,83.104012,1.979441e-18
AGE,0.061834,1.0,0.002746,0.9582293
DIS,1232.412493,1.0,54.730457,6.013491e-13
RAD,479.153926,1.0,21.278844,5.070529e-06
TAX,242.25744,1.0,10.75846,0.001111637
PTRATIO,1194.233533,1.0,53.03496,1.308835e-12


In [55]:
full_model.fit().summary()

0,1,2,3
Dep. Variable:,MEDV,R-squared:,0.741
Model:,OLS,Adj. R-squared:,0.734
Method:,Least Squares,F-statistic:,108.1
Date:,"Sun, 11 Nov 2018",Prob (F-statistic):,6.72e-135
Time:,11:58:40,Log-Likelihood:,-1498.8
No. Observations:,506,AIC:,3026.0
Df Residuals:,492,BIC:,3085.0
Df Model:,13,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,36.4595,5.103,7.144,0.000,26.432,46.487
CRIM,-0.1080,0.033,-3.287,0.001,-0.173,-0.043
ZN,0.0464,0.014,3.382,0.001,0.019,0.073
INDUS,0.0206,0.061,0.334,0.738,-0.100,0.141
NOX,-17.7666,3.820,-4.651,0.000,-25.272,-10.262
RM,3.8099,0.418,9.116,0.000,2.989,4.631
AGE,0.0007,0.013,0.052,0.958,-0.025,0.027
DIS,-1.4756,0.199,-7.398,0.000,-1.867,-1.084
RAD,0.3060,0.066,4.613,0.000,0.176,0.436

0,1,2,3
Omnibus:,178.041,Durbin-Watson:,1.078
Prob(Omnibus):,0.0,Jarque-Bera (JB):,783.126
Skew:,1.521,Prob(JB):,8.84e-171
Kurtosis:,8.281,Cond. No.,15100.0
