# StatsModels
- R 기능: DataFrame과 문자열 기호 이용
## patsy
- 회귀분석 전처리 패키지: DataFrame 가공 -> 인코딩, 변환
- `demtrix`: 실험설계행렬 experiment design matrix
    - `data_transfromed = dmatrix(formula, data=df)`
    -  default: 상수항 intercept
    - `formula 연산자`
        - `+/-` 설명변수 추가/제거
        - `-1, +0` 상수항intercept 제거
        - `:` 곱
        - `a*b` = a + b + a:b
        - `a/b` = a + a:b
        - `~` 종속-독립관계

#### 연산자 사용 예시

In [11]:
from patsy import *
df = pd.DataFrame(demo_data('x1','x2', 'y'))
df

Unnamed: 0,x1,x2,y
0,1.764052,-0.977278,0.144044
1,0.400157,0.950088,1.454274
2,0.978738,-0.151357,0.761038
3,2.240893,-0.103219,0.121675
4,1.867558,0.410599,0.443863


In [12]:
# 변수 하나씩 돌려볼 때 유용
dmatrix("x1", df)

DesignMatrix with shape (5, 2)
  Intercept       x1
          1  1.76405
          1  0.40016
          1  0.97874
          1  2.24089
          1  1.86756
  Terms:
    'Intercept' (column 0)
    'x1' (column 1)

In [18]:
dmatrix("x1 + x2 - 1 + x1:x2", df)

DesignMatrix with shape (5, 3)
       x1        x2     x1:x2
  1.76405  -0.97728  -1.72397
  0.40016   0.95009   0.38018
  0.97874  -0.15136  -0.14814
  2.24089  -0.10322  -0.23130
  1.86756   0.41060   0.76682
  Terms:
    'x1' (column 0)
    'x2' (column 1)
    'x1:x2' (column 2)

In [19]:
dmatrix("x1 * x2 - 1", df)

DesignMatrix with shape (5, 3)
       x1        x2     x1:x2
  1.76405  -0.97728  -1.72397
  0.40016   0.95009   0.38018
  0.97874  -0.15136  -0.14814
  2.24089  -0.10322  -0.23130
  1.86756   0.41060   0.76682
  Terms:
    'x1' (column 0)
    'x2' (column 1)
    'x1:x2' (column 2)

In [20]:
dmatrix("x1 / x2 - 1", df)

DesignMatrix with shape (5, 2)
       x1     x1:x2
  1.76405  -1.72397
  0.40016   0.38018
  0.97874  -0.14814
  2.24089  -0.23130
  1.86756   0.76682
  Terms:
    'x1' (column 0)
    'x1:x2' (column 1)

#### 수학연산(변환transform)

In [21]:
dmatrix("x1 + np.log(np.abs(x2))", df)

DesignMatrix with shape (5, 3)
  Intercept       x1  np.log(np.abs(x2))
          1  1.76405            -0.02298
          1  0.40016            -0.05120
          1  0.97874            -1.88811
          1  2.24089            -2.27090
          1  1.86756            -0.89014
  Terms:
    'Intercept' (column 0)
    'x1' (column 1)
    'np.log(np.abs(x2))' (column 2)

#### 상태 보존 변환(stateful transform)
- 평균제거 등 결과 내부 상태변수로 저장
     - `center(x)`: 평균제거
     - `standardize(x)`: 평균 제거 및 표준편차 스케일링
     - `scale(x)`: 상동
     - `dm.design_info`라는 속성에 state variable로 저장
- 평균제거 데이터로 학습하면 performance가 좋음
- 평균제거 전처리 결과 반영하여, test해야 함 <- 이 때 유용
    

In [26]:
#x1 데이터 평균제거 
dm = dmatrix("center(x1) + 0", df)
dm

DesignMatrix with shape (5, 1)
  center(x1)
     0.31377
    -1.05012
    -0.47154
     0.79061
     0.41728
  Terms:
    'center(x1)' (column 0)

In [28]:
df.x1 - np.mean(df.x1)

0    0.313773
1   -1.050123
2   -0.471542
3    0.790613
4    0.417278
Name: x1, dtype: float64

In [30]:
dm.design_info

DesignInfo(['center(x1)'],
           factor_infos={EvalFactor('center(x1)'): FactorInfo(factor=EvalFactor('center(x1)'),
            type='numerical',
            state=<factor state>,
            num_columns=1)},
           term_codings=OrderedDict([(Term([EvalFactor('center(x1)')]),
  [SubtermInfo(factors=(EvalFactor('center(x1)'),),
               contrast_matrices={},
               num_columns=1)])]))

In [31]:
# 평균제거 데이터 활용
# dm: 평균제거 데이터
df_new = df.copy()
df_new["x1"] = df_new["x1"] * 10
df_new

Unnamed: 0,x1,x2,y
0,17.640523,-0.977278,0.144044
1,4.001572,0.950088,1.454274
2,9.78738,-0.151357,0.761038
3,22.408932,-0.103219,0.121675
4,18.67558,0.410599,0.443863


In [32]:
build_design_matrices([dm.design_info], df_new)

[DesignMatrix with shape (5, 1)
   center(x1)
     16.19024
      2.55129
      8.33710
     20.95865
     17.22530
   Terms:
     'center(x1)' (column 0)]

In [36]:
# 상동
# 0샘플: 실제17.64 -> 모델input시 학습시 평균1.4 제거 -> 실제input수치 +16.19
df_new.x1 - np.mean(df.x1)

0    16.190244
1     2.551292
2     8.337100
3    20.958652
4    17.225300
Name: x1, dtype: float64

In [34]:
# 새로 평균 제거가 아니라, 학습 데이터의 평균과 비교한 값이 나와야 함
# 고로 아래는 틀림
dmatrix("center(x1) + 0", df_new)

DesignMatrix with shape (5, 1)
  center(x1)
     3.13773
   -10.50123
    -4.71542
     7.90613
     4.17278
  Terms:
    'center(x1)' (column 0)

### 변수보호(변수간 수학연산자 사용)
- `I()`: 다항회귀에도 활용

In [37]:
dmatrix("I(x1 + x2)", df)

DesignMatrix with shape (5, 2)
  Intercept  I(x1 + x2)
          1     0.78677
          1     1.35025
          1     0.82738
          1     2.13767
          1     2.27816
  Terms:
    'Intercept' (column 0)
    'I(x1 + x2)' (column 1)

In [38]:
# 위와 비교
dmatrix("x1 + x2", df)

DesignMatrix with shape (5, 3)
  Intercept       x1        x2
          1  1.76405  -0.97728
          1  0.40016   0.95009
          1  0.97874  -0.15136
          1  2.24089  -0.10322
          1  1.86756   0.41060
  Terms:
    'Intercept' (column 0)
    'x1' (column 1)
    'x2' (column 2)

## 다항선형회귀(polynomial regression)

In [45]:
dmatrix("x1 + I(x1**2) + I(x1**3)", df)

DesignMatrix with shape (5, 4)
  Intercept       x1  I(x1 ** 2)  I(x1 ** 3)
          1  1.76405     3.11188     5.48952
          1  0.40016     0.16013     0.06408
          1  0.97874     0.95793     0.93756
          1  2.24089     5.02160    11.25287
          1  1.86756     3.48777     6.51362
  Terms:
    'Intercept' (column 0)
    'x1' (column 1)
    'I(x1 ** 2)' (column 2)
    'I(x1 ** 3)' (column 3)

In [55]:
# 수기 작성하기에 차원이 많을 땐,
dmatrix("C(x1, Poly)", balanced(x1=10), return_type="dataframe")

Unnamed: 0,Intercept,"C(x1, Poly).Linear","C(x1, Poly).Quadratic","C(x1, Poly).Cubic","C(x1, Poly)^4","C(x1, Poly)^5","C(x1, Poly)^6","C(x1, Poly)^7","C(x1, Poly)^8","C(x1, Poly)^9"
0,1.0,-0.495434,0.522233,-0.453425,0.336581,-0.214834,0.116775,-0.052694,0.018699,-0.004535
1,1.0,-0.275241,-0.087039,0.377854,-0.317882,-0.035806,0.389249,-0.503518,0.373979,-0.163266
2,1.0,-0.165145,-0.261116,0.334671,0.056097,-0.393863,0.23355,0.245904,-0.52357,0.380953
3,1.0,-0.055048,-0.348155,0.12955,0.336581,-0.214834,-0.3114,0.327872,0.261785,-0.57143
4,1.0,0.055048,-0.348155,-0.12955,0.336581,0.214834,-0.3114,-0.327872,0.261785,0.57143
5,1.0,0.165145,-0.261116,-0.334671,0.056097,0.393863,0.23355,-0.245904,-0.52357,-0.380953
6,1.0,0.275241,-0.087039,-0.377854,-0.317882,0.035806,0.389249,0.503518,0.373979,0.163266
7,1.0,0.385337,0.174078,-0.151142,-0.411377,-0.50128,-0.428174,-0.275179,-0.130893,-0.040816
8,1.0,0.495434,0.522233,0.453425,0.336581,0.214834,0.116775,0.052694,0.018699,0.004535
9,1.0,-0.385337,0.174078,0.151142,-0.411377,0.50128,-0.428174,0.275179,-0.130893,0.040816


## `OLS.form_formula`
- formula로 식 정의

In [71]:
x1 = np.random.rand(20)
x2 = np.random.randn(20)
y = 2 * x1 + 3 * x2 + np.random.rand(20)
df0 = pd.DataFrame({'x1': x1, 'x2': x2, 'y': y})

In [72]:
# 직접 Dataframe 재작업
dfy = df0['y']
dfx = sm.add_constant(df0[df0.columns[:2]])
model = sm.OLS(dfy, dfx)
print(model.fit().summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.996
Model:                            OLS   Adj. R-squared:                  0.995
Method:                 Least Squares   F-statistic:                     1972.
Date:                Thu, 08 Nov 2018   Prob (F-statistic):           7.53e-21
Time:                        08:24:22   Log-Likelihood:                 1.2918
No. Observations:                  20   AIC:                             3.416
Df Residuals:                      17   BIC:                             6.404
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.5494      0.110      5.004      0.0

In [76]:
# 모형정의
model2 = sm.OLS.from_formula("y ~ x1 + x2", data=df0)
print(model2.fit().summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.996
Model:                            OLS   Adj. R-squared:                  0.995
Method:                 Least Squares   F-statistic:                     1972.
Date:                Thu, 08 Nov 2018   Prob (F-statistic):           7.53e-21
Time:                        09:17:55   Log-Likelihood:                 1.2918
No. Observations:                  20   AIC:                             3.416
Df Residuals:                      17   BIC:                             6.404
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.5494      0.110      5.004      0.0