### 重回帰分析（mulitple linear regression）

statsmodelsというライブラリを用いて、重回帰分析を行います。

statsmodelsは、さまざまな統計モデルの推定、統計テスト、統計データの調査を行うためのクラスと関数を提供するPythonモジュールです。
statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration. 

正式サイト(Official site):  
https://www.statsmodels.org/stable/index.html  


**インストール(Installation)**：conda install statsmodels  
**更新（update）**：conda update statsmodels

### 1. Import libraries



In [1]:
import numpy as np

In [2]:
import statsmodels
### Version
print("Statsmodels version: {}".format(statsmodels.__version__))

Statsmodels version: 0.8.0


#### 対話式（For interactive use）

In [3]:
import statsmodels.api as sm

  from pandas.core import datetools


#### 数式対応API (formula api)

In [4]:
import statsmodels.formula.api as smf

### 2. データファイルの読み込む(Read data)

dataフォルダにある「two_obs_var.csv」をNumpyのloadtxt()で読み込みます。  
Read the file "two_obs_var.csv" in the data folder via Numpy's loadtxt() function.

以下は、ファイルの一部です。（Part of data is shown as following）

``` 
# X1, X2, Y generated from y = 0.25 + 0.1*x1 + 0.5*x2 + epsilon
   0.000,   0.000,   0.323
   0.101,   0.010,   0.338
   0.202,   0.041,   0.364
   0.303,   0.092,   0.399
   ...
   ...
```
データファイルの基本情報は、

1.何行がある?  
2.何列がある？  
3.コメント行がある、いわゆる、”＃”がある？  
4."#" がないの場合、ヘッドラインがある？(データ列を説明するため)  
5.区切り文字が何ですか？  
を調べてください。  

To find basic info for our data,  

1. How many rows it has?  
2. How many columns it has?  
3. Does it contain comment lines, i.e., lines start with a symbol such as "#" or others  
4. If it does not contain comment lines, is there header lines?(description for each column)  
5. What is the delimiter used in the data file?  


In [5]:
filename = 'data/two_obs_var.csv'
data = np.loadtxt(filename, comments="#", delimiter=',') # variable "data" is numpy array
print(type(data))

<class 'numpy.ndarray'>


### 3. 独立変数と従属変数　Independent and Dependent variables 
変数dataから独立変数と従属変数を分離します。  
Separate data for independent and dependent variables for variable *data*


1.列１、２：独立変数X1とx２を対応します。Column 1 and 2: Corresponds to the two independent variables, x1 and x2.   
2.列３：従属変数Yを対応します。Column 3: Corresponds to the dependent variables Y.


In [6]:
X = data[:, :2]  # data[rows, cols]
Y = data[:, 2]   # numpy slicing - ":" means all rows, ":2" means columns 0 and 1

### 4. 切片について、定数をXに追加する　Add one constant for intercept at the beginning of X

切片（intercept）は、既知のxと既知のyを通過する最適な回帰直線に基づいています。 独立変数が 0 (ゼロ) のときの従属変数の値です。
Intercept is based on an optimal regression line passing through the known x and known y and value of the dependent variable when the independent variable is 0 (zero).

y = ax + b 

a: 傾き（slope） 
b: 切片 (intercept)  


In [7]:
X = sm.add_constant(X)

### 5. モデルを構築　Construct a model
数学式: y= c + a*x1 + b*x2。既知x1, 既知x2, 既知Yがあり、a, b, cをデータの重回帰分析によって求めます。

statsmodelsにOLSクラス（Ordinary Linear Regression）によって、重回帰分析用モデルを構築します。
Construct a linear regression model via OLS class in statsmodels.

クラスの定義（Class definition）は、 https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.html?highlight=ols
を参考してください。

> Note: Xの形に注意してください.　行と列は、それぞれ観測回数と変数（測定値）に対応します。  
> Please pay attention to the shape of X. The rows and columns are corresponding to observations and variables(values of measurements), respectively. 

``` python
class statsmodels.regression.linear_model.OLS(endog, exog=None, missing='none', hasconst=None, **kwargs)
```
**endog** (array-like) – 1-d endogenous response variable. The dependent variable.  
**exog** (array-like) – A **nobs x k** array where nobs is the number of observations and k is the number of regressors. An intercept is not included by default and should be added by the user. See statsmodels.tools.add_constant.  
**missing** (str) – Available options are ‘none’, ‘drop’, and ‘raise’. If ‘none’, no nan checking is done. If ‘drop’, any observations with nans are dropped. If ‘raise’, an error is raised. Default is ‘none.’  
**hasconst** (None or bool) – Indicates whether the RHS includes a user-supplied constant. If True, a constant is not checked for and k_constant is set to 1 and all result statistics are calculated as if a constant is present. If False, a constant is not checked for and k_constant is set to 0.

In [8]:
model = sm.OLS(Y, X)

### 6. 回帰分析の実行 (Run the regression analysis)

``` python
OLS.fit(method='pinv', cov_type='nonrobust', cov_kwds=None, use_t=None, **kwargs)

Parameters:	
   method (str, optional) – Can be “pinv”, “qr”. “pinv” uses the Moore-Penrose pseudoinverse to solve the least   
                            squares problem. “qr” uses the QR factorization. 
   cov_type (str, optional) – See regression.linear_model.RegressionResults for a description of the available 
                            covariance estimators
   cov_kwds (list or None, optional) – See linear_model.RegressionResults.get_robustcov_results for a 
                            description required keywords for alternative covariance estimators
   use_t (bool, optional) – Flag indicating to use the Student’s t distribution when computing p-values. 
                             Default behavior depends on cov_type. 
                             See linear_model.RegressionResults.get_robustcov_results for implementation details.
Returns:	
Return type:	
    A RegressionResults class instance.

```

In [9]:
results = model.fit()

### 7. 結果（Results)

summary() - Summarize the Regression Results
conf_int(alpha=0.05, cols=None) - [0.025      0.975]

Available field:  

Top left: 

nobs     - No. Observations  
df_model - Df Model  
df_resid - Df Residuals  
cov_type - Covariance Type  

Top right:  

**rsquared**  - R-squared  
**rsquared_adj** - Adj. R-squared  
**fvalue**       - F-statistic  
**f_pvalue**     - Prob (F-statistic)  
llf          - Log-Likelihood  
aic          - AIC  
bic          - BIC  

diagnostic (middle):

**params**  - coef [const, x1, x2]  
**bse**     - std err [const, x1, x2]  
**tvalues** - t [const, x1, x2]  
**pvalues** - P>|t| [const, x1, x2]  


diagnostic left:

diagn - a dictionary for  
```python
    diagn['omni'] - Omnibus    
    diagn['omnipv'] - Prob(Omnibus)  
    diagn['skew'] - Skew  
    diagn['kurtosis'] - Kurtosis  
    diagn['jb'] - Jarque-Bera (JB)  
    diagn['jbpv'] - Prob(JB)  
    diagn['condno'] - Cond. No.  
```    
**condition_number** -  Cond. No.

In [10]:
print(results.summary()) # 回帰分析のまとめ

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 1.110e+11
Date:                Mon, 28 May 2018   Prob (F-statistic):               0.00
Time:                        15:12:54   Log-Likelihood:                 662.56
No. Observations:                 100   AIC:                            -1319.
Df Residuals:                      97   BIC:                            -1311.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.3230   9.58e-05   3371.908      0.0

In [11]:
results.bse   # Std err

array([  9.58010089e-05,   4.42791063e-05,   4.28485204e-06])

In [12]:
results.tvalues # t value

array([   3371.90808429,    2257.34196078,  116691.37931038])

In [13]:
results.pvalues # P>|t|

array([  1.15276439e-247,   9.25813926e-231,   0.00000000e+000])

In [14]:
results.condition_number  # if condo > 1000: strong multicollinearity or other numerical problems

144.29876800227544

In [15]:
# if eigenvals[-1] < 1e-10, The smallest eigenvalue indicate strong multicollinearity problems or that the design
# matrix is singular
print(results.eigenvals)
print(results.eigenvals[-1])

[  2.06233393e+05   2.40612534e+02   9.90452702e+00]
9.90452702449


In [16]:
results.params  # [const, a, b]に対応します。

array([ 0.3230322 ,  0.09995308,  0.50000529])

### 結果から見ると（Results）
two_obs_var.csvは,以下の式で作成した。  
Y = 0.25 + 0.1*x1 + 0.5*x2 + epsilon  
epsilonはランダム誤差です。  

Data saved in two_obs_var.csv was generated by the following formula,   
Y = 0.25 + 0.1*x1 + 0.5*x2 + epsilon  
where, epsilon represents random errors.


|   |  True  |  Regression  |
|-- | ------ |  ----------- |
| c |  0.25  |  0.3230      |
| a |  0.1   |  0.1         |
| b |  0.5   |  0.5         |


回帰の結果は,　真実の係数にすごく近いことがわかりました。  
From the above regression result, we know that 
the regression coefficients are very close to the true ones.


