## Key words
### 선형회귀, 종속변수, 독립변수, 결정계수, ols, LinearRegression, mean_absolute_error, mean_squared_error, train_test_split(데이터 분리)

## 단순 회귀분석의 특징
- `연속형` 종속변수와 독립변수 간 선형관계 및 설명력을 확인하는 기법
- 종속변수와 독립변수가 각각 하나인 경우의 단순 선형 회귀 모형
 - 독립변수가 두개면 다중 선형 회귀 분석
- `설명력`과 더불어 `오차 평가 지표`로 모델의 성능을 평가

### statsmodels - ols()
- 선형회귀분석을 위한 statsmodels의 함수
- ols() 함수 내에 종속변수와 독립변수를 선언
- ols() 함수의 fit() 메서드로 모델 적합
- 변수명에 온점(.) 등 특정 특수문자가 있는 경우 오류 발생
- 모델 객체의 predict() 메서드로 예측
- 종속변수가 연속형이지만 명목형 독립변수를 넣었을 때는 ANOVA에 넣기 가능
- 독립변수 수치형을 넣었을 때는 회귀분석을 활용 가능

In [1]:
import pandas as pd
from statsmodels.formula.api import ols

In [56]:
df = pd.read_csv("iris.csv")
df.head

<bound method NDFrame.head of      Sepal.Length  Sepal.Width  Petal.Length  Petal.Width    Species
0             5.1          3.5           1.4          0.2     setosa
1             4.9          3.0           1.4          0.2     setosa
2             4.7          3.2           1.3          0.2     setosa
3             4.6          3.1           1.5          0.2     setosa
4             5.0          3.6           1.4          0.2     setosa
..            ...          ...           ...          ...        ...
145           6.7          3.0           5.2          2.3  virginica
146           6.3          2.5           5.0          1.9  virginica
147           6.5          3.0           5.2          2.0  virginica
148           6.2          3.4           5.4          2.3  virginica
149           5.9          3.0           5.1          1.8  virginica

[150 rows x 5 columns]>

#### ols로 선형회귀 만들기

In [3]:
# model = ols(formula= "Sepal.Length ~ Sepal.Width", data = df).fit()
# model
# 이름 온점(.) 때문에 오류가 났으므로 이름을 변경해야함

In [4]:
df.columns = ["SL", "SW", "PL", "PW", "species"]
df.head(2)

Unnamed: 0,SL,SW,PL,PW,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa


In [5]:
model = ols(formula= "SL ~ SW", data = df).fit()
# model 그냥치면 안나오므로
model.summary()

0,1,2,3
Dep. Variable:,SL,R-squared:,0.014
Model:,OLS,Adj. R-squared:,0.007
Method:,Least Squares,F-statistic:,2.074
Date:,"Wed, 23 Mar 2022",Prob (F-statistic):,0.152
Time:,18:24:14,Log-Likelihood:,-183.0
No. Observations:,150,AIC:,370.0
Df Residuals:,148,BIC:,376.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,6.5262,0.479,13.628,0.000,5.580,7.473
SW,-0.2234,0.155,-1.440,0.152,-0.530,0.083

0,1,2,3
Omnibus:,4.389,Durbin-Watson:,0.952
Prob(Omnibus):,0.111,Jarque-Bera (JB):,4.237
Skew:,0.36,Prob(JB):,0.12
Kurtosis:,2.6,Cond. No.,24.2


- Prob(F-statistic): F검정통계량에 대한 p값 [해석] 귀무가설 기각 못함 -> 선형성을 만족하지 못하는 모델 -> 버려야하는 모델



In [6]:
model = ols(formula= "PL ~ PW", data = df).fit()
# model 그냥치면 안나오므로
model.summary()

0,1,2,3
Dep. Variable:,PL,R-squared:,0.927
Model:,OLS,Adj. R-squared:,0.927
Method:,Least Squares,F-statistic:,1882.0
Date:,"Wed, 23 Mar 2022",Prob (F-statistic):,4.6800000000000005e-86
Time:,18:24:14,Log-Likelihood:,-101.18
No. Observations:,150,AIC:,206.4
Df Residuals:,148,BIC:,212.4
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,1.0836,0.073,14.850,0.000,0.939,1.228
PW,2.2299,0.051,43.387,0.000,2.128,2.332

0,1,2,3
Omnibus:,2.438,Durbin-Watson:,1.43
Prob(Omnibus):,0.295,Jarque-Bera (JB):,1.966
Skew:,0.211,Prob(JB):,0.374
Kurtosis:,3.369,Cond. No.,3.7


### 보는 순서
- R-squared ~ Prob (F-statistic)
1. Prob (F-statistic): 값이 0이므로 독립변수, 종속변수 간 선형관계를 만족한다는 뜻->귀무가설 기각, 대립가설 채택
2. R-squared(설명력, 결정계수)
3. Adj. R-squared(조정된 설명력, 결정계수)

- 두번째 표에서는 coef, t, P>|t|

### 변수 부분
```bash
coef	std err	t	P>|t|	[0.025	0.975]
Intercept	1.0836	0.073	14.850	0.000	0.939	1.228
PW	2.2299	0.051	43.387	0.000	2.128	2.332
```
- Intercept(절편)
- PW(우리가 모델에 넣은 독립변수)
- t(t 검정통계량)
- P>|t|(검정통계량 t에 대한 p-value): 현재 계산된 검정통계량의 절대값보다 큰 검정통계량 t가 나올 P(확률), 귀무가설 기각
- coef: `y = 2.2299x + 1.0836` 이런식으로 일차방정식이 나온다는 말

---
predict(): 예측값

In [7]:
# model = ols(formula= "PL ~ PW", data = df).fit()

In [8]:
model.predict(df.iloc[:6, ]) # 데이터 6개만 넣어봄, 원래는 확습한 데이터값만 넣어줘야하는데 이모델은 알아서 판별해줌

0    1.529546
1    1.529546
2    1.529546
3    1.529546
4    1.529546
5    1.975534
dtype: float64

In [9]:
df["pred"] = model.predict(df) # 이런식으로 넣기도 한다.
df.head()

Unnamed: 0,SL,SW,PL,PW,species,pred
0,5.1,3.5,1.4,0.2,setosa,1.529546
1,4.9,3.0,1.4,0.2,setosa,1.529546
2,4.7,3.2,1.3,0.2,setosa,1.529546
3,4.6,3.1,1.5,0.2,setosa,1.529546
4,5.0,3.6,1.4,0.2,setosa,1.529546


## sklearn: 머신러닝 전문 라이브러리
### sklearn - LinearRegression()
- 선형회귀분석을 위한 sklearn의 함수
- LinearRegression() 함수 내 fit_intercept로 절편 적합 여부 설정 가능 -> 보통 안건드림
- LinearRegression() 함수의 fit() 메서드에 학습데이터 할당 가능
- 모델 객체의 coef_ 와 intercept_ 어트리뷰트로 각각 계수와 절편 확인 가능
- 모델 객체의 predict() 메서드로 예측

In [10]:
from sklearn.linear_model import LinearRegression

In [11]:
# model = LinearRegression().fit(X = df["PL"], y = df["PW"])
# model
# 시리즈를 넣었다고 오류발생

In [12]:
df["PL"].head() # 시리즈

0    1.4
1    1.4
2    1.3
3    1.5
4    1.4
Name: PL, dtype: float64

In [13]:
df[["PL"]].head(2) # 데이터프레임 구조가 풀리지않음

Unnamed: 0,PL
0,1.4
1,1.4


In [14]:
df.iloc[0, ] # 시리즈로 나옴

SL              5.1
SW              3.5
PL              1.4
PW              0.2
species      setosa
pred       1.529546
Name: 0, dtype: object

In [15]:
df.iloc[[0], ] # 데이터프레임으로 나옴

Unnamed: 0,SL,SW,PL,PW,species,pred
0,5.1,3.5,1.4,0.2,setosa,1.529546


꿀팁: 마우스로 "PL" 드래그한상대로 []누르니 입력됨

In [16]:
model = LinearRegression().fit(X = df[["PL"]], y = df[["PW"]]) # 마우스로 "PL" 드래그한상대로 []누르니 입력됨
model

LinearRegression()

In [17]:
model.coef_

array([[0.41575542]])

In [18]:
model.intercept_

array([-0.36307552])

In [19]:
model.predict(df[["PL"]]) # 데이터프레임이 풀리지않은 상태로 넣어줘야함

array([[0.21898206],
       [0.21898206],
       [0.17740652],
       [0.2605576 ],
       [0.21898206],
       [0.34370869],
       [0.21898206],
       [0.2605576 ],
       [0.21898206],
       [0.2605576 ],
       [0.2605576 ],
       [0.30213314],
       [0.21898206],
       [0.09425544],
       [0.13583098],
       [0.2605576 ],
       [0.17740652],
       [0.21898206],
       [0.34370869],
       [0.2605576 ],
       [0.34370869],
       [0.2605576 ],
       [0.0526799 ],
       [0.34370869],
       [0.42685977],
       [0.30213314],
       [0.30213314],
       [0.2605576 ],
       [0.21898206],
       [0.30213314],
       [0.30213314],
       [0.2605576 ],
       [0.2605576 ],
       [0.21898206],
       [0.2605576 ],
       [0.13583098],
       [0.17740652],
       [0.21898206],
       [0.17740652],
       [0.2605576 ],
       [0.17740652],
       [0.17740652],
       [0.17740652],
       [0.30213314],
       [0.42685977],
       [0.21898206],
       [0.30213314],
       [0.218

## 평가 metrics 함수
### sklearn - mean_absolute_error()
- MAE(Mean Absolute Error) 연산을 위한 sklearn의 함수

### sklearn - mean_squared_error()
- MSE(Mean Squared Error) 연산을 위한 sklearn의 함수
- 해당 결과에 제곱근 연산(루트씌우기)을 하면 RMSE(Root Mean Squared Error) 계산 가능

In [20]:
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
# sklearn.metrics로 불러온 함수들은 대부분다 옵션에 y_true, y_pred 값을 가짐

In [21]:
mean_absolute_error(y_true = df["PL"], y_pred = df["PW"]) # 임의로 값 그냥 넣음

2.558666666666667

In [22]:
mean_squared_error(y_true = df["PL"], y_pred = df["PW"])

7.645466666666667

In [23]:
mean_squared_error(y_true = df["PL"], y_pred = df["PW"]) ** 0.5 # RMSE 값

2.76504370067937

### 1. 종속변수를 registered, 독립변수를 temp로 했을 때 결정계수는?
- bike.csv
- statsmodels 함수 활용
- 학습 데이터 비율을 70%, seed를 123으로 설정

In [24]:
df = pd.read_csv("bike.csv")
df.head(2)

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40


In [25]:
from statsmodels.formula.api import ols

In [26]:
model = ols(formula = "registered ~ temp" ,data = df).fit()
model.summary()

0,1,2,3
Dep. Variable:,registered,R-squared:,0.101
Model:,OLS,Adj. R-squared:,0.101
Method:,Least Squares,F-statistic:,1229.0
Date:,"Wed, 23 Mar 2022",Prob (F-statistic):,2.87e-255
Time:,18:24:14,Log-Likelihood:,-69485.0
No. Observations:,10886,AIC:,139000.0
Df Residuals:,10884,BIC:,139000.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,30.6172,3.818,8.018,0.000,23.133,38.102
temp,6.1755,0.176,35.062,0.000,5.830,6.521

0,1,2,3
Omnibus:,2988.023,Durbin-Watson:,0.442
Prob(Omnibus):,0.0,Jarque-Bera (JB):,7629.148
Skew:,1.499,Prob(JB):,0.0
Kurtosis:,5.799,Cond. No.,60.4


1. 정답

In [27]:
from sklearn.model_selection import train_test_split

In [28]:
df_train, df_test = train_test_split(df, train_size=0.7, random_state=123) # 학습 데이터 비율을 70%, seed를 123으로 설정
df_train.head(2)

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
4046,2011-09-19 15:00:00,3,0,1,2,24.6,30.305,60,15.0013,44,143,187
9262,2012-09-09 07:00:00,3,0,0,1,22.14,25.76,73,11.0014,20,50,70


In [29]:
model = ols(formula = "registered ~ temp" ,data = df_train).fit()
model.summary()

0,1,2,3
Dep. Variable:,registered,R-squared:,0.106
Model:,OLS,Adj. R-squared:,0.106
Method:,Least Squares,F-statistic:,902.3
Date:,"Wed, 23 Mar 2022",Prob (F-statistic):,1.92e-187
Time:,18:24:14,Log-Likelihood:,-48650.0
No. Observations:,7620,AIC:,97300.0
Df Residuals:,7618,BIC:,97320.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,27.5151,4.559,6.036,0.000,18.579,36.452
temp,6.3391,0.211,30.038,0.000,5.925,6.753

0,1,2,3
Omnibus:,2097.525,Durbin-Watson:,2.022
Prob(Omnibus):,0.0,Jarque-Bera (JB):,5337.402
Skew:,1.502,Prob(JB):,0.0
Kurtosis:,5.79,Cond. No.,60.1


### 2. 종속변수를 casual, 독립변수를  atemp로 했을 때 RMSE는?
- bike.csv
- statsmodels 함수 활용
- 학습 데이터 비율을 70%, seed를 123으로 설정

In [30]:
from sklearn.model_selection import train_test_split

In [31]:
df = pd.read_csv("bike.csv")
df.head(2)

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40


In [32]:
df_train, df_test = train_test_split(df, train_size=0.7, random_state=123)
df_train.head(2)

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
4046,2011-09-19 15:00:00,3,0,1,2,24.6,30.305,60,15.0013,44,143,187
9262,2012-09-09 07:00:00,3,0,0,1,22.14,25.76,73,11.0014,20,50,70


In [33]:
model = ols(formula = "casual ~ atemp", data = df_train).fit()
model.summary()

0,1,2,3
Dep. Variable:,casual,R-squared:,0.219
Model:,OLS,Adj. R-squared:,0.219
Method:,Least Squares,F-statistic:,2138.0
Date:,"Wed, 23 Mar 2022",Prob (F-statistic):,0.0
Time:,18:24:14,Log-Likelihood:,-39689.0
No. Observations:,7620,AIC:,79380.0
Df Residuals:,7618,BIC:,79400.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-29.2974,1.498,-19.554,0.000,-32.234,-26.360
atemp,2.7672,0.060,46.243,0.000,2.650,2.885

0,1,2,3
Omnibus:,4125.373,Durbin-Watson:,1.973
Prob(Omnibus):,0.0,Jarque-Bera (JB):,34148.771
Skew:,2.494,Prob(JB):,0.0
Kurtosis:,12.092,Cond. No.,74.1


In [34]:
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

In [35]:
mean_squared_error(y_true=df_train["casual"], y_pred=df_train["atemp"]) ** 0.5

48.30664766403011

2. 정답

In [36]:
df = pd.read_csv("bike.csv")
df.head(2)

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40


In [37]:
df_train, df_test = train_test_split(df, train_size=0.7, random_state=123)
df_train.head(2)

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
4046,2011-09-19 15:00:00,3,0,1,2,24.6,30.305,60,15.0013,44,143,187
9262,2012-09-09 07:00:00,3,0,0,1,22.14,25.76,73,11.0014,20,50,70


In [38]:
model = ols(formula = "casual ~ atemp", data = df_train).fit() # 트레인용 넣기
pred = model.predict(df_test) # 테스트용 넣기
pred[:4]

6495    31.499001
7050    12.626390
558     10.537120
5085    33.588271
dtype: float64

In [39]:
mean_squared_error(y_pred=pred, y_true=df_test["casual"]) ** 0.5

44.462370102714324

### 3. 종속변수를  casual, 독립변수를 atemp로 했을 때 여름과 겨울의 RMSE 차이는?
- bike.csv
- statsmodels 함수 활용
- 학습 데이터 비율을 70%, seed를 123으로 설정
- RMSE의 차이는 절대값을 취한다.

In [40]:
df = pd.read_csv("bike.csv")
df.head(2)

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40


In [41]:
df["season"].unique()

array([1, 2, 3, 4], dtype=int64)

In [42]:
df["summer"] = (df["season"] == 2) + 0
df["summer"].unique()

array([0, 1])

In [43]:
df["winter"] = (df["season"] == 4) + 0
df["winter"].unique()

array([0, 1])

In [44]:
df_summer = df.loc[df["summer"] == 1]
df_summer.head(2)

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count,summer,winter
1323,2011-04-01 00:00:00,2,0,1,3,10.66,12.88,100,11.0014,0,6,6,1,0
1324,2011-04-01 01:00:00,2,0,1,3,10.66,12.88,100,11.0014,0,4,4,1,0


In [45]:
df_winter = df.loc[df["winter"] == 1]
df_winter.head(2)

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count,summer,winter
4055,2011-10-01 00:00:00,4,0,0,1,20.5,24.24,63,26.0027,24,106,130,0,1
4056,2011-10-01 01:00:00,4,0,0,1,19.68,23.485,67,22.0028,11,47,58,0,1


In [46]:
s_train, s_test = train_test_split(df_summer, train_size=0.7, random_state=123)
model_s = ols(formula = "casual ~ atemp", data = s_train).fit() # 트레인 데이터 넣어야함, 실수-처음에 그냥 데이터넣음
pred_s = model_s.predict(s_test) # 테스트 데이터 넣어야함

In [47]:
a = mean_squared_error(y_true=s_test["casual"], y_pred=pred_s) ** 0.5 # y_true에는 테스트의 종속변수값, y_pred에는 pred값

In [48]:
w_train, w_test = train_test_split(df_winter, train_size=0.7, random_state=123)
model_w = ols(formula = "casual ~ atemp", data = w_train).fit()
pred_w = model_w.predict(w_test)

In [49]:
b = mean_squared_error(y_true=w_test["casual"], y_pred=pred_w) ** 0.5

In [50]:
abs(a - b)

8.648423450414178

3. 정답

In [51]:
df_s2 = df.loc[df["season"] == 2, ]
df_s4 = df.loc[df["season"] == 4, ]

In [52]:
# train_test_split

In [53]:
# 모델만들기

In [54]:
# predict

In [55]:
# 내가쓴것과 비슷함