# 통계(회귀분석)

### (1) 단순선형회귀분석
- 종속변수 1개, 독립변수 1개인 회귀분석

In [2]:
import pandas as pd
import numpy as np
house = pd.read_csv('./data/kc_house_data.csv')
house = house[['price','sqft_living']]
house.corr()

Unnamed: 0,price,sqft_living
price,1.0,0.702035
sqft_living,0.702035,1.0


In [4]:
from statsmodels.formula.api import ols
import matplotlib.pyplot as plt

# 변수할당
X = house['sqft_living']
y = house['price']

# 단순 선형 회귀모형 적합
formula = 'price~ sqft_living'
lr = ols(formula,data=house).fit()
y_pred = lr.predict(X)
lr.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.493
Model:,OLS,Adj. R-squared:,0.493
Method:,Least Squares,F-statistic:,21000.0
Date:,"Mon, 29 May 2023",Prob (F-statistic):,0.0
Time:,16:40:01,Log-Likelihood:,-300270.0
No. Observations:,21613,AIC:,600500.0
Df Residuals:,21611,BIC:,600600.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-4.358e+04,4402.690,-9.899,0.000,-5.22e+04,-3.5e+04
sqft_living,280.6236,1.936,144.920,0.000,276.828,284.419

0,1,2,3
Omnibus:,14832.49,Durbin-Watson:,1.983
Prob(Omnibus):,0.0,Jarque-Bera (JB):,546444.713
Skew:,2.824,Prob(JB):,0.0
Kurtosis:,26.977,Cond. No.,5630.0


#### 해석)

1. 회귀모형은 통계적으로 유의한가?
- 귀무가설: 회귀모형은 유의하지 않다.
- 대립가설: 회귀모형은 유의하다.

-> 해당 모델의 F통계량과 p-value를 통해 확인, p-value가 유의수준 0.05보다 작으므로 귀무가설을 기각한다. 즉, 회귀모형은 유의하다.

2. 모형은 데이터를 얼마나 잘 설명하는가?
- R-squared 값이 0.493으로 모형이 전체 데이터의 49.3%를 설명한다고 볼 수 있다. 

3. 모형 내 회귀계수는 유의한가?
- Intercept는 모형의 상수 값으로 관심사가 아님
- sqft_living 변수의 회귀계수는 280.6236이고 p-value가 유의수준 0.05보다 작으므로 귀무가설을 기각한다. 즉, 회귀계수는 유의하다.

회귀식: Price= sqft_living * 280.6236 - 4.358e+04


### (2) 다중 회귀분석
- 종속변수 1개, 독립변수 2개 이상
- 다중공선성
1. 독립변수들 간에 강한 상관관계(0.9이상)가 나타나는 문제
2. 다중공선성이 의심되는 두 독립변수의 회귀분석으로 허용 오차를 구했을 때 0.1이하이면 다중공선성 문제가 심각한 것
3. VIF = 1/(1-R2)

In [6]:
import statsmodels.api as am
import statsmodels.formula.api as smf
import warnings
warnings.filterwarnings('ignore')
Cars = pd.read_csv('./data/Cars93.csv')
Cars.columns = Cars.columns.str.replace('.',"")
model =smf.ols(formula ='Price ~ EngineSize + RPM + Weight + Length + MPGcity + MPGhighway', data =Cars)
result = model.fit()
result.summary()

0,1,2,3
Dep. Variable:,Price,R-squared:,0.572
Model:,OLS,Adj. R-squared:,0.542
Method:,Least Squares,F-statistic:,19.14
Date:,"Mon, 29 May 2023",Prob (F-statistic):,4.88e-14
Time:,16:55:20,Log-Likelihood:,-302.94
No. Observations:,93,AIC:,619.9
Df Residuals:,86,BIC:,637.6
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-32.2157,17.812,-1.809,0.074,-67.625,3.193
EngineSize,4.4732,1.410,3.172,0.002,1.670,7.276
RPM,0.0071,0.001,5.138,0.000,0.004,0.010
Weight,0.0056,0.003,1.634,0.106,-0.001,0.012
Length,-0.0464,0.094,-0.496,0.621,-0.232,0.139
MPGcity,-0.3478,0.448,-0.776,0.440,-1.239,0.544
MPGhighway,0.0582,0.460,0.126,0.900,-0.856,0.973

0,1,2,3
Omnibus:,62.984,Durbin-Watson:,1.446
Prob(Omnibus):,0.0,Jarque-Bera (JB):,383.289
Skew:,2.074,Prob(JB):,5.89e-84
Kurtosis:,12.039,Cond. No.,161000.0


In [8]:
#### 다중 공선성 파악

Cars[['EngineSize', 'RPM','Weight','Length','MPGcity','MPGhighway']].corr()

# MPGcity 변수와 MPGhighway변수 간의 상관계수가 0.9 이상의 상관성을 보이므로 다중 공선성 문제가 존재함을 알 수 있다.

Unnamed: 0,EngineSize,RPM,Weight,Length,MPGcity,MPGhighway
EngineSize,1.0,-0.547898,0.845075,0.780283,-0.710003,-0.626795
RPM,-0.547898,1.0,-0.427931,-0.441249,0.363045,0.313469
Weight,0.845075,-0.427931,1.0,0.806274,-0.843139,-0.810658
Length,0.780283,-0.441249,0.806274,1.0,-0.666239,-0.542897
MPGcity,-0.710003,0.363045,-0.843139,-0.666239,1.0,0.943936
MPGhighway,-0.626795,0.313469,-0.810658,-0.542897,0.943936,1.0


In [10]:
# VIF 값 구하기

from patsy import dmatrices
from statsmodels.stats.outliers_influence import variance_inflation_factor

# 독립변수와 종속변수를 데이터프레임으로 나누어 저장하는 함수
y,X = dmatrices('Price ~ EngineSize + RPM + Weight + Length + MPGcity + MPGhighway', data = Cars, return_type = 'dataframe')

vif_list = []
for i in range(1,len(X.columns)):
    vif_list.append([variance_inflation_factor(X.values,i),X.columns[i]])
pd.DataFrame(vif_list, columns=['vif','variable'])

# MPGcity 변수의 값이 제일 높으므로 제거해야함

Unnamed: 0,vif,variable
0,4.605118,EngineSize
1,1.446859,RPM
2,8.685973,Weight
3,4.013002,Length
4,13.668288,MPGcity
5,12.943133,MPGhighway


In [11]:
# MPGcity 변수 제거 후 다중 선형 회귀분석 진행

model = smf.ols(formula = 'Price ~ EngineSize + RPM + Weight + Length + MPGhighway', data = Cars)
result= model.fit()
result.summary()

0,1,2,3
Dep. Variable:,Price,R-squared:,0.569
Model:,OLS,Adj. R-squared:,0.544
Method:,Least Squares,F-statistic:,22.95
Date:,"Mon, 29 May 2023",Prob (F-statistic):,1.28e-14
Time:,17:03:03,Log-Likelihood:,-303.27
No. Observations:,93,AIC:,618.5
Df Residuals:,87,BIC:,633.7
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-35.8122,17.158,-2.087,0.040,-69.916,-1.709
EngineSize,4.6591,1.386,3.361,0.001,1.904,7.415
RPM,0.0071,0.001,5.173,0.000,0.004,0.010
Weight,0.0053,0.003,1.567,0.121,-0.001,0.012
Length,-0.0194,0.087,-0.224,0.823,-0.191,0.153
MPGhighway,-0.2500,0.231,-1.082,0.282,-0.709,0.209

0,1,2,3
Omnibus:,61.903,Durbin-Watson:,1.397
Prob(Omnibus):,0.0,Jarque-Bera (JB):,363.806
Skew:,2.044,Prob(JB):,1.0000000000000001e-79
Kurtosis:,11.785,Cond. No.,156000.0


### (3) 분위수 회귀분석

#### 예제

user_counts를 종속변수로 하는 데이터이다

1. 분위수 회귀분석 (Quantile Regression) 을 사용하여 회귀 계수를 구하시오. (반올림하여 소수점 아래 둘째자리까지 표기하시오.)

2. 1의 모델의 회귀계수를 활용하여 temperature : 10.5 , wind : 8.2 , precipitation : 3.5 일때 user_counts를 예측하시오

In [12]:
import pandas as pd
df= pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/adp/27/problem8.csv')
df.head()


Unnamed: 0,temperature,wind,precipitation,user_counts
0,10.4,4.6,0.844944,6368
1,5.666667,4.625,0.04086,5902
2,4.933333,4.725,0.008696,6226
3,3.4,2.675,0.156989,5829
4,8.9,3.95,7.988462,7589


In [16]:
model = smf.quantreg('user_counts~ temperature+wind+precipitation',data=df)
result = model.fit(q=0.5)
result.summary()

0,1,2,3
Dep. Variable:,user_counts,Pseudo R-squared:,0.3723
Model:,QuantReg,Bandwidth:,840.9
Method:,Least Squares,Sparsity:,5590.0
Date:,"Mon, 29 May 2023",No. Observations:,2097.0
Time:,17:09:51,Df Residuals:,2093.0
,,Df Model:,3.0

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,5941.8395,198.127,29.990,0.000,5553.293,6330.386
temperature,268.8920,6.571,40.918,0.000,256.005,281.779
wind,-129.4050,46.259,-2.797,0.005,-220.124,-38.686
precipitation,-83.3843,7.891,-10.567,0.000,-98.859,-67.910


In [18]:
# 회귀식: user_counts = 5941.84 + 268.89 * temperature - 129.41 * wind - 83.38 * precipitation

temperature= 10.5
wind = 8.2 
precipitation= 3.5

user_counts = 5941.84 + 268.89 * temperature - 129.41 * wind - 83.38 * precipitation
user_counts


7412.192999999999