# Linear Regression : 설명을 위한 선형회귀분석
 
 1) 종속변수 y : 만성폐쇄성폐질환 (COPD) 삶의 질 평가 검사(CAT) 증상점수
 *  COPD 평가검사(COPD assessment test, CAT)
 * CAT는 8가지 설문 사항을 통해 환자의 삶의 질 즉, 호흡기 증상, 활동 정도, 수면, 자신감을 지표로 삼고 평가
 * 0(삶의 질 가장 좋음)~40점(삶의 질 가장 나쁨)까지의 점수로 나타남 
 
2) 독립변수 x (6개)
 * age :나이, 연속형
 * sex :성별(1:남성, 0:여성), 범주형
 * FEV1%: 1초간 노력성 폐활량(Forced Expiratory Volume in One second, FEV1), 연속형, 
  - (> 70 :mild, 60-69 :moderate ,50-59:moderagely severe, 35-49:severe, <35 :very severe)
 * Smoke_pack_year : 갑년(pack-year), 사람의 담배 노출을 측정하는 데 사용되는 지표, 하루평균 담배소비량(갑)×흡연기간(년)을 의미
 * Chol :  혈중 콜레스트롤 (mg / dl 단위), 연속형
 * Comorbid : 동반질환여부 (1:유, 0:무), 범주형
 * premium : 월평균 건강보험료(소득수준 의미), 연속형
 
 * COPD 환자의 80% 이상에서 10갑년 이상의 흡연력. 을 가지며, 특히 남자에서는 90% 이상 흡연과 관련

## 데이터 불러오기

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning)

In [None]:
df=pd.read_csv('copdcat.csv')

In [None]:
X=df.drop(["CATScore"],axis=1)
y=df["CATScore"]
X["Smoke_pack_year2"]= df["Smoke_pack_year"]**2

In [None]:
import statsmodels.api as sm
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet

## Wrapper 방법을 이용한 feature selection
### 후진제거법

In [None]:
import feature_selection as fsel

In [None]:
result_back = fsel.backwardSelection(X, y, model_type="linear", elimination_criteria="aic")
result_back

#### 최종모형

In [None]:
X_reduced=df.drop(["CATScore","Chol","Smoke_pack_year"],axis=1)
X_reduced["Smoke_pack_year2"]= df["Smoke_pack_year"]**2
y=df["CATScore"]

In [None]:
model_reg3 = sm.OLS(y,sm.add_constant(X_reduced))
result_reg3 = model_reg3.fit()
result_reg3.summary()

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
X_design = sm.add_constant(X_reduced)
vif = pd.DataFrame()
vif["Features"] = X_reduced.columns
vif["VIF Factor"] = [np.round(variance_inflation_factor(X_design.values, i+1),4) 
                     for i in range(len(X_reduced.columns))] 
vif

#### Linearity

In [None]:
sns.scatterplot(x=df['CATScore'], y=result_reg3.fittedvalues, alpha=0.1)
plt.plot([0, 30], [0, 30], color='red')
plt.ylabel("Fitted values")
plt.show()

#### equal variance

In [None]:
sns.scatterplot(x=result_reg3.fittedvalues, y=result_reg3.resid, alpha=0.1)
plt.axhline(y=0, color='red')
plt.xlabel("Fitted values")
plt.ylabel("Residual")
plt.show()

In [None]:
sm.qqplot(result_reg3.resid, line='r')
plt.title("Normal Q-Q plot")
plt.show()

## Embedding 방법을 이용한 feature selection
### Ridge 회귀분석

In [None]:
model_ridge = Ridge(alpha=1, fit_intercept=True)
result_ridge = model_ridge.fit(X, y)
coef = pd.DataFrame()
coef["Features"] = X.columns
coef["Coefficients"] = [np.round(result_ridge.coef_[i],6) for i in range(len(X.columns))] 
coef

### Lasso 회귀분석

In [None]:
model_lasso = Lasso(alpha=1,fit_intercept=True)
result_lasso = model_lasso.fit(X, y)
coef = pd.DataFrame()
coef["Features"] = X.columns
coef["Coefficients"] = [np.round(result_lasso.coef_[i],6) for i in range(len(X.columns))] 
coef

In [None]:
model_lasso = Lasso(alpha=0.5,fit_intercept=True)
result_lasso = model_lasso.fit(X, y)
coef = pd.DataFrame()
coef["Features"] = X.columns
coef["Coefficients"] = [np.round(result_lasso.coef_[i],6) for i in range(len(X.columns))] 
coef

### Elastic Net 회귀분석

In [None]:
model_enet = ElasticNet(alpha=1, l1_ratio=0.5, fit_intercept=True)
result_enet = model_enet.fit(X, y)
coef = pd.DataFrame()
coef["Features"] = X.columns
coef["Coefficients"] = [np.round(result_enet.coef_[i],6) for i in range(len(X.columns))] 
coef

In [None]:
model_enet = ElasticNet(alpha=0.5, l1_ratio=0.5, fit_intercept=True)
result_enet = model_enet.fit(X, y)
coef = pd.DataFrame()
coef["Features"] = X.columns
coef["Coefficients"] = [np.round(result_enet.coef_[i],6) for i in range(len(X.columns))] 
coef