<a href="https://colab.research.google.com/github/JakeOh/202105_itw_bd26/blob/main/lab_ml/ml06_multiple_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 다중 선형 회귀

특성(독립변수)가 여러 개인 선형 회귀 모델
* 1차항만 고려한 선형 회귀
* 고차항들을 포함하는 선형 회귀
* 규제(Regularization): overfitting(과대적합)을 줄이기 위한 기법

# Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score, mean_squared_error

# 데이터 준비

In [2]:
# 데이터 셋 github URL
fish_csv = 'https://github.com/rickiepark/hg-mldl/raw/master/fish.csv'

In [3]:
# DataFrame 생성
fish = pd.read_csv(fish_csv)

In [4]:
fish.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159 entries, 0 to 158
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Species   159 non-null    object 
 1   Weight    159 non-null    float64
 2   Length    159 non-null    float64
 3   Diagonal  159 non-null    float64
 4   Height    159 non-null    float64
 5   Width     159 non-null    float64
dtypes: float64(5), object(1)
memory usage: 7.6+ KB


In [5]:
fish.head()

Unnamed: 0,Species,Weight,Length,Diagonal,Height,Width
0,Bream,242.0,25.4,30.0,11.52,4.02
1,Bream,290.0,26.3,31.2,12.48,4.3056
2,Bream,340.0,26.5,31.1,12.3778,4.6961
3,Bream,363.0,29.0,33.5,12.73,4.4555
4,Bream,430.0,29.0,34.0,12.444,5.134


선형 회귀 목적: 농어(Perch)의 무게(Weight)를 농어의 다른 특성들(Length, Diagonal, Height, Width)로 예측.

Weight ~ Length + Diagonal + Height + Width

In [6]:
perch = fish[fish.Species == 'Perch']  # 농어(Perch)만 선택
perch.head()

Unnamed: 0,Species,Weight,Length,Diagonal,Height,Width
72,Perch,5.9,8.4,8.8,2.112,1.408
73,Perch,32.0,13.7,14.7,3.528,1.9992
74,Perch,40.0,15.0,16.0,3.824,2.432
75,Perch,51.5,16.2,17.2,4.5924,2.6316
76,Perch,70.0,17.4,18.5,4.588,2.9415


In [9]:
# 특성(features), 독립 변수
X = perch[['Length', 'Diagonal', 'Height', 'Width']].values

In [10]:
X.shape

(56, 4)

In [13]:
# label, target, 종속 변수
y = perch['Weight'].values

In [14]:
y.shape

(56,)

# train/test split

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.25,
                                                    random_state=42)

In [16]:
X_train.shape, X_test.shape

((42, 4), (14, 4))

In [17]:
y_train.shape, y_test.shape

((42,), (14,))

# 1차항만 고려하는 선형 회귀

$
\hat{y} = w_0 + w_1 \times x_1 + w_2 \times x_2 + w_3 \times x_3 + w_4 \times x_4
$

In [18]:
lin_reg = LinearRegression()  # 선형 회귀 알고리즘 생성

In [19]:
lin_reg.fit(X_train, y_train)  # ML 알고리즘을 데이터에 fitting. 데이터를 학습시킴.

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [23]:
lin_reg.intercept_  # w0: 절편, 편향

-610.0275364260526

In [24]:
lin_reg.coef_  # [w1 w2 w3 w4] 계수들의 배열
# w1 * length + w2 * diagonal + w3 * height + w4 * width

array([-40.18338554,  47.80681727,  67.34086612,  35.34904264])

In [25]:
train_pred = lin_reg.predict(X_train)  # 훈련 셋 예측값

In [26]:
train_pred[:5]

array([ 50.07831254, 149.63115115,  26.52323981, -11.85322276,
       727.07849472])

In [28]:
y_train[:5]  # 실젯값

array([ 85., 135.,  78.,  70., 700.])

In [29]:
# RMSE
np.sqrt(mean_squared_error(y_train, train_pred))

73.07651173088374

In [30]:
# 결정 계수
r2_score(y_train, train_pred)

0.9567246116638569

In [31]:
test_pred = lin_reg.predict(X_test)  # 테스트 셋 예측값

In [32]:
test_pred[:5]

array([-334.87262176,   53.65873458,  318.38723843,  178.88939119,
        155.66294578])

In [33]:
y_test[:5]  # 테스트 셋 실젯값

array([  5.9, 100. , 250. , 130. , 130. ])

In [34]:
np.sqrt(mean_squared_error(y_test, test_pred))  # RMSE

110.1835310901991

In [35]:
r2_score(y_test, test_pred)  # 결정 계수

0.879046561599027

1차항만 고려한 선형 회귀 모델은 overfitting이 약간 있음.

# 2차항까지 추가한 선형 회귀

$
\hat{y} = w_0 + w_1 \times x_1 + \cdots + w_4 \times x_4 + w_5 \times x_1^2 + \cdots
$


In [36]:
poly = PolynomialFeatures(include_bias=False)  # 다차항을 추가하는 변환기 생성
# degree=2 (default): 2차항까지 고려
# interaction_only=False (default): x1^2, x2^2, x1*x2, ... 등을 모두 추가

In [38]:
poly.fit_transform(X_train)[:2]

array([[ 19.6       ,  20.8       ,   5.1376    ,   3.0368    ,
        384.16      , 407.68      , 100.69696   ,  59.52128   ,
        432.64      , 106.86208   ,  63.16544   ,  26.39493376,
         15.60186368,   9.22215424],
       [ 22.        ,  23.5       ,   5.875     ,   3.525     ,
        484.        , 517.        , 129.25      ,  77.55      ,
        552.25      , 138.0625    ,  82.8375    ,  34.515625  ,
         20.709375  ,  12.425625  ]])

In [39]:
scaler = StandardScaler()  # 표준화 변환기 생성

In [40]:
lin_reg = LinearRegression()  # ML 알고리즘 생성

In [41]:
# Pipeline 객체 생성
model = Pipeline(steps=[('poly', poly),
                        ('scaler', scaler),
                        ('lin_reg', lin_reg)])

In [42]:
# ML 모델을 데이터에 fitting. 학습 셋을 훈련시킴.
model.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('poly',
                 PolynomialFeatures(degree=2, include_bias=False,
                                    interaction_only=False, order='C')),
                ('scaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('lin_reg',
                 LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
                                  normalize=False))],
         verbose=False)

In [43]:
model['lin_reg'].intercept_  # 학습이 끝난 후 선형 회귀 모델이 찾은 절편

400.83333333332587

In [44]:
model['lin_reg'].coef_  # 학습이 끝난 후 선형 회귀 모델이 찾은 계수들(coefficients)

array([   -443.26816039,    1150.91134799,    -650.22360319,
          -368.62831244,  115424.97558536, -210083.78541706,
        -49872.08633924,   29100.85132271,   91656.18352525,
         53699.90248992,  -27521.03052328,    1226.11352267,
         -5243.73927458,    2288.55011685])

In [45]:
model['poly'].get_feature_names()  # PolynomialFeatures 변환기가 만들어낸 다차항들 리스트

['x0',
 'x1',
 'x2',
 'x3',
 'x0^2',
 'x0 x1',
 'x0 x2',
 'x0 x3',
 'x1^2',
 'x1 x2',
 'x1 x3',
 'x2^2',
 'x2 x3',
 'x3^2']

In [46]:
train_pred = model.predict(X_train)  # 훈련 셋 예측값

In [47]:
train_pred[:5]

array([ 86.22462498, 117.8371985 ,  65.36623277,  51.32036181,
       688.61814191])

In [48]:
y_train[:5]

array([ 85., 135.,  78.,  70., 700.])

In [49]:
np.sqrt(mean_squared_error(y_train, train_pred))  # 훈련 셋 RMSE

31.408812188346158

In [50]:
r2_score(y_train, train_pred)  # 훈련 셋 결정 계수

0.9920055538341124

In [51]:
test_pred = model.predict(X_test)

In [52]:
test_pred[:5]

array([ 23.11093892,  16.86703258, 283.14558245, 126.83444969,
       121.43654058])

In [53]:
y_test[:5]

array([  5.9, 100. , 250. , 130. , 130. ])

In [54]:
np.sqrt(mean_squared_error(y_test, test_pred))  # 테스트 셋 RMSE

71.36392024375351

In [55]:
r2_score(y_test, test_pred)  # 테스트 셋 결정 계수

0.949260960155265