선형 종속과 선형 독립의 관계를 보자!

선형 종속
- x + 2x + 3x + 4x + 5x = y 라는 식을 생성하자

선형 독립
- x + x^2 + x^3 + x^4 + x^5 = y 라는 식을 생성하자

다음과 같은 식을 각각 선형 모델, DT 모델, MLP 모델에 넣어 회귀 계수와 성능에 어떠한 차이가 발생하는지 알아보자!

In [1]:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor

In [2]:
x = np.random.randint(100, size = (100, 1))
dependent_X = np.concatenate([x, 2*x, 3*x, 4*x, 5*x], axis = 1)
dependent_Y = dependent_X.sum(axis = 1)

independent_X = np.concatenate([x, x**2, x**3, x**4, x**5], axis = 1)
independent_Y = independent_X.sum(axis = 1)

In [68]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.neural_network import MLPRegressor

def get_linear_result(X, Y):
    model = LinearRegression(fit_intercept = True)
    model.fit(X, Y)
    score = model.score(X, Y)
    rank = model.rank_
    coef = model.coef_
    intercept = model.intercept_

    print('score: ', score)
    print('rank: ', rank)
    print('coef: ', coef)
    print('intercept: ', intercept)


def get_dt_result(X, Y, random_state):
    model = DecisionTreeRegressor(random_state = random_state)
    model.fit(X, Y)
    score = model.score(X, Y)
    feature_importances = model.feature_importances_
    print('random_state: ', random_state)
    print('score: ', score)
    print('feature_importances: ', feature_importances)

def get_nn_result(X, Y):
    model = MLPRegressor(hidden_layer_sizes = (100, 50,))
    model.fit(X, Y)
    best_loss = model.best_loss_
    coef = model.coefs_
    intercept = model.intercepts_

    print('best_loss: ', best_loss)
    # print('coef: ', coef)
    # print('intercept: ', intercept)

# LinearRegression

In [36]:
get_linear_result(dependent_X, dependent_Y)

score:  1.0
rank:  1
coef:  [0.27272727 0.54545455 0.81818182 1.09090909 1.36363636]
intercept:  2.2737367544323206e-13


In [49]:
dependent_Y

array([1110, 1425,  405,  675,  360,  105, 1095, 1440,  735, 1035,  960,
        525,  675,  990,  435, 1200,  135, 1170,  750,  315, 1350,  915,
         30,  345,  120, 1335, 1275,  870,  870, 1005, 1410,  180,  150,
       1170,  240,  795,  420,  510,  390, 1155,  495,  660, 1200,  645,
         15,  330, 1050, 1290, 1455,  555,  150,  120,  915, 1170,  915,
        405,   90, 1155, 1350, 1110,  270,  735,  975, 1080,  375, 1425,
       1200,  195,  975,  750, 1425,  300, 1395,  915,  615, 1410, 1050,
        690, 1245,  510, 1125,  540,  765,  735,  120,  825,  180,  210,
        285,  105,  630,  165, 1230,   15, 1035,  600, 1245,  705,  630,
        720])

In [53]:
x.reshape(-1)

array([74, 95, 27, 45, 24,  7, 73, 96, 49, 69, 64, 35, 45, 66, 29, 80,  9,
       78, 50, 21, 90, 61,  2, 23,  8, 89, 85, 58, 58, 67, 94, 12, 10, 78,
       16, 53, 28, 34, 26, 77, 33, 44, 80, 43,  1, 22, 70, 86, 97, 37, 10,
        8, 61, 78, 61, 27,  6, 77, 90, 74, 18, 49, 65, 72, 25, 95, 80, 13,
       65, 50, 95, 20, 93, 61, 41, 94, 70, 46, 83, 34, 75, 36, 51, 49,  8,
       55, 12, 14, 19,  7, 42, 11, 82,  1, 69, 40, 83, 47, 42, 48])

In [51]:
(x +  2*x + 3*x + 4*x + 5*x).reshape(-1)

array([1110, 1425,  405,  675,  360,  105, 1095, 1440,  735, 1035,  960,
        525,  675,  990,  435, 1200,  135, 1170,  750,  315, 1350,  915,
         30,  345,  120, 1335, 1275,  870,  870, 1005, 1410,  180,  150,
       1170,  240,  795,  420,  510,  390, 1155,  495,  660, 1200,  645,
         15,  330, 1050, 1290, 1455,  555,  150,  120,  915, 1170,  915,
        405,   90, 1155, 1350, 1110,  270,  735,  975, 1080,  375, 1425,
       1200,  195,  975,  750, 1425,  300, 1395,  915,  615, 1410, 1050,
        690, 1245,  510, 1125,  540,  765,  735,  120,  825,  180,  210,
        285,  105,  630,  165, 1230,   15, 1035,  600, 1245,  705,  630,
        720])

In [60]:
 _x = 0.27272727 * dependent_X[:,0]
 _x

array([20.18181798, 25.90909065,  7.36363629, 12.27272715,  6.54545448,
        1.90909089, 19.90909071, 26.18181792, 13.36363623, 18.81818163,
       17.45454528,  9.54545445, 12.27272715, 17.99999982,  7.90909083,
       21.8181816 ,  2.45454543, 21.27272706, 13.6363635 ,  5.72727267,
       24.5454543 , 16.63636347,  0.54545454,  6.27272721,  2.18181816,
       24.27272703, 23.18181795, 15.81818166, 15.81818166, 18.27272709,
       25.63636338,  3.27272724,  2.7272727 , 21.27272706,  4.36363632,
       14.45454531,  7.63636356,  9.27272718,  7.09090902, 20.99999979,
        8.99999991, 11.99999988, 21.8181816 , 11.72727261,  0.27272727,
        5.99999994, 19.0909089 , 23.45454522, 26.45454519, 10.09090899,
        2.7272727 ,  2.18181816, 16.63636347, 21.27272706, 16.63636347,
        7.36363629,  1.63636362, 20.99999979, 24.5454543 , 20.18181798,
        4.90909086, 13.36363623, 17.72727255, 19.63636344,  6.81818175,
       25.90909065, 21.8181816 ,  3.54545451, 17.72727255, 13.63

중심이 되는 벡터가 달라졌다는 것을 알 수 있음

In [61]:
(_x +  4*_x + 9*_x + 16*_x + 25*_x).reshape(-1)

array([1109.9999889 , 1424.99998575,  404.99999595,  674.99999325,
        359.9999964 ,  104.99999895, 1094.99998905, 1439.9999856 ,
        734.99999265, 1034.99998965,  959.9999904 ,  524.99999475,
        674.99999325,  989.9999901 ,  434.99999565, 1199.999988  ,
        134.99999865, 1169.9999883 ,  749.9999925 ,  314.99999685,
       1349.9999865 ,  914.99999085,   29.9999997 ,  344.99999655,
        119.9999988 , 1334.99998665, 1274.99998725,  869.9999913 ,
        869.9999913 , 1004.99998995, 1409.9999859 ,  179.9999982 ,
        149.9999985 , 1169.9999883 ,  239.9999976 ,  794.99999205,
        419.9999958 ,  509.9999949 ,  389.9999961 , 1154.99998845,
        494.99999505,  659.9999934 , 1199.999988  ,  644.99999355,
         14.99999985,  329.9999967 , 1049.9999895 , 1289.9999871 ,
       1454.99998545,  554.99999445,  149.9999985 ,  119.9999988 ,
        914.99999085, 1169.9999883 ,  914.99999085,  404.99999595,
         89.9999991 , 1154.99998845, 1349.9999865 , 1109.99998

In [62]:
(0.27272727 * dependent_X[:,0]) + (0.54545455 * dependent_X[:,1]) + (0.81818182 * dependent_X[:,2]) + (1.09090909  * dependent_X[:,3]) + (1.36363636 * dependent_X[:,4])

array([1109.99999926, 1424.99999905,  404.99999973,  674.99999955,
        359.99999976,  104.99999993, 1094.99999927, 1439.99999904,
        734.99999951, 1034.99999931,  959.99999936,  524.99999965,
        674.99999955,  989.99999934,  434.99999971, 1199.9999992 ,
        134.99999991, 1169.99999922,  749.9999995 ,  314.99999979,
       1349.9999991 ,  914.99999939,   29.99999998,  344.99999977,
        119.99999992, 1334.99999911, 1274.99999915,  869.99999942,
        869.99999942, 1004.99999933, 1409.99999906,  179.99999988,
        149.9999999 , 1169.99999922,  239.99999984,  794.99999947,
        419.99999972,  509.99999966,  389.99999974, 1154.99999923,
        494.99999967,  659.99999956, 1199.9999992 ,  644.99999957,
         14.99999999,  329.99999978, 1049.9999993 , 1289.99999914,
       1454.99999903,  554.99999963,  149.9999999 ,  119.99999992,
        914.99999939, 1169.99999922,  914.99999939,  404.99999973,
         89.99999994, 1154.99999923, 1349.9999991 , 1109.99999

In [18]:
get_linear_result(independent_X, independent_Y)

score:  1.0
rank:  5
coef:  [0.99999999 1.         1.         1.         1.        ]
intercept:  0.0


# DecisionTreeRegressor

In [19]:
for random_state in range(1, 11):
    get_dt_result(dependent_X, dependent_Y, random_state)
    print()

random_state:  1
score:  1.0
feature_importances:  [0.00252374 0.20164982 0.78987836 0.00179868 0.0041494 ]

random_state:  2
score:  1.0
feature_importances:  [0.02507944 0.13799863 0.06311274 0.76968275 0.00412644]

random_state:  3
score:  1.0
feature_importances:  [4.31396659e-04 1.42279637e-01 9.77387174e-03 8.33183425e-01
 1.43316690e-02]

random_state:  4
score:  1.0
feature_importances:  [0.14061879 0.004072   0.02107422 0.0640551  0.77017989]

random_state:  5
score:  1.0
feature_importances:  [0.01299489 0.06463138 0.90562379 0.01114598 0.00560396]

random_state:  6
score:  1.0
feature_importances:  [9.00707876e-02 1.72235629e-02 1.29823942e-01 7.62633321e-01
 2.48386105e-04]

random_state:  7
score:  1.0
feature_importances:  [0.0028127  0.0036742  0.01871088 0.90945296 0.06534926]

random_state:  8
score:  1.0
feature_importances:  [0.00497483 0.07507906 0.75081258 0.16728954 0.00184399]

random_state:  9
score:  1.0
feature_importances:  [0.00440544 0.19279247 0.01963767 0

In [20]:
for random_state in range(1, 11):
    get_dt_result(independent_X, independent_Y, random_state)
    print()

random_state:  1
score:  1.0
feature_importances:  [3.07078125e-03 1.24332531e-01 7.96280487e-01 3.93294709e-04
 7.59229069e-02]

random_state:  2
score:  1.0
feature_importances:  [0.0157423  0.12213388 0.07481671 0.78597099 0.00133612]

random_state:  3
score:  1.0
feature_importances:  [0.01236591 0.12262137 0.06616117 0.79493768 0.00391388]

random_state:  4
score:  1.0
feature_importances:  [0.12562783 0.01059352 0.07848212 0.01154205 0.77375448]

random_state:  5
score:  1.0
feature_importances:  [2.54251767e-03 1.45857339e-04 9.91118023e-01 9.56927275e-04
 5.23667491e-03]

random_state:  6
score:  1.0
feature_importances:  [1.27169410e-02 2.93202986e-03 1.89498355e-01 7.94801989e-01
 5.06853702e-05]

random_state:  7
score:  1.0
feature_importances:  [0.0038448  0.00249819 0.00207619 0.9902685  0.00131232]

random_state:  8
score:  1.0
feature_importances:  [1.16978929e-02 4.10270908e-04 8.47898495e-01 1.36797631e-01
 3.19571038e-03]

random_state:  9
score:  1.0
feature_importa

# MLP

In [69]:
get_nn_result(dependent_X, dependent_Y)

best_loss:  2.4044585392222966


In [70]:
get_nn_result(dependent_X, dependent_Y)

best_loss:  0.38402544489646084


In [71]:
get_nn_result(independent_X, independent_Y)

best_loss:  1968074374353.7734


In [72]:
get_nn_result(independent_X, independent_Y)

best_loss:  37245179964715.17


# 결론

- 선형 종속
  - LinearRegression의 경우 선형 종속일 경우 회귀 계수의 값이 우리가 지정한 값과 다르다는 것을 알 수 있다. 그러나 그 값은 모델을 아무리 돌려도 고정되는 것을 알 수 있는데, 이러한 이유가 발생한 이유는 ML이 데이터를 가지고 estimation, 즉 추정을 하기 때문인 것으로 보인다. 따라서 선형 종속일 경우 다양한 해답이 존재할 수 있기 때문에(기준이 되는 벡터가 여러개가 존재할 수 있음) 회귀 계수의 값이 불안정할 수 있다.
  - DT는 데이터를 바탕으로 estimation, 즉 추정을 하는 모델이다. 이러한 경우 선형 종속은 feature 선택을 매우 불안정하게 한다는 것을 알 수 있다.
  - MLP의 경우 비선형활성화 함수를 사용해 비선형결합이 가능하기 때문에 선형 종속일지라도 모든 feature를 잘 사용할 수 있을 것이다. 그러나 본 데이터의 경우 매우 단순한 함수로 이루어져 있기 때문에 복잡한 학습이 필요없어 성능이 매우 떨어진 것으로 보인다.

- 선형 독립
  - 선형 독립일 경우에는 대체로 모든 feature를 잘 활용한다는 것을 알 수 있다.또한 MLP는 단순한 데이터에서는 매우 좋지 못한 성능을 보여준다는 것도 알 수 있다.


결국 선형 종속과 선형 독립을 의미하는 것은 모든 피쳐를 안정적이게 잘 활용할 수 있으며, 회귀 계수가 안정적이냐, 불안정적이냐로 이야기 할 수 있을 것 같다.