##### 보스턴 집값 예측 모델
- 데이터셋 : boston.csv
- 학습 방법 : 지도학습 >> 회귀
- 피쳐/독립 : 13개
- 타겟/종속 : 1개

[1] 데이터 준비

In [2]:
# 모듈로딩
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.model_selection import train_test_split

In [3]:
# 데이터
DATA_FILE = '../Data/boston.csv'

In [4]:
# CSV => DataFrame
dataDF = pd.read_csv(DATA_FILE)
dataDF.head(2)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6


In [5]:
# 데이터 기본 정보 확인
dataDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    int64  
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    int64  
 9   TAX      506 non-null    float64
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
 13  MEDV     506 non-null    float64
dtypes: float64(12), int64(2)
memory usage: 55.5 KB


[2] 전처리
- [2-1] 데이터 정제

- 결측치, 중복값, 이상치, 컬럼별 고유값 추출로 이상 데이터 체크

- [2-2] 표준화 & 정규화 ===> 진행 여부에 따라 성능의 변화는 경우에 따라 다름!!!
    * 정규분포 데이터셋을 기반으로 한 모델 ==> StandardScaler, MinMaxScaler, Log 변환
    * 피쳐의 값의 범위 차이를 줄이기 ==> 피쳐 스케일링, MinMaxScaler, RobustScaler....
    * 범주형 피쳐 ===> 수치화 인코딩 OneHotEncoder, OrdinalEncoder
    * 문자열 타겟 ===> 정수 라벨 인코딩 LabelEncoder

- [2-3] 피쳐와 타겟 분리

In [6]:
featureDF = dataDF.iloc[:, : -1]
targetSR = dataDF["MEDV"]

In [7]:
print(f"featureDF : {featureDF.shape} targetSR : {targetSR.shape}")

featureDF : (506, 13) targetSR : (506,)


[3] 학습 준비

[3-1] 학습용 데이터셋과 테스트용 데이터셋 분리

In [8]:
X_train, X_test, y_train, y_test = train_test_split(featureDF, targetSR, random_state = 10)

In [9]:
print(f"X_train : {X_train.shape} y_train : {y_train.shape}")
print(f"X_test : {X_test.shape} y_test : {y_test.shape}")


X_train : (379, 13) y_train : (379,)
X_test : (127, 13) y_test : (127,)


[3-2] 학습용 데이터셋으로 스케일러 생성

In [10]:
### - 수치 피쳐 값의 범위 차가 큼 ==> Scaling 진행
ssScaler = StandardScaler()

ssScaler.fit(X_train)

In [11]:
X_train_scaled = ssScaler.transform(X_train)
X_test_scaled = ssScaler.transform(X_test)


[4] 학습 진행 ==> 교차 검증으로 진행

In [12]:
from sklearn.model_selection import cross_validate
from sklearn.linear_model import Ridge, Lasso

In [13]:
### 모델의 성능을 좌우하는 Hyper-parameter 제어 즉, 튜닝
alpha_values = [0., 1., 10, 100]

for value in alpha_values:
# 모델 인스턴스 생성
    ridge_model = Lasso(alpha = value, max_iter = 3)    # 기본값 1.0

    # 학습 진행
    # - cv : 3개
    # - scoring : 'mean_squared_error', 'r2'
    # - return_train_score
    result = cross_validate(ridge_model, X_train_scaled, y_train, 
                            cv = 3,
                            scoring = ['neg_mean_squared_error', 'r2'],
                        return_train_score = True,
                        return_estimator = True)
    print(result)

    resultDF = pd.DataFrame(result)[['test_r2', 'train_r2']]
    resultDF['diff'] = resultDF['test_r2'] - resultDF['train_r2']
    best_idx = resultDF['diff'].sort_values()[0]
    print(best_idx)

    print(result['estimator'][0].coef_)
    print(f'[Ridge (alpha = {value})]')
    print(resultDF, "\n\n")

    model = result['estimator'][0]
    print(model.predict(X_train_scaled))

{'fit_time': array([0.00812006, 0.        , 0.        ]), 'score_time': array([0.00842166, 0.        , 0.        ]), 'estimator': [Lasso(alpha=0.0, max_iter=3), Lasso(alpha=0.0, max_iter=3), Lasso(alpha=0.0, max_iter=3)], 'test_neg_mean_squared_error': array([-17.97755983, -24.15628474, -25.25978452]), 'train_neg_mean_squared_error': array([-22.0562488 , -19.76097481, -19.44626821]), 'test_r2': array([0.73873047, 0.73930894, 0.6443532 ]), 'train_r2': array([0.73246304, 0.71790954, 0.7594718 ])}


0.006267434331734156
[-0.76918209  1.30798802 -1.3660128   0.70871821 -1.12810945  3.13078874
  0.20140226 -3.18951128  0.40006951 -1.02796444 -1.33246342  1.05170534
 -2.85931196]
[Ridge (alpha = 0.0)]
    test_r2  train_r2      diff
0  0.738730  0.732463  0.006267
1  0.739309  0.717910  0.021399
2  0.644353  0.759472 -0.115119 


[20.89447192 24.91350293 34.14210688 33.45356876  7.55452514 35.95547143
 24.38676976 17.01412673 23.94896554 27.79993157 36.28033589  5.49691167
  6.11669451 28.81216952 12.52507926 17.79148918 19.89042972  5.96951469
 13.72090277 38.53677317 26.33652325 23.22842271 25.60682368 12.47997489
 20.60321099 38.08215048 21.0730438  10.0118152  17.66468448 25.48698294
  8.92582647 16.03731787 27.44502163 11.11508975 10.92768036 17.87410136
 18.7475499  32.3304534  22.59427867 26.25927632 10.22140401 22.57244524
  6.0093879  27.44669466 22.21201488 23.88449041 19.3017314  26.68771245
 17.08616762 12.23982655 24.82742478 22.24397153 20.80644297 20.73943852
 22.53644

  return fit_method(estimator, *args, **kwargs)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  return fit_method(estimator, *args, **kwargs)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  return fit_method(estimator, *args, **kwargs)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


- 하이퍼 파라미터 튜닝과 교차 검증을 동시에 진행

In [14]:
from sklearn.model_selection import GridSearchCV


In [15]:
# Ridge의 Hyper-parameter 값 설정
params = {'alpha' : [0., 0.1, 0.5, 1.0],
          'max_iter' : [3, 5]}

# ==> 0., 3 => Model  # ==> 0., 5 => Model
# ==> 0.1, 3 => Model  # ==> 0.1, 5 => Model
# ==> 0.5, 3 => Model  # ==> 0.5, 5 => Model
# ==> 1.0, 3 => Model  # ==> 1.0, 5 => Model
# ==> 8개의 Ridge 모델 생성

In [16]:
# 인스턴스 생성
rModel = Ridge()

# GridSearchCV 인스턴스 생성
searchCV = GridSearchCV(rModel, params, cv = 3, verbose = True, return_train_score = True)

In [17]:
# 학습 진행
searchCV.fit(X_train_scaled, y_train)

Fitting 3 folds for each of 8 candidates, totalling 24 fits


In [18]:
# fit() 진행 후 모델 파라미터 확인
searchCV.best_params_

{'alpha': 1.0, 'max_iter': 3}

In [19]:
best_model = searchCV.best_estimator_
best_model

In [20]:
searchCV.best_index_

6

In [21]:
resultDF = pd.DataFrame(searchCV.cv_results_)
resultDF

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_alpha,param_max_iter,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,mean_train_score,std_train_score
0,0.005446,0.007701,0.002667,0.003772,0.0,3,"{'alpha': 0.0, 'max_iter': 3}",0.747022,0.756482,0.680801,0.728101,0.033669,7,0.75572,0.740082,0.786156,0.760653,0.019131
1,0.000214,0.000302,0.0,0.0,0.0,5,"{'alpha': 0.0, 'max_iter': 5}",0.747022,0.756482,0.680801,0.728101,0.033669,7,0.75572,0.740082,0.786156,0.760653,0.019131
2,0.002038,0.002883,0.0,0.0,0.1,3,"{'alpha': 0.1, 'max_iter': 3}",0.747159,0.756462,0.680831,0.728151,0.033675,5,0.75572,0.740081,0.786156,0.760652,0.019131
3,0.0,0.0,0.001021,0.001444,0.1,5,"{'alpha': 0.1, 'max_iter': 5}",0.747159,0.756462,0.680831,0.728151,0.033675,5,0.75572,0.740081,0.786156,0.760652,0.019131
4,0.000556,0.000786,0.000675,0.000955,0.5,3,"{'alpha': 0.5, 'max_iter': 3}",0.747682,0.756385,0.680927,0.728331,0.033708,3,0.755705,0.74007,0.786141,0.760639,0.019129
5,0.0,0.0,0.0,0.0,0.5,5,"{'alpha': 0.5, 'max_iter': 5}",0.747682,0.756385,0.680927,0.728331,0.033708,3,0.755705,0.74007,0.786141,0.760639,0.019129
6,0.0,0.0,0.0,0.0,1.0,3,"{'alpha': 1.0, 'max_iter': 3}",0.748283,0.756292,0.680991,0.728522,0.033768,1,0.755663,0.740039,0.786097,0.7606,0.019124
7,0.00401,0.005671,0.0,0.0,1.0,5,"{'alpha': 1.0, 'max_iter': 5}",0.748283,0.756292,0.680991,0.728522,0.033768,1,0.755663,0.740039,0.786097,0.7606,0.019124
