## 교차 검증 (Cross Validation)

- 학습중인 모델의 성능을 평가하는 방법으로 고정된 검증 데이터셋에만 과적합 되어버리지 않도록 하는 방법
- `02.PreProcessing`의 `7.Split`에서 배운 분할 방법들을 실질적으로 활용하여 모델의 성능을 평가

#### 교차 검증 WorkFlow

![](https://velog.velcdn.com/images/newnew_daddy/post/76491917-2cd6-49d9-a417-59efacdf64dd/image.png)


#### cross_val_score
- 특정 평가 지표를 사용하여 교차 검증을 수행하고, 각 교차 검증 반복에서 얻은 점수만 반환

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
import pandas as pd

# 회귀용 데이터 생성
X, y = make_regression(
    n_samples=1000,
    n_features=10,
    n_targets=1,
    noise=50
    )

In [2]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()

In [3]:
from sklearn.model_selection import cross_val_score

## 교차 검증 수행
## scoring 지표 확인 -> https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
scores = cross_val_score(lr, X, y, cv=5, scoring='r2')

print(f"Cross-Validation Scores: {scores}")
print(f"Mean Score: {scores.mean()}")

Cross-Validation Scores: [0.92869561 0.94849197 0.94867325 0.93794079 0.93186519]
Mean Score: 0.9391333624815659


In [None]:
from sklearn.metrics._scorer import _SCORERS

_SCORERS.keys()

#### cross_validate
- 여러 개의 평가 지표를 사용할 수 있음
- 훈련 시간(fit_time), 예측 시간(score_time) 등을 포함한 다양한 정보를 반환
- 더 유연하고 다양한 정보를 제공

In [4]:
from sklearn.model_selection import cross_validate

cv = cross_validate(
    lr,
    X,
    y,
    cv=5,
    scoring=['r2', 'neg_root_mean_squared_error'],
    return_train_score=True
    )

indices = [f"fold_{i}" for i in range(1,6)]

pd.DataFrame(cv, index=indices)

Unnamed: 0,fit_time,score_time,test_r2,train_r2,test_neg_root_mean_squared_error,train_neg_root_mean_squared_error
fold_1,0.001106,0.000584,0.928696,0.9442,-55.179173,-50.107214
fold_2,0.000621,0.000351,0.948492,0.939314,-50.605887,-51.225276
fold_3,0.000569,0.000319,0.948673,0.939265,-50.488786,-51.210632
fold_4,0.000314,0.00028,0.937941,0.942179,-53.613127,-50.456033
fold_5,0.000304,0.00027,0.931865,0.943097,-48.013394,-51.830098


#### cross_val_predict
- 교차 검증을 통해 모델을 학습시키고, 학습된 모델로 테스트 세트의 예측값을 반환
- 예측값만 반환하며, 점수나 시간을 제공하지 않음

In [70]:
from sklearn.model_selection import cross_val_predict

y_pred = cross_val_predict(lr, X, y, cv=5)

array([-1.36388735e+02, -2.30912987e+02, -9.26551564e+00, -6.89509543e+01,
        5.57270004e+01, -9.60295201e+01,  6.82360550e+01, -1.80438951e+00,
       -2.91967354e+02, -1.49160749e+02,  7.76382865e+01,  1.00840762e+01,
       -4.69079504e+01, -1.38006909e+02,  1.74068660e+02, -2.97659713e+02,
       -2.40519851e+02,  8.86774696e+01, -3.35214070e+01,  2.06739085e+02,
       -1.30470723e+02,  2.81536832e+02, -1.14532265e+02,  8.86237269e+01,
       -1.02520357e+02,  1.37931067e+02,  4.00623163e+02,  9.15362061e+00,
       -1.21457167e+01,  2.95445016e+02,  2.67248332e+02, -5.93153405e+01,
        1.32970862e+02,  5.16079082e+01, -5.19252341e+01, -1.36317516e+02,
       -2.57234736e+01,  4.78496314e+01,  1.06764764e+01, -1.34846872e+01,
        4.90125017e+01,  9.63911208e+01,  2.77788233e+01,  8.94562977e+01,
        5.65720094e-01, -1.42379475e+02,  4.78313773e+01,  1.74157067e+02,
        2.54689158e+02, -3.02645260e+01, -1.29691841e+02, -6.77306578e+01,
        8.79583666e+01,  