<a href="https://colab.research.google.com/github/Null2648/google-colab/blob/main/10_XGBoost.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Boosting 이란?
 - 여러개의 약한 Decision Tree를 조합해서 사용하는 Ensemble 기법중 하나이다.
 - 즉, 약한 예측 모델들의 학습 에러에 가중치를 두고, 순차적으로 다음 학습 모델에 반영하여 강한 예측모델을 만드는 것이다.


# GBM(Grandient Boosting)
 - GBM은 여러 개의 weak learner를 순차적으로 학습, 예측하면서 잘못 예측한 데이터에 가중치 부여를 통해 오류를 개선해 나가면서 학습하는 방식
 - 가중치의 업데이트 방법은 경사하강법(Grandient Descent)을 사용한다.


# XGBoost 란?
 - XGBoost는 Extreme Gradient Boosting의 약자이다.
 - Boosting 기법을 이용하여 구현한 알고리즘은 Graiednt Boost가 대표적인데 이 알고리즘을 병렬 합습이 지원되도록 구현한 라이브러리가 XGBoost이다.
 - Regression, Classifier 문제를 모두 지원하며, 성능과 자원 효율이 좋아서 인기 있는 알고리즘이다.
 - XGBoost의 장점
  1. GBM 대비 빠른 수행시간: 병렬 처리로 학습, 분류속도가 빠른다.
  2. 과적합 규제(Regularization): 표준 GBM 경우 과적합 규제기능이 없으나, XGBoost는 자체에 과적합 규제 기능으로 강한 내구성을 지닌다.
  3. 분류과 회귀 영역에서 뛰어난 예측 성능을 발휘한다.
    - 즉, CART(Classifier and Regression Tree)앙상블 모델을 사용한다.
  4. Early Stopping(조기 종료) 기능이 있다.
  5. 다양한 옵션을 제공하며 Customizing이 용이하다. 

In [55]:
import numpy as np
import io
import pandas as pd

import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

import warnings

In [56]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

# 한글 폰트 설정
mpl.rc('font', family = 'malgun gothic')
# 유니코드에서 음수 부호 설정
mpl.rc('axes', unicode_minus = False)

#차트 스타일 지정
sns.set(font='malgun gothic', rc={'axes.unicode_minus':False}, style = 'darkgrid')
plt.rc('figure', figsize=(10, 8))


warnings.filterwarnings('ignore')

# 로지스틱 회귀와 평가지표

## 데이터셋 - 위스콘신 유방암 예측

In [57]:
!pip install xgboost

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [58]:
import xgboost as xgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

cancer = load_breast_cancer()
cancer.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [59]:
X = cancer.data
y = cancer.target

cancer_df = pd.DataFrame(data = X, columns = cancer.feature_names)
cancer_df['target'] = y
cancer_df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


In [60]:
print('data shape:', X.shape)
print('target shape:', y.shape)

data shape: (569, 30)
target shape: (569,)


In [61]:
print(cancer.DESCR)

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is Mean Radi

In [62]:
# 양성 유방암세트와 음성 유방암세트 개수의 파악
np.unique(cancer.target, return_counts = True)

(array([0, 1]), array([212, 357]))

In [63]:
# 30개의 feature가 뭔지를 확인
# enumerate 값 뿐만 아니라 인덱스도 반환해줌
for i, feature in enumerate(cancer.feature_names):
  print(f'feature{(i+1)}:', feature)

feature1: mean radius
feature2: mean texture
feature3: mean perimeter
feature4: mean area
feature5: mean smoothness
feature6: mean compactness
feature7: mean concavity
feature8: mean concave points
feature9: mean symmetry
feature10: mean fractal dimension
feature11: radius error
feature12: texture error
feature13: perimeter error
feature14: area error
feature15: smoothness error
feature16: compactness error
feature17: concavity error
feature18: concave points error
feature19: symmetry error
feature20: fractal dimension error
feature21: worst radius
feature22: worst texture
feature23: worst perimeter
feature24: worst area
feature25: worst smoothness
feature26: worst compactness
feature27: worst concavity
feature28: worst concave points
feature29: worst symmetry
feature30: worst fractal dimension


## standarization(표준화)

In [64]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled = scaler.fit_transform(cancer_df.drop('target', axis=1))

## 훈련세트와 테스트세트 분리