* 결정 트리
    * 장점
        * 데이터에 대한 가정이 없는 모델. 예를들어 선형 모델은 정규 분포에 대한 가정이나 독립변수와 종속변수의 선형 관계 등을 가정으로 하는 모델인 반면, 결정 트리는 데이터에 대한 가정이 없으므로 어디에나 자유롭게 적용할 수 있다.
        * 아웃라이어에 영향을 거의 받지 않는다.
        * 트리 그래프를 통해서 직관적으로 이해하고 설명할 수 있다. 시각화에 굉장히 탁월하다.
    * 단점
        * 트리가 무한정 깊어지면 오버피팅 문제를 야기한다
        * 앞으로 배울 발전된 트리 기반 모델들에 비하면 예측력이 상당히 떨어진다.

## 8.1 문제 정의

* 미션 : 학력, 교육 연수, 혼인 상태, 직업 정보를 담은 연봉 데이터셋을 이용해 연봉을 예측하라.
* 알고리즘 : 결정트리(Decision Tree)
* 문제 유형 : 분류
* 평가지표: 정확도

## 8.2 라이브러리 및 데이터 불러오기, 데이터 확인하기

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
file_url = 'https://media.githubusercontent.com/media/musthave-ML10/data_source/main/salary.csv'
data = pd.read_csv(file_url, skipinitialspace = True)

In [3]:
data.head()

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25,Private,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,,Some-college,10,Never-married,,Own-child,White,Female,0,0,30,United-States,<=50K


In [4]:
data['class'].unique() # 종속 변수 고육값 확인하기

array(['<=50K', '>50K'], dtype=object)

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       46043 non-null  object
 2   education       48842 non-null  object
 3   education-num   48842 non-null  int64 
 4   marital-status  48842 non-null  object
 5   occupation      46033 non-null  object
 6   relationship    48842 non-null  object
 7   race            48842 non-null  object
 8   sex             48842 non-null  object
 9   capital-gain    48842 non-null  int64 
 10  capital-loss    48842 non-null  int64 
 11  hours-per-week  48842 non-null  int64 
 12  native-country  47985 non-null  object
 13  class           48842 non-null  object
dtypes: int64(5), object(9)
memory usage: 5.2+ MB


In [6]:
data.describe() # 통계 정보 출력

Unnamed: 0,age,education-num,capital-gain,capital-loss,hours-per-week
count,48842.0,48842.0,48842.0,48842.0,48842.0
mean,38.643585,10.078089,1079.067626,87.502314,40.422382
std,13.71051,2.570973,7452.019058,403.004552,12.391444
min,17.0,1.0,0.0,0.0,1.0
25%,28.0,9.0,0.0,0.0,40.0
50%,37.0,10.0,0.0,0.0,40.0
75%,48.0,12.0,0.0,0.0,45.0
max,90.0,16.0,99999.0,4356.0,99.0


In [7]:
data.describe(include='all') # object형이 포함된 통계 자료 출력

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
count,48842.0,46043,48842,48842.0,48842,46033,48842,48842,48842,48842.0,48842.0,48842.0,47985,48842
unique,,8,16,,7,14,6,5,2,,,,41,2
top,,Private,HS-grad,,Married-civ-spouse,Prof-specialty,Husband,White,Male,,,,United-States,<=50K
freq,,33906,15784,,22379,6172,19716,41762,32650,,,,43832,37155
mean,38.643585,,,10.078089,,,,,,1079.067626,87.502314,40.422382,,
std,13.71051,,,2.570973,,,,,,7452.019058,403.004552,12.391444,,
min,17.0,,,1.0,,,,,,0.0,0.0,1.0,,
25%,28.0,,,9.0,,,,,,0.0,0.0,40.0,,
50%,37.0,,,10.0,,,,,,0.0,0.0,40.0,,
75%,48.0,,,12.0,,,,,,0.0,0.0,45.0,,


## 8.3 전처리: 범주형 데이터

In [8]:
data['class'] = data['class'].map({'<=50K': 0, '>50k': 1}) # 숫자로 변환

In [9]:
data['class'][:5]

0    0.0
1    0.0
2    NaN
3    NaN
4    0.0
Name: class, dtype: float64

### 8.3.1 object형의 변수 정보 확인하기

In [10]:
data['age'].dtype

dtype('int64')