* 결정 트리
    * 장점
        * 데이터에 대한 가정이 없는 모델. 예를들어 선형 모델은 정규 분포에 대한 가정이나 독립변수와 종속변수의 선형 관계 등을 가정으로 하는 모델인 반면, 결정 트리는 데이터에 대한 가정이 없으므로 어디에나 자유롭게 적용할 수 있다.
        * 아웃라이어에 영향을 거의 받지 않는다.
        * 트리 그래프를 통해서 직관적으로 이해하고 설명할 수 있다. 시각화에 굉장히 탁월하다.
    * 단점
        * 트리가 무한정 깊어지면 오버피팅 문제를 야기한다
        * 앞으로 배울 발전된 트리 기반 모델들에 비하면 예측력이 상당히 떨어진다.

## 8.1 문제 정의

* 미션 : 학력, 교육 연수, 혼인 상태, 직업 정보를 담은 연봉 데이터셋을 이용해 연봉을 예측하라.
* 알고리즘 : 결정트리(Decision Tree)
* 문제 유형 : 분류
* 평가지표: 정확도

## 8.2 라이브러리 및 데이터 불러오기, 데이터 확인하기

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
file_url = 'https://media.githubusercontent.com/media/musthave-ML10/data_source/main/salary.csv'
data = pd.read_csv(file_url, skipinitialspace = True)

In [3]:
data.head()

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25,Private,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,,Some-college,10,Never-married,,Own-child,White,Female,0,0,30,United-States,<=50K


In [4]:
data['class'].unique() # 종속 변수 고육값 확인하기

array(['<=50K', '>50K'], dtype=object)

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       46043 non-null  object
 2   education       48842 non-null  object
 3   education-num   48842 non-null  int64 
 4   marital-status  48842 non-null  object
 5   occupation      46033 non-null  object
 6   relationship    48842 non-null  object
 7   race            48842 non-null  object
 8   sex             48842 non-null  object
 9   capital-gain    48842 non-null  int64 
 10  capital-loss    48842 non-null  int64 
 11  hours-per-week  48842 non-null  int64 
 12  native-country  47985 non-null  object
 13  class           48842 non-null  object
dtypes: int64(5), object(9)
memory usage: 5.2+ MB


In [6]:
data.describe() # 통계 정보 출력

Unnamed: 0,age,education-num,capital-gain,capital-loss,hours-per-week
count,48842.0,48842.0,48842.0,48842.0,48842.0
mean,38.643585,10.078089,1079.067626,87.502314,40.422382
std,13.71051,2.570973,7452.019058,403.004552,12.391444
min,17.0,1.0,0.0,0.0,1.0
25%,28.0,9.0,0.0,0.0,40.0
50%,37.0,10.0,0.0,0.0,40.0
75%,48.0,12.0,0.0,0.0,45.0
max,90.0,16.0,99999.0,4356.0,99.0


In [7]:
data.describe(include='all') # object형이 포함된 통계 자료 출력

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
count,48842.0,46043,48842,48842.0,48842,46033,48842,48842,48842,48842.0,48842.0,48842.0,47985,48842
unique,,8,16,,7,14,6,5,2,,,,41,2
top,,Private,HS-grad,,Married-civ-spouse,Prof-specialty,Husband,White,Male,,,,United-States,<=50K
freq,,33906,15784,,22379,6172,19716,41762,32650,,,,43832,37155
mean,38.643585,,,10.078089,,,,,,1079.067626,87.502314,40.422382,,
std,13.71051,,,2.570973,,,,,,7452.019058,403.004552,12.391444,,
min,17.0,,,1.0,,,,,,0.0,0.0,1.0,,
25%,28.0,,,9.0,,,,,,0.0,0.0,40.0,,
50%,37.0,,,10.0,,,,,,0.0,0.0,40.0,,
75%,48.0,,,12.0,,,,,,0.0,0.0,45.0,,


## 8.3 전처리: 범주형 데이터

In [8]:
data['class'] = data['class'].map({'<=50K': 0, '>50k': 1}) # 숫자로 변환

In [9]:
data['class'][:5]

0    0.0
1    0.0
2    NaN
3    NaN
4    0.0
Name: class, dtype: float64

### 8.3.1 object형의 변수 정보 확인하기

In [10]:
data['age'].dtype

dtype('int64')

In [11]:
for i in data.columns:
    print(i, data[i].dtype)

age int64
workclass object
education object
education-num int64
marital-status object
occupation object
relationship object
race object
sex object
capital-gain int64
capital-loss int64
hours-per-week int64
native-country object
class float64


In [12]:
obj_list = []
for i in data.columns:
    if data[i].dtype == 'object':
        obj_list.append(i)

In [13]:
obj_list

['workclass',
 'education',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'native-country']

### 8.3.2 전처리할 변수 선별하기

In [15]:
for i in obj_list:
    print(i, data[i].nunique())

workclass 8
education 16
marital-status 7
occupation 14
relationship 6
race 5
sex 2
native-country 41


In [16]:
for i in obj_list:
    if data[i].nunique() >= 10:
        print(i, data[i].nunique())

education 16
occupation 14
native-country 41


### 8.3.3 education 변수 처리

In [17]:
data['education'].value_counts() # 고윳값 출현 빈도 확인

HS-grad         15784
Some-college    10878
Bachelors        8025
Masters          2657
Assoc-voc        2061
11th             1812
Assoc-acdm       1601
10th             1389
7th-8th           955
Prof-school       834
9th               756
12th              657
Doctorate         594
5th-6th           509
1st-4th           247
Preschool          83
Name: education, dtype: int64

In [18]:
np.sort(data['education-num'].unique())

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16],
      dtype=int64)

In [19]:
data['education'] == 1

0        False
1        False
2        False
3        False
4        False
         ...  
48837    False
48838    False
48839    False
48840    False
48841    False
Name: education, Length: 48842, dtype: bool

In [21]:
data[data['education-num'] == 1]

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
779,64,Private,Preschool,1,Married-civ-spouse,Handlers-cleaners,Husband,Asian-Pac-Islander,Male,0,0,40,Philippines,0.0
818,21,Private,Preschool,1,Never-married,Farming-fishing,Not-in-family,White,Male,0,0,25,Mexico,0.0
1029,57,,Preschool,1,Separated,,Not-in-family,White,Male,0,0,40,United-States,0.0
1059,31,Private,Preschool,1,Never-married,Handlers-cleaners,Not-in-family,Amer-Indian-Eskimo,Male,0,0,25,United-States,0.0
1489,19,Private,Preschool,1,Never-married,Farming-fishing,Not-in-family,White,Male,0,0,36,Mexico,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48079,31,State-gov,Preschool,1,Never-married,Other-service,Not-in-family,White,Male,0,0,24,United-States,0.0
48316,40,Private,Preschool,1,Married-civ-spouse,Other-service,Husband,White,Male,0,1672,40,Mexico,0.0
48505,40,Private,Preschool,1,Never-married,Other-service,Not-in-family,White,Female,0,0,20,United-States,0.0
48640,46,Private,Preschool,1,Married-civ-spouse,Machine-op-inspct,Other-relative,Black,Male,0,0,75,Dominican-Republic,0.0


In [22]:
data[data['education-num'] == 1]['education'].unique()

array(['Preschool'], dtype=object)

In [23]:
for i in np.sort(data['education-num'].unique()):
    print(i, data[data['education-num'] == i]['education'].unique())

1 ['Preschool']
2 ['1st-4th']
3 ['5th-6th']
4 ['7th-8th']
5 ['9th']
6 ['10th']
7 ['11th']
8 ['12th']
9 ['HS-grad']
10 ['Some-college']
11 ['Assoc-voc']
12 ['Assoc-acdm']
13 ['Bachelors']
14 ['Masters']
15 ['Prof-school']
16 ['Doctorate']


In [24]:
data.drop('education', axis=1, inplace=True)

### 8.3.4 occupation 처리

In [25]:
data['occupation'].value_counts()

Prof-specialty       6172
Craft-repair         6112
Exec-managerial      6086
Adm-clerical         5611
Sales                5504
Other-service        4923
Machine-op-inspct    3022
Transport-moving     2355
Handlers-cleaners    2072
Farming-fishing      1490
Tech-support         1446
Protective-serv       983
Priv-house-serv       242
Armed-Forces           15
Name: occupation, dtype: int64

### 8.3.5 native-country 변수 처리

In [27]:
data['native-country'].value_counts()

United-States                 43832
Mexico                          951
Philippines                     295
Germany                         206
Puerto-Rico                     184
Canada                          182
El-Salvador                     155
India                           151
Cuba                            138
England                         127
China                           122
South                           115
Jamaica                         106
Italy                           105
Dominican-Republic              103
Japan                            92
Guatemala                        88
Poland                           87
Vietnam                          86
Columbia                         85
Haiti                            75
Portugal                         67
Taiwan                           65
Iran                             59
Greece                           49
Nicaragua                        49
Peru                             46
Ecuador                     

In [28]:
data['native-country'].nunique()

41

In [29]:
data.groupby('native-country').mean().sort_values('class')

Unnamed: 0_level_0,age,education-num,capital-gain,capital-loss,hours-per-week,class
native-country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Cambodia,36.892857,9.392857,697.464286,194.821429,42.035714,0.0
Jamaica,37.141509,9.811321,495.915094,17.801887,39.160377,0.0
Japan,37.358696,11.423913,1874.586957,59.445652,42.282609,0.0
Laos,35.217391,8.826087,125.434783,75.652174,39.391304,0.0
Mexico,33.635121,6.026288,415.954784,32.656151,40.21346,0.0
Nicaragua,36.285714,9.0,138.653061,69.938776,36.938776,0.0
Outlying-US(Guam-USVI-etc),38.826087,10.043478,0.0,76.608696,41.347826,0.0
Peru,36.434783,9.826087,39.804348,40.173913,36.543478,0.0
Philippines,39.633898,10.722034,1508.823729,88.522034,39.620339,0.0
Poland,42.758621,10.068966,471.91954,70.390805,37.689655,0.0


In [30]:
country_group = data.groupby('native-country').mean()['class']

In [31]:
country_group = country_group.reset_index()

In [32]:
country_group

Unnamed: 0,native-country,class
0,Cambodia,0.0
1,Canada,0.0
2,China,0.0
3,Columbia,0.0
4,Cuba,0.0
5,Dominican-Republic,0.0
6,Ecuador,0.0
7,El-Salvador,0.0
8,England,0.0
9,France,0.0


In [33]:
data = data.merge(country_group, on='native-country', how='left')

In [34]:
data

Unnamed: 0,age,workclass,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class_x,class_y
0,25,Private,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,0.0,0.0
1,38,Private,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,0.0,0.0
2,28,Local-gov,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,,0.0
3,44,Private,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,,0.0
4,18,,10,Never-married,,Own-child,White,Female,0,0,30,United-States,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,27,Private,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,0.0,0.0
48838,40,Private,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,,0.0
48839,58,Private,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,0.0,0.0
48840,22,Private,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,0.0,0.0


In [35]:
data.drop('native-country', axis=1, inplace=True)

In [36]:
data = data.rename(columns={'class_x': 'class', 'class_y': 'native-country'})

## 8.4 전처리: 결측치 처리 및 더미 변수 변환

In [37]:
data.isna().mean()

age               0.000000
workclass         0.057307
education-num     0.000000
marital-status    0.000000
occupation        0.057512
relationship      0.000000
race              0.000000
sex               0.000000
capital-gain      0.000000
capital-loss      0.000000
hours-per-week    0.000000
class             0.239282
native-country    0.017546
dtype: float64

In [38]:
data['native-country'] = data['native-country'].fillna(-99)

In [40]:
data['workclass'].value_counts()

Private             33906
Self-emp-not-inc     3862
Local-gov            3136
State-gov            1981
Self-emp-inc         1695
Federal-gov          1432
Without-pay            21
Never-worked           10
Name: workclass, dtype: int64

In [41]:
data['workclass'] = data['workclass'].fillna('Private')

In [42]:
data['occupation'].value_counts()

Prof-specialty       6172
Craft-repair         6112
Exec-managerial      6086
Adm-clerical         5611
Sales                5504
Other-service        4923
Machine-op-inspct    3022
Transport-moving     2355
Handlers-cleaners    2072
Farming-fishing      1490
Tech-support         1446
Protective-serv       983
Priv-house-serv       242
Armed-Forces           15
Name: occupation, dtype: int64

In [43]:
data['occupation'] = data['occupation'].fillna('Unknown')

In [44]:
data = pd.get_dummies(data, drop_first=True)

In [52]:
data = data.dropna()

## 8.5 모델링 및 평가하기

In [54]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data.drop('class', axis=1), data['class'], test_size=0.4, random_state=100)

In [55]:
from sklearn.tree import DecisionTreeClassifier

In [56]:
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
pred = model.predict(X_test)

In [57]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, pred)

1.0

# 8.6 이해하기: 결정트리

* DecisionTreeClassifier는 각 노드의순도가 가장 높은 방향으로 분류를 한다. 순도 평가 지표로 지니 인덱스와 교차 엔트로피가 있다.
* 결정트리에서는 지니 인덱스가 가장 낮은 값이 나오는 특정 변수의 특정 값을 기준으로 노드를 분류해 간다.

## 8.7 오버피팅 문제

편향-분산 트레이드오프

## 8.8 매개변수 튜닝