<h1> 앙상블 - 배깅(랜덤 포레스트)

데이터 : 금융데이터. 텔레마케팅시 고객이 예금에 가입하는 여부

그렇다면 어떤 특징을 가진 고객에게 집중해야할까?

In [6]:
import os
import pandas as pd

In [7]:
os.chdir(r'C:\Users\hjb38\Documents\데이터 분석 과정\data\ml_data')
data = pd.read_csv('bank-additional-full.csv', sep = ";")
# 일반적으로 csv 파일의 구분인자는 ,인데
# 이 파일의 구분인자는 ;라서 이렇게 표시해줘야 함.

In [8]:
data.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


<h2> 1. job, marital 등은 숫자형이 아닌 범주형 변수이기 때문에
    <div> 원핫인코딩을 통해 숫자형으로 바꿔준다.

In [9]:
# 먼저 dtypes 를 이용해 변수의 특징을 확인한다.
data.dtypes

# object 가 범주형.

age                 int64
job                object
marital            object
education          object
default            object
housing            object
loan               object
contact            object
month              object
day_of_week        object
duration            int64
campaign            int64
pdays               int64
previous            int64
poutcome           object
emp.var.rate      float64
cons.price.idx    float64
cons.conf.idx     float64
euribor3m         float64
nr.employed       float64
y                  object
dtype: object

In [10]:
# 원핫 인코딩 실시. y는 타겟변수로 쓸거라 제외
data = pd.get_dummies(data, columns = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'day_of_week', 'poutcome'])
data.head()

Unnamed: 0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,...,month_oct,month_sep,day_of_week_fri,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed,poutcome_failure,poutcome_nonexistent,poutcome_success
0,56,261,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0,0,0,1,0,0,0,0,1,0
1,57,149,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0,0,0,1,0,0,0,0,1,0
2,37,226,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0,0,0,1,0,0,0,0,1,0
3,40,151,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0,0,0,1,0,0,0,0,1,0
4,56,307,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0,0,0,1,0,0,0,0,1,0


<h2> 2. train 과 test 데이터로 나누기

In [11]:
# 데이터의 전체 수 확인
data['id'] = range(len(data))

In [12]:
len(data)

41188

In [13]:
# 41188 개의 데이터 중 30000개를 train 데이터로 지정
train = data.sample(30000, replace = False, random_state = 2020).reset_index().drop(['index'], axis = 1)
# 나머지를 test 데이터로
test = data.loc[ ~data['id'].isin(train['id'])].reset_index().drop(['index'], axis = 1)

<h1> 랜덤 포레스트 실습

<h3> 특징

1. 해석이 어렵다.
2. 학습이 느리다.
3. 의사결정나무만 사용하는 것에 비해 성능은 월등히 높다.

<h3> 파라미터

1. n_estimators : 몇 개의 의사결정나무를 만들 것인지. 보통 100~500개
2. min_samples_split : 의사결정 나무에서 각 구간의 최소 샘플 수.
이것보다 작으면 더이상 분리되지 않음

<h2> 1) 랜덤 포레스트 학습시키기

In [14]:
from sklearn.ensemble import RandomForestClassifier

# 모델 정의
rf = RandomForestClassifier(n_estimators = 500, min_samples_split = 10)

In [15]:
# 인풋변수로 쓸 train 컬럼들을 확인. y, id 빼고 전부
train.columns

Index(['age', 'duration', 'campaign', 'pdays', 'previous', 'emp.var.rate',
       'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed', 'y',
       'job_admin.', 'job_blue-collar', 'job_entrepreneur', 'job_housemaid',
       'job_management', 'job_retired', 'job_self-employed', 'job_services',
       'job_student', 'job_technician', 'job_unemployed', 'job_unknown',
       'marital_divorced', 'marital_married', 'marital_single',
       'marital_unknown', 'education_basic.4y', 'education_basic.6y',
       'education_basic.9y', 'education_high.school', 'education_illiterate',
       'education_professional.course', 'education_university.degree',
       'education_unknown', 'default_no', 'default_unknown', 'default_yes',
       'housing_no', 'housing_unknown', 'housing_yes', 'loan_no',
       'loan_unknown', 'loan_yes', 'contact_cellular', 'contact_telephone',
       'month_apr', 'month_aug', 'month_dec', 'month_jul', 'month_jun',
       'month_mar', 'month_may', 'month_nov', 'mon

In [16]:
# 너무 기니까 input_var라는 변수에 담아서 사용

input_var = ['age', 'duration', 'campaign', 'pdays', 'previous', 'emp.var.rate',
       'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed', 
       'job_admin.', 'job_blue-collar', 'job_entrepreneur', 'job_housemaid',
       'job_management', 'job_retired', 'job_self-employed', 'job_services',
       'job_student', 'job_technician', 'job_unemployed', 'job_unknown',
       'marital_divorced', 'marital_married', 'marital_single',
       'marital_unknown', 'education_basic.4y', 'education_basic.6y',
       'education_basic.9y', 'education_high.school', 'education_illiterate',
       'education_professional.course', 'education_university.degree',
       'education_unknown', 'default_no', 'default_unknown', 'default_yes',
       'housing_no', 'housing_unknown', 'housing_yes', 'loan_no',
       'loan_unknown', 'loan_yes', 'contact_cellular', 'contact_telephone',
       'month_apr', 'month_aug', 'month_dec', 'month_jul', 'month_jun',
       'month_mar', 'month_may', 'month_nov', 'month_oct', 'month_sep',
       'day_of_week_fri', 'day_of_week_mon', 'day_of_week_thu',
       'day_of_week_tue', 'day_of_week_wed', 'poutcome_failure',
       'poutcome_nonexistent', 'poutcome_success']

In [17]:
# train 학습. 인풋변수와 타겟변수
rf.fit( train[input_var], train['y'])

RandomForestClassifier(min_samples_split=10, n_estimators=500)

In [18]:
# test 데이터 예측
predictions = rf.predict(test[input_var])

In [19]:
test['pred'] = predictions

In [20]:
(test['pred'] == test['y']).mean()

0.9132999642474079

 91.1% 정확도

의사결정나무와 비교

In [86]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(min_samples_split = 10)

In [87]:
dt.fit(train[input_var], train['y'])

DecisionTreeClassifier(min_samples_split=10)

In [88]:
test['pred'] = dt.predict(test[input_var])

In [89]:
(test['pred'] == test['y']).mean()

0.8984626385412943

의사결정나무는 89.8% 정확도