## 수정 사항
1. 검증 / 테스트 -> 검증 best_estimator를 적용한 '같은 모델' 사용하도록 수정
2. test시 train data에서 validation data 제외하도록 수정

In [8]:
# 한글 폰트 설치
!sudo apt-get install -y fonts-nanum
!sudo fc-cache -fv
!rm ~/.cache/matplotlib -rf

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  fonts-nanum
0 upgraded, 1 newly installed, 0 to remove and 40 not upgraded.
Need to get 9,604 kB of archives.
After this operation, 29.5 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 fonts-nanum all 20170925-1 [9,604 kB]
Fetched 9,604 kB in 1s (9,734 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76, <> line 1.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 
Selecting previously unselected package fonts-nanum.
(Reading database ... 155501 files and d

In [1]:
pip install gtts

Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'C:\anaconda3\envs\cakd3\python.exe -m pip install --upgrade pip' command.


In [1]:
# 드라이브 마운트
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
import pandas as pd
from gtts import gTTS
import os 
from IPython.display import Audio
import warnings
warnings.filterwarnings('ignore')

def speak(text):
    tts = gTTS(text=text, lang='ko') 
    filename='voice.mp3' 
    tts.save(filename)
    display(Audio(filename, autoplay=True))
    os.remove(filename)

In [2]:
# change directory
%cd /content/drive/MyDrive/Cakd3_Project/1.ldata_현정

/content/drive/MyDrive/Cakd3_Project/1.ldata_현정


In [4]:
# 데이터셋 불러오기
dataset1 = pd.read_csv('./full_dataset1.csv', index_col=0)
dataset2 = pd.read_csv('./full_dataset2.csv', index_col=0)

In [7]:
# classification

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier, plot_importance
from lightgbm import LGBMClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
import warnings
warnings.filterwarnings('ignore')


def clf(train, test=None):
    
    if test is not None:
        X_test = test.drop('y', axis=1)
        y_test = test['y']
        
    X = train.drop('y', axis=1)
    y = train['y']
    X_train, X_valid, y_train, y_valid = train_test_split(X, y,
                                                       test_size=0.3,
                                                       random_state=1004)

    # 객체 생성
    dct_clf = DecisionTreeClassifier(criterion = 'entropy')
    rf_clf = RandomForestClassifier()
    lr_clf = LogisticRegression()
    xgb_clf = XGBClassifier(eval_metric='logloss')
    lgb_clf = LGBMClassifier()

    # 파라미터 설정
    dct_parameters = {'max_depth':[3,5], }
    rf_parameters = {'n_estimators':[100,200,300], 'max_depth':[7,9,11]}
    lr_parameters = { "penalty":['l1', 'l2', 'elasticnet', 'none'], 'C': [ 1, 10, 100, 1000]}
    xgb_parameters = {'n_estimators':[300], 'learning_rate':[0.05, 0.1], 'max_depth':[4,5,6]}
    lgb_parameters = {'n_estimators':[300], 'learning_rate':[0.05, 0.1], 'max_depth':[4,5,6]}

    clf_param = [(dct_clf,dct_parameters),(rf_clf,rf_parameters),(lr_clf,lr_parameters),
                 (xgb_clf,xgb_parameters),(lgb_clf,lgb_parameters)]
    
    for clf, parameter in clf_param:
        grid_clf = GridSearchCV(clf, param_grid=parameter, scoring='accuracy', cv=3, refit=True)
        grid_clf.fit(X_train, y_train)

        # 교차검증 결과 출력
        class_name = clf.__class__.__name__
        scores_df = pd.DataFrame(grid_clf.cv_results_)
        print(f'{class_name} 최적 하이퍼 파라미터:', grid_clf.best_params_)

        # X_valid에 최적 하이퍼 파라미터 적용하여 분석한 결과
        def clf_predict(X_test, y_test, mode):
            best_clf = grid_clf.best_estimator_
            pred = best_clf.predict(X_test)
            pred_proba = best_clf.predict_proba(X_test)[:,1]

            accuracy = accuracy_score(y_test, pred)
            precision = precision_score(y_test, pred)
            recall = recall_score(y_test, pred)
            f1 = f1_score(y_test, pred)
            auc = roc_auc_score(y_test, pred_proba)

            print(f'< {mode} 데이터셋 >')
            print('accuracy  :  {:.4f}'.format(accuracy))
            print('precision :  {:.4f}'.format(precision))
            print('recall    :  {:.4f}'.format(recall))
            print('f1        :  {:.4f}'.format(f1))
            print('roc_auc   :  {:.4f}\n'.format(auc))
            
        clf_predict(X_valid, y_valid, '검증')
        if test is not None:
            clf_predict(X_test, y_test, '테스트')
        print('='*80)

# full dataset

In [8]:
# 검증 데이터셋
clf(dataset1)

DecisionTreeClassifier 최적 하이퍼 파라미터: {'max_depth': 5}
< 검증 데이터셋 >
accuracy  :  0.6930
precision :  0.6345
recall    :  0.6271
f1        :  0.6308
roc_auc   :  0.7559

RandomForestClassifier 최적 하이퍼 파라미터: {'max_depth': 11, 'n_estimators': 200}
< 검증 데이터셋 >
accuracy  :  0.7163
precision :  0.6700
recall    :  0.6338
f1        :  0.6514
roc_auc   :  0.7831

LogisticRegression 최적 하이퍼 파라미터: {'C': 1, 'penalty': 'l2'}
< 검증 데이터셋 >
accuracy  :  0.6545
precision :  0.6046
recall    :  0.5027
f1        :  0.5490
roc_auc   :  0.7021

XGBClassifier 최적 하이퍼 파라미터: {'learning_rate': 0.05, 'max_depth': 4, 'n_estimators': 300}
< 검증 데이터셋 >
accuracy  :  0.7105
precision :  0.6600
recall    :  0.6350
f1        :  0.6473
roc_auc   :  0.7815

LGBMClassifier 최적 하이퍼 파라미터: {'learning_rate': 0.05, 'max_depth': 4, 'n_estimators': 300}
< 검증 데이터셋 >
accuracy  :  0.7093
precision :  0.6593
recall    :  0.6313
f1        :  0.6450
roc_auc   :  0.7806



In [9]:
speak('검증 데이터셋 완료')

In [10]:
# 테스트 데이터셋
clf(dataset1, dataset2)

DecisionTreeClassifier 최적 하이퍼 파라미터: {'max_depth': 5}
< 검증 데이터셋 >
accuracy  :  0.6930
precision :  0.6345
recall    :  0.6271
f1        :  0.6308
roc_auc   :  0.7559

< 테스트 데이터셋 >
accuracy  :  0.6839
precision :  0.6808
recall    :  0.6657
f1        :  0.6732
roc_auc   :  0.7394

RandomForestClassifier 최적 하이퍼 파라미터: {'max_depth': 11, 'n_estimators': 100}
< 검증 데이터셋 >
accuracy  :  0.7114
precision :  0.6652
recall    :  0.6242
f1        :  0.6441
roc_auc   :  0.7818

< 테스트 데이터셋 >
accuracy  :  0.6914
precision :  0.6968
recall    :  0.6534
f1        :  0.6744
roc_auc   :  0.7594

LogisticRegression 최적 하이퍼 파라미터: {'C': 1, 'penalty': 'l2'}
< 검증 데이터셋 >
accuracy  :  0.6545
precision :  0.6046
recall    :  0.5027
f1        :  0.5490
roc_auc   :  0.7021

< 테스트 데이터셋 >
accuracy  :  0.6315
precision :  0.6485
recall    :  0.5381
f1        :  0.5882
roc_auc   :  0.6905

XGBClassifier 최적 하이퍼 파라미터: {'learning_rate': 0.05, 'max_depth': 4, 'n_estimators': 300}
< 검증 데이터셋 >
accuracy  :  0.7105
precision :  

In [11]:
speak('테스트 데이터셋 완료')

# small dataset

In [12]:
# 데이터셋 불러오기
s_dataset1 = pd.read_csv('./small_dataset1.csv', index_col=0)
s_dataset2 = pd.read_csv('./small_dataset2.csv', index_col=0)

In [16]:
clf(s_dataset1, s_dataset2)
speak('small dataset 완료')

DecisionTreeClassifier 최적 하이퍼 파라미터: {'max_depth': 3}
< 검증 데이터셋 >
accuracy  :  0.7175
precision :  0.6678
recall    :  0.6459
f1        :  0.6567
roc_auc   :  0.7705

< 테스트 데이터셋 >
accuracy  :  0.6982
precision :  0.6991
recall    :  0.6721
f1        :  0.6854
roc_auc   :  0.7577

RandomForestClassifier 최적 하이퍼 파라미터: {'max_depth': 11, 'n_estimators': 200}
< 검증 데이터셋 >
accuracy  :  0.7163
precision :  0.6657
recall    :  0.6463
f1        :  0.6558
roc_auc   :  0.7853

< 테스트 데이터셋 >
accuracy  :  0.6986
precision :  0.6993
recall    :  0.6734
f1        :  0.6861
roc_auc   :  0.7662

LogisticRegression 최적 하이퍼 파라미터: {'C': 1, 'penalty': 'l2'}
< 검증 데이터셋 >
accuracy  :  0.7140
precision :  0.6707
recall    :  0.6213
f1        :  0.6451
roc_auc   :  0.7759

< 테스트 데이터셋 >
accuracy  :  0.6876
precision :  0.6964
recall    :  0.6404
f1        :  0.6672
roc_auc   :  0.7571

XGBClassifier 최적 하이퍼 파라미터: {'learning_rate': 0.05, 'max_depth': 4, 'n_estimators': 300}
< 검증 데이터셋 >
accuracy  :  0.7104
precision :  

# sfm dataset

In [17]:
# 데이터셋 불러오기
sfm_dataset1 = pd.read_csv('./sfm_dataset1.csv', index_col=0)
sfm_dataset2 = pd.read_csv('./sfm_dataset2.csv', index_col=0)

In [18]:
clf(sfm_dataset1, sfm_dataset2)
speak('sfm dataset 완료')

DecisionTreeClassifier 최적 하이퍼 파라미터: {'max_depth': 5}
< 검증 데이터셋 >
accuracy  :  0.6950
precision :  0.6400
recall    :  0.6192
f1        :  0.6294
roc_auc   :  0.7573

< 테스트 데이터셋 >
accuracy  :  0.6842
precision :  0.6850
recall    :  0.6559
f1        :  0.6702
roc_auc   :  0.7397

RandomForestClassifier 최적 하이퍼 파라미터: {'max_depth': 11, 'n_estimators': 300}
< 검증 데이터셋 >
accuracy  :  0.7104
precision :  0.6613
recall    :  0.6305
f1        :  0.6455
roc_auc   :  0.7823

< 테스트 데이터셋 >
accuracy  :  0.6949
precision :  0.6977
recall    :  0.6636
f1        :  0.6802
roc_auc   :  0.7612

LogisticRegression 최적 하이퍼 파라미터: {'C': 1, 'penalty': 'none'}
< 검증 데이터셋 >
accuracy  :  0.6277
precision :  0.5981
recall    :  0.3350
f1        :  0.4294
roc_auc   :  0.6751

< 테스트 데이터셋 >
accuracy  :  0.5877
precision :  0.6486
recall    :  0.3426
f1        :  0.4484
roc_auc   :  0.6601

XGBClassifier 최적 하이퍼 파라미터: {'learning_rate': 0.05, 'max_depth': 6, 'n_estimators': 300}
< 검증 데이터셋 >
accuracy  :  0.7107
precision :

# 전진선택법 dataset

In [19]:
# 데이터셋 불러오기
f_dataset1 = pd.read_csv('./f_dataset1.csv', index_col=0)
f_dataset2 = pd.read_csv('./f_dataset2.csv', index_col=0)

In [20]:
clf(f_dataset1, f_dataset2)
speak('f dataset 완료')

DecisionTreeClassifier 최적 하이퍼 파라미터: {'max_depth': 5}
< 검증 데이터셋 >
accuracy  :  0.6943
precision :  0.6410
recall    :  0.6122
f1        :  0.6262
roc_auc   :  0.7573

< 테스트 데이터셋 >
accuracy  :  0.6819
precision :  0.6852
recall    :  0.6465
f1        :  0.6653
roc_auc   :  0.7397

RandomForestClassifier 최적 하이퍼 파라미터: {'max_depth': 9, 'n_estimators': 300}
< 검증 데이터셋 >
accuracy  :  0.7109
precision :  0.6645
recall    :  0.6238
f1        :  0.6435
roc_auc   :  0.7820

< 테스트 데이터셋 >
accuracy  :  0.6952
precision :  0.7007
recall    :  0.6576
f1        :  0.6785
roc_auc   :  0.7618

LogisticRegression 최적 하이퍼 파라미터: {'C': 1, 'penalty': 'none'}
< 검증 데이터셋 >
accuracy  :  0.6073
precision :  0.5632
recall    :  0.2726
f1        :  0.3674
roc_auc   :  0.6422

< 테스트 데이터셋 >
accuracy  :  0.5628
precision :  0.6170
recall    :  0.2793
f1        :  0.3845
roc_auc   :  0.6268

XGBClassifier 최적 하이퍼 파라미터: {'learning_rate': 0.05, 'max_depth': 4, 'n_estimators': 300}
< 검증 데이터셋 >
accuracy  :  0.7081
precision : 

# 후진선택법 dataset

In [21]:
# 데이터셋 불러오기
b_dataset1 = pd.read_csv('./b_dataset1.csv', index_col=0)
b_dataset2 = pd.read_csv('./b_dataset2.csv', index_col=0)

In [22]:
clf(b_dataset1, b_dataset2)
speak('b dataset 완료')

DecisionTreeClassifier 최적 하이퍼 파라미터: {'max_depth': 5}
< 검증 데이터셋 >
accuracy  :  0.6982
precision :  0.6450
recall    :  0.6192
f1        :  0.6318
roc_auc   :  0.7566

< 테스트 데이터셋 >
accuracy  :  0.6875
precision :  0.6915
recall    :  0.6519
f1        :  0.6711
roc_auc   :  0.7400

RandomForestClassifier 최적 하이퍼 파라미터: {'max_depth': 9, 'n_estimators': 200}
< 검증 데이터셋 >
accuracy  :  0.7138
precision :  0.6677
recall    :  0.6288
f1        :  0.6477
roc_auc   :  0.7818

< 테스트 데이터셋 >
accuracy  :  0.6910
precision :  0.6945
recall    :  0.6575
f1        :  0.6755
roc_auc   :  0.7605

LogisticRegression 최적 하이퍼 파라미터: {'C': 1000, 'penalty': 'l2'}
< 검증 데이터셋 >
accuracy  :  0.6228
precision :  0.5851
recall    :  0.3375
f1        :  0.4281
roc_auc   :  0.6553

< 테스트 데이터셋 >
accuracy  :  0.5770
precision :  0.6254
recall    :  0.3366
f1        :  0.4377
roc_auc   :  0.6398

XGBClassifier 최적 하이퍼 파라미터: {'learning_rate': 0.05, 'max_depth': 4, 'n_estimators': 300}
< 검증 데이터셋 >
accuracy  :  0.7044
precision :

# <font color='green'>변수 중요도(XGB)</font>

In [None]:
# # 변수 중요도(XGB)
# import matplotlib.pyplot as plt
# %matplotlib inline

# xgb_clf = XGBClassifier(learning_rate=0.05, max_depth=4, n_estimators=300)
# xgb_clf.fit(X_train, y_train)
# fig, ax = plt.subplots(1,1,figsize=(10,8))
# plot_importance(xgb_clf, ax=ax, max_num_features=20, height=0.4)