## 데이터 불러오기 (Read Data)

In [1]:
import pandas as pd
train = pd.read_csv('train.csv', index_col=0)
test = pd.read_csv('test_x.csv', index_col=0)
submission = pd.read_csv('sample_submission.csv', index_col=0)

## 데이터 구조 확인 (Checking the shapes of data)

In [2]:
print(train.shape)
print(test.shape)
print(submission.shape)

(45532, 77)
(11383, 76)
(11383, 1)


## 분류 작업에 필용한 함수 불러오기 (Import methods for classification task)

In [3]:
from pycaret.classification import *

## 실험 환경 구축 (Setup the environment)

- PyCaret에서는 모델 학습 전 실험 환경을 구축 해주어야 합니다. setup 함수를 통해 환경을 구축할 수 있습니다. 
- setup 단계에서는 PyCaret이 자동으로 컬럼 형태를 인식합니다. 그 후 사용자에게 제대로 인식되었는지 확인을 받게 됩니다. 그 때 enter를 눌러주시면 됩니다. 
- 또한 주어진 데이터의 얼마를 사용하여 train / validation을 구축할지 묻게 되는데, 전체 데이터를 사용하고 싶다면 enter 눌러주시면 됩니다. 
----

- In PyCaret you have to setup the environment before experimenting with the models. It can be done by using 'setup' method. 
- In setup stage, PyCaret automatically interprets column types of the given data and asks the user if it has intepreted it correctly. You can customize whether you want each columns to be interpreted differently by using the parameters in setup method. In this tutorial we will just go with the automatic interpretation by pressing 'enter'. 
- Also, it asks the ratio of dataset used to contruct train/validation sets. We will use 100% of the dataset so just press 'enter' again. 

In [4]:
# 'voted' 컬럼이 예측 대상이므로 target 인자에 명시
# 'voted' column is the target variable
clf = setup(data = train, target = 'voted', train_size = 0.7,
            normalize = True,
            remove_outliers = True,
            outliers_threshold = 0.01,
            polynomial_features = True,
            create_clusters= True, cluster_iter=20,
            session_id = 904)

Setup Succesfully Completed!


Unnamed: 0,Description,Value
0,session_id,904
1,Target Type,Binary
2,Label Encoded,"1: 0, 2: 1"
3,Original Data,"(45532, 77)"
4,Missing Values,False
5,Numeric Features,41
6,Categorical Features,35
7,Ordinal Features,False
8,High Cardinality Features,False
9,High Cardinality Method,


## 모델 학습 및 비교 (Train models and compare)

- 환경 구축을 했으니 PyCaret에서 제공하는 기본 모델에 대해 학습하고 비교해보겠습니다.
- compared_models 함수를 통해 15개의 기본 모델을 학습하고 성능을 비교할 수 있습니다. 
- AUC 기준으로 성능이 가장 좋은 3개의 모델을 추려내어 저장해보겠습니다. 본 대회 평가지표가 AUC이기 때문에 AUC 기준으로 모델을 선정합니다.
-----
- Now we have constructed the environment, we will now train and compare the default models provided in PyCaret
- By using 'compare_models' method we can easily train and compare 15 default models provided in the package
- We will select top 3 models in terms of AUC, that is because the evaluation metric for this competition is AUC

In [5]:
best_3 = compare_models(sort = 'AUC', n_select = 3)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
0,Gradient Boosting Classifier,0.6944,0.7641,0.639,0.7632,0.6955,0.3935,0.3998,28.0026
1,Light Gradient Boosting Machine,0.6933,0.7639,0.6418,0.7596,0.6957,0.3909,0.3966,0.8002
2,CatBoost Classifier,0.6906,0.7632,0.6527,0.749,0.6975,0.384,0.3879,22.6184
3,Linear Discriminant Analysis,0.69,0.7602,0.6616,0.7429,0.6999,0.3815,0.3843,0.8958
4,Logistic Regression,0.6903,0.7592,0.667,0.7404,0.7018,0.3815,0.3838,1.0157
5,Extra Trees Classifier,0.69,0.7585,0.647,0.7511,0.6951,0.3833,0.3877,3.9443
6,Ada Boost Classifier,0.6881,0.7551,0.6529,0.7448,0.6958,0.3787,0.3822,6.4905
7,Extreme Gradient Boosting,0.6735,0.7442,0.6613,0.7188,0.6888,0.3467,0.3481,6.9024
8,Naive Bayes,0.6487,0.7082,0.776,0.6502,0.707,0.2766,0.2842,0.1338
9,Random Forest Classifier,0.649,0.7075,0.5976,0.7136,0.6504,0.3035,0.3084,0.3833


- CatBoost Classfier, Gradient Boosting Classifer, LGBM이 가장 좋은 3개의 모델입니다. 해당 모델은 best_3 변수에 저장되어 있습니다. 
- CatBoost Classfier, Gradient Boosting Classifer, and LGBM are the best 3 models. Those models are now stored in best_3 variable. 

## 모델 앙상블 (Model Ensemble)

In [6]:
tuned_top3 = [tune_model(i) for i in best_3]

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.6996,0.7639,0.6566,0.7608,0.7049,0.4024,0.4069
1,0.6841,0.7617,0.6334,0.7495,0.6866,0.3726,0.378
2,0.7085,0.7878,0.673,0.7653,0.7162,0.4191,0.4227
3,0.6964,0.7718,0.645,0.7627,0.6989,0.397,0.4026
4,0.683,0.757,0.6352,0.7469,0.6865,0.3702,0.3752
5,0.7059,0.7756,0.6555,0.7719,0.7089,0.4156,0.4213
6,0.6903,0.7579,0.6346,0.7592,0.6913,0.3856,0.3919
7,0.6906,0.767,0.641,0.7558,0.6937,0.3855,0.3908
8,0.6906,0.7559,0.6404,0.7562,0.6935,0.3855,0.391
9,0.6926,0.7647,0.6317,0.7647,0.6919,0.3906,0.3977


- 학습된 3개의 모델을 앙상블 시키도록 하겠습니다. 본 대회는 score 최적화를 위해 확률 값을 예측해야 하므로 soft vote ensemble을 진행하겠습니다. 
------
- We will now ensemble the three models. In order to optimize the score for this competition we have to predict probabilities, we we will soft-vote ensemble the three models using 'blend_models' method. 

In [7]:
blended = blend_models(estimator_list = tuned_top3, fold = 5, method = 'soft')

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.6897,0.7627,0.6523,0.7477,0.6967,0.3822,0.3859
1,0.7024,0.7773,0.6697,0.7575,0.7109,0.4067,0.41
2,0.6899,0.7648,0.6469,0.7512,0.6951,0.3832,0.3876
3,0.688,0.7621,0.6485,0.7471,0.6943,0.3789,0.3828
4,0.6886,0.7585,0.6398,0.7532,0.6919,0.3813,0.3865
Mean,0.6917,0.7651,0.6514,0.7513,0.6978,0.3864,0.3906
SD,0.0054,0.0065,0.01,0.0038,0.0067,0.0102,0.0098


## 모델 예측 (Prediction)
- 구축된 앙상블 모델을 통해 예측을 해보겠습니다. 
- setup 환경에 이미 hold-out set이 존재하므로 해당 데이터에 대해 예측을 하여 모델 성능을 확인하겠습니다. 

----
- We will use the ensembled model on predicting unseen data.
- There is already a hold-out set constucted on our environment so we will test on it to evaluate the performance.

In [8]:
pred_holdout = predict_model(blended)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Voting Classifier,0.7018,0.7711,0.6596,0.7625,0.7074,0.4065,0.411


- AUC가 0.7725로 꽤 준수한 성능을 보이는 것을 알 수 있습니다. 
- We got a pretty decent model with AUC of 0.7725

## 전체 데이터에 대한 재학습 (Re-training the model on whole data)

- 현재까지 실험은 주어진 train 데이터를 다시 한 번 train / validation으로 나눠서 실험을 한 것이므로, 전체 train 데이터에 학습되어 있지 않습니다. 
- 최적의 성능을 위해 전체 데이터에 학습을 시켜주도록 하겠습니다. 

------
- Until now we have splitted the given train data into another train / validation sets to experiment. So the models are not trained on the full training data set.
- We will train the model on the whole dataset for the most optimal performance. 

In [None]:
final_model = finalize_model(blended)

## 대회용 test set에 대한 예측 (Predicting on test set for the competition)

- predict_model 함수를 통해 재학습된 모델을 대회용 test set에 대해 예측해보겠습니다. 
- We will now use the re-trained model on the test set for the competition

In [10]:
predictions = predict_model(final_model, data = test)

In [11]:
predictions

Unnamed: 0,QaA,QaE,QbA,QbE,QcA,QcE,QdA,QdE,QeA,QeE,...,wr_06,wr_07,wr_08,wr_09,wr_10,wr_11,wr_12,wr_13,Label,Score
0,3.0,736,2.0,2941,3.0,4621,1.0,4857,2.0,2550,...,0,0,1,0,1,0,1,1,2,0.5487
1,3.0,514,2.0,1952,3.0,1552,3.0,821,4.0,1150,...,0,0,0,0,0,0,0,0,2,0.5495
2,3.0,500,2.0,2507,4.0,480,2.0,614,2.0,1326,...,0,1,1,0,1,0,1,1,2,0.5099
3,1.0,669,1.0,1050,5.0,1435,2.0,2252,5.0,2533,...,1,1,1,1,1,1,1,1,2,0.5108
4,2.0,499,1.0,1243,5.0,845,2.0,1666,2.0,925,...,0,1,1,0,1,1,1,1,2,0.5487
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11378,5.0,427,5.0,1066,5.0,588,1.0,560,2.0,1110,...,0,1,1,0,1,0,1,1,2,0.5487
11379,1.0,314,5.0,554,5.0,230,1.0,956,2.0,1173,...,1,1,1,1,1,1,1,1,2,0.5724
11380,1.0,627,2.0,799,1.0,739,2.0,1123,1.0,829,...,0,1,1,0,1,0,1,1,2,0.5108
11381,2.0,539,1.0,2090,2.0,4642,1.0,673,2.0,1185,...,0,1,1,0,1,1,1,0,2,0.5374


- 확률 값이 'Score' 컬럼에 저장되어 있으므로 해당 값을 submission 파일에 옮겨 데이콘에 제출하겠습니다. 
- The probability values are stored on 'Score' column. So we will write them on our submission format and submit on DACON.

In [12]:
submission['voted'] = predictions['Score']

In [13]:
submission.to_csv('submission_proba.csv', index = False)

- 아마 0.77 정도의 성능을 보일 것이며 추가 작업을 통해 성능을 더 향상시킬 수 있을거라 기대합니다. 
- You will probabily get around 0.77 AUC and with additional steps I think we can improve this score. 