## 정형데이터 대회는 AutoML에 때려박고(?) 시작하자!

이번 코드에서는 AutoML 패키지인 PyCaret을 활용하여 정형데이터 대회에 참여하는 과정을 알아보겠습니다. Feature engineering, model tuning 없이 주어진 데이터를 그대로 활용하여 default 모델을 훈련하고 예측 했으므로, 추가 작업을 통해 높은 성능을 보여줄 수 있을 것 같습니다. 

개인적으로 PyCaret은 아직까지 single output인 문제에는 적합한데 multi output 문제에는 부적합한것 같습니다. 혹시 multi output 문제에도 잘 적용된다면 알려주세요!

In this kernel we will use an AutoML package called PyCaret to enter data science competitions with structured data. I've used the given data without any feature engineering and trained the models without model tuning, so I expect better scores if we engineer additional feature and tune the models. 

I think PyCaret is approporiate for single output prediction tasks, but I still haven't figured out easier way to implement it on multi output prediction tasks. Would appreciate it if anyone could share tutorial code on applying PyCaret on multi output prediction task. 

## 경로 설정 (Define your path)

In [1]:
path = 'data/'

In [2]:
import os
os.listdir(path)

['train.csv', 'test_x.csv', 'sample_submission.csv']

## 데이터 불러오기 (Read Data)

In [3]:
import pandas as pd
train = pd.read_csv(path + 'train.csv')
test = pd.read_csv(path + 'test_x.csv')
submission = pd.read_csv(path + 'sample_submission.csv')

## 데이터 구조 확인 (Checking the shapes of data)

In [4]:
print(train.shape)
print(test.shape)
print(submission.shape)

(45532, 78)
(11383, 77)
(11383, 2)


## PyCaret 패키지 설치 (Install PyCaret)

## 분류 작업에 필용한 함수 불러오기 (Import methods for classification task)

In [6]:
from pycaret.classification import *

## 실험 환경 구축 (Setup the environment)

- PyCaret에서는 모델 학습 전 실험 환경을 구축 해주어야 합니다. setup 함수를 통해 환경을 구축할 수 있습니다. 
- setup 단계에서는 PyCaret이 자동으로 컬럼 형태를 인식합니다. 그 후 사용자에게 제대로 인식되었는지 확인을 받게 됩니다. 그 때 enter를 눌러주시면 됩니다. 
- 또한 주어진 데이터의 얼마를 사용하여 train / validation을 구축할지 묻게 되는데, 전체 데이터를 사용하고 싶다면 enter 눌러주시면 됩니다. 
----

- In PyCaret you have to setup the environment before experimenting with the models. It can be done by using 'setup' method. 
- In setup stage, PyCaret automatically interprets column types of the given data and asks the user if it has intepreted it correctly. You can customize whether you want each columns to be interpreted differently by using the parameters in setup method. In this tutorial we will just go with the automatic interpretation by pressing 'enter'. 
- Also, it asks the ratio of dataset used to contruct train/validation sets. We will use 100% of the dataset so just press 'enter' again. 

In [7]:
# 'voted' 컬럼이 예측 대상이므로 target 인자에 명시
# 'voted' column is the target variable
clf = setup(data = train, target = 'voted')

Unnamed: 0,Description,Value
0,session_id,5972
1,Target,voted
2,Target Type,Binary
3,Label Encoded,
4,Original Data,"(45532, 78)"
5,Missing Values,False
6,Numeric Features,42
7,Categorical Features,35
8,Ordinal Features,False
9,High Cardinality Features,False


## 모델 학습 및 비교 (Train models and compare)

- 환경 구축을 했으니 PyCaret에서 제공하는 기본 모델에 대해 학습하고 비교해보겠습니다.
- compared_models 함수를 통해 15개의 기본 모델을 학습하고 성능을 비교할 수 있습니다. 
- AUC 기준으로 성능이 가장 좋은 3개의 모델을 추려내어 저장해보겠습니다. 본 대회 평가지표가 AUC이기 때문에 AUC 기준으로 모델을 선정합니다.
-----
- Now we have constructed the environment, we will now train and compare the default models provided in PyCaret
- By using 'compare_models' method we can easily train and compare 15 default models provided in the package
- We will select top 3 models in terms of AUC, that is because the evaluation metric for this competition is AUC

In [8]:
best_3 = compare_models(sort = 'AUC', n_select = 3)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
gbc,Gradient Boosting Classifier,0.6949,0.7659,0.7534,0.6359,0.6896,0.3938,0.3994,5.201
catboost,CatBoost Classifier,0.6931,0.7658,0.7317,0.6389,0.6821,0.3881,0.3915,5.749
lightgbm,Light Gradient Boosting Machine,0.6943,0.7652,0.7433,0.6375,0.6863,0.3915,0.396,1.18
lda,Linear Discriminant Analysis,0.6914,0.7616,0.7163,0.6405,0.6762,0.3831,0.3854,0.552
et,Extra Trees Classifier,0.6927,0.7606,0.741,0.6362,0.6846,0.3884,0.3928,1.778
ada,Ada Boost Classifier,0.6877,0.7574,0.7204,0.6349,0.6749,0.3767,0.3796,1.063
rf,Random Forest Classifier,0.6899,0.7556,0.7427,0.6325,0.6831,0.3833,0.3881,1.728
xgboost,Extreme Gradient Boosting,0.6763,0.7457,0.6845,0.629,0.6556,0.3512,0.3524,7.613
dt,Decision Tree Classifier,0.6128,0.6092,0.5731,0.5693,0.5712,0.2182,0.2183,0.509
lr,Logistic Regression,0.5499,0.5828,0.0072,0.5078,0.0141,0.0012,0.0081,1.061


- CatBoost Classfier, Gradient Boosting Classifer, LGBM이 가장 좋은 3개의 모델입니다. 해당 모델은 best_3 변수에 저장되어 있습니다. 
- CatBoost Classfier, Gradient Boosting Classifer, and LGBM are the best 3 models. Those models are now stored in best_3 variable. 

## 모델 앙상블 (Model Ensemble)

- 학습된 3개의 모델을 앙상블 시키도록 하겠습니다. 본 대회는 score 최적화를 위해 확률 값을 예측해야 하므로 soft vote ensemble을 진행하겠습니다. 
------
- We will now ensemble the three models. In order to optimize the score for this competition we have to predict probabilities, we we will soft-vote ensemble the three models using 'blend_models' method. 

In [9]:
blended = blend_models(estimator_list = best_3, fold = 5, method = 'soft')

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.6962,0.7709,0.7564,0.6367,0.6914,0.3965,0.4022
1,0.6867,0.7617,0.74,0.6292,0.6801,0.3771,0.3819
2,0.696,0.7684,0.7542,0.6369,0.6906,0.3959,0.4013
3,0.6988,0.7697,0.7365,0.6448,0.6876,0.3992,0.4025
4,0.7003,0.7699,0.7483,0.6438,0.6921,0.4035,0.4078
Mean,0.6956,0.7681,0.7471,0.6383,0.6884,0.3944,0.3992
SD,0.0047,0.0033,0.0078,0.0056,0.0044,0.0091,0.0089


## 모델 예측 (Prediction)
- 구축된 앙상블 모델을 통해 예측을 해보겠습니다. 
- setup 환경에 이미 hold-out set이 존재하므로 해당 데이터에 대해 예측을 하여 모델 성능을 확인하겠습니다. 

----
- We will use the ensembled model on predicting unseen data.
- There is already a hold-out set constucted on our environment so we will test on it to evaluate the performance.



In [11]:
pred_holdout = predict_model(blended)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Voting Classifier,0.6947,0.7667,0.7439,0.6464,0.6918,0.3923,0.3961


IndexError: index 2 is out of bounds for axis 0 with size 2

  return linalg.solve(A, Xy, sym_pos=True,
  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
  return linalg.solve(A, Xy, sym_pos=True,
  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)


- AUC가 0.7725로 꽤 준수한 성능을 보이는 것을 알 수 있습니다. 
- We got a pretty decent model with AUC of 0.7725

## 전체 데이터에 대한 재학습 (Re-training the model on whole data)

In [22]:
saved_model

NameError: name 'saved_model' is not defined

- 현재까지 실험은 주어진 train 데이터를 다시 한 번 train / validation으로 나눠서 실험을 한 것이므로, 전체 train 데이터에 학습되어 있지 않습니다. 
- 최적의 성능을 위해 전체 데이터에 학습을 시켜주도록 하겠습니다. 

------
- Until now we have splitted the given train data into another train / validation sets to experiment. So the models are not trained on the full training data set.
- We will train the model on the whole dataset for the most optimal performance. 

In [32]:
final_model = finalize_model(blended)

## 대회용 test set에 대한 예측 (Predicting on test set for the competition)

- predict_model 함수를 통해 재학습된 모델을 대회용 test set에 대해 예측해보겠습니다. 
- We will now use the re-trained model on the test set for the competition

In [33]:
predictions = predict_model(final_model, data = test)

In [34]:
predictions

Unnamed: 0,index,QaA,QaE,QbA,QbE,QcA,QcE,QdA,QdE,QeA,...,wr_06,wr_07,wr_08,wr_09,wr_10,wr_11,wr_12,wr_13,Label,Score
0,0,3.0,736,2.0,2941,3.0,4621,1.0,4857,2.0,...,0,0,1,0,1,0,1,1,2,0.6475
1,1,3.0,514,2.0,1952,3.0,1552,3.0,821,4.0,...,0,0,0,0,0,0,0,0,2,0.8857
2,2,3.0,500,2.0,2507,4.0,480,2.0,614,2.0,...,0,1,1,0,1,0,1,1,2,0.5256
3,3,1.0,669,1.0,1050,5.0,1435,2.0,2252,5.0,...,1,1,1,1,1,1,1,1,1,0.1998
4,4,2.0,499,1.0,1243,5.0,845,2.0,1666,2.0,...,0,1,1,0,1,1,1,1,2,0.7567
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11378,11378,5.0,427,5.0,1066,5.0,588,1.0,560,2.0,...,0,1,1,0,1,0,1,1,1,0.3924
11379,11379,1.0,314,5.0,554,5.0,230,1.0,956,2.0,...,1,1,1,1,1,1,1,1,2,0.8792
11380,11380,1.0,627,2.0,799,1.0,739,2.0,1123,1.0,...,0,1,1,0,1,0,1,1,1,0.2230
11381,11381,2.0,539,1.0,2090,2.0,4642,1.0,673,2.0,...,0,1,1,0,1,1,1,0,1,0.3271


- 확률 값이 'Score' 컬럼에 저장되어 있으므로 해당 값을 submission 파일에 옮겨 데이콘에 제출하겠습니다. 
- The probability values are stored on 'Score' column. So we will write them on our submission format and submit on DACON.

In [37]:
submission['voted'] = predictions['Score']

In [40]:
submission.to_csv('submission_proba.csv', index = False)

- 아마 0.77 정도의 성능을 보일 것이며 추가 작업을 통해 성능을 더 향상시킬 수 있을거라 기대합니다. 
- You will probabily get around 0.77 AUC and with additional steps I think we can improve this score. 