# "머신러닝을 통한 접근 가이드" 목차
## 1. Library
## 2. Data Loading
## 3. Feature Engineering
### 3-1. Feature Generation
#### CODE SHARE WITH "파베르"님 
#### (https://dacon.io/competitions/official/235745/codeshare/2851?page=1&dtype=recent)
### 3-2. Feature Engineering
#### 3-2-1. Encoding
#### 3-2-2. Scailing
## 4. Modeling with Pycaret
## 5. Modeling with CatBoostRegressor

## 1. Library

In [1]:
# for "2. Data Loading"
import pandas as pd

# for "3-1. Feature Generation"
import numpy as np

# for "3-2. Feature Engineering"
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.preprocessing import RobustScaler, StandardScaler

# for "4. Modeling with Pycaret"
from pycaret.regression import *

# for "5. Modeling with CatBoostRegressor"
from catboost import CatBoostRegressor
import optuna
from optuna import Trial
from optuna.samplers import TPESampler
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split, StratifiedKFold

## 2. Data Loading

In [2]:
train = pd.read_csv('data/train0723(5).3.csv')
test = pd.read_csv('data/test0723.fi.csv')
train = train.set_index("code")
test = test.set_index("code")

## 3. Feature Engineering

### 3-1. Feature Generation
#### CODE SHARE WITH "파베르"님 
#### (https://dacon.io/competitions/official/235745/codeshare/2851?page=1&dtype=recent)
##### Feature Generation과 관련된 EDA는 위의 링크를 참조해주세요 :) 

# Modeling with Gradient Boosting Regressor

- Setting Data For Model

In [3]:
X = train.drop(columns = ['target'])
y = train['target']

- Hyper params Tuning

In [4]:
def objective(trial: Trial) -> float:
    params_cat = {
        "random_state": 42,
        "learning_rate": 0.05,
        "n_estimators": 10000,
        "verbose" : 1,
        "objective" : "MAE",
        "max_depth": trial.suggest_int("max_depth", 1, 16),
        "colsample_bylevel": trial.suggest_float("colsample_bylevel", 0.8, 1.0),
        "subsample": trial.suggest_float("subsample", 0.3, 1.0),
        "min_child_samples": trial.suggest_int("min_child_samples", 5, 100),
        "max_bin": trial.suggest_int("max_bin", 200, 500),
    }
    
    X_tr, X_val, y_tr, y_val = train_test_split(X, y, test_size=0.2)

    model = CatBoostRegressor(**params_cat)
    model.fit(
        X_tr,
        y_tr,
        eval_set=[(X_tr, y_tr), (X_val, y_val)],
        early_stopping_rounds=10,
        verbose=False,
    )

    cat_pred = model.predict(X_val)
    log_score = mean_absolute_error(y_val, cat_pred)
    
    return log_score

In [5]:
sampler = TPESampler(seed=42)
study = optuna.create_study(
    study_name="cat_opt",
    direction="minimize",
    sampler=sampler,
)
study.optimize(objective, n_trials=10)
print("Best Score:", study.best_value)
print("Best trial:", study.best_trial.params)

[32m[I 2021-07-25 23:24:47,062][0m A new study created in memory with name: cat_opt[0m
[32m[I 2021-07-25 23:24:47,505][0m Trial 0 finished with value: 138.79743196539047 and parameters: {'max_depth': 6, 'colsample_bylevel': 0.9901428612819833, 'subsample': 0.8123957592679836, 'min_child_samples': 62, 'max_bin': 246}. Best is trial 0 with value: 138.79743196539047.[0m
[32m[I 2021-07-25 23:24:47,643][0m Trial 1 finished with value: 125.90885383485603 and parameters: {'max_depth': 3, 'colsample_bylevel': 0.8116167224336399, 'subsample': 0.9063233020424546, 'min_child_samples': 62, 'max_bin': 413}. Best is trial 1 with value: 125.90885383485603.[0m
[32m[I 2021-07-25 23:24:47,775][0m Trial 2 finished with value: 122.02544719055965 and parameters: {'max_depth': 1, 'colsample_bylevel': 0.9939819704323989, 'subsample': 0.8827098485602951, 'min_child_samples': 25, 'max_bin': 254}. Best is trial 2 with value: 122.02544719055965.[0m
[32m[I 2021-07-25 23:24:47,903][0m Trial 3 finishe

Best Score: 90.5308168076066
Best trial: {'max_depth': 3, 'colsample_bylevel': 0.8608484485919076, 'subsample': 0.6673295021425665, 'min_child_samples': 46, 'max_bin': 287}


In [6]:
cat_p = study.best_trial.params
cat = CatBoostRegressor(**cat_p)

- StratifiedK-Fold for Regression

In [7]:
y_cat = pd.cut(y, 10, labels=range(10))
skf = StratifiedKFold(5)

preds = []
for tr_id, val_id in skf.split(X, y_cat) : 
    X_tr = X.iloc[tr_id]
    y_tr = y.iloc[tr_id]
    
    cat.fit(X_tr, y_tr, verbose = 0)
    
    pred = cat.predict(test)
    preds.append(pred)
cat_pred = np.mean(preds, axis = 0)

In [8]:
sample = pd.read_csv('data/sample_submission.csv')
sample['num'] = cat_pred
sample.to_csv('sub/cat0725_5.csv', index=False)

In [9]:
sample

Unnamed: 0,code,num
0,C1072,737.367282
1,C1128,1237.697214
2,C1456,510.684007
3,C1840,499.108403
4,C1332,1135.048148
5,C1563,1827.699492
6,C1794,836.777456
7,C1640,495.316096
8,C1377,362.507132
9,C2072,268.875325
