<a href="https://colab.research.google.com/github/LeeSeungwon89/Kaggle_Dacon_Practice/blob/main/3.%20Bike_Sharing_Demand_improve_model_performance1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install kaggle
from google.colab import files
files.upload()

In [None]:
ls -1ha kaggle.json

kaggle.json


In [None]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

# Permission Warning이 발생하지 않도록 해줍니다.
!chmod 600 ~/.kaggle/kaggle.json

# 참가한 대회 리스트를 확인합니다.
!kaggle competitions list

In [None]:
!kaggle competitions download -c bike-sharing-demand

Downloading bike-sharing-demand.zip to /content
  0% 0.00/189k [00:00<?, ?B/s]
100% 189k/189k [00:00<00:00, 63.1MB/s]


In [None]:
!ls

bike-sharing-demand.zip  kaggle.json  sample_data


In [None]:
!unzip bike-sharing-demand.zip

Archive:  bike-sharing-demand.zip
  inflating: sampleSubmission.csv    
  inflating: test.csv                
  inflating: train.csv               


# **1. 성능 개선 절차**

성능 개선은 총 두 챕터로 나누어 시도할 것입니다. 첫 번째 챕터에서는 하이퍼파라미터를 튜닝하여 바로 성능을 체크해 보겠습니다. 두 번째 챕터에서는 첫 번째 챕터에서 시도한 방법과 함께 추가 피처 엔지니어링까지 수행해 보겠습니다. 추가 피처 엔지니어링은 'month' 피처를 제거하는 것입니다.

# **2. 베이스라인 모델링에서의 피처 엔지니어링**

In [None]:
import numpy as np
import pandas as pd
import random

np.random.seed(2022)
random.seed(2022)

# 최대 행렬 수를 설정합니다.
pd.set_option('display.max_column', 50)
pd.set_option('display.max_rows', 50)

# 데이터를 읽습니다.
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
submission = pd.read_csv('sampleSubmission.csv')

## **2.1. 이상치 레코드 제거**

'weather' 피처에서 값이 4인 피처를 제거하겠습니다.

In [None]:
train = train[train['weather']!=4]

## **2.2. 훈련 및 테스트 세트 결합**

피처 엔지니어링을 수행하기 위해 훈련 세트와 테스트 세트를 결합하겠습니다.

In [None]:
all_data = pd.concat([train, test], ignore_index=True)
all_data

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0000,3.0,13.0,16.0
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0000,8.0,32.0,40.0
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0000,5.0,27.0,32.0
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0000,3.0,10.0,13.0
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0000,0.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...
17373,2012-12-31 19:00:00,1,0,1,2,10.66,12.880,60,11.0014,,,
17374,2012-12-31 20:00:00,1,0,1,2,10.66,12.880,60,11.0014,,,
17375,2012-12-31 21:00:00,1,0,1,1,10.66,12.880,60,11.0014,,,
17376,2012-12-31 22:00:00,1,0,1,1,10.66,13.635,56,8.9981,,,


## **2.3. 피처 분할**

'datetime' 피처를 연, 월, 시, 요일 피처로 분할하겠습니다. 사용하기에 부적합한 날짜, 일, 분, 초 피처는 생성하지 않겠습니다.

In [None]:
from datetime import datetime
import calendar

# 연, 월, 시 피처를 생성합니다.
all_data['year'] = all_data['datetime'].apply(lambda x: x.split()[0].split('-')[0])
all_data['month'] = all_data['datetime'].apply(lambda x: x.split()[0].split('-')[1])
all_data['hour'] = all_data['datetime'].apply(lambda x: x.split()[1].split(':')[0])

# 날짜를 추출하고 날짜에 해당하는 요일을 숫자로 치환합니다.
all_data['date'] = all_data['datetime'].apply(lambda x: x.split()[0])
all_data['day_of_week'] = all_data['date'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d').weekday())

# 날짜 피처를 삭제합니다.
all_data.drop(['datetime', 'date'], axis=1, inplace=True)

In [None]:
all_data.head(1)

Unnamed: 0,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count,year,month,hour,day_of_week
0,1,0,0,1,9.84,14.395,81,0.0,3.0,13.0,16.0,2011,1,0,5


## **2.4. 불필요한 피처 제거**

불필요한 피처를 제거하겠습니다.

In [None]:
feature_list = ['casual', 'windspeed', 'registered']
all_data.drop(feature_list, axis=1, inplace=True)

## **2.5. 피처 인코딩**

파생 피처인 'year', 'month', 'hour'는 명목형 피처입니다. 원-핫 인코딩을 적용하겠습니다.

In [None]:
all_data_ohe = pd.get_dummies(all_data)
all_data_ohe.head()

Unnamed: 0,season,holiday,workingday,weather,temp,atemp,humidity,count,day_of_week,year_2011,year_2012,month_01,month_02,month_03,month_04,month_05,month_06,month_07,month_08,month_09,month_10,month_11,month_12,hour_00,hour_01,hour_02,hour_03,hour_04,hour_05,hour_06,hour_07,hour_08,hour_09,hour_10,hour_11,hour_12,hour_13,hour_14,hour_15,hour_16,hour_17,hour_18,hour_19,hour_20,hour_21,hour_22,hour_23
0,1,0,0,1,9.84,14.395,81,16.0,5,1,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,9.02,13.635,80,40.0,5,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1,0,0,1,9.02,13.635,80,32.0,5,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1,0,0,1,9.84,14.395,75,13.0,5,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1,0,0,1,9.84,14.395,75,1.0,5,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


# **3. 모델링**

## **3.1. 데이터 준비**

데이터를 준비하겠습니다.

In [None]:
train_num = len(train) # 훈련 세트 개수를 지정합니다.
X_train_df = all_data_ohe[:train_num].drop('count', axis=1) # 훈련 세트를 지정합니다.
X_test_df = all_data_ohe[train_num:].drop('count', axis=1) # 테스트 세트를 지정합니다.
y_train = train['count'] # 타깃값을 지정합니다.

타깃값에 로그 변환을 적용하겠습니다.

In [None]:
log_y_train = np.log(y_train)

## **3.2. XGBoost**

### **3.2.1. 하이퍼파라미터 튜닝**

optuna를 사용하여 하이퍼파라미터를 튜닝하겠습니다.

In [None]:
pip install optuna

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting optuna
  Downloading optuna-3.0.5-py3-none-any.whl (348 kB)
[K     |████████████████████████████████| 348 kB 5.0 MB/s 
Collecting colorlog
  Downloading colorlog-6.7.0-py2.py3-none-any.whl (11 kB)
Collecting alembic>=1.5.0
  Downloading alembic-1.9.1-py3-none-any.whl (210 kB)
[K     |████████████████████████████████| 210 kB 56.5 MB/s 
Collecting cliff
  Downloading cliff-4.1.0-py3-none-any.whl (81 kB)
[K     |████████████████████████████████| 81 kB 9.1 MB/s 
Collecting importlib-metadata<5.0.0
  Downloading importlib_metadata-4.13.0-py3-none-any.whl (23 kB)
Collecting cmaes>=0.8.2
  Downloading cmaes-0.9.0-py3-none-any.whl (23 kB)
Collecting Mako
  Downloading Mako-1.2.4-py3-none-any.whl (78 kB)
[K     |████████████████████████████████| 78 kB 5.1 MB/s 
Collecting cmd2>=1.0.0
  Downloading cmd2-2.4.2-py3-none-any.whl (147 kB)
[K     |████████████████████████████████| 147 kB

이 문제의 측정 지표는 RMSLE를 계산하는 함수를 선언하겠습니다.

In [None]:
def rmsle(y, prediction, exponent=True): # 지수 변환을 기본값으로 지정합니다.
    # 타깃값에 지수 변환을 수행하길 원하면 지수 변환을 수행합니다.
    if exponent:
        y = np.exp(y)
        prediction = np.exp(prediction)

    # RMSLE 공식을 구현합니다.
    # 로그 변환을 수행하고 넘파이의 nan_to_num() 메서드를 사용하여 결측치를 0으로 변환합니다.
    log_y = np.nan_to_num(np.log(y + 1))
    log_prediction = np.nan_to_num(np.log(prediction + 1))
    result = np.sqrt(np.mean((log_y - log_prediction)**2))

    return result

하이퍼파라미터 튜닝을 수행하겠습니다.

In [None]:
import optuna
from optuna.samplers import TPESampler

# 시도 과정을 출력하지 않는 코드입니다.
optuna.logging.set_verbosity(optuna.logging.WARNING)

In [None]:
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor

def objective_XGB(trial, X_train_df, log_y_train):
    X_train, X_valid, y_train, y_valid = train_test_split(X_train_df, log_y_train,
                                                          test_size=0.2,
                                                          random_state=42)
    
    params = {
        'booster': trial.suggest_categorical('booster', ['gbtree', 'dart']),
        'objective': 'reg:squarederror',
        'learning_rate': trial.suggest_float('learning_rate', 0.0001, 0.1),
        'n_estimators': trial.suggest_int('n_estimators', 100, 3000),
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'subsample': trial.suggest_float('subsample', 0.6, 1),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1),
        'reg_alpha': trial.suggest_float('reg_alpha', 0, 100),
        'reg_lambda': trial.suggest_float('reg_lambda', 0, 100),
        'gamma': trial.suggest_float('gamma', 0, 9),
        'random_state': 42
    }
    model = XGBRegressor(**params)
    model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)],
              early_stopping_rounds=50,
              verbose=False)
    prediction = model.predict(X_valid)
    rmsle_score = rmsle(y_valid, prediction)

    return rmsle_score

In [None]:
study = optuna.create_study(direction='minimize', sampler=TPESampler())
study.optimize(lambda trial: objective_XGB(trial, X_train_df, log_y_train),
               n_trials=50,
               show_progress_bar=True)

print(f'Best trial score: {study.best_trial.value}')
print(f'Best params: {study.best_trial.params}')

  self._init_valid()


  0%|          | 0/50 [00:00<?, ?it/s]

Best trial score: 0.29452980905665616
Best params: {'booster': 'gbtree', 'learning_rate': 0.09815274040846987, 'n_estimators': 1821, 'max_depth': 9, 'subsample': 0.7800160517868404, 'colsample_bytree': 0.8134210145981454, 'reg_alpha': 0.36022635152073, 'reg_lambda': 86.35377373358983, 'gamma': 0.0021797946912791456}


파라미터 중요도를 시각화해 보겠습니다.

In [None]:
optuna.visualization.plot_param_importances(study)

### **3.2.2. 모델 훈련**

최적 파라미터를 적용하여 모델을 생성하겠습니다.

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(X_train_df, log_y_train,
                                                      test_size=0.2,
                                                      random_state=42)

params = study.best_trial.params
xgb_reg_model = XGBRegressor(**params)
xgb_reg_model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)],
                  early_stopping_rounds=50)

### **3.2.3. 모델 점수 확인**

제출 파일을 생성하고 점수를 확인하겠습니다.

In [None]:
xgb_prediction_test = xgb_reg_model.predict(X_test_df)
submission['count'] = np.exp(xgb_prediction_test)
submission.to_csv('Bike_Sharing_Demand_submission4.csv', index=False)

프라이빗 스코어는 0.39687입니다. 기본 파라미터를 적용한 0.51315보다 큰 폭으로 감소했습니다. 3,242팀 중 199위에 위치한 기록입니다. 

## **3.3. LightGBM**

### **3.3.1. 하이퍼파라미터 튜닝**

In [None]:
from sklearn.model_selection import train_test_split
from lightgbm import LGBMRegressor

def objective_LGBM(trial, X_train_df, log_y_train):
    X_train, X_valid, y_train, y_valid = train_test_split(X_train_df, log_y_train,
                                                          test_size=0.2,
                                                          random_state=42)
        
    params = {
        'boosting_type': trial.suggest_categorical('boosting_type', ['gbdt', 'dart']),
        'learning_rate': trial.suggest_float('learning_rate', 0.0001, 10),
        'n_estimators': trial.suggest_int('n_estimators', 100, 3000),
        'num_leaves': trial.suggest_int('num_leaves', 20, 200),
        'max_depth': -1,
        'subsample': trial.suggest_float('subsample', 0.6, 1),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1),
        'reg_alpha': trial.suggest_float('reg_alpha', 0, 100),
        'reg_lambda': trial.suggest_float('reg_lambda', 0, 100),
        'min_child_samples': trial.suggest_int('min_child_samples', 50, 1000),
        'min_child_weight': trial.suggest_float('min_child_weight', 1e-3, 10),
        'random_state': 42
    }
    model = LGBMRegressor(**params)
    model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)],
              early_stopping_rounds=50,
              verbose=False)
    prediction = model.predict(X_valid)
    rmsle_score = rmsle(y_valid, prediction)

    return rmsle_score

In [None]:
study = optuna.create_study(direction='minimize', sampler=TPESampler())
study.optimize(lambda trial: objective_LGBM(trial, X_train_df, log_y_train),
               n_trials=500,
               show_progress_bar=True)

print(f'Best trial score: {study.best_trial.value}')
print(f'Best params: {study.best_trial.params}')

  self._init_valid()


  0%|          | 0/500 [00:00<?, ?it/s]

Best trial score: 0.29165284609239966
Best params: {'boosting_type': 'gbdt', 'learning_rate': 0.16222788567232288, 'n_estimators': 2185, 'num_leaves': 115, 'subsample': 0.9213391499130541, 'colsample_bytree': 0.9987381522834238, 'reg_lambda': 69.98144902032159, 'min_child_samples': 169, 'min_child_weight': 6.402405128134849}


In [None]:
optuna.visualization.plot_param_importances(study)

### **3.3.2. 모델 훈련**

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(X_train_df, log_y_train,
                                                      test_size=0.2,
                                                      random_state=42)

params = study.best_trial.params
lgbm_reg_model = LGBMRegressor(**params)
lgbm_reg_model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)],
                  early_stopping_rounds=50, verbose=False)

LGBMRegressor(colsample_bytree=0.9987381522834238,
              learning_rate=0.16222788567232288, min_child_samples=169,
              min_child_weight=6.402405128134849, n_estimators=2185,
              num_leaves=115, reg_lambda=69.98144902032159,
              subsample=0.9213391499130541)

### **3.3.3. 모델 점수 확인**

In [None]:
lgbm_prediction_test = lgbm_reg_model.predict(X_test_df)
submission['count'] = np.exp(lgbm_prediction_test)
submission.to_csv('Bike_Sharing_Demand_submission10.csv', index=False)

프라이빗 스코어는 0.40596입니다. 파라미터 튜닝을 수행하기 전인 0.40073보다 오히려 약간 증가했습니다.