<a href="https://colab.research.google.com/github/LeeSeungwon89/Kaggle_Dacon_Practice/blob/main/5.%20Bike_Sharing_Demand_final_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install kaggle
from google.colab import files
files.upload()

In [2]:
ls -1ha kaggle.json

kaggle.json


In [3]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

# Permission Warning이 발생하지 않도록 해줍니다.
!chmod 600 ~/.kaggle/kaggle.json

# 참가한 대회 리스트를 확인합니다.
# !kaggle competitions list

In [4]:
!kaggle competitions download -c bike-sharing-demand

Downloading bike-sharing-demand.zip to /content
  0% 0.00/189k [00:00<?, ?B/s]
100% 189k/189k [00:00<00:00, 74.2MB/s]


In [5]:
!ls

bike-sharing-demand.zip  kaggle.json  sample_data


In [6]:
!unzip bike-sharing-demand.zip

Archive:  bike-sharing-demand.zip
  inflating: sampleSubmission.csv    
  inflating: test.csv                
  inflating: train.csv               


# **1. 최종 모델**

제반 절차를 수행한 결과 'month', 'season' 피처 모두 활용하여 XGBoost를 적용한 성능이 가장 높았습니다.

# **2. 최종 피처 엔지니어링**

In [7]:
import numpy as np
import pandas as pd
import random

np.random.seed(2022)
random.seed(2022)

# 최대 행렬 수를 설정합니다.
pd.set_option('display.max_column', 50)
pd.set_option('display.max_rows', 50)

# 데이터를 읽습니다.
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
submission = pd.read_csv('sampleSubmission.csv')

## **2.1. 이상치 레코드 제거**

'weather' 피처에서 값이 4인 피처를 제거하겠습니다.

In [8]:
train = train[train['weather']!=4]

## **2.2. 훈련 및 테스트 세트 결합**

피처 엔지니어링을 수행하기 위해 훈련 세트와 테스트 세트를 결합하겠습니다.

In [9]:
all_data = pd.concat([train, test], ignore_index=True)
all_data

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0000,3.0,13.0,16.0
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0000,8.0,32.0,40.0
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0000,5.0,27.0,32.0
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0000,3.0,10.0,13.0
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0000,0.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...
17373,2012-12-31 19:00:00,1,0,1,2,10.66,12.880,60,11.0014,,,
17374,2012-12-31 20:00:00,1,0,1,2,10.66,12.880,60,11.0014,,,
17375,2012-12-31 21:00:00,1,0,1,1,10.66,12.880,60,11.0014,,,
17376,2012-12-31 22:00:00,1,0,1,1,10.66,13.635,56,8.9981,,,


## **2.3. 피처 분할**

'datetime' 피처를 연, 월, 시, 요일 피처로 분할하겠습니다. 사용하기에 부적합한 날짜, 일, 분, 초 피처는 생성하지 않겠습니다.

In [10]:
from datetime import datetime
import calendar

# 연, 월, 시 피처를 생성합니다.
all_data['year'] = all_data['datetime'].apply(lambda x: x.split()[0].split('-')[0])
all_data['month'] = all_data['datetime'].apply(lambda x: x.split()[0].split('-')[1])
all_data['hour'] = all_data['datetime'].apply(lambda x: x.split()[1].split(':')[0])

# 날짜를 추출하고 날짜에 해당하는 요일을 숫자로 치환합니다.
all_data['date'] = all_data['datetime'].apply(lambda x: x.split()[0])
all_data['day_of_week'] = all_data['date'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d').weekday())

# 날짜 피처를 삭제합니다.
all_data.drop(['datetime', 'date'], axis=1, inplace=True)

In [11]:
all_data.head(1)

Unnamed: 0,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count,year,month,hour,day_of_week
0,1,0,0,1,9.84,14.395,81,0.0,3.0,13.0,16.0,2011,1,0,5


## **2.4. 불필요한 피처 제거**

불필요한 피처를 제거하겠습니다.

In [12]:
feature_list = ['casual', 'windspeed', 'registered']
all_data.drop(feature_list, axis=1, inplace=True)

## **2.5. 피처 인코딩**

파생 피처인 'year', 'month', 'hour'는 명목형 피처입니다. 원-핫 인코딩을 적용하겠습니다.

In [13]:
all_data_ohe = pd.get_dummies(all_data)
all_data_ohe.head()

Unnamed: 0,season,holiday,workingday,weather,temp,atemp,humidity,count,day_of_week,year_2011,year_2012,month_01,month_02,month_03,month_04,month_05,month_06,month_07,month_08,month_09,month_10,month_11,month_12,hour_00,hour_01,hour_02,hour_03,hour_04,hour_05,hour_06,hour_07,hour_08,hour_09,hour_10,hour_11,hour_12,hour_13,hour_14,hour_15,hour_16,hour_17,hour_18,hour_19,hour_20,hour_21,hour_22,hour_23
0,1,0,0,1,9.84,14.395,81,16.0,5,1,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,9.02,13.635,80,40.0,5,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1,0,0,1,9.02,13.635,80,32.0,5,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1,0,0,1,9.84,14.395,75,13.0,5,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1,0,0,1,9.84,14.395,75,1.0,5,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


# **3. 최종 모델링**

## **3.1. 데이터 준비**

데이터를 준비하겠습니다.

In [14]:
train_num = len(train) # 훈련 세트 개수를 지정합니다.
X_train_df = all_data_ohe[:train_num].drop('count', axis=1) # 훈련 세트를 지정합니다.
X_test_df = all_data_ohe[train_num:].drop('count', axis=1) # 테스트 세트를 지정합니다.
y_train = train['count'] # 타깃값을 지정합니다.

타깃값에 로그 변환을 적용하겠습니다.

In [15]:
log_y_train = np.log(y_train)

## **3.2. XGBoost**

### **3.2.1. 하이퍼파라미터 튜닝**

optuna를 사용하여 하이퍼파라미터를 튜닝하겠습니다.

In [16]:
pip install optuna

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting optuna
  Downloading optuna-3.0.5-py3-none-any.whl (348 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m348.5/348.5 KB[0m [31m23.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting cliff
  Downloading cliff-4.1.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.0/81.0 KB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
Collecting importlib-metadata<5.0.0
  Downloading importlib_metadata-4.13.0-py3-none-any.whl (23 kB)
Collecting colorlog
  Downloading colorlog-6.7.0-py2.py3-none-any.whl (11 kB)
Collecting alembic>=1.5.0
  Downloading alembic-1.9.1-py3-none-any.whl (210 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m210.4/210.4 KB[0m [31m23.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting cmaes>=0.8.2
  Downloading cmaes-0.9.0-py3-none-any.whl (23 kB)
Collecting Mako
  Downloading Mako-1.2.4-py3-none-any

이 문제의 측정 지표는 RMSLE를 계산하는 함수를 선언하겠습니다.

In [17]:
def rmsle(y, prediction, exponent=True): # 지수 변환을 기본값으로 지정합니다.
    # 타깃값에 지수 변환을 수행하길 원하면 지수 변환을 수행합니다.
    if exponent:
        y = np.exp(y)
        prediction = np.exp(prediction)

    # RMSLE 공식을 구현합니다.
    # 로그 변환을 수행하고 넘파이의 nan_to_num() 메서드를 사용하여 결측치를 0으로 변환합니다.
    log_y = np.nan_to_num(np.log(y + 1))
    log_prediction = np.nan_to_num(np.log(prediction + 1))
    result = np.sqrt(np.mean((log_y - log_prediction)**2))

    return result

하이퍼파라미터 튜닝을 수행하겠습니다.

In [18]:
import optuna
from optuna.samplers import TPESampler

# 시도 과정을 출력하지 않는 코드입니다.
optuna.logging.set_verbosity(optuna.logging.WARNING)

In [19]:
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor

def objective_XGB(trial, X_train_df, log_y_train):
    X_train, X_valid, y_train, y_valid = train_test_split(X_train_df, log_y_train,
                                                          test_size=0.2,
                                                          random_state=42)
    
    params = {
        'booster': trial.suggest_categorical('booster', ['gbtree']),
        'objective': 'reg:squarederror',
        'learning_rate': trial.suggest_float('learning_rate', 0.0001, 0.1),
        'n_estimators': trial.suggest_int('n_estimators', 100, 3000),
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'subsample': trial.suggest_float('subsample', 0.6, 1),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1),
        'reg_alpha': trial.suggest_float('reg_alpha', 0, 100),
        'reg_lambda': trial.suggest_float('reg_lambda', 0, 100),
        'gamma': trial.suggest_float('gamma', 0, 9),
        'random_state': 42
    }
    model = XGBRegressor(**params)
    model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)],
              early_stopping_rounds=50,
              verbose=False)
    prediction = model.predict(X_valid)
    rmsle_score = rmsle(y_valid, prediction)

    return rmsle_score

In [20]:
study = optuna.create_study(direction='minimize', sampler=TPESampler())
study.optimize(lambda trial: objective_XGB(trial, X_train_df, log_y_train),
               n_trials=50,
               show_progress_bar=True)

print(f'Best trial score: {study.best_trial.value}')
print(f'Best params: {study.best_trial.params}')

  self._init_valid()


  0%|          | 0/50 [00:00<?, ?it/s]

Best trial score: 0.29896693308999195
Best params: {'booster': 'gbtree', 'learning_rate': 0.09987366081308952, 'n_estimators': 2315, 'max_depth': 6, 'subsample': 0.7082052036080327, 'colsample_bytree': 0.8521354269767685, 'reg_alpha': 7.059389840321787, 'reg_lambda': 10.96960972736687, 'gamma': 0.03755162275829448}


파라미터 중요도를 시각화해 보겠습니다.

In [21]:
optuna.visualization.plot_param_importances(study)

### **3.2.2. 모델 훈련**

최적 파라미터를 적용하여 모델을 생성하겠습니다.

In [25]:
params = study.best_trial.params
xgb_reg_model = XGBRegressor(**params)
xgb_reg_model.fit(X_train_df, log_y_train)



XGBRegressor(colsample_bytree=0.8521354269767685, gamma=0.03755162275829448,
             learning_rate=0.09987366081308952, max_depth=6, n_estimators=2315,
             reg_alpha=7.059389840321787, reg_lambda=10.96960972736687,
             subsample=0.7082052036080327)

### **3.2.3. 모델 점수 확인**

제출 파일을 생성하고 점수를 확인하겠습니다.

In [26]:
xgb_prediction_test = xgb_reg_model.predict(X_test_df)
submission['count'] = np.exp(xgb_prediction_test)
submission.to_csv('Bike_Sharing_Demand_submission14.csv', index=False)

프라이빗 스코어는 0.39534입니다. 3,242팀 중 190위에 위치한 기록입니다. 