# Bike Sharing Demand

- 도시 자전거 공유 시스템 사용 예측
- [캐글](https://www.kaggle.com)의 [Bike Sharing Demand](https://www.kaggle.com/c/bike-sharing-demand)에서 `train.csv`와 `test.csv`를 다운로드
- 두 파일을 각각 datasets 디렉토리에 bike_train.csv bike_test.csv로 저장 

- 자전거 대여량을 예측하는 문제
- Evaluation : Submissions are evaluated one the Root Mean Squared Logarithmic Error (RMSLE). 

## 데이터 탐색 및 전처리

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
%matplotlib inline
plt.style.use('ggplot')
# 스타일 리스트 출력
#plt.style.available

* [Style 정보](https://matplotlib.org/stable/gallery/style_sheets/style_sheets_reference.html)

In [3]:
bike_train = pd.read_csv('bike_train.csv')
bike_train.shape

(10886, 12)

In [4]:
bike_train.head(3)

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,5,27,32


In [5]:
bike_train.tail(3)

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
10883,2012-12-19 21:00:00,4,0,1,1,13.94,15.91,61,15.0013,4,164,168
10884,2012-12-19 22:00:00,4,0,1,1,13.94,17.425,61,6.0032,12,117,129
10885,2012-12-19 23:00:00,4,0,1,1,13.12,16.665,66,8.9981,4,84,88


datetime: hourly date + timestamp  
season: 1 = 봄, 2 = 여름, 3 = 가을, 4 = 겨울  
holiday: 1 = 토, 일요일의 주말을 제외한 국경일 등의 휴일, 0 = 휴일이 아닌 날  
workingday: 1 = 토, 일요일의 주말 및 휴일이 아닌 주중, 0 = 주말 및 휴일  
weather:  
• 1 = 맑음, 약간 구름 낀 흐림  
• 2 = 안개, 안개 + 흐림  
• 3 = 가벼운 눈, 가벼운 비 + 천둥  
• 4 = 심한 눈/비, 천둥/번개  
temp: 온도(섭씨)   
atemp: 체감온도(섭씨)  
humidity: 상대습도  
windspeed: 풍속  
casual: 사전에 등록되지 않는 사용자가 대여한 횟수  
registered: 사전에 등록된 사용자가 대여한 횟수  
count: 대여 횟수  

In [6]:
bike_test = pd.read_csv('bike_test.csv')
bike_test.shape

(6493, 9)

In [7]:
bike_test.head(3)

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed
0,2011-01-20 00:00:00,1,0,1,1,10.66,11.365,56,26.0027
1,2011-01-20 01:00:00,1,0,1,1,10.66,13.635,56,0.0
2,2011-01-20 02:00:00,1,0,1,1,10.66,13.635,56,0.0


### 데이터탐색

In [8]:
bike_train['datetime'] = bike_train.datetime.apply(pd.to_datetime)

In [9]:
bike_train['year'] = bike_train.datetime.apply(lambda x : x.year)
bike_train['month'] = bike_train.datetime.apply(lambda x : x.month)
bike_train['day'] = bike_train.datetime.apply(lambda x : x.day)
bike_train['hour'] = bike_train.datetime.apply(lambda x: x.hour)

In [10]:
bike_test['datetime'] = bike_test.datetime.apply(pd.to_datetime)

In [11]:
bike_test['year'] = bike_test.datetime.apply(lambda x : x.year)
bike_test['month'] = bike_test.datetime.apply(lambda x : x.month)
bike_test['day'] = bike_test.datetime.apply(lambda x : x.day)
bike_test['hour'] = bike_test.datetime.apply(lambda x: x.hour)

In [12]:
bike_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   datetime    10886 non-null  datetime64[ns]
 1   season      10886 non-null  int64         
 2   holiday     10886 non-null  int64         
 3   workingday  10886 non-null  int64         
 4   weather     10886 non-null  int64         
 5   temp        10886 non-null  float64       
 6   atemp       10886 non-null  float64       
 7   humidity    10886 non-null  int64         
 8   windspeed   10886 non-null  float64       
 9   casual      10886 non-null  int64         
 10  registered  10886 non-null  int64         
 11  count       10886 non-null  int64         
 12  year        10886 non-null  int64         
 13  month       10886 non-null  int64         
 14  day         10886 non-null  int64         
 15  hour        10886 non-null  int64         
dtypes: datetime64[ns](1), 

In [13]:
bike_train.isnull().sum()

datetime      0
season        0
holiday       0
workingday    0
weather       0
temp          0
atemp         0
humidity      0
windspeed     0
casual        0
registered    0
count         0
year          0
month         0
day           0
hour          0
dtype: int64

In [14]:
bike_test.isnull().sum()

datetime      0
season        0
holiday       0
workingday    0
weather       0
temp          0
atemp         0
humidity      0
windspeed     0
year          0
month         0
day           0
hour          0
dtype: int64

In [15]:
bike_train.describe()

Unnamed: 0,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count,year,month,day,hour
count,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0
mean,2.506614,0.028569,0.680875,1.418427,20.23086,23.655084,61.88646,12.799395,36.021955,155.552177,191.574132,2011.501929,6.521495,9.992559,11.541613
std,1.116174,0.166599,0.466159,0.633839,7.79159,8.474601,19.245033,8.164537,49.960477,151.039033,181.144454,0.500019,3.444373,5.476608,6.915838
min,1.0,0.0,0.0,1.0,0.82,0.76,0.0,0.0,0.0,0.0,1.0,2011.0,1.0,1.0,0.0
25%,2.0,0.0,0.0,1.0,13.94,16.665,47.0,7.0015,4.0,36.0,42.0,2011.0,4.0,5.0,6.0
50%,3.0,0.0,1.0,1.0,20.5,24.24,62.0,12.998,17.0,118.0,145.0,2012.0,7.0,10.0,12.0
75%,4.0,0.0,1.0,2.0,26.24,31.06,77.0,16.9979,49.0,222.0,284.0,2012.0,10.0,15.0,18.0
max,4.0,1.0,1.0,4.0,41.0,45.455,100.0,56.9969,367.0,886.0,977.0,2012.0,12.0,19.0,23.0


In [16]:
bike_train["windspeed"].value_counts() # 시간당 자료이기떄문에 '0'을 Null값으로 보기 어려워보인다. # 0 이외의 최소값은 6

0.0000     1313
8.9981     1120
11.0014    1057
12.9980    1042
7.0015     1034
15.0013     961
6.0032      872
16.9979     824
19.0012     676
19.9995     492
22.0028     372
23.9994     274
26.0027     235
27.9993     187
30.0026     111
31.0009      89
32.9975      80
35.0008      58
39.0007      27
36.9974      22
43.0006      12
40.9973      11
43.9989       8
46.0022       3
56.9969       2
47.9988       2
51.9987       1
50.0021       1
Name: windspeed, dtype: int64

### 모델 훈련 준비

In [17]:
bike_train.columns

Index(['datetime', 'season', 'holiday', 'workingday', 'weather', 'temp',
       'atemp', 'humidity', 'windspeed', 'casual', 'registered', 'count',
       'year', 'month', 'day', 'hour'],
      dtype='object')

In [18]:
bike_test.columns

Index(['datetime', 'season', 'holiday', 'workingday', 'weather', 'temp',
       'atemp', 'humidity', 'windspeed', 'year', 'month', 'day', 'hour'],
      dtype='object')

In [19]:
X_train = bike_train.drop(['datetime', 'casual', 'registered', 'count'], axis=1)
y_train = bike_train['count']
X_test = bike_test.drop(['datetime'], axis=1)

In [20]:
X_train.columns

Index(['season', 'holiday', 'workingday', 'weather', 'temp', 'atemp',
       'humidity', 'windspeed', 'year', 'month', 'day', 'hour'],
      dtype='object')

In [21]:
X_test

Unnamed: 0,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,year,month,day,hour
0,1,0,1,1,10.66,11.365,56,26.0027,2011,1,20,0
1,1,0,1,1,10.66,13.635,56,0.0000,2011,1,20,1
2,1,0,1,1,10.66,13.635,56,0.0000,2011,1,20,2
3,1,0,1,1,10.66,12.880,56,11.0014,2011,1,20,3
4,1,0,1,1,10.66,12.880,56,11.0014,2011,1,20,4
...,...,...,...,...,...,...,...,...,...,...,...,...
6488,1,0,1,2,10.66,12.880,60,11.0014,2012,12,31,19
6489,1,0,1,2,10.66,12.880,60,11.0014,2012,12,31,20
6490,1,0,1,1,10.66,12.880,60,11.0014,2012,12,31,21
6491,1,0,1,1,10.66,13.635,56,8.9981,2012,12,31,22


In [22]:
X_train.shape

(10886, 12)

In [23]:
y_train.shape

(10886,)

In [24]:
y_train.value_counts()

5      169
4      149
3      144
6      135
2      132
      ... 
948      1
589      1
629      1
637      1
943      1
Name: count, Length: 822, dtype: int64

In [25]:
from sklearn.preprocessing import StandardScaler

In [26]:
std_scaler = StandardScaler()
X_train_scaled = std_scaler.fit_transform(X_train)
X_test_scaled = std_scaler.transform(X_test)

모델 훈련

In [27]:
from sklearn.ensemble import RandomForestRegressor

In [28]:
rnd_rg = RandomForestRegressor()

In [29]:
rnd_rg.fit(X_train_scaled, y_train)

RandomForestRegressor()

In [30]:
from sklearn.linear_model import SGDRegressor

In [31]:
sg_rg = SGDRegressor()

In [32]:
sg_rg.fit(X_train_scaled, y_train)

SGDRegressor()

In [33]:
from sklearn.linear_model import LinearRegression

In [34]:
lin_rg = LinearRegression()

In [35]:
lin_rg.fit(X_train_scaled, y_train)

LinearRegression()

In [36]:
from sklearn.kernel_ridge import KernelRidge

In [37]:
k_ridge = KernelRidge()

In [38]:
k_ridge.fit(X_train_scaled, y_train)

KernelRidge()

In [39]:
from xgboost.sklearn import XGBRegressor

In [40]:
xgb_rg = XGBRegressor()

In [41]:
xgb_rg.fit(X_train_scaled, y_train)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.300000012, max_delta_step=0, max_depth=6,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=100, n_jobs=6, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

In [42]:
from sklearn.ensemble import GradientBoostingRegressor

In [43]:
gb_rg = GradientBoostingRegressor()

In [44]:
gb_rg.fit(X_train_scaled, y_train)

GradientBoostingRegressor()

In [45]:
from sklearn.svm import SVR

In [46]:
svr = SVR()

In [47]:
svr.fit(X_train_scaled, y_train)

SVR()

In [48]:
from sklearn.ensemble import AdaBoostRegressor

In [49]:
ada_rg = AdaBoostRegressor()

In [50]:
ada_rg.fit(X_train_scaled, y_train)

AdaBoostRegressor()

In [51]:
from sklearn.ensemble import BaggingRegressor

In [52]:
bag_rg = BaggingRegressor()

In [53]:
bag_rg.fit(X_train_scaled, y_train)

BaggingRegressor()

In [54]:
from sklearn.ensemble import ExtraTreesRegressor

In [55]:
x_rg = ExtraTreesRegressor()

In [56]:
from sklearn.linear_model import TweedieRegressor

In [57]:
tw_rg = TweedieRegressor()

In [58]:
tw_rg.fit(X_train_scaled, y_train)

TweedieRegressor()

In [59]:
from sklearn.linear_model import Ridge

In [60]:
coeff_df = pd.DataFrame()
alphas = [0 , 0.1 , 1 , 10 , 100]

for pos , alpha in enumerate(alphas) :
    ridge = Ridge(alpha = alpha)
    ridge.fit(X_train_scaled , y_train)
    coeff = pd.Series(data=ridge.coef_ , index=X_train.columns )
    colname='alpha:'+str(alpha)
    coeff_df[colname] = coeff
    coeff = coeff.sort_values(ascending=False)
    
coeff_df.sort_values(by = "alpha:0", ascending=False)

Unnamed: 0,alpha:0,alpha:0.1,alpha:1,alpha:10,alpha:100
hour,53.789495,53.789152,53.786064,53.754926,53.427736
year,41.380872,41.380513,41.377285,41.3451,41.029538
atemp,39.610678,39.601776,39.522172,38.773798,34.183298
month,34.212618,34.205448,34.141125,33.517899,28.794163
temp,12.836519,12.845302,12.923839,13.661584,18.147702
windspeed,4.938645,4.938426,4.936478,4.918666,4.840694
day,2.06944,2.069403,2.069067,2.065823,2.040435
workingday,0.080639,0.080582,0.080072,0.075321,0.049234
holiday,-0.992872,-0.993117,-0.995305,-1.016281,-1.162501
weather,-3.098754,-3.098949,-3.100702,-3.117863,-3.264687


In [61]:
from sklearn.linear_model import Lasso

In [62]:
coeff_df = pd.DataFrame()
alphas = [ 0.07, 0.1, 0.5, 1, 3]

for pos , alpha in enumerate(alphas) :
    lasso = Lasso(alpha = alpha)
    lasso.fit(X_train_scaled , y_train)
    coeff = pd.Series(data=lasso.coef_ , index=X_train.columns)
    colname='alpha:'+str(alpha)
    coeff_df[colname] = coeff
    coeff = coeff.sort_values(ascending=False)
    
coeff_df.sort_values(by = "alpha:0.07", ascending=False)

Unnamed: 0,alpha:0.07,alpha:0.1,alpha:0.5,alpha:1,alpha:3
hour,53.749109,53.73173,53.470671,53.122184,51.73591
year,41.316182,41.288721,40.907333,40.419286,38.488528
atemp,39.299209,39.245603,38.563886,37.703242,33.993551
month,31.733683,30.671699,25.286628,24.744788,22.578999
temp,13.094578,13.123876,13.551384,14.148911,16.78854
windspeed,4.869479,4.842636,4.444387,3.918374,1.798338
day,2.003739,1.975746,1.602919,1.138015,0.0
workingday,0.018592,0.0,0.0,0.0,0.0
holiday,-1.007903,-1.0119,-0.752083,-0.250237,-0.0
weather,-3.042058,-3.017701,-2.6849,-2.255744,-0.554888


In [63]:
from sklearn.linear_model import ElasticNet

In [64]:
coeff_df = pd.DataFrame()
alphas = [ 0.07, 0.1, 0.5, 1, 3]

for pos , alpha in enumerate(alphas) :
    elastic = ElasticNet(alpha = alpha, l1_ratio=0.7)
    elastic.fit(X_train_scaled , y_train)
    coeff = pd.Series(data=elastic.coef_ , index=X_train.columns )
    colname='alpha:'+str(alpha)
    coeff_df[colname] = coeff
    coeff = coeff.sort_values(ascending=False)
    
coeff_df.sort_values(by = "alpha:0.07", ascending=False)

Unnamed: 0,alpha:0.07,alpha:0.1,alpha:0.5,alpha:1,alpha:3
hour,52.907291,52.520481,47.759612,42.88585,30.465971
year,40.542933,40.195399,36.049021,31.893812,21.640153
atemp,31.212507,30.041524,26.022898,24.380913,20.001153
month,24.752267,22.757034,14.802071,12.296972,8.497325
temp,20.949381,21.99671,24.310735,23.756704,20.075502
windspeed,4.800963,4.819799,5.195774,5.355199,4.660551
day,1.968998,1.932879,1.534246,1.149338,0.198552
season,0.414182,2.24733,8.311844,8.881534,7.41147
workingday,0.001925,0.0,0.0,0.0,0.0
holiday,-1.228057,-1.250628,-1.0158,-0.621567,-0.0


모델 평가

In [65]:
from sklearn.model_selection import cross_val_score

RandomForestRegressor

In [66]:
y_scores_tree = cross_val_score(rnd_rg, X_train_scaled, y_train, scoring="neg_mean_squared_log_error", cv=5, n_jobs=-1)

In [67]:
np.sqrt(-(y_scores_tree))

array([0.80022563, 0.37100188, 0.53991458, 0.37162261, 0.36381077])

In [68]:
np.sqrt(-(y_scores_tree)).mean()

0.489315092387531

SGDRegressor

In [69]:
y_scores_sg = cross_val_score(sg_rg, X_train_scaled, np.log1p(y_train), scoring="neg_mean_squared_error", cv=5, n_jobs=-1)

In [70]:
-y_scores_sg

array([1.10123444, 0.88068674, 1.1884913 , 1.08590995, 1.00592241])

In [71]:
-y_scores_sg.mean()

1.052448968549866

Linear Regression

In [72]:
y_scores_lin_rg = cross_val_score(lin_rg, X_train_scaled, np.log1p(y_train), scoring="neg_mean_squared_error", cv=5, n_jobs=-1)

In [73]:
y_scores_lin_rg

array([-1.11742627, -0.87749727, -1.16494405, -1.08443716, -1.01748897])

In [74]:
-y_scores_lin_rg.mean()

1.052358745772206

Kernel Ridge Regression

In [75]:
y_scores_k_ridge = cross_val_score(k_ridge, X_train_scaled, np.log1p(y_train), scoring="neg_mean_squared_error", cv=5, n_jobs=-1)

In [76]:
y_scores_k_ridge

array([-77.51290495, -60.59108605, -31.92599115, -46.49624687,
       -74.91589775])

In [77]:
-y_scores_k_ridge.mean()

58.28842535597006

XGBoost Regressor

In [78]:
y_scores_xgb_rg = cross_val_score(xgb_rg, X_train_scaled, np.log1p(y_train), scoring="neg_mean_squared_error", cv=5, n_jobs=-1)

In [79]:
y_scores_xgb_rg

array([-0.38500685, -0.11694149, -0.17741521, -0.13402383, -0.12109065])

In [80]:
-y_scores_xgb_rg.mean()

0.18689560444902967

Grandient Boosting Regression

In [81]:
y_scores_gb_rg = cross_val_score(gb_rg, X_train_scaled, np.log1p(y_train), scoring="neg_mean_squared_error", cv=5, n_jobs=-1)

In [82]:
y_scores_gb_rg

array([-0.50262127, -0.14455434, -0.21131492, -0.14619408, -0.16260276])

In [83]:
-y_scores_gb_rg.mean()

0.23345747225467411

Support Vector Machine

In [84]:
y_scores_svr = cross_val_score(svr, X_train_scaled, np.log1p(y_train), scoring="neg_mean_squared_error", cv=5, n_jobs=-1)

In [85]:
y_scores_svr

array([-0.91170796, -0.68230809, -0.7851179 , -0.6932306 , -0.85574933])

In [86]:
-y_scores_svr.mean()

0.7856227736778381

ADABoost Regressor

In [87]:
y_scores_ada_rg = cross_val_score(ada_rg, X_train_scaled, np.log1p(y_train), scoring="neg_mean_squared_error", cv=5, n_jobs=-1)

In [88]:
y_scores_ada_rg

array([-0.55567833, -0.40631275, -0.50575862, -0.49411438, -0.54364346])

In [89]:
-y_scores_ada_rg.mean()

0.5011015086186775

Bagging Regressor

In [90]:
y_scores_bag_rg = cross_val_score(bag_rg, X_train_scaled, np.log1p(y_train), scoring="neg_mean_squared_error", cv=5, n_jobs=-1)

In [91]:
y_scores_bag_rg

array([-0.51385175, -0.14376485, -0.26286748, -0.15619862, -0.18507618])

In [92]:
-y_scores_bag_rg.mean()

0.25235177675791254

ExtraTrees Regressor

In [93]:
y_scores_x_rg = cross_val_score(x_rg, X_train_scaled, np.log1p(y_train), scoring="neg_mean_squared_error", cv=5, n_jobs=-1)

In [94]:
y_scores_x_rg

array([-0.50398326, -0.12819389, -0.18627818, -0.12061243, -0.12969462])

In [95]:
-y_scores_x_rg.mean()

0.21375247296614336

Tweedie Regressor

In [96]:
y_scores_tw_rg = cross_val_score(tw_rg, X_train_scaled, np.log1p(y_train), scoring="neg_mean_squared_error", cv=5, n_jobs=-1)

In [97]:
y_scores_tw_rg

array([-1.40483991, -1.0446499 , -1.28327081, -1.25527775, -1.23976601])

In [98]:
-y_scores_tw_rg.mean()

1.245560875872026

Ridge

In [99]:
ridge = Ridge(alpha=0.01)
y_scores_r = cross_val_score(ridge, X_train_scaled, np.log1p(y_train), scoring="neg_mean_squared_error", cv=5, n_jobs=-1)
y_scores_r

array([-1.1174262 , -0.87749732, -1.16494387, -1.08443722, -1.01748532])

In [100]:
-y_scores_r.mean()

1.0523579861365888

Lasso

In [101]:
lasso = Lasso(alpha=0.001)
y_scores_l = cross_val_score(lasso, X_train_scaled, np.log1p(y_train), scoring="neg_mean_squared_error", cv=5, n_jobs=-1)
y_scores_l

array([-1.11722712, -0.87747244, -1.16402665, -1.08409159, -1.00971378])

In [102]:
-y_scores_l.mean()

1.0505063163708428

ElasticNet

In [103]:
elastic = ElasticNet(alpha=0.001)
y_scores_elastic = cross_val_score(elastic, X_train_scaled, np.log1p(y_train), scoring="neg_mean_squared_error", cv=5, n_jobs=-1)
y_scores_elastic

array([-1.11727688, -0.87750033, -1.16440622, -1.08430983, -1.01221127])

In [104]:
-y_scores_elastic.mean()

1.0511409051708254

파라미터 튜닝

In [105]:
from sklearn.model_selection import GridSearchCV

In [106]:
param_grid = [
    {'n_estimators': [300], 'max_depth' : [94], "max_features" : [11]}
] # 추가해볼만한 Parameter가 있으면 고려

grid_search = GridSearchCV(rnd_rg, param_grid = param_grid, scoring="neg_mean_squared_log_error", cv=5, n_jobs=-1)

grid_search.fit(X_train_scaled, y_train)

GridSearchCV(cv=5, estimator=RandomForestRegressor(), n_jobs=-1,
             param_grid=[{'max_depth': [94], 'max_features': [11],
                          'n_estimators': [300]}],
             scoring='neg_mean_squared_log_error')

In [107]:
grid_search.best_params_

{'max_depth': 94, 'max_features': 11, 'n_estimators': 300}

In [108]:
np.sqrt(-grid_search.best_score_)

0.5173474527102001

In [109]:
param_grid = [
    {"n_estimators": [4], "learning_rate": [1], "loss": ["exponential"]}
] # 추가해볼만한 Parameter가 있으면 고려

grid_search = GridSearchCV(ada_rg, param_grid = param_grid, scoring="neg_mean_squared_log_error", cv=5, n_jobs=-1)

grid_search.fit(X_train_scaled, y_train)

GridSearchCV(cv=5, estimator=AdaBoostRegressor(), n_jobs=-1,
             param_grid=[{'learning_rate': [1], 'loss': ['exponential'],
                          'n_estimators': [4]}],
             scoring='neg_mean_squared_log_error')

In [110]:
grid_search.best_params_

{'learning_rate': 1, 'loss': 'exponential', 'n_estimators': 4}

In [111]:
np.sqrt(-grid_search.best_score_)

0.9007229899588901

In [112]:
param_grid = [
    {"n_estimators": [190], "bootstrap": [True]}
] # 추가해볼만한 Parameter가 있으면 고려

grid_search = GridSearchCV(bag_rg, param_grid = param_grid, scoring="neg_mean_squared_log_error", cv=5, n_jobs=-1)

grid_search.fit(X_train_scaled, y_train)

GridSearchCV(cv=5, estimator=BaggingRegressor(), n_jobs=-1,
             param_grid=[{'bootstrap': [True], 'n_estimators': [190]}],
             scoring='neg_mean_squared_log_error')

In [113]:
bag_rg

BaggingRegressor()

In [114]:
grid_search.best_params_

{'bootstrap': True, 'n_estimators': 190}

In [115]:
np.sqrt(-grid_search.best_score_)

0.5205004607582815

In [116]:
param_grid = [
    {"n_estimators": [10], "max_depth": [17]}
] # 추가해볼만한 Parameter가 있으면 고려

grid_search = GridSearchCV(x_rg, param_grid = param_grid, scoring="neg_mean_squared_log_error", cv=5, n_jobs=-1)

grid_search.fit(X_train_scaled, y_train)

GridSearchCV(cv=5, estimator=ExtraTreesRegressor(), n_jobs=-1,
             param_grid=[{'max_depth': [17], 'n_estimators': [10]}],
             scoring='neg_mean_squared_log_error')

In [117]:
x_rg

ExtraTreesRegressor()

In [118]:
grid_search.best_params_

{'max_depth': 17, 'n_estimators': 10}

In [119]:
np.sqrt(-grid_search.best_score_)

0.560162587449117

In [134]:
param_grid = [
    {"n_estimators": [100], "max_depth": [6]}
] # 추가해볼만한 Parameter가 있으면 고려

grid_search = GridSearchCV(gb_rg, param_grid = param_grid, scoring="neg_mean_squared_error", cv=5, n_jobs=-1)

grid_search.fit(X_train_scaled, np.log1p(y_train))

GridSearchCV(cv=5, estimator=GradientBoostingRegressor(), n_jobs=-1,
             param_grid=[{'max_depth': [6], 'n_estimators': [100]}],
             scoring='neg_mean_squared_error')

In [121]:
gb_rg

GradientBoostingRegressor()

In [135]:
grid_search.best_params_

{'max_depth': 6, 'n_estimators': 100}

In [136]:
np.sqrt(-grid_search.best_score_)

0.4199137358812572

In [137]:
grid_search.best_estimator_

GradientBoostingRegressor(max_depth=6)

In [138]:
final_model = grid_search.best_estimator_

In [139]:
final_model.fit(X_train_scaled, y_train)

GradientBoostingRegressor(max_depth=6)

최종 성능 평가

In [140]:
y_pred = final_model.predict(X_test_scaled)

In [141]:
y_pred.shape

(6493,)

In [146]:
y_pred

제출용 CSV 만들기

In [147]:
submission = pd.read_csv("sampleSubmission.csv")
submission

Unnamed: 0,datetime,count
0,2011-01-20 00:00:00,0
1,2011-01-20 01:00:00,0
2,2011-01-20 02:00:00,0
3,2011-01-20 03:00:00,0
4,2011-01-20 04:00:00,0
...,...,...
6488,2012-12-31 19:00:00,0
6489,2012-12-31 20:00:00,0
6490,2012-12-31 21:00:00,0
6491,2012-12-31 22:00:00,0


In [148]:
submission["count"] = y_pred
submission.head()

Unnamed: 0,datetime,count
0,2011-01-20 00:00:00,20.828627
1,2011-01-20 01:00:00,0.029532
2,2011-01-20 02:00:00,3.149258
3,2011-01-20 03:00:00,3.069114
4,2011-01-20 04:00:00,3.069114


In [149]:
ver = 10 

submission.to_csv("ver_{0}_submission.csv".format(ver), index=False)