# 共享單車需求 - 選擇模型/評估
本次的課程將學習如何實作迴歸分析模型，目的是利用時間、季節、是否是特別假日、是否是工作日、天氣狀況、溫度、體感溫度、濕度、風速，來預測每小時的腳踏車數量；藉由此項專案將學會如何使用python裡的套件pandas和numpy來操作資料、並利用matplotlib、seaborn視覺化資料，以及用scikit-learn來建構模型。

### 環境提醒及備註
在執行本範例前請先確認Jupyter筆記本設置是否正確，首先點選主選單的「修改」─「筆記本設置」─「運行類別」，選擇「Python3」，同時將「硬件加速器」下拉式選單由「None」改成「GPU」，再按「保存」。

### 課程架構
在共享單車的專案中，將帶著學員建構一個機器學習的模型，並進行單車需求的預測，主要包括以下四個步驟：

>1.   如何進行資料前處理(Processing)

>2.   如何實作探索式數據分析(Exploratory Data Analysis)

>3.   如何導入特徵工程(Feature Engineering)

>4.   如何選擇模型並評估其學習狀況(Model&Inference) 

---

**4.1 載入所需套件**

---

In [1]:
# 4-1
# 首先載入所需套件，一般會利用import (package_name) as (xxx) 來簡化套件名稱，使得之後呼叫它們時更方便

from sklearn.ensemble import RandomForestRegressor
from datetime import datetime
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import calendar
pd.options.mode.chained_assignment = None

import warnings
plt.style.use('ggplot')
warnings.filterwarnings('ignore')
%matplotlib inline

---

**4.2 載入資料集**

---

至https://www.kaggle.com/c/bike-sharing-demand/data 下載所需data，共有test、train以及gender_submission三個csv檔

In [2]:
# 4-2
# 可以用pandas裡面的函式來讀取csv檔，使用方法為pd.read_csv('檔案名稱')

# 訓練資料
train = pd.read_excel('train/train_new.xls')
Y_train = train["count"]

# 測試資料
test = pd.read_excel('test/test_new.xls')
submit = pd.read_csv('sampleSubmission.csv')

In [3]:
# 4-3

train.head()

Unnamed: 0,atemp,count,holiday,humidity,season,temp,weather,windspeed,workingday,hour,year,weekday,month
14630,14.395,2.772589,0,81,1,9.84,1,6.878964,0,0,2011,5,1
14631,13.635,3.688879,0,80,1,9.02,1,6.578061,0,1,2011,5,1
14632,13.635,3.465736,0,80,1,9.02,1,6.578061,0,2,2011,5,1
14633,14.395,2.564949,0,75,1,9.84,1,6.666139,0,3,2011,5,1
14634,14.395,0.0,0,75,1,9.84,1,6.666139,0,4,2011,5,1


In [11]:
train.iloc[:,0]

14630    14.395
14631    13.635
14632    13.635
14633    14.395
14634    14.395
0        12.880
14635    13.635
14636    12.880
14637    14.395
14638    17.425
1        19.695
2        16.665
3        21.210
4        22.725
5        22.725
6        21.970
7        21.210
8        21.970
9        21.210
10       21.210
11       20.455
12       20.455
13       20.455
14       22.725
15       22.725
16       21.970
17       21.210
18       22.725
19       22.725
20       21.210
          ...  
8976     22.725
8977     22.725
8978     21.970
8979     19.695
8980     19.695
8981     16.665
8982     17.425
15884    16.665
8983     17.425
15885    15.910
8984     15.910
8985     15.150
8986     13.635
8987     12.120
8988     14.395
8989     12.880
8990     13.635
8991     14.395
8992     16.665
8993     20.455
8994     20.455
8995     21.210
8996     21.210
8997     21.210
8998     21.210
8999     19.695
9000     17.425
9001     15.910
9002     17.425
9003     16.665
Name: atemp, Length: 102

In [4]:
# 4-4

test.head()

Unnamed: 0,atemp,datetime,holiday,humidity,season,temp,weather,windspeed,workingday,hour,year,weekday,month
9004,11.365,2011-01-20 00:00:00,0,56,1,10.66,1,26.0027,1,0,2011,3,1
15886,13.635,2011-01-20 01:00:00,0,56,1,10.66,1,8.772007,1,1,2011,3,1
15887,13.635,2011-01-20 02:00:00,0,56,1,10.66,1,8.772007,1,2,2011,3,1
9005,12.88,2011-01-20 03:00:00,0,56,1,10.66,1,11.0014,1,3,2011,3,1
9006,12.88,2011-01-20 04:00:00,0,56,1,10.66,1,11.0014,1,4,2011,3,1


In [5]:
# 4-5

a = test[['atemp','holiday']]
a.head()

Unnamed: 0,atemp,holiday
9004,11.365,0
15886,13.635,0
15887,13.635,0
9005,12.88,0
9006,12.88,0


---

**4.3 模型訓練及預測**

---

最後將資料丟進Random forest train並且輸出結果到csv，接著上傳此csv檔就可以獲得大約17%的排名了！整段程式大約跑1~5分鐘，Random forest模型的參數這邊只做了n_estimators =1000這個非常基本的設定，random_state =42主要是隨機的方式，有點像是在洗撲克牌時洗牌的方式，透過設定一個固定的值，每個人跑這隻程式輸出的結果會跟我一模一樣。

In [6]:
# 4-6

X_train = train.drop("count", axis=1)

In [17]:
X_train

Unnamed: 0,atemp,holiday,humidity,season,temp,weather,windspeed,workingday,hour,year,weekday,month
14630,14.395,0,81,1,9.84,1,6.878964,0,0,2011,5,1
14631,13.635,0,80,1,9.02,1,6.578061,0,1,2011,5,1
14632,13.635,0,80,1,9.02,1,6.578061,0,2,2011,5,1
14633,14.395,0,75,1,9.84,1,6.666139,0,3,2011,5,1
14634,14.395,0,75,1,9.84,1,6.666139,0,4,2011,5,1
0,12.880,0,75,1,9.84,2,6.003200,0,5,2011,5,1
14635,13.635,0,80,1,9.02,1,6.578061,0,6,2011,5,1
14636,12.880,0,86,1,8.20,1,6.579490,0,7,2011,5,1
14637,14.395,0,75,1,9.84,1,6.666139,0,8,2011,5,1
14638,17.425,0,76,1,13.12,1,9.239897,0,9,2011,5,1


In [18]:
Y_train

14630    2.772589
14631    3.688879
14632    3.465736
14633    2.564949
14634    0.000000
0        0.000000
14635    0.693147
14636    1.098612
14637    2.079442
14638    2.639057
1        3.583519
2        4.025352
3        4.430817
4        4.543295
5        4.663439
6        4.700480
7        4.532599
8        4.204693
9        3.555348
10       3.610918
11       3.583519
12       3.526361
13       3.332205
14       3.663562
15       2.833213
16       2.833213
17       2.197225
18       1.791759
19       1.098612
20       0.693147
           ...   
8976     5.509388
8977     5.505332
8978     5.894403
8979     6.263398
8980     5.866468
8981     5.590987
8982     5.123964
15884    4.882802
8983     4.394449
15885    3.713572
8984     2.708050
8985     1.098612
8986     1.609438
8987     1.945910
8988     3.433987
8989     4.718499
8990     5.894403
8991     5.758902
8992     5.099866
8993     5.298317
8994     5.463832
8995     5.361292
8996     5.384495
8997     5.468060
8998     5

In [1]:
# from sklearn.ensemble import RandomForestClassifier,RandomForestRegressor
# from sklearn.model_selection import GridSearchCV 

# rf = RandomForestRegressor(oob_score=True, random_state=1, n_jobs=-1)
# param_grid = { "min_samples_leaf" : [1, 5, 10], "min_samples_split" : [2, 4, 10, 12, 16, 20], "n_estimators": [50, 100, 400, 700, 1000]}
# gs = GridSearchCV(estimator=rf, param_grid=param_grid, scoring='accuracy', cv=3, n_jobs=-1)

# gs = GridSearchCV(RandomForestRegressor(),param_grid=param_grid, cv=5)
# gs = gs.fit(X_train, Y_train)


# print(gs.best_score_)
# print(gs.best_params_)

In [None]:
# 4-7

rfModel = RandomForestRegressor(n_estimators=1000,random_state=42)
rfModel.fit(X_train,Y_train)
preds = rfModel.predict(X= X_train)

In [None]:
# 4-8

pd.concat((pd.DataFrame(X_train.columns, columns = ['variable']), 
           pd.DataFrame(rfModel.feature_importances_, columns = ['importance'])), 
          axis = 1).sort_values(by='importance', ascending = False)[:5]

---

**4.4 輸出預測結果**

---

In [None]:
# 4-9

datetimecol = test["datetime"]
X_test = test.drop("datetime", axis=1)

predsTest = rfModel.predict(X= X_test)
submission = pd.DataFrame({
        "datetime": datetimecol,
        "count": [max(0, x) for x in np.exp(predsTest)]
    })
submission.to_csv('bike_predictions_RF.csv', index=False)

In [None]:
# 4-10

submission.head()

In [None]:
# 4-11

predsTest[0]

In [None]:
# 4-12

a = np.exp(predsTest[0])
a

----