# 共享單車需求 - 資料前處理
本次的課程將學習如何實作迴歸分析模型，目的是利用時間、季節、是否是特別假日、是否是工作日、天氣狀況、溫度、體感溫度、濕度、風速，來預測每小時的腳踏車數量；藉由此項專案將學會如何使用python裡的套件pandas和numpy來操作資料、並利用matplotlib、seaborn視覺化資料，以及用scikit-learn來建構模型。

### 環境提醒及備註
在執行本範例前請先確認Jupyter筆記本設置是否正確，首先點選主選單的「修改」─「筆記本設置」─「運行類別」，選擇「Python3」，同時將「硬件加速器」下拉式選單由「None」改成「GPU」，再按「保存」。

### 課程架構
在共享單車的專案中，將帶著學員建構一個機器學習的模型，並進行單車需求的預測，主要包括以下四個步驟：

>1.   如何進行資料前處理(Processing)

>2.   如何實作探索式數據分析(Exploratory Data Analysis)

>3.   如何導入特徵工程(Feature Engineering)

>4.   如何選擇模型並評估其學習狀況(Model&Inference) 

---

**1.1 載入所需套件**

---

In [9]:
# 1-1
# 首先載入所需套件，一般會利用import (package_name) as (xxx) 來簡化套件名稱，使得之後呼叫它們時更方便

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import calendar
from datetime import datetime
from sklearn.ensemble import RandomForestRegressor

import warnings
plt.style.use('ggplot')
warnings.filterwarnings('ignore')
pd.options.mode.chained_assignment = None
%matplotlib inline

---

**1.2 載入資料集**

---

至https://www.kaggle.com/c/bike-sharing-demand/data 下載所需data，共有test、train以及gender_submission三個csv檔

In [10]:
# 1-2
# 可以用pandas裡面的函式來讀取csv檔，使用方法為pd.read_csv('檔案名稱')

# 訓練資料
train = pd.read_csv('train/train.csv')

# 測試資料
test = pd.read_csv('test/test.csv')
submit = pd.read_csv('sampleSubmission.csv')

# 合併資料
data = train.append(test)
data.reset_index(inplace=True)
data.drop('index',inplace=True,axis=1)

---

**1.2 觀察資料分布**

---

In [11]:
train.head(5)

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,5,27,32
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,3,10,13
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,0,1,1


In [12]:
# 1-3
# 使用Info看train, test的資料來觀察是否有空值？

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
datetime      10886 non-null object
season        10886 non-null int64
holiday       10886 non-null int64
workingday    10886 non-null int64
weather       10886 non-null int64
temp          10886 non-null float64
atemp         10886 non-null float64
humidity      10886 non-null int64
windspeed     10886 non-null float64
casual        10886 non-null int64
registered    10886 non-null int64
count         10886 non-null int64
dtypes: float64(3), int64(8), object(1)
memory usage: 1020.6+ KB


In [13]:
# 1-4

test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6493 entries, 0 to 6492
Data columns (total 9 columns):
datetime      6493 non-null object
season        6493 non-null int64
holiday       6493 non-null int64
workingday    6493 non-null int64
weather       6493 non-null int64
temp          6493 non-null float64
atemp         6493 non-null float64
humidity      6493 non-null int64
windspeed     6493 non-null float64
dtypes: float64(3), int64(5), object(1)
memory usage: 456.6+ KB


In [14]:
# 1-5

train.describe()

Unnamed: 0,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
count,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0
mean,2.506614,0.028569,0.680875,1.418427,20.23086,23.655084,61.88646,12.799395,36.021955,155.552177,191.574132
std,1.116174,0.166599,0.466159,0.633839,7.79159,8.474601,19.245033,8.164537,49.960477,151.039033,181.144454
min,1.0,0.0,0.0,1.0,0.82,0.76,0.0,0.0,0.0,0.0,1.0
25%,2.0,0.0,0.0,1.0,13.94,16.665,47.0,7.0015,4.0,36.0,42.0
50%,3.0,0.0,1.0,1.0,20.5,24.24,62.0,12.998,17.0,118.0,145.0
75%,4.0,0.0,1.0,2.0,26.24,31.06,77.0,16.9979,49.0,222.0,284.0
max,4.0,1.0,1.0,4.0,41.0,45.455,100.0,56.9969,367.0,886.0,977.0


In [15]:
# 1-6

test.describe()

Unnamed: 0,season,holiday,workingday,weather,temp,atemp,humidity,windspeed
count,6493.0,6493.0,6493.0,6493.0,6493.0,6493.0,6493.0,6493.0
mean,2.4933,0.029108,0.685815,1.436778,20.620607,24.012865,64.125212,12.631157
std,1.091258,0.168123,0.464226,0.64839,8.059583,8.782741,19.293391,8.250151
min,1.0,0.0,0.0,1.0,0.82,0.0,16.0,0.0
25%,2.0,0.0,0.0,1.0,13.94,16.665,49.0,7.0015
50%,3.0,0.0,1.0,1.0,21.32,25.0,65.0,11.0014
75%,3.0,0.0,1.0,2.0,27.06,31.06,81.0,16.9979
max,4.0,1.0,1.0,4.0,40.18,50.0,100.0,55.9986


In [16]:
# 1-7

for i in range(5,12):
    name = train.columns[i]
    print('{0} 偏態係數為：{1} , 峰態係數為：{2}'.format(name,train[name].skew(),train[name].kurt()))

temp 偏態係數為：0.003690844422472008 , 峰態係數為：-0.9145302637630794
atemp 偏態係數為：-0.10255951346908665 , 峰態係數為：-0.8500756471754651
humidity 偏態係數為：-0.08633518364548581 , 峰態係數為：-0.7598175375208864
windspeed 偏態係數為：0.5887665265853944 , 峰態係數為：0.6301328693364932
casual 偏態係數為：2.4957483979812567 , 峰態係數為：7.551629305632764
registered 偏態係數為：1.5248045868182296 , 峰態係數為：2.6260809999210672
count 偏態係數為：1.2420662117180776 , 峰態係數為：1.3000929518398334


----