# Bikeshare数据集上的特征工程

1、	任务描述
请在Capital Bikeshare （美国Washington, D.C.的一个共享单车公司）提供的自行车数据上进行回归分析。根据每天的天气信息，预测该天的单车共享骑行量。

原始数据集地址：http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset
1)	文件说明
day.csv: 按天计的单车共享次数（作业只需使用该文件）
hour.csv: 按小时计的单车共享次数（无需理会）
readme：数据说明文件

2)	字段说明
Instant记录号
Dteday：日期
Season：季节（1=春天、2=夏天、3=秋天、4=冬天）
yr：年份，(0: 2011, 1:2012)
mnth：月份( 1 to 12)
hr：小时 (0 to 23)  （只在hour.csv有，作业忽略此字段）
holiday：是否是节假日
weekday：星期中的哪天，取值为0～6
workingday：是否工作日
1=工作日 （是否为工作日，1为工作日，0为非周末或节假日
weathersit：天气（1：晴天，多云 2：雾天，阴天 3：小雪，小雨 4：大雨，大雪，大雾）
temp：气温摄氏度
atemp：体感温度
hum：湿度
windspeed：风速
casual：非注册用户个数
registered：注册用户个数
cnt：给定日期（天）时间（每小时）总租车人数，响应变量y （cnt = casual + registered）

casual、registered和cnt三个特征均为要预测的y，作业里只需对cnt进行预测

## 导入必要的工具包

In [1]:
# 数据读取及基本处理
import pandas as pd
import numpy as np

## 读入数据

数据预处理对训练数据和测试数据需进行同样处理，因此将二者一起读入

In [2]:
# 读入数据
dpath = "./data/"
train = pd.read_csv( dpath + "day.csv")

train.head()
#print("train : " + str(train.shape))

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446,331,654,985
1,2,2011-01-02,1,0,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539,131,670,801
2,3,2011-01-03,1,0,1,0,1,1,1,0.196364,0.189405,0.437273,0.248309,120,1229,1349
3,4,2011-01-04,1,0,1,0,2,1,1,0.2,0.212122,0.590435,0.160296,108,1454,1562
4,5,2011-01-05,1,0,1,0,3,1,1,0.226957,0.22927,0.436957,0.1869,82,1518,1600


In [3]:
#train.info()

没有缺失数据

## 特征工程

### 类别型特征编码
对类别型特征进行独热编码

In [4]:
#对类别型特征，观察其取值范围及直方图
categorical_features = ['season','mnth','weathersit','weekday']

#数据类型变为object，才能被get_dummies处理
for col in categorical_features:
    train[col] = train[col].astype('object')
    
X_train_cat = train[categorical_features]
X_train_cat = pd.get_dummies(X_train_cat)
X_train_cat.head()

Unnamed: 0,season_1,season_2,season_3,season_4,mnth_1,mnth_2,mnth_3,mnth_4,mnth_5,mnth_6,...,weathersit_1,weathersit_2,weathersit_3,weekday_0,weekday_1,weekday_2,weekday_3,weekday_4,weekday_5,weekday_6
0,1,0,0,0,1,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,1
1,1,0,0,0,1,0,0,0,0,0,...,0,1,0,1,0,0,0,0,0,0
2,1,0,0,0,1,0,0,0,0,0,...,1,0,0,0,1,0,0,0,0,0
3,1,0,0,0,1,0,0,0,0,0,...,1,0,0,0,0,1,0,0,0,0
4,1,0,0,0,1,0,0,0,0,0,...,1,0,0,0,0,0,1,0,0,0


### 数值型特征
对数值型特征进行标准化/MinMaxScaler，去量纲

In [5]:
#数值型变量预处理，
#感觉数据已经做过处理（取值都在0-1之间），这里用MinMaxScaler再处理一次
from sklearn.preprocessing import MinMaxScaler
mn_X = MinMaxScaler()
numerical_features = ['temp','atemp','hum','windspeed']
temp = mn_X.fit_transform(train[numerical_features])

X_train_num = pd.DataFrame(data=temp, columns=numerical_features, index =train.index)
X_train_num.head()

Unnamed: 0,temp,atemp,hum,windspeed
0,0.35517,0.373517,0.82862,0.284606
1,0.379232,0.360541,0.715771,0.466215
2,0.171,0.14483,0.449638,0.46574
3,0.17553,0.174649,0.607131,0.284297
4,0.20912,0.197158,0.449313,0.339143


In [6]:
# Join categorical and numerical features
X_train = pd.concat([X_train_cat, X_train_num, train['holiday'],  train['workingday']], axis = 1, ignore_index=False)
X_train.head()

Unnamed: 0,season_1,season_2,season_3,season_4,mnth_1,mnth_2,mnth_3,mnth_4,mnth_5,mnth_6,...,weekday_3,weekday_4,weekday_5,weekday_6,temp,atemp,hum,windspeed,holiday,workingday
0,1,0,0,0,1,0,0,0,0,0,...,0,0,0,1,0.35517,0.373517,0.82862,0.284606,0,0
1,1,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0.379232,0.360541,0.715771,0.466215,0,0
2,1,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0.171,0.14483,0.449638,0.46574,0,1
3,1,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0.17553,0.174649,0.607131,0.284297,0,1
4,1,0,0,0,1,0,0,0,0,0,...,1,0,0,0,0.20912,0.197158,0.449313,0.339143,0,1


In [9]:
FE_train = pd.concat([train['instant'], X_train,  train['yr'],train['cnt']], axis = 1)
FE_train.to_csv( dpath + 'FE_day.csv', index=False)
FE_train.head()

Unnamed: 0,instant,season_1,season_2,season_3,season_4,mnth_1,mnth_2,mnth_3,mnth_4,mnth_5,...,weekday_5,weekday_6,temp,atemp,hum,windspeed,holiday,workingday,yr,cnt
0,1,1,0,0,0,1,0,0,0,0,...,0,1,0.35517,0.373517,0.82862,0.284606,0,0,0,985
1,2,1,0,0,0,1,0,0,0,0,...,0,0,0.379232,0.360541,0.715771,0.466215,0,0,0,801
2,3,1,0,0,0,1,0,0,0,0,...,0,0,0.171,0.14483,0.449638,0.46574,0,1,0,1349
3,4,1,0,0,0,1,0,0,0,0,...,0,0,0.17553,0.174649,0.607131,0.284297,0,1,0,1562
4,5,1,0,0,0,1,0,0,0,0,...,0,0,0.20912,0.197158,0.449313,0.339143,0,1,0,1600


In [8]:
FE_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731 entries, 0 to 730
Data columns (total 35 columns):
instant         731 non-null int64
season_1        731 non-null uint8
season_2        731 non-null uint8
season_3        731 non-null uint8
season_4        731 non-null uint8
mnth_1          731 non-null uint8
mnth_2          731 non-null uint8
mnth_3          731 non-null uint8
mnth_4          731 non-null uint8
mnth_5          731 non-null uint8
mnth_6          731 non-null uint8
mnth_7          731 non-null uint8
mnth_8          731 non-null uint8
mnth_9          731 non-null uint8
mnth_10         731 non-null uint8
mnth_11         731 non-null uint8
mnth_12         731 non-null uint8
weathersit_1    731 non-null uint8
weathersit_2    731 non-null uint8
weathersit_3    731 non-null uint8
weekday_0       731 non-null uint8
weekday_1       731 non-null uint8
weekday_2       731 non-null uint8
weekday_3       731 non-null uint8
weekday_4       731 non-null uint8
weekday_5       731 