利用线性回归技术实现共享单车数量预测

一、数据说明： Capital Bikeshare （美国Washington, D.C.的一个共享单车公司）提供的共享单车数据。数据包含每天的日期、天气等信息，需要预测每天的共享单车骑行量。

原始数据集地址：http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset
1)	文件说明
day.csv: 按天计的单车共享次数（作业只需使用该文件）
hour.csv: 按小时计的单车共享次数（无需理会）
readme：数据说明文件

2)	字段说明
Instant记录号
Dteday：日期
Season：季节（1=春天、2=夏天、3=秋天、4=冬天）
yr：年份，(0: 2011, 1:2012)
mnth：月份( 1 to 12)
hr：小时 (0 to 23)  （只在hour.csv有，作业忽略此字段）
holiday：是否是节假日（0/1）
weekday：星期中的哪天，取值为0～6
workingday：是否工作日（0/1）
1=工作日 （是否为工作日，1为工作日，0为非周末或节假日）
weathersit：天气（1：晴天，多云 2：雾天，阴天 3：小雪，小雨 4：大雨，大雪，大雾）
temp：气温摄氏度
atemp：体感温度
hum：湿度
windspeed：风速
### 这两项可以做特征工程的时候去掉，只关注cnt就可以了
casual：非注册用户贡献的骑行量（作业无需理会该字段）
registered：注册用户贡献的骑行量（作业无需理会该字段）
cnt：给定日期（天, day.csv）时间（每小时,hour.csv）总租车人数，响应变量y

casual、registered和cnt三个特征均为要预测的y（cnt =casual+registered ），作业里只需对cnt进行预测。
# RSME

In [1]:
#导入必要的包
import sys
import pandas as pd
import numpy as np
import seaborn as sn
import matplotlib.pyplot as plt
from IPython.display import display, HTML

In [2]:
# 格式数据
params = {'legend.fontsize': 'x-large',
          'figure.figsize': (30, 10),
          'axes.labelsize': 'x-large',
          'axes.titlesize':'x-large',
          'xtick.labelsize':'x-large',
          'ytick.labelsize':'x-large'}

sn.set_style('whitegrid')
sn.set_context('talk')

plt.rcParams.update(params)
pd.options.display.max_colwidth = 600

In [38]:
#读入数据
train = pd.read_csv("FE_BikeSharing.csv")
train.head()

Unnamed: 0,instant,season_1,season_2,season_3,season_4,mnth_1,mnth_2,mnth_3,mnth_4,mnth_5,...,weekday_5,weekday_6,temp,atemp,hum,windspeed,holiday,workingday,yr,cnt
0,1,1,0,0,0,1,0,0,0,0,...,0,1,0.35517,0.373517,0.82862,0.284606,0,0,0,985
1,2,1,0,0,0,1,0,0,0,0,...,0,0,0.379232,0.360541,0.715771,0.466215,0,0,0,801
2,3,1,0,0,0,1,0,0,0,0,...,0,0,0.171,0.14483,0.449638,0.46574,0,1,0,1349
3,4,1,0,0,0,1,0,0,0,0,...,0,0,0.17553,0.174649,0.607131,0.284297,0,1,0,1562
4,5,1,0,0,0,1,0,0,0,0,...,0,0,0.20912,0.197158,0.449313,0.339143,0,1,0,1600


In [39]:
train.shape

(731, 35)

In [40]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731 entries, 0 to 730
Data columns (total 35 columns):
instant         731 non-null int64
season_1        731 non-null int64
season_2        731 non-null int64
season_3        731 non-null int64
season_4        731 non-null int64
mnth_1          731 non-null int64
mnth_2          731 non-null int64
mnth_3          731 non-null int64
mnth_4          731 non-null int64
mnth_5          731 non-null int64
mnth_6          731 non-null int64
mnth_7          731 non-null int64
mnth_8          731 non-null int64
mnth_9          731 non-null int64
mnth_10         731 non-null int64
mnth_11         731 non-null int64
mnth_12         731 non-null int64
weathersit_1    731 non-null int64
weathersit_2    731 non-null int64
weathersit_3    731 non-null int64
weekday_0       731 non-null int64
weekday_1       731 non-null int64
weekday_2       731 non-null int64
weekday_3       731 non-null int64
weekday_4       731 non-null int64
weekday_5       731 

In [41]:
#对数值型变量，可以使用describe查看其分布
train.describe()

Unnamed: 0,instant,season_1,season_2,season_3,season_4,mnth_1,mnth_2,mnth_3,mnth_4,mnth_5,...,weekday_5,weekday_6,temp,atemp,hum,windspeed,holiday,workingday,yr,cnt
count,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,...,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0
mean,366.0,0.247606,0.25171,0.257182,0.243502,0.084815,0.077975,0.084815,0.082079,0.084815,...,0.142271,0.143639,0.543594,0.518864,0.645649,0.346536,0.028728,0.683995,0.500684,4504.348837
std,211.165812,0.431917,0.434293,0.43738,0.429489,0.278797,0.268316,0.278797,0.274674,0.278797,...,0.349567,0.350963,0.228091,0.213909,0.146457,0.159766,0.167155,0.465233,0.500342,1937.211452
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,22.0
25%,183.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.346343,0.339674,0.534704,0.232045,0.0,0.0,0.0,3152.0
50%,366.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.547268,0.535113,0.644388,0.326928,0.0,1.0,1.0,4548.0
75%,548.5,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.743002,0.695083,0.750857,0.434622,0.0,1.0,1.0,5956.0
max,731.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,8714.0


In [42]:
#已经讲类别型特征转化为object类型
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731 entries, 0 to 730
Data columns (total 35 columns):
instant         731 non-null int64
season_1        731 non-null int64
season_2        731 non-null int64
season_3        731 non-null int64
season_4        731 non-null int64
mnth_1          731 non-null int64
mnth_2          731 non-null int64
mnth_3          731 non-null int64
mnth_4          731 non-null int64
mnth_5          731 non-null int64
mnth_6          731 non-null int64
mnth_7          731 non-null int64
mnth_8          731 non-null int64
mnth_9          731 non-null int64
mnth_10         731 non-null int64
mnth_11         731 non-null int64
mnth_12         731 non-null int64
weathersit_1    731 non-null int64
weathersit_2    731 non-null int64
weathersit_3    731 non-null int64
weekday_0       731 non-null int64
weekday_1       731 non-null int64
weekday_2       731 non-null int64
weekday_3       731 non-null int64
weekday_4       731 non-null int64
weekday_5       731 

In [44]:
# Scikit-Learn将全体数据随机划分为训练集和测试集:train_test_split

from sklearn.model_selection import train_test_split

y = train['cnt']
y.shape


(731,)

In [45]:
X = train.drop('cnt', axis = 1)
X.shape

(731, 34)

In [46]:
X_train_data,X_test_data, y_train_data, y_test_data = train_test_split(X, y, test_size=0.2, random_state = 42)

In [47]:
X_train_data.describe()

Unnamed: 0,instant,season_1,season_2,season_3,season_4,mnth_1,mnth_2,mnth_3,mnth_4,mnth_5,...,weekday_4,weekday_5,weekday_6,temp,atemp,hum,windspeed,holiday,workingday,yr
count,584.0,584.0,584.0,584.0,584.0,584.0,584.0,584.0,584.0,584.0,...,584.0,584.0,584.0,584.0,584.0,584.0,584.0,584.0,584.0,584.0
mean,370.325342,0.241438,0.258562,0.258562,0.241438,0.092466,0.080479,0.068493,0.082192,0.094178,...,0.15411,0.130137,0.148973,0.549558,0.524359,0.649029,0.346938,0.025685,0.669521,0.517123
std,210.479312,0.428322,0.43822,0.43822,0.428322,0.289931,0.272267,0.252807,0.274892,0.292327,...,0.361363,0.336743,0.356367,0.226224,0.212212,0.146188,0.159584,0.158329,0.470789,0.500135
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,187.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.35413,0.351343,0.536311,0.230777,0.0,0.0,0.0
50%,378.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.565959,0.556663,0.652314,0.324347,0.0,1.0,1.0
75%,550.5,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.743781,0.696319,0.754713,0.43624,0.0,1.0,1.0
max,731.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,0.989717,0.864144,1.0,1.0,1.0


In [63]:
X_train_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 584 entries, 682 to 102
Data columns (total 34 columns):
instant         584 non-null int64
season_1        584 non-null int64
season_2        584 non-null int64
season_3        584 non-null int64
season_4        584 non-null int64
mnth_1          584 non-null int64
mnth_2          584 non-null int64
mnth_3          584 non-null int64
mnth_4          584 non-null int64
mnth_5          584 non-null int64
mnth_6          584 non-null int64
mnth_7          584 non-null int64
mnth_8          584 non-null int64
mnth_9          584 non-null int64
mnth_10         584 non-null int64
mnth_11         584 non-null int64
mnth_12         584 non-null int64
weathersit_1    584 non-null int64
weathersit_2    584 non-null int64
weathersit_3    584 non-null int64
weekday_0       584 non-null int64
weekday_1       584 non-null int64
weekday_2       584 non-null int64
weekday_3       584 non-null int64
weekday_4       584 non-null int64
weekday_5       58

In [48]:
y_train_data.describe()

count     584.000000
mean     4560.977740
std      1916.243513
min       431.000000
25%      3290.250000
50%      4577.500000
75%      6031.750000
max      8714.000000
Name: cnt, dtype: float64

In [49]:
X_test_data.describe()

Unnamed: 0,instant,season_1,season_2,season_3,season_4,mnth_1,mnth_2,mnth_3,mnth_4,mnth_5,...,weekday_4,weekday_5,weekday_6,temp,atemp,hum,windspeed,holiday,workingday,yr
count,147.0,147.0,147.0,147.0,147.0,147.0,147.0,147.0,147.0,147.0,...,147.0,147.0,147.0,147.0,147.0,147.0,147.0,147.0,147.0,147.0
mean,348.816327,0.272109,0.22449,0.251701,0.251701,0.054422,0.068027,0.14966,0.081633,0.047619,...,0.095238,0.190476,0.122449,0.519902,0.497033,0.632224,0.344936,0.040816,0.741497,0.435374
std,213.729121,0.446567,0.418672,0.435474,0.435474,0.227624,0.252653,0.357957,0.27474,0.213687,...,0.294547,0.394019,0.328924,0.234651,0.219901,0.147252,0.161023,0.198541,0.439309,0.497501
min,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.047675,0.05101,0.193231,0.051298,0.0,0.0,0.0
25%,162.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.32237,0.315521,0.520994,0.235894,0.0,0.0,0.0
50%,341.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.48912,0.484539,0.622108,0.342338,0.0,1.0,0.0
75%,540.0,1.0,0.0,0.5,0.5,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.728577,0.690929,0.732695,0.430444,0.0,1.0,1.0
max,730.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,0.965734,0.928746,1.0,1.0,1.0,1.0,1.0


In [64]:
X_test_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 147 entries, 703 to 135
Data columns (total 34 columns):
instant         147 non-null int64
season_1        147 non-null int64
season_2        147 non-null int64
season_3        147 non-null int64
season_4        147 non-null int64
mnth_1          147 non-null int64
mnth_2          147 non-null int64
mnth_3          147 non-null int64
mnth_4          147 non-null int64
mnth_5          147 non-null int64
mnth_6          147 non-null int64
mnth_7          147 non-null int64
mnth_8          147 non-null int64
mnth_9          147 non-null int64
mnth_10         147 non-null int64
mnth_11         147 non-null int64
mnth_12         147 non-null int64
weathersit_1    147 non-null int64
weathersit_2    147 non-null int64
weathersit_3    147 non-null int64
weekday_0       147 non-null int64
weekday_1       147 non-null int64
weekday_2       147 non-null int64
weekday_3       147 non-null int64
weekday_4       147 non-null int64
weekday_5       14

In [50]:
y_test_data.describe()

count     147.000000
mean     4279.374150
std      2009.315116
min        22.000000
25%      2413.000000
50%      4302.000000
75%      5819.000000
max      8167.000000
Name: cnt, dtype: float64

In [71]:
#最小二乘训练模型
from sklearn.linear_model import LinearRegression
#均方误差
from sklearn.metrics import mean_squared_error
#使用默认配置初始化
lr = LinearRegression()
#训练模型参数
lr.fit(X_train_data, y_train_data)
#预测模型
y_train_predict_data = lr.predict(X_train_data)
y_test_predict_data = lr.predict(X_test_data)
print("训练集上RSME大小:{}".format(np.sqrt(mean_squared_error(y_train_data, y_train_predict_data))))
print("测试集上RSME大小:{}".format(np.sqrt(mean_squared_error(y_test_data, y_test_predict_data))))


训练集上RSME大小:748.0284747352988
测试集上RSME大小:792.0291778273759


In [73]:
from sklearn.linear_model import Ridge
#使用岭回归
rg = Ridge()
#训练模型参数
rg.fit(X_train_data, y_train_data)
#预测模型
y_train_predict_data = rg.predict(X_train_data)
y_test_predict_data = rg.predict(X_test_data)
print("训练集上RSME大小:{}".format(np.sqrt(mean_squared_error(y_train_data, y_train_predict_data))))
print("测试集上RSME大小:{}".format(np.sqrt(mean_squared_error(y_test_data, y_test_predict_data))))

训练集上RSME大小:751.8147441561619
测试集上RSME大小:795.1050720950882


In [74]:
from sklearn.linear_model import Lasso
#使用岭回归
lasso = Lasso()
#训练模型参数
lasso.fit(X_train_data, y_train_data)
#预测模型
y_train_predict_data = lasso.predict(X_train_data)
y_test_predict_data = lasso.predict(X_test_data)
print("训练集上RSME大小:{}".format(np.sqrt(mean_squared_error(y_train_data, y_train_predict_data))))
print("测试集上RSME大小:{}".format(np.sqrt(mean_squared_error(y_test_data, y_test_predict_data))))

训练集上RSME大小:749.2714224948047
测试集上RSME大小:795.6395596888104


  positive)
