# The Regular Neural Network for the Prediction of Bike Sharing

The goal of the project is to predict the ranges of cnt (the count of a new bike shares) each hour based on the given factors. To achieve this goal, a regular neural network based on tensorflow is designed.

First of all, several essential packages and training data are imported. The data is from https://www.kaggle.com/c/cee-498-project1-london-bike-sharing 

In [1]:
import numpy as np
import tensorflow as tf
import pandas as pd

In [2]:
df = pd.read_csv("input/train.csv")
df

Unnamed: 0,timestamp,cnt,t1,t2,hum,wind_speed,weather_code,is_holiday,is_weekend,season
0,2015-01-04 02:00:00,134,2.5,2.5,96.5,0.0,1.0,0.0,1.0,3.0
1,2015-01-04 07:00:00,75,1.0,-1.0,100.0,7.0,4.0,0.0,1.0,3.0
2,2015-01-04 08:00:00,131,1.5,-1.0,96.5,8.0,4.0,0.0,1.0,3.0
3,2015-01-04 09:00:00,301,2.0,-0.5,100.0,9.0,3.0,0.0,1.0,3.0
4,2015-01-04 10:00:00,528,3.0,-0.5,93.0,12.0,3.0,0.0,1.0,3.0
...,...,...,...,...,...,...,...,...,...,...
12218,2017-01-03 19:00:00,1042,5.0,1.0,81.0,19.0,3.0,0.0,0.0,3.0
12219,2017-01-03 20:00:00,541,5.0,1.0,81.0,21.0,4.0,0.0,0.0,3.0
12220,2017-01-03 21:00:00,337,5.5,1.5,78.5,24.0,4.0,0.0,0.0,3.0
12221,2017-01-03 22:00:00,224,5.5,1.5,76.0,23.0,4.0,0.0,0.0,3.0


## Data Preprocessing

In data preprocessing, split "timestamp" to "year", "month", "day" and "hour". Besides, "t1" is deleted because "t1" and "t2" are highly correlated.

In [3]:
df.dropna(axis=0, how='any')
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['year'] = df['timestamp'].dt.year
df['month'] = df['timestamp'].dt.month
df['day'] = df['timestamp'].dt.day
df['hour'] = df['timestamp'].dt.hour
df = df.drop(['timestamp','t1'], axis=1)
df

Unnamed: 0,cnt,t2,hum,wind_speed,weather_code,is_holiday,is_weekend,season,year,month,day,hour
0,134,2.5,96.5,0.0,1.0,0.0,1.0,3.0,2015,1,4,2
1,75,-1.0,100.0,7.0,4.0,0.0,1.0,3.0,2015,1,4,7
2,131,-1.0,96.5,8.0,4.0,0.0,1.0,3.0,2015,1,4,8
3,301,-0.5,100.0,9.0,3.0,0.0,1.0,3.0,2015,1,4,9
4,528,-0.5,93.0,12.0,3.0,0.0,1.0,3.0,2015,1,4,10
...,...,...,...,...,...,...,...,...,...,...,...,...
12218,1042,1.0,81.0,19.0,3.0,0.0,0.0,3.0,2017,1,3,19
12219,541,1.0,81.0,21.0,4.0,0.0,0.0,3.0,2017,1,3,20
12220,337,1.5,78.5,24.0,4.0,0.0,0.0,3.0,2017,1,3,21
12221,224,1.5,76.0,23.0,4.0,0.0,0.0,3.0,2017,1,3,22


Assign columns to dataset in the type of tensorflow. "cnt" is the target and the other columns are factors.

In [4]:
cnt = df.pop('cnt')
dataset = tf.data.Dataset.from_tensor_slices((df.values, cnt.values))

In [5]:
for feat, targ in dataset.take(5):
  print ('Features: {}, Target: {}'.format(feat, targ))

Features: [2.500e+00 9.650e+01 0.000e+00 1.000e+00 0.000e+00 1.000e+00 3.000e+00
 2.015e+03 1.000e+00 4.000e+00 2.000e+00], Target: 134
Features: [-1.000e+00  1.000e+02  7.000e+00  4.000e+00  0.000e+00  1.000e+00
  3.000e+00  2.015e+03  1.000e+00  4.000e+00  7.000e+00], Target: 75
Features: [-1.000e+00  9.650e+01  8.000e+00  4.000e+00  0.000e+00  1.000e+00
  3.000e+00  2.015e+03  1.000e+00  4.000e+00  8.000e+00], Target: 131
Features: [-5.000e-01  1.000e+02  9.000e+00  3.000e+00  0.000e+00  1.000e+00
  3.000e+00  2.015e+03  1.000e+00  4.000e+00  9.000e+00], Target: 301
Features: [-5.000e-01  9.300e+01  1.200e+01  3.000e+00  0.000e+00  1.000e+00
  3.000e+00  2.015e+03  1.000e+00  4.000e+00  1.000e+01], Target: 528


The training dataset is splited to mini batches.

In [91]:
train_dataset = dataset.shuffle(len(df)).batch(batch_size=150).repeat(20)

In [106]:
for feat, targ in train_dataset.take(1):
  print ('Features: {}, Target: {}'.format(feat, targ))

Features: [[15.  72.5 21.  ...  4.  25.  11. ]
 [14.  88.   9.  ... 10.  16.  22. ]
 [ 9.  84.5  6.  ...  1.   5.  18. ]
 ...
 [18.  47.5 24.  ...  9.  17.  16. ]
 [13.  61.   8.  ...  4.  13.  23. ]
 [ 6.5 93.5 15.  ...  4.   4.   4. ]], Target: [1555  369 2561  422 1234   45  995  150  479  604 2845   26  944 2205
 1136  301   57 2288 3000   95 3143  855  258 1306  372  143   63  224
  213 1517  356 1410   70   98  171  693  912  523 1555  225  568  511
 1283  700  419  632 2927  234  926 1954 3317 2250   81   25   12 1565
  979  991  282 1170  521   98 1721 1708 2819 1105 1256 1313  316  762
  997   79  168 1375   90 1003 1024  717  549 1740 1892  756  237  350
   86   47  266  911 1075  840  264 1273 1180 1603 1905 4628 1087 1107
 1260   60  192  906  141 1779  147 1717  512   31 1306 2314 2733 4447
 2272 1175  982 1884  341  548   87  153  164   30 2017   19 1081 3687
 1829  720   55 1600  580 3671 2065 1127 2857   35  133  114 1734 4076
 1176  459 2075   60   67  341 1432 1948  4

## The Structure of Model

The layers are shown as follow. The loss is defined by mean square error. The learning rate will be changed by epoches during the training.

In [96]:
model = tf.keras.Sequential(
    [tf.keras.layers.Dense(units=32, input_shape=(11,)),
     tf.keras.layers.BatchNormalization(),
     tf.keras.layers.Dense(256, activation='relu'),
     tf.keras.layers.BatchNormalization(),
     tf.keras.layers.Dense(16, activation='relu'),
     tf.keras.layers.BatchNormalization(),
     tf.keras.layers.Dense(1)
    ])

model.compile(
     optimizer=tf.keras.optimizers.Adam(lr=0.001),
     loss='mean_squared_error',
    )

def scheduler(epoch, lr):
    if epoch < 10:
        return lr
    else:
        return lr * tf.math.exp(-0.1)**(epoch//3-3)
    
    
callback = tf.keras.callbacks.LearningRateScheduler(scheduler)

trained_model = model.fit(train_dataset, epochs=50, callbacks=[callback])

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


## The Test of Model

Import test data to test the accuracy of the model.

The performance of the neural network is evaluated based on the RMSE of predictions.

In [97]:
df2 = pd.read_csv("input/test.csv")
df2.dropna(axis=0, how='any')
df2['timestamp'] = pd.to_datetime(df2['timestamp'])
df2['year'] = df2['timestamp'].dt.year
df2['month'] = df2['timestamp'].dt.month
df2['day'] = df2['timestamp'].dt.day
df2['hour'] = df2['timestamp'].dt.hour
df_test = df2.drop(['timestamp','t1'], axis=1)
df_test

Unnamed: 0,t2,hum,wind_speed,weather_code,is_holiday,is_weekend,season,year,month,day,hour
0,2.0,93.0,6.0,3.0,0.0,1.0,3.0,2015,1,4,0
1,2.5,93.0,5.0,1.0,0.0,1.0,3.0,2015,1,4,1
2,2.0,100.0,0.0,1.0,0.0,1.0,3.0,2015,1,4,3
3,0.0,93.0,6.5,1.0,0.0,1.0,3.0,2015,1,4,4
4,2.0,93.0,4.0,1.0,0.0,1.0,3.0,2015,1,4,5
...,...,...,...,...,...,...,...,...,...,...,...
5186,2.5,76.0,11.0,1.0,1.0,0.0,3.0,2017,1,2,16
5187,0.0,81.0,11.0,1.0,1.0,0.0,3.0,2017,1,2,19
5188,-1.5,81.0,14.0,1.0,0.0,0.0,3.0,2017,1,3,9
5189,0.0,78.0,21.0,1.0,0.0,0.0,3.0,2017,1,3,11


In [98]:
test_dataset = tf.data.Dataset.from_tensor_slices(df_test.values).batch(1)
for feat in test_dataset.take(5):
  print ('Features: {}'.format(feat))

Features: [[2.000e+00 9.300e+01 6.000e+00 3.000e+00 0.000e+00 1.000e+00 3.000e+00
  2.015e+03 1.000e+00 4.000e+00 0.000e+00]]
Features: [[2.500e+00 9.300e+01 5.000e+00 1.000e+00 0.000e+00 1.000e+00 3.000e+00
  2.015e+03 1.000e+00 4.000e+00 1.000e+00]]
Features: [[2.000e+00 1.000e+02 0.000e+00 1.000e+00 0.000e+00 1.000e+00 3.000e+00
  2.015e+03 1.000e+00 4.000e+00 3.000e+00]]
Features: [[0.000e+00 9.300e+01 6.500e+00 1.000e+00 0.000e+00 1.000e+00 3.000e+00
  2.015e+03 1.000e+00 4.000e+00 4.000e+00]]
Features: [[2.000e+00 9.300e+01 4.000e+00 1.000e+00 0.000e+00 1.000e+00 3.000e+00
  2.015e+03 1.000e+00 4.000e+00 5.000e+00]]


In [99]:
prediction = model.predict(test_dataset)

In [100]:
df2['cnt']=prediction
df2

Unnamed: 0,timestamp,t1,t2,hum,wind_speed,weather_code,is_holiday,is_weekend,season,year,month,day,hour,cnt
0,2015-01-04 00:00:00,3.0,2.0,93.0,6.0,3.0,0.0,1.0,3.0,2015,1,4,0,333.182678
1,2015-01-04 01:00:00,3.0,2.5,93.0,5.0,1.0,0.0,1.0,3.0,2015,1,4,1,199.737000
2,2015-01-04 03:00:00,2.0,2.0,100.0,0.0,1.0,0.0,1.0,3.0,2015,1,4,3,141.838959
3,2015-01-04 04:00:00,2.0,0.0,93.0,6.5,1.0,0.0,1.0,3.0,2015,1,4,4,77.079193
4,2015-01-04 05:00:00,2.0,2.0,93.0,4.0,1.0,0.0,1.0,3.0,2015,1,4,5,33.481956
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5186,2017-01-02 16:00:00,5.0,2.5,76.0,11.0,1.0,1.0,0.0,3.0,2017,1,2,16,1029.974976
5187,2017-01-02 19:00:00,3.0,0.0,81.0,11.0,1.0,1.0,0.0,3.0,2017,1,2,19,439.895355
5188,2017-01-03 09:00:00,2.5,-1.5,81.0,14.0,1.0,0.0,0.0,3.0,2017,1,3,9,2271.223389
5189,2017-01-03 11:00:00,4.0,0.0,78.0,21.0,1.0,0.0,0.0,3.0,2017,1,3,11,809.702209


In [101]:
output = df2.drop(['t1','t2','hum','wind_speed','weather_code','is_holiday','is_weekend','season','year','month','day','hour'], axis=1)

The CSV document of results is uploaded to Kaggle for evaluated.

In [102]:
output.to_csv('output_32_256_16_1_001xe0.1xepoch3-3_50.csv',index=False)

The model is saved in h5 format.

In [104]:
model.save("32_256_16_1_001xe0.1xepoch3-3_50.h5")