<a href="https://colab.research.google.com/github/15007919uhi/15007919_DataAnalytics/blob/master/DAOTW_Assignment_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

Following the analysis and cleansing of a dataset on weather conditions and traffic collisions in New York City, this document applies linear regression and deep neural networking models to the exported data to make predictions as to how these two factors might interact on a given day. Previous analysis discovered a positive correlation between temperature or dew point and traffic collisions, although yearly fluctuations occurred due to external factors such as restrictions on road traffic and the impact of COVID-19, with lockdowns limiting travel. As such, the data was split into four different groupings based on similar patterns to account for four potential scenarios the city could face in the future. 

# Methodology

The initial dataset was created through Google's BigQuery platform, utilising databases of both global weather and traffic incidents in New York City. Following this, manual cleansing of data was carried out to remove high values and initial data science was conducted in Colab using the R programming language. 

As data collection for traffic incidents began in the middle of 2012, this year was discarded for being incomplete. The years from 2012-15 were grouped together as they showed similar patterns, with 2016-18 also being collated due to showing their own separate pattern, potentially caused by new developments or restrictions in the city. 2019 showed unique data, due to the city implementing limitations on traffic in certain areas which resulted in a decline in collisions unrelated to weather conditions. 2020's data is incomplete, but data after the city locked down in March was useful in showing a conclusive correlation between temperature/dew point and collisions, as the smaller scale of data showed less outliers.

Eight total datasets were created, four simple sets to use with linear regression models, and four with one-hot encoding for month and day values for greater compatibility with a deep neural network model. This document uses Python and the Tensorflow library to train and test the data; calculating the RMSE (root-mean-square deviation) of the model based on the training data used, and testing it with example values for months, days, temperature and dew point to predict the number of collisions under those conditions.

# Findings

## Linear Regression

This section details the results of applying linear regression to the gathered data, examining the four different datasets created and the variations in the relationship between collisions and weather conditions shown. The main predictors selected for the following models are day, month and temperature, as these factors have been demonstrated to have the greatest impact on collision totals.

### 2013-15

In [None]:
# needed to create the data frame
import pandas as pd

# create the dataframe from 2013-15 csv file
df = pd.read_csv('https://raw.githubusercontent.com/15007919uhi/15007919_DataAnalytics/master/lineardata1315.csv', index_col=0, )

In [None]:
# check the data is correct
print(df[:6])
# print(df) #all

     day  mo  temp  dewp   max   min  NUM_COLLISIONS
185    2   1  38.0  25.6  41.0  33.1        0.000000
186    3   1  27.5  12.1  33.1  21.9        0.336735
187    4   1  21.8   7.8  28.9  16.0        0.571429
188    5   1  32.2  21.1  41.0  24.1        0.421769
189    6   1  37.3  24.5  42.1  30.9        0.027211
190    7   1  35.7  31.3  44.1  23.0        0.040816


In [None]:
# needed for calculations
import numpy as np

# shuffle rows at random
shuffle = df.iloc[np.random.permutation(len(df))]

# specify the column to select all rows from
predictors = shuffle.iloc[:,0:3]

# print the first 6 rows of the predictors
print(predictors[:6])

      day  mo  temp
1000    5   3  42.4
256     3   3  46.3
885     2  12  42.5
483     6  10  49.0
803     4   9  64.4
689     2   5  56.4


In [None]:
# print the first five rows of the shuffled data
shuffle[:5]

Unnamed: 0,day,mo,temp,dewp,max,min,NUM_COLLISIONS
1000,5,3,42.4,40.0,48.9,37.9,0.411444
256,3,3,46.3,41.1,50.0,41.0,0.867347
885,2,12,42.5,32.4,57.0,39.0,0.785235
483,6,10,49.0,32.9,57.9,32.0,0.833333
803,4,9,64.4,60.3,73.0,54.0,0.812081


In [None]:
# select all rows of the specified column
targets = shuffle.iloc[:,-1]

# print the first 6 rows of the targets
print(targets[:6])

1000    0.411444
256     0.867347
885     0.785235
483     0.833333
803     0.812081
689     0.916107
Name: NUM_COLLISIONS, dtype: float64


In [None]:
# scale the data
SCALE_NUM_COLLISIONS = 1.0

In [None]:
# split data into a training set that is 80% of the full shuffled array
trainsize = int(len(shuffle['NUM_COLLISIONS'])*0.8)
# subtract training set size from test set to leave 20% of the array
testsize = len(shuffle['NUM_COLLISIONS']) - trainsize

# set number of input values/predictors
nppredictors = 3
# set number of output values/targets
noutputs = 1

In [None]:
# import tensorflow
%tensorflow_version 1.x
import tensorflow as tf

# check the version
print(tf.__version__)

import shutil  

# logging for tensorflow
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)

# remove saved model from previous training attempts
shutil.rmtree('/tmp/linear_regression_trained_model', ignore_errors=True)

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir='/tmp/linear_regression_trained_model', optimizer=tf.train.AdamOptimizer(learning_rate=0.1), enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors.values)))

# show model is beginning to train
print("starting to train");

# use predictors and target values to train the model
estimator.fit(predictors[:trainsize].values, targets[:trainsize].values.reshape(trainsize, noutputs)/SCALE_NUM_COLLISIONS, steps=10000)

# check predictors
preds = estimator.predict(x=predictors[trainsize:].values)

# apply the same scale to the output
predslistscale = preds['scores']*SCALE_NUM_COLLISIONS

# pred = format(str(predslistscale)) # useful for checking outputs and printing.

# calculate RMSE
rmse = np.sqrt(np.mean((targets[trainsize:].values - predslistscale)**2))
print('LinearRegression has RMSE of {0}'.format(rmse));

# calculate mean number of collisions
avg = np.mean(shuffle['NUM_COLLISIONS'][:trainsize])

# calculate RMSE using the number of collisions and mean of all other values
rmse = np.sqrt(np.mean((shuffle['NUM_COLLISIONS'][trainsize:] - avg)**2))
print('Just using average = {0} has RMSE of {1}'.format(avg, rmse));


1.15.2
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f9da57ca9b0>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model', '_session_creation_timeout_secs': 7200}
starting to train
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Ru

The RMSE is relatively low (around 0.2 to 0.25), although it highlights the presence of outliers around the line of best fit, despite previous removal of the most obvious outlying values. Setting the training set size to less than 20% of the full array increases the RMSE, suggesting there is a level of variability present, or that the model may be overfitting the data.

This can be verified below by providing existing values outside of the set of predictors for validation.

In [None]:
print(preds)

{'scores': array([0.74313205, 0.7680194 , 0.55781   , 0.6056054 , 0.6082678 ,
       0.79998755, 0.68714446, 0.84736365, 0.5668215 , 0.4292273 ,
       0.62388116, 0.54640967, 0.643962  , 0.6662653 , 0.78877234,
       0.63320404, 0.754669  , 0.75332284, 0.52928483, 0.7976273 ,
       0.7823356 , 0.49643943, 0.6810591 , 0.69331706, 0.53123546,
       0.820935  , 0.7385187 , 0.62819207, 0.5604141 , 0.680766  ,
       0.6077509 , 0.65480626, 0.55572283, 0.5219316 , 0.5853307 ,
       0.6494909 , 0.65159667, 0.64572716, 0.6346578 , 0.7151817 ,
       0.6562693 , 0.47760817, 0.5207319 , 0.65893066, 0.6937463 ,
       0.8332912 , 0.46277508, 0.69767696, 0.62426186, 0.63453054,
       0.7747289 , 0.6781619 , 0.6330385 , 0.6369002 , 0.59748185,
       0.67679715, 0.5924009 , 0.72316873, 0.6195413 , 0.59191334,
       0.48203564, 0.63499844, 0.5733362 , 0.7424297 , 0.74459475,
       0.579304  , 0.49662495, 0.6551763 , 0.61195457, 0.5063771 ,
       0.62039053, 0.5763975 , 0.5756275 , 0.591035

Manually selecting data outside of the set shown above returns results within range of their actual collision totals, although experimentation with this proved the accuracy of predictions was sometimes low, suggesting collisions cannot be predicted by temperature or season alone in every situation.

In [None]:
input = pd.DataFrame.from_dict(data = 
				{'day' : [3,5,6],
         'mo' : [6, 1, 12],
         'temp' : [58.6, 32.2, 42.1]})

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir='/tmp/linear_regression_trained_model', enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(input.values)))

preds = estimator.predict(x=input.values)
predslistscale = preds['scores']*SCALE_NUM_COLLISIONS
pred = format(str(predslistscale))
print(pred)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f9d9fc26b38>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/linear_regression_trained_model/model.ck

However, using imagined values demonstrates the model's overall effectiveness at understanding seasonal patterns. Setting test predictors to January, May and August, with relative dew points shows the model predicts suitable collision values - increasing during later months as the temperature rises. 

In [None]:
input = pd.DataFrame.from_dict(data = 
				{'day' : [1,1,1],
         'mo' : [1, 5, 8],
         'temp' : [8.9, 20.5, 54.0]})

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir='/tmp/linear_regression_trained_model', enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(input.values)))

preds = estimator.predict(x=input.values)
predslistscale = preds['scores']*SCALE_NUM_COLLISIONS
pred = format(str(predslistscale))
print(pred)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f9d9fc81518>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/linear_regression_trained_model/model.ck

The above results use Monday as predictor for a focus on seasonal changes in temperature rather than showing the changes in traffic throughout the week. If these are changed to other values this relationship can be seen clearly - setting the first day to Sunday dramatically lessens the number of collisions, with the number of collisions increasing throughout the week. Monday remains the day with the highest collisions in this particular dataset.

In [None]:
input = pd.DataFrame.from_dict(data = 
				{'day' : [7,3,5],
         'mo' : [1, 5, 8],
         'temp' : [8.9, 20.5, 54.0]})

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir='/tmp/linear_regression_trained_model', enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(input.values)))

preds = estimator.predict(x=input.values)
predslistscale = preds['scores']*SCALE_NUM_COLLISIONS
pred = format(str(predslistscale))
print(pred)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f9d97944668>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/linear_regression_trained_model/model.ck

The process is repeated below, this time including the dew point value in addition to temperature, while excluding month and day values.

In [None]:
# create the dataframe from 2013-15 csv file
df = pd.read_csv('https://raw.githubusercontent.com/15007919uhi/15007919_DataAnalytics/master/lineardata1315.csv', index_col=0, )

In [None]:
# check the data is correct
print(df[:6])
# print(df) #all

     day  mo  temp  dewp   max   min  NUM_COLLISIONS
185    2   1  38.0  25.6  41.0  33.1        0.000000
186    3   1  27.5  12.1  33.1  21.9        0.336735
187    4   1  21.8   7.8  28.9  16.0        0.571429
188    5   1  32.2  21.1  41.0  24.1        0.421769
189    6   1  37.3  24.5  42.1  30.9        0.027211
190    7   1  35.7  31.3  44.1  23.0        0.040816


In [None]:
# needed for calculations
import numpy as np

# shuffle rows at random
shuffle = df.iloc[np.random.permutation(len(df))]

# specify the column to select all rows from
predictors = shuffle.iloc[:,2:4]

# print the first 6 rows of the predictors
print(predictors[:6])

     temp  dewp
839  63.2  59.7
807  57.5  47.6
281  39.7  28.6
391  66.1  64.4
817  59.7  55.3
600  39.6  32.2


In [None]:
# print the first five rows of the shuffled data
shuffle[:5]

Unnamed: 0,day,mo,temp,dewp,max,min,NUM_COLLISIONS
839,5,10,63.2,59.7,70.0,60.1,0.909396
807,1,9,57.5,47.6,66.0,46.9,0.781879
281,7,4,39.7,28.6,48.0,26.1,0.238095
391,5,7,66.1,64.4,69.8,64.0,0.802721
817,4,9,59.7,55.3,68.0,48.9,0.38255


In [None]:
# select all rows of the specified column
targets = shuffle.iloc[:,-1]

# print the first 6 rows of the targets
print(targets[:6])

839    0.909396
807    0.781879
281    0.238095
391    0.802721
817    0.382550
600    0.600671
Name: NUM_COLLISIONS, dtype: float64


In [None]:
# scale the data
SCALE_NUM_COLLISIONS = 1.0

In [None]:
# split data into a training set that is 80% of the full shuffled array
trainsize = int(len(shuffle['NUM_COLLISIONS'])*0.8)
# subtract training set size from test set to leave 20% of the array
testsize = len(shuffle['NUM_COLLISIONS']) - trainsize

# set number of input values/predictors
nppredictors = 2
# set number of output values/targets
noutputs = 1

In [None]:
# import tensorflow
%tensorflow_version 1.x
import tensorflow as tf

# check the version
print(tf.__version__)

import shutil  

# logging for tensorflow
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)

# remove saved model from previous training attempts
shutil.rmtree('/tmp/linear_regression_trained_model', ignore_errors=True)

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir='/tmp/linear_regression_trained_model', optimizer=tf.train.AdamOptimizer(learning_rate=0.1), enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors.values)))

# show model is beginning to train
print("starting to train");

# use predictors and target values to train the model
estimator.fit(predictors[:trainsize].values, targets[:trainsize].values.reshape(trainsize, noutputs)/SCALE_NUM_COLLISIONS, steps=10000)

# check predictors
preds = estimator.predict(x=predictors[trainsize:].values)

# apply the same scale to the output
predslistscale = preds['scores']*SCALE_NUM_COLLISIONS

# pred = format(str(predslistscale)) # useful for checking outputs and printing.

# calculate RMSE
rmse = np.sqrt(np.mean((targets[trainsize:].values - predslistscale)**2))
print('LinearRegression has RMSE of {0}'.format(rmse));

# calculate mean number of collisions
avg = np.mean(shuffle['NUM_COLLISIONS'][:trainsize])

# calculate RMSE using the number of collisions and mean of all other values
rmse = np.sqrt(np.mean((shuffle['NUM_COLLISIONS'][trainsize:] - avg)**2))
print('Just using average = {0} has RMSE of {1}'.format(avg, rmse));


1.15.2
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f9d9fca9828>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model', '_session_creation_timeout_secs': 7200}
starting to train
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Ru

The RMSE for the model using these two features together (excluding day and month) tends to appear slightly higher, highlighting the importance of seasonality.

In [None]:
input = pd.DataFrame.from_dict(data = 
				{'dewp' : [5.4, 16.9, 48.3],
         'temp' : [8.9, 20.5, 54.0]})

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir='/tmp/linear_regression_trained_model', enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(input.values)))

preds = estimator.predict(x=input.values)
predslistscale = preds['scores']*SCALE_NUM_COLLISIONS
pred = format(str(predslistscale))
print(pred)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f9d9f6a5978>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/linear_regression_trained_model/model.ck

The predicted values are not as varied as those created with input from month and day, suggesting weather conditions alone are not enough to create helpful predictions due to the fluctuations introduced by changing months and days of the week. It is also likely dew point introduces a level of variability that temperature does not.

### 2016-18

In [None]:
# needed to create the data frame
import pandas as pd

# create the dataframe from 2016-18 csv file
df = pd.read_csv('https://raw.githubusercontent.com/15007919uhi/15007919_DataAnalytics/master/lineardata1618.csv', index_col=0, )

In [None]:
# check the data is correct
print(df[:6])
# print(df) #all

      day  mo  temp  dewp   max   min  NUM_COLLISIONS
1283    1   1  38.2  27.4  52.0  34.0        0.368613
1284    2   1  27.7  18.1  43.0  24.1        0.510949
1285    3   1  33.5  19.3  45.0  21.0        0.456204
1286    4   1  40.5  29.9  46.9  24.1        0.153285
1287    5   1  41.3  33.1  48.0  30.9        0.372263
1288    6   1  45.9  38.6  50.0  30.9        0.197080


In [None]:
# needed for calculations
import numpy as np

# shuffle rows at random
shuffle = df.iloc[np.random.permutation(len(df))]

# specify the column to select all rows from
predictors = shuffle.iloc[:,0:3]

# print the first 6 rows of the predictors
print(predictors[:6])

      day  mo  temp
1595    5  11  54.9
1462    5   7  66.9
2250    2   8  73.0
1538    4   9  64.5
1587    4  11  58.8
2370    3  12  31.8


In [None]:
# print the first five rows of the shuffled data
shuffle[:5]

Unnamed: 0,day,mo,temp,dewp,max,min,NUM_COLLISIONS
1595,5,11,54.9,41.0,63.0,46.9,0.60219
1462,5,7,66.9,63.6,77.0,61.0,0.919708
2250,2,8,73.0,71.2,84.0,66.9,0.796296
1538,4,9,64.5,53.4,73.9,53.1,0.937956
1587,4,11,58.8,56.2,63.0,51.1,0.689781


In [None]:
# select all rows of the specified column
targets = shuffle.iloc[:,-1]

# print the first 6 rows of the targets
print(targets[:6])

1595    0.602190
1462    0.919708
2250    0.796296
1538    0.937956
1587    0.689781
2370    0.051852
Name: NUM_COLLISIONS, dtype: float64


In [None]:
# scale the data
SCALE_NUM_COLLISIONS = 1.0

In [None]:
# split data into a training set that is 80% of the full shuffled array
trainsize = int(len(shuffle['NUM_COLLISIONS'])*0.8)
# subtract training set size from test set to leave 20% of the array
testsize = len(shuffle['NUM_COLLISIONS']) - trainsize

# set number of input values/predictors
nppredictors = 3
# set number of output values/targets
noutputs = 1

In [None]:
# import tensorflow
%tensorflow_version 1.x
import tensorflow as tf

# check the version
print(tf.__version__)

import shutil  

# logging for tensorflow
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)

# remove saved model from previous training attempts
shutil.rmtree('/tmp/linear_regression_trained_model', ignore_errors=True)

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir='/tmp/linear_regression_trained_model', optimizer=tf.train.AdamOptimizer(learning_rate=0.1), enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors.values)))

# show model is beginning to train
print("starting to train");

# use predictors and target values to train the model
estimator.fit(predictors[:trainsize].values, targets[:trainsize].values.reshape(trainsize, noutputs)/SCALE_NUM_COLLISIONS, steps=10000)

# check predictors
preds = estimator.predict(x=predictors[trainsize:].values)

# apply the same scale to the output
predslistscale = preds['scores']*SCALE_NUM_COLLISIONS

# pred = format(str(predslistscale)) # useful for checking outputs and printing.

# calculate RMSE
rmse = np.sqrt(np.mean((targets[trainsize:].values - predslistscale)**2))
print('LinearRegression has RMSE of {0}'.format(rmse));

# calculate mean number of collisions
avg = np.mean(shuffle['NUM_COLLISIONS'][:trainsize])

# calculate RMSE using the number of collisions and mean of all other values
rmse = np.sqrt(np.mean((shuffle['NUM_COLLISIONS'][trainsize:] - avg)**2))
print('Just using average = {0} has RMSE of {1}'.format(avg, rmse));


1.15.2
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7efddffcf630>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model', '_session_creation_timeout_secs': 7200}
starting to train
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Ru

The RMSE for data from 2016-18 often appears higher than previous years, which can be attributed to higher fluctuations in data and greater number of values gathered. An RMSE of around 0.26 is most useful for making accurate predictions.

In [None]:
input = pd.DataFrame.from_dict(data = 
				{'day' : [1,1,1],
         'mo' : [1, 5, 8],
         'temp' : [8.9, 20.5, 54.0]})
					

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir='/tmp/linear_regression_trained_model', enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(input.values)))

preds = estimator.predict(x=input.values)
predslistscale = preds['scores']*SCALE_NUM_COLLISIONS
pred = format(str(predslistscale))
print(pred)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7efddfc61710>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/linear_regression_trained_model/model.ck

A regression model with relatively low RMSE predicts suitable values for each of the input values that rise throughout the year, although compared to the 2013-15 model (and depending on the shuffled input values) it shows wider variability between the highest and lowest values - mirroring the wider spread in collision numbers throughout these years.

### 2019

In [None]:
# needed to create the data frame
import pandas as pd

# create the dataframe from 2019 csv file
df = pd.read_csv('https://raw.githubusercontent.com/15007919uhi/15007919_DataAnalytics/master/lineardata19.csv', index_col=0, )

In [None]:
# check the data is correct
print(df[:6])
# print(df) #all

      day  mo  temp  dewp   max   min  NUM_COLLISIONS
2376    2   1  47.4  41.6  54.0  39.9        0.083095
2377    3   1  35.0  23.4  55.0  28.0        0.292264
2378    4   1  39.4  32.9  46.9  28.0        0.297994
2379    5   1  39.0  30.2  46.9  30.9        0.567335
2380    6   1  43.8  42.4  46.9  30.9        0.157593
2382    1   1  27.4  11.7  45.0  21.9        0.524355


In [None]:
# needed for calculations
import numpy as np

# shuffle rows at random
shuffle = df.iloc[np.random.permutation(len(df))]

# specify the column to select all rows from
predictors = shuffle.iloc[:,0:3]

# print the first 6 rows of the predictors
print(predictors[:6])

      day  mo  temp
2555    6   6  65.8
2442    5   3  27.2
2479    7   4  50.8
2671    3  10  61.4
2432    2   2  29.2
2453    2   3  35.2


In [None]:
# print the first five rows of the shuffled data
shuffle[:5]

Unnamed: 0,day,mo,temp,dewp,max,min,NUM_COLLISIONS
2555,6,6,65.8,63.5,80.1,57.9,0.744986
2442,5,3,27.2,10.7,35.1,15.1,0.722063
2479,7,4,50.8,48.7,68.0,45.0,0.163324
2671,3,10,61.4,57.7,68.0,53.1,0.770774
2432,2,2,29.2,6.8,42.1,26.1,0.773639


In [None]:
# select all rows of the specified column
targets = shuffle.iloc[:,-1]

# print the first 6 rows of the targets
print(targets[:6])

2555    0.744986
2442    0.722063
2479    0.163324
2671    0.770774
2432    0.773639
2453    0.472779
Name: NUM_COLLISIONS, dtype: float64


In [None]:
# scale the data
SCALE_NUM_COLLISIONS = 1.0

In [None]:
# split data into a training set that is 80% of the full shuffled array
trainsize = int(len(shuffle['NUM_COLLISIONS'])*0.8)
# subtract training set size from test set to leave 20% of the array
testsize = len(shuffle['NUM_COLLISIONS']) - trainsize

# set number of input values/predictors
nppredictors = 3
# set number of output values/targets
noutputs = 1

In [None]:
# import tensorflow
%tensorflow_version 1.x
import tensorflow as tf

# check the version
print(tf.__version__)

import shutil  

# logging for tensorflow
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)

# remove saved model from previous training attempts
shutil.rmtree('/tmp/linear_regression_trained_model', ignore_errors=True)

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir='/tmp/linear_regression_trained_model', optimizer=tf.train.AdamOptimizer(learning_rate=0.1), enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors.values)))

# show model is beginning to train
print("starting to train");

# use predictors and target values to train the model
estimator.fit(predictors[:trainsize].values, targets[:trainsize].values.reshape(trainsize, noutputs)/SCALE_NUM_COLLISIONS, steps=10000)

# check predictors
preds = estimator.predict(x=predictors[trainsize:].values)

# apply the same scale to the output
predslistscale = preds['scores']*SCALE_NUM_COLLISIONS

# pred = format(str(predslistscale)) # useful for checking outputs and printing.

# calculate RMSE
rmse = np.sqrt(np.mean((targets[trainsize:].values - predslistscale)**2))
print('LinearRegression has RMSE of {0}'.format(rmse));

# calculate mean number of collisions
avg = np.mean(shuffle['NUM_COLLISIONS'][:trainsize])

# calculate RMSE using the number of collisions and mean of all other values
rmse = np.sqrt(np.mean((shuffle['NUM_COLLISIONS'][trainsize:] - avg)**2))
print('Just using average = {0} has RMSE of {1}'.format(avg, rmse));


1.15.2
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f9d9ec56be0>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model', '_session_creation_timeout_secs': 7200}
starting to train
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Ru

2019 shows more fluctuation in RMSE values due to external conditions affecting collision numbers and weakening the correlation between seasonal changes or temperature. As such, predicted values show the same fluctuations, but typically follow the same increasing pattern.

In [None]:
input = pd.DataFrame.from_dict(data = 
				{'day' : [1,1,1],
         'mo' : [1, 5, 8],
         'temp' : [8.9, 20.5, 54.0]})
					

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir='/tmp/linear_regression_trained_model', enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(input.values)))

preds = estimator.predict(x=input.values)
predslistscale = preds['scores']*SCALE_NUM_COLLISIONS
pred = format(str(predslistscale))
print(pred)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f9d9ec46f60>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/linear_regression_trained_model/model.ck

### 2020

In [None]:
# needed to create the data frame
import pandas as pd

# create the dataframe from 2020 csv file
df = pd.read_csv('https://raw.githubusercontent.com/15007919uhi/15007919_DataAnalytics/master/lineardata20.csv', index_col=0, )

In [None]:
# check the data is correct
print(df[:6])
# print(df) #all

      day  mo  temp  dewp   max   min  NUM_COLLISIONS
2826    4   3  41.2  33.6  46.9  36.0        0.138235
2827    5   3  47.6  36.5  59.0  35.1        0.150000
2828    6   3  42.3  33.0  59.0  30.0        0.132353
2829    7   3  44.8  41.2  53.1  30.0        0.077941
2830    1   3  44.7  42.2  48.9  39.9        0.104412
2831    2   3  39.4  32.1  48.9  34.0        0.094118


In [None]:
# needed for calculations
import numpy as np

# shuffle rows at random
shuffle = df.iloc[np.random.permutation(len(df))]

# specify the column to select all rows from
predictors = shuffle.iloc[:,0:3]

# print the first 6 rows of the predictors
print(predictors[:6])

      day  mo  temp
2894    2   6  57.1
2865    1   5  56.3
2945    4   7  74.1
2954    6   8  72.6
2972    3   8  70.4
2900    1   6  62.6


In [None]:
# print the first five rows of the shuffled data
shuffle[:5]

Unnamed: 0,day,mo,temp,dewp,max,min,NUM_COLLISIONS
2894,2,6,57.1,46.6,62.1,48.9,0.152941
2865,1,5,56.3,46.2,69.1,46.9,0.117647
2945,4,7,74.1,70.9,82.0,64.9,0.351471
2954,6,8,72.6,66.2,82.9,62.1,0.414706
2972,3,8,70.4,62.6,78.1,62.6,0.283824


In [None]:
# select all rows of the specified column
targets = shuffle.iloc[:,-1]

# print the first 6 rows of the targets
print(targets[:6])

2894    0.152941
2865    0.117647
2945    0.351471
2954    0.414706
2972    0.283824
2900    0.207353
Name: NUM_COLLISIONS, dtype: float64


In [None]:
# scale the data
SCALE_NUM_COLLISIONS = 1.0

In [None]:
# split data into a training set that is 80% of the full shuffled array
trainsize = int(len(shuffle['NUM_COLLISIONS'])*0.8)
# subtract training set size from test set to leave 20% of the array
testsize = len(shuffle['NUM_COLLISIONS']) - trainsize

# set number of input values/predictors
nppredictors = 3
# set number of output values/targets
noutputs = 1

In [None]:
# import tensorflow
%tensorflow_version 1.x
import tensorflow as tf

# check the version
print(tf.__version__)

import shutil  

# logging for tensorflow
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)

# remove saved model from previous training attempts
shutil.rmtree('/tmp/linear_regression_trained_model', ignore_errors=True)

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir='/tmp/linear_regression_trained_model', optimizer=tf.train.AdamOptimizer(learning_rate=0.1), enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors.values)))

# show model is beginning to train
print("starting to train");

# use predictors and target values to train the model
estimator.fit(predictors[:trainsize].values, targets[:trainsize].values.reshape(trainsize, noutputs)/SCALE_NUM_COLLISIONS, steps=10000)

# check predictors
preds = estimator.predict(x=predictors[trainsize:].values)

# apply the same scale to the output
predslistscale = preds['scores']*SCALE_NUM_COLLISIONS

# pred = format(str(predslistscale)) # useful for checking outputs and printing.

# calculate RMSE
rmse = np.sqrt(np.mean((targets[trainsize:].values - predslistscale)**2))
print('LinearRegression has RMSE of {0}'.format(rmse));

# calculate mean number of collisions
avg = np.mean(shuffle['NUM_COLLISIONS'][:trainsize])

# calculate RMSE using the number of collisions and mean of all other values
rmse = np.sqrt(np.mean((shuffle['NUM_COLLISIONS'][trainsize:] - avg)**2))
print('Just using average = {0} has RMSE of {1}'.format(avg, rmse));


1.15.2
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f9d9fc5be80>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model', '_session_creation_timeout_secs': 7200}
starting to train
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Ru

Due to 2020's smaller scale of data and values, the RMSE is harder to calculate clearly, with attempts reaching values as small as 0.04 or as large as 0.3, although mostly remaining around 0.25.

In [None]:
input = pd.DataFrame.from_dict(data = 
				{'day' : [1,1,1],
         'mo' : [1, 5, 8],
         'dewp' : [8.9, 20.5, 54.0]})
					

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir='/tmp/linear_regression_trained_model', enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(input.values)))

preds = estimator.predict(x=input.values)
predslistscale = preds['scores']*SCALE_NUM_COLLISIONS
pred = format(str(predslistscale))
print(pred)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f9d9f3b8358>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/linear_regression_trained_model/model.ck

The sample values used in each previous prediction model are not as helpful for this example due to data up to late March being excluded to create dataset reflecting conditions since the beginning of COVID restrictions, in addition to the data collected only reaching mid October. With missing values for certain months, particularly where temperature may be far lower than the values already present, negative values may be returned as part of predictions. 

## Deep Neural Network

It is evident there are limitations present in the linear regression models used above, particularly when examining 2020's data. A deep neural network model may be more suited to making predictions concerning collision numbers due to its ability to understand potential non-linear relationships present in the data provided.

### 2013-15

In [None]:
# needed to create the data frame
import pandas as pd

# create data frame from 2013-15 csv file
df = pd.read_csv('https://raw.githubusercontent.com/15007919uhi/15007919_DataAnalytics/master/dnn1315.csv', index_col=0)

In [None]:
# print the data
print(df[:6])

   Apr  Aug  Dec  Feb  Jan  Jul  ...  Tue  Wed  year  temp  dewp  NUM_COLLISIONS
1    0    0    0    0    1    0  ...    1    0  2013  38.0  25.6        0.000000
2    0    0    0    0    1    0  ...    0    1  2013  27.5  12.1        0.336735
3    0    0    0    0    1    0  ...    0    0  2013  21.8   7.8        0.571429
4    0    0    0    0    1    0  ...    0    0  2013  32.2  21.1        0.421769
5    0    0    0    0    1    0  ...    0    0  2013  37.3  24.5        0.027211
6    0    0    0    0    1    0  ...    0    0  2013  35.7  31.3        0.040816

[6 rows x 23 columns]


In [None]:
# needed for calculations
import numpy as np

# shuffle the data by row at random
shuffle = df.iloc[np.random.permutation(len(df))]

# select columns to take predictors from
predictors = shuffle.iloc[:,0:22]

# print the first 6 rows of predictors
print(predictors[:6])

     Apr  Aug  Dec  Feb  Jan  Jul  Jun  ...  Sun  Thu  Tue  Wed  year  temp  dewp
698    0    0    0    1    0    0    0  ...    0    0    0    0  2015  37.1  34.3
904    0    0    0    0    0    0    0  ...    0    0    0    0  2015  67.6  57.8
240    0    0    0    0    0    0    0  ...    0    0    0    1  2013  56.1  42.7
271    0    0    0    0    0    0    0  ...    1    0    0    0  2013  58.3  47.4
690    0    0    0    0    1    0    0  ...    0    0    0    0  2015  24.5  12.9
598    0    0    0    0    0    0    0  ...    0    0    0    1  2014  64.9  62.1

[6 rows x 22 columns]


In [None]:
# print the first 5 rows of the shuffled data
shuffle[:5]

Unnamed: 0,Apr,Aug,Dec,Feb,Jan,Jul,Jun,Mar,May,Nov,Oct,Sep,Fri,Mon,Sat,Sun,Thu,Tue,Wed,year,temp,dewp,NUM_COLLISIONS
698,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,2015,37.1,34.3,0.452316
904,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,2015,67.6,57.8,0.53951
240,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,2013,56.1,42.7,0.646259
271,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,2013,58.3,47.4,0.370748
690,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,2015,24.5,12.9,0.395095


In [None]:
# select all rows of the chosen colummn
targets = shuffle.iloc[:,22]

# show the first six rows of the targets
print(targets[:6])

698    0.452316
904    0.539510
240    0.646259
271    0.370748
690    0.395095
598    0.664430
Name: NUM_COLLISIONS, dtype: float64


In [None]:
# scale the data
SCALE_NUM_COLLISIONS = 1.0

In [None]:
# split data into a training set that is 80% of the full shuffled array
trainsize = int(len(shuffle['NUM_COLLISIONS'])*0.8)
# subtract training set size from test set to leave 20% of the array
testsize = len(shuffle['NUM_COLLISIONS']) - trainsize

# set number of input values/predictors
nppredictors = 22
# set number of output values/targets
noutputs = 1

In [None]:
# import tensorflow
%tensorflow_version 1.x
import tensorflow as tf

# check version
print(tf.__version__)

import shutil  

# logging for tensorflow
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)

# remove saved model from previous training attempts
shutil.rmtree('/tmp/DNN_regression_trained_model', ignore_errors=True)

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_regression_trained_model', hidden_units=[20,18,14], optimizer=tf.train.AdamOptimizer(learning_rate=0.01), enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors.values)))

# show model is beginning to train
print("starting to train");

# use predictors and target values to train the model
estimator.fit(predictors[:trainsize].values, targets[:trainsize].values.reshape(trainsize, noutputs)/SCALE_NUM_COLLISIONS, steps=10000)

# check predictors
preds = estimator.predict(x=predictors[trainsize:].values)

# apply the same scale to output
predslistscale = preds['scores']*SCALE_NUM_COLLISIONS

# calculate RMSE
rmse = np.sqrt(np.mean((targets[trainsize:].values - predslistscale)**2))
print('DNNRegression has RMSE of {0}'.format(rmse));

# calculate mean number of collisions
avg = np.mean(shuffle['NUM_COLLISIONS'][:trainsize])

# calculate RMSE using the number of collisions and mean of all other values
rmse = np.sqrt(np.mean((shuffle['NUM_COLLISIONS'][trainsize:] - avg)**2))
print('Just using average = {0} has RMSE of {1}'.format(avg, rmse));

1.15.2
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f9da57c5668>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_regression_trained_model', '_session_creation_timeout_secs': 7200}
starting to train
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Runni

A lower RMSE is present when using a deep neural network model. As demonstrated previously, the data can be validated using real entries in the dataset that were not used as part of the training data, identified by manually selection in from the full dataset in comparison to the predictors shown below.

In [None]:
print(preds)

{'scores': array([0.4629023 , 0.5429624 , 0.42276162, 0.43025172, 0.6825472 ,
       0.72752494, 0.5749796 , 0.6859055 , 0.70134467, 0.4440515 ,
       0.6937975 , 0.4812401 , 0.6950825 , 0.5909236 , 0.19610704,
       0.7259197 , 0.6440246 , 0.32073882, 0.7026121 , 0.44772714,
       0.30961323, 0.5427398 , 0.54237056, 0.66540974, 0.7764637 ,
       0.55159223, 0.5990605 , 0.73318005, 0.60288405, 0.7264879 ,
       0.6289759 , 0.5827084 , 0.6023879 , 0.60652155, 0.3930419 ,
       0.7432711 , 0.7216243 , 0.5887956 , 0.6451551 , 0.6753668 ,
       0.7560595 , 0.5295702 , 0.47913128, 0.5753547 , 0.7686741 ,
       0.19264102, 0.48136884, 0.5537827 , 0.5495374 , 0.40602833,
       0.68962914, 0.40351397, 0.6132774 , 0.7418737 , 0.4831059 ,
       0.63177496, 0.80118454, 0.60134935, 0.72231513, 0.6182646 ,
       0.66004866, 0.37863636, 0.6104003 , 0.40712497, 0.39578626,
       0.5430946 , 0.68439513, 0.708652  , 0.21687254, 0.7201941 ,
       0.5435876 , 0.6394695 , 0.7162373 , 0.660674

In [None]:
input = pd.DataFrame.from_dict(data = 
				{
         'Apr' : [0,0,0],
         'Aug' : [0,0,1],
         'Dec' : [0,0,0],
         'Feb' : [0,1,0],
         'Jan' : [1,0,0],
         'Jul' : [0,0,0],
         'Jun' : [0,0,0],
         'Mar' : [0,0,0],
         'May' : [0,0,0],
         'Nov' : [0,0,0],
         'Oct' : [0,0,0],
         'Sep' : [0,0,1],
         'Fri' : [1,0,0],
         'Mon' : [0,1,0],
         'Sat' : [0,0,0],
         'Sun' : [0,0,0],
         'Thu' : [0,0,0],
         'Tue' : [0,0,0],
         'Wed' : [0,0,1],
         'year' : [2013,2014,2015],
         'dewp' : [16.6, 8.9, 55.2],
         'temp' : [28.1, 24.3, 62.4]
        })

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_regression_trained_model', hidden_units=[20,18,14], enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(input.values)))

preds = estimator.predict(x=input.values)

predslistnorm = preds['scores']
prednorm = format(str(predslistnorm))
print(prednorm)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f9da581e748>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_regression_trained_model', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/DNN_regression_trained_model/model.ckpt-100

The accuracy of these collision numbers is poorer than that shown by the linear regression model, although this model does show a better understanding of the daily and monthly variations in collisions in contrast to the other model. This is shown below using imaginary values.

In [None]:
input = pd.DataFrame.from_dict(data = 
				{
         'Apr' : [0,0,0],
         'Aug' : [0,0,0],
         'Dec' : [0,0,0],
         'Feb' : [0,0,0],
         'Jan' : [1,1,1],
         'Jul' : [0,0,0],
         'Jun' : [0,0,0],
         'Mar' : [0,0,0],
         'May' : [0,0,0],
         'Nov' : [0,0,0],
         'Oct' : [0,0,0],
         'Sep' : [0,0,0],
         'Fri' : [0,0,1],
         'Mon' : [0,1,0],
         'Sat' : [0,0,0],
         'Sun' : [1,0,0],
         'Thu' : [0,0,0],
         'Tue' : [0,0,0],
         'Wed' : [0,0,0],
         'year' : [2015,2015,2015],
         'dewp' : [16.6, 8.9, 24.2],
         'temp' : [28.1, 24.3, 30.4]
        })

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_regression_trained_model', hidden_units=[20,18,14], enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(input.values)))

preds = estimator.predict(x=input.values)

predslistnorm = preds['scores']
prednorm = format(str(predslistnorm))
print(prednorm)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f9d9eff2b00>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_regression_trained_model', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/DNN_regression_trained_model/model.ckpt-100

Looking at a Sunday, Monday and Friday in January 2015, an increase can be seen throughout the working week, in accordance with previous findings concerning factors in collisions. This is much clearer than the increase shown in the linear regression model, suggesting this model may work better with month and day values, while the other is more effective in understanding the relationship specifically with temperature.

### 2016-18

In [None]:
# needed to create the data frame
import pandas as pd

# create data frame from 2016-18 csv file
df = pd.read_csv('https://raw.githubusercontent.com/15007919uhi/15007919_DataAnalytics/master/dnn1618.csv', index_col=0)

In [None]:
# print the data
print(df[:6])

   Apr  Aug  Dec  Feb  Jan  Jul  ...  Tue  Wed  year  temp  dewp  NUM_COLLISIONS
1    0    0    0    0    1    0  ...    0    0  2016  38.2  27.4        0.368613
2    0    0    0    0    1    0  ...    1    0  2016  27.7  18.1        0.510949
3    0    0    0    0    1    0  ...    0    1  2016  33.5  19.3        0.456204
4    0    0    0    0    1    0  ...    0    0  2016  40.5  29.9        0.153285
5    0    0    0    0    1    0  ...    0    0  2016  41.3  33.1        0.372263
6    0    0    0    0    1    0  ...    0    0  2016  45.9  38.6        0.197080

[6 rows x 23 columns]


In [None]:
# needed for calculations
import numpy as np

# shuffle the data by row at random
shuffle = df.iloc[np.random.permutation(len(df))]

# select columns to take predictors from
predictors = shuffle.iloc[:,0:22]

# print the first 6 rows of predictors
print(predictors[:6])

     Apr  Aug  Dec  Feb  Jan  Jul  Jun  ...  Sun  Thu  Tue  Wed  year  temp  dewp
46     0    0    0    1    0    0    0  ...    0    0    0    0  2016  47.8  40.9
474    0    0    0    0    0    1    0  ...    1    0    0    0  2017  67.1  57.7
669    0    0    0    0    0    0    0  ...    0    1    0    0  2018  36.5  28.1
629    0    0    0    0    1    0    0  ...    0    1    0    0  2018  28.0  12.4
735    0    0    0    0    0    0    1  ...    1    0    0    0  2018  52.8  48.1
549    0    0    0    0    0    0    0  ...    0    0    1    0  2017  51.1  39.0

[6 rows x 22 columns]


In [None]:
# print the first 5 rows of the shuffled data
shuffle[:5]

Unnamed: 0,Apr,Aug,Dec,Feb,Jan,Jul,Jun,Mar,May,Nov,Oct,Sep,Fri,Mon,Sat,Sun,Thu,Tue,Wed,year,temp,dewp,NUM_COLLISIONS
46,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,2016,47.8,40.9,0.423358
474,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,2017,67.1,57.7,0.384615
669,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,2018,36.5,28.1,0.7
629,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,2018,28.0,12.4,0.377778
735,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,2018,52.8,48.1,0.092593


In [None]:
# select all rows of the chosen colummn
targets = shuffle.iloc[:,22]

# show the first six rows of the targets
print(targets[:6])

46     0.423358
474    0.384615
669    0.700000
629    0.377778
735    0.092593
549    0.860806
Name: NUM_COLLISIONS, dtype: float64


In [None]:
# scale the data
SCALE_NUM_COLLISIONS = 1.0

In [None]:
# split data into a training set that is 80% of the full shuffled array
trainsize = int(len(shuffle['NUM_COLLISIONS'])*0.8)
# subtract training set size from test set to leave 20% of the array
testsize = len(shuffle['NUM_COLLISIONS']) - trainsize

# set number of input values/predictors
nppredictors = 22
# set number of output values/targets
noutputs = 1

In [None]:
# import tensorflow
%tensorflow_version 1.x
import tensorflow as tf

# check version
print(tf.__version__)

import shutil  

# logging for tensorflow
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)

# remove saved model from previous training attempts
shutil.rmtree('/tmp/DNN_regression_trained_model', ignore_errors=True)

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_regression_trained_model', hidden_units=[20,18,14], optimizer=tf.train.AdamOptimizer(learning_rate=0.01), enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors.values)))

# show model is beginning to train
print("starting to train");

# use predictors and target values to train the model
estimator.fit(predictors[:trainsize].values, targets[:trainsize].values.reshape(trainsize, noutputs)/SCALE_NUM_COLLISIONS, steps=10000)

# check predictors
preds = estimator.predict(x=predictors[trainsize:].values)

# apply the same scale to output
predslistscale = preds['scores']*SCALE_NUM_COLLISIONS

# calculate RMSE
rmse = np.sqrt(np.mean((targets[trainsize:].values - predslistscale)**2))
print('DNNRegression has RMSE of {0}'.format(rmse));

# calculate mean number of collisions
avg = np.mean(shuffle['NUM_COLLISIONS'][:trainsize])

# calculate RMSE using the number of collisions and mean of all other values
rmse = np.sqrt(np.mean((shuffle['NUM_COLLISIONS'][trainsize:] - avg)**2))
print('Just using average = {0} has RMSE of {1}'.format(avg, rmse));

1.15.2
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f9d9f06e208>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_regression_trained_model', '_session_creation_timeout_secs': 7200}
starting to train
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Runni

The RMSE for 2016-18 is very similar to that found while using the linear regression model, however, as shown below, this model struggles with making useful predictions, with collision numbers within very close range of each other. This is unrepresentative of the spread between collision numbers of the years included in the dataset.

In [None]:
input = pd.DataFrame.from_dict(data = 
				{
         'Apr' : [0,0,0],
         'Aug' : [0,0,1],
         'Dec' : [0,0,0],
         'Feb' : [0,0,0],
         'Jan' : [1,0,0],
         'Jul' : [0,0,0],
         'Jun' : [0,0,0],
         'Mar' : [0,0,0],
         'May' : [0,1,0],
         'Nov' : [0,0,0],
         'Oct' : [0,0,0],
         'Sep' : [0,0,0],
         'Fri' : [0,0,0],
         'Mon' : [1,1,1],
         'Sat' : [0,0,0],
         'Sun' : [0,0,0],
         'Thu' : [0,0,0],
         'Tue' : [0,0,0],
         'Wed' : [0,0,0],
         'year' : [2017,2017,2017],
         'dewp' : [5.4, 16.9, 48.3],
         'temp' : [8.9, 20.5, 54.0]
        })

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_regression_trained_model', hidden_units=[20,18,14], enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(input.values)))

preds = estimator.predict(x=input.values)

predslistnorm = preds['scores']
prednorm = format(str(predslistnorm))
print(prednorm)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f9d9ed33860>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_regression_trained_model', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/DNN_regression_trained_model/model.ckpt-100

In this case, it appears the linear regression model is more capable at handling the wider spread of data used in this set. This data may require further cleansing and splitting if further use of a neural network was to be undertaken. As shown below, it also fails to represent the relationship between days of the week as clearly as with the 2013-15 data, suggesting an issue with this dataset that limits its compatibility with a DNN.

In [None]:
input = pd.DataFrame.from_dict(data = 
				{
         'Apr' : [0,0,0],
         'Aug' : [0,0,0],
         'Dec' : [0,0,0],
         'Feb' : [0,0,0],
         'Jan' : [1,1,1],
         'Jul' : [0,0,0],
         'Jun' : [0,0,0],
         'Mar' : [0,0,0],
         'May' : [0,0,0],
         'Nov' : [0,0,0],
         'Oct' : [0,0,0],
         'Sep' : [0,0,0],
         'Fri' : [0,0,1],
         'Mon' : [0,1,0],
         'Sat' : [0,0,0],
         'Sun' : [1,0,0],
         'Thu' : [0,0,0],
         'Tue' : [0,0,0],
         'Wed' : [0,0,0],
         'year' : [2017,2017,2017],
         'dewp' : [16.6, 8.9, 24.2],
         'temp' : [28.1, 24.3, 30.4]
        })

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_regression_trained_model', hidden_units=[20,18,14], enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(input.values)))

preds = estimator.predict(x=input.values)

predslistnorm = preds['scores']
prednorm = format(str(predslistnorm))
print(prednorm)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f9da789a1d0>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_regression_trained_model', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/DNN_regression_trained_model/model.ckpt-100

### 2019

In [None]:
# needed to create the data frame
import pandas as pd

# create data frame from 2019 csv file
df = pd.read_csv('https://raw.githubusercontent.com/15007919uhi/15007919_DataAnalytics/master/dnn19.csv', index_col=0)

In [None]:
# print the data
print(df[:6])

   Apr  Aug  Dec  Feb  Jan  Jul  ...  Tue  Wed  year  temp  dewp  NUM_COLLISIONS
1    0    0    0    0    1    0  ...    1    0  2019  47.4  41.6        0.083095
2    0    0    0    0    1    0  ...    0    1  2019  35.0  23.4        0.292264
3    0    0    0    0    1    0  ...    0    0  2019  39.4  32.9        0.297994
4    0    0    0    0    1    0  ...    0    0  2019  39.0  30.2        0.567335
5    0    0    0    0    1    0  ...    0    0  2019  43.8  42.4        0.157593
6    0    0    0    0    1    0  ...    0    0  2019  27.4  11.7        0.524355

[6 rows x 23 columns]


In [None]:
# needed for calculations
import numpy as np

# shuffle the data by row at random
shuffle = df.iloc[np.random.permutation(len(df))]

# select columns to take predictors from
predictors = shuffle.iloc[:,0:22]

# print the first 6 rows of predictors
print(predictors[:6])

     Apr  Aug  Dec  Feb  Jan  Jul  Jun  ...  Sun  Thu  Tue  Wed  year  temp  dewp
314    0    0    0    0    0    0    0  ...    0    1    0    0  2019  43.3  36.2
264    0    0    0    0    0    0    0  ...    0    0    0    0  2019  56.3  48.8
248    0    0    0    0    0    0    0  ...    0    0    0    0  2019  60.7  54.2
349    0    0    1    0    0    0    0  ...    0    0    1    0  2019  44.7  42.6
164    0    0    0    0    0    0    1  ...    0    0    1    0  2019  60.4  58.8
38     0    0    0    1    0    0    0  ...    0    0    0    0  2019  31.5  10.3

[6 rows x 22 columns]


In [None]:
# print the first 5 rows of the shuffled data
shuffle[:5]

Unnamed: 0,Apr,Aug,Dec,Feb,Jan,Jul,Jun,Mar,May,Nov,Oct,Sep,Fri,Mon,Sat,Sun,Thu,Tue,Wed,year,temp,dewp,NUM_COLLISIONS
314,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,2019,43.3,36.2,0.707736
264,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,2019,56.3,48.8,0.100287
248,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,2019,60.7,54.2,0.252149
349,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,2019,44.7,42.6,0.123209
164,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,2019,60.4,58.8,0.919771


In [None]:
# select all rows of the chosen colummn
targets = shuffle.iloc[:,22]

# show the first six rows of the targets
print(targets[:6])

314    0.707736
264    0.100287
248    0.252149
349    0.123209
164    0.919771
38     0.401146
Name: NUM_COLLISIONS, dtype: float64


In [None]:
# scale the data
SCALE_NUM_COLLISIONS = 1.0

In [None]:
# split data into a training set that is 80% of the full shuffled array
trainsize = int(len(shuffle['NUM_COLLISIONS'])*0.8)
# subtract training set size from test set to leave 20% of the array
testsize = len(shuffle['NUM_COLLISIONS']) - trainsize

# set number of input values/predictors
nppredictors = 22
# set number of output values/targets
noutputs = 1

In [None]:
# import tensorflow
%tensorflow_version 1.x
import tensorflow as tf

# check version
print(tf.__version__)

import shutil  

# logging for tensorflow
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)

# remove saved model from previous training attempts
shutil.rmtree('/tmp/DNN_regression_trained_model', ignore_errors=True)

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_regression_trained_model', hidden_units=[20,18,14], optimizer=tf.train.AdamOptimizer(learning_rate=0.01), enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors.values)))

# show model is beginning to train
print("starting to train");

# use predictors and target values to train the model
estimator.fit(predictors[:trainsize].values, targets[:trainsize].values.reshape(trainsize, noutputs)/SCALE_NUM_COLLISIONS, steps=10000)

# check predictors
preds = estimator.predict(x=predictors[trainsize:].values)

# apply the same scale to output
predslistscale = preds['scores']*SCALE_NUM_COLLISIONS

# calculate RMSE
rmse = np.sqrt(np.mean((targets[trainsize:].values - predslistscale)**2))
print('DNNRegression has RMSE of {0}'.format(rmse));

# calculate mean number of collisions
avg = np.mean(shuffle['NUM_COLLISIONS'][:trainsize])

# calculate RMSE using the number of collisions and mean of all other values
rmse = np.sqrt(np.mean((shuffle['NUM_COLLISIONS'][trainsize:] - avg)**2))
print('Just using average = {0} has RMSE of {1}'.format(avg, rmse));

1.15.2
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f9d9fc30278>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_regression_trained_model', '_session_creation_timeout_secs': 7200}
starting to train
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Runni

2019 again shows a decent RMSE value, but while the predictions below show an acceptable reflection of seasonal increase of collisions, a low spread between the predicted totals is still present.

In [None]:
#"Apr","Aug","Dec","Feb","Jan","Jul","Jun","Mar","May","Nov","Oct","Sep","Fri","Mon","Sat","Sun","Thu","Tue","Wed","year","temp","dewp","NUM_COLLISIONS"
input = pd.DataFrame.from_dict(data = 
				{
         'Apr' : [0,0,0],
         'Aug' : [0,0,1],
         'Dec' : [0,0,0],
         'Feb' : [0,0,0],
         'Jan' : [1,0,0],
         'Jul' : [0,0,0],
         'Jun' : [0,0,0],
         'Mar' : [0,0,0],
         'May' : [0,1,0],
         'Nov' : [0,0,0],
         'Oct' : [0,0,0],
         'Sep' : [0,0,0],
         'Fri' : [0,0,0],
         'Mon' : [1,1,1],
         'Sat' : [0,0,0],
         'Sun' : [0,0,0],
         'Thu' : [0,0,0],
         'Tue' : [0,0,0],
         'Wed' : [0,0,0],
         'year' : [2019,2019,2019],
         'dewp' : [5.4, 16.9, 48.3],
         'temp' : [8.9, 20.5, 54.0]
        })

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_regression_trained_model', hidden_units=[20,18,14], enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(input.values)))

preds = estimator.predict(x=input.values)

predslistnorm = preds['scores']
prednorm = format(str(predslistnorm))
print(prednorm)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f9d9ec300f0>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_regression_trained_model', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/DNN_regression_trained_model/model.ckpt-100

### 2020

In [None]:
# needed to create the data frame
import pandas as pd

# create data frame from 2020 csv file
df = pd.read_csv('https://raw.githubusercontent.com/15007919uhi/15007919_DataAnalytics/master/dnn20.csv', index_col=0)

In [None]:
# print the data
print(df[:6])

   Apr  Aug  Feb  Jan  Jul  Jun  ...  Tue  Wed  year  temp  dewp  NUM_COLLISIONS
1    0    0    0    1    0    0  ...    0    1  2020  40.3  29.9        0.357353
2    0    0    0    1    0    0  ...    0    0  2020  39.6  28.9        0.466176
3    0    0    0    1    0    0  ...    0    0  2020  45.8  42.9        0.527941
4    0    0    0    1    0    0  ...    0    0  2020  45.4  43.9        0.373529
5    0    0    0    1    0    0  ...    0    0  2020  40.1  33.8        0.283824
6    0    0    0    1    0    0  ...    0    0  2020  33.5  24.0        0.532353

[6 rows x 21 columns]


In [None]:
# needed for calculations
import numpy as np

# shuffle the data by row at random
shuffle = df.iloc[np.random.permutation(len(df))]

# select columns to take predictors from
predictors = shuffle.iloc[:,0:20]

# print the first 6 rows of predictors
print(predictors[:6])

     Apr  Aug  Feb  Jan  Jul  Jun  Mar  ...  Sun  Thu  Tue  Wed  year  temp  dewp
95     1    0    0    0    0    0    0  ...    0    0    0    0  2020  42.3  38.4
182    0    0    0    0    0    1    0  ...    0    0    1    0  2020  66.3  62.4
212    0    0    0    0    1    0    0  ...    0    1    0    0  2020  74.4  72.0
71     0    0    0    0    0    0    1  ...    0    0    0    1  2020  47.2  39.3
122    0    0    0    0    0    0    0  ...    0    0    0    0  2020  51.3  49.1
191    0    0    0    0    1    0    0  ...    0    1    0    0  2020  70.9  67.4

[6 rows x 20 columns]


In [None]:
# print the first 5 rows of the shuffled data
shuffle[:5]

Unnamed: 0,Apr,Aug,Feb,Jan,Jul,Jun,Mar,May,Oct,Sep,Fri,Mon,Sat,Sun,Thu,Tue,Wed,year,temp,dewp,NUM_COLLISIONS
95,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,2020,42.3,38.4,0.091176
182,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,2020,66.3,62.4,0.270588
212,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,2020,74.4,72.0,0.302941
71,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,2020,47.2,39.3,0.558824
122,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,2020,51.3,49.1,0.136765


In [None]:
# select all rows of the chosen colummn
targets = shuffle.iloc[:,20]

# show the first six rows of the targets
print(targets[:6])

95     0.091176
182    0.270588
212    0.302941
71     0.558824
122    0.136765
191    0.322059
Name: NUM_COLLISIONS, dtype: float64


In [None]:
# scale the data
SCALE_NUM_COLLISIONS = 1.0

In [None]:
# split data into a training set that is 80% of the full shuffled array
trainsize = int(len(shuffle['NUM_COLLISIONS'])*0.8)
# subtract training set size from test set to leave 20% of the array
testsize = len(shuffle['NUM_COLLISIONS']) - trainsize

# set number of input values/predictors
nppredictors = 22
# set number of output values/targets
noutputs = 1

In [None]:
# import tensorflow
%tensorflow_version 1.x
import tensorflow as tf

# check version
print(tf.__version__)

import shutil  

# logging for tensorflow
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)

# remove saved model from previous training attempts
shutil.rmtree('/tmp/DNN_regression_trained_model', ignore_errors=True)

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_regression_trained_model', hidden_units=[20,18,14], optimizer=tf.train.AdamOptimizer(learning_rate=0.01), enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors.values)))

# show model is beginning to train
print("starting to train");

# use predictors and target values to train the model
estimator.fit(predictors[:trainsize].values, targets[:trainsize].values.reshape(trainsize, noutputs)/SCALE_NUM_COLLISIONS, steps=10000)

# check predictors
preds = estimator.predict(x=predictors[trainsize:].values)

# apply the same scale to output
predslistscale = preds['scores']*SCALE_NUM_COLLISIONS

# calculate RMSE
rmse = np.sqrt(np.mean((targets[trainsize:].values - predslistscale)**2))
print('DNNRegression has RMSE of {0}'.format(rmse));

# calculate mean number of collisions
avg = np.mean(shuffle['NUM_COLLISIONS'][:trainsize])

# calculate RMSE using the number of collisions and mean of all other values
rmse = np.sqrt(np.mean((shuffle['NUM_COLLISIONS'][trainsize:] - avg)**2))
print('Just using average = {0} has RMSE of {1}'.format(avg, rmse));

1.15.2
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f9d9f475c88>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_regression_trained_model', '_session_creation_timeout_secs': 7200}
starting to train
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Runni

The DNN is slightly more useful at making predictions using 2020's data in that it is less likely to return negative values, but in this particular case, the seasonality is lost.

In [None]:
#"Apr","Aug","Dec","Jul","Jun","Mar","May","Nov","Oct","Sep","Fri","Mon","Sat","Sun","Thu","Tue","Wed","year","temp","dewp","NUM_COLLISIONS"
input = pd.DataFrame.from_dict(data = 
				{
         'Apr' : [0,0,0],
         'Aug' : [0,0,1],
         'Jan' : [0,0,0],
         'Jul' : [0,0,0],
         'Jun' : [0,0,0],
         'Mar' : [1,0,0],
         'May' : [0,1,0],
         'Feb' : [0,0,0],
         'Oct' : [0,0,0],
         'Sep' : [0,0,0],
         'Fri' : [0,0,0],
         'Mon' : [1,1,1],
         'Sat' : [0,0,0],
         'Sun' : [0,0,0],
         'Thu' : [0,0,0],
         'Tue' : [0,0,0],
         'Wed' : [0,0,0],
         'year' : [2020,2020,2020],
         'dewp' : [5.4, 16.9, 48.3],
         'temp' : [8.9, 20.5, 54.0]
        })

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_regression_trained_model', hidden_units=[20,18,14], enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(input.values)))

preds = estimator.predict(x=input.values)

predslistnorm = preds['scores']
prednorm = format(str(predslistnorm))
print(prednorm)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f9d9ee23cc0>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_regression_trained_model', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/DNN_regression_trained_model/model.ckpt-100

As with two of the other models, the DNN makes more helpful predictions concerning the increase of collisions throughout the week, despite the narrower range of values it returns.

In [None]:
input = pd.DataFrame.from_dict(data = 
				{
         'Apr' : [0,0,0],
         'Aug' : [0,0,0],
         'Feb' : [0,0,0],
         'Jan' : [1,1,1],
         'Jul' : [0,0,0],
         'Jun' : [0,0,0],
         'Mar' : [0,0,0],
         'May' : [0,0,0],
         'Oct' : [0,0,0],
         'Sep' : [0,0,0],
         'Fri' : [0,0,1],
         'Mon' : [0,1,0],
         'Sat' : [0,0,0],
         'Sun' : [1,0,0],
         'Thu' : [0,0,0],
         'Tue' : [0,0,0],
         'Wed' : [0,0,0],
         'year' : [2020,2020,2020],
         'dewp' : [16.6, 8.9, 24.2],
         'temp' : [28.1, 24.3, 30.4]
        })

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_regression_trained_model', hidden_units=[20,18,14], enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(input.values)))

preds = estimator.predict(x=input.values)

predslistnorm = preds['scores']
prednorm = format(str(predslistnorm))
print(prednorm)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f9d9ee25828>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_regression_trained_model', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/DNN_regression_trained_model/model.ckpt-100

# Conclusions

As the scope of this short report is limited, it is clear there is room for further exploration concerning the effectiveness of these models. In most cases, and given further work on the subject, a linear regression model seems more beneficial for the emergency services of New York to accurately predict collision numbers in varying potential scenarios based on temperature and season, while the DNN's effectiveness lies mostly in predicting collisions based on day. Predicting external factors such as road closures or changes to traffic regulations is beyond the capabilities of such models due to the data they are based on. However, by providing four distinct models, it is hoped that the way these external factors affect seasonal or weekly patterns is clear, and that, with further exploration, predictions from these different scenarios could be used as tools to help optimise emergency staff more effectively.

# References

Patterson, J. & Gibson, A. (2017) *Deep Learning: A Practitioner's Approach* California: O'Reilly Media.

Paolucci, R. (2020) 'Linear Regression v.s. Neural Networks' *Towards Data Science* [online]. Available from: https://towardsdatascience.com/linear-regression-v-s-neural-networks-cd03b29386d4 [Accessed 8th December 2020].