# **INTRODUCTION**

This is the second part of the report where two regression models will be used: 

* The Linear Regression Model
* The Deep Neural Network Regression Model

The above models will be trained and tested with the data prepared in the first part of the report. The aims of the second part of the report is to evaluate the conclusions from the first part using the linear regression model and DNN regression model.

# **METHODOLOGY**

For the purposes of this report the following components have been used for linear and DNN regression models:

* [Python](https://www.python.org/) programming language
* [pandas](https://pandas.pydata.org/) software for analysis and data manipulation in Python
* [NumPy](https://numpy.org/) library for the Python - for advanced mathematical functions on large multi-dimensional arrays and matrices
* [TensorFlow](https://www.tensorflow.org/) - free and open-source software library for machine learning


The following statistical techniques have been used:

* [Linear regression](https://www.oxfordreference.com/view/10.1093/acref/9780199541454.001.0001/acref-9780199541454-e-932) - the assumption that there is a linear relationship between variables x and y
* [Root Mean Square Errors (RMSE)](https://www.researchgate.net/publication/262980567_Root_mean_square_error_RMSE_or_mean_absolute_error_MAE) as a performance measure. It is used to verify experimental results - shows how much error the model makes in its predictions.


# **RESULTS**

## **I. LINEAR REGRESSION MODEL**

In this section the data frame will be created with the data prepared for linear regression model - saved as csv file on GitHub as **data_for_linear_regression_full.csv** in the part 1 of this report. 

In [None]:
# create the data frame using pandas
import pandas as pd

# create data frame from csv file hosted on our github for linear regression
df = pd.read_csv('https://raw.githubusercontent.com/20023160uhi/20023160_Data_Analytics/main/data_for_linear_regression_full.csv', index_col=0, )

It is worth to see how the data looks like:

In [None]:
# print the data
print(df)

      day  temp  dewp  NUM_COLLISIONS
1       3  27.5  12.1       -1.077771
2       4  21.8   7.8       -0.162511
3       5  32.2  21.1       -0.746155
4       1  35.9  31.0       -0.533921
5       2  39.8  34.1       -0.640038
...   ...   ...   ...             ...
1749    2  42.8  33.7       -0.668066
1750    4  35.9  30.3       -2.084936
1751    5  40.1  35.0       -1.669647
1752    1  39.6  36.3       -0.814638
1753    2  44.7  42.6       -1.730719

[1753 rows x 4 columns]


In the next step the data will be shuffled randomly by rows. Then the data from chosen columns will be determined as predictors for linear regression model. In this case the predictors are:

* day - day of the week
* temp - temperature
* dewp - dew point

Worth to notice is that has been decided in part 1 of the report, that for this model only weekdays are taking into account - without Saturdays and Sundays.

In [None]:
# import numpy package to help with speedy maths based calculations
import numpy as np

# use iloc to select by rows
# Shuffle the data by rows determined at random
shuffle = df.iloc[np.random.permutation(len(df))]

# select all rows of the columns outlined i.e. The 3rd (2 as indexes start from 0)
predictors = shuffle.iloc[:,0:3]

# print out the first 6 rows of predictors.
print(predictors[:6])

      day  temp  dewp
1154    5  67.7  66.7
83      3  49.9  41.4
750     2  58.0  51.5
747     4  51.1  46.4
456     4  65.0  63.4
1079    5  47.8  42.6


The first 6 rows of predictors have been shown above and below the first six lines for all the shuffled data (along with the number of collisions) can be seen. 

In [None]:
# print out the shuffled data (first 6 rows)
shuffle[:6]

Unnamed: 0,day,temp,dewp,NUM_COLLISIONS
1154,5,67.7,66.7,1.539175
83,3,49.9,41.4,0.195635
750,2,58.0,51.5,1.776413
747,4,51.1,46.4,0.773438
456,4,65.0,63.4,0.420227
1079,5,47.8,42.6,0.805939


The next step is to specify the targets, which is the number of collisions. The first 6 lines of them are presented below.

In [None]:
# Select all rows for the 2nd column (i.e. 1)
targets = shuffle.iloc[:,3]

# print out the first 6 rows of the targets data.
print(targets[:6])

1154    1.539175
83      0.195635
750     1.776413
747     0.773438
456     0.420227
1079    0.805939
Name: NUM_COLLISIONS, dtype: float64


In [None]:
# define scale targets 
SCALE_NUM_COLLISIONS = 1.0

The dataset prepared for this model will be divided into two parts: 80 percent of the data will be used for training the model, and the remaining 20 percent for testing.

In [None]:
# Split the data into a training set: 80% of the length of the shuffle array
trainsize = int(len(shuffle['NUM_COLLISIONS'])*0.8)
# The test set size is 20% of the length of the shuffle array.
testsize = len(shuffle['NUM_COLLISIONS']) - trainsize

# Define the number of input values (predictors)
nppredictors = 3
# Define the number of output values (targets)
noutputs = 1

In [None]:
# see predictors
print(predictors)

      day  temp  dewp
1154    5  67.7  66.7
83      3  49.9  41.4
750     2  58.0  51.5
747     4  51.1  46.4
456     4  65.0  63.4
...   ...   ...   ...
455     3  64.9  62.1
1268    2  45.3  45.3
52      2  39.6  35.5
1092    3  53.1  51.8
746     3  43.7  32.9

[1753 rows x 3 columns]


In [None]:
# see values of predictors
print(predictors.values)

[[ 5.  67.7 66.7]
 [ 3.  49.9 41.4]
 [ 2.  58.  51.5]
 ...
 [ 2.  39.6 35.5]
 [ 3.  53.1 51.8]
 [ 3.  43.7 32.9]]


In [None]:
# see values of targets
print(targets[:trainsize].values.reshape(trainsize, noutputs))

[[ 1.53917472]
 [ 0.19563478]
 [ 1.77641276]
 ...
 [-0.23379967]
 [ 0.59560338]
 [ 0.0586751 ]]


### **Training the linear regression model**

In [None]:
# import tensorflow
%tensorflow_version 1.x
import tensorflow as tf

# check the version
print(tf.__version__)

# import shutil module for high-level file management
import shutil  

# logging for tensorflow
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)

# removes a saved model from the last training attempt.
shutil.rmtree('/tmp/linear_regression_trained_model', ignore_errors=True)

# This is the core of the linear regressor

# You can see that we save the model, use the the Adam optimization algorithm, which is an extension 
# to stochastic gradient descent that has recently seen broader adoption for deep learning applications 
# in computer vision and natural language processing and infer real valued columns from input which interprets 
# all inputs as dense, fixed-length float values.

# See the link for more information
# https://www.tensorflow.org/versions/r1.15/api_docs/python/tf/contrib/learn/LinearRegressor
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir='/tmp/linear_regression_trained_model', optimizer=tf.train.AdamOptimizer(learning_rate=0.1), enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors.values)))

# Prints a log to show model is starting to train
print("starting to train");

# Train the model. Pass in predictor values and target values.
estimator.fit(predictors[:trainsize].values, targets[:trainsize].values.reshape(trainsize, noutputs)/SCALE_NUM_COLLISIONS, steps=10000)

# Next, we can check our predictions based on our predictors.
preds = estimator.predict(x=predictors[trainsize:].values)

# Apply the Scale value (not really needed here) to the outputs.
predslistscale = preds['scores']*SCALE_NUM_COLLISIONS

#pred = format(str(predslistscale)) # useful for checking outputs and printing.

# Calculate RMSE i.e. how good the model works using the predictions and targets.
# i.e. take the difference between the actual and the forecast then square the difference, 
# find the average of all the squares and then find the square root. 
# The RMSE essentially punishes larger errors i.e. it puts a heavier weight on larger errors.
rmse = np.sqrt(np.mean((targets[trainsize:].values - predslistscale)**2))
print('LinearRegression has RMSE of {0}'.format(rmse));


# Calculate the mean of the Number of vehicle collisions values.
avg = np.mean(shuffle['NUM_COLLISIONS'][:trainsize])

# Calculate the RMSE using Number of vehicle collisions values and the mean of all target values.
# The fit of a proposed regression model should therefore be better than the fit of the mean model.
# In this case, it doesn't seem to be the case but it will vary on every run.
rmse = np.sqrt(np.mean((shuffle['NUM_COLLISIONS'][trainsize:] - avg)**2))
print('Just using average = {0} has RMSE of {1}'.format(avg, rmse));

1.15.2
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f9f92aef050>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model', '_session_creation_timeout_secs': 7200}
starting to train
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Ru

This model has been launched four times. It was only the fourth time that the RDMS value was relatively low. 

Each run gave the following values:

* First attempt: `LinearRegression has RMSE of 0.9154298743550359. Just using average = 0.3449799013019476 has RMSE of 0.9152411831919334`
* Second attempt: `LinearRegression has RMSE of 0.9467731581299792. Just using average = 0.34797110838693696 has RMSE of 0.9221486842188588`
* Third attempt: `LinearRegression has RMSE of 1.0418983215461957. Just using average = 0.3092714154978759 has RMSE of 0.8745654169987936`
* Fourth attempt: `LinearRegression has RMSE of 0.8986772152123146. Just using average = 0.349233942180536 has RMSE of 0.906158995855328`

### **Testing the linear regression model**

The data displayed below will be used as reference to values using for tests. It is worth checking whether by examining the increase in the value of only one factor out of three, our model will show the expected increase in the value of the number of collisions.

In [None]:
# print the predictors for test
print(predictors[:testsize])

      day  temp  dewp
1154    5  67.7  66.7
83      3  49.9  41.4
750     2  58.0  51.5
747     4  51.1  46.4
456     4  65.0  63.4
...   ...   ...   ...
1006    4  38.9  32.3
88      3  59.4  54.7
1178    4  67.1  65.1
165     5  70.6  65.0
1656    2  71.2  68.7

[351 rows x 3 columns]


In [None]:
# see row 1656   ###AH: REMOVE IT!!!###
print(df.loc[[1656]])

      day  temp  dewp  NUM_COLLISIONS
1656    2  71.2  68.7       -1.413144


#### **1. Day of the week**

The day of the week will be the first factor that may affect the number of collisions. It is worth noting that for the purposes of this model, it was decided in the first part of the report that only days from Monday to Friday will be taken into account. The thesis presented earlier assumes that the number of collisions increases depending on the day of the week - for Monday it should be the smallest, and for Friday - the highest.

In [None]:
input = pd.DataFrame.from_dict(data = 
				{'day' : [3, 4, 5],
         'temp' : [38.9, 38.9, 38.9],
         'dewp' : [32.3, 32.3, 32.3]
        })
					

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir='/tmp/linear_regression_trained_model', enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(input.values)))

preds = estimator.predict(x=input.values)
predslistnorm = preds['scores']
# predslistscale = preds['scores']*600000
prednorm = format(str(predslistnorm))
# pred = format(str(predslistscale))
print(prednorm)
# print(pred)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f9f8300ab50>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/linear_regression_trained_model/model.ck

The test was run three times:

> For the first attempt, days 1, 3 and 5 were selected corresponding to Monday, Wednesday and Friday. The predicted values were estimated keeping the same temperature (value 60) and dew point (value 54.7) for all three days. The following results were obtained:
* `-0.2299912` for day 1 (Monday)
* `0.04447675` for day 2 (Wednesday)
* `0.3189448` for day 5 (Friday)

> For the second attempt, the values for days were  2, 3, 4. The values for temperature were 51.1, whereas for dew point 46.4. The following results were obtained:
* `-0.19163096` for day 2 (Tueasday)
* `-0.05439687` for day 3 (Wednesday)
* `0.0828371` for day 4 (Thursday)

> For the third attempt, the values for days were  3, 4, 5. The values for temperature were 38.9, whereas for dew point 32.3. The following results were obtained:
* `-0.14617062` for day 3 (Wednesday)
* `-0.00893664` for day 4 (Thursday)
* `0.12829733` for day 5 (Friday)

Based on the results it can be concluded that the model predicts an increase in the number of collisions from Monday to Friday. This confirms the conclusions drawn in the first part of the report.

#### **2. Temperature**

As before for the day of the week, now the influence of temperature on the number of traffic collisions will be examined, with the values for the rest of the predictors on the individual days unchanged.

In [None]:
input = pd.DataFrame.from_dict(data = 
				{'day' : [3, 3, 3],
         'temp' : [49.9, 59.4, 67.7],
         'dewp' : [41.4, 41.4, 41.4]
        })
					

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir='/tmp/linear_regression_trained_model', enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(input.values)))

preds = estimator.predict(x=input.values)
predslistnorm = preds['scores']
# predslistscale = preds['scores']*600000
prednorm = format(str(predslistnorm))
# pred = format(str(predslistscale))
print(prednorm)
# print(pred)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f9f82ff4590>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/linear_regression_trained_model/model.ck

Three tests were conducted:
> For the first attempt, the values for days were 2, whereas for dew point were 54.7. The following results were obtained:
* `-0.64346063` for the temperature value of 38.9
* `-0.10841703` for the temperature value of 59.4
* `0.18389952` for the temperature value of 70.6

> For the second attempt, the values for days were 1, whereas for dew point were 51.5. The following results were obtained:
* `-0.41084194` for the temperature value of 51.1
* `-0.23075402` for the temperature value of 58.0
* `-0.19421446` for the temperature value of 59.4

> For the third attempt, the values for days were 3, whereas for dew point were 41.4. The following results were obtained:
* `-0.00534689` for the temperature value of 49.9
* `0.2426002` for the temperature value of 59.4
* `0.45922756` for the temperature value of 67.7

The conclusion presented in the first part of the report that the number of collisions increases when temperature increases is confirmed by the values predicted by the model.

#### **3. Dew point**

The last weather factor left to be tested is the dew point. As with the previous tests, the other predictors (in this case: day of the week and temperature) will have a constant value.

In [None]:
input = pd.DataFrame.from_dict(data = 
				{'day' : [4, 4, 4],
         'temp' : [51.1, 51.1, 51.1],
         'dewp' : [32.3, 46.4, 63.4]
        })
					

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir='/tmp/linear_regression_trained_model', enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(input.values)))

preds = estimator.predict(x=input.values)
predslistnorm = preds['scores']
# predslistscale = preds['scores']*600000
prednorm = format(str(predslistnorm))
# pred = format(str(predslistscale))
print(prednorm)
# print(pred)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f9f82f0fdd0>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/linear_regression_trained_model/model.ck

Three attepts were carried out:

> For the first attempt, the values for days were  1, whereas for temperature were 67.1. The following results were obtained:
* `0.31537247` for the dew point value 32.3
* `-0.04468346` for the dew point value 54.7
* `-0.2102449` for the dew point value 65.0

> For the second attempt, the values for days were  5, whereas for temperature were 67.7. The following results were obtained:
* `0.38006926` for the dew point value 63.4
* `0.3270253` for the dew point value 66.7
* `0.2948774` for the dew point value 68.7

> For the third attempt, the values for days were  4, whereas for temperature were 51.1. The following results were obtained:
* `0.30947948` for the dew point value 32.3
* `0.0828371` for the dew point value 46.4
* `-0.19041961` for the dew point value 63.4


Based on the results above, it cannot be concluded that a higher dew point results in a higher number of collisions. The given outputs indicate something completely opposite - as the value of the dew point increases, the predicted number of collisions decreases.



---



## **II. DNN REGRESSION MODEL**

The deep neural network regression model is a complex model used to detect patterns that are difficult to detect in the analysis performed in the first part of the report. The data prepared for this model includes all weather factors, days, months and years from 2013 to 2019 with the number of vehicle collisions per day.
One hot encoding has been used to create matrix for months and days of the week.

In [13]:
# create the data frame using pandas
import pandas as pd

# create data frame from csv file hosted on github
df = pd.read_csv('https://raw.githubusercontent.com/20023160uhi/20023160_Data_Analytics/main/data_for_dnn_regression_full.csv', index_col=0)

In [14]:
# make sure we have our data by printing it out
print(df)

      Apr  Aug  Dec  Feb  Jan  Jul  ...   max   min  prcp  sndp  fog  NUM_COLLISIONS
1       0    0    0    0    1    0  ...  33.1  21.9  0.00   NaN    0             480
2       0    0    0    0    1    0  ...  28.9  16.0  0.00   NaN    0             549
3       0    0    0    0    1    0  ...  41.0  24.1  0.00   NaN    0             505
4       0    0    0    0    1    0  ...  42.1  28.0  0.01   NaN    0             521
5       0    0    0    0    1    0  ...  48.0  25.0  0.00   NaN    0             513
...   ...  ...  ...  ...  ...  ...  ...   ...   ...   ...   ...  ...             ...
2445    0    0    1    0    0    0  ...  50.0  34.0  0.00   NaN    0             530
2446    0    0    1    0    0    0  ...  46.4  28.9  0.00   NaN    0             414
2447    0    0    1    0    0    0  ...  48.0  28.9  0.00   NaN    0             448
2448    0    0    1    0    0    0  ...  45.0  30.9  0.39   NaN    0             518
2449    0    0    1    0    0    0  ...  52.0  37.9  0.88   NaN  

The data prepared for this model is visible above. The dataset contains 33 columns: 12 for months, 7 for days of the week, one for the year, another 12 for weather factors (temp, dewp, slp, visib, wdsp, mxpsd, taste, max, min, prcp, sndp, fog) and the last for the number of vehicle collisions (NUM_COLLISIONS).

Next step is to shuffle the data and define predictors. For this model predictors are months, days, years and the weather factors (temp, dewp, slp, visib, wdsp, mxpsd, taste, max, min, prcp, sndp, fog).

In [15]:
# needed to help with speedy maths based calculations
import numpy as np

# iloc allows us to select by rows. Here, we are shuffling the data by rows determined at random.
shuffle = df.iloc[np.random.permutation(len(df))]

# we are selecting all rows of the columns outliined i.e. The 3rd (2 as indexes start from 0)
predictors = shuffle.iloc[:,0:32]
# Since it is the last column, we can also use
# predictorTest = shuffle.iloc[:,-1]

# print out the first 6 rows of predictors.
print(predictors[:6])

      Apr  Aug  Dec  Feb  Jan  Jul  ...  gust   max   min  prcp  sndp  fog
1093    0    0    0    1    0    0  ...  38.1  50.0  36.0  0.02   NaN    1
2261    0    0    0    0    0    0  ...   NaN  69.1  57.9  0.32   NaN    1
371     0    0    0    0    1    0  ...  24.1  28.9   9.0  0.00   NaN    0
1263    0    1    0    0    0    0  ...   NaN  82.0  66.9  0.00   NaN    1
531     0    0    0    0    0    1  ...  26.0  77.0  62.1  0.11   NaN    0
1460    0    0    0    1    0    0  ...   NaN  48.9  30.0  0.00   NaN    0

[6 rows x 32 columns]


This is how the first 5 rows of shuffled data for predictors looks like:

In [16]:
# print out the shuffled data (first 5 rows)
shuffle[:5]

Unnamed: 0,Apr,Aug,Dec,Feb,Jan,Jul,Jun,Mar,May,Nov,Oct,Sep,Fri,Mon,Sat,Sun,Thu,Tue,Wed,year,temp,dewp,slp,visib,wdsp,mxpsd,gust,max,min,prcp,sndp,fog,NUM_COLLISIONS
1093,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,2016,44.5,41.7,1018.9,4.8,18.5,26.0,38.1,50.0,36.0,0.02,,1,615
2261,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,2019,62.3,59.6,1011.6,6.7,6.8,9.9,,69.1,57.9,0.32,,1,591
371,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,2014,16.5,5.1,1014.4,9.7,11.1,19.0,24.1,28.9,9.0,0.0,,0,612
1263,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,2016,74.2,65.0,1009.0,8.1,9.9,13.0,,82.0,66.9,0.0,,1,540
531,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,2014,68.2,59.1,1016.4,10.0,14.1,18.1,26.0,77.0,62.1,0.11,,0,451


Now the targets will be defined:

In [17]:
# Select all rows for the last column (NUM_COLLISIONS)
targets = shuffle.iloc[:,32]

# print out the first 6 rows of the targets data.
print(targets[:6])

1093    615
2261    591
371     612
1263    540
531     451
1460    591
Name: NUM_COLLISIONS, dtype: int64


The values of the targets have to be scaled. This will be done in the cell below.

In [20]:
# scale the values of targets
SCALE_NUM_COLLISIONS = 1.0

The next step is to split the data into two sets - one for model training (80%) and one for testing(20%). Then, the number of input values (predictors) will be defined as 32 and the number of output values (targets - NUM_COLLISIONS) will be set up as 1.

In [22]:
# Split our data into a training set i.e. 80% of the length of the shuffle array
trainsize = int(len(shuffle['NUM_COLLISIONS'])*0.8)
# The test set size is 100% - 80% = 20% of the length of the shuffle array.
testsize = len(shuffle['NUM_COLLISIONS']) - trainsize

# Define the number of input values (predictors)
nppredictors = 32
# Define the number of output values (targets)
noutputs = 1

> **Issues to run the DNN regression model**

In [26]:
# import tensorflow
%tensorflow_version 1.x
import tensorflow as tf

# check the version
print(tf.__version__)

# needed for high-level file management
import shutil  

# logging for tensorflow
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)

# removes a saved model from the last training attempt.
shutil.rmtree('/tmp/DNN_house_regression_trained_model', ignore_errors=True)

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_house_regression_trained_model', hidden_units=[20,18,14], optimizer=tf.train.AdamOptimizer(learning_rate=0.01), enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors.values)))

# Prints a log to show model is starting to train
print("starting to train");

# Train the model. Pass in predictor values and target values.
estimator.fit(predictors[:trainsize].values, targets[:trainsize].values.reshape(trainsize, noutputs)/SCALE_NUM_COLLISIONS, steps=10000)

# Next, we can check our predictions based on our predictors.
preds = estimator.predict(x=predictors[trainsize:].values)

# Apply the Scale value (not really needed here) to the outputs.
predslistscale = preds['scores']*SCALE_NUM_COLLISIONS

# pred = format(str(predslistscale)) # useful for checking outputs and printing.

# Calculate RMSE i.e. how good the model works using the predictions and targets.
# i.e. take the difference between the actual and the forecast then square the difference, 
# find the average of all the squares and then find the square root. 
# The RMSE essentially punishes larger errors i.e. it puts a heavier weight on larger errors.
rmse = np.sqrt(np.mean((targets[trainsize:].values - predslistscale)**2))
print('DNNRegression has RMSE of {0}'.format(rmse));


# Calculate the mean of the Number of Collisions Values.
avg = np.mean(shuffle['NUM_COLLISIONS'][:trainsize])

# Calculate the RMSE using the Number of Collisions Values and the mean of all target values.
# The fit of a proposed regression model should therefore be better than the fit of the mean model.
# In this case, it doesn't seem to be the case but it will vary on every run.
rmse = np.sqrt(np.mean((shuffle['NUM_COLLISIONS'][trainsize:] - avg)**2))
print('Just using average = {0} has RMSE of {1}'.format(avg, rmse));

1.15.2
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fa96bd47e50>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_house_regression_trained_model', '_session_creation_timeout_secs': 7200}
starting to train
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow

InvalidArgumentError: ignored

The model could not run above. This might be caused by NaN values in the dataset. This will be investigated below. 

> **Cleaning the data for DNN model**

In [40]:
# check missing values
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2449 entries, 1 to 2449
Data columns (total 33 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Apr             2449 non-null   int64  
 1   Aug             2449 non-null   int64  
 2   Dec             2449 non-null   int64  
 3   Feb             2449 non-null   int64  
 4   Jan             2449 non-null   int64  
 5   Jul             2449 non-null   int64  
 6   Jun             2449 non-null   int64  
 7   Mar             2449 non-null   int64  
 8   May             2449 non-null   int64  
 9   Nov             2449 non-null   int64  
 10  Oct             2449 non-null   int64  
 11  Sep             2449 non-null   int64  
 12  Fri             2449 non-null   int64  
 13  Mon             2449 non-null   int64  
 14  Sat             2449 non-null   int64  
 15  Sun             2449 non-null   int64  
 16  Thu             2449 non-null   int64  
 17  Tue             2449 non-null   i

The significant number of missing values are in columns gust and sndp. These columns will be removed and from the rest columns rows which contain other NaN will be removed and saved as ***df_2*** dataset. 

In [42]:
# remove columns with significant number of NaNs
df_2 = df.drop(['gust', 'sndp'], axis=1)

In [44]:
# remove other rows with NaNs from the rest of columns
df_2 = df_2.dropna(axis = 0)

In [45]:
# see if all NaNs have been removed
df_2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2427 entries, 1 to 2449
Data columns (total 31 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Apr             2427 non-null   int64  
 1   Aug             2427 non-null   int64  
 2   Dec             2427 non-null   int64  
 3   Feb             2427 non-null   int64  
 4   Jan             2427 non-null   int64  
 5   Jul             2427 non-null   int64  
 6   Jun             2427 non-null   int64  
 7   Mar             2427 non-null   int64  
 8   May             2427 non-null   int64  
 9   Nov             2427 non-null   int64  
 10  Oct             2427 non-null   int64  
 11  Sep             2427 non-null   int64  
 12  Fri             2427 non-null   int64  
 13  Mon             2427 non-null   int64  
 14  Sat             2427 non-null   int64  
 15  Sun             2427 non-null   int64  
 16  Thu             2427 non-null   int64  
 17  Tue             2427 non-null   i

Now, when the data have been cleaned the procedure for DNN model has to be repeated:

> **Repeat steps for the DNN Regression Model with cleaned dataset *df_2*** 

In [46]:
# make sure we have our data by printing it out
print(df_2)

      Apr  Aug  Dec  Feb  Jan  ...   max   min  prcp  fog  NUM_COLLISIONS
1       0    0    0    0    1  ...  33.1  21.9  0.00    0             480
2       0    0    0    0    1  ...  28.9  16.0  0.00    0             549
3       0    0    0    0    1  ...  41.0  24.1  0.00    0             505
4       0    0    0    0    1  ...  42.1  28.0  0.01    0             521
5       0    0    0    0    1  ...  48.0  25.0  0.00    0             513
...   ...  ...  ...  ...  ...  ...   ...   ...   ...  ...             ...
2445    0    0    1    0    0  ...  50.0  34.0  0.00    0             530
2446    0    0    1    0    0  ...  46.4  28.9  0.00    0             414
2447    0    0    1    0    0  ...  48.0  28.9  0.00    0             448
2448    0    0    1    0    0  ...  45.0  30.9  0.39    0             518
2449    0    0    1    0    0  ...  52.0  37.9  0.88    1             443

[2427 rows x 31 columns]


In [47]:
# needed to help with speedy maths based calculations
import numpy as np

# iloc allows us to select by rows. Here, we are shuffling the data by rows determined at random.
shuffle = df_2.iloc[np.random.permutation(len(df_2))]

# we are selecting all rows of the columns outliined i.e. The 3rd (2 as indexes start from 0)
predictors = shuffle.iloc[:,0:30]
# Since it is the last column, we can also use
# predictorTest = shuffle.iloc[:,-1]

# print out the first 6 rows of predictors.
print(predictors[:6])

      Apr  Aug  Dec  Feb  Jan  Jul  ...  wdsp  mxpsd   max   min  prcp  fog
1325    0    0    0    0    0    0  ...  24.8   36.9  57.0  51.8  3.76    0
1084    0    0    0    1    0    0  ...  17.9   31.1  54.0  32.0  1.44    1
303     0    0    0    0    0    0  ...  11.4   20.0  64.0  37.0  0.01    0
1724    0    0    1    0    0    0  ...  15.0   26.0  57.0  28.0  0.47    0
532     0    0    0    0    0    1  ...  13.9   15.9  73.9  63.0  0.04    0
1527    0    0    0    0    0    0  ...   6.2    8.9  55.0  39.9  0.00    0

[6 rows x 30 columns]


In [48]:
# print out the shuffled data (first 5 rows)
shuffle[:5]

Unnamed: 0,Apr,Aug,Dec,Feb,Jan,Jul,Jun,Mar,May,Nov,Oct,Sep,Fri,Mon,Sat,Sun,Thu,Tue,Wed,year,temp,dewp,slp,visib,wdsp,mxpsd,max,min,prcp,fog,NUM_COLLISIONS
1325,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,2016,54.1,43.8,1016.8,8.0,24.8,36.9,57.0,51.8,3.76,0,539
1084,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,2016,40.8,38.4,1012.4,3.5,17.9,31.1,54.0,32.0,1.44,1,607
303,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,2013,57.8,55.7,1015.6,7.1,11.4,20.0,64.0,37.0,0.01,0,646
1724,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2017,51.3,48.7,1012.7,7.9,15.0,26.0,57.0,28.0,0.47,0,651
532,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,2014,68.6,62.8,1012.8,9.4,13.9,15.9,73.9,63.0,0.04,0,614


In [50]:
# Select all rows for the last column (NUM_COLLISIONS)
targets = shuffle.iloc[:,30]

# print out the first 6 rows of the targets data.
print(targets[:6])

1325    539
1084    607
303     646
1724    651
532     614
1527    651
Name: NUM_COLLISIONS, dtype: int64


In [51]:
# scale the values of targets
SCALE_NUM_COLLISIONS = 1.0

This time the number of predictors is 30, because two weather factors have been removed (gust and sndp) during cleaning.

In [53]:
# Split our data into a training set i.e. 80% of the length of the shuffle array
trainsize = int(len(shuffle['NUM_COLLISIONS'])*0.8)
# The test set size is 100% - 80% = 20% of the length of the shuffle array.
testsize = len(shuffle['NUM_COLLISIONS']) - trainsize

# Define the number of input values (predictors)
nppredictors = 30
# Define the number of output values (targets)
noutputs = 1

### **Training the DNN regression model**

In [57]:
# import tensorflow
%tensorflow_version 1.x
import tensorflow as tf

# check the version
print(tf.__version__)

# needed for high-level file management
import shutil  

# logging for tensorflow
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)

# removes a saved model from the last training attempt.
shutil.rmtree('/tmp/DNN_house_regression_trained_model', ignore_errors=True)

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_house_regression_trained_model', hidden_units=[20,18,14], optimizer=tf.train.AdamOptimizer(learning_rate=0.01), enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors.values)))

# Prints a log to show model is starting to train
print("starting to train");

# Train the model. Pass in predictor values and target values.
estimator.fit(predictors[:trainsize].values, targets[:trainsize].values.reshape(trainsize, noutputs)/SCALE_NUM_COLLISIONS, steps=10000)

# Next, we can check our predictions based on our predictors.
preds = estimator.predict(x=predictors[trainsize:].values)

# Apply the Scale value (not really needed here) to the outputs.
predslistscale = preds['scores']*SCALE_NUM_COLLISIONS

# pred = format(str(predslistscale)) # useful for checking outputs and printing.

# Calculate RMSE i.e. how good the model works using the predictions and targets.
# i.e. take the difference between the actual and the forecast then square the difference, 
# find the average of all the squares and then find the square root. 
# The RMSE essentially punishes larger errors i.e. it puts a heavier weight on larger errors.
rmse = np.sqrt(np.mean((targets[trainsize:].values - predslistscale)**2))
print('DNNRegression has RMSE of {0}'.format(rmse));


# Calculate the mean of the Number of Collisions Values.
avg = np.mean(shuffle['NUM_COLLISIONS'][:trainsize])

# Calculate the RMSE using the Number of Collisions Values and the mean of all target values.
# The fit of a proposed regression model should therefore be better than the fit of the mean model.
# In this case, it doesn't seem to be the case but it will vary on every run.
rmse = np.sqrt(np.mean((shuffle['NUM_COLLISIONS'][trainsize:] - avg)**2))
print('Just using average = {0} has RMSE of {1}'.format(avg, rmse));

1.15.2
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fa96ba99a90>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_house_regression_trained_model', '_session_creation_timeout_secs': 7200}
starting to train
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow

This model has been launched four times. The fourth time gave the smallest RDMS value (smallest errors). 

Each run gave the following values:

* First attempt: `DNNRegression has RMSE of 65.24945640459521
Just using average = 598.0499742400824 has RMSE of 86.07378007407621`
* Second attempt: `DNNRegression has RMSE of 65.46612998488212
Just using average = 598.0499742400824 has RMSE of 86.07378007407621`
* Third attempt: `DNNRegression has RMSE of 65.07330068754943
Just using average = 598.0499742400824 has RMSE of 86.07378007407621`
* Fourth attempt: `DNNRegression has RMSE of 64.91302882407783
Just using average = 598.0499742400824 has RMSE of 86.07378007407621`

### **Testing the DNN regression model**

#### **Checking seasonality**

In the first part of the report, a seasonality was noted for the number of collisions throughout the year: from January to the end of May (~ 150 days) it increases, then from mid-year to mid-August (~ 225-234 days of the year) it decreases, to increase until mid-December (~ 350 day of the year). At the end of the year, a significant drop in the number of collisions is visible (~ 360-365 days of the year).

Tests will be carried out below to check whether the number of traffic collisions predicted by the DNN regression model confirms this observation.

In order to help in the selection of values for individual weather factors in a given part of the year so that they do not differ significantly from the conditions prevailing in that period, the real data is presented below.

In [71]:
# print out the shuffled data (first 60 rows) - help for input values for tests
shuffle[:60]

Unnamed: 0,Apr,Aug,Dec,Feb,Jan,Jul,Jun,Mar,May,Nov,Oct,Sep,Fri,Mon,Sat,Sun,Thu,Tue,Wed,year,temp,dewp,slp,visib,wdsp,mxpsd,max,min,prcp,fog,NUM_COLLISIONS
1325,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,2016,54.1,43.8,1016.8,8.0,24.8,36.9,57.0,51.8,3.76,0,539
1084,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,2016,40.8,38.4,1012.4,3.5,17.9,31.1,54.0,32.0,1.44,1,607
303,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,2013,57.8,55.7,1015.6,7.1,11.4,20.0,64.0,37.0,0.01,0,646
1724,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2017,51.3,48.7,1012.7,7.9,15.0,26.0,57.0,28.0,0.47,0,651
532,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,2014,68.6,62.8,1012.8,9.4,13.9,15.9,73.9,63.0,0.04,0,614
1527,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,2017,48.3,34.9,1014.8,10.0,6.2,8.9,55.0,39.9,0.0,0,651
846,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,2015,53.4,50.1,1019.5,6.6,13.7,19.0,61.0,51.1,0.08,0,642
1877,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,2018,50.5,49.6,1016.9,6.4,5.5,9.9,57.0,48.0,0.0,1,689
624,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,2014,65.2,60.4,1012.5,9.4,12.3,31.1,70.0,60.1,0.07,1,639
1266,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2016,70.6,67.2,1023.2,9.9,8.6,14.0,80.1,64.0,0.0,0,618


In [74]:
#"Apr", "Aug", "Dec", "Feb", "Jan", "Jul", "Jun", "Mar", "May", "Nov", "Oct", "Sep", "Fri", "Mon", "Sat", "Sun", "Thu", "Tue", "Wed", "year", "temp", "dewp", "slp", "visib", "wdsp", "mxpsd", "max", "min", "prcp", "fog"
input = pd.DataFrame.from_dict(data = 
				{
         'Apr' : [0,0,0],
         'Aug' : [0,0,0],
         'Dec' : [1,1,1],
         'Feb' : [0,0,0],
         'Jan' : [0,0,0],
         'Jul' : [0,0,0],
         'Jun' : [0,0,0],
         'Mar' : [0,0,0],
         'May' : [0,0,0],
         'Nov' : [0,0,0],
         'Oct' : [0,0,0],
         'Sep' : [0,0,0],
         'Fri' : [0,0,0],
         'Mon' : [0,1,0],
         'Sat' : [0,0,0],
         'Sun' : [1,0,0],
         'Thu' : [0,0,0],
         'Tue' : [0,0,0],
         'Wed' : [0,0,1],
         'year' : [2019, 2015, 2013],
         'temp' : [44, 45.1, 37.6],
         'dewp' : [18.8, 34.3, 45.2],
         'slp' : [1026.6, 1011.5, 1018.3],
         'visib' : [8, 9, 10],
         'wdsp' : [9.2, 10.2, 13.7],
         'mxpsd' : [18, 24.2, 16.9],
         'max' : [40.1, 48.2, 54.9],
         'min' : [29.8, 35.2, 32.9],
         'prcp' : [0, 0.22, 0.39],
         'fog' : [0, 1, 0]
        })

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_house_regression_trained_model', hidden_units=[20,18,14], enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(input.values)))

preds = estimator.predict(x=input.values)

predslistnorm = preds['scores']
prednorm = format(str(predslistnorm))
print(prednorm)


INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fa96bda2190>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_house_regression_trained_model', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/DNN_house_regression_trained_model/mo

The test was performed twelve times in total, three times for the following months:

* January
* May
* August
* December

Different values for weather factors, days and years were used. The obtained results are presented below:




> **For January**:

*1st attempt - Inputs:*

`'Apr' : [0,0,0],
         'Aug' : [0,0,0],
         'Dec' : [0,0,0],
         'Feb' : [0,0,0],
         'Jan' : [1,1,1],
         'Jul' : [0,0,0],
         'Jun' : [0,0,0],
         'Mar' : [0,0,0],
         'May' : [0,0,0],
         'Nov' : [0,0,0],
         'Oct' : [0,0,0],
         'Sep' : [0,0,0],
         'Fri' : [0,0,1],
         'Mon' : [1,0,0],
         'Sat' : [0,0,0],
         'Sun' : [0,0,0],
         'Thu' : [0,0,0],
         'Tue' : [0,0,0],
         'Wed' : [0,1,0],
         'year' : [2013, 2014, 2015],
         'temp' : [42.7, 31.4, 39.4],
         'dewp' : [41.5, 24.2, 34.6],
         'slp' : [1022.6, 1029.7, 1014.2],
         'visib' : [7.1, 10.0, 8.6],
         'wdsp' : [7.9, 6.6, 8.6],
         'mxpsd' : [13.0, 14.0, 14.0],
         'max' : [46.4, 42.1, 41.0],
         'min' : [39.2, 21.9, 37.0],
         'prcp' : [0.17, 0.00, 0.47],
         'fog' : [0, 0, 1]
         `


*1st attempt - Outputs:*
* `553.6691`
* `562.4535`
* `606.4983`


*2nd attempt - Inputs:*

`'Apr' : [0,0,0],
         'Aug' : [0,0,0],
         'Dec' : [0,0,0],
         'Feb' : [0,0,0],
         'Jan' : [1,1,1],
         'Jul' : [0,0,0],
         'Jun' : [0,0,0],
         'Mar' : [0,0,0],
         'May' : [0,0,0],
         'Nov' : [0,0,0],
         'Oct' : [0,0,0],
         'Sep' : [0,0,0],
         'Fri' : [0,0,0],
         'Mon' : [0,0,0],
         'Sat' : [0,0,1],
         'Sun' : [0,0,0],
         'Thu' : [0,1,0],
         'Tue' : [1,0,0],
         'Wed' : [0,0,0],
         'year' : [2016, 2017, 2018],
         'temp' : [33.8, 35.0, 32.1],
         'dewp' : [21.0, 34.2, 27.5],
         'slp' : [1019.2, 1011.0, 1016.5],
         'visib' : [8.3, 7.9, 10.1],
         'wdsp' : [6.8, 9.2, 7.7],
         'mxpsd' : [14.0, 13.0, 13.5],
         'max' : [40.9, 44.5, 42.0],
         'min' : [22.7, 35.4, 25.9],
         'prcp' : [0.00, 0.38, 0.22],
         'fog' : [1, 0, 1]
         `

*2nd attempt - Outputs:*
* `572.71326`
* `580.30835`
* `510.4562`


*3rd attempt - Inputs:*

`'Apr' : [0,0,0],
         'Aug' : [0,0,0],
         'Dec' : [0,0,0],
         'Feb' : [0,0,0],
         'Jan' : [1,1,1],
         'Jul' : [0,0,0],
         'Jun' : [0,0,0],
         'Mar' : [0,0,0],
         'May' : [0,0,0],
         'Nov' : [0,0,0],
         'Oct' : [0,0,0],
         'Sep' : [0,0,0],
         'Fri' : [0,0,0],
         'Mon' : [0,1,0],
         'Sat' : [0,0,0],
         'Sun' : [1,0,0],
         'Thu' : [0,0,0],
         'Tue' : [0,0,0],
         'Wed' : [0,0,1],
         'year' : [2019, 2015, 2013],
         'temp' : [36.6, 31.2, 38.7],
         'dewp' : [29.5, 28.6, 26.0],
         'slp' : [1012.3, 1019.1, 1015.4],
         'visib' : [9.4, 8.1, 9.5],
         'wdsp' : [7.0, 7.6, 8.2],
         'mxpsd' : [12.9, 14.1, 13.9],
         'max' : [41.2, 40.8, 41.3],
         'min' : [32.3, 26.4, 29.8],
         'prcp' : [0.11, 0.08, 0.19],
         'fog' : [0, 1, 0]
         `

*3rd attempt - Outputs:*
* `450.92697`
* `551.10675`
* `568.18286`

> **For May**:

*1st attempt - Inputs:*

`'Apr' : [0,0,0],
         'Aug' : [0,0,0],
         'Dec' : [0,0,0],
         'Feb' : [0,0,0],
         'Jan' : [0,0,0],
         'Jul' : [0,0,0],
         'Jun' : [0,0,0],
         'Mar' : [0,0,0],
         'May' : [1,1,1],
         'Nov' : [0,0,0],
         'Oct' : [0,0,0],
         'Sep' : [0,0,0],
         'Fri' : [0,0,1],
         'Mon' : [1,0,0],
         'Sat' : [0,0,0],
         'Sun' : [0,0,0],
         'Thu' : [0,0,0],
         'Tue' : [0,0,0],
         'Wed' : [0,1,0],
         'year' : [2013, 2014, 2015],
         'temp' : [48.3, 50.5, 56.8],
         'dewp' : [34.9, 49.6, 49.5],
         'slp' : [1014.8, 1016.9, 1014.3],
         'visib' : [10, 6.4, 7],
         'wdsp' : [6.2, 5.5, 10],
         'mxpsd' : [8.9, 9.9, 15],
         'max' : [55, 57, 64.9],
         'min' : [39.9, 48, 45],
         'prcp' : [0, 0, 0.01],
         'fog' : [0, 1, 0]
         `


*1st attempt - Outputs:*
* `615.00916`
* `632.8248`
* `684.0004`


*2nd attempt - Inputs:*

`'Apr' : [0,0,0],
         'Aug' : [0,0,0],
         'Dec' : [0,0,0],
         'Feb' : [0,0,0],
         'Jan' : [0,0,0],
         'Jul' : [0,0,0],
         'Jun' : [0,0,0],
         'Mar' : [0,0,0],
         'May' : [1,1,1],
         'Nov' : [0,0,0],
         'Oct' : [0,0,0],
         'Sep' : [0,0,0],
         'Fri' : [0,0,0],
         'Mon' : [0,0,0],
         'Sat' : [0,0,1],
         'Sun' : [0,0,0],
         'Thu' : [0,1,0],
         'Tue' : [1,0,0],
         'Wed' : [0,0,0],
         'year' : [2016, 2017, 2018],
         'temp' : [51.1, 54.0, 53.3],
         'dewp' : [45.2, 41.3, 47.2],
         'slp' : [1015.0, 1014.9, 1016.1],
         'visib' : [8.6, 7.8, 9.2],
         'wdsp' : [5.8, 7.9, 6.6],
         'mxpsd' : [12, 10.2, 13.2],
         'max' : [61.8, 58.2, 60.1],
         'min' : [41.3, 47.2, 46.3],
         'prcp' : [0.01, 0.01, 0],
         'fog' : [1, 0, 0]
         `
         
*2nd attempt - Outputs:*
* `642.1402`
* `652.09546`
* `577.7129`


*3rd attempt - Inputs:*

`'Apr' : [0,0,0],
         'Aug' : [0,0,0],
         'Dec' : [0,0,0],
         'Feb' : [0,0,0],
         'Jan' : [0,0,0],
         'Jul' : [0,0,0],
         'Jun' : [0,0,0],
         'Mar' : [0,0,0],
         'May' : [1,1,1],
         'Nov' : [0,0,0],
         'Oct' : [0,0,0],
         'Sep' : [0,0,0],
         'Fri' : [0,0,0],
         'Mon' : [0,1,0],
         'Sat' : [0,0,0],
         'Sun' : [1,0,0],
         'Thu' : [0,0,0],
         'Tue' : [0,0,0],
         'Wed' : [0,0,1],
         'year' : [2019, 2015, 2013],
         'temp' : [49.7, 52.2, 54.8],
         'dewp' : [35.8, 48.9, 45.4],
         'slp' : [1016.1, 1015.5, 1014.7],
         'visib' : [6.9, 8.2, 7.3],
         'wdsp' : [9.8, 6.8, 7.4],
         'mxpsd' : [9.2, 12.4, 11.6],
         'max' : [57.2, 62.1, 59.9],
         'min' : [40.1, 45.8, 42.6],
         'prcp' : [0, 0.01, 0],
         'fog' : [0, 1, 0]
         `

*3rd attempt - Outputs:*
* `515.43024`
* `624.81854`
* `639.746 `

> **For August**:

*1st attempt - Inputs:*

`'Apr' : [0,0,0],
         'Aug' : [1,1,1],
         'Dec' : [0,0,0],
         'Feb' : [0,0,0],
         'Jan' : [0,0,0],
         'Jul' : [0,0,0],
         'Jun' : [0,0,0],
         'Mar' : [0,0,0],
         'May' : [0,0,0],
         'Nov' : [0,0,0],
         'Oct' : [0,0,0],
         'Sep' : [0,0,0],
         'Fri' : [0,0,1],
         'Mon' : [1,0,0],
         'Sat' : [0,0,0],
         'Sun' : [0,0,0],
         'Thu' : [0,0,0],
         'Tue' : [0,0,0],
         'Wed' : [0,1,0],
         'year' : [2013, 2014, 2015],
         'temp' : [70.6, 67.5, 68.3],
         'dewp' : [67.2, 65.3, 58.5],
         'slp' : [1023.2, 1017.6, 1018.7],
         'visib' : [9.9, 3.8, 10],
         'wdsp' : [8.6, 7.1, 12.3],
         'mxpsd' : [14, 11.1, 17.1],
         'max' : [80.1, 77, 75.9],
         'min' : [64, 62.1, 63],
         'prcp' : [0, 0, 0],
         'fog' : [0, 1, 0]
         `


*1st attempt - Outputs:*
* `592.21747`
* `608.9144`
* `646.9852`


*2nd attempt - Inputs:*

`'Apr' : [0,0,0],
         'Aug' : [1,1,1],
         'Dec' : [0,0,0],
         'Feb' : [0,0,0],
         'Jan' : [0,0,0],
         'Jul' : [0,0,0],
         'Jun' : [0,0,0],
         'Mar' : [0,0,0],
         'May' : [0,0,0],
         'Nov' : [0,0,0],
         'Oct' : [0,0,0],
         'Sep' : [0,0,0],
         'Fri' : [0,0,0],
         'Mon' : [0,0,0],
         'Sat' : [0,0,1],
         'Sun' : [0,0,0],
         'Thu' : [0,1,0],
         'Tue' : [1,0,0],
         'Wed' : [0,0,0],
         'year' : [2016, 2017, 2018],
         'temp' : [68.3, 69.8, 70.1],
         'dewp' : [61.1, 64.8, 62.7],
         'slp' : [1020.4, 1020.1, 1019.9],
         'visib' : [6, 9, 5.5],
         'wdsp' : [10.8, 8.4, 11.2],
         'mxpsd' : [12, 16, 13.5],
         'max' : [78.3, 77.8, 76.5],
         'min' : [62.5, 63.9, 63],
         'prcp' : [0, 0, 0],
         'fog' : [0, 0, 1]
         `
         
*2nd attempt - Outputs:*
* `613.399`
* `621.2207`
* `559.2971`


*3rd attempt - Inputs:*

`'Apr' : [0,0,0],
         'Aug' : [1,1,1],
         'Dec' : [0,0,0],
         'Feb' : [0,0,0],
         'Jan' : [0,0,0],
         'Jul' : [0,0,0],
         'Jun' : [0,0,0],
         'Mar' : [0,0,0],
         'May' : [0,0,0],
         'Nov' : [0,0,0],
         'Oct' : [0,0,0],
         'Sep' : [0,0,0],
         'Fri' : [0,0,0],
         'Mon' : [0,1,0],
         'Sat' : [0,0,0],
         'Sun' : [1,0,0],
         'Thu' : [0,0,0],
         'Tue' : [0,0,0],
         'Wed' : [0,0,1],
         'year' : [2019, 2015, 2013],
         'temp' : [69.2, 68.4, 69.9],
         'dewp' : [59.1, 67, 61.5],
         'slp' : [1022, 1021, 1018.9],
         'visib' : [9.5, 5.2, 8.1],
         'wdsp' : [7.9, 9, 11.5],
         'mxpsd' : [13, 15, 16.6],
         'max' : [79.6, 76.4, 78],
         'min' : [63.1, 62.9, 64],
         'prcp' : [0, 0, 0],
         'fog' : [1, 0, 0]
         `

*3rd attempt - Outputs:*
* `499.8673`
* `591.60535`
* `611.2499`



> **For December**:

*1st attempt - Inputs:*

`'Apr' : [0,0,0],
         'Aug' : [0,0,0],
         'Dec' : [1,1,1],
         'Feb' : [0,0,0],
         'Jan' : [0,0,0],
         'Jul' : [0,0,0],
         'Jun' : [0,0,0],
         'Mar' : [0,0,0],
         'May' : [0,0,0],
         'Nov' : [0,0,0],
         'Oct' : [0,0,0],
         'Sep' : [0,0,0],
         'Fri' : [0,0,1],
         'Mon' : [1,0,0],
         'Sat' : [0,0,0],
         'Sun' : [0,0,0],
         'Thu' : [0,0,0],
         'Tue' : [0,0,0],
         'Wed' : [0,1,0],
         'year' : [2013, 2014, 2015],
         'temp' : [51.3, 31.8, 48.6],
         'dewp' : [48.7, 12.5, 45.7],
         'slp' : [1012.7, 1027.5, 1009.7],
         'visib' : [7.9, 10, 8],
         'wdsp' : [15, 11.1, 8.8],
         'mxpsd' : [26, 15.9, 15],
         'max' : [57, 39.9, 52],
         'min' : [28, 27, 41],
         'prcp' : [0.47, 0, 0.1],
         'fog' : [0, 0, 1]
         `


*1st attempt - Outputs:*
* `624.53766`
* `608.4956`
* `661.6003`


*2nd attempt - Inputs:*

`'Apr' : [0,0,0],
         'Aug' : [0,0,0],
         'Dec' : [1,1,1],
         'Feb' : [0,0,0],
         'Jan' : [0,0,0],
         'Jul' : [0,0,0],
         'Jun' : [0,0,0],
         'Mar' : [0,0,0],
         'May' : [0,0,0],
         'Nov' : [0,0,0],
         'Oct' : [0,0,0],
         'Sep' : [0,0,0],
         'Fri' : [0,0,0],
         'Mon' : [0,0,0],
         'Sat' : [0,0,1],
         'Sun' : [0,0,0],
         'Thu' : [0,1,0],
         'Tue' : [1,0,0],
         'Wed' : [0,0,0],
         'year' : [2016, 2017, 2018],
         'temp' : [42.6, 38.2, 45.3],
         'dewp' : [44.4, 28.6, 30.4],
         'slp' : [1015.3, 1022.9, 1021.1],
         'visib' : [9, 8, 9],
         'wdsp' : [12, 13, 10.4],
         'mxpsd' : [17.2, 19.9, 16.4],
         'max' : [44.2, 46.8, 50.1],
         'min' : [32.8, 38.7, 40],
         'prcp' : [0.21, 0.2, 0.32],
         'fog' : [1, 0, 0]
         `
         
*2nd attempt - Outputs:*
* `624.9684`
* `629.9253`
* `561.87256`


*3rd attempt - Inputs:*

`'Apr' : [0,0,0],
         'Aug' : [0,0,0],
         'Dec' : [1,1,1],
         'Feb' : [0,0,0],
         'Jan' : [0,0,0],
         'Jul' : [0,0,0],
         'Jun' : [0,0,0],
         'Mar' : [0,0,0],
         'May' : [0,0,0],
         'Nov' : [0,0,0],
         'Oct' : [0,0,0],
         'Sep' : [0,0,0],
         'Fri' : [0,0,0],
         'Mon' : [0,1,0],
         'Sat' : [0,0,0],
         'Sun' : [1,0,0],
         'Thu' : [0,0,0],
         'Tue' : [0,0,0],
         'Wed' : [0,0,1],
         'year' : [2019, 2015, 2013],
         'temp' : [44, 45.1, 37.6],
         'dewp' : [18.8, 34.3, 45.2],
         'slp' : [1026.6, 1011.5, 1018.3],
         'visib' : [8, 9, 10],
         'wdsp' : [9.2, 10.2, 13.7],
         'mxpsd' : [18, 24.2, 16.9],
         'max' : [40.1, 48.2, 54.9],
         'min' : [29.8, 35.2, 32.9],
         'prcp' : [0, 0.22, 0.39],
         'fog' : [0, 1, 0]
         `

*3rd attempt - Outputs:*
* `492.23422`
* `609.13715`
* `620.2196`

The average of results given from DNN Regression Model for January is 550.7017, 620.4197 for May, 593.8618 for August and 603.6656 for December.
In other words, the predicted number of vehicle collisions (the average of the results) is as follows:

* 551 for January
* 620 for May
* 594 for August
* 604 for December

This confirms the seasonality noted in the first part of the report - the number of collisions increases from January to May, then declines until August and then increases again until December.

# **CONCLUSIONS**

Predicted number of collisions by the **Linear Regression Model**:

* confirms that the number of collisions increases from Monday to Friday,
* confirms that the number of collisions increases when the temperature increases,
* denies an increase in the number of collisions with an increase in the value of dew point. When the value of the dew point increases, the predicted number of collisions decreases.

Predicted number of collisions by the **Deep Neural Network Regression Model** confirms the seasonality. The number of predicted collisions increases from January to May, then decreases until August and then increases again until December.

# **REFERENCES**

Python Software Foundation. Python Language Reference, [https://www.python.org](https://www.python.org)

R Core Team. 2020. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. [https://www.R-project.org/](https://www.R-project.org/). 

reback2020pandas, The pandas development team,
pandas-dev/pandas: Pandas, feb 2020, Zenodo, https://doi.org/10.5281/zenodo.3509134

[Data structures for statistical computing in python](https://conference.scipy.org/proceedings/scipy2010/pdfs/mckinney.pdf), McKinney, Proceedings of the 9th Python in Science Conference, Volume 445, 2010.

[NumPy — NumPy](https://numpy.org/). numpy.org. NumPy developers.

McKinney, Wes (2017). Python for Data Analysis : Data Wrangling with Pandas, NumPy, and IPython (2nd ed.). Sebastopol: O'Reilly. ISBN 978-1-4919-5766-0.

VanderPlas, Jake (2016). "Introduction to NumPy". Python Data Science Handbook: Essential Tools for Working with Data. O'Reilly. pp. 33–96. ISBN 978-1-4919-1205-8.

[Adam: A Method for Stochastic Optimization](https://arxiv.org/abs/1412.6980), 2015.

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo,
Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis,
Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow,
Andrew Harp, Geoffrey Irving, Michael Isard, Rafal Jozefowicz, Yangqing Jia,
Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Mike Schuster,
Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Jonathon Shlens,
Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker,
Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas,
Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke,
Yuan Yu, and Xiaoqiang Zheng.
TensorFlow: Large-scale machine learning on heterogeneous systems,
2015. Software available from [tensorflow.org](https://www.tensorflow.org/).

shutil — High-level file operations,  [https://docs.python.org/3/library/shutil.html](https://docs.python.org/3/library/shutil.html), Source code: [Lib/shutil.py](https://github.com/python/cpython/blob/3.9/Lib/shutil.py)

Chai, Tianfeng & Draxler, R.. (2014). Root mean square error (RMSE) or mean absolute error (MAE)?. Geosci. Model Dev.. 7. 10.5194/gmdd-7-1525-2014. 