#**Introduction**

In this report I will generate linear regression models and deep neural network regressor models, to help predict traffic collision numbers, and assist in the planning of emergency response and staffing for New York City.

# **Methodology**

Using datasets previously created using data analysis techniques, I will train, test and analyse the linear and deep neural network regression models, to determine which model and choice of data, offers the best effiacy against a set of test data, retained and separated from the original datasets.  

The results will be used to identify which model or models could potentially be used to successfully predict the number of traffic collisions per day; allowing optimisation of New York City's emergency response and staffing.  


# **Modelling and Results**

#**1. Mean and Minimum Temperature**

The first models will be created with a large collated dataset of weather and traffic collision.  Mean and minimum temperature will be used as the primary data points for these models.

In [1]:
# Import pandas to allow creation of data frames
import pandas as pd

# Create data frame from github csv file
source_dataframe = pd.read_csv('https://raw.githubusercontent.com/15014370uhi/15014370_DataAnalytics/master/bq-collated_collision_numbers_per_day_nyc.csv', index_col=0, );

In [2]:
# Check that correct data is present
print(source_dataframe) 

     year  mo  da collision_date  temp  ...   min   prcp   sndp  fog  NUM_COLLISIONS
day                                     ...                                         
7    2012   7   1     2012-07-01  83.6  ...  66.0   0.00  999.9    0             538
1    2012   7   2     2012-07-02  80.3  ...  66.9   0.00  999.9    0             564
2    2012   7   3     2012-07-03  79.8  ...  63.0   0.00  999.9    0             664
3    2012   7   4     2012-07-04  81.8  ...  68.0   0.06  999.9    0             432
4    2012   7   5     2012-07-05  86.7  ...  70.0  99.99  999.9    0             591
..    ...  ..  ..            ...   ...  ...   ...    ...    ...  ...             ...
5    2020  11  13     2020-11-13  53.5  ...  52.0   1.28  999.9    1             362
6    2020  11  14     2020-11-14  50.5  ...  46.9   0.02  999.9    0             262
7    2020  11  15     2020-11-15  46.4  ...  32.0   0.00  999.9    0             221
1    2020  11  16     2020-11-16  54.9  ...  50.0   0.24  999.9  

In [3]:
# Find the total length of data rows
totalNumberOfRows = len(source_dataframe);
print(totalNumberOfRows);

3121


In [4]:
# Import numpty to improve speed of math calculations
import numpy as np

In [5]:
# Shuffle all source data
source_dataframe = source_dataframe.iloc[np.random.permutation(len(source_dataframe))]; 

In [6]:
# Number of rows to reserve for testing
number_of_test_rows = 50

In [7]:
# Store training data as (all rows - number_of_test_rows) of source data
df_train = source_dataframe[:-number_of_test_rows];

# Shuffle training data a second time
df_train = df_train.iloc[np.random.permutation(len(df_train))]; 

# Store validation test data as last number_of_test_rows of rows of source data
df_test = source_dataframe[-number_of_test_rows:];

# Shuffle test data a second time
df_test = df_test.iloc[np.random.permutation(len(df_test))];

# Store only relevant columns of data for this model
df_train = df_train.iloc[:, [4,12,16]]; 
df_test = df_test.iloc[:, [4,12,16]];


In [8]:
# Confirm training data is stored
print(df_train)

     temp   min  NUM_COLLISIONS
day                            
1    69.9  59.0             533
6    61.4  57.0             500
7    54.7  46.0             303
6    35.1  19.9             523
5    61.7  57.0             749
..    ...   ...             ...
5    42.7  34.0             676
5    63.2  55.0             738
1    47.6  43.0             716
4    70.2  66.9             695
4    65.7  62.6             572

[3071 rows x 3 columns]


In [9]:
# Confirm test data rows are stored 
print(df_test);

     temp   min  NUM_COLLISIONS
day                            
6    59.9  52.0             558
2    51.1  39.2             565
3    49.3  39.9             679
3    46.3  41.0             636
4    74.1  66.0             356
5    69.8  66.0             296
4    63.6  57.9             249
7    65.2  55.9             451
2    54.8  46.9             741
7    43.2  34.0             446
3    69.2  64.0             660
2    53.8  50.0             570
1    29.0  26.6             553
6    62.3  53.1             603
5    70.1  48.9             760
4    56.8  42.8             630
7    69.0  62.1             441
5    67.2  61.0             604
5    69.1  66.9             652
2    63.0  45.0             620
4    36.5  33.1             689
5    64.9  59.0             684
5    53.0  48.9             617
1    49.5  37.0             687
3    32.2  28.0             531
7    56.3  46.9             544
1    67.7  61.0             602
7    59.6  51.1             505
3    44.7  41.0             706
2    69.

In [10]:
# Select Predictor columns for training and testing data to be used to predict the outcome

predictors_train = df_train.iloc[:, [0,1]];
predictors_test = df_test.iloc[:, [0,1]];

In [11]:
# confirm predictor holds correct data
print(predictors_train)

     temp   min
day            
1    69.9  59.0
6    61.4  57.0
7    54.7  46.0
6    35.1  19.9
5    61.7  57.0
..    ...   ...
5    42.7  34.0
5    63.2  55.0
1    47.6  43.0
4    70.2  66.9
4    65.7  62.6

[3071 rows x 2 columns]


In [12]:
# confirm predictor holds correct data
print(predictors_test);

     temp   min
day            
6    59.9  52.0
2    51.1  39.2
3    49.3  39.9
3    46.3  41.0
4    74.1  66.0
5    69.8  66.0
4    63.6  57.9
7    65.2  55.9
2    54.8  46.9
7    43.2  34.0
3    69.2  64.0
2    53.8  50.0
1    29.0  26.6
6    62.3  53.1
5    70.1  48.9
4    56.8  42.8
7    69.0  62.1
5    67.2  61.0
5    69.1  66.9
2    63.0  45.0
4    36.5  33.1
5    64.9  59.0
5    53.0  48.9
1    49.5  37.0
3    32.2  28.0
7    56.3  46.9
1    67.7  61.0
7    59.6  51.1
3    44.7  41.0
2    69.2  66.0
7    70.6  66.9
6    62.4  59.0
2    67.2  50.0
6    42.8  32.0
5    53.3  50.0
2    56.1  51.1
6    66.9  61.0
3    68.5  55.0
7    39.9  37.0
3    39.8  30.9
4    45.7  28.9
7    48.3  32.0
5    31.7  28.9
1    60.5  57.0
5    37.6  30.9
1    53.4  50.0
4    62.5  53.1
4    68.4  61.0
5    38.4  33.1
6    38.1  23.0


In [13]:
# Select target column
targets_train = df_train.iloc[:,2];

# Select target column
targets_test = df_test.iloc[:,2];

In [14]:
print(targets_train);

day
1    533
6    500
7    303
6    523
5    749
    ... 
5    676
5    738
1    716
4    695
4    572
Name: NUM_COLLISIONS, Length: 3071, dtype: int64


In [15]:
print(targets_test);

day
6    558
2    565
3    679
3    636
4    356
5    296
4    249
7    451
2    741
7    446
3    660
2    570
1    553
6    603
5    760
4    630
7    441
5    604
5    652
2    620
4    689
5    684
5    617
1    687
3    531
7    544
1    602
7    505
3    706
2    654
7    574
6    304
2    727
6    555
5    723
2    627
6    534
3    651
7    500
3    520
4    635
7    457
5    600
1    490
5    753
1    408
4    616
4    306
5    743
6    493
Name: NUM_COLLISIONS, dtype: int64


In [16]:
# Set scale value
SCALE_NUM_COLLISIONS = 1000.0;

In [17]:
# Get size of training set 
trainsize = int(len(df_train['NUM_COLLISIONS']));

# Get size of test set 
testsize = int(len(df_test['NUM_COLLISIONS']));

In [18]:
print(trainsize);

3071


In [19]:
print(testsize);

50


In [20]:

# Define the number of predictor column input values
nppredictors = len(predictors_train.columns);

# Define the number of target column output values
noutputs = 1;


In [21]:
print(nppredictors)

2


# **1.1 Linear Regression Model**

In [22]:
# import tensorflow
%tensorflow_version 1.x
import tensorflow as tf

# check the version
print(tf.__version__)

# needed for high-level file management
import shutil  

# logging for tensorflow
#tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR) # Supress verbosity
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO) # Show verbosity

# removes a saved model from the last training attempt.
shutil.rmtree('/tmp/linear_regression_trained_model_temp_min', ignore_errors=True)
   
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir='/tmp/linear_regression_trained_model_temp_min', optimizer=tf.train.AdamOptimizer(learning_rate=0.1), enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors_train.values)))

# Prints log to show start of training model
print("starting to train...\n");

# Train the model using predictor and target values
estimator.fit(predictors_train.values, targets_train.values.reshape(trainsize, noutputs)/SCALE_NUM_COLLISIONS, steps=10000)

# Check predictions based on predictor values
preds = estimator.predict(x=predictors_train.values)

# Apply Scale value to outputs
predslistscale = preds['scores']*SCALE_NUM_COLLISIONS

# Calculate RMSE using predictions and targets
rmse = np.sqrt(np.mean((targets_train.values - predslistscale)**2))
print('\nLinearRegression has RMSE of {0}'.format(rmse));

# Store linear regressor value
rmse_LR = getattr(rmse, "tolist", lambda: rmse)()

# Calculate mean value of Number of Collisions
avg = np.mean(df_train['NUM_COLLISIONS'][:trainsize])

# Calculate the RMSE using Number of Collision Values and the mean of all target values.
# The fit of a proposed regression model should therefore be better than the fit of the mean model.
rmse = np.sqrt(np.mean((df_train['NUM_COLLISIONS'] - avg)**2));
print('Just using an average = {0}, has RMSE of {1}'.format(avg, rmse)); 

# store RMSE for average
rmse_avg = getattr(rmse, "tolist", lambda: rmse)()

if(rmse_LR < rmse_avg): # If DNN rmse is lower than average rmse
  print('\nGreat! Your Linear Regression model performs better than finding average!');
else: 
  print('\nSorry! On this run, your model performs worse than just finding the average!');


TensorFlow 1.x selected.
1.15.2
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

Instructions for updating:
Please specify feature columns explicitly.
Instructions for updating:
Please use tensorflow/transform or tf.data.
Instructions for updating:
Please access pandas data directly.
Instructions for updating:
Please use tensorflow/transform or tf.data.
Instructions for updating:
Please convert numpy dtypes explicitly.
Instructions for updating:
Please specify feature columns explicitly.
Instructions for updating:
Please switch to tf.contrib.estimator.*_head.
Instructions for updating:
Please replace uses of any Estimator from tf.contrib.learn with an Estimator from tf.estimat

**1.1.1 Linear Regression Validation Test**

Perform linear regression validation test using values from the original data set reserved for testing.


In [23]:

# Perform linear regression validation test using values from the original data set reserved for testing.                                                                                                 
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir='/tmp/linear_regression_trained_model_temp_min', enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors_test.values)))

preds = estimator.predict(x=predictors_test.values)  # Use test data values 
predslistscale = preds['scores']*SCALE_NUM_COLLISIONS  # Adjust for scale
pred = format(str(predslistscale))
print("\n--- Predicted number of collisions ---\n", pred)
print("\n-- ROW -- Number Collisions ----\n", df_test['NUM_COLLISIONS'].values)



INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7ff2041be240>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model_temp_min', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/linear_regression_trained_model

Overall the linear regression model performed produced reasonable values at approximately the same accuracy as finding an average mean value, depending on how the data was shuffled.  The linear model appears to underestimate some of the outliers at maximum and minimum end of the range of values.

# **1.2 Deep Neural Network Regressor**



In [24]:
# Import tensorflow
%tensorflow_version 1.x
import tensorflow as tf

# Check the version
print(tf.__version__)

# Required for high-level file management
import shutil  

# logging for tensorflow
#tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR) # Supress verbosity
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO) # Show verbosity

# Remove any previously saved training model training
shutil.rmtree('/tmp/DNN_collision_regression_trained_model_temp_min', ignore_errors=True)

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_collision_regression_trained_model_temp_min', hidden_units=[20,18,14], optimizer=tf.train.AdamOptimizer(learning_rate=0.01), enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors_train.values)))

# Print message to display start of training log
print("starting to train");

# Train the model by passing predictor and target values
estimator.fit(predictors_train.values, targets_train.values.reshape(trainsize, noutputs)/SCALE_NUM_COLLISIONS, steps=10000)

# Next, we can check our predictions based on our predictors.
preds = estimator.predict(x=predictors_train.values)

# Apply the Scale value (not really needed here) to the outputs.
predslistscale = preds['scores']*SCALE_NUM_COLLISIONS

# Calculate RMSE value to determine how well the model works using prediction and target values.
rmse = np.sqrt(np.mean((targets_train.values - predslistscale)**2))
print('\nDNNRegression has RMSE of {0}'.format(rmse));

# Store DNN regressoion value
rmse_DNN = getattr(rmse, "tolist", lambda: rmse)()

# Calculate the mean of the Number of Collision Values.
avg = np.mean(df_train['NUM_COLLISIONS'])

# Calculate RMSE using COLLISION Values and the mean of all target values to determine
# if the DNN model is better than calculating the mean value.
rmse = np.sqrt(np.mean((df_train['NUM_COLLISIONS'] - avg)**2))
print('Just using average = {0} has RMSE of {1}'.format(avg, rmse));

# Store RMSE for average
rmse_avg = getattr(rmse, "tolist", lambda: rmse)()

# Output success or failure message for this model
if(rmse_DNN < rmse_avg): # If rmse is lower than average rmse
  print('\nGreat! Your DNN Regression model performs better than finding the average!'); # Success
else: 
  print('\nSorry! But on this run, your DNN Regression model performs worse than just finding the average!'); # Failure


1.15.2
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7ff1a9a6a978>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_collision_regression_trained_model_temp_min', '_session_creation_timeout_secs': 7200}
starting to train
Instructions for updating:
Please use `layer.__call__` method instead.
INFO:tensorf

**1.2.1 Deep Neural Network Validation Test**

In [25]:
# Perform validation assessment of DNN model using test values reserved from original dataset
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_collision_regression_trained_model_temp_min', hidden_units=[20,18,14], enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors_test.values)))

# Use test values reserved from original data set
preds = estimator.predict(x=predictors_test.values)
predslistscale = preds['scores']*SCALE_NUM_COLLISIONS
pred = format(str(predslistscale))
print("\n==== Predicted Number of Collisions ====", pred)
print("\n====Target Collision Values====", targets_test.values)


INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7ff2046cdef0>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_collision_regression_trained_model_temp_min', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/DNN_collision_regression

From the results of the DNN validation test, for most runs, it was seen that many of the predicted values are an approximation to the target values, although many of the very large and small target values were under-estimated by the DNN model.

# **1.3 Summary**

From the results of the linear and DNN models, we can see that the linear model performed worse than just finding the average number of collisions using a mean value on about half the runs, but the DNN model performed better than finding the average on the majority of occasions.

#**2. Day, Month and Minimum Temperature**

I will now investigate if adding day and month data to weather and temperature  factors, changes the performance of the models.  I will also use a different dataset with enhanced cleansing and normalisation of weather data.

In [26]:
# Import pandas to allow creation of data frames
import pandas as pd

# Create data frame from github csv file
source_dataframe = pd.read_csv('https://raw.githubusercontent.com/15014370uhi/15014370_DataAnalytics/master/bq-cleansed_weather_reduced_assignment_1.csv', index_col=0,);

In [27]:
# Check that correct data is present
print(source_dataframe) 

      mo  da  day  temp  dewp  ...  mxpsd   max   min  prcp  NUM_COLLISIONS
year                           ...                                         
2012   7   1    7  83.6  63.0  ...    9.9  93.0  66.0  0.00             538
2012   7   2    1  80.3  54.1  ...   15.0  88.0  66.9  0.00             564
2012   7   3    2  79.8  56.7  ...   12.0  88.0  63.0  0.00             664
2012   7   4    3  81.8  65.6  ...   11.1  91.0  68.0  0.06             432
2012   7   6    5  81.9  62.3  ...    9.9  91.0  66.9  0.00             638
...   ..  ..  ...   ...   ...  ...    ...   ...   ...   ...             ...
2020  12   1    2  58.1  54.3  ...   34.0  63.0  45.0  0.88             253
2020  12   2    3  47.5  35.9  ...   22.9  63.0  42.1  0.59             204
2020  12   3    4  45.1  32.4  ...   18.1  52.0  36.0  0.02             218
2020  12   4    5  53.6  43.8  ...   18.1  59.0  36.0  0.00             319
2020  12   5    6  51.9  48.9  ...   32.1  59.0  39.9  0.32             222

[3054 rows 

In [28]:
# Find the total length of data rows
totalNumberOfRows = len(source_dataframe);
print(totalNumberOfRows);

3054


In [29]:
# Import numpty to improve speed of math calculations
import numpy as np

In [30]:
# Shuffle all source data
source_dataframe = source_dataframe.iloc[np.random.permutation(len(source_dataframe))]; 

In [31]:
# Number of rows to reserve for testing
number_of_test_rows = 50

In [32]:
# Store training data as (all rows - number_of_test_rows) of source data
df_train = source_dataframe[:-number_of_test_rows];

# Shuffle training data a second time
df_train = df_train.iloc[np.random.permutation(len(df_train))]; 

# Store validation test data as last number_of_test_rows of rows of source data
df_test = source_dataframe[-number_of_test_rows:];

# Shuffle test data a second time
df_test = df_test.iloc[np.random.permutation(len(df_test))];

# Store only relevant columns of data for this model
df_train = df_train.iloc[:, [2,3,10,12]]; 
df_test = df_test.iloc[:, [2,3,10,12]];


In [33]:
# Confirm training data is stored 
print(df_train)

      day  temp   min  NUM_COLLISIONS
year                                 
2018    5  38.0  27.0             705
2014    1  44.6  37.0             625
2013    3  55.3  44.1             539
2014    4  52.0  46.0             596
2016    2  73.8  66.9             699
...   ...   ...   ...             ...
2015    2  65.5  59.0             703
2013    6  66.2  55.9             527
2013    4  42.8  30.0             620
2020    4  44.6  39.0             581
2019    6  67.1  55.9             602

[3004 rows x 4 columns]


In [34]:
# Confirm test data rows are stored
print(df_test);

      day  temp   min  NUM_COLLISIONS
year                                 
2020    4  60.7  51.8             322
2016    7  51.3  37.9             518
2014    1  50.9  39.0             451
2019    5  73.7  69.1             645
2015    1  67.6  63.0             579
2016    3  45.7  37.0             653
2012    1  65.3  61.0             607
2019    3  35.6  28.0             264
2015    5  67.0  61.0             451
2013    6  69.3  66.0             492
2019    3  43.7  39.9             613
2013    7  35.7  23.0             393
2013    6  24.4  21.9             558
2016    7  67.5  64.4             529
2019    7  38.4  30.9             384
2014    3  67.5  53.1             538
2015    4  59.7  57.0             683
2020    4  40.6  30.9             455
2017    3  62.6  60.1             744
2019    3  62.3  57.9             591
2016    4  73.6  66.9             672
2017    5  13.7   6.1             598
2017    3  29.8  25.0             650
2017    3  53.9  50.0             641
2019    2  4

In [35]:
# Select Predictor columns for training and testing data to be used to predict the outcome

predictors_train = df_train.iloc[:, [0,1,2]];
predictors_test = df_test.iloc[:,  [0,1,2]];

In [36]:
# confirm predictor holds correct data
print(predictors_train)

      day  temp   min
year                 
2018    5  38.0  27.0
2014    1  44.6  37.0
2013    3  55.3  44.1
2014    4  52.0  46.0
2016    2  73.8  66.9
...   ...   ...   ...
2015    2  65.5  59.0
2013    6  66.2  55.9
2013    4  42.8  30.0
2020    4  44.6  39.0
2019    6  67.1  55.9

[3004 rows x 3 columns]


In [37]:
# confirm predictor holds correct data
print(predictors_test);

      day  temp   min
year                 
2020    4  60.7  51.8
2016    7  51.3  37.9
2014    1  50.9  39.0
2019    5  73.7  69.1
2015    1  67.6  63.0
2016    3  45.7  37.0
2012    1  65.3  61.0
2019    3  35.6  28.0
2015    5  67.0  61.0
2013    6  69.3  66.0
2019    3  43.7  39.9
2013    7  35.7  23.0
2013    6  24.4  21.9
2016    7  67.5  64.4
2019    7  38.4  30.9
2014    3  67.5  53.1
2015    4  59.7  57.0
2020    4  40.6  30.9
2017    3  62.6  60.1
2019    3  62.3  57.9
2016    4  73.6  66.9
2017    5  13.7   6.1
2017    3  29.8  25.0
2017    3  53.9  50.0
2019    2  48.8  44.1
2014    2  22.2  17.1
2015    7  41.6  34.0
2019    6  23.5  10.9
2019    6  39.3  32.0
2019    7  26.0  18.0
2015    3  73.9  69.1
2015    3  65.2  57.9
2019    6  51.9  42.1
2016    3  45.0  41.0
2012    6  43.6  39.0
2013    4  69.8  68.0
2016    7  57.6  48.0
2013    7  64.0  50.0
2014    6  59.4  57.0
2015    6  21.9   3.9
2015    7  57.0  52.0
2017    1  51.1  48.0
2014    3  69.2  60.1
2020    7 

In [38]:
# Select target columns
targets_train = df_train.iloc[:,3];

# Select target columns
targets_test = df_test.iloc[:,3];

In [39]:
print(targets_train);

year
2018    705
2014    625
2013    539
2014    596
2016    699
       ... 
2015    703
2013    527
2013    620
2020    581
2019    602
Name: NUM_COLLISIONS, Length: 3004, dtype: int64


In [40]:
print(targets_test);

year
2020    322
2016    518
2014    451
2019    645
2015    579
2016    653
2012    607
2019    264
2015    451
2013    492
2019    613
2013    393
2013    558
2016    529
2019    384
2014    538
2015    683
2020    455
2017    744
2019    591
2016    672
2017    598
2017    650
2017    641
2019    592
2014    504
2015    422
2019    577
2019    530
2019    374
2015    594
2015    667
2019    557
2016    510
2012    542
2013    438
2016    572
2013    459
2014    531
2015    517
2015    312
2017    729
2014    591
2020    166
2019    411
2019    515
2015    830
2019    502
2020    309
2016    700
Name: NUM_COLLISIONS, dtype: int64


In [41]:
# Set scale value
SCALE_NUM_COLLISIONS = 1000.0;

In [42]:
# Get size of training set 
trainsize = int(len(df_train['NUM_COLLISIONS']));

# Get size of test set 
testsize = int(len(df_test['NUM_COLLISIONS']));

In [43]:
print(trainsize);

3004


In [44]:
print(testsize);

50


In [45]:

# Define the number of predictor column input values
nppredictors = len(predictors_train.columns);

# Define the number of target column output values
noutputs = 1;


In [46]:
print(nppredictors)

3


# **2.1 Linear Regression Model**

In [48]:
# import tensorflow
%tensorflow_version 1.x
import tensorflow as tf

# check the version
print(tf.__version__)

# needed for high-level file management
import shutil  

# logging for tensorflow
#tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR) # Supress verbosity
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO) # Show verbosity

# removes a saved model from the last training attempt.
shutil.rmtree('/tmp/linear_regression_trained_model_mo_day_temp_min', ignore_errors=True)
   
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir='/tmp/linear_regression_trained_model_mo_day_temp_min', optimizer=tf.train.AdamOptimizer(learning_rate=0.1), enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors_train.values)))

# Prints log to show start of training model
print("starting to train...\n");

# Train the model using predictor and target values
estimator.fit(predictors_train.values, targets_train.values.reshape(trainsize, noutputs)/SCALE_NUM_COLLISIONS, steps=10000)

# Check predictions based on predictor values
preds = estimator.predict(x=predictors_train.values)

# Apply Scale value to outputs
predslistscale = preds['scores']*SCALE_NUM_COLLISIONS

# Calculate RMSE using predictions and targets
rmse = np.sqrt(np.mean((targets_train.values - predslistscale)**2))
print('\nLinearRegression has RMSE of {0}'.format(rmse));

# Store linear regressor value
rmse_LR = getattr(rmse, "tolist", lambda: rmse)()

# Calculate mean value of Number of Collisions
avg = np.mean(df_train['NUM_COLLISIONS'][:trainsize])

# Calculate the RMSE using Number of Collision Values and the mean of all target values.
# The fit of a proposed regression model should therefore be better than the fit of the mean model.
rmse = np.sqrt(np.mean((df_train['NUM_COLLISIONS'] - avg)**2));
print('Just using an average = {0}, has RMSE of {1}'.format(avg, rmse)); 

# Store RMSE for average
rmse_avg = getattr(rmse, "tolist", lambda: rmse)()

if(rmse_LR < rmse_avg): # If DNN rmse is lower than average rmse
  print('\nGreat! Your Linear Regression model performs better than finding average!');
else: 
  print('\nSorry! On this run, your model performs worse than just finding the average!');


1.15.2
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7ff1afb89f60>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model_mo_day_temp_min', '_session_creation_timeout_secs': 7200}
starting to train...

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized

**2.1.1 Linear Regression Validation Test**

Perform linear regression validation test using values from the original data set reserved for testing.


In [49]:

# Perform linear regression validation test using values from the original data set reserved for testing.                                                                                                 
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir='/tmp/linear_regression_trained_model_mo_day_temp_min', enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors_test.values)))

preds = estimator.predict(x=predictors_test.values)  # Use test data values 
predslistscale = preds['scores']*SCALE_NUM_COLLISIONS  # Adjust for scale
pred = format(str(predslistscale))
print("\n==== Predicted Number of Collisions ====", pred)
print("\n==== Target Collision Values ====", targets_test.values)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7ff1afcd9cc0>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model_mo_day_temp_min', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/linear_regression_traine

# **2.2 Deep Neural Network Regressor**



In [50]:
# Import tensorflow
%tensorflow_version 1.x
import tensorflow as tf

# Check the version
print(tf.__version__)

# Required for high-level file management
import shutil  

# logging for tensorflow
#tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR) # Supress verbosity
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO) # Show verbosity

# Remove any previously saved training model training
shutil.rmtree('/tmp/DNN_collision_regression_trained_model_mo_day_temp_min', ignore_errors=True)

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_collision_regression_trained_model_mo_day_temp_min', hidden_units=[20,18,14], optimizer=tf.train.AdamOptimizer(learning_rate=0.01), enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors_train.values)))

# Print message to display start of training log
print("starting to train");

# Train the model by passing predictor and target values
estimator.fit(predictors_train.values, targets_train.values.reshape(trainsize, noutputs)/SCALE_NUM_COLLISIONS, steps=10000)

# Next, we can check our predictions based on our predictors.
preds = estimator.predict(x=predictors_train.values)

# Apply the Scale value (not really needed here) to the outputs.
predslistscale = preds['scores']*SCALE_NUM_COLLISIONS

# Calculate RMSE value to determine how well the model works using prediction and target values.
rmse = np.sqrt(np.mean((targets_train.values - predslistscale)**2))
print('\nDNNRegression has RMSE of {0}'.format(rmse));

# Store DNN regressoion value
rmse_DNN = getattr(rmse, "tolist", lambda: rmse)()

# Calculate the mean of the Number of Collision Values.
avg = np.mean(df_train['NUM_COLLISIONS'])

# Calculate RMSE using COLLISION Values and the mean of all target values to determine
# if the DNN model is better than calculating the mean value.
rmse = np.sqrt(np.mean((df_train['NUM_COLLISIONS'] - avg)**2))
print('Just using average = {0} has RMSE of {1}'.format(avg, rmse));

# Store RMSE for average
rmse_avg = getattr(rmse, "tolist", lambda: rmse)()

# Output success or failure message for this model
if(rmse_DNN < rmse_avg): # If rmse is lower than average rmse
  print('\nGreat! Your DNN Regression model performs better than finding the average!'); # Success
else: 
  print('\nSorry! But on this run, your DNN Regression model performs worse than just finding the average!'); # Failure


1.15.2
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7ff1afa9a278>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_collision_regression_trained_model_mo_day_temp_min', '_session_creation_timeout_secs': 7200}
starting to train
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finali

2.2.1 Deep Neural Network Validation Test

In [51]:
# Perform validation assessment of DNN model using test values reserved from original dataset
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_collision_regression_trained_model_mo_day_temp_min', hidden_units=[20,18,14], enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors_test.values)))

# Use test values reserved from original data set
preds = estimator.predict(x=predictors_test.values)
predslistscale = preds['scores']*SCALE_NUM_COLLISIONS
pred = format(str(predslistscale))
print("\n==== Predicted Number of Collisions ====", pred)
print("\n==== Target Collision Values ====", targets_test.values)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7ff1afa8eda0>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_collision_regression_trained_model_mo_day_temp_min', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/DNN_collision_reg

From the results of the validation test, it can be seen that like the previous models, the DNN model struggles to accurately predict outlier values which are either unusually large or small, but it can be seen that some of the predicted values are a close match to the target values overall.

# **2.3 Summary**

From the results of the linear and DNN models, it was observed that the linear model performed worse than just finding the average number of collisions using a mean value on most runs, but overall, the DNN model performed better than finding the average.

The choice of an improved dataset and selecting day, month and minimum temperature, may have contributed to an improved model.

#**3. Wind and Air Pressure Factors**

I will now investigate if air pressure and windspeed factors, changes the performance of the models. 

In [52]:
# Import pandas to allow creation of data frames
import pandas as pd

# Create data frame from github csv file
source_dataframe = pd.read_csv('https://raw.githubusercontent.com/15014370uhi/15014370_DataAnalytics/master/bq-cleansed_weather_reduced_assignment_1.csv', index_col=0,);

In [53]:
# Check that correct data is present
print(source_dataframe) 

      mo  da  day  temp  dewp  ...  mxpsd   max   min  prcp  NUM_COLLISIONS
year                           ...                                         
2012   7   1    7  83.6  63.0  ...    9.9  93.0  66.0  0.00             538
2012   7   2    1  80.3  54.1  ...   15.0  88.0  66.9  0.00             564
2012   7   3    2  79.8  56.7  ...   12.0  88.0  63.0  0.00             664
2012   7   4    3  81.8  65.6  ...   11.1  91.0  68.0  0.06             432
2012   7   6    5  81.9  62.3  ...    9.9  91.0  66.9  0.00             638
...   ..  ..  ...   ...   ...  ...    ...   ...   ...   ...             ...
2020  12   1    2  58.1  54.3  ...   34.0  63.0  45.0  0.88             253
2020  12   2    3  47.5  35.9  ...   22.9  63.0  42.1  0.59             204
2020  12   3    4  45.1  32.4  ...   18.1  52.0  36.0  0.02             218
2020  12   4    5  53.6  43.8  ...   18.1  59.0  36.0  0.00             319
2020  12   5    6  51.9  48.9  ...   32.1  59.0  39.9  0.32             222

[3054 rows 

In [54]:
# Find the total length of data rows
totalNumberOfRows = len(source_dataframe);
print(totalNumberOfRows);

3054


In [55]:
# Import numpty to improve speed of math calculations
import numpy as np

In [56]:
# Shuffle all source data
source_dataframe = source_dataframe.iloc[np.random.permutation(len(source_dataframe))]; 

In [57]:
# Number of rows to reserve for testing
number_of_test_rows = 50

In [58]:
# Store training data as (all rows - number_of_test_rows) of source data
df_train = source_dataframe[:-number_of_test_rows];

# Shuffle training data a second time
df_train = df_train.iloc[np.random.permutation(len(df_train))]; 

# Store validation test data as last number_of_test_rows of rows of source data
df_test = source_dataframe[-number_of_test_rows:];

# Shuffle test data a second time
df_test = df_test.iloc[np.random.permutation(len(df_test))];

# Store only relevant columns of data for this model
df_train = df_train.iloc[:, [5,7,8,12]]; 
df_test = df_test.iloc[:, [5,7,8,12]];


In [59]:
# Confirm training data is stored 
print(df_train)

         slp  wdsp  mxpsd  NUM_COLLISIONS
year                                     
2015  1019.2  15.4   21.0             742
2017   999.6  12.9   19.0             729
2014  1019.7   4.1    9.9             644
2013  1011.3  13.0   24.1             701
2016  1006.4  17.0   21.0             616
...      ...   ...    ...             ...
2017  1019.1   7.5   14.0             771
2014  1018.6   6.9   13.0             639
2019  1007.6  10.4   19.0             633
2017  1021.4  14.1   20.0             541
2013  1003.8  13.6   19.0             604

[3004 rows x 4 columns]


In [60]:
# Confirm test data rows are stored
print(df_test);

         slp  wdsp  mxpsd  NUM_COLLISIONS
year                                     
2017  1013.6  12.4   22.0             661
2012  1023.9   3.0    8.9             511
2014  1017.9  13.4   24.1             547
2020  1011.2  15.0   22.9             191
2013  1016.3  11.8   18.1             791
2016  1020.4  13.1   22.0             702
2012  1019.9   4.1    9.9             494
2017  1028.6  17.1   25.1             703
2016  1015.4  18.6   27.0             798
2019  1019.3   9.1   18.1             580
2018  1009.8  13.8   18.1             601
2019  1021.8   7.8   17.1             445
2015  1019.7   8.2   11.1             646
2018  1028.8  14.5   17.1             494
2020  1014.7   3.9    8.0             295
2018  1025.8   9.8   12.0             654
2018  1019.1   9.5   13.0             580
2013  1018.1  12.5   19.0             570
2019  1023.6   6.7   11.1             613
2018  1010.6  13.9   20.0             746
2017  1018.5   5.7   11.1             639
2014  1014.7   7.0   18.1         

In [61]:
# Select Predictor columns for training and testing data to be used to predict the outcome

predictors_train = df_train.iloc[:, [0,1,2]];
predictors_test = df_test.iloc[:,  [0,1,2]];

In [62]:
# confirm predictor holds correct data
print(predictors_train)

         slp  wdsp  mxpsd
year                     
2015  1019.2  15.4   21.0
2017   999.6  12.9   19.0
2014  1019.7   4.1    9.9
2013  1011.3  13.0   24.1
2016  1006.4  17.0   21.0
...      ...   ...    ...
2017  1019.1   7.5   14.0
2014  1018.6   6.9   13.0
2019  1007.6  10.4   19.0
2017  1021.4  14.1   20.0
2013  1003.8  13.6   19.0

[3004 rows x 3 columns]


In [63]:
# confirm predictor holds correct data
print(predictors_test);

         slp  wdsp  mxpsd
year                     
2017  1013.6  12.4   22.0
2012  1023.9   3.0    8.9
2014  1017.9  13.4   24.1
2020  1011.2  15.0   22.9
2013  1016.3  11.8   18.1
2016  1020.4  13.1   22.0
2012  1019.9   4.1    9.9
2017  1028.6  17.1   25.1
2016  1015.4  18.6   27.0
2019  1019.3   9.1   18.1
2018  1009.8  13.8   18.1
2019  1021.8   7.8   17.1
2015  1019.7   8.2   11.1
2018  1028.8  14.5   17.1
2020  1014.7   3.9    8.0
2018  1025.8   9.8   12.0
2018  1019.1   9.5   13.0
2013  1018.1  12.5   19.0
2019  1023.6   6.7   11.1
2018  1010.6  13.9   20.0
2017  1018.5   5.7   11.1
2014  1014.7   7.0   18.1
2018  1005.6  16.6   26.0
2019  1026.1   7.6   12.0
2018  1011.1  11.9   22.0
2020  1034.5  13.4   26.0
2018  1016.7  10.7   15.0
2019  1035.1   5.2    8.9
2019   999.3  21.7   28.0
2015  1017.8   6.4    9.9
2018  1026.8   7.2   14.0
2012  1016.1   3.1    9.9
2020  1020.2  10.4   12.0
2017  1009.8  10.3   17.1
2019  1013.8   8.8   17.1
2015  1013.9  16.6   22.0
2018  1008.6

In [64]:
# Select target columns
targets_train = df_train.iloc[:,3];

# Select target columns
targets_test = df_test.iloc[:,3];

In [65]:
print(targets_train);

year
2015    742
2017    729
2014    644
2013    701
2016    616
       ... 
2017    771
2014    639
2019    633
2017    541
2013    604
Name: NUM_COLLISIONS, Length: 3004, dtype: int64


In [66]:
print(targets_test);

year
2017    661
2012    511
2014    547
2020    191
2013    791
2016    702
2012    494
2017    703
2016    798
2019    580
2018    601
2019    445
2015    646
2018    494
2020    295
2018    654
2018    580
2013    570
2019    613
2018    746
2017    639
2014    589
2018    662
2019    448
2018    617
2020    225
2018    557
2019    520
2019    530
2015    760
2018    568
2012    630
2020    299
2017    657
2019    395
2015    686
2018    629
2014    635
2012    574
2018    693
2017    537
2015    732
2018    721
2013    542
2017    606
2015    582
2018    465
2013    570
2019    587
2018    753
Name: NUM_COLLISIONS, dtype: int64


In [67]:
# Set scale value
SCALE_NUM_COLLISIONS = 1000.0;

In [68]:
# Get size of training set 
trainsize = int(len(df_train['NUM_COLLISIONS']));

# Get size of test set 
testsize = int(len(df_test['NUM_COLLISIONS']));

In [69]:
print(trainsize);

3004


In [70]:
print(testsize);

50


In [71]:

# Define the number of predictor column input values
nppredictors = len(predictors_train.columns);

# Define the number of target column output values
noutputs = 1;


In [72]:
print(nppredictors)

3


# **3.1 Linear Regression Model**

In [73]:
# import tensorflow
%tensorflow_version 1.x
import tensorflow as tf

# check the version
print(tf.__version__)

# needed for high-level file management
import shutil  

# logging for tensorflow
#tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR) # Supress verbosity
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO) # Show verbosity

# removes a saved model from the last training attempt.
shutil.rmtree('/tmp/linear_regression_trained_model_windspeed', ignore_errors=True)
   
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir='/tmp/linear_regression_trained_model_windspeed', optimizer=tf.train.AdamOptimizer(learning_rate=0.1), enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors_train.values)))

# Prints log to show start of training model
print("starting to train...\n");

# Train the model using predictor and target values
estimator.fit(predictors_train.values, targets_train.values.reshape(trainsize, noutputs)/SCALE_NUM_COLLISIONS, steps=10000)

# Check predictions based on predictor values
preds = estimator.predict(x=predictors_train.values)

# Apply Scale value to outputs
predslistscale = preds['scores']*SCALE_NUM_COLLISIONS

# Calculate RMSE using predictions and targets
rmse = np.sqrt(np.mean((targets_train.values - predslistscale)**2))
print('\nLinearRegression has RMSE of {0}'.format(rmse));

# Store linear regressor value
rmse_LR = getattr(rmse, "tolist", lambda: rmse)()

# Calculate mean value of Number of Collisions
avg = np.mean(df_train['NUM_COLLISIONS'][:trainsize])

# Calculate the RMSE using Number of Collision Values and the mean of all target values.
# The fit of a proposed regression model should therefore be better than the fit of the mean model.
rmse = np.sqrt(np.mean((df_train['NUM_COLLISIONS'] - avg)**2));
print('Just using an average = {0}, has RMSE of {1}'.format(avg, rmse)); 

# Store RMSE for average
rmse_avg = getattr(rmse, "tolist", lambda: rmse)()

if(rmse_LR < rmse_avg): # If rmse is lower than average rmse
  print('\nGreat! Your Linear Regression model performs better than finding average!');
else: 
  print('\nSorry! On this run, your model performs worse than just finding the average!');


1.15.2
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7ff1af7d6198>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model_windspeed', '_session_creation_timeout_secs': 7200}
starting to train...

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO

**3.1.1 Linear Regression Validation Test**

Perform linear regression validation test using values from the original data set reserved for testing.


In [74]:

# Perform linear regression validation test using values from the original data set reserved for testing.
                                                                                                 
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir='/tmp/linear_regression_trained_model_windspeed', enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors_test.values)))

preds = estimator.predict(x=predictors_test.values)
predslistscale = preds['scores']*SCALE_NUM_COLLISIONS
pred = format(str(predslistscale))
print("\n==== Predicted Number of Collisions ====", pred)
print("\n==== Target Collision Values ====", targets_test.values)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7ff1af7e0668>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model_windspeed', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/linear_regression_trained_mode

# **3.2 Deep Neural Network Regressor**



In [75]:
# Import tensorflow
%tensorflow_version 1.x
import tensorflow as tf

# Check the version
print(tf.__version__)

# Required for high-level file management
import shutil  

# logging for tensorflow
#tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR) # Supress verbosity
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO) # Show verbosity

# Remove any previously saved training model training
shutil.rmtree('/tmp/DNN_collision_regression_trained_model_windspeed', ignore_errors=True)

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_collision_regression_trained_model_windspeed', hidden_units=[20,18,14], optimizer=tf.train.AdamOptimizer(learning_rate=0.01), enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors_train.values)))

# Print message to display start of training log
print("starting to train");

# Train the model by passing predictor and target values
estimator.fit(predictors_train.values, targets_train.values.reshape(trainsize, noutputs)/SCALE_NUM_COLLISIONS, steps=10000)

# Next, we can check our predictions based on our predictors.
preds = estimator.predict(x=predictors_train.values)

# Apply the Scale value (not really needed here) to the outputs.
predslistscale = preds['scores']*SCALE_NUM_COLLISIONS

# Calculate RMSE value to determine how well the model works using prediction and target values.
rmse = np.sqrt(np.mean((targets_train.values - predslistscale)**2))
print('\nDNNRegression has RMSE of {0}'.format(rmse));

# Store DNN regressoion value
rmse_DNN = getattr(rmse, "tolist", lambda: rmse)()

# Calculate the mean of the Number of Collision Values.
avg = np.mean(df_train['NUM_COLLISIONS'])

# Calculate RMSE using COLLISION Values and the mean of all target values to determine
# if the DNN model is better than calculating the mean value.
rmse = np.sqrt(np.mean((df_train['NUM_COLLISIONS'] - avg)**2))
print('Just using average = {0} has RMSE of {1}'.format(avg, rmse));

# Store RMSE for average
rmse_avg = getattr(rmse, "tolist", lambda: rmse)()

# Output success or failure message for this model
if(rmse_DNN < rmse_avg): # If rmse is lower than average rmse
  print('\nGreat! Your DNN Regression model performs better than finding the average!'); # Success
else: 
  print('\nSorry! But on this run, your DNN Regression model performs worse than just finding the average!'); # Failure


1.15.2
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7ff1af78ebe0>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_collision_regression_trained_model_windspeed', '_session_creation_timeout_secs': 7200}
starting to train
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
I

**3.2.1 Deep Neural Network Validation Test**

In [76]:
# Perform validation assessment of DNN model using test values reserved from original dataset
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_collision_regression_trained_model_windspeed', hidden_units=[20,18,14], enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors_test.values)))

# Use test values reserved from original data set
preds = estimator.predict(x=predictors_test.values)
predslistscale = preds['scores']*SCALE_NUM_COLLISIONS
pred = format(str(predslistscale))
print("\n==== Predicted Number of Collisions ====", pred)
print("\n==== Target Collision Values ====", targets_test.values)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7ff1af79b0f0>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_collision_regression_trained_model_windspeed', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/DNN_collision_regressio

From the results of the validation test, it can be seen that many of the predicted values are a close match to the target values, although some of the higher target values were under-estimated by the DNN model.

# **3.3 Summary**

From the results of the linear and DNN models, we can see that the linear model performed worse than just finding the average number of collisions using a mean value, but overall, the DNN model performed better than finding the average.

The DNN model in this case is a good candidate for use as a predictive tool for collision numbers and to inform emergency services in New York city.

#**4. Dew Point, Visibility and Precipitation**

I will now investigate if dew point, visibility and precipitation changes the performance of the models. 

In [131]:
# Import pandas to allow creation of data frames
import pandas as pd

# Create data frame from github csv file
source_dataframe = pd.read_csv('https://raw.githubusercontent.com/15014370uhi/15014370_DataAnalytics/master/bq-cleansed_weather_reduced_assignment_1.csv', index_col=0,);

In [132]:
# Check that correct data is present
print(source_dataframe) 

      mo  da  day  temp  dewp  ...  mxpsd   max   min  prcp  NUM_COLLISIONS
year                           ...                                         
2012   7   1    7  83.6  63.0  ...    9.9  93.0  66.0  0.00             538
2012   7   2    1  80.3  54.1  ...   15.0  88.0  66.9  0.00             564
2012   7   3    2  79.8  56.7  ...   12.0  88.0  63.0  0.00             664
2012   7   4    3  81.8  65.6  ...   11.1  91.0  68.0  0.06             432
2012   7   6    5  81.9  62.3  ...    9.9  91.0  66.9  0.00             638
...   ..  ..  ...   ...   ...  ...    ...   ...   ...   ...             ...
2020  12   1    2  58.1  54.3  ...   34.0  63.0  45.0  0.88             253
2020  12   2    3  47.5  35.9  ...   22.9  63.0  42.1  0.59             204
2020  12   3    4  45.1  32.4  ...   18.1  52.0  36.0  0.02             218
2020  12   4    5  53.6  43.8  ...   18.1  59.0  36.0  0.00             319
2020  12   5    6  51.9  48.9  ...   32.1  59.0  39.9  0.32             222

[3054 rows 

In [133]:
# Find the total length of data rows
totalNumberOfRows = len(source_dataframe);
print(totalNumberOfRows);

3054


In [135]:
# Import numpty to improve speed of math calculations
import numpy as np

In [136]:
# Shuffle all source data
source_dataframe = source_dataframe.iloc[np.random.permutation(len(source_dataframe))]; 

In [137]:
# Number of rows to reserve for testing
number_of_test_rows = 50

In [138]:
# Store training data as (all rows - number_of_test_rows) of source data
df_train = source_dataframe[:-number_of_test_rows];

# Shuffle training data a second time
df_train = df_train.iloc[np.random.permutation(len(df_train))]; 

# Store validation test data as last number_of_test_rows of rows of source data
df_test = source_dataframe[-number_of_test_rows:];

# Shuffle test data a second time
df_test = df_test.iloc[np.random.permutation(len(df_test))];

# Store only relevant columns of data for this model
df_train = df_train.iloc[:, [4,6,11,12]]; 
df_test = df_test.iloc[:, [4,6,11,12]];


In [139]:
# Confirm training data is stored 
print(df_train)

      dewp  visib  prcp  NUM_COLLISIONS
year                                   
2018  38.5   10.0  0.00             610
2012  45.1    7.0  0.00             526
2015  31.4   10.0  0.00             443
2013  51.0    5.7  0.00             698
2016  55.8    6.5  0.00             767
...    ...    ...   ...             ...
2018  21.7   10.0  0.01             494
2018  31.7    9.1  0.00             668
2018  56.7   10.0  0.00             700
2012  32.1    9.3  0.00             480
2014   8.0   10.0  0.00             547

[3004 rows x 4 columns]


In [140]:
# Confirm test data rows are stored
print(df_test);

      dewp  visib  prcp  NUM_COLLISIONS
year                                   
2014  44.2    6.6  0.03             732
2019  60.1    5.6  0.13             734
2019  34.9    9.6  0.00             510
2013  32.9   10.0  0.00             626
2019  16.2   10.0  0.00             562
2016  49.8    5.6  0.13             649
2017  63.7    8.8  0.00             747
2013  11.3   10.0  0.00             648
2014  61.5    8.2  0.11             662
2020  53.3    7.0  0.00             198
2018  32.9   10.0  0.00             629
2013  29.2    9.8  0.26             600
2015  55.4    8.0  0.00             389
2014  52.3    9.3  0.00             607
2015  64.4    8.8  0.00             598
2014  10.5   10.0  0.00             474
2015  33.8   10.0  0.00             628
2014  48.1    8.1  0.15             546
2015  68.8    9.5  0.00             552
2015  69.6    9.0  0.00             614
2016  24.3   10.0  0.00             482
2012  71.3    7.6  0.00             617
2014  60.8    9.1  0.02             625


In [141]:
# Select Predictor columns for training and testing data to be used to predict the outcome

predictors_train = df_train.iloc[:, [0,1,2]];
predictors_test = df_test.iloc[:,  [0,1,2]];

In [142]:
# confirm predictor holds correct data
print(predictors_train)

      dewp  visib  prcp
year                   
2018  38.5   10.0  0.00
2012  45.1    7.0  0.00
2015  31.4   10.0  0.00
2013  51.0    5.7  0.00
2016  55.8    6.5  0.00
...    ...    ...   ...
2018  21.7   10.0  0.01
2018  31.7    9.1  0.00
2018  56.7   10.0  0.00
2012  32.1    9.3  0.00
2014   8.0   10.0  0.00

[3004 rows x 3 columns]


In [143]:
# confirm predictor holds correct data
print(predictors_test);

      dewp  visib  prcp
year                   
2014  44.2    6.6  0.03
2019  60.1    5.6  0.13
2019  34.9    9.6  0.00
2013  32.9   10.0  0.00
2019  16.2   10.0  0.00
2016  49.8    5.6  0.13
2017  63.7    8.8  0.00
2013  11.3   10.0  0.00
2014  61.5    8.2  0.11
2020  53.3    7.0  0.00
2018  32.9   10.0  0.00
2013  29.2    9.8  0.26
2015  55.4    8.0  0.00
2014  52.3    9.3  0.00
2015  64.4    8.8  0.00
2014  10.5   10.0  0.00
2015  33.8   10.0  0.00
2014  48.1    8.1  0.15
2015  68.8    9.5  0.00
2015  69.6    9.0  0.00
2016  24.3   10.0  0.00
2012  71.3    7.6  0.00
2014  60.8    9.1  0.02
2014  61.3    9.9  0.00
2014  10.4    6.3  0.02
2016  14.7    8.1  0.01
2015  68.1    9.2  0.30
2020  41.9    3.7  0.48
2013  67.1    6.5  0.02
2015  38.5    5.0  0.55
2020  69.3    2.1  0.43
2014  57.0    7.1  0.00
2015  11.0    8.1  0.04
2018  -0.5   10.0  0.00
2014  33.6    5.3  0.06
2020  46.2    9.1  0.00
2015  68.7    9.7  0.01
2014  54.0    9.2  0.01
2020  52.1   10.0  0.00
2019  59.3    3.

In [144]:
# Select target columns
targets_train = df_train.iloc[:,3];

# Select target columns
targets_test = df_test.iloc[:,3];

In [145]:
print(targets_train);

year
2018    610
2012    526
2015    443
2013    698
2016    767
       ... 
2018    494
2018    668
2018    700
2012    480
2014    547
Name: NUM_COLLISIONS, Length: 3004, dtype: int64


In [146]:
print(targets_test);

year
2014    732
2019    734
2019    510
2013    626
2019    562
2016    649
2017    747
2013    648
2014    662
2020    198
2018    629
2013    600
2015    389
2014    607
2015    598
2014    474
2015    628
2014    546
2015    552
2015    614
2016    482
2012    617
2014    625
2014    538
2014    649
2016    597
2015    606
2020    190
2013    597
2015    565
2020    362
2014    528
2015    757
2018    476
2014    960
2020    173
2015    595
2014    719
2020    264
2019    697
2014    670
2018    703
2016    534
2016    629
2017    696
2016    665
2017    757
2017    675
2014    643
2018    672
Name: NUM_COLLISIONS, dtype: int64


In [147]:
# Set scale value
SCALE_NUM_COLLISIONS = 1000.0;

In [148]:
# Get size of training set 
trainsize = int(len(df_train['NUM_COLLISIONS']));

# Get size of test set 
testsize = int(len(df_test['NUM_COLLISIONS']));

In [149]:
print(trainsize);

3004


In [150]:
print(testsize);

50


In [151]:

# Define the number of predictor column input values
nppredictors = len(predictors_train.columns);

# Define the number of target column output values
noutputs = 1;


In [152]:
print(nppredictors)

3


# **4.1 Linear Regression Model**

In [154]:
# import tensorflow
%tensorflow_version 1.x
import tensorflow as tf

# check the version
print(tf.__version__)

# needed for high-level file management
import shutil  

# logging for tensorflow
#tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR) # Supress verbosity
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO) # Show verbosity

# removes a saved model from the last training attempt.
shutil.rmtree('/tmp/linear_regression_trained_model_dew_visib_precip', ignore_errors=True)
   
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir='/tmp/linear_regression_trained_model_dew_visib_precip', optimizer=tf.train.AdamOptimizer(learning_rate=0.1), enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors_train.values)))

# Prints log to show start of training model
print("starting to train...\n");

# Train the model using predictor and target values
estimator.fit(predictors_train.values, targets_train.values.reshape(trainsize, noutputs)/SCALE_NUM_COLLISIONS, steps=10000)

# Check predictions based on predictor values
preds = estimator.predict(x=predictors_train.values)

# Apply Scale value to outputs
predslistscale = preds['scores']*SCALE_NUM_COLLISIONS

# Calculate RMSE using predictions and targets
rmse = np.sqrt(np.mean((targets_train.values - predslistscale)**2))
print('\nLinearRegression has RMSE of {0}'.format(rmse));

# Store linear regressor value
rmse_LR = getattr(rmse, "tolist", lambda: rmse)()

# Calculate mean value of Number of Collisions
avg = np.mean(df_train['NUM_COLLISIONS'][:trainsize])

# Calculate the RMSE using Number of Collision Values and the mean of all target values.
# The fit of a proposed regression model should therefore be better than the fit of the mean model.
rmse = np.sqrt(np.mean((df_train['NUM_COLLISIONS'] - avg)**2));
print('Just using an average = {0}, has RMSE of {1}'.format(avg, rmse)); 

# Store RMSE for average
rmse_avg = getattr(rmse, "tolist", lambda: rmse)()

if(rmse_LR < rmse_avg): # If rmse is lower than average rmse
  print('\nGreat! Your Linear Regression model performs better than finding average!');
else: 
  print('\nSorry! On this run, your model performs worse than just finding the average!');


1.15.2
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7ff1aa5c6898>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model_dew_visib_precip', '_session_creation_timeout_secs': 7200}
starting to train...

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalize

On several shuffled runs of this data, the linear regression model performed very badly, and far worse than simply calculating an average number for the number of collisions.  This model produces very strange results that can vary greatly and may not be useful.

**4.1.1 Linear Regression Validation Test**

Perform linear regression validation test using values from the original data set reserved for testing.


In [None]:

# Perform linear regression validation test using values from the original data set reserved for testing.
                                                                                                 
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir='/tmp/linear_regression_trained_model_dew_visib_precip', enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors_test.values)))

preds = estimator.predict(x=predictors_test.values)
predslistscale = preds['scores']*SCALE_NUM_COLLISIONS
pred = format(str(predslistscale))
print("\n==== Predicted Number of Collisions ====", pred)
print("\n==== Target Collision Values ====", targets_test.values)

The DNN performed better than linear regression and produced quite reasonable values.  Again, DNN under and over estimated both the extremes of outlier low and high target values.

# **4.2 Deep Neural Network Regressor**



In [156]:
# Import tensorflow
%tensorflow_version 1.x
import tensorflow as tf

# Check the version
print(tf.__version__)

# Required for high-level file management
import shutil  

# logging for tensorflow
#tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR) # Supress verbosity
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO) # Show verbosity

# Remove any previously saved training model training
shutil.rmtree('/tmp/DNN_collision_regression_trained_model_dew_visib_precipd', ignore_errors=True)

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_collision_regression_trained_model_dew_visib_precip', hidden_units=[20,18,14], optimizer=tf.train.AdamOptimizer(learning_rate=0.01), enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors_train.values)))

# Print message to display start of training log
print("starting to train");

# Train the model by passing predictor and target values
estimator.fit(predictors_train.values, targets_train.values.reshape(trainsize, noutputs)/SCALE_NUM_COLLISIONS, steps=10000)

# Next, we can check our predictions based on our predictors.
preds = estimator.predict(x=predictors_train.values)

# Apply the Scale value (not really needed here) to the outputs.
predslistscale = preds['scores']*SCALE_NUM_COLLISIONS

# Calculate RMSE value to determine how well the model works using prediction and target values.
rmse = np.sqrt(np.mean((targets_train.values - predslistscale)**2))
print('\nDNNRegression has RMSE of {0}'.format(rmse));

# Store DNN regressoion value
rmse_DNN = getattr(rmse, "tolist", lambda: rmse)()

# Calculate the mean of the Number of Collision Values.
avg = np.mean(df_train['NUM_COLLISIONS'])

# Calculate RMSE using COLLISION Values and the mean of all target values to determine
# if the DNN model is better than calculating the mean value.
rmse = np.sqrt(np.mean((df_train['NUM_COLLISIONS'] - avg)**2))
print('Just using average = {0} has RMSE of {1}'.format(avg, rmse));

# Store RMSE for average
rmse_avg = getattr(rmse, "tolist", lambda: rmse)()

# Output success or failure message for this model
if(rmse_DNN < rmse_avg): # If rmse is lower than average rmse
  print('\nGreat! Your DNN Regression model performs better than finding the average!'); # Success
else: 
  print('\nSorry! But on this run, your DNN Regression model performs worse than just finding the average!'); # Failure


1.15.2
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7ff1aa4535c0>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_collision_regression_trained_model_dew_visib_precip', '_session_creation_timeout_secs': 7200}
starting to train
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was final

**4.2.1 Deep Neural Network Validation Test**

In [157]:
# Perform validation assessment of DNN model using test values reserved from original dataset
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_collision_regression_trained_model_dew_visib_precip', hidden_units=[20,18,14], enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors_test.values)))

# Use test values reserved from original data set
preds = estimator.predict(x=predictors_test.values)
predslistscale = preds['scores']*SCALE_NUM_COLLISIONS
pred = format(str(predslistscale))
print("\n==== Predicted Number of Collisions ====", pred)
print("\n==== Target Collision Values ====", targets_test.values)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7ff1aa6c3400>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_collision_regression_trained_model_dew_visib_precip', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/DNN_collision_re

From the results of the DNN validation test, it can be seen that many of the predicted values seem to be very strange and in most runs the results are mainly identical values.  

# **4.3 Summary**

From the results of the linear and DNN models, we can see that the linear model performed worse than just finding the average number of collisions using a mean value, and the DNN model produced some strange results.  This is not a good model to use.


#**5. Month, Day, Mean Temperature, Max and Min Temp, Precipitation**

I will attempt to combine factors from the better performing models, to determine if improvements can be seen in the model performance. 

In [183]:
# Import pandas to allow creation of data frames
import pandas as pd

# Create data frame from github csv file
source_dataframe = pd.read_csv('https://raw.githubusercontent.com/15014370uhi/15014370_DataAnalytics/master/bq-cleansed_weather_reduced_assignment_1.csv', index_col=0,);

In [184]:
# Check that correct data is present
print(source_dataframe) 

      mo  da  day  temp  dewp  ...  mxpsd   max   min  prcp  NUM_COLLISIONS
year                           ...                                         
2012   7   1    7  83.6  63.0  ...    9.9  93.0  66.0  0.00             538
2012   7   2    1  80.3  54.1  ...   15.0  88.0  66.9  0.00             564
2012   7   3    2  79.8  56.7  ...   12.0  88.0  63.0  0.00             664
2012   7   4    3  81.8  65.6  ...   11.1  91.0  68.0  0.06             432
2012   7   6    5  81.9  62.3  ...    9.9  91.0  66.9  0.00             638
...   ..  ..  ...   ...   ...  ...    ...   ...   ...   ...             ...
2020  12   1    2  58.1  54.3  ...   34.0  63.0  45.0  0.88             253
2020  12   2    3  47.5  35.9  ...   22.9  63.0  42.1  0.59             204
2020  12   3    4  45.1  32.4  ...   18.1  52.0  36.0  0.02             218
2020  12   4    5  53.6  43.8  ...   18.1  59.0  36.0  0.00             319
2020  12   5    6  51.9  48.9  ...   32.1  59.0  39.9  0.32             222

[3054 rows 

In [185]:
# Find the total length of data rows
totalNumberOfRows = len(source_dataframe);
print(totalNumberOfRows);

3054


In [186]:
# Import numpty to improve speed of math calculations
import numpy as np

In [187]:
# Shuffle all source data
source_dataframe = source_dataframe.iloc[np.random.permutation(len(source_dataframe))]; 

In [188]:
# Number of rows to reserve for testing
number_of_test_rows = 50

In [189]:
# Store training data as (all rows - number_of_test_rows) of source data
df_train = source_dataframe[:-number_of_test_rows];

# Shuffle training data a second time
df_train = df_train.iloc[np.random.permutation(len(df_train))]; 

# Store validation test data as last number_of_test_rows of rows of source data
df_test = source_dataframe[-number_of_test_rows:];

# Shuffle test data a second time
df_test = df_test.iloc[np.random.permutation(len(df_test))];

# Store only relevant columns of data for this model
df_train = df_train.iloc[:, [0,2,3,9,10,11,12]]; 
df_test = df_test.iloc[:, [0,2,3,9,10,11,12]];


In [190]:
# Confirm training data is stored 
print(df_train)

      mo  day  temp   max   min  prcp  NUM_COLLISIONS
year                                                 
2019   1    3  44.4  46.9  28.9  0.12             514
2019   8    7  71.2  79.0  62.1  0.00             481
2016   8    5  74.4  81.0  64.9  0.00             707
2018   4    1  46.2  57.9  41.0  0.07             656
2015  10    6  52.7  66.0  41.0  0.07             565
...   ..  ...   ...   ...   ...   ...             ...
2016   2    7   5.8  27.0  -2.0  0.00             309
2020   5    4  48.3  57.0  36.0  0.00             211
2018   6    5  54.8  62.1  43.0  0.00             757
2015  11    5  58.3  61.0  51.1  0.07             677
2018   5    1  53.0  63.0  46.0  0.48             695

[3004 rows x 7 columns]


In [191]:
# Confirm test data rows are stored
print(df_test);

      mo  day  temp   max   min  prcp  NUM_COLLISIONS
year                                                 
2013   6    2  61.8  66.0  57.2  0.34             613
2018  10    3  48.8  59.0  44.1  0.03             719
2017   7    3  63.6  72.0  55.0  0.00             635
2019  11    4  35.0  48.0  21.9  0.00             545
2019  12    2  44.7  52.0  37.9  0.88             443
2015   6    6  68.0  79.0  61.0  0.02             395
2016   2    6  23.5  27.0  12.9  0.00             480
2018   6    1  56.2  68.0  51.1  0.00             686
2016   1    6  45.7  51.1  39.9  1.30             563
2012  11    3  37.1  48.2  30.9  0.00             718
2018   9    7  63.9  75.9  54.0  0.00             597
2015   8    7  67.0  77.0  64.0  0.00             489
2017   8    5  67.7  75.9  61.0  0.00             758
2018   4    5  47.7  51.1  37.0  0.02             755
2015  11    3  39.6  50.0  28.9  0.00             717
2020   4    2  41.8  50.0  39.0  0.13             185
2018  12    1  32.0  43.0  2

In [192]:
# Select Predictor columns for training and testing data to be used to predict the outcome

predictors_train = df_train.iloc[:, [0,1,2,3,4,5]];
predictors_test = df_test.iloc[:,  [0,1,2,3,4,5]];

In [193]:
# confirm predictor holds correct data
print(predictors_train)

      mo  day  temp   max   min  prcp
year                                 
2019   1    3  44.4  46.9  28.9  0.12
2019   8    7  71.2  79.0  62.1  0.00
2016   8    5  74.4  81.0  64.9  0.00
2018   4    1  46.2  57.9  41.0  0.07
2015  10    6  52.7  66.0  41.0  0.07
...   ..  ...   ...   ...   ...   ...
2016   2    7   5.8  27.0  -2.0  0.00
2020   5    4  48.3  57.0  36.0  0.00
2018   6    5  54.8  62.1  43.0  0.00
2015  11    5  58.3  61.0  51.1  0.07
2018   5    1  53.0  63.0  46.0  0.48

[3004 rows x 6 columns]


In [194]:
# confirm predictor holds correct data
print(predictors_test);

      mo  day  temp   max   min  prcp
year                                 
2013   6    2  61.8  66.0  57.2  0.34
2018  10    3  48.8  59.0  44.1  0.03
2017   7    3  63.6  72.0  55.0  0.00
2019  11    4  35.0  48.0  21.9  0.00
2019  12    2  44.7  52.0  37.9  0.88
2015   6    6  68.0  79.0  61.0  0.02
2016   2    6  23.5  27.0  12.9  0.00
2018   6    1  56.2  68.0  51.1  0.00
2016   1    6  45.7  51.1  39.9  1.30
2012  11    3  37.1  48.2  30.9  0.00
2018   9    7  63.9  75.9  54.0  0.00
2015   8    7  67.0  77.0  64.0  0.00
2017   8    5  67.7  75.9  61.0  0.00
2018   4    5  47.7  51.1  37.0  0.02
2015  11    3  39.6  50.0  28.9  0.00
2020   4    2  41.8  50.0  39.0  0.13
2018  12    1  32.0  43.0  21.0  0.00
2016  11    6  47.8  55.9  36.0  0.00
2019   3    5  42.3  48.0  36.0  0.75
2020   1    4  29.5  42.1  24.1  0.34
2015  12    3  46.5  61.0  43.0  0.35
2019   3    6  39.3  46.0  32.0  0.00
2015   5    5  43.8  57.0  39.0  0.00
2015   3    3  36.1  41.0  23.0  0.31
2018   7    

In [195]:
# Select target columns
targets_train = df_train.iloc[:,6];

# Select target columns
targets_test = df_test.iloc[:,6];

In [196]:
print(targets_train);

year
2019    514
2019    481
2016    707
2018    656
2015    565
       ... 
2016    309
2020    211
2018    757
2015    677
2018    695
Name: NUM_COLLISIONS, Length: 3004, dtype: int64


In [197]:
print(targets_test);

year
2013    613
2018    719
2017    635
2019    545
2019    443
2015    395
2016    480
2018    686
2016    563
2012    718
2018    597
2015    489
2017    758
2018    755
2015    717
2020    185
2018    622
2016    626
2019    718
2020    517
2015    673
2019    530
2015    642
2015    574
2018    622
2012    511
2018    720
2019    502
2016    448
2013    517
2014    642
2020    154
2018    700
2018    749
2012    703
2019    638
2013    562
2017    603
2014    650
2014    591
2015    618
2015    515
2015    555
2014    616
2014    496
2019    635
2020    497
2018    463
2019    523
2012    547
Name: NUM_COLLISIONS, dtype: int64


In [198]:
# Set scale value
SCALE_NUM_COLLISIONS = 1000.0;

In [199]:
# Get size of training set 
trainsize = int(len(df_train['NUM_COLLISIONS']));

# Get size of test set 
testsize = int(len(df_test['NUM_COLLISIONS']));

In [200]:
print(trainsize);

3004


In [176]:
print(testsize);

50


In [201]:

# Define the number of predictor column input values
nppredictors = len(predictors_train.columns);

# Define the number of target column output values
noutputs = 1;


In [178]:
print(nppredictors)

6


# **5.1 Linear Regression Model**

In [202]:
# import tensorflow
%tensorflow_version 1.x
import tensorflow as tf

# check the version
print(tf.__version__)

# needed for high-level file management
import shutil  

# logging for tensorflow
#tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR) # Supress verbosity
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO) # Show verbosity

# removes a saved model from the last training attempt.
shutil.rmtree('/tmp/linear_regression_trained_model_combined_day_temps_percp', ignore_errors=True)
   
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir='/tmp/linear_regression_trained_model_combined_day_temps_percp', optimizer=tf.train.AdamOptimizer(learning_rate=0.1), enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors_train.values)))

# Prints log to show start of training model
print("starting to train...\n");

# Train the model using predictor and target values
estimator.fit(predictors_train.values, targets_train.values.reshape(trainsize, noutputs)/SCALE_NUM_COLLISIONS, steps=10000)

# Check predictions based on predictor values
preds = estimator.predict(x=predictors_train.values)

# Apply Scale value to outputs
predslistscale = preds['scores']*SCALE_NUM_COLLISIONS

# Calculate RMSE using predictions and targets
rmse = np.sqrt(np.mean((targets_train.values - predslistscale)**2))
print('\nLinearRegression has RMSE of {0}'.format(rmse));

# Store linear regressor value
rmse_LR = getattr(rmse, "tolist", lambda: rmse)()

# Calculate mean value of Number of Collisions
avg = np.mean(df_train['NUM_COLLISIONS'][:trainsize])

# Calculate the RMSE using Number of Collision Values and the mean of all target values.
# The fit of a proposed regression model should therefore be better than the fit of the mean model.
rmse = np.sqrt(np.mean((df_train['NUM_COLLISIONS'] - avg)**2));
print('Just using an average = {0}, has RMSE of {1}'.format(avg, rmse)); 

# Store RMSE for average
rmse_avg = getattr(rmse, "tolist", lambda: rmse)()

if(rmse_LR < rmse_avg): # If rmse is lower than average rmse
  print('\nGreat! Your Linear Regression model performs better than finding average!');
else: 
  print('\nSorry! On this run, your model performs worse than just finding the average!');


1.15.2
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7ff1af985da0>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model_combined_day_temps_percp', '_session_creation_timeout_secs': 7200}
starting to train...

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was 

**5.1.1 Linear Regression Validation Test**

Perform linear regression validation test using values from the original data set reserved for testing.


In [203]:

# Perform linear regression validation test using values from the original data set reserved for testing.
                                                                                                 
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir='/tmp/linear_regression_trained_model_combined_day_temps_percp', enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors_test.values)))

preds = estimator.predict(x=predictors_test.values)
predslistscale = preds['scores']*SCALE_NUM_COLLISIONS
pred = format(str(predslistscale))
print("\n==== Predicted Number of Collisions ====", pred)
print("\n==== Target Collision Values ====", targets_test.values)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7ff1afaea4a8>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model_combined_day_temps_percp', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/linear_regressi

# **5.2 Deep Neural Network Regressor**



In [204]:
# Import tensorflow
%tensorflow_version 1.x
import tensorflow as tf

# Check the version
print(tf.__version__)

# Required for high-level file management
import shutil  

# logging for tensorflow
#tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR) # Supress verbosity
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO) # Show verbosity

# Remove any previously saved training model training
shutil.rmtree('/tmp/DNN_collision_regression_trained_model_combined_day_temps_percp', ignore_errors=True)

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_collision_regression_trained_model_combined_day_temps_percp', hidden_units=[20,18,14], optimizer=tf.train.AdamOptimizer(learning_rate=0.01), enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors_train.values)))

# Print message to display start of training log
print("starting to train");

# Train the model by passing predictor and target values
estimator.fit(predictors_train.values, targets_train.values.reshape(trainsize, noutputs)/SCALE_NUM_COLLISIONS, steps=10000)

# Next, we can check our predictions based on our predictors.
preds = estimator.predict(x=predictors_train.values)

# Apply the Scale value (not really needed here) to the outputs.
predslistscale = preds['scores']*SCALE_NUM_COLLISIONS

# Calculate RMSE value to determine how well the model works using prediction and target values.
rmse = np.sqrt(np.mean((targets_train.values - predslistscale)**2))
print('\nDNNRegression has RMSE of {0}'.format(rmse));

# Store DNN regressoion value
rmse_DNN = getattr(rmse, "tolist", lambda: rmse)()

# Calculate the mean of the Number of Collision Values.
avg = np.mean(df_train['NUM_COLLISIONS'])

# Calculate RMSE using COLLISION Values and the mean of all target values to determine
# if the DNN model is better than calculating the mean value.
rmse = np.sqrt(np.mean((df_train['NUM_COLLISIONS'] - avg)**2))
print('Just using average = {0} has RMSE of {1}'.format(avg, rmse));

# Store RMSE for average
rmse_avg = getattr(rmse, "tolist", lambda: rmse)()

# Output success or failure message for this model
if(rmse_DNN < rmse_avg): # If rmse is lower than average rmse
  print('\nGreat! Your DNN Regression model performs better than finding the average!'); # Success
else: 
  print('\nSorry! But on this run, your DNN Regression model performs worse than just finding the average!'); # Failure


1.15.2
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7ff1b1eae6d8>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_collision_regression_trained_model_combined_day_temps_percp', '_session_creation_timeout_secs': 7200}
starting to train
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph w

**5.2.1 Deep Neural Network Validation Test**

In [None]:
# Perform validation assessment of DNN model using test values reserved from original dataset
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_collision_regression_trained_model_combined_day_temps_percp', hidden_units=[20,18,14], enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors_test.values)))

# Use test values reserved from original data set
preds = estimator.predict(x=predictors_test.values)
predslistscale = preds['scores']*SCALE_NUM_COLLISIONS
pred = format(str(predslistscale))
print("\n==== Predicted Number of Collisions ====", pred)
print("\n==== Target Collision Values ====", targets_test.values)

From the results of the validation test, it can be seen that many of the predicted values are a very close match to the target values, although like previous models, the DNN models struggles to predict some of the higher target values, which were under-estimated by the DNN model.

# **5.3 Summary**

By combining key data points from weather and date elements, which previously contributed to the generation of the better performing models, the resultant deep neural network model results consistently apears to provide a better than mean RMSE result.  The linear regression model also performed well, and generally produced a result superior to a mean value calculation.

The DNN model in this case, could be a good candidate for use as a predictive tool for collision numbers and to inform emergency service planning in New York city.

# **6. Conclusion**

Through completion of both linear regression and Deep Neural Network Regressor models using different data sets and data selections, the potential to predict the number of collisions for New York City can be summarised as follows:  

**Linear Regression Models**

Through examination of the results of each model, it was found that in most cases, a linear regression model alone had mostly unsatisfactory results.  In many cases the performance of the linear regression model was heavily dependent on the specific rows of data selected and used to train the model, and there appears to be no combination of linear regression data selection which results in a consistently reliable predictive ability for the number of collisions.

It was surprising to see that depending on how the data was shuffled, the results of linear regression modelling could produce better results than the average calculation, but this was inconsistent and unreliable, and statistically rarely better than a mean average.

I suspect the relatively poor performance of linear regression in this case, may be due to the fact that the data sets available do not necessarily include all of the data required to make such a prediction.  In reality, there are multiple, complex and chaotic contributing factors responsible in causing a collision to take place on a particular day or location, and without including all the factors, including the human factors, any linear regression model used as a predictive model, will be potentially inadequate for consistent and detailed forecasting. 


**Deep Neural Network Models**

Compared to the linear regression models, the deep neural network (DNN) models seemed to perform consistently better overall, and in particular, when applied to data related to temperature and precipitation factors in combination with day and month data.  Like the linear regression models, it was clear that the performance of the DNN models relied heavily on the particular training data it was supplied, which was an over-simplification of all the complex contributing factors to the cause of many collisions. 

**Summary**

Overall, DNN models did perform better than linear regression for most data selected, and I would recommend to the emergency services of New York City the use of the DNN model which used *Month, Day, Mean Temperature, Max Temp, Min Temp and Precipitation data*, as during testing, this combination resulted in the most consistent, better than mean, predictive ability for the number of traffic collisions per day.



