Before creating the DNN model we need to import Pandas, Numpy and Tensorflow. We need Pandas for loading and manipulating the dataset, Numpy for numerical operations, TensorFlow for building and training the models, and the MinMaxScaler function from the Sklearn library to normalise the feature and target variables.

In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.preprocessing import MinMaxScaler

After that we need to load the data that will be used to train the model.

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/23009256uhi/23009256_DataAnalytics/main/model_data_dnn.csv')

In [None]:
print(df.head())
print(df.describe())

   day  year  mo  da collision_date  temp  dewp     slp  visib  wdsp  mxpsd  \
0    2  2018   9  23     2018-09-23  59.8  50.2  1023.4   10.0   3.0    5.1   
1    6  2018  12  20     2018-12-20  38.6  34.4  1020.2    9.6   5.0    7.0   
2    4  2013  11   5     2013-11-05  43.5  30.4  1037.8   10.0   3.9    7.0   
3    5  2012   7  11     2012-07-11  77.1  62.0  1019.9   10.0   1.9    7.0   
4    6  2012   8   9     2012-08-09  78.2  69.6  1013.6    9.3   2.3    7.0   

   gust   max   min  prcp  fog  NUM_COLLISIONS   z_score  
0   NaN  78.1  53.1   0.0    0             475 -0.112394  
1   NaN  48.0  21.0   0.0    0             806  1.942315  
2   NaN  50.0  37.9   0.0    0             510  0.104871  
3   NaN  84.0  64.9   0.0    0             565  0.446288  
4  15.0  88.0  61.0   0.0    0             581  0.545609  
              day         year          mo          da        temp  \
count  662.000000   662.000000  662.000000  662.000000  662.000000   
mean     3.895770  2015.580060 

Let's check if there are any NaN values.

In [None]:
print(df.isna().sum())

day                 0
year                0
mo                  0
da                  0
collision_date      0
temp                0
dewp                0
slp                 0
visib               0
wdsp                0
mxpsd               0
gust              594
max                 0
min                 0
prcp                0
fog                 0
NUM_COLLISIONS      0
z_score             0
dtype: int64


As 'gust' has almost every entry as 'NaN', we can just drop it. We can also drop the z score and the collision date for the model.

In [None]:
# Drop non-numeric or irrelevant columns
df = df.drop(columns=['collision_date', 'z_score', 'gust'])

Now we have to split the data into a training set and a test set

In [None]:
# Train-test split, training set (80% of the data) and test set (the remaining 20%). The training set is used to train the model, the test set is used to evaluate the model.
training_set = df.sample(frac=0.8, random_state=0)
test_set = df.drop(training_set.index)

training_features = training_set.copy()
test_features = test_set.copy()

training_labels = training_features.pop('NUM_COLLISIONS')
test_labels = test_features.pop('NUM_COLLISIONS')

We also need to create separate scalers for the feature and target variables. The MinMaxScaler function from sklearn library is used for this purpose.

In [None]:
# Create a scaler object for the features
feature_scaler = MinMaxScaler()

# Create a separate scaler object for the target
target_scaler = MinMaxScaler()

# Fit the scaler to the training data and transform both training and test data
training_features_scaled = feature_scaler.fit_transform(training_features)
test_features_scaled = feature_scaler.transform(test_features)

# Fit the scaler to the training target data and transform both training and test target data
training_labels_scaled = target_scaler.fit_transform(training_labels.values.reshape(-1, 1))
test_labels_scaled = target_scaler.transform(test_labels.values.reshape(-1, 1))

In [None]:
normaliser = tf.keras.layers.Normalization(axis=-1)
normaliser.adapt(np.array(training_features))

Now we can create a function to build the model. After building the model using this function, we will compile the model, train the model and lastly evaluate its performance.

In [None]:
# Model building function
def build_dnn_model():
    model = tf.keras.Sequential([
        layers.Dense(64, activation='relu', input_shape=[len(training_features.keys())]),
        layers.Dense(64, activation='relu'),
        layers.Dense(32, activation='relu'),
        layers.Dense(1)
    ])
    return model

In [None]:
# 1. Build the model
dnn_model = build_dnn_model()

In [None]:
# 2. Compile the model
dnn_model.compile(optimizer=tf.keras.optimizers.Adam(0.001),
                  loss='mean_absolute_error')

In [None]:
# 3. Train the model
history = dnn_model.fit(
    training_features_scaled,
    training_labels_scaled,
    epochs=100,
    validation_split=0.2,
    verbose=0
)

In [None]:
# 4. Evaluate the model
dnn_model_results = dnn_model.evaluate(test_features_scaled, test_labels_scaled, verbose=2)
print(dnn_model_results)

5/5 - 0s - loss: 0.1051 - 31ms/epoch - 6ms/step
0.1051439493894577


At this point we can test the model by using the November collision data. Firt we have to load the data.

In [None]:
df_nov = pd.read_csv('https://raw.githubusercontent.com/23009256uhi/23009256_DataAnalytics/main/november_collated_collision_data.csv')

In [None]:
print(df_nov.head())
print(df_nov.describe())

   day  year  mo  da collision_date  temp  dewp     slp  visib  wdsp  mxpsd  \
0    1  2023  11  13     11/13/2023  39.7  29.6  1026.7   10.0   3.8    8.0   
1    7  2023  11   5      11/5/2023  53.0  48.0  1016.7    9.4   5.0   11.1   
2    7  2023  11  12     11/12/2023  41.2  29.6  1030.0   10.0   7.1   11.1   
3    2  2023  11  21     11/21/2023  38.7  24.2  1033.0   10.0   8.8   12.0   
4    1  2023  11   6      11/6/2023  48.4  43.7  1019.6    9.9   7.9   13.0   

    gust   max   min  prcp   sndp  fog  NUM_COLLISIONS  
0  999.9  48.0  30.0   0.0  999.9    0             300  
1  999.9  61.0  43.0   0.0  999.9    0             235  
2   14.0  54.0  30.9   0.0  999.9    0             233  
3   19.0  46.0  30.9   0.0  999.9    0             352  
4   19.0  61.0  39.9   0.0  999.9    0             275  
             day    year    mo         da       temp       dewp          slp  \
count  30.000000    30.0  30.0  30.000000  30.000000  30.000000    30.000000   
mean    3.966667  2023.

We can extract the actual number of collisions from the data to compare it later with the model predictions.

In [None]:
# Extract the actual number of collisions
actual_collisions = df_nov['NUM_COLLISIONS']

Now we have to drop the non relative columns.

In [None]:
df_nov = df_nov.drop(columns=['collision_date', 'sndp', 'gust', 'NUM_COLLISIONS'])

The next steps involve scaling the data using the same feature scaler used before, making predictions with the model by using the model created before and lastly inverse scaling the predictions to convert them back to their original scale.

In [None]:
# Scale the data
nov_features_scaled = feature_scaler.transform(df_nov)

# Make predictions with the model
nov_predictions_scaled = dnn_model.predict(nov_features_scaled)

# Inverse scale the predictions
nov_predictions = target_scaler.inverse_transform(nov_predictions_scaled)



Let's now create a new data frame and add the predicted collisions and the actual collisions to check the accuracy of the model.

In [None]:
nov_results_df = pd.DataFrame({
    'Actual Collisions': actual_collisions,
    'Predicted Collisions': nov_predictions.flatten()})

print(nov_results_df)

    Actual Collisions  Predicted Collisions
0                 300            551.688660
1                 235            694.152405
2                 233            784.981873
3                 352            585.987854
4                 275            605.349182
5                 240            537.905640
6                 320            689.820496
7                 290            730.942139
8                 203            562.793518
9                 263            563.291687
10                294            647.820251
11                233            684.898621
12                242            497.066803
13                216            569.610718
14                238            715.317810
15                266            664.935486
16                315            606.781433
17                208            561.381714
18                270            702.380554
19                220            545.986328
20                267            714.156250
21                275           

The DNN model shows a similar pattern to the linear regression model in terms of predicting the number of collisions. In both cases, the models' predictions are approximately double the actual number of collisions. This consistent overstimation might indicate that the models are probably not taking into consideration the changes in traffic patterns due to the pandemic. To improve the models' predictions we can incorporate data from both pre and during COVID. This could help the models adapt to the shifts in traffic patterns caused by the pandemic, and lead to more accurate predictions.  