#Introduction

In this part of the project, we import our previously preprocessed data and further process the data before using TensorFlow to train, evaluate, and make predictions about the number of collisions on any given day.

Pradhan and Sameen (2019) state that “predictive analytics and computational models are essential to forecast future scenarios of road safety." Pradhan and Sameen (2019) go on to describe how predictive models are classified into two main groups, namely statistical and computational intelligence. For the next part of our assignment, we will use modelling from both groups. Firstly, we will use a linear regression model (statistical), and later, we will use a deep neural network (DNN) (computational intelligence). We will train, evaluate, and use both models to make predictions on the number of traffic collisions on any given day in New York City.

#Import Data

In [None]:
# needed to create the data frame
import pandas as pd

#create non-standardised df
non_stan = pd.read_csv('https://raw.githubusercontent.com/Ritchie-Robinson/22024961_DataAnalytics/main/lmfiltered.csv', index_col=0)

#create standardised df
stan = pd.read_csv('https://raw.githubusercontent.com/Ritchie-Robinson/22024961_DataAnalytics/main/lmfiltered_standardised.csv', index_col=0)


In [None]:
print(non_stan[:10])
print(stan[:10])

    day_num  max_temp  num_collisions
1         4      39.0             586
2         5      19.4             705
3         7      48.9             445
4         3      57.9             502
5         1      33.1             703
6         3      45.0             514
7         4      45.0             580
8         1      16.0             412
9         5      39.9             666
10        4      59.0             592
    day_num  max_temp_standardised  num_collisions
1         4              -1.466071             586
2         5              -2.561928             705
3         7              -0.912551             445
4         3              -0.409351             502
5         1              -1.795946             703
6         3              -1.130604             514
7         4              -1.130604             580
8         1              -2.752026             412
9         5              -1.415751             666
10        4              -0.347848             592


#Shuffle and Normalise

Below we will shuffle the rows in both data sets to minimise the risk of unintended patterns in the data.

In [None]:
# import
import numpy as np

# shuffle rows with iloc using random permutation.
shuffle_non_stan = non_stan.iloc[np.random.permutation(len(non_stan))]

shuffle_stan = stan.iloc[np.random.permutation(len(stan))]


Let us identify the min and max values for the number of collisions. This will allow us to better assess and interpret the Mean Absolute Error (MAE) when we evaluate our non-normalised models.

In [None]:
print("Min ", shuffle_stan['num_collisions'].min(), " Max ", shuffle_stan['num_collisions'].max())

Min  353  Max  845


Now the rows have been shuffled in both datasets, we will now normalise the collisions values in only the standardised dataset, as these have not been standardised previously.

In [None]:
SCALE_NUM_COLL = 1.0

shuffle_stan['num_collisions'] = (shuffle_stan['num_collisions'] - shuffle_stan['num_collisions'].min()) / (shuffle_stan['num_collisions'].max() - shuffle_stan['num_collisions'].min()) * SCALE_NUM_COLL

# print out the first 10 rows
print("Normalised output variables first 10 rows:\n\n", shuffle_stan[:10])

Normalised output variables first 10 rows:

       day_num  max_temp_standardised  num_collisions
1022        5               0.267174        0.684959
426         7              -1.024373        0.142276
3786        1              -0.521173        0.699187
1742        4               0.988427        0.780488
1055        3              -0.789546        0.526423
489         1              -1.130604        0.278455
4000        6              -0.571493        0.247967
1788        2               0.267174        0.532520
3928        5              -0.627404        0.193089
3952        4              -1.024373        0.123984


#TensorFlow

Now we will use both sets to train and evaluate different models. For this we will need to import the TensorFlow library.

In [None]:
import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers

print(tf.__version__)

2.15.0


##Model 0 - Day Only Non-Standardised

Create new dataframe.

In [None]:
#create dataframe with inputs and output from imported dataframe.
df_input_data_day = [shuffle_non_stan["day_num"], shuffle_non_stan["num_collisions"]]
#create headers for the new dataframe.
df_input_headers_day = ["day_num", "num_collisions"]
#create final dataframe by concatenating new dataframe and headers.
df_input_day = pd.concat(df_input_data_day, axis=1, keys=df_input_headers_day)

Construct training data.

In [None]:
#construct training set and test set (with 0.8 for an 80% training set and 20% for test)
training_set_day = df_input_day.sample(frac=0.8, random_state=0)
test_set_day = df_input_day.drop(training_set_day.index)

Copy the datasets and remove output variable.

In [None]:
#copy datasets and remove output variable
training_features_day = training_set_day.copy()
test_features_day = test_set_day.copy()

training_labels_day = training_features_day.pop('num_collisions')
test_labels_day = test_features_day.pop('num_collisions')

In [None]:
print(training_features_day)

      day_num
2401        3
2246        5
1927        3
1713        2
1380        4
...       ...
584         1
3585        2
1074        6
1455        6
423         3

[2222 rows x 1 columns]


###Train Model 0

In [None]:
model_0 = tf.keras.Sequential([
    layers.Dense(units=1)
])

In [None]:
model_0.compile(
    optimizer=tf.optimizers.Adam(learning_rate=0.1),
    loss='mean_absolute_error')

In [None]:
#fit model
#run 100 times (epochs) and apply further 20% validation split.

%%time
history = model_0.fit(
    training_features_day,
    training_labels_day,
    epochs=100,
    verbose=0,
    validation_split = 0.2)

CPU times: user 13.1 s, sys: 641 ms, total: 13.7 s
Wall time: 12.8 s


###Evaluate Model 0

In [None]:
#evaluate model using test features and labels.
mean_absolute_error_model_0 = model_0.evaluate(
    test_features_day,
    test_labels_day, verbose=0)

In [None]:
#print out mean absolute error
print(mean_absolute_error_model_0)

116.93606567382812


The MAE is showing that the models predictions are out by 120 collisions a day on average. Considering that the range of collisions are between a minimum of 353 and a maximum of 845, this shows that there is a degree of accuracy with the model although not ideal.

###Predictions for Model 0

In [None]:
#create custom dataframe with 5 dummy values
input_0 = pd.DataFrame.from_dict(data =
				{
            'day_num' : [3,5,7,1,2]
        })

In [None]:
input_0.head()

Unnamed: 0,day_num
0,3
1,5
2,7
3,1
4,2


Below we use our model 0 to make predictions and we remove the decimal place to give a realistic number of potential collisions based on the day of the week.

In [None]:
model_0_predictions = model_0.predict(input_0[:5])
model_0_predictions = model_0_predictions.astype(int)
print(model_0_predictions)

[[520]
 [584]
 [649]
 [455]
 [487]]


##Model 1 - Day Only Non-Standardised but with Normalisation Layer

Create new dataframe.

In [None]:
#create dataframe with inputs and output from imported dataframe.
df_input_data_day_norm = [shuffle_non_stan["day_num"], shuffle_non_stan["num_collisions"]]
#create headers for the new dataframe.
df_input_headers_day_norm = ["day_num", "num_collisions"]
#create final dataframe by concatenating new dataframe and headers.
df_input_day_norm = pd.concat(df_input_data_day_norm, axis=1, keys=df_input_headers_day_norm)

Construct training data.

In [None]:
#construct training set and test set (with 0.8 for an 80% training set and 20% for test)
training_set_day_norm = df_input_day_norm.sample(frac=0.8, random_state=0)
test_set_day_norm = df_input_day_norm.drop(training_set_day_norm.index)

Copy the datasets and remove output variable.

In [None]:
#copy datasets and remove output variable
training_features_day_norm = training_set_day_norm.copy()
test_features_day_norm = test_set_day_norm.copy()

training_labels_day_norm = training_features_day_norm.pop('num_collisions')
test_labels_day_norm = test_features_day_norm.pop('num_collisions')

In [None]:
print(training_features_day_norm)

      day_num
2401        3
2246        5
1927        3
1713        2
1380        4
...       ...
584         1
3585        2
1074        6
1455        6
423         3

[2222 rows x 1 columns]


For this model we will create a normalisation layer.

In [None]:
#create normalisation layer
normaliser_day = tf.keras.layers.Normalization(input_shape=[1,], axis=None)
normaliser_day.adapt(np.array(training_features_day_norm))

###Train Model 1

Train the model.

In [None]:
model_1 = tf.keras.Sequential([
    normaliser_day,
    layers.Dense(units=1)
])

In [None]:
model_1.compile(
    optimizer=tf.optimizers.Adam(learning_rate=0.1),
    loss='mean_absolute_error')

In [None]:
#fit model
#run 100 times (epochs) and apply further 20% validation split.

%%time
history = model_1.fit(
    training_features_day_norm,
    training_labels_day_norm,
    epochs=100,
    verbose=0,
    validation_split = 0.2)

CPU times: user 13.3 s, sys: 662 ms, total: 13.9 s
Wall time: 21.1 s


###Evaluate Model 1

In [None]:
#evaluate model using test features and labels.
mean_absolute_error_model_1 = model_1.evaluate(
    test_features_day_norm,
    test_labels_day_norm, verbose=0)

In [None]:
#print out mean absolute error
print(mean_absolute_error_model_1)

87.38441467285156


Even with the added normalised layer, the MAE is out by around 90 collisions a day. This is an improvement. However, the model still is not perfect.

###Predictions for Model 1

In [None]:
#create custom dataframe with 5 dummy values
input_1 = pd.DataFrame.from_dict(data =
				{
            'day_num' : [2,1,5,7,2]
        })

In [None]:
input_1.head()

Unnamed: 0,day_num
0,2
1,1
2,5
3,7
4,2


In [None]:
model_1_predictions = model_1.predict(input_1[:5])
model_1_predictions = model_1_predictions.astype(int)
print(model_1_predictions)

[[563]
 [578]
 [518]
 [489]
 [563]]


##Model 2 - Day Only with Normalised Collision Values

Create new dataframe.

In [None]:
#create dataframe with inputs and output from imported dataframe.
input_data_day_stan = [shuffle_stan["day_num"], shuffle_stan["num_collisions"]]
#create headers for the new dataframe.
input_headers_day_stan = ["day_num", "num_collisions"]
#create final dataframe by concatenating new dataframe and headers.
input_day_stan = pd.concat(input_data_day_stan, axis=1, keys=input_headers_day_stan)

Construct training data.

In [None]:
#construct training set and test set (with 0.8 for an 80% training set and 20% for test)
training_set_day_stan = input_day_stan.sample(frac=0.8, random_state=0)
test_set_day_stan = input_day_stan.drop(training_set_day_stan.index)

Copy the datasets and remove output variable.

In [None]:
#copy datasets and remove output variable
training_features_day_stan = training_set_day_stan.copy()
test_features_day_stan = test_set_day_stan.copy()

training_labels_day_stan = training_features_day_stan.pop('num_collisions')
test_labels_day_stan = test_features_day_stan.pop('num_collisions')

In [None]:
print(training_features_day_stan)

      day_num
2483        5
3019        5
1946        5
619         5
1312        4
...       ...
947         5
2794        4
4085        4
1021        7
491         3

[2222 rows x 1 columns]


###Train the model 2

In [None]:
model_2 = tf.keras.Sequential([
    layers.Dense(units=1)
])

In [None]:
model_2.compile(
    optimizer=tf.optimizers.Adam(learning_rate=0.1),
    loss='mean_absolute_error')

In [None]:
#fit model
#run 100 times (epochs) and apply further 20% validation split.

%%time
history = model_2.fit(
    training_features_day_stan,
    training_labels_day_stan,
    epochs=100,
    verbose=0,
    validation_split = 0.2)

CPU times: user 13.1 s, sys: 634 ms, total: 13.7 s
Wall time: 21.1 s


###Evaluate Model 2

In [None]:
#evaluate model using test features and labels.
mean_absolute_error_model_2 = model_2.evaluate(
    test_features_day_stan,
    test_labels_day_stan, verbose=0)

In [None]:
#print out mean absolute error
print(mean_absolute_error_model_2)

0.14864429831504822


###Predictions for Model 2

In [None]:
#create custom dataframe with 5 dummy values
input_2 = pd.DataFrame.from_dict(data =
				{
            'day_num' : [3,4,6,1,7]
        })

In [None]:
input_2.head()

Unnamed: 0,day_num
0,3
1,4
2,6
3,1
4,7


In [None]:

model_2_predictions = model_2.predict(input_2[:5])

print("\n\nNormalised:\n", model_2_predictions)

SCALE_NUM_COLL = 1.0

min_val = 353
max_val = 845

unnormalised_predictions_2 = model_2_predictions / SCALE_NUM_COLL * (max_val - min_val) + min_val

unnormalised_predictions_2 = unnormalised_predictions_2.astype(int)

print("\nAbsolute Values:\n", unnormalised_predictions_2)







Normalised:
 [[0.47863847]
 [0.44864917]
 [0.38867053]
 [0.53861713]
 [0.3586812 ]]

Absolute Values:
 [[588]
 [573]
 [544]
 [617]
 [529]]


##Model 3 - Day and Weather (Standardised) with Normalised Collisions Values

Create new dataframe.

In [None]:
#create dataframe with inputs and output from imported dataframe.
input_data_stan = [shuffle_stan["day_num"], shuffle_stan["max_temp_standardised"], shuffle_stan["num_collisions"]]
#create headers for the new dataframe.
input_headers_stan = ["day_num", "max_temp_standardised", "num_collisions"]
#create final dataframe by concatenating new dataframe and headers.
input_stan = pd.concat(input_data_stan, axis=1, keys=input_headers_stan)

Construct training data.

In [None]:
#construct training set and test set (with 0.8 for an 80% training set and 20% for test)
training_set_stan = input_stan.sample(frac=0.8, random_state=0)
test_set_stan = input_stan.drop(training_set_stan.index)

Copy the datasets and remove output variable.

In [None]:
#copy datasets and remove output variable
training_features_stan = training_set_stan.copy()
test_features_stan = test_set_stan.copy()

training_labels_stan = training_features_stan.pop('num_collisions')
test_labels_stan = test_features_stan.pop('num_collisions')

In [None]:
print(training_features_stan)

      day_num  max_temp_standardised
2483        5               1.100249
3019        5               1.553129
1946        5               1.273574
619         5              -1.180924
1312        4              -0.521173
...       ...                    ...
947         5              -0.962871
2794        4               1.553129
4085        4              -0.627404
1021        7               0.043529
491         3              -2.187324

[2222 rows x 2 columns]


###Train model 3

In [None]:
model_3 = tf.keras.Sequential([
    layers.Dense(units=1)
])

In [None]:
model_3.compile(
    optimizer=tf.optimizers.Adam(learning_rate=0.1),
    loss='mean_absolute_error')

In [None]:
#fit model
#run 100 times (epochs) and apply further 20% validation split.

%%time
history = model_3.fit(
    training_features_stan,
    training_labels_stan,
    epochs=100,
    verbose=0,
    validation_split = 0.2)

CPU times: user 13.4 s, sys: 636 ms, total: 14.1 s
Wall time: 13.2 s


###Evaluate Model 3

In [None]:
#evaluate model using test features and labels.
mean_absolute_error_model_3 = model_3.evaluate(
    test_features_stan,
    test_labels_stan, verbose=0)

In [None]:
#print out mean absolute error
print(mean_absolute_error_model_3)

0.16116999089717865


###Predictions for Model 3

In [None]:
#create custom dataframe with 5 dummy values for each input column
input_3 = pd.DataFrame.from_dict(data =
				{
            'day_num' : [7,1,3,6,5],
             'max_temp_standardised' : [1.391727, 1.539546, -0.063549, -1.213534, -0.113404,]
        })

In [None]:
input_3.head()

Unnamed: 0,day_num,max_temp_standardised
0,7,1.391727
1,1,1.539546
2,3,-0.063549
3,6,-1.213534
4,5,-0.113404


In [None]:
model_3_predictions = model_3.predict(input_3[:5])

print("\n\nNormalised:\n", model_3_predictions)

SCALE_NUM_COLL = 1.0

min_val = 353
max_val = 845

unnormalised_predictions_3 = model_3_predictions / SCALE_NUM_COLL * (max_val - min_val) + min_val

unnormalised_predictions_3 = unnormalised_predictions_3.astype(int)

print("\nAbsolute Values:\n", unnormalised_predictions_3)






Normalised:
 [[0.49540544]
 [0.5732825 ]
 [0.5669234 ]
 [0.54155844]
 [0.5409717 ]]

Absolute Values:
 [[596]
 [635]
 [631]
 [619]
 [619]]


#Conclusion

Overall, none of the above models are perfect and require further work to improve their accuracy. However, the results are relatively close, and the process has demonstrated the importance of standardisation and normalisation to improve the accuracy of machine learning.

Comparing Model 0, which included the Day Only Non-Standardised variable, we achieved an MAE of approximately 116. Meaning that the predictions are off by approximately 116 per day. As we have not standardised or normalised the output variable, this is an absolute value. However, we saw a slight but notable difference with model 1. Again, using the day-only input variable, but this time with a normalisation layer. Here we see the MAE drop to approximately 87.

Furthermore, this project demonstrated how using input variables with strong correlations helps to improve linear models and how weakly correlated variables just add noise and decrease accuracy. We can see that model 2 using the day of the week variable only with normalised collision values resulted in an MAE of approximately 0.1486. However, when using day of the week and max temperature with model 3, the MAE increased to approximately 0.1611, demonstrating less accuracy.

Ultimately, none of the above models are perfect and would require further work to improve their accuracy.

Data Pre-processing:
Grady et al. (2017) describe how data cleansing can be an open-ended task and how an agile approach can help to solve the problem of poorly managed data. Using an agile methodology within this project, we can iteratively revisit each stage of the project, including the initial data pre-processing stage and thus, identify any errors that may have been missed previously, or look again at the threshold of where outliers are situated within the dataset.

Feature Engineering:
Dong and Huan (2018) state that machine learning tasks rely on effective feature engineering. Therefore, to improve our models, further feature engineering is an important consideration for future work. It is likely that there are other relevant features that could improve the model's predictive power, such as location details identified in the previous data science element of this project. Although these were omitted on this occasion, these variables would likely be very useful predictors.

Cross-Validation:
To ensure that the models generalise well to new data, splitting the data into training and testing data is an important step (Google, 2024). For our testing and training data we have used a 20/80 split respectively. However, future work could involve cross-validation to split our training data into blocks and compare which blocks gave the best results. E.g. split our data into five blocks (20% each), known as five-fold cross validation or generally, K-fold-cross-validation for unknown number of blocks (Shrivastava, 2020). As a result, different blocks (20%) of data can be used for testing. Ultimately, splitting training and test data, as well as cross-validation allows us to assess how well the model generalises to unseen data and ensures that our model is not overfitting to training data.

Use a different model:
We could also use cross-validation to test the different arrangement of blocks with different models; because it could be possible that our simple linear regression model is not capturing the complexity of the underlying relationships in the data, especially considering the nature of this highly complex problem with many potential variables. Therefore, we could consider using a more complex model, such as polynomial regression (pavankumar, 2023) or other non-linear regression. However, we will use a deep neural network with multiple layers in the next part of this project. Both polynomial regression and DNNs can capture intricate patterns and relationships in data, especially when the relationships are non-linear or involve high degrees of complexity.

Ultimately, iteratively experimenting with each step in the process and assessing the impact will help us to improve model accuracy.

#Bibliography

Dong, G. and Huan, L.e. (2018) Feature Engineering for Machine Learning and Data Analytics. Boca Raton, FL: CRC press.

Google (2024) Training and Test Sets: Splitting Data. Available at: https://developers.google.com/machine-learning/crash-course/training-and-test-sets/splitting-data (Accessed: 07 Jan 2024).

Grady, N.W., Payne, J.A. and Parker, H. (2017) 'Agile big data analytics: AnalyticsOps for data science Publisher: IEEE Cite This.' 2017 IEEE International Conference on Big Data (Big Data). Boston, MA IEEE, pp.2331-39.

pavankumar (2023) Polynomial Regression Unraveled: Modeling Complex Relationships. Available at: https://medium.com/@uppalapavankumar18/polynomial-regression-unraveled-modeling-complex-relationships-bfc2b347e397 (Accessed: 7 Jan 2024).

Pradhan, B. and Sameen , M.I. (2019) 'Predicting Injury Severity of Road Traffic Accidents Using a Hybrid Extreme Gradient Boosting and Deep Neural Network Approach', Laser Scanning Systems in Highway and Safety Assessment pp.119–27.

Shrivastava, S. (2020) Cross Validation in Time Series. Available at: https://medium.com/@soumyachess1496/cross-validation-in-time-series-566ae4981ce4 (Accessed: 7 Jan 2024).