Before creating the linear regression model we need to import Pandas, Numpy and Tensorflow. We need Pandas for loading and manipulating the dataset, Numpy for numerical operations and TensorFlow for building and training the models.

In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

After that we need to load the data that will be used to train the model.

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/23009256uhi/23009256_DataAnalytics/main/model_data.csv')

In [None]:
print(df.head())
print(df.describe())

   day  temp  dewp  visib  NUM_COLLISIONS
0    2  59.8  50.2   10.0             475
1    6  38.6  34.4    9.6             806
2    4  43.5  30.4   10.0             510
3    5  77.1  62.0   10.0             565
4    6  78.2  69.6    9.3             581
              day        temp        dewp      visib  NUM_COLLISIONS
count  662.000000  662.000000  662.000000  662.00000      662.000000
mean     3.895770   59.225076   52.460423    8.54864      593.027190
std      1.987026   12.978881   14.408704    2.12104       89.757713
min      1.000000   22.200000    8.300000    1.50000      264.000000
25%      2.000000   51.000000   43.625000    7.70000      528.250000
50%      4.000000   63.000000   56.700000    9.70000      596.000000
75%      6.000000   68.875000   64.000000   10.00000      654.750000
max      7.000000   85.300000   73.600000   10.00000      841.000000


Now we can calculate the standard deviation of 'NUM_COLLISIONS' and use it to normalise the target value. In other words, it will change the values to a common scale to improve the accuracy of the model.

In [None]:
# Normalisation scale calculation. It calculates the standard deviation that will be used for normalising the target value.
SCALE_NUM_COLLISIONS = df['NUM_COLLISIONS'].std()

We can now write a function to create and train a model which takes as arguments a given set of features.

In [None]:
# Function to create and train a model for a given set of features
def train_model(features):
    # Select features from the df along with the target column "NUM_COLLISIONS" to create a new df
    df_input = df[features + ["NUM_COLLISIONS"]]

    # Train-test split, training set (80% of the data) and test set (the remaining 20%). The training set is used to train the model, the test set is used to evaluate the model.
    training_set = df_input.sample(frac=0.8, random_state=0)
    test_set = df_input.drop(training_set.index)

    # Separate features (input variables) and labels (target value)
    training_features = training_set.copy()
    test_features = test_set.copy()

    # Extracting the target variable ('NUM_COLLISIONS') from the datasets and normalise the target values
    training_labels = training_features.pop('NUM_COLLISIONS') / SCALE_NUM_COLLISIONS
    test_labels = test_features.pop('NUM_COLLISIONS') / SCALE_NUM_COLLISIONS

    # Feature normalisation using TensorFlow's Keras API. It takes the normalised features as input and outputs a single continuous value (the predicted "NUM_COLLISIONS").
    normaliser = tf.keras.layers.Normalization(axis=-1)
    normaliser.adapt(np.array(training_features))

    # 1. Linear model definition using TensorFlow's Keras API. It takes the normalised features as input and outputs a single value (the predicted number of collisions).
    model = tf.keras.Sequential([
        tf.keras.layers.Input(shape=(training_features.shape[1],)),
        normaliser,
        layers.Dense(units=1)
    ])

    # 2. Linear model compilation
    model.compile(optimizer=tf.optimizers.Adam(learning_rate=0.1), loss='mean_absolute_error')

    # 3. Linear model training, trained for 100 epochs, 20% of the training data used as validation set to monitor the model's performance during training. No training logs are printed during the process.
    history = model.fit(training_features, training_labels, epochs=100, verbose=0, validation_split=0.2)

    # 4. Linear model evaluation using test_features and test_labels
    mae = model.evaluate(test_features, test_labels, verbose=0)
    print(f"Mean Absolute Error for {features} model: {mae}")

    return model

We can train a model now by using the function created before.

In [None]:
model = train_model(['day', 'dewp', 'temp', 'visib'])

Mean Absolute Error for ['day', 'dewp', 'temp', 'visib'] model: 0.6733102202415466


Let's test the model now with some made up test data. First we need to create an object with made up data.

In [None]:
# Made up data
input_custom = pd.DataFrame({
    'day': [1, 4, 7],
    'temp': [60, 70, 80],
    'dewp': [50, 60, 70],
    'visib': [10, 5, 8]
})

In [None]:
predictions_custom = model.predict(input_custom)
# Apply any necessary scaling factor if you used one during training
scaled_predictions = predictions_custom * SCALE_NUM_COLLISIONS
print(scaled_predictions)



[[597.0029 ]
 [661.0499 ]
 [746.83594]]


It seems the model is working fine. Now we can try using the November collision data to check whether the accuracy of the model predictions. First we have to load the data and convert the 'collision_date' to datetime format.


In [None]:
df_nov = pd.read_csv('https://raw.githubusercontent.com/23009256uhi/23009256_DataAnalytics/main/november_collated_collision_data.csv')

In [None]:
# Convert 'collision_date' to datetime format
df_nov['collision_date'] = pd.to_datetime(df_nov['collision_date'])

# Sort the DataFrame based on 'collision_date'
df_nov_sorted = df_nov.sort_values(by='collision_date')

In [None]:
print(df_nov_sorted.head())
print(df_nov_sorted.describe())

    day  year  mo  da collision_date  temp  dewp     slp  visib  wdsp  mxpsd  \
29    3  2023  11   1     2023-11-01  43.9  35.9  1014.6    9.5  11.2   17.1   
28    4  2023  11   2     2023-11-02  42.0  32.6  1027.2   10.0   7.7   11.1   
7     5  2023  11   3     2023-11-03  44.0  33.8  1032.0   10.0   8.2   15.0   
14    6  2023  11   4     2023-11-04  56.0  51.5  1024.6   10.0  11.2   18.1   
1     7  2023  11   5     2023-11-05  53.0  48.0  1016.7    9.4   5.0   11.1   

     gust   max   min  prcp   sndp  fog  NUM_COLLISIONS  
29   27.0  48.9  39.9  0.06  999.9    0             257  
28   20.0  50.0  36.0  0.06  999.9    0             231  
7   999.9  53.1  30.9  0.00  999.9    0             290  
14   22.0  60.1  52.0  0.00  999.9    0             238  
1   999.9  61.0  43.0  0.00  999.9    0             235  
             day    year    mo         da       temp       dewp          slp  \
count  30.000000    30.0  30.0  30.000000  30.000000  30.000000    30.000000   
mean    3.9

Now we can select the relevant features from the df_nov dataset that will be used to test the model.

In [None]:
# Select relevant features from df_nov
features_nov = df_nov_sorted[['day', 'dewp', 'temp', 'visib']]

In [None]:
print(features_nov)

    day  dewp  temp  visib
29    3  35.9  43.9    9.5
28    4  32.6  42.0   10.0
7     5  33.8  44.0   10.0
14    6  51.5  56.0   10.0
1     7  48.0  53.0    9.4
4     1  43.7  48.4    9.9
11    2  51.5  56.2    9.2
20    3  34.0  48.1   10.0
5     4  37.5  45.3    9.4
24    5  39.9  49.4   10.0
10    6  34.4  47.9   10.0
2     7  29.6  41.2   10.0
0     1  29.6  39.7   10.0
9     2  35.1  41.8   10.0
23    3  36.6  45.1   10.0
18    4  46.0  52.5    9.7
6     5  50.1  51.9    6.5
21    6  52.7  54.5    8.9
15    7  32.3  44.3   10.0
16    1  27.6  41.7   10.0
3     2  24.2  38.7   10.0
27    3  45.0  48.4    6.2
22    4  41.2  46.6   10.0
13    5  34.4  45.6   10.0
17    6  20.9  34.5   10.0
8     7  28.8  35.5   10.0
26    1  46.8  50.8    8.7
25    2  25.3  40.5   10.0
19    3  17.7  34.1   10.0
12    4  29.4  41.6   10.0


We only need to call the predict method in TensorFlow's Keras API to generate predictions from the trained model and apply the scaling factor to the results.

In [None]:
# Call the predict method
november_predictions = model.predict(features_nov)

# Apply the scaling factor
scaled_nov_predictions = november_predictions * SCALE_NUM_COLLISIONS

print(scaled_nov_predictions)



[[547.33203]
 [562.7845 ]
 [583.49225]
 [645.4215 ]
 [661.2479 ]
 [523.3371 ]
 [551.5275 ]
 [522.8242 ]
 [572.092  ]
 [592.0107 ]
 [593.7676 ]
 [619.4694 ]
 [488.99313]
 [530.3336 ]
 [547.378  ]
 [586.05817]
 [622.5125 ]
 [654.2202 ]
 [620.4385 ]
 [471.31992]
 [489.2658 ]
 [565.1053 ]
 [586.6853 ]
 [580.11194]
 [580.6345 ]
 [638.0524 ]
 [525.7639 ]
 [487.54123]
 [498.36685]
 [548.7148 ]]


Let's now create a new data frame and add the predicted collisions and the actual collisions to check the accuracy of the model.

In [None]:
# Convert the predictions to a DataFrame
predictions_df = pd.DataFrame(scaled_nov_predictions, columns=['Predicted_Collisions'])

df_nov_sorted['Predicted_Collisions'] = predictions_df['Predicted_Collisions'].values

In [None]:
print(df_nov_sorted[['collision_date', 'NUM_COLLISIONS', 'Predicted_Collisions']])

   collision_date  NUM_COLLISIONS  Predicted_Collisions
29     2023-11-01             257            547.332031
28     2023-11-02             231            562.784485
7      2023-11-03             290            583.492249
14     2023-11-04             238            645.421509
1      2023-11-05             235            661.247925
4      2023-11-06             275            523.337097
11     2023-11-07             233            551.527527
20     2023-11-08             267            522.824219
5      2023-11-09             240            572.091980
24     2023-11-10             215            592.010681
10     2023-11-11             294            593.767578
2      2023-11-12             233            619.469421
0      2023-11-13             300            488.993134
9      2023-11-14             263            530.333618
23     2023-11-15             271            547.377991
18     2023-11-16             270            586.058167
6      2023-11-17             320            622

It looks like the model's predictions are about twice the actual number of collisions. This discrepancy between the actual and predicted number of collisions could be due to various factors. One possible reason might be the impact of COVID-19. In fact, the pandemic has led to changes in traffic volumes and travel patterns, as many people have started working remotely, along with behavioral changes. The next step will be to create a DNN model to better understand and predict these patterns.