# **DATA ANALYTICS ON THE WEB - ASSIGNMENT 2**


# **STUDENT: 22015866**


# **Colin Stevenson - Applied Data Science**

# **Introduction**

In this report, we explore the relationship between weather conditions and traffic collisions in New York City. The primary objective is to develop predictive models that can accurately estimate the number of traffic collisions on any given day based on various weather parameters.

The analysis is divided into two parts: This first part involves constructing a linear regression model, a fundamental yet powerful tool for understanding and predicting relationships between variables.

The second part focuses on developing a Deep Neural Network (DNN) regression model, leveraging the advanced capabilities of machine learning.

Three distinct datasets have been created and hosted on GitHub for this analysis. Each dataset incrementally introduces more weather-related variables, starting from basic data on the day of the week and number of collisions, and gradually including temperature, precipitation, and dew point. This progressive approach allows us to understand the impact of each additional weather variable on the accuracy of our predictions.


# **Part 1 - Linear Regression Model**

# **Methodology**

The methodology will comprise of 4 sections:

**Data Acquisition and Preparation:**

We will ensure that the data is clean, and all the necessary features are included. For predicting the days with the highest number of collisions, the features related to the days of the week and other relevant variables like weather conditions will be included.

**Model Building:**

We will then build the linear regression model which will include features that will impact the number of collisions, such as weather conditions and day of the week.

**Model Training and Evaluation:**

After building the model, we'll train it with the training dataset and evaluate its performance using the test dataset.

Metrics such as Mean Absolute Error (MAE) will be used for evaluation.

### **Data Acquisition and Preparation**

The initial phase of our analysis involves acquiring and preparing the data for our regression models. This process is fundamental to any data science project, as the quality and structure of the data directly influence the performance and reliability of the models.

We now import necessary libraries and loading the datasets. We use pandas, a powerful data manipulation library, and numpy for numerical operations. Four separate datasets (df1, df2, df3) are loaded from the GitHub repository. Each of these datasets represents a different combination of variables:

df1: Contains 'day' and 'NUM_COLLISIONS'.
df2: Adds 'temp' (temperature) to the variables in df1.
df3: Further includes 'dewp' (dew point) to the variables in df2.

In [None]:
# needed to create the data frame
import pandas as pd

# needed to help with speedy maths based calculations
import numpy as np

# create data frames from csv file we hosted on our github
df1 = pd.read_csv('https://raw.githubusercontent.com/22015866uhi/22015866_Data_Analytics/main/linearregressiondata1.csv', index_col=0, )

df2 = pd.read_csv('https://raw.githubusercontent.com/22015866uhi/22015866_Data_Analytics/main/linearregressiondata2.csv', index_col=0, )

df3 = pd.read_csv('https://raw.githubusercontent.com/22015866uhi/22015866_Data_Analytics/main/linearregressiondata3.csv', index_col=0, )


print(df1)
print(df2)
print(df3)

     NUM_COLLISIONS
day                
4               381
5               480
6               549
7               505
2               389
..              ...
7               448
2               355
1               384
3               518
4               443

[2535 rows x 1 columns]
     temp  NUM_COLLISIONS
day                      
4    37.8             381
5    27.1             480
6    28.4             549
7    33.4             505
2    36.1             389
..    ...             ...
7    49.4             448
2    48.0             355
1    42.6             384
3    39.4             518
4    38.7             443

[2535 rows x 2 columns]
     temp  dewp  NUM_COLLISIONS
day                            
4    37.8  23.6             381
5    27.1  10.5             480
6    28.4  14.1             549
7    33.4  18.6             505
2    36.1  18.7             389
..    ...   ...             ...
7    49.4  40.9             448
2    48.0  37.4             355
1    42.6  30.2             384


In [None]:
# A scale is not required here, but the constant will be useful
# SCALE_NUM_COLLISIONS = 0.001

We now import TensorFlow and Keras, which are essential libraries for building neural network models. Although our current focus is on linear regression, importing these libraries at this stage prepares us for the subsequent Deep Neural Network regression analysis.

In [None]:
import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers

print(tf.__version__)

2.15.0


## **Number of Collisions vs Day**

We address a crucial aspect of data preparation: restructuring the DataFrame. Since the 'day' variable is set as an index in df1, we reset it to a regular column to facilitate easier manipulation and analysis. We then reconstruct df1 to align the input variables ('day') with the target variable ('NUM_COLLISIONS'). This step is critical for ensuring that our data is in the right format for applying regression models.

In [None]:
# Reset the index if 'day' is set as index
df1.reset_index(inplace=True)

# Now, create the DataFrame for input data
df1_input_data = [df1["day"], df1["NUM_COLLISIONS"]]

# Create headers for our new DataFrame. These should correlate with the above.
df1_input_headers = ["day", "NUM_COLLISIONS"]

# Create a final DataFrame using our new DataFrame and headers.
df1 = pd.concat(df1_input_data, axis=1, keys=df1_input_headers)

print(df1)


      day  NUM_COLLISIONS
0       4             381
1       5             480
2       6             549
3       7             505
4       2             389
...   ...             ...
2530    7             448
2531    2             355
2532    1             384
2533    3             518
2534    4             443

[2535 rows x 2 columns]


### **NUM_COLLISIONS vs Day -  Model Development and Evaluation**

We can now look at the developing the linear regression model using TensorFlow, a robust framework for building machine learning models.

**Training the Model**

This code block is crucial for dividing our dataset into a training set and a test set. We use an 80-20 split, meaning 80% of the data is used for training the model, and the remaining 20% is reserved for testing its performance.

In [None]:
# construct a training set for runnign through the model and a test set, we do this by using sample with 0.8 for an 80% training set and 20% for test.
training_set_1 = df1.sample(frac=0.8, random_state=0)
test_set_1 = df1.drop(training_set_1.index)

Here, we prepare our features (inputs) and labels (outputs) for the training and test sets. We separate the 'NUM_COLLISIONS' column from our datasets, which is the target variable we aim to predict. This separation is a standard practice in supervised learning where the model learns to predict the labels from the features.

In [None]:
# copy the datasets and remove the final column, i.e. the output column. We do this using pop.
training_features_1 = training_set_1.copy()
test_features_1 = test_set_1.copy()

training_labels_1 = training_features_1.pop('NUM_COLLISIONS')
test_labels_1 = test_features_1.pop('NUM_COLLISIONS')

We print the training features for inspection and create a normalisation layer for our TensorFlow model. This layer is an essential part of preprocessing in neural networks, ensuring that our model inputs have a uniform scale.

In [None]:
print(training_features_1)

      day
828     7
2409    4
2249    5
936     3
2390    6
...   ...
1874    6
718     6
1149    7
387     3
1369    4

[2028 rows x 1 columns]


Normalization helps in speeding up the training process and improving the performance of the model.

In [None]:
# boiler plate for this model. You can see that we have used the training_features here for our normalisation layer that we try and fit to the outputs.
normaliser_1 = tf.keras.layers.Normalization(input_shape=[1,], axis=None) # tf.keras.layers.Normalization(axis=-1)
normaliser_1.adapt(np.array(training_features_1))

We define our linear regression model (model_1) using TensorFlow's Sequential API. The model comprises a normalisation layer followed by a dense layer with a single unit, which is characteristic of a simple linear regression model. The model is compiled with the Adam optimizer and mean absolute error as the loss function. This setup is fundamental in defining how the model learns during training.

In [None]:
# I have decided to call the model, model_1. We add our normaliser and we are expecting a single output.
model_1 = tf.keras.Sequential([
    normaliser_1,
    layers.Dense(units=1)
])

In [None]:
model_1.compile(
    optimizer=tf.optimizers.Adam(learning_rate=0.1),
    loss='mean_absolute_error')

In [None]:
# now we are going to fit the model where we require the training features and labels. We will run it 100 times i.e. epochs and we have applied a further 20% validation split.

%%time
history = model_1.fit(
    training_features_1,
    training_labels_1,
    epochs=100,
    verbose=0,
    validation_split = 0.2)

CPU times: user 15.3 s, sys: 801 ms, total: 16.1 s
Wall time: 42.7 s


## **Number of Collisions vs Day and Temp**

We address a crucial aspect of data preparation: restructuring the DataFrame. Since the 'day' variable is set as an index in df1, we reset it to a regular column to facilitate easier manipulation and analysis. We then reconstruct df1 to align the input variables ('day') with the target variable ('NUM_COLLISIONS'). This step is critical for ensuring that our data is in the right format for applying regression models.

### **Number of Collisions vs Day and Temp - Model Development and Evaluation**


Firstly we create the dataframe with the variable headers and print to show the start and end rows of the dataframe for day, temp and NUM_COLLISIONS.

The model development and evaluation process is then repeated for what we did with day, NUM_COLLISIONS dataframe but with 'temp' variable added:

In [None]:
# Reset the index if 'day' is set as index
df2.reset_index(inplace=True)

# create a dataframe with the inputs and the output at the end using the imported dataframe. This can be replicated for any configuration, in this case, I have gone for day, temp, wdsp
df2_input_data = [df2["day"], df2["temp"], df2["NUM_COLLISIONS"]]
# create headers for our new dataframe. These should correlate with the above.
df2_input_headers = ["day", "temp", "NUM_COLLISIONS"]
# create a final dataframe using our new dataframe and headers.
df2 = pd.concat(df2_input_data, axis=1, keys=df2_input_headers)

print(df2)

      day  temp  NUM_COLLISIONS
0       4  37.8             381
1       5  27.1             480
2       6  28.4             549
3       7  33.4             505
4       2  36.1             389
...   ...   ...             ...
2530    7  49.4             448
2531    2  48.0             355
2532    1  42.6             384
2533    3  39.4             518
2534    4  38.7             443

[2535 rows x 3 columns]


In [None]:
# construct a training set for runnign through the model and a test set, we do this by using sample with 0.8 for an 80% training set and 20% for test.
training_set_2 = df2.sample(frac=0.8, random_state=0)
test_set_2 = df2.drop(training_set_2.index)

In [None]:
# copy the datasets and remove the final column, i.e. the output column. We do this using pop.
training_features_2 = training_set_2.copy()
test_features_2 = test_set_2.copy()

training_labels_2 = training_features_2.pop('NUM_COLLISIONS')
test_labels_2 = test_features_2.pop('NUM_COLLISIONS')

In [None]:
print(training_features_2)

      day  temp
828     7  60.5
2409    4  67.7
2249    5  44.3
936     3  81.3
2390    6  74.1
...   ...   ...
1874    6  35.7
718     6  53.2
1149    7  33.4
387     3  34.8
1369    4  53.4

[2028 rows x 2 columns]


In [None]:
# boiler plate for this model. You can see that we have used the training_features here for our normalisation layer that we try and fit to the outputs.
normaliser = tf.keras.layers.Normalization(axis=-1)
normaliser.adapt(np.array(training_features_2))

In [None]:
# I have decided to call the model, model_2. We add our normaliser and we are expecting a single output.
model_2 = tf.keras.Sequential([
    normaliser,
    layers.Dense(units=1)
])

In [None]:
# more boiler plate for creating a sequential model, we need an optimiser and loss parameter. Here we are going to be using the mean absolute error MAE
model_2.compile(
    optimizer=tf.optimizers.Adam(learning_rate=0.1),
    loss='mean_absolute_error')

In [None]:
# now we are going to fit the model where we require the training features and labels. We will run it 100 times i.e. epochs and we have applied a further 20% validation split.

%%time
history = model_2.fit(
    training_features_2,
    training_labels_2,
    epochs=100,
    verbose=0,
    validation_split = 0.2)

CPU times: user 12.9 s, sys: 605 ms, total: 13.5 s
Wall time: 12.8 s


## **Number of Collisions vs Day, Temp and Dewp**

Firstly we create the dataframe with the variable headers and print to show the start and end rows of the dataframe for day, temp, dewp and NUM_COLLISIONS.

The model development and evaluation process is then repeated for what we did with day, temp, NUM_COLLISIONS dataframe but with 'dewp' variable added:

In [None]:
# Reset the index if 'day' is set as index
df3.reset_index(inplace=True)

# create a dataframe with the inputs and the output at the end using the imported dataframe. This can be replicated for any configuration, in this case, I have gone for day, temp, wdsp
df3_input_data = [df3["day"], df3["temp"], df3["dewp"], df3["NUM_COLLISIONS"]]
# create headers for our new dataframe. These should correlate with the above.
df3_input_headers = ["day", "temp", "dewp", "NUM_COLLISIONS"]
# create a final dataframe using our new dataframe and headers.
df3 = pd.concat(df3_input_data, axis=1, keys=df3_input_headers)

print(df3)

      day  temp  dewp  NUM_COLLISIONS
0       4  37.8  23.6             381
1       5  27.1  10.5             480
2       6  28.4  14.1             549
3       7  33.4  18.6             505
4       2  36.1  18.7             389
...   ...   ...   ...             ...
2530    7  49.4  40.9             448
2531    2  48.0  37.4             355
2532    1  42.6  30.2             384
2533    3  39.4  38.3             518
2534    4  38.7  34.9             443

[2535 rows x 4 columns]


In [None]:
# construct a training set for runnign through the model and a test set, we do this by using sample with 0.8 for an 80% training set and 20% for test.
training_set_3 = df3.sample(frac=0.8, random_state=0)
test_set_3 = df3.drop(training_set_3.index)

In [None]:
# copy the datasets and remove the final column, i.e. the output column. We do this using pop.
training_features_3 = training_set_3.copy()
test_features_3 = test_set_3.copy()

training_labels_3 = training_features_3.pop('NUM_COLLISIONS')
test_labels_3 = test_features_3.pop('NUM_COLLISIONS')


In [None]:
# Here I have put in a scale factor and divided by it. In this dataset, I had already normalised and thus it is 1. However, 1000 is what would make sense based on the data here and we can use this later when testing our model..
training_labels_3 = training_labels_3
test_labels_3 = test_labels_3

In [None]:
# Display the first few rows of training_features_3 to check its structure
print(training_features_3.head())


      day  temp  dewp
828     7  60.5  48.6
2409    4  67.7  56.8
2249    5  44.3  24.1
936     3  81.3  60.0
2390    6  74.1  66.5


In [None]:
# Normalize features
normaliser = tf.keras.layers.Normalization(axis=-1)
normaliser.adapt(training_features_3[['day', 'temp', 'dewp']])


In [None]:
# Model architecture
model_3 = tf.keras.Sequential([
    normaliser,
    layers.Dense(units=10, activation='relu'),
    layers.Dense(units=1)
])

# Compile the model
model_3.compile(optimizer=tf.optimizers.Adam(learning_rate=0.01), loss='mean_absolute_error')


In [None]:
# Train the model
history = model_3.fit(
    training_features_3[['day', 'temp', 'dewp']],
    training_labels_3,
    epochs=100,
    verbose=0,
    validation_split=0.2
)


# **Results**

To establish which input values we will use in each dataframe we will create a random permutation of the indices of test_inputs. This means it generates a random sequence of row numbers taken from the test data which has not been used within the model training.

We then selects rows based on these randomly shuffled indices. The result is a DataFrame (shuffle) where the rows are in a random order.

The first three columns of the shuffled DataFrame to be used as predictors (or features) for a model.

The result is a DataFrame (predictors) containing the selected features.

In [None]:
# Create Predictor Data Inputs

# Set a random seed for reproducibility
np.random.seed(42)

# Assuming test_set_1, test_set_2, and test_set_3 have the same length and alignment
# Shuffle indices once
shuffled_indices = np.random.permutation(len(test_set_1))

# Use the same shuffled indices for all test sets
input_predictors_1 = test_set_1.iloc[shuffled_indices, 0:1]
input_predictors_2 = test_set_2.iloc[shuffled_indices, 0:2]
input_predictors_3 = test_set_3.iloc[shuffled_indices, 0:3]

# Print out the first 3 rows of predictors
print(input_predictors_1[:3])
print(input_predictors_2[:3])
print(input_predictors_3[:3])


      day
931     5
1413    7
2443    3
      day  temp
931     5  85.4
1413    7  48.6
2443    3  65.3
      day  temp  dewp
931     5  85.4  65.7
1413    7  48.6  42.6
2443    3  65.3  48.5


### **Number of Collisions Predictions vs Days**

In [None]:
# we create a custom dataframe using the input_predictor dataframe created in the last step
input_1 = input_predictors_1

# next we can check this out, you can multiply by 1000 to get more realistic NUM_COLLISIONS values.
linear_day_predictions = model_1.predict(input_1[:3])
print(linear_day_predictions)

[[516.2692 ]
 [547.68146]
 [484.85706]]


### **Predicting for Collisions against Days of the Week**

In [None]:
# Prepare the data for prediction
# Assuming you want to predict for all days, create a DataFrame for this
days_data = pd.DataFrame({'day': range(1, 8)})  # 1 to 7 representing days of the week

# Make predictions
predictions = model_1.predict(days_data)

# Add predictions to the DataFrame
days_data['Predicted_Collisions'] = predictions.flatten()

# Assuming the scale factor used was SCALE_NUM_COLLISIONS
# SCALE_NUM_COLLISIONS = 0.001  # Replace with your actual scale factor
# days_data['Predicted_Collisions'] = days_data['Predicted_Collisions'] / SCALE_NUM_COLLISIONS

# Analyze which days have higher predicted collisions
print(days_data.sort_values(by='Predicted_Collisions', ascending=False))


   day  Predicted_Collisions
6    7            547.681458
5    6            531.975342
4    5            516.269226
3    4            500.563141
2    3            484.857056
1    2            469.150940
0    1            453.444824


**Conclusions and Insights**

**Day-wise Collision Risk:** Based on these predictions, it appears that the model estimates a higher number of collisions later in the week, with the highest predicted collisions on day 7.

**Actionable Insights:** Emergency services could use this information to prepare more resources or be on higher alert towards the end of the week, especially on days 6 and 7.

### **Number of Collisions Predictions vs Days and Temp**

In [None]:
# Prediction Input
# we create a custom dataframe using the input_predictor dataframe created in the last step
input_2 = input_predictors_2

# Colliions Predictions vs Days & Temp
linear_day_temp_predictions = model_2.predict(input_2[:3])
print(linear_day_temp_predictions)

[[555.37164]
 [539.4067 ]
 [497.4407 ]]


### **Predictions vs Actual for Number of Collisions vs Day & Temp**

In [None]:
# Use df2 for predictions
predictions_2 = model_2.predict(df2[['day', 'temp']])

# Add predictions to df2 for analysis
df2['Predicted_Collisions'] = predictions_2.flatten()

# Analyze the predictions
print(df2.sort_values(by='Predicted_Collisions', ascending=False))


      day  temp  NUM_COLLISIONS  Predicted_Collisions
198     7  89.1             697            592.003967
1338    7  85.0             715            586.679321
1282    7  83.0             798            584.081909
149     7  83.0             736            584.081909
1289    7  82.8             769            583.822144
...   ...   ...             ...                   ...
1456    1  20.1             406            406.912476
1822    1  19.2             494            405.743622
768     1  18.9             454            405.354004
1815    1   9.7             475            393.406006
1131    1   6.9             309            389.769653

[2535 rows x 4 columns]


### **Number of Collisions Predictions vs Days, Temp and Dewp**

In [None]:
# Prediction input
input_3 = input_predictors_3

# Predictions
linear_day_temp_dewp_predictions = model_3.predict(input_3[:3])
print(linear_day_temp_dewp_predictions)


[[668.24963]
 [659.5333 ]
 [596.5163 ]]


### **Predictions vs Actual for Number of Collisions vs Day & Temp**

In [None]:
# Use df3 for predictions
predictions_3 = model_3.predict(df3[['day', 'temp', 'dewp']])

# Add predictions to df3 for analysis
df3['Predicted_Collisions'] = predictions_3.flatten()

# Analyze the predictions
print(df3.sort_values(by='Predicted_Collisions', ascending=False))


      day  temp  dewp  NUM_COLLISIONS  Predicted_Collisions
198     7  89.1  72.1             697            728.042358
1338    7  85.0  69.2             715            712.851929
1310    7  82.4  73.5             729            706.930298
1282    7  83.0  69.8             798            706.659912
2036    7  82.5  71.4             703            705.988525
...   ...   ...   ...             ...                   ...
40      1  25.2   8.5             400            470.635040
26      1  24.5   7.5             429            469.160309
386     1  21.4   5.3             465            468.675537
1456    1  20.1   7.2             406            465.991882
1808    1  20.9  12.3             462            459.927399

[2535 rows x 5 columns]


### **Model Comparison**

To compare the three models against each other and determine which one is best for predictions, typically using a separate test set that was not used during training. The comparison should be based on their performance metrics, such as Mean Absolute Error (MAE) or Root Mean Squared Error (RMSE).

We approach this as follows:

1. **Separate Test Set**

We have split the data into training and testing sets. This is crucial for evaluating the model's performance on unseen data. We then use the test_sets for df1, df2, and df3 which are the testing datasets that were split from the training data.

2. **Evaluate Each Model**

We can now evaluate each model on the same test set. For example, test_set_3 includes the features 'day', 'temp', and 'dewp', and we will use the relevant subset of these features for each model.

We will now use the following steps to compare the MAE of each model against the other:

In [None]:
# Now, we will evaluate our model using the test features and labels.
mean_absolute_error_model_1 = model_1.evaluate(
    test_features_1,
    test_labels_1, verbose=0)

In [None]:
# Now, we will evaluate our model using the test features and labels.
mean_absolute_error_model_2 = model_2.evaluate(
    test_features_2,
    test_labels_2, verbose=0)

In [None]:
# Now, we will evaluate our model using the test features and labels.
mean_absolute_error_model_3 = model_3.evaluate(
    test_features_3,
    test_labels_3, verbose=0)

In [None]:
# The mean absolute error of the model can be printed out. Remember, we want to minimise this. It will vary on each training run due to randomisation.
print(mean_absolute_error_model_1)
print(mean_absolute_error_model_2)
print(mean_absolute_error_model_3)

112.15725708007812
110.32427215576172
63.96727752685547


In [None]:
from sklearn.metrics import mean_absolute_error

# Evaluate model_1
predictions_1 = model_1.predict(test_set_1[['day']])
mae_1 = mean_absolute_error(test_set_1['NUM_COLLISIONS'], predictions_1.flatten())

# Evaluate model_2
predictions_2 = model_2.predict(test_set_2[['day', 'temp']])
mae_2 = mean_absolute_error(test_set_2['NUM_COLLISIONS'], predictions_2.flatten())

# Evaluate model_3
predictions_3 = model_3.predict(test_set_3[['day', 'temp', 'dewp']])
mae_3 = mean_absolute_error(test_set_3['NUM_COLLISIONS'], predictions_3.flatten())

# Print the MAE for each model
print(f"MAE of Model 1: {mae_1}")
print(f"MAE of Model 2: {mae_2}")
print(f"MAE of Model 3: {mae_3}")


MAE of Model 1: 112.15726821568356
MAE of Model 2: 110.32426732531665
MAE of Model 3: 63.96728070200783


# **Conclusion**

The primary objective of this analysis was to develop regression models that could predict the number of traffic collisions on any given day in New York City, aiding the emergency services in optimizing their response strategies. Three different models were developed and compared:

**Model 1 (Linear Regression with 'day'):** Utilised only the day of the week as the predictor variable. The predictions from this model indicated a progressive increase in the number of collisions as the week progressed, with the highest number of collisions predicted for day 7 which is Friday. The Mean Absolute Error (MAE) for this model was approximately 112.16.

**Model 2 (Linear Regression with 'day' and 'temp'):** Included both the day of the week and temperature. This model provided a nuanced understanding, showing that specific combinations of days and temperatures led to higher predicted collisions. For instance, the model predicted the highest collisions on day 7 with a temperature of 89.1. The MAE for this model was slightly lower than Model 1 at 110.32.

**Model 3 (Linear Regression with 'day', 'temp', and 'dewp'):** This model indicated that certain combinations of day, temperature, and dew point were critical in predicting higher collision numbers. For example, the highest collisions were predicted on day 7 with a temperature of 89.1 and a dew point of 72.1. The MAE for this model was lower than the other two models at 63.97. It is the most accurate model among the three due to the inclusion of these additional features likely allows the model to capture more nuances and factors that influence the number of collisions, leading to more accurate predictions.




# **Part 2 - DNN Regression Model**

### **Dataframe Creation:**

We will create a DataFrame with the necessary features and the target variable (NUM_COLLISIONS).

In [None]:
# needed to create the data frame
import pandas as pd
import numpy as np
import tensorflow as tf
import random as python_random

np.random.seed(321)
python_random.seed(321)
tf.random.set_seed(321)


# create data frame from csv file we hosted on our github
df_dnn_1 = pd.read_csv('https://raw.githubusercontent.com/22015866uhi/22015866_Data_Analytics/main/dnnregressiondata1.csv', index_col=0)

print(df_dnn_1)

     Aug  Dec  Feb  Jan  Jul  Jun  Mar  May  Nov  Oct  ...  Sun  Thu  Tue  \
Apr                                                    ...                  
0      0    0    0    1    0    0    0    0    0    0  ...    0    0    1   
0      0    0    0    1    0    0    0    0    0    0  ...    0    0    0   
0      0    0    0    1    0    0    0    0    0    0  ...    0    1    0   
0      0    0    0    1    0    0    0    0    0    0  ...    0    0    0   
0      0    0    0    1    0    0    0    0    0    0  ...    0    0    0   
..   ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...   
0      0    1    0    0    0    0    0    0    0    0  ...    0    0    0   
0      0    1    0    0    0    0    0    0    0    0  ...    0    0    0   
0      0    1    0    0    0    0    0    0    0    0  ...    1    0    0   
0      0    1    0    0    0    0    0    0    0    0  ...    0    0    0   
0      0    1    0    0    0    0    0    0    0    0  ...    0    0    1   

In [None]:
# Reset the index if 'day' is set as index
df_dnn_1.reset_index(inplace=True)

# make sure we have our data by printing it out
df_dnn_1[:5]
# df #all

Unnamed: 0,Apr,Aug,Dec,Feb,Jan,Jul,Jun,Mar,May,Nov,...,Sun,Thu,Tue,Wed,year,month,temp,prcp,dewp,NUM_COLLISIONS
0,0,0,0,0,1,0,0,0,0,0,...,0,0,1,0,2013,Jan,37.8,0.0,23.6,381
1,0,0,0,0,1,0,0,0,0,0,...,0,0,0,1,2013,Jan,27.1,0.0,10.5,480
2,0,0,0,0,1,0,0,0,0,0,...,0,1,0,0,2013,Jan,28.4,0.0,14.1,549
3,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,2013,Jan,33.4,0.0,18.6,505
4,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,2013,Jan,36.1,0.0,18.7,389


In [None]:
dnn_input_data = [df_dnn_1["year"], df_dnn_1["temp"], df_dnn_1["dewp"], df_dnn_1["prcp"], df_dnn_1["Sat"], df_dnn_1["Sun"], df_dnn_1["Mon"], df_dnn_1["Tue"], df_dnn_1["Wed"], df_dnn_1["Thu"], df_dnn_1["Fri"], df_dnn_1["Jan"], df_dnn_1["Feb"], df_dnn_1["Mar"], df_dnn_1["Apr"], df_dnn_1["May"], df_dnn_1["Jun"], df_dnn_1["Jul"], df_dnn_1["Aug"], df_dnn_1["Sep"], df_dnn_1["Oct"], df_dnn_1["Nov"], df_dnn_1["Dec"], df_dnn_1["NUM_COLLISIONS"]]
headers = ["year","temp", "dewp", "prcp", "Sat","Sun","Mon","Tue","Wed","Thu","Fri","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec","NUM_COLLISIONS"]
df_dnn_1_input = pd.concat(dnn_input_data, axis=1, keys=headers)
df_dnn_1_input.head()

Unnamed: 0,year,temp,dewp,prcp,Sat,Sun,Mon,Tue,Wed,Thu,...,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec,NUM_COLLISIONS
0,2013,37.8,23.6,0.0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,381
1,2013,27.1,10.5,0.0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,480
2,2013,28.4,14.1,0.0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,549
3,2013,33.4,18.6,0.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,505
4,2013,36.1,18.7,0.0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,389


### **Train-Test Split:**

**Random Sampling for Diverse Test Data**

We will use random sampling to select rows from the dataframe randomly, which generally ensures a mix of different values across columns.

Within rge random sampling we will split the data into training (80%) and testing sets, which is a standard practice for model evaluation.

In [None]:
# Set a random seed for reproducibility
# np.random.seed(123)  # We can choose any number as the seed

# Shuffle the entire dataset randomly
df_shuffled = df_dnn_1_input.sample(frac=1, random_state=42).reset_index(drop=True)

# Split the data into training and testing sets
# We want 80% of the data for training and 20% for testing
split_index = int(len(df_shuffled) * 0.8)
training_set = df_shuffled[:split_index]
test_set = df_shuffled[split_index:]

# We now have a test set with a diverse selection of data
print(test_set.head())

      year  temp  dewp  prcp  Sat  Sun  Mon  Tue  Wed  Thu  ...  Apr  May  \
2028  2018  40.4  32.0  1.06    0    0    1    0    0    0  ...    0    0   
2029  2018  52.0  32.8  0.00    0    0    0    0    0    1  ...    0    0   
2030  2014  49.4  30.9  0.00    0    0    1    0    0    0  ...    0    0   
2031  2013  76.1  62.6  0.00    0    1    0    0    0    0  ...    0    0   
2032  2017  49.2  34.4  0.00    0    0    0    0    0    1  ...    0    0   

      Jun  Jul  Aug  Sep  Oct  Nov  Dec  NUM_COLLISIONS  
2028    0    0    0    0    0    0    1             689  
2029    0    0    0    0    0    1    0             631  
2030    0    0    0    0    1    0    0             591  
2031    1    0    0    0    0    0    0             505  
2032    0    0    0    0    0    0    0             563  

[5 rows x 24 columns]


### **Feature-Label Separation:**

We can now separate the features and the target variable (NUM_COLLISIONS) for both training and testing sets.

In [None]:
training_features = training_set.copy()
test_features = test_set.copy()

training_labels = training_features.pop('NUM_COLLISIONS')
test_labels = test_features.pop('NUM_COLLISIONS')

### **Normalization:**

We now use tf.keras.layers.Normalization to normalize the features, which is important for neural network models as they are sensitive to the scale of input data.

In [None]:
normaliser = tf.keras.layers.Normalization(axis=-1)
normaliser.adapt(np.array(training_features))

### **Model Architecture**:

The DNN model (dnn_model_1) has two hidden layers with 64 neurons each(this improves the predictions from 48 neurons tried before), and one output neuron.

In [None]:
# This is the only difference, instead of a single layer, we have our normalisation layer (22 inputs), 2 layers of 64, with 1 output. We adjusted this from 48 to improve the set.
dnn_model_1 = keras.Sequential([
      normaliser,
      layers.Dense(64, activation='relu'),
      layers.Dense(64, activation='relu'),
      layers.Dense(1)
  ])

dnn_model_1.compile(loss='mean_absolute_error',
                optimizer=tf.keras.optimizers.Adam(0.001))

### **Model Training**

We will train the model for 100 epochs and use a 20% validation split, which allows the model to be evaluated on a part of the training data it hasn't seen during training, helping to monitor for overfitting.

In [None]:
%%time
history = dnn_model_1.fit(
    training_features,
    training_labels,
    validation_split=0.2,
    verbose=0,
    epochs=100)

CPU times: user 16.9 s, sys: 748 ms, total: 17.7 s
Wall time: 21.8 s


### **Model Evaluation**

After training the model we can now evaluate the model on the test set, obtaining a mean absolute error (MAE) of 52.76.

In [None]:
# remember, we want to minimise this. The model with the lowest is the best.
dnn_model_1_results = dnn_model_1.evaluate(test_features, test_labels, verbose=0)
print(dnn_model_1_results)

52.74506759643555


In [None]:
# Extracting three specific rows from the test data
# Change the row indices [0, 1, 2] to any other indices you want to use
selected_rows = test_set.iloc[[0, 1, 2]]

# Formatting the data
input_1 = selected_rows[['year', 'temp', 'dewp', 'prcp', 'Sat', 'Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']]

# Set a random seed for reproducibility
#np.random.seed(123)  # We can choose any number as the seed

# Make predictions using the DNN model
linear_day_predictions = dnn_model_1.predict(input_1[:3])

# Output the predictions
print(linear_day_predictions)


[[638.4358 ]
 [672.46826]
 [601.8736 ]]


Step 1: Retrieve Specific Input Combinations
For each of the three predictions, we'll need to display the corresponding input row. If input_1 is the DataFrame you used for predictions and it contains the exact same rows from the test dataset, you can print these rows:

In [None]:
print(input_1)


      year  temp  dewp  prcp  Sat  Sun  Mon  Tue  Wed  Thu  ...  Mar  Apr  \
2028  2018  40.4  32.0  1.06    0    0    1    0    0    0  ...    0    0   
2029  2018  52.0  32.8  0.00    0    0    0    0    0    1  ...    0    0   
2030  2014  49.4  30.9  0.00    0    0    1    0    0    0  ...    0    0   

      May  Jun  Jul  Aug  Sep  Oct  Nov  Dec  
2028    0    0    0    0    0    0    0    1  
2029    0    0    0    0    0    0    1    0  
2030    0    0    0    0    0    1    0    0  

[3 rows x 23 columns]


Step 2: Compare Predictions with Actual NUM_COLLISIONS
Next, we need to compare the model's predictions with the actual number of collisions for these specific days. If test_dataset includes the NUM_COLLISIONS column and the rows are in the same order as input_1, we can directly compare the predictions with the actual values:

In [None]:
# Assuming the indices in input_1 are aligned with those in test_dataset
actual_collisions = test_set.loc[input_1.index, 'NUM_COLLISIONS']

# Combine predictions with actual values for comparison
comparison_df = input_1.copy()
comparison_df['Predicted_Collisions'] = linear_day_predictions.flatten()
comparison_df['Actual_Collisions'] = actual_collisions

# Display the comparison
print(comparison_df)


      year  temp  dewp  prcp  Sat  Sun  Mon  Tue  Wed  Thu  ...  May  Jun  \
2028  2018  40.4  32.0  1.06    0    0    1    0    0    0  ...    0    0   
2029  2018  52.0  32.8  0.00    0    0    0    0    0    1  ...    0    0   
2030  2014  49.4  30.9  0.00    0    0    1    0    0    0  ...    0    0   

      Jul  Aug  Sep  Oct  Nov  Dec  Predicted_Collisions  Actual_Collisions  
2028    0    0    0    0    0    1            638.435791                689  
2029    0    0    0    0    1    0            672.468262                631  
2030    0    0    0    1    0    0            601.873596                591  

[3 rows x 25 columns]


## **Results:**

**DNN Model Analysis**

**First Prediction (638.43 Collisions)**

Year: 2018
Temperature: 40.4°F
Dew Point: 32.0°F
Precipitation: 1.06 inches
Day of the Week: Monday
Month: December

**Second Prediction (672.47)**

Year: 2018
Temperature: 52.0°F
Dew Point: 32.8°F
Precipitation: 0.00 inches
Day of the Week: Thursday
Month: November

**Third Prediction (601.87)**

Year: 2014
Temperature: 49.4°F
Dew Point: 30.9°F
Precipitation: 0.00 inches
Day of the Week: Monday
Month: October

**Accuracy Assessment**

**First Prediction vs. Actual**

Predicted: 638.43
Actual: 689
Difference: -51.43 (Underestimation)

**Second Prediction vs. Actual**

Predicted: 672.47
Actual: 631
Difference: 41.34 (Overestimation)

**Third Prediction vs. Actual**

Predicted: 601.87
Actual: 591
Difference: 10.87 (Slight Overestimation)

**Contextual Interpretation**

**Impact of Weather and Time:**

The model's predictions seem influenced by weather conditions and time factors. For instance, higher temperatures in predictions 2 and 3 seem to correlate with higher predicted collisions compared to prediction 1. Precipitation in prediction 1 also appears to influence the number of collisions.

**Days of the Week and Month Influence:**

The predictions also reflect the influence of specific days and months. For instance, prediction 2 (Thursday in November) has a higher predicted number than prediction 1 (Monday in December), possibly indicating the model's sensitivity to these temporal factors.

**Model's Predictive Power and Accuracy:**

The model shows a mix of underestimation and overestimation across different scenarios. While the predictions are relatively close to actual values, especially in prediction 3, there is still some deviation, suggesting room for model improvement.

**Implications for Emergency Services Planning**

**Weather-Related Planning:**

The model's sensitivity to weather conditions can aid in predicting collision numbers during different weather scenarios, aiding in resource allocation during adverse conditions.

**Temporal Trends:**

Understanding how different days of the week and months impact collision numbers can inform long-term planning and preparedness strategies.

**Model Refinement for Improved Accuracy:**

To enhance the model's utility, further refinement could be done. This could involve incorporating additional relevant features, adjusting the model architecture, or employing more advanced modeling techniques.

**Remove Precipitation Variable To Improve Model**

Since the precipitation variable "prcp" had the lowest correlation vs NUM_COLLISIONS in DNN Model 1, let's remove this variable to see if it improves the accuracy. We will call this 'dnn_model_2' and read in the data from the Assignment 1 report and follow the same steps as we did with dnn_model_1.

In [None]:
# needed to create the data frame
import pandas as pd

# needed to help with speedy maths based calculations
import numpy as np

# create data frame from csv file we hosted on our github
df_dnn_2 = pd.read_csv('https://raw.githubusercontent.com/22015866uhi/22015866_Data_Analytics/main/dnnregressiondata2.csv', index_col=0)

# Reset the index if 'day' is set as index
df_dnn_2.reset_index(inplace=True)

# make sure we have our data by printing it out
df_dnn_2[:5]
# df #all



Unnamed: 0,Apr,Aug,Dec,Feb,Jan,Jul,Jun,Mar,May,Nov,...,Sat,Sun,Thu,Tue,Wed,year,month,temp,dewp,NUM_COLLISIONS
0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,1,0,2013,Jan,37.8,23.6,381
1,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,1,2013,Jan,27.1,10.5,480
2,0,0,0,0,1,0,0,0,0,0,...,0,0,1,0,0,2013,Jan,28.4,14.1,549
3,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,2013,Jan,33.4,18.6,505
4,0,0,0,0,1,0,0,0,0,0,...,1,0,0,0,0,2013,Jan,36.1,18.7,389


In [None]:
dnn_input_data_2 = [df_dnn_2["year"], df_dnn_2["temp"], df_dnn_2["dewp"], df_dnn_2["Sat"], df_dnn_2["Sun"], df_dnn_2["Mon"], df_dnn_2["Tue"], df_dnn_2["Wed"], df_dnn_2["Thu"], df_dnn_2["Fri"], df_dnn_2["Jan"], df_dnn_2["Feb"], df_dnn_2["Mar"], df_dnn_2["Apr"], df_dnn_2["May"], df_dnn_2["Jun"], df_dnn_2["Jul"], df_dnn_2["Aug"], df_dnn_2["Sep"], df_dnn_2["Oct"], df_dnn_2["Nov"], df_dnn_2["Dec"], df_dnn_2["NUM_COLLISIONS"]]
headers = ["year","temp", "dewp", "Sat","Sun","Mon","Tue","Wed","Thu","Fri","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec","NUM_COLLISIONS"]
df_dnn_input_2 = pd.concat(dnn_input_data_2, axis=1, keys=headers)

# Set a random seed for reproducibility
# np.random.seed(123)  # We can choose any number as the seed

# Shuffle the entire dataset randomly
df_shuffled_2 = df_dnn_input_2.sample(frac=1, random_state=42).reset_index(drop=True)

# Split the data into training and testing sets
# We want 80% of the data for training and 20% for testing
split_index = int(len(df_shuffled_2) * 0.8)
training_set_2 = df_shuffled_2[:split_index]
test_set_2 = df_shuffled_2[split_index:]

# Separate features and target variable for training and testing sets
training_features_2 = training_set_2.copy()
test_features_2 = test_set_2.copy()

training_labels_2 = training_features_2.pop('NUM_COLLISIONS')
test_labels_2 = test_features_2.pop('NUM_COLLISIONS')

# Normalizer layer
normaliser_2 = tf.keras.layers.Normalization(axis=-1)
normaliser_2.adapt(np.array(training_features_2))

# Define the model
dnn_model_2 = keras.Sequential([
    normaliser_2,
    layers.Dense(64, activation='relu'),
    layers.Dense(64, activation='relu'),
    layers.Dense(1)
])

# Compile the model
dnn_model_2.compile(loss='mean_absolute_error', optimizer=tf.keras.optimizers.Adam(0.001))

# Train the model
history_2 = dnn_model_2.fit(
    training_features_2,
    training_labels_2,
    validation_split=0.2,
    verbose=0,
    epochs=100
)




In [None]:
# Evaluate the model
dnn_model_2_results = dnn_model_2.evaluate(test_features_2, test_labels_2, verbose=0)
print(dnn_model_2_results)


53.3503532409668


In [None]:
# Assuming the same rows [0, 1, 2] as in dnn_model_1
selected_rows_2 = test_set_2.iloc[[0, 1, 2]]

# Make predictions
predictions_2 = dnn_model_2.predict(selected_rows_2.drop('NUM_COLLISIONS', axis=1))

# Compare predictions with actual values
comparison_df_2 = selected_rows_2.copy()
comparison_df_2['Predicted_Collisions'] = predictions_2.flatten()

# Display the comparison
print(comparison_df_2)


      year  temp  dewp  Sat  Sun  Mon  Tue  Wed  Thu  Fri  ...  May  Jun  Jul  \
2028  2018  40.4  32.0    0    0    1    0    0    0    0  ...    0    0    0   
2029  2018  52.0  32.8    0    0    0    0    0    1    0  ...    0    0    0   
2030  2014  49.4  30.9    0    0    1    0    0    0    0  ...    0    0    0   

      Aug  Sep  Oct  Nov  Dec  NUM_COLLISIONS  Predicted_Collisions  
2028    0    0    0    0    1             689            625.869690  
2029    0    0    0    1    0             631            667.956726  
2030    0    0    1    0    0             591            598.903381  

[3 rows x 24 columns]


In [None]:
from sklearn.metrics import mean_squared_error, r2_score

# Predictions from dnn_model_1
predictions_1 = dnn_model_1.predict(test_features)
mae_1 = np.mean(np.abs(predictions_1.flatten() - test_labels))
mse_1 = mean_squared_error(test_labels, predictions_1.flatten())
r2_1 = r2_score(test_labels, predictions_1.flatten())

# Predictions from dnn_model_2
# Note: Make sure test_features_2 does not contain the 'prcp' column
predictions_2 = dnn_model_2.predict(test_features_2)
mae_2 = np.mean(np.abs(predictions_2.flatten() - test_labels_2))
mse_2 = mean_squared_error(test_labels_2, predictions_2.flatten())
r2_2 = r2_score(test_labels_2, predictions_2.flatten())

# Print the metrics for comparison
print("dnn_model_1 Metrics: MAE =", mae_1, "MSE =", mse_1, "R2 =", r2_1)
print("dnn_model_2 Metrics: MAE =", mae_2, "MSE =", mse_2, "R2 =", r2_2)


dnn_model_1 Metrics: MAE = 52.74507318681044 MSE = 5203.674955493212 R2 = 0.40122212457980744
dnn_model_2 Metrics: MAE = 53.35035016362719 MSE = 5245.4375827317335 R2 = 0.3964165713076052


Here's an analysis of the performance of dnn_model_1 and dnn_model_2 using the metrics MAE, MSE, and R²:

**Mean Absolute Error (MAE):**

dnn_model_1: MAE = 52.93

dnn_model_2: MAE = 53.32

**Interpretation:**

The MAE indicates the average deviation of predicted values from the actual values. dnn_model_1 has a slightly lower MAE compared to dnn_model_2, meaning it generally makes predictions that are closer to the actual number of collisions.

**Mean Squared Error (MSE):**

dnn_model_1: MSE = 5194.08

dnn_model_2: MSE = 5270.91

**Interpretation:**

MSE is similar to MAE but gives more weight to larger errors, as the errors are squared before they are averaged. The lower MSE in dnn_model_1 suggests that it not only predicts values closer to the actual figures on average (as indicated by MAE) but also makes fewer large errors compared to dnn_model_2.

**R-squared (R²):**

dnn_model_1: R² = 0.4023

dnn_model_2: R² = 0.3935

**Interpretation:**

R² measures the proportion of the variance in the dependent variable (number of collisions) that is predictable from the independent variables (features like weather conditions, day of the week, etc.). A higher R² value indicates a better fit of the model to the data. dnn_model_1 has a slightly higher R², suggesting it explains the variability in collision numbers marginally better than dnn_model_2.

**Which Model to Choose?**

Overall Performance: dnn_model_1 outperforms dnn_model_2 across all three metrics. It has a lower MAE and MSE and a higher R². This suggests that dnn_model_1 is generally more accurate and has a better fit to the data.

**Impact of 'prcp' Feature:**

The presence of the 'prcp' (precipitation) feature in dnn_model_1 appears to contribute positively to the model's performance. Its removal in dnn_model_2 led to a marginal decrease in performance across all metrics, indicating that precipitation data plays a role, albeit a small one, in predicting traffic collisions.

**Recommendation:**

Based on these results, dnn_model_1 is recommended as the preferred model for predicting traffic collisions. It demonstrates better accuracy and fit, implying that it could provide slightly more reliable predictions for use in applications such as planning and optimizing emergency services responses.

# **Conclusion**

**Context and Objective**

This comprehensive report aimed to develop predictive models to estimate the number of traffic collisions in New York City based on various factors, focusing on aiding the emergency services in optimizing response strategies.

This part of the analysis encompassed linear regression models and Deep Neural Network (DNN) regression models, utilising datasets with incremental inclusion of weather-related variables.


**Findings from DNN Regression Models**

**DNN Model 1 (Multiple Variables including 'prcp'):**

Demonstrated sensitivity to weather conditions and time factors. MAE: 52.75, MSE: 5203.67, R²: 0.4012.

**DNN Model 2 (Excluding 'prcp'):**

Slight decrease in performance upon removing the 'prcp' variable. MAE: 53.35, MSE: 5245.44, R²: 0.3964.

**DNN Model Insights:**

Weather and Time Impact: Predictions indicated that weather conditions and time factors (days of the week, months) significantly influence collision numbers.

**Accuracy Assessment:**

The models showed a mix of underestimation and overestimation. The DNN Model 1 was slightly more accurate overall.

**Recommendation:**

DNN Model 1 is preferred for its marginally better accuracy and fit. The inclusion of 'prcp' contributes positively, albeit slightly, to model performance.

**Comprehensive Summary**

**Linear Regression Models:**

Offer straightforward predictions with increasing complexity and accuracy from Model 1 to Model 3.

**DNN Models:**

Provide a more nuanced understanding of the interplay between weather, time, and collision numbers, with DNN Model 1 slightly outperforming Model 2.

**Recommendation for Emergency Services:**

For simplicity and immediate insights: Linear Regression Model 1.
For detailed planning and comprehensive analysis: Linear Regression Model 3 or DNN Model 1.

**Actionable Strategy:**

Enhanced preparedness towards the weekend and during extreme temperatures(hot and cold) or humid conditions, could be beneficial.



