<a href="https://colab.research.google.com/github/22015866uhi/22015866_Data_Analytics/blob/main/DAOTW_Assignment_2_Part1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Introduction**

In this report, we explore the intricate relationship between weather conditions and traffic collisions in New York City. The primary objective is to develop predictive models that can accurately estimate the number of traffic collisions on any given day based on various weather parameters. This study is particularly significant for the emergency services in New York City, as it aims to enable them to optimize their emergency response strategies based on predictive insights.

The analysis is divided into two parts: The first part involves constructing a linear regression model, a fundamental yet powerful tool for understanding and predicting relationships between variables. The second part, which will be discussed in a subsequent report, focuses on developing a Deep Neural Network (DNN) regression model, leveraging the advanced capabilities of machine learning.

Four distinct datasets have been curated and hosted on GitHub for this analysis. Each dataset incrementally introduces more weather-related variables, starting from basic data on the day of the week and number of collisions, and gradually including temperature, precipitation, and dew point. This progressive approach allows us to understand the impact of each additional weather variable on the accuracy of our predictions.

The methodology section that follows will detail the data acquisition process, the preprocessing steps undertaken, the specific approach to linear regression analysis, and the evaluation metrics used to assess the models' performance.

# **Part 1 - Linear Regression Model**

**Data Preparation:**

You have imported the data correctly. Ensure that the data is clean, and all the necessary features are included. For predicting the days with the highest number of collisions, the features related to the days of the week and other relevant variables like weather conditions should be included.

**Model Building:**

From the code cells, it appears you have started building the linear regression model. Make sure to use features that you believe could impact the number of collisions, such as weather conditions and day of the week.

**Model Training and Evaluation:**

After building the model, you'll need to train it with your training dataset and evaluate its performance using the test dataset.

Metrics such as Mean Absolute Error (MAE) or Root Mean Squared Error (RMSE) can be used for evaluation.

**Interpretation:**

Interpret the results to understand which factors are contributing most to collisions. This can usually be done by examining the coefficients of the linear regression model.

# **Methodology**

### **Data Acquisition and Preparation**

The initial phase of our analysis involves acquiring and preparing the data for our regression models. This process is fundamental to any data science project, as the quality and structure of the data directly influence the performance and reliability of the models.

We now import necessary libraries and loading the datasets. We use pandas, a powerful data manipulation library, and numpy for numerical operations. Four separate datasets (df1, df2, df3, df4) are loaded from a GitHub repository. Each of these datasets represents a different combination of variables:

df1: Contains 'day' and 'NUM_COLLISIONS'.
df2: Adds 'temp' (temperature) to the variables in df1.
df3: Further includes 'dewp' (dew point) to the variables in df2.
df4: Adds 'prcp' (precipitation) to the variables in df3.

In [6]:
# needed to create the data frame
import pandas as pd

# needed to help with speedy maths based calculations
import numpy as np

# create data frames from csv file we hosted on our github
df1 = pd.read_csv('https://raw.githubusercontent.com/22015866uhi/22015866_Data_Analytics/main/linearregressiondata1.csv', index_col=0, )

df2 = pd.read_csv('https://raw.githubusercontent.com/22015866uhi/22015866_Data_Analytics/main/linearregressiondata2.csv', index_col=0, )

df3 = pd.read_csv('https://raw.githubusercontent.com/22015866uhi/22015866_Data_Analytics/main/linearregressiondata3.csv', index_col=0, )

df4 = pd.read_csv('https://raw.githubusercontent.com/22015866uhi/22015866_Data_Analytics/main/linearregressiondata4.csv', index_col=0, )

print(df1)
print(df2)
print(df3)
print(df4)

     NUM_COLLISIONS
day                
4               381
5               480
6               549
7               505
2               389
..              ...
7               448
2               355
1               384
3               518
4               443

[2535 rows x 1 columns]
     temp  NUM_COLLISIONS
day                      
4    37.8             381
5    27.1             480
6    28.4             549
7    33.4             505
2    36.1             389
..    ...             ...
7    49.4             448
2    48.0             355
1    42.6             384
3    39.4             518
4    38.7             443

[2535 rows x 2 columns]
     temp  dewp  NUM_COLLISIONS
day                            
4    37.8  23.6             381
5    27.1  10.5             480
6    28.4  14.1             549
7    33.4  18.6             505
2    36.1  18.7             389
..    ...   ...             ...
7    49.4  40.9             448
2    48.0  37.4             355
1    42.6  30.2             384


In [7]:
# make sure we have our data by printing it out
print(df1[:6])
# print(df) #all

     NUM_COLLISIONS
day                
4               381
5               480
6               549
7               505
2               389
1               393


In [8]:
# A scale is not required here, but the constant will be useful
SCALE_NUM_COLLISIONS = 0.001

We now import TensorFlow and Keras, which are essential libraries for building neural network models. Although our current focus is on linear regression, importing these libraries at this stage prepares us for the subsequent Deep Neural Network regression analysis. The TensorFlow version (2.14.0) ensures that we are working with a specific, known version of the library, which is crucial for reproducibility of results:

In [4]:
import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers

print(tf.__version__)

2.14.0


## **Number of Collisions vs Day**

We address a crucial aspect of data preparation: restructuring the DataFrame. Since the 'day' variable is set as an index in df1, we reset it to a regular column to facilitate easier manipulation and analysis. We then reconstruct df1 to align the input variables ('day') with the target variable ('NUM_COLLISIONS'). This step is critical for ensuring that our data is in the right format for applying regression models.

In [9]:
# Reset the index if 'day' is set as index
df1.reset_index(inplace=True)

# Now, create the DataFrame for input data
df1_input_data = [df1["day"], df1["NUM_COLLISIONS"]]

# Create headers for our new DataFrame. These should correlate with the above.
df1_input_headers = ["day", "NUM_COLLISIONS"]

# Create a final DataFrame using our new DataFrame and headers.
df1 = pd.concat(df1_input_data, axis=1, keys=df1_input_headers)

print(df1)


      day  NUM_COLLISIONS
0       4             381
1       5             480
2       6             549
3       7             505
4       2             389
...   ...             ...
2530    7             448
2531    2             355
2532    1             384
2533    3             518
2534    4             443

[2535 rows x 2 columns]


### **Day and NUM_COLLISIONS -  Model Development and Evaluation**

Continuing our methodology, we delve into the development of the linear regression model using TensorFlow, a robust framework for building machine learning models.

This code block is crucial for dividing our dataset into a training set and a test set. We use an 80-20 split, meaning 80% of the data is used for training the model, and the remaining 20% is reserved for testing its performance. This split is vital for evaluating the model on unseen data, ensuring that our model can generalise well to new data.

In [10]:
# construct a training set for runnign through the model and a test set, we do this by using sample with 0.8 for an 80% training set and 20% for test.
training_set_1 = df1.sample(frac=0.8, random_state=0)
test_set_1 = df1.drop(training_set_1.index)

Here, we prepare our features (inputs) and labels (outputs) for the training and test sets. We separate the 'NUM_COLLISIONS' column from our datasets, which is the target variable we aim to predict. This separation is a standard practice in supervised learning where the model learns to predict the labels from the features.

In [11]:
# copy the datasets and remove the final column, i.e. the output column. We do this using pop.
training_features_1 = training_set_1.copy()
test_features_1 = test_set_1.copy()

training_labels_1 = training_features_1.pop('NUM_COLLISIONS')
test_labels_1 = test_features_1.pop('NUM_COLLISIONS')

Normalization of data is addressed in this section. Although in this dataset the labels are already normalized (and thus divided by 1), we include a scale factor (1000) that could be useful for similar datasets with different scales.

In [12]:
# Here I have put in a scale factor and divided by it. In this dataset, I had already normalised and thus it is 1. However, 1000 is what would make sense based on the data here and we can use this later when testing our model..
training_labels_1 = training_labels_1/1000
test_labels_1 = test_labels_1/1000

We print the training features for inspection and create a normalization layer for our TensorFlow model. This layer is an essential part of preprocessing in neural networks, ensuring that our model inputs have a uniform scale.

In [13]:
print(training_features_1)

      day
828     7
2409    4
2249    5
936     3
2390    6
...   ...
1874    6
718     6
1149    7
387     3
1369    4

[2028 rows x 1 columns]


Normalization helps in speeding up the training process and improving the performance of the model.

In [14]:
# boiler plate for this model. You can see that we have used the training_features here for our normalisation layer that we try and fit to the outputs.
normaliser_1 = tf.keras.layers.Normalization(input_shape=[1,], axis=None) # tf.keras.layers.Normalization(axis=-1)
normaliser_1.adapt(np.array(training_features_1))

We define our linear regression model (model_1) using TensorFlow's Sequential API. The model comprises a normalization layer followed by a dense layer with a single unit, which is characteristic of a simple linear regression model. The model is compiled with the Adam optimizer and mean absolute error as the loss function. This setup is fundamental in defining how the model learns during training.

In [15]:
# I have decided to call the model, model_1. We add our normaliser and we are expecting a single output.
model_1 = tf.keras.Sequential([
    normaliser_1,
    layers.Dense(units=1)
])

In [16]:
model_1.compile(
    optimizer=tf.optimizers.Adam(learning_rate=0.1),
    loss='mean_absolute_error')

In [17]:
# now we are going to fit the model where we require the training features and labels. We will run it 100 times i.e. epochs and we have applied a further 20% validation split.

%%time
history = model_1.fit(
    training_features_1,
    training_labels_1,
    epochs=100,
    verbose=0,
    validation_split = 0.2)

CPU times: user 13 s, sys: 517 ms, total: 13.5 s
Wall time: 13.1 s


In [18]:
# Now, we will evaluate our model using the test features and labels.
mean_absolute_error_model_1 = model_1.evaluate(
    test_features_1,
    test_labels_1, verbose=0)

In [19]:
# The mean absolute error of the model can be printed out. Remember, we want to minimise this. Perhaps a model with just day and NUM_COLLISIONS would be better. It will also vary on each training run due to randomisation.
print(mean_absolute_error_model_1)

0.06635782867670059


## **Number of Collisions vs Day and Temp**

We address a crucial aspect of data preparation: restructuring the DataFrame. Since the 'day' variable is set as an index in df1, we reset it to a regular column to facilitate easier manipulation and analysis. We then reconstruct df1 to align the input variables ('day') with the target variable ('NUM_COLLISIONS'). This step is critical for ensuring that our data is in the right format for applying regression models.

### **Number of Collisions vs Day and Temp - Model Development and Evaluation**


In [20]:
# Reset the index if 'day' is set as index
df2.reset_index(inplace=True)

# create a dataframe with the inputs and the output at the end using the imported dataframe. This can be replicated for any configuration, in this case, I have gone for day, temp, wdsp
df2_input_data = [df2["day"], df2["temp"], df2["NUM_COLLISIONS"]]
# create headers for our new dataframe. These should correlate with the above.
df2_input_headers = ["day", "temp", "NUM_COLLISIONS"]
# create a final dataframe using our new dataframe and headers.
df2 = pd.concat(df2_input_data, axis=1, keys=df2_input_headers)

print(df2)

      day  temp  NUM_COLLISIONS
0       4  37.8             381
1       5  27.1             480
2       6  28.4             549
3       7  33.4             505
4       2  36.1             389
...   ...   ...             ...
2530    7  49.4             448
2531    2  48.0             355
2532    1  42.6             384
2533    3  39.4             518
2534    4  38.7             443

[2535 rows x 3 columns]


In [21]:
# construct a training set for runnign through the model and a test set, we do this by using sample with 0.8 for an 80% training set and 20% for test.
training_set_2 = df2.sample(frac=0.8, random_state=0)
test_set_2 = df2.drop(training_set_2.index)

In [22]:
# copy the datasets and remove the final column, i.e. the output column. We do this using pop.
training_features_2 = training_set_2.copy()
test_features_2 = test_set_2.copy()

training_labels_2 = training_features_2.pop('NUM_COLLISIONS')
test_labels_2 = test_features_2.pop('NUM_COLLISIONS')

In [23]:
# Here I have put in a scale factor and divided by it. In this dataset, I had already normalised and thus it is 1. However, 1000 is what would make sense based on the data here and we can use this later when testing our model..
training_labels_2 = training_labels_2/1000
test_labels_2 = test_labels_2/1000

In [24]:
# boiler plate for this model. You can see that we have used the training_features here for our normalisation layer that we try and fit to the outputs.
normaliser = tf.keras.layers.Normalization(axis=-1)
normaliser.adapt(np.array(training_features_2))

In [25]:
# I have decided to call the model, model_1. We add our normaliser and we are expecting a single output.
model_2 = tf.keras.Sequential([
    normaliser,
    layers.Dense(units=1)
])

In [26]:
# more boiler plate for creating a sequential model, we need an optimiser and loss parameter. Here we are going to be using the mean absolute error MAE
model_2.compile(
    optimizer=tf.optimizers.Adam(learning_rate=0.1),
    loss='mean_absolute_error')

In [27]:
# now we are going to fit the model where we require the training features and labels. We will run it 100 times i.e. epochs and we have applied a further 20% validation split.

%%time
history = model_2.fit(
    training_features_2,
    training_labels_2,
    epochs=100,
    verbose=0,
    validation_split = 0.2)

CPU times: user 14.2 s, sys: 592 ms, total: 14.8 s
Wall time: 14.8 s


In [28]:
# Now, we will evaluate our model using the test features and labels.
mean_absolute_error_model_2 = model_2.evaluate(
    test_features_2,
    test_labels_2, verbose=0)

In [29]:
# The mean absolute error of the model can be printed out. Remember, we want to minimise this. Perhaps a model with just day and NUM_TRIPS would be better. It will also vary on each training run due to randomisation.
print(mean_absolute_error_model_2)

0.08275976777076721


In [30]:
# Try scaling NUM_COLLISIONS
training_labels_2 = training_set_2['NUM_COLLISIONS']*SCALE_NUM_COLLISIONS
test_labels_2 = test_set_2['NUM_COLLISIONS']*SCALE_NUM_COLLISIONS

# Normalization layer for features
normaliser = tf.keras.layers.Normalization(axis=-1)
normaliser.adapt(np.array(training_features_2[['day', 'temp']]))

# Updated model architecture
model_2 = tf.keras.Sequential([
    normaliser,
    layers.Dense(units=10, activation='relu'),  # additional layer
    layers.Dense(units=1)
])

# Compile with a lower learning rate
model_2.compile(optimizer=tf.optimizers.Adam(learning_rate=0.01), loss='mean_absolute_error')

# Train the model
history = model_2.fit(training_features_2, training_labels_2, epochs=100, verbose=1, validation_split=0.2)



Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

## **Number of Collisions vs Day, Temp and Dewp**

In [31]:
# Reset the index if 'day' is set as index
df3.reset_index(inplace=True)

# create a dataframe with the inputs and the output at the end using the imported dataframe. This can be replicated for any configuration, in this case, I have gone for day, temp, wdsp
df3_input_data = [df3["day"], df3["temp"], df3["dewp"], df3["NUM_COLLISIONS"]]
# create headers for our new dataframe. These should correlate with the above.
df3_input_headers = ["day", "temp", "dewp", "NUM_COLLISIONS"]
# create a final dataframe using our new dataframe and headers.
df3 = pd.concat(df3_input_data, axis=1, keys=df3_input_headers)

print(df3)

      day  temp  dewp  NUM_COLLISIONS
0       4  37.8  23.6             381
1       5  27.1  10.5             480
2       6  28.4  14.1             549
3       7  33.4  18.6             505
4       2  36.1  18.7             389
...   ...   ...   ...             ...
2530    7  49.4  40.9             448
2531    2  48.0  37.4             355
2532    1  42.6  30.2             384
2533    3  39.4  38.3             518
2534    4  38.7  34.9             443

[2535 rows x 4 columns]


In [32]:
# construct a training set for runnign through the model and a test set, we do this by using sample with 0.8 for an 80% training set and 20% for test.
training_set_3 = df3.sample(frac=0.8, random_state=0)
test_set_3 = df3.drop(training_set_2.index)

In [33]:
# copy the datasets and remove the final column, i.e. the output column. We do this using pop.
training_features_3 = training_set_3.copy()
test_features_3 = test_set_3.copy()

# Try without scaling NUM_COLLISIONS
training_labels_3 = training_set_3['NUM_COLLISIONS']
test_labels_3 = test_set_3['NUM_COLLISIONS']

In [34]:
# Here I have put in a scale factor and divided by it. In this dataset, I had already normalised and thus it is 1. However, 1000 is what would make sense based on the data here and we can use this later when testing our model..
training_labels_3 = training_labels_3/1000
test_labels_3 = test_labels_3/1000

In [35]:
# Display the first few rows of training_features_3 to check its structure
print(training_features_3.head())


      day  temp  dewp  NUM_COLLISIONS
828     7  60.5  48.6             679
2409    4  67.7  56.8             529
2249    5  44.3  24.1             584
936     3  81.3  60.0             705
2390    6  74.1  66.5             646


In [36]:
# Normalize features
normaliser = tf.keras.layers.Normalization(axis=-1)
normaliser.adapt(training_features_3[['day', 'temp', 'dewp']])


In [37]:
# Model architecture
model_3 = tf.keras.Sequential([
    normaliser,
    layers.Dense(units=10, activation='relu'),
    layers.Dense(units=1)
])

# Compile the model
model_3.compile(optimizer=tf.optimizers.Adam(learning_rate=0.01), loss='mean_absolute_error')


In [38]:
# Train the model
history = model_3.fit(
    training_features_3[['day', 'temp', 'dewp']],
    training_labels_3,
    epochs=100,
    verbose=1,
    validation_split=0.2
)


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

# **Results**

### **Number of Collisions Predictions vs Days**

In [40]:
# iloc allows us to select by rows. Here, we are shuffling the data by rows determined at random.
shuffle = df3.iloc[np.random.permutation(len(df3))]

# we are selecting all rows of the columns outliined i.e. The 3rd (2 as indexes start from 0)
predictors = shuffle.iloc[:,0:3]
# Since it is the last column, we can also use
# predictorTest = shuffle.iloc[:,-1]

# print out the first 6 rows of predictors.
print(predictors[:3])

      day  temp  dewp
1371    6  60.1  54.1
1947    1  67.5  60.9
945     5  77.0  59.7


In [42]:
# we create a custom dataframe with 3 values per feature.
input_1 = pd.DataFrame.from_dict(data = {'day' : [6,1,5]})

# next we can check this out, you can multiply by 1000 to get more realistic NUM_COLLISIONS values.
linear_day_predictions = model_1.predict(input_1[:3])*1000 # essentially 1000 in this instance would give back realistic numbers based on the NUM_COLLISIONS data
print(linear_day_predictions)

[[642.8704 ]
 [538.7159 ]
 [622.03955]]


In [74]:
# Prepare the data for prediction
# Assuming you want to predict for all days, create a DataFrame for this
days_data = pd.DataFrame({'day': range(1, 8)})  # 1 to 7 representing days of the week

# Make predictions
predictions = model_1.predict(days_data)

# Add predictions to the DataFrame
days_data['Predicted_Collisions'] = predictions.flatten()

# Assuming the scale factor used was SCALE_NUM_COLLISIONS
SCALE_NUM_COLLISIONS = 0.001  # Replace with your actual scale factor
days_data['Predicted_Collisions'] = days_data['Predicted_Collisions'] / SCALE_NUM_COLLISIONS

# Analyze which days have higher predicted collisions
print(days_data.sort_values(by='Predicted_Collisions', ascending=False))


   day  Predicted_Collisions
6    7            663.701294
5    6            642.870422
4    5            622.039551
3    4            601.208618
2    3            580.377747
1    2            559.546814
0    1            538.715881


**Conclusions and Insights**

**Day-wise Collision Risk:** Based on these predictions, it appears that the model estimates a higher number of collisions later in the week, with the highest predicted collisions on day 7 (which might correspond to Sunday, depending on how you've coded the days).

**Actionable Insights:** Emergency services could use this information to prepare more resources or be on higher alert towards the end of the week, especially on days 6 and 7.

### **Number of Collisions Predictions vs Days and Temp**

In [44]:
# Prediction Input
input_2 = pd.DataFrame.from_dict({'day': [6, 1, 5], 'temp': [60.1, 67.5, 77.0]})

# Predictions
linear_day_temp_predictions = model_2.predict(input_2)*1000
print(linear_day_temp_predictions)

[[634.70325]
 [505.4407 ]
 [641.42535]]


In [76]:
# Use df2 for predictions
predictions_2 = model_2.predict(df2[['day', 'temp']])

# Add predictions to df2 for analysis
df2['Predicted_Collisions'] = predictions_2.flatten()

# Assuming the scale factor used was SCALE_NUM_COLLISIONS
SCALE_NUM_COLLISIONS = 0.001  # Replace with your actual scale factor
df2['Predicted_Collisions'] = df2['Predicted_Collisions'] / SCALE_NUM_COLLISIONS

# Analyze the predictions
print(df2.sort_values(by='Predicted_Collisions', ascending=False))


      day  temp  NUM_COLLISIONS  Predicted_Collisions
198     7  89.1             697            707.153442
1338    7  85.0             715            700.081177
1282    7  83.0             798            696.631287
149     7  83.0             736            696.631287
1289    7  82.8             769            696.286316
...   ...   ...             ...                   ...
1456    1  20.1             406            453.098236
1822    1  19.2             494            452.481415
768     1  18.9             454            452.275848
1815    1   9.7             475            445.970825
1131    1   6.9             309            444.051849

[2535 rows x 4 columns]


### **Number of Collisions Predictions vs Days, Temp and Dewp**

In [45]:
# Prediction input
input_3 = pd.DataFrame({'day': [6, 1, 5], 'temp': [60.1, 67.5, 77.0], 'dewp': [54.1, 60.9, 59.7]})

# Predictions
linear_day_temp_dewp_predictions = model_3.predict(input_3)*1000
print(linear_day_temp_dewp_predictions)


[[662.01465]
 [535.937  ]
 [672.81256]]


In [77]:
# Use df3 for predictions
predictions_3 = model_3.predict(df3[['day', 'temp', 'dewp']])

# Add predictions to df3 for analysis
df3['Predicted_Collisions'] = predictions_3.flatten()

# Assuming the scale factor used was SCALE_NUM_COLLISIONS
SCALE_NUM_COLLISIONS = 0.001  # Replace with your actual scale factor
df3['Predicted_Collisions'] = df3['Predicted_Collisions'] / SCALE_NUM_COLLISIONS

# Analyze the predictions
print(df3.sort_values(by='Predicted_Collisions', ascending=False))


      day  temp  dewp  NUM_COLLISIONS  Predicted_Collisions
198     7  89.1  72.1             697            721.016602
1929    5  77.0  34.6             792            717.288391
926     7  77.8  49.0             490            715.323181
2349    7  81.2  58.2             753            715.147278
149     7  83.0  63.4             736            714.777466
...   ...   ...   ...             ...                   ...
768     1  18.9   4.1             454            468.264648
1822    1  19.2   0.6             494            466.495636
1512    1  22.8  -8.1             489            463.188019
1815    1   9.7  -6.6             475            458.994019
1131    1   6.9 -16.1             309            452.812561

[2535 rows x 5 columns]


To compare the three models against each other and determine which one is best for predictions, you should evaluate them on a common ground, typically using a separate test set that was not used during training. The comparison should be based on their performance metrics, such as Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), or any other relevant metric.

Here's how you can approach this:

1. Separate Test Set

If you haven't already, split your data into training and testing sets. This is crucial for evaluating the model's performance on unseen data. If your datasets df1, df2, and df3 have not been split and have been used entirely for training, you'll need a separate dataset for testing. If you already have a test set, proceed with that.

2. Evaluate Each Model

Evaluate each model on the same test set. For example, if your test set includes the features 'day', 'temp', and 'dewp', you would use the relevant subset of these features for each model

In [84]:
from sklearn.metrics import mean_absolute_error

# Evaluate model_1
predictions_1 = model_1.predict(test_set_1[['day']])
mae_1 = mean_absolute_error(test_set_1['NUM_COLLISIONS'], predictions_1.flatten())

# Evaluate model_2
predictions_2 = model_2.predict(test_set_2[['day', 'temp']])
mae_2 = mean_absolute_error(test_set_2['NUM_COLLISIONS'], predictions_2.flatten())

# Evaluate model_3
predictions_3 = model_3.predict(test_set_3[['day', 'temp', 'dewp']])
mae_3 = mean_absolute_error(test_set_3['NUM_COLLISIONS'], predictions_3.flatten())

# Print the MAE for each model
print(f"MAE of Model 1: {mae_1}")
print(f"MAE of Model 2: {mae_2}")
print(f"MAE of Model 3: {mae_3}")


MAE of Model 1: 600.6678987587462
MAE of Model 2: 600.6774156478617
MAE of Model 3: 600.6540629563365


# **Conclusion**

The primary objective of this analysis was to develop regression models that could predict the number of traffic collisions on any given day in New York City, aiding the emergency services in optimizing their response strategies. Three different models were developed and compared:

Model 1 (Linear Regression with 'day'): Utilized only the day of the week as the predictor variable. The predictions from this model indicated a progressive increase in the number of collisions as the week progressed, with the highest number of collisions predicted for day 7. The Mean Absolute Error (MAE) for this model was approximately 600.67.

Model 2 (Linear Regression with 'day' and 'temp'): Included both the day of the week and temperature. This model provided a nuanced understanding, showing that specific combinations of days and temperatures led to higher predicted collisions. For instance, the model predicted the highest collisions on day 7 with a temperature of 89.1. The MAE for this model was slightly higher than Model 1 at approximately 600.68.

Model 3 (Linear Regression with 'day', 'temp', and 'dewp'): Added dew point to the features, providing an even more detailed analysis. This model indicated that certain combinations of day, temperature, and dew point were critical in predicting higher collision numbers. For example, the highest collisions were predicted on day 7 with a temperature of 89.1 and a dew point of 72.1. The MAE for this model was slightly lower than the other two models at approximately 600.65.

Key Insights and Recommendations:
Progression Throughout the Week: All models consistently predicted an increase in collisions towards the end of the week, suggesting a need for heightened preparedness by emergency services during these days.

Impact of Weather Conditions: Models 2 and 3, which included weather conditions (temperature and dew point), offered more detailed insights, although the improvement in prediction accuracy was marginal compared to Model 1. This suggests that while weather conditions do impact collision numbers, their predictive power may not be significantly stronger than the day of the week alone.

Model Selection for Emergency Services: Given the marginal difference in MAEs among the three models, Model 1 (using only the day of the week) could be preferred for its simplicity and ease of interpretation. However, if the emergency services value the additional insights from weather conditions, Model 3 offers the most comprehensive analysis with a slightly better MAE.

Final Note:
While the models provide valuable insights, the predictions should be used as one of several tools in decision-making. Other factors, such as special events, traffic pattern changes, and city-wide initiatives, should also be considered in the planning process by the emergency services. Additionally, the close MAE values across all models suggest that further refinement or incorporation of additional data sources could be explored to enhance predictive accuracy.

From test inputs we have used, the first element in the array here is similar to this actual data:

"165",1,62.4,5.6,0.668458757342322

We can see that the temperature is slightly higher in the test data (which means less trips, but slightly higher wind means more trips. So, the difference between (actual) 0.668 and (predicted) 0.576 (rounded to 3 significant figures) seems reasonable.

Similarly with the second:

"389",1,26.6,3.1,0.763954173062719, which has higher number of trips due to a lower temperature and also with a slightly higher wind speed.

And with the third:

"571",1,77.2,8.4,0.724652060408235

The last prediction with the higher temperature seems to punish the values more.

This test uses day 6 (Friday) instead of day 1 (Sunday) which shows higher number of trips. The other values were left the same.

Things to think about for the assignment. Make a validation set i.e. 5% of the data (or maybe more). This should be used for this type of testing. My values are simply made up.

You should also remember to use different models with different data. In this case, I would maybe take each input valuable separately and make a regression model for each, then different variations i.e. any 2.

Remember, you need to write up your results in the assignment.

# **Part 2 - DNN Regression Model**

In [46]:
# needed to create the data frame
import pandas as pd

# needed to help with speedy maths based calculations
import numpy as np

# create data frame from csv file we hosted on our github
df = pd.read_csv('https://raw.githubusercontent.com/22015866uhi/22015866_Data_Analytics/main/dnnregressiondata1.csv', index_col=0)

print(df)

     Aug  Dec  Feb  Jan  Jul  Jun  Mar  May  Nov  Oct  ...  Sun  Thu  Tue  \
Apr                                                    ...                  
0      0    0    0    1    0    0    0    0    0    0  ...    0    0    1   
0      0    0    0    1    0    0    0    0    0    0  ...    0    0    0   
0      0    0    0    1    0    0    0    0    0    0  ...    0    1    0   
0      0    0    0    1    0    0    0    0    0    0  ...    0    0    0   
0      0    0    0    1    0    0    0    0    0    0  ...    0    0    0   
..   ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...   
0      0    1    0    0    0    0    0    0    0    0  ...    0    0    0   
0      0    1    0    0    0    0    0    0    0    0  ...    0    0    0   
0      0    1    0    0    0    0    0    0    0    0  ...    1    0    0   
0      0    1    0    0    0    0    0    0    0    0  ...    0    0    0   
0      0    1    0    0    0    0    0    0    0    0  ...    0    0    1   

In [48]:
# Reset the index if 'day' is set as index
df.reset_index(inplace=True)

# make sure we have our data by printing it out
df[:5]
# df #all

Unnamed: 0,Apr,Aug,Dec,Feb,Jan,Jul,Jun,Mar,May,Nov,...,Sun,Thu,Tue,Wed,year,month,temp,prcp,dewp,NUM_COLLISIONS
0,0,0,0,0,1,0,0,0,0,0,...,0,0,1,0,2013,Jan,37.8,0.0,23.6,381
1,0,0,0,0,1,0,0,0,0,0,...,0,0,0,1,2013,Jan,27.1,0.0,10.5,480
2,0,0,0,0,1,0,0,0,0,0,...,0,1,0,0,2013,Jan,28.4,0.0,14.1,549
3,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,2013,Jan,33.4,0.0,18.6,505
4,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,2013,Jan,36.1,0.0,18.7,389


In [49]:
dnn_input_data = [df["year"], df["temp"], df["dewp"], df["prcp"], df["Sat"], df["Sun"], df["Mon"], df["Tue"], df["Wed"], df["Thu"], df["Fri"], df["Jan"], df["Feb"], df["Mar"], df["Apr"], df["May"], df["Jun"], df["Jul"], df["Aug"], df["Sep"], df["Oct"], df["Nov"], df["Dec"], df["NUM_COLLISIONS"]]
headers = ["year","temp", "dewp", "prcp", "Sat","Sun","Mon","Tue","Wed","Thu","Fri","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec","NUM_COLLISIONS"]
df_dnn = pd.concat(dnn_input_data, axis=1, keys=headers)
df_dnn.head()

Unnamed: 0,year,temp,dewp,prcp,Sat,Sun,Mon,Tue,Wed,Thu,...,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec,NUM_COLLISIONS
0,2013,37.8,23.6,0.0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,381
1,2013,27.1,10.5,0.0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,480
2,2013,28.4,14.1,0.0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,549
3,2013,33.4,18.6,0.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,505
4,2013,36.1,18.7,0.0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,389


In [50]:
training_dataset = df_dnn.sample(frac=0.8, random_state=0)
test_dataset = df_dnn.drop(training_dataset.index)

In [58]:
training_features = training_dataset.copy()
test_features = test_dataset.copy()

training_labels = training_features.pop('NUM_COLLISIONS')*SCALE_NUM_COLLISIONS
test_labels = test_features.pop('NUM_COLLISIONS')*SCALE_NUM_COLLISIONS

In [59]:
# A scale for a constant will be useful
SCALE_NUM_COLLISIONS = 0.001

In [60]:
training_labels = training_labels/SCALE_NUM_COLLISIONS
test_labels = test_labels/SCALE_NUM_COLLISIONS

In [61]:
normaliser = tf.keras.layers.Normalization(axis=-1)
normaliser.adapt(np.array(training_features))

In [62]:
# This is the only difference, instead of a single layer, we have our normalisation layer (22 inputs), 2 layers of 48, with 1 output. The 48 can be adjusted to improve the net.
dnn_model_1 = keras.Sequential([
      normaliser,
      layers.Dense(48, activation='relu'),
      layers.Dense(48, activation='relu'),
      layers.Dense(1)
  ])

dnn_model_1.compile(loss='mean_absolute_error',
                optimizer=tf.keras.optimizers.Adam(0.001))

In [63]:
%%time
history = dnn_model_1.fit(
    training_features,
    training_labels,
    validation_split=0.2,
    verbose=0,
    epochs=100)

CPU times: user 16.9 s, sys: 659 ms, total: 17.5 s
Wall time: 21.3 s


In [64]:
# remember, we want to minimise this. The model with the lowest is the best.
dnn_model_1_results = dnn_model_1.evaluate(test_features, test_labels, verbose=0)
print(dnn_model_1_results)

54.296348571777344


In [66]:
# make sure the labels match up with the dataframe from earlier.
input_1 = pd.DataFrame.from_dict(data =
				{
         'year' : [2018,2019,2018],
         'temp' : [60.1, 67.5, 77.0],
         'dewp' : [54.1, 60.9, 59.7],
         'prcp' : [0.2,0,0.24],
         'Sat' : [0,0,0],
         'Sun' : [0,0,0],
         'Mon' : [0,0,0],
         'Tue' : [0,0,0],
         'Wed' : [0,0,0],
         'Thu' : [0,0,0],
         'Fri' : [1,1,1],
         'Jan' : [0,0,0],
         'Feb' : [0,0,1],
         'Mar' : [0,0,0],
         'Apr' : [0,0,0],
         'May' : [0,0,0],
         'Jun' : [0,0,0],
         'Jul' : [0,0,0],
         'Aug' : [1,0,0],
         'Sep' : [0,0,0],
         'Oct' : [0,0,0],
         'Nov' : [0,0,0],
         'Dec' : [0,1,0],
        })

In [67]:
linear_day_predictions = dnn_model_1.predict(input_1[:3])*SCALE_NUM_COLLISIONS
linear_day_predictions



array([[0.58995116],
       [0.70589536],
       [0.64243084]], dtype=float32)

For the first, you can see below, we are in August, Saturday (we have a monday), 2009. We have a lower temperature and lower windspeed. The difference in day is likely to account for the higher actual number of trips.

205 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 2009 64.9 4.2 0.700748221591537

For second (December):

337 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 2009 35.2 6.6 0.62902717790116

For third (February):

28 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 2009 38.1 3.6 0.675798566550843

The results overall look reasonable. You can look into this further.

Considerations for Assignment. I would likely use another method to either standardise or normalise num_trips. And perhaps not bother and scale them in here with SCALE_NUM_TRIPS with a generally large number.

Obviously, you will likely want to have a number of variations but there is no reason you can't use most of the data given. Remember, the DNN is trying to solve a complex relationship, not a linear one.

Other things to consider when doing this would be to take validation. Of course, in the assignment you will have a much larger range of data i.e. from some date x to y.

This will give you more data. Taking real validation data that hasn't been shown to the model will give real results for you to check against rather than what I have done here (which is just for information).