<h2 style = "text-align: center;">Advanced Data Science for Traffic and Transportation Engineering</h2>

<h3 style = "text-align: center;">Determine the Position of Road Inspectors - Rijkswaterstaat</h3><br><br>

<div style="display:flex">
     <div style="flex:1;padding-right:10px;">
          <img src="./Images/rws.jpg" width="260"/>
     </div>
     <div style="flex:1;padding-left:10px;">
          <img src="./Images/tud.png" width="200"/>
     </div>
</div>
<h3 style = "text-align: left;">Group 3</h3>

<ul>
     <li>Yiman Bao (5691648)</li>
     <li>Juan Camargo Fonseca (5834112))</li>
     <li>Tijmen Hoedjes (4959183)</li>
     <li>Max Lange (5169402)</li>
     <li>Wail Abdellaoui (5130654)</li>
</ul>

### **Content** ###

**This notebook is structured as follows:**

<b>1. Introduction</b>
<b><p style="margin-left: 40px">1.1 Research Objective and Research Questions</p></b>
<b><p style="margin-left: 40px">1.2 Tech Stack</p></b>
<b><p style="margin-left: 40px">1.3 Ethical Considerations</p></b>
<b><p style="margin-left: 40px">1.4 Project Management</p></b>


<b>2. Data Story</b>
<b><p style="margin-left: 40px">2.1 Data Overview</p></b>
<b><p style="margin-left: 40px">2.2 Data Filtering and Preprocessing</p></b>
<b><p style="margin-left: 40px">2.3 Streamlit App</p></b>

<b>3. Algorithm</b>
<b><p style="margin-left: 40px">3.1 Algorithm Requirements</p></b>
<b><p style="margin-left: 40px">3.2 Methodology - Overview</p></b>

<b>4. Results</b>

<b>5. Validation</b>

<b>6. Conclusions</b>





### 1. Introduction ###

Have you ever been involved in a road incident in the Netherlands? Did you ever were late to an appointment or event because an incident occurred? If you answered yes, it is most likely that you needed a road inspector to make your life easier. As you may have noticed by now, road inspectors are important to ensure safety on highways and a smooth traffic flow. When incident occur, they appear and make sure that the traffic can quickly resume. Therefore, you can imagine it is important for them to show up as soon as possible when incidents happen. In consequence, an optimal distribution of the inspectors to make sure that they will arrive shortly after the incident occur is necessary, in this document we discuss the considerations made for reaching this objective, and the implications of this for road users in the Netherlands.

#### 1.1 Research Objective and Research Questions ####

The research objective is to find out the optimal location of inspectors in the Dutch road network, such that travel time to the incidents is minimized. For reaching this objective the following research question(s) are proposed. <br><br><b>Main Research Question:</b> What would be the optimal location of road inspectors in the Netherlands, such that travel times to incidents are minimized?

<b>Sub questions:</b>
<ol>
    <li> To what extent the data provided by Rijkswaterstaat is useful for the research objective? </li>
    <li> How do the incidents distribute in time and space in the Dutch network? </li>
    <li> What type of method is suitable for evaluating the accident probabilities? </li>
    <li> What are the locations in which accidents are more likely to happen? </li>
    <li> What variables would affect the response time of the inspectors? </li>
    <li> How to calculate the response time of the inspectors?  </li>
    <li> What is the optimal number of road inspectors needed?  </li>
    <li> What is the optimal location for the road inspectors?  </li>
</ol>
    
#### 1.2 Tech Stack ####

Given the requirement of this project, the following is the tech stack that will be used for reaching the research objective.

<b> Coding: </b>The main programming language used will be Python, and necessary packages for data analytics such as Pandas, NumPy, GeoPandas, SciPy and scikit-learn will be imported as necessary.

<b> Visualization: </b>For standard data visualization, Python libraries such as Matplotlib, Plotly and Seaborn will be employed. For Geospatial visualization, additional tools such as Folium and Rasterio are considered.

<b> Version Control: </b>Git and GitHub will be used for code tracking and collaboration.

<b> Communication: </b>Weekly in-person meetings are planned. However, Microsoft Teams can be used as necessary.

<b> Documentation: </b>Jupyter Notebooks will be used for documenting code, analysis, results and conclusions.

The following code block imports the necessary packages and libraries.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from shapely.geometry import Point
from datetime import datetime
import geopandas as gpd
import numpy as np
from sklearn.neighbors import KernelDensity
from scipy.stats import gaussian_kde
import networkx as nx
import node_probability
import random
from matplotlib.lines import Line2D
import scipy.stats

import data_filtering
from projection_conversions import DutchRDtoWGS84, WGS84toDutchRD
# import validation
from collections import defaultdict

#### 1.3 Ethical Considerations ####

For this project, a Value Sensitive Design approach is to be followed, for which the stakeholders of the project need to be identified, along with their values and possible value conflicts that may arise. The first stakeholder is Rijkswaterstaat, owner of the project and provider of the data, as previously state, its main goal is to optimize the response time of the road inspectors when an incident occurs on the road network, however they have limited inspector capacity. Another stakeholder are the road users, for which the motorway network represents a way of meeting they daily needs, and who expect the network to be safe and to enable them to travel from A to B in the least possible time. Finally, another important stakeholder to be considered is police and law enforcement, as sometimes incidents can involve fatalities or injuries, in those cases it is necessary to carry out specific procedures in which they need to be involved.<br><br>
Different values can be recognized from these perspectives: in the case of Rijkswaterstaat, efficiency and sustainability are the main driving factors, whereas from users’ perspective safety and reliability are important, and from law enforcement’s side privacy and due diligence are relevant as well. Unfortunately, these objectives cannot be met simultaneously in the project; for example, if many inspectors are assigned to the network, response time can be optimized but the sustainability of the project will be compromised, as well as user safety and ability to travel in the least time possible; the other way around also generates conflicts, as having few inspectors will harm efficiency and also cooperation with authorities in case it is needed.<br><br>
On the other hand, the following issues regarding privacy, fairness and bias were identified:
<ol>
    <li>Data collection period is autumn/winter. (<b>bias</b>).<br><br> First, the incident data collection period is from August to December, which may cause some bias in the incident analysis results as only late summer, fall and early winter seasons are represented. For example, weather conditions can vary significantly between seasons, with fall potentially being more prone to rainy days and falling leaves, while early winter can be accompanied by cold weather and ice and snow, leading to differences in the type and frequency of incidents in different seasons. Finally, traffic flow on the roads may differ by season, due to holidays or special events that can take place, a feat that may affect the occurrence of incidents, making the analysis prone to flow-related biases if this is not accounted for.</li><br>
    <li>Accidents can have some privacy issues (police, insurance claims, etc.) (<b>privacy</b>).<br><br> Privacy issues in accident data analysis may have some important impacts on the results. Due to privacy concerns, some sensitive information may have been removed or anonymized, resulting in limited availability of the data. This may result in missing or incomplete data sets, compromising the comprehensiveness of the accident analysis. Missing critical information may bias the results of the analysis, as all relevant factors cannot be considered. At the same time, data sampling may be limited due to privacy issues. Certain types of accidents or accidents for specific groups of people may not be adequately reflected in the data. This can lead to sampling bias, making the results less representative of the overall situation. Most importantly, however, relevant legal and ethical regulations must be followed when handling incident data to ensure that individual privacy rights are respected. These legal provisions may restrict the use, storage and sharing of data. Violating these regulations may lead to legal issues and may also affect the availability and quality of the data.</li><br>
    <li>Algorithm prioritizing criteria (<b>Bias</b>) <br><br> After data processing and filtering, a large majority of the reported incidents occurred in motorways (A roads), while less than 7% of the data was reported in national highways (N roads). We consider that this can bring bias to the project in two manners, depending on the approach. The first one of them is representation bias, meaning that the main focus of the model will be on motorways and thus the solution devised may not capture the necessary features to devise an optimal inspector assignment for national highway. The second way in which this can bring a bias to the model is that the impact of a disruption may vary depending on the road type in which it occurs. Solely optimizing the response time of the road inspectors will result in a ‘first come, first served’ approach. This will mean that, if an incident in a motorway and another incident in a national highway occur close in time, the inspector assignment may not be optimal at a network level. </li><br>
    <li>Representation bias: unequal representation of accident types (<b>bias</b>)<br><br>Following a similar line of reasoning, around of 80% of the incidents reported correspond to vehicle obstructions, 13% correspond to accidents, and the remaining 7% are catalogued as general obstructions. In consequence, general trends will be like the vehicle obstruction category, and the model obtained can have a better performance for these than for accidents and general obstructions, which may undermine its capability to optimize the road position.</li><br>
    <li> Data leakage (privacy) 
    Data from incidents involving citizens is used. The data itself does not reveal anything about individuals. However, combining this data with other information can lead to serious privacy violations. Therefore, sharing the data with other parties should be prohibited, also transparency about the use of data is important.
    
</ol>

#### 1.4 Project Management ####

For reaching the end goal of the project, a SCRUM project management framework was used. More detailed information can be found in the link below.

<a href="https://tud365-my.sharepoint.com/:f:/g/personal/jcamargofonsec_tudelft_nl/Eqbxs-pfs0ZBh7_sFyhtFqsBP2QVSMZRoqllJnYncS0trA?e=UxQ82N">Access Backlog Diary</a>

### 2. Data Story ###

#### 2.1 Data Overview ####

For the development of the project, two datasets were provided. The first one of them is the Rijkswegen road network shapefile, which contains geometric and functional information of the main roads that compose the dutch motorway network. The second dataset is a csv file of 88,851 incidents that ocurred in the Netherlands, between July 31st, 2019 and December 31st, 2019, for every incident information regarding location, classification, starting time, end time and road number is included.

#### 2.2 Data Filtering and Preprocessing ####

After an initial exploration of the data, it was clear that some preprocessing and filtering was neccesary. For doing so, the following criteria was considered:

<ol>
    <li>Incidents that occurred in roads not included in the NWD Road Network Data.</li>
    <li>Incidents that had a duration of zero (0) minutes or lasted longer than one day.</li>
    <li>Incidents that did not occur in the Netherlands.</li>
       
</ol>

After applying these procedures, a total of 13,977 incidents were removed. 

In [3]:
#Extract subnetwork
highway_shapefile = 'Data/Shapefiles/Snelheid_Wegvakken.shp'
road_network = gpd.read_file(highway_shapefile)
road_network = road_network.to_crs("EPSG:4326")

path = 'Data/incidents19Q3Q4.csv'
df_incident = pd.read_csv(path)
df_incident['starttime_new'] = pd.to_datetime(df_incident['starttime_new'])
df_incident['endtime_new'] = pd.to_datetime(df_incident['endtime_new'])
tot_inc = len(df_incident)
df_incident = data_filtering.filter_out(df_incident, road_network)
red_inc = len(df_incident)

#print amount of incidents filtered out
print('Total incidents filtered out: ', tot_inc-red_inc)
print('Percentage of incidents filtered out: ', np.round((tot_inc-red_inc)/tot_inc*100,2), '%')


Total incidents filtered out:  13977
Percentage of incidents filtered out:  15.73 %


#### 2.3 Streamlit App ####

For better data visualization, and with the intent of creating a datastory for our project, a Streamlit dashboard was created. The dashboard contains seven pages:

<ol>
    <li>Homepage: Contains and introduction and brief description of the project.<br><br></li>
    <li>Research: In this page the research objective and research questions are described.<br><br></li>
    <li>Visualise data: Enables data visualization according to different filtering parameters.<br><br></li>
    <li>Maps: Shows a heatmap containing the incident clustering, heatmap and KDE map for the project.<br><br></li>
    <li>Spatio-temporal: Enables visualization of incidents that occurred within a specific timeframe. <br><br></li>
    <li>Algorithm: In which the algorithm explanation is done and results of inspector placement are visualized.<br><br></li>
    <li>Evaluation: Contains considerations made regarding validation and testing of results.<br><br></li>
</ol>
To open the streamlit dashboard please run the cell below. Note that after you close the dashboard you should manually interrupt the cell execution as well.



In [4]:
!streamlit run Homepage.py

^C


### 3. Algorithm ###