# TPM034A Machine Learning for socio-technical systems 
## `Assignment 02: Embeddings and Explainable Artificial Intelligence (xAI)`

**Delft University of Technology**<br>
**Q2 2024**<br>
**Instructor:** Sander van Cranenburgh & Giacomo Marangoni <br>
**TAs:**  Francisco Garrido Valenzuela & Lucas Spierenburg <br>

### `Instructions`

**Assignments aim to:**<br>
* Examine your understanding of the key concepts and techniques.
* Examine your the applied ML skills.

**Assignments:**<br>
* Are graded and must be submitted (see the submission instruction below). 

### `Google Colab workspace set-up`

Uncomment the following cells code lines if you are running this notebook on Colab

In [None]:
#!git clone https://github.com/TPM034A/Q2_2024
#!pip install -r Q2_2024/requirements_colab.txt

# `Application: Visual urban environment as component for bike speed and safety` <br>

### **Introduction**

In the previous assignment, you worked with machine learning models to predict cycling speeds using tabular data, such as road infrastructure. While this approach allowed us to explore cycling speeds and make some predictions, it represents only one way to understand and predict cycling dynamics. This assignment builds on your previous work by introducing new datasets and tools to approach the same problem from two different perspectives: exploring the visual components of the urban space for predicting cycling speed, and examining the explainability of cycling accidents.

In the first part of this assignment, you will focus on using images of urban spaces to predict cycling speeds. By generating image embeddings that represent visual features, such as road quality and surrounding infrastructure, you will train machine learning models to incorporate this new dimension of data. This will provide insights into how visual elements contribute to cycling speeds and how they can complement tabular data.

The second part of the assignment shifts the focus to safety and explainability. By integrating accident data with our tabular information, you will explore and explain bike accidents. Using explainable AI techniques, you will interpret model predictions to uncover patterns and risk factors, offering actionable insights for improving urban cycling infrastructure and safety.

#### **Data**
For this assignment you have access to different datasets. All of them will be available in the data folder after the execution of cell below this instructions. The data folder contains four sub-folder: `image_tabular`, `bike_speeds`, `traffic_accidents` and `images`. The following list describes the datasets within the folders.

1. `data/image_tabular/image_metadata.csv`: A csv file with the image metadata (e.g., year, month or location) of Rotterdam images. The column `in_folder` indicates is the img file is present in the `data/images`.
2. `data/image_tabular/image_embeddings.csv`: A tabular csv file with image embeddings from Rotterdam.
3. `data/bike_speeds/bike_speeds.gpkg`: A geo dataset of linestrings (streets) for the Netherlands with bike speed data.
4. `data/traffic_accidents/accident_events.geojson`: A geographical dataset with information and characteristics of accidents for Rotterdam.
5. `data/traffic_accidents/accident_parties.csv`: A tabular csv with the information of the parties involved in the accidents.
6. `data/images`: A folder with image files from Rotterdam. Images in this folder are indicated by the column `in_folder` in `data/image_tabular/image_metadata.csv` with a 1.

As indicated, run the code in the cell below to prepare the dataset. The cell will download the datasets and place them in the data folder automatically for this assigment. It may take up to two minutes to download the data.

Remember to transform all geographical datasets (if it is needed) to the Dutch projection: **28992**.

In [None]:
## IMPORTANT: You have to be on the TUDelft network (eduroam) or under eduVPN to run this script # comment after running
## You can comment these lines after downloading the data.
from assets import data_downloader as dld
dld.download_data()

### **Tasks and grading**

Your assignment is divided into two sections: Part I: Image Embeddings, and Part II: Explainable AI. The specific tasks and grading points for each section are outlined below.

1. **Part I: Image embeddings** [5 pts]<br>
    1. **Data preparation and exploration** [1.0 pnt]<br>
        - Loading the datasets. 
        - Preparing the image dataset. 
        - Combining bike speed data with the images. 
        - Exploring the street view images 
    1. **Model training** [2.5 pnt]<br>
        - Data preparation for training (random and spatial) 
        - Training a Linear Regression (LR) 
        - Training a Random Forest (RF) 
        - Training a Multi-Layer Perceptron (MLP) 
    1. **Model selection and application** [1.5 pnt]<br>
        - Discussion on data split
        - Model selection and application
1. **Part II: Explainable AI** [5 pts]<br>
    1. **Data preparation and exploration** [1.0 pnt]<br>
        - Loading the datasets.
        - Combining bike speed data with accidents. 
    1. **Exploring explainability using accidents aggregated by streets** [1.5 pnt]
        - Data preparation
        - Model preparation
        - Explainability
    1. **Exploring explainability using accidents instances** [1.5 pnt]
        - Data preparation
        - Model preparation
        - Explainability
    1. **Reflection** [1.0 pnt]

    ### **Submission**
- The deadline for this assignment is **Monday, December 12th, 2024** 
- Use **Python 3.11**
- You have to submit your work in ipynb **(fully executed)** into Brightspace.

In [None]:
# Data manipulation
import numpy as np
import pandas as pd
import geopandas as gpd

# Plotting and visualization
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import ImageGrid
from PIL import Image
import contextily as ctx

# Machine learning
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

# Explainaibility AI
import shap
from lime import lime_tabular

# Others
from shapely.geometry import Point
from pathlib import Path

# pd settings
pd.set_option('display.max_columns', None)

## Part I: Image embeddings
### **1. Data preparation and exploration**
#### **1.1. Loading datasets.** Load the datasets from the folders: image_tabular and bike_speeds. Explore them with `df.head()` for you to familiarize with the tables.

#### **1.2. Preparing the image dataset.** Combine the `img_metadata` dataframe with the `img_embeddings` for consolidating it in one unique dataframe.  

#### **1.3. Combining bike speed data with the image embeddings** 
- (a) Link the bike speed data (represented as linestrings for streets) with the images (represented as point locations). You can apply any suitable spatial aggregation method to join the data. *HINT*: Use 'u-v' as the unique identifier for each street to keep each street only one time, and consider using the buffer function in GeoPandas for this task.
- (b) As one street object may contain several images, group by the rows by the street and average the embedding features. As a result you should have each street once. Remove the columns that doesn't make sense to have and include the linestring geometry.
- (c) Visualize the result in a map. Plot the all images (points) and resulting streets with image data. For better visualization just show Y within 436000-438000 and X within 92000-96000 (using EPSG:28992)

#### **1.4. Exploring the street view images.**
- (a) Explore the distribution of the speed (`speed`) in the dataset built in 1.3. Remove outliers and plot the final distribution: The histogram and a map. In the map, color the streets based on the speed.
- (b) Sample ten street view images (present in the folder, i.e., in_folder == 1) from streets with a speed between 20 and 50 km/h.
- (c) Repeat (b) sampling images from streets with a speed between 0> and 10 km/h.
- (d) Did you see differences in the urban scenery comparing the images from (b) and (c)?

### **2. Model training**. Train different machine learning models for predicting the bike speed based on the embedding characteristics. 

#### **2.1. Dataset for training**. Prepare the dataset for the modelling. 
- (a) Keep only the relevant columns for the training phase: embedding columns and bike speed.
- (b) Split the dataset train/test using two different approaches:
    - (b.1) Random-based: Split 80%/20% randomly.
    - (b.2) Location-based: Split 80%/20% geographically. Create a buffer around a coordinate to select 20% of the data points as test set (taking all datapoints inside the buffer).
- (c) Plot in a map both approaches. Using blue for coloring the streets in the training set and red for the test set.

#### **2.2. Training a Linear Regression (LR)** Train a LR for predicting the bike speed using both splits separately. Report the R2 and the MSE for train and test in both cases.

#### **2.3. Training a Random Forest (RF)**. Train a RF for predicting the bike speed using both splits separately. Report the R2 and the MSE for train and test in both cases. Hyperameter tunning is NOT needed in this task.

#### **2.4. Training a Multi-Layer Perceptron (MLP)**. Train a MLP for predicting the bike speed using both splits separately. Report the R2 and the MSE for train and test in both cases. Hyperameter tunning is NOT needed in this task.

### **3. Model selection and application**
#### **3.1. Reflexion data splits**. Reflect on the model indicators
- (a) Did you see difference between the splits? If so, what could cause those differences?
- (b) Which of these two splits do you think it is more robust? why?

#### **3.2. Model selection and application**. 
- (a) Choose a model for its application and justify the choice.
- (b) Apply the selected model and plot the predicted values and the errors in two different maps.

## Part II: Explainable AI
### **1. Data preparation and exploration**
#### **1.1. Loading datasets.** Load the datasets from the folder traffic_accidents. Explore them with `df.head()` for you to familiarize with the tables.

#### **1.2. Combining bike speed data with accidents**
- (a) Filter the accidents events to keep only the ones where a party `Fiets`, `Bromfiets`, and `Snorfiets` is involved.
- (b) Generate two different dataset for analyzing accidents:
    - Accident-based dataset. Each row corresponds to a unique accident, and you have to associated the characteristics of the segement where it happened.
    HINT for b: Create a buffer of 5m around the streets for doing the spatial merge between streets and accidents.
    
    - Street-based dataset. Each row corresponds to a unique street segment, and you have to aggregate accident data of the ones happening within 5 meters around the street segement. For each street, you should create the following data columns:
        - Create a column (named `n_severe`) with the total number of severe accidents (`Injury` and `Fatal` in AP3).
        - Create a column (named `n_minor`) with the total number of minor accidents (`Only material damage` in AP3).
        - Create a column (named `n_parties`) with the average number of parties involved in the accidents of that segment (using `N_PARTIES`).
        - Create a columns with the max speed limit (using `MAX_SPEED`)
        - Create a column (named `rain_prop`) with the percentage of accidents in the segement ocurring on rain/snow (`Rain`, `Fog` and `Snow/Hail` in WHEATER_1)
        - Create a column (named `dark_prop`) with the percentage of accidents in the segement ocurring on low-light condition. (`Darkness` in LIGHT_CONDITION)

### **2. Exploring explainability using accidents aggregated by streets**
#### **2.1. Data preparation**

You will be using the `street_accidents` datasets to explore the relationship between accidents and characteristics of the streets they occured.
Key variables are:

- `n_severe`: Total number of severe accidents (fatal or injured);
- `n_minor`: Total number of minor accidents (only material damage).

(a) Add a column `n_any` summing all accidents happened in that street, whether minor or severe.

(b) Then, filter the datasets only to streets with:
- at least one accident;
- `speed` greater than 0.

(c) Create a column `highway2`, mapping the values of `highway` according to this table:
| highway     | highway2                 |
|------------------|-----------------------|
| primary          | major_road            |
| secondary        | major_road            |
| tertiary         | major_road            |
| primary_link     | major_road            |
| secondary_link   | major_road            |
| tertiary_link    | major_road            |
| residential      | minor_road            |
| unclassified     | minor_road            |
| living_street    | minor_road            |
| service          | minor_road            |
| services         | minor_road            |
| footway          | pedestrian_cycleway   |
| pedestrian       | pedestrian_cycleway   |
| cycleway         | pedestrian_cycleway   |
| path             | pedestrian_cycleway   |
| busway           | public_transit        |

(d) Then create 3 dummy columns (i.e. one for each `highway2` value except `public_transit`), true or false depending on the value of `highway2`:
- highway2_major_road
- highway2_minor_road
- highway2_pedestrian_cycleway

(e) Finally, remove all non-numeric columns, make sure the remaining ones are numeric, and drop any rows with NAs.

#### **2.2. Model preparation**

(a) Set up a classification model called `model4xai` using a RandomForest to predict whether an accident on a street segment is severe (i.e., involves death or injury).

Use as features:
- `oneway` (0 = false, 1 = true)
- `length`
- `speed` (i.e. the average speed)
- `n_observations` (i.e. the traffic)
- the `highway2` dummies you created above

#### **2.3. Explainability**

(a) How does `speed` affect the likelihood of severe accidents? Use a partial dependence plot and explain the resulting relationship on the train dataset.

(b) Use a SHAP `KernelExplainer` to compute the SHAP values for the test set.
As background parameter, use `shap.kmeans` to summarize the train dataset via 10 centroids.

(c) Plot and explain a summary plot of the SHAP values. What insights can you get?

(d) Take the first row of the test dataset: is it a "safe" street according to our model? Explain how each feature contributed to the final prediction.
Hint: use a waterfall plot.

(e) Compute the explanation you would get with LIME: how does it compare with SHAP? How can that be explained?

### **3. Exploring explainability using accidents instances**

Use the `accident_street` dataset, containing accidents as rows.

#### **3.1. Data preparation**

(a) Simplify the columns following these remapping tables (1st column: original values, 2nd column: mapped values):

- Road type (as above, see highway2)

- Weather conditions

| WEATHER_1        | weather2     |
|-----------------|------------------|
| Dry           | clear            |
| Rain           | precipitation    |
| Snow/Hail    | precipitation    |
| Fog            | adverse          |
| Strong wind gusts| adverse          |
| Unknown       | other            |

- Surface Types Mapping

| ROAD_SURFACE_CONDITION                             | surface2 |
|--------------------------------------|--------------|
| Overig asfalt                        | asphalt      |
| ASFALT                               | asphalt      |
| ASVALT                               | asphalt      |
| ASFALT TEGELS                        | asphalt      |
| ASFALT FIETSPAD EN STOEPTEGELS       | asphalt      |
| ZOAB                                 | asphalt      |
| BITUMEN EN KLINKERS                  | asphalt      |
| KLINKERS OVERGAANDE IN BITUMEN       | asphalt      |
| BITUMEN FIETSPAD KLINKERWEG          | asphalt      |
| STOEPTEGELSFIETSPAD BITUMEN          | asphalt      |
| DEELS KLINKERS DEELS BITUMEN         | asphalt      |
| BITUMEN                              | asphalt      |
| KLINKERS EN BITUMEN                  | asphalt      |
| Klinkers                             | pavement     |
| Beton                                | pavement     |
| KEIEN                                | pavement     |
| STOEPTEGELS                          | pavement     |
| TEGELS                               | pavement     |
| 30 30 BETONTEGELS                    | pavement     |
| STOEPTEGELS 30X30                    | pavement     |
| TROTTOIR TEGELS                      | pavement     |
| STENEN                               | pavement     |
| TEGELS VOETPAD                       | pavement     |
| STRAATTEGELS                         | pavement     |
| KUNSTSTOF RIJPLATEN                  | pavement     |
| BETONNEN RIJPLATEN                   | pavement     |
| GRAS                                 | pavement     |
| TRAMRAILS MET DAARIN EEN GAT I       | other        |
| TRAMRAILS                            | other        |
| LOS GRAVEL                           | other        |
| IJZEREN ROOSTER                      | other        |
| GLAD WEGDEK                          | other        |
| KUNSTSTOF RIJPLATEN                  | other        |

- Road Light Conditions Mapping

| ROAD_LIGHT_CONDITION           | light2 |
|--------------------|--------------|
| Lit           | lit          |
| Not Lit      | not_lit      |
| Not present      | no_lighting  |


(b) Then create 0/1 dummies columns for each of `highway2`, `weather2`, `surface2`, `light2`. Create a `severe` column which is 1 if `AP3` is `Injury` or `Fatal`, 0 otherwise. Drop all non-numeric columns, rows with NAs and rows with `speed` = 0.

#### **3.2. Model preparation**

(a) Set up a classification model called `model4xai2` using a RandomForest to predict whether an accident is severe (i.e., involves death or injury).
Use as features:
- `oneway` (0 = false, 1 = true)
- `speed` (i.e. the average speed)
- `n_observations` (i.e. proxi of the traffic)
- the `highway2` dummies created above
- the `weather2` dummies created above
- the `surface2` dummies created above
- the `light2` dummies created above

#### **3.3. Explainability**

(a) Use a SHAP `KernelExplainer` to compute the SHAP values for the test set.
As background parameter, use `shap.kmeans` to summarize the train dataset via 10 centroids.

(b) What are the 3 most important features? Use a summary BAR plot.

(c) Plot a scatter plot of speed (x-axis) vs its SHAP value (y-axis). What is the difference with the partial dependence plot?

Hint: use shap.plots.scatter

### **4. Reflextion**
- (a) How do the two analyses with the two datasets above differ? What are instead the similiarities?
- (b) What are the benefits of a XAI-informed model for predicting severe accidents? What could be the risks?