# TPM034A Machine Learning for socio-technical systems 
## `Assignment 01: Data Exploration and MultiLayer Perceptrons`

**Delft University of Technology**<br>
**Q2 2024**<br>
**Instructor:** Sander van Cranenburgh <br>
**TAs:**  Francisco Garrido Valenzuela & Lucas Spierenburg <br>

### `Instructions`

**Assignments aim to:**<br>
* Examine your understanding of the key concepts and techniques.
* Examine your the applied ML skills.

**Assignments:**<br>
* Are graded and must be submitted (see the submission instruction below). 

### `Google Colab workspace set-up`

Uncomment the following cells code lines if you are running this notebook on Colab

In [1]:
#!git clone https://github.com/TPM034A/Q2_2024
#!pip install -r Q2_2024/requirements_colab.txt
#!mv "/content/Q2_2024/Assignments/assignment_01/data" /content/data

## `Application: Cycling speed prediction for Rotterdam` <br>

### **Introduction**

In Dutch urban context, cycling is an important mode of transportation, serving both personal and commercial purposes, including delivery services. However, one of the challenges that individuals and companies frequently encounter is the lack of relevant cycling itineraries proposed by routing algorithms. This is partly due to the lack of accurate cycling speed information per road link. Hence, accurate information on cycling speeds could improve itinerary recommendation by routing apps, and, in turn, help individuals and companies to choose better cycling routes. 

One way to obtain accurate cypling speed data is by installing speed sensors on all road links. However, this is a costly and time-consuming process. Alternatively, one could use machine learning models to estimate cycling speeds based on other variables, such as road infrastructure, traffic, and weather conditions.

Seeing a business opportunity, a data-analytics start-up company has collected data on average cycling speeds on several roads within a specific neighborhood. Now it needs to develop a machine learning model that can predict cycling speeds on any road link in the city of Rotterdam. You have been hired by this company for this task. Specifically, in this assignment you need todevelop an MLP that is capable of predicting the (average) cycling speeds per road link for the entire city of Amsterdam based publicly available street infrastructure data.

List of tasks:
1. Explore the cycling speed data provided by the company, to determine if it can be used for your task
1. Train an MLP model to predict average cycling speed
1. Evaluate and reflect on the performance of your model

### **Data**

You have access to 2 data sets:
1. Training dataset: first_campaign.gpkg
1. Testing dataset: validation_campaign.gpkg
<br>

### **Tasks and grading**

Your assignment is divided into 4 subtasks: (1) data inspection, (2) regression model, (3) MLP model, (4) performing an out-of-sample validation.

1.  **Data inspection: Load the dataset and make a first inspection** [1.5 pnt]
    1. Distribution of cycling speed:
        1. Visualize the statistical distribution of cycling speed per street segment
        1. Plot its spatial distribution on a map. Do the data make sense?
    1. Number of observations per street:
        1. Plot the distribution of number of observations per street
        1. How many street segments have less than 3 observations?  Why is it a problem?
1. **Training a first regression model** [3 pnt]
    1. Selecting relevant columns: Which columns should you keep in your analysis? Why? 
    1. One-hot encoding: Which variables would you need to encode and why? (provide a list)
    1. Split the data between training and test set.
    1. Start with a simple model:
        - Use a linear regression model
        - Make that the columns are in the same order in both train and test sets.
        - Evaluate the performance on the test set, using the R2, is the performance decent to use this model?
    1. Plot the predicted speed on the test set against the true speed. Is there a speed region where the prediction are better/poorer?
    1. Print the coefficient of the regression. Which features contribute to faster speed? Does that make sense?
1. **Training a more advanced MLP model** [2.5 pnt]
    1. Scaling variable: Which variables would you scale and why? (provide a list), use a minmax scaler.
    1. Hyperparameter tuning: Design a grid search over the following hyperparameter space:
        - hidden_layer_sizes: [(18),(10,10,10,10,10),(18,10)]
        - learning_rate_init: [1e-2,1e-3,1e-5]
        - alpha: [1,0.1]
    1. Evaluate the performance on the test set, using the R2. How is the performance compared to the previous regression model?
    1. Plot the predicted speed on the test set against the true speed. Is there a speed region where the prediction are better/poorer?
1. **Out-of-sample validation** [3 pnt]
    The company that hired you is impressed by your results. But to get more confidence about the generalisation performance, they decided to collect a new data set in another neighborhood.
    1. Load and preprocess the validation data.
        - Make sure that the validation data have the same columns as the original data in the same order
        - Apply the same preprocessing as the data used for training
    1. Measure the generalisation performance of the MLP model on the hold-out sample data.
        1. Measure the performance
        1. Plot the model's prediction against the true speed
    1. Measure the generalisation performance of the regression model on the hold-out sample data
        1. Measure the performance
        1. Plot the model's prediction against the true speed
    1. Reflect on the generalisation performance of your model
        1. Discuss reasons why the models might perform better or worse on the validation data.
        1. Which model performs better? Why?


### **Competition for bonus**

Within this assignment you can participate in the TPM034A Competition for getting up to 0.5 bonus point in grading. Please read the details about this [here](/Assignments/assignment_01/competition/competition.md).

### **Submission**
- The deadline for this assignment is **Monday, November 25th, 2024 at 9:00 am** 
- Use **Python 3.11**
- You have to submit your work in zip file with the ipynb **(fully executed)** into Brightspace.

In [2]:
import geopandas as gpd
import os
from pathlib import Path
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pickle

pd.set_option('display.max_columns', None)

### 1. Data inspection: Load the dataset and make a first inspection
#### 1.1 Distribution of cycling speed:

#### 1.1.1 Visualize the statistical distribution of cycling speed per street segment

#### 1.1.2 Plot its spatial distribution on a map. Do the data make sense?

#### 1.2 Number of observation per street

#### 1.2.1 Plot the distribution of number of observations per street

#### 2.1.2 How many street segments have two or less observations? Why is it a problem?

### 2. Training multiple linear regression model

#### 2.1 Selecting relevant columns: Which columns should you keep in your analysis? Why?

#### 2.2 One-hot encoding: Which variables would you need to encode and why? (provide a list)

#### 2.3 Split the data into a training and test set.

#### 2.4 Train your first model:

#### 2.5 Plot the predicted speed on the test set against the true speed. Is there a speed region where the prediction are better/poorer?

#### 2.6 Print the coefficient of the regression. Which features contribute to faster speed? Does that make sense?

### 3 Train a more advanced model: MLP
hint: 
- When feeding tabular data to sklearn use a pandas dataframe instead of a numpy array 
- It allows sklearn to control for the order of the columns
- It will be useful later in the assignment

#### 3.1 Scaling numerical features

#### 3.2 Hyperparameter tuning: Design a grid search over the following hyperparameter space:
 - hidden_layer_sizes: [(18),(10,10,10,10,10),(18,10)]
 - learning_rate_init: [1e-2,1e-3,1e-5]
 - alpha: [1,0.1]

Fixed hyperparameters:
 - activation = 'relu'
 - solver = 'adam'
 - batch_size = 50
 - max_iter = 2000
 - random_state = 42
 
 Parameter for the grid search:
 - cv = 5
 - scoring = 'r2'
 - random_state = 42

#### 3.3 Train the model with the best parameters and evaluate the performance on the test set, using the R2.

#### 3.4 Plot the predicted speed on the test set against the true speed. Is there a speed region where the prediction are better/poorer?

### 4. Out-of-sample validation
The company hiring you finds your result for the models suspiciously good, they decided to collect more data in another neighborhood to check its performance.<br>
Performance could change drastically

#### 4.1. Measure the generalisation performance of the MLP model on the hold-out sample data.
 - Make sure that the test data have the same columns as the original data in the same order: hint use a pandas dataframe instead of a numpy array (it allows sklearn to control for the order of the columns)
 - Apply the same preprocessing as the original data

### 4.1.1 Measure the performance

#### 4.1.2 Plot the model's prediction against the true speed. Comment on the model's prediction behavior.

#### 4.2.1 Measure the generalisation performance of the regression model on the hold-out sample data

#### 4.2.2 Plot the model's prediction against the true speed. Comment on the model's prediction behavior.

#### 4.3 Interpretation of the results:

#### 4.3.1 Do you observe a decrease in the performance? If so, what is the cause?

#### 4.3.2 Which of the two models performs better? Why?