# TPM034A Machine Learning for socio-technical systems 
## `Assignment 01: Data Exploration and MultiLayer Perceptrons`

**Delft University of Technology**<br>
**Q2 2024**<br>
**Instructor:** Sander van Cranenburgh <br>
**TAs:**  Francisco Garrido Valenzuela & Lucas Spierenburg <br>

### `Instructions`

**Assignments aim to:**<br>
* Examine your understanding of the key concepts and techniques.
* Examine your the applied ML skills.

**Assignments:**<br>
* Are graded and must be submitted (see the submission instruction below). 

### `Google Colab workspace set-up`

Uncomment the following cells code lines if you are running this notebook on Colab

In [None]:
#!git clone https://github.com/TPM034A/Q2_2024
#!pip install -r Q2_2024/requirements_colab.txt
#!mv "/content/Q2_2024/Assignments/assignment_01/data" /content/data

## `Application: Cycling speed prediction for Rotterdam` <br>

### **Introduction**

In Dutch urban context, cycling is an important mode of transportation, serving both personal and commercial purposes, including delivery services. However, one of the challenges that individuals and companies frequently encounter is the lack of relevant cycling itineraries proposed by routing algorithms. This is partly due to the lack of accurate cycling speed information per road link. Hence, accurate information on cycling speeds could improve itinerary recommendation by routing apps, and, in turn, help individuals and companies to choose better cycling routes. 

One way to obtain accurate cypling speed data is by installing speed sensors on all road links. However, this is a costly and time-consuming process. Alternatively, one could use machine learning models to estimate cycling speeds based on other variables, such as road infrastructure, traffic, and weather conditions.

Seeing a business opportunity, a data-analytics start-up company has collected data on average cycling speeds on several roads within a specific neighborhood. Now it needs to develop a machine learning model that can predict cycling speeds on any road link in the city of Rotterdam. You have been hired by this company for this task. Specifically, in this assignment you need todevelop an MLP that is capable of predicting the (average) cycling speeds per road link for the entire city of Amsterdam based publicly available street infrastructure data.

List of tasks:
1. Explore the cycling speed data provided by the company, to determine if it can be used for your task
1. Train an MLP model to predict average cycling speed
1. Evaluate and reflect on the performance of your model

### **Data**

You have access to 2 data sets:
1. Training dataset: first_campaign.gpkg
1. Testing dataset: validation_campaign.gpkg
<br>

### **Tasks and grading**

Your assignment is divided into 4 subtasks: (1) data inspection, (2) regression model, (3) MLP model, (4) performing an out-of-sample validation.

1.  **Data inspection: Load the dataset and make a first inspection** [1.5 pnt]
    1. Distribution of cycling speed:
        1. Visualize the statistical distribution of cycling speed per street segment
        1. Plot its spatial distribution on a map. Do the data make sense?
    1. Number of observations per street:
        1. Plot the distribution of number of observations per street
        1. How many street segments have less than 3 observations?  Why is it a problem?
1. **Training a first regression model** [3 pnt]
    1. Selecting relevant columns: Which columns should you keep in your analysis? Why? 
    1. One-hot encoding: Which variables would you need to encode and why? (provide a list)
    1. Split the data between training and test set.
    1. Start with a simple model:
        - Use a linear regression model
        - Make that the columns are in the same order in both train and test sets.
        - Evaluate the performance on the test set, using the R2, is the performance decent to use this model?
    1. Plot the predicted speed on the test set against the true speed. Is there a speed region where the prediction are better/poorer?
    1. Print the coefficient of the regression. Which features contribute to faster speed? Does that make sense?
1. **Training a more advanced MLP model** [2.5 pnt]
    1. Scaling variable: Which variables would you scale and why? (provide a list), use a minmax scaler.
    1. Hyperparameter tuning: Design a grid search over the following hyperparameter space:
        - hidden_layer_sizes: [(18),(10,10,10,10,10),(18,10)]
        - learning_rate_init: [1e-2,1e-3,1e-5]
        - alpha: [1,0.1]
    1. Evaluate the performance on the test set, using the R2. How is the performance compared to the previous regression model?
    1. Plot the predicted speed on the test set against the true speed. Is there a speed region where the prediction are better/poorer?
1. **Out-of-sample validation** [3 pnt]
    The company that hired you is impressed by your results. But to get more confidence about the generalisation performance, they decided to collect a new data set in another neighborhood.
    1. Load and preprocess the validation data.
        - Make sure that the validation data have the same columns as the original data in the same order
        - Apply the same preprocessing as the data used for training
    1. Measure the generalisation performance of the MLP model on the hold-out sample data.
        1. Measure the performance
        1. Plot the model's prediction against the true speed
    1. Measure the generalisation performance of the regression model on the hold-out sample data
        1. Measure the performance
        1. Plot the model's prediction against the true speed
    1. Reflect on the generalisation performance of your model
        1. Discuss reasons why the models might perform better or worse on the validation data.
        1. Which model performs better? Why?


### **Submission**
- The deadline for this assignment is **Monday, XX XXXXXX XXXX** 
- Use **Python 3.XX or above**
- You have to submit your work in zip file with the ipynb **(fully executed)**

In [None]:
import geopandas as gpd
import os
from pathlib import Path
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pickle

pd.set_option('display.max_columns', None)

### 1. Data inspection: Load the dataset and make a first inspection
#### 1.1 Distribution of cycling speed:

In [None]:
# Loading the data
data_path = Path('data')
cycling_links = gpd.read_file(data_path / 'first_campaign.gpkg')

#### 1.1.1 Visualize the statistical distribution of cycling speed per street segment

In [None]:
# Visualize the statistical distribution of cycling speed per street segment
cycling_links['speed'].hist(bins=30)

#### 1.1.2 Plot its spatial distribution on a map. Do the data make sense?

In [None]:
# Plot its spatial distribution on a map. Do the data make sense?
cycling_links.explore(column='speed')

# Answer: Yes, the data makes sense. Speed is higher along main roads.

#### 1.2 Number of observation per street

#### 1.2.1 Plot the distribution of number of observations per street

In [None]:
cycling_links['n_observations'].hist(bins=30)

#### 2.1.2 How many street segments have two or less observations? Why is it a problem?

In [None]:
len(cycling_links.loc[cycling_links['n_observations']<3])
# Answer: Many street segments have less than 3 observations, which threatens reliable predictions.

### 2. Training multiple linear regression model

#### 2.1 Selecting relevant columns: Which columns should you keep in your analysis? Why?

In [None]:
cycling_links.columns
#Keep all columns but 'osmid' (no relevant information), 'geometry' and 'n_observations' for the analysis.

In [None]:
cols_keep = cycling_links.columns.drop(['n_observations','osmid', 'geometry'])
cycling_links = cycling_links[cols_keep]

#### 2.2 One-hot encoding: Which variables would you need to encode and why? (provide a list)

In [None]:
# Encode categorical features
features_categorical = ['oneway', 'highway', 'bridge', 'service', 'access', 'tunnel', 'junction','street_light'] 

In [None]:
cycling_links[features_categorical] = cycling_links[features_categorical].astype('category')
cycling_links[features_categorical].info()

In [None]:
for feature in features_categorical:
    print(f'FEATURE = {feature}\n')
    print(cycling_links[feature].value_counts())

In [None]:
# hot encode categorical features
df = pd.get_dummies(cycling_links, columns=features_categorical, drop_first=True)
df.head()

#### 2.3 Split the data into a training and test set.

In [None]:
# Split the DataFrame into train and test sets
df_train, df_test = train_test_split(df, test_size=0.2, random_state=42)

features = df.columns.tolist()
features.remove('speed')
features

X_train = df_train[features]
y_train = df_train['speed']
X_test = df_test[features]
y_test = df_test['speed']

#### 2.4 Train your first model:

In [None]:
# Use a linear regression model
# Make that the columns are in the same order in both train and test sets.
model = LinearRegression()
model.fit(X_train, y_train)

In [None]:
# Evaluate the performance on the test set, using the R2
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error: {mse}')
print(f'R² Score: {r2}' )

#### 2.5 Plot the predicted speed on the test set against the true speed. Is there a speed region where the prediction are better/poorer?

In [None]:
# Scatter plot of the true values vs the predicted values using sns_regplot
sns.regplot(x=df_test['speed'], y=y_pred, scatter_kws={'alpha':0.3})
plt.xlim(5, 35)
plt.ylim(5, 35)
plt.xlabel('True values')
plt.ylabel('Predicted values')
plt.title('True vs Predicted values')
plt.gca().set_aspect('equal', adjustable='box')
plt.show()
# Outliers are located in the extremes of the speed range, and upper outliers make the model overestimate the speed. 

#### 2.6 Print the coefficient of the regression. Which features contribute to faster speed? Does that make sense?

In [None]:
# print the coefficients with their corresponding feature names
coefficients = pd.DataFrame({'feature': features, 'coefficient': model.coef_})

# Add the intercept to the DataFrame
intercept_df = pd.DataFrame({'feature': ['intercept'], 'coefficient': [model.intercept_]})
coefficients = pd.concat([intercept_df,coefficients], ignore_index=True)

coefficients.sort_values('coefficient', ascending=False)

# Larger roads (primary, secondary) have a high positive impact on speed.
# Street light have negative impact on speed.
# Residential roads have a negative impact on speed.

### 3 Train a more advanced model: MLP
hint: 
- When feeding tabular data to sklearn use a pandas dataframe instead of a numpy array 
- It allows sklearn to control for the order of the columns
- It will be useful later in the assignment

#### 3.1 Scaling numerical features

In [None]:
features_numeric = ['length','lanes','maxspeed']

df_train_scaled = df_train.copy()
df_test_scaled = df_test.copy()
# Scaling numerical features
scaler = MinMaxScaler()

X_train_numeric = X_train[features_numeric]
X_train_cat = X_train.drop(columns=features_numeric)
X_test_numeric = X_test[features_numeric]
X_test_cat = X_test.drop(columns=features_numeric)

# Fit the scaler on the training data
scaler.fit(X_train_numeric)

X_train_numeric_scaled = scaler.transform(X_train_numeric)
X_test_numeric_scaled = scaler.transform(X_test_numeric)

# Scaling the training and test data
df_X_train_scaled_numeric = pd.DataFrame(X_train_numeric_scaled, columns=features_numeric)
df_X_train_scaled = pd.concat([df_X_train_scaled_numeric, X_train_cat.reset_index(drop=True)], axis=1)
X_test_scaled_numeric_df = pd.DataFrame(X_test_numeric_scaled, columns=features_numeric)
df_X_test_scaled = pd.concat([X_test_scaled_numeric_df, X_test_cat.reset_index(drop=True)], axis=1)

#### 3.2 Hyperparameter tuning: Design a grid search over the following hyperparameter space:
 - hidden_layer_sizes: [(18),(10,10,10,10,10),(18,10)]
 - learning_rate_init: [1e-2,1e-3,1e-5]
 - alpha: [1,0.1]

Fixed hyperparameters:
 - activation = 'relu'
 - solver = 'adam'
 - batch_size = 50
 - max_iter = 2000
 - random_state = 42
 
 Parameter for the grid search:
 - cv = 5
 - scoring = 'r2'
 - random_state = 42

In [None]:
mlp_gs = MLPRegressor(activation = 'relu', solver='adam', batch_size=50, max_iter=2000)

# Define the hyperparameter search space
hyperparameter_space = {
    'hidden_layer_sizes': [(18),(10,10,10,10,10),(18,10)],
    'alpha': [1,0.1],
    'learning_rate_init': [0.01,0.001]
    }

folds = 5

mlp_gridsearch = GridSearchCV(mlp_gs, hyperparameter_space, n_jobs=-1, cv=folds, scoring = 'r2')

In [None]:
# Conduct the gridsearch (just once)
output_folder = Path(data_path/'output')
model_sav = 'my_tuned_MLPmodel.sav'
if Path.exists(output_folder/model_sav):
    mlp_gridsearch = pickle.load(open(output_folder/model_sav,'rb'))
else:
    if not Path.exists(output_folder):
        os.mkdir(output_folder)
    
    # Conduct the gridsearch
    mlp_gridsearch.fit(df_X_train_scaled_numeric, y_train)
    
    # Save the results
    pickle.dump(mlp_gridsearch, open(output_folder/model_sav,'wb'))

#### 3.3 Train the model with the best parameters and evaluate the performance on the test set, using the R2.

In [None]:
df_gridsearch = pd.DataFrame.from_dict(mlp_gridsearch.cv_results_)
# Add new column with a label for the hyperparameter combinations
df_gridsearch['gs_combinations'] = 'L2 = '+ df_gridsearch['param_alpha'].astype('str') + '; Learning_rate = '+ df_gridsearch['param_learning_rate_init'].astype('str') + '; Layers = ' + df_gridsearch['param_hidden_layer_sizes'].astype('str')
df_gridsearch = df_gridsearch.sort_values('rank_test_score')

# Visualise deviation in performance across hyper parameter settings
plt.figure(figsize = (16,6))
ax = sns.barplot(x = df_gridsearch.gs_combinations, y=df_gridsearch.mean_test_score, palette="Blues_d", hue = df_gridsearch.gs_combinations)
plt.xticks(rotation=90)

print('Best hyperparameters found:\t', mlp_gridsearch.best_params_)
print('Best model performance:\t\t', mlp_gridsearch.best_score_)

In [None]:
# Create a new mlp object using the optimised hyperparameters, just using the train/test split
layers = mlp_gridsearch.best_params_['hidden_layer_sizes']
lr = mlp_gridsearch.best_params_['learning_rate_init']
alpha = mlp_gridsearch.best_params_['alpha']
mlp_gs = MLPRegressor(hidden_layer_sizes = layers, solver='adam', learning_rate_init = lr,
                       alpha=alpha, batch_size=50, activation = 'relu', max_iter = 2000) 

# Train the model
mlp_gs.fit(df_X_train_scaled,y_train)

In [None]:
# Evaluate the performance on the test set, using the R2
y_pred = mlp_gs.predict(df_X_test_scaled)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(df_test['speed'], y_pred)
print(f'Mean Squared Error: {mse}')
print(f'R² Score: {r2}' )

#### 3.4 Plot the predicted speed on the test set against the true speed. Is there a speed region where the prediction are better/poorer?

In [None]:
# Scatter plot of the true values vs the predicted values using sns_regplot
sns.regplot(x=df_test['speed'], y=y_pred, scatter_kws={'alpha':0.3})
plt.xlim(5, 35)
plt.ylim(5, 35)
plt.xlabel('True values')
plt.ylabel('Predicted values')
plt.title('True vs Predicted values')
plt.gca().set_aspect('equal', adjustable='box')
plt.show()
# Overestimat small values and underestimates large values.

### 4. Out-of-sample validation
The company hiring you finds your result for the models suspiciously good, they decided to collect more data in another neighborhood to check its performance.<br>
Performance could change drastically

#### 4.1. Measure the generalisation performance of the MLP model on the hold-out sample data.
 - Make sure that the test data have the same columns as the original data in the same order: hint use a pandas dataframe instead of a numpy array (it allows sklearn to control for the order of the columns)
 - Apply the same preprocessing as the original data

In [None]:
out_of_sample = gpd.read_file(data_path / 'validation_campaign.gpkg')

In [None]:
df_out_of_sample = pd.get_dummies(out_of_sample, columns=features_categorical, drop_first=True)
for col in set(df_X_train_scaled.columns) - set(df_out_of_sample.columns):
    df_out_of_sample[col] = 0

In [None]:
df_out_of_sample = df_out_of_sample.reindex(columns=df_X_train_scaled.columns, fill_value=0)

In [None]:
df_out_of_sample_scaled = df_out_of_sample.copy()
df_out_of_sample_scaled[features_numeric] = scaler.transform(df_out_of_sample_scaled[features_numeric])
# Order the columns in the same way as in the training set
df_out_of_sample_scaled = df_out_of_sample_scaled[df_X_train_scaled.columns]

### 4.1.1 Measure the performance

In [None]:
y_pred_out_of_sample = mlp_gs.predict(df_out_of_sample_scaled)
mse_out_of_sample = mean_squared_error(out_of_sample['speed'], y_pred_out_of_sample)
r2_out_of_sample = r2_score(out_of_sample['speed'], y_pred_out_of_sample)
print(f'Mean Squared Error: {mse_out_of_sample}')
print(f'R² Score: {r2_out_of_sample}' )

#### 4.1.2 Plot the model's prediction against the true speed. Comment on the model's prediction behavior.

In [None]:
# Scatter plot of the true values vs the predicted values using sns_regplot
sns.regplot(x=out_of_sample['speed'], y=y_pred_out_of_sample, scatter_kws={'alpha':0.3})
plt.xlim(5, 35)
plt.ylim(5, 35)
plt.xlabel('True values')
plt.ylabel('Predicted values')
plt.title('True vs Predicted values')
plt.gca().set_aspect('equal', adjustable='box')
plt.show()
# Speeds below 20 km/h mapped to around 10 km/h, and speeds above 20 km/h mapped to higher values 25 km/h.

#### 4.2.1 Measure the generalisation performance of the regression model on the hold-out sample data

In [None]:
y_pred_out_of_sample = model.predict(df_out_of_sample[features])
mse_out_of_sample = mean_squared_error(out_of_sample['speed'], y_pred_out_of_sample)
r2_out_of_sample = r2_score(out_of_sample['speed'], y_pred_out_of_sample)
print(f'Mean Squared Error: {mse_out_of_sample}')
print(f'R² Score: {r2_out_of_sample}' )

#### 4.2.2 Plot the model's prediction against the true speed. Comment on the model's prediction behavior.

In [None]:
# Scatter plot of the true values vs the predicted values using sns_regplot
sns.regplot(x=out_of_sample['speed'], y=y_pred_out_of_sample, scatter_kws={'alpha':0.3})
plt.xlim(5, 35)
plt.ylim(5, 35)
plt.xlabel('True values')
plt.ylabel('Predicted values')
plt.title('True vs Predicted values')
plt.gca().set_aspect('equal', adjustable='box')
plt.show()
# Speeds below 20 km/h mapped to around 15 km/h, and speeds above 20 km/h mapped to around 25 km/h.

#### 4.3 Interpretation of the results:

#### 4.3.1 Do you observe a decrease in the performance? If so, what is the cause?

- Decrease in performance
- Data leak: High spatial correlation between train and validation set (street segments are from the same neighborhood, sometimes the same street)
- Overestimation of the performance in the validation set

#### 4.3.2 Which of the two models performs better? Why?

- Decrease in performance
- Data leak: High spatial correlation between train and validation set (street segments are from the same neighborhood, sometimes the same street)
- Overestimation of the performance in the validation set