# TSFS12 Hand-in exercise 5, extra assignment, solution: Learning predictive driver models with neural networks
Erik Frisk (erik.frisk@liu.se)

This exercise is based on data from the I-80 data set from the U.S. Department of Transportation. The data can be downloaded from the course directory in 
Lisam, and are available in the directory /courses/tsfs12/i80_data in the student labs at campus. 

I-80 data set citation: U.S. Department of Transportation Federal Highway Administration. (2016). Next Generation Simulation (NGSIM) Vehicle
Trajectories and Supporting Data. [Dataset]. Provided by ITS DataHub through Data.transportation.gov. Accessed 2020-09-29 from http://doi.org/10.21949/1504477. More details about the data set are 
available through this link.  

Make initial imports. The exercise requires python packages numpy, tensorflow, scikit-learn, and pandas.

In [None]:
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
import seaborn as sns
from i80_utility import plot_road, lane_bounds, plot_prediction, load_i80_features, load_i80_trajectories, get_trajectory_from_datapoint

In [None]:
%matplotlib

# Introduction
The raw data used in the exercise is available from https://www.its.dot.gov/data/, US Department of Transportation, Intelligent Transport Systems datahub. More specifically, we will use the I80 data from the NGSIM program. The data was collected through a network of synchronized digital video cameras and then transcribed to vehicle trajectory data from the video. This vehicle trajectory data provided the precise location of each vehicle within the study area in 10 Hz, resulting in detailed lane positions and locations relative to other vehicles.

https://data.transportation.gov/Automobiles/Next-Generation-Simulation-NGSIM-Vehicle-Trajector/8ect-6jqj

The raw data is described in the file ```I-80_Metadata_Documentation.pdf```. There are predefined functions for reading the raw data (and units are converted to SI-units).

From the raw trajectory data, we have designed features to be able to build predictive models. The data needed for this exercise can be downloaded from Lisam, thus you _do not_ have to download anything outside of Lisam.

First, define where the data resides, on your computer or if you are working in the student labs. The variable ```i80_data_dir``` points to the directory where the data directory ```i80_data``` is located.

In [None]:
i80_data_dir = './'  # data downloaded in the current directory
# i80_data_dir = '/courses/tsfs12/'  # student labs

Create a random number generator (with a specified seed so that results are reproducible)

In [None]:
rg = np.random.default_rng(seed=1891)

# Load I-80 feature data

Based on the I80 data, we have designed features to be able to build predictive models. Each datapointhas 41 features. The feature data consists of 95591 datapoints and consists of three variables:
* x - The feature data, a (95591 x 41)-matrix.
* y - True label for each datapoint.
* info - Information which trajectory, dataset, and time-stamp the datapoint corresponds to.

The feature data is described in more detail in the handin documentation and the file ```features.md```.

In [None]:
x, y, info = load_i80_features(i80_data_dir)
print(f"Read {x.shape[0]} datapoints with {x.shape[1]} features.")

Show how many datapoints correspond to switching lane left, right, and staying in the same lane.

In [None]:
print(f"Left: {np.sum(y == 0)}, Straight: {np.sum(y == 1)}, Right: {np.sum(y == 2)}")

# Factor out validation data set

The next step is to create a validation data-set and carefully ensure that there is no leakage of validation datapoints into the training dataset. First, collect indices for all datapoints corresponding to each class.

In [None]:
left_class_idx = np.argwhere(y == 0).reshape(-1)
straight_class_idx = np.argwhere(y == 1).reshape(-1)
right_class_idx = np.argwhere(y == 2).reshape(-1)

By random, select M datapoints from each class to be included in the dataset. The validation dataset will then be balanced. Due to the large imbalance, we can't include too many datapoints from each class, then very few would be available for training. Experiment with this number M.

In [None]:
M = 50
val_class_idx = (rg.choice(left_class_idx, M, replace=False),
                 rg.choice(straight_class_idx, M, replace=False),
                 rg.choice(right_class_idx, M, replace=False))
validation_index = np.hstack(val_class_idx)

# Balance training data by resampling

Due to the severe class imbalance in data, some measure need to be taken. Here, data is balanced by oversampling the underrepresented classes to weigh thios datapoints higher. The code below samples M datapoints, _with replacement_, from each class (excluding the validation data). Experiment also with the number M.

In [None]:
M = 1500  # Number of samples from each class
train_class_idx = (rg.choice(np.setdiff1d(left_class_idx, val_class_idx[0]), M),
                   rg.choice(np.setdiff1d(straight_class_idx, val_class_idx[1]), M),
                   rg.choice(np.setdiff1d(right_class_idx, val_class_idx[2]), M))

train_index = np.hstack(train_class_idx)

Collect data points and lables for traing and validation in arrays ```x_train```, ```y_train```, ```x_val```, ```y_val```.

In [None]:
x_train = x[train_index]
y_train = y[train_index]
x_val = x[validation_index]
y_val = y[validation_index]

# Normalize data

Last step befor building models, normalize data so that each feature has mean 0 and standard deviation 1.

In [None]:
mu = np.mean(x_train, axis=0)
std = np.std(x_train, axis=0)

x_val = (x_val - mu) / std
x_train = (x_train - mu) / std

Do not forget to normalize data also when doing predictions.

# Formulate model and train

You will not need any advanced architectures in this exercise. Start with the "Hello World" example at https://www.tensorflow.org/overview and adapt to the lane-change prediction model in this exercise. Try to experiment also with regularization techniques, e.g., ```tf.keras.layers.Dropout```layers (https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dropout).

In [None]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(20, input_shape=(#YOUR_CODE_HERE,), activation=#YOUR_CODE_HERE),
    #YOUR_CODE_HERE
    tf.keras.layers.Dense(#YOUR_CODE_HERE, activation=#YOUR_CODE_HERE)
])

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])
model.summary()

In [None]:
hist = model.fit(x_train, y_train, epochs=30, validation_data=(x_val, y_val))

model.evaluate(x_val, y_val)

Plot loss and accuracy for test and validation datasets.

In [None]:
plt.figure(10, clear=True)
plt.plot(hist.history['loss'], label='train')
plt.plot(hist.history['val_loss'], label='test')
plt.xlabel('Epoch')
plt.title('Loss')
plt.legend()
sns.despine()

plt.figure(11, clear=True)
plt.plot(hist.history['accuracy'], label='train')
plt.plot(hist.history['val_accuracy'], label='test')
plt.xlabel('Epoch')
plt.title('Accuracy')
plt.legend()
sns.despine()

Take a random datapoint from the validation dataset and make a prediction and compare with true label.

In [None]:
xi_index = rg.choice(validation_index, 1)
yhat = model.predict((x[xi_index] - mu) / std)
print(f"Prediction: {yhat}")
print(f"True label: {int(y[xi_index][0])}")

Compute the confusion matrix for training and validation data using the imported ```confusion_matrix``` function. Function ```np.argmax``` can also be useful.

In [None]:
?confusion_matrix  # Run this for help

In [None]:
C_train = 0  #YOUR_CODE_HERE
print(C_train)

C_val = 0  # YOUR_CODE_HERE
print(C_val)

# Evaluate on validation trajectories

Below is a simple visualization of model predictions given the vehicle trajectories. First, load all trajectories from the I-80 dataset.

## Load and explore trajectories

In [None]:
trajectories = load_i80_trajectories(i80_data_dir)

print(f"0400pm-0415pm: {len(trajectories[0])} trajectories.")
print(f"0500pm-0515pm: {len(trajectories[1])} trajectories.")
print(f"0515pm-0530pm: {len(trajectories[2])} trajectories.")

The trajectories are stores as pandas dataframes. For example, the first samples of the first trajectory in the first data set (0400pm-0415pm) has the following data

In [None]:
trajectories[0][0].head()

Plot N=100 random trajectories from the first data-set.

In [None]:
N = 100
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']
plt.figure(10, clear=True)
for trajectory_idx in rg.choice(range(len(trajectories[0])), N):
    trajectory = trajectories[0][trajectory_idx]
    plt.plot(trajectory.Local_X, trajectory.Local_Y, color=colors[trajectory.Lane_ID.iloc[0]], lw=0.5)
plot_road()
plt.xlabel('x [m]')
plt.ylabel('y [m]')
sns.despine()

## Visualize model predictions

Plot random trajectories from the validation dataset. The function ```get_trajectory_from_datapoint``` finds which trajectory contains the prediction point, and also returns the index to all points on the trajectory included in the feature dataset.

In [None]:
N = 100  # Number of trajectories
plt.figure(30, clear=True)
plot_road()
for val_index in rg.choice(validation_index, N):
    trajectory, _ = get_trajectory_from_datapoint(val_index, info, trajectories)
    plt.plot(trajectory.Local_X, trajectory.Local_Y, color=colors[trajectory.Lane_ID.iloc[0]], lw=0.5)
plt.xlabel('x [m]')
plt.ylabel('y [m]')
sns.despine()

Select a point by random from the validation dataset, find the corresponding trajectory, and make predictions along the trajectory.

In [None]:
val_i = rg.choice(validation_index)
trajectory, data_points = get_trajectory_from_datapoint(val_i, info, trajectories)

plt.figure(40, clear=True)
plot_road()
plt.plot(trajectory.Local_X, trajectory.Local_Y)
for ti, xi in zip(info[data_points][:, 1], x[data_points]):
    x_norm = (xi - mu) / std
    lane_change_prediction = model.predict(x_norm[None, :])[0]
    
    pos_prediction = (trajectory.Local_X.iloc[ti], trajectory.Local_Y.iloc[ti])
    plot_prediction(pos_prediction, lane_change_prediction, lane_bounds)

    plt.plot(trajectory.Local_X.iloc[ti], trajectory.Local_Y.iloc[ti], 'ro')
plt.xlabel('x [m]')
plt.ylabel('y [m]')
sns.despine()