# Fitting a GMM to the AdVitam Dataset


<hr style="border-top: 1px solid white;">


## Preamble


Python Libraries


In [19]:
import numpy as np
import pandas as pd
import neurokit2 as nk
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.mixture import GaussianMixture
from sklearn.model_selection import (
    train_test_split,
    GridSearchCV,
)

<br></br>


Custom Functions


In [20]:
from useful_functions.driving_data.dd_dictionary import create_dd_dictionary
from useful_functions.driving_data.process_driving_data import processing_driving_data

from useful_functions.physio_data.pd_dictionary import create_pd_dictionary
from useful_functions.physio_data.process_physio_timestamps import process_physio_timestamps
from useful_functions.physio_data.process_physio_data import process_physio_data

from useful_functions.demographic_data.process_driver_demographic_data import (
    process_driver_demographic_data,
)

from useful_functions.construct_observations import construct_observations

from useful_functions.takeover_dataframe import create_takeover_timestamps
from useful_functions.check_for_missing_data import check_for_missing_data

<br></br>

Storing the folder paths to raw data


In [21]:
driving_data_folder = "../AdVitam/Exp2/Raw/Driving"
physio_data_folder = "../AdVitam/Exp2/Raw/Physio/Txt"

<br></br>

Storing a list of driver files to exclude

In [22]:
drivers_to_exclude = check_for_missing_data(driving_data_folder, physio_data_folder)
drivers_to_exclude.extend(["NST77", "NST11"])

`check_for_missing_data()` returns a list of any driver that is _not_ in both the `driving` and `physio` folders.

<br></br>
<br></br>

# Importing Data + Preprocessing


---


<br>


## Driving Data


<hr style="border-top: 1px dashed white; border-bottom: 0px">


### Data Description


**Driver Data:**
| Feature | Description | Notes |
| --- | --- | --- |
| Time | Time elapsed since the software was launched (in seconds) | |
| EngineSpeed | Engine speed (in rpm) | Removed |
| GearPosActual | Current gear | Removed |
| GearPosTarget | Next planned gear | Removed |
| AcceleratorPedalPos | Position of gas pedal. | Recording problem, Removed |
| DeceleratorPedalPos | Position of brake pedal. | Recording problem, Removed |
| SteeringWheelAngle | Steering wheel angle (in degrees) | |
| VehicleSpeed | Vehicle speed (in mph) | |
| Position X | Vehicle position along the x-axis in the simulated driving environment | |
| Position Y | Vehicle position along the y-axis in the simulated driving environment | |
| Position Z | Vehicle position along the z-axis in the simulated driving environment | |
| Autonomous Mode (T/F) | Autonomous pilot status. The driver is in control of the car when the value of the column "Autonomous Mode (T/F)" is False | True = Activated, False = Deactivated (driver in control) |
| Obstacles | Events that occurred during the driving simulation. | See Below |


<br></br>


**Obstacles:**

| Event         | Description                                                                                                                                                                                             |
| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| TriggeredObsX | Time at which each takeover request was triggered by the experimenter                                                                                                                                   |
| Obs1          | Deer                                                                                                                                                                                                    |
| Obs2          | Traffic cone                                                                                                                                                                                            |
| Obs3          | Frog                                                                                                                                                                                                    |
| Obs4          | Traffic cone                                                                                                                                                                                            |
| Obs5          | False alarm (x2)                                                                                                                                                                                        |
| Detected      | Time at which the driver pressed the steering wheel button to notify he/she understood the situation. |


<br></br>


### Driving Data Dictionary


Creates a dictionary of the raw driving data files.


In [23]:
driving_data_dictionary = create_dd_dictionary(driving_data_folder, drivers_to_exclude)

`create_dd_dictionary()` stores every file in the `driving_data_folder` in a dictionary

<br></br>


Processing driving data


In [24]:
# Fitting a Label Encoder to the `Obstacles` column
driver_data = driving_data_dictionary["NST01"]
driver_data = driver_data.fillna("Nothing")
enc = preprocessing.LabelEncoder()
enc.fit(driver_data["Obstacles"])

# Processing the driving data
driving_data_dictionary = processing_driving_data(driving_data_dictionary, enc)

A label Encoder is fit to the `Obstacles` of driver `NST01`

`processing_driving_data()`label encodes the `Obstacles` for every driver and resamples the driving data to 10ms (or 100Hz)

<br></br>


Creating driving data takeover timestamps


In [25]:
driving_timestamps = create_takeover_timestamps(driving_data_dictionary, enc)

`create_takeover_timestamps()` extracts timestamps from each driver takeover.

Driving Timestamps include:
- `TriggeredObsX` : When the takeover request for obstacle X was triggered
- `TakeoverObsX`: When the driver tookover the vehicle
- `ReleaseObsX`: When the driver released control of the vehicle to the automation
- `TOTObsX`: The amount of time it tooke for the driver to takeover for obstacle X (`TakeoverObsX` - `TriggeredObsX`)


<br></br>
<br></br>


## Physiological Signals & Markers


<hr style="border-top: 1px dashed white; border-bottom: 0px">


### Data Description


**Signals:**

| Feature | Description            | Notes  |
| ------- | ---------------------- | ------ |
| min     | Time Elapsed           |        |
| ECG     | Electrocardiogram      | 1000Hz |
| EDA     | Electrodermal Activity | 1000Hz |
| RESP    | Resperatory            | 1000Hz |


<br></br>


**Markers:**

Contains the timestamps for each period of the experiment.

- Training1 = Baseline phase
- Training2 = Practice phase in the driving simulator
- Driving = Main driving session in conditionally automated driving.

Be careful, the timestamps are here in seconds while they are in minutes in the raw data.


<br></br>


**Timestamps:**

Time elapsed (in seconds) between the start of the main driving session and the appearance of the obstacles.

- TrigObsX: the time when the driver pressed the button to report having understood the situation
- DetObsX: and the time when the driver actually took over control
- RepObsX: X corresponds to one of obstacle or the false alarm.


<br></br>


### Physio data dictionary


Creates a dictionary of the raw physiological data and their markers


In [26]:
phsyiological_data_dictionary = create_pd_dictionary(physio_data_folder, drivers_to_exclude)

`create_pd_dictionary()` stores every file in the `physio_data_folder` in a dictionary

<br></br>


Processing the Physiological data


In [27]:
phsyiological_data_dictionary = process_physio_data(phsyiological_data_dictionary)

`process_physio_data()` resamples the data to 10ms (100Hz) and then segments the data into each experimental phase (Baseline, Training, Driving)


<br></br>


### Physio timestamps


A dataframe to store the trigger time, takeover time, release time, and TOT for each obstacle, for every driver. Similar to `driver_timestamps`.


In [28]:
physio_timestamps = pd.read_csv(
    "../AdVitam/Exp2/Preprocessed/Physio and Driving/timestamps_obstacles.csv"
)

<br></br>


### Processing the Physio Timestamps

Steps:

1. Change column names to match driving timestamps
1. Remove preselected drivers
1. Reformat subject id to match
1. Transfrom timestamps into timedelta objects


In [29]:
physio_timestamps = process_physio_timestamps(physio_timestamps, drivers_to_exclude)

<br></br>


## Driver Demographic Data


<hr style="border-top: 1px dashed white; border-bottom: 0px">


### Data Description


**Driver Demographic Data Description:**
| Feature | Description | Note |
| --- | --- | --- |
| code | Code of driver Secondary Task (ST) vs No ST (NST) + unique id (1,2,...) | In the form (ST/NST)# |
| date | Day of data collection | Removed |
| time | Hour of data collection | Removed |
| condition | Experimental condition for mental workload | Removed (contained in driver code |
| sex | driver sex | |
| age | Age of drivers in years | |
| mothertongue | drivers first language | |
| education | Highest education degree | |
| driving_license | Year of obtenstion of driving license | |
| km_year | Number of kilometers covered per year in average | |
| accidents | Number of accidents during the last 3 years | |
| nasa_tlx_N | Answer to the NASA TLX for question N | Removed |
| danger_O | Subjective ranking of the danger of obstacle O | Removed |
| realism_O | Subjective ranking of the realism of obstacle O | Removed |
| sart_N_O | Subjective answer to the sart for question N related to obstacle O | Removed |
| demand_O | Demands on attentional resources (complexity, variability, and instability of the situation) | Removed |
| supply_O | Supply of attentional resources (division of attention, arousal, concentration, and spare mental capacity) | Removed |
| understanding_O | Understanding of the situation (information quantity, information quality and familiarity). | Removed |


<br></br>


### Driver Demographic Data


Grabbing the driver demographic data


In [30]:
driver_demographic_data = pd.read_csv(
    "../AdVitam/Exp2/Preprocessed/Questionnaires/Exp2_Database.csv",
    usecols=[
        "code",
        "sex",
        "age",
        "mothertongue",
        "education",
        "driving_license",
        "km_year",
        "accidents",
    ],
)

<br></br>


### Processing driver demographic data


Steps:

1. Remove preselected drivers
2. Reformat code to match data
3. Coverting driving licence from year obtained to of years obtained
4. Normalize km/y


In [31]:
driver_demographic_data = process_driver_demographic_data(
    driver_demographic_data, drivers_to_exclude
)

<br></br>


# Constructing Sequence of Observations


---


The idea is to train 2 HMM trained to observations assosiated with a 'slow' takeover, and a 'fast' takeover.


In [32]:
slow_observations, fast_observations = construct_observations(
    driving_data_dictionary,
    phsyiological_data_dictionary,
    driving_timestamps,
    physio_timestamps,
    driver_demographic_data,
)

<br>
<br>


# Training the HMMs


---


**Train/Validate/Test Split**

In [33]:
slow_observations_train, slow_observations_test = train_test_split(
    slow_observations, test_size=0.1, random_state=42
)

fast_observations_train, fast_observations_test = train_test_split(
    fast_observations, test_size=0.1, random_state=42
)

<br>
<br>


# Hyperparameters


---


In [34]:
# initializing the hyperparameters
n_components = np.arange(1, 11)
covariance_type = ["full", "tied", "diag", "spherical"]
tol = np.arange(0.001, 0.011, 0.001)
init_params = ["kmeans", " k-means++", "random", "random_from_data"]
random_state = np.arange(0, 11)

hyperparametes = {
    "n_components": n_components,
    "covariance_type": covariance_type,
    "tol": tol,
    "init_params": init_params,
    "random_state": random_state,
}

In [35]:
max_iter = 1000

# initialize the model
slow_model = GaussianMixture(max_iter=max_iter)
fast_model = GaussianMixture(max_iter=max_iter)

In [36]:
# initialize the grid search cv
slow_grid = GridSearchCV(slow_model, hyperparametes, cv=5, n_jobs=-1, verbose=1)
fast_grid = GridSearchCV(fast_model, hyperparametes, cv=5, n_jobs=-1, verbose=1)

# fit the model
s = np.vstack(slow_observations_train)
slow_grid.fit(s)
f = np.vstack(fast_observations_train)
fast_grid.fit(f)

Fitting 5 folds for each of 17600 candidates, totalling 88000 fits




In [None]:
slow_hmm = slow_grid.best_estimator_
fast_hmm = fast_grid.best_estimator_

slow_hmm.fit(s)
fast_hmm.fit(f)

In [None]:
# test the HMM
accuracy = 0

for obs in slow_observations_test:
    if slow_hmm.score(obs) > fast_hmm.score(obs):
        accuracy += 1

for obs in fast_observations_test:
    if fast_hmm.score(obs) > slow_hmm.score(obs):
        accuracy += 1

accuracy = accuracy / (len(slow_observations_test) + len(fast_observations_test))
print("Accuracy: ", accuracy)

<br>
<br>
