# Fitting a GMM to the AdVitam Dataset


<hr style="border-top: 1px solid white;">


## Preamble


Python Libraries


In [1]:
import numpy as np
import pandas as pd
import neurokit2 as nk
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.mixture import GaussianMixture
from sklearn.model_selection import (
    train_test_split,
    GridSearchCV,
    RandomizedSearchCV
)

<br></br>


Custom Functions


In [2]:
# in order of appearance
from useful_functions.check_for_missing_data import check_for_missing_data

# physio data
from useful_functions.physio_data.pd_dictionary import create_pd_dictionary
from useful_functions.physio_data.process_physio_timestamps import process_physio_timestamps
from useful_functions.physio_data.preprocess_physio_data import preprocess_physio_data

from useful_functions.demographic_data.process_driver_demographic_data import (
    process_driver_demographic_data,
)

from useful_functions.construct_observations import construct_observations

<br></br>

Storing the folder paths to raw data


In [3]:
driving_data_folder = "../AdVitam/Exp2/Raw/Driving"
physio_data_folder = "../AdVitam/Exp2/Raw/Physio/Txt"

<br></br>

Storing a list of driver files to exclude

In [4]:
drivers_to_exclude = check_for_missing_data(driving_data_folder, physio_data_folder)
drivers_to_exclude.extend(["NST77", "NST11", "ST22", "NST87", "ST14", "ST12", "NST73", "ST10"])

`check_for_missing_data()` returns a list of any driver that is _not_ in both the `driving` and `physio` folders.

Participants: ST22, NST87, ST14, ST12, NST73 and ST10 seem to have issues with there physiological data

<br></br>
<br></br>

# Importing Data + Preprocessing


---


<br>


## Physiological Signals & Markers


<hr style="border-top: 1px dashed white; border-bottom: 0px">


### Data Description


**Signals:**

| Feature | Description            | Notes  |
| ------- | ---------------------- | ------ |
| min     | Time Elapsed           |        |
| ECG     | Electrocardiogram      | 1000Hz |
| EDA     | Electrodermal Activity | 1000Hz |
| RESP    | Resperatory            | 1000Hz |


<br></br>


**Markers:**

Contains the timestamps for each period of the experiment.

- Training1 = Baseline phase
- Training2 = Practice phase in the driving simulator
- Driving = Main driving session in conditionally automated driving.

Be careful, the timestamps are here in seconds while they are in minutes in the raw data.


<br></br>


**Timestamps:**

Time elapsed (in seconds) between the start of the main driving session and the appearance of the obstacles.

- TrigObsX: the time when the driver pressed the button to report having understood the situation
- DetObsX: and the time when the driver actually took over control
- RepObsX: X corresponds to one of obstacle or the false alarm.


<br></br>


### Physio data dictionary


Creates a dictionary of the raw physiological data and their markers


In [5]:
phsyiological_data_dictionary = create_pd_dictionary(physio_data_folder, drivers_to_exclude)

`create_pd_dictionary()` stores every file in the `physio_data_folder` in a dictionary

<br></br>


Processing the Physiological data


In [6]:
phsyiological_data_dictionary = preprocess_physio_data(phsyiological_data_dictionary)

`process_physio_data()` resamples the data to 10ms (100Hz) and then segments the data into each experimental phase (Baseline, Training, Driving)


<br></br>


### Physio timestamps


A dataframe to store the trigger time, takeover time, release time, and TOT for each obstacle, for every driver. Similar to `driver_timestamps`.


In [7]:
physio_timestamps = pd.read_csv(
    "../AdVitam/Exp2/Preprocessed/Physio and Driving/timestamps_obstacles.csv"
)

<br></br>


### Processing the Physio Timestamps

Steps:

1. Change column names to match driving timestamps
1. Remove preselected drivers
1. Reformat subject id to match
1. Transfrom timestamps into timedelta objects


In [8]:
physio_timestamps = process_physio_timestamps(physio_timestamps, drivers_to_exclude)

<br></br>


## Driver Demographic Data


<hr style="border-top: 1px dashed white; border-bottom: 0px">


### Data Description


**Driver Demographic Data Description:**
| Feature | Description | Note |
| --- | --- | --- |
| code | Code of driver Secondary Task (ST) vs No ST (NST) + unique id (1,2,...) | In the form (ST/NST)# |
| date | Day of data collection | Removed |
| time | Hour of data collection | Removed |
| condition | Experimental condition for mental workload | Removed (contained in driver code |
| sex | driver sex | |
| age | Age of drivers in years | |
| mothertongue | drivers first language | |
| education | Highest education degree | |
| driving_license | Year of obtenstion of driving license | |
| km_year | Number of kilometers covered per year in average | |
| accidents | Number of accidents during the last 3 years | |
| nasa_tlx_N | Answer to the NASA TLX for question N | Removed |
| danger_O | Subjective ranking of the danger of obstacle O | Removed |
| realism_O | Subjective ranking of the realism of obstacle O | Removed |
| sart_N_O | Subjective answer to the sart for question N related to obstacle O | Removed |
| demand_O | Demands on attentional resources (complexity, variability, and instability of the situation) | Removed |
| supply_O | Supply of attentional resources (division of attention, arousal, concentration, and spare mental capacity) | Removed |
| understanding_O | Understanding of the situation (information quantity, information quality and familiarity). | Removed |


<br></br>


### Driver Demographic Data


Grabbing the driver demographic data


In [9]:
driver_demographic_data = pd.read_csv(
    "../AdVitam/Exp2/Preprocessed/Questionnaires/Exp2_Database.csv",
    usecols=[
        "code",
        "sex",
        "age",
        "mothertongue",
        "education",
        "driving_license",
        "km_year",
        "accidents",
    ],
)

<br></br>


### Processing driver demographic data


Steps:

1. Remove preselected drivers
2. Reformat code to match data
3. Coverting driving licence from year obtained to of years obtained
4. Normalize km/y


In [10]:
driver_demographic_data = process_driver_demographic_data(
    driver_demographic_data, drivers_to_exclude
)

<br></br>


# Constructing Sequence of Observations


---


The idea is to train 2 HMM trained to observations assosiated with a 'slow' takeover, and a 'fast' takeover.


In [11]:
# initialize lists to store observations
slow_observations = []
fast_observations = []

# loop through each driver
for driver in phsyiological_data_dictionary.keys():
    # data for each driver
    driver_phyio_baseline_data = phsyiological_data_dictionary[driver]["baseline"]
    driver_physio_data = phsyiological_data_dictionary[driver]["driving"]

    # timestamps
    driver_physio_timestamps = physio_timestamps[physio_timestamps["subject_id"] == driver]

    # loop through every takeover
    for column in driver_physio_timestamps.columns:
        if "TOT" in column:
            # get the obstacle number
            obstacle = column.replace("TOT", "")

            # store the obstacle triggers for driving and physio
            physio_obstacle_trigger = driver_physio_timestamps["Triggered" + obstacle].iloc[0]

            # check if the obstacle triggers are not null
            if pd.isnull(physio_obstacle_trigger):
                continue

            # trim the data to the 10s before the takeover
            physio_data_10_sec = driver_physio_data[
                (
                    driver_physio_data["Time"]
                    >= (
                        driver_physio_data.Time.min()
                        + physio_obstacle_trigger
                        - pd.to_timedelta("10s")
                    )
                )
                & (
                    driver_physio_data["Time"]
                    < driver_physio_data.Time.min() + physio_obstacle_trigger
                )
            ].copy()

            # # Store the Difference between the baseline and the takeover
            # hrv_difference = takeover_hrv - baseline_hrv

            # # rename the columns
            # hrv_difference.columns = [
            #     "HRV_" + column + "_Difference" for column in hrv_difference.columns
            # ]

            # # concatenate the dataframes
            # hrv = pd.concat([baseline_hrv, takeover_hrv, hrv_difference], axis=1)

            # reset the Time index
            physio_data_10_sec = physio_data_10_sec.set_index("Time")

            # set the index to 0
            physio_data_10_sec.index = physio_data_10_sec.index - physio_data_10_sec.index.min()

            # merge the data
            driver_data = pd.merge(
                physio_data_10_sec,
                left_index=True,
                right_index=True,
            )

            # reset the index
            driver_data.reset_index(inplace=True)

            # Remove Time, Position X, Position Y, Position Z, Autonomous Mode (T/F), Obstacles
            driver_data = driver_data.drop(
                columns=[
                    "Time",
                    "Autonomous Mode (T/F)",
                    "Obstacles",
                ]
            )

            # grab driver demogrpahic data
            demo_data = driver_demographic_data[driver_demographic_data["code"] == driver]

            # Broadcast to repeat the static data for each row of the dynamic data
            demo_data = pd.concat([demo_data] * len(driver_data), ignore_index=True)

            # Broadcast the hrv data
            # hrv = pd.concat([hrv] * len(driver_data), ignore_index=True)

            # merge the data
            driver_data = pd.merge(driver_data, demo_data, left_index=True, right_index=True)
            # driver_data = pd.merge(driver_data, hrv, left_index=True, right_index=True)

            # change the code value to the driver id
            driver_data["code"] = driver_data["code"].apply(lambda x: x.split("T")[1])
            # cast code to int
            driver_data["code"] = driver_data["code"].astype(int)

            if len(driver_data) != 1000:
                continue

            # determine if the takeover was slow or fast
            if physio_timestamps[column].iloc[0] > pd.to_timedelta("3s"):
                slow_observations.append(driver_data.to_numpy())
            else:
                fast_observations.append(driver_data.to_numpy())

TypeError: merge() missing 1 required positional argument: 'right'

<br>
<br>


# Training the HMMs


---


**Train/Validate/Test Split**

In [None]:
slow_observations_train, slow_observations_test = train_test_split(
    slow_observations, test_size=0.1
)

fast_observations_train, fast_observations_test = train_test_split(
    fast_observations, test_size=0.1
)

<br>
<br>


# Hyperparameters


---


In [None]:
# initializing the hyperparameters
n_components = np.arange(1, 11)
covariance_type = ["full", "tied", "diag", "spherical"]
tol = np.arange(0.001, 0.011, 0.001)
init_params = ["kmeans", "k-means++", "random", "random_from_data"]
random_state = np.arange(0, 11)
max_iter = np.linspace(100, 10000, 100).astype(int)

hyperparametes = {
    "n_components": n_components,
    "covariance_type": covariance_type,
    "tol": tol,
    "init_params": init_params,
    "random_state": random_state,
    "max_iter": max_iter,
}

# initialize the model
slow_model = GaussianMixture(reg_covar=1e-4)
fast_model = GaussianMixture(reg_covar=1e-4)

# initialize the random search
slow_random_search = RandomizedSearchCV(
    slow_model, hyperparametes, n_iter=1000, cv=5, n_jobs=-1, error_score='raise'
)
fast_random_search = RandomizedSearchCV(
    fast_model, hyperparametes, n_iter=1000, cv=5,  n_jobs=-1, error_score='raise'
)

# fit the model
s = np.vstack(slow_observations_train)
slow_random_search.fit(s)
f = np.vstack(fast_observations_train)
fast_random_search.fit(f)



In [None]:
slow_model = slow_random_search.best_estimator_
fast_model = fast_random_search.best_estimator_

slow_model.fit(s)
fast_model.fit(f)

In [None]:
# test the HMM
accuracy = 0

for obs in slow_observations_test:
    if slow_model.score(obs) > fast_model.score(obs):
        accuracy += 1

for obs in fast_observations_test:
    if fast_model.score(obs) > slow_model.score(obs):
        accuracy += 1

accuracy = accuracy / (len(slow_observations_test) + len(fast_observations_test))
print("Accuracy: ", accuracy)

Accuracy:  0.5161290322580645


In [None]:
"""
# initialize the grid search cv
slow_grid = GridSearchCV(slow_model, hyperparametes, cv=5, n_jobs=-1, verbose=1)
fast_grid = GridSearchCV(fast_model, hyperparametes, cv=5, n_jobs=-1, verbose=1)

# fit the model
s = np.vstack(slow_observations_train)
slow_grid.fit(s)
f = np.vstack(fast_observations_train)
fast_grid.fit(f)

# get the best estimators
slow_hmm = slow_grid.best_estimator_
fast_hmm = fast_grid.best_estimator_

slow_hmm.fit(s)
fast_hmm.fit(f)

# test the HMM
accuracy = 0

for obs in slow_observations_test:
    if slow_hmm.score(obs) > fast_hmm.score(obs):
        accuracy += 1

for obs in fast_observations_test:
    if fast_hmm.score(obs) > slow_hmm.score(obs):
        accuracy += 1

accuracy = accuracy / (len(slow_observations_test) + len(fast_observations_test))
print("Accuracy: ", accuracy)
"""

'\n# initialize the grid search cv\nslow_grid = GridSearchCV(slow_model, hyperparametes, cv=5, n_jobs=-1, verbose=1)\nfast_grid = GridSearchCV(fast_model, hyperparametes, cv=5, n_jobs=-1, verbose=1)\n\n# fit the model\ns = np.vstack(slow_observations_train)\nslow_grid.fit(s)\nf = np.vstack(fast_observations_train)\nfast_grid.fit(f)\n\n# get the best estimators\nslow_hmm = slow_grid.best_estimator_\nfast_hmm = fast_grid.best_estimator_\n\nslow_hmm.fit(s)\nfast_hmm.fit(f)\n\n# test the HMM\naccuracy = 0\n\nfor obs in slow_observations_test:\n    if slow_hmm.score(obs) > fast_hmm.score(obs):\n        accuracy += 1\n\nfor obs in fast_observations_test:\n    if fast_hmm.score(obs) > slow_hmm.score(obs):\n        accuracy += 1\n\naccuracy = accuracy / (len(slow_observations_test) + len(fast_observations_test))\nprint("Accuracy: ", accuracy)\n'

Initial Accuracy Including Every Feature:
51.61%
