The first step in wrangling our EEG results is to import our datasets. They have been separated into 3 files, `run4.csv`, `run5.csv`, and `run4_5.txt`. The reason for this is that we wrote a `keylogger` (a program that captures the timing of keyboard inputs) that writes to one file, and we separated out the runs in the `OpenBCI` program. 

To import the raw EEG data, we will be using the `Pandas` Python module to read the raw CSV file into a `DataFrame`, which is an easily accessible large-scale format object for managing datasets. 

Once we have the DataFrames for the 2 selected runs, we need to combine them, so that we can access the entire dataset at once. `Pandas` will also help with this

In [109]:
import pandas as pd # Import Pandas to project

pd.set_option('display.float_format', lambda x: '%.5f' % x)

In [110]:
# Read run4.csv into Pandas DataFrame

run_4 = pd.read_csv(r'dataset/run4.csv')
run_5 = pd.read_csv(r'dataset/run5.csv')

eeg_data = pd.concat([run_4, run_5])

eeg_data 

Unnamed: 0,Sample Index,EXG Channel 0,EXG Channel 1,EXG Channel 2,EXG Channel 3,EXG Channel 4,EXG Channel 5,EXG Channel 6,EXG Channel 7,Accel Channel 0,...,Other.3,Other.4,Other.5,Other.6,Analog Channel 0,Analog Channel 1,Analog Channel 2,Timestamp,Other.7,Timestamp (Formatted)
0,171.00000,6179.94442,2068.16221,-9348.55006,-324.45792,-591.33775,-1583.41993,-12547.66584,-13145.91028,0.05200,...,7.00000,208.00000,30.00000,208.00000,0.00000,0.00000,0.00000,1678313733.21671,0.00000,2023-03-08 14:15:33.216
1,172.00000,6184.52653,2071.13499,-9344.77262,-283.77775,-575.13274,-1581.56473,-12546.43649,-13146.46907,0.05200,...,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,1678313733.21674,0.00000,2023-03-08 14:15:33.216
2,173.00000,6185.86763,2072.36434,-9343.85620,-361.22654,-581.57004,-1584.02343,-12544.93893,-13149.68773,0.05200,...,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,1678313733.21675,0.00000,2023-03-08 14:15:33.216
3,174.00000,6186.67229,2071.64908,-9342.15746,-430.98634,-593.55057,-1585.36453,-12543.15079,-13155.65564,0.05200,...,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,1678313733.21676,0.00000,2023-03-08 14:15:33.216
4,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.05200,...,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,1678313733.29276,0.00000,2023-03-08 14:15:33.292
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
86972,184.00000,-1268.39444,-3476.34446,-9734.16236,-2671.12287,-4573.34573,-6089.30809,-13128.69944,-13841.11659,0.05800,...,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,1678315269.78043,0.00000,2023-03-08 14:41:09.780
86973,185.00000,-1268.50620,-3475.18217,-9729.64731,-3879.41347,-4632.98018,-6111.30221,-13127.22422,-13850.01258,0.05800,...,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,1678315269.78534,0.00000,2023-03-08 14:41:09.785
86974,186.00000,-1269.06499,-3474.95866,-9731.54720,-4805.75917,-4645.83244,-6106.02720,-13128.38651,-13845.02814,0.05800,...,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,1678315269.78930,0.00000,2023-03-08 14:41:09.789
86975,187.00000,-1265.39931,-3476.85855,-9737.09044,-3011.27172,-4561.36519,-6075.49472,-13130.28641,-13830.83479,0.05800,...,8.00000,208.00000,30.00000,128.00000,0.00000,0.00000,0.00000,1678315269.79332,0.00000,2023-03-08 14:41:09.793


Now that the raw CSV files have been read into DataFrames, as can be seen above, there are many columns (or channels) of data, not all of which are necessary for our purposes, which are to read the raw EEG data. Given that, we can actually drop some of these columns, using the `df.drop()` function from Pandas. Doing this will increase the speed of our lookup algorithm in future cells. 

One problem this dataset has is that all of the column names were inputted to the CSV file with a space in the front, so we need to go and remove the space from each column name.

We only want the raw electrode data, and therefore only want the `EXG` channels, not the `accel` channels. Rather than manually input every channel to drop, we can instead filter the columns of the `DataFrame` by whether they include the `Accel, Other, Analog,` or `Formatted`, and then use that filtered `List` as the list of columns to drop!

In [111]:
eeg_cols_raw = eeg_data.columns # Read DataFrame columns into List

DISALLOWED_PHRASES = ['Accel', 'Other', 'Analog', 'Formatted'] # List of keywords not allowed in columns

eeg_cols_filtered = [] # Final Column list

for col in eeg_cols_raw: # For each column
    append = False
    for phrase in DISALLOWED_PHRASES: # For each phrase in the list
        if phrase in col: # If that phrase is in the column string
            append = True 
            break
    if append:
        eeg_cols_filtered.append(col)


eeg_data = eeg_data.drop(eeg_cols_filtered, axis=1) # Drop unnecessary columns

stripped_cols = []

for col in eeg_data.columns:
    stripped_cols.append(col.lstrip()) # Get rid of beginning space

eeg_data.columns = stripped_cols # Set columns

# eeg_data['Timestamp'] -= 28800

eeg_data

Unnamed: 0,Sample Index,EXG Channel 0,EXG Channel 1,EXG Channel 2,EXG Channel 3,EXG Channel 4,EXG Channel 5,EXG Channel 6,EXG Channel 7,Timestamp
0,171.00000,6179.94442,2068.16221,-9348.55006,-324.45792,-591.33775,-1583.41993,-12547.66584,-13145.91028,1678313733.21671
1,172.00000,6184.52653,2071.13499,-9344.77262,-283.77775,-575.13274,-1581.56473,-12546.43649,-13146.46907,1678313733.21674
2,173.00000,6185.86763,2072.36434,-9343.85620,-361.22654,-581.57004,-1584.02343,-12544.93893,-13149.68773,1678313733.21675
3,174.00000,6186.67229,2071.64908,-9342.15746,-430.98634,-593.55057,-1585.36453,-12543.15079,-13155.65564,1678313733.21676
4,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,1678313733.29276
...,...,...,...,...,...,...,...,...,...,...
86972,184.00000,-1268.39444,-3476.34446,-9734.16236,-2671.12287,-4573.34573,-6089.30809,-13128.69944,-13841.11659,1678315269.78043
86973,185.00000,-1268.50620,-3475.18217,-9729.64731,-3879.41347,-4632.98018,-6111.30221,-13127.22422,-13850.01258,1678315269.78534
86974,186.00000,-1269.06499,-3474.95866,-9731.54720,-4805.75917,-4645.83244,-6106.02720,-13128.38651,-13845.02814,1678315269.78930
86975,187.00000,-1265.39931,-3476.85855,-9737.09044,-3011.27172,-4561.36519,-6075.49472,-13130.28641,-13830.83479,1678315269.79332


Now that we only have the required columns for analysis in our DataFrame, the next step is to add a `KeyPressed` column, whose entries will be the key the user was pressing at the time of the reading. 

A huge problem is that because our KeyLogger and OpenBCI run at different sampling rates, and they may not have been started at exactly the same time for each run, and differences in the computers, the timestamps may not exactly align between the KeyLogger output, and the EEG data. 

There is no perfect solution for this, but we feel that one that works well enough for our purposes is to get the keypress with the closest TimeStamp to our current sample. 

The first step to this is to read the KeyLogger result into two `Lists`, where the indeces match up exactly. The `timestamps` `List` will contain the `UNIX Timestamps` for each keypress converted into a DateTime object, and the `keys` `List` will contain the key pressed at that time. 

The reason we are using a `List` and not a `Dictionary` is that `Lists` are designed to be indexed and searched quickly, whereas `Dictionaries` often take much more time, and performing logical operations on them en-masse is very inefficient.

In [112]:
from datetime import datetime

timestamps = []
keys = []

with open(r'dataset/run4_5.txt') as key_file:
    for line in key_file:
        key = line.split(',')[0].replace("'", "")
        timestamp = float(line.split(',')[1].replace('\n', ''))

        timestamps.append(timestamp)
        keys.append(key)

obj = { 'Timestamp': timestamps, 'key': keys }

key_pressed = pd.DataFrame(data=obj)

key_pressed.sort_values('Timestamp')

key_pressed

Unnamed: 0,Timestamp,key
0,1678313724.80382,f
1,1678313745.28523,w
2,1678313745.79649,w
3,1678313747.93245,/
4,1678313748.92906,g
...,...,...
1844,1678315277.10736,Button.left
1845,1678315281.81075,Button.left
1846,1678315289.75182,Button.left
1847,1678315296.42435,Button.right


Now that we have the timestamps and keys imported and readable, we need to add the `KeyPressed` column, that contains the appropriate key.

In [113]:
def get_closest_key_press(key, df_ts):
    '''
        Steps:
            - Find closest timestamp to curkey timestamp in eeg_data rows
            - Put key pressed as the current key in iteration in eeg_data
    '''
    result_index = eeg_data['Timestamp'].sub(df_ts).abs().idxmin()
    eeg_data['KeyPressed'][result_index] = key

def back_fill_empties(keyPressed):
    print(keyPressed)
    if keyPressed == 'None':
        return 'NO_KEY'
    return keyPressed

eeg_data['KeyPressed'] = None

key_pressed.apply(lambda x: get_closest_key_press(x['key'], x['Timestamp']), axis=1)

eeg_data['KeyPressed'] = eeg_data['KeyPressed'].fillna(value='NO_BUTTON')

eeg_data.to_csv('eeg_val_to_key_press.csv')